git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* problem with parsing of patch files for patch-id
@ 2024-06-21 10:33 Rob Linden
  2024-06-21 17:05 ` Junio C Hamano
  0 siblings, 1 reply; 5+ messages in thread
From: Rob Linden @ 2024-06-21 10:33 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 2144 bytes --]

Hello!

We noticed a problem with the parsing of a patch file for "git
patch-id". I understand that the patch file format is very difficult
and unpredictable and probably it's not even possible to correctly
parse all of them (mostly due to missing restrictions on or escaping
of commit messages).

But in our specific case it can be improved to handle it correctly.

I attached an example patch file. When feeding that to "git patch-id"
(with git version 2.45.2.561.g66ac6e4bcd) the output is:

068a8a30a5b0e55b93fdc16b2a7dcd6886e420f3
1111111111111111111111111111111111111111
818756276fff2c6075da8effe36c65d25e6ed592
dcc59dffa5116bf96618065cd60742cb660224b8
b033e3eca8a60741bb414689ddfe00a0c1a09de5
3333333333333333333333333333333333333333

But it should be:

068a8a30a5b0e55b93fdc16b2a7dcd6886e420f3
1111111111111111111111111111111111111111
818756276fff2c6075da8effe36c65d25e6ed592
2222222222222222222222222222222222222222
b033e3eca8a60741bb414689ddfe00a0c1a09de5
3333333333333333333333333333333333333333

The reason is that the commit message of the second patch contains
commit hashes which are parsed as if they were the commit hash for the
patch, and not just some message.

This patch (also attached) fixes it by only considering commit hashes
in a "From xxxxx..." line:

diff --git a/builtin/patch-id.c b/builtin/patch-id.c
index 583099cacf..4b8a41bde8 100644
--- a/builtin/patch-id.c
+++ b/builtin/patch-id.c
@@ -86,7 +86,7 @@ static int get_one_patchid(struct object_id
*next_oid, struct object_id *result,
                        continue;
                }

-               if (!get_oid_hex(p, next_oid)) {
+               if (starts_with(p-5, "From ") && !get_oid_hex(p, next_oid)) {
                        found_next = 1;
                        break;
                }

But I'm not sure it is ok in all other cases [which are handled
correctly now, i.e. it only makes it better for cases like ours,
without making it worse for anything else). The unit-tests pass ok but
I didn't check how comprehensive they are.
Can somebody please have a look and tell me what they think about
patch file parsing?

Thanks & all the best,
rob

[-- Attachment #2: test.patch --]
[-- Type: text/x-patch, Size: 3119 bytes --]

From 1111111111111111111111111111111111111111 Mon Sep 17 00:00:00 2001
From: rlinden@redhat.com
Date: Wed, 24 Jan 2024 14:49:21 +0100
Subject: [PATCH 1/3] foo foo

Resolves: bug-123
---
 test/units/testsuite-07.main-PID-change.sh | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/test/units/testsuite-07.main-PID-change.sh b/test/units/testsuite-07.main-PID-change.sh
index be4631f10d..bd5a63a272 100755
--- a/test/units/testsuite-07.main-PID-change.sh
+++ b/test/units/testsuite-07.main-PID-change.sh
@@ -151,6 +151,8 @@ systemd-run --unit=test-mainpidsh3.service \
             -p RuntimeDirectory=mainpidsh3 \
             -p PIDFile=/run/mainpidsh3/pid \
             -p DynamicUser=1 \
+            `# Make sanitizers happy when DynamicUser=1 pulls in instrumented systemd NSS modules` \
+            -p EnvironmentFile=-/usr/lib/systemd/systemd-asan-env \
             -p TimeoutStartSec=2s \
             /dev/shm/test-mainpid3.sh \
     && { echo 'unexpected success'; exit 1; }

From 2222222222222222222222222222222222222222 Mon Sep 17 00:00:00 2001
From: rlinden@redhat.com
Date: Wed, 24 Jan 2024 15:49:21 +0100
Subject: [PATCH 2/3] bar bar

we should not mention these other commits in our commit message like this:
55d337de1940076855c1687ffd588498d068724e
dcc59dffa5116bf96618065cd60742cb660224b8

---
 test/units/testsuite-07.main-PID-change.sh | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/test/units/testsuite-07.main-PID-change.sh b/test/units/testsuite-07.main-PID-change.sh
index be4631f10d..bd5a63a272 100755
--- a/test/units/testsuite-07.main-PID-change.sh
+++ b/test/units/testsuite-07.main-PID-change.sh
@@ -151,6 +151,8 @@ systemd-run --unit=test-mainpidsh3.service \
             -p RuntimeDirectory=mainpidsh3 \
             -p PIDFile=/run/mainpidsh3/pid \
             -p DynamicUser=1 \
+            `# Make sanitizers happy when DynamicUser=1 pulls in instrumented systemd NSS modules` \
+            -p EnvironmentFile=-/usr/share/lib/systemd/systemd-asan-env \
             -p TimeoutStartSec=2s \
             /dev/shm/test-mainpid3.sh \
     && { echo 'unexpected success'; exit 1; }

From 3333333333333333333333333333333333333333 Mon Sep 17 00:00:00 2001
From: rlinden@redhat.com
Date: Wed, 24 Jan 2024 16:49:21 +0100
Subject: [PATCH 3/3] baz baz

that's it
---
 test/units/testsuite-07.main-PID-change.sh | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/test/units/testsuite-07.main-PID-change.sh b/test/units/testsuite-07.main-PID-change.sh
index be4631f10d..bd5a63a272 100755
--- a/test/units/testsuite-07.main-PID-change.sh
+++ b/test/units/testsuite-07.main-PID-change.sh
@@ -151,6 +151,8 @@ systemd-run --unit=test-mainpidsh3.service \
             -p RuntimeDirectory=mainpidsh3 \
             -p PIDFile=/run/mainpidsh3/pid \
             -p DynamicUser=1 \
+            `# Make sanitizers happy when DynamicUser=1 pulls in instrumented systemd NSS modules` \
+            -p EnvironmentFile=-/usr/local/lib/systemd/systemd-asan-env \
             -p TimeoutStartSec=2s \
             /dev/shm/test-mainpid3.sh \
     && { echo 'unexpected success'; exit 1; }

[-- Attachment #3: fix.patch --]
[-- Type: text/x-patch, Size: 396 bytes --]

diff --git a/builtin/patch-id.c b/builtin/patch-id.c
index 583099cacf..4b8a41bde8 100644
--- a/builtin/patch-id.c
+++ b/builtin/patch-id.c
@@ -86,7 +86,7 @@ static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
 			continue;
 		}
 
-		if (!get_oid_hex(p, next_oid)) {
+		if (starts_with(p-5, "From ") && !get_oid_hex(p, next_oid)) {
 			found_next = 1;
 			break;
 		}

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: problem with parsing of patch files for patch-id
  2024-06-21 10:33 problem with parsing of patch files for patch-id Rob Linden
@ 2024-06-21 17:05 ` Junio C Hamano
  2024-06-21 18:15   ` Rob Linden
  2024-06-21 18:53   ` Junio C Hamano
  0 siblings, 2 replies; 5+ messages in thread
From: Junio C Hamano @ 2024-06-21 17:05 UTC (permalink / raw)
  To: Rob Linden; +Cc: git

Rob Linden <rlinden@redhat.com> writes:

> This patch (also attached) fixes it by only considering commit hashes
> in a "From xxxxx..." line:

If I am not mistaken, "git patch-id" was designed to read from

    git rev-list ... commit range ... | git diff-tree --stdin -p

where we see

    9005149a4a77e2d3409c6127bf4fd1a0893c3495
    diff --git a/path b/path
    index ...
    ... patch text here ...

so I would suspect that limiting the commit object names only to
those that follow "From " (i.e. the format-patch output or output
with the "--format=email" option) would break existing use cases big
time.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: problem with parsing of patch files for patch-id
  2024-06-21 17:05 ` Junio C Hamano
@ 2024-06-21 18:15   ` Rob Linden
  2024-07-13 16:25     ` Junio C Hamano
  2024-06-21 18:53   ` Junio C Hamano
  1 sibling, 1 reply; 5+ messages in thread
From: Rob Linden @ 2024-06-21 18:15 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Hello Junio!
Thanks for the clue, You're right... I only work with the email format
so I didn't think of that.
My solution doesn't work then...
I had a different idea first: to check if we already got an oid and
only read a new one once
the current diff is finished (and wasn't empty so far). The other one
seemed just simpler.
I will try again...
Thanks & all the best,
rob

On Fri, Jun 21, 2024 at 7:05 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Rob Linden <rlinden@redhat.com> writes:
>
> > This patch (also attached) fixes it by only considering commit hashes
> > in a "From xxxxx..." line:
>
> If I am not mistaken, "git patch-id" was designed to read from
>
>     git rev-list ... commit range ... | git diff-tree --stdin -p
>
> where we see
>
>     9005149a4a77e2d3409c6127bf4fd1a0893c3495
>     diff --git a/path b/path
>     index ...
>     ... patch text here ...
>
> so I would suspect that limiting the commit object names only to
> those that follow "From " (i.e. the format-patch output or output
> with the "--format=email" option) would break existing use cases big
> time.
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: problem with parsing of patch files for patch-id
  2024-06-21 17:05 ` Junio C Hamano
  2024-06-21 18:15   ` Rob Linden
@ 2024-06-21 18:53   ` Junio C Hamano
  1 sibling, 0 replies; 5+ messages in thread
From: Junio C Hamano @ 2024-06-21 18:53 UTC (permalink / raw)
  To: Rob Linden; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> Rob Linden <rlinden@redhat.com> writes:
>
>> This patch (also attached) fixes it by only considering commit hashes
>> in a "From xxxxx..." line:
>
> If I am not mistaken, "git patch-id" was designed to read from
>
>     git rev-list ... commit range ... | git diff-tree --stdin -p
>
> where we see
>
>     9005149a4a77e2d3409c6127bf4fd1a0893c3495
>     diff --git a/path b/path
>     index ...
>     ... patch text here ...
>
> so I would suspect that limiting the commit object names only to
> those that follow "From " (i.e. the format-patch output or output
> with the "--format=email" option) would break existing use cases big
> time.

Let's do this to make sure we have a baseline that we will not
break.

------- >8 ------------- >8 ------------- >8 -------
[PATCH] t4204: patch-id supports various input format

"git patch-id" was first developed to read from "git diff-tree
--stdin -p" output.  Later it was enhanced to read from "git
diff-tree --stdin -p -v", which was the downstream of an early
imitation of "git log" ("git rev-list" run in the upstream of a pipe
to feed the "diff-tree").  These days, we also read from "git
format-patch".

Their output begins slightly differently, but the patch-id computed
over them for the same commit should be the same.  Ensure that we
won't accidentally break this expectation.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 t/t4204-patch-id.sh | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git c/t/t4204-patch-id.sh w/t/t4204-patch-id.sh
index 605faea0c7..ebce72b2ce 100755
--- c/t/t4204-patch-id.sh
+++ w/t/t4204-patch-id.sh
@@ -114,6 +114,29 @@ test_expect_success 'patch-id supports git-format-patch output' '
 	test "$2" = $(git rev-parse HEAD)
 '
 
+test_expect_success 'patch-id computes the same for various formats' '
+	# This test happens to consider "git log -p -1" output
+	# the canonical input format, so use it as the norm.
+	git log -1 -p same >log-p.output &&
+	git patch-id <log-p.output >expect &&
+
+	# format-patch begins with "From <commit object name>"
+	git format-patch -1 --stdout same >format-patch.output &&
+	git patch-id <format-patch.output >actual &&
+	test_cmp actual expect &&
+
+	# "diff-tree --stdin -p" begins with "<commit object name>"
+	same=$(git rev-parse same) &&
+	echo $same | git diff-tree --stdin -p >diff-tree.output &&
+	git patch-id <diff-tree.output >actual &&
+	test_cmp actual expect &&
+
+	# "diff-tree --stdin -v -p" begins with "commit <commit object name>"
+	echo $same | git diff-tree --stdin -p -v >diff-tree-v.output &&
+	git patch-id <diff-tree-v.output >actual &&
+	test_cmp actual expect
+'
+
 test_expect_success 'whitespace is irrelevant in footer' '
 	get_patch_id main &&
 	git checkout same &&

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: problem with parsing of patch files for patch-id
  2024-06-21 18:15   ` Rob Linden
@ 2024-07-13 16:25     ` Junio C Hamano
  0 siblings, 0 replies; 5+ messages in thread
From: Junio C Hamano @ 2024-07-13 16:25 UTC (permalink / raw)
  To: Rob Linden; +Cc: git

Rob Linden <rlinden@redhat.com> writes:

>> Rob Linden <rlinden@redhat.com> writes:
>>
>> > This patch (also attached) fixes it by only considering commit hashes
>> > in a "From xxxxx..." line:
>>
>> If I am not mistaken, "git patch-id" was designed to read from
>>
>>     git rev-list ... commit range ... | git diff-tree --stdin -p
>>
>> where we see
>>
>>     9005149a4a77e2d3409c6127bf4fd1a0893c3495
>>     diff --git a/path b/path
>>     index ...
>>     ... patch text here ...
>>
>> so I would suspect that limiting the commit object names only to
>> those that follow "From " (i.e. the format-patch output or output
>> with the "--format=email" option) would break existing use cases big
>> time.
> ...
> Hello Junio!
> Thanks for the clue, You're right... I only work with the email format
> so I didn't think of that.
> My solution doesn't work then...
> I had a different idea first: to check if we already got an oid and
> only read a new one once
> the current diff is finished (and wasn't empty so far). The other one
> seemed just simpler.
> I will try again...
> Thanks & all the best,
> rob

Since then I sent a series [*] that was designed to address the
issue you raised here, but unfortunately nobody seems to have paid
attention and the patches are left hanging.  It is part of my 'seen'
branch and in the broken-out format parked on the jc/patch-id branch
in the https://github.com/gitster/git/ repository, ending with the
commit 3226bd87 (patch-id: tighten code to detect the patch header,
2024-06-21).

If you can test (and if possible code review) them, that may help
the series to move forward.

Thanks.



[Reference]

 * https://lore.kernel.org/git/20240621231826.3280338-1-gitster@pobox.com/

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-07-13 16:25 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-21 10:33 problem with parsing of patch files for patch-id Rob Linden
2024-06-21 17:05 ` Junio C Hamano
2024-06-21 18:15   ` Rob Linden
2024-07-13 16:25     ` Junio C Hamano
2024-06-21 18:53   ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).