* git format-patch escaping issues in the patch format
@ 2024-11-04 19:24 Christoph Anton Mitterer
2024-11-04 22:15 ` Kristoffer Haugsbakk
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Christoph Anton Mitterer @ 2024-11-04 19:24 UTC (permalink / raw)
To: git
Hey there.
For some special use case, I wanted to write a parser for the patch
format created by git format-patch, especially where I can separate
headers, commit message and the actual unified diffs.
There seems unfortunately only little (written) definition of that
format, git-format-patch(1) merely says it's in UNIX mailbox format
(which itself is, AFAIK, not really formally defined).
Anyway, it seems to turn out, that no escaping is done for the commit
message in the patch format and that this can cause actual breakage
with valid commit messages.
Consider the following example:
1. I create a fresh repo, add a test file and use a commit message,
which contains a From line (even with the "magic" timestamp) and
some made up commit id (0000...)
~/test$ git init foo; cd foo
Initialized empty Git repository in /home/calestyo/test/foo/.git/
~/test/foo$ echo a >f; git add f
~/test/foo$ git commit -m "msg1
From 0000000000000000000000000000000061603705 Mon Sep 17 00:00:00 2001
--
---"
[master (root-commit) c08debc] msg1
1 file changed, 1 insertion(+)
create mode 100644 f
2. The format-patch for that looks already suspicious:
- The From line is not escaped (as some variants of mbox would do,
some properly some, causing corruption by the escaping with >
itself).
- What the format may think of as a separator after the commit
message (namely the ---) cannot be used as that either, as a ---
in the commit message is again not escaped.
~/test/foo$ git format-patch --root; cat 0001-msg1.patch; rm -f 0001-msg1.patch
0001-msg1.patch
From c08debcc502c78786ec71d50686ff0445a13b654 Mon Sep 17 00:00:00 2001
From: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name>
Date: Mon, 4 Nov 2024 19:58:45 +0100
Subject: [PATCH] msg1
From 0000000000000000000000000000000061603705 Mon Sep 17 00:00:00 2001
--
---
---
f | 1 +
1 file changed, 1 insertion(+)
create mode 100644 f
diff --git a/f b/f
new file mode 100644
index 0000000..7898192
--- /dev/null
+++ b/f
@@ -0,0 +1 @@
+a
--
2.45.2
3. Adding a 2nd commit, this time using the unified diff from the above
patch as commit message body(!).
~/test/foo$ echo b >>f; git add f
~/test/foo$ git commit -m "msg2
diff --git a/f b/f
new file mode 100644
index 0000000..7898192
--- /dev/null
+++ b/f
@@ -0,0 +1 @@
+a
--
2.45.2"
[master 6bbe38c] msg2
1 file changed, 1 insertion(+)
~/test/foo$ git format-patch --root
0001-msg1.patch
0002-msg2.patch
4. To no surprise, git itself of course knows the difference between
commit message and actual patch, as show e.g. by the following,
where the commit message is indented (by git):
$ git log --patch | cat
commit 6bbe38c33680239ac9767e0e5095f9f32ad41ade
Author: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name>
Date: Mon Nov 4 20:00:20 2024 +0100
msg2
diff --git a/f b/f
new file mode 100644
index 0000000..7898192
--- /dev/null
+++ b/f
@@ -0,0 +1 @@
+a
--
2.45.2
diff --git a/f b/f
index 7898192..422c2b7 100644
--- a/f
+++ b/f
@@ -1 +1,2 @@
a
+b
commit c08debcc502c78786ec71d50686ff0445a13b654
Author: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name>
Date: Mon Nov 4 19:58:45 2024 +0100
msg1
From 0000000000000000000000000000000061603705 Mon Sep 17 00:00:00 2001
--
---
diff --git a/f b/f
new file mode 100644
index 0000000..7898192
--- /dev/null
+++ b/f
@@ -0,0 +1 @@
+a
5. Next I try whether git am can use the patches created above in a
fresh repo:
~/test/foo$ cd ..; git init bar; cd bar
Initialized empty Git repository in /home/calestyo/test/bar/.git/
~/test/bar$ git am ../foo/0001-msg1.patch
Patch is empty.
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To record the empty patch as an empty commit, run "git am --allow-empty".
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
That already fails for the first patch, the reason probably being my
From 0000...
line in the commit message.
6. So trying again with simply that From 000.. line removed
~/test/bar$ sed -i '/^From 00000/d' ../foo/0001-msg1.patch
~/test/bar$ git am ../foo/0001-msg1.patch
fatal: previous rebase directory .git/rebase-apply still exists but mbox given.
and again on a freshly created repo:
~/test/bar$ cd ..; rm -rf bar; git init bar; cd bar
Initialized empty Git repository in /home/calestyo/test/bar/.git/
~/test/bar$ git am ../foo/0001-msg1.patch
Applying: msg1
applying to an empty history
Ah, now it works, so it was indeed the (unusual but still valid commit message).
7. Now that 0001-msg1.patch is applied, let's try the 2nd patch:
~/test/bar$ git am ../foo/0002-msg2.patch
Applying: msg2
error: f: already exists in index
Patch failed at 0001 msg2
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
~/test/bar$
And again, ... the reason most likely git not being able to get that
the "first diff" is not actually a diff but part of the commit message.
btw and shamelessly off-topic question:
Any chance that git format-patch / am will ever support keeping track
of the branch/merge history of generated / applied patches?
That would be really neat.
Thanks,
Chris.
PS: Not subscribed, so please keep me CCed in case you want me to read
any possible replies :-)
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git format-patch escaping issues in the patch format
2024-11-04 19:24 git format-patch escaping issues in the patch format Christoph Anton Mitterer
@ 2024-11-04 22:15 ` Kristoffer Haugsbakk
2024-11-05 1:26 ` Christoph Anton Mitterer
2024-11-04 23:54 ` Jeff King
2024-11-05 0:22 ` Junio C Hamano
2 siblings, 1 reply; 10+ messages in thread
From: Kristoffer Haugsbakk @ 2024-11-04 22:15 UTC (permalink / raw)
To: Christoph Anton Mitterer, git
On Mon, Nov 4, 2024, at 20:24, Christoph Anton Mitterer wrote:
> Hey there.
>
> For some special use case, I wanted to write a parser for the patch
> format created by git format-patch, especially where I can separate
> headers, commit message and the actual unified diffs.
>
> There seems unfortunately only little (written) definition of that
> format, git-format-patch(1) merely says it's in UNIX mailbox format
> (which itself is, AFAIK, not really formally defined).
It seems to me (totally naïve) that you would do something like
1. Blank line terminates headers
2. Then there might be some optional commit-headers which can override
things (`From`)
3. Commit message
4. `---`
5. Look for a regex `^diff` line
• Now the indentation will tell you when it ends
6. `^Range-diff` and `^Interdiff` can also make an appearance in this
section
> Anyway, it seems to turn out, that no escaping is done for the commit
> message in the patch format and that this can cause actual breakage
> with valid commit messages.
>
> Consider the following example:
> 1. I create a fresh repo, add a test file and use a commit message,
> which contains a From line (even with the "magic" timestamp) and
> some made up commit id (0000...)
>
> ~/test$ git init foo; cd foo
> Initialized empty Git repository in /home/calestyo/test/foo/.git/
> ~/test/foo$ echo a >f; git add f
> ~/test/foo$ git commit -m "msg1
>
> From 0000000000000000000000000000000061603705 Mon Sep 17 00:00:00 2001
> --
> ---"
> [master (root-commit) c08debc] msg1
> 1 file changed, 1 insertion(+)
> create mode 100644 f
At first I thought that it is a magic string for a reason. But there
could be an organic use-case here. Like someone who is merely
code-quoting the magic string. And also using code fences.
Parse the magic mbox string
Turns out that Git has something weird: a magic string to delimit emails
or something. Look here:
```
From 0000000000000000000000000000000061603705 Mon Sep 17 00:00:00 2001
```
It will look exactly like this except all those numbers in the middle.
Detecting this is all we need to parse that 80-message file that Pedro
sent us of our school project backup.
This will not apply because “Patch is empty”.
But if I change the commit message to use indentation for the
code block:
Turns out that Git has something weird: a magic string to delimit emails
or something. Look here:
From 0000000000000000000000000000000061603705 Mon Sep 17 00:00:00 2001
It does apply.
As for the `---`: it seems natural for people to use this as a section
break. And I do end up needing that from time to time (maybe once a
year). For people who stick to ASCII (`-` times 5):
Detecting this is all we need to parse that 80-message file that Pedro
sent us of our school project backup.
-----
We should look into Pedro’s backup system. Why does he use email?
Or `***`.
> 3. Adding a 2nd commit, this time using the unified diff from the above
> patch as commit message body(!).
>
> ~/test/foo$ echo b >>f; git add f
> ~/test/foo$ git commit -m "msg2
>
> diff --git a/f b/f
> new file mode 100644
> index 0000000..7898192
> --- /dev/null
> +++ b/f
> @@ -0,0 +1 @@
> +a
> --
> 2.45.2"
> [master 6bbe38c] msg2
> 1 file changed, 1 insertion(+)
> ~/test/foo$ git format-patch --root
> 0001-msg1.patch
> 0002-msg2.patch
>
>
> 4. To no surprise, git itself of course knows the difference between
> commit message and actual patch, as show e.g. by the following,
> where the commit message is indented (by git):
This has happened out in the wild before.[1] And the advice was the same
as what you’re doing here.
🔗 1: https://lore.kernel.org/git/16297305.cDA1TJNmNo@earendil/#t
>
> $ git log --patch | cat
> commit 6bbe38c33680239ac9767e0e5095f9f32ad41ade
> Author: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name>
> Date: Mon Nov 4 20:00:20 2024 +0100
>
> msg2
>
> diff --git a/f b/f
> new file mode 100644
> index 0000000..7898192
> --- /dev/null
> +++ b/f
> @@ -0,0 +1 @@
> +a
> --
> 2.45.2
>
> diff --git a/f b/f
> index 7898192..422c2b7 100644
> --- a/f
> +++ b/f
> @@ -1 +1,2 @@
> a
> +b
>
> commit c08debcc502c78786ec71d50686ff0445a13b654
> Author: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name>
> Date: Mon Nov 4 19:58:45 2024 +0100
>
> msg1
>
> From 0000000000000000000000000000000061603705 Mon Sep 17 00:00:00 2001
> --
> ---
>
> diff --git a/f b/f
> new file mode 100644
> index 0000000..7898192
> --- /dev/null
> +++ b/f
> @@ -0,0 +1 @@
> +a
>
>
> 5. Next I try whether git am can use the patches created above in a
> fresh repo:
>
> ~/test/foo$ cd ..; git init bar; cd bar
> Initialized empty Git repository in /home/calestyo/test/bar/.git/
> ~/test/bar$ git am ../foo/0001-msg1.patch
> Patch is empty.
> hint: When you have resolved this problem, run "git am --continue".
> hint: If you prefer to skip this patch, run "git am --skip" instead.
> hint: To record the empty patch as an empty commit, run "git am
> --allow-empty".
> hint: To restore the original branch and stop patching, run "git am
> --abort".
> hint: Disable this message with "git config advice.mergeConflict
> false"
>
> That already fails for the first patch, the reason probably being my
> From 0000...
> line in the commit message.
>
>
> 6. So trying again with simply that From 000.. line removed
>
> ~/test/bar$ sed -i '/^From 00000/d' ../foo/0001-msg1.patch
> ~/test/bar$ git am ../foo/0001-msg1.patch
> fatal: previous rebase directory .git/rebase-apply still exists but
> mbox given.
>
> and again on a freshly created repo:
>
> ~/test/bar$ cd ..; rm -rf bar; git init bar; cd bar
> Initialized empty Git repository in /home/calestyo/test/bar/.git/
> ~/test/bar$ git am ../foo/0001-msg1.patch
> Applying: msg1
> applying to an empty history
>
> Ah, now it works, so it was indeed the (unusual but still valid
> commit message).
>
>
> 7. Now that 0001-msg1.patch is applied, let's try the 2nd patch:
>
> ~/test/bar$ git am ../foo/0002-msg2.patch
> Applying: msg2
> error: f: already exists in index
> Patch failed at 0001 msg2
> hint: Use 'git am --show-current-patch=diff' to see the failed patch
> hint: When you have resolved this problem, run "git am --continue".
> hint: If you prefer to skip this patch, run "git am --skip" instead.
> hint: To restore the original branch and stop patching, run "git am --abort".
> hint: Disable this message with "git config advice.mergeConflict false"
> ~/test/bar$
>
> And again, ... the reason most likely git not being able to get that
> the "first diff" is not actually a diff but part of the commit message.
It did the right thing for me with this (last part) of the commit message:
We should look into Pedro’s backup system. Why does he use email?
diff --git a/a.txt b/a.txt
index ce01362503..a32119c8aa 100644
--- a/a.txt
+++ b/a.txt
@@ -1 +1,2 @@
hello
+goodbye
***
The `---` part could show up in a commit message naturally. I guess the
status quote remedy (?) is to get burned once with a failed application
and not use that as a section break any more.
It seems like it would be nice if format-patch complained if it found
regex `^---$` in the commit body. It is not at all a magic
(obfuscated/unlikely) string. If it crashes on that you can fix it
straight away. Then you don’t have to wait for two email roundtrips and
an git-am(1) application.
The magic string is unlikely but could happen. The solution is to use
an indented block. Same for the diff. (Hopefully few have to
code-quote diffs)
But escaping things in this format? That has knock-on effects like
escaping-escaping. Certainly with one-character escapes. I mean if you
are trying to be resistant to all kinds of strings and characters, and
code blocks (with fences, not indented) can contain basically anything,
you could end up with likely collisions. Then the format gets harder to
read.
Maybe with yet more magic strings since (unlike single chars) they are
unlikely to occur in the input:
Detecting this is all we need to parse that 80-message file that Pedro
sent us of our school project backup.
(blasphemous bath tub)---
We should look into Pedro’s backup system. Why does he use email?
And if you really meant to write that literally:
Detecting this is all we need to parse that 80-message file that Pedro
sent us of our school project backup.
(blasphemous bath tub)(blasphemous bath tub)---
We should look into Pedro’s backup system. Why does he use email?
But then the problem is that people have to look at that. Just because
they wanted to use `---` for section break. Then they’re probably just
gonna stop using it for section breaks. Back to square one (could have
just forbidden it).
> btw and shamelessly off-topic question:
> Any chance that git format-patch / am will ever support keeping track
> of the branch/merge history of generated / applied patches?
> That would be really neat.
What does this mean more concretely? :)
> PS: Not subscribed, so please keep me CCed in case you want me to read
> any possible replies :-)
That’s the list policy.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git format-patch escaping issues in the patch format
2024-11-04 19:24 git format-patch escaping issues in the patch format Christoph Anton Mitterer
2024-11-04 22:15 ` Kristoffer Haugsbakk
@ 2024-11-04 23:54 ` Jeff King
2024-11-05 1:03 ` Christoph Anton Mitterer
2024-11-05 0:22 ` Junio C Hamano
2 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2024-11-04 23:54 UTC (permalink / raw)
To: Christoph Anton Mitterer; +Cc: git
On Mon, Nov 04, 2024 at 08:24:14PM +0100, Christoph Anton Mitterer wrote:
> 2. The format-patch for that looks already suspicious:
> - The From line is not escaped (as some variants of mbox would do,
> some properly some, causing corruption by the escaping with >
> itself).
As you note, the mbox format is not well defined. :) The variant with
">"-quoting of "From" lines is often called "mboxrd", and you can get it
with the "--format=mboxrd" option.
> - What the format may think of as a separator after the commit
> message (namely the ---) cannot be used as that either, as a ---
> in the commit message is again not escaped.
For this, though, I don't think there is any solution. The receiving
side of "git am" does not know of any unquoting mechanism. So even if
you wanted to quote it, you could not get a lossless transmission of the
commit message.
This does occasionally cause confusion. Especially if you include an
unindented diff in your commit message, which similarly (and
intentionally) triggers the "---" detection. But I think it's one of
those things that just doesn't come up often enough for anybody to have
cared about trying to address.
-Peff
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git format-patch escaping issues in the patch format
2024-11-04 19:24 git format-patch escaping issues in the patch format Christoph Anton Mitterer
2024-11-04 22:15 ` Kristoffer Haugsbakk
2024-11-04 23:54 ` Jeff King
@ 2024-11-05 0:22 ` Junio C Hamano
2024-11-05 1:01 ` Christoph Anton Mitterer
2 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2024-11-05 0:22 UTC (permalink / raw)
To: Christoph Anton Mitterer; +Cc: git
Christoph Anton Mitterer <calestyo@scientia.org> writes:
> There seems unfortunately only little (written) definition of that
> format, git-format-patch(1) merely says it's in UNIX mailbox format
> (which itself is, AFAIK, not really formally defined).
>
>
> Anyway, it seems to turn out, that no escaping is done for the commit
> message in the patch format and that this can cause actual breakage
> with valid commit messages.
Yes, so ...
> Consider the following example:
> ~/test/foo$ git commit -m "msg1
>
> From 0000000000000000000000000000000061603705 Mon Sep 17 00:00:00 2001
> --
... this falls squarely into "if it hurts, don't do it" category.
A commonly used trick, when you are working on Git and have to
include such a line in the commit log message, e.g., when you
discuss how the output produced by "git log --format=email" looks
like, is to indent such lines in the paragraph you talk about them,
e.g.
In a format-patch output file, the contents of each commit
begins with
From 8f8d6eee531b3fa1a8ef14f169b0cb5035f7a772 Mon Sep 17 00:00:00 2001
where the string 8f8d... is replaced with the name of the commit
object the patch was taken from.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git format-patch escaping issues in the patch format
2024-11-05 0:22 ` Junio C Hamano
@ 2024-11-05 1:01 ` Christoph Anton Mitterer
0 siblings, 0 replies; 10+ messages in thread
From: Christoph Anton Mitterer @ 2024-11-05 1:01 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
Hey.
On Mon, 2024-11-04 at 16:22 -0800, Junio C Hamano wrote:
> ... this falls squarely into "if it hurts, don't do it" category.
I rather thought it would be the category "if it's a bug, that is
completely non-obvious and not really to expect, with not even a
warning in place,... it should better be fixed" ;-)
There is not reason for any commit author to assume that these strings
are not valid, or possibly cause later breakage.
Even if so, it seems not really feasible to know any possible strings,
especially when people might just copy&paste output from others into
their commit message.
To the least there should be an error when format-patch creates a one
that cannot be parsed afterwards.
And a proper solution would employ some form of escaping.
And if one's on the other side, and wants to write a parser, one has no
real chance of assuring that only "valid" input is used by users in a
commit message.
One can easily imagine hat this could accidentally (or intentionally)
be used apply undesired patches.
Cheers,
Chris.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git format-patch escaping issues in the patch format
2024-11-04 23:54 ` Jeff King
@ 2024-11-05 1:03 ` Christoph Anton Mitterer
2024-11-05 15:05 ` Jeff King
0 siblings, 1 reply; 10+ messages in thread
From: Christoph Anton Mitterer @ 2024-11-05 1:03 UTC (permalink / raw)
To: Jeff King; +Cc: git
On Mon, 2024-11-04 at 18:54 -0500, Jeff King wrote:
> As you note, the mbox format is not well defined. :) The variant with
> ">"-quoting of "From" lines is often called "mboxrd", and you can get
> it
> with the "--format=mboxrd" option.
But as you've said below, here too, the receiving side most likely
doesn’t know that and then a wrong commit message would be applied
(with no unescaping being performed or it would simply break again when
the magic From is used).
Cheers,
Chris.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git format-patch escaping issues in the patch format
2024-11-04 22:15 ` Kristoffer Haugsbakk
@ 2024-11-05 1:26 ` Christoph Anton Mitterer
2024-11-05 14:32 ` Kristoffer Haugsbakk
0 siblings, 1 reply; 10+ messages in thread
From: Christoph Anton Mitterer @ 2024-11-05 1:26 UTC (permalink / raw)
To: Kristoffer Haugsbakk, git
On Mon, 2024-11-04 at 23:15 +0100, Kristoffer Haugsbakk wrote:
> It seems to me (totally naïve) that you would do something like
>
> 1. Blank line terminates headers
> 2. Then there might be some optional commit-headers which can
> override
> things (`From`)
> 3. Commit message
> 4. `---`
> 5. Look for a regex `^diff` line
> • Now the indentation will tell you when it ends
> 6. `^Range-diff` and `^Interdiff` can also make an appearance in this
> section
Well as you've seen by the follow-up, such a naive approach is not
really possible, as the commit message may also contain ---, unified
diffs, etc.
> At first I thought that it is a magic string for a reason.
I think such magic cookies can always only (safely) be used to detect a
type, but not as a separator, at least not when free text is allowed,
too.
> It did the right thing for me with this (last part) of the commit
> message:
>
> We should look into Pedro’s backup system. Why does he use
> email?
>
> diff --git a/a.txt b/a.txt
> index ce01362503..a32119c8aa 100644
> --- a/a.txt
> +++ b/a.txt
> @@ -1 +1,2 @@
> hello
> +goodbye
>
> ***
But yours look indented? I mean then it's clear it works.
> It seems like it would be nice if format-patch complained if it found
> regex `^---$` in the commit body.
Actually already when committing... cause there it's taken as valid and
then it should also work with any following tools.
> The magic string is unlikely but could happen. The solution is to
> use
> an indented block. Same for the diff. (Hopefully few have to
> code-quote diffs)
As written in the other mail, there is nothing real obvious for the
user that this wouldn’t be allowed, and in fact committing and such
works.
The simple problem here is the fuzzy format which cannot be parsed
properly.
> But escaping things in this format?
Coudln't one do something like this:
If the line consisting of three - is the line that ends the commit
message, check during format-patch whether it's contained in that.
If not, generate the patch as now.
If so, use another magic timestamp and/or (since that might get lost
when sending as mail, set some X-git-patch-format: header, there adding
perhaps a flag like "escaped" and if that's set, any line that matches
the regexp:
>*---
get's another > prepended when escaping, and one removed when
unescaping (well in the latter only lines that match >+---).
* = 0..n
+ = 1..n
Or probably thinking about some more sophisticated solution or at least
a better character than > .
> > btw and shamelessly off-topic question:
> > Any chance that git format-patch / am will ever support keeping
> > track
> > of the branch/merge history of generated / applied patches?
> > That would be really neat.
>
> What does this mean more concretely? :)
I guess I want what's been asked there:
https://stackoverflow.com/questions/2285699/git-how-to-create-patches-for-a-merge
In short, imagine I have the following commit tree (just a simple
example with --no-ff):
*
|\
|*
|/
*
And I make a series of git format-patch patches from that, it would be
nice if git am *.patches, give back that structure (i.e. with the
branch and the merge).
Cheers,
Chris.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git format-patch escaping issues in the patch format
2024-11-05 1:26 ` Christoph Anton Mitterer
@ 2024-11-05 14:32 ` Kristoffer Haugsbakk
2024-11-05 15:02 ` Christoph Anton Mitterer
0 siblings, 1 reply; 10+ messages in thread
From: Kristoffer Haugsbakk @ 2024-11-05 14:32 UTC (permalink / raw)
To: Christoph Anton Mitterer, git
On Tue, Nov 5, 2024, at 02:26, Christoph Anton Mitterer wrote:
> On Mon, 2024-11-04 at 23:15 +0100, Kristoffer Haugsbakk wrote:
>> It seems to me (totally naïve) that you would do something like
>>
>> 1. Blank line terminates headers
>> 2. Then there might be some optional commit-headers which can
>> override
>> things (`From`)
>> 3. Commit message
>> 4. `---`
>> 5. Look for a regex `^diff` line
>> • Now the indentation will tell you when it ends
>> 6. `^Range-diff` and `^Interdiff` can also make an appearance in this
>> section
>
> Well as you've seen by the follow-up, such a naive approach is not
> really possible, as the commit message may also contain ---, unified
> diffs, etc.
It’s possible in the sense that it would work just as well as
git-format-patch(1).
You could make it more robust with some backtracking, like finding the
last `^diff` and movig back. That’s OK in a file with only one patch
(`.patch`) but harder to do for an mbox file.
> […]
>> It seems like it would be nice if format-patch complained if it found
>> regex `^---$` in the commit body.
>
> Actually already when committing... cause there it's taken as valid and
> then it should also work with any following tools.
That would inconvenience all users that never use format-patch. You
could provide an advice/hint about some configuration variable which
turns off this new default. But that’s a lot of work to benefit
format-patch users.
Such an error would have to be a default because an opt-in would only be
discovered (in practice) after you’ve been burned once. And then the
opt-in error is of very little value. You’re realistically not going to
forget and write commit messages with `^---$` after that.
>> The magic string is unlikely but could happen. The solution is to
>> use
>> an indented block. Same for the diff. (Hopefully few have to
>> code-quote diffs)
>
> As written in the other mail, there is nothing real obvious for the
> user that this wouldn’t be allowed, and in fact committing and such
> works.
> The simple problem here is the fuzzy format which cannot be parsed
> properly.
It’s not non-obvious either. The simple format is apparent if you
review your patches before sending them out into the world. (I’m
paranoid so I do that)
>> But escaping things in this format?
>
> Coudln't one do something like this:
>
> If the line consisting of three - is the line that ends the commit
> message, check during format-patch whether it's contained in that.
> If not, generate the patch as now.
> If so, use another magic timestamp and/or (since that might get lost
> when sending as mail, set some X-git-patch-format: header, there adding
> perhaps a flag like "escaped" and if that's set, any line that matches
> the regexp:
>>*---
> get's another > prepended when escaping, and one removed when
> unescaping (well in the latter only lines that match >+---).
> * = 0..n
> + = 1..n
>
> Or probably thinking about some more sophisticated solution or at least
> a better character than > .
That’s interesting and a good idea to use an email header to signal the
escaping.
Thinking just about `^---$`: an email header could be generated if
`^---$` occurs in the commit message. Then it could suggest something
non-occurring instead. Simply `-----` or `***` or something.
But this might not just help with the commit message separator.
Apparently the status quo is that git-am(1) will try to apply the diff
in a commit message. Even though it hasn’t seen `^---$` yet. But if it
was required to see that first then you could do something like:
```
diff ...
<diff content>
```
And not have any issues.
Of course you can’t change the magic mbox string. But I don’t see how
you can’t do some more expensive parsing and require that you see the
end of the commit message before parsing this string as the
marker/delimiter.
Some niggles: the commit message might just be the subject line and
there might be no diff (empty patch). Then looking for `^---$` will
take you too far. Maybe just look for the signature line.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git format-patch escaping issues in the patch format
2024-11-05 14:32 ` Kristoffer Haugsbakk
@ 2024-11-05 15:02 ` Christoph Anton Mitterer
0 siblings, 0 replies; 10+ messages in thread
From: Christoph Anton Mitterer @ 2024-11-05 15:02 UTC (permalink / raw)
To: Kristoffer Haugsbakk, git
On Tue, 2024-11-05 at 15:32 +0100, Kristoffer Haugsbakk wrote:
> You could make it more robust with some backtracking, like finding
> the
> last `^diff` and movig back. That’s OK in a file with only one patch
> (`.patch`) but harder to do for an mbox file.
I wonder whether that (parsing from the end) is really a proper
solution or whether it could still break.
A patch can contain multiple diffs, like in:
From b3b82e59ea52fc059dc23ecc1a3cc0810d297b10 Mon Sep 17 00:00:00 2001
From: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name>
Date: Tue, 5 Nov 2024 15:40:13 +0100
Subject: [PATCH] foo
---
a | 1 +
b | 1 +
2 files changed, 2 insertions(+)
create mode 100644 a
create mode 100644 b
diff --git a/a b/a
new file mode 100644
index 0000000..ae5b3c5
--- /dev/null
+++ b/a
@@ -0,0 +1 @@
+Aaa
diff --git a/b b/b
new file mode 100644
index 0000000..f761ec1
--- /dev/null
+++ b/b
@@ -0,0 +1 @@
+bbb
--
2.45.2
But the unified diff shouldn't be able to contain newlines or a line
consisting only of --- .
So would a proper description be:
From <ID> Mon Sep 17 00:00:00 2001
<mail style headers, no empty lines>
<exactly one empty line to separate headers from commit message>
<any arbitrary sequence of zero or more lines>
---
<a number of lines (not sure whether it could be zero) all starting with a space>
<exactly one empty line>
<n unified diffs, no empty lines, no lines consisting solely of --- >
--
<version>
?
Seems still pretty brittle.
>
> > Actually already when committing... cause there it's taken as valid
> > and
> > then it should also work with any following tools.
>
> That would inconvenience all users that never use format-patch.
Sure, the real solution would still be to do proper escaping.
> It’s not non-obvious either. The simple format is apparent if you
> review your patches before sending them out into the world. (I’m
> paranoid so I do that)
The whole arguing "the user must check what he writes in the commit
message" seems a bit to me as if users would need to think about not
being allowed to use characters like < etc. when they write text which
might be stored as HTML, because that wouldn't provide a quoting
mechanism which an editor could automatically employ.
> That’s interesting and a good idea to use an email header to signal
> the
> escaping.
My first idea was using MIME, but I guess many people wouldn’t be all
too delighted seeing patches with MIME and the commit message encoded
as base64 or so ;-)
So any quoting should be still human readable (though git already does
use some IMO not so human readable encoding for Subject: lines).
Using something that is already standard would be of course nicer, but
if it's not accepted, than better a simple schema that still works.
> Thinking just about `^---$`: an email header could be generated if
> `^---$` occurs in the commit message. Then it could suggest
> something
> non-occurring instead. Simply `-----` or `***` or something.
Should work, but if one has strange enough commit messages, either the
new separator would get stranger and stranger, or, if there was just a
hardcoded list of alternatives, one could run out of alternatives.
If a header would get introduced one should make it generic enough...
e.g. to allow for different quoting schemas, or even to allow for
completely other format information.
> Some niggles: the commit message might just be the subject line and
> there might be no diff (empty patch). Then looking for `^---$` will
> take you too far. Maybe just look for the signature line.
Another way might be to simply store in a header just how many lines of
commit messages follow.
But I think that would have also some downsides.
Cheers,
Chris.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git format-patch escaping issues in the patch format
2024-11-05 1:03 ` Christoph Anton Mitterer
@ 2024-11-05 15:05 ` Jeff King
0 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2024-11-05 15:05 UTC (permalink / raw)
To: Christoph Anton Mitterer; +Cc: git
On Tue, Nov 05, 2024 at 02:03:21AM +0100, Christoph Anton Mitterer wrote:
> On Mon, 2024-11-04 at 18:54 -0500, Jeff King wrote:
> > As you note, the mbox format is not well defined. :) The variant with
> > ">"-quoting of "From" lines is often called "mboxrd", and you can get
> > it
> > with the "--format=mboxrd" option.
>
> But as you've said below, here too, the receiving side most likely
> doesn’t know that and then a wrong commit message would be applied
> (with no unescaping being performed or it would simply break again when
> the magic From is used).
I think there are two differences between ">From" and quote "---":
1. There are already mbox parsers that understand and unquote ">From".
If you know you are feeding your patches to such a parser (e.g.,
the one in mutt) then using "mboxrd" makes sense.
2. The consumer of the mbox format is much "closer" than the consumer
of the "---" line.
If you produce some patches with format-patch, you might feed them
to your MUA or another local tool, and you know how that tool will
unquote them. Once you email the patches, that ">From" quoting is
irrelevant, because the receiver will get individual emails without
the quoting. They may themselves choose to store them in an mbox,
of course, but the details are up to them.
Whereas the "---" line is inside the email, and will be interpreted
on the other end by a tool like "git am". So if we introduce
quoting in format-patch and unquoting in git-am, then correctly
unquoting will depend on the version that the other side is using.
IMHO that is not necessarily a reason to avoid implementing quoting. If
you do accidentally put "---" in a commit message, the current outcome
is for your commit message to get silently cut off by "git am". But in a
world where we quote it and the other side is too old to unquote, then
they may end up with an extra quoting character in the result. But that
is probably a less bad outcome overall.
There's also another flip-side mistake, which is somebody who doesn't
quote sends to somebody who unquotes, and the result is similarly
slightly corrupted. The mbox equivalent is that you meant to say
">From" with the ">", but the receiver turned it into just "From".
So I don't think it's a totally wrong direction to go. But like I said,
it doesn't seem to happen often enough in practice that anybody has been
motivated to add the feature.
-Peff
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-11-05 15:07 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-04 19:24 git format-patch escaping issues in the patch format Christoph Anton Mitterer
2024-11-04 22:15 ` Kristoffer Haugsbakk
2024-11-05 1:26 ` Christoph Anton Mitterer
2024-11-05 14:32 ` Kristoffer Haugsbakk
2024-11-05 15:02 ` Christoph Anton Mitterer
2024-11-04 23:54 ` Jeff King
2024-11-05 1:03 ` Christoph Anton Mitterer
2024-11-05 15:05 ` Jeff King
2024-11-05 0:22 ` Junio C Hamano
2024-11-05 1:01 ` Christoph Anton Mitterer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).