* [PATCH] submodule: truncate the oid when fetchig commits
@ 2025-08-14 15:06 Michael Schroeder
2025-08-14 15:42 ` Junio C Hamano
2025-08-14 22:16 ` [PATCH] " brian m. carlson
0 siblings, 2 replies; 6+ messages in thread
From: Michael Schroeder @ 2025-08-14 15:06 UTC (permalink / raw)
To: git; +Cc: gitster
If a submodule uses a different hash algorithm than used in
the main repository, the recorded submodule commit is padded
with zeros. This is usually not a problem as the default is to
do submodule clones non-shallow and the commit can be found
in the local objects.
But this is not true if the --shallow-submodules clone option is
used (or the --depth option in the submodule update call).
In this case, the commit is often not reachable and a fetch of the
specific commit is done. But the fetch cannot deal with the zero
padding and interprets the commit as a name. Because of this,
the checkout will fail.
Implement truncation of the recorded commit to the correct size
corresponding to the hash algorithm used in the submodule.
Signed-off-by: Michael Schroeder <mls@suse.de>
---
builtin/submodule--helper.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)
diff --git a/builtin/submodule--helper.c b/builtin/submodule--helper.c
index 07a1935cbe..ef21eb42b8 100644
--- a/builtin/submodule--helper.c
+++ b/builtin/submodule--helper.c
@@ -72,7 +72,7 @@ static char *resolve_relative_url(const char *rel_url, const char *up_path, int
return resolved_url;
}
-static int get_default_remote_submodule(const char *module_path, char **default_remote)
+static int get_default_remote_submodule(const char *module_path, char **default_remote, const struct git_hash_algo **hash_algo)
{
const struct submodule *sub;
struct repository subrepo;
@@ -106,6 +106,9 @@ static int get_default_remote_submodule(const char *module_path, char **default_
*default_remote = xstrdup(remote_name);
+ if (hash_algo)
+ *hash_algo = subrepo.hash_algo;
+
repo_clear(&subrepo);
free(url);
@@ -1272,7 +1275,7 @@ static void sync_submodule(const char *path, const char *prefix,
goto cleanup;
strbuf_reset(&sb);
- code = get_default_remote_submodule(path, &default_remote);
+ code = get_default_remote_submodule(path, &default_remote, NULL);
if (code)
exit(code);
@@ -2319,16 +2322,19 @@ static int fetch_in_submodule(const char *module_path, int depth, int quiet,
if (depth)
strvec_pushf(&cp.args, "--depth=%d", depth);
if (oid) {
- char *hex = oid_to_hex(oid);
+ char hexbuffer[GIT_MAX_HEXSZ + 1];
+ char *hex = oid_to_hex_r(hexbuffer, oid);
char *remote;
+ const struct git_hash_algo *hash_algo = NULL;
int code;
- code = get_default_remote_submodule(module_path, &remote);
+ code = get_default_remote_submodule(module_path, &remote, &hash_algo);
if (code) {
child_process_clear(&cp);
return code;
}
-
+ if (hash_algo)
+ hex[hash_algo->hexsz] = 0; /* truncate to correct size */
strvec_pushl(&cp.args, remote, hex, NULL);
free(remote);
}
@@ -2635,7 +2641,7 @@ static int update_submodule(struct update_data *update_data)
char *remote_ref;
int code;
- code = get_default_remote_submodule(update_data->sm_path, &remote_name);
+ code = get_default_remote_submodule(update_data->sm_path, &remote_name, NULL);
if (code)
return code;
code = remote_submodule_branch(update_data->sm_path, &branch);
--
2.50.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] submodule: truncate the oid when fetchig commits
2025-08-14 15:06 [PATCH] submodule: truncate the oid when fetchig commits Michael Schroeder
@ 2025-08-14 15:42 ` Junio C Hamano
2025-08-18 9:15 ` [PATCH v2] " Michael Schroeder
2025-08-14 22:16 ` [PATCH] " brian m. carlson
1 sibling, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2025-08-14 15:42 UTC (permalink / raw)
To: Michael Schroeder; +Cc: git
Michael Schroeder <mls@suse.de> writes:
> If a submodule uses a different hash algorithm than used in
> the main repository, the recorded submodule commit is padded
> with zeros.
Hmph. I suspect that this "extra zero bits" would happen when your
superproject uses a longer hash (e.g. SHA-256) than the hash used by
the submodule you are fetching? If the arrangement is reversed,
would we see a different and possibly an even severe problem?
I do not mean to say that you need to solve the problem in both
direction at all. But perhaps
... uses a hash algorithm with shorter hash length than what is
used in the main repository, ...
would clarify what problem you are solving in this patch better.
> This is usually not a problem as the default is to
> do submodule clones non-shallow and the commit can be found
> in the local objects.
>
> But this is not true if the --shallow-submodules clone option is
> used (or the --depth option in the submodule update call).
> In this case, the commit is often not reachable and a fetch of the
> specific commit is done. But the fetch cannot deal with the zero
> padding and interprets the commit as a name. Because of this,
> the checkout will fail.
>
> Implement truncation of the recorded commit to the correct size
> corresponding to the hash algorithm used in the submodule.
When we use the verb "truncate" in the context of this project, I
think we always use the word to mean making it shorter than its
natural length. But in the above, the word is used to remove excess
so that we use it at its natural length. I cannot offhand offer a
better phrase, so I'll let it go for now, but suggestions for better
wording is very much welcome.
> builtin/submodule--helper.c | 18 ++++++++++++------
> 1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/builtin/submodule--helper.c b/builtin/submodule--helper.c
> index 07a1935cbe..ef21eb42b8 100644
> --- a/builtin/submodule--helper.c
> +++ b/builtin/submodule--helper.c
> @@ -72,7 +72,7 @@ static char *resolve_relative_url(const char *rel_url, const char *up_path, int
> return resolved_url;
> }
>
> -static int get_default_remote_submodule(const char *module_path, char **default_remote)
> +static int get_default_remote_submodule(const char *module_path, char **default_remote, const struct git_hash_algo **hash_algo)
> {
> const struct submodule *sub;
> struct repository subrepo;
> @@ -106,6 +106,9 @@ static int get_default_remote_submodule(const char *module_path, char **default_
>
> *default_remote = xstrdup(remote_name);
>
> + if (hash_algo)
> + *hash_algo = subrepo.hash_algo;
> +
> repo_clear(&subrepo);
> free(url);
>
> @@ -1272,7 +1275,7 @@ static void sync_submodule(const char *path, const char *prefix,
> goto cleanup;
>
> strbuf_reset(&sb);
> - code = get_default_remote_submodule(path, &default_remote);
> + code = get_default_remote_submodule(path, &default_remote, NULL);
> if (code)
> exit(code);
>
> @@ -2319,16 +2322,19 @@ static int fetch_in_submodule(const char *module_path, int depth, int quiet,
> if (depth)
> strvec_pushf(&cp.args, "--depth=%d", depth);
> if (oid) {
> - char *hex = oid_to_hex(oid);
> + char hexbuffer[GIT_MAX_HEXSZ + 1];
> + char *hex = oid_to_hex_r(hexbuffer, oid);
> char *remote;
> + const struct git_hash_algo *hash_algo = NULL;
> int code;
>
> - code = get_default_remote_submodule(module_path, &remote);
> + code = get_default_remote_submodule(module_path, &remote, &hash_algo);
It feels highly unsatisfying to have oid_to_hex_r() blindly read and
convert oid to hex (assuming the hash length of the current
repository) without using the knowledge of how much of that object
name is valid. We obtain that knowledge shortly after we call
oid_to_hex_r() in hash_algo here. That is the only reason why we
manually have to truncate below, isn't it?
In other words, shouldn't we be finding out hash_algo first, and
then use something like hash_to_hex_algop_r() to convert only the
valid bits of the oid into the buffer?. That way, we do not have to
manually add '\0' in this function below.
> if (code) {
> child_process_clear(&cp);
> return code;
> }
> -
> + if (hash_algo)
> + hex[hash_algo->hexsz] = 0; /* truncate to correct size */
> strvec_pushl(&cp.args, remote, hex, NULL);
> free(remote);
> }
> @@ -2635,7 +2641,7 @@ static int update_submodule(struct update_data *update_data)
> char *remote_ref;
> int code;
>
> - code = get_default_remote_submodule(update_data->sm_path, &remote_name);
> + code = get_default_remote_submodule(update_data->sm_path, &remote_name, NULL);
> if (code)
> return code;
> code = remote_submodule_branch(update_data->sm_path, &branch);
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] submodule: truncate the oid when fetchig commits
2025-08-14 15:06 [PATCH] submodule: truncate the oid when fetchig commits Michael Schroeder
2025-08-14 15:42 ` Junio C Hamano
@ 2025-08-14 22:16 ` brian m. carlson
2025-08-18 9:30 ` Michael Schroeder
1 sibling, 1 reply; 6+ messages in thread
From: brian m. carlson @ 2025-08-14 22:16 UTC (permalink / raw)
To: Michael Schroeder; +Cc: git, gitster
[-- Attachment #1: Type: text/plain, Size: 1007 bytes --]
On 2025-08-14 at 15:06:32, Michael Schroeder wrote:
> If a submodule uses a different hash algorithm than used in
> the main repository, the recorded submodule commit is padded
> with zeros. This is usually not a problem as the default is to
> do submodule clones non-shallow and the commit can be found
> in the local objects.
This should not even work at all. It may currently behave as you
suggest when the main repository is SHA-256 and the submodule is SHA-1,
but it will corrupt the data if the submodule is SHA-256 and the main
repository is SHA-1, since then the data will be truncated.
The proper way for this to work is that the SHA-1 version of the
repository stores submodules in their SHA-1 states and the SHA-256
version of the repository stores submodules in their SHA-256 states.
Yes, this means that you have to convert submodules to the same
algorithm, but that's required because trees are binary and not text.
--
brian m. carlson (they/them)
Toronto, Ontario, CA
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] submodule: truncate the oid when fetchig commits
2025-08-14 15:42 ` Junio C Hamano
@ 2025-08-18 9:15 ` Michael Schroeder
0 siblings, 0 replies; 6+ messages in thread
From: Michael Schroeder @ 2025-08-18 9:15 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
Hi git team,
here's the new version of the patch based on your comments:
From 3962b9cb065504f52875f728f03ecbf8dc1f23e2 Mon Sep 17 00:00:00 2001
From: Michael Schroeder <mls@suse.de>
Date: Thu, 14 Aug 2025 16:12:53 +0200
Subject: [PATCH v2] submodule: use subrepo algo when fetchig commits
If a submodule uses a hash algorithm with shorter hash length than
what is used in the main repository, the recorded submodule commit
is padded with zeros. This is usually not a problem as the default
is to do submodule clones non-shallow and the commit can be found
in the local objects.
But this is not true if the --shallow-submodules clone option is
used (or the --depth option in the submodule update call).
In this case, the commit is often not reachable and a fetch of the
specific commit is done. But the fetch cannot deal with the zero
padding and interprets the commit as a name. Because of this,
the checkout will fail.
Use the subrepo algorithm when converting the stored commit to hex
so that the result matches the expected oid length of the submodule.
Signed-off-by: Michael Schroeder <mls@suse.de>
---
builtin/submodule--helper.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/builtin/submodule--helper.c b/builtin/submodule--helper.c
index 07a1935cbe..5c2e96b517 100644
--- a/builtin/submodule--helper.c
+++ b/builtin/submodule--helper.c
@@ -72,7 +72,7 @@ static char *resolve_relative_url(const char *rel_url, const char *up_path, int
return resolved_url;
}
-static int get_default_remote_submodule(const char *module_path, char **default_remote)
+static int get_default_remote_submodule(const char *module_path, char **default_remote, const struct git_hash_algo **hash_algo)
{
const struct submodule *sub;
struct repository subrepo;
@@ -106,6 +106,9 @@ static int get_default_remote_submodule(const char *module_path, char **default_
*default_remote = xstrdup(remote_name);
+ if (hash_algo)
+ *hash_algo = subrepo.hash_algo;
+
repo_clear(&subrepo);
free(url);
@@ -1272,7 +1275,7 @@ static void sync_submodule(const char *path, const char *prefix,
goto cleanup;
strbuf_reset(&sb);
- code = get_default_remote_submodule(path, &default_remote);
+ code = get_default_remote_submodule(path, &default_remote, NULL);
if (code)
exit(code);
@@ -2319,16 +2322,18 @@ static int fetch_in_submodule(const char *module_path, int depth, int quiet,
if (depth)
strvec_pushf(&cp.args, "--depth=%d", depth);
if (oid) {
- char *hex = oid_to_hex(oid);
+ const struct git_hash_algo *hash_algo = NULL;
+ char *hex;
char *remote;
int code;
- code = get_default_remote_submodule(module_path, &remote);
+ code = get_default_remote_submodule(module_path, &remote, &hash_algo);
if (code) {
child_process_clear(&cp);
return code;
}
+ hex = hash_to_hex_algop(oid->hash, hash_algo);
strvec_pushl(&cp.args, remote, hex, NULL);
free(remote);
}
@@ -2635,7 +2640,7 @@ static int update_submodule(struct update_data *update_data)
char *remote_ref;
int code;
- code = get_default_remote_submodule(update_data->sm_path, &remote_name);
+ code = get_default_remote_submodule(update_data->sm_path, &remote_name, NULL);
if (code)
return code;
code = remote_submodule_branch(update_data->sm_path, &branch);
--
2.50.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] submodule: truncate the oid when fetchig commits
2025-08-14 22:16 ` [PATCH] " brian m. carlson
@ 2025-08-18 9:30 ` Michael Schroeder
2025-08-19 0:45 ` brian m. carlson
0 siblings, 1 reply; 6+ messages in thread
From: Michael Schroeder @ 2025-08-18 9:30 UTC (permalink / raw)
To: brian m. carlson, git, gitster
On Thu, Aug 14, 2025 at 10:16:24PM +0000, brian m. carlson wrote:
> On 2025-08-14 at 15:06:32, Michael Schroeder wrote:
> > If a submodule uses a different hash algorithm than used in
> > the main repository, the recorded submodule commit is padded
> > with zeros. This is usually not a problem as the default is to
> > do submodule clones non-shallow and the commit can be found
> > in the local objects.
>
> This should not even work at all. It may currently behave as you
> suggest when the main repository is SHA-256 and the submodule is SHA-1,
> but it will corrupt the data if the submodule is SHA-256 and the main
> repository is SHA-1, since then the data will be truncated.
But it works, and I'm pretty sure people already use it. If you
have a sha1 main repo and a sha256 submodule, git will truncate
the commit when recording the gitlink. The checkout done by
git submodule update will work as it does the normal prefix matching.
If you have a sha256 main repo and a sha submodule, the recorded
commit is padded with zero. The checkout will also work as git
ignores the extra data since commit 52fca06db2 (object-names: support
input of oids in any supported hash, 2023-10-01).
What doesn't work is if a shallow clone is done for the submodule.
In that case the commit is not reachable and git tries a direct
fetch. This fetch can be made to work if the commit was padded with
zeros. If the commit was truncated, we would probably need some
protocol extension to make the server do a prefix match for a
"want" request.
> The proper way for this to work is that the SHA-1 version of the
> repository stores submodules in their SHA-1 states and the SHA-256
> version of the repository stores submodules in their SHA-256 states.
You mean by using "compatObjectFormat"? I couldn't make that work,
but maybe I missed something. Anyway, I think this also will not
work for shallow clones.
Cheers,
Michael.
--
Michael Schroeder SUSE Software Solutions Germany GmbH
mls@suse.de GF: Ivo Totev HRB 36809, AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] submodule: truncate the oid when fetchig commits
2025-08-18 9:30 ` Michael Schroeder
@ 2025-08-19 0:45 ` brian m. carlson
0 siblings, 0 replies; 6+ messages in thread
From: brian m. carlson @ 2025-08-19 0:45 UTC (permalink / raw)
To: Michael Schroeder; +Cc: git, gitster
[-- Attachment #1: Type: text/plain, Size: 2995 bytes --]
On 2025-08-18 at 09:30:51, Michael Schroeder wrote:
> On Thu, Aug 14, 2025 at 10:16:24PM +0000, brian m. carlson wrote:
> > On 2025-08-14 at 15:06:32, Michael Schroeder wrote:
> > > If a submodule uses a different hash algorithm than used in
> > > the main repository, the recorded submodule commit is padded
> > > with zeros. This is usually not a problem as the default is to
> > > do submodule clones non-shallow and the commit can be found
> > > in the local objects.
> >
> > This should not even work at all. It may currently behave as you
> > suggest when the main repository is SHA-256 and the submodule is SHA-1,
> > but it will corrupt the data if the submodule is SHA-256 and the main
> > repository is SHA-1, since then the data will be truncated.
>
> But it works, and I'm pretty sure people already use it. If you
> have a sha1 main repo and a sha256 submodule, git will truncate
> the commit when recording the gitlink. The checkout done by
> git submodule update will work as it does the normal prefix matching.
Unfortunately, that will break with the interoperability work. The
protocol will learn to convert the object ID on the server side by
announcing the mapping and when the object ID doesn't exist, the client
will die because it can't remap the object and the process will fail.
By doing that, you'll end up with a repository that you can never use
interoperability code on, ever, without rewriting history. There's no
way around this problem because we don't keep the object format in
trees, so we can't distinguish between a SHA-256 submodule that happens
to end in 24 zeros and a SHA-1 submodule.
The entire hash function transition has mandated exactly one object
format on disk and in data structures from the very beginning:
This affects both object names and object content -- both the names
of objects and all references to other objects within an object are
switched to the new hash function.
I apologize that I didn't think about this problem and make the code die
on this case earlier, but it's not a supported configuration and it will
absolutely break in the future. Sorry to be the bearer of bad news.
> > The proper way for this to work is that the SHA-1 version of the
> > repository stores submodules in their SHA-1 states and the SHA-256
> > version of the repository stores submodules in their SHA-256 states.
>
> You mean by using "compatObjectFormat"? I couldn't make that work,
> but maybe I missed something. Anyway, I think this also will not
> work for shallow clones.
There is interoperability code only for loose objects now. The code
that handles packs and interoperability between repositories is in a
branch on my remote. It's work that I'm doing for a talk at Git Merge
and I will send it upstream when it's ready.
Right now, I'm working on shallow clones at the moment and then
submodules are next.
--
brian m. carlson (they/them)
Toronto, Ontario, CA
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-08-19 0:45 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-14 15:06 [PATCH] submodule: truncate the oid when fetchig commits Michael Schroeder
2025-08-14 15:42 ` Junio C Hamano
2025-08-18 9:15 ` [PATCH v2] " Michael Schroeder
2025-08-14 22:16 ` [PATCH] " brian m. carlson
2025-08-18 9:30 ` Michael Schroeder
2025-08-19 0:45 ` brian m. carlson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).