* [PATCH v3 1/5] version: refactor strbuf_sanitize()
2024-12-06 12:42 ` [PATCH v3 0/5] " Christian Couder
@ 2024-12-06 12:42 ` Christian Couder
2024-12-07 6:21 ` Junio C Hamano
2024-12-06 12:42 ` [PATCH v3 2/5] strbuf: refactor strbuf_trim_trailing_ch() Christian Couder
` (5 subsequent siblings)
6 siblings, 1 reply; 110+ messages in thread
From: Christian Couder @ 2024-12-06 12:42 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Taylor Blau,
Eric Sunshine, Christian Couder, Christian Couder
The git_user_agent_sanitized() function performs some sanitizing to
avoid special characters being sent over the line and possibly messing
up with the protocol or with the parsing on the other side.
Let's extract this sanitizing into a new strbuf_sanitize() function, as
we will want to reuse it in a following patch, and let's put it into
strbuf.{c,h}.
While at it, let's also make a few small improvements:
- use 'size_t' for 'i' instead of 'int',
- move the declaration of 'i' inside the 'for ( ... )',
- use strbuf_detach() to explicitly detach the string contained by
the 'sb' strbuf.
Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
strbuf.c | 9 +++++++++
strbuf.h | 7 +++++++
version.c | 9 ++-------
3 files changed, 18 insertions(+), 7 deletions(-)
diff --git a/strbuf.c b/strbuf.c
index 3d2189a7f6..cccfdec0e3 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -1082,3 +1082,12 @@ void strbuf_strip_file_from_path(struct strbuf *sb)
char *path_sep = find_last_dir_sep(sb->buf);
strbuf_setlen(sb, path_sep ? path_sep - sb->buf + 1 : 0);
}
+
+void strbuf_sanitize(struct strbuf *sb)
+{
+ strbuf_trim(sb);
+ for (size_t i = 0; i < sb->len; i++) {
+ if (sb->buf[i] <= 32 || sb->buf[i] >= 127)
+ sb->buf[i] = '.';
+ }
+}
diff --git a/strbuf.h b/strbuf.h
index 003f880ff7..884157873e 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -664,6 +664,13 @@ typedef int (*char_predicate)(char ch);
void strbuf_addstr_urlencode(struct strbuf *sb, const char *name,
char_predicate allow_unencoded_fn);
+/*
+ * Trim and replace each character with ascii code below 32 or above
+ * 127 (included) using a dot '.' character. Useful for sending
+ * capabilities.
+ */
+void strbuf_sanitize(struct strbuf *sb);
+
__attribute__((format (printf,1,2)))
int printf_ln(const char *fmt, ...);
__attribute__((format (printf,2,3)))
diff --git a/version.c b/version.c
index 41b718c29e..951e6dca74 100644
--- a/version.c
+++ b/version.c
@@ -24,15 +24,10 @@ const char *git_user_agent_sanitized(void)
if (!agent) {
struct strbuf buf = STRBUF_INIT;
- int i;
strbuf_addstr(&buf, git_user_agent());
- strbuf_trim(&buf);
- for (i = 0; i < buf.len; i++) {
- if (buf.buf[i] <= 32 || buf.buf[i] >= 127)
- buf.buf[i] = '.';
- }
- agent = buf.buf;
+ strbuf_sanitize(&buf);
+ agent = strbuf_detach(&buf, NULL);
}
return agent;
--
2.47.1.402.gc25c94707f
^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [PATCH v3 1/5] version: refactor strbuf_sanitize()
2024-12-06 12:42 ` [PATCH v3 1/5] version: refactor strbuf_sanitize() Christian Couder
@ 2024-12-07 6:21 ` Junio C Hamano
2025-01-27 15:07 ` Christian Couder
0 siblings, 1 reply; 110+ messages in thread
From: Junio C Hamano @ 2024-12-07 6:21 UTC (permalink / raw)
To: Christian Couder
Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Christian Couder
Christian Couder <christian.couder@gmail.com> writes:
> +/*
> + * Trim and replace each character with ascii code below 32 or above
> + * 127 (included) using a dot '.' character. Useful for sending
> + * capabilities.
> + */
> +void strbuf_sanitize(struct strbuf *sb);
I am not getting "Useful for sending capabilities" here, and feel
that it is somewhat an unsubstantiated claim. If some information
is going to be transferred (which the phrase "sending capabilities"
hints), I'd expect that we try as hard as possible not to lose
information, but redact-non-ASCII is the total opposite of "not
losing information".
> diff --git a/version.c b/version.c
> index 41b718c29e..951e6dca74 100644
> --- a/version.c
> +++ b/version.c
> @@ -24,15 +24,10 @@ const char *git_user_agent_sanitized(void)
>
> if (!agent) {
> struct strbuf buf = STRBUF_INIT;
> - int i;
>
> strbuf_addstr(&buf, git_user_agent());
> - strbuf_trim(&buf);
> - for (i = 0; i < buf.len; i++) {
> - if (buf.buf[i] <= 32 || buf.buf[i] >= 127)
> - buf.buf[i] = '.';
> - }
> - agent = buf.buf;
> + strbuf_sanitize(&buf);
> + agent = strbuf_detach(&buf, NULL);
> }
>
> return agent;
This is very faithful rewrite of the original. The original had a
strbuf on stack, and after creating user-agent string in it, a
function scope static variable "agent" is made to point at it and
then the stack the strbuf was on is allowed to go out of scope.
Since the variable "agent" is holding onto the piece of memory, the
leak checker does not complain about anything. The rewritten
version is leak-free for exactly the same reason, but because it
calls strbuf_detach() before the strbuf goes out of scope to
officially transfer the ownership to the variable "agent", it tells
what is going on to readers a lot more clearly.
Nicely done.
By the way, as we are trimming, I am very very much tempted to
squish a run of non-ASCII bytes into one dot, perhaps like
void redact_non_printables(struct strbuf *sb)
{
size_t dst = 0;
int skipped = 0;
strbuf_trim(sb);
for (size_t src = 0; src < sb->len; src++) {
int ch = sb->buf[src];
if (ch <= 32 && 127 <= ch) {
if (skipped)
continue;
ch = '.';
}
sb->buf[dst++] = ch;
skipped = (ch == '.');
}
}
or even without strbuf_trim(), which would turn any leading or
trailing run of whitespaces into '.'.
But that is an improvement that can be easily done on top after the
dust settles and better left as #leftoverbits material.
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v3 1/5] version: refactor strbuf_sanitize()
2024-12-07 6:21 ` Junio C Hamano
@ 2025-01-27 15:07 ` Christian Couder
0 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:07 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Christian Couder, karthik nayak
On Sat, Dec 7, 2024 at 7:21 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > +/*
> > + * Trim and replace each character with ascii code below 32 or above
> > + * 127 (included) using a dot '.' character. Useful for sending
> > + * capabilities.
> > + */
> > +void strbuf_sanitize(struct strbuf *sb);
>
> I am not getting "Useful for sending capabilities" here, and feel
> that it is somewhat an unsubstantiated claim. If some information
> is going to be transferred (which the phrase "sending capabilities"
> hints), I'd expect that we try as hard as possible not to lose
> information, but redact-non-ASCII is the total opposite of "not
> losing information".
Ok, "Useful for sending capabilities" will be removed.
> By the way, as we are trimming, I am very very much tempted to
> squish a run of non-ASCII bytes into one dot, perhaps like
>
> void redact_non_printables(struct strbuf *sb)
> {
> size_t dst = 0;
> int skipped = 0;
>
> strbuf_trim(sb);
> for (size_t src = 0; src < sb->len; src++) {
> int ch = sb->buf[src];
> if (ch <= 32 && 127 <= ch) {
> if (skipped)
> continue;
> ch = '.';
> }
> sb->buf[dst++] = ch;
> skipped = (ch == '.');
> }
> }
>
> or even without strbuf_trim(), which would turn any leading or
> trailing run of whitespaces into '.'.
>
> But that is an improvement that can be easily done on top after the
> dust settles and better left as #leftoverbits material.
Usman's patch series about introducing a "os-version" capability needs
such a feature too, and Usman already reworked this code according to
your comments here. It looks like you found it good too. So I will
just reuse his patches related to this in the version 4 of this patch
series.
^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH v3 2/5] strbuf: refactor strbuf_trim_trailing_ch()
2024-12-06 12:42 ` [PATCH v3 0/5] " Christian Couder
2024-12-06 12:42 ` [PATCH v3 1/5] version: refactor strbuf_sanitize() Christian Couder
@ 2024-12-06 12:42 ` Christian Couder
2024-12-07 6:35 ` Junio C Hamano
2024-12-16 11:47 ` karthik nayak
2024-12-06 12:42 ` [PATCH v3 3/5] Add 'promisor-remote' capability to protocol v2 Christian Couder
` (4 subsequent siblings)
6 siblings, 2 replies; 110+ messages in thread
From: Christian Couder @ 2024-12-06 12:42 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Taylor Blau,
Eric Sunshine, Christian Couder, Christian Couder
We often have to split strings at some specified terminator character.
The strbuf_split*() functions, that we can use for this purpose,
return substrings that include the terminator character, so we often
need to remove that character.
When it is a whitespace, newline or directory separator, the
terminator character can easily be removed using an existing triming
function like strbuf_rtrim(), strbuf_trim_trailing_newline() or
strbuf_trim_trailing_dir_sep(). There is no function to remove that
character when it's not one of those characters though.
Let's introduce a new strbuf_trim_trailing_ch() function that can be
used to remove any trailing character, and let's refactor existing code
that manually removed trailing characters using this new function.
We are also going to use this new function in a following commit.
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
strbuf.c | 7 +++++++
strbuf.h | 3 +++
trace2/tr2_cfg.c | 10 ++--------
3 files changed, 12 insertions(+), 8 deletions(-)
diff --git a/strbuf.c b/strbuf.c
index cccfdec0e3..c986ec28f4 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -134,6 +134,13 @@ void strbuf_trim_trailing_dir_sep(struct strbuf *sb)
sb->buf[sb->len] = '\0';
}
+void strbuf_trim_trailing_ch(struct strbuf *sb, int c)
+{
+ while (sb->len > 0 && sb->buf[sb->len - 1] == c)
+ sb->len--;
+ sb->buf[sb->len] = '\0';
+}
+
void strbuf_trim_trailing_newline(struct strbuf *sb)
{
if (sb->len > 0 && sb->buf[sb->len - 1] == '\n') {
diff --git a/strbuf.h b/strbuf.h
index 884157873e..5e389ab065 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -197,6 +197,9 @@ void strbuf_trim_trailing_dir_sep(struct strbuf *sb);
/* Strip trailing LF or CR/LF */
void strbuf_trim_trailing_newline(struct strbuf *sb);
+/* Strip trailing character c */
+void strbuf_trim_trailing_ch(struct strbuf *sb, int c);
+
/**
* Replace the contents of the strbuf with a reencoded form. Returns -1
* on error, 0 on success.
diff --git a/trace2/tr2_cfg.c b/trace2/tr2_cfg.c
index 22a99a0682..9da1f8466c 100644
--- a/trace2/tr2_cfg.c
+++ b/trace2/tr2_cfg.c
@@ -35,10 +35,7 @@ static int tr2_cfg_load_patterns(void)
tr2_cfg_patterns = strbuf_split_buf(envvar, strlen(envvar), ',', -1);
for (s = tr2_cfg_patterns; *s; s++) {
- struct strbuf *buf = *s;
-
- if (buf->len && buf->buf[buf->len - 1] == ',')
- strbuf_setlen(buf, buf->len - 1);
+ strbuf_trim_trailing_ch(*s, ',');
strbuf_trim_trailing_newline(*s);
strbuf_trim(*s);
}
@@ -74,10 +71,7 @@ static int tr2_load_env_vars(void)
tr2_cfg_env_vars = strbuf_split_buf(varlist, strlen(varlist), ',', -1);
for (s = tr2_cfg_env_vars; *s; s++) {
- struct strbuf *buf = *s;
-
- if (buf->len && buf->buf[buf->len - 1] == ',')
- strbuf_setlen(buf, buf->len - 1);
+ strbuf_trim_trailing_ch(*s, ',');
strbuf_trim_trailing_newline(*s);
strbuf_trim(*s);
}
--
2.47.1.402.gc25c94707f
^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [PATCH v3 2/5] strbuf: refactor strbuf_trim_trailing_ch()
2024-12-06 12:42 ` [PATCH v3 2/5] strbuf: refactor strbuf_trim_trailing_ch() Christian Couder
@ 2024-12-07 6:35 ` Junio C Hamano
2025-01-27 15:07 ` Christian Couder
2024-12-16 11:47 ` karthik nayak
1 sibling, 1 reply; 110+ messages in thread
From: Junio C Hamano @ 2024-12-07 6:35 UTC (permalink / raw)
To: Christian Couder
Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Christian Couder
Christian Couder <christian.couder@gmail.com> writes:
> We often have to split strings at some specified terminator character.
> The strbuf_split*() functions, that we can use for this purpose,
> return substrings that include the terminator character, so we often
> need to remove that character.
>
> When it is a whitespace, newline or directory separator, the
> terminator character can easily be removed using an existing triming
> function like strbuf_rtrim(), strbuf_trim_trailing_newline() or
> strbuf_trim_trailing_dir_sep(). There is no function to remove that
> character when it's not one of those characters though.
Heh, totally uninteresting (alternative being open coding this one).
If we pass, instead of a single character 'c', an array of characters
to be stripped from the right (like strspn() allows you to skip from
the left), I may have been a bit more receptive, though ;-)
> +void strbuf_trim_trailing_ch(struct strbuf *sb, int c)
> +{
> + while (sb->len > 0 && sb->buf[sb->len - 1] == c)
> + sb->len--;
> + sb->buf[sb->len] = '\0';
> +}
So, trim_trailing will leave "foo" when "foo,,," is fed with c set
to ','.
> diff --git a/trace2/tr2_cfg.c b/trace2/tr2_cfg.c
> index 22a99a0682..9da1f8466c 100644
> --- a/trace2/tr2_cfg.c
> +++ b/trace2/tr2_cfg.c
> @@ -35,10 +35,7 @@ static int tr2_cfg_load_patterns(void)
>
> tr2_cfg_patterns = strbuf_split_buf(envvar, strlen(envvar), ',', -1);
> for (s = tr2_cfg_patterns; *s; s++) {
> - struct strbuf *buf = *s;
> -
> - if (buf->len && buf->buf[buf->len - 1] == ',')
> - strbuf_setlen(buf, buf->len - 1);
> + strbuf_trim_trailing_ch(*s, ',');
And the only thing that prevents this rewrite from being buggy is
the use of misdesigned strbuf_split_buf() function (which by now we
should have deprecated!). Because it splits at ',', we won't have
more than one ',' trailing, but we still split that one trailing
comma because the misdesigned strbuf_split_buf() leaves the
separator at the end of each element.
This does not look like a very convincing example to demonstrate why
the new helper function is useful, at least to me.
If somebody would touch this area of code, I think a lot nicer
clean-up would be to rewrite the thing into a helper function that
is called from here, and the other one in the next hunk in a single
patch, and then clean up the refactored helper function not to use
the strbuf_split_buf(). Looking at the way tr2_cfg_patterns and
tr2_cfg_env_vars are used, they have *NO* valid reason why they have
to be a strbuf. Once populated, they are only used for a constant
string pointed at by their .buf member. A string_list constructed
by appending (i.e. not sorted) would be a lot more suitable data
structure.
> strbuf_trim_trailing_newline(*s);
> strbuf_trim(*s);
> }
> @@ -74,10 +71,7 @@ static int tr2_load_env_vars(void)
>
> tr2_cfg_env_vars = strbuf_split_buf(varlist, strlen(varlist), ',', -1);
> for (s = tr2_cfg_env_vars; *s; s++) {
> - struct strbuf *buf = *s;
> -
> - if (buf->len && buf->buf[buf->len - 1] == ',')
> - strbuf_setlen(buf, buf->len - 1);
> + strbuf_trim_trailing_ch(*s, ',');
> strbuf_trim_trailing_newline(*s);
> strbuf_trim(*s);
> }
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v3 2/5] strbuf: refactor strbuf_trim_trailing_ch()
2024-12-07 6:35 ` Junio C Hamano
@ 2025-01-27 15:07 ` Christian Couder
0 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:07 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Christian Couder, karthik nayak
On Sat, Dec 7, 2024 at 7:35 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > We often have to split strings at some specified terminator character.
> > The strbuf_split*() functions, that we can use for this purpose,
> > return substrings that include the terminator character, so we often
> > need to remove that character.
> >
> > When it is a whitespace, newline or directory separator, the
> > terminator character can easily be removed using an existing triming
> > function like strbuf_rtrim(), strbuf_trim_trailing_newline() or
> > strbuf_trim_trailing_dir_sep(). There is no function to remove that
> > character when it's not one of those characters though.
>
> Heh, totally uninteresting (alternative being open coding this one).
> If we pass, instead of a single character 'c', an array of characters
> to be stripped from the right (like strspn() allows you to skip from
> the left), I may have been a bit more receptive, though ;-)
Yeah, I realized strbuf_strip_suffix() can do the job in the following
patches, so I dropped this patch and used strbuf_strip_suffix() in the
version 4 of this series.
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v3 2/5] strbuf: refactor strbuf_trim_trailing_ch()
2024-12-06 12:42 ` [PATCH v3 2/5] strbuf: refactor strbuf_trim_trailing_ch() Christian Couder
2024-12-07 6:35 ` Junio C Hamano
@ 2024-12-16 11:47 ` karthik nayak
1 sibling, 0 replies; 110+ messages in thread
From: karthik nayak @ 2024-12-16 11:47 UTC (permalink / raw)
To: Christian Couder, git
Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Taylor Blau,
Eric Sunshine, Christian Couder
[-- Attachment #1: Type: text/plain, Size: 1304 bytes --]
Christian Couder <christian.couder@gmail.com> writes:
> We often have to split strings at some specified terminator character.
> The strbuf_split*() functions, that we can use for this purpose,
> return substrings that include the terminator character, so we often
> need to remove that character.
>
> When it is a whitespace, newline or directory separator, the
> terminator character can easily be removed using an existing triming
Nit: s/triming/trimming
> function like strbuf_rtrim(), strbuf_trim_trailing_newline() or
> strbuf_trim_trailing_dir_sep(). There is no function to remove that
> character when it's not one of those characters though.
>
> Let's introduce a new strbuf_trim_trailing_ch() function that can be
> used to remove any trailing character, and let's refactor existing code
> that manually removed trailing characters using this new function.
>
> We are also going to use this new function in a following commit.
>
> Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
> ---
> strbuf.c | 7 +++++++
> strbuf.h | 3 +++
> trace2/tr2_cfg.c | 10 ++--------
> 3 files changed, 12 insertions(+), 8 deletions(-)
>
Shouldn't this patch also add unit tests? We already have some in
't/unit-tests/t-strbuf.c'. This applies to the previous patch too.
[snip]
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]
^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH v3 3/5] Add 'promisor-remote' capability to protocol v2
2024-12-06 12:42 ` [PATCH v3 0/5] " Christian Couder
2024-12-06 12:42 ` [PATCH v3 1/5] version: refactor strbuf_sanitize() Christian Couder
2024-12-06 12:42 ` [PATCH v3 2/5] strbuf: refactor strbuf_trim_trailing_ch() Christian Couder
@ 2024-12-06 12:42 ` Christian Couder
2024-12-07 7:59 ` Junio C Hamano
2024-12-06 12:42 ` [PATCH v3 4/5] promisor-remote: check advertised name or URL Christian Couder
` (3 subsequent siblings)
6 siblings, 1 reply; 110+ messages in thread
From: Christian Couder @ 2024-12-06 12:42 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Taylor Blau,
Eric Sunshine, Christian Couder, Christian Couder
When a server S knows that some objects from a repository are available
from a promisor remote X, S might want to suggest to a client C cloning
or fetching the repo from S that C should use X directly instead of S
for these objects.
Note that this could happen both in the case S itself doesn't have the
objects and borrows them from X, and in the case S has the objects but
knows that X is better connected to the world (e.g., it is in a
$LARGEINTERNETCOMPANY datacenter with petabit/s backbone connections)
than S. Implementation of the latter case, which would require S to
omit in its response the objects available on X, is left for future
improvement though.
Then C might or might not, want to get the objects from X, and should
let S know about this.
To allow S and C to agree and let each other know about C using X or
not, let's introduce a new "promisor-remote" capability in the
protocol v2, as well as a few new configuration variables:
- "promisor.advertise" on the server side, and:
- "promisor.acceptFromServer" on the client side.
By default, or if "promisor.advertise" is set to 'false', a server S will
not advertise the "promisor-remote" capability.
If S doesn't advertise the "promisor-remote" capability, then a client C
replying to S shouldn't advertise the "promisor-remote" capability
either.
If "promisor.advertise" is set to 'true', S will advertise its promisor
remotes with a string like:
promisor-remote=<pr-info>[;<pr-info>]...
where each <pr-info> element contains information about a single
promisor remote in the form:
name=<pr-name>[,url=<pr-url>]
where <pr-name> is the urlencoded name of a promisor remote and
<pr-url> is the urlencoded URL of the promisor remote named <pr-name>.
For now, the URL is passed in addition to the name. In the future, it
might be possible to pass other information like a filter-spec that the
client should use when cloning from S, or a token that the client should
use when retrieving objects from X.
It might also be possible in the future for "promisor.advertise" to have
other values. For example a value like "onlyName" could prevent S from
advertising URLs, which could help in case C should use a different URL
for X than the URL S is using. (The URL S is using might be an internal
one on the server side for example.)
By default or if "promisor.acceptFromServer" is set to "None", C will
not accept to use the promisor remotes that might have been advertised
by S. In this case, C will not advertise any "promisor-remote"
capability in its reply to S.
If "promisor.acceptFromServer" is set to "All" and S advertised some
promisor remotes, then on the contrary, C will accept to use all the
promisor remotes that S advertised and C will reply with a string like:
promisor-remote=<pr-name>[;<pr-name>]...
where the <pr-name> elements are the urlencoded names of all the
promisor remotes S advertised.
In a following commit, other values for "promisor.acceptFromServer" will
be implemented, so that C will be able to decide the promisor remotes it
accepts depending on the name and URL it received from S. So even if
that name and URL information is not used much right now, it will be
needed soon.
Helped-by: Taylor Blau <me@ttaylorr.com>
Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
Documentation/config/promisor.txt | 17 ++
Documentation/gitprotocol-v2.txt | 54 ++++++
connect.c | 9 +
promisor-remote.c | 195 +++++++++++++++++++++
promisor-remote.h | 36 +++-
serve.c | 26 +++
t/t5710-promisor-remote-capability.sh | 241 ++++++++++++++++++++++++++
upload-pack.c | 3 +
8 files changed, 580 insertions(+), 1 deletion(-)
create mode 100755 t/t5710-promisor-remote-capability.sh
diff --git a/Documentation/config/promisor.txt b/Documentation/config/promisor.txt
index 98c5cb2ec2..9cbfe3e59e 100644
--- a/Documentation/config/promisor.txt
+++ b/Documentation/config/promisor.txt
@@ -1,3 +1,20 @@
promisor.quiet::
If set to "true" assume `--quiet` when fetching additional
objects for a partial clone.
+
+promisor.advertise::
+ If set to "true", a server will use the "promisor-remote"
+ capability, see linkgit:gitprotocol-v2[5], to advertise the
+ promisor remotes it is using, if it uses some. Default is
+ "false", which means the "promisor-remote" capability is not
+ advertised.
+
+promisor.acceptFromServer::
+ If set to "all", a client will accept all the promisor remotes
+ a server might advertise using the "promisor-remote"
+ capability. Default is "none", which means no promisor remote
+ advertised by a server will be accepted. By accepting a
+ promisor remote, the client agrees that the server might omit
+ objects that are lazily fetchable from this promisor remote
+ from its responses to "fetch" and "clone" requests from the
+ client. See linkgit:gitprotocol-v2[5].
diff --git a/Documentation/gitprotocol-v2.txt b/Documentation/gitprotocol-v2.txt
index 1652fef3ae..f25a9a6ad8 100644
--- a/Documentation/gitprotocol-v2.txt
+++ b/Documentation/gitprotocol-v2.txt
@@ -781,6 +781,60 @@ retrieving the header from a bundle at the indicated URI, and thus
save themselves and the server(s) the request(s) needed to inspect the
headers of that bundle or bundles.
+promisor-remote=<pr-infos>
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The server may advertise some promisor remotes it is using or knows
+about to a client which may want to use them as its promisor remotes,
+instead of this repository. In this case <pr-infos> should be of the
+form:
+
+ pr-infos = pr-info | pr-infos ";" pr-info
+
+ pr-info = "name=" pr-name | "name=" pr-name "," "url=" pr-url
+
+where `pr-name` is the urlencoded name of a promisor remote, and
+`pr-url` the urlencoded URL of that promisor remote.
+
+In this case, if the client decides to use one or more promisor
+remotes the server advertised, it can reply with
+"promisor-remote=<pr-names>" where <pr-names> should be of the form:
+
+ pr-names = pr-name | pr-names ";" pr-name
+
+where `pr-name` is the urlencoded name of a promisor remote the server
+advertised and the client accepts.
+
+Note that, everywhere in this document, `pr-name` MUST be a valid
+remote name, and the ';' and ',' characters MUST be encoded if they
+appear in `pr-name` or `pr-url`.
+
+If the server doesn't know any promisor remote that could be good for
+a client to use, or prefers a client not to use any promisor remote it
+uses or knows about, it shouldn't advertise the "promisor-remote"
+capability at all.
+
+In this case, or if the client doesn't want to use any promisor remote
+the server advertised, the client shouldn't advertise the
+"promisor-remote" capability at all in its reply.
+
+The "promisor.advertise" and "promisor.acceptFromServer" configuration
+options can be used on the server and client side respectively to
+control what they advertise or accept respectively. See the
+documentation of these configuration options for more information.
+
+Note that in the future it would be nice if the "promisor-remote"
+protocol capability could be used by the server, when responding to
+`git fetch` or `git clone`, to advertise better-connected remotes that
+the client can use as promisor remotes, instead of this repository, so
+that the client can lazily fetch objects from these other
+better-connected remotes. This would require the server to omit in its
+response the objects available on the better-connected remotes that
+the client has accepted. This hasn't been implemented yet though. So
+for now this "promisor-remote" capability is useful only when the
+server advertises some promisor remotes it already uses to borrow
+objects from.
+
GIT
---
Part of the linkgit:git[1] suite
diff --git a/connect.c b/connect.c
index 58f53d8dcb..898bf3b438 100644
--- a/connect.c
+++ b/connect.c
@@ -22,6 +22,7 @@
#include "protocol.h"
#include "alias.h"
#include "bundle-uri.h"
+#include "promisor-remote.h"
static char *server_capabilities_v1;
static struct strvec server_capabilities_v2 = STRVEC_INIT;
@@ -487,6 +488,7 @@ void check_stateless_delimiter(int stateless_rpc,
static void send_capabilities(int fd_out, struct packet_reader *reader)
{
const char *hash_name;
+ const char *promisor_remote_info;
if (server_supports_v2("agent"))
packet_write_fmt(fd_out, "agent=%s", git_user_agent_sanitized());
@@ -500,6 +502,13 @@ static void send_capabilities(int fd_out, struct packet_reader *reader)
} else {
reader->hash_algo = &hash_algos[GIT_HASH_SHA1];
}
+ if (server_feature_v2("promisor-remote", &promisor_remote_info)) {
+ char *reply = promisor_remote_reply(promisor_remote_info);
+ if (reply) {
+ packet_write_fmt(fd_out, "promisor-remote=%s", reply);
+ free(reply);
+ }
+ }
}
int get_remote_bundle_uri(int fd_out, struct packet_reader *reader,
diff --git a/promisor-remote.c b/promisor-remote.c
index 9345ae3db2..ea418c4094 100644
--- a/promisor-remote.c
+++ b/promisor-remote.c
@@ -11,6 +11,7 @@
#include "strvec.h"
#include "packfile.h"
#include "environment.h"
+#include "url.h"
struct promisor_remote_config {
struct promisor_remote *promisors;
@@ -221,6 +222,18 @@ int repo_has_promisor_remote(struct repository *r)
return !!repo_promisor_remote_find(r, NULL);
}
+int repo_has_accepted_promisor_remote(struct repository *r)
+{
+ struct promisor_remote *p;
+
+ promisor_remote_init(r);
+
+ for (p = r->promisor_remote_config->promisors; p; p = p->next)
+ if (p->accepted)
+ return 1;
+ return 0;
+}
+
static int remove_fetched_oids(struct repository *repo,
struct object_id **oids,
int oid_nr, int to_free)
@@ -292,3 +305,185 @@ void promisor_remote_get_direct(struct repository *repo,
if (to_free)
free(remaining_oids);
}
+
+static int allow_unsanitized(char ch)
+{
+ if (ch == ',' || ch == ';' || ch == '%')
+ return 0;
+ return ch > 32 && ch < 127;
+}
+
+static void promisor_info_vecs(struct repository *repo,
+ struct strvec *names,
+ struct strvec *urls)
+{
+ struct promisor_remote *r;
+
+ promisor_remote_init(repo);
+
+ for (r = repo->promisor_remote_config->promisors; r; r = r->next) {
+ char *url;
+ char *url_key = xstrfmt("remote.%s.url", r->name);
+
+ strvec_push(names, r->name);
+ strvec_push(urls, git_config_get_string(url_key, &url) ? NULL : url);
+
+ free(url);
+ free(url_key);
+ }
+}
+
+char *promisor_remote_info(struct repository *repo)
+{
+ struct strbuf sb = STRBUF_INIT;
+ int advertise_promisors = 0;
+ struct strvec names = STRVEC_INIT;
+ struct strvec urls = STRVEC_INIT;
+
+ git_config_get_bool("promisor.advertise", &advertise_promisors);
+
+ if (!advertise_promisors)
+ return NULL;
+
+ promisor_info_vecs(repo, &names, &urls);
+
+ if (!names.nr)
+ return NULL;
+
+ for (size_t i = 0; i < names.nr; i++) {
+ if (i)
+ strbuf_addch(&sb, ';');
+ strbuf_addstr(&sb, "name=");
+ strbuf_addstr_urlencode(&sb, names.v[i], allow_unsanitized);
+ if (urls.v[i]) {
+ strbuf_addstr(&sb, ",url=");
+ strbuf_addstr_urlencode(&sb, urls.v[i], allow_unsanitized);
+ }
+ }
+
+ strbuf_sanitize(&sb);
+
+ strvec_clear(&names);
+ strvec_clear(&urls);
+
+ return strbuf_detach(&sb, NULL);
+}
+
+enum accept_promisor {
+ ACCEPT_NONE = 0,
+ ACCEPT_ALL
+};
+
+static int should_accept_remote(enum accept_promisor accept,
+ const char *remote_name UNUSED,
+ const char *remote_url UNUSED)
+{
+ if (accept == ACCEPT_ALL)
+ return 1;
+
+ BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+}
+
+static void filter_promisor_remote(struct strvec *accepted, const char *info)
+{
+ struct strbuf **remotes;
+ const char *accept_str;
+ enum accept_promisor accept = ACCEPT_NONE;
+
+ if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
+ if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
+ accept = ACCEPT_NONE;
+ else if (!strcasecmp("All", accept_str))
+ accept = ACCEPT_ALL;
+ else
+ warning(_("unknown '%s' value for '%s' config option"),
+ accept_str, "promisor.acceptfromserver");
+ }
+
+ if (accept == ACCEPT_NONE)
+ return;
+
+ /* Parse remote info received */
+
+ remotes = strbuf_split_str(info, ';', 0);
+
+ for (size_t i = 0; remotes[i]; i++) {
+ struct strbuf **elems;
+ const char *remote_name = NULL;
+ const char *remote_url = NULL;
+ char *decoded_name = NULL;
+ char *decoded_url = NULL;
+
+ strbuf_trim_trailing_ch(remotes[i], ';');
+ elems = strbuf_split_str(remotes[i]->buf, ',', 0);
+
+ for (size_t j = 0; elems[j]; j++) {
+ int res;
+ strbuf_trim_trailing_ch(elems[j], ',');
+ res = skip_prefix(elems[j]->buf, "name=", &remote_name) ||
+ skip_prefix(elems[j]->buf, "url=", &remote_url);
+ if (!res)
+ warning(_("unknown element '%s' from remote info"),
+ elems[j]->buf);
+ }
+
+ if (remote_name)
+ decoded_name = url_percent_decode(remote_name);
+ if (remote_url)
+ decoded_url = url_percent_decode(remote_url);
+
+ if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url))
+ strvec_push(accepted, decoded_name);
+
+ strbuf_list_free(elems);
+ free(decoded_name);
+ free(decoded_url);
+ }
+
+ strbuf_list_free(remotes);
+}
+
+char *promisor_remote_reply(const char *info)
+{
+ struct strvec accepted = STRVEC_INIT;
+ struct strbuf reply = STRBUF_INIT;
+
+ filter_promisor_remote(&accepted, info);
+
+ if (!accepted.nr)
+ return NULL;
+
+ for (size_t i = 0; i < accepted.nr; i++) {
+ if (i)
+ strbuf_addch(&reply, ';');
+ strbuf_addstr_urlencode(&reply, accepted.v[i], allow_unsanitized);
+ }
+
+ strvec_clear(&accepted);
+
+ return strbuf_detach(&reply, NULL);
+}
+
+void mark_promisor_remotes_as_accepted(struct repository *r, const char *remotes)
+{
+ struct strbuf **accepted_remotes = strbuf_split_str(remotes, ';', 0);
+
+ for (size_t i = 0; accepted_remotes[i]; i++) {
+ struct promisor_remote *p;
+ char *decoded_remote;
+
+ strbuf_trim_trailing_ch(accepted_remotes[i], ';');
+ decoded_remote = url_percent_decode(accepted_remotes[i]->buf);
+
+ p = repo_promisor_remote_find(r, decoded_remote);
+ if (p)
+ p->accepted = 1;
+ else
+ warning(_("accepted promisor remote '%s' not found"),
+ decoded_remote);
+
+ free(decoded_remote);
+ }
+
+ strbuf_list_free(accepted_remotes);
+}
diff --git a/promisor-remote.h b/promisor-remote.h
index 88cb599c39..814ca248c7 100644
--- a/promisor-remote.h
+++ b/promisor-remote.h
@@ -9,11 +9,13 @@ struct object_id;
* Promisor remote linked list
*
* Information in its fields come from remote.XXX config entries or
- * from extensions.partialclone.
+ * from extensions.partialclone, except for 'accepted' which comes
+ * from protocol v2 capabilities exchange.
*/
struct promisor_remote {
struct promisor_remote *next;
char *partial_clone_filter;
+ unsigned int accepted : 1;
const char name[FLEX_ARRAY];
};
@@ -32,4 +34,36 @@ void promisor_remote_get_direct(struct repository *repo,
const struct object_id *oids,
int oid_nr);
+/*
+ * Prepare a "promisor-remote" advertisement by a server.
+ * Check the value of "promisor.advertise" and maybe the configured
+ * promisor remotes, if any, to prepare information to send in an
+ * advertisement.
+ * Return value is NULL if no promisor remote advertisement should be
+ * made. Otherwise it contains the names and urls of the advertised
+ * promisor remotes separated by ';'
+ */
+char *promisor_remote_info(struct repository *repo);
+
+/*
+ * Prepare a reply to a "promisor-remote" advertisement from a server.
+ * Check the value of "promisor.acceptfromserver" and maybe the
+ * configured promisor remotes, if any, to prepare the reply.
+ * Return value is NULL if no promisor remote from the server
+ * is accepted. Otherwise it contains the names of the accepted promisor
+ * remotes separated by ';'.
+ */
+char *promisor_remote_reply(const char *info);
+
+/*
+ * Set the 'accepted' flag for some promisor remotes. Useful when some
+ * promisor remotes have been accepted by the client.
+ */
+void mark_promisor_remotes_as_accepted(struct repository *repo, const char *remotes);
+
+/*
+ * Has any promisor remote been accepted by the client?
+ */
+int repo_has_accepted_promisor_remote(struct repository *r);
+
#endif /* PROMISOR_REMOTE_H */
diff --git a/serve.c b/serve.c
index d674764a25..5a40a7abb7 100644
--- a/serve.c
+++ b/serve.c
@@ -12,6 +12,7 @@
#include "upload-pack.h"
#include "bundle-uri.h"
#include "trace2.h"
+#include "promisor-remote.h"
static int advertise_sid = -1;
static int advertise_object_info = -1;
@@ -31,6 +32,26 @@ static int agent_advertise(struct repository *r UNUSED,
return 1;
}
+static int promisor_remote_advertise(struct repository *r,
+ struct strbuf *value)
+{
+ if (value) {
+ char *info = promisor_remote_info(r);
+ if (!info)
+ return 0;
+ strbuf_addstr(value, info);
+ free(info);
+ }
+ return 1;
+}
+
+static void promisor_remote_receive(struct repository *r,
+ const char *remotes)
+{
+ mark_promisor_remotes_as_accepted(r, remotes);
+}
+
+
static int object_format_advertise(struct repository *r,
struct strbuf *value)
{
@@ -157,6 +178,11 @@ static struct protocol_capability capabilities[] = {
.advertise = bundle_uri_advertise,
.command = bundle_uri_command,
},
+ {
+ .name = "promisor-remote",
+ .advertise = promisor_remote_advertise,
+ .receive = promisor_remote_receive,
+ },
};
void protocol_v2_advertise_capabilities(void)
diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
new file mode 100755
index 0000000000..000cb4c0f6
--- /dev/null
+++ b/t/t5710-promisor-remote-capability.sh
@@ -0,0 +1,241 @@
+#!/bin/sh
+
+test_description='handling of promisor remote advertisement'
+
+. ./test-lib.sh
+
+# Setup the repository with three commits, this way HEAD is always
+# available and we can hide commit 1 or 2.
+test_expect_success 'setup: create "template" repository' '
+ git init template &&
+ test_commit -C template 1 &&
+ test_commit -C template 2 &&
+ test_commit -C template 3 &&
+ test-tool genrandom foo 10240 >template/foo &&
+ git -C template add foo &&
+ git -C template commit -m foo
+'
+
+# A bare repo will act as a server repo with unpacked objects.
+test_expect_success 'setup: create bare "server" repository' '
+ git clone --bare --no-local template server &&
+ mv server/objects/pack/pack-* . &&
+ packfile=$(ls pack-*.pack) &&
+ git -C server unpack-objects --strict <"$packfile"
+'
+
+check_missing_objects () {
+ git -C "$1" rev-list --objects --all --missing=print > all.txt &&
+ perl -ne 'print if s/^[?]//' all.txt >missing.txt &&
+ test_line_count = "$2" missing.txt &&
+ if test "$2" -lt 2
+ then
+ test "$3" = "$(cat missing.txt)"
+ else
+ test -f "$3" &&
+ sort <"$3" >expected_sorted &&
+ sort <missing.txt >actual_sorted &&
+ test_cmp expected_sorted actual_sorted
+ fi
+}
+
+initialize_server () {
+ count="$1"
+ missing_oids="$2"
+
+ # Repack everything first
+ git -C server -c repack.writebitmaps=false repack -a -d &&
+
+ # Remove promisor file in case they exist, useful when reinitializing
+ rm -rf server/objects/pack/*.promisor &&
+
+ # Repack without the largest object and create a promisor pack on server
+ git -C server -c repack.writebitmaps=false repack -a -d \
+ --filter=blob:limit=5k --filter-to="$(pwd)/pack" &&
+ promisor_file=$(ls server/objects/pack/*.pack | sed "s/\.pack/.promisor/") &&
+ >"$promisor_file" &&
+
+ # Check objects missing on the server
+ check_missing_objects server "$count" "$missing_oids"
+}
+
+copy_to_server2 () {
+ oid_path="$(test_oid_to_path $1)" &&
+ path="server/objects/$oid_path" &&
+ path2="server2/objects/$oid_path" &&
+ mkdir -p $(dirname "$path2") &&
+ cp "$path" "$path2"
+}
+
+test_expect_success "setup for testing promisor remote advertisement" '
+ # Create another bare repo called "server2"
+ git init --bare server2 &&
+
+ # Copy the largest object from server to server2
+ obj="HEAD:foo" &&
+ oid="$(git -C server rev-parse $obj)" &&
+ copy_to_server2 "$oid" &&
+
+ initialize_server 1 "$oid" &&
+
+ # Configure server2 as promisor remote for server
+ git -C server remote add server2 "file://$(pwd)/server2" &&
+ git -C server config remote.server2.promisor true &&
+
+ git -C server2 config uploadpack.allowFilter true &&
+ git -C server2 config uploadpack.allowAnySHA1InWant true &&
+ git -C server config uploadpack.allowFilter true &&
+ git -C server config uploadpack.allowAnySHA1InWant true
+'
+
+test_expect_success "clone with promisor.advertise set to 'true'" '
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+ -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+ -c remote.server2.url="file://$(pwd)/server2" \
+ -c promisor.acceptfromserver=All \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
+
+ # Check that the largest object is still missing on the server
+ check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.advertise set to 'false'" '
+ git -C server config promisor.advertise false &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+ -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+ -c remote.server2.url="file://$(pwd)/server2" \
+ -c promisor.acceptfromserver=All \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
+
+ # Check that the largest object is not missing on the server
+ check_missing_objects server 0 "" &&
+
+ # Reinitialize server so that the largest object is missing again
+ initialize_server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.acceptfromserver set to 'None'" '
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+ -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+ -c remote.server2.url="file://$(pwd)/server2" \
+ -c promisor.acceptfromserver=None \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
+
+ # Check that the largest object is not missing on the server
+ check_missing_objects server 0 "" &&
+
+ # Reinitialize server so that the largest object is missing again
+ initialize_server 1 "$oid"
+'
+
+test_expect_success "init + fetch with promisor.advertise set to 'true'" '
+ git -C server config promisor.advertise true &&
+
+ test_when_finished "rm -rf client" &&
+ mkdir client &&
+ git -C client init &&
+ git -C client config remote.server2.promisor true &&
+ git -C client config remote.server2.fetch "+refs/heads/*:refs/remotes/server2/*" &&
+ git -C client config remote.server2.url "file://$(pwd)/server2" &&
+ git -C client config remote.server.url "file://$(pwd)/server" &&
+ git -C client config remote.server.fetch "+refs/heads/*:refs/remotes/server/*" &&
+ git -C client config promisor.acceptfromserver All &&
+ GIT_NO_LAZY_FETCH=0 git -C client fetch --filter="blob:limit=5k" server &&
+
+ # Check that the largest object is still missing on the server
+ check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.advertise set to 'true' but don't delete the client" '
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+ -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+ -c remote.server2.url="file://$(pwd)/server2" \
+ -c promisor.acceptfromserver=All \
+ --no-local --filter="blob:limit=5k" server client &&
+
+ # Check that the largest object is still missing on the server
+ check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "setup for subsequent fetches" '
+ # Generate new commit with large blob
+ test-tool genrandom bar 10240 >template/bar &&
+ git -C template add bar &&
+ git -C template commit -m bar &&
+
+ # Fetch new commit with large blob
+ git -C server fetch origin &&
+ git -C server update-ref HEAD FETCH_HEAD &&
+ git -C server rev-parse HEAD >expected_head &&
+
+ # Repack everything twice and remove .promisor files before
+ # each repack. This makes sure everything gets repacked
+ # into a single packfile. The second repack is necessary
+ # because the first one fetches from server2 and creates a new
+ # packfile and its associated .promisor file.
+
+ rm -f server/objects/pack/*.promisor &&
+ git -C server -c repack.writebitmaps=false repack -a -d &&
+ rm -f server/objects/pack/*.promisor &&
+ git -C server -c repack.writebitmaps=false repack -a -d &&
+
+ # Unpack everything
+ rm pack-* &&
+ mv server/objects/pack/pack-* . &&
+ packfile=$(ls pack-*.pack) &&
+ git -C server unpack-objects --strict <"$packfile" &&
+
+ # Copy new large object to server2
+ obj_bar="HEAD:bar" &&
+ oid_bar="$(git -C server rev-parse $obj_bar)" &&
+ copy_to_server2 "$oid_bar" &&
+
+ # Reinitialize server so that the 2 largest objects are missing
+ printf "%s\n" "$oid" "$oid_bar" >expected_missing.txt &&
+ initialize_server 2 expected_missing.txt &&
+
+ # Create one more client
+ cp -r client client2
+'
+
+test_expect_success "subsequent fetch from a client when promisor.advertise is true" '
+ git -C server config promisor.advertise true &&
+
+ GIT_NO_LAZY_FETCH=0 git -C client pull origin &&
+
+ git -C client rev-parse HEAD >actual &&
+ test_cmp expected_head actual &&
+
+ cat client/bar >/dev/null &&
+
+ check_missing_objects server 2 expected_missing.txt
+'
+
+test_expect_success "subsequent fetch from a client when promisor.advertise is false" '
+ git -C server config promisor.advertise false &&
+
+ GIT_NO_LAZY_FETCH=0 git -C client2 pull origin &&
+
+ git -C client2 rev-parse HEAD >actual &&
+ test_cmp expected_head actual &&
+
+ cat client2/bar >/dev/null &&
+
+ check_missing_objects server 1 "$oid"
+'
+
+test_done
diff --git a/upload-pack.c b/upload-pack.c
index 43006c0614..c6550a8d51 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -31,6 +31,7 @@
#include "write-or-die.h"
#include "json-writer.h"
#include "strmap.h"
+#include "promisor-remote.h"
/* Remember to update object flag allocation in object.h */
#define THEY_HAVE (1u << 11)
@@ -318,6 +319,8 @@ static void create_pack_file(struct upload_pack_data *pack_data,
strvec_push(&pack_objects.args, "--delta-base-offset");
if (pack_data->use_include_tag)
strvec_push(&pack_objects.args, "--include-tag");
+ if (repo_has_accepted_promisor_remote(the_repository))
+ strvec_push(&pack_objects.args, "--missing=allow-promisor");
if (pack_data->filter_options.choice) {
const char *spec =
expand_list_objects_filter_spec(&pack_data->filter_options);
--
2.47.1.402.gc25c94707f
^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [PATCH v3 3/5] Add 'promisor-remote' capability to protocol v2
2024-12-06 12:42 ` [PATCH v3 3/5] Add 'promisor-remote' capability to protocol v2 Christian Couder
@ 2024-12-07 7:59 ` Junio C Hamano
2025-01-27 15:08 ` Christian Couder
0 siblings, 1 reply; 110+ messages in thread
From: Junio C Hamano @ 2024-12-07 7:59 UTC (permalink / raw)
To: Christian Couder
Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Christian Couder
Christian Couder <christian.couder@gmail.com> writes:
> Then C might or might not, want to get the objects from X, and should
> let S know about this.
I only left this instance quoted in this reply, but I found that
there are too many "should" in the description (both in the proposed
log message and in the documentation patch), which do not help the
readers with accompanying explanation on the reason why it is a good
idea to follow these "should". For example, S may suggest X to C,
and C (imagine a third-party reimplementation of Git, which is not
bound by your "should") may take advantage of that suggestion and
use X as a better connected alternative, and C might want to do so
without even telling S. What entices C to tell S? IOW, how are
these two parties expected to collaborate with that information at
hand? Without answering that question ...
> To allow S and C to agree and let each other know about C using X or
> not, let's introduce a new "promisor-remote" capability in the
> protocol v2, as well as a few new configuration variables:
>
> - "promisor.advertise" on the server side, and:
> - "promisor.acceptFromServer" on the client side.
... the need for a mechanism to share that information between S and
C is hard to sell. "By telling S, C allows S to omit objects that
can be obtained from X when answering C's request?" or something,
perhaps?
> +Note that in the future it would be nice if the "promisor-remote"
> +protocol capability could be used by the server, when responding to
> +`git fetch` or `git clone`, to advertise better-connected remotes that
> +the client can use as promisor remotes, instead of this repository, so
> +that the client can lazily fetch objects from these other
> +better-connected remotes. This would require the server to omit in its
> +response the objects available on the better-connected remotes that
> +the client has accepted. This hasn't been implemented yet though. So
> +for now this "promisor-remote" capability is useful only when the
> +server advertises some promisor remotes it already uses to borrow
> +objects from.
We need to figure out before etching the protocol specification in
stone what to do when the network situations observable by C and S
are different. For example, C may need to go over a proxy to reach
S, S may directly have connection to X, but C cannot reach X
directly, and C needs another proxy, different from the one it uses
to go to S, to reach X. How is S expected to know about C's network
situation, and use the knowledge to tell C how to reach X? Or is X
so well known a name that it is C's responsibility to arrange how it
can reach X? I suspect that this was designed primarily to allow a
server to better help clients owned by the same enterprise entity,
so it might be tempting to distribute pieces of information we
usually do not consider Git's concern, like proxy configuration,
over the same protocol. I personally would strongly prefer *not* to
go in that direction, and if we agree that we won't go there from
the beginning, I'd be a lot happier ;-)
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v3 3/5] Add 'promisor-remote' capability to protocol v2
2024-12-07 7:59 ` Junio C Hamano
@ 2025-01-27 15:08 ` Christian Couder
0 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:08 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Christian Couder, karthik nayak
On Sat, Dec 7, 2024 at 8:59 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > Then C might or might not, want to get the objects from X, and should
> > let S know about this.
>
> I only left this instance quoted in this reply, but I found that
> there are too many "should" in the description (both in the proposed
> log message and in the documentation patch), which do not help the
> readers with accompanying explanation on the reason why it is a good
> idea to follow these "should".
In the next version, I have changed the commit message to replace many
"should" with something else.
> For example, S may suggest X to C,
> and C (imagine a third-party reimplementation of Git, which is not
> bound by your "should") may take advantage of that suggestion and
> use X as a better connected alternative, and C might want to do so
> without even telling S. What entices C to tell S? IOW, how are
> these two parties expected to collaborate with that information at
> hand? Without answering that question ...
The improved commit message in the next version says earlier that "If
S and C can agree on C using X directly, S can then omit objects that
can be obtained from X when answering C's request."
> > To allow S and C to agree and let each other know about C using X or
> > not, let's introduce a new "promisor-remote" capability in the
> > protocol v2, as well as a few new configuration variables:
> >
> > - "promisor.advertise" on the server side, and:
> > - "promisor.acceptFromServer" on the client side.
>
> ... the need for a mechanism to share that information between S and
> C is hard to sell. "By telling S, C allows S to omit objects that
> can be obtained from X when answering C's request?" or something,
> perhaps?
Yeah, now this is mentioned earlier.
> > +Note that in the future it would be nice if the "promisor-remote"
> > +protocol capability could be used by the server, when responding to
> > +`git fetch` or `git clone`, to advertise better-connected remotes that
> > +the client can use as promisor remotes, instead of this repository, so
> > +that the client can lazily fetch objects from these other
> > +better-connected remotes. This would require the server to omit in its
> > +response the objects available on the better-connected remotes that
> > +the client has accepted. This hasn't been implemented yet though. So
> > +for now this "promisor-remote" capability is useful only when the
> > +server advertises some promisor remotes it already uses to borrow
> > +objects from.
>
> We need to figure out before etching the protocol specification in
> stone what to do when the network situations observable by C and S
> are different. For example, C may need to go over a proxy to reach
> S, S may directly have connection to X, but C cannot reach X
> directly, and C needs another proxy, different from the one it uses
> to go to S, to reach X. How is S expected to know about C's network
> situation, and use the knowledge to tell C how to reach X? Or is X
> so well known a name that it is C's responsibility to arrange how it
> can reach X?
Yeah, it's C's responsibility to arrange how it can reach X.
> I suspect that this was designed primarily to allow a
> server to better help clients owned by the same enterprise entity,
> so it might be tempting to distribute pieces of information we
> usually do not consider Git's concern, like proxy configuration,
> over the same protocol. I personally would strongly prefer *not* to
> go in that direction, and if we agree that we won't go there from
> the beginning, I'd be a lot happier ;-)
I don't want to go into that direction either. I have added the
following into the commit message:
"It is C's responsibility to arrange how it can reach X though, so pieces
of information that are usually outside Git's concern, like proxy
configuration, must not be distributed over this protocol."
I think that requiring some global configuration is a good thing. What
we should particularly make easier and more flexible are some details
about the best ways to access each individual repo, like which filter
spec it is best to use. So that if the repo admins decide to move some
smaller objects to the LOP, each client doesn't have to adjust the
filter spec.
^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH v3 4/5] promisor-remote: check advertised name or URL
2024-12-06 12:42 ` [PATCH v3 0/5] " Christian Couder
` (2 preceding siblings ...)
2024-12-06 12:42 ` [PATCH v3 3/5] Add 'promisor-remote' capability to protocol v2 Christian Couder
@ 2024-12-06 12:42 ` Christian Couder
2024-12-06 12:42 ` [PATCH v3 5/5] doc: add technical design doc for large object promisors Christian Couder
` (2 subsequent siblings)
6 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2024-12-06 12:42 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Taylor Blau,
Eric Sunshine, Christian Couder, Christian Couder
A previous commit introduced a "promisor.acceptFromServer" configuration
variable with only "None" or "All" as valid values.
Let's introduce "KnownName" and "KnownUrl" as valid values for this
configuration option to give more choice to a client about which
promisor remotes it might accept among those that the server advertised.
In case of "KnownName", the client will accept promisor remotes which
are already configured on the client and have the same name as those
advertised by the client. This could be useful in a corporate setup
where servers and clients are trusted to not switch names and URLs, but
where some kind of control is still useful.
In case of "KnownUrl", the client will accept promisor remotes which
have both the same name and the same URL configured on the client as the
name and URL advertised by the server. This is the most secure option,
so it should be used if possible.
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
Documentation/config/promisor.txt | 22 ++++++---
promisor-remote.c | 60 ++++++++++++++++++++---
t/t5710-promisor-remote-capability.sh | 68 +++++++++++++++++++++++++++
3 files changed, 138 insertions(+), 12 deletions(-)
diff --git a/Documentation/config/promisor.txt b/Documentation/config/promisor.txt
index 9cbfe3e59e..d1364bc018 100644
--- a/Documentation/config/promisor.txt
+++ b/Documentation/config/promisor.txt
@@ -12,9 +12,19 @@ promisor.advertise::
promisor.acceptFromServer::
If set to "all", a client will accept all the promisor remotes
a server might advertise using the "promisor-remote"
- capability. Default is "none", which means no promisor remote
- advertised by a server will be accepted. By accepting a
- promisor remote, the client agrees that the server might omit
- objects that are lazily fetchable from this promisor remote
- from its responses to "fetch" and "clone" requests from the
- client. See linkgit:gitprotocol-v2[5].
+ capability. If set to "knownName" the client will accept
+ promisor remotes which are already configured on the client
+ and have the same name as those advertised by the client. This
+ is not very secure, but could be used in a corporate setup
+ where servers and clients are trusted to not switch name and
+ URLs. If set to "knownUrl", the client will accept promisor
+ remotes which have both the same name and the same URL
+ configured on the client as the name and URL advertised by the
+ server. This is more secure than "all" or "knownUrl", so it
+ should be used if possible instead of those options. Default
+ is "none", which means no promisor remote advertised by a
+ server will be accepted. By accepting a promisor remote, the
+ client agrees that the server might omit objects that are
+ lazily fetchable from this promisor remote from its responses
+ to "fetch" and "clone" requests from the client. See
+ linkgit:gitprotocol-v2[5].
diff --git a/promisor-remote.c b/promisor-remote.c
index ea418c4094..b72d539c19 100644
--- a/promisor-remote.c
+++ b/promisor-remote.c
@@ -369,30 +369,73 @@ char *promisor_remote_info(struct repository *repo)
return strbuf_detach(&sb, NULL);
}
+/*
+ * Find first index of 'vec' where there is 'val'. 'val' is compared
+ * case insensively to the strings in 'vec'. If not found 'vec->nr' is
+ * returned.
+ */
+static size_t strvec_find_index(struct strvec *vec, const char *val)
+{
+ for (size_t i = 0; i < vec->nr; i++)
+ if (!strcasecmp(vec->v[i], val))
+ return i;
+ return vec->nr;
+}
+
enum accept_promisor {
ACCEPT_NONE = 0,
+ ACCEPT_KNOWN_URL,
+ ACCEPT_KNOWN_NAME,
ACCEPT_ALL
};
static int should_accept_remote(enum accept_promisor accept,
- const char *remote_name UNUSED,
- const char *remote_url UNUSED)
+ const char *remote_name, const char *remote_url,
+ struct strvec *names, struct strvec *urls)
{
+ size_t i;
+
if (accept == ACCEPT_ALL)
return 1;
- BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+ i = strvec_find_index(names, remote_name);
+
+ if (i >= names->nr)
+ /* We don't know about that remote */
+ return 0;
+
+ if (accept == ACCEPT_KNOWN_NAME)
+ return 1;
+
+ if (accept != ACCEPT_KNOWN_URL)
+ BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+
+ if (!strcasecmp(urls->v[i], remote_url))
+ return 1;
+
+ warning(_("known remote named '%s' but with url '%s' instead of '%s'"),
+ remote_name, urls->v[i], remote_url);
+
+ return 0;
}
-static void filter_promisor_remote(struct strvec *accepted, const char *info)
+static void filter_promisor_remote(struct repository *repo,
+ struct strvec *accepted,
+ const char *info)
{
struct strbuf **remotes;
const char *accept_str;
enum accept_promisor accept = ACCEPT_NONE;
+ struct strvec names = STRVEC_INIT;
+ struct strvec urls = STRVEC_INIT;
if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
accept = ACCEPT_NONE;
+ else if (!strcasecmp("KnownUrl", accept_str))
+ accept = ACCEPT_KNOWN_URL;
+ else if (!strcasecmp("KnownName", accept_str))
+ accept = ACCEPT_KNOWN_NAME;
else if (!strcasecmp("All", accept_str))
accept = ACCEPT_ALL;
else
@@ -403,6 +446,9 @@ static void filter_promisor_remote(struct strvec *accepted, const char *info)
if (accept == ACCEPT_NONE)
return;
+ if (accept != ACCEPT_ALL)
+ promisor_info_vecs(repo, &names, &urls);
+
/* Parse remote info received */
remotes = strbuf_split_str(info, ';', 0);
@@ -432,7 +478,7 @@ static void filter_promisor_remote(struct strvec *accepted, const char *info)
if (remote_url)
decoded_url = url_percent_decode(remote_url);
- if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url))
+ if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url, &names, &urls))
strvec_push(accepted, decoded_name);
strbuf_list_free(elems);
@@ -440,6 +486,8 @@ static void filter_promisor_remote(struct strvec *accepted, const char *info)
free(decoded_url);
}
+ strvec_clear(&names);
+ strvec_clear(&urls);
strbuf_list_free(remotes);
}
@@ -448,7 +496,7 @@ char *promisor_remote_reply(const char *info)
struct strvec accepted = STRVEC_INIT;
struct strbuf reply = STRBUF_INIT;
- filter_promisor_remote(&accepted, info);
+ filter_promisor_remote(the_repository, &accepted, info);
if (!accepted.nr)
return NULL;
diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
index 000cb4c0f6..483cc8e16d 100755
--- a/t/t5710-promisor-remote-capability.sh
+++ b/t/t5710-promisor-remote-capability.sh
@@ -157,6 +157,74 @@ test_expect_success "init + fetch with promisor.advertise set to 'true'" '
check_missing_objects server 1 "$oid"
'
+test_expect_success "clone with promisor.acceptfromserver set to 'KnownName'" '
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+ -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+ -c remote.server2.url="file://$(pwd)/server2" \
+ -c promisor.acceptfromserver=KnownName \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
+
+ # Check that the largest object is still missing on the server
+ check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with 'KnownName' and different remote names" '
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.serverTwo.promisor=true \
+ -c remote.serverTwo.fetch="+refs/heads/*:refs/remotes/server2/*" \
+ -c remote.serverTwo.url="file://$(pwd)/server2" \
+ -c promisor.acceptfromserver=KnownName \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
+
+ # Check that the largest object is not missing on the server
+ check_missing_objects server 0 "" &&
+
+ # Reinitialize server so that the largest object is missing again
+ initialize_server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.acceptfromserver set to 'KnownUrl'" '
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+ -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+ -c remote.server2.url="file://$(pwd)/server2" \
+ -c promisor.acceptfromserver=KnownUrl \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
+
+ # Check that the largest object is still missing on the server
+ check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with 'KnownUrl' and different remote urls" '
+ ln -s server2 serverTwo &&
+
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+ -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+ -c remote.server2.url="file://$(pwd)/serverTwo" \
+ -c promisor.acceptfromserver=KnownUrl \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
+
+ # Check that the largest object is not missing on the server
+ check_missing_objects server 0 "" &&
+
+ # Reinitialize server so that the largest object is missing again
+ initialize_server 1 "$oid"
+'
+
test_expect_success "clone with promisor.advertise set to 'true' but don't delete the client" '
git -C server config promisor.advertise true &&
--
2.47.1.402.gc25c94707f
^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH v3 5/5] doc: add technical design doc for large object promisors
2024-12-06 12:42 ` [PATCH v3 0/5] " Christian Couder
` (3 preceding siblings ...)
2024-12-06 12:42 ` [PATCH v3 4/5] promisor-remote: check advertised name or URL Christian Couder
@ 2024-12-06 12:42 ` Christian Couder
2024-12-10 1:28 ` Junio C Hamano
2024-12-10 11:43 ` Junio C Hamano
2024-12-09 8:04 ` [PATCH v3 0/5] Introduce a "promisor-remote" capability Junio C Hamano
2025-01-27 15:16 ` [PATCH v4 0/6] " Christian Couder
6 siblings, 2 replies; 110+ messages in thread
From: Christian Couder @ 2024-12-06 12:42 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Taylor Blau,
Eric Sunshine, Christian Couder, Christian Couder
Let's add a design doc about how we could improve handling liarge blobs
using "Large Object Promisors" (LOPs). It's a set of features with the
goal of using special dedicated promisor remotes to store large blobs,
and having them accessed directly by main remotes and clients.
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
.../technical/large-object-promisors.txt | 530 ++++++++++++++++++
1 file changed, 530 insertions(+)
create mode 100644 Documentation/technical/large-object-promisors.txt
diff --git a/Documentation/technical/large-object-promisors.txt b/Documentation/technical/large-object-promisors.txt
new file mode 100644
index 0000000000..267c65b0d5
--- /dev/null
+++ b/Documentation/technical/large-object-promisors.txt
@@ -0,0 +1,530 @@
+Large Object Promisors
+======================
+
+Since Git has been created, users have been complaining about issues
+with storing large files in Git. Some solutions have been created to
+help, but they haven't helped much with some issues.
+
+Git currently supports multiple promisor remotes, which could help
+with some of these remaining issues, but it's very hard to use them to
+help, because a number of important features are missing.
+
+The goal of the effort described in this document is to add these
+important features.
+
+We will call a "Large Object Promisor", or "LOP" in short, a promisor
+remote which is used to store only large blobs and which is separate
+from the main remote that should store the other Git objects and the
+rest of the repos.
+
+By extension, we will also call "Large Object Promisor", or LOP, the
+effort described in this document to add a set of features to make it
+easier to handle large blobs/files in Git by using LOPs.
+
+This effort would especially improve things on the server side, and
+especially for large blobs that are already compressed in a binary
+format.
+
+This effort could help provide an alternative to Git LFS
+(https://git-lfs.com/) and similar tools like git-annex
+(https://git-annex.branchable.com/) for handling large files, even
+though a complete alternative would very likely require other efforts
+especially on the client side, where it would likely help to implement
+a new object representation for large blobs as discussed in:
+
+https://lore.kernel.org/git/xmqqbkdometi.fsf@gitster.g/
+
+0) Non goals
+------------
+
+- We will not discuss those client side improvements here, as they
+ would require changes in different parts of Git than this effort.
++
+So we don't pretend to fully replace Git LFS with only this effort,
+but we nevertheless believe that it can significantly improve the
+current situation on the server side, and that other separate
+efforts could also improve the situation on the client side.
+
+- In the same way, we are not going to discuss all the possible ways
+ to implement a LOP or their underlying object storage.
++
+In particular we are not going to discuss pluggable ODBs or other
+object database backends that could chunk large blobs, dedup the
+chunks and store them efficiently. Sure, that would be a nice
+improvement to store large blobs on the server side, but we believe
+it can just be a separate effort as it's also not technically very
+related to this effort.
+
+In other words, the goal of this document is not to talk about all the
+possible ways to optimize how Git could handle large blobs, but to
+describe how a LOP based solution could work well and alleviate a
+number of current issues in the context of Git clients and servers
+sharing Git objects.
+
+I) Issues with the current situation
+------------------------------------
+
+- Some statistics made on GitLab repos have shown that more than 75%
+ of the disk space is used by blobs that are larger than 1MB and
+ often in a binary format.
+
+- So even if users could use Git LFS or similar tools to store a lot
+ of large blobs out of their repos, it's a fact that in practice they
+ don't do it as much as they probably should.
+
+- On the server side ideally, the server should be able to decide for
+ itself how it stores things. It should not depend on users deciding
+ to use tools like Git LFS on some blobs or not.
+
+- It's much more expensive to store large blobs that don't delta
+ compress well on regular fast seeking drives (like SSDs) than on
+ object storage (like Amazon S3 or GCP Buckets). Using fast drives
+ for regular Git repos makes sense though, as serving regular Git
+ content (blobs containing text or code) needs drives where seeking
+ is fast, but the content is relatively small. On the other hand,
+ object storage for Git LFS blobs makes sense as seeking speed is not
+ as important when dealing with large files, while costs are more
+ important. So the fact that users don't use Git LFS or similar tools
+ for a significant number of large blobs has likely some bad
+ consequences on the cost of repo storage for most Git hosting
+ platforms.
+
+- Having large blobs handled in the same way as other blobs and Git
+ objects in Git repos instead of on object storage also has a cost in
+ increased memory and CPU usage, and therefore decreased performance,
+ when creating packfiles. (This is because Git tries to use delta
+ compression or zlib compression which is unlikely to work well on
+ already compressed binary content.) So it's not just a storage cost
+ increase.
+
+- When a large blob has been committed into a repo, it might not be
+ possible to remove this blob from the repo without rewriting
+ history, even if the user then decides to use Git LFS or a similar
+ tool to handle it.
+
+- In fact Git LFS and similar tools are not very flexible in letting
+ users change their minds about the blobs they should handle or not.
+
+- Even when users are using Git LFS or similar tools, they are often
+ complaining that these tools require significant effort to set up,
+ learn and use correctly.
+
+II) Main features of the "Large Object Promisors" solution
+----------------------------------------------------------
+
+The main features below should give a rough overview of how the
+solution may work. Details about needed elements can be found in
+following sections.
+
+Even if each feature below is very useful for the full solution, it is
+very likely to be also useful on its own in some cases where the full
+solution is not required. However, we'll focus primarily on the big
+picture here.
+
+Also each feature doesn't need to be implemented entirely in Git
+itself. Some could be scripts, hooks or helpers that are not part of
+the Git repo. It could be helpful if those could be shared and
+improved on collaboratively though.
+
+1) Large blobs are stored on LOPs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Large blobs should be stored on special promisor remotes that we will
+call "Large Object Promisors" or LOPs. These LOPs should be additional
+remotes dedicated to contain large blobs especially those in binary
+format. They should be used along with main remotes that contain the
+other objects.
+
+Note 1
+++++++
+
+To clarify, a LOP is a normal promisor remote, except that:
+
+- it should store only large blobs,
+
+- it should be separate from the main remote, so that the main remote
+ can focus on serving other objects and the rest of the repos (see
+ feature 4) below) and can use the LOP as a promisor remote for
+ itself.
+
+Note 2
+++++++
+
+Git already makes it possible for a main remote to also be a promisor
+remote storing both regular objects and large blobs for a client that
+clones from it with a filter on blob size. But here we explicitly want
+to avoid that.
+
+Rationale
++++++++++
+
+LOP remotes should be good at handling large blobs while main remotes
+should be good at handling other objects.
+
+Implementation
+++++++++++++++
+
+Git already has support for multiple promisor remotes, see
+link:partial-clone.html#using-many-promisor-remotes[the partial clone documentation].
+
+Also, Git already has support for partial clone using a filter on the
+size of the blobs (with `git clone --filter=blob:limit=<size>`). Most
+of the other main features below are based on these existing features
+and are about making them easy and efficient to use for the purpose of
+better handling large blobs.
+
+2) LOPs can use object storage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A LOP could be using object storage, like an Amazon S3 or GCP Bucket
+to actually store the large blobs, and could be accessed through a Git
+remote helper (see linkgit:gitremote-helpers[7]) which makes the
+underlying object storage appears like a remote to Git.
+
+Note
+++++
+
+A LOP could be a promisor remote accessed using a remote helper by
+both some clients and the main remote.
+
+Rationale
++++++++++
+
+This looks like the simplest way to create LOPs that can cheaply
+handle many large blobs.
+
+Implementation
+++++++++++++++
+
+Remote helpers are quite easy to write as shell scripts, but it might
+be more efficient and maintainable to write them using other languages
+like Go.
+
+Other ways to implement LOPs are certainly possible, but the goal of
+this document is not to discuss how to best implement a LOP or its
+underlying object storage (see the "0) Non goals" section above).
+
+3) LOP object storage can be Git LFS storage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The underlying object storage that a LOP uses could also serve as
+storage for large files handled by Git LFS.
+
+Rationale
++++++++++
+
+This would simplify the server side if it wants to both use a LOP and
+act as a Git LFS server.
+
+4) A main remote can offload to a LOP with a configurable threshold
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+On the server side, a main remote should have a way to offload to a
+LOP all its blobs with a size over a configurable threshold.
+
+Rationale
++++++++++
+
+This makes it easy to set things up and to clean things up. For
+example, an admin could use this to manually convert a repo not using
+LOPs to a repo using a LOP. On a repo already using a LOP but where
+some users would sometimes push large blobs, a cron job could use this
+to regularly make sure the large blobs are moved to the LOP.
+
+Implementation
+++++++++++++++
+
+Using something based on `git repack --filter=...` to separate the
+blobs we want to offload from the other Git objects could be a good
+idea. The missing part is to connect to the LOP, check if the blobs we
+want to offload are already there and if not send them.
+
+5) A main remote should try to remain clean from large blobs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A main remote should try to avoid containing a lot of oversize
+blobs. For that purpose, it should offload as needed to a LOP and it
+should have ways to prevent oversize blobs to be fetched, and also
+perhaps pushed, into it.
+
+Rationale
++++++++++
+
+A main remote containing many oversize blobs would defeat the purpose
+of LOPs.
+
+Implementation
+++++++++++++++
+
+The way to offload to a LOP discussed in 4) above can be used to
+regularly offload oversize blobs. About preventing oversize blobs to
+be fetched into the repo see 6) below. About preventing oversize blob
+pushes, a pre-receive hook could be used.
+
+Also there are different scenarios in which large blobs could get
+fetched into the main remote, for example:
+
+- A client that doesn't implement the "promisor-remote" protocol
+ (described in 6) below) clones from the main remote.
+
+- The main remote gets a request for information about a large blob
+ and is not able to get that information without fetching the blob
+ from the LOP.
+
+It might not be possible to completely prevent all these scenarios
+from happening. So the goal here should be to implement features that
+make the fetching of large blobs less likely. For example adding a
+`remote-object-info` command in the `git cat-file --batch*` protocol
+might make it possible for a main repo to respond to some requests
+about large blobs without fetching them.
+
+6) A protocol negotiation should happen when a client clones
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a client clones from a main repo, there should be a protocol
+negotiation so that the server can advertise one or more LOPs and so
+that the client and the server can discuss if the client could
+directly use a LOP the server is advertising. If the client and the
+server can agree on that, then the client would be able to get the
+large blobs directly from the LOP and the server would not need to
+fetch those blobs from the LOP to be able to serve the client.
+
+Note
+++++
+
+For fetches instead of clones, see the "What about fetches?" FAQ entry
+below.
+
+Rationale
++++++++++
+
+Security, configurability and efficiency of setting things up.
+
+Implementation
+++++++++++++++
+
+A "promisor-remote" protocol v2 capability looks like a good way to
+implement this. The way the client and server use this capability
+could be controlled by configuration variables.
+
+Information that the server could send to the client through that
+protocol could be things like: LOP name, LOP URL, filter-spec (for
+example `blob:limit=<size>`) or just size limit that should be used as
+a filter when cloning, token to be used with the LOP, etc..
+
+7) A client can offload to a LOP
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a client is using a LOP that is also a LOP of its main remote,
+the client should be able to offload some large blobs it has fetched,
+but might not need anymore, to the LOP.
+
+Rationale
++++++++++
+
+On the client, the easiest way to deal with unneeded large blobs is to
+offload them.
+
+Implementation
+++++++++++++++
+
+This is very similar to what 4) above is about, except on the client
+side instead of the server side. So a good solution to 4) could likely
+be adapted to work on the client side too.
+
+There might be some security issues here, as there is no negotiation,
+but they might be mitigated if the client can reuse a token it got
+when cloning (see 6) above). Also if the large blobs were fetched from
+a LOP, it is likely, and can easily be confirmed, that the LOP still
+has them, so that they can just be removed from the client.
+
+III) Benefits of using LOPs
+---------------------------
+
+Many benefits are related to the issues discussed in "I) Issues with
+the current situation" above:
+
+- No need to rewrite history when deciding which blobs are worth
+ handling separately than other objects, or when moving or removing
+ the threshold.
+
+- If the protocol between client and server is developed and secured
+ enough, then many details might be setup on the server side only and
+ all the clients could then easily get all the configuration
+ information and use it to set themselves up mostly automatically.
+
+- Storage costs benefits on the server side.
+
+- Reduced memory and CPU needs on main remotes on the server side.
+
+- Reduced storage needs on the client side.
+
+IV) FAQ
+-------
+
+What about using multiple LOPs on the server and client side?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+That could perhaps be useful in some cases, but it's more likely for
+now than in most cases a single LOP will be advertised by the server
+and should be used by the client.
+
+A case where it could be useful for a server to advertise multiple
+LOPs is if a LOP is better for some users while a different LOP is
+better for other users. For example some clients might have a better
+connection to a LOP than others.
+
+In those cases it's the responsibility of the server to have some
+documentation to help clients. It could say for example something like
+"Users in this part of the world might want to pick only LOP A as it
+is likely to be better connected to them, while users in other parts
+of the world should pick only LOP B for the same reason."
+
+Trusting the LOPs advertised by the server, or not trusting them?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In some contexts, like in corporate setup where the server and all the
+clients are parts of an internal network in a company where admins
+have all the rights on every system, it's Ok, and perhaps even a good
+thing, if the clients fully trust the server, as it can help ensure
+that all the clients are on the same page.
+
+There are also contexts in which clients trust a code hosting platform
+serving them some repos, but might not fully trust other users
+managing or contributing to some of these repos. For example, the code
+hosting platform could have hooks in place to check that any object it
+receives doesn't contain malware or otherwise bad content. In this
+case it might be OK for the client to use a main remote and its LOP if
+they are both hosted by the code hosting platform, but not if the LOP
+is hosted elsewhere (where the content is not checked).
+
+In other contexts, a client should just not trust a server.
+
+So there should be different ways to configure how the client should
+behave when a server advertises a LOP to it at clone time.
+
+As the basic elements that a server can advertise about a LOP are a
+LOP name and a LOP URL, the client should base its decision about
+accepting a LOP on these elements.
+
+One simple way to be very strict in the LOP it accepts is for example
+for the client to check that the LOP is already configured on the
+client with the same name and URL as what the server advertises.
+
+In general default and "safe" settings should require that the LOP are
+configured on the client separately from the "promisor-remote"
+protocol and that the client accepts a LOP only when information about
+it from the protocol matches what has been already configured
+separately.
+
+What about LOP names?
+~~~~~~~~~~~~~~~~~~~~~
+
+In some contexts, for example if the clients sometimes fetch from each
+other, it can be a good idea for all the clients to use the same names
+for all the remotes they use, including LOPs.
+
+In other contexts, each client might want to be able to give the name
+it wants to each remote, including each LOP, it interacts with.
+
+So there should be different ways to configure how the client accepts
+or not the LOP name the server advertises.
+
+If a default or "safe" setting is used, then as such a setting should
+require that the LOP be configured separately, then the name would be
+configured separately and there is no risk that the server could
+dictate a name to a client.
+
+Could the main remote be bogged down by old or paranoid clients?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Yes, it could happen if there are too many clients that are either
+unwilling to trust the main remote or that just don't implement the
+"promisor-remote" protocol because they are too old or not fully
+compatible with the 'git' client.
+
+When serving such a client, the main remote has no other choice than
+to first fetch from its LOP, to then be able to provide to the client
+everything it requested. So the main remote, even if it has cleanup
+mechanisms (see section II.4 above), would be burdened at least
+temporarily with the large blobs it had to fetch from its LOP.
+
+Not behaving like this would be breaking backward compatibility, and
+could be seen as segregating clients. For example, it might be
+possible to implement a special mode that allows the server to just
+reject clients that don't implement the "promisor-remote" protocol or
+aren't willing to trust the main remote. This mode might be useful in
+a special context like a corporate environment. There is no plan to
+implement such a mode though, and this should be discussed separately
+later anyway.
+
+A better way to proceed is probably for the main remote to show a
+message telling clients that don't implement the protocol or are
+unwilling to accept the advertised LOP(s) that they would get faster
+clone and fetches by upgrading client software or properly setting
+them up to accept LOP(s).
+
+Waiting for clients to upgrade, monitoring these upgrades and limiting
+the use of LOPs to repos that are not very frequently accessed might
+be other good ways to make sure that some benefits are still reaped
+from LOPs. Over time, as more and more clients upgrade and benefit
+from LOPs, using them in more and more frequently accessed repos will
+become worth it.
+
+Corporate environments, where it might be easier to make sure that all
+the clients are up-to-date and properly configured, could hopefully
+benefit more and earlier from using LOPs.
+
+What about fetches?
+~~~~~~~~~~~~~~~~~~~
+
+There are different kinds of fetches. A regular fetch happens when
+some refs have been updated on the server and the client wants the ref
+updates and possibly the new objects added with them. A "backfill" or
+"lazy" fetch, on the contrary, happens when the client needs to use
+some objects it already knows about but doesn't have because they are
+on a promisor remote.
+
+Regular fetch
++++++++++++++
+
+In a regular fetch, the client will contact the main remote and a
+protocol negotiation will happen between them. It's a good thing that
+a protocol negotiation happens every time, as the configuration on the
+client or the main remote could have changed since the previous
+protocol negotiation. In this case, the new protocol negotiation
+should ensure that the new fetch will happen in a way that satisfies
+the new configuration of both the client and the server.
+
+In most cases though, the configurations on the client and the main
+remote will not have changed between 2 fetches or between the initial
+clone and a subsequent fetch. This means that the result of a new
+protocol negotiation will be the same as the previous result, so the
+new fetch will happen in the same way as the previous clone or fetch,
+using, or not using, the same LOP(s) as last time.
+
+"Backfill" or "lazy" fetch
+++++++++++++++++++++++++++
+
+When there is a backfill fetch, the client doesn't necessarily contact
+the main remote first. It will try to fetch from its promisor remotes
+in the order they appear in the config file, except that a remote
+configured using the `extensions.partialClone` config variable will be
+tried last. See
+link:partial-clone.html#using-many-promisor-remotes[the partial clone documentation].
+
+This is not new with this effort. In fact this is how multiple remotes
+have already been working for around 5 years.
+
+When using LOPs, having the main remote configured using
+`extensions.partialClone`, so it's tried last, makes sense, as missing
+objects should only be large blobs that are on LOPs.
+
+This means that a protocol negotiation will likely not happen as the
+missing objects will be fetched from the LOPs, and then there will be
+nothing left to fetch from the main remote.
+
+To secure that, it could be a good idea for LOPs to require a token
+from the client when it fetches from them. The client could get the
+token when performing a protocol negotiation with the main remote (see
+section II.6 above).
--
2.47.1.402.gc25c94707f
^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [PATCH v3 5/5] doc: add technical design doc for large object promisors
2024-12-06 12:42 ` [PATCH v3 5/5] doc: add technical design doc for large object promisors Christian Couder
@ 2024-12-10 1:28 ` Junio C Hamano
2025-01-27 15:12 ` Christian Couder
2024-12-10 11:43 ` Junio C Hamano
1 sibling, 1 reply; 110+ messages in thread
From: Junio C Hamano @ 2024-12-10 1:28 UTC (permalink / raw)
To: Christian Couder
Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Christian Couder
Christian Couder <christian.couder@gmail.com> writes:
> Let's add a design doc about how we could improve handling liarge blobs
> using "Large Object Promisors" (LOPs). It's a set of features with the
> goal of using special dedicated promisor remotes to store large blobs,
> and having them accessed directly by main remotes and clients.
>
> Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
> ---
> .../technical/large-object-promisors.txt | 530 ++++++++++++++++++
> 1 file changed, 530 insertions(+)
> create mode 100644 Documentation/technical/large-object-promisors.txt
Kudos to whoever suggested to write this kind of birds-eye view
document to help readers understand the bigger picture. Such a "we
want to go in this direction, and this small piece fits within that
larger picture this way" is a good way to motivate readers.
Hopefully I'll have time to comment on different parts of the
documents, but the impression I got was that we should write with
fewer "we could" and instead say more "we aim to", i.e. be more
assertive.
Thanks.
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v3 5/5] doc: add technical design doc for large object promisors
2024-12-10 1:28 ` Junio C Hamano
@ 2025-01-27 15:12 ` Christian Couder
0 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:12 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Christian Couder
On Tue, Dec 10, 2024 at 2:28 AM Junio C Hamano <gitster@pobox.com> wrote:
> Hopefully I'll have time to comment on different parts of the
> documents, but the impression I got was that we should write with
> fewer "we could" and instead say more "we aim to", i.e. be more
> assertive.
I have tried to make the next version of the document more assertive
in some places and clearer in other places by replacing some "could"
with other terms.
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v3 5/5] doc: add technical design doc for large object promisors
2024-12-06 12:42 ` [PATCH v3 5/5] doc: add technical design doc for large object promisors Christian Couder
2024-12-10 1:28 ` Junio C Hamano
@ 2024-12-10 11:43 ` Junio C Hamano
2024-12-16 9:00 ` Patrick Steinhardt
2025-01-27 15:11 ` Christian Couder
1 sibling, 2 replies; 110+ messages in thread
From: Junio C Hamano @ 2024-12-10 11:43 UTC (permalink / raw)
To: Christian Couder
Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Christian Couder
Christian Couder <christian.couder@gmail.com> writes:
> +We will call a "Large Object Promisor", or "LOP" in short, a promisor
> +remote which is used to store only large blobs and which is separate
> +from the main remote that should store the other Git objects and the
> +rest of the repos.
> +
> +By extension, we will also call "Large Object Promisor", or LOP, the
> +effort described in this document to add a set of features to make it
> +easier to handle large blobs/files in Git by using LOPs.
> +
> +This effort would especially improve things on the server side, and
> +especially for large blobs that are already compressed in a binary
> +format.
The implementation on the server side can be hidden and be improved
as long as we have a reasonable wire protocol. As it stands, even
with the promisor-remote referral extension, the data coming from
LOP still is expected to be a pack stream, which I am not sure is a
good match. Is the expectation (yes, I know the document later says
it won't go into storage layer, but still, in order to get the
details of the protocol extension right, we MUST have some idea on
the characteristics the storage layer has so that the protocol would
work well with the storage implementation with such characteristics)
that we give up on deltifying these LOP objects (which might be a
sensible assumption, if they are incompressible large binary gunk),
we store each object in LOP as base representation inside a pack
stream (i.e. the in-pack "undeltified representation" defined in
Documentation/gitformat-pack.txt), so that to send these LOP objects
is just the matter of preparing the pack header (PACK + version +
numobjects) and then concatenating these objects while computing the
running checksum to place in the trailer of the pack stream? Could
it still be too expensive for the server side, having to compute the
running sum, and we might want to update the object transfer part of
the pack stream definition somehow to reduce the load on the server
side?
> +- We will not discuss those client side improvements here, as they
> + would require changes in different parts of Git than this effort.
> ++
> +So we don't pretend to fully replace Git LFS with only this effort,
> +but we nevertheless believe that it can significantly improve the
> +current situation on the server side, and that other separate
> +efforts could also improve the situation on the client side.
We still need to come up with a minimally working client side
components, if our goal were to only improve the server side, in
order to demonstrate the benefit of the effort.
> +In other words, the goal of this document is not to talk about all the
> +possible ways to optimize how Git could handle large blobs, but to
> +describe how a LOP based solution could work well and alleviate a
> +number of current issues in the context of Git clients and servers
> +sharing Git objects.
But if you do not discuss even a single way, and handwave "we'll
have this magical object storage that would solve all the problems
for us", then we cannot really tell if the problem is solved by us,
or by handwaved away by assuming the magical object storage. We'd
need at least one working example.
> +6) A protocol negotiation should happen when a client clones
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +When a client clones from a main repo, there should be a protocol
> +negotiation so that the server can advertise one or more LOPs and so
> +that the client and the server can discuss if the client could
> +directly use a LOP the server is advertising. If the client and the
> +server can agree on that, then the client would be able to get the
> +large blobs directly from the LOP and the server would not need to
> +fetch those blobs from the LOP to be able to serve the client.
> +
> +Note
> +++++
> +
> +For fetches instead of clones, see the "What about fetches?" FAQ entry
> +below.
> +
> +Rationale
> ++++++++++
> +
> +Security, configurability and efficiency of setting things up.
It is unclear how it improves security and configurability if we
limit the protocol exchange only at the clone time (implying that
later either side cannot change it). It will lead to security
issues if we assume that it is impossible for one side to "lie" to
the other side what they earlier agreed on (unless we somehow make
it actually impossible to lie to the other side, of course).
> +7) A client can offload to a LOP
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +When a client is using a LOP that is also a LOP of its main remote,
> +the client should be able to offload some large blobs it has fetched,
> +but might not need anymore, to the LOP.
For a client that _creates_ a large object, the situation would be
the same, right? After it creates several versions of the opening
segment of, say, a movie, the latest version may be still wanted,
but the creating client may want to offload earlier versions.
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v3 5/5] doc: add technical design doc for large object promisors
2024-12-10 11:43 ` Junio C Hamano
@ 2024-12-16 9:00 ` Patrick Steinhardt
2025-01-27 15:11 ` Christian Couder
1 sibling, 0 replies; 110+ messages in thread
From: Patrick Steinhardt @ 2024-12-16 9:00 UTC (permalink / raw)
To: Junio C Hamano
Cc: Christian Couder, git, John Cai, Taylor Blau, Eric Sunshine,
Christian Couder
On Tue, Dec 10, 2024 at 08:43:03PM +0900, Junio C Hamano wrote:
> Christian Couder <christian.couder@gmail.com> writes:
> > +In other words, the goal of this document is not to talk about all the
> > +possible ways to optimize how Git could handle large blobs, but to
> > +describe how a LOP based solution could work well and alleviate a
> > +number of current issues in the context of Git clients and servers
> > +sharing Git objects.
>
> But if you do not discuss even a single way, and handwave "we'll
> have this magical object storage that would solve all the problems
> for us", then we cannot really tell if the problem is solved by us,
> or by handwaved away by assuming the magical object storage. We'd
> need at least one working example.
It's something we're working on in parallel with the effort to slowly
move towards pluggable object databases. We aren't yet totally clear
on how exactly to store such objects, but there are a couple of ideas:
- Store large objects verbatim in a separate path without any kind of
compression at all. It solves the problem of wasting compute time
during compression, but does not solve the problem of having to
store blobs multiple times even if only a tiny part of them change.
- Use a rolling hash function to split up large objects into smaller
hunks that can be deduplicated. This solves the issue of only small
parts of the binary file changing as we'd only have to store the
hunk that has changed.
This has been discussed e.g. in [1], and I've been talking with some
people about rolling hash functions.
In any case, getting to pluggale ODBs is likely a multi-year effort, so
I wonder how detailed we should be in the context of the document here.
We might want to mention that there are ideas and maybe even provide
some pointers, but I think it makes sense to defer the technical
discussion of how exactly this could look like to the future. Mostly
because I think it's going to be a rather big discussion on its own.
Patrick
[1]: https://lore.kernel.org/git/xmqqbkdometi.fsf@gitster.g/
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v3 5/5] doc: add technical design doc for large object promisors
2024-12-10 11:43 ` Junio C Hamano
2024-12-16 9:00 ` Patrick Steinhardt
@ 2025-01-27 15:11 ` Christian Couder
2025-01-27 18:02 ` Junio C Hamano
1 sibling, 1 reply; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:11 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Christian Couder
On Tue, Dec 10, 2024 at 12:43 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > +We will call a "Large Object Promisor", or "LOP" in short, a promisor
> > +remote which is used to store only large blobs and which is separate
> > +from the main remote that should store the other Git objects and the
> > +rest of the repos.
> > +
> > +By extension, we will also call "Large Object Promisor", or LOP, the
> > +effort described in this document to add a set of features to make it
> > +easier to handle large blobs/files in Git by using LOPs.
> > +
> > +This effort would especially improve things on the server side, and
> > +especially for large blobs that are already compressed in a binary
> > +format.
>
> The implementation on the server side can be hidden and be improved
> as long as we have a reasonable wire protocol. As it stands, even
> with the promisor-remote referral extension, the data coming from
> LOP still is expected to be a pack stream, which I am not sure is a
> good match.
I agree it might not be a good match.
> Is the expectation (yes, I know the document later says
> it won't go into storage layer, but still, in order to get the
> details of the protocol extension right, we MUST have some idea on
> the characteristics the storage layer has so that the protocol would
> work well with the storage implementation with such characteristics)
> that we give up on deltifying these LOP objects (which might be a
> sensible assumption, if they are incompressible large binary gunk),
Yes, there is a section (II.2) called "LOPs can use object storage" about this.
In the next version I have tried to clarified this early in the doc by
saying the following in the non-goal section:
"Our opinion is that the simplest solution for now is for LOPs to use
object storage through a remote helper (see section II.2 below for
more details) to store their objects. So we consider that this is the
default implementation. If there are improvements on top of this,
that's great, but our opinion is that such improvements are not
necessary for LOPs to already be useful. Such improvements are likely
a different technical topic, and can be taken care of separately
anyway."
> we store each object in LOP as base representation inside a pack
> stream (i.e. the in-pack "undeltified representation" defined in
> Documentation/gitformat-pack.txt), so that to send these LOP objects
> is just the matter of preparing the pack header (PACK + version +
> numobjects) and then concatenating these objects while computing the
> running checksum to place in the trailer of the pack stream? Could
> it still be too expensive for the server side, having to compute the
> running sum, and we might want to update the object transfer part of
> the pack stream definition somehow to reduce the load on the server
> side?
I agree that this might be an interesting thing to look at, but I
think it's not necessary to work on this now. It's more important for
now that the storage for large blobs on LOPs is cheap.
As clients may not all migrate soon to a version of Git that supports
LOPs well, it's likely that LOPs will be used for repos that are
mostly inactive first (at least that's our plan at GitLab), so there
would not be much traffic. This would give us time to look at
optimizing data transfer.
> > +- We will not discuss those client side improvements here, as they
> > + would require changes in different parts of Git than this effort.
> > ++
> > +So we don't pretend to fully replace Git LFS with only this effort,
> > +but we nevertheless believe that it can significantly improve the
> > +current situation on the server side, and that other separate
> > +efforts could also improve the situation on the client side.
>
> We still need to come up with a minimally working client side
> components, if our goal were to only improve the server side, in
> order to demonstrate the benefit of the effort.
How would clients work worse with large files compared to the current
situation, when the benefit of the current effort (the
"promisor-remote" capability) makes it easier for them, but doesn't
force them, to use promisor remotes?
If clients can use promisor remotes more, especially when cloning,
they can benefit from having fewer large files locally when they don't
need them. So they should just work better. And again they are not
forced to use promisor remotes, if they still prefer not to use them,
they still can perform a regular clone, and they will not work
differently than they do now.
> > +In other words, the goal of this document is not to talk about all the
> > +possible ways to optimize how Git could handle large blobs, but to
> > +describe how a LOP based solution could work well and alleviate a
> > +number of current issues in the context of Git clients and servers
> > +sharing Git objects.
>
> But if you do not discuss even a single way, and handwave "we'll
> have this magical object storage that would solve all the problems
> for us", then we cannot really tell if the problem is solved by us,
> or by handwaved away by assuming the magical object storage.
> We'd need at least one working example.
It's not magical object storage. Amazon S3, GCP Bucket and MinIO
(which is open source), for example, already exist and are used a lot
in the industry. Some Git remote helpers to access them can even be
found online under open source licenses, like for example:
- https://github.com/awslabs/git-remote-s3
- https://gitlab.com/eric.p.ju/git-remote-gs
Writing a remote helper to use some object storage as a promisor
remote is also not very difficult. Yeah, perhaps optimizing them would
be worth the effort, but they are, or would likely be, at least for
now, separate projects, and nothing prevents people interested in
optimizing them from contributing to these projects.
I have added some details about these object storage technologies and
remote helpers to access them in the next version of the doc.
> > +6) A protocol negotiation should happen when a client clones
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +When a client clones from a main repo, there should be a protocol
> > +negotiation so that the server can advertise one or more LOPs and so
> > +that the client and the server can discuss if the client could
> > +directly use a LOP the server is advertising. If the client and the
> > +server can agree on that, then the client would be able to get the
> > +large blobs directly from the LOP and the server would not need to
> > +fetch those blobs from the LOP to be able to serve the client.
> > +
> > +Note
> > +++++
> > +
> > +For fetches instead of clones, see the "What about fetches?" FAQ entry
> > +below.
> > +
> > +Rationale
> > ++++++++++
> > +
> > +Security, configurability and efficiency of setting things up.
>
> It is unclear how it improves security and configurability if we
> limit the protocol exchange only at the clone time (implying that
> later either side cannot change it). It will lead to security
> issues if we assume that it is impossible for one side to "lie" to
> the other side what they earlier agreed on (unless we somehow make
> it actually impossible to lie to the other side, of course).
It's not limited to clone time. There are tests in the patch series
that test that the protocol is used and works when fetching.
The "What about fetches?" FAQ entry also says:
"In a regular fetch, the client will contact the main remote and a
protocol negotiation will happen between them."
Or are you talking about lazy fetches? There it is mentioned that a
token could be used to secure this. Other parts of the doc mention
using such a token by the way.
I have changed the note about fetches to be like this:
"For fetches instead of clones, a protocol negotiation might not always
happen, see the "What about fetches?" FAQ entry below for details."
> > +7) A client can offload to a LOP
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +When a client is using a LOP that is also a LOP of its main remote,
> > +the client should be able to offload some large blobs it has fetched,
> > +but might not need anymore, to the LOP.
>
> For a client that _creates_ a large object, the situation would be
> the same, right? After it creates several versions of the opening
> segment of, say, a movie, the latest version may be still wanted,
> but the creating client may want to offload earlier versions.
Yeah, but it's not clear if the versions of the opening segment should
be sent directly to the LOP without the main remote checking them in
some ways (hooks might be configured only on the main remote) and/or
checking that they are connected to the repo. I guess it depends on
the context if it would be OK or not.
I have added the following note:
"It might depend on the context if it should be OK or not for clients
to offload large blobs they have created, instead of fetched, directly
to the LOP without the main remote checking them in some ways
(possibly using hooks or other tools)."
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v3 5/5] doc: add technical design doc for large object promisors
2025-01-27 15:11 ` Christian Couder
@ 2025-01-27 18:02 ` Junio C Hamano
2025-02-18 11:42 ` Christian Couder
0 siblings, 1 reply; 110+ messages in thread
From: Junio C Hamano @ 2025-01-27 18:02 UTC (permalink / raw)
To: Christian Couder
Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Christian Couder
Christian Couder <christian.couder@gmail.com> writes:
>> > +In other words, the goal of this document is not to talk about all the
>> > +possible ways to optimize how Git could handle large blobs, but to
>> > +describe how a LOP based solution could work well and alleviate a
>> > +number of current issues in the context of Git clients and servers
>> > +sharing Git objects.
>>
>> But if you do not discuss even a single way, and handwave "we'll
>> have this magical object storage that would solve all the problems
>> for us", then we cannot really tell if the problem is solved by us,
>> or by handwaved away by assuming the magical object storage.
>> We'd need at least one working example.
>
> It's not magical object storage. Amazon S3, GCP Bucket and MinIO
> (which is open source), for example, already exist and are used a lot
> in the industry.
That's just "we can store bunch of bytes and ask them to be
retrieved". What I said about handwaving the presence of magical
"object storage" is exactly the "optimize how to handle large blobs"
part. I agree that we do not need to discuss _ALL_ the possible
ways. But without telling what our thoughts on _how_ to use these
"lower cost and safe by duplication but with high latency" services
to store our objects efficiently enough to make it practical, I'd
have to call what we see in the document "magical object storage".
>> > +7) A client can offload to a LOP
>> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> > +
>> > +When a client is using a LOP that is also a LOP of its main remote,
>> > +the client should be able to offload some large blobs it has fetched,
>> > +but might not need anymore, to the LOP.
>>
>> For a client that _creates_ a large object, the situation would be
>> the same, right? After it creates several versions of the opening
>> segment of, say, a movie, the latest version may be still wanted,
>> but the creating client may want to offload earlier versions.
>
> Yeah, but it's not clear if the versions of the opening segment should
> be sent directly to the LOP without the main remote checking them in
> some ways (hooks might be configured only on the main remote) and/or
> checking that they are connected to the repo. I guess it depends on
> the context if it would be OK or not.
If it is not clear to us or whoever writes this document, the users
would have a hard time to make effective use of it, which is why I
am worried about the current design in this feature.
Thanks for clarifying other parts of my confusion.
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v3 5/5] doc: add technical design doc for large object promisors
2025-01-27 18:02 ` Junio C Hamano
@ 2025-02-18 11:42 ` Christian Couder
0 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-02-18 11:42 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Christian Couder
On Mon, Jan 27, 2025 at 7:02 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> >> > +In other words, the goal of this document is not to talk about all the
> >> > +possible ways to optimize how Git could handle large blobs, but to
> >> > +describe how a LOP based solution could work well and alleviate a
> >> > +number of current issues in the context of Git clients and servers
> >> > +sharing Git objects.
> >>
> >> But if you do not discuss even a single way, and handwave "we'll
> >> have this magical object storage that would solve all the problems
> >> for us", then we cannot really tell if the problem is solved by us,
> >> or by handwaved away by assuming the magical object storage.
> >> We'd need at least one working example.
> >
> > It's not magical object storage. Amazon S3, GCP Bucket and MinIO
> > (which is open source), for example, already exist and are used a lot
> > in the industry.
>
> That's just "we can store bunch of bytes and ask them to be
> retrieved". What I said about handwaving the presence of magical
> "object storage" is exactly the "optimize how to handle large blobs"
> part. I agree that we do not need to discuss _ALL_ the possible
> ways. But without telling what our thoughts on _how_ to use these
> "lower cost and safe by duplication but with high latency" services
> to store our objects efficiently enough to make it practical, I'd
> have to call what we see in the document "magical object storage".
I have added the following:
Even if LOPs are used not very efficiently, they can still be useful
and worth using in some cases because, as we will see in more details
later in this document:
- they can make it simpler for clients to use promisor remotes and
therefore avoid fetching a lot of large blobs they might not need
locally,
- they can make it significantly cheaper or easier for servers to
host a significant part of the current repository content, and
even more to host content with larger blobs or more large blobs
than currently.
I hope this addresses some of your concerns. I could also talk about
remote helpers and object storage here, but this would be duplicating
the "2) LOPs can use object storage" section. If you think that we
should tell our thoughts about how to improve remote helpers and
object storage performance, I think this should go into that section
rather than here.
> >> > +7) A client can offload to a LOP
> >> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >> > +
> >> > +When a client is using a LOP that is also a LOP of its main remote,
> >> > +the client should be able to offload some large blobs it has fetched,
> >> > +but might not need anymore, to the LOP.
> >>
> >> For a client that _creates_ a large object, the situation would be
> >> the same, right? After it creates several versions of the opening
> >> segment of, say, a movie, the latest version may be still wanted,
> >> but the creating client may want to offload earlier versions.
> >
> > Yeah, but it's not clear if the versions of the opening segment should
> > be sent directly to the LOP without the main remote checking them in
> > some ways (hooks might be configured only on the main remote) and/or
> > checking that they are connected to the repo. I guess it depends on
> > the context if it would be OK or not.
>
> If it is not clear to us or whoever writes this document, the users
> would have a hard time to make effective use of it, which is why I
> am worried about the current design in this feature.
Yeah, but this feature doesn't exist at all yet, and it might not even
be a priority, so I prefer not to promise too much.
For now, I have added:
"This should be discussed and refined when we get closer to
implementing this feature."
just after:
"It might depend on the context if it should be OK or not for clients
to offload large blobs they have created, instead of fetched, directly
to the LOP without the main remote checking them in some ways
(possibly using hooks or other tools)."
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v3 0/5] Introduce a "promisor-remote" capability
2024-12-06 12:42 ` [PATCH v3 0/5] " Christian Couder
` (4 preceding siblings ...)
2024-12-06 12:42 ` [PATCH v3 5/5] doc: add technical design doc for large object promisors Christian Couder
@ 2024-12-09 8:04 ` Junio C Hamano
2024-12-09 10:40 ` Christian Couder
2025-01-27 15:16 ` [PATCH v4 0/6] " Christian Couder
6 siblings, 1 reply; 110+ messages in thread
From: Junio C Hamano @ 2024-12-09 8:04 UTC (permalink / raw)
To: Christian Couder
Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine
Christian Couder <christian.couder@gmail.com> writes:
> This work is part of some effort to better handle large files/blobs in
> a client-server context using promisor remotes dedicated to storing
> large blobs. To help understand this effort, this series now contains
> a patch (patch 5/5) that adds design documentation about this effort.
https://github.com/git/git/actions/runs/12229786922/job/34110073072
is a CI-run on 'seen' with this topic. linux-TEST-vars job is failing.
A CI-run for the same topics in 'seen' but without this topic is
https://github.com/git/git/actions/runs/12230853182/job/34112864500
This topic seems to break linux-TEST-vars CI job (where different
settings like + export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=master
is used).
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v3 0/5] Introduce a "promisor-remote" capability
2024-12-09 8:04 ` [PATCH v3 0/5] Introduce a "promisor-remote" capability Junio C Hamano
@ 2024-12-09 10:40 ` Christian Couder
2024-12-09 10:42 ` Christian Couder
2024-12-09 23:01 ` Junio C Hamano
0 siblings, 2 replies; 110+ messages in thread
From: Christian Couder @ 2024-12-09 10:40 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine
On Mon, Dec 9, 2024 at 9:04 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > This work is part of some effort to better handle large files/blobs in
> > a client-server context using promisor remotes dedicated to storing
> > large blobs. To help understand this effort, this series now contains
> > a patch (patch 5/5) that adds design documentation about this effort.
>
> https://github.com/git/git/actions/runs/12229786922/job/34110073072
> is a CI-run on 'seen' with this topic. linux-TEST-vars job is failing.
>
> A CI-run for the same topics in 'seen' but without this topic is
> https://github.com/git/git/actions/runs/12230853182/job/34112864500
>
> This topic seems to break linux-TEST-vars CI job (where different
> settings like + export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=master
> is used).
Yeah, in the "CI tests" section in the cover letter I wrote:
> One test, linux-TEST-vars, failed much earlier, in what doesn't look
> like a CI issue as I could reproduce the failure locally when setting
> GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL to 1. I will investigate,
> but in the meantime I think I can send this as-is so we can start
> discussing.
I noticed that fcb2205b77 (midx: implement support for writing
incremental MIDX chains, 2024-08-06)
which introduced GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL adds lines like:
GIT_TEST_MULTI_PACK_INDEX=0
GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0
at the top of a number of repack related test scripts like
t7700-repack.sh, so I guess that it should be OK to add the same lines
at the top of the t5710 test script added by this series. This should
fix the CI failures.
I have made this change in my current version.
Thanks.
Yeah, not sure why
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v3 0/5] Introduce a "promisor-remote" capability
2024-12-09 10:40 ` Christian Couder
@ 2024-12-09 10:42 ` Christian Couder
2024-12-09 23:01 ` Junio C Hamano
1 sibling, 0 replies; 110+ messages in thread
From: Christian Couder @ 2024-12-09 10:42 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine
On Mon, Dec 9, 2024 at 11:40 AM Christian Couder
<christian.couder@gmail.com> wrote:
> Yeah, not sure why
Sorry for this. It's an editing mistake.
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v3 0/5] Introduce a "promisor-remote" capability
2024-12-09 10:40 ` Christian Couder
2024-12-09 10:42 ` Christian Couder
@ 2024-12-09 23:01 ` Junio C Hamano
2025-01-27 15:05 ` Christian Couder
1 sibling, 1 reply; 110+ messages in thread
From: Junio C Hamano @ 2024-12-09 23:01 UTC (permalink / raw)
To: Christian Couder
Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine
Christian Couder <christian.couder@gmail.com> writes:
> I noticed that fcb2205b77 (midx: implement support for writing
> incremental MIDX chains, 2024-08-06)
> which introduced GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL adds lines like:
>
> GIT_TEST_MULTI_PACK_INDEX=0
> GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0
>
> at the top of a number of repack related test scripts like
> t7700-repack.sh, so I guess that it should be OK to add the same lines
> at the top of the t5710 test script added by this series. This should
> fix the CI failures.
>
> I have made this change in my current version.
Thanks.
Is it because the feature is fundamentally incompatible with the
multi-pack index (or its incremental writing), or is it merely
because the way the feature is verified assumes that the multi-pack
index is not used, even though the protocol exchange, capability
selection, and the actual behaviour adjustment for the capability
are all working just fine? I am assuming it is the latter, but just
to make sure we know where we stand...
Thanks, again.
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v3 0/5] Introduce a "promisor-remote" capability
2024-12-09 23:01 ` Junio C Hamano
@ 2025-01-27 15:05 ` Christian Couder
2025-01-27 19:38 ` Junio C Hamano
0 siblings, 1 reply; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:05 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine
On Tue, Dec 10, 2024 at 12:01 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > I noticed that fcb2205b77 (midx: implement support for writing
> > incremental MIDX chains, 2024-08-06)
> > which introduced GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL adds lines like:
> >
> > GIT_TEST_MULTI_PACK_INDEX=0
> > GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0
> >
> > at the top of a number of repack related test scripts like
> > t7700-repack.sh, so I guess that it should be OK to add the same lines
> > at the top of the t5710 test script added by this series. This should
> > fix the CI failures.
> >
> > I have made this change in my current version.
>
> Thanks.
>
> Is it because the feature is fundamentally incompatible with the
> multi-pack index (or its incremental writing),
It's not an incompatibility with the feature developed in this series.
Adding the following test script on top of master or even fcb2205b77
(midx: implement support for writing incremental MIDX chains,
2024-08-06), shows that it fails in the same way without any code
change to `git` itself from this series:
diff --git a/t/t5709-midx-increment-write.sh b/t/t5709-midx-increment-write.sh
new file mode 100755
index 0000000000..8801222374
--- /dev/null
+++ b/t/t5709-midx-increment-write.sh
@@ -0,0 +1,132 @@
+#!/bin/sh
+
+test_description='test midx incremental write'
+
+. ./test-lib.sh
+
+export GIT_TEST_MULTI_PACK_INDEX=1
+export GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=1
+
+# Setup the repository with three commits, this way HEAD is always
+# available and we can hide commit 1 or 2.
+test_expect_success 'setup: create "template" repository' '
+ git init template &&
+ test_commit -C template 1 &&
+ test_commit -C template 2 &&
+ test_commit -C template 3 &&
+ test-tool genrandom foo 10240 >template/foo &&
+ git -C template add foo &&
+ git -C template commit -m foo
+'
+
+# A bare repo will act as a server repo with unpacked objects.
+test_expect_success 'setup: create bare "server" repository' '
+ git clone --bare --no-local template server &&
+ mv server/objects/pack/pack-* . &&
+ packfile=$(ls pack-*.pack) &&
+ git -C server unpack-objects --strict <"$packfile"
+'
+
+check_missing_objects () {
+ git -C "$1" rev-list --objects --all --missing=print > all.txt &&
+ perl -ne 'print if s/^[?]//' all.txt >missing.txt &&
+ test_line_count = "$2" missing.txt &&
+ if test "$2" -lt 2
+ then
+ test "$3" = "$(cat missing.txt)"
+ else
+ test -f "$3" &&
+ sort <"$3" >expected_sorted &&
+ sort <missing.txt >actual_sorted &&
+ test_cmp expected_sorted actual_sorted
+ fi
+}
+
+initialize_server () {
+ count="$1"
+ missing_oids="$2"
+
+ # Repack everything first
+ git -C server -c repack.writebitmaps=false repack -a -d &&
+
+ # Remove promisor file in case they exist, useful when reinitializing
+ rm -rf server/objects/pack/*.promisor &&
+
+ # Repack without the largest object and create a promisor pack on server
+ git -C server -c repack.writebitmaps=false repack -a -d \
+ --filter=blob:limit=5k --filter-to="$(pwd)/pack" &&
+ promisor_file=$(ls server/objects/pack/*.pack | sed
"s/\.pack/.promisor/") &&
+ >"$promisor_file" &&
+
+ # Check objects missing on the server
+ check_missing_objects server "$count" "$missing_oids"
+}
+
+copy_to_server2 () {
+ oid_path="$(test_oid_to_path $1)" &&
+ path="server/objects/$oid_path" &&
+ path2="server2/objects/$oid_path" &&
+ mkdir -p $(dirname "$path2") &&
+ cp "$path" "$path2"
+}
+
+test_expect_success "setup for testing promisor remote advertisement" '
+ # Create another bare repo called "server2"
+ git init --bare server2 &&
+
+ # Copy the largest object from server to server2
+ obj="HEAD:foo" &&
+ oid="$(git -C server rev-parse $obj)" &&
+ copy_to_server2 "$oid" &&
+
+ initialize_server 1 "$oid" &&
+
+ # Configure server2 as promisor remote for server
+ git -C server remote add server2 "file://$(pwd)/server2" &&
+ git -C server config remote.server2.promisor true &&
+
+ git -C server2 config uploadpack.allowFilter true &&
+ git -C server2 config uploadpack.allowAnySHA1InWant true &&
+ git -C server config uploadpack.allowFilter true &&
+ git -C server config uploadpack.allowAnySHA1InWant true
+'
+
+test_expect_success "setup for subsequent fetches" '
+ # Generate new commit with large blob
+ test-tool genrandom bar 10240 >template/bar &&
+ git -C template add bar &&
+ git -C template commit -m bar &&
+
+ # Fetch new commit with large blob
+ git -C server fetch origin &&
+ git -C server update-ref HEAD FETCH_HEAD &&
+ git -C server rev-parse HEAD >expected_head &&
+
+ # Repack everything twice and remove .promisor files before
+ # each repack. This makes sure everything gets repacked
+ # into a single packfile. The second repack is necessary
+ # because the first one fetches from server2 and creates a new
+ # packfile and its associated .promisor file.
+
+ rm -f server/objects/pack/*.promisor &&
+ git -C server -c repack.writebitmaps=false repack -a -d &&
+ rm -f server/objects/pack/*.promisor &&
+ git -C server -c repack.writebitmaps=false repack -a -d &&
+
+ # Unpack everything
+ rm pack-* &&
+ mv server/objects/pack/pack-* . &&
+ packfile=$(ls pack-*.pack) &&
+ git -C server unpack-objects --strict <"$packfile" &&
+
+ # Copy new large object to server2
+ obj_bar="HEAD:bar" &&
+ oid_bar="$(git -C server rev-parse $obj_bar)" &&
+ copy_to_server2 "$oid_bar" &&
+
+ # Reinitialize server so that the 2 largest objects are missing
+ printf "%s\n" "$oid" "$oid_bar" >expected_missing.txt &&
+ initialize_server 2 expected_missing.txt
+'
+
+test_done
Changing `export GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=1` into
`export GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0` at the top of
the file makes it work.
This could probably be simplified, but I think it shows that it's just
the incremental writing of the multi-pack index that is incompatible
or has a bug when doing some repacking.
> or is it merely
> because the way the feature is verified assumes that the multi-pack
> index is not used, even though the protocol exchange, capability
> selection, and the actual behaviour adjustment for the capability
> are all working just fine? I am assuming it is the latter, but just
> to make sure we know where we stand...
Let me know if you need more than the above, but I think it's fair for
now to just use:
GIT_TEST_MULTI_PACK_INDEX=0
GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0
at the top of the tests, like it's done in the version 4 of this
series I will send soon.
^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [PATCH v3 0/5] Introduce a "promisor-remote" capability
2025-01-27 15:05 ` Christian Couder
@ 2025-01-27 19:38 ` Junio C Hamano
0 siblings, 0 replies; 110+ messages in thread
From: Junio C Hamano @ 2025-01-27 19:38 UTC (permalink / raw)
To: Christian Couder
Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine
Christian Couder <christian.couder@gmail.com> writes:
>> or is it merely
>> because the way the feature is verified assumes that the multi-pack
>> index is not used, even though the protocol exchange, capability
>> selection, and the actual behaviour adjustment for the capability
>> are all working just fine? I am assuming it is the latter, but just
>> to make sure we know where we stand...
>
> Let me know if you need more than the above,
Hard to say if I got a test script when I asked for a simple yes-or-no
question.
> but I think it's fair for
> now to just use:
>
> GIT_TEST_MULTI_PACK_INDEX=0
> GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0
>
> at the top of the tests, like it's done in the version 4 of this
> series I will send soon.
Doesn't it mean that people should not use multi-pack-index or
incremental writing with this feature? If we cannot make both of
them work together even in our controlled testing environment, how
would the users know what combinations of features are safe to use
and what are incompatible? That sounds far from fair at least to me.
I see Taylor is included in the Cc: list, so hopefully, we'll get
the anomalies you found in the multi-pack stuff resolved and see how
well these two things would work together.
Thanks.
^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH v4 0/6] Introduce a "promisor-remote" capability
2024-12-06 12:42 ` [PATCH v3 0/5] " Christian Couder
` (5 preceding siblings ...)
2024-12-09 8:04 ` [PATCH v3 0/5] Introduce a "promisor-remote" capability Junio C Hamano
@ 2025-01-27 15:16 ` Christian Couder
2025-01-27 15:16 ` [PATCH v4 1/6] version: replace manual ASCII checks with isprint() for clarity Christian Couder
` (7 more replies)
6 siblings, 8 replies; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:16 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
Randall S . Becker, Christian Couder
This work is part of some effort to better handle large files/blobs in
a client-server context using promisor remotes dedicated to storing
large blobs. To help understand this effort, this series now contains
a patch (patch 6/6) that adds design documentation about this effort.
Last year, I sent 3 versions of a patch series with the goal of
allowing a client C to clone from a server S while using the same
promisor remote X that S already use. See:
https://lore.kernel.org/git/20240418184043.2900955-1-christian.couder@gmail.com/
Junio suggested to implement that feature using:
"a protocol extension that lets S tell C that S wants C to fetch
missing objects from X (which means that if C knows about X in its
".git/config" then there is no need for end-user interaction at all),
or a protocol extension that C tells S that C is willing to see
objects available from X omitted when S does not have them (again,
this could be done by looking at ".git/config" at C, but there may be
security implications???)"
This patch series implements that protocol extension called
"promisor-remote" (that name is open to change or simplification)
which allows S and C to agree on C using X directly or not.
I have tried to implement it in a quite generic way that could allow S
and C to share more information about promisor remotes and how to use
them.
For now, C doesn't use the information it gets from S when cloning.
That information is only used to decide if C is OK to use the promisor
remotes advertised by S. But this could change in the future which
could make it much simpler for clients than using the current way of
passing information about X with the `-c` option of `git clone` many
times on the command line.
Another improvement could be to not require GIT_NO_LAZY_FETCH=0 when S
and C have agreed on using S.
Changes compared to version 3
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Patches 1/6 and 2/6 are new in this series. They come from the
patch series Usman Akinyemi is working on
(https://lore.kernel.org/git/20250124122217.250925-1-usmanakinyemi202@gmail.com/).
We need a similar redact_non_printables() function as the one he
has been working on in his patch series, so it's just simpler to
reuse his patches related to this function, and to build on top of
them.
- Patch 2/5 in version 3 has been removed. It created a new
strbuf_trim_trailing_ch() function as part of the strbuf API, but
we can reuse an existing function, strbuf_strip_suffix(), instead.
- Patch 3/6 is new. It makes the redact_non_printables() non-static
to be able to reuse it in a following patch.
- In patch 4/6, the commit message has been improved:
- Some "should" have been replaced with "may".
- It states early that "If S and C can agree on C using X
directly, S can then omit objects that can be obtained from X
when answering C's request."
- It mentions that "pieces of information that are usually
outside Git's concern, like proxy configuration, must not be
distributed over this protocol."
- In patch 4/6, there are also some code changes:
- redact_non_printables() is used instead of strbuf_sanitize(),
see changes in patches 1/6 to 3/6 above.
- strbuf_strip_suffix() is used instead of
strbuf_trim_trailing_ch(), see the removal of patch 2/5 in
version 3 mentioned above.
- strbuf_split() is used instead of strbuf_split_str() when
possible to simplifies the code a bit.
- In patch 4/6, there is also a small change in the tests. In t5710
testing with multi pack index and especially its incremental write
are disabled. An issue has been found between the setup code in
this test script and the multi pack index incremental write.
- In patch 6/6 (doc: add technical design doc for large object
promisors) there are a number of changes:
- "aim to" is used more often to better outline the direction of
the effort. And in general some similarly small changes have
been made to make the document more assertive.
- The "0) Non goal" section has been improved to mention that we
want to focus for now on using existing object storage
solutions accessed through remote helpers, and that we don't
want to discuss data transfer improvements between LOPs and
clients or servers.
- A few typos, grammos and such have been fixed.
- Examples of existing remote helpers to access existing object
storage solutions have been added.
- A note has been improved to mention that a protocol
negotiation might not always happen when fetching.
- A new note has been added about clients offloading objects
they created directly to a LOP.
- A new "V) Future improvements" section has been added.
Thanks to Junio, Patrick, Eric, Karthik, Kristoffer, brian, Randall
and Taylor for their suggestions to improve this patch series.
CI tests
~~~~~~~~
All the CI tests passed, see:
https://github.com/chriscool/git/actions/runs/12989763108
Range diff compared to version 3
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1: 13dd730641 < -: ---------- version: refactor strbuf_sanitize()
2: 8f2aecf6a1 < -: ---------- strbuf: refactor strbuf_trim_trailing_ch()
-: ---------- > 1: 9e646013be version: replace manual ASCII checks with isprint() for clarity
-: ---------- > 2: f4b22ef39d version: refactor redact_non_printables()
-: ---------- > 3: 8bfa6f7a20 version: make redact_non_printables() non-static
3: 57e1481bc4 ! 4: 652ce32892 Add 'promisor-remote' capability to protocol v2
@@ Commit message
When a server S knows that some objects from a repository are available
from a promisor remote X, S might want to suggest to a client C cloning
- or fetching the repo from S that C should use X directly instead of S
- for these objects.
+ or fetching the repo from S that C may use X directly instead of S for
+ these objects.
Note that this could happen both in the case S itself doesn't have the
objects and borrows them from X, and in the case S has the objects but
@@ Commit message
omit in its response the objects available on X, is left for future
improvement though.
- Then C might or might not, want to get the objects from X, and should
- let S know about this.
+ Then C might or might not, want to get the objects from X. If S and C
+ can agree on C using X directly, S can then omit objects that can be
+ obtained from X when answering C's request.
To allow S and C to agree and let each other know about C using X or
not, let's introduce a new "promisor-remote" capability in the
@@ Commit message
For now, the URL is passed in addition to the name. In the future, it
might be possible to pass other information like a filter-spec that the
- client should use when cloning from S, or a token that the client should
- use when retrieving objects from X.
+ client may use when cloning from S, or a token that the client may use
+ when retrieving objects from X.
+
+ It is C's responsibility to arrange how it can reach X though, so pieces
+ of information that are usually outside Git's concern, like proxy
+ configuration, must not be distributed over this protocol.
It might also be possible in the future for "promisor.advertise" to have
other values. For example a value like "onlyName" could prevent S from
@@ promisor-remote.c
#include "packfile.h"
#include "environment.h"
+#include "url.h"
++#include "version.h"
struct promisor_remote_config {
struct promisor_remote *promisors;
@@ promisor-remote.c: void promisor_remote_get_direct(struct repository *repo,
+ }
+ }
+
-+ strbuf_sanitize(&sb);
++ redact_non_printables(&sb);
+
+ strvec_clear(&names);
+ strvec_clear(&urls);
@@ promisor-remote.c: void promisor_remote_get_direct(struct repository *repo,
+ char *decoded_name = NULL;
+ char *decoded_url = NULL;
+
-+ strbuf_trim_trailing_ch(remotes[i], ';');
-+ elems = strbuf_split_str(remotes[i]->buf, ',', 0);
++ strbuf_strip_suffix(remotes[i], ";");
++ elems = strbuf_split(remotes[i], ',');
+
+ for (size_t j = 0; elems[j]; j++) {
+ int res;
-+ strbuf_trim_trailing_ch(elems[j], ',');
++ strbuf_strip_suffix(elems[j], ",");
+ res = skip_prefix(elems[j]->buf, "name=", &remote_name) ||
+ skip_prefix(elems[j]->buf, "url=", &remote_url);
+ if (!res)
@@ promisor-remote.c: void promisor_remote_get_direct(struct repository *repo,
+ struct promisor_remote *p;
+ char *decoded_remote;
+
-+ strbuf_trim_trailing_ch(accepted_remotes[i], ';');
++ strbuf_strip_suffix(accepted_remotes[i], ";");
+ decoded_remote = url_percent_decode(accepted_remotes[i]->buf);
+
+ p = repo_promisor_remote_find(r, decoded_remote);
@@ serve.c: static struct protocol_capability capabilities[] = {
+ },
};
- void protocol_v2_advertise_capabilities(void)
+ void protocol_v2_advertise_capabilities(struct repository *r)
+
+ ## t/meson.build ##
+@@ t/meson.build: integration_tests = [
+ 't5703-upload-pack-ref-in-want.sh',
+ 't5704-protocol-violations.sh',
+ 't5705-session-id-in-capabilities.sh',
++ 't5710-promisor-remote-capability.sh',
+ 't5730-protocol-v2-bundle-uri-file.sh',
+ 't5731-protocol-v2-bundle-uri-git.sh',
+ 't5732-protocol-v2-bundle-uri-http.sh',
## t/t5710-promisor-remote-capability.sh (new) ##
@@
@@ t/t5710-promisor-remote-capability.sh (new)
+
+. ./test-lib.sh
+
++GIT_TEST_MULTI_PACK_INDEX=0
++GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0
++
+# Setup the repository with three commits, this way HEAD is always
+# available and we can hide commit 1 or 2.
+test_expect_success 'setup: create "template" repository' '
4: 7fcc619e41 = 5: 979a0af1c3 promisor-remote: check advertised name or URL
5: c25c94707f ! 6: 3a0c134e09 doc: add technical design doc for large object promisors
@@ Documentation/technical/large-object-promisors.txt (new)
+effort described in this document to add a set of features to make it
+easier to handle large blobs/files in Git by using LOPs.
+
-+This effort would especially improve things on the server side, and
++This effort aims to especially improve things on the server side, and
+especially for large blobs that are already compressed in a binary
+format.
+
-+This effort could help provide an alternative to Git LFS
++This effort aims to provide an alternative to Git LFS
+(https://git-lfs.com/) and similar tools like git-annex
+(https://git-annex.branchable.com/) for handling large files, even
+though a complete alternative would very likely require other efforts
@@ Documentation/technical/large-object-promisors.txt (new)
+efforts could also improve the situation on the client side.
+
+- In the same way, we are not going to discuss all the possible ways
-+ to implement a LOP or their underlying object storage.
++ to implement a LOP or their underlying object storage, or to
++ optimize how LOP works.
++
-+In particular we are not going to discuss pluggable ODBs or other
++Our opinion is that the simplest solution for now is for LOPs to use
++object storage through a remote helper (see section II.2 below for
++more details) to store their objects. So we consider that this is the
++default implementation. If there are improvements on top of this,
++that's great, but our opinion is that such improvements are not
++necessary for LOPs to already be useful. Such improvements are likely
++a different technical topic, and can be taken care of separately
++anyway.
+++
++So in particular we are not going to discuss pluggable ODBs or other
+object database backends that could chunk large blobs, dedup the
+chunks and store them efficiently. Sure, that would be a nice
+improvement to store large blobs on the server side, but we believe
+it can just be a separate effort as it's also not technically very
+related to this effort.
+++
++We are also not going to discuss data transfer improvements between
++LOPs and clients or servers. Sure, there might be some easy and very
++effective optimizations there (as we know that objects on LOPs are
++very likely incompressible and not deltifying well), but this can be
++dealt with separately in a separate effort.
+
+In other words, the goal of this document is not to talk about all the
+possible ways to optimize how Git could handle large blobs, but to
-+describe how a LOP based solution could work well and alleviate a
-+number of current issues in the context of Git clients and servers
++describe how a LOP based solution can already work well and alleviate
++a number of current issues in the context of Git clients and servers
+sharing Git objects.
+
+I) Issues with the current situation
@@ Documentation/technical/large-object-promisors.txt (new)
+
+Also each feature doesn't need to be implemented entirely in Git
+itself. Some could be scripts, hooks or helpers that are not part of
-+the Git repo. It could be helpful if those could be shared and
-+improved on collaboratively though.
++the Git repo. It would be helpful if those could be shared and
++improved on collaboratively though. So we want to encourage sharing
++them.
+
+1) Large blobs are stored on LOPs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ Documentation/technical/large-object-promisors.txt (new)
+Rationale
++++++++++
+
-+LOP remotes should be good at handling large blobs while main remotes
-+should be good at handling other objects.
++LOPs aim to be good at handling large blobs while main remotes are
++already good at handling other objects.
+
+Implementation
+++++++++++++++
@@ Documentation/technical/large-object-promisors.txt (new)
+2) LOPs can use object storage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
-+A LOP could be using object storage, like an Amazon S3 or GCP Bucket
-+to actually store the large blobs, and could be accessed through a Git
++LOPs can be implemented using object storage, like an Amazon S3 or GCP
++Bucket or MinIO (which is open source under the GNU AGPLv3 license) to
++actually store the large blobs, and can be accessed through a Git
+remote helper (see linkgit:gitremote-helpers[7]) which makes the
-+underlying object storage appears like a remote to Git.
++underlying object storage appear like a remote to Git.
+
+Note
+++++
+
-+A LOP could be a promisor remote accessed using a remote helper by
++A LOP can be a promisor remote accessed using a remote helper by
+both some clients and the main remote.
+
+Rationale
@@ Documentation/technical/large-object-promisors.txt (new)
+be more efficient and maintainable to write them using other languages
+like Go.
+
++Some already exist under open source licenses, for example:
++
++ - https://github.com/awslabs/git-remote-s3
++ - https://gitlab.com/eric.p.ju/git-remote-gs
++
+Other ways to implement LOPs are certainly possible, but the goal of
+this document is not to discuss how to best implement a LOP or its
+underlying object storage (see the "0) Non goals" section above).
@@ Documentation/technical/large-object-promisors.txt (new)
+++++++++++++++
+
+The way to offload to a LOP discussed in 4) above can be used to
-+regularly offload oversize blobs. About preventing oversize blobs to
-+be fetched into the repo see 6) below. About preventing oversize blob
-+pushes, a pre-receive hook could be used.
++regularly offload oversize blobs. About preventing oversize blobs from
++being fetched into the repo see 6) below. About preventing oversize
++blob pushes, a pre-receive hook could be used.
+
+Also there are different scenarios in which large blobs could get
+fetched into the main remote, for example:
@@ Documentation/technical/large-object-promisors.txt (new)
+It might not be possible to completely prevent all these scenarios
+from happening. So the goal here should be to implement features that
+make the fetching of large blobs less likely. For example adding a
-+`remote-object-info` command in the `git cat-file --batch*` protocol
-+might make it possible for a main repo to respond to some requests
-+about large blobs without fetching them.
++`remote-object-info` command in the `git cat-file --batch` protocol
++and its variants might make it possible for a main repo to respond to
++some requests about large blobs without fetching them.
+
+6) A protocol negotiation should happen when a client clones
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ Documentation/technical/large-object-promisors.txt (new)
+Note
+++++
+
-+For fetches instead of clones, see the "What about fetches?" FAQ entry
-+below.
++For fetches instead of clones, a protocol negotiation might not always
++happen, see the "What about fetches?" FAQ entry below for details.
+
+Rationale
++++++++++
@@ Documentation/technical/large-object-promisors.txt (new)
+Information that the server could send to the client through that
+protocol could be things like: LOP name, LOP URL, filter-spec (for
+example `blob:limit=<size>`) or just size limit that should be used as
-+a filter when cloning, token to be used with the LOP, etc..
++a filter when cloning, token to be used with the LOP, etc.
+
+7) A client can offload to a LOP
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ Documentation/technical/large-object-promisors.txt (new)
+the client should be able to offload some large blobs it has fetched,
+but might not need anymore, to the LOP.
+
++Note
++++++
++
++It might depend on the context if it should be OK or not for clients
++to offload large blobs they have created, instead of fetched, directly
++to the LOP without the main remote checking them in some ways
++(possibly using hooks or other tools).
++
+Rationale
++++++++++
+
@@ Documentation/technical/large-object-promisors.txt (new)
+What about using multiple LOPs on the server and client side?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
-+That could perhaps be useful in some cases, but it's more likely for
-+now than in most cases a single LOP will be advertised by the server
-+and should be used by the client.
++That could perhaps be useful in some cases, but for now it's more
++likely that in most cases a single LOP will be advertised by the
++server and should be used by the client.
+
+A case where it could be useful for a server to advertise multiple
+LOPs is if a LOP is better for some users while a different LOP is
@@ Documentation/technical/large-object-promisors.txt (new)
+is likely to be better connected to them, while users in other parts
+of the world should pick only LOP B for the same reason."
+
-+Trusting the LOPs advertised by the server, or not trusting them?
-+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
++When should we trust or not trust the LOPs advertised by the server?
++~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In some contexts, like in corporate setup where the server and all the
+clients are parts of an internal network in a company where admins
-+have all the rights on every system, it's Ok, and perhaps even a good
++have all the rights on every system, it's OK, and perhaps even a good
+thing, if the clients fully trust the server, as it can help ensure
+that all the clients are on the same page.
+
@@ Documentation/technical/large-object-promisors.txt (new)
+from the client when it fetches from them. The client could get the
+token when performing a protocol negotiation with the main remote (see
+section II.6 above).
++
++V) Future improvements
++----------------------
++
++It is expected that at the beginning using LOPs will be mostly worth
++it either in a corporate context where the Git version that clients
++use can easily be controlled, or on repos that are infrequently
++accessed. (See the "Could the main remote be bogged down by old or
++paranoid clients?" section in the FAQ above.)
++
++Over time, as more and more clients upgrade to a version that
++implements the "promisor-remote" protocol v2 capability described
++above in section II.6), it will be worth it to use LOPs more widely.
++
++A lot of improvements may also help using LOPs more widely. Some of
++these improvements are part of the scope of this document like the
++following:
++
++ - Implementing a "remote-object-info" command in the
++ `git cat-file --batch` protocol and its variants to allow main
++ remotes to respond to requests about large blobs without fetching
++ them. (Eric Ju has started working on this based on previous work
++ by Calvin Wan.)
++
++ - Creating better cleanup and offload mechanisms for main remotes
++ and clients to prevent accumulation of large blobs.
++
++ - Developing more sophisticated protocol negotiation capabilities
++ between clients and servers for handling LOPs, for example adding
++ a filter-spec (e.g., blob:limit=<size>) or size limit for
++ filtering when cloning, or adding a token for LOP authentication.
++
++ - Improving security measures for LOP access, particularly around
++ token handling and authentication.
++
++ - Developing standardized ways to configure and manage multiple LOPs
++ across different environments. Especially in the case where
++ different LOPs serve the same content to clients in different
++ geographical locations, there is a need for replication or
++ synchronization between LOPs.
++
++Some improvements, including some that have been mentioned in the "0)
++Non Goals" section of this document, are out of the scope of this
++document:
++
++ - Implementing a new object representation for large blobs on the
++ client side.
++
++ - Developing pluggable ODBs or other object database backends that
++ could chunk large blobs, dedup the chunks and store them
++ efficiently.
++
++ - Optimizing data transfer between LOPs and clients/servers,
++ particularly for incompressible and non-deltifying content.
++
++ - Creating improved client side tools for managing large objects
++ more effectively, for example tools for migrating from Git LFS or
++ git-annex, or tools to find which objects could be offloaded and
++ how much disk space could be reclaimed by offloading them.
++
++Some improvements could be seen as part of the scope of this document,
++but might already have their own separate projects from the Git
++project, like:
++
++ - Improving existing remote helpers to access object storage or
++ developing new ones.
++
++ - Improving existing object storage solutions or developing new
++ ones.
++
++Even though all the above improvements may help, this document and the
++LOP effort should try to focus, at least first, on a relatively small
++number of improvements mostly those that are in its current scope.
++
++For example introducing pluggable ODBs and a new object database
++backend is likely a multi-year effort on its own that can happen
++separately in parallel. It has different technical requirements,
++touches other part of the Git code base and should have its own design
++document(s).
Christian Couder (4):
version: make redact_non_printables() non-static
Add 'promisor-remote' capability to protocol v2
promisor-remote: check advertised name or URL
doc: add technical design doc for large object promisors
Usman Akinyemi (2):
version: replace manual ASCII checks with isprint() for clarity
version: refactor redact_non_printables()
Documentation/config/promisor.txt | 27 +
Documentation/gitprotocol-v2.txt | 54 ++
.../technical/large-object-promisors.txt | 640 ++++++++++++++++++
connect.c | 9 +
promisor-remote.c | 244 +++++++
promisor-remote.h | 36 +-
serve.c | 26 +
t/meson.build | 1 +
t/t5710-promisor-remote-capability.sh | 312 +++++++++
upload-pack.c | 3 +
version.c | 18 +-
version.h | 8 +
12 files changed, 1371 insertions(+), 7 deletions(-)
create mode 100644 Documentation/technical/large-object-promisors.txt
create mode 100755 t/t5710-promisor-remote-capability.sh
--
2.46.0.rc0.95.gcbf174a634
^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH v4 1/6] version: replace manual ASCII checks with isprint() for clarity
2025-01-27 15:16 ` [PATCH v4 0/6] " Christian Couder
@ 2025-01-27 15:16 ` Christian Couder
2025-01-27 15:16 ` [PATCH v4 2/6] version: refactor redact_non_printables() Christian Couder
` (6 subsequent siblings)
7 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:16 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
Randall S . Becker, Usman Akinyemi, Christian Couder
From: Usman Akinyemi <usmanakinyemi202@gmail.com>
Since the isprint() function checks for printable characters, let's
replace the existing hardcoded ASCII checks with it. However, since
the original checks also handled spaces, we need to account for spaces
explicitly in the new check.
Mentored-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Usman Akinyemi <usmanakinyemi202@gmail.com>
---
version.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/version.c b/version.c
index 4786c4e0a5..c9192a5beb 100644
--- a/version.c
+++ b/version.c
@@ -1,6 +1,7 @@
#include "git-compat-util.h"
#include "version.h"
#include "strbuf.h"
+#include "sane-ctype.h"
#ifndef GIT_VERSION_H
# include "version-def.h"
@@ -34,7 +35,7 @@ const char *git_user_agent_sanitized(void)
strbuf_addstr(&buf, git_user_agent());
strbuf_trim(&buf);
for (size_t i = 0; i < buf.len; i++) {
- if (buf.buf[i] <= 32 || buf.buf[i] >= 127)
+ if (!isprint(buf.buf[i]) || buf.buf[i] == ' ')
buf.buf[i] = '.';
}
agent = buf.buf;
--
2.46.0.rc0.95.gcbf174a634
^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH v4 2/6] version: refactor redact_non_printables()
2025-01-27 15:16 ` [PATCH v4 0/6] " Christian Couder
2025-01-27 15:16 ` [PATCH v4 1/6] version: replace manual ASCII checks with isprint() for clarity Christian Couder
@ 2025-01-27 15:16 ` Christian Couder
2025-01-27 15:16 ` [PATCH v4 3/6] version: make redact_non_printables() non-static Christian Couder
` (5 subsequent siblings)
7 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:16 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
Randall S . Becker, Usman Akinyemi, Christian Couder
From: Usman Akinyemi <usmanakinyemi202@gmail.com>
The git_user_agent_sanitized() function performs some sanitizing to
avoid special characters being sent over the line and possibly messing
up with the protocol or with the parsing on the other side.
Let's extract this sanitizing into a new redact_non_printables() function,
as we will want to reuse it in a following patch.
For now the new redact_non_printables() function is still static as
it's only needed locally.
While at it, let's use strbuf_detach() to explicitly detach the string
contained by the 'buf' strbuf.
Mentored-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Usman Akinyemi <usmanakinyemi202@gmail.com>
---
version.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/version.c b/version.c
index c9192a5beb..4f37b4499d 100644
--- a/version.c
+++ b/version.c
@@ -12,6 +12,19 @@
const char git_version_string[] = GIT_VERSION;
const char git_built_from_commit_string[] = GIT_BUILT_FROM_COMMIT;
+/*
+ * Trim and replace each character with ascii code below 32 or above
+ * 127 (included) using a dot '.' character.
+ */
+static void redact_non_printables(struct strbuf *buf)
+{
+ strbuf_trim(buf);
+ for (size_t i = 0; i < buf->len; i++) {
+ if (!isprint(buf->buf[i]) || buf->buf[i] == ' ')
+ buf->buf[i] = '.';
+ }
+}
+
const char *git_user_agent(void)
{
static const char *agent = NULL;
@@ -33,12 +46,8 @@ const char *git_user_agent_sanitized(void)
struct strbuf buf = STRBUF_INIT;
strbuf_addstr(&buf, git_user_agent());
- strbuf_trim(&buf);
- for (size_t i = 0; i < buf.len; i++) {
- if (!isprint(buf.buf[i]) || buf.buf[i] == ' ')
- buf.buf[i] = '.';
- }
- agent = buf.buf;
+ redact_non_printables(&buf);
+ agent = strbuf_detach(&buf, NULL);
}
return agent;
--
2.46.0.rc0.95.gcbf174a634
^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH v4 3/6] version: make redact_non_printables() non-static
2025-01-27 15:16 ` [PATCH v4 0/6] " Christian Couder
2025-01-27 15:16 ` [PATCH v4 1/6] version: replace manual ASCII checks with isprint() for clarity Christian Couder
2025-01-27 15:16 ` [PATCH v4 2/6] version: refactor redact_non_printables() Christian Couder
@ 2025-01-27 15:16 ` Christian Couder
2025-01-30 10:51 ` Patrick Steinhardt
2025-01-27 15:16 ` [PATCH v4 4/6] Add 'promisor-remote' capability to protocol v2 Christian Couder
` (4 subsequent siblings)
7 siblings, 1 reply; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:16 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
Randall S . Becker, Christian Couder
As we are going to reuse redact_non_printables() outside "version.c",
let's make it non-static.
---
version.c | 6 +-----
version.h | 8 ++++++++
2 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/version.c b/version.c
index 4f37b4499d..77423fcaf3 100644
--- a/version.c
+++ b/version.c
@@ -12,11 +12,7 @@
const char git_version_string[] = GIT_VERSION;
const char git_built_from_commit_string[] = GIT_BUILT_FROM_COMMIT;
-/*
- * Trim and replace each character with ascii code below 32 or above
- * 127 (included) using a dot '.' character.
- */
-static void redact_non_printables(struct strbuf *buf)
+void redact_non_printables(struct strbuf *buf)
{
strbuf_trim(buf);
for (size_t i = 0; i < buf->len; i++) {
diff --git a/version.h b/version.h
index 7c62e80577..fcc1816685 100644
--- a/version.h
+++ b/version.h
@@ -4,7 +4,15 @@
extern const char git_version_string[];
extern const char git_built_from_commit_string[];
+struct strbuf;
+
const char *git_user_agent(void);
const char *git_user_agent_sanitized(void);
+/*
+ * Trim and replace each character with ascii code below 32 or above
+ * 127 (included) using a dot '.' character.
+*/
+void redact_non_printables(struct strbuf *buf);
+
#endif /* VERSION_H */
--
2.46.0.rc0.95.gcbf174a634
^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [PATCH v4 3/6] version: make redact_non_printables() non-static
2025-01-27 15:16 ` [PATCH v4 3/6] version: make redact_non_printables() non-static Christian Couder
@ 2025-01-30 10:51 ` Patrick Steinhardt
2025-02-18 11:42 ` Christian Couder
0 siblings, 1 reply; 110+ messages in thread
From: Patrick Steinhardt @ 2025-01-30 10:51 UTC (permalink / raw)
To: Christian Couder
Cc: git, Junio C Hamano, Taylor Blau, Eric Sunshine, Karthik Nayak,
Kristoffer Haugsbakk, brian m . carlson, Randall S . Becker
On Mon, Jan 27, 2025 at 04:16:58PM +0100, Christian Couder wrote:
> As we are going to reuse redact_non_printables() outside "version.c",
> let's make it non-static.
Missing the DCO.
> diff --git a/version.h b/version.h
> index 7c62e80577..fcc1816685 100644
> --- a/version.h
> +++ b/version.h
> @@ -4,7 +4,15 @@
> extern const char git_version_string[];
> extern const char git_built_from_commit_string[];
>
> +struct strbuf;
> +
> const char *git_user_agent(void);
> const char *git_user_agent_sanitized(void);
>
> +/*
> + * Trim and replace each character with ascii code below 32 or above
> + * 127 (included) using a dot '.' character.
> +*/
> +void redact_non_printables(struct strbuf *buf);
Is this header really the right spot though? If I want to redact
characters I certainly wouldn't be looking at "version.h" for that
functionality.
Patrick
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v4 3/6] version: make redact_non_printables() non-static
2025-01-30 10:51 ` Patrick Steinhardt
@ 2025-02-18 11:42 ` Christian Couder
0 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-02-18 11:42 UTC (permalink / raw)
To: Patrick Steinhardt
Cc: git, Junio C Hamano, Taylor Blau, Eric Sunshine, Karthik Nayak,
Kristoffer Haugsbakk, brian m . carlson, Randall S . Becker
On Thu, Jan 30, 2025 at 11:51 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Mon, Jan 27, 2025 at 04:16:58PM +0100, Christian Couder wrote:
> > As we are going to reuse redact_non_printables() outside "version.c",
> > let's make it non-static.
>
> Missing the DCO.
Thanks for spotting this.
> > diff --git a/version.h b/version.h
> > index 7c62e80577..fcc1816685 100644
> > --- a/version.h
> > +++ b/version.h
> > @@ -4,7 +4,15 @@
> > extern const char git_version_string[];
> > extern const char git_built_from_commit_string[];
> >
> > +struct strbuf;
> > +
> > const char *git_user_agent(void);
> > const char *git_user_agent_sanitized(void);
> >
> > +/*
> > + * Trim and replace each character with ascii code below 32 or above
> > + * 127 (included) using a dot '.' character.
> > +*/
> > +void redact_non_printables(struct strbuf *buf);
>
> Is this header really the right spot though? If I want to redact
> characters I certainly wouldn't be looking at "version.h" for that
> functionality.
In previous versions of this series, I wanted to put this in the
strbuf API but it appeared not to be a good idea.
Anyway, now I think that this patch is not needed, thanks to a comment
you made about the following patch. So we don't need to find a good
place for it for now.
^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH v4 4/6] Add 'promisor-remote' capability to protocol v2
2025-01-27 15:16 ` [PATCH v4 0/6] " Christian Couder
` (2 preceding siblings ...)
2025-01-27 15:16 ` [PATCH v4 3/6] version: make redact_non_printables() non-static Christian Couder
@ 2025-01-27 15:16 ` Christian Couder
2025-01-30 10:51 ` Patrick Steinhardt
2025-01-27 15:17 ` [PATCH v4 5/6] promisor-remote: check advertised name or URL Christian Couder
` (3 subsequent siblings)
7 siblings, 1 reply; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:16 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
Randall S . Becker, Christian Couder, Christian Couder
When a server S knows that some objects from a repository are available
from a promisor remote X, S might want to suggest to a client C cloning
or fetching the repo from S that C may use X directly instead of S for
these objects.
Note that this could happen both in the case S itself doesn't have the
objects and borrows them from X, and in the case S has the objects but
knows that X is better connected to the world (e.g., it is in a
$LARGEINTERNETCOMPANY datacenter with petabit/s backbone connections)
than S. Implementation of the latter case, which would require S to
omit in its response the objects available on X, is left for future
improvement though.
Then C might or might not, want to get the objects from X. If S and C
can agree on C using X directly, S can then omit objects that can be
obtained from X when answering C's request.
To allow S and C to agree and let each other know about C using X or
not, let's introduce a new "promisor-remote" capability in the
protocol v2, as well as a few new configuration variables:
- "promisor.advertise" on the server side, and:
- "promisor.acceptFromServer" on the client side.
By default, or if "promisor.advertise" is set to 'false', a server S will
not advertise the "promisor-remote" capability.
If S doesn't advertise the "promisor-remote" capability, then a client C
replying to S shouldn't advertise the "promisor-remote" capability
either.
If "promisor.advertise" is set to 'true', S will advertise its promisor
remotes with a string like:
promisor-remote=<pr-info>[;<pr-info>]...
where each <pr-info> element contains information about a single
promisor remote in the form:
name=<pr-name>[,url=<pr-url>]
where <pr-name> is the urlencoded name of a promisor remote and
<pr-url> is the urlencoded URL of the promisor remote named <pr-name>.
For now, the URL is passed in addition to the name. In the future, it
might be possible to pass other information like a filter-spec that the
client may use when cloning from S, or a token that the client may use
when retrieving objects from X.
It is C's responsibility to arrange how it can reach X though, so pieces
of information that are usually outside Git's concern, like proxy
configuration, must not be distributed over this protocol.
It might also be possible in the future for "promisor.advertise" to have
other values. For example a value like "onlyName" could prevent S from
advertising URLs, which could help in case C should use a different URL
for X than the URL S is using. (The URL S is using might be an internal
one on the server side for example.)
By default or if "promisor.acceptFromServer" is set to "None", C will
not accept to use the promisor remotes that might have been advertised
by S. In this case, C will not advertise any "promisor-remote"
capability in its reply to S.
If "promisor.acceptFromServer" is set to "All" and S advertised some
promisor remotes, then on the contrary, C will accept to use all the
promisor remotes that S advertised and C will reply with a string like:
promisor-remote=<pr-name>[;<pr-name>]...
where the <pr-name> elements are the urlencoded names of all the
promisor remotes S advertised.
In a following commit, other values for "promisor.acceptFromServer" will
be implemented, so that C will be able to decide the promisor remotes it
accepts depending on the name and URL it received from S. So even if
that name and URL information is not used much right now, it will be
needed soon.
Helped-by: Taylor Blau <me@ttaylorr.com>
Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
Documentation/config/promisor.txt | 17 ++
Documentation/gitprotocol-v2.txt | 54 ++++++
connect.c | 9 +
promisor-remote.c | 196 +++++++++++++++++++++
promisor-remote.h | 36 +++-
serve.c | 26 +++
t/meson.build | 1 +
t/t5710-promisor-remote-capability.sh | 244 ++++++++++++++++++++++++++
upload-pack.c | 3 +
9 files changed, 585 insertions(+), 1 deletion(-)
create mode 100755 t/t5710-promisor-remote-capability.sh
diff --git a/Documentation/config/promisor.txt b/Documentation/config/promisor.txt
index 98c5cb2ec2..9cbfe3e59e 100644
--- a/Documentation/config/promisor.txt
+++ b/Documentation/config/promisor.txt
@@ -1,3 +1,20 @@
promisor.quiet::
If set to "true" assume `--quiet` when fetching additional
objects for a partial clone.
+
+promisor.advertise::
+ If set to "true", a server will use the "promisor-remote"
+ capability, see linkgit:gitprotocol-v2[5], to advertise the
+ promisor remotes it is using, if it uses some. Default is
+ "false", which means the "promisor-remote" capability is not
+ advertised.
+
+promisor.acceptFromServer::
+ If set to "all", a client will accept all the promisor remotes
+ a server might advertise using the "promisor-remote"
+ capability. Default is "none", which means no promisor remote
+ advertised by a server will be accepted. By accepting a
+ promisor remote, the client agrees that the server might omit
+ objects that are lazily fetchable from this promisor remote
+ from its responses to "fetch" and "clone" requests from the
+ client. See linkgit:gitprotocol-v2[5].
diff --git a/Documentation/gitprotocol-v2.txt b/Documentation/gitprotocol-v2.txt
index 1652fef3ae..f25a9a6ad8 100644
--- a/Documentation/gitprotocol-v2.txt
+++ b/Documentation/gitprotocol-v2.txt
@@ -781,6 +781,60 @@ retrieving the header from a bundle at the indicated URI, and thus
save themselves and the server(s) the request(s) needed to inspect the
headers of that bundle or bundles.
+promisor-remote=<pr-infos>
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The server may advertise some promisor remotes it is using or knows
+about to a client which may want to use them as its promisor remotes,
+instead of this repository. In this case <pr-infos> should be of the
+form:
+
+ pr-infos = pr-info | pr-infos ";" pr-info
+
+ pr-info = "name=" pr-name | "name=" pr-name "," "url=" pr-url
+
+where `pr-name` is the urlencoded name of a promisor remote, and
+`pr-url` the urlencoded URL of that promisor remote.
+
+In this case, if the client decides to use one or more promisor
+remotes the server advertised, it can reply with
+"promisor-remote=<pr-names>" where <pr-names> should be of the form:
+
+ pr-names = pr-name | pr-names ";" pr-name
+
+where `pr-name` is the urlencoded name of a promisor remote the server
+advertised and the client accepts.
+
+Note that, everywhere in this document, `pr-name` MUST be a valid
+remote name, and the ';' and ',' characters MUST be encoded if they
+appear in `pr-name` or `pr-url`.
+
+If the server doesn't know any promisor remote that could be good for
+a client to use, or prefers a client not to use any promisor remote it
+uses or knows about, it shouldn't advertise the "promisor-remote"
+capability at all.
+
+In this case, or if the client doesn't want to use any promisor remote
+the server advertised, the client shouldn't advertise the
+"promisor-remote" capability at all in its reply.
+
+The "promisor.advertise" and "promisor.acceptFromServer" configuration
+options can be used on the server and client side respectively to
+control what they advertise or accept respectively. See the
+documentation of these configuration options for more information.
+
+Note that in the future it would be nice if the "promisor-remote"
+protocol capability could be used by the server, when responding to
+`git fetch` or `git clone`, to advertise better-connected remotes that
+the client can use as promisor remotes, instead of this repository, so
+that the client can lazily fetch objects from these other
+better-connected remotes. This would require the server to omit in its
+response the objects available on the better-connected remotes that
+the client has accepted. This hasn't been implemented yet though. So
+for now this "promisor-remote" capability is useful only when the
+server advertises some promisor remotes it already uses to borrow
+objects from.
+
GIT
---
Part of the linkgit:git[1] suite
diff --git a/connect.c b/connect.c
index 10fad43e98..7d309c4a7b 100644
--- a/connect.c
+++ b/connect.c
@@ -23,6 +23,7 @@
#include "protocol.h"
#include "alias.h"
#include "bundle-uri.h"
+#include "promisor-remote.h"
static char *server_capabilities_v1;
static struct strvec server_capabilities_v2 = STRVEC_INIT;
@@ -488,6 +489,7 @@ void check_stateless_delimiter(int stateless_rpc,
static void send_capabilities(int fd_out, struct packet_reader *reader)
{
const char *hash_name;
+ const char *promisor_remote_info;
if (server_supports_v2("agent"))
packet_write_fmt(fd_out, "agent=%s", git_user_agent_sanitized());
@@ -501,6 +503,13 @@ static void send_capabilities(int fd_out, struct packet_reader *reader)
} else {
reader->hash_algo = &hash_algos[GIT_HASH_SHA1];
}
+ if (server_feature_v2("promisor-remote", &promisor_remote_info)) {
+ char *reply = promisor_remote_reply(promisor_remote_info);
+ if (reply) {
+ packet_write_fmt(fd_out, "promisor-remote=%s", reply);
+ free(reply);
+ }
+ }
}
int get_remote_bundle_uri(int fd_out, struct packet_reader *reader,
diff --git a/promisor-remote.c b/promisor-remote.c
index c714f4f007..5ac282ed27 100644
--- a/promisor-remote.c
+++ b/promisor-remote.c
@@ -11,6 +11,8 @@
#include "strvec.h"
#include "packfile.h"
#include "environment.h"
+#include "url.h"
+#include "version.h"
struct promisor_remote_config {
struct promisor_remote *promisors;
@@ -221,6 +223,18 @@ int repo_has_promisor_remote(struct repository *r)
return !!repo_promisor_remote_find(r, NULL);
}
+int repo_has_accepted_promisor_remote(struct repository *r)
+{
+ struct promisor_remote *p;
+
+ promisor_remote_init(r);
+
+ for (p = r->promisor_remote_config->promisors; p; p = p->next)
+ if (p->accepted)
+ return 1;
+ return 0;
+}
+
static int remove_fetched_oids(struct repository *repo,
struct object_id **oids,
int oid_nr, int to_free)
@@ -292,3 +306,185 @@ void promisor_remote_get_direct(struct repository *repo,
if (to_free)
free(remaining_oids);
}
+
+static int allow_unsanitized(char ch)
+{
+ if (ch == ',' || ch == ';' || ch == '%')
+ return 0;
+ return ch > 32 && ch < 127;
+}
+
+static void promisor_info_vecs(struct repository *repo,
+ struct strvec *names,
+ struct strvec *urls)
+{
+ struct promisor_remote *r;
+
+ promisor_remote_init(repo);
+
+ for (r = repo->promisor_remote_config->promisors; r; r = r->next) {
+ char *url;
+ char *url_key = xstrfmt("remote.%s.url", r->name);
+
+ strvec_push(names, r->name);
+ strvec_push(urls, git_config_get_string(url_key, &url) ? NULL : url);
+
+ free(url);
+ free(url_key);
+ }
+}
+
+char *promisor_remote_info(struct repository *repo)
+{
+ struct strbuf sb = STRBUF_INIT;
+ int advertise_promisors = 0;
+ struct strvec names = STRVEC_INIT;
+ struct strvec urls = STRVEC_INIT;
+
+ git_config_get_bool("promisor.advertise", &advertise_promisors);
+
+ if (!advertise_promisors)
+ return NULL;
+
+ promisor_info_vecs(repo, &names, &urls);
+
+ if (!names.nr)
+ return NULL;
+
+ for (size_t i = 0; i < names.nr; i++) {
+ if (i)
+ strbuf_addch(&sb, ';');
+ strbuf_addstr(&sb, "name=");
+ strbuf_addstr_urlencode(&sb, names.v[i], allow_unsanitized);
+ if (urls.v[i]) {
+ strbuf_addstr(&sb, ",url=");
+ strbuf_addstr_urlencode(&sb, urls.v[i], allow_unsanitized);
+ }
+ }
+
+ redact_non_printables(&sb);
+
+ strvec_clear(&names);
+ strvec_clear(&urls);
+
+ return strbuf_detach(&sb, NULL);
+}
+
+enum accept_promisor {
+ ACCEPT_NONE = 0,
+ ACCEPT_ALL
+};
+
+static int should_accept_remote(enum accept_promisor accept,
+ const char *remote_name UNUSED,
+ const char *remote_url UNUSED)
+{
+ if (accept == ACCEPT_ALL)
+ return 1;
+
+ BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+}
+
+static void filter_promisor_remote(struct strvec *accepted, const char *info)
+{
+ struct strbuf **remotes;
+ const char *accept_str;
+ enum accept_promisor accept = ACCEPT_NONE;
+
+ if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
+ if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
+ accept = ACCEPT_NONE;
+ else if (!strcasecmp("All", accept_str))
+ accept = ACCEPT_ALL;
+ else
+ warning(_("unknown '%s' value for '%s' config option"),
+ accept_str, "promisor.acceptfromserver");
+ }
+
+ if (accept == ACCEPT_NONE)
+ return;
+
+ /* Parse remote info received */
+
+ remotes = strbuf_split_str(info, ';', 0);
+
+ for (size_t i = 0; remotes[i]; i++) {
+ struct strbuf **elems;
+ const char *remote_name = NULL;
+ const char *remote_url = NULL;
+ char *decoded_name = NULL;
+ char *decoded_url = NULL;
+
+ strbuf_strip_suffix(remotes[i], ";");
+ elems = strbuf_split(remotes[i], ',');
+
+ for (size_t j = 0; elems[j]; j++) {
+ int res;
+ strbuf_strip_suffix(elems[j], ",");
+ res = skip_prefix(elems[j]->buf, "name=", &remote_name) ||
+ skip_prefix(elems[j]->buf, "url=", &remote_url);
+ if (!res)
+ warning(_("unknown element '%s' from remote info"),
+ elems[j]->buf);
+ }
+
+ if (remote_name)
+ decoded_name = url_percent_decode(remote_name);
+ if (remote_url)
+ decoded_url = url_percent_decode(remote_url);
+
+ if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url))
+ strvec_push(accepted, decoded_name);
+
+ strbuf_list_free(elems);
+ free(decoded_name);
+ free(decoded_url);
+ }
+
+ strbuf_list_free(remotes);
+}
+
+char *promisor_remote_reply(const char *info)
+{
+ struct strvec accepted = STRVEC_INIT;
+ struct strbuf reply = STRBUF_INIT;
+
+ filter_promisor_remote(&accepted, info);
+
+ if (!accepted.nr)
+ return NULL;
+
+ for (size_t i = 0; i < accepted.nr; i++) {
+ if (i)
+ strbuf_addch(&reply, ';');
+ strbuf_addstr_urlencode(&reply, accepted.v[i], allow_unsanitized);
+ }
+
+ strvec_clear(&accepted);
+
+ return strbuf_detach(&reply, NULL);
+}
+
+void mark_promisor_remotes_as_accepted(struct repository *r, const char *remotes)
+{
+ struct strbuf **accepted_remotes = strbuf_split_str(remotes, ';', 0);
+
+ for (size_t i = 0; accepted_remotes[i]; i++) {
+ struct promisor_remote *p;
+ char *decoded_remote;
+
+ strbuf_strip_suffix(accepted_remotes[i], ";");
+ decoded_remote = url_percent_decode(accepted_remotes[i]->buf);
+
+ p = repo_promisor_remote_find(r, decoded_remote);
+ if (p)
+ p->accepted = 1;
+ else
+ warning(_("accepted promisor remote '%s' not found"),
+ decoded_remote);
+
+ free(decoded_remote);
+ }
+
+ strbuf_list_free(accepted_remotes);
+}
diff --git a/promisor-remote.h b/promisor-remote.h
index 88cb599c39..814ca248c7 100644
--- a/promisor-remote.h
+++ b/promisor-remote.h
@@ -9,11 +9,13 @@ struct object_id;
* Promisor remote linked list
*
* Information in its fields come from remote.XXX config entries or
- * from extensions.partialclone.
+ * from extensions.partialclone, except for 'accepted' which comes
+ * from protocol v2 capabilities exchange.
*/
struct promisor_remote {
struct promisor_remote *next;
char *partial_clone_filter;
+ unsigned int accepted : 1;
const char name[FLEX_ARRAY];
};
@@ -32,4 +34,36 @@ void promisor_remote_get_direct(struct repository *repo,
const struct object_id *oids,
int oid_nr);
+/*
+ * Prepare a "promisor-remote" advertisement by a server.
+ * Check the value of "promisor.advertise" and maybe the configured
+ * promisor remotes, if any, to prepare information to send in an
+ * advertisement.
+ * Return value is NULL if no promisor remote advertisement should be
+ * made. Otherwise it contains the names and urls of the advertised
+ * promisor remotes separated by ';'
+ */
+char *promisor_remote_info(struct repository *repo);
+
+/*
+ * Prepare a reply to a "promisor-remote" advertisement from a server.
+ * Check the value of "promisor.acceptfromserver" and maybe the
+ * configured promisor remotes, if any, to prepare the reply.
+ * Return value is NULL if no promisor remote from the server
+ * is accepted. Otherwise it contains the names of the accepted promisor
+ * remotes separated by ';'.
+ */
+char *promisor_remote_reply(const char *info);
+
+/*
+ * Set the 'accepted' flag for some promisor remotes. Useful when some
+ * promisor remotes have been accepted by the client.
+ */
+void mark_promisor_remotes_as_accepted(struct repository *repo, const char *remotes);
+
+/*
+ * Has any promisor remote been accepted by the client?
+ */
+int repo_has_accepted_promisor_remote(struct repository *r);
+
#endif /* PROMISOR_REMOTE_H */
diff --git a/serve.c b/serve.c
index f6dfe34a2b..e3ccf1505c 100644
--- a/serve.c
+++ b/serve.c
@@ -10,6 +10,7 @@
#include "upload-pack.h"
#include "bundle-uri.h"
#include "trace2.h"
+#include "promisor-remote.h"
static int advertise_sid = -1;
static int advertise_object_info = -1;
@@ -29,6 +30,26 @@ static int agent_advertise(struct repository *r UNUSED,
return 1;
}
+static int promisor_remote_advertise(struct repository *r,
+ struct strbuf *value)
+{
+ if (value) {
+ char *info = promisor_remote_info(r);
+ if (!info)
+ return 0;
+ strbuf_addstr(value, info);
+ free(info);
+ }
+ return 1;
+}
+
+static void promisor_remote_receive(struct repository *r,
+ const char *remotes)
+{
+ mark_promisor_remotes_as_accepted(r, remotes);
+}
+
+
static int object_format_advertise(struct repository *r,
struct strbuf *value)
{
@@ -155,6 +176,11 @@ static struct protocol_capability capabilities[] = {
.advertise = bundle_uri_advertise,
.command = bundle_uri_command,
},
+ {
+ .name = "promisor-remote",
+ .advertise = promisor_remote_advertise,
+ .receive = promisor_remote_receive,
+ },
};
void protocol_v2_advertise_capabilities(struct repository *r)
diff --git a/t/meson.build b/t/meson.build
index 7b35eadbc8..20e15c407c 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -727,6 +727,7 @@ integration_tests = [
't5703-upload-pack-ref-in-want.sh',
't5704-protocol-violations.sh',
't5705-session-id-in-capabilities.sh',
+ 't5710-promisor-remote-capability.sh',
't5730-protocol-v2-bundle-uri-file.sh',
't5731-protocol-v2-bundle-uri-git.sh',
't5732-protocol-v2-bundle-uri-http.sh',
diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
new file mode 100755
index 0000000000..0390c1dbad
--- /dev/null
+++ b/t/t5710-promisor-remote-capability.sh
@@ -0,0 +1,244 @@
+#!/bin/sh
+
+test_description='handling of promisor remote advertisement'
+
+. ./test-lib.sh
+
+GIT_TEST_MULTI_PACK_INDEX=0
+GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0
+
+# Setup the repository with three commits, this way HEAD is always
+# available and we can hide commit 1 or 2.
+test_expect_success 'setup: create "template" repository' '
+ git init template &&
+ test_commit -C template 1 &&
+ test_commit -C template 2 &&
+ test_commit -C template 3 &&
+ test-tool genrandom foo 10240 >template/foo &&
+ git -C template add foo &&
+ git -C template commit -m foo
+'
+
+# A bare repo will act as a server repo with unpacked objects.
+test_expect_success 'setup: create bare "server" repository' '
+ git clone --bare --no-local template server &&
+ mv server/objects/pack/pack-* . &&
+ packfile=$(ls pack-*.pack) &&
+ git -C server unpack-objects --strict <"$packfile"
+'
+
+check_missing_objects () {
+ git -C "$1" rev-list --objects --all --missing=print > all.txt &&
+ perl -ne 'print if s/^[?]//' all.txt >missing.txt &&
+ test_line_count = "$2" missing.txt &&
+ if test "$2" -lt 2
+ then
+ test "$3" = "$(cat missing.txt)"
+ else
+ test -f "$3" &&
+ sort <"$3" >expected_sorted &&
+ sort <missing.txt >actual_sorted &&
+ test_cmp expected_sorted actual_sorted
+ fi
+}
+
+initialize_server () {
+ count="$1"
+ missing_oids="$2"
+
+ # Repack everything first
+ git -C server -c repack.writebitmaps=false repack -a -d &&
+
+ # Remove promisor file in case they exist, useful when reinitializing
+ rm -rf server/objects/pack/*.promisor &&
+
+ # Repack without the largest object and create a promisor pack on server
+ git -C server -c repack.writebitmaps=false repack -a -d \
+ --filter=blob:limit=5k --filter-to="$(pwd)/pack" &&
+ promisor_file=$(ls server/objects/pack/*.pack | sed "s/\.pack/.promisor/") &&
+ >"$promisor_file" &&
+
+ # Check objects missing on the server
+ check_missing_objects server "$count" "$missing_oids"
+}
+
+copy_to_server2 () {
+ oid_path="$(test_oid_to_path $1)" &&
+ path="server/objects/$oid_path" &&
+ path2="server2/objects/$oid_path" &&
+ mkdir -p $(dirname "$path2") &&
+ cp "$path" "$path2"
+}
+
+test_expect_success "setup for testing promisor remote advertisement" '
+ # Create another bare repo called "server2"
+ git init --bare server2 &&
+
+ # Copy the largest object from server to server2
+ obj="HEAD:foo" &&
+ oid="$(git -C server rev-parse $obj)" &&
+ copy_to_server2 "$oid" &&
+
+ initialize_server 1 "$oid" &&
+
+ # Configure server2 as promisor remote for server
+ git -C server remote add server2 "file://$(pwd)/server2" &&
+ git -C server config remote.server2.promisor true &&
+
+ git -C server2 config uploadpack.allowFilter true &&
+ git -C server2 config uploadpack.allowAnySHA1InWant true &&
+ git -C server config uploadpack.allowFilter true &&
+ git -C server config uploadpack.allowAnySHA1InWant true
+'
+
+test_expect_success "clone with promisor.advertise set to 'true'" '
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+ -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+ -c remote.server2.url="file://$(pwd)/server2" \
+ -c promisor.acceptfromserver=All \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
+
+ # Check that the largest object is still missing on the server
+ check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.advertise set to 'false'" '
+ git -C server config promisor.advertise false &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+ -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+ -c remote.server2.url="file://$(pwd)/server2" \
+ -c promisor.acceptfromserver=All \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
+
+ # Check that the largest object is not missing on the server
+ check_missing_objects server 0 "" &&
+
+ # Reinitialize server so that the largest object is missing again
+ initialize_server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.acceptfromserver set to 'None'" '
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+ -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+ -c remote.server2.url="file://$(pwd)/server2" \
+ -c promisor.acceptfromserver=None \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
+
+ # Check that the largest object is not missing on the server
+ check_missing_objects server 0 "" &&
+
+ # Reinitialize server so that the largest object is missing again
+ initialize_server 1 "$oid"
+'
+
+test_expect_success "init + fetch with promisor.advertise set to 'true'" '
+ git -C server config promisor.advertise true &&
+
+ test_when_finished "rm -rf client" &&
+ mkdir client &&
+ git -C client init &&
+ git -C client config remote.server2.promisor true &&
+ git -C client config remote.server2.fetch "+refs/heads/*:refs/remotes/server2/*" &&
+ git -C client config remote.server2.url "file://$(pwd)/server2" &&
+ git -C client config remote.server.url "file://$(pwd)/server" &&
+ git -C client config remote.server.fetch "+refs/heads/*:refs/remotes/server/*" &&
+ git -C client config promisor.acceptfromserver All &&
+ GIT_NO_LAZY_FETCH=0 git -C client fetch --filter="blob:limit=5k" server &&
+
+ # Check that the largest object is still missing on the server
+ check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.advertise set to 'true' but don't delete the client" '
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+ -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+ -c remote.server2.url="file://$(pwd)/server2" \
+ -c promisor.acceptfromserver=All \
+ --no-local --filter="blob:limit=5k" server client &&
+
+ # Check that the largest object is still missing on the server
+ check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "setup for subsequent fetches" '
+ # Generate new commit with large blob
+ test-tool genrandom bar 10240 >template/bar &&
+ git -C template add bar &&
+ git -C template commit -m bar &&
+
+ # Fetch new commit with large blob
+ git -C server fetch origin &&
+ git -C server update-ref HEAD FETCH_HEAD &&
+ git -C server rev-parse HEAD >expected_head &&
+
+ # Repack everything twice and remove .promisor files before
+ # each repack. This makes sure everything gets repacked
+ # into a single packfile. The second repack is necessary
+ # because the first one fetches from server2 and creates a new
+ # packfile and its associated .promisor file.
+
+ rm -f server/objects/pack/*.promisor &&
+ git -C server -c repack.writebitmaps=false repack -a -d &&
+ rm -f server/objects/pack/*.promisor &&
+ git -C server -c repack.writebitmaps=false repack -a -d &&
+
+ # Unpack everything
+ rm pack-* &&
+ mv server/objects/pack/pack-* . &&
+ packfile=$(ls pack-*.pack) &&
+ git -C server unpack-objects --strict <"$packfile" &&
+
+ # Copy new large object to server2
+ obj_bar="HEAD:bar" &&
+ oid_bar="$(git -C server rev-parse $obj_bar)" &&
+ copy_to_server2 "$oid_bar" &&
+
+ # Reinitialize server so that the 2 largest objects are missing
+ printf "%s\n" "$oid" "$oid_bar" >expected_missing.txt &&
+ initialize_server 2 expected_missing.txt &&
+
+ # Create one more client
+ cp -r client client2
+'
+
+test_expect_success "subsequent fetch from a client when promisor.advertise is true" '
+ git -C server config promisor.advertise true &&
+
+ GIT_NO_LAZY_FETCH=0 git -C client pull origin &&
+
+ git -C client rev-parse HEAD >actual &&
+ test_cmp expected_head actual &&
+
+ cat client/bar >/dev/null &&
+
+ check_missing_objects server 2 expected_missing.txt
+'
+
+test_expect_success "subsequent fetch from a client when promisor.advertise is false" '
+ git -C server config promisor.advertise false &&
+
+ GIT_NO_LAZY_FETCH=0 git -C client2 pull origin &&
+
+ git -C client2 rev-parse HEAD >actual &&
+ test_cmp expected_head actual &&
+
+ cat client2/bar >/dev/null &&
+
+ check_missing_objects server 1 "$oid"
+'
+
+test_done
diff --git a/upload-pack.c b/upload-pack.c
index 728b2477fc..7498b45e2e 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -32,6 +32,7 @@
#include "write-or-die.h"
#include "json-writer.h"
#include "strmap.h"
+#include "promisor-remote.h"
/* Remember to update object flag allocation in object.h */
#define THEY_HAVE (1u << 11)
@@ -319,6 +320,8 @@ static void create_pack_file(struct upload_pack_data *pack_data,
strvec_push(&pack_objects.args, "--delta-base-offset");
if (pack_data->use_include_tag)
strvec_push(&pack_objects.args, "--include-tag");
+ if (repo_has_accepted_promisor_remote(the_repository))
+ strvec_push(&pack_objects.args, "--missing=allow-promisor");
if (pack_data->filter_options.choice) {
const char *spec =
expand_list_objects_filter_spec(&pack_data->filter_options);
--
2.46.0.rc0.95.gcbf174a634
^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [PATCH v4 4/6] Add 'promisor-remote' capability to protocol v2
2025-01-27 15:16 ` [PATCH v4 4/6] Add 'promisor-remote' capability to protocol v2 Christian Couder
@ 2025-01-30 10:51 ` Patrick Steinhardt
2025-02-18 11:41 ` Christian Couder
0 siblings, 1 reply; 110+ messages in thread
From: Patrick Steinhardt @ 2025-01-30 10:51 UTC (permalink / raw)
To: Christian Couder
Cc: git, Junio C Hamano, Taylor Blau, Eric Sunshine, Karthik Nayak,
Kristoffer Haugsbakk, brian m . carlson, Randall S . Becker,
Christian Couder
On Mon, Jan 27, 2025 at 04:16:59PM +0100, Christian Couder wrote:
> When a server S knows that some objects from a repository are available
> from a promisor remote X, S might want to suggest to a client C cloning
> or fetching the repo from S that C may use X directly instead of S for
> these objects.
A lot of the commit message seems to be duplicated with the technical
documentation that you add. I wonder whether it would make sense to
simply refer to that instead of repeating all of it? That would make it
easier to spot the actually-important bits in the commit message that
add context to the patch.
One very important bit of context that I was lacking is what exactly we
wire up and where we do so. I have been searching for longer than I want
to admit where the client ends up using the promisor remotes, until I
eventually figured out that the client-side isn't wired up at all. It
makes sense in retrospect, but it would've been nice if the reader was
guided a bit.
> diff --git a/Documentation/gitprotocol-v2.txt b/Documentation/gitprotocol-v2.txt
> index 1652fef3ae..f25a9a6ad8 100644
> --- a/Documentation/gitprotocol-v2.txt
> +++ b/Documentation/gitprotocol-v2.txt
> @@ -781,6 +781,60 @@ retrieving the header from a bundle at the indicated URI, and thus
> save themselves and the server(s) the request(s) needed to inspect the
> headers of that bundle or bundles.
>
> +promisor-remote=<pr-infos>
> +~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The server may advertise some promisor remotes it is using or knows
> +about to a client which may want to use them as its promisor remotes,
> +instead of this repository. In this case <pr-infos> should be of the
> +form:
> +
> + pr-infos = pr-info | pr-infos ";" pr-info
> +
> + pr-info = "name=" pr-name | "name=" pr-name "," "url=" pr-url
> +
> +where `pr-name` is the urlencoded name of a promisor remote, and
> +`pr-url` the urlencoded URL of that promisor remote.
> +
> +In this case, if the client decides to use one or more promisor
> +remotes the server advertised, it can reply with
> +"promisor-remote=<pr-names>" where <pr-names> should be of the form:
> +
> + pr-names = pr-name | pr-names ";" pr-name
> +
> +where `pr-name` is the urlencoded name of a promisor remote the server
> +advertised and the client accepts.
> +
> +Note that, everywhere in this document, `pr-name` MUST be a valid
> +remote name, and the ';' and ',' characters MUST be encoded if they
> +appear in `pr-name` or `pr-url`.
> +
> +If the server doesn't know any promisor remote that could be good for
> +a client to use, or prefers a client not to use any promisor remote it
> +uses or knows about, it shouldn't advertise the "promisor-remote"
> +capability at all.
> +
> +In this case, or if the client doesn't want to use any promisor remote
> +the server advertised, the client shouldn't advertise the
> +"promisor-remote" capability at all in its reply.
> +
> +The "promisor.advertise" and "promisor.acceptFromServer" configuration
> +options can be used on the server and client side respectively to
s/respectively//, as you already say that in the next line.
> +control what they advertise or accept respectively. See the
> +documentation of these configuration options for more information.
> +
> +Note that in the future it would be nice if the "promisor-remote"
> +protocol capability could be used by the server, when responding to
> +`git fetch` or `git clone`, to advertise better-connected remotes that
> +the client can use as promisor remotes, instead of this repository, so
> +that the client can lazily fetch objects from these other
> +better-connected remotes. This would require the server to omit in its
> +response the objects available on the better-connected remotes that
> +the client has accepted. This hasn't been implemented yet though. So
> +for now this "promisor-remote" capability is useful only when the
> +server advertises some promisor remotes it already uses to borrow
> +objects from.
I'd leave away this bit as it doesn't really add a lot to the document.
It's a possibility for the future, but without it being implemented
anywhere it's not that helpful from my point of view.
> diff --git a/promisor-remote.c b/promisor-remote.c
> index c714f4f007..5ac282ed27 100644
> --- a/promisor-remote.c
> +++ b/promisor-remote.c
> @@ -292,3 +306,185 @@ void promisor_remote_get_direct(struct repository *repo,
> if (to_free)
> free(remaining_oids);
> }
> +
> +static int allow_unsanitized(char ch)
> +{
> + if (ch == ',' || ch == ';' || ch == '%')
> + return 0;
> + return ch > 32 && ch < 127;
> +}
Isn't this too lenient? It would allow also allow e.g. '=' and all kinds
of other characters. This does make sense for URLs, but it doesn't make
sense for remote names as they aren't supposed to contain punctuation in
the first place. So for these remote names I'd think we should be way
stricter and return an error in case they contain non-alphanumeric data.
> +static void promisor_info_vecs(struct repository *repo,
> + struct strvec *names,
> + struct strvec *urls)
I wonder whether it would make more sense to track these as a strmap
instead of two arrays which are expected to have related entries in the
same place.
> +{
> + struct promisor_remote *r;
> +
> + promisor_remote_init(repo);
> +
> + for (r = repo->promisor_remote_config->promisors; r; r = r->next) {
> + char *url;
> + char *url_key = xstrfmt("remote.%s.url", r->name);
> +
> + strvec_push(names, r->name);
> + strvec_push(urls, git_config_get_string(url_key, &url) ? NULL : url);
> +
> + free(url);
> + free(url_key);
> + }
> +}
> +
> +char *promisor_remote_info(struct repository *repo)
> +{
> + struct strbuf sb = STRBUF_INIT;
> + int advertise_promisors = 0;
> + struct strvec names = STRVEC_INIT;
> + struct strvec urls = STRVEC_INIT;
> +
> + git_config_get_bool("promisor.advertise", &advertise_promisors);
> +
> + if (!advertise_promisors)
> + return NULL;
> +
> + promisor_info_vecs(repo, &names, &urls);
> +
> + if (!names.nr)
> + return NULL;
> +
> + for (size_t i = 0; i < names.nr; i++) {
> + if (i)
> + strbuf_addch(&sb, ';');
> + strbuf_addstr(&sb, "name=");
> + strbuf_addstr_urlencode(&sb, names.v[i], allow_unsanitized);
> + if (urls.v[i]) {
> + strbuf_addstr(&sb, ",url=");
> + strbuf_addstr_urlencode(&sb, urls.v[i], allow_unsanitized);
> + }
> + }
> +
> + redact_non_printables(&sb);
So here we replace non-printable characters with dots as far as I
understand. But didn't we just URL-encode the strings? So is there ever
a possibility for non-printable characters here?
> + strvec_clear(&names);
> + strvec_clear(&urls);
> +
> + return strbuf_detach(&sb, NULL);
> +}
> +
> +enum accept_promisor {
> + ACCEPT_NONE = 0,
> + ACCEPT_ALL
> +};
> +
> +static int should_accept_remote(enum accept_promisor accept,
> + const char *remote_name UNUSED,
> + const char *remote_url UNUSED)
> +{
> + if (accept == ACCEPT_ALL)
> + return 1;
> +
> + BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
> +}
> +
> +static void filter_promisor_remote(struct strvec *accepted, const char *info)
> +{
> + struct strbuf **remotes;
> + const char *accept_str;
> + enum accept_promisor accept = ACCEPT_NONE;
> +
> + if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
> + if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
> + accept = ACCEPT_NONE;
> + else if (!strcasecmp("All", accept_str))
> + accept = ACCEPT_ALL;
> + else
> + warning(_("unknown '%s' value for '%s' config option"),
> + accept_str, "promisor.acceptfromserver");
> + }
> +
> + if (accept == ACCEPT_NONE)
> + return;
> +
> + /* Parse remote info received */
> +
> + remotes = strbuf_split_str(info, ';', 0);
> +
> + for (size_t i = 0; remotes[i]; i++) {
> + struct strbuf **elems;
> + const char *remote_name = NULL;
> + const char *remote_url = NULL;
> + char *decoded_name = NULL;
> + char *decoded_url = NULL;
> +
> + strbuf_strip_suffix(remotes[i], ";");
> + elems = strbuf_split(remotes[i], ',');
> +
> + for (size_t j = 0; elems[j]; j++) {
> + int res;
> + strbuf_strip_suffix(elems[j], ",");
> + res = skip_prefix(elems[j]->buf, "name=", &remote_name) ||
> + skip_prefix(elems[j]->buf, "url=", &remote_url);
> + if (!res)
> + warning(_("unknown element '%s' from remote info"),
> + elems[j]->buf);
> + }
> +
> + if (remote_name)
> + decoded_name = url_percent_decode(remote_name);
> + if (remote_url)
> + decoded_url = url_percent_decode(remote_url);
This is data we have received from a potentially-untrusted remote, so we
should double-check that the data we have received doesn't contain any
weird characters:
- For the remote name we should verify that it consists only of
alphanumeric characters.
- For the remote URL we need to verify that it's a proper URL without
any newlines, non-printable characters or anything else.
We'll eventually end up storing that data in the configuration, so these
verifications are quite important so that an adversarial server cannot
perform config-injection and thus cause remote code execution.
[snip]
> +void mark_promisor_remotes_as_accepted(struct repository *r, const char *remotes)
> +{
> + struct strbuf **accepted_remotes = strbuf_split_str(remotes, ';', 0);
> +
> + for (size_t i = 0; accepted_remotes[i]; i++) {
> + struct promisor_remote *p;
> + char *decoded_remote;
> +
> + strbuf_strip_suffix(accepted_remotes[i], ";");
> + decoded_remote = url_percent_decode(accepted_remotes[i]->buf);
> +
> + p = repo_promisor_remote_find(r, decoded_remote);
> + if (p)
> + p->accepted = 1;
> + else
> + warning(_("accepted promisor remote '%s' not found"),
> + decoded_remote);
My initial understanding of this code was that it is about the
client-side accepting a remote, but this is about the server-side and
tracks whether a promisor remote has been accepted by the client. It
feels a bit weird to modify semi-global state for this, as I'd have
rather expected that we pass around a vector of accepted remotes
instead.
But I guess ultimately this isn't too bad. It would be nice though if
it was more obvious whether we're on the server- or client-side.
> diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
> new file mode 100755
> index 0000000000..0390c1dbad
> --- /dev/null
> +++ b/t/t5710-promisor-remote-capability.sh
> @@ -0,0 +1,244 @@
[snip]
> +initialize_server () {
> + count="$1"
> + missing_oids="$2"
> +
> + # Repack everything first
> + git -C server -c repack.writebitmaps=false repack -a -d &&
> +
> + # Remove promisor file in case they exist, useful when reinitializing
> + rm -rf server/objects/pack/*.promisor &&
> +
> + # Repack without the largest object and create a promisor pack on server
> + git -C server -c repack.writebitmaps=false repack -a -d \
> + --filter=blob:limit=5k --filter-to="$(pwd)/pack" &&
> + promisor_file=$(ls server/objects/pack/*.pack | sed "s/\.pack/.promisor/") &&
> + >"$promisor_file" &&
> +
> + # Check objects missing on the server
> + check_missing_objects server "$count" "$missing_oids"
> +}
> +
> +copy_to_server2 () {
Nit: `server2` could be renamed to `promisor` to make the relation
between the two servers more obvious.
> diff --git a/upload-pack.c b/upload-pack.c
> index 728b2477fc..7498b45e2e 100644
> --- a/upload-pack.c
> +++ b/upload-pack.c
> @@ -319,6 +320,8 @@ static void create_pack_file(struct upload_pack_data *pack_data,
> strvec_push(&pack_objects.args, "--delta-base-offset");
> if (pack_data->use_include_tag)
> strvec_push(&pack_objects.args, "--include-tag");
> + if (repo_has_accepted_promisor_remote(the_repository))
> + strvec_push(&pack_objects.args, "--missing=allow-promisor");
This is nice and simple, I like it.
Patrick
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v4 4/6] Add 'promisor-remote' capability to protocol v2
2025-01-30 10:51 ` Patrick Steinhardt
@ 2025-02-18 11:41 ` Christian Couder
0 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-02-18 11:41 UTC (permalink / raw)
To: Patrick Steinhardt
Cc: git, Junio C Hamano, Taylor Blau, Eric Sunshine, Karthik Nayak,
Kristoffer Haugsbakk, brian m . carlson, Randall S . Becker,
Christian Couder
On Thu, Jan 30, 2025 at 11:51 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Mon, Jan 27, 2025 at 04:16:59PM +0100, Christian Couder wrote:
> > When a server S knows that some objects from a repository are available
> > from a promisor remote X, S might want to suggest to a client C cloning
> > or fetching the repo from S that C may use X directly instead of S for
> > these objects.
>
> A lot of the commit message seems to be duplicated with the technical
> documentation that you add. I wonder whether it would make sense to
> simply refer to that instead of repeating all of it? That would make it
> easier to spot the actually-important bits in the commit message that
> add context to the patch.
I thought that commit messages should be self-contained as much as
possible. I am fine with adding a sentence saying that a design doc to
help with seeing the big picture will follow in one of the next
commits if it helps though.
> One very important bit of context that I was lacking is what exactly we
> wire up and where we do so. I have been searching for longer than I want
> to admit where the client ends up using the promisor remotes, until I
> eventually figured out that the client-side isn't wired up at all. It
> makes sense in retrospect, but it would've been nice if the reader was
> guided a bit.
The protocol side is implemented on both the client and the server
side in this patch. The rest already works on the client side because
using promisor remotes already works on the client side. We are just
making sure client and server agree on using a promisor remote before
the server allows it by passing "--missing=allow-promisor" to `git
pack-objects`, see below . The tests show that this single change is
enough to make things work.
> > diff --git a/Documentation/gitprotocol-v2.txt b/Documentation/gitprotocol-v2.txt
> > index 1652fef3ae..f25a9a6ad8 100644
> > --- a/Documentation/gitprotocol-v2.txt
> > +++ b/Documentation/gitprotocol-v2.txt
> > @@ -781,6 +781,60 @@ retrieving the header from a bundle at the indicated URI, and thus
> > save themselves and the server(s) the request(s) needed to inspect the
> > headers of that bundle or bundles.
> >
> > +promisor-remote=<pr-infos>
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +The server may advertise some promisor remotes it is using or knows
> > +about to a client which may want to use them as its promisor remotes,
> > +instead of this repository. In this case <pr-infos> should be of the
> > +form:
> > +
> > + pr-infos = pr-info | pr-infos ";" pr-info
> > +
> > + pr-info = "name=" pr-name | "name=" pr-name "," "url=" pr-url
> > +
> > +where `pr-name` is the urlencoded name of a promisor remote, and
> > +`pr-url` the urlencoded URL of that promisor remote.
> > +
> > +In this case, if the client decides to use one or more promisor
> > +remotes the server advertised, it can reply with
> > +"promisor-remote=<pr-names>" where <pr-names> should be of the form:
> > +
> > + pr-names = pr-name | pr-names ";" pr-name
> > +
> > +where `pr-name` is the urlencoded name of a promisor remote the server
> > +advertised and the client accepts.
> > +
> > +Note that, everywhere in this document, `pr-name` MUST be a valid
> > +remote name, and the ';' and ',' characters MUST be encoded if they
> > +appear in `pr-name` or `pr-url`.
> > +
> > +If the server doesn't know any promisor remote that could be good for
> > +a client to use, or prefers a client not to use any promisor remote it
> > +uses or knows about, it shouldn't advertise the "promisor-remote"
> > +capability at all.
> > +
> > +In this case, or if the client doesn't want to use any promisor remote
> > +the server advertised, the client shouldn't advertise the
> > +"promisor-remote" capability at all in its reply.
> > +
> > +The "promisor.advertise" and "promisor.acceptFromServer" configuration
> > +options can be used on the server and client side respectively to
>
> s/respectively//, as you already say that in the next line.
I have removed it in the next version.
> > +control what they advertise or accept respectively. See the
> > +documentation of these configuration options for more information.
> > +
> > +Note that in the future it would be nice if the "promisor-remote"
> > +protocol capability could be used by the server, when responding to
> > +`git fetch` or `git clone`, to advertise better-connected remotes that
> > +the client can use as promisor remotes, instead of this repository, so
> > +that the client can lazily fetch objects from these other
> > +better-connected remotes. This would require the server to omit in its
> > +response the objects available on the better-connected remotes that
> > +the client has accepted. This hasn't been implemented yet though. So
> > +for now this "promisor-remote" capability is useful only when the
> > +server advertises some promisor remotes it already uses to borrow
> > +objects from.
>
> I'd leave away this bit as it doesn't really add a lot to the document.
> It's a possibility for the future, but without it being implemented
> anywhere it's not that helpful from my point of view.
In previous iterations, Junio talked about this as an interesting
possibility to implement in the future, so I thought it could be
interesting to mention it in some places. I would be Ok to remove it
if no one cares though.
> > diff --git a/promisor-remote.c b/promisor-remote.c
> > index c714f4f007..5ac282ed27 100644
> > --- a/promisor-remote.c
> > +++ b/promisor-remote.c
> > @@ -292,3 +306,185 @@ void promisor_remote_get_direct(struct repository *repo,
> > if (to_free)
> > free(remaining_oids);
> > }
> > +
> > +static int allow_unsanitized(char ch)
> > +{
> > + if (ch == ',' || ch == ';' || ch == '%')
> > + return 0;
> > + return ch > 32 && ch < 127;
> > +}
>
> Isn't this too lenient? It would allow also allow e.g. '=' and all kinds
> of other characters. This does make sense for URLs, but it doesn't make
> sense for remote names as they aren't supposed to contain punctuation in
> the first place. So for these remote names I'd think we should be way
> stricter and return an error in case they contain non-alphanumeric data.
This is used only to determine which characters are URL-encoded, not
which characters we pass or not to the other side. See below.
> > +static void promisor_info_vecs(struct repository *repo,
> > + struct strvec *names,
> > + struct strvec *urls)
>
> I wonder whether it would make more sense to track these as a strmap
> instead of two arrays which are expected to have related entries in the
> same place.
In the future we might have more generic code with perhaps a
configuration option (maybe "promisor.advertiseFields") that lists the
remote fields, like "name, url, token, filter, id", that should be
advertised by the server. If that happens, then it will make a lot of
sense to use a strmap indeed. For now we just don't know how that code
will evolve, so I think it's not worth risking overengineering this.
> > +{
> > + struct promisor_remote *r;
> > +
> > + promisor_remote_init(repo);
> > +
> > + for (r = repo->promisor_remote_config->promisors; r; r = r->next) {
> > + char *url;
> > + char *url_key = xstrfmt("remote.%s.url", r->name);
> > +
> > + strvec_push(names, r->name);
> > + strvec_push(urls, git_config_get_string(url_key, &url) ? NULL : url);
> > +
> > + free(url);
> > + free(url_key);
> > + }
> > +}
> > +
> > +char *promisor_remote_info(struct repository *repo)
> > +{
> > + struct strbuf sb = STRBUF_INIT;
> > + int advertise_promisors = 0;
> > + struct strvec names = STRVEC_INIT;
> > + struct strvec urls = STRVEC_INIT;
> > +
> > + git_config_get_bool("promisor.advertise", &advertise_promisors);
> > +
> > + if (!advertise_promisors)
> > + return NULL;
> > +
> > + promisor_info_vecs(repo, &names, &urls);
> > +
> > + if (!names.nr)
> > + return NULL;
> > +
> > + for (size_t i = 0; i < names.nr; i++) {
> > + if (i)
> > + strbuf_addch(&sb, ';');
> > + strbuf_addstr(&sb, "name=");
> > + strbuf_addstr_urlencode(&sb, names.v[i], allow_unsanitized);
> > + if (urls.v[i]) {
> > + strbuf_addstr(&sb, ",url=");
> > + strbuf_addstr_urlencode(&sb, urls.v[i], allow_unsanitized);
> > + }
> > + }
> > +
> > + redact_non_printables(&sb);
>
> So here we replace non-printable characters with dots as far as I
> understand. But didn't we just URL-encode the strings? So is there ever
> a possibility for non-printable characters here?
Yeah, right. I am removing this call in the next version then. This is
nice because it allows us to remove the first 3 patches in this series
and not depend on Usman's "extend agent capability to include OS name"
series (https://lore.kernel.org/git/20250215155130.1756934-1-usmanakinyemi202@gmail.com/).
> > + strvec_clear(&names);
> > + strvec_clear(&urls);
> > +
> > + return strbuf_detach(&sb, NULL);
> > +}
[...]
> > +static void filter_promisor_remote(struct strvec *accepted, const char *info)
> > +{
> > + struct strbuf **remotes;
> > + const char *accept_str;
> > + enum accept_promisor accept = ACCEPT_NONE;
> > +
> > + if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
> > + if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
> > + accept = ACCEPT_NONE;
> > + else if (!strcasecmp("All", accept_str))
> > + accept = ACCEPT_ALL;
> > + else
> > + warning(_("unknown '%s' value for '%s' config option"),
> > + accept_str, "promisor.acceptfromserver");
> > + }
> > +
> > + if (accept == ACCEPT_NONE)
> > + return;
> > +
> > + /* Parse remote info received */
> > +
> > + remotes = strbuf_split_str(info, ';', 0);
> > +
> > + for (size_t i = 0; remotes[i]; i++) {
> > + struct strbuf **elems;
> > + const char *remote_name = NULL;
> > + const char *remote_url = NULL;
> > + char *decoded_name = NULL;
> > + char *decoded_url = NULL;
> > +
> > + strbuf_strip_suffix(remotes[i], ";");
> > + elems = strbuf_split(remotes[i], ',');
> > +
> > + for (size_t j = 0; elems[j]; j++) {
> > + int res;
> > + strbuf_strip_suffix(elems[j], ",");
> > + res = skip_prefix(elems[j]->buf, "name=", &remote_name) ||
> > + skip_prefix(elems[j]->buf, "url=", &remote_url);
> > + if (!res)
> > + warning(_("unknown element '%s' from remote info"),
> > + elems[j]->buf);
> > + }
> > +
> > + if (remote_name)
> > + decoded_name = url_percent_decode(remote_name);
> > + if (remote_url)
> > + decoded_url = url_percent_decode(remote_url);
>
> This is data we have received from a potentially-untrusted remote, so we
> should double-check that the data we have received doesn't contain any
> weird characters:
>
> - For the remote name we should verify that it consists only of
> alphanumeric characters.
>
> - For the remote URL we need to verify that it's a proper URL without
> any newlines, non-printable characters or anything else.
>
> We'll eventually end up storing that data in the configuration, so these
> verifications are quite important so that an adversarial server cannot
> perform config-injection and thus cause remote code execution.
We currently don't store that data in the configuration. We just use
it to compare it with what is already configured on the client side. I
agree that if we ever make changes in a future series to store that
data, we should be careful to double-check it.
> > +void mark_promisor_remotes_as_accepted(struct repository *r, const char *remotes)
> > +{
> > + struct strbuf **accepted_remotes = strbuf_split_str(remotes, ';', 0);
> > +
> > + for (size_t i = 0; accepted_remotes[i]; i++) {
> > + struct promisor_remote *p;
> > + char *decoded_remote;
> > +
> > + strbuf_strip_suffix(accepted_remotes[i], ";");
> > + decoded_remote = url_percent_decode(accepted_remotes[i]->buf);
> > +
> > + p = repo_promisor_remote_find(r, decoded_remote);
> > + if (p)
> > + p->accepted = 1;
> > + else
> > + warning(_("accepted promisor remote '%s' not found"),
> > + decoded_remote);
>
> My initial understanding of this code was that it is about the
> client-side accepting a remote, but this is about the server-side and
> tracks whether a promisor remote has been accepted by the client. It
> feels a bit weird to modify semi-global state for this, as I'd have
> rather expected that we pass around a vector of accepted remotes
> instead.
>
> But I guess ultimately this isn't too bad. It would be nice though if
> it was more obvious whether we're on the server- or client-side.
I have changed the description of the function like this in "promisor-remote.h":
/*
* Set the 'accepted' flag for some promisor remotes. Useful on the
* server side when some promisor remotes have been accepted by the
* client.
*/
void mark_promisor_remotes_as_accepted(struct repository *repo, const
char *remotes);
> > diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
> > new file mode 100755
> > index 0000000000..0390c1dbad
> > --- /dev/null
> > +++ b/t/t5710-promisor-remote-capability.sh
> > @@ -0,0 +1,244 @@
> [snip]
> > +initialize_server () {
> > + count="$1"
> > + missing_oids="$2"
> > +
> > + # Repack everything first
> > + git -C server -c repack.writebitmaps=false repack -a -d &&
> > +
> > + # Remove promisor file in case they exist, useful when reinitializing
> > + rm -rf server/objects/pack/*.promisor &&
> > +
> > + # Repack without the largest object and create a promisor pack on server
> > + git -C server -c repack.writebitmaps=false repack -a -d \
> > + --filter=blob:limit=5k --filter-to="$(pwd)/pack" &&
> > + promisor_file=$(ls server/objects/pack/*.pack | sed "s/\.pack/.promisor/") &&
> > + >"$promisor_file" &&
> > +
> > + # Check objects missing on the server
> > + check_missing_objects server "$count" "$missing_oids"
> > +}
> > +
> > +copy_to_server2 () {
>
> Nit: `server2` could be renamed to `promisor` to make the relation
> between the two servers more obvious.
I think "promisor" might be confusing as that is already used in parts
of some config variable names. For example we would have to set
"remote.promisor.promisor" to "true" several times. I have renamed it
to "lop" instead.
> > diff --git a/upload-pack.c b/upload-pack.c
> > index 728b2477fc..7498b45e2e 100644
> > --- a/upload-pack.c
> > +++ b/upload-pack.c
> > @@ -319,6 +320,8 @@ static void create_pack_file(struct upload_pack_data *pack_data,
> > strvec_push(&pack_objects.args, "--delta-base-offset");
> > if (pack_data->use_include_tag)
> > strvec_push(&pack_objects.args, "--include-tag");
> > + if (repo_has_accepted_promisor_remote(the_repository))
> > + strvec_push(&pack_objects.args, "--missing=allow-promisor");
>
> This is nice and simple, I like it.
Yeah, this is really the only change that is needed for a client to be
able to lazy fetch from promisor remotes at clone time.
Thanks.
^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH v4 5/6] promisor-remote: check advertised name or URL
2025-01-27 15:16 ` [PATCH v4 0/6] " Christian Couder
` (3 preceding siblings ...)
2025-01-27 15:16 ` [PATCH v4 4/6] Add 'promisor-remote' capability to protocol v2 Christian Couder
@ 2025-01-27 15:17 ` Christian Couder
2025-01-27 23:48 ` Junio C Hamano
2025-01-27 15:17 ` [PATCH v4 6/6] doc: add technical design doc for large object promisors Christian Couder
` (2 subsequent siblings)
7 siblings, 1 reply; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:17 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
Randall S . Becker, Christian Couder, Christian Couder
A previous commit introduced a "promisor.acceptFromServer" configuration
variable with only "None" or "All" as valid values.
Let's introduce "KnownName" and "KnownUrl" as valid values for this
configuration option to give more choice to a client about which
promisor remotes it might accept among those that the server advertised.
In case of "KnownName", the client will accept promisor remotes which
are already configured on the client and have the same name as those
advertised by the client. This could be useful in a corporate setup
where servers and clients are trusted to not switch names and URLs, but
where some kind of control is still useful.
In case of "KnownUrl", the client will accept promisor remotes which
have both the same name and the same URL configured on the client as the
name and URL advertised by the server. This is the most secure option,
so it should be used if possible.
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
Documentation/config/promisor.txt | 22 ++++++---
promisor-remote.c | 60 ++++++++++++++++++++---
t/t5710-promisor-remote-capability.sh | 68 +++++++++++++++++++++++++++
3 files changed, 138 insertions(+), 12 deletions(-)
diff --git a/Documentation/config/promisor.txt b/Documentation/config/promisor.txt
index 9cbfe3e59e..d1364bc018 100644
--- a/Documentation/config/promisor.txt
+++ b/Documentation/config/promisor.txt
@@ -12,9 +12,19 @@ promisor.advertise::
promisor.acceptFromServer::
If set to "all", a client will accept all the promisor remotes
a server might advertise using the "promisor-remote"
- capability. Default is "none", which means no promisor remote
- advertised by a server will be accepted. By accepting a
- promisor remote, the client agrees that the server might omit
- objects that are lazily fetchable from this promisor remote
- from its responses to "fetch" and "clone" requests from the
- client. See linkgit:gitprotocol-v2[5].
+ capability. If set to "knownName" the client will accept
+ promisor remotes which are already configured on the client
+ and have the same name as those advertised by the client. This
+ is not very secure, but could be used in a corporate setup
+ where servers and clients are trusted to not switch name and
+ URLs. If set to "knownUrl", the client will accept promisor
+ remotes which have both the same name and the same URL
+ configured on the client as the name and URL advertised by the
+ server. This is more secure than "all" or "knownUrl", so it
+ should be used if possible instead of those options. Default
+ is "none", which means no promisor remote advertised by a
+ server will be accepted. By accepting a promisor remote, the
+ client agrees that the server might omit objects that are
+ lazily fetchable from this promisor remote from its responses
+ to "fetch" and "clone" requests from the client. See
+ linkgit:gitprotocol-v2[5].
diff --git a/promisor-remote.c b/promisor-remote.c
index 5ac282ed27..790a96aa19 100644
--- a/promisor-remote.c
+++ b/promisor-remote.c
@@ -370,30 +370,73 @@ char *promisor_remote_info(struct repository *repo)
return strbuf_detach(&sb, NULL);
}
+/*
+ * Find first index of 'vec' where there is 'val'. 'val' is compared
+ * case insensively to the strings in 'vec'. If not found 'vec->nr' is
+ * returned.
+ */
+static size_t strvec_find_index(struct strvec *vec, const char *val)
+{
+ for (size_t i = 0; i < vec->nr; i++)
+ if (!strcasecmp(vec->v[i], val))
+ return i;
+ return vec->nr;
+}
+
enum accept_promisor {
ACCEPT_NONE = 0,
+ ACCEPT_KNOWN_URL,
+ ACCEPT_KNOWN_NAME,
ACCEPT_ALL
};
static int should_accept_remote(enum accept_promisor accept,
- const char *remote_name UNUSED,
- const char *remote_url UNUSED)
+ const char *remote_name, const char *remote_url,
+ struct strvec *names, struct strvec *urls)
{
+ size_t i;
+
if (accept == ACCEPT_ALL)
return 1;
- BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+ i = strvec_find_index(names, remote_name);
+
+ if (i >= names->nr)
+ /* We don't know about that remote */
+ return 0;
+
+ if (accept == ACCEPT_KNOWN_NAME)
+ return 1;
+
+ if (accept != ACCEPT_KNOWN_URL)
+ BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+
+ if (!strcasecmp(urls->v[i], remote_url))
+ return 1;
+
+ warning(_("known remote named '%s' but with url '%s' instead of '%s'"),
+ remote_name, urls->v[i], remote_url);
+
+ return 0;
}
-static void filter_promisor_remote(struct strvec *accepted, const char *info)
+static void filter_promisor_remote(struct repository *repo,
+ struct strvec *accepted,
+ const char *info)
{
struct strbuf **remotes;
const char *accept_str;
enum accept_promisor accept = ACCEPT_NONE;
+ struct strvec names = STRVEC_INIT;
+ struct strvec urls = STRVEC_INIT;
if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
accept = ACCEPT_NONE;
+ else if (!strcasecmp("KnownUrl", accept_str))
+ accept = ACCEPT_KNOWN_URL;
+ else if (!strcasecmp("KnownName", accept_str))
+ accept = ACCEPT_KNOWN_NAME;
else if (!strcasecmp("All", accept_str))
accept = ACCEPT_ALL;
else
@@ -404,6 +447,9 @@ static void filter_promisor_remote(struct strvec *accepted, const char *info)
if (accept == ACCEPT_NONE)
return;
+ if (accept != ACCEPT_ALL)
+ promisor_info_vecs(repo, &names, &urls);
+
/* Parse remote info received */
remotes = strbuf_split_str(info, ';', 0);
@@ -433,7 +479,7 @@ static void filter_promisor_remote(struct strvec *accepted, const char *info)
if (remote_url)
decoded_url = url_percent_decode(remote_url);
- if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url))
+ if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url, &names, &urls))
strvec_push(accepted, decoded_name);
strbuf_list_free(elems);
@@ -441,6 +487,8 @@ static void filter_promisor_remote(struct strvec *accepted, const char *info)
free(decoded_url);
}
+ strvec_clear(&names);
+ strvec_clear(&urls);
strbuf_list_free(remotes);
}
@@ -449,7 +497,7 @@ char *promisor_remote_reply(const char *info)
struct strvec accepted = STRVEC_INIT;
struct strbuf reply = STRBUF_INIT;
- filter_promisor_remote(&accepted, info);
+ filter_promisor_remote(the_repository, &accepted, info);
if (!accepted.nr)
return NULL;
diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
index 0390c1dbad..5bce99f5eb 100755
--- a/t/t5710-promisor-remote-capability.sh
+++ b/t/t5710-promisor-remote-capability.sh
@@ -160,6 +160,74 @@ test_expect_success "init + fetch with promisor.advertise set to 'true'" '
check_missing_objects server 1 "$oid"
'
+test_expect_success "clone with promisor.acceptfromserver set to 'KnownName'" '
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+ -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+ -c remote.server2.url="file://$(pwd)/server2" \
+ -c promisor.acceptfromserver=KnownName \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
+
+ # Check that the largest object is still missing on the server
+ check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with 'KnownName' and different remote names" '
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.serverTwo.promisor=true \
+ -c remote.serverTwo.fetch="+refs/heads/*:refs/remotes/server2/*" \
+ -c remote.serverTwo.url="file://$(pwd)/server2" \
+ -c promisor.acceptfromserver=KnownName \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
+
+ # Check that the largest object is not missing on the server
+ check_missing_objects server 0 "" &&
+
+ # Reinitialize server so that the largest object is missing again
+ initialize_server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.acceptfromserver set to 'KnownUrl'" '
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+ -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+ -c remote.server2.url="file://$(pwd)/server2" \
+ -c promisor.acceptfromserver=KnownUrl \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
+
+ # Check that the largest object is still missing on the server
+ check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with 'KnownUrl' and different remote urls" '
+ ln -s server2 serverTwo &&
+
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+ -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+ -c remote.server2.url="file://$(pwd)/serverTwo" \
+ -c promisor.acceptfromserver=KnownUrl \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
+
+ # Check that the largest object is not missing on the server
+ check_missing_objects server 0 "" &&
+
+ # Reinitialize server so that the largest object is missing again
+ initialize_server 1 "$oid"
+'
+
test_expect_success "clone with promisor.advertise set to 'true' but don't delete the client" '
git -C server config promisor.advertise true &&
--
2.46.0.rc0.95.gcbf174a634
^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [PATCH v4 5/6] promisor-remote: check advertised name or URL
2025-01-27 15:17 ` [PATCH v4 5/6] promisor-remote: check advertised name or URL Christian Couder
@ 2025-01-27 23:48 ` Junio C Hamano
2025-01-28 0:01 ` Junio C Hamano
` (2 more replies)
0 siblings, 3 replies; 110+ messages in thread
From: Junio C Hamano @ 2025-01-27 23:48 UTC (permalink / raw)
To: Christian Couder
Cc: git, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
Randall S . Becker, Christian Couder
Christian Couder <christian.couder@gmail.com> writes:
> A previous commit introduced a "promisor.acceptFromServer" configuration
> variable with only "None" or "All" as valid values.
>
> Let's introduce "KnownName" and "KnownUrl" as valid values for this
> configuration option to give more choice to a client about which
> promisor remotes it might accept among those that the server advertised.
OK.
> promisor.acceptFromServer::
> If set to "all", a client will accept all the promisor remotes
> a server might advertise using the "promisor-remote"
> - capability. Default is "none", which means no promisor remote
> - advertised by a server will be accepted. By accepting a
> - promisor remote, the client agrees that the server might omit
> - objects that are lazily fetchable from this promisor remote
> - from its responses to "fetch" and "clone" requests from the
> - client. See linkgit:gitprotocol-v2[5].
> + capability. If set to "knownName" the client will accept
> + promisor remotes which are already configured on the client
> + and have the same name as those advertised by the client. This
> + is not very secure, but could be used in a corporate setup
> + where servers and clients are trusted to not switch name and
> + URLs.
I wonder if the reader needs to be told a bit more about the
security argument here. I imagine that the attack vector behind the
use of "secure" in the above paragraph is for a malicious server
that guesses a promisor remote name the client already uses, which
has a different URL from what the client expects to be associated
with the name, thereby such an acceptance means that the URL used in
future fetches would be replaced without the user's consent. Being
able to silently repoint the remote.origin.url at an evil repository
you control is indeed a powerful thing, I would guess. Of course,
in a corp environment, such a mechanism to drive the clients to a
new repository after upgrading or migrating may be extremely handy.
Or does the above paragraph assumes some other attack vectors,
perhaps?
> + If set to "knownUrl", the client will accept promisor
> + remotes which have both the same name and the same URL
> + configured on the client as the name and URL advertised by the
> + server. This is more secure than "all" or "knownUrl", so it
> + should be used if possible instead of those options. Default
> + is "none", which means no promisor remote advertised by a
> + server will be accepted.
OK.
> diff --git a/promisor-remote.c b/promisor-remote.c
> index 5ac282ed27..790a96aa19 100644
> --- a/promisor-remote.c
> +++ b/promisor-remote.c
> @@ -370,30 +370,73 @@ char *promisor_remote_info(struct repository *repo)
> return strbuf_detach(&sb, NULL);
> }
>
> +/*
> + * Find first index of 'vec' where there is 'val'. 'val' is compared
> + * case insensively to the strings in 'vec'. If not found 'vec->nr' is
> + * returned.
> + */
> +static size_t strvec_find_index(struct strvec *vec, const char *val)
> +{
> + for (size_t i = 0; i < vec->nr; i++)
> + if (!strcasecmp(vec->v[i], val))
> + return i;
> + return vec->nr;
> +}
Hmph, without the hardcoded strcasecmp(), strvec_find() might make a
fine public API in <strvec.h>.
Unless we intend to create a generic function that qualifies as a
part of the public strvec API, we shouldn't call it strvec_anything.
This is a great helper that finds a matching remote nickname from
list of remote nicknames, so
remote_nick_find(struct strvec *nicks, const char *nick)
may be more appropriate. When we lift it out of here and make it
more generic to move it to strvec.[ch], perhaps
size_t strvec_find(struct strvec *vec, void *needle,
int (*match)(const char *, void *)) {
for (size_t ix = 0; ix < vec->nr, ix++)
if (match(vec->v[ix], needle))
return ix;
return vec->nr;
}
which will be used to rewrite remote_nick_find() like so:
static int nicks_match(const char *nick, void *needle)
{
return !strcasecmp(nick, (conat char *)needle);
}
remote_hick_find(struct strvec *nicks, const char *nick)
{
return strvec_find(nicks, nick, nicks_match);
}
it would be better to use a more generic parameter name "vec", but
until then, it is better to be more specific and explicit about the
reason why the immediate callers call the function for, which is
where my "nicks" vs "nick" comes from (it is OK to call the latter
"needle", though).
> enum accept_promisor {
> ACCEPT_NONE = 0,
> + ACCEPT_KNOWN_URL,
> + ACCEPT_KNOWN_NAME,
> ACCEPT_ALL
> };
>
> static int should_accept_remote(enum accept_promisor accept,
> - const char *remote_name UNUSED,
> - const char *remote_url UNUSED)
> + const char *remote_name, const char *remote_url,
> + struct strvec *names, struct strvec *urls)
> {
> + size_t i;
> +
> if (accept == ACCEPT_ALL)
> return 1;
>
> - BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
> + i = strvec_find_index(names, remote_name);
> +
> + if (i >= names->nr)
> + /* We don't know about that remote */
> + return 0;
OK.
> + if (accept == ACCEPT_KNOWN_NAME)
> + return 1;
> +
> + if (accept != ACCEPT_KNOWN_URL)
> + BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
I can see why this defensiveness may be a good idea than not having
any, but I wonder if we can take advantage of compile time checks
some compilers have to ensure that case arms in a switch statement
are exhausitive?
> + if (!strcasecmp(urls->v[i], remote_url))
> + return 1;
This is iffy. The <schema>://<host>/ part might want to be compared
case insensitively, but the rest of the URL is generally case
sensitive (unless the material served is stored on a machine with
case-insensitive filesystem)?
Given that the existing URL must have come by either cloning from
this server or another related server or by an earlier
acceptFromServer behaviour, I do not see a need for being extra lax
here. We should be more careful about our use of case-insensitive
comparison, and I do not see how this URL comparison could be
something the end users would expect to be done case insensitively.
> -static void filter_promisor_remote(struct strvec *accepted, const char *info)
> +static void filter_promisor_remote(struct repository *repo,
> + struct strvec *accepted,
> + const char *info)
> {
> struct strbuf **remotes;
> const char *accept_str;
> enum accept_promisor accept = ACCEPT_NONE;
> + struct strvec names = STRVEC_INIT;
> + struct strvec urls = STRVEC_INIT;
>
> if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
> if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
Not a fault of this step, but is it sensible to even expect
!accept_str in an error case? *accept_str could be NUL, but
accept_str be either left uninitialized (because this caller does
not initialize it) when the get_string_tmp() returns non-zero, or
points at the internal cached value in the config_set if it returns
0 (and the control comes into this block).
> accept = ACCEPT_NONE;
> + else if (!strcasecmp("KnownUrl", accept_str))
> + accept = ACCEPT_KNOWN_URL;
> + else if (!strcasecmp("KnownName", accept_str))
> + accept = ACCEPT_KNOWN_NAME;
> else if (!strcasecmp("All", accept_str))
> accept = ACCEPT_ALL;
> else
Ditto about icase for all of the above.
> +test_expect_success "clone with 'KnownUrl' and different remote urls" '
> + ln -s server2 serverTwo &&
> +
> + git -C server config promisor.advertise true &&
> +
> + # Clone from server to create a client
> + GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
> + -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
> + -c remote.server2.url="file://$(pwd)/serverTwo" \
> + -c promisor.acceptfromserver=KnownUrl \
> + --no-local --filter="blob:limit=5k" server client &&
> + test_when_finished "rm -rf client" &&
> +
> + # Check that the largest object is not missing on the server
> + check_missing_objects server 0 "" &&
> +
> + # Reinitialize server so that the largest object is missing again
> + initialize_server 1 "$oid"
> +'
Nice ;-)
Here, I also notice that we are not testing that serverTwo and
servertwo are considered the same thanks to the use of icase
comparison. We shouldn't compare URLs with strcasecmp().
Thanks.
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v4 5/6] promisor-remote: check advertised name or URL
2025-01-27 23:48 ` Junio C Hamano
@ 2025-01-28 0:01 ` Junio C Hamano
2025-01-30 10:51 ` Patrick Steinhardt
2025-02-18 11:42 ` Christian Couder
2 siblings, 0 replies; 110+ messages in thread
From: Junio C Hamano @ 2025-01-28 0:01 UTC (permalink / raw)
To: Christian Couder
Cc: git, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
Randall S . Becker, Christian Couder
Junio C Hamano <gitster@pobox.com> writes:
>> + if (!strcasecmp(urls->v[i], remote_url))
>> + return 1;
>
> This is iffy. The <schema>://<host>/ part might want to be compared
> case insensitively, but the rest of the URL is generally case
> sensitive (unless the material served is stored on a machine with
> case-insensitive filesystem)?
>
> Given that the existing URL must have come by either cloning from
> this server or another related server or by an earlier
> acceptFromServer behaviour, I do not see a need for being extra lax
> here. We should be more careful about our use of case-insensitive
> comparison, and I do not see how this URL comparison could be
> something the end users would expect to be done case insensitively.
Note that I am not advocating to compare the earlier part case
insensitively while comparing the remainder case sensitively.
Because we are not comparing URLs that come from random sources, but
we know they come from a only few very controlled sources (i.e., the
original server we cloned from, and the promisor remotes sugggested
by the original server and other promisor remotes whose suggestion
we accepted, recursively), it should be sufficient to compare the
whole string case sensitively.
Thanks.
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v4 5/6] promisor-remote: check advertised name or URL
2025-01-27 23:48 ` Junio C Hamano
2025-01-28 0:01 ` Junio C Hamano
@ 2025-01-30 10:51 ` Patrick Steinhardt
2025-02-18 11:41 ` Christian Couder
2025-02-18 11:42 ` Christian Couder
2 siblings, 1 reply; 110+ messages in thread
From: Patrick Steinhardt @ 2025-01-30 10:51 UTC (permalink / raw)
To: Junio C Hamano
Cc: Christian Couder, git, Taylor Blau, Eric Sunshine, Karthik Nayak,
Kristoffer Haugsbakk, brian m . carlson, Randall S . Becker,
Christian Couder
On Mon, Jan 27, 2025 at 03:48:08PM -0800, Junio C Hamano wrote:
> Christian Couder <christian.couder@gmail.com> writes:
> > promisor.acceptFromServer::
> > If set to "all", a client will accept all the promisor remotes
> > a server might advertise using the "promisor-remote"
> > - capability. Default is "none", which means no promisor remote
> > - advertised by a server will be accepted. By accepting a
> > - promisor remote, the client agrees that the server might omit
> > - objects that are lazily fetchable from this promisor remote
> > - from its responses to "fetch" and "clone" requests from the
> > - client. See linkgit:gitprotocol-v2[5].
> > + capability. If set to "knownName" the client will accept
> > + promisor remotes which are already configured on the client
> > + and have the same name as those advertised by the client. This
> > + is not very secure, but could be used in a corporate setup
> > + where servers and clients are trusted to not switch name and
> > + URLs.
>
> I wonder if the reader needs to be told a bit more about the
> security argument here. I imagine that the attack vector behind the
> use of "secure" in the above paragraph is for a malicious server
> that guesses a promisor remote name the client already uses, which
> has a different URL from what the client expects to be associated
> with the name, thereby such an acceptance means that the URL used in
> future fetches would be replaced without the user's consent. Being
> able to silently repoint the remote.origin.url at an evil repository
> you control is indeed a powerful thing, I would guess. Of course,
> in a corp environment, such a mechanism to drive the clients to a
> new repository after upgrading or migrating may be extremely handy.
I'm still very hesitant about letting the server-side control remote
names at all, as I've already mentioned in previous review rounds. I
think that it opens up the client for a whole lot of issues that should
rather be avoided. Most importantly, it takes control away from the
user, as they are not free anymore to name the remotes however they want
to. It also casts into stone current behaviour because it is now part of
the protocol.
That being said, I get the point that it may make sense to be "agile"
regarding the promisor remotes. But I think we can achieve that without
having to compromise on either usability or security by using something
like a promisor ID instead.
Instead of announcing remote names, each announced promisor would have
an ID. This ID is opaque and merely used to identify the promisor after
the fact. It could for example be a UUID or something else that is
mostly unique.
The client will then create a promisor remote for each of the remote
names. The name of the promisor is derived from the remote name that it
is being created from. When there's a single promisor only it could for
example be called "origin-promisor". When there are multiple ones they
could be enumerated as "origin-promisor-1". In practice, we can even
roll the dice to generate the name, even though that may not be as user
friendly.
These names are _not_ used to identify the promisor. Instead, we also
write "remote.origin-promisor.id" and point it to the UUID that the
server has advertised. Furthermore, for each promisor that gets added in
this way, we'll also add "remote.origin.promisor" pointing to the
promisor name.
So on a subsequent fetch, we can now:
1. Look up all the promisors for the remote we're fetching from via
the "remote.origin.promisor" multivalue config.
2. For each promisor, we figure out whether its ID is still being
advertised by the remote server. If not, then it is a stale
promisor and we can optionally remove it.
3. If the promisor ID is still being announced we double check whether
the URL we have stored is still valid. If not, we can optionally
update it to point to the new URL.
This buys us a bunch of things:
- We have promisor agility and are easily able to update URLs and
prune out stale promisors.
- Promisors can be renamed by the user at will, as they are identified
by ID and not by remote name. We have to add logic to update the
"remote.*.promisor" links, but that should be doable.
- Each remote has its own set of promisors that cannot conflict with
one another.
From hereon, I'd also redesign "promisor.acceptFromServer" a bit:
- "new" allows newly announced promisor remotes.
- "update" allows updating existing promisor remotes.
- "prune" allows pruning existing promisor remotes.
All of that only applies to promisors connected to the current remote,
of course. Furthermore, the values may be combined arbitrarily with one
another, e.g. you can say "new,update" to only accept new or updated
remotes but not allow pruning, or "update,prune" to only allow updating
or pruning promisors without adding new ones.
I realize that this is a bit more work than what we currently have, but
I think that the design is significantly better than the proposed one.
From my point of view none of this really needs to be part of the
current patch series though, as these are all client-side changes in the
first place, and as far as I understand we don't have the client-side
ready yet anyway.
The only change required would be to adapt the protocol so that we don't
advertise a promisor names anymore, but instead promisor IDs.
Patrick
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v4 5/6] promisor-remote: check advertised name or URL
2025-01-30 10:51 ` Patrick Steinhardt
@ 2025-02-18 11:41 ` Christian Couder
0 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-02-18 11:41 UTC (permalink / raw)
To: Patrick Steinhardt
Cc: Junio C Hamano, git, Taylor Blau, Eric Sunshine, Karthik Nayak,
Kristoffer Haugsbakk, brian m . carlson, Randall S . Becker,
Christian Couder
On Thu, Jan 30, 2025 at 11:51 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Mon, Jan 27, 2025 at 03:48:08PM -0800, Junio C Hamano wrote:
> > I wonder if the reader needs to be told a bit more about the
> > security argument here. I imagine that the attack vector behind the
> > use of "secure" in the above paragraph is for a malicious server
> > that guesses a promisor remote name the client already uses, which
> > has a different URL from what the client expects to be associated
> > with the name, thereby such an acceptance means that the URL used in
> > future fetches would be replaced without the user's consent. Being
> > able to silently repoint the remote.origin.url at an evil repository
> > you control is indeed a powerful thing, I would guess. Of course,
> > in a corp environment, such a mechanism to drive the clients to a
> > new repository after upgrading or migrating may be extremely handy.
>
> I'm still very hesitant about letting the server-side control remote
> names at all, as I've already mentioned in previous review rounds. I
> think that it opens up the client for a whole lot of issues that should
> rather be avoided. Most importantly, it takes control away from the
> user, as they are not free anymore to name the remotes however they want
> to. It also casts into stone current behaviour because it is now part of
> the protocol.
The server-side doesn't control remote names at all in this series.
There is just a match or no match, depending on the value of
promisor.acceptFromServer on the client-side, between what the client
already has configured (for example using the clone -c option) and
what the server advertises.
> That being said, I get the point that it may make sense to be "agile"
> regarding the promisor remotes. But I think we can achieve that without
> having to compromise on either usability or security by using something
> like a promisor ID instead.
Thanks for the suggestion and the ideas, but I think that what you
suggest could be discussed and implemented as part of a follow up
patch series. This patch series implements basic checks with
information (name and URL) that already exists on the server side and
might also be available on the client side. For a number of use cases
it is likely enough, and it's also not very complex.
I would be fine with resending the series without this patch, if
that's what is prefered though.
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v4 5/6] promisor-remote: check advertised name or URL
2025-01-27 23:48 ` Junio C Hamano
2025-01-28 0:01 ` Junio C Hamano
2025-01-30 10:51 ` Patrick Steinhardt
@ 2025-02-18 11:42 ` Christian Couder
2 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-02-18 11:42 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
Randall S . Becker, Christian Couder
On Tue, Jan 28, 2025 at 12:48 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > A previous commit introduced a "promisor.acceptFromServer" configuration
> > variable with only "None" or "All" as valid values.
> >
> > Let's introduce "KnownName" and "KnownUrl" as valid values for this
> > configuration option to give more choice to a client about which
> > promisor remotes it might accept among those that the server advertised.
>
> OK.
>
> > promisor.acceptFromServer::
> > If set to "all", a client will accept all the promisor remotes
> > a server might advertise using the "promisor-remote"
> > - capability. Default is "none", which means no promisor remote
> > - advertised by a server will be accepted. By accepting a
> > - promisor remote, the client agrees that the server might omit
> > - objects that are lazily fetchable from this promisor remote
> > - from its responses to "fetch" and "clone" requests from the
> > - client. See linkgit:gitprotocol-v2[5].
> > + capability. If set to "knownName" the client will accept
> > + promisor remotes which are already configured on the client
> > + and have the same name as those advertised by the client. This
> > + is not very secure, but could be used in a corporate setup
> > + where servers and clients are trusted to not switch name and
> > + URLs.
>
> I wonder if the reader needs to be told a bit more about the
> security argument here. I imagine that the attack vector behind the
> use of "secure" in the above paragraph is for a malicious server
> that guesses a promisor remote name the client already uses, which
> has a different URL from what the client expects to be associated
> with the name, thereby such an acceptance means that the URL used in
> future fetches would be replaced without the user's consent.
There is currently no mechanism for the URL to be replaced on the
client side by the one advertised by the server. The client will still
use the URL that has been configured in another way, likely the clone
`-c` option. But yeah it could lead to misunderstandings between the
client and the server. And if we later develop such a mechanism to
replace the URL on the client side, or to just temporarily use the one
advertised by the server, this could be a problem.
> Being
> able to silently repoint the remote.origin.url at an evil repository
> you control is indeed a powerful thing, I would guess. Of course,
> in a corp environment, such a mechanism to drive the clients to a
> new repository after upgrading or migrating may be extremely handy.
Yeah, that's why there are chances that such a mechanism will be
developed later, and we should take care of warning users even if
currently there are no real security risks.
> Or does the above paragraph assumes some other attack vectors,
> perhaps?
No, I don't see another attack vector.
> > + If set to "knownUrl", the client will accept promisor
> > + remotes which have both the same name and the same URL
> > + configured on the client as the name and URL advertised by the
> > + server. This is more secure than "all" or "knownUrl", so it
Here I see that it should be "knownName" instead of "knownUrl". I have
fixed this in the next version I will send soon.
> > + should be used if possible instead of those options. Default
> > + is "none", which means no promisor remote advertised by a
> > + server will be accepted.
>
> OK.
>
> > diff --git a/promisor-remote.c b/promisor-remote.c
> > index 5ac282ed27..790a96aa19 100644
> > --- a/promisor-remote.c
> > +++ b/promisor-remote.c
> > @@ -370,30 +370,73 @@ char *promisor_remote_info(struct repository *repo)
> > return strbuf_detach(&sb, NULL);
> > }
> >
> > +/*
> > + * Find first index of 'vec' where there is 'val'. 'val' is compared
> > + * case insensively to the strings in 'vec'. If not found 'vec->nr' is
I mean "insensitively" instead of "insensively". This is fixed in the
next version.
> > + * returned.
> > + */
> > +static size_t strvec_find_index(struct strvec *vec, const char *val)
> > +{
> > + for (size_t i = 0; i < vec->nr; i++)
> > + if (!strcasecmp(vec->v[i], val))
> > + return i;
> > + return vec->nr;
> > +}
>
> Hmph, without the hardcoded strcasecmp(), strvec_find() might make a
> fine public API in <strvec.h>.
Yeah, but I didn't find any other places in the code where a
strvec_find() function could be useful.
> Unless we intend to create a generic function that qualifies as a
> part of the public strvec API, we shouldn't call it strvec_anything.
> This is a great helper that finds a matching remote nickname from
> list of remote nicknames, so
>
> remote_nick_find(struct strvec *nicks, const char *nick)
>
> may be more appropriate.
Ok, I have renamed it remote_nick_find() in the next version.
> When we lift it out of here and make it
> more generic to move it to strvec.[ch], perhaps
>
> size_t strvec_find(struct strvec *vec, void *needle,
> int (*match)(const char *, void *)) {
> for (size_t ix = 0; ix < vec->nr, ix++)
> if (match(vec->v[ix], needle))
> return ix;
> return vec->nr;
> }
>
> which will be used to rewrite remote_nick_find() like so:
>
> static int nicks_match(const char *nick, void *needle)
> {
> return !strcasecmp(nick, (conat char *)needle);
> }
>
> remote_hick_find(struct strvec *nicks, const char *nick)
> {
> return strvec_find(nicks, nick, nicks_match);
> }
>
> it would be better to use a more generic parameter name "vec", but
> until then, it is better to be more specific and explicit about the
> reason why the immediate callers call the function for, which is
> where my "nicks" vs "nick" comes from (it is OK to call the latter
> "needle", though).
Yeah, I would be fine with this solution if there were other places
where strvec_find() could be useful.
> > enum accept_promisor {
> > ACCEPT_NONE = 0,
> > + ACCEPT_KNOWN_URL,
> > + ACCEPT_KNOWN_NAME,
> > ACCEPT_ALL
> > };
> >
> > static int should_accept_remote(enum accept_promisor accept,
> > - const char *remote_name UNUSED,
> > - const char *remote_url UNUSED)
> > + const char *remote_name, const char *remote_url,
> > + struct strvec *names, struct strvec *urls)
> > {
> > + size_t i;
> > +
> > if (accept == ACCEPT_ALL)
> > return 1;
> >
> > - BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
> > + i = strvec_find_index(names, remote_name);
> > +
> > + if (i >= names->nr)
> > + /* We don't know about that remote */
> > + return 0;
>
> OK.
>
> > + if (accept == ACCEPT_KNOWN_NAME)
> > + return 1;
> > +
> > + if (accept != ACCEPT_KNOWN_URL)
> > + BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
>
> I can see why this defensiveness may be a good idea than not having
> any, but I wonder if we can take advantage of compile time checks
> some compilers have to ensure that case arms in a switch statement
> are exhausitive?
Perhaps, but otherwise I am not sure that using a switch statement
would make the code better. The ACCEPT_KNOWN_NAME and ACCEPT_KNOWN_URL
cases need to share some code and the ACCEPT_NONE case seems better
handled by the caller.
> > + if (!strcasecmp(urls->v[i], remote_url))
> > + return 1;
>
> This is iffy. The <schema>://<host>/ part might want to be compared
> case insensitively, but the rest of the URL is generally case
> sensitive (unless the material served is stored on a machine with
> case-insensitive filesystem)?
I am fine with comparing the whole URL case sensitively. So
"strcasecmp()" is replaced with "strcmp()" in the next version.
> Given that the existing URL must have come by either cloning from
> this server or another related server or by an earlier
> acceptFromServer behaviour, I do not see a need for being extra lax
> here. We should be more careful about our use of case-insensitive
> comparison, and I do not see how this URL comparison could be
> something the end users would expect to be done case insensitively.
In another email you also said:
> Note that I am not advocating to compare the earlier part case
> insensitively while comparing the remainder case sensitively.
>
> Because we are not comparing URLs that come from random sources, but
> we know they come from a only few very controlled sources (i.e., the
> original server we cloned from, and the promisor remotes sugggested
> by the original server and other promisor remotes whose suggestion
> we accepted, recursively), it should be sufficient to compare the
> whole string case sensitively.
When I implemented this, I was just thinking that some users might for
example spell the scheme part "HTTPS" in their client config and then
complain that it should work when the server advertises the same URL
with "https" instead of "HTTPS", because yeah the <schema>://<host>/
part should be case insensitive. But I agree we can start with
everything being case sensitive and improve on this (likely by
comparing the <schema>://<host>/ part case insensitively and the rest
case sensitively) if/when users complain.
> > -static void filter_promisor_remote(struct strvec *accepted, const char *info)
> > +static void filter_promisor_remote(struct repository *repo,
> > + struct strvec *accepted,
> > + const char *info)
> > {
> > struct strbuf **remotes;
> > const char *accept_str;
> > enum accept_promisor accept = ACCEPT_NONE;
> > + struct strvec names = STRVEC_INIT;
> > + struct strvec urls = STRVEC_INIT;
> >
> > if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
> > if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
>
> Not a fault of this step, but is it sensible to even expect
> !accept_str in an error case? *accept_str could be NUL, but
> accept_str be either left uninitialized (because this caller does
> not initialize it) when the get_string_tmp() returns non-zero, or
> points at the internal cached value in the config_set if it returns
> 0 (and the control comes into this block).
Yeah, I agree accept_str cannot be NULL here. I have removed
"!accept_str || " in the next version.
> > accept = ACCEPT_NONE;
> > + else if (!strcasecmp("KnownUrl", accept_str))
> > + accept = ACCEPT_KNOWN_URL;
> > + else if (!strcasecmp("KnownName", accept_str))
> > + accept = ACCEPT_KNOWN_NAME;
> > else if (!strcasecmp("All", accept_str))
> > accept = ACCEPT_ALL;
> > else
>
> Ditto about icase for all of the above.
These are config values that can take only a specific set of values. I
think those are most often compared case insensitively in Git, for
example there is no distinction between "True" and "true" for bool
values. So I am not sure what you suggest here.
> > +test_expect_success "clone with 'KnownUrl' and different remote urls" '
> > + ln -s server2 serverTwo &&
> > +
> > + git -C server config promisor.advertise true &&
> > +
> > + # Clone from server to create a client
> > + GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
> > + -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
> > + -c remote.server2.url="file://$(pwd)/serverTwo" \
> > + -c promisor.acceptfromserver=KnownUrl \
> > + --no-local --filter="blob:limit=5k" server client &&
> > + test_when_finished "rm -rf client" &&
> > +
> > + # Check that the largest object is not missing on the server
> > + check_missing_objects server 0 "" &&
> > +
> > + # Reinitialize server so that the largest object is missing again
> > + initialize_server 1 "$oid"
> > +'
>
> Nice ;-)
>
> Here, I also notice that we are not testing that serverTwo and
> servertwo are considered the same thanks to the use of icase
> comparison. We shouldn't compare URLs with strcasecmp().
Ok, thanks.
^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH v4 6/6] doc: add technical design doc for large object promisors
2025-01-27 15:16 ` [PATCH v4 0/6] " Christian Couder
` (4 preceding siblings ...)
2025-01-27 15:17 ` [PATCH v4 5/6] promisor-remote: check advertised name or URL Christian Couder
@ 2025-01-27 15:17 ` Christian Couder
2025-01-27 21:14 ` [PATCH v4 0/6] Introduce a "promisor-remote" capability Junio C Hamano
2025-02-18 11:32 ` [PATCH v5 0/3] " Christian Couder
7 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:17 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
Randall S . Becker, Christian Couder, Christian Couder
Let's add a design doc about how we could improve handling liarge blobs
using "Large Object Promisors" (LOPs). It's a set of features with the
goal of using special dedicated promisor remotes to store large blobs,
and having them accessed directly by main remotes and clients.
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
.../technical/large-object-promisors.txt | 640 ++++++++++++++++++
1 file changed, 640 insertions(+)
create mode 100644 Documentation/technical/large-object-promisors.txt
diff --git a/Documentation/technical/large-object-promisors.txt b/Documentation/technical/large-object-promisors.txt
new file mode 100644
index 0000000000..1984f11a55
--- /dev/null
+++ b/Documentation/technical/large-object-promisors.txt
@@ -0,0 +1,640 @@
+Large Object Promisors
+======================
+
+Since Git has been created, users have been complaining about issues
+with storing large files in Git. Some solutions have been created to
+help, but they haven't helped much with some issues.
+
+Git currently supports multiple promisor remotes, which could help
+with some of these remaining issues, but it's very hard to use them to
+help, because a number of important features are missing.
+
+The goal of the effort described in this document is to add these
+important features.
+
+We will call a "Large Object Promisor", or "LOP" in short, a promisor
+remote which is used to store only large blobs and which is separate
+from the main remote that should store the other Git objects and the
+rest of the repos.
+
+By extension, we will also call "Large Object Promisor", or LOP, the
+effort described in this document to add a set of features to make it
+easier to handle large blobs/files in Git by using LOPs.
+
+This effort aims to especially improve things on the server side, and
+especially for large blobs that are already compressed in a binary
+format.
+
+This effort aims to provide an alternative to Git LFS
+(https://git-lfs.com/) and similar tools like git-annex
+(https://git-annex.branchable.com/) for handling large files, even
+though a complete alternative would very likely require other efforts
+especially on the client side, where it would likely help to implement
+a new object representation for large blobs as discussed in:
+
+https://lore.kernel.org/git/xmqqbkdometi.fsf@gitster.g/
+
+0) Non goals
+------------
+
+- We will not discuss those client side improvements here, as they
+ would require changes in different parts of Git than this effort.
++
+So we don't pretend to fully replace Git LFS with only this effort,
+but we nevertheless believe that it can significantly improve the
+current situation on the server side, and that other separate
+efforts could also improve the situation on the client side.
+
+- In the same way, we are not going to discuss all the possible ways
+ to implement a LOP or their underlying object storage, or to
+ optimize how LOP works.
++
+Our opinion is that the simplest solution for now is for LOPs to use
+object storage through a remote helper (see section II.2 below for
+more details) to store their objects. So we consider that this is the
+default implementation. If there are improvements on top of this,
+that's great, but our opinion is that such improvements are not
+necessary for LOPs to already be useful. Such improvements are likely
+a different technical topic, and can be taken care of separately
+anyway.
++
+So in particular we are not going to discuss pluggable ODBs or other
+object database backends that could chunk large blobs, dedup the
+chunks and store them efficiently. Sure, that would be a nice
+improvement to store large blobs on the server side, but we believe
+it can just be a separate effort as it's also not technically very
+related to this effort.
++
+We are also not going to discuss data transfer improvements between
+LOPs and clients or servers. Sure, there might be some easy and very
+effective optimizations there (as we know that objects on LOPs are
+very likely incompressible and not deltifying well), but this can be
+dealt with separately in a separate effort.
+
+In other words, the goal of this document is not to talk about all the
+possible ways to optimize how Git could handle large blobs, but to
+describe how a LOP based solution can already work well and alleviate
+a number of current issues in the context of Git clients and servers
+sharing Git objects.
+
+I) Issues with the current situation
+------------------------------------
+
+- Some statistics made on GitLab repos have shown that more than 75%
+ of the disk space is used by blobs that are larger than 1MB and
+ often in a binary format.
+
+- So even if users could use Git LFS or similar tools to store a lot
+ of large blobs out of their repos, it's a fact that in practice they
+ don't do it as much as they probably should.
+
+- On the server side ideally, the server should be able to decide for
+ itself how it stores things. It should not depend on users deciding
+ to use tools like Git LFS on some blobs or not.
+
+- It's much more expensive to store large blobs that don't delta
+ compress well on regular fast seeking drives (like SSDs) than on
+ object storage (like Amazon S3 or GCP Buckets). Using fast drives
+ for regular Git repos makes sense though, as serving regular Git
+ content (blobs containing text or code) needs drives where seeking
+ is fast, but the content is relatively small. On the other hand,
+ object storage for Git LFS blobs makes sense as seeking speed is not
+ as important when dealing with large files, while costs are more
+ important. So the fact that users don't use Git LFS or similar tools
+ for a significant number of large blobs has likely some bad
+ consequences on the cost of repo storage for most Git hosting
+ platforms.
+
+- Having large blobs handled in the same way as other blobs and Git
+ objects in Git repos instead of on object storage also has a cost in
+ increased memory and CPU usage, and therefore decreased performance,
+ when creating packfiles. (This is because Git tries to use delta
+ compression or zlib compression which is unlikely to work well on
+ already compressed binary content.) So it's not just a storage cost
+ increase.
+
+- When a large blob has been committed into a repo, it might not be
+ possible to remove this blob from the repo without rewriting
+ history, even if the user then decides to use Git LFS or a similar
+ tool to handle it.
+
+- In fact Git LFS and similar tools are not very flexible in letting
+ users change their minds about the blobs they should handle or not.
+
+- Even when users are using Git LFS or similar tools, they are often
+ complaining that these tools require significant effort to set up,
+ learn and use correctly.
+
+II) Main features of the "Large Object Promisors" solution
+----------------------------------------------------------
+
+The main features below should give a rough overview of how the
+solution may work. Details about needed elements can be found in
+following sections.
+
+Even if each feature below is very useful for the full solution, it is
+very likely to be also useful on its own in some cases where the full
+solution is not required. However, we'll focus primarily on the big
+picture here.
+
+Also each feature doesn't need to be implemented entirely in Git
+itself. Some could be scripts, hooks or helpers that are not part of
+the Git repo. It would be helpful if those could be shared and
+improved on collaboratively though. So we want to encourage sharing
+them.
+
+1) Large blobs are stored on LOPs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Large blobs should be stored on special promisor remotes that we will
+call "Large Object Promisors" or LOPs. These LOPs should be additional
+remotes dedicated to contain large blobs especially those in binary
+format. They should be used along with main remotes that contain the
+other objects.
+
+Note 1
+++++++
+
+To clarify, a LOP is a normal promisor remote, except that:
+
+- it should store only large blobs,
+
+- it should be separate from the main remote, so that the main remote
+ can focus on serving other objects and the rest of the repos (see
+ feature 4) below) and can use the LOP as a promisor remote for
+ itself.
+
+Note 2
+++++++
+
+Git already makes it possible for a main remote to also be a promisor
+remote storing both regular objects and large blobs for a client that
+clones from it with a filter on blob size. But here we explicitly want
+to avoid that.
+
+Rationale
++++++++++
+
+LOPs aim to be good at handling large blobs while main remotes are
+already good at handling other objects.
+
+Implementation
+++++++++++++++
+
+Git already has support for multiple promisor remotes, see
+link:partial-clone.html#using-many-promisor-remotes[the partial clone documentation].
+
+Also, Git already has support for partial clone using a filter on the
+size of the blobs (with `git clone --filter=blob:limit=<size>`). Most
+of the other main features below are based on these existing features
+and are about making them easy and efficient to use for the purpose of
+better handling large blobs.
+
+2) LOPs can use object storage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+LOPs can be implemented using object storage, like an Amazon S3 or GCP
+Bucket or MinIO (which is open source under the GNU AGPLv3 license) to
+actually store the large blobs, and can be accessed through a Git
+remote helper (see linkgit:gitremote-helpers[7]) which makes the
+underlying object storage appear like a remote to Git.
+
+Note
+++++
+
+A LOP can be a promisor remote accessed using a remote helper by
+both some clients and the main remote.
+
+Rationale
++++++++++
+
+This looks like the simplest way to create LOPs that can cheaply
+handle many large blobs.
+
+Implementation
+++++++++++++++
+
+Remote helpers are quite easy to write as shell scripts, but it might
+be more efficient and maintainable to write them using other languages
+like Go.
+
+Some already exist under open source licenses, for example:
+
+ - https://github.com/awslabs/git-remote-s3
+ - https://gitlab.com/eric.p.ju/git-remote-gs
+
+Other ways to implement LOPs are certainly possible, but the goal of
+this document is not to discuss how to best implement a LOP or its
+underlying object storage (see the "0) Non goals" section above).
+
+3) LOP object storage can be Git LFS storage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The underlying object storage that a LOP uses could also serve as
+storage for large files handled by Git LFS.
+
+Rationale
++++++++++
+
+This would simplify the server side if it wants to both use a LOP and
+act as a Git LFS server.
+
+4) A main remote can offload to a LOP with a configurable threshold
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+On the server side, a main remote should have a way to offload to a
+LOP all its blobs with a size over a configurable threshold.
+
+Rationale
++++++++++
+
+This makes it easy to set things up and to clean things up. For
+example, an admin could use this to manually convert a repo not using
+LOPs to a repo using a LOP. On a repo already using a LOP but where
+some users would sometimes push large blobs, a cron job could use this
+to regularly make sure the large blobs are moved to the LOP.
+
+Implementation
+++++++++++++++
+
+Using something based on `git repack --filter=...` to separate the
+blobs we want to offload from the other Git objects could be a good
+idea. The missing part is to connect to the LOP, check if the blobs we
+want to offload are already there and if not send them.
+
+5) A main remote should try to remain clean from large blobs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A main remote should try to avoid containing a lot of oversize
+blobs. For that purpose, it should offload as needed to a LOP and it
+should have ways to prevent oversize blobs to be fetched, and also
+perhaps pushed, into it.
+
+Rationale
++++++++++
+
+A main remote containing many oversize blobs would defeat the purpose
+of LOPs.
+
+Implementation
+++++++++++++++
+
+The way to offload to a LOP discussed in 4) above can be used to
+regularly offload oversize blobs. About preventing oversize blobs from
+being fetched into the repo see 6) below. About preventing oversize
+blob pushes, a pre-receive hook could be used.
+
+Also there are different scenarios in which large blobs could get
+fetched into the main remote, for example:
+
+- A client that doesn't implement the "promisor-remote" protocol
+ (described in 6) below) clones from the main remote.
+
+- The main remote gets a request for information about a large blob
+ and is not able to get that information without fetching the blob
+ from the LOP.
+
+It might not be possible to completely prevent all these scenarios
+from happening. So the goal here should be to implement features that
+make the fetching of large blobs less likely. For example adding a
+`remote-object-info` command in the `git cat-file --batch` protocol
+and its variants might make it possible for a main repo to respond to
+some requests about large blobs without fetching them.
+
+6) A protocol negotiation should happen when a client clones
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a client clones from a main repo, there should be a protocol
+negotiation so that the server can advertise one or more LOPs and so
+that the client and the server can discuss if the client could
+directly use a LOP the server is advertising. If the client and the
+server can agree on that, then the client would be able to get the
+large blobs directly from the LOP and the server would not need to
+fetch those blobs from the LOP to be able to serve the client.
+
+Note
+++++
+
+For fetches instead of clones, a protocol negotiation might not always
+happen, see the "What about fetches?" FAQ entry below for details.
+
+Rationale
++++++++++
+
+Security, configurability and efficiency of setting things up.
+
+Implementation
+++++++++++++++
+
+A "promisor-remote" protocol v2 capability looks like a good way to
+implement this. The way the client and server use this capability
+could be controlled by configuration variables.
+
+Information that the server could send to the client through that
+protocol could be things like: LOP name, LOP URL, filter-spec (for
+example `blob:limit=<size>`) or just size limit that should be used as
+a filter when cloning, token to be used with the LOP, etc.
+
+7) A client can offload to a LOP
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a client is using a LOP that is also a LOP of its main remote,
+the client should be able to offload some large blobs it has fetched,
+but might not need anymore, to the LOP.
+
+Note
+++++
+
+It might depend on the context if it should be OK or not for clients
+to offload large blobs they have created, instead of fetched, directly
+to the LOP without the main remote checking them in some ways
+(possibly using hooks or other tools).
+
+Rationale
++++++++++
+
+On the client, the easiest way to deal with unneeded large blobs is to
+offload them.
+
+Implementation
+++++++++++++++
+
+This is very similar to what 4) above is about, except on the client
+side instead of the server side. So a good solution to 4) could likely
+be adapted to work on the client side too.
+
+There might be some security issues here, as there is no negotiation,
+but they might be mitigated if the client can reuse a token it got
+when cloning (see 6) above). Also if the large blobs were fetched from
+a LOP, it is likely, and can easily be confirmed, that the LOP still
+has them, so that they can just be removed from the client.
+
+III) Benefits of using LOPs
+---------------------------
+
+Many benefits are related to the issues discussed in "I) Issues with
+the current situation" above:
+
+- No need to rewrite history when deciding which blobs are worth
+ handling separately than other objects, or when moving or removing
+ the threshold.
+
+- If the protocol between client and server is developed and secured
+ enough, then many details might be setup on the server side only and
+ all the clients could then easily get all the configuration
+ information and use it to set themselves up mostly automatically.
+
+- Storage costs benefits on the server side.
+
+- Reduced memory and CPU needs on main remotes on the server side.
+
+- Reduced storage needs on the client side.
+
+IV) FAQ
+-------
+
+What about using multiple LOPs on the server and client side?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+That could perhaps be useful in some cases, but for now it's more
+likely that in most cases a single LOP will be advertised by the
+server and should be used by the client.
+
+A case where it could be useful for a server to advertise multiple
+LOPs is if a LOP is better for some users while a different LOP is
+better for other users. For example some clients might have a better
+connection to a LOP than others.
+
+In those cases it's the responsibility of the server to have some
+documentation to help clients. It could say for example something like
+"Users in this part of the world might want to pick only LOP A as it
+is likely to be better connected to them, while users in other parts
+of the world should pick only LOP B for the same reason."
+
+When should we trust or not trust the LOPs advertised by the server?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In some contexts, like in corporate setup where the server and all the
+clients are parts of an internal network in a company where admins
+have all the rights on every system, it's OK, and perhaps even a good
+thing, if the clients fully trust the server, as it can help ensure
+that all the clients are on the same page.
+
+There are also contexts in which clients trust a code hosting platform
+serving them some repos, but might not fully trust other users
+managing or contributing to some of these repos. For example, the code
+hosting platform could have hooks in place to check that any object it
+receives doesn't contain malware or otherwise bad content. In this
+case it might be OK for the client to use a main remote and its LOP if
+they are both hosted by the code hosting platform, but not if the LOP
+is hosted elsewhere (where the content is not checked).
+
+In other contexts, a client should just not trust a server.
+
+So there should be different ways to configure how the client should
+behave when a server advertises a LOP to it at clone time.
+
+As the basic elements that a server can advertise about a LOP are a
+LOP name and a LOP URL, the client should base its decision about
+accepting a LOP on these elements.
+
+One simple way to be very strict in the LOP it accepts is for example
+for the client to check that the LOP is already configured on the
+client with the same name and URL as what the server advertises.
+
+In general default and "safe" settings should require that the LOP are
+configured on the client separately from the "promisor-remote"
+protocol and that the client accepts a LOP only when information about
+it from the protocol matches what has been already configured
+separately.
+
+What about LOP names?
+~~~~~~~~~~~~~~~~~~~~~
+
+In some contexts, for example if the clients sometimes fetch from each
+other, it can be a good idea for all the clients to use the same names
+for all the remotes they use, including LOPs.
+
+In other contexts, each client might want to be able to give the name
+it wants to each remote, including each LOP, it interacts with.
+
+So there should be different ways to configure how the client accepts
+or not the LOP name the server advertises.
+
+If a default or "safe" setting is used, then as such a setting should
+require that the LOP be configured separately, then the name would be
+configured separately and there is no risk that the server could
+dictate a name to a client.
+
+Could the main remote be bogged down by old or paranoid clients?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Yes, it could happen if there are too many clients that are either
+unwilling to trust the main remote or that just don't implement the
+"promisor-remote" protocol because they are too old or not fully
+compatible with the 'git' client.
+
+When serving such a client, the main remote has no other choice than
+to first fetch from its LOP, to then be able to provide to the client
+everything it requested. So the main remote, even if it has cleanup
+mechanisms (see section II.4 above), would be burdened at least
+temporarily with the large blobs it had to fetch from its LOP.
+
+Not behaving like this would be breaking backward compatibility, and
+could be seen as segregating clients. For example, it might be
+possible to implement a special mode that allows the server to just
+reject clients that don't implement the "promisor-remote" protocol or
+aren't willing to trust the main remote. This mode might be useful in
+a special context like a corporate environment. There is no plan to
+implement such a mode though, and this should be discussed separately
+later anyway.
+
+A better way to proceed is probably for the main remote to show a
+message telling clients that don't implement the protocol or are
+unwilling to accept the advertised LOP(s) that they would get faster
+clone and fetches by upgrading client software or properly setting
+them up to accept LOP(s).
+
+Waiting for clients to upgrade, monitoring these upgrades and limiting
+the use of LOPs to repos that are not very frequently accessed might
+be other good ways to make sure that some benefits are still reaped
+from LOPs. Over time, as more and more clients upgrade and benefit
+from LOPs, using them in more and more frequently accessed repos will
+become worth it.
+
+Corporate environments, where it might be easier to make sure that all
+the clients are up-to-date and properly configured, could hopefully
+benefit more and earlier from using LOPs.
+
+What about fetches?
+~~~~~~~~~~~~~~~~~~~
+
+There are different kinds of fetches. A regular fetch happens when
+some refs have been updated on the server and the client wants the ref
+updates and possibly the new objects added with them. A "backfill" or
+"lazy" fetch, on the contrary, happens when the client needs to use
+some objects it already knows about but doesn't have because they are
+on a promisor remote.
+
+Regular fetch
++++++++++++++
+
+In a regular fetch, the client will contact the main remote and a
+protocol negotiation will happen between them. It's a good thing that
+a protocol negotiation happens every time, as the configuration on the
+client or the main remote could have changed since the previous
+protocol negotiation. In this case, the new protocol negotiation
+should ensure that the new fetch will happen in a way that satisfies
+the new configuration of both the client and the server.
+
+In most cases though, the configurations on the client and the main
+remote will not have changed between 2 fetches or between the initial
+clone and a subsequent fetch. This means that the result of a new
+protocol negotiation will be the same as the previous result, so the
+new fetch will happen in the same way as the previous clone or fetch,
+using, or not using, the same LOP(s) as last time.
+
+"Backfill" or "lazy" fetch
+++++++++++++++++++++++++++
+
+When there is a backfill fetch, the client doesn't necessarily contact
+the main remote first. It will try to fetch from its promisor remotes
+in the order they appear in the config file, except that a remote
+configured using the `extensions.partialClone` config variable will be
+tried last. See
+link:partial-clone.html#using-many-promisor-remotes[the partial clone documentation].
+
+This is not new with this effort. In fact this is how multiple remotes
+have already been working for around 5 years.
+
+When using LOPs, having the main remote configured using
+`extensions.partialClone`, so it's tried last, makes sense, as missing
+objects should only be large blobs that are on LOPs.
+
+This means that a protocol negotiation will likely not happen as the
+missing objects will be fetched from the LOPs, and then there will be
+nothing left to fetch from the main remote.
+
+To secure that, it could be a good idea for LOPs to require a token
+from the client when it fetches from them. The client could get the
+token when performing a protocol negotiation with the main remote (see
+section II.6 above).
+
+V) Future improvements
+----------------------
+
+It is expected that at the beginning using LOPs will be mostly worth
+it either in a corporate context where the Git version that clients
+use can easily be controlled, or on repos that are infrequently
+accessed. (See the "Could the main remote be bogged down by old or
+paranoid clients?" section in the FAQ above.)
+
+Over time, as more and more clients upgrade to a version that
+implements the "promisor-remote" protocol v2 capability described
+above in section II.6), it will be worth it to use LOPs more widely.
+
+A lot of improvements may also help using LOPs more widely. Some of
+these improvements are part of the scope of this document like the
+following:
+
+ - Implementing a "remote-object-info" command in the
+ `git cat-file --batch` protocol and its variants to allow main
+ remotes to respond to requests about large blobs without fetching
+ them. (Eric Ju has started working on this based on previous work
+ by Calvin Wan.)
+
+ - Creating better cleanup and offload mechanisms for main remotes
+ and clients to prevent accumulation of large blobs.
+
+ - Developing more sophisticated protocol negotiation capabilities
+ between clients and servers for handling LOPs, for example adding
+ a filter-spec (e.g., blob:limit=<size>) or size limit for
+ filtering when cloning, or adding a token for LOP authentication.
+
+ - Improving security measures for LOP access, particularly around
+ token handling and authentication.
+
+ - Developing standardized ways to configure and manage multiple LOPs
+ across different environments. Especially in the case where
+ different LOPs serve the same content to clients in different
+ geographical locations, there is a need for replication or
+ synchronization between LOPs.
+
+Some improvements, including some that have been mentioned in the "0)
+Non Goals" section of this document, are out of the scope of this
+document:
+
+ - Implementing a new object representation for large blobs on the
+ client side.
+
+ - Developing pluggable ODBs or other object database backends that
+ could chunk large blobs, dedup the chunks and store them
+ efficiently.
+
+ - Optimizing data transfer between LOPs and clients/servers,
+ particularly for incompressible and non-deltifying content.
+
+ - Creating improved client side tools for managing large objects
+ more effectively, for example tools for migrating from Git LFS or
+ git-annex, or tools to find which objects could be offloaded and
+ how much disk space could be reclaimed by offloading them.
+
+Some improvements could be seen as part of the scope of this document,
+but might already have their own separate projects from the Git
+project, like:
+
+ - Improving existing remote helpers to access object storage or
+ developing new ones.
+
+ - Improving existing object storage solutions or developing new
+ ones.
+
+Even though all the above improvements may help, this document and the
+LOP effort should try to focus, at least first, on a relatively small
+number of improvements mostly those that are in its current scope.
+
+For example introducing pluggable ODBs and a new object database
+backend is likely a multi-year effort on its own that can happen
+separately in parallel. It has different technical requirements,
+touches other part of the Git code base and should have its own design
+document(s).
--
2.46.0.rc0.95.gcbf174a634
^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [PATCH v4 0/6] Introduce a "promisor-remote" capability
2025-01-27 15:16 ` [PATCH v4 0/6] " Christian Couder
` (5 preceding siblings ...)
2025-01-27 15:17 ` [PATCH v4 6/6] doc: add technical design doc for large object promisors Christian Couder
@ 2025-01-27 21:14 ` Junio C Hamano
2025-02-18 11:40 ` Christian Couder
2025-02-18 11:32 ` [PATCH v5 0/3] " Christian Couder
7 siblings, 1 reply; 110+ messages in thread
From: Junio C Hamano @ 2025-01-27 21:14 UTC (permalink / raw)
To: Christian Couder
Cc: git, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
Randall S . Becker
Christian Couder <christian.couder@gmail.com> writes:
> This work is part of some effort to better handle large files/blobs in
> a client-server context using promisor remotes dedicated to storing
> large blobs. To help understand this effort, this series now contains
> a patch (patch 6/6) that adds design documentation about this effort.
>
> Last year, I sent 3 versions of a patch series with the goal of
> allowing a client C to clone from a server S while using the same
> promisor remote X that S already use. See:
>
> https://lore.kernel.org/git/20240418184043.2900955-1-christian.couder@gmail.com/
>
> Junio suggested to implement that feature using:
>
> "a protocol extension that lets S tell C that S wants C to fetch
> missing objects from X (which means that if C knows about X in its
> ".git/config" then there is no need for end-user interaction at all),
> or a protocol extension that C tells S that C is willing to see
> objects available from X omitted when S does not have them (again,
> this could be done by looking at ".git/config" at C, but there may be
> security implications???)"
>
> This patch series implements that protocol extension called
> "promisor-remote" (that name is open to change or simplification)
> which allows S and C to agree on C using X directly or not.
>
> I have tried to implement it in a quite generic way that could allow S
> and C to share more information about promisor remotes and how to use
> them.
>
> For now, C doesn't use the information it gets from S when cloning.
> That information is only used to decide if C is OK to use the promisor
> remotes advertised by S. But this could change in the future which
> could make it much simpler for clients than using the current way of
> passing information about X with the `-c` option of `git clone` many
> times on the command line.
>
> Another improvement could be to not require GIT_NO_LAZY_FETCH=0 when S
> and C have agreed on using S.
>
> Changes compared to version 3
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> - Patches 1/6 and 2/6 are new in this series. They come from the
> patch series Usman Akinyemi is working on
> (https://lore.kernel.org/git/20250124122217.250925-1-usmanakinyemi202@gmail.com/).
> We need a similar redact_non_printables() function as the one he
> has been working on in his patch series, so it's just simpler to
> reuse his patches related to this function, and to build on top of
> them.
Two topics in flight, neither of which hit 'next', sharing a handful
of patches is cumbersome to keep track of. Typically our strategy
dealing with such a situation has been for these topics to halt and
have the authors work together to help the common part solidify a
bit better before continuing. Otherwise, every time any one of the
topics that share the same early parts of the series needs to change
them even a bit, it would result in a huge rebase chaos, and worse
yet, even if the two (or more) topics share the need for these two
early parts, they may have different dependency requirements (e.g.
this may be OK with these two early patches directly applied on
'maint', while the other topic may need to have these two early
patches on 'master').
I think [3/6] falls into the same category as [1/6] and [2/6], that
is, to lay foundation of the remainder?
> - In patch 4/6, the commit message has been improved:
> - In patch 4/6, there are also some code changes:
> - In patch 4/6, there is also a small change in the tests.
All good changes.
Will queue, but we should find a better way to manage the "an
earlier part is shared across multiple topics" situation.
Thanks.
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v4 0/6] Introduce a "promisor-remote" capability
2025-01-27 21:14 ` [PATCH v4 0/6] Introduce a "promisor-remote" capability Junio C Hamano
@ 2025-02-18 11:40 ` Christian Couder
0 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-02-18 11:40 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
Randall S . Becker
On Mon, Jan 27, 2025 at 10:14 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
> > - Patches 1/6 and 2/6 are new in this series. They come from the
> > patch series Usman Akinyemi is working on
> > (https://lore.kernel.org/git/20250124122217.250925-1-usmanakinyemi202@gmail.com/).
> > We need a similar redact_non_printables() function as the one he
> > has been working on in his patch series, so it's just simpler to
> > reuse his patches related to this function, and to build on top of
> > them.
>
> Two topics in flight, neither of which hit 'next', sharing a handful
> of patches is cumbersome to keep track of. Typically our strategy
> dealing with such a situation has been for these topics to halt and
> have the authors work together to help the common part solidify a
> bit better before continuing. Otherwise, every time any one of the
> topics that share the same early parts of the series needs to change
> them even a bit, it would result in a huge rebase chaos, and worse
> yet, even if the two (or more) topics share the need for these two
> early parts, they may have different dependency requirements (e.g.
> this may be OK with these two early patches directly applied on
> 'maint', while the other topic may need to have these two early
> patches on 'master').
>
> I think [3/6] falls into the same category as [1/6] and [2/6], that
> is, to lay foundation of the remainder?
Yeah, but patches 1/6, 2/6 and 3/6 are removed in the next version,
thanks to a comment by Patrick...
> > - In patch 4/6, the commit message has been improved:
> > - In patch 4/6, there are also some code changes:
> > - In patch 4/6, there is also a small change in the tests.
>
> All good changes.
>
> Will queue, but we should find a better way to manage the "an
> earlier part is shared across multiple topics" situation.
... so no problem anymore with this earlier part.
Thanks!
^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH v5 0/3] Introduce a "promisor-remote" capability
2025-01-27 15:16 ` [PATCH v4 0/6] " Christian Couder
` (6 preceding siblings ...)
2025-01-27 21:14 ` [PATCH v4 0/6] Introduce a "promisor-remote" capability Junio C Hamano
@ 2025-02-18 11:32 ` Christian Couder
2025-02-18 11:32 ` [PATCH v5 1/3] Add 'promisor-remote' capability to protocol v2 Christian Couder
` (4 more replies)
7 siblings, 5 replies; 110+ messages in thread
From: Christian Couder @ 2025-02-18 11:32 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
Randall S . Becker, Christian Couder
This work is part of some effort to better handle large files/blobs in
a client-server context using promisor remotes dedicated to storing
large blobs. To help understand this effort, this series now contains
a patch (patch 6/6) that adds design documentation about this effort.
Last year, I sent 3 versions of a patch series with the goal of
allowing a client C to clone from a server S while using the same
promisor remote X that S already use. See:
https://lore.kernel.org/git/20240418184043.2900955-1-christian.couder@gmail.com/
Junio suggested to implement that feature using:
"a protocol extension that lets S tell C that S wants C to fetch
missing objects from X (which means that if C knows about X in its
".git/config" then there is no need for end-user interaction at all),
or a protocol extension that C tells S that C is willing to see
objects available from X omitted when S does not have them (again,
this could be done by looking at ".git/config" at C, but there may be
security implications???)"
This patch series implements that protocol extension called
"promisor-remote" (that name is open to change or simplification)
which allows S and C to agree on C using X directly or not.
I have tried to implement it in a quite generic way that could allow S
and C to share more information about promisor remotes and how to use
them.
For now, C doesn't use the information it gets from S when cloning.
That information is only used to decide if C is OK to use the promisor
remotes advertised by S. But this could change in the future which
could make it much simpler for clients than using the current way of
passing information about X with the `-c` option of `git clone` many
times on the command line.
Another improvement could be to not require GIT_NO_LAZY_FETCH=0 when S
and C have agreed on using S.
Changes compared to version 4
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- The series is rebased on top 0394451348 (The eleventh batch,
2025-02-14). This is to take into account some recent changes like
some documentation files using the ".adoc" extension instead of
".txt".
- Patches 1/6, 2/6 and 3/6 from version 4 have been removed, as it
looks like using redact_non_printables() is not necessary after
all.
- Patch 1/3 ("Add 'promisor-remote' capability to protocol v2") has
a number of small changes:
- In the protocol-v2 doc, "respectively" is not repeated.
- In "promisor-remote.c", the useless call to
redact_non_printables() has been removed.
- In "promisor-remote.c", a useless "!accept_str" check has been
removed.
- In "promisor-remote.h", references to gitprotocol-v2(5) have
been added to some comments.
- In "promisor-remote.h", a comment has been improved to say
that mark_promisor_remotes_as_accepted() is useful on the
server side.
- In "t/t5710-promisor-remote-capability.sh", "server2" has been
replaced with "lop".
- In patch 2/3 ("promisor-remote: check advertised name or URL"),
there are also a number of small changes:
- In "Documentation/config/promisor.adoc", an instance of
"knownUrl" has been replaced with "knownName" to fix a
mistake.
- In "promisor-remote.c", strvec_find_index() has been renamed
remote_nick_find() and its arguments have been renamed
accordingly. Its comment doc has also been updated
accordingly.
- In "promisor-remote.c", URLs are now compared case
sensitively, so a call to strcasecmp() has been replaced with
a call to strcmp().
- In patch 3/3 ("doc: add technical design doc for large object
promisors"), there are a few small changes:
- A paragraph was added to tell that even if used not very
efficiently LOPs can be useful.
- A small sentence was added to acknowledge that more discussion
will be needed before implementing a feature to offload large
blobs from clients.
Thanks to Junio, Patrick, Eric, Karthik, Kristoffer, brian, Randall
and Taylor for their suggestions to improve this patch series.
CI tests
~~~~~~~~
All the CI tests passed, see:
https://github.com/chriscool/git/actions/runs/13388314841
Range diff compared to version 4
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1: 9e646013be < -: ---------- version: replace manual ASCII checks with isprint() for clarity
2: f4b22ef39d < -: ---------- version: refactor redact_non_printables()
3: 8bfa6f7a20 < -: ---------- version: make redact_non_printables() non-static
4: 652ce32892 ! 1: 918515f5ee Add 'promisor-remote' capability to protocol v2
@@ Commit message
Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
- ## Documentation/config/promisor.txt ##
+ ## Documentation/config/promisor.adoc ##
@@
promisor.quiet::
If set to "true" assume `--quiet` when fetching additional
@@ Documentation/config/promisor.txt
+ from its responses to "fetch" and "clone" requests from the
+ client. See linkgit:gitprotocol-v2[5].
- ## Documentation/gitprotocol-v2.txt ##
-@@ Documentation/gitprotocol-v2.txt: retrieving the header from a bundle at the indicated URI, and thus
+ ## Documentation/gitprotocol-v2.adoc ##
+@@ Documentation/gitprotocol-v2.adoc: retrieving the header from a bundle at the indicated URI, and thus
save themselves and the server(s) the request(s) needed to inspect the
headers of that bundle or bundles.
@@ Documentation/gitprotocol-v2.txt: retrieving the header from a bundle at the ind
+"promisor-remote" capability at all in its reply.
+
+The "promisor.advertise" and "promisor.acceptFromServer" configuration
-+options can be used on the server and client side respectively to
-+control what they advertise or accept respectively. See the
-+documentation of these configuration options for more information.
++options can be used on the server and client side to control what they
++advertise or accept respectively. See the documentation of these
++configuration options for more information.
+
+Note that in the future it would be nice if the "promisor-remote"
+protocol capability could be used by the server, when responding to
@@ promisor-remote.c: void promisor_remote_get_direct(struct repository *repo,
+ }
+ }
+
-+ redact_non_printables(&sb);
-+
+ strvec_clear(&names);
+ strvec_clear(&urls);
+
@@ promisor-remote.c: void promisor_remote_get_direct(struct repository *repo,
+ enum accept_promisor accept = ACCEPT_NONE;
+
+ if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
-+ if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
++ if (!*accept_str || !strcasecmp("None", accept_str))
+ accept = ACCEPT_NONE;
+ else if (!strcasecmp("All", accept_str))
+ accept = ACCEPT_ALL;
@@ promisor-remote.h: void promisor_remote_get_direct(struct repository *repo,
+ * advertisement.
+ * Return value is NULL if no promisor remote advertisement should be
+ * made. Otherwise it contains the names and urls of the advertised
-+ * promisor remotes separated by ';'
++ * promisor remotes separated by ';'. See gitprotocol-v2(5).
+ */
+char *promisor_remote_info(struct repository *repo);
+
@@ promisor-remote.h: void promisor_remote_get_direct(struct repository *repo,
+ * configured promisor remotes, if any, to prepare the reply.
+ * Return value is NULL if no promisor remote from the server
+ * is accepted. Otherwise it contains the names of the accepted promisor
-+ * remotes separated by ';'.
++ * remotes separated by ';'. See gitprotocol-v2(5).
+ */
+char *promisor_remote_reply(const char *info);
+
+/*
-+ * Set the 'accepted' flag for some promisor remotes. Useful when some
-+ * promisor remotes have been accepted by the client.
++ * Set the 'accepted' flag for some promisor remotes. Useful on the
++ * server side when some promisor remotes have been accepted by the
++ * client.
+ */
+void mark_promisor_remotes_as_accepted(struct repository *repo, const char *remotes);
+
@@ t/t5710-promisor-remote-capability.sh (new)
+ check_missing_objects server "$count" "$missing_oids"
+}
+
-+copy_to_server2 () {
++copy_to_lop () {
+ oid_path="$(test_oid_to_path $1)" &&
+ path="server/objects/$oid_path" &&
-+ path2="server2/objects/$oid_path" &&
++ path2="lop/objects/$oid_path" &&
+ mkdir -p $(dirname "$path2") &&
+ cp "$path" "$path2"
+}
+
+test_expect_success "setup for testing promisor remote advertisement" '
-+ # Create another bare repo called "server2"
-+ git init --bare server2 &&
++ # Create another bare repo called "lop" (for Large Object Promisor)
++ git init --bare lop &&
+
-+ # Copy the largest object from server to server2
++ # Copy the largest object from server to lop
+ obj="HEAD:foo" &&
+ oid="$(git -C server rev-parse $obj)" &&
-+ copy_to_server2 "$oid" &&
++ copy_to_lop "$oid" &&
+
+ initialize_server 1 "$oid" &&
+
-+ # Configure server2 as promisor remote for server
-+ git -C server remote add server2 "file://$(pwd)/server2" &&
-+ git -C server config remote.server2.promisor true &&
++ # Configure lop as promisor remote for server
++ git -C server remote add lop "file://$(pwd)/lop" &&
++ git -C server config remote.lop.promisor true &&
+
-+ git -C server2 config uploadpack.allowFilter true &&
-+ git -C server2 config uploadpack.allowAnySHA1InWant true &&
++ git -C lop config uploadpack.allowFilter true &&
++ git -C lop config uploadpack.allowAnySHA1InWant true &&
+ git -C server config uploadpack.allowFilter true &&
+ git -C server config uploadpack.allowAnySHA1InWant true
+'
@@ t/t5710-promisor-remote-capability.sh (new)
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
-+ GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
-+ -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
-+ -c remote.server2.url="file://$(pwd)/server2" \
++ GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
++ -c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
++ -c remote.lop.url="file://$(pwd)/lop" \
+ -c promisor.acceptfromserver=All \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
@@ t/t5710-promisor-remote-capability.sh (new)
+ git -C server config promisor.advertise false &&
+
+ # Clone from server to create a client
-+ GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
-+ -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
-+ -c remote.server2.url="file://$(pwd)/server2" \
++ GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
++ -c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
++ -c remote.lop.url="file://$(pwd)/lop" \
+ -c promisor.acceptfromserver=All \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
@@ t/t5710-promisor-remote-capability.sh (new)
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
-+ GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
-+ -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
-+ -c remote.server2.url="file://$(pwd)/server2" \
++ GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
++ -c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
++ -c remote.lop.url="file://$(pwd)/lop" \
+ -c promisor.acceptfromserver=None \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
@@ t/t5710-promisor-remote-capability.sh (new)
+ test_when_finished "rm -rf client" &&
+ mkdir client &&
+ git -C client init &&
-+ git -C client config remote.server2.promisor true &&
-+ git -C client config remote.server2.fetch "+refs/heads/*:refs/remotes/server2/*" &&
-+ git -C client config remote.server2.url "file://$(pwd)/server2" &&
++ git -C client config remote.lop.promisor true &&
++ git -C client config remote.lop.fetch "+refs/heads/*:refs/remotes/lop/*" &&
++ git -C client config remote.lop.url "file://$(pwd)/lop" &&
+ git -C client config remote.server.url "file://$(pwd)/server" &&
+ git -C client config remote.server.fetch "+refs/heads/*:refs/remotes/server/*" &&
+ git -C client config promisor.acceptfromserver All &&
@@ t/t5710-promisor-remote-capability.sh (new)
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
-+ GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
-+ -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
-+ -c remote.server2.url="file://$(pwd)/server2" \
++ GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
++ -c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
++ -c remote.lop.url="file://$(pwd)/lop" \
+ -c promisor.acceptfromserver=All \
+ --no-local --filter="blob:limit=5k" server client &&
+
@@ t/t5710-promisor-remote-capability.sh (new)
+ # Repack everything twice and remove .promisor files before
+ # each repack. This makes sure everything gets repacked
+ # into a single packfile. The second repack is necessary
-+ # because the first one fetches from server2 and creates a new
++ # because the first one fetches from lop and creates a new
+ # packfile and its associated .promisor file.
+
+ rm -f server/objects/pack/*.promisor &&
@@ t/t5710-promisor-remote-capability.sh (new)
+ packfile=$(ls pack-*.pack) &&
+ git -C server unpack-objects --strict <"$packfile" &&
+
-+ # Copy new large object to server2
++ # Copy new large object to lop
+ obj_bar="HEAD:bar" &&
+ oid_bar="$(git -C server rev-parse $obj_bar)" &&
-+ copy_to_server2 "$oid_bar" &&
++ copy_to_lop "$oid_bar" &&
+
+ # Reinitialize server so that the 2 largest objects are missing
+ printf "%s\n" "$oid" "$oid_bar" >expected_missing.txt &&
5: 979a0af1c3 ! 2: 89e20976ba promisor-remote: check advertised name or URL
@@ Commit message
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
- ## Documentation/config/promisor.txt ##
-@@ Documentation/config/promisor.txt: promisor.advertise::
+ ## Documentation/config/promisor.adoc ##
+@@ Documentation/config/promisor.adoc: promisor.advertise::
promisor.acceptFromServer::
If set to "all", a client will accept all the promisor remotes
a server might advertise using the "promisor-remote"
@@ Documentation/config/promisor.txt: promisor.advertise::
+ URLs. If set to "knownUrl", the client will accept promisor
+ remotes which have both the same name and the same URL
+ configured on the client as the name and URL advertised by the
-+ server. This is more secure than "all" or "knownUrl", so it
++ server. This is more secure than "all" or "knownName", so it
+ should be used if possible instead of those options. Default
+ is "none", which means no promisor remote advertised by a
+ server will be accepted. By accepting a promisor remote, the
@@ promisor-remote.c: char *promisor_remote_info(struct repository *repo)
}
+/*
-+ * Find first index of 'vec' where there is 'val'. 'val' is compared
-+ * case insensively to the strings in 'vec'. If not found 'vec->nr' is
-+ * returned.
++ * Find first index of 'nicks' where there is 'nick'. 'nick' is
++ * compared case insensitively to the strings in 'nicks'. If not found
++ * 'nicks->nr' is returned.
+ */
-+static size_t strvec_find_index(struct strvec *vec, const char *val)
++static size_t remote_nick_find(struct strvec *nicks, const char *nick)
+{
-+ for (size_t i = 0; i < vec->nr; i++)
-+ if (!strcasecmp(vec->v[i], val))
++ for (size_t i = 0; i < nicks->nr; i++)
++ if (!strcasecmp(nicks->v[i], nick))
+ return i;
-+ return vec->nr;
++ return nicks->nr;
+}
+
enum accept_promisor {
@@ promisor-remote.c: char *promisor_remote_info(struct repository *repo)
return 1;
- BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
-+ i = strvec_find_index(names, remote_name);
++ i = remote_nick_find(names, remote_name);
+
+ if (i >= names->nr)
+ /* We don't know about that remote */
@@ promisor-remote.c: char *promisor_remote_info(struct repository *repo)
+ if (accept != ACCEPT_KNOWN_URL)
+ BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+
-+ if (!strcasecmp(urls->v[i], remote_url))
++ if (!strcmp(urls->v[i], remote_url))
+ return 1;
+
+ warning(_("known remote named '%s' but with url '%s' instead of '%s'"),
@@ promisor-remote.c: char *promisor_remote_info(struct repository *repo)
+ struct strvec urls = STRVEC_INIT;
if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
- if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
+ if (!*accept_str || !strcasecmp("None", accept_str))
accept = ACCEPT_NONE;
+ else if (!strcasecmp("KnownUrl", accept_str))
+ accept = ACCEPT_KNOWN_URL;
@@ t/t5710-promisor-remote-capability.sh: test_expect_success "init + fetch with pr
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
-+ GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
-+ -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
-+ -c remote.server2.url="file://$(pwd)/server2" \
++ GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
++ -c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
++ -c remote.lop.url="file://$(pwd)/lop" \
+ -c promisor.acceptfromserver=KnownName \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
@@ t/t5710-promisor-remote-capability.sh: test_expect_success "init + fetch with pr
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.serverTwo.promisor=true \
-+ -c remote.serverTwo.fetch="+refs/heads/*:refs/remotes/server2/*" \
-+ -c remote.serverTwo.url="file://$(pwd)/server2" \
++ -c remote.serverTwo.fetch="+refs/heads/*:refs/remotes/lop/*" \
++ -c remote.serverTwo.url="file://$(pwd)/lop" \
+ -c promisor.acceptfromserver=KnownName \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
@@ t/t5710-promisor-remote-capability.sh: test_expect_success "init + fetch with pr
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
-+ GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
-+ -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
-+ -c remote.server2.url="file://$(pwd)/server2" \
++ GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
++ -c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
++ -c remote.lop.url="file://$(pwd)/lop" \
+ -c promisor.acceptfromserver=KnownUrl \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
@@ t/t5710-promisor-remote-capability.sh: test_expect_success "init + fetch with pr
+'
+
+test_expect_success "clone with 'KnownUrl' and different remote urls" '
-+ ln -s server2 serverTwo &&
++ ln -s lop serverTwo &&
+
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
-+ GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
-+ -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
-+ -c remote.server2.url="file://$(pwd)/serverTwo" \
++ GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
++ -c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
++ -c remote.lop.url="file://$(pwd)/serverTwo" \
+ -c promisor.acceptfromserver=KnownUrl \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
6: 3a0c134e09 ! 3: e980fe0aa2 doc: add technical design doc for large object promisors
@@ Documentation/technical/large-object-promisors.txt (new)
+a number of current issues in the context of Git clients and servers
+sharing Git objects.
+
++Even if LOPs are used not very efficiently, they can still be useful
++and worth using in some cases because, as we will see in more details
++later in this document:
++
++ - they can make it simpler for clients to use promisor remotes and
++ therefore avoid fetching a lot of large blobs they might not need
++ locally,
++
++ - they can make it significantly cheaper or easier for servers to
++ host a significant part of the current repository content, and
++ even more to host content with larger blobs or more large blobs
++ than currently.
++
+I) Issues with the current situation
+------------------------------------
+
@@ Documentation/technical/large-object-promisors.txt (new)
+to the LOP without the main remote checking them in some ways
+(possibly using hooks or other tools).
+
++This should be discussed and refined when we get closer to
++implementing this feature.
++
+Rationale
++++++++++
+
Christian Couder (3):
Add 'promisor-remote' capability to protocol v2
promisor-remote: check advertised name or URL
doc: add technical design doc for large object promisors
Documentation/config/promisor.adoc | 27 +
Documentation/gitprotocol-v2.adoc | 54 ++
.../technical/large-object-promisors.txt | 656 ++++++++++++++++++
connect.c | 9 +
promisor-remote.c | 242 +++++++
promisor-remote.h | 37 +-
serve.c | 26 +
t/meson.build | 1 +
t/t5710-promisor-remote-capability.sh | 312 +++++++++
upload-pack.c | 3 +
10 files changed, 1366 insertions(+), 1 deletion(-)
create mode 100644 Documentation/technical/large-object-promisors.txt
create mode 100755 t/t5710-promisor-remote-capability.sh
--
2.48.1.359.ge980fe0aa2
^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH v5 1/3] Add 'promisor-remote' capability to protocol v2
2025-02-18 11:32 ` [PATCH v5 0/3] " Christian Couder
@ 2025-02-18 11:32 ` Christian Couder
2025-02-18 11:32 ` [PATCH v5 2/3] promisor-remote: check advertised name or URL Christian Couder
` (3 subsequent siblings)
4 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-02-18 11:32 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
Randall S . Becker, Christian Couder, Christian Couder
When a server S knows that some objects from a repository are available
from a promisor remote X, S might want to suggest to a client C cloning
or fetching the repo from S that C may use X directly instead of S for
these objects.
Note that this could happen both in the case S itself doesn't have the
objects and borrows them from X, and in the case S has the objects but
knows that X is better connected to the world (e.g., it is in a
$LARGEINTERNETCOMPANY datacenter with petabit/s backbone connections)
than S. Implementation of the latter case, which would require S to
omit in its response the objects available on X, is left for future
improvement though.
Then C might or might not, want to get the objects from X. If S and C
can agree on C using X directly, S can then omit objects that can be
obtained from X when answering C's request.
To allow S and C to agree and let each other know about C using X or
not, let's introduce a new "promisor-remote" capability in the
protocol v2, as well as a few new configuration variables:
- "promisor.advertise" on the server side, and:
- "promisor.acceptFromServer" on the client side.
By default, or if "promisor.advertise" is set to 'false', a server S will
not advertise the "promisor-remote" capability.
If S doesn't advertise the "promisor-remote" capability, then a client C
replying to S shouldn't advertise the "promisor-remote" capability
either.
If "promisor.advertise" is set to 'true', S will advertise its promisor
remotes with a string like:
promisor-remote=<pr-info>[;<pr-info>]...
where each <pr-info> element contains information about a single
promisor remote in the form:
name=<pr-name>[,url=<pr-url>]
where <pr-name> is the urlencoded name of a promisor remote and
<pr-url> is the urlencoded URL of the promisor remote named <pr-name>.
For now, the URL is passed in addition to the name. In the future, it
might be possible to pass other information like a filter-spec that the
client may use when cloning from S, or a token that the client may use
when retrieving objects from X.
It is C's responsibility to arrange how it can reach X though, so pieces
of information that are usually outside Git's concern, like proxy
configuration, must not be distributed over this protocol.
It might also be possible in the future for "promisor.advertise" to have
other values. For example a value like "onlyName" could prevent S from
advertising URLs, which could help in case C should use a different URL
for X than the URL S is using. (The URL S is using might be an internal
one on the server side for example.)
By default or if "promisor.acceptFromServer" is set to "None", C will
not accept to use the promisor remotes that might have been advertised
by S. In this case, C will not advertise any "promisor-remote"
capability in its reply to S.
If "promisor.acceptFromServer" is set to "All" and S advertised some
promisor remotes, then on the contrary, C will accept to use all the
promisor remotes that S advertised and C will reply with a string like:
promisor-remote=<pr-name>[;<pr-name>]...
where the <pr-name> elements are the urlencoded names of all the
promisor remotes S advertised.
In a following commit, other values for "promisor.acceptFromServer" will
be implemented, so that C will be able to decide the promisor remotes it
accepts depending on the name and URL it received from S. So even if
that name and URL information is not used much right now, it will be
needed soon.
Helped-by: Taylor Blau <me@ttaylorr.com>
Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
Documentation/config/promisor.adoc | 17 ++
Documentation/gitprotocol-v2.adoc | 54 ++++++
connect.c | 9 +
promisor-remote.c | 194 ++++++++++++++++++++
promisor-remote.h | 37 +++-
serve.c | 26 +++
t/meson.build | 1 +
t/t5710-promisor-remote-capability.sh | 244 ++++++++++++++++++++++++++
upload-pack.c | 3 +
9 files changed, 584 insertions(+), 1 deletion(-)
create mode 100755 t/t5710-promisor-remote-capability.sh
diff --git a/Documentation/config/promisor.adoc b/Documentation/config/promisor.adoc
index 98c5cb2ec2..9cbfe3e59e 100644
--- a/Documentation/config/promisor.adoc
+++ b/Documentation/config/promisor.adoc
@@ -1,3 +1,20 @@
promisor.quiet::
If set to "true" assume `--quiet` when fetching additional
objects for a partial clone.
+
+promisor.advertise::
+ If set to "true", a server will use the "promisor-remote"
+ capability, see linkgit:gitprotocol-v2[5], to advertise the
+ promisor remotes it is using, if it uses some. Default is
+ "false", which means the "promisor-remote" capability is not
+ advertised.
+
+promisor.acceptFromServer::
+ If set to "all", a client will accept all the promisor remotes
+ a server might advertise using the "promisor-remote"
+ capability. Default is "none", which means no promisor remote
+ advertised by a server will be accepted. By accepting a
+ promisor remote, the client agrees that the server might omit
+ objects that are lazily fetchable from this promisor remote
+ from its responses to "fetch" and "clone" requests from the
+ client. See linkgit:gitprotocol-v2[5].
diff --git a/Documentation/gitprotocol-v2.adoc b/Documentation/gitprotocol-v2.adoc
index 1652fef3ae..c20b74aac0 100644
--- a/Documentation/gitprotocol-v2.adoc
+++ b/Documentation/gitprotocol-v2.adoc
@@ -781,6 +781,60 @@ retrieving the header from a bundle at the indicated URI, and thus
save themselves and the server(s) the request(s) needed to inspect the
headers of that bundle or bundles.
+promisor-remote=<pr-infos>
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The server may advertise some promisor remotes it is using or knows
+about to a client which may want to use them as its promisor remotes,
+instead of this repository. In this case <pr-infos> should be of the
+form:
+
+ pr-infos = pr-info | pr-infos ";" pr-info
+
+ pr-info = "name=" pr-name | "name=" pr-name "," "url=" pr-url
+
+where `pr-name` is the urlencoded name of a promisor remote, and
+`pr-url` the urlencoded URL of that promisor remote.
+
+In this case, if the client decides to use one or more promisor
+remotes the server advertised, it can reply with
+"promisor-remote=<pr-names>" where <pr-names> should be of the form:
+
+ pr-names = pr-name | pr-names ";" pr-name
+
+where `pr-name` is the urlencoded name of a promisor remote the server
+advertised and the client accepts.
+
+Note that, everywhere in this document, `pr-name` MUST be a valid
+remote name, and the ';' and ',' characters MUST be encoded if they
+appear in `pr-name` or `pr-url`.
+
+If the server doesn't know any promisor remote that could be good for
+a client to use, or prefers a client not to use any promisor remote it
+uses or knows about, it shouldn't advertise the "promisor-remote"
+capability at all.
+
+In this case, or if the client doesn't want to use any promisor remote
+the server advertised, the client shouldn't advertise the
+"promisor-remote" capability at all in its reply.
+
+The "promisor.advertise" and "promisor.acceptFromServer" configuration
+options can be used on the server and client side to control what they
+advertise or accept respectively. See the documentation of these
+configuration options for more information.
+
+Note that in the future it would be nice if the "promisor-remote"
+protocol capability could be used by the server, when responding to
+`git fetch` or `git clone`, to advertise better-connected remotes that
+the client can use as promisor remotes, instead of this repository, so
+that the client can lazily fetch objects from these other
+better-connected remotes. This would require the server to omit in its
+response the objects available on the better-connected remotes that
+the client has accepted. This hasn't been implemented yet though. So
+for now this "promisor-remote" capability is useful only when the
+server advertises some promisor remotes it already uses to borrow
+objects from.
+
GIT
---
Part of the linkgit:git[1] suite
diff --git a/connect.c b/connect.c
index 91f3990014..125150ac25 100644
--- a/connect.c
+++ b/connect.c
@@ -22,6 +22,7 @@
#include "protocol.h"
#include "alias.h"
#include "bundle-uri.h"
+#include "promisor-remote.h"
static char *server_capabilities_v1;
static struct strvec server_capabilities_v2 = STRVEC_INIT;
@@ -487,6 +488,7 @@ void check_stateless_delimiter(int stateless_rpc,
static void send_capabilities(int fd_out, struct packet_reader *reader)
{
const char *hash_name;
+ const char *promisor_remote_info;
if (server_supports_v2("agent"))
packet_write_fmt(fd_out, "agent=%s", git_user_agent_sanitized());
@@ -500,6 +502,13 @@ static void send_capabilities(int fd_out, struct packet_reader *reader)
} else {
reader->hash_algo = &hash_algos[GIT_HASH_SHA1];
}
+ if (server_feature_v2("promisor-remote", &promisor_remote_info)) {
+ char *reply = promisor_remote_reply(promisor_remote_info);
+ if (reply) {
+ packet_write_fmt(fd_out, "promisor-remote=%s", reply);
+ free(reply);
+ }
+ }
}
int get_remote_bundle_uri(int fd_out, struct packet_reader *reader,
diff --git a/promisor-remote.c b/promisor-remote.c
index c714f4f007..918be6528f 100644
--- a/promisor-remote.c
+++ b/promisor-remote.c
@@ -11,6 +11,8 @@
#include "strvec.h"
#include "packfile.h"
#include "environment.h"
+#include "url.h"
+#include "version.h"
struct promisor_remote_config {
struct promisor_remote *promisors;
@@ -221,6 +223,18 @@ int repo_has_promisor_remote(struct repository *r)
return !!repo_promisor_remote_find(r, NULL);
}
+int repo_has_accepted_promisor_remote(struct repository *r)
+{
+ struct promisor_remote *p;
+
+ promisor_remote_init(r);
+
+ for (p = r->promisor_remote_config->promisors; p; p = p->next)
+ if (p->accepted)
+ return 1;
+ return 0;
+}
+
static int remove_fetched_oids(struct repository *repo,
struct object_id **oids,
int oid_nr, int to_free)
@@ -292,3 +306,183 @@ void promisor_remote_get_direct(struct repository *repo,
if (to_free)
free(remaining_oids);
}
+
+static int allow_unsanitized(char ch)
+{
+ if (ch == ',' || ch == ';' || ch == '%')
+ return 0;
+ return ch > 32 && ch < 127;
+}
+
+static void promisor_info_vecs(struct repository *repo,
+ struct strvec *names,
+ struct strvec *urls)
+{
+ struct promisor_remote *r;
+
+ promisor_remote_init(repo);
+
+ for (r = repo->promisor_remote_config->promisors; r; r = r->next) {
+ char *url;
+ char *url_key = xstrfmt("remote.%s.url", r->name);
+
+ strvec_push(names, r->name);
+ strvec_push(urls, git_config_get_string(url_key, &url) ? NULL : url);
+
+ free(url);
+ free(url_key);
+ }
+}
+
+char *promisor_remote_info(struct repository *repo)
+{
+ struct strbuf sb = STRBUF_INIT;
+ int advertise_promisors = 0;
+ struct strvec names = STRVEC_INIT;
+ struct strvec urls = STRVEC_INIT;
+
+ git_config_get_bool("promisor.advertise", &advertise_promisors);
+
+ if (!advertise_promisors)
+ return NULL;
+
+ promisor_info_vecs(repo, &names, &urls);
+
+ if (!names.nr)
+ return NULL;
+
+ for (size_t i = 0; i < names.nr; i++) {
+ if (i)
+ strbuf_addch(&sb, ';');
+ strbuf_addstr(&sb, "name=");
+ strbuf_addstr_urlencode(&sb, names.v[i], allow_unsanitized);
+ if (urls.v[i]) {
+ strbuf_addstr(&sb, ",url=");
+ strbuf_addstr_urlencode(&sb, urls.v[i], allow_unsanitized);
+ }
+ }
+
+ strvec_clear(&names);
+ strvec_clear(&urls);
+
+ return strbuf_detach(&sb, NULL);
+}
+
+enum accept_promisor {
+ ACCEPT_NONE = 0,
+ ACCEPT_ALL
+};
+
+static int should_accept_remote(enum accept_promisor accept,
+ const char *remote_name UNUSED,
+ const char *remote_url UNUSED)
+{
+ if (accept == ACCEPT_ALL)
+ return 1;
+
+ BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+}
+
+static void filter_promisor_remote(struct strvec *accepted, const char *info)
+{
+ struct strbuf **remotes;
+ const char *accept_str;
+ enum accept_promisor accept = ACCEPT_NONE;
+
+ if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
+ if (!*accept_str || !strcasecmp("None", accept_str))
+ accept = ACCEPT_NONE;
+ else if (!strcasecmp("All", accept_str))
+ accept = ACCEPT_ALL;
+ else
+ warning(_("unknown '%s' value for '%s' config option"),
+ accept_str, "promisor.acceptfromserver");
+ }
+
+ if (accept == ACCEPT_NONE)
+ return;
+
+ /* Parse remote info received */
+
+ remotes = strbuf_split_str(info, ';', 0);
+
+ for (size_t i = 0; remotes[i]; i++) {
+ struct strbuf **elems;
+ const char *remote_name = NULL;
+ const char *remote_url = NULL;
+ char *decoded_name = NULL;
+ char *decoded_url = NULL;
+
+ strbuf_strip_suffix(remotes[i], ";");
+ elems = strbuf_split(remotes[i], ',');
+
+ for (size_t j = 0; elems[j]; j++) {
+ int res;
+ strbuf_strip_suffix(elems[j], ",");
+ res = skip_prefix(elems[j]->buf, "name=", &remote_name) ||
+ skip_prefix(elems[j]->buf, "url=", &remote_url);
+ if (!res)
+ warning(_("unknown element '%s' from remote info"),
+ elems[j]->buf);
+ }
+
+ if (remote_name)
+ decoded_name = url_percent_decode(remote_name);
+ if (remote_url)
+ decoded_url = url_percent_decode(remote_url);
+
+ if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url))
+ strvec_push(accepted, decoded_name);
+
+ strbuf_list_free(elems);
+ free(decoded_name);
+ free(decoded_url);
+ }
+
+ strbuf_list_free(remotes);
+}
+
+char *promisor_remote_reply(const char *info)
+{
+ struct strvec accepted = STRVEC_INIT;
+ struct strbuf reply = STRBUF_INIT;
+
+ filter_promisor_remote(&accepted, info);
+
+ if (!accepted.nr)
+ return NULL;
+
+ for (size_t i = 0; i < accepted.nr; i++) {
+ if (i)
+ strbuf_addch(&reply, ';');
+ strbuf_addstr_urlencode(&reply, accepted.v[i], allow_unsanitized);
+ }
+
+ strvec_clear(&accepted);
+
+ return strbuf_detach(&reply, NULL);
+}
+
+void mark_promisor_remotes_as_accepted(struct repository *r, const char *remotes)
+{
+ struct strbuf **accepted_remotes = strbuf_split_str(remotes, ';', 0);
+
+ for (size_t i = 0; accepted_remotes[i]; i++) {
+ struct promisor_remote *p;
+ char *decoded_remote;
+
+ strbuf_strip_suffix(accepted_remotes[i], ";");
+ decoded_remote = url_percent_decode(accepted_remotes[i]->buf);
+
+ p = repo_promisor_remote_find(r, decoded_remote);
+ if (p)
+ p->accepted = 1;
+ else
+ warning(_("accepted promisor remote '%s' not found"),
+ decoded_remote);
+
+ free(decoded_remote);
+ }
+
+ strbuf_list_free(accepted_remotes);
+}
diff --git a/promisor-remote.h b/promisor-remote.h
index 88cb599c39..263d331a55 100644
--- a/promisor-remote.h
+++ b/promisor-remote.h
@@ -9,11 +9,13 @@ struct object_id;
* Promisor remote linked list
*
* Information in its fields come from remote.XXX config entries or
- * from extensions.partialclone.
+ * from extensions.partialclone, except for 'accepted' which comes
+ * from protocol v2 capabilities exchange.
*/
struct promisor_remote {
struct promisor_remote *next;
char *partial_clone_filter;
+ unsigned int accepted : 1;
const char name[FLEX_ARRAY];
};
@@ -32,4 +34,37 @@ void promisor_remote_get_direct(struct repository *repo,
const struct object_id *oids,
int oid_nr);
+/*
+ * Prepare a "promisor-remote" advertisement by a server.
+ * Check the value of "promisor.advertise" and maybe the configured
+ * promisor remotes, if any, to prepare information to send in an
+ * advertisement.
+ * Return value is NULL if no promisor remote advertisement should be
+ * made. Otherwise it contains the names and urls of the advertised
+ * promisor remotes separated by ';'. See gitprotocol-v2(5).
+ */
+char *promisor_remote_info(struct repository *repo);
+
+/*
+ * Prepare a reply to a "promisor-remote" advertisement from a server.
+ * Check the value of "promisor.acceptfromserver" and maybe the
+ * configured promisor remotes, if any, to prepare the reply.
+ * Return value is NULL if no promisor remote from the server
+ * is accepted. Otherwise it contains the names of the accepted promisor
+ * remotes separated by ';'. See gitprotocol-v2(5).
+ */
+char *promisor_remote_reply(const char *info);
+
+/*
+ * Set the 'accepted' flag for some promisor remotes. Useful on the
+ * server side when some promisor remotes have been accepted by the
+ * client.
+ */
+void mark_promisor_remotes_as_accepted(struct repository *repo, const char *remotes);
+
+/*
+ * Has any promisor remote been accepted by the client?
+ */
+int repo_has_accepted_promisor_remote(struct repository *r);
+
#endif /* PROMISOR_REMOTE_H */
diff --git a/serve.c b/serve.c
index f6dfe34a2b..e3ccf1505c 100644
--- a/serve.c
+++ b/serve.c
@@ -10,6 +10,7 @@
#include "upload-pack.h"
#include "bundle-uri.h"
#include "trace2.h"
+#include "promisor-remote.h"
static int advertise_sid = -1;
static int advertise_object_info = -1;
@@ -29,6 +30,26 @@ static int agent_advertise(struct repository *r UNUSED,
return 1;
}
+static int promisor_remote_advertise(struct repository *r,
+ struct strbuf *value)
+{
+ if (value) {
+ char *info = promisor_remote_info(r);
+ if (!info)
+ return 0;
+ strbuf_addstr(value, info);
+ free(info);
+ }
+ return 1;
+}
+
+static void promisor_remote_receive(struct repository *r,
+ const char *remotes)
+{
+ mark_promisor_remotes_as_accepted(r, remotes);
+}
+
+
static int object_format_advertise(struct repository *r,
struct strbuf *value)
{
@@ -155,6 +176,11 @@ static struct protocol_capability capabilities[] = {
.advertise = bundle_uri_advertise,
.command = bundle_uri_command,
},
+ {
+ .name = "promisor-remote",
+ .advertise = promisor_remote_advertise,
+ .receive = promisor_remote_receive,
+ },
};
void protocol_v2_advertise_capabilities(struct repository *r)
diff --git a/t/meson.build b/t/meson.build
index a03ebc81fd..75ad6726c4 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -728,6 +728,7 @@ integration_tests = [
't5703-upload-pack-ref-in-want.sh',
't5704-protocol-violations.sh',
't5705-session-id-in-capabilities.sh',
+ 't5710-promisor-remote-capability.sh',
't5730-protocol-v2-bundle-uri-file.sh',
't5731-protocol-v2-bundle-uri-git.sh',
't5732-protocol-v2-bundle-uri-http.sh',
diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
new file mode 100755
index 0000000000..51cf2269e1
--- /dev/null
+++ b/t/t5710-promisor-remote-capability.sh
@@ -0,0 +1,244 @@
+#!/bin/sh
+
+test_description='handling of promisor remote advertisement'
+
+. ./test-lib.sh
+
+GIT_TEST_MULTI_PACK_INDEX=0
+GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0
+
+# Setup the repository with three commits, this way HEAD is always
+# available and we can hide commit 1 or 2.
+test_expect_success 'setup: create "template" repository' '
+ git init template &&
+ test_commit -C template 1 &&
+ test_commit -C template 2 &&
+ test_commit -C template 3 &&
+ test-tool genrandom foo 10240 >template/foo &&
+ git -C template add foo &&
+ git -C template commit -m foo
+'
+
+# A bare repo will act as a server repo with unpacked objects.
+test_expect_success 'setup: create bare "server" repository' '
+ git clone --bare --no-local template server &&
+ mv server/objects/pack/pack-* . &&
+ packfile=$(ls pack-*.pack) &&
+ git -C server unpack-objects --strict <"$packfile"
+'
+
+check_missing_objects () {
+ git -C "$1" rev-list --objects --all --missing=print > all.txt &&
+ perl -ne 'print if s/^[?]//' all.txt >missing.txt &&
+ test_line_count = "$2" missing.txt &&
+ if test "$2" -lt 2
+ then
+ test "$3" = "$(cat missing.txt)"
+ else
+ test -f "$3" &&
+ sort <"$3" >expected_sorted &&
+ sort <missing.txt >actual_sorted &&
+ test_cmp expected_sorted actual_sorted
+ fi
+}
+
+initialize_server () {
+ count="$1"
+ missing_oids="$2"
+
+ # Repack everything first
+ git -C server -c repack.writebitmaps=false repack -a -d &&
+
+ # Remove promisor file in case they exist, useful when reinitializing
+ rm -rf server/objects/pack/*.promisor &&
+
+ # Repack without the largest object and create a promisor pack on server
+ git -C server -c repack.writebitmaps=false repack -a -d \
+ --filter=blob:limit=5k --filter-to="$(pwd)/pack" &&
+ promisor_file=$(ls server/objects/pack/*.pack | sed "s/\.pack/.promisor/") &&
+ >"$promisor_file" &&
+
+ # Check objects missing on the server
+ check_missing_objects server "$count" "$missing_oids"
+}
+
+copy_to_lop () {
+ oid_path="$(test_oid_to_path $1)" &&
+ path="server/objects/$oid_path" &&
+ path2="lop/objects/$oid_path" &&
+ mkdir -p $(dirname "$path2") &&
+ cp "$path" "$path2"
+}
+
+test_expect_success "setup for testing promisor remote advertisement" '
+ # Create another bare repo called "lop" (for Large Object Promisor)
+ git init --bare lop &&
+
+ # Copy the largest object from server to lop
+ obj="HEAD:foo" &&
+ oid="$(git -C server rev-parse $obj)" &&
+ copy_to_lop "$oid" &&
+
+ initialize_server 1 "$oid" &&
+
+ # Configure lop as promisor remote for server
+ git -C server remote add lop "file://$(pwd)/lop" &&
+ git -C server config remote.lop.promisor true &&
+
+ git -C lop config uploadpack.allowFilter true &&
+ git -C lop config uploadpack.allowAnySHA1InWant true &&
+ git -C server config uploadpack.allowFilter true &&
+ git -C server config uploadpack.allowAnySHA1InWant true
+'
+
+test_expect_success "clone with promisor.advertise set to 'true'" '
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
+ -c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
+ -c remote.lop.url="file://$(pwd)/lop" \
+ -c promisor.acceptfromserver=All \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
+
+ # Check that the largest object is still missing on the server
+ check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.advertise set to 'false'" '
+ git -C server config promisor.advertise false &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
+ -c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
+ -c remote.lop.url="file://$(pwd)/lop" \
+ -c promisor.acceptfromserver=All \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
+
+ # Check that the largest object is not missing on the server
+ check_missing_objects server 0 "" &&
+
+ # Reinitialize server so that the largest object is missing again
+ initialize_server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.acceptfromserver set to 'None'" '
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
+ -c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
+ -c remote.lop.url="file://$(pwd)/lop" \
+ -c promisor.acceptfromserver=None \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
+
+ # Check that the largest object is not missing on the server
+ check_missing_objects server 0 "" &&
+
+ # Reinitialize server so that the largest object is missing again
+ initialize_server 1 "$oid"
+'
+
+test_expect_success "init + fetch with promisor.advertise set to 'true'" '
+ git -C server config promisor.advertise true &&
+
+ test_when_finished "rm -rf client" &&
+ mkdir client &&
+ git -C client init &&
+ git -C client config remote.lop.promisor true &&
+ git -C client config remote.lop.fetch "+refs/heads/*:refs/remotes/lop/*" &&
+ git -C client config remote.lop.url "file://$(pwd)/lop" &&
+ git -C client config remote.server.url "file://$(pwd)/server" &&
+ git -C client config remote.server.fetch "+refs/heads/*:refs/remotes/server/*" &&
+ git -C client config promisor.acceptfromserver All &&
+ GIT_NO_LAZY_FETCH=0 git -C client fetch --filter="blob:limit=5k" server &&
+
+ # Check that the largest object is still missing on the server
+ check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.advertise set to 'true' but don't delete the client" '
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
+ -c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
+ -c remote.lop.url="file://$(pwd)/lop" \
+ -c promisor.acceptfromserver=All \
+ --no-local --filter="blob:limit=5k" server client &&
+
+ # Check that the largest object is still missing on the server
+ check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "setup for subsequent fetches" '
+ # Generate new commit with large blob
+ test-tool genrandom bar 10240 >template/bar &&
+ git -C template add bar &&
+ git -C template commit -m bar &&
+
+ # Fetch new commit with large blob
+ git -C server fetch origin &&
+ git -C server update-ref HEAD FETCH_HEAD &&
+ git -C server rev-parse HEAD >expected_head &&
+
+ # Repack everything twice and remove .promisor files before
+ # each repack. This makes sure everything gets repacked
+ # into a single packfile. The second repack is necessary
+ # because the first one fetches from lop and creates a new
+ # packfile and its associated .promisor file.
+
+ rm -f server/objects/pack/*.promisor &&
+ git -C server -c repack.writebitmaps=false repack -a -d &&
+ rm -f server/objects/pack/*.promisor &&
+ git -C server -c repack.writebitmaps=false repack -a -d &&
+
+ # Unpack everything
+ rm pack-* &&
+ mv server/objects/pack/pack-* . &&
+ packfile=$(ls pack-*.pack) &&
+ git -C server unpack-objects --strict <"$packfile" &&
+
+ # Copy new large object to lop
+ obj_bar="HEAD:bar" &&
+ oid_bar="$(git -C server rev-parse $obj_bar)" &&
+ copy_to_lop "$oid_bar" &&
+
+ # Reinitialize server so that the 2 largest objects are missing
+ printf "%s\n" "$oid" "$oid_bar" >expected_missing.txt &&
+ initialize_server 2 expected_missing.txt &&
+
+ # Create one more client
+ cp -r client client2
+'
+
+test_expect_success "subsequent fetch from a client when promisor.advertise is true" '
+ git -C server config promisor.advertise true &&
+
+ GIT_NO_LAZY_FETCH=0 git -C client pull origin &&
+
+ git -C client rev-parse HEAD >actual &&
+ test_cmp expected_head actual &&
+
+ cat client/bar >/dev/null &&
+
+ check_missing_objects server 2 expected_missing.txt
+'
+
+test_expect_success "subsequent fetch from a client when promisor.advertise is false" '
+ git -C server config promisor.advertise false &&
+
+ GIT_NO_LAZY_FETCH=0 git -C client2 pull origin &&
+
+ git -C client2 rev-parse HEAD >actual &&
+ test_cmp expected_head actual &&
+
+ cat client2/bar >/dev/null &&
+
+ check_missing_objects server 1 "$oid"
+'
+
+test_done
diff --git a/upload-pack.c b/upload-pack.c
index 728b2477fc..7498b45e2e 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -32,6 +32,7 @@
#include "write-or-die.h"
#include "json-writer.h"
#include "strmap.h"
+#include "promisor-remote.h"
/* Remember to update object flag allocation in object.h */
#define THEY_HAVE (1u << 11)
@@ -319,6 +320,8 @@ static void create_pack_file(struct upload_pack_data *pack_data,
strvec_push(&pack_objects.args, "--delta-base-offset");
if (pack_data->use_include_tag)
strvec_push(&pack_objects.args, "--include-tag");
+ if (repo_has_accepted_promisor_remote(the_repository))
+ strvec_push(&pack_objects.args, "--missing=allow-promisor");
if (pack_data->filter_options.choice) {
const char *spec =
expand_list_objects_filter_spec(&pack_data->filter_options);
--
2.48.1.359.ge980fe0aa2
^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH v5 2/3] promisor-remote: check advertised name or URL
2025-02-18 11:32 ` [PATCH v5 0/3] " Christian Couder
2025-02-18 11:32 ` [PATCH v5 1/3] Add 'promisor-remote' capability to protocol v2 Christian Couder
@ 2025-02-18 11:32 ` Christian Couder
2025-02-18 11:32 ` [PATCH v5 3/3] doc: add technical design doc for large object promisors Christian Couder
` (2 subsequent siblings)
4 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-02-18 11:32 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
Randall S . Becker, Christian Couder, Christian Couder
A previous commit introduced a "promisor.acceptFromServer" configuration
variable with only "None" or "All" as valid values.
Let's introduce "KnownName" and "KnownUrl" as valid values for this
configuration option to give more choice to a client about which
promisor remotes it might accept among those that the server advertised.
In case of "KnownName", the client will accept promisor remotes which
are already configured on the client and have the same name as those
advertised by the client. This could be useful in a corporate setup
where servers and clients are trusted to not switch names and URLs, but
where some kind of control is still useful.
In case of "KnownUrl", the client will accept promisor remotes which
have both the same name and the same URL configured on the client as the
name and URL advertised by the server. This is the most secure option,
so it should be used if possible.
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
Documentation/config/promisor.adoc | 22 ++++++---
promisor-remote.c | 60 ++++++++++++++++++++---
t/t5710-promisor-remote-capability.sh | 68 +++++++++++++++++++++++++++
3 files changed, 138 insertions(+), 12 deletions(-)
diff --git a/Documentation/config/promisor.adoc b/Documentation/config/promisor.adoc
index 9cbfe3e59e..9192acfd24 100644
--- a/Documentation/config/promisor.adoc
+++ b/Documentation/config/promisor.adoc
@@ -12,9 +12,19 @@ promisor.advertise::
promisor.acceptFromServer::
If set to "all", a client will accept all the promisor remotes
a server might advertise using the "promisor-remote"
- capability. Default is "none", which means no promisor remote
- advertised by a server will be accepted. By accepting a
- promisor remote, the client agrees that the server might omit
- objects that are lazily fetchable from this promisor remote
- from its responses to "fetch" and "clone" requests from the
- client. See linkgit:gitprotocol-v2[5].
+ capability. If set to "knownName" the client will accept
+ promisor remotes which are already configured on the client
+ and have the same name as those advertised by the client. This
+ is not very secure, but could be used in a corporate setup
+ where servers and clients are trusted to not switch name and
+ URLs. If set to "knownUrl", the client will accept promisor
+ remotes which have both the same name and the same URL
+ configured on the client as the name and URL advertised by the
+ server. This is more secure than "all" or "knownName", so it
+ should be used if possible instead of those options. Default
+ is "none", which means no promisor remote advertised by a
+ server will be accepted. By accepting a promisor remote, the
+ client agrees that the server might omit objects that are
+ lazily fetchable from this promisor remote from its responses
+ to "fetch" and "clone" requests from the client. See
+ linkgit:gitprotocol-v2[5].
diff --git a/promisor-remote.c b/promisor-remote.c
index 918be6528f..6a0a61382f 100644
--- a/promisor-remote.c
+++ b/promisor-remote.c
@@ -368,30 +368,73 @@ char *promisor_remote_info(struct repository *repo)
return strbuf_detach(&sb, NULL);
}
+/*
+ * Find first index of 'nicks' where there is 'nick'. 'nick' is
+ * compared case insensitively to the strings in 'nicks'. If not found
+ * 'nicks->nr' is returned.
+ */
+static size_t remote_nick_find(struct strvec *nicks, const char *nick)
+{
+ for (size_t i = 0; i < nicks->nr; i++)
+ if (!strcasecmp(nicks->v[i], nick))
+ return i;
+ return nicks->nr;
+}
+
enum accept_promisor {
ACCEPT_NONE = 0,
+ ACCEPT_KNOWN_URL,
+ ACCEPT_KNOWN_NAME,
ACCEPT_ALL
};
static int should_accept_remote(enum accept_promisor accept,
- const char *remote_name UNUSED,
- const char *remote_url UNUSED)
+ const char *remote_name, const char *remote_url,
+ struct strvec *names, struct strvec *urls)
{
+ size_t i;
+
if (accept == ACCEPT_ALL)
return 1;
- BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+ i = remote_nick_find(names, remote_name);
+
+ if (i >= names->nr)
+ /* We don't know about that remote */
+ return 0;
+
+ if (accept == ACCEPT_KNOWN_NAME)
+ return 1;
+
+ if (accept != ACCEPT_KNOWN_URL)
+ BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+
+ if (!strcmp(urls->v[i], remote_url))
+ return 1;
+
+ warning(_("known remote named '%s' but with url '%s' instead of '%s'"),
+ remote_name, urls->v[i], remote_url);
+
+ return 0;
}
-static void filter_promisor_remote(struct strvec *accepted, const char *info)
+static void filter_promisor_remote(struct repository *repo,
+ struct strvec *accepted,
+ const char *info)
{
struct strbuf **remotes;
const char *accept_str;
enum accept_promisor accept = ACCEPT_NONE;
+ struct strvec names = STRVEC_INIT;
+ struct strvec urls = STRVEC_INIT;
if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
if (!*accept_str || !strcasecmp("None", accept_str))
accept = ACCEPT_NONE;
+ else if (!strcasecmp("KnownUrl", accept_str))
+ accept = ACCEPT_KNOWN_URL;
+ else if (!strcasecmp("KnownName", accept_str))
+ accept = ACCEPT_KNOWN_NAME;
else if (!strcasecmp("All", accept_str))
accept = ACCEPT_ALL;
else
@@ -402,6 +445,9 @@ static void filter_promisor_remote(struct strvec *accepted, const char *info)
if (accept == ACCEPT_NONE)
return;
+ if (accept != ACCEPT_ALL)
+ promisor_info_vecs(repo, &names, &urls);
+
/* Parse remote info received */
remotes = strbuf_split_str(info, ';', 0);
@@ -431,7 +477,7 @@ static void filter_promisor_remote(struct strvec *accepted, const char *info)
if (remote_url)
decoded_url = url_percent_decode(remote_url);
- if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url))
+ if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url, &names, &urls))
strvec_push(accepted, decoded_name);
strbuf_list_free(elems);
@@ -439,6 +485,8 @@ static void filter_promisor_remote(struct strvec *accepted, const char *info)
free(decoded_url);
}
+ strvec_clear(&names);
+ strvec_clear(&urls);
strbuf_list_free(remotes);
}
@@ -447,7 +495,7 @@ char *promisor_remote_reply(const char *info)
struct strvec accepted = STRVEC_INIT;
struct strbuf reply = STRBUF_INIT;
- filter_promisor_remote(&accepted, info);
+ filter_promisor_remote(the_repository, &accepted, info);
if (!accepted.nr)
return NULL;
diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
index 51cf2269e1..d2cc69a17e 100755
--- a/t/t5710-promisor-remote-capability.sh
+++ b/t/t5710-promisor-remote-capability.sh
@@ -160,6 +160,74 @@ test_expect_success "init + fetch with promisor.advertise set to 'true'" '
check_missing_objects server 1 "$oid"
'
+test_expect_success "clone with promisor.acceptfromserver set to 'KnownName'" '
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
+ -c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
+ -c remote.lop.url="file://$(pwd)/lop" \
+ -c promisor.acceptfromserver=KnownName \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
+
+ # Check that the largest object is still missing on the server
+ check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with 'KnownName' and different remote names" '
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.serverTwo.promisor=true \
+ -c remote.serverTwo.fetch="+refs/heads/*:refs/remotes/lop/*" \
+ -c remote.serverTwo.url="file://$(pwd)/lop" \
+ -c promisor.acceptfromserver=KnownName \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
+
+ # Check that the largest object is not missing on the server
+ check_missing_objects server 0 "" &&
+
+ # Reinitialize server so that the largest object is missing again
+ initialize_server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.acceptfromserver set to 'KnownUrl'" '
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
+ -c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
+ -c remote.lop.url="file://$(pwd)/lop" \
+ -c promisor.acceptfromserver=KnownUrl \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
+
+ # Check that the largest object is still missing on the server
+ check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with 'KnownUrl' and different remote urls" '
+ ln -s lop serverTwo &&
+
+ git -C server config promisor.advertise true &&
+
+ # Clone from server to create a client
+ GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
+ -c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
+ -c remote.lop.url="file://$(pwd)/serverTwo" \
+ -c promisor.acceptfromserver=KnownUrl \
+ --no-local --filter="blob:limit=5k" server client &&
+ test_when_finished "rm -rf client" &&
+
+ # Check that the largest object is not missing on the server
+ check_missing_objects server 0 "" &&
+
+ # Reinitialize server so that the largest object is missing again
+ initialize_server 1 "$oid"
+'
+
test_expect_success "clone with promisor.advertise set to 'true' but don't delete the client" '
git -C server config promisor.advertise true &&
--
2.48.1.359.ge980fe0aa2
^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH v5 3/3] doc: add technical design doc for large object promisors
2025-02-18 11:32 ` [PATCH v5 0/3] " Christian Couder
2025-02-18 11:32 ` [PATCH v5 1/3] Add 'promisor-remote' capability to protocol v2 Christian Couder
2025-02-18 11:32 ` [PATCH v5 2/3] promisor-remote: check advertised name or URL Christian Couder
@ 2025-02-18 11:32 ` Christian Couder
2025-02-21 8:33 ` Patrick Steinhardt
2025-02-18 19:07 ` [PATCH v5 0/3] Introduce a "promisor-remote" capability Junio C Hamano
2025-02-21 8:34 ` Patrick Steinhardt
4 siblings, 1 reply; 110+ messages in thread
From: Christian Couder @ 2025-02-18 11:32 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
Randall S . Becker, Christian Couder, Christian Couder
Let's add a design doc about how we could improve handling liarge blobs
using "Large Object Promisors" (LOPs). It's a set of features with the
goal of using special dedicated promisor remotes to store large blobs,
and having them accessed directly by main remotes and clients.
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
.../technical/large-object-promisors.txt | 656 ++++++++++++++++++
1 file changed, 656 insertions(+)
create mode 100644 Documentation/technical/large-object-promisors.txt
diff --git a/Documentation/technical/large-object-promisors.txt b/Documentation/technical/large-object-promisors.txt
new file mode 100644
index 0000000000..ebbbd7c18f
--- /dev/null
+++ b/Documentation/technical/large-object-promisors.txt
@@ -0,0 +1,656 @@
+Large Object Promisors
+======================
+
+Since Git has been created, users have been complaining about issues
+with storing large files in Git. Some solutions have been created to
+help, but they haven't helped much with some issues.
+
+Git currently supports multiple promisor remotes, which could help
+with some of these remaining issues, but it's very hard to use them to
+help, because a number of important features are missing.
+
+The goal of the effort described in this document is to add these
+important features.
+
+We will call a "Large Object Promisor", or "LOP" in short, a promisor
+remote which is used to store only large blobs and which is separate
+from the main remote that should store the other Git objects and the
+rest of the repos.
+
+By extension, we will also call "Large Object Promisor", or LOP, the
+effort described in this document to add a set of features to make it
+easier to handle large blobs/files in Git by using LOPs.
+
+This effort aims to especially improve things on the server side, and
+especially for large blobs that are already compressed in a binary
+format.
+
+This effort aims to provide an alternative to Git LFS
+(https://git-lfs.com/) and similar tools like git-annex
+(https://git-annex.branchable.com/) for handling large files, even
+though a complete alternative would very likely require other efforts
+especially on the client side, where it would likely help to implement
+a new object representation for large blobs as discussed in:
+
+https://lore.kernel.org/git/xmqqbkdometi.fsf@gitster.g/
+
+0) Non goals
+------------
+
+- We will not discuss those client side improvements here, as they
+ would require changes in different parts of Git than this effort.
++
+So we don't pretend to fully replace Git LFS with only this effort,
+but we nevertheless believe that it can significantly improve the
+current situation on the server side, and that other separate
+efforts could also improve the situation on the client side.
+
+- In the same way, we are not going to discuss all the possible ways
+ to implement a LOP or their underlying object storage, or to
+ optimize how LOP works.
++
+Our opinion is that the simplest solution for now is for LOPs to use
+object storage through a remote helper (see section II.2 below for
+more details) to store their objects. So we consider that this is the
+default implementation. If there are improvements on top of this,
+that's great, but our opinion is that such improvements are not
+necessary for LOPs to already be useful. Such improvements are likely
+a different technical topic, and can be taken care of separately
+anyway.
++
+So in particular we are not going to discuss pluggable ODBs or other
+object database backends that could chunk large blobs, dedup the
+chunks and store them efficiently. Sure, that would be a nice
+improvement to store large blobs on the server side, but we believe
+it can just be a separate effort as it's also not technically very
+related to this effort.
++
+We are also not going to discuss data transfer improvements between
+LOPs and clients or servers. Sure, there might be some easy and very
+effective optimizations there (as we know that objects on LOPs are
+very likely incompressible and not deltifying well), but this can be
+dealt with separately in a separate effort.
+
+In other words, the goal of this document is not to talk about all the
+possible ways to optimize how Git could handle large blobs, but to
+describe how a LOP based solution can already work well and alleviate
+a number of current issues in the context of Git clients and servers
+sharing Git objects.
+
+Even if LOPs are used not very efficiently, they can still be useful
+and worth using in some cases because, as we will see in more details
+later in this document:
+
+ - they can make it simpler for clients to use promisor remotes and
+ therefore avoid fetching a lot of large blobs they might not need
+ locally,
+
+ - they can make it significantly cheaper or easier for servers to
+ host a significant part of the current repository content, and
+ even more to host content with larger blobs or more large blobs
+ than currently.
+
+I) Issues with the current situation
+------------------------------------
+
+- Some statistics made on GitLab repos have shown that more than 75%
+ of the disk space is used by blobs that are larger than 1MB and
+ often in a binary format.
+
+- So even if users could use Git LFS or similar tools to store a lot
+ of large blobs out of their repos, it's a fact that in practice they
+ don't do it as much as they probably should.
+
+- On the server side ideally, the server should be able to decide for
+ itself how it stores things. It should not depend on users deciding
+ to use tools like Git LFS on some blobs or not.
+
+- It's much more expensive to store large blobs that don't delta
+ compress well on regular fast seeking drives (like SSDs) than on
+ object storage (like Amazon S3 or GCP Buckets). Using fast drives
+ for regular Git repos makes sense though, as serving regular Git
+ content (blobs containing text or code) needs drives where seeking
+ is fast, but the content is relatively small. On the other hand,
+ object storage for Git LFS blobs makes sense as seeking speed is not
+ as important when dealing with large files, while costs are more
+ important. So the fact that users don't use Git LFS or similar tools
+ for a significant number of large blobs has likely some bad
+ consequences on the cost of repo storage for most Git hosting
+ platforms.
+
+- Having large blobs handled in the same way as other blobs and Git
+ objects in Git repos instead of on object storage also has a cost in
+ increased memory and CPU usage, and therefore decreased performance,
+ when creating packfiles. (This is because Git tries to use delta
+ compression or zlib compression which is unlikely to work well on
+ already compressed binary content.) So it's not just a storage cost
+ increase.
+
+- When a large blob has been committed into a repo, it might not be
+ possible to remove this blob from the repo without rewriting
+ history, even if the user then decides to use Git LFS or a similar
+ tool to handle it.
+
+- In fact Git LFS and similar tools are not very flexible in letting
+ users change their minds about the blobs they should handle or not.
+
+- Even when users are using Git LFS or similar tools, they are often
+ complaining that these tools require significant effort to set up,
+ learn and use correctly.
+
+II) Main features of the "Large Object Promisors" solution
+----------------------------------------------------------
+
+The main features below should give a rough overview of how the
+solution may work. Details about needed elements can be found in
+following sections.
+
+Even if each feature below is very useful for the full solution, it is
+very likely to be also useful on its own in some cases where the full
+solution is not required. However, we'll focus primarily on the big
+picture here.
+
+Also each feature doesn't need to be implemented entirely in Git
+itself. Some could be scripts, hooks or helpers that are not part of
+the Git repo. It would be helpful if those could be shared and
+improved on collaboratively though. So we want to encourage sharing
+them.
+
+1) Large blobs are stored on LOPs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Large blobs should be stored on special promisor remotes that we will
+call "Large Object Promisors" or LOPs. These LOPs should be additional
+remotes dedicated to contain large blobs especially those in binary
+format. They should be used along with main remotes that contain the
+other objects.
+
+Note 1
+++++++
+
+To clarify, a LOP is a normal promisor remote, except that:
+
+- it should store only large blobs,
+
+- it should be separate from the main remote, so that the main remote
+ can focus on serving other objects and the rest of the repos (see
+ feature 4) below) and can use the LOP as a promisor remote for
+ itself.
+
+Note 2
+++++++
+
+Git already makes it possible for a main remote to also be a promisor
+remote storing both regular objects and large blobs for a client that
+clones from it with a filter on blob size. But here we explicitly want
+to avoid that.
+
+Rationale
++++++++++
+
+LOPs aim to be good at handling large blobs while main remotes are
+already good at handling other objects.
+
+Implementation
+++++++++++++++
+
+Git already has support for multiple promisor remotes, see
+link:partial-clone.html#using-many-promisor-remotes[the partial clone documentation].
+
+Also, Git already has support for partial clone using a filter on the
+size of the blobs (with `git clone --filter=blob:limit=<size>`). Most
+of the other main features below are based on these existing features
+and are about making them easy and efficient to use for the purpose of
+better handling large blobs.
+
+2) LOPs can use object storage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+LOPs can be implemented using object storage, like an Amazon S3 or GCP
+Bucket or MinIO (which is open source under the GNU AGPLv3 license) to
+actually store the large blobs, and can be accessed through a Git
+remote helper (see linkgit:gitremote-helpers[7]) which makes the
+underlying object storage appear like a remote to Git.
+
+Note
+++++
+
+A LOP can be a promisor remote accessed using a remote helper by
+both some clients and the main remote.
+
+Rationale
++++++++++
+
+This looks like the simplest way to create LOPs that can cheaply
+handle many large blobs.
+
+Implementation
+++++++++++++++
+
+Remote helpers are quite easy to write as shell scripts, but it might
+be more efficient and maintainable to write them using other languages
+like Go.
+
+Some already exist under open source licenses, for example:
+
+ - https://github.com/awslabs/git-remote-s3
+ - https://gitlab.com/eric.p.ju/git-remote-gs
+
+Other ways to implement LOPs are certainly possible, but the goal of
+this document is not to discuss how to best implement a LOP or its
+underlying object storage (see the "0) Non goals" section above).
+
+3) LOP object storage can be Git LFS storage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The underlying object storage that a LOP uses could also serve as
+storage for large files handled by Git LFS.
+
+Rationale
++++++++++
+
+This would simplify the server side if it wants to both use a LOP and
+act as a Git LFS server.
+
+4) A main remote can offload to a LOP with a configurable threshold
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+On the server side, a main remote should have a way to offload to a
+LOP all its blobs with a size over a configurable threshold.
+
+Rationale
++++++++++
+
+This makes it easy to set things up and to clean things up. For
+example, an admin could use this to manually convert a repo not using
+LOPs to a repo using a LOP. On a repo already using a LOP but where
+some users would sometimes push large blobs, a cron job could use this
+to regularly make sure the large blobs are moved to the LOP.
+
+Implementation
+++++++++++++++
+
+Using something based on `git repack --filter=...` to separate the
+blobs we want to offload from the other Git objects could be a good
+idea. The missing part is to connect to the LOP, check if the blobs we
+want to offload are already there and if not send them.
+
+5) A main remote should try to remain clean from large blobs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A main remote should try to avoid containing a lot of oversize
+blobs. For that purpose, it should offload as needed to a LOP and it
+should have ways to prevent oversize blobs to be fetched, and also
+perhaps pushed, into it.
+
+Rationale
++++++++++
+
+A main remote containing many oversize blobs would defeat the purpose
+of LOPs.
+
+Implementation
+++++++++++++++
+
+The way to offload to a LOP discussed in 4) above can be used to
+regularly offload oversize blobs. About preventing oversize blobs from
+being fetched into the repo see 6) below. About preventing oversize
+blob pushes, a pre-receive hook could be used.
+
+Also there are different scenarios in which large blobs could get
+fetched into the main remote, for example:
+
+- A client that doesn't implement the "promisor-remote" protocol
+ (described in 6) below) clones from the main remote.
+
+- The main remote gets a request for information about a large blob
+ and is not able to get that information without fetching the blob
+ from the LOP.
+
+It might not be possible to completely prevent all these scenarios
+from happening. So the goal here should be to implement features that
+make the fetching of large blobs less likely. For example adding a
+`remote-object-info` command in the `git cat-file --batch` protocol
+and its variants might make it possible for a main repo to respond to
+some requests about large blobs without fetching them.
+
+6) A protocol negotiation should happen when a client clones
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a client clones from a main repo, there should be a protocol
+negotiation so that the server can advertise one or more LOPs and so
+that the client and the server can discuss if the client could
+directly use a LOP the server is advertising. If the client and the
+server can agree on that, then the client would be able to get the
+large blobs directly from the LOP and the server would not need to
+fetch those blobs from the LOP to be able to serve the client.
+
+Note
+++++
+
+For fetches instead of clones, a protocol negotiation might not always
+happen, see the "What about fetches?" FAQ entry below for details.
+
+Rationale
++++++++++
+
+Security, configurability and efficiency of setting things up.
+
+Implementation
+++++++++++++++
+
+A "promisor-remote" protocol v2 capability looks like a good way to
+implement this. The way the client and server use this capability
+could be controlled by configuration variables.
+
+Information that the server could send to the client through that
+protocol could be things like: LOP name, LOP URL, filter-spec (for
+example `blob:limit=<size>`) or just size limit that should be used as
+a filter when cloning, token to be used with the LOP, etc.
+
+7) A client can offload to a LOP
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a client is using a LOP that is also a LOP of its main remote,
+the client should be able to offload some large blobs it has fetched,
+but might not need anymore, to the LOP.
+
+Note
+++++
+
+It might depend on the context if it should be OK or not for clients
+to offload large blobs they have created, instead of fetched, directly
+to the LOP without the main remote checking them in some ways
+(possibly using hooks or other tools).
+
+This should be discussed and refined when we get closer to
+implementing this feature.
+
+Rationale
++++++++++
+
+On the client, the easiest way to deal with unneeded large blobs is to
+offload them.
+
+Implementation
+++++++++++++++
+
+This is very similar to what 4) above is about, except on the client
+side instead of the server side. So a good solution to 4) could likely
+be adapted to work on the client side too.
+
+There might be some security issues here, as there is no negotiation,
+but they might be mitigated if the client can reuse a token it got
+when cloning (see 6) above). Also if the large blobs were fetched from
+a LOP, it is likely, and can easily be confirmed, that the LOP still
+has them, so that they can just be removed from the client.
+
+III) Benefits of using LOPs
+---------------------------
+
+Many benefits are related to the issues discussed in "I) Issues with
+the current situation" above:
+
+- No need to rewrite history when deciding which blobs are worth
+ handling separately than other objects, or when moving or removing
+ the threshold.
+
+- If the protocol between client and server is developed and secured
+ enough, then many details might be setup on the server side only and
+ all the clients could then easily get all the configuration
+ information and use it to set themselves up mostly automatically.
+
+- Storage costs benefits on the server side.
+
+- Reduced memory and CPU needs on main remotes on the server side.
+
+- Reduced storage needs on the client side.
+
+IV) FAQ
+-------
+
+What about using multiple LOPs on the server and client side?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+That could perhaps be useful in some cases, but for now it's more
+likely that in most cases a single LOP will be advertised by the
+server and should be used by the client.
+
+A case where it could be useful for a server to advertise multiple
+LOPs is if a LOP is better for some users while a different LOP is
+better for other users. For example some clients might have a better
+connection to a LOP than others.
+
+In those cases it's the responsibility of the server to have some
+documentation to help clients. It could say for example something like
+"Users in this part of the world might want to pick only LOP A as it
+is likely to be better connected to them, while users in other parts
+of the world should pick only LOP B for the same reason."
+
+When should we trust or not trust the LOPs advertised by the server?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In some contexts, like in corporate setup where the server and all the
+clients are parts of an internal network in a company where admins
+have all the rights on every system, it's OK, and perhaps even a good
+thing, if the clients fully trust the server, as it can help ensure
+that all the clients are on the same page.
+
+There are also contexts in which clients trust a code hosting platform
+serving them some repos, but might not fully trust other users
+managing or contributing to some of these repos. For example, the code
+hosting platform could have hooks in place to check that any object it
+receives doesn't contain malware or otherwise bad content. In this
+case it might be OK for the client to use a main remote and its LOP if
+they are both hosted by the code hosting platform, but not if the LOP
+is hosted elsewhere (where the content is not checked).
+
+In other contexts, a client should just not trust a server.
+
+So there should be different ways to configure how the client should
+behave when a server advertises a LOP to it at clone time.
+
+As the basic elements that a server can advertise about a LOP are a
+LOP name and a LOP URL, the client should base its decision about
+accepting a LOP on these elements.
+
+One simple way to be very strict in the LOP it accepts is for example
+for the client to check that the LOP is already configured on the
+client with the same name and URL as what the server advertises.
+
+In general default and "safe" settings should require that the LOP are
+configured on the client separately from the "promisor-remote"
+protocol and that the client accepts a LOP only when information about
+it from the protocol matches what has been already configured
+separately.
+
+What about LOP names?
+~~~~~~~~~~~~~~~~~~~~~
+
+In some contexts, for example if the clients sometimes fetch from each
+other, it can be a good idea for all the clients to use the same names
+for all the remotes they use, including LOPs.
+
+In other contexts, each client might want to be able to give the name
+it wants to each remote, including each LOP, it interacts with.
+
+So there should be different ways to configure how the client accepts
+or not the LOP name the server advertises.
+
+If a default or "safe" setting is used, then as such a setting should
+require that the LOP be configured separately, then the name would be
+configured separately and there is no risk that the server could
+dictate a name to a client.
+
+Could the main remote be bogged down by old or paranoid clients?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Yes, it could happen if there are too many clients that are either
+unwilling to trust the main remote or that just don't implement the
+"promisor-remote" protocol because they are too old or not fully
+compatible with the 'git' client.
+
+When serving such a client, the main remote has no other choice than
+to first fetch from its LOP, to then be able to provide to the client
+everything it requested. So the main remote, even if it has cleanup
+mechanisms (see section II.4 above), would be burdened at least
+temporarily with the large blobs it had to fetch from its LOP.
+
+Not behaving like this would be breaking backward compatibility, and
+could be seen as segregating clients. For example, it might be
+possible to implement a special mode that allows the server to just
+reject clients that don't implement the "promisor-remote" protocol or
+aren't willing to trust the main remote. This mode might be useful in
+a special context like a corporate environment. There is no plan to
+implement such a mode though, and this should be discussed separately
+later anyway.
+
+A better way to proceed is probably for the main remote to show a
+message telling clients that don't implement the protocol or are
+unwilling to accept the advertised LOP(s) that they would get faster
+clone and fetches by upgrading client software or properly setting
+them up to accept LOP(s).
+
+Waiting for clients to upgrade, monitoring these upgrades and limiting
+the use of LOPs to repos that are not very frequently accessed might
+be other good ways to make sure that some benefits are still reaped
+from LOPs. Over time, as more and more clients upgrade and benefit
+from LOPs, using them in more and more frequently accessed repos will
+become worth it.
+
+Corporate environments, where it might be easier to make sure that all
+the clients are up-to-date and properly configured, could hopefully
+benefit more and earlier from using LOPs.
+
+What about fetches?
+~~~~~~~~~~~~~~~~~~~
+
+There are different kinds of fetches. A regular fetch happens when
+some refs have been updated on the server and the client wants the ref
+updates and possibly the new objects added with them. A "backfill" or
+"lazy" fetch, on the contrary, happens when the client needs to use
+some objects it already knows about but doesn't have because they are
+on a promisor remote.
+
+Regular fetch
++++++++++++++
+
+In a regular fetch, the client will contact the main remote and a
+protocol negotiation will happen between them. It's a good thing that
+a protocol negotiation happens every time, as the configuration on the
+client or the main remote could have changed since the previous
+protocol negotiation. In this case, the new protocol negotiation
+should ensure that the new fetch will happen in a way that satisfies
+the new configuration of both the client and the server.
+
+In most cases though, the configurations on the client and the main
+remote will not have changed between 2 fetches or between the initial
+clone and a subsequent fetch. This means that the result of a new
+protocol negotiation will be the same as the previous result, so the
+new fetch will happen in the same way as the previous clone or fetch,
+using, or not using, the same LOP(s) as last time.
+
+"Backfill" or "lazy" fetch
+++++++++++++++++++++++++++
+
+When there is a backfill fetch, the client doesn't necessarily contact
+the main remote first. It will try to fetch from its promisor remotes
+in the order they appear in the config file, except that a remote
+configured using the `extensions.partialClone` config variable will be
+tried last. See
+link:partial-clone.html#using-many-promisor-remotes[the partial clone documentation].
+
+This is not new with this effort. In fact this is how multiple remotes
+have already been working for around 5 years.
+
+When using LOPs, having the main remote configured using
+`extensions.partialClone`, so it's tried last, makes sense, as missing
+objects should only be large blobs that are on LOPs.
+
+This means that a protocol negotiation will likely not happen as the
+missing objects will be fetched from the LOPs, and then there will be
+nothing left to fetch from the main remote.
+
+To secure that, it could be a good idea for LOPs to require a token
+from the client when it fetches from them. The client could get the
+token when performing a protocol negotiation with the main remote (see
+section II.6 above).
+
+V) Future improvements
+----------------------
+
+It is expected that at the beginning using LOPs will be mostly worth
+it either in a corporate context where the Git version that clients
+use can easily be controlled, or on repos that are infrequently
+accessed. (See the "Could the main remote be bogged down by old or
+paranoid clients?" section in the FAQ above.)
+
+Over time, as more and more clients upgrade to a version that
+implements the "promisor-remote" protocol v2 capability described
+above in section II.6), it will be worth it to use LOPs more widely.
+
+A lot of improvements may also help using LOPs more widely. Some of
+these improvements are part of the scope of this document like the
+following:
+
+ - Implementing a "remote-object-info" command in the
+ `git cat-file --batch` protocol and its variants to allow main
+ remotes to respond to requests about large blobs without fetching
+ them. (Eric Ju has started working on this based on previous work
+ by Calvin Wan.)
+
+ - Creating better cleanup and offload mechanisms for main remotes
+ and clients to prevent accumulation of large blobs.
+
+ - Developing more sophisticated protocol negotiation capabilities
+ between clients and servers for handling LOPs, for example adding
+ a filter-spec (e.g., blob:limit=<size>) or size limit for
+ filtering when cloning, or adding a token for LOP authentication.
+
+ - Improving security measures for LOP access, particularly around
+ token handling and authentication.
+
+ - Developing standardized ways to configure and manage multiple LOPs
+ across different environments. Especially in the case where
+ different LOPs serve the same content to clients in different
+ geographical locations, there is a need for replication or
+ synchronization between LOPs.
+
+Some improvements, including some that have been mentioned in the "0)
+Non Goals" section of this document, are out of the scope of this
+document:
+
+ - Implementing a new object representation for large blobs on the
+ client side.
+
+ - Developing pluggable ODBs or other object database backends that
+ could chunk large blobs, dedup the chunks and store them
+ efficiently.
+
+ - Optimizing data transfer between LOPs and clients/servers,
+ particularly for incompressible and non-deltifying content.
+
+ - Creating improved client side tools for managing large objects
+ more effectively, for example tools for migrating from Git LFS or
+ git-annex, or tools to find which objects could be offloaded and
+ how much disk space could be reclaimed by offloading them.
+
+Some improvements could be seen as part of the scope of this document,
+but might already have their own separate projects from the Git
+project, like:
+
+ - Improving existing remote helpers to access object storage or
+ developing new ones.
+
+ - Improving existing object storage solutions or developing new
+ ones.
+
+Even though all the above improvements may help, this document and the
+LOP effort should try to focus, at least first, on a relatively small
+number of improvements mostly those that are in its current scope.
+
+For example introducing pluggable ODBs and a new object database
+backend is likely a multi-year effort on its own that can happen
+separately in parallel. It has different technical requirements,
+touches other part of the Git code base and should have its own design
+document(s).
--
2.48.1.359.ge980fe0aa2
^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [PATCH v5 3/3] doc: add technical design doc for large object promisors
2025-02-18 11:32 ` [PATCH v5 3/3] doc: add technical design doc for large object promisors Christian Couder
@ 2025-02-21 8:33 ` Patrick Steinhardt
2025-03-03 16:58 ` Junio C Hamano
0 siblings, 1 reply; 110+ messages in thread
From: Patrick Steinhardt @ 2025-02-21 8:33 UTC (permalink / raw)
To: Christian Couder
Cc: git, Junio C Hamano, Taylor Blau, Eric Sunshine, Karthik Nayak,
Kristoffer Haugsbakk, brian m . carlson, Randall S . Becker,
Christian Couder
On Tue, Feb 18, 2025 at 12:32:04PM +0100, Christian Couder wrote:
> diff --git a/Documentation/technical/large-object-promisors.txt b/Documentation/technical/large-object-promisors.txt
> new file mode 100644
> index 0000000000..ebbbd7c18f
> --- /dev/null
> +++ b/Documentation/technical/large-object-promisors.txt
> @@ -0,0 +1,656 @@
> +In other words, the goal of this document is not to talk about all the
> +possible ways to optimize how Git could handle large blobs, but to
> +describe how a LOP based solution can already work well and alleviate
> +a number of current issues in the context of Git clients and servers
> +sharing Git objects.
> +
> +Even if LOPs are used not very efficiently, they can still be useful
> +and worth using in some cases because, as we will see in more details
s/because//
Patrick
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v5 3/3] doc: add technical design doc for large object promisors
2025-02-21 8:33 ` Patrick Steinhardt
@ 2025-03-03 16:58 ` Junio C Hamano
0 siblings, 0 replies; 110+ messages in thread
From: Junio C Hamano @ 2025-03-03 16:58 UTC (permalink / raw)
To: Patrick Steinhardt
Cc: Christian Couder, git, Taylor Blau, Eric Sunshine, Karthik Nayak,
Kristoffer Haugsbakk, brian m . carlson, Randall S . Becker,
Christian Couder
Patrick Steinhardt <ps@pks.im> writes:
> On Tue, Feb 18, 2025 at 12:32:04PM +0100, Christian Couder wrote:
>> diff --git a/Documentation/technical/large-object-promisors.txt b/Documentation/technical/large-object-promisors.txt
>> new file mode 100644
>> index 0000000000..ebbbd7c18f
>> --- /dev/null
>> +++ b/Documentation/technical/large-object-promisors.txt
>> @@ -0,0 +1,656 @@
>> +In other words, the goal of this document is not to talk about all the
>> +possible ways to optimize how Git could handle large blobs, but to
>> +describe how a LOP based solution can already work well and alleviate
>> +a number of current issues in the context of Git clients and servers
>> +sharing Git objects.
>> +
>> +Even if LOPs are used not very efficiently, they can still be useful
>> +and worth using in some cases because, as we will see in more details
>
> s/because//
I've squashed this in and it seems everything is in order in this
topic, so let's mark it for 'next' now.
Thanks, all.
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v5 0/3] Introduce a "promisor-remote" capability
2025-02-18 11:32 ` [PATCH v5 0/3] " Christian Couder
` (2 preceding siblings ...)
2025-02-18 11:32 ` [PATCH v5 3/3] doc: add technical design doc for large object promisors Christian Couder
@ 2025-02-18 19:07 ` Junio C Hamano
2025-02-21 8:34 ` Patrick Steinhardt
4 siblings, 0 replies; 110+ messages in thread
From: Junio C Hamano @ 2025-02-18 19:07 UTC (permalink / raw)
To: Christian Couder
Cc: git, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
Randall S . Becker
Christian Couder <christian.couder@gmail.com> writes:
> Changes compared to version 4
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> - The series is rebased on top 0394451348 (The eleventh batch,
> 2025-02-14). This is to take into account some recent changes like
> some documentation files using the ".adoc" extension instead of
> ".txt".
That would make it easier to work for you and anybody who wants to
improve on these changes, which is very much welcome. The topic is
not a maint material to fix anything, so the rebase is pretty much
welcome.
> - Patches 1/6, 2/6 and 3/6 from version 4 have been removed, as it
> looks like using redact_non_printables() is not necessary after
> all.
That would make my work a lot simpler ;-) I had to juggle the two
topics every time one of them changed.
Will queue.
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v5 0/3] Introduce a "promisor-remote" capability
2025-02-18 11:32 ` [PATCH v5 0/3] " Christian Couder
` (3 preceding siblings ...)
2025-02-18 19:07 ` [PATCH v5 0/3] Introduce a "promisor-remote" capability Junio C Hamano
@ 2025-02-21 8:34 ` Patrick Steinhardt
2025-02-21 18:40 ` Junio C Hamano
4 siblings, 1 reply; 110+ messages in thread
From: Patrick Steinhardt @ 2025-02-21 8:34 UTC (permalink / raw)
To: Christian Couder
Cc: git, Junio C Hamano, Taylor Blau, Eric Sunshine, Karthik Nayak,
Kristoffer Haugsbakk, brian m . carlson, Randall S . Becker
On Tue, Feb 18, 2025 at 12:32:01PM +0100, Christian Couder wrote:
> This work is part of some effort to better handle large files/blobs in
> a client-server context using promisor remotes dedicated to storing
> large blobs. To help understand this effort, this series now contains
> a patch (patch 6/6) that adds design documentation about this effort.
>
> Last year, I sent 3 versions of a patch series with the goal of
> allowing a client C to clone from a server S while using the same
> promisor remote X that S already use. See:
>
> https://lore.kernel.org/git/20240418184043.2900955-1-christian.couder@gmail.com/
>
> Junio suggested to implement that feature using:
>
> "a protocol extension that lets S tell C that S wants C to fetch
> missing objects from X (which means that if C knows about X in its
> ".git/config" then there is no need for end-user interaction at all),
> or a protocol extension that C tells S that C is willing to see
> objects available from X omitted when S does not have them (again,
> this could be done by looking at ".git/config" at C, but there may be
> security implications???)"
>
> This patch series implements that protocol extension called
> "promisor-remote" (that name is open to change or simplification)
> which allows S and C to agree on C using X directly or not.
>
> I have tried to implement it in a quite generic way that could allow S
> and C to share more information about promisor remotes and how to use
> them.
>
> For now, C doesn't use the information it gets from S when cloning.
> That information is only used to decide if C is OK to use the promisor
> remotes advertised by S. But this could change in the future which
> could make it much simpler for clients than using the current way of
> passing information about X with the `-c` option of `git clone` many
> times on the command line.
>
> Another improvement could be to not require GIT_NO_LAZY_FETCH=0 when S
> and C have agreed on using S.
I'm fine with this version of the patch series. There are a couple of
features that we probably want to have eventually:
- Persisting announced promisors. As far as I understand, we don't yet
write them into the client-side configuration of the repository at
all.
- Promisor remote agility. When the set of announced promisors
changes, we should optionally update the set of promisors connected
to that remote on the client-side.
- Authentication. In case the promisor remote requires authentication
we'll somehow need to communicate the credentials to the client.
All of these feel like topics that can be implemented incrementally once
the foundation has landed, so I don't think they have to be implemented
as part of the patch series here. I also don't see anything obvious that
would block any of these features with the current design.
Thanks for working on this!
Patrick
^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH v5 0/3] Introduce a "promisor-remote" capability
2025-02-21 8:34 ` Patrick Steinhardt
@ 2025-02-21 18:40 ` Junio C Hamano
0 siblings, 0 replies; 110+ messages in thread
From: Junio C Hamano @ 2025-02-21 18:40 UTC (permalink / raw)
To: Patrick Steinhardt
Cc: Christian Couder, git, Taylor Blau, Eric Sunshine, Karthik Nayak,
Kristoffer Haugsbakk, brian m . carlson, Randall S . Becker
Patrick Steinhardt <ps@pks.im> writes:
> I'm fine with this version of the patch series. There are a couple of
> features that we probably want to have eventually:
>
> - Persisting announced promisors. As far as I understand, we don't yet
> write them into the client-side configuration of the repository at
> all.
>
> - Promisor remote agility. When the set of announced promisors
> changes, we should optionally update the set of promisors connected
> to that remote on the client-side.
>
> - Authentication. In case the promisor remote requires authentication
> we'll somehow need to communicate the credentials to the client.
>
> All of these feel like topics that can be implemented incrementally once
> the foundation has landed, so I don't think they have to be implemented
> as part of the patch series here. I also don't see anything obvious that
> would block any of these features with the current design.
All of them smell like with grave security implications to me.
I am happy to see none of them are included in this round, as
getting the details of them right would take a lot of time and
effort; it is great to have the fundamentals first without having to
worry about them.
> Thanks for working on this!
Likewise.
^ permalink raw reply [flat|nested] 110+ messages in thread