[PATCH 0/4] Introduce a "promisor-remote" capability

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] Introduce a "promisor-remote" capability
@ 2024-07-31 13:40 Christian Couder
  2024-07-31 13:40 ` [PATCH 1/4] version: refactor strbuf_sanitize() Christian Couder
                   ` (6 more replies)
  0 siblings, 7 replies; 110+ messages in thread
From: Christian Couder @ 2024-07-31 13:40 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Christian Couder

Earlier this year, I sent 3 versions of a patch series with the goal
of allowing a client C to clone from a server S while using the same
promisor remote X that S already use. See:

https://lore.kernel.org/git/20240418184043.2900955-1-christian.couder@gmail.com/

Junio suggested to instead implement that feature using:

"a protocol extension that lets S tell C that S wants C to fetch
missing objects from X (which means that if C knows about X in its
".git/config" then there is no need for end-user interaction at all),
or a protocol extension that C tells S that C is willing to see
objects available from X omitted when S does not have them (again,
this could be done by looking at ".git/config" at C, but there may be
security implications???)"

This patch series implements that protocol extension called
"promisor-remote" (that name is open to change or simplification)
which allows S and C to agree on C using X directly or not.

I have tried to implement it in a quite generic way that could allow S
and C to share more information about promisor remotes and how to use
them.

For now C doesn't use the information it gets from S when cloning.
That information is only used to decide if C is Ok to use the promisor
remotes advertised by S. But this could change which could make it
much simpler for clients than using the current way of passing
information about X with the `-c` option of `git clone` many times on
the command line.

Another improvement could be to not require GIT_NO_LAZY_FETCH=0 when S
and C have agreed on using S.

Christian Couder (4):
  version: refactor strbuf_sanitize()
  strbuf: refactor strbuf_trim_trailing_ch()
  Add 'promisor-remote' capability to protocol v2
  promisor-remote: check advertised name or URL

 Documentation/config/promisor.txt     |  18 ++
 Documentation/gitprotocol-v2.txt      |  37 +++++
 connect.c                             |   7 +
 promisor-remote.c                     | 228 ++++++++++++++++++++++++++
 promisor-remote.h                     |  26 ++-
 serve.c                               |  21 +++
 strbuf.c                              |  16 ++
 strbuf.h                              |  10 ++
 t/t5555-http-smart-common.sh          |   1 +
 t/t5701-git-serve.sh                  |   1 +
 t/t5710-promisor-remote-capability.sh | 192 ++++++++++++++++++++++
 trace2/tr2_cfg.c                      |  10 +-
 upload-pack.c                         |   3 +
 version.c                             |   9 +-
 14 files changed, 563 insertions(+), 16 deletions(-)
 create mode 100755 t/t5710-promisor-remote-capability.sh

-- 
2.46.0.4.gbcb884ee16

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH 1/4] version: refactor strbuf_sanitize()
  2024-07-31 13:40 [PATCH 0/4] Introduce a "promisor-remote" capability Christian Couder
@ 2024-07-31 13:40 ` Christian Couder
  2024-07-31 17:18   ` Junio C Hamano
  2024-07-31 13:40 ` [PATCH 2/4] strbuf: refactor strbuf_trim_trailing_ch() Christian Couder
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 110+ messages in thread
From: Christian Couder @ 2024-07-31 13:40 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Christian Couder,
	Christian Couder

The git_user_agent_sanitized() function performs some sanitizing to
avoid special characters being sent over the line and possibly messing
up with the protocol or with the parsing on the other side.

Let's extract this sanitizing into a new strbuf_sanitize() function, as
we will want to reuse it in a following patch, and let's put it into
strbuf.{c,h}.

While at it, let's also make a few small improvements:
  - use 'size_t' for 'i' instead of 'int',
  - move the declaration of 'i' inside the 'for ( ... )',
  - use strbuf_detach() to explicitely detach the string contained by
    the 'sb' strbuf.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 strbuf.c  | 9 +++++++++
 strbuf.h  | 7 +++++++
 version.c | 9 ++-------
 3 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/strbuf.c b/strbuf.c
index 3d2189a7f6..cccfdec0e3 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -1082,3 +1082,12 @@ void strbuf_strip_file_from_path(struct strbuf *sb)
 	char *path_sep = find_last_dir_sep(sb->buf);
 	strbuf_setlen(sb, path_sep ? path_sep - sb->buf + 1 : 0);
 }
+
+void strbuf_sanitize(struct strbuf *sb)
+{
+	strbuf_trim(sb);
+	for (size_t i = 0; i < sb->len; i++) {
+		if (sb->buf[i] <= 32 || sb->buf[i] >= 127)
+			sb->buf[i] = '.';
+	}
+}
diff --git a/strbuf.h b/strbuf.h
index 003f880ff7..884157873e 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -664,6 +664,13 @@ typedef int (*char_predicate)(char ch);
 void strbuf_addstr_urlencode(struct strbuf *sb, const char *name,
 			     char_predicate allow_unencoded_fn);
 
+/*
+ * Trim and replace each character with ascii code below 32 or above
+ * 127 (included) using a dot '.' character. Useful for sending
+ * capabilities.
+ */
+void strbuf_sanitize(struct strbuf *sb);
+
 __attribute__((format (printf,1,2)))
 int printf_ln(const char *fmt, ...);
 __attribute__((format (printf,2,3)))
diff --git a/version.c b/version.c
index 41b718c29e..951e6dca74 100644
--- a/version.c
+++ b/version.c
@@ -24,15 +24,10 @@ const char *git_user_agent_sanitized(void)
 
 	if (!agent) {
 		struct strbuf buf = STRBUF_INIT;
-		int i;
 
 		strbuf_addstr(&buf, git_user_agent());
-		strbuf_trim(&buf);
-		for (i = 0; i < buf.len; i++) {
-			if (buf.buf[i] <= 32 || buf.buf[i] >= 127)
-				buf.buf[i] = '.';
-		}
-		agent = buf.buf;
+		strbuf_sanitize(&buf);
+		agent = strbuf_detach(&buf, NULL);
 	}
 
 	return agent;
-- 
2.46.0.4.gbcb884ee16


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 2/4] strbuf: refactor strbuf_trim_trailing_ch()
  2024-07-31 13:40 [PATCH 0/4] Introduce a "promisor-remote" capability Christian Couder
  2024-07-31 13:40 ` [PATCH 1/4] version: refactor strbuf_sanitize() Christian Couder
@ 2024-07-31 13:40 ` Christian Couder
  2024-07-31 17:29   ` Junio C Hamano
  2024-07-31 13:40 ` [PATCH 3/4] Add 'promisor-remote' capability to protocol v2 Christian Couder
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 110+ messages in thread
From: Christian Couder @ 2024-07-31 13:40 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Christian Couder,
	Christian Couder

We often have to split strings at some specified terminator character.
The strbuf_split*() functions, that we can use for this purpose,
return substrings that include the terminator character, so we often
need to remove that character.

When it is a whitespace, newline or directory separator, the
terminator character can easily be removed using an existing triming
function like strbuf_rtrim(), strbuf_trim_trailing_newline() or
strbuf_trim_trailing_dir_sep(). There is no function to remove that
character when it's not one of those characters though.

Let's introduce a new strbuf_trim_trailing_ch() function that can be
used to remove any trailing character, and let's refactor existing code
that manually removed trailing characters using this new function.

We are also going to use this new function in a following commit.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 strbuf.c         |  7 +++++++
 strbuf.h         |  3 +++
 trace2/tr2_cfg.c | 10 ++--------
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/strbuf.c b/strbuf.c
index cccfdec0e3..c986ec28f4 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -134,6 +134,13 @@ void strbuf_trim_trailing_dir_sep(struct strbuf *sb)
 	sb->buf[sb->len] = '\0';
 }
 
+void strbuf_trim_trailing_ch(struct strbuf *sb, int c)
+{
+	while (sb->len > 0 && sb->buf[sb->len - 1] == c)
+		sb->len--;
+	sb->buf[sb->len] = '\0';
+}
+
 void strbuf_trim_trailing_newline(struct strbuf *sb)
 {
 	if (sb->len > 0 && sb->buf[sb->len - 1] == '\n') {
diff --git a/strbuf.h b/strbuf.h
index 884157873e..5e389ab065 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -197,6 +197,9 @@ void strbuf_trim_trailing_dir_sep(struct strbuf *sb);
 /* Strip trailing LF or CR/LF */
 void strbuf_trim_trailing_newline(struct strbuf *sb);
 
+/* Strip trailing character c */
+void strbuf_trim_trailing_ch(struct strbuf *sb, int c);
+
 /**
  * Replace the contents of the strbuf with a reencoded form.  Returns -1
  * on error, 0 on success.
diff --git a/trace2/tr2_cfg.c b/trace2/tr2_cfg.c
index d96d908bb9..356fcd38f4 100644
--- a/trace2/tr2_cfg.c
+++ b/trace2/tr2_cfg.c
@@ -33,10 +33,7 @@ static int tr2_cfg_load_patterns(void)
 
 	tr2_cfg_patterns = strbuf_split_buf(envvar, strlen(envvar), ',', -1);
 	for (s = tr2_cfg_patterns; *s; s++) {
-		struct strbuf *buf = *s;
-
-		if (buf->len && buf->buf[buf->len - 1] == ',')
-			strbuf_setlen(buf, buf->len - 1);
+		strbuf_trim_trailing_ch(*s, ',');
 		strbuf_trim_trailing_newline(*s);
 		strbuf_trim(*s);
 	}
@@ -72,10 +69,7 @@ static int tr2_load_env_vars(void)
 
 	tr2_cfg_env_vars = strbuf_split_buf(varlist, strlen(varlist), ',', -1);
 	for (s = tr2_cfg_env_vars; *s; s++) {
-		struct strbuf *buf = *s;
-
-		if (buf->len && buf->buf[buf->len - 1] == ',')
-			strbuf_setlen(buf, buf->len - 1);
+		strbuf_trim_trailing_ch(*s, ',');
 		strbuf_trim_trailing_newline(*s);
 		strbuf_trim(*s);
 	}
-- 
2.46.0.4.gbcb884ee16


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 3/4] Add 'promisor-remote' capability to protocol v2
  2024-07-31 13:40 [PATCH 0/4] Introduce a "promisor-remote" capability Christian Couder
  2024-07-31 13:40 ` [PATCH 1/4] version: refactor strbuf_sanitize() Christian Couder
  2024-07-31 13:40 ` [PATCH 2/4] strbuf: refactor strbuf_trim_trailing_ch() Christian Couder
@ 2024-07-31 13:40 ` Christian Couder
  2024-07-31 15:40   ` Taylor Blau
                     ` (3 more replies)
  2024-07-31 13:40 ` [PATCH 4/4] promisor-remote: check advertised name or URL Christian Couder
                   ` (3 subsequent siblings)
  6 siblings, 4 replies; 110+ messages in thread
From: Christian Couder @ 2024-07-31 13:40 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Christian Couder,
	Christian Couder

When a server repository S borrows some objects from a promisor remote X,
then a client repository C which would like to clone or fetch from S might,
or might not, want to also borrow objects from X. Also S might, or might
not, want to advertise X as a good way for C to directly get objects from,
instead of C getting everything through S.

To allow S and C to agree on C using X or not, let's introduce a new
"promisor-remote" capability in the protocol v2, as well as a few new
configuration variables:

  - "promisor.advertise" on the server side, and:
  - "promisor.acceptFromServer" on the client side.

By default, or if "promisor.advertise" is set to 'false', a server S will
advertise only the "promisor-remote" capability without passing any
argument through this capability. This means that S supports the new
capability but doesn't wish any client C to directly access any promisor
remote X S might use.

If "promisor.advertise" is set to 'true', S will advertise its promisor
remotes with a string like:

  promisor-remote=<pm-info>[;<pm-info>]...

where each <pm-info> element contains information about a single
promisor remote in the form:

  name=<pm-name>[,url=<pm-url>]

where <pm-name> is the name of a promisor remote and <pm-url> is the
urlencoded url of the promisor remote named <pm-name>.

For now, the URL is passed in addition to the name. In the future, it
might be possible to pass other information like a filter-spec that the
client should use when cloning from S, or a token that the client should
use when retrieving objects from X.

It might also be possible in the future for "promisor.advertise" to have
other values like "onlyName", so that no URL is advertised.

By default or if "promisor.acceptFromServer" is set to "None", the
client will not accept to use the promisor remotes that might have been
advertised by the server. In this case, the client will advertise only
"promisor-remote" in its reply to the server. This means that the client
has the "promisor-remote" capability but decided not to use any of the
promisor remotes that the server might have advertised.

If "promisor.acceptFromServer" is set to "All", on the contrary, the
client will accept to use all the promisor remotes that the server
advertised and it will reply with a string like:

  promisor-remote=<pm-name>[;<pm-name>]...

where the <pm-name> elements are the names of all the promisor remotes
the server advertised. If the server advertised no promisor remote
though, the client will reply with just "promisor-remote".

In a following commit, other values for "promisor.acceptFromServer" will
be implemented so that the client will be able to decide the promisor
remotes it accepts depending on the name and URL it received from the
server. So even if that name and URL information is not used much right
now, it will be needed soon.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/promisor.txt     |  13 ++
 Documentation/gitprotocol-v2.txt      |  37 ++++++
 connect.c                             |   7 +
 promisor-remote.c                     | 182 ++++++++++++++++++++++++++
 promisor-remote.h                     |  26 +++-
 serve.c                               |  21 +++
 t/t5555-http-smart-common.sh          |   1 +
 t/t5701-git-serve.sh                  |   1 +
 t/t5710-promisor-remote-capability.sh | 124 ++++++++++++++++++
 upload-pack.c                         |   3 +
 10 files changed, 414 insertions(+), 1 deletion(-)
 create mode 100755 t/t5710-promisor-remote-capability.sh

diff --git a/Documentation/config/promisor.txt b/Documentation/config/promisor.txt
index 98c5cb2ec2..e3939d83a9 100644
--- a/Documentation/config/promisor.txt
+++ b/Documentation/config/promisor.txt
@@ -1,3 +1,16 @@
 promisor.quiet::
 	If set to "true" assume `--quiet` when fetching additional
 	objects for a partial clone.
+
+promisor.advertise::
+	If set to "true", a server will use the "promisor-remote"
+	capability, see linkgit:gitprotocol-v2[5], to advertise the
+	promisor remotes it is using if any. Default is "false", which
+	means no promisor remote is advertised.
+
+promisor.acceptFromServer::
+	If set to "all", a client will accept all the promisor remotes
+	a server might advertise using the "promisor-remote"
+	capability, see linkgit:gitprotocol-v2[5]. Default is "none",
+	which means no promisor remote advertised by a server will be
+	accepted.
diff --git a/Documentation/gitprotocol-v2.txt b/Documentation/gitprotocol-v2.txt
index 414bc625d5..4d8d3839c4 100644
--- a/Documentation/gitprotocol-v2.txt
+++ b/Documentation/gitprotocol-v2.txt
@@ -781,6 +781,43 @@ retrieving the header from a bundle at the indicated URI, and thus
 save themselves and the server(s) the request(s) needed to inspect the
 headers of that bundle or bundles.
 
+promisor-remote=<pr-infos>
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The server may advertise some promisor remotes it is using, if it's OK
+for the server that a client uses them too. In this case <pr-infos>
+should be of the form:
+
+	pr-infos = pr-info | pr-infos ";" pr-info
+
+	pr-info = "name=" pr-name | "name=" pr-name "," "url=" pr-url
+
+where `pr-name` is the name of a promisor remote, and `pr-url` the
+urlencoded URL of that promisor remote.
+
+In this case a client wanting to use one or more promisor remotes the
+server advertised should reply with "promisor-remote=<pr-names>" where
+<pr-names> should be of the form:
+
+	pr-names = pr-name | pr-names ";" pr-name
+
+where `pr-name` is the name of a promisor remote the server
+advertised.
+
+If the server prefers a client not to use any promisor remote the
+server uses, or if the server doesn't use any promisor remote, it
+should only advertise "promisor-remote" without any value or "=" sign
+after it.
+
+In this case, or if the client doesn't want to use any promisor remote
+the server advertised, the client should reply only "promisor-remote"
+without any value or "=" sign after it.
+
+The "promisor.advertise" and "promisor.acceptFromServer" configuration
+options can be used on the server and client side respectively to
+control what they advertise or accept respectively. See the
+documentation of these configuration options for more information.
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/connect.c b/connect.c
index cf84e631e9..284ea3cf12 100644
--- a/connect.c
+++ b/connect.c
@@ -20,6 +20,7 @@
 #include "protocol.h"
 #include "alias.h"
 #include "bundle-uri.h"
+#include "promisor-remote.h"
 
 static char *server_capabilities_v1;
 static struct strvec server_capabilities_v2 = STRVEC_INIT;
@@ -485,6 +486,7 @@ void check_stateless_delimiter(int stateless_rpc,
 static void send_capabilities(int fd_out, struct packet_reader *reader)
 {
 	const char *hash_name;
+	const char *promisor_remote_info;
 
 	if (server_supports_v2("agent"))
 		packet_write_fmt(fd_out, "agent=%s", git_user_agent_sanitized());
@@ -498,6 +500,11 @@ static void send_capabilities(int fd_out, struct packet_reader *reader)
 	} else {
 		reader->hash_algo = &hash_algos[GIT_HASH_SHA1];
 	}
+	if (server_feature_v2("promisor-remote", &promisor_remote_info)) {
+		char *reply = promisor_remote_reply(promisor_remote_info);
+		packet_write_fmt(fd_out, "promisor-remote%s", reply ? reply : "");
+		free(reply);
+	}
 }
 
 int get_remote_bundle_uri(int fd_out, struct packet_reader *reader,
diff --git a/promisor-remote.c b/promisor-remote.c
index 317e1b127f..d347f4d9b5 100644
--- a/promisor-remote.c
+++ b/promisor-remote.c
@@ -11,6 +11,7 @@
 #include "strvec.h"
 #include "packfile.h"
 #include "environment.h"
+#include "url.h"
 
 struct promisor_remote_config {
 	struct promisor_remote *promisors;
@@ -219,6 +220,18 @@ int repo_has_promisor_remote(struct repository *r)
 	return !!repo_promisor_remote_find(r, NULL);
 }
 
+int repo_has_accepted_promisor_remote(struct repository *r)
+{
+	struct promisor_remote *p;
+
+	promisor_remote_init(r);
+
+	for (p = r->promisor_remote_config->promisors; p; p = p->next)
+		if (p->accepted)
+			return 1;
+	return 0;
+}
+
 static int remove_fetched_oids(struct repository *repo,
 			       struct object_id **oids,
 			       int oid_nr, int to_free)
@@ -290,3 +303,172 @@ void promisor_remote_get_direct(struct repository *repo,
 	if (to_free)
 		free(remaining_oids);
 }
+
+static int allow_unsanitized(char ch)
+{
+	if (ch == ',' || ch == ';' || ch == '%')
+		return 0;
+	return ch > 32 && ch < 127;
+}
+
+static void promisor_info_vecs(struct repository *repo,
+			       struct strvec *names,
+			       struct strvec *urls)
+{
+	struct promisor_remote *r;
+
+	promisor_remote_init(repo);
+
+	for (r = repo->promisor_remote_config->promisors; r; r = r->next) {
+		char *url;
+		char *url_key = xstrfmt("remote.%s.url", r->name);
+
+		strvec_push(names, r->name);
+		strvec_push(urls, git_config_get_string(url_key, &url) ? NULL : url);
+
+		free(url);
+		free(url_key);
+	}
+}
+
+void promisor_remote_info(struct repository *repo, struct strbuf *buf)
+{
+	struct strbuf sb = STRBUF_INIT;
+	int advertise_promisors = 0;
+	struct strvec names = STRVEC_INIT;
+	struct strvec urls = STRVEC_INIT;
+
+	git_config_get_bool("promisor.advertise", &advertise_promisors);
+
+	if (!advertise_promisors)
+		return;
+
+	promisor_info_vecs(repo, &names, &urls);
+
+	for (size_t i = 0; i < names.nr; i++) {
+		if (sb.len)
+			strbuf_addch(&sb, ';');
+		strbuf_addf(&sb, "name=%s", names.v[i]);
+		if (urls.v[i]) {
+			strbuf_addstr(&sb, ",url=");
+			strbuf_addstr_urlencode(&sb, urls.v[i], allow_unsanitized);
+		}
+	}
+
+	strbuf_sanitize(&sb);
+	strbuf_addbuf(buf, &sb);
+
+	strvec_clear(&names);
+	strvec_clear(&urls);
+}
+
+enum accept_promisor {
+	ACCEPT_NONE = 0,
+	ACCEPT_ALL
+};
+
+static int should_accept_remote(enum accept_promisor accept,
+				const char *remote_name UNUSED,
+				const char *remote_url UNUSED)
+{
+	if (accept == ACCEPT_ALL)
+		return 1;
+
+	BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+}
+
+static void filter_promisor_remote(struct repository *repo,
+				   struct strvec *accepted,
+				   const char *info)
+{
+	struct strbuf **remotes;
+	char *accept_str;
+	enum accept_promisor accept = ACCEPT_NONE;
+
+	if (!git_config_get_string("promisor.acceptfromserver", &accept_str)) {
+		if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
+			accept = ACCEPT_NONE;
+		else if (!strcasecmp("All", accept_str))
+			accept = ACCEPT_ALL;
+		else
+			warning(_("unknown '%s' value for '%s' config option"),
+				accept_str, "promisor.acceptfromserver");
+	}
+
+	if (accept == ACCEPT_NONE)
+		return;
+
+	/* Parse remote info received */
+
+	remotes = strbuf_split_str(info, ';', 0);
+
+	for (size_t i = 0; remotes[i]; i++) {
+		struct strbuf **elems;
+		const char *remote_name = NULL;
+		const char *remote_url = NULL;
+		char *decoded_url = NULL;
+
+		strbuf_trim_trailing_ch(remotes[i], ';');
+		elems = strbuf_split_str(remotes[i]->buf, ',', 0);
+
+		for (size_t j = 0; elems[j]; j++) {
+			int res;
+			strbuf_trim_trailing_ch(elems[j], ',');
+			res = skip_prefix(elems[j]->buf, "name=", &remote_name) ||
+				skip_prefix(elems[j]->buf, "url=", &remote_url);
+			if (!res)
+				warning(_("unknown element '%s' from remote info"),
+					elems[j]->buf);
+		}
+
+		decoded_url = url_decode(remote_url);
+
+		if (should_accept_remote(accept, remote_name, decoded_url))
+			strvec_push(accepted, remote_name);
+
+		strbuf_list_free(elems);
+		free(decoded_url);
+	}
+
+	free(accept_str);
+	strbuf_list_free(remotes);
+}
+
+char *promisor_remote_reply(const char *info)
+{
+	struct strvec accepted = STRVEC_INIT;
+	struct strbuf reply = STRBUF_INIT;
+
+	filter_promisor_remote(the_repository, &accepted, info);
+
+	strbuf_addch(&reply, '=');
+
+	for (size_t i = 0; i < accepted.nr; i++) {
+		if (i != 0)
+			strbuf_addch(&reply, ';');
+		strbuf_addstr(&reply, accepted.v[i]);
+	}
+
+	strvec_clear(&accepted);
+
+	return strbuf_detach(&reply, NULL);
+}
+
+void mark_promisor_remotes_as_accepted(struct repository *r, const char *remotes)
+{
+	struct strbuf **accepted_remotes = strbuf_split_str(remotes, ';', 0);
+
+	for (size_t i = 0; accepted_remotes[i]; i++) {
+		struct promisor_remote *p;
+
+		strbuf_trim_trailing_ch(accepted_remotes[i], ';');
+		p = repo_promisor_remote_find(r, accepted_remotes[i]->buf);
+		if (p)
+			p->accepted = 1;
+		else
+			warning(_("accepted promisor remote '%s' not found"),
+				accepted_remotes[i]->buf);
+	}
+
+	strbuf_list_free(accepted_remotes);
+}
diff --git a/promisor-remote.h b/promisor-remote.h
index 88cb599c39..82f060b5af 100644
--- a/promisor-remote.h
+++ b/promisor-remote.h
@@ -9,11 +9,13 @@ struct object_id;
  * Promisor remote linked list
  *
  * Information in its fields come from remote.XXX config entries or
- * from extensions.partialclone.
+ * from extensions.partialclone, except for 'accepted' which comes
+ * from protocol v2 capabilities exchange.
  */
 struct promisor_remote {
 	struct promisor_remote *next;
 	char *partial_clone_filter;
+	unsigned int accepted : 1;
 	const char name[FLEX_ARRAY];
 };
 
@@ -32,4 +34,26 @@ void promisor_remote_get_direct(struct repository *repo,
 				const struct object_id *oids,
 				int oid_nr);
 
+/*
+ * Append promisor remote info to buf. Useful for a server to
+ * advertise the promisor remotes it uses.
+ */
+void promisor_remote_info(struct repository *repo, struct strbuf *buf);
+
+/*
+ * Prepare a reply to a "promisor-remote" advertisement from a server.
+ */
+char *promisor_remote_reply(const char *info);
+
+/*
+ * Set the 'accepted' flag for some promisor remotes. Useful when some
+ * promisor remotes have been accepted by the client.
+ */
+void mark_promisor_remotes_as_accepted(struct repository *repo, const char *remotes);
+
+/*
+ * Has any promisor remote been accepted by the client?
+ */
+int repo_has_accepted_promisor_remote(struct repository *r);
+
 #endif /* PROMISOR_REMOTE_H */
diff --git a/serve.c b/serve.c
index 884cd84ca8..7c5c7c9856 100644
--- a/serve.c
+++ b/serve.c
@@ -12,6 +12,7 @@
 #include "upload-pack.h"
 #include "bundle-uri.h"
 #include "trace2.h"
+#include "promisor-remote.h"
 
 static int advertise_sid = -1;
 static int advertise_object_info = -1;
@@ -31,6 +32,21 @@ static int agent_advertise(struct repository *r UNUSED,
 	return 1;
 }
 
+static int promisor_remote_advertise(struct repository *r,
+				     struct strbuf *value)
+{
+       if (value)
+	       promisor_remote_info(r, value);
+       return 1;
+}
+
+static void promisor_remote_receive(struct repository *r,
+				    const char *remotes)
+{
+	mark_promisor_remotes_as_accepted(r, remotes);
+}
+
+
 static int object_format_advertise(struct repository *r,
 				   struct strbuf *value)
 {
@@ -157,6 +173,11 @@ static struct protocol_capability capabilities[] = {
 		.advertise = bundle_uri_advertise,
 		.command = bundle_uri_command,
 	},
+	{
+		.name = "promisor-remote",
+		.advertise = promisor_remote_advertise,
+		.receive = promisor_remote_receive,
+	},
 };
 
 void protocol_v2_advertise_capabilities(void)
diff --git a/t/t5555-http-smart-common.sh b/t/t5555-http-smart-common.sh
index 3dcb3340a3..27300a8bf5 100755
--- a/t/t5555-http-smart-common.sh
+++ b/t/t5555-http-smart-common.sh
@@ -131,6 +131,7 @@ test_expect_success 'git upload-pack --advertise-refs: v2' '
 	fetch=shallow wait-for-done
 	server-option
 	object-format=$(test_oid algo)
+	promisor-remote
 	0000
 	EOF
 
diff --git a/t/t5701-git-serve.sh b/t/t5701-git-serve.sh
index c48830de8f..c858c43db2 100755
--- a/t/t5701-git-serve.sh
+++ b/t/t5701-git-serve.sh
@@ -22,6 +22,7 @@ test_expect_success 'test capability advertisement' '
 	object-format=$(test_oid algo)
 	EOF
 	cat >expect.trailer <<-EOF &&
+	promisor-remote
 	0000
 	EOF
 	cat expect.base expect.trailer >expect &&
diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
new file mode 100755
index 0000000000..7e44ad15ce
--- /dev/null
+++ b/t/t5710-promisor-remote-capability.sh
@@ -0,0 +1,124 @@
+#!/bin/sh
+
+test_description='handling of promisor remote advertisement'
+
+. ./test-lib.sh
+
+# Setup the repository with three commits, this way HEAD is always
+# available and we can hide commit 1 or 2.
+test_expect_success 'setup: create "template" repository' '
+	git init template &&
+	test_commit -C template 1 &&
+	test_commit -C template 2 &&
+	test_commit -C template 3 &&
+	test-tool genrandom foo 10240 >template/foo &&
+	git -C template add foo &&
+	git -C template commit -m foo
+'
+
+# A bare repo will act as a server repo with unpacked objects.
+test_expect_success 'setup: create bare "server" repository' '
+	git clone --bare --no-local template server &&
+	mv server/objects/pack/pack-* . &&
+	packfile=$(ls pack-*.pack) &&
+	git -C server unpack-objects --strict <"$packfile"
+'
+
+check_missing_objects () {
+	git -C "$1" rev-list --objects --all --missing=print > all.txt &&
+	perl -ne 'print if s/^[?]//' all.txt >missing.txt &&
+	test_line_count = "$2" missing.txt &&
+	test "$3" = "$(cat missing.txt)"
+}
+
+initialize_server () {
+	# Repack everything first
+	git -C server -c repack.writebitmaps=false repack -a -d &&
+
+	# Remove promisor file in case they exist, useful when reinitializing
+	rm -rf server/objects/pack/*.promisor &&
+
+	# Repack without the largest object and create a promisor pack on server
+	git -C server -c repack.writebitmaps=false repack -a -d \
+	    --filter=blob:limit=5k --filter-to="$(pwd)" &&
+	promisor_file=$(ls server/objects/pack/*.pack | sed "s/\.pack/.promisor/") &&
+	touch "$promisor_file" &&
+
+	# Check that only one object is missing on the server
+	check_missing_objects server 1 "$oid"
+}
+
+test_expect_success "setup for testing promisor remote advertisement" '
+	# Create another bare repo called "server2"
+	git init --bare server2 &&
+
+	# Copy the largest object from server to server2
+	obj="HEAD:foo" &&
+	oid="$(git -C server rev-parse $obj)" &&
+	oid_path="$(test_oid_to_path $oid)" &&
+	path="server/objects/$oid_path" &&
+	path2="server2/objects/$oid_path" &&
+	mkdir -p $(dirname "$path2") &&
+	cp "$path" "$path2" &&
+
+	initialize_server &&
+
+	# Configure server2 as promisor remote for server
+	git -C server remote add server2 "file://$(pwd)/server2" &&
+	git -C server config remote.server2.promisor true &&
+
+	git -C server2 config uploadpack.allowFilter true &&
+	git -C server2 config uploadpack.allowAnySHA1InWant true &&
+	git -C server config uploadpack.allowFilter true &&
+	git -C server config uploadpack.allowAnySHA1InWant true
+'
+
+test_expect_success "fetch with promisor.advertise set to 'true'" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=All \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is still missing on the server
+	check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "fetch with promisor.advertise set to 'false'" '
+	git -C server config promisor.advertise false &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=All \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is not missing on the server
+	check_missing_objects server 0 "" &&
+
+	# Reinitialize server so that the largest object is missing again
+	initialize_server
+'
+
+test_expect_success "fetch with promisor.acceptfromserver set to 'None'" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=None \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is not missing on the server
+	check_missing_objects server 0 ""
+'
+
+test_done
diff --git a/upload-pack.c b/upload-pack.c
index 0052c6a4dc..0cff76c845 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -31,6 +31,7 @@
 #include "write-or-die.h"
 #include "json-writer.h"
 #include "strmap.h"
+#include "promisor-remote.h"
 
 /* Remember to update object flag allocation in object.h */
 #define THEY_HAVE	(1u << 11)
@@ -317,6 +318,8 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 		strvec_push(&pack_objects.args, "--delta-base-offset");
 	if (pack_data->use_include_tag)
 		strvec_push(&pack_objects.args, "--include-tag");
+	if (repo_has_accepted_promisor_remote(the_repository))
+		strvec_push(&pack_objects.args, "--missing=allow-promisor");
 	if (pack_data->filter_options.choice) {
 		const char *spec =
 			expand_list_objects_filter_spec(&pack_data->filter_options);
-- 
2.46.0.4.gbcb884ee16


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 4/4] promisor-remote: check advertised name or URL
  2024-07-31 13:40 [PATCH 0/4] Introduce a "promisor-remote" capability Christian Couder
                   ` (2 preceding siblings ...)
  2024-07-31 13:40 ` [PATCH 3/4] Add 'promisor-remote' capability to protocol v2 Christian Couder
@ 2024-07-31 13:40 ` Christian Couder
  2024-07-31 18:35   ` Junio C Hamano
  2024-07-31 16:01 ` [PATCH 0/4] Introduce a "promisor-remote" capability Junio C Hamano
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 110+ messages in thread
From: Christian Couder @ 2024-07-31 13:40 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Christian Couder,
	Christian Couder

A previous commit introduced a "promisor.acceptFromServer" configuration
variable with only "None" or "All" as valid values.

Let's introduce "KnownName" and "KnownUrl" as valid values for this
configuration option to give more choice to a client about which
promisor remotes it might accept among those that the server advertised.

In case of "KnownName", the client will accept promisor remotes which
are already configured on the client and have the same name as those
advertised by the client.

In case of "KnownUrl", the client will accept promisor remotes which
have both the same name and the same URL configured on the client as the
name and URL advertised by the server.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/promisor.txt     | 11 +++--
 promisor-remote.c                     | 54 +++++++++++++++++++--
 t/t5710-promisor-remote-capability.sh | 68 +++++++++++++++++++++++++++
 3 files changed, 126 insertions(+), 7 deletions(-)

diff --git a/Documentation/config/promisor.txt b/Documentation/config/promisor.txt
index e3939d83a9..fadf593621 100644
--- a/Documentation/config/promisor.txt
+++ b/Documentation/config/promisor.txt
@@ -11,6 +11,11 @@ promisor.advertise::
 promisor.acceptFromServer::
 	If set to "all", a client will accept all the promisor remotes
 	a server might advertise using the "promisor-remote"
-	capability, see linkgit:gitprotocol-v2[5]. Default is "none",
-	which means no promisor remote advertised by a server will be
-	accepted.
+	capability, see linkgit:gitprotocol-v2[5]. If set to
+	"knownName" the client will accept promisor remotes which are
+	already configured on the client and have the same name as
+	those advertised by the client. If set to "knownUrl", the
+	client will accept promisor remotes which have both the same
+	name and the same URL configured on the client as the name and
+	URL advertised by the server. Default is "none", which means
+	no promisor remote advertised by a server will be accepted.
diff --git a/promisor-remote.c b/promisor-remote.c
index d347f4d9b5..0ff26b835e 100644
--- a/promisor-remote.c
+++ b/promisor-remote.c
@@ -362,19 +362,54 @@ void promisor_remote_info(struct repository *repo, struct strbuf *buf)
 	strvec_clear(&urls);
 }
 
+/*
+ * Find first index of 'vec' where there is 'val'. 'val' is compared
+ * case insensively to the strings in 'vec'. If not found 'vec->nr' is
+ * returned.
+ */
+static size_t strvec_find_index(struct strvec *vec, const char *val)
+{
+	for (size_t i = 0; i < vec->nr; i++)
+		if (!strcasecmp(vec->v[i], val))
+			return i;
+	return vec->nr;
+}
+
 enum accept_promisor {
 	ACCEPT_NONE = 0,
+	ACCEPT_KNOWN_URL,
+	ACCEPT_KNOWN_NAME,
 	ACCEPT_ALL
 };
 
 static int should_accept_remote(enum accept_promisor accept,
-				const char *remote_name UNUSED,
-				const char *remote_url UNUSED)
+				const char *remote_name, const char *remote_url,
+				struct strvec *names, struct strvec *urls)
 {
+	size_t i;
+
 	if (accept == ACCEPT_ALL)
 		return 1;
 
-	BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+	i = strvec_find_index(names, remote_name);
+
+	if (i >= names->nr)
+		/* We don't know about that remote */
+		return 0;
+
+	if (accept == ACCEPT_KNOWN_NAME)
+		return 1;
+
+	if (accept != ACCEPT_KNOWN_URL)
+		BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+
+	if (!strcasecmp(urls->v[i], remote_url))
+		return 1;
+
+	warning(_("known remote named '%s' but with url '%s' instead of '%s'"),
+		remote_name, urls->v[i], remote_url);
+
+	return 0;
 }
 
 static void filter_promisor_remote(struct repository *repo,
@@ -384,10 +419,16 @@ static void filter_promisor_remote(struct repository *repo,
 	struct strbuf **remotes;
 	char *accept_str;
 	enum accept_promisor accept = ACCEPT_NONE;
+	struct strvec names = STRVEC_INIT;
+	struct strvec urls = STRVEC_INIT;
 
 	if (!git_config_get_string("promisor.acceptfromserver", &accept_str)) {
 		if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
 			accept = ACCEPT_NONE;
+		else if (!strcasecmp("KnownUrl", accept_str))
+			accept = ACCEPT_KNOWN_URL;
+		else if (!strcasecmp("KnownName", accept_str))
+			accept = ACCEPT_KNOWN_NAME;
 		else if (!strcasecmp("All", accept_str))
 			accept = ACCEPT_ALL;
 		else
@@ -398,6 +439,9 @@ static void filter_promisor_remote(struct repository *repo,
 	if (accept == ACCEPT_NONE)
 		return;
 
+	if (accept != ACCEPT_ALL)
+		promisor_info_vecs(repo, &names, &urls);
+
 	/* Parse remote info received */
 
 	remotes = strbuf_split_str(info, ';', 0);
@@ -423,7 +467,7 @@ static void filter_promisor_remote(struct repository *repo,
 
 		decoded_url = url_decode(remote_url);
 
-		if (should_accept_remote(accept, remote_name, decoded_url))
+		if (should_accept_remote(accept, remote_name, decoded_url, &names, &urls))
 			strvec_push(accepted, remote_name);
 
 		strbuf_list_free(elems);
@@ -431,6 +475,8 @@ static void filter_promisor_remote(struct repository *repo,
 	}
 
 	free(accept_str);
+	strvec_clear(&names);
+	strvec_clear(&urls);
 	strbuf_list_free(remotes);
 }
 
diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
index 7e44ad15ce..c2c83a5914 100755
--- a/t/t5710-promisor-remote-capability.sh
+++ b/t/t5710-promisor-remote-capability.sh
@@ -117,6 +117,74 @@ test_expect_success "fetch with promisor.acceptfromserver set to 'None'" '
 		--no-local --filter="blob:limit=5k" server client &&
 	test_when_finished "rm -rf client" &&
 
+	# Check that the largest object is not missing on the server
+	check_missing_objects server 0 "" &&
+
+	# Reinitialize server so that the largest object is missing again
+	initialize_server
+'
+
+test_expect_success "fetch with promisor.acceptfromserver set to 'KnownName'" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=KnownName \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is still missing on the server
+	check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "fetch with 'KnownName' and different remote names" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.serverTwo.promisor=true \
+		-c remote.serverTwo.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.serverTwo.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=KnownName \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is not missing on the server
+	check_missing_objects server 0 "" &&
+
+	# Reinitialize server so that the largest object is missing again
+	initialize_server
+'
+
+test_expect_success "fetch with promisor.acceptfromserver set to 'KnownUrl'" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=KnownUrl \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is still missing on the server
+	check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "fetch with 'KnownUrl' and different remote urls" '
+	ln -s server2 serverTwo &&
+
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/serverTwo" \
+		-c promisor.acceptfromserver=KnownUrl \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
 	# Check that the largest object is not missing on the server
 	check_missing_objects server 0 ""
 '
-- 
2.46.0.4.gbcb884ee16


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [PATCH 3/4] Add 'promisor-remote' capability to protocol v2
  2024-07-31 13:40 ` [PATCH 3/4] Add 'promisor-remote' capability to protocol v2 Christian Couder
@ 2024-07-31 15:40   ` Taylor Blau
  2024-08-20 11:32     ` Christian Couder
  2024-07-31 16:16   ` Taylor Blau
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 110+ messages in thread
From: Taylor Blau @ 2024-07-31 15:40 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Patrick Steinhardt,
	Christian Couder

On Wed, Jul 31, 2024 at 03:40:13PM +0200, Christian Couder wrote:
> By default, or if "promisor.advertise" is set to 'false', a server S will
> advertise only the "promisor-remote" capability without passing any
> argument through this capability. This means that S supports the new
> capability but doesn't wish any client C to directly access any promisor
> remote X S might use.

Even if the server supports this new capability, is there a reason to
advertise it to the client if the server knows ahead of time that it has
no promisor remotes to advertise?

I am not sure what action the client would take if it knows the server
supports this capability, but does not actually have any promisor
remotes to advertise. I would suggest that setting promisor.advertise to
false indeed prevents advertising it as a capability in the first place.

Selfishly, it prevents some issues that I have when rolling out new Git
versions within GitHub's infrastructure, since our push proxy layer
picks a single replica to replay the capabilities from, but obviously
replays the client's response to all replicas. So if only some replicas
understand the new 'promisor-remote' capability, we can run into issues.

I'm not sure if the client even bothers to send back promisor-remote if
the server did not send any such remotes to begin with, but between that
and what I wrote in the second paragraph here, I don't see a reason to
advertise the capability when promisor.advertise is false.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 0/4] Introduce a "promisor-remote" capability
  2024-07-31 13:40 [PATCH 0/4] Introduce a "promisor-remote" capability Christian Couder
                   ` (3 preceding siblings ...)
  2024-07-31 13:40 ` [PATCH 4/4] promisor-remote: check advertised name or URL Christian Couder
@ 2024-07-31 16:01 ` Junio C Hamano
  2024-07-31 16:17 ` Taylor Blau
  2024-09-10 16:29 ` [PATCH v2 " Christian Couder
  6 siblings, 0 replies; 110+ messages in thread
From: Junio C Hamano @ 2024-07-31 16:01 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, John Cai, Patrick Steinhardt

Christian Couder <christian.couder@gmail.com> writes:

> Earlier this year, I sent 3 versions of a patch series with the goal
> of allowing a client C to clone from a server S while using the same
> promisor remote X that S already use. See:
>
> https://lore.kernel.org/git/20240418184043.2900955-1-christian.couder@gmail.com/
>
> Junio suggested to instead implement that feature using:

I actually do not see it as "instead".  The end result would be the
same when things go right.  The only "instead" part is that a protocol
exchange gives you a chance to make sure that the server can tell that
it is OK to omit objects available elsewhere and the fetcher knows
about it, instead of letting the server blindly assuming that it is
fine to omit objects.

> This patch series implements that protocol extension called
> "promisor-remote" (that name is open to change or simplification)
> which allows S and C to agree on C using X directly or not.

;-)


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 3/4] Add 'promisor-remote' capability to protocol v2
  2024-07-31 13:40 ` [PATCH 3/4] Add 'promisor-remote' capability to protocol v2 Christian Couder
  2024-07-31 15:40   ` Taylor Blau
@ 2024-07-31 16:16   ` Taylor Blau
  2024-08-20 11:32     ` Christian Couder
  2024-07-31 18:25   ` Junio C Hamano
  2024-08-05 13:48   ` Patrick Steinhardt
  3 siblings, 1 reply; 110+ messages in thread
From: Taylor Blau @ 2024-07-31 16:16 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Patrick Steinhardt,
	Christian Couder

On Wed, Jul 31, 2024 at 03:40:13PM +0200, Christian Couder wrote:
> diff --git a/Documentation/gitprotocol-v2.txt b/Documentation/gitprotocol-v2.txt
> index 414bc625d5..4d8d3839c4 100644
> --- a/Documentation/gitprotocol-v2.txt
> +++ b/Documentation/gitprotocol-v2.txt
> @@ -781,6 +781,43 @@ retrieving the header from a bundle at the indicated URI, and thus
>  save themselves and the server(s) the request(s) needed to inspect the
>  headers of that bundle or bundles.
>
> +promisor-remote=<pr-infos>
> +~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The server may advertise some promisor remotes it is using, if it's OK
> +for the server that a client uses them too. In this case <pr-infos>
> +should be of the form:
> +
> +	pr-infos = pr-info | pr-infos ";" pr-info

So <pr-infos> uses the ';' character to delimit multiple <pr-info>s,
which means that <pr-info> can't use ';' itself. You mention above that
<pr-info> is supposed to be generic so that we can add other fields to
it in the future. Do you imagine that any of those fields might want to
use the ';' in their values?

One that comes to mind is the shared token example you wrote about
above. It would be nice to not restrict what characters the token can
contain.

I wonder if it would instead be useful to have <pr-infos> first write
out how many <pr-info>s it contains, and then write out each <pr-info>
separated by a NUL byte, so that none of the files in the <pr-info>
itself are restricted in what characters they can use.

> +static void promisor_info_vecs(struct repository *repo,
> +			       struct strvec *names,
> +			       struct strvec *urls)
> +{
> +	struct promisor_remote *r;
> +
> +	promisor_remote_init(repo);
> +
> +	for (r = repo->promisor_remote_config->promisors; r; r = r->next) {
> +		char *url;
> +		char *url_key = xstrfmt("remote.%s.url", r->name);
> +
> +		strvec_push(names, r->name);
> +		strvec_push(urls, git_config_get_string(url_key, &url) ? NULL : url);

Do you mean to push NULL onto urls here? It seems risky since you have
to check that each entry in the strvec is non-NULL before printing it
out (which you do below in promisor_remote_info()).

Or maybe you need to in order to advertise promisor remotes without
URLs? If so, I'm not sure what the benefit would be to the client if it
doesn't know where to go to retrieve any objects without having a URL.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 0/4] Introduce a "promisor-remote" capability
  2024-07-31 13:40 [PATCH 0/4] Introduce a "promisor-remote" capability Christian Couder
                   ` (4 preceding siblings ...)
  2024-07-31 16:01 ` [PATCH 0/4] Introduce a "promisor-remote" capability Junio C Hamano
@ 2024-07-31 16:17 ` Taylor Blau
  2024-09-10 16:29 ` [PATCH v2 " Christian Couder
  6 siblings, 0 replies; 110+ messages in thread
From: Taylor Blau @ 2024-07-31 16:17 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, Junio C Hamano, John Cai, Patrick Steinhardt

On Wed, Jul 31, 2024 at 03:40:10PM +0200, Christian Couder wrote:
> I have tried to implement it in a quite generic way that could allow S
> and C to share more information about promisor remotes and how to use
> them.
>
> For now C doesn't use the information it gets from S when cloning.
> That information is only used to decide if C is Ok to use the promisor
> remotes advertised by S. But this could change which could make it
> much simpler for clients than using the current way of passing
> information about X with the `-c` option of `git clone` many times on
> the command line.

I left a review after carefully reading these patches. I had a couple of
technical questions and suggestions of things to change.

But it's hard to have a definite opinion about the feature overall
without seeing how it is used in practice. I didn't see anything that
made me concerned, though, so I think this is a worthwhile experimental
feature.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/4] version: refactor strbuf_sanitize()
  2024-07-31 13:40 ` [PATCH 1/4] version: refactor strbuf_sanitize() Christian Couder
@ 2024-07-31 17:18   ` Junio C Hamano
  2024-08-20 11:29     ` Christian Couder
  0 siblings, 1 reply; 110+ messages in thread
From: Junio C Hamano @ 2024-07-31 17:18 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, John Cai, Patrick Steinhardt, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> diff --git a/strbuf.c b/strbuf.c
> index 3d2189a7f6..cccfdec0e3 100644
> --- a/strbuf.c
> +++ b/strbuf.c
> @@ -1082,3 +1082,12 @@ void strbuf_strip_file_from_path(struct strbuf *sb)
>  	char *path_sep = find_last_dir_sep(sb->buf);
>  	strbuf_setlen(sb, path_sep ? path_sep - sb->buf + 1 : 0);
>  }
> +
> +void strbuf_sanitize(struct strbuf *sb)
> +{
> +	strbuf_trim(sb);
> +	for (size_t i = 0; i < sb->len; i++) {
> +		if (sb->buf[i] <= 32 || sb->buf[i] >= 127)
> +			sb->buf[i] = '.';
> +	}
> +}

This looked a bit _too_ specific for the use of the transport layer
(which raises the question if it should even live in strbuf.[ch]).
It also made me wonder if different callers likely want to have
different variants (e.g., do not trim, only trim at the tail, squash
a run of unprintables into a single '.', use '?'  instead of '.',
etc., etc.).

It turns out that there is only *one* existing caller that gets
replaced with this "common" version, which made it a Meh to me.

Let's hope that there will be many new callers to make this step
worthwhile.

>  __attribute__((format (printf,1,2)))
>  int printf_ln(const char *fmt, ...);
>  __attribute__((format (printf,2,3)))
> diff --git a/version.c b/version.c
> index 41b718c29e..951e6dca74 100644
> --- a/version.c
> +++ b/version.c
> @@ -24,15 +24,10 @@ const char *git_user_agent_sanitized(void)
>  
>  	if (!agent) {
>  		struct strbuf buf = STRBUF_INIT;
> -		int i;
>  
>  		strbuf_addstr(&buf, git_user_agent());
> -		strbuf_trim(&buf);
> -		for (i = 0; i < buf.len; i++) {
> -			if (buf.buf[i] <= 32 || buf.buf[i] >= 127)
> -				buf.buf[i] = '.';
> -		}
> -		agent = buf.buf;
> +		strbuf_sanitize(&buf);
> +		agent = strbuf_detach(&buf, NULL);
>  	}
>  
>  	return agent;

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 2/4] strbuf: refactor strbuf_trim_trailing_ch()
  2024-07-31 13:40 ` [PATCH 2/4] strbuf: refactor strbuf_trim_trailing_ch() Christian Couder
@ 2024-07-31 17:29   ` Junio C Hamano
  2024-07-31 21:49     ` Taylor Blau
  2024-08-20 11:29     ` Christian Couder
  0 siblings, 2 replies; 110+ messages in thread
From: Junio C Hamano @ 2024-07-31 17:29 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, John Cai, Patrick Steinhardt, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> We often have to split strings at some specified terminator character.
> The strbuf_split*() functions, that we can use for this purpose,
> return substrings that include the terminator character, so we often
> need to remove that character.
>
> When it is a whitespace, newline or directory separator, the
> terminator character can easily be removed using an existing triming
> function like strbuf_rtrim(), strbuf_trim_trailing_newline() or
> strbuf_trim_trailing_dir_sep(). There is no function to remove that
> character when it's not one of those characters though.

OK.

> Let's introduce a new strbuf_trim_trailing_ch() function that can be
> used to remove any trailing character, and let's refactor existing code
> that manually removed trailing characters using this new function.

It is disappointing that this new one is not adequate to rewrite any
of the existing strbuf_trim* functions in terms of it, but that's
probably OK.  At least this one we have two existing callers, but
makes me wonder if these callers are doing sensible things in the
first place.  After trimming trailing commas, there may be trailing
newlines to be trimmed, and then again whitespaces around the whole
thing may need to be trimmed---what kind of input is that?  The
value has to be " junk \n\n,,,", but " junk, \n\n, " will only
become "junk, \n\n," without further cleaned up, and it is very
dubious how that is useful.

But that is not an issue this patch introduces ;-)

> diff --git a/trace2/tr2_cfg.c b/trace2/tr2_cfg.c
> index d96d908bb9..356fcd38f4 100644
> --- a/trace2/tr2_cfg.c
> +++ b/trace2/tr2_cfg.c
> @@ -33,10 +33,7 @@ static int tr2_cfg_load_patterns(void)
>  
>  	tr2_cfg_patterns = strbuf_split_buf(envvar, strlen(envvar), ',', -1);
>  	for (s = tr2_cfg_patterns; *s; s++) {
> -		struct strbuf *buf = *s;
> -
> -		if (buf->len && buf->buf[buf->len - 1] == ',')
> -			strbuf_setlen(buf, buf->len - 1);
> +		strbuf_trim_trailing_ch(*s, ',');
>  		strbuf_trim_trailing_newline(*s);
>  		strbuf_trim(*s);
>  	}
> @@ -72,10 +69,7 @@ static int tr2_load_env_vars(void)
>  
>  	tr2_cfg_env_vars = strbuf_split_buf(varlist, strlen(varlist), ',', -1);
>  	for (s = tr2_cfg_env_vars; *s; s++) {
> -		struct strbuf *buf = *s;
> -
> -		if (buf->len && buf->buf[buf->len - 1] == ',')
> -			strbuf_setlen(buf, buf->len - 1);
> +		strbuf_trim_trailing_ch(*s, ',');
>  		strbuf_trim_trailing_newline(*s);
>  		strbuf_trim(*s);
>  	}

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 3/4] Add 'promisor-remote' capability to protocol v2
  2024-07-31 13:40 ` [PATCH 3/4] Add 'promisor-remote' capability to protocol v2 Christian Couder
  2024-07-31 15:40   ` Taylor Blau
  2024-07-31 16:16   ` Taylor Blau
@ 2024-07-31 18:25   ` Junio C Hamano
  2024-07-31 19:34     ` Junio C Hamano
  2024-08-20 12:21     ` Christian Couder
  2024-08-05 13:48   ` Patrick Steinhardt
  3 siblings, 2 replies; 110+ messages in thread
From: Junio C Hamano @ 2024-07-31 18:25 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, John Cai, Patrick Steinhardt, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> When a server repository S borrows some objects from a promisor remote X,
> then a client repository C which would like to clone or fetch from S might,
> or might not, want to also borrow objects from X. Also S might, or might
> not, want to advertise X as a good way for C to directly get objects from,
> instead of C getting everything through S.

If S is a clone that is keeping up to date with X, even if it does
not borrow anything from X, as long as X is known to be much better
connected to the world (e.g., it is in a $LARGEINTERNETCOMPANY
datacenter with petabit/s backbone connections) than S is (e.g., it
is my deskside box on a cable modem), it may be beneficial if S can
omit objects from its "git fetch" response to C, if C is willing to
fill the gap using X.

So it is of dubious value to limit the feature only to cases where S
"borrows" from X, is it?

> To allow S and C to agree on C using X or not, let's introduce a new
> "promisor-remote" capability in the protocol v2, as well as a few new
> configuration variables:
>
>   - "promisor.advertise" on the server side, and:
>   - "promisor.acceptFromServer" on the client side.
>
> By default, or if "promisor.advertise" is set to 'false', a server S will
> advertise only the "promisor-remote" capability without passing any
> argument through this capability. This means that S supports the new
> capability but doesn't wish any client C to directly access any promisor
> remote X S might use.

I would find it more natural if .advertise is turned off by setting
it explicitly to "false", we would pretend as if we have never even
heard of such a capability.

> If "promisor.advertise" is set to 'true', S will advertise its promisor
> remotes with a string like:
>
>   promisor-remote=<pm-info>[;<pm-info>]...
>
> where each <pm-info> element contains information about a single
> promisor remote in the form:
>
>   name=<pm-name>[,url=<pm-url>]
> where <pm-name> is the name of a promisor remote and <pm-url> is the
> urlencoded url of the promisor remote named <pm-name>.

OK, so pm-name cannot have ";," in it (which is sensible, or define
pm-name more tightly, like "only lowercase alnum").  URL cannot have
';' or ',' in it that is an OK limitation as URL encoding can hide
them.

> For now, the URL is passed in addition to the name. In the future, it
> might be possible to pass other information like a filter-spec that the
> client should use when cloning from S, or a token that the client should
> use when retrieving objects from X.

OK.  And obviously they cannot have ';," in them without encoding
similarly.

> It might also be possible in the future for "promisor.advertise" to have
> other values like "onlyName", so that no URL is advertised.

Saying "<pm-info> is expected to be extended" should be sufficient,
without inviting discussions like "what good does it do to give only
names" that is irrelevant at least at this moment.

> By default or if "promisor.acceptFromServer" is set to "None", the
> client will not accept to use the promisor remotes that might have been
> advertised by the server. In this case, the client will advertise only
> "promisor-remote" in its reply to the server. This means that the client
> has the "promisor-remote" capability but decided not to use any of the
> promisor remotes that the server might have advertised.

OK, that is a signal to the server side that it is not allowed to
omit any objects from its response to "git fetch" request, even
though they might be available via a better connected remotes.

> If "promisor.acceptFromServer" is set to "All", on the contrary, the
> client will accept to use all the promisor remotes that the server
> advertised and it will reply with a string like:
>
>   promisor-remote=<pm-name>[;<pm-name>]...
>
> where the <pm-name> elements are the names of all the promisor remotes
> the server advertised.

So, this is why we need "name" for each "pm-info"---to give a short
name associated with the URL of the remote repository.

Presumably, C has an option to see if each of the remote suggested
is reachable and omit remotes that are not available to C from its
response, so even when .accept is set to "all", the response may not
list all the names of remotes that S advertised, in general.

> If the server advertised no promisor remote
> though, the client will reply with just "promisor-remote".

In other words, at the protocol level:

 - S uses promisor-remote capability to tell C what are potentially
   useful alternate remotes to obtain objects that C may want to
   fetch from S

 - C uses promisor-remote capability to tell S that among the
   remotes advertised by S, it is willing to use the named remotes
   as its promisor, which permits S from omitting objects from its
   response to "git fetch" request from C as long as they are known
   to be available from these remotes.

I think that makes sense, but I do not see the point of sending an
empty promisor-remote capability at all.

What practical difference would it make to S and C, if S chooses not
to advertise the capability at all, instead of advertising an empty
remote list with the capability?  Both tells C that it is useless to
request promistor-remote capability to S in its response.

What practical difference would it make to S and C, if C chooses not
to advertise the capability at all, instead of advertising an empty
remote list with the capability?  Both tells S that S is not allowed
to omit objects that are obtainable from elsewhere.

> In a following commit, other values for "promisor.acceptFromServer" will
> be implemented so that the client will be able to decide the promisor
> remotes it accepts depending on the name and URL it received from the
> server. So even if that name and URL information is not used much right
> now, it will be needed soon.

OK.

> diff --git a/Documentation/config/promisor.txt b/Documentation/config/promisor.txt
> index 98c5cb2ec2..e3939d83a9 100644
> --- a/Documentation/config/promisor.txt
> +++ b/Documentation/config/promisor.txt
> @@ -1,3 +1,16 @@
>  promisor.quiet::
>  	If set to "true" assume `--quiet` when fetching additional
>  	objects for a partial clone.
> +
> +promisor.advertise::
> +	If set to "true", a server will use the "promisor-remote"
> +	capability, see linkgit:gitprotocol-v2[5], to advertise the
> +	promisor remotes it is using if any. Default is "false", which
> +	means no promisor remote is advertised.

Even though I said that there logically is not much reason to tie
this advertisement to the use of promistor remote by the serving
side, I am OK if the initial implementation is limited to that
arrangement.  It would be an easy change to allow this variable
to take a list of remote repositories that may (or may not) be a
promisor remote of this repository (in other words, "they are clones
that are better connected than me") in the future, but that does not
have to happen in the initial iteration.

It would be less confusing to first-time readers if you described
the intent a bit better.  Why would the server want to advertise and
how would the client take advantage of the information?  I see that
the update in this patch to protocol document is skimpy on this point,
and end-user facing documentation has better exposure anyway, so
let's see what we can do here.

    The "promisor-remote" protocol capability can be used by the
    responder to "git fetch" to advertise better-connected remotes
    that the requester can use as promisor remotes, instead of this
    repository, so that "git fetch" requestor can lazily fetch
    objects from these other better-connected remotes.  If this
    configuration variable is set to "true",...

or something, perhaps?

"no promisor remote is advertised" -> "no promisor-remote capability
is advertised".

> +promisor.acceptFromServer::
> +	If set to "all", a client will accept all the promisor remotes
> +	a server might advertise using the "promisor-remote"
> +	capability, see linkgit:gitprotocol-v2[5]. Default is "none",
> +	which means no promisor remote advertised by a server will be
> +	accepted.

Similarly, readers would want to know what the implication is to
"accept" promisor remotes.

	accept ..., and adds them to its promisor remotes, allowing
        the server to omit objects from its response to "fetch"
        requests that are lazily fetchable from these promisor
        remotes, see linkgit:gitprotocol-v2[5].

or something?

> diff --git a/Documentation/gitprotocol-v2.txt b/Documentation/gitprotocol-v2.txt
> index 414bc625d5..4d8d3839c4 100644
> --- a/Documentation/gitprotocol-v2.txt
> +++ b/Documentation/gitprotocol-v2.txt
> @@ -781,6 +781,43 @@ retrieving the header from a bundle at the indicated URI, and thus
>  save themselves and the server(s) the request(s) needed to inspect the
>  headers of that bundle or bundles.
>  
> +promisor-remote=<pr-infos>
> +~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The server may advertise some promisor remotes it is using, if it's OK
> +for the server that a client uses them too. In this case <pr-infos>
> +should be of the form:

As this is the protocol documentation, we should describe what goes
over the wire and what they mean, regardless of how limited the
initial implementation on either end is.  Advertising the promisor
remotes the server side relies on is probably not what we want to
see this capability limited to forever (remember the previous "X is
much better connected than S" example).

    "it is using, if it's OK ..." -> "the other side may want to use
    as its promisot remotes, instead of this repository"

> +	pr-infos = pr-info | pr-infos ";" pr-info
> +
> +	pr-info = "name=" pr-name | "name=" pr-name "," "url=" pr-url
> +
> +where `pr-name` is the name of a promisor remote, and `pr-url` the
> +urlencoded URL of that promisor remote.

Clarify what the syntax for a valid name here.  Also stress the fact
that ';' and ',' MUST be encoded if it appears in 'pr-url'.

> +In this case a client wanting to use one or more promisor remotes the
> +server advertised should reply with "promisor-remote=<pr-names>" where
> +<pr-names> should be of the form:
> +
> +	pr-names = pr-name | pr-names ";" pr-name
> +
> +where `pr-name` is the name of a promisor remote the server
> +advertised.

After seeing advertisement, client can use some it picked but it
does not have to tell the server about it.  Why would it respond
with the promisor remotes, and what effect does it have to give the
list of promisor remotes it uses?

    If the "git fetch" side decides to use one or more promisor
    remotes advertised, it can reply with ...
    ...
    where ... the server advertised.  Doing so allows the server to
    make its response smaller by omitting objects that are known to
    be lazily fetchable from these other promisor remotes
    repositories.

perhaps?

> +If the server prefers a client not to use any promisor remote the
> +server uses, or if the server doesn't use any promisor remote, it
> +should only advertise "promisor-remote" without any value or "=" sign
> +after it.

It should not advertise "promisor-remote" capability at all.

> +In this case, or if the client doesn't want to use any promisor remote
> +the server advertised, the client should reply only "promisor-remote"
> +without any value or "=" sign after it.

Likewise.  It should not advertise "promisor-remote" capability at
all.

> +The "promisor.advertise" and "promisor.acceptFromServer" configuration
> +options can be used on the server and client side respectively to
> +control what they advertise or accept respectively. See the
> +documentation of these configuration options for more information.

OK.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 4/4] promisor-remote: check advertised name or URL
  2024-07-31 13:40 ` [PATCH 4/4] promisor-remote: check advertised name or URL Christian Couder
@ 2024-07-31 18:35   ` Junio C Hamano
  2024-09-10 16:32     ` Christian Couder
  0 siblings, 1 reply; 110+ messages in thread
From: Junio C Hamano @ 2024-07-31 18:35 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, John Cai, Patrick Steinhardt, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> A previous commit introduced a "promisor.acceptFromServer" configuration
> variable with only "None" or "All" as valid values.
>
> Let's introduce "KnownName" and "KnownUrl" as valid values for this
> configuration option to give more choice to a client about which
> promisor remotes it might accept among those that the server advertised.

A malicous server can swich name and url correspondence.  The URLs
this repository uses to lazily fetch missing objects from are the
only thing that matters, and it does not matter what name the server
calls these URLs as, I am not sure what value, if any, KnownName has,
other than adding a potential security hole.

> In case of "KnownUrl", the client will accept promisor remotes which
> have both the same name and the same URL configured on the client as the
> name and URL advertised by the server.

This makes sense, especially if we had updates to documents I
suggested in my review of [3/4].  If the side effect of "accepting"
a suggested promisor remote were to only use it as a promisor remote
on this side, there is no reason to "accept" the same thing again,
but because the main effect at the protocol level of "accepting" is
to affect the behaviour of the server in such a way that it is now
allowed to omit objects that are requested but would be available
lazily from the promisor remotes in the response, we _do_ need to
be able to respond with the promisor remotes we are willing to and
have been using.

This iteration does not seem to have the true server side support to
slim its response by omitting objects that are available elsewhere,
but I agree that it is a good approach to get the protocol support
right.

Thanks.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 3/4] Add 'promisor-remote' capability to protocol v2
  2024-07-31 18:25   ` Junio C Hamano
@ 2024-07-31 19:34     ` Junio C Hamano
  2024-08-20 12:21     ` Christian Couder
  1 sibling, 0 replies; 110+ messages in thread
From: Junio C Hamano @ 2024-07-31 19:34 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, John Cai, Patrick Steinhardt, Christian Couder

Junio C Hamano <gitster@pobox.com> writes:

> Christian Couder <christian.couder@gmail.com> writes:
>
>> When a server repository S borrows some objects from a promisor remote X,
>> then a client repository C which would like to clone or fetch from S might,
>> or might not, want to also borrow objects from X. Also S might, or might
>> not, want to advertise X as a good way for C to directly get objects from,
>> instead of C getting everything through S.
>
> If S is a clone that is keeping up to date with X, even if it does
> not borrow anything from X, as long as X is known to be much better
> connected to the world (e.g., it is in a $LARGEINTERNETCOMPANY
> datacenter with petabit/s backbone connections) than S is (e.g., it
> is my deskside box on a cable modem), it may be beneficial if S can
> omit objects from its "git fetch" response to C, if C is willing to
> fill the gap using X.
>
> So it is of dubious value to limit the feature only to cases where S
> "borrows" from X, is it?

An even better example is if S on my deskside box is the source of
truth, and X in $LARGEINTERNETCOMPANY datacenter that is much better
connected is a publishing repository I use to push from S to X.

Even if you originally cloned from S and use S as your promisor
remote, as the operator of S, I would like you to always consult X
first to reduce the load on S when lazily fetching objects that you
are missing.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 2/4] strbuf: refactor strbuf_trim_trailing_ch()
  2024-07-31 17:29   ` Junio C Hamano
@ 2024-07-31 21:49     ` Taylor Blau
  2024-08-20 11:29       ` Christian Couder
  2024-08-20 11:29     ` Christian Couder
  1 sibling, 1 reply; 110+ messages in thread
From: Taylor Blau @ 2024-07-31 21:49 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Christian Couder, git, John Cai, Patrick Steinhardt,
	Christian Couder

On Wed, Jul 31, 2024 at 10:29:00AM -0700, Junio C Hamano wrote:
> > Let's introduce a new strbuf_trim_trailing_ch() function that can be
> > used to remove any trailing character, and let's refactor existing code
> > that manually removed trailing characters using this new function.
>
> It is disappointing that this new one is not adequate to rewrite any
> of the existing strbuf_trim* functions in terms of it, but that's
> probably OK.

I don't think it's possible without some awkwardness. strbuf_[lr]trim()
both trim characters for which isspace(c) is true, and this new function
only trims a single character (also from the right-hand side of the
string, so strbuf_ltrim() would not be a candidate[^1]).

Likewise for strbuf_trim_trailing_dir_sep(), which uses the
platform-dependent is_dir_sep(). strbuf_trim_trailing_newline() is also
complicated because it only removes '\n' or '\r\n' from the end of a
buffer, but not a lone '\r' character.

Thanks,
Taylor

[^1]: Unless you had a function to swap the order of the underlying
  buffer, then call the trim function on the right-hand side, before
  swapping it back. But that's obviously disgusting and clearly a bad
  idea.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 3/4] Add 'promisor-remote' capability to protocol v2
  2024-07-31 13:40 ` [PATCH 3/4] Add 'promisor-remote' capability to protocol v2 Christian Couder
                     ` (2 preceding siblings ...)
  2024-07-31 18:25   ` Junio C Hamano
@ 2024-08-05 13:48   ` Patrick Steinhardt
  2024-08-19 20:00     ` Junio C Hamano
  2024-09-10 16:31     ` Christian Couder
  3 siblings, 2 replies; 110+ messages in thread
From: Patrick Steinhardt @ 2024-08-05 13:48 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, Junio C Hamano, John Cai, Christian Couder

[-- Attachment #1: Type: text/plain, Size: 2915 bytes --]

On Wed, Jul 31, 2024 at 03:40:13PM +0200, Christian Couder wrote:
> diff --git a/Documentation/gitprotocol-v2.txt b/Documentation/gitprotocol-v2.txt
> index 414bc625d5..4d8d3839c4 100644
> --- a/Documentation/gitprotocol-v2.txt
> +++ b/Documentation/gitprotocol-v2.txt
> @@ -781,6 +781,43 @@ retrieving the header from a bundle at the indicated URI, and thus
>  save themselves and the server(s) the request(s) needed to inspect the
>  headers of that bundle or bundles.
>  
> +promisor-remote=<pr-infos>
> +~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The server may advertise some promisor remotes it is using, if it's OK
> +for the server that a client uses them too. In this case <pr-infos>
> +should be of the form:
> +
> +	pr-infos = pr-info | pr-infos ";" pr-info
> +
> +	pr-info = "name=" pr-name | "name=" pr-name "," "url=" pr-url
> +
> +where `pr-name` is the name of a promisor remote, and `pr-url` the
> +urlencoded URL of that promisor remote.
> +
> +In this case a client wanting to use one or more promisor remotes the
> +server advertised should reply with "promisor-remote=<pr-names>" where
> +<pr-names> should be of the form:
> +
> +	pr-names = pr-name | pr-names ";" pr-name
> +
> +where `pr-name` is the name of a promisor remote the server
> +advertised.
> +
> +If the server prefers a client not to use any promisor remote the
> +server uses, or if the server doesn't use any promisor remote, it
> +should only advertise "promisor-remote" without any value or "=" sign
> +after it.
> +
> +In this case, or if the client doesn't want to use any promisor remote
> +the server advertised, the client should reply only "promisor-remote"
> +without any value or "=" sign after it.

Why does the client have to advertise anything if they don't want to use
any of the promisor remotes?

> +The "promisor.advertise" and "promisor.acceptFromServer" configuration
> +options can be used on the server and client side respectively to
> +control what they advertise or accept respectively. See the
> +documentation of these configuration options for more information.

One thing I'm not totally clear on is the consequence of this
capability. What is the expected consequence if the client accepts one
of the promisor remotes? What is the consequence if the client accepts
none?

In the former case I'd expect that the server is free to omit objects,
but that isn't made explicit anywhere, I think. Also, is there any
mechanism that tells the client exactly which objects have been omitted?
In the latter case I assume that the result will be a full clone, that
is the server fetched any objects it didn't have from the promisor?

Or does the server side continue to only honor whatever the client has
provided as object filters, but signals to the client that it shall
please contact somebody else when backfilling those promised objects?

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 3/4] Add 'promisor-remote' capability to protocol v2
  2024-08-05 13:48   ` Patrick Steinhardt
@ 2024-08-19 20:00     ` Junio C Hamano
  2024-09-10 16:31     ` Christian Couder
  1 sibling, 0 replies; 110+ messages in thread
From: Junio C Hamano @ 2024-08-19 20:00 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Christian Couder, git, John Cai, Christian Couder

Patrick Steinhardt <ps@pks.im> writes:

>> +In this case, or if the client doesn't want to use any promisor remote
>> +the server advertised, the client should reply only "promisor-remote"
>> +without any value or "=" sign after it.
>
> Why does the client have to advertise anything if they don't want to use
> any of the promisor remotes?

Yeah, it is not very well justified why an empty capability needs to
be sent (from both sides).  My recommendation is to drop that part
of the design, but if there is a reason to keep, it should be done
by explaining how differently the other side should behave when the
capability is not sent at all and when the capability with no
promisor remote is sent.

>> +The "promisor.advertise" and "promisor.acceptFromServer" configuration
>> +options can be used on the server and client side respectively to
>> +control what they advertise or accept respectively. See the
>> +documentation of these configuration options for more information.
>
> One thing I'm not totally clear on is the consequence of this
> capability. What is the expected consequence if the client accepts one
> of the promisor remotes? What is the consequence if the client accepts
> none?

Yes, I also found the documentation lacking in that respect.  The
series talks about how the exchange can proceed, without saying much
(if anything) about what both sides want to exchange promisor-remote
for---what effect does it have on the behaviour of both sides to
send one.  I covered this point in one of my reviews a bit more.

  https://lore.kernel.org/git/xmqqikwl2zca.fsf@gitster.g/

> In the former case I'd expect that the server is free to omit objects,
> but that isn't made explicit anywhere, I think. Also, is there any
> mechanism that tells the client exactly which objects have been omitted?
> In the latter case I assume that the result will be a full clone, that
> is the server fetched any objects it didn't have from the promisor?
>
> Or does the server side continue to only honor whatever the client has
> provided as object filters, but signals to the client that it shall
> please contact somebody else when backfilling those promised objects?

Thanks.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 1/4] version: refactor strbuf_sanitize()
  2024-07-31 17:18   ` Junio C Hamano
@ 2024-08-20 11:29     ` Christian Couder
  0 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2024-08-20 11:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, John Cai, Patrick Steinhardt, Christian Couder

On Wed, Jul 31, 2024 at 7:18 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > diff --git a/strbuf.c b/strbuf.c
> > index 3d2189a7f6..cccfdec0e3 100644
> > --- a/strbuf.c
> > +++ b/strbuf.c
> > @@ -1082,3 +1082,12 @@ void strbuf_strip_file_from_path(struct strbuf *sb)
> >       char *path_sep = find_last_dir_sep(sb->buf);
> >       strbuf_setlen(sb, path_sep ? path_sep - sb->buf + 1 : 0);
> >  }
> > +
> > +void strbuf_sanitize(struct strbuf *sb)
> > +{
> > +     strbuf_trim(sb);
> > +     for (size_t i = 0; i < sb->len; i++) {
> > +             if (sb->buf[i] <= 32 || sb->buf[i] >= 127)
> > +                     sb->buf[i] = '.';
> > +     }
> > +}
>
> This looked a bit _too_ specific for the use of the transport layer
> (which raises the question if it should even live in strbuf.[ch]).
> It also made me wonder if different callers likely want to have
> different variants (e.g., do not trim, only trim at the tail, squash
> a run of unprintables into a single '.', use '?'  instead of '.',
> etc., etc.).
>
> It turns out that there is only *one* existing caller that gets
> replaced with this "common" version, which made it a Meh to me.
>
> Let's hope that there will be many new callers to make this step
> worthwhile.

A very similar step was also part of my previous patch series to add
an OS version to the protocol. See:

https://lore.kernel.org/git/20240619125708.3719150-2-christian.couder@gmail.com/

My opinion is that the code is doing something often needed when
dealing with the protocol, so it is worth it to refactor that code
soon, and then adapt it later when needed with options (to not trim,
only trim at the tail, use '?'  instead of '.', etc).

I am not sure if it should live in strbuf.[ch], but on the other hand
if we indeed adapt it over time with a number of options for different
use cases, it might end up in strbuf.[ch], so it is a reasonable bet
to put it there right away. I must also say that I don't know which
other place(s) would be a good home for it.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 2/4] strbuf: refactor strbuf_trim_trailing_ch()
  2024-07-31 17:29   ` Junio C Hamano
  2024-07-31 21:49     ` Taylor Blau
@ 2024-08-20 11:29     ` Christian Couder
  1 sibling, 0 replies; 110+ messages in thread
From: Christian Couder @ 2024-08-20 11:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, John Cai, Patrick Steinhardt, Christian Couder

On Wed, Jul 31, 2024 at 7:29 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > We often have to split strings at some specified terminator character.
> > The strbuf_split*() functions, that we can use for this purpose,
> > return substrings that include the terminator character, so we often
> > need to remove that character.
> >
> > When it is a whitespace, newline or directory separator, the
> > terminator character can easily be removed using an existing triming
> > function like strbuf_rtrim(), strbuf_trim_trailing_newline() or
> > strbuf_trim_trailing_dir_sep(). There is no function to remove that
> > character when it's not one of those characters though.
>
> OK.
>
> > Let's introduce a new strbuf_trim_trailing_ch() function that can be
> > used to remove any trailing character, and let's refactor existing code
> > that manually removed trailing characters using this new function.
>
> It is disappointing that this new one is not adequate to rewrite any
> of the existing strbuf_trim* functions in terms of it, but that's
> probably OK.

Yeah, I took a look at that but thought it wasn't worth trying to
unify the trim functions as they each have quite specific code and
requirements.


> At least this one we have two existing callers, but
> makes me wonder if these callers are doing sensible things in the
> first place.  After trimming trailing commas, there may be trailing
> newlines to be trimmed, and then again whitespaces around the whole
> thing may need to be trimmed---what kind of input is that?  The
> value has to be " junk \n\n,,,", but " junk, \n\n, " will only
> become "junk, \n\n," without further cleaned up, and it is very
> dubious how that is useful.
>
> But that is not an issue this patch introduces ;-)

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 2/4] strbuf: refactor strbuf_trim_trailing_ch()
  2024-07-31 21:49     ` Taylor Blau
@ 2024-08-20 11:29       ` Christian Couder
  0 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2024-08-20 11:29 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Junio C Hamano, git, John Cai, Patrick Steinhardt,
	Christian Couder

On Wed, Jul 31, 2024 at 11:49 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Wed, Jul 31, 2024 at 10:29:00AM -0700, Junio C Hamano wrote:
> > > Let's introduce a new strbuf_trim_trailing_ch() function that can be
> > > used to remove any trailing character, and let's refactor existing code
> > > that manually removed trailing characters using this new function.
> >
> > It is disappointing that this new one is not adequate to rewrite any
> > of the existing strbuf_trim* functions in terms of it, but that's
> > probably OK.
>
> I don't think it's possible without some awkwardness. strbuf_[lr]trim()
> both trim characters for which isspace(c) is true, and this new function
> only trims a single character (also from the right-hand side of the
> string, so strbuf_ltrim() would not be a candidate[^1]).
>
> Likewise for strbuf_trim_trailing_dir_sep(), which uses the
> platform-dependent is_dir_sep(). strbuf_trim_trailing_newline() is also
> complicated because it only removes '\n' or '\r\n' from the end of a
> buffer, but not a lone '\r' character.

Yeah, I agree with that analysis.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 3/4] Add 'promisor-remote' capability to protocol v2
  2024-07-31 16:16   ` Taylor Blau
@ 2024-08-20 11:32     ` Christian Couder
  2024-08-20 16:55       ` Junio C Hamano
  2024-09-10 16:32       ` Christian Couder
  0 siblings, 2 replies; 110+ messages in thread
From: Christian Couder @ 2024-08-20 11:32 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Junio C Hamano, John Cai, Patrick Steinhardt,
	Christian Couder

On Wed, Jul 31, 2024 at 6:16 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Wed, Jul 31, 2024 at 03:40:13PM +0200, Christian Couder wrote:
> > diff --git a/Documentation/gitprotocol-v2.txt b/Documentation/gitprotocol-v2.txt
> > index 414bc625d5..4d8d3839c4 100644
> > --- a/Documentation/gitprotocol-v2.txt
> > +++ b/Documentation/gitprotocol-v2.txt
> > @@ -781,6 +781,43 @@ retrieving the header from a bundle at the indicated URI, and thus
> >  save themselves and the server(s) the request(s) needed to inspect the
> >  headers of that bundle or bundles.
> >
> > +promisor-remote=<pr-infos>
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +The server may advertise some promisor remotes it is using, if it's OK
> > +for the server that a client uses them too. In this case <pr-infos>
> > +should be of the form:
> > +
> > +     pr-infos = pr-info | pr-infos ";" pr-info
>
> So <pr-infos> uses the ';' character to delimit multiple <pr-info>s,
> which means that <pr-info> can't use ';' itself. You mention above that
> <pr-info> is supposed to be generic so that we can add other fields to
> it in the future. Do you imagine that any of those fields might want to
> use the ';' in their values?

Yeah, but, as for the 'url' field where the value is urlencoded, the
value can be encoded if it could contain some special characters.

> One that comes to mind is the shared token example you wrote about
> above. It would be nice to not restrict what characters the token can
> contain.

I agree but I think urlencoding, or maybe other simple encodings like
base64, should be easy and simple enough to work around this.

> I wonder if it would instead be useful to have <pr-infos> first write
> out how many <pr-info>s it contains, and then write out each <pr-info>
> separated by a NUL byte, so that none of the files in the <pr-info>
> itself are restricted in what characters they can use.

I am not sure how NUL bytes would interfere with the pkt-line.[c,h] code though.

> > +static void promisor_info_vecs(struct repository *repo,
> > +                            struct strvec *names,
> > +                            struct strvec *urls)
> > +{
> > +     struct promisor_remote *r;
> > +
> > +     promisor_remote_init(repo);
> > +
> > +     for (r = repo->promisor_remote_config->promisors; r; r = r->next) {
> > +             char *url;
> > +             char *url_key = xstrfmt("remote.%s.url", r->name);
> > +
> > +             strvec_push(names, r->name);
> > +             strvec_push(urls, git_config_get_string(url_key, &url) ? NULL : url);
>
> Do you mean to push NULL onto urls here? It seems risky since you have
> to check that each entry in the strvec is non-NULL before printing it
> out (which you do below in promisor_remote_info()).

The code doesn't seem risky to me as it allows us to treat the case
when git_config_get_string() fails and when it succeeds but possibly
sets 'url' to NULL (not sure if it's possible though as I didn't
check) in the same way.

Yeah, it means that we have to check if each entry in the strvec is
non-NULL, but I think it's quite easy, and honestly I didn't want to
ask myself questions like should we treat an URL of a remote
configured as an empty string in the same way as the URL not
configured. I think it's much simpler to just pass as-is the content,
if any, that we get from git_config_get_string().

> Or maybe you need to in order to advertise promisor remotes without
> URLs?

Yeah, I think we should advertise promisor remotes that don't have an
URL configured. It might seem strange, but maybe servers might want in
the future to have hidden/secret URLs (URLs that they use, likely
internally on the server, but don't want to pass for some reason to a
client).

> If so, I'm not sure what the benefit would be to the client if it
> doesn't know where to go to retrieve any objects without having a URL.

The client might already have an URL for the promisor-remote (and it
might be a different one than the one the server would use if
hidden/secret URLs become a thing). That's why patch 4/4 implements
the "KnownName" value for the "promisor.acceptFromServer" config
option.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 3/4] Add 'promisor-remote' capability to protocol v2
  2024-07-31 15:40   ` Taylor Blau
@ 2024-08-20 11:32     ` Christian Couder
  2024-08-20 17:01       ` Junio C Hamano
  0 siblings, 1 reply; 110+ messages in thread
From: Christian Couder @ 2024-08-20 11:32 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Junio C Hamano, John Cai, Patrick Steinhardt,
	Christian Couder

On Wed, Jul 31, 2024 at 5:40 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Wed, Jul 31, 2024 at 03:40:13PM +0200, Christian Couder wrote:
> > By default, or if "promisor.advertise" is set to 'false', a server S will
> > advertise only the "promisor-remote" capability without passing any
> > argument through this capability. This means that S supports the new
> > capability but doesn't wish any client C to directly access any promisor
> > remote X S might use.
>
> Even if the server supports this new capability, is there a reason to
> advertise it to the client if the server knows ahead of time that it has
> no promisor remotes to advertise?

I think it could be useful at least in some cases for C to know that S
has the capability to advertise promisor remotes but decided not to
advertise any. For example, if C knows that the repo has a lot of very
large files, it might realize that S is likely not a good mirror of
the repo if it doesn't have the 'promisor-remote' capability.

I agree that it's more useful the other way though. That is for a
server to know that the client has the capability but might not want
to use it.

For example, when C clones without using X directly, it can be a
burden for S to have to fetch large objects from X (as it would use
precious disk space on S, and unnecessarily duplicate large objects).
So S might want to say "please use a newer or different client that
has the 'promisor-remote' capability" if it knows that the client
doesn't have this capability. If S knows that C has the capability but
didn't configure it or doesn't want to use it, it could instead say
something like "please consider activating the 'promisor-remote'
capability by doing this and that to avoid burdening this server and
get a faster clone".

Note that the client might not be 'git'. It might be a "compatible"
implementation (libgit2, gix, JGit, etc), so using the version passed
in the "agent" protocol capability is not a good way to detect if the
client has the capability or not.

In the end, as it looks very useful for S to know if C has the
capability or not, and as it seems natural that S and C behave the
same regarding advertising the capability, I think the choice of
always advertising the capability, even when not using it, is the
right one.

> I am not sure what action the client would take if it knows the server
> supports this capability, but does not actually have any promisor
> remotes to advertise. I would suggest that setting promisor.advertise to
> false indeed prevents advertising it as a capability in the first place.

It could, in some cases, help C realize that S is likely using old or
unoptimized server software for the repo, and C could decide based on
this to use a different mirror repo. For example if C wants to clone
some well known open source AI repo that has a lot of very large files
and is mirrored on many common repo hosting platforms (GitHub, GitLab,
etc), C might be happy to get a clue of how likely the different
mirrors are to be optimized to serve that repo.

I agree that it might not be a very good reason right now, but I think
it might be in the future. Anyway the main reason for such a behavior
is (as I said above) that it is very useful for S to know if C has the
'promisor-remote' capability or not.

> Selfishly, it prevents some issues that I have when rolling out new Git
> versions within GitHub's infrastructure, since our push proxy layer
> picks a single replica to replay the capabilities from, but obviously
> replays the client's response to all replicas. So if only some replicas
> understand the new 'promisor-remote' capability, we can run into issues.

I understand the problem, but I think it might be worked around by
first deploying on a single replica with the new 'promisor-remote'
capability disabled in the config, which is the default. Yeah, that
replica might behave a bit differently than the others, but the client
behavior shouldn't change much. And when things work well with a
single replica with the capability disabled, then more replicas with
that capability disabled can be rolled out until they all have it.

More issues are likely to happen when actually enabling the
capability, but this is independent of the fact that the
'promisor-remote' capability is advertised even if it is not enabled.

> I'm not sure if the client even bothers to send back promisor-remote if
> the server did not send any such remotes to begin with,

If S sends 'promisor-remote' even without sending any remote
information then C should reply using 'promisor-remote' too. I think
it can help S decide if setting up promisor remotes is worth it or
not, if S can easily know if many of its clients could use them or
not.

> but between that
> and what I wrote in the second paragraph here, I don't see a reason to
> advertise the capability when promisor.advertise is false.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 3/4] Add 'promisor-remote' capability to protocol v2
  2024-07-31 18:25   ` Junio C Hamano
  2024-07-31 19:34     ` Junio C Hamano
@ 2024-08-20 12:21     ` Christian Couder
  1 sibling, 0 replies; 110+ messages in thread
From: Christian Couder @ 2024-08-20 12:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, John Cai, Patrick Steinhardt, Christian Couder

On Wed, Jul 31, 2024 at 8:25 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > When a server repository S borrows some objects from a promisor remote X,
> > then a client repository C which would like to clone or fetch from S might,
> > or might not, want to also borrow objects from X. Also S might, or might
> > not, want to advertise X as a good way for C to directly get objects from,
> > instead of C getting everything through S.
>
> If S is a clone that is keeping up to date with X, even if it does
> not borrow anything from X, as long as X is known to be much better
> connected to the world (e.g., it is in a $LARGEINTERNETCOMPANY
> datacenter with petabit/s backbone connections) than S is (e.g., it
> is my deskside box on a cable modem), it may be beneficial if S can
> omit objects from its "git fetch" response to C, if C is willing to
> fill the gap using X.
>
> So it is of dubious value to limit the feature only to cases where S
> "borrows" from X, is it?

I agree. I just blindly copied the word from something you said in a
previous thread, but I should have thought more about the use case you
suggest.

> > To allow S and C to agree on C using X or not, let's introduce a new
> > "promisor-remote" capability in the protocol v2, as well as a few new
> > configuration variables:
> >
> >   - "promisor.advertise" on the server side, and:
> >   - "promisor.acceptFromServer" on the client side.
> >
> > By default, or if "promisor.advertise" is set to 'false', a server S will
> > advertise only the "promisor-remote" capability without passing any
> > argument through this capability. This means that S supports the new
> > capability but doesn't wish any client C to directly access any promisor
> > remote X S might use.
>
> I would find it more natural if .advertise is turned off by setting
> it explicitly to "false", we would pretend as if we have never even
> heard of such a capability.

In a reply to Taylor I just sent, I tried to explain why I think it's
a good thing that S can know if C has the capability even if neither S
nor C actually use it.

If, in the future, many servers and repos are transitioned to setups
where promisor remotes and this capability are used, then I think it
could help a lot if S can help C take advantage of X, and better
diagnose issues. And perhaps the other way around (C knowing that S
has the capability or not) could help a bit too.

> > If "promisor.advertise" is set to 'true', S will advertise its promisor
> > remotes with a string like:
> >
> >   promisor-remote=<pm-info>[;<pm-info>]...
> >
> > where each <pm-info> element contains information about a single
> > promisor remote in the form:
> >
> >   name=<pm-name>[,url=<pm-url>]
> > where <pm-name> is the name of a promisor remote and <pm-url> is the
> > urlencoded url of the promisor remote named <pm-name>.
>
> OK, so pm-name cannot have ";," in it (which is sensible, or define
> pm-name more tightly, like "only lowercase alnum").

Not sure what our config allows here. Ideally I think the capability
should support everything our config supports.

>  URL cannot have
> ';' or ',' in it that is an OK limitation as URL encoding can hide
> them.

Yeah, right.

> > For now, the URL is passed in addition to the name. In the future, it
> > might be possible to pass other information like a filter-spec that the
> > client should use when cloning from S, or a token that the client should
> > use when retrieving objects from X.
>
> OK.  And obviously they cannot have ';," in them without encoding
> similarly.

Yeah, or maybe using a different encoding if it's better for some reason.

> > It might also be possible in the future for "promisor.advertise" to have
> > other values like "onlyName", so that no URL is advertised.
>
> Saying "<pm-info> is expected to be extended" should be sufficient,
> without inviting discussions like "what good does it do to give only
> names" that is irrelevant at least at this moment.

Yeah, but in this case Taylor had comments related to advertising
remotes that have a name but no URL, so I think this example could
help people having related questions understand that it could be a
good thing to advertise only a URL name and no URL.

> > By default or if "promisor.acceptFromServer" is set to "None", the
> > client will not accept to use the promisor remotes that might have been
> > advertised by the server. In this case, the client will advertise only
> > "promisor-remote" in its reply to the server. This means that the client
> > has the "promisor-remote" capability but decided not to use any of the
> > promisor remotes that the server might have advertised.
>
> OK, that is a signal to the server side that it is not allowed to
> omit any objects from its response to "git fetch" request, even
> though they might be available via a better connected remotes.

Yeah, right.

> > If "promisor.acceptFromServer" is set to "All", on the contrary, the
> > client will accept to use all the promisor remotes that the server
> > advertised and it will reply with a string like:
> >
> >   promisor-remote=<pm-name>[;<pm-name>]...
> >
> > where the <pm-name> elements are the names of all the promisor remotes
> > the server advertised.
>
> So, this is why we need "name" for each "pm-info"---to give a short
> name associated with the URL of the remote repository.

Yeah, I think it can be valuable to use names to agree on
promisor-remotes that should or shouldn't be used, like we commonly
use remote names with `git clone`, `git fetch`, etc.

> Presumably, C has an option to see if each of the remote suggested
> is reachable and omit remotes that are not available to C from its
> response, so even when .accept is set to "all", the response may not
> list all the names of remotes that S advertised, in general.

I didn't think about reachability or connectivity specifically, but I
agree it might be useful for C to be able to filter in some ways the
promisor remotes that S advertised.

> > If the server advertised no promisor remote
> > though, the client will reply with just "promisor-remote".
>
> In other words, at the protocol level:
>
>  - S uses promisor-remote capability to tell C what are potentially
>    useful alternate remotes to obtain objects that C may want to
>    fetch from S
>
>  - C uses promisor-remote capability to tell S that among the
>    remotes advertised by S, it is willing to use the named remotes
>    as its promisor, which permits S from omitting objects from its
>    response to "git fetch" request from C as long as they are known
>    to be available from these remotes.

Yeah, right.

> I think that makes sense, but I do not see the point of sending an
> empty promisor-remote capability at all.

I hope I gave good enough reasons above and in my replies to Taylor.

> What practical difference would it make to S and C, if S chooses not
> to advertise the capability at all, instead of advertising an empty
> remote list with the capability?  Both tells C that it is useless to
> request promistor-remote capability to S in its response.
>
> What practical difference would it make to S and C, if C chooses not
> to advertise the capability at all, instead of advertising an empty
> remote list with the capability?  Both tells S that S is not allowed
> to omit objects that are obtainable from elsewhere.

I agree that when looking at how Git related things technically work,
it doesn't change anything, but I think we should look at the big
picture too. For GitLab, for example (but I suppose it will be similar
for GitHub and other hosting sites), it will be important to help
users take advantage, as seamlessly as possible, of the feature, and
the more we know about the client they use and its capabilities, the
better job we can do to help them.

> > In a following commit, other values for "promisor.acceptFromServer" will
> > be implemented so that the client will be able to decide the promisor
> > remotes it accepts depending on the name and URL it received from the
> > server. So even if that name and URL information is not used much right
> > now, it will be needed soon.
>
> OK.
>
> > diff --git a/Documentation/config/promisor.txt b/Documentation/config/promisor.txt
> > index 98c5cb2ec2..e3939d83a9 100644
> > --- a/Documentation/config/promisor.txt
> > +++ b/Documentation/config/promisor.txt
> > @@ -1,3 +1,16 @@
> >  promisor.quiet::
> >       If set to "true" assume `--quiet` when fetching additional
> >       objects for a partial clone.
> > +
> > +promisor.advertise::
> > +     If set to "true", a server will use the "promisor-remote"
> > +     capability, see linkgit:gitprotocol-v2[5], to advertise the
> > +     promisor remotes it is using if any. Default is "false", which
> > +     means no promisor remote is advertised.
>
> Even though I said that there logically is not much reason to tie
> this advertisement to the use of promistor remote by the serving
> side, I am OK if the initial implementation is limited to that
> arrangement.  It would be an easy change to allow this variable
> to take a list of remote repositories that may (or may not) be a
> promisor remote of this repository (in other words, "they are clones
> that are better connected than me") in the future, but that does not
> have to happen in the initial iteration.

Yeah, I will mention something like this in the next iteration.

> It would be less confusing to first-time readers if you described
> the intent a bit better.  Why would the server want to advertise and
> how would the client take advantage of the information?

Yeah, I will try to better answer that question in the doc, cover
letter and commit messages.

> I see that
> the update in this patch to protocol document is skimpy on this point,
> and end-user facing documentation has better exposure anyway, so
> let's see what we can do here.
>
>     The "promisor-remote" protocol capability can be used by the
>     responder to "git fetch" to advertise better-connected remotes
>     that the requester can use as promisor remotes, instead of this
>     repository, so that "git fetch" requestor can lazily fetch
>     objects from these other better-connected remotes.  If this
>     configuration variable is set to "true",...
>
> or something, perhaps?

Thanks for the suggestion. I will base the changes in version 2 on it.

> "no promisor remote is advertised" -> "no promisor-remote capability
> is advertised".

Right.

> > +promisor.acceptFromServer::
> > +     If set to "all", a client will accept all the promisor remotes
> > +     a server might advertise using the "promisor-remote"
> > +     capability, see linkgit:gitprotocol-v2[5]. Default is "none",
> > +     which means no promisor remote advertised by a server will be
> > +     accepted.
>
> Similarly, readers would want to know what the implication is to
> "accept" promisor remotes.
>
>         accept ..., and adds them to its promisor remotes, allowing
>         the server to omit objects from its response to "fetch"
>         requests that are lazily fetchable from these promisor
>         remotes, see linkgit:gitprotocol-v2[5].
>
> or something?

Thanks for the suggestion. I will improve the wording based on it.

> > diff --git a/Documentation/gitprotocol-v2.txt b/Documentation/gitprotocol-v2.txt
> > index 414bc625d5..4d8d3839c4 100644
> > --- a/Documentation/gitprotocol-v2.txt
> > +++ b/Documentation/gitprotocol-v2.txt
> > @@ -781,6 +781,43 @@ retrieving the header from a bundle at the indicated URI, and thus
> >  save themselves and the server(s) the request(s) needed to inspect the
> >  headers of that bundle or bundles.
> >
> > +promisor-remote=<pr-infos>
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +The server may advertise some promisor remotes it is using, if it's OK
> > +for the server that a client uses them too. In this case <pr-infos>
> > +should be of the form:
>
> As this is the protocol documentation, we should describe what goes
> over the wire and what they mean, regardless of how limited the
> initial implementation on either end is.  Advertising the promisor
> remotes the server side relies on is probably not what we want to
> see this capability limited to forever (remember the previous "X is
> much better connected than S" example).
>
>     "it is using, if it's OK ..." -> "the other side may want to use
>     as its promisot remotes, instead of this repository"

Right, thanks.

> > +     pr-infos = pr-info | pr-infos ";" pr-info
> > +
> > +     pr-info = "name=" pr-name | "name=" pr-name "," "url=" pr-url
> > +
> > +where `pr-name` is the name of a promisor remote, and `pr-url` the
> > +urlencoded URL of that promisor remote.
>
> Clarify what the syntax for a valid name here.  Also stress the fact
> that ';' and ',' MUST be encoded if it appears in 'pr-url'.

Yeah, I will do that.

> > +In this case a client wanting to use one or more promisor remotes the
> > +server advertised should reply with "promisor-remote=<pr-names>" where
> > +<pr-names> should be of the form:
> > +
> > +     pr-names = pr-name | pr-names ";" pr-name
> > +
> > +where `pr-name` is the name of a promisor remote the server
> > +advertised.
>
> After seeing advertisement, client can use some it picked but it
> does not have to tell the server about it.  Why would it respond
> with the promisor remotes, and what effect does it have to give the
> list of promisor remotes it uses?
>
>     If the "git fetch" side decides to use one or more promisor
>     remotes advertised, it can reply with ...
>     ...
>     where ... the server advertised.  Doing so allows the server to
>     make its response smaller by omitting objects that are known to
>     be lazily fetchable from these other promisor remotes
>     repositories.
>
> perhaps?

Yeah, right.

> > +If the server prefers a client not to use any promisor remote the
> > +server uses, or if the server doesn't use any promisor remote, it
> > +should only advertise "promisor-remote" without any value or "=" sign
> > +after it.
>
> It should not advertise "promisor-remote" capability at all.

Let's discuss this above or in a thread started by Taylor's reviews.

> > +In this case, or if the client doesn't want to use any promisor remote
> > +the server advertised, the client should reply only "promisor-remote"
> > +without any value or "=" sign after it.
>
> Likewise.  It should not advertise "promisor-remote" capability at
> all.
>
> > +The "promisor.advertise" and "promisor.acceptFromServer" configuration
> > +options can be used on the server and client side respectively to
> > +control what they advertise or accept respectively. See the
> > +documentation of these configuration options for more information.
>
> OK.

Thanks for the review and suggestions.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 3/4] Add 'promisor-remote' capability to protocol v2
  2024-08-20 11:32     ` Christian Couder
@ 2024-08-20 16:55       ` Junio C Hamano
  2024-09-10 16:32       ` Christian Couder
  1 sibling, 0 replies; 110+ messages in thread
From: Junio C Hamano @ 2024-08-20 16:55 UTC (permalink / raw)
  To: Christian Couder
  Cc: Taylor Blau, git, John Cai, Patrick Steinhardt, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> On Wed, Jul 31, 2024 at 6:16 PM Taylor Blau <me@ttaylorr.com> wrote:
> ...
> I am not sure how NUL bytes would interfere with the pkt-line.[c,h] code though.

Heh, you had 20 days to compose this response, which would be
pleanty to see that pkt-line.[ch] is about <length> and <bytes>.
After all, it is used to transfer the pack data stream that can have
arbitrary bytes ;-)


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 3/4] Add 'promisor-remote' capability to protocol v2
  2024-08-20 11:32     ` Christian Couder
@ 2024-08-20 17:01       ` Junio C Hamano
  2024-09-10 16:32         ` Christian Couder
  0 siblings, 1 reply; 110+ messages in thread
From: Junio C Hamano @ 2024-08-20 17:01 UTC (permalink / raw)
  To: Christian Couder
  Cc: Taylor Blau, git, John Cai, Patrick Steinhardt, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> I agree that it's more useful the other way though. That is for a
> server to know that the client has the capability but might not want
> to use it.
>
> For example, when C clones without using X directly, it can be a
> burden for S to have to fetch large objects from X (as it would use
> precious disk space on S, and unnecessarily duplicate large objects).
> So S might want to say "please use a newer or different client that
> has the 'promisor-remote' capability" if it knows that the client
> doesn't have this capability. If S knows that C has the capability but
> didn't configure it or doesn't want to use it, it could instead say
> something like "please consider activating the 'promisor-remote'
> capability by doing this and that to avoid burdening this server and
> get a faster clone".
>
> Note that the client might not be 'git'. It might be a "compatible"
> implementation (libgit2, gix, JGit, etc), so using the version passed
> in the "agent" protocol capability is not a good way to detect if the
> client has the capability or not.

It is none of S's business to even know about C's "true" capability,
if C does not want to use it with S.  I do not quite find the above
a credible justification.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH v2 0/4] Introduce a "promisor-remote" capability
  2024-07-31 13:40 [PATCH 0/4] Introduce a "promisor-remote" capability Christian Couder
                   ` (5 preceding siblings ...)
  2024-07-31 16:17 ` Taylor Blau
@ 2024-09-10 16:29 ` Christian Couder
  2024-09-10 16:29   ` [PATCH v2 1/4] version: refactor strbuf_sanitize() Christian Couder
                     ` (5 more replies)
  6 siblings, 6 replies; 110+ messages in thread
From: Christian Couder @ 2024-09-10 16:29 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Taylor Blau,
	Eric Sunshine, Christian Couder

Earlier this year, I sent 3 versions of a patch series with the goal
of allowing a client C to clone from a server S while using the same
promisor remote X that S already use. See:

https://lore.kernel.org/git/20240418184043.2900955-1-christian.couder@gmail.com/

Junio suggested to implement that feature using:

"a protocol extension that lets S tell C that S wants C to fetch
missing objects from X (which means that if C knows about X in its
".git/config" then there is no need for end-user interaction at all),
or a protocol extension that C tells S that C is willing to see
objects available from X omitted when S does not have them (again,
this could be done by looking at ".git/config" at C, but there may be
security implications???)"

This patch series implements that protocol extension called
"promisor-remote" (that name is open to change or simplification)
which allows S and C to agree on C using X directly or not.

I have tried to implement it in a quite generic way that could allow S
and C to share more information about promisor remotes and how to use
them.

For now, C doesn't use the information it gets from S when cloning.
That information is only used to decide if C is OK to use the promisor
remotes advertised by S. But this could change in the future which
could make it much simpler for clients than using the current way of
passing information about X with the `-c` option of `git clone` many
times on the command line.

Another improvement could be to not require GIT_NO_LAZY_FETCH=0 when S
and C have agreed on using S.

Changes compared to version 1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  - In patch 1/4, fixed a typo spotted by Eric Sunshine when I sent
    this patch in a previous patch-series.

  - In patch 3/4, changed server and client behavior, so that they
    don't advertise the "promisor-remote" capability at all if they
    don't use it. Changed the doc and commit message accordingly.

  - In patch 3/4, related to the previous change, modified both
    promisor_remote_reply() and promisor_remote_info() so they return
    NULL when the "promisor-remote" capability shouldn't be advertised
    in the protocol, and the string that should appear in the
    advertisement otherwise. Adapted the callers and the doc of these
    functions accordingly.

  - In patch 3/4, fixed lack of urlencoding for remote names, as it
    looks like remote names can contain a lot of characters like ','
    and ';'. So it's just better to urlencode them in the same way
    URLs are also urlencoded. Also used url_percent_decode(), instead
    of url_decode(), to decode both names and urls, as it looks more
    suited to decode what strbuf_addstr_urlencode() encoded.

  - In patch 3/4, simplified a bit some `if` conditions in
    loops. Changed `if (sb.len)` or `if (i != 0)` to just `if (i)`.

  - In patch 3/4, improve doc and commit message to talk about the
    case of a server advertising promisor remotes which are better
    connected to the world.

  - In patch 3/4, improve doc and commit message to talk about the
    consequences for the client and the server of promisor remotes
    being advertised or accepted.

  - In patch 3/4, improve the doc to say that `pr-name` MUST be a
    valid remote name, and the ';' and ',' characters MUST be encoded
    if they appear in `pr-name` or `pr-url`.

  - In patch 3/4, changed 'pm-*' to 'pr-*' in commit message to match
    what is in the doc.

  - In patch 3/4, made a number of other minor improvements to the doc
    and commit message.

  - In patch 4/4, improve doc and commit message to add information
    about the security implications of the new "KnownName" and
    "KnownUrl" options for the "promisor.acceptFromServer" config
    variable.

Thanks to Junio, Patrick, Eric and Taylor for their suggestions.

CI tests
~~~~~~~~

See: https://github.com/chriscool/git/actions/runs/10796188005

Range diff compared to version 1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1:  6260022d20 ! 1:  0d9d094181 version: refactor strbuf_sanitize()
    @@ Commit message
         While at it, let's also make a few small improvements:
           - use 'size_t' for 'i' instead of 'int',
           - move the declaration of 'i' inside the 'for ( ... )',
    -      - use strbuf_detach() to explicitely detach the string contained by
    +      - use strbuf_detach() to explicitly detach the string contained by
             the 'sb' strbuf.
     
    +    Helped-by: Eric Sunshine <sunshine@sunshineco.com>
         Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
     
      ## strbuf.c ##
2:  ff1246b07c = 2:  fc53229eff strbuf: refactor strbuf_trim_trailing_ch()
3:  cb7250d06e ! 3:  5c507e427f Add 'promisor-remote' capability to protocol v2
    @@ Metadata
      ## Commit message ##
         Add 'promisor-remote' capability to protocol v2
     
    -    When a server repository S borrows some objects from a promisor remote X,
    -    then a client repository C which would like to clone or fetch from S might,
    -    or might not, want to also borrow objects from X. Also S might, or might
    -    not, want to advertise X as a good way for C to directly get objects from,
    -    instead of C getting everything through S.
    +    When a server S knows that some objects from a repository are available
    +    from a promisor remote X, S might want to suggest to a client C cloning
    +    or fetching the repo from S that C should use X directly instead of S
    +    for these objects.
     
    -    To allow S and C to agree on C using X or not, let's introduce a new
    -    "promisor-remote" capability in the protocol v2, as well as a few new
    -    configuration variables:
    +    Note that this could happen both in the case S itself doesn't have the
    +    objects and borrows them from X, and in the case S has the objects but
    +    knows that X is better connected to the world (e.g., it is in a
    +    $LARGEINTERNETCOMPANY datacenter with petabit/s backbone connections)
    +    than S. Implementation of the latter case, which would require S to
    +    omit in its response the objects available on X, is left for future
    +    improvement though.
    +
    +    Then C might or might not, want to get the objects from X, and should
    +    let S know about this.
    +
    +    To allow S and C to agree and let each other know about C using X or
    +    not, let's introduce a new "promisor-remote" capability in the
    +    protocol v2, as well as a few new configuration variables:
     
           - "promisor.advertise" on the server side, and:
           - "promisor.acceptFromServer" on the client side.
     
         By default, or if "promisor.advertise" is set to 'false', a server S will
    -    advertise only the "promisor-remote" capability without passing any
    -    argument through this capability. This means that S supports the new
    -    capability but doesn't wish any client C to directly access any promisor
    -    remote X S might use.
    +    not advertise the "promisor-remote" capability.
    +
    +    If S doesn't advertise the "promisor-remote" capability, then a client C
    +    replying to S shouldn't advertise the "promisor-remote" capability
    +    either.
     
         If "promisor.advertise" is set to 'true', S will advertise its promisor
         remotes with a string like:
     
    -      promisor-remote=<pm-info>[;<pm-info>]...
    +      promisor-remote=<pr-info>[;<pr-info>]...
     
    -    where each <pm-info> element contains information about a single
    +    where each <pr-info> element contains information about a single
         promisor remote in the form:
     
    -      name=<pm-name>[,url=<pm-url>]
    +      name=<pr-name>[,url=<pr-url>]
     
    -    where <pm-name> is the name of a promisor remote and <pm-url> is the
    -    urlencoded url of the promisor remote named <pm-name>.
    +    where <pr-name> is the urlencoded name of a promisor remote and
    +    <pr-url> is the urlencoded URL of the promisor remote named <pr-name>.
     
         For now, the URL is passed in addition to the name. In the future, it
         might be possible to pass other information like a filter-spec that the
    @@ Commit message
         use when retrieving objects from X.
     
         It might also be possible in the future for "promisor.advertise" to have
    -    other values like "onlyName", so that no URL is advertised.
    +    other values. For example a value like "onlyName" could prevent S from
    +    advertising URLs, which could help in case C should use a different URL
    +    for X than the URL S is using. (The URL S is using might be an internal
    +    one on the server side for example.)
     
    -    By default or if "promisor.acceptFromServer" is set to "None", the
    -    client will not accept to use the promisor remotes that might have been
    -    advertised by the server. In this case, the client will advertise only
    -    "promisor-remote" in its reply to the server. This means that the client
    -    has the "promisor-remote" capability but decided not to use any of the
    -    promisor remotes that the server might have advertised.
    +    By default or if "promisor.acceptFromServer" is set to "None", C will
    +    not accept to use the promisor remotes that might have been advertised
    +    by S. In this case, C will not advertise any "promisor-remote"
    +    capability in its reply to S.
     
    -    If "promisor.acceptFromServer" is set to "All", on the contrary, the
    -    client will accept to use all the promisor remotes that the server
    -    advertised and it will reply with a string like:
    +    If "promisor.acceptFromServer" is set to "All" and S advertised some
    +    promisor remotes, then on the contrary, C will accept to use all the
    +    promisor remotes that S advertised and C will reply with a string like:
     
    -      promisor-remote=<pm-name>[;<pm-name>]...
    +      promisor-remote=<pr-name>[;<pr-name>]...
     
    -    where the <pm-name> elements are the names of all the promisor remotes
    -    the server advertised. If the server advertised no promisor remote
    -    though, the client will reply with just "promisor-remote".
    +    where the <pr-name> elements are the urlencoded names of all the
    +    promisor remotes S advertised.
     
         In a following commit, other values for "promisor.acceptFromServer" will
    -    be implemented so that the client will be able to decide the promisor
    -    remotes it accepts depending on the name and URL it received from the
    -    server. So even if that name and URL information is not used much right
    -    now, it will be needed soon.
    +    be implemented, so that C will be able to decide the promisor remotes it
    +    accepts depending on the name and URL it received from S. So even if
    +    that name and URL information is not used much right now, it will be
    +    needed soon.
     
    +    Helped-by: Taylor Blau <me@ttaylorr.com>
    +    Helped-by: Patrick Steinhardt <ps@pks.im>
         Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
     
      ## Documentation/config/promisor.txt ##
    @@ Documentation/config/promisor.txt
     +promisor.advertise::
     +  If set to "true", a server will use the "promisor-remote"
     +  capability, see linkgit:gitprotocol-v2[5], to advertise the
    -+  promisor remotes it is using if any. Default is "false", which
    -+  means no promisor remote is advertised.
    ++  promisor remotes it is using, if it uses some. Default is
    ++  "false", which means the "promisor-remote" capability is not
    ++  advertised.
     +
     +promisor.acceptFromServer::
     +  If set to "all", a client will accept all the promisor remotes
     +  a server might advertise using the "promisor-remote"
    -+  capability, see linkgit:gitprotocol-v2[5]. Default is "none",
    -+  which means no promisor remote advertised by a server will be
    -+  accepted.
    ++  capability. Default is "none", which means no promisor remote
    ++  advertised by a server will be accepted. By accepting a
    ++  promisor remote, the client agrees that the server might omit
    ++  objects that are lazily fetchable from this promisor remote
    ++  from its responses to "fetch" and "clone" requests from the
    ++  client. See linkgit:gitprotocol-v2[5].
     
      ## Documentation/gitprotocol-v2.txt ##
     @@ Documentation/gitprotocol-v2.txt: retrieving the header from a bundle at the indicated URI, and thus
    @@ Documentation/gitprotocol-v2.txt: retrieving the header from a bundle at the ind
     +promisor-remote=<pr-infos>
     +~~~~~~~~~~~~~~~~~~~~~~~~~~
     +
    -+The server may advertise some promisor remotes it is using, if it's OK
    -+for the server that a client uses them too. In this case <pr-infos>
    -+should be of the form:
    ++The server may advertise some promisor remotes it is using or knows
    ++about to a client which may want to use them as its promisor remotes,
    ++instead of this repository. In this case <pr-infos> should be of the
    ++form:
     +
     +  pr-infos = pr-info | pr-infos ";" pr-info
     +
     +  pr-info = "name=" pr-name | "name=" pr-name "," "url=" pr-url
     +
    -+where `pr-name` is the name of a promisor remote, and `pr-url` the
    -+urlencoded URL of that promisor remote.
    ++where `pr-name` is the urlencoded name of a promisor remote, and
    ++`pr-url` the urlencoded URL of that promisor remote.
     +
    -+In this case a client wanting to use one or more promisor remotes the
    -+server advertised should reply with "promisor-remote=<pr-names>" where
    -+<pr-names> should be of the form:
    ++In this case, if the client decides to use one or more promisor
    ++remotes the server advertised, it can reply with
    ++"promisor-remote=<pr-names>" where <pr-names> should be of the form:
     +
     +  pr-names = pr-name | pr-names ";" pr-name
     +
    -+where `pr-name` is the name of a promisor remote the server
    -+advertised.
    ++where `pr-name` is the urlencoded name of a promisor remote the server
    ++advertised and the client accepts.
    ++
    ++Note that, everywhere in this document, `pr-name` MUST be a valid
    ++remote name, and the ';' and ',' characters MUST be encoded if they
    ++appear in `pr-name` or `pr-url`.
     +
    -+If the server prefers a client not to use any promisor remote the
    -+server uses, or if the server doesn't use any promisor remote, it
    -+should only advertise "promisor-remote" without any value or "=" sign
    -+after it.
    ++If the server doesn't know any promisor remote that could be good for
    ++a client to use, or prefers a client not to use any promisor remote it
    ++uses or knows about, it shouldn't advertise the "promisor-remote"
    ++capability at all.
     +
     +In this case, or if the client doesn't want to use any promisor remote
    -+the server advertised, the client should reply only "promisor-remote"
    -+without any value or "=" sign after it.
    ++the server advertised, the client shouldn't advertise the
    ++"promisor-remote" capability at all in its reply.
     +
     +The "promisor.advertise" and "promisor.acceptFromServer" configuration
     +options can be used on the server and client side respectively to
     +control what they advertise or accept respectively. See the
     +documentation of these configuration options for more information.
    ++
    ++Note that in the future it would be nice if the "promisor-remote"
    ++protocol capability could be used by the server, when responding to
    ++`git fetch` or `git clone`, to advertise better-connected remotes that
    ++the client can use as promisor remotes, instead of this repository, so
    ++that the client can lazily fetch objects from these other
    ++better-connected remotes. This would require the server to omit in its
    ++response the objects available on the better-connected remotes that
    ++the client has accepted. This hasn't been implemented yet though. So
    ++for now this "promisor-remote" capability is useful only when the
    ++server advertises some promisor remotes it already uses to borrow
    ++objects from.
     +
      GIT
      ---
    @@ connect.c: static void send_capabilities(int fd_out, struct packet_reader *reade
        }
     +  if (server_feature_v2("promisor-remote", &promisor_remote_info)) {
     +          char *reply = promisor_remote_reply(promisor_remote_info);
    -+          packet_write_fmt(fd_out, "promisor-remote%s", reply ? reply : "");
    -+          free(reply);
    ++          if (reply) {
    ++                  packet_write_fmt(fd_out, "promisor-remote=%s", reply);
    ++                  free(reply);
    ++          }
     +  }
      }
      
    @@ promisor-remote.c: void promisor_remote_get_direct(struct repository *repo,
     +  }
     +}
     +
    -+void promisor_remote_info(struct repository *repo, struct strbuf *buf)
    ++char *promisor_remote_info(struct repository *repo)
     +{
     +  struct strbuf sb = STRBUF_INIT;
     +  int advertise_promisors = 0;
    @@ promisor-remote.c: void promisor_remote_get_direct(struct repository *repo,
     +  git_config_get_bool("promisor.advertise", &advertise_promisors);
     +
     +  if (!advertise_promisors)
    -+          return;
    ++          return NULL;
     +
     +  promisor_info_vecs(repo, &names, &urls);
     +
    ++  if (!names.nr)
    ++          return NULL;
    ++
     +  for (size_t i = 0; i < names.nr; i++) {
    -+          if (sb.len)
    ++          if (i)
     +                  strbuf_addch(&sb, ';');
    -+          strbuf_addf(&sb, "name=%s", names.v[i]);
    ++          strbuf_addstr(&sb, "name=");
    ++          strbuf_addstr_urlencode(&sb, names.v[i], allow_unsanitized);
     +          if (urls.v[i]) {
     +                  strbuf_addstr(&sb, ",url=");
     +                  strbuf_addstr_urlencode(&sb, urls.v[i], allow_unsanitized);
    @@ promisor-remote.c: void promisor_remote_get_direct(struct repository *repo,
     +  }
     +
     +  strbuf_sanitize(&sb);
    -+  strbuf_addbuf(buf, &sb);
     +
     +  strvec_clear(&names);
     +  strvec_clear(&urls);
    ++
    ++  return strbuf_detach(&sb, NULL);
     +}
     +
     +enum accept_promisor {
    @@ promisor-remote.c: void promisor_remote_get_direct(struct repository *repo,
     +          struct strbuf **elems;
     +          const char *remote_name = NULL;
     +          const char *remote_url = NULL;
    ++          char *decoded_name = NULL;
     +          char *decoded_url = NULL;
     +
     +          strbuf_trim_trailing_ch(remotes[i], ';');
    @@ promisor-remote.c: void promisor_remote_get_direct(struct repository *repo,
     +                                  elems[j]->buf);
     +          }
     +
    -+          decoded_url = url_decode(remote_url);
    ++          if (remote_name)
    ++                  decoded_name = url_percent_decode(remote_name);
    ++          if (remote_url)
    ++                  decoded_url = url_percent_decode(remote_url);
     +
    -+          if (should_accept_remote(accept, remote_name, decoded_url))
    -+                  strvec_push(accepted, remote_name);
    ++          if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url))
    ++                  strvec_push(accepted, decoded_name);
     +
     +          strbuf_list_free(elems);
    ++          free(decoded_name);
     +          free(decoded_url);
     +  }
     +
    @@ promisor-remote.c: void promisor_remote_get_direct(struct repository *repo,
     +
     +  filter_promisor_remote(the_repository, &accepted, info);
     +
    -+  strbuf_addch(&reply, '=');
    ++  if (!accepted.nr)
    ++          return NULL;
     +
     +  for (size_t i = 0; i < accepted.nr; i++) {
    -+          if (i != 0)
    ++          if (i)
     +                  strbuf_addch(&reply, ';');
    -+          strbuf_addstr(&reply, accepted.v[i]);
    ++          strbuf_addstr_urlencode(&reply, accepted.v[i], allow_unsanitized);
     +  }
     +
     +  strvec_clear(&accepted);
    @@ promisor-remote.c: void promisor_remote_get_direct(struct repository *repo,
     +
     +  for (size_t i = 0; accepted_remotes[i]; i++) {
     +          struct promisor_remote *p;
    ++          char *decoded_remote;
     +
     +          strbuf_trim_trailing_ch(accepted_remotes[i], ';');
    -+          p = repo_promisor_remote_find(r, accepted_remotes[i]->buf);
    ++          decoded_remote = url_percent_decode(accepted_remotes[i]->buf);
    ++
    ++          p = repo_promisor_remote_find(r, decoded_remote);
     +          if (p)
     +                  p->accepted = 1;
     +          else
     +                  warning(_("accepted promisor remote '%s' not found"),
    -+                          accepted_remotes[i]->buf);
    ++                          decoded_remote);
    ++
    ++          free(decoded_remote);
     +  }
     +
     +  strbuf_list_free(accepted_remotes);
    @@ promisor-remote.h: void promisor_remote_get_direct(struct repository *repo,
                                int oid_nr);
      
     +/*
    -+ * Append promisor remote info to buf. Useful for a server to
    -+ * advertise the promisor remotes it uses.
    ++ * Prepare a "promisor-remote" advertisement by a server.
    ++ * Check the value of "promisor.advertise" and maybe the configured
    ++ * promisor remotes, if any, to prepare information to send in an
    ++ * advertisement.
    ++ * Return value is NULL if no promisor remote advertisement should be
    ++ * made. Otherwise it contains the names and urls of the advertised
    ++ * promisor remotes separated by ';'
     + */
    -+void promisor_remote_info(struct repository *repo, struct strbuf *buf);
    ++char *promisor_remote_info(struct repository *repo);
     +
     +/*
     + * Prepare a reply to a "promisor-remote" advertisement from a server.
    ++ * Check the value of "promisor.acceptfromserver" and maybe the
    ++ * configured promisor remotes, if any, to prepare the reply.
    ++ * Return value is NULL if no promisor remote from the server
    ++ * is accepted. Otherwise it contains the names of the accepted promisor
    ++ * remotes separated by ';'.
     + */
     +char *promisor_remote_reply(const char *info);
     +
    @@ serve.c: static int agent_advertise(struct repository *r UNUSED,
     +static int promisor_remote_advertise(struct repository *r,
     +                               struct strbuf *value)
     +{
    -+       if (value)
    -+         promisor_remote_info(r, value);
    -+       return 1;
    ++  if (value) {
    ++          char *info = promisor_remote_info(r);
    ++          if (!info)
    ++                  return 0;
    ++          strbuf_addstr(value, info);
    ++          free(info);
    ++  }
    ++  return 1;
     +}
     +
     +static void promisor_remote_receive(struct repository *r,
    @@ serve.c: static struct protocol_capability capabilities[] = {
      
      void protocol_v2_advertise_capabilities(void)
     
    - ## t/t5555-http-smart-common.sh ##
    -@@ t/t5555-http-smart-common.sh: test_expect_success 'git upload-pack --advertise-refs: v2' '
    -   fetch=shallow wait-for-done
    -   server-option
    -   object-format=$(test_oid algo)
    -+  promisor-remote
    -   0000
    -   EOF
    - 
    -
    - ## t/t5701-git-serve.sh ##
    -@@ t/t5701-git-serve.sh: test_expect_success 'test capability advertisement' '
    -   object-format=$(test_oid algo)
    -   EOF
    -   cat >expect.trailer <<-EOF &&
    -+  promisor-remote
    -   0000
    -   EOF
    -   cat expect.base expect.trailer >expect &&
    -
      ## t/t5710-promisor-remote-capability.sh (new) ##
     @@
     +#!/bin/sh
4:  bcb884ee16 ! 4:  1c2794f139 promisor-remote: check advertised name or URL
    @@ Commit message
     
         In case of "KnownName", the client will accept promisor remotes which
         are already configured on the client and have the same name as those
    -    advertised by the client.
    +    advertised by the client. This could be useful in a corporate setup
    +    where servers and clients are trusted to not switch names and URLs, but
    +    where some kind of control is still useful.
     
         In case of "KnownUrl", the client will accept promisor remotes which
         have both the same name and the same URL configured on the client as the
    -    name and URL advertised by the server.
    +    name and URL advertised by the server. This is the most secure option,
    +    so it should be used if possible.
     
         Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
     
    @@ Documentation/config/promisor.txt: promisor.advertise::
      promisor.acceptFromServer::
        If set to "all", a client will accept all the promisor remotes
        a server might advertise using the "promisor-remote"
    --  capability, see linkgit:gitprotocol-v2[5]. Default is "none",
    --  which means no promisor remote advertised by a server will be
    --  accepted.
    -+  capability, see linkgit:gitprotocol-v2[5]. If set to
    -+  "knownName" the client will accept promisor remotes which are
    -+  already configured on the client and have the same name as
    -+  those advertised by the client. If set to "knownUrl", the
    -+  client will accept promisor remotes which have both the same
    -+  name and the same URL configured on the client as the name and
    -+  URL advertised by the server. Default is "none", which means
    -+  no promisor remote advertised by a server will be accepted.
    +-  capability. Default is "none", which means no promisor remote
    +-  advertised by a server will be accepted. By accepting a
    +-  promisor remote, the client agrees that the server might omit
    +-  objects that are lazily fetchable from this promisor remote
    +-  from its responses to "fetch" and "clone" requests from the
    +-  client. See linkgit:gitprotocol-v2[5].
    ++  capability. If set to "knownName" the client will accept
    ++  promisor remotes which are already configured on the client
    ++  and have the same name as those advertised by the client. This
    ++  is not very secure, but could be used in a corporate setup
    ++  where servers and clients are trusted to not switch name and
    ++  URLs. If set to "knownUrl", the client will accept promisor
    ++  remotes which have both the same name and the same URL
    ++  configured on the client as the name and URL advertised by the
    ++  server. This is more secure than "all" or "knownUrl", so it
    ++  should be used if possible instead of those options. Default
    ++  is "none", which means no promisor remote advertised by a
    ++  server will be accepted. By accepting a promisor remote, the
    ++  client agrees that the server might omit objects that are
    ++  lazily fetchable from this promisor remote from its responses
    ++  to "fetch" and "clone" requests from the client. See
    ++  linkgit:gitprotocol-v2[5].
     
      ## promisor-remote.c ##
    -@@ promisor-remote.c: void promisor_remote_info(struct repository *repo, struct strbuf *buf)
    -   strvec_clear(&urls);
    +@@ promisor-remote.c: char *promisor_remote_info(struct repository *repo)
    +   return strbuf_detach(&sb, NULL);
      }
      
     +/*
    @@ promisor-remote.c: static void filter_promisor_remote(struct repository *repo,
      
        remotes = strbuf_split_str(info, ';', 0);
     @@ promisor-remote.c: static void filter_promisor_remote(struct repository *repo,
    +           if (remote_url)
    +                   decoded_url = url_percent_decode(remote_url);
      
    -           decoded_url = url_decode(remote_url);
    - 
    --          if (should_accept_remote(accept, remote_name, decoded_url))
    -+          if (should_accept_remote(accept, remote_name, decoded_url, &names, &urls))
    -                   strvec_push(accepted, remote_name);
    +-          if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url))
    ++          if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url, &names, &urls))
    +                   strvec_push(accepted, decoded_name);
      
                strbuf_list_free(elems);
     @@ promisor-remote.c: static void filter_promisor_remote(struct repository *repo,


Christian Couder (4):
  version: refactor strbuf_sanitize()
  strbuf: refactor strbuf_trim_trailing_ch()
  Add 'promisor-remote' capability to protocol v2
  promisor-remote: check advertised name or URL

 Documentation/config/promisor.txt     |  27 +++
 Documentation/gitprotocol-v2.txt      |  54 ++++++
 connect.c                             |   9 +
 promisor-remote.c                     | 244 ++++++++++++++++++++++++++
 promisor-remote.h                     |  36 +++-
 serve.c                               |  26 +++
 strbuf.c                              |  16 ++
 strbuf.h                              |  10 ++
 t/t5710-promisor-remote-capability.sh | 192 ++++++++++++++++++++
 trace2/tr2_cfg.c                      |  10 +-
 upload-pack.c                         |   3 +
 version.c                             |   9 +-
 12 files changed, 620 insertions(+), 16 deletions(-)
 create mode 100755 t/t5710-promisor-remote-capability.sh

-- 
2.46.0.4.g7a37e584ed


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH v2 1/4] version: refactor strbuf_sanitize()
  2024-09-10 16:29 ` [PATCH v2 " Christian Couder
@ 2024-09-10 16:29   ` Christian Couder
  2024-09-10 16:29   ` [PATCH v2 2/4] strbuf: refactor strbuf_trim_trailing_ch() Christian Couder
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2024-09-10 16:29 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Taylor Blau,
	Eric Sunshine, Christian Couder, Christian Couder

The git_user_agent_sanitized() function performs some sanitizing to
avoid special characters being sent over the line and possibly messing
up with the protocol or with the parsing on the other side.

Let's extract this sanitizing into a new strbuf_sanitize() function, as
we will want to reuse it in a following patch, and let's put it into
strbuf.{c,h}.

While at it, let's also make a few small improvements:
  - use 'size_t' for 'i' instead of 'int',
  - move the declaration of 'i' inside the 'for ( ... )',
  - use strbuf_detach() to explicitly detach the string contained by
    the 'sb' strbuf.

Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 strbuf.c  | 9 +++++++++
 strbuf.h  | 7 +++++++
 version.c | 9 ++-------
 3 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/strbuf.c b/strbuf.c
index 3d2189a7f6..cccfdec0e3 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -1082,3 +1082,12 @@ void strbuf_strip_file_from_path(struct strbuf *sb)
 	char *path_sep = find_last_dir_sep(sb->buf);
 	strbuf_setlen(sb, path_sep ? path_sep - sb->buf + 1 : 0);
 }
+
+void strbuf_sanitize(struct strbuf *sb)
+{
+	strbuf_trim(sb);
+	for (size_t i = 0; i < sb->len; i++) {
+		if (sb->buf[i] <= 32 || sb->buf[i] >= 127)
+			sb->buf[i] = '.';
+	}
+}
diff --git a/strbuf.h b/strbuf.h
index 003f880ff7..884157873e 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -664,6 +664,13 @@ typedef int (*char_predicate)(char ch);
 void strbuf_addstr_urlencode(struct strbuf *sb, const char *name,
 			     char_predicate allow_unencoded_fn);
 
+/*
+ * Trim and replace each character with ascii code below 32 or above
+ * 127 (included) using a dot '.' character. Useful for sending
+ * capabilities.
+ */
+void strbuf_sanitize(struct strbuf *sb);
+
 __attribute__((format (printf,1,2)))
 int printf_ln(const char *fmt, ...);
 __attribute__((format (printf,2,3)))
diff --git a/version.c b/version.c
index 41b718c29e..951e6dca74 100644
--- a/version.c
+++ b/version.c
@@ -24,15 +24,10 @@ const char *git_user_agent_sanitized(void)
 
 	if (!agent) {
 		struct strbuf buf = STRBUF_INIT;
-		int i;
 
 		strbuf_addstr(&buf, git_user_agent());
-		strbuf_trim(&buf);
-		for (i = 0; i < buf.len; i++) {
-			if (buf.buf[i] <= 32 || buf.buf[i] >= 127)
-				buf.buf[i] = '.';
-		}
-		agent = buf.buf;
+		strbuf_sanitize(&buf);
+		agent = strbuf_detach(&buf, NULL);
 	}
 
 	return agent;
-- 
2.46.0.4.g7a37e584ed


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH v2 2/4] strbuf: refactor strbuf_trim_trailing_ch()
  2024-09-10 16:29 ` [PATCH v2 " Christian Couder
  2024-09-10 16:29   ` [PATCH v2 1/4] version: refactor strbuf_sanitize() Christian Couder
@ 2024-09-10 16:29   ` Christian Couder
  2024-09-10 16:29   ` [PATCH v2 3/4] Add 'promisor-remote' capability to protocol v2 Christian Couder
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2024-09-10 16:29 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Taylor Blau,
	Eric Sunshine, Christian Couder, Christian Couder

We often have to split strings at some specified terminator character.
The strbuf_split*() functions, that we can use for this purpose,
return substrings that include the terminator character, so we often
need to remove that character.

When it is a whitespace, newline or directory separator, the
terminator character can easily be removed using an existing triming
function like strbuf_rtrim(), strbuf_trim_trailing_newline() or
strbuf_trim_trailing_dir_sep(). There is no function to remove that
character when it's not one of those characters though.

Let's introduce a new strbuf_trim_trailing_ch() function that can be
used to remove any trailing character, and let's refactor existing code
that manually removed trailing characters using this new function.

We are also going to use this new function in a following commit.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 strbuf.c         |  7 +++++++
 strbuf.h         |  3 +++
 trace2/tr2_cfg.c | 10 ++--------
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/strbuf.c b/strbuf.c
index cccfdec0e3..c986ec28f4 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -134,6 +134,13 @@ void strbuf_trim_trailing_dir_sep(struct strbuf *sb)
 	sb->buf[sb->len] = '\0';
 }
 
+void strbuf_trim_trailing_ch(struct strbuf *sb, int c)
+{
+	while (sb->len > 0 && sb->buf[sb->len - 1] == c)
+		sb->len--;
+	sb->buf[sb->len] = '\0';
+}
+
 void strbuf_trim_trailing_newline(struct strbuf *sb)
 {
 	if (sb->len > 0 && sb->buf[sb->len - 1] == '\n') {
diff --git a/strbuf.h b/strbuf.h
index 884157873e..5e389ab065 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -197,6 +197,9 @@ void strbuf_trim_trailing_dir_sep(struct strbuf *sb);
 /* Strip trailing LF or CR/LF */
 void strbuf_trim_trailing_newline(struct strbuf *sb);
 
+/* Strip trailing character c */
+void strbuf_trim_trailing_ch(struct strbuf *sb, int c);
+
 /**
  * Replace the contents of the strbuf with a reencoded form.  Returns -1
  * on error, 0 on success.
diff --git a/trace2/tr2_cfg.c b/trace2/tr2_cfg.c
index d96d908bb9..356fcd38f4 100644
--- a/trace2/tr2_cfg.c
+++ b/trace2/tr2_cfg.c
@@ -33,10 +33,7 @@ static int tr2_cfg_load_patterns(void)
 
 	tr2_cfg_patterns = strbuf_split_buf(envvar, strlen(envvar), ',', -1);
 	for (s = tr2_cfg_patterns; *s; s++) {
-		struct strbuf *buf = *s;
-
-		if (buf->len && buf->buf[buf->len - 1] == ',')
-			strbuf_setlen(buf, buf->len - 1);
+		strbuf_trim_trailing_ch(*s, ',');
 		strbuf_trim_trailing_newline(*s);
 		strbuf_trim(*s);
 	}
@@ -72,10 +69,7 @@ static int tr2_load_env_vars(void)
 
 	tr2_cfg_env_vars = strbuf_split_buf(varlist, strlen(varlist), ',', -1);
 	for (s = tr2_cfg_env_vars; *s; s++) {
-		struct strbuf *buf = *s;
-
-		if (buf->len && buf->buf[buf->len - 1] == ',')
-			strbuf_setlen(buf, buf->len - 1);
+		strbuf_trim_trailing_ch(*s, ',');
 		strbuf_trim_trailing_newline(*s);
 		strbuf_trim(*s);
 	}
-- 
2.46.0.4.g7a37e584ed


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH v2 3/4] Add 'promisor-remote' capability to protocol v2
  2024-09-10 16:29 ` [PATCH v2 " Christian Couder
  2024-09-10 16:29   ` [PATCH v2 1/4] version: refactor strbuf_sanitize() Christian Couder
  2024-09-10 16:29   ` [PATCH v2 2/4] strbuf: refactor strbuf_trim_trailing_ch() Christian Couder
@ 2024-09-10 16:29   ` Christian Couder
  2024-09-30  7:56     ` Patrick Steinhardt
                       ` (2 more replies)
  2024-09-10 16:30   ` [PATCH v2 4/4] promisor-remote: check advertised name or URL Christian Couder
                     ` (2 subsequent siblings)
  5 siblings, 3 replies; 110+ messages in thread
From: Christian Couder @ 2024-09-10 16:29 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Taylor Blau,
	Eric Sunshine, Christian Couder, Christian Couder

When a server S knows that some objects from a repository are available
from a promisor remote X, S might want to suggest to a client C cloning
or fetching the repo from S that C should use X directly instead of S
for these objects.

Note that this could happen both in the case S itself doesn't have the
objects and borrows them from X, and in the case S has the objects but
knows that X is better connected to the world (e.g., it is in a
$LARGEINTERNETCOMPANY datacenter with petabit/s backbone connections)
than S. Implementation of the latter case, which would require S to
omit in its response the objects available on X, is left for future
improvement though.

Then C might or might not, want to get the objects from X, and should
let S know about this.

To allow S and C to agree and let each other know about C using X or
not, let's introduce a new "promisor-remote" capability in the
protocol v2, as well as a few new configuration variables:

  - "promisor.advertise" on the server side, and:
  - "promisor.acceptFromServer" on the client side.

By default, or if "promisor.advertise" is set to 'false', a server S will
not advertise the "promisor-remote" capability.

If S doesn't advertise the "promisor-remote" capability, then a client C
replying to S shouldn't advertise the "promisor-remote" capability
either.

If "promisor.advertise" is set to 'true', S will advertise its promisor
remotes with a string like:

  promisor-remote=<pr-info>[;<pr-info>]...

where each <pr-info> element contains information about a single
promisor remote in the form:

  name=<pr-name>[,url=<pr-url>]

where <pr-name> is the urlencoded name of a promisor remote and
<pr-url> is the urlencoded URL of the promisor remote named <pr-name>.

For now, the URL is passed in addition to the name. In the future, it
might be possible to pass other information like a filter-spec that the
client should use when cloning from S, or a token that the client should
use when retrieving objects from X.

It might also be possible in the future for "promisor.advertise" to have
other values. For example a value like "onlyName" could prevent S from
advertising URLs, which could help in case C should use a different URL
for X than the URL S is using. (The URL S is using might be an internal
one on the server side for example.)

By default or if "promisor.acceptFromServer" is set to "None", C will
not accept to use the promisor remotes that might have been advertised
by S. In this case, C will not advertise any "promisor-remote"
capability in its reply to S.

If "promisor.acceptFromServer" is set to "All" and S advertised some
promisor remotes, then on the contrary, C will accept to use all the
promisor remotes that S advertised and C will reply with a string like:

  promisor-remote=<pr-name>[;<pr-name>]...

where the <pr-name> elements are the urlencoded names of all the
promisor remotes S advertised.

In a following commit, other values for "promisor.acceptFromServer" will
be implemented, so that C will be able to decide the promisor remotes it
accepts depending on the name and URL it received from S. So even if
that name and URL information is not used much right now, it will be
needed soon.

Helped-by: Taylor Blau <me@ttaylorr.com>
Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/promisor.txt     |  17 +++
 Documentation/gitprotocol-v2.txt      |  54 +++++++
 connect.c                             |   9 ++
 promisor-remote.c                     | 198 ++++++++++++++++++++++++++
 promisor-remote.h                     |  36 ++++-
 serve.c                               |  26 ++++
 t/t5710-promisor-remote-capability.sh | 124 ++++++++++++++++
 upload-pack.c                         |   3 +
 8 files changed, 466 insertions(+), 1 deletion(-)
 create mode 100755 t/t5710-promisor-remote-capability.sh

diff --git a/Documentation/config/promisor.txt b/Documentation/config/promisor.txt
index 98c5cb2ec2..9cbfe3e59e 100644
--- a/Documentation/config/promisor.txt
+++ b/Documentation/config/promisor.txt
@@ -1,3 +1,20 @@
 promisor.quiet::
 	If set to "true" assume `--quiet` when fetching additional
 	objects for a partial clone.
+
+promisor.advertise::
+	If set to "true", a server will use the "promisor-remote"
+	capability, see linkgit:gitprotocol-v2[5], to advertise the
+	promisor remotes it is using, if it uses some. Default is
+	"false", which means the "promisor-remote" capability is not
+	advertised.
+
+promisor.acceptFromServer::
+	If set to "all", a client will accept all the promisor remotes
+	a server might advertise using the "promisor-remote"
+	capability. Default is "none", which means no promisor remote
+	advertised by a server will be accepted. By accepting a
+	promisor remote, the client agrees that the server might omit
+	objects that are lazily fetchable from this promisor remote
+	from its responses to "fetch" and "clone" requests from the
+	client. See linkgit:gitprotocol-v2[5].
diff --git a/Documentation/gitprotocol-v2.txt b/Documentation/gitprotocol-v2.txt
index 414bc625d5..65d5256baf 100644
--- a/Documentation/gitprotocol-v2.txt
+++ b/Documentation/gitprotocol-v2.txt
@@ -781,6 +781,60 @@ retrieving the header from a bundle at the indicated URI, and thus
 save themselves and the server(s) the request(s) needed to inspect the
 headers of that bundle or bundles.
 
+promisor-remote=<pr-infos>
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The server may advertise some promisor remotes it is using or knows
+about to a client which may want to use them as its promisor remotes,
+instead of this repository. In this case <pr-infos> should be of the
+form:
+
+	pr-infos = pr-info | pr-infos ";" pr-info
+
+	pr-info = "name=" pr-name | "name=" pr-name "," "url=" pr-url
+
+where `pr-name` is the urlencoded name of a promisor remote, and
+`pr-url` the urlencoded URL of that promisor remote.
+
+In this case, if the client decides to use one or more promisor
+remotes the server advertised, it can reply with
+"promisor-remote=<pr-names>" where <pr-names> should be of the form:
+
+	pr-names = pr-name | pr-names ";" pr-name
+
+where `pr-name` is the urlencoded name of a promisor remote the server
+advertised and the client accepts.
+
+Note that, everywhere in this document, `pr-name` MUST be a valid
+remote name, and the ';' and ',' characters MUST be encoded if they
+appear in `pr-name` or `pr-url`.
+
+If the server doesn't know any promisor remote that could be good for
+a client to use, or prefers a client not to use any promisor remote it
+uses or knows about, it shouldn't advertise the "promisor-remote"
+capability at all.
+
+In this case, or if the client doesn't want to use any promisor remote
+the server advertised, the client shouldn't advertise the
+"promisor-remote" capability at all in its reply.
+
+The "promisor.advertise" and "promisor.acceptFromServer" configuration
+options can be used on the server and client side respectively to
+control what they advertise or accept respectively. See the
+documentation of these configuration options for more information.
+
+Note that in the future it would be nice if the "promisor-remote"
+protocol capability could be used by the server, when responding to
+`git fetch` or `git clone`, to advertise better-connected remotes that
+the client can use as promisor remotes, instead of this repository, so
+that the client can lazily fetch objects from these other
+better-connected remotes. This would require the server to omit in its
+response the objects available on the better-connected remotes that
+the client has accepted. This hasn't been implemented yet though. So
+for now this "promisor-remote" capability is useful only when the
+server advertises some promisor remotes it already uses to borrow
+objects from.
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/connect.c b/connect.c
index cf84e631e9..1650bbd71d 100644
--- a/connect.c
+++ b/connect.c
@@ -20,6 +20,7 @@
 #include "protocol.h"
 #include "alias.h"
 #include "bundle-uri.h"
+#include "promisor-remote.h"
 
 static char *server_capabilities_v1;
 static struct strvec server_capabilities_v2 = STRVEC_INIT;
@@ -485,6 +486,7 @@ void check_stateless_delimiter(int stateless_rpc,
 static void send_capabilities(int fd_out, struct packet_reader *reader)
 {
 	const char *hash_name;
+	const char *promisor_remote_info;
 
 	if (server_supports_v2("agent"))
 		packet_write_fmt(fd_out, "agent=%s", git_user_agent_sanitized());
@@ -498,6 +500,13 @@ static void send_capabilities(int fd_out, struct packet_reader *reader)
 	} else {
 		reader->hash_algo = &hash_algos[GIT_HASH_SHA1];
 	}
+	if (server_feature_v2("promisor-remote", &promisor_remote_info)) {
+		char *reply = promisor_remote_reply(promisor_remote_info);
+		if (reply) {
+			packet_write_fmt(fd_out, "promisor-remote=%s", reply);
+			free(reply);
+		}
+	}
 }
 
 int get_remote_bundle_uri(int fd_out, struct packet_reader *reader,
diff --git a/promisor-remote.c b/promisor-remote.c
index 317e1b127f..baacbe9d94 100644
--- a/promisor-remote.c
+++ b/promisor-remote.c
@@ -11,6 +11,7 @@
 #include "strvec.h"
 #include "packfile.h"
 #include "environment.h"
+#include "url.h"
 
 struct promisor_remote_config {
 	struct promisor_remote *promisors;
@@ -219,6 +220,18 @@ int repo_has_promisor_remote(struct repository *r)
 	return !!repo_promisor_remote_find(r, NULL);
 }
 
+int repo_has_accepted_promisor_remote(struct repository *r)
+{
+	struct promisor_remote *p;
+
+	promisor_remote_init(r);
+
+	for (p = r->promisor_remote_config->promisors; p; p = p->next)
+		if (p->accepted)
+			return 1;
+	return 0;
+}
+
 static int remove_fetched_oids(struct repository *repo,
 			       struct object_id **oids,
 			       int oid_nr, int to_free)
@@ -290,3 +303,188 @@ void promisor_remote_get_direct(struct repository *repo,
 	if (to_free)
 		free(remaining_oids);
 }
+
+static int allow_unsanitized(char ch)
+{
+	if (ch == ',' || ch == ';' || ch == '%')
+		return 0;
+	return ch > 32 && ch < 127;
+}
+
+static void promisor_info_vecs(struct repository *repo,
+			       struct strvec *names,
+			       struct strvec *urls)
+{
+	struct promisor_remote *r;
+
+	promisor_remote_init(repo);
+
+	for (r = repo->promisor_remote_config->promisors; r; r = r->next) {
+		char *url;
+		char *url_key = xstrfmt("remote.%s.url", r->name);
+
+		strvec_push(names, r->name);
+		strvec_push(urls, git_config_get_string(url_key, &url) ? NULL : url);
+
+		free(url);
+		free(url_key);
+	}
+}
+
+char *promisor_remote_info(struct repository *repo)
+{
+	struct strbuf sb = STRBUF_INIT;
+	int advertise_promisors = 0;
+	struct strvec names = STRVEC_INIT;
+	struct strvec urls = STRVEC_INIT;
+
+	git_config_get_bool("promisor.advertise", &advertise_promisors);
+
+	if (!advertise_promisors)
+		return NULL;
+
+	promisor_info_vecs(repo, &names, &urls);
+
+	if (!names.nr)
+		return NULL;
+
+	for (size_t i = 0; i < names.nr; i++) {
+		if (i)
+			strbuf_addch(&sb, ';');
+		strbuf_addstr(&sb, "name=");
+		strbuf_addstr_urlencode(&sb, names.v[i], allow_unsanitized);
+		if (urls.v[i]) {
+			strbuf_addstr(&sb, ",url=");
+			strbuf_addstr_urlencode(&sb, urls.v[i], allow_unsanitized);
+		}
+	}
+
+	strbuf_sanitize(&sb);
+
+	strvec_clear(&names);
+	strvec_clear(&urls);
+
+	return strbuf_detach(&sb, NULL);
+}
+
+enum accept_promisor {
+	ACCEPT_NONE = 0,
+	ACCEPT_ALL
+};
+
+static int should_accept_remote(enum accept_promisor accept,
+				const char *remote_name UNUSED,
+				const char *remote_url UNUSED)
+{
+	if (accept == ACCEPT_ALL)
+		return 1;
+
+	BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+}
+
+static void filter_promisor_remote(struct repository *repo,
+				   struct strvec *accepted,
+				   const char *info)
+{
+	struct strbuf **remotes;
+	char *accept_str;
+	enum accept_promisor accept = ACCEPT_NONE;
+
+	if (!git_config_get_string("promisor.acceptfromserver", &accept_str)) {
+		if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
+			accept = ACCEPT_NONE;
+		else if (!strcasecmp("All", accept_str))
+			accept = ACCEPT_ALL;
+		else
+			warning(_("unknown '%s' value for '%s' config option"),
+				accept_str, "promisor.acceptfromserver");
+	}
+
+	if (accept == ACCEPT_NONE)
+		return;
+
+	/* Parse remote info received */
+
+	remotes = strbuf_split_str(info, ';', 0);
+
+	for (size_t i = 0; remotes[i]; i++) {
+		struct strbuf **elems;
+		const char *remote_name = NULL;
+		const char *remote_url = NULL;
+		char *decoded_name = NULL;
+		char *decoded_url = NULL;
+
+		strbuf_trim_trailing_ch(remotes[i], ';');
+		elems = strbuf_split_str(remotes[i]->buf, ',', 0);
+
+		for (size_t j = 0; elems[j]; j++) {
+			int res;
+			strbuf_trim_trailing_ch(elems[j], ',');
+			res = skip_prefix(elems[j]->buf, "name=", &remote_name) ||
+				skip_prefix(elems[j]->buf, "url=", &remote_url);
+			if (!res)
+				warning(_("unknown element '%s' from remote info"),
+					elems[j]->buf);
+		}
+
+		if (remote_name)
+			decoded_name = url_percent_decode(remote_name);
+		if (remote_url)
+			decoded_url = url_percent_decode(remote_url);
+
+		if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url))
+			strvec_push(accepted, decoded_name);
+
+		strbuf_list_free(elems);
+		free(decoded_name);
+		free(decoded_url);
+	}
+
+	free(accept_str);
+	strbuf_list_free(remotes);
+}
+
+char *promisor_remote_reply(const char *info)
+{
+	struct strvec accepted = STRVEC_INIT;
+	struct strbuf reply = STRBUF_INIT;
+
+	filter_promisor_remote(the_repository, &accepted, info);
+
+	if (!accepted.nr)
+		return NULL;
+
+	for (size_t i = 0; i < accepted.nr; i++) {
+		if (i)
+			strbuf_addch(&reply, ';');
+		strbuf_addstr_urlencode(&reply, accepted.v[i], allow_unsanitized);
+	}
+
+	strvec_clear(&accepted);
+
+	return strbuf_detach(&reply, NULL);
+}
+
+void mark_promisor_remotes_as_accepted(struct repository *r, const char *remotes)
+{
+	struct strbuf **accepted_remotes = strbuf_split_str(remotes, ';', 0);
+
+	for (size_t i = 0; accepted_remotes[i]; i++) {
+		struct promisor_remote *p;
+		char *decoded_remote;
+
+		strbuf_trim_trailing_ch(accepted_remotes[i], ';');
+		decoded_remote = url_percent_decode(accepted_remotes[i]->buf);
+
+		p = repo_promisor_remote_find(r, decoded_remote);
+		if (p)
+			p->accepted = 1;
+		else
+			warning(_("accepted promisor remote '%s' not found"),
+				decoded_remote);
+
+		free(decoded_remote);
+	}
+
+	strbuf_list_free(accepted_remotes);
+}
diff --git a/promisor-remote.h b/promisor-remote.h
index 88cb599c39..814ca248c7 100644
--- a/promisor-remote.h
+++ b/promisor-remote.h
@@ -9,11 +9,13 @@ struct object_id;
  * Promisor remote linked list
  *
  * Information in its fields come from remote.XXX config entries or
- * from extensions.partialclone.
+ * from extensions.partialclone, except for 'accepted' which comes
+ * from protocol v2 capabilities exchange.
  */
 struct promisor_remote {
 	struct promisor_remote *next;
 	char *partial_clone_filter;
+	unsigned int accepted : 1;
 	const char name[FLEX_ARRAY];
 };
 
@@ -32,4 +34,36 @@ void promisor_remote_get_direct(struct repository *repo,
 				const struct object_id *oids,
 				int oid_nr);
 
+/*
+ * Prepare a "promisor-remote" advertisement by a server.
+ * Check the value of "promisor.advertise" and maybe the configured
+ * promisor remotes, if any, to prepare information to send in an
+ * advertisement.
+ * Return value is NULL if no promisor remote advertisement should be
+ * made. Otherwise it contains the names and urls of the advertised
+ * promisor remotes separated by ';'
+ */
+char *promisor_remote_info(struct repository *repo);
+
+/*
+ * Prepare a reply to a "promisor-remote" advertisement from a server.
+ * Check the value of "promisor.acceptfromserver" and maybe the
+ * configured promisor remotes, if any, to prepare the reply.
+ * Return value is NULL if no promisor remote from the server
+ * is accepted. Otherwise it contains the names of the accepted promisor
+ * remotes separated by ';'.
+ */
+char *promisor_remote_reply(const char *info);
+
+/*
+ * Set the 'accepted' flag for some promisor remotes. Useful when some
+ * promisor remotes have been accepted by the client.
+ */
+void mark_promisor_remotes_as_accepted(struct repository *repo, const char *remotes);
+
+/*
+ * Has any promisor remote been accepted by the client?
+ */
+int repo_has_accepted_promisor_remote(struct repository *r);
+
 #endif /* PROMISOR_REMOTE_H */
diff --git a/serve.c b/serve.c
index 884cd84ca8..a8935571d6 100644
--- a/serve.c
+++ b/serve.c
@@ -12,6 +12,7 @@
 #include "upload-pack.h"
 #include "bundle-uri.h"
 #include "trace2.h"
+#include "promisor-remote.h"
 
 static int advertise_sid = -1;
 static int advertise_object_info = -1;
@@ -31,6 +32,26 @@ static int agent_advertise(struct repository *r UNUSED,
 	return 1;
 }
 
+static int promisor_remote_advertise(struct repository *r,
+				     struct strbuf *value)
+{
+	if (value) {
+		char *info = promisor_remote_info(r);
+		if (!info)
+			return 0;
+		strbuf_addstr(value, info);
+		free(info);
+	}
+	return 1;
+}
+
+static void promisor_remote_receive(struct repository *r,
+				    const char *remotes)
+{
+	mark_promisor_remotes_as_accepted(r, remotes);
+}
+
+
 static int object_format_advertise(struct repository *r,
 				   struct strbuf *value)
 {
@@ -157,6 +178,11 @@ static struct protocol_capability capabilities[] = {
 		.advertise = bundle_uri_advertise,
 		.command = bundle_uri_command,
 	},
+	{
+		.name = "promisor-remote",
+		.advertise = promisor_remote_advertise,
+		.receive = promisor_remote_receive,
+	},
 };
 
 void protocol_v2_advertise_capabilities(void)
diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
new file mode 100755
index 0000000000..7e44ad15ce
--- /dev/null
+++ b/t/t5710-promisor-remote-capability.sh
@@ -0,0 +1,124 @@
+#!/bin/sh
+
+test_description='handling of promisor remote advertisement'
+
+. ./test-lib.sh
+
+# Setup the repository with three commits, this way HEAD is always
+# available and we can hide commit 1 or 2.
+test_expect_success 'setup: create "template" repository' '
+	git init template &&
+	test_commit -C template 1 &&
+	test_commit -C template 2 &&
+	test_commit -C template 3 &&
+	test-tool genrandom foo 10240 >template/foo &&
+	git -C template add foo &&
+	git -C template commit -m foo
+'
+
+# A bare repo will act as a server repo with unpacked objects.
+test_expect_success 'setup: create bare "server" repository' '
+	git clone --bare --no-local template server &&
+	mv server/objects/pack/pack-* . &&
+	packfile=$(ls pack-*.pack) &&
+	git -C server unpack-objects --strict <"$packfile"
+'
+
+check_missing_objects () {
+	git -C "$1" rev-list --objects --all --missing=print > all.txt &&
+	perl -ne 'print if s/^[?]//' all.txt >missing.txt &&
+	test_line_count = "$2" missing.txt &&
+	test "$3" = "$(cat missing.txt)"
+}
+
+initialize_server () {
+	# Repack everything first
+	git -C server -c repack.writebitmaps=false repack -a -d &&
+
+	# Remove promisor file in case they exist, useful when reinitializing
+	rm -rf server/objects/pack/*.promisor &&
+
+	# Repack without the largest object and create a promisor pack on server
+	git -C server -c repack.writebitmaps=false repack -a -d \
+	    --filter=blob:limit=5k --filter-to="$(pwd)" &&
+	promisor_file=$(ls server/objects/pack/*.pack | sed "s/\.pack/.promisor/") &&
+	touch "$promisor_file" &&
+
+	# Check that only one object is missing on the server
+	check_missing_objects server 1 "$oid"
+}
+
+test_expect_success "setup for testing promisor remote advertisement" '
+	# Create another bare repo called "server2"
+	git init --bare server2 &&
+
+	# Copy the largest object from server to server2
+	obj="HEAD:foo" &&
+	oid="$(git -C server rev-parse $obj)" &&
+	oid_path="$(test_oid_to_path $oid)" &&
+	path="server/objects/$oid_path" &&
+	path2="server2/objects/$oid_path" &&
+	mkdir -p $(dirname "$path2") &&
+	cp "$path" "$path2" &&
+
+	initialize_server &&
+
+	# Configure server2 as promisor remote for server
+	git -C server remote add server2 "file://$(pwd)/server2" &&
+	git -C server config remote.server2.promisor true &&
+
+	git -C server2 config uploadpack.allowFilter true &&
+	git -C server2 config uploadpack.allowAnySHA1InWant true &&
+	git -C server config uploadpack.allowFilter true &&
+	git -C server config uploadpack.allowAnySHA1InWant true
+'
+
+test_expect_success "fetch with promisor.advertise set to 'true'" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=All \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is still missing on the server
+	check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "fetch with promisor.advertise set to 'false'" '
+	git -C server config promisor.advertise false &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=All \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is not missing on the server
+	check_missing_objects server 0 "" &&
+
+	# Reinitialize server so that the largest object is missing again
+	initialize_server
+'
+
+test_expect_success "fetch with promisor.acceptfromserver set to 'None'" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=None \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is not missing on the server
+	check_missing_objects server 0 ""
+'
+
+test_done
diff --git a/upload-pack.c b/upload-pack.c
index 0052c6a4dc..0cff76c845 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -31,6 +31,7 @@
 #include "write-or-die.h"
 #include "json-writer.h"
 #include "strmap.h"
+#include "promisor-remote.h"
 
 /* Remember to update object flag allocation in object.h */
 #define THEY_HAVE	(1u << 11)
@@ -317,6 +318,8 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 		strvec_push(&pack_objects.args, "--delta-base-offset");
 	if (pack_data->use_include_tag)
 		strvec_push(&pack_objects.args, "--include-tag");
+	if (repo_has_accepted_promisor_remote(the_repository))
+		strvec_push(&pack_objects.args, "--missing=allow-promisor");
 	if (pack_data->filter_options.choice) {
 		const char *spec =
 			expand_list_objects_filter_spec(&pack_data->filter_options);
-- 
2.46.0.4.g7a37e584ed


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH v2 4/4] promisor-remote: check advertised name or URL
  2024-09-10 16:29 ` [PATCH v2 " Christian Couder
                     ` (2 preceding siblings ...)
  2024-09-10 16:29   ` [PATCH v2 3/4] Add 'promisor-remote' capability to protocol v2 Christian Couder
@ 2024-09-10 16:30   ` Christian Couder
  2024-09-30  7:57     ` Patrick Steinhardt
  2024-09-26 18:09   ` [PATCH v2 0/4] Introduce a "promisor-remote" capability Junio C Hamano
  2024-12-06 12:42   ` [PATCH v3 0/5] " Christian Couder
  5 siblings, 1 reply; 110+ messages in thread
From: Christian Couder @ 2024-09-10 16:30 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Taylor Blau,
	Eric Sunshine, Christian Couder, Christian Couder

A previous commit introduced a "promisor.acceptFromServer" configuration
variable with only "None" or "All" as valid values.

Let's introduce "KnownName" and "KnownUrl" as valid values for this
configuration option to give more choice to a client about which
promisor remotes it might accept among those that the server advertised.

In case of "KnownName", the client will accept promisor remotes which
are already configured on the client and have the same name as those
advertised by the client. This could be useful in a corporate setup
where servers and clients are trusted to not switch names and URLs, but
where some kind of control is still useful.

In case of "KnownUrl", the client will accept promisor remotes which
have both the same name and the same URL configured on the client as the
name and URL advertised by the server. This is the most secure option,
so it should be used if possible.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/promisor.txt     | 22 ++++++---
 promisor-remote.c                     | 54 +++++++++++++++++++--
 t/t5710-promisor-remote-capability.sh | 68 +++++++++++++++++++++++++++
 3 files changed, 134 insertions(+), 10 deletions(-)

diff --git a/Documentation/config/promisor.txt b/Documentation/config/promisor.txt
index 9cbfe3e59e..d1364bc018 100644
--- a/Documentation/config/promisor.txt
+++ b/Documentation/config/promisor.txt
@@ -12,9 +12,19 @@ promisor.advertise::
 promisor.acceptFromServer::
 	If set to "all", a client will accept all the promisor remotes
 	a server might advertise using the "promisor-remote"
-	capability. Default is "none", which means no promisor remote
-	advertised by a server will be accepted. By accepting a
-	promisor remote, the client agrees that the server might omit
-	objects that are lazily fetchable from this promisor remote
-	from its responses to "fetch" and "clone" requests from the
-	client. See linkgit:gitprotocol-v2[5].
+	capability. If set to "knownName" the client will accept
+	promisor remotes which are already configured on the client
+	and have the same name as those advertised by the client. This
+	is not very secure, but could be used in a corporate setup
+	where servers and clients are trusted to not switch name and
+	URLs. If set to "knownUrl", the client will accept promisor
+	remotes which have both the same name and the same URL
+	configured on the client as the name and URL advertised by the
+	server. This is more secure than "all" or "knownUrl", so it
+	should be used if possible instead of those options. Default
+	is "none", which means no promisor remote advertised by a
+	server will be accepted. By accepting a promisor remote, the
+	client agrees that the server might omit objects that are
+	lazily fetchable from this promisor remote from its responses
+	to "fetch" and "clone" requests from the client. See
+	linkgit:gitprotocol-v2[5].
diff --git a/promisor-remote.c b/promisor-remote.c
index baacbe9d94..f713595eb0 100644
--- a/promisor-remote.c
+++ b/promisor-remote.c
@@ -367,19 +367,54 @@ char *promisor_remote_info(struct repository *repo)
 	return strbuf_detach(&sb, NULL);
 }
 
+/*
+ * Find first index of 'vec' where there is 'val'. 'val' is compared
+ * case insensively to the strings in 'vec'. If not found 'vec->nr' is
+ * returned.
+ */
+static size_t strvec_find_index(struct strvec *vec, const char *val)
+{
+	for (size_t i = 0; i < vec->nr; i++)
+		if (!strcasecmp(vec->v[i], val))
+			return i;
+	return vec->nr;
+}
+
 enum accept_promisor {
 	ACCEPT_NONE = 0,
+	ACCEPT_KNOWN_URL,
+	ACCEPT_KNOWN_NAME,
 	ACCEPT_ALL
 };
 
 static int should_accept_remote(enum accept_promisor accept,
-				const char *remote_name UNUSED,
-				const char *remote_url UNUSED)
+				const char *remote_name, const char *remote_url,
+				struct strvec *names, struct strvec *urls)
 {
+	size_t i;
+
 	if (accept == ACCEPT_ALL)
 		return 1;
 
-	BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+	i = strvec_find_index(names, remote_name);
+
+	if (i >= names->nr)
+		/* We don't know about that remote */
+		return 0;
+
+	if (accept == ACCEPT_KNOWN_NAME)
+		return 1;
+
+	if (accept != ACCEPT_KNOWN_URL)
+		BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+
+	if (!strcasecmp(urls->v[i], remote_url))
+		return 1;
+
+	warning(_("known remote named '%s' but with url '%s' instead of '%s'"),
+		remote_name, urls->v[i], remote_url);
+
+	return 0;
 }
 
 static void filter_promisor_remote(struct repository *repo,
@@ -389,10 +424,16 @@ static void filter_promisor_remote(struct repository *repo,
 	struct strbuf **remotes;
 	char *accept_str;
 	enum accept_promisor accept = ACCEPT_NONE;
+	struct strvec names = STRVEC_INIT;
+	struct strvec urls = STRVEC_INIT;
 
 	if (!git_config_get_string("promisor.acceptfromserver", &accept_str)) {
 		if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
 			accept = ACCEPT_NONE;
+		else if (!strcasecmp("KnownUrl", accept_str))
+			accept = ACCEPT_KNOWN_URL;
+		else if (!strcasecmp("KnownName", accept_str))
+			accept = ACCEPT_KNOWN_NAME;
 		else if (!strcasecmp("All", accept_str))
 			accept = ACCEPT_ALL;
 		else
@@ -403,6 +444,9 @@ static void filter_promisor_remote(struct repository *repo,
 	if (accept == ACCEPT_NONE)
 		return;
 
+	if (accept != ACCEPT_ALL)
+		promisor_info_vecs(repo, &names, &urls);
+
 	/* Parse remote info received */
 
 	remotes = strbuf_split_str(info, ';', 0);
@@ -432,7 +476,7 @@ static void filter_promisor_remote(struct repository *repo,
 		if (remote_url)
 			decoded_url = url_percent_decode(remote_url);
 
-		if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url))
+		if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url, &names, &urls))
 			strvec_push(accepted, decoded_name);
 
 		strbuf_list_free(elems);
@@ -441,6 +485,8 @@ static void filter_promisor_remote(struct repository *repo,
 	}
 
 	free(accept_str);
+	strvec_clear(&names);
+	strvec_clear(&urls);
 	strbuf_list_free(remotes);
 }
 
diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
index 7e44ad15ce..c2c83a5914 100755
--- a/t/t5710-promisor-remote-capability.sh
+++ b/t/t5710-promisor-remote-capability.sh
@@ -117,6 +117,74 @@ test_expect_success "fetch with promisor.acceptfromserver set to 'None'" '
 		--no-local --filter="blob:limit=5k" server client &&
 	test_when_finished "rm -rf client" &&
 
+	# Check that the largest object is not missing on the server
+	check_missing_objects server 0 "" &&
+
+	# Reinitialize server so that the largest object is missing again
+	initialize_server
+'
+
+test_expect_success "fetch with promisor.acceptfromserver set to 'KnownName'" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=KnownName \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is still missing on the server
+	check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "fetch with 'KnownName' and different remote names" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.serverTwo.promisor=true \
+		-c remote.serverTwo.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.serverTwo.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=KnownName \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is not missing on the server
+	check_missing_objects server 0 "" &&
+
+	# Reinitialize server so that the largest object is missing again
+	initialize_server
+'
+
+test_expect_success "fetch with promisor.acceptfromserver set to 'KnownUrl'" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=KnownUrl \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is still missing on the server
+	check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "fetch with 'KnownUrl' and different remote urls" '
+	ln -s server2 serverTwo &&
+
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/serverTwo" \
+		-c promisor.acceptfromserver=KnownUrl \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
 	# Check that the largest object is not missing on the server
 	check_missing_objects server 0 ""
 '
-- 
2.46.0.4.g7a37e584ed


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [PATCH 3/4] Add 'promisor-remote' capability to protocol v2
  2024-08-05 13:48   ` Patrick Steinhardt
  2024-08-19 20:00     ` Junio C Hamano
@ 2024-09-10 16:31     ` Christian Couder
  1 sibling, 0 replies; 110+ messages in thread
From: Christian Couder @ 2024-09-10 16:31 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Junio C Hamano, John Cai, Christian Couder

On Mon, Aug 5, 2024 at 8:01 PM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Wed, Jul 31, 2024 at 03:40:13PM +0200, Christian Couder wrote:

> > +The server may advertise some promisor remotes it is using, if it's OK
> > +for the server that a client uses them too. In this case <pr-infos>
> > +should be of the form:
> > +
> > +     pr-infos = pr-info | pr-infos ";" pr-info
> > +
> > +     pr-info = "name=" pr-name | "name=" pr-name "," "url=" pr-url
> > +
> > +where `pr-name` is the name of a promisor remote, and `pr-url` the
> > +urlencoded URL of that promisor remote.
> > +
> > +In this case a client wanting to use one or more promisor remotes the
> > +server advertised should reply with "promisor-remote=<pr-names>" where
> > +<pr-names> should be of the form:
> > +
> > +     pr-names = pr-name | pr-names ";" pr-name
> > +
> > +where `pr-name` is the name of a promisor remote the server
> > +advertised.
> > +
> > +If the server prefers a client not to use any promisor remote the
> > +server uses, or if the server doesn't use any promisor remote, it
> > +should only advertise "promisor-remote" without any value or "=" sign
> > +after it.
> > +
> > +In this case, or if the client doesn't want to use any promisor remote
> > +the server advertised, the client should reply only "promisor-remote"
> > +without any value or "=" sign after it.
>
> Why does the client have to advertise anything if they don't want to use
> any of the promisor remotes?

I have tried to explain it in a reply to Taylor, but as you, Junio and
others seem to prefer the capability not to be advertised at all when
not used, I have changed this in version 2.

> > +The "promisor.advertise" and "promisor.acceptFromServer" configuration
> > +options can be used on the server and client side respectively to
> > +control what they advertise or accept respectively. See the
> > +documentation of these configuration options for more information.
>
> One thing I'm not totally clear on is the consequence of this
> capability. What is the expected consequence if the client accepts one
> of the promisor remotes? What is the consequence if the client accepts
> none?

I have tried to improve the documentation significatively, especially
according to Junio's suggestion, in version 2.

> In the former case I'd expect that the server is free to omit objects,
> but that isn't made explicit anywhere, I think.

Junio also suggested making it explicit so I have done that in version 2.

> Also, is there any
> mechanism that tells the client exactly which objects have been omitted?

I don't think it's necessary. Agreeing on which promisor remote (name
and URL) to use should be enough security wise. When using bundle-uri,
for example, the server is not telling the client exactly which
objects are in the bundle.

> In the latter case I assume that the result will be a full clone, that
> is the server fetched any objects it didn't have from the promisor?

Yeah, the server should fetch any objects it doesn't have from the
promisor, so it can send everything to the client.

> Or does the server side continue to only honor whatever the client has
> provided as object filters, but signals to the client that it shall
> please contact somebody else when backfilling those promised objects?

No. Options to enable things like this could be built on top later if
needed though.

Thanks for the review.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 3/4] Add 'promisor-remote' capability to protocol v2
  2024-08-20 11:32     ` Christian Couder
  2024-08-20 16:55       ` Junio C Hamano
@ 2024-09-10 16:32       ` Christian Couder
  2024-09-10 17:46         ` Junio C Hamano
  1 sibling, 1 reply; 110+ messages in thread
From: Christian Couder @ 2024-09-10 16:32 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Junio C Hamano, John Cai, Patrick Steinhardt,
	Christian Couder

On Tue, Aug 20, 2024 at 1:32 PM Christian Couder
<christian.couder@gmail.com> wrote:
>
> On Wed, Jul 31, 2024 at 6:16 PM Taylor Blau <me@ttaylorr.com> wrote:
> >
> > On Wed, Jul 31, 2024 at 03:40:13PM +0200, Christian Couder wrote:
> > > diff --git a/Documentation/gitprotocol-v2.txt b/Documentation/gitprotocol-v2.txt
> > > index 414bc625d5..4d8d3839c4 100644
> > > --- a/Documentation/gitprotocol-v2.txt
> > > +++ b/Documentation/gitprotocol-v2.txt
> > > @@ -781,6 +781,43 @@ retrieving the header from a bundle at the indicated URI, and thus
> > >  save themselves and the server(s) the request(s) needed to inspect the
> > >  headers of that bundle or bundles.
> > >
> > > +promisor-remote=<pr-infos>
> > > +~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > +
> > > +The server may advertise some promisor remotes it is using, if it's OK
> > > +for the server that a client uses them too. In this case <pr-infos>
> > > +should be of the form:
> > > +
> > > +     pr-infos = pr-info | pr-infos ";" pr-info

[...]

> > I wonder if it would instead be useful to have <pr-infos> first write
> > out how many <pr-info>s it contains, and then write out each <pr-info>
> > separated by a NUL byte, so that none of the files in the <pr-info>
> > itself are restricted in what characters they can use.
>
> I am not sure how NUL bytes would interfere with the pkt-line.[c,h] code though.

As Junio said pkt-line.[ch] is about <length> and <bytes> and it is
used to transfer the pack data stream that can have arbitrary bytes,
so there is no problem with NUL bytes. Sorry for not checking.

However I still think that capabilities have been using a simple text
format for now which works well, and that it's better to respect that
format and not introduce complexity in it if it's not necessary.

For example t5555-http-smart-common.sh has:

cat >expect <<-EOF &&
    version 2
    agent=FAKE
    ls-refs=unborn
    fetch=shallow wait-for-done
    server-option
    object-format=$(test_oid algo)
    0000
    EOF

to check the capabilities sent by `git upload-pack --advertise-refs`.

t5701 also uses similar instructions to check protocol v2 server commands.

So I think it's nice for tests and debugging if we keep using a simple
text format.

Also writing the number of <pr-info>s and then each <pr-info>
separated by a NUL byte might not save a lot of bytes compared to
urlencoded content if necessary, as I don't think many special
characters will need to be urlencoded most of the time.

So in the version 2 of this patch series, I haven't changed this.

Thanks for the review.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 3/4] Add 'promisor-remote' capability to protocol v2
  2024-08-20 17:01       ` Junio C Hamano
@ 2024-09-10 16:32         ` Christian Couder
  0 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2024-09-10 16:32 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Taylor Blau, git, John Cai, Patrick Steinhardt, Christian Couder

On Tue, Aug 20, 2024 at 7:01 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > I agree that it's more useful the other way though. That is for a
> > server to know that the client has the capability but might not want
> > to use it.
> >
> > For example, when C clones without using X directly, it can be a
> > burden for S to have to fetch large objects from X (as it would use
> > precious disk space on S, and unnecessarily duplicate large objects).
> > So S might want to say "please use a newer or different client that
> > has the 'promisor-remote' capability" if it knows that the client
> > doesn't have this capability. If S knows that C has the capability but
> > didn't configure it or doesn't want to use it, it could instead say
> > something like "please consider activating the 'promisor-remote'
> > capability by doing this and that to avoid burdening this server and
> > get a faster clone".
> >
> > Note that the client might not be 'git'. It might be a "compatible"
> > implementation (libgit2, gix, JGit, etc), so using the version passed
> > in the "agent" protocol capability is not a good way to detect if the
> > client has the capability or not.
>
> It is none of S's business to even know about C's "true" capability,
> if C does not want to use it with S.  I do not quite find the above
> a credible justification.

Ok, as you and others have said that the "promisor-remote" capability
should not be advertised by the server or the client if they aren't
actually using it, then I have changed the implementation in the
version 2 of the patch series according to that.

I still think that this change might make it harder than necessary
(for example for support teams at GitHub and GitLab) to help users and
debug issues related to this.

The only downside I saw with always advertising the "promisor-remote"
capability even when not using it, was that it added a bit of bloat in
the protocol, but there are a number of things that could be done to
avoid that. For example changing the name of the capability to just
"promisor" or even "pr" instead of "promisor-remote" could reduce the
size of the overhead.

Thanks for the review.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 4/4] promisor-remote: check advertised name or URL
  2024-07-31 18:35   ` Junio C Hamano
@ 2024-09-10 16:32     ` Christian Couder
  0 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2024-09-10 16:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, John Cai, Patrick Steinhardt, Christian Couder

On Wed, Jul 31, 2024 at 8:35 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > A previous commit introduced a "promisor.acceptFromServer" configuration
> > variable with only "None" or "All" as valid values.
> >
> > Let's introduce "KnownName" and "KnownUrl" as valid values for this
> > configuration option to give more choice to a client about which
> > promisor remotes it might accept among those that the server advertised.
>
> A malicous server can swich name and url correspondence.  The URLs
> this repository uses to lazily fetch missing objects from are the
> only thing that matters, and it does not matter what name the server
> calls these URLs as, I am not sure what value, if any, KnownName has,
> other than adding a potential security hole.

In a corporate setup where clients and servers trust each other to not
switch names and URLs, it could be valuable to still have a bit of
control in a simple way, for example:
  - if servers use many promisor remotes, but clients should only use
a subset of them, or:
  - if the URLs used by clients should not be the same as the URLs
used by servers

In version 2, I have updated the "promisor.acceptFromServer"
documentation and the commit message of this patch to better explain
cases where the new "KnownName" and "KnownUrl" could be useful.

> > In case of "KnownUrl", the client will accept promisor remotes which
> > have both the same name and the same URL configured on the client as the
> > name and URL advertised by the server.
>
> This makes sense, especially if we had updates to documents I
> suggested in my review of [3/4].  If the side effect of "accepting"
> a suggested promisor remote were to only use it as a promisor remote
> on this side, there is no reason to "accept" the same thing again,
> but because the main effect at the protocol level of "accepting" is
> to affect the behaviour of the server in such a way that it is now
> allowed to omit objects that are requested but would be available
> lazily from the promisor remotes in the response, we _do_ need to
> be able to respond with the promisor remotes we are willing to and
> have been using.

Yeah, it is better to let the server know.

> This iteration does not seem to have the true server side support to
> slim its response by omitting objects that are available elsewhere,

Yeah, in version 2, the commit message of patch 3/4 has been improved
to say that implementation of this case, which would require S to omit
in its response the objects available on X, is left for future
improvement.

> but I agree that it is a good approach to get the protocol support
> right.

Thanks.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 3/4] Add 'promisor-remote' capability to protocol v2
  2024-09-10 16:32       ` Christian Couder
@ 2024-09-10 17:46         ` Junio C Hamano
  0 siblings, 0 replies; 110+ messages in thread
From: Junio C Hamano @ 2024-09-10 17:46 UTC (permalink / raw)
  To: Christian Couder
  Cc: Taylor Blau, git, John Cai, Patrick Steinhardt, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

>> > > +     pr-infos = pr-info | pr-infos ";" pr-info
>
> [...]
>
>> > I wonder if it would instead be useful to have <pr-infos> first write
>> > out how many <pr-info>s it contains, and then write out each <pr-info>
>> > separated by a NUL byte, so that none of the files in the <pr-info>
>> > itself are restricted in what characters they can use.
>>
>> I am not sure how NUL bytes would interfere with the pkt-line.[c,h] code though.
> ...
> However I still think that capabilities have been using a simple text
> format for now which works well, and that it's better to respect that
> format and not introduce complexity in it if it's not necessary.

Yup, especially when we are in control of what goes into "pr-info",
I do not see much reason to go binary.  It helps debuggability
greatly to stay in text format when you can.

Thanks.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v2 0/4] Introduce a "promisor-remote" capability
  2024-09-10 16:29 ` [PATCH v2 " Christian Couder
                     ` (3 preceding siblings ...)
  2024-09-10 16:30   ` [PATCH v2 4/4] promisor-remote: check advertised name or URL Christian Couder
@ 2024-09-26 18:09   ` Junio C Hamano
  2024-09-27  9:15     ` Christian Couder
  2024-12-06 12:42   ` [PATCH v3 0/5] " Christian Couder
  5 siblings, 1 reply; 110+ messages in thread
From: Junio C Hamano @ 2024-09-26 18:09 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine

Christian Couder <christian.couder@gmail.com> writes:

> Changes compared to version 1
> ...
> Thanks to Junio, Patrick, Eric and Taylor for their suggestions.

We haven't heard from anybody in support of (or against, for that
matter) this series even after a few weeks, which is not a good
sign, even with everybody away for GitMerge for a few days.

IIRC, the comments that the initial iteration have received were
mostly about clarifying the intent of this new capability (and some
typofixes).  What are opinions on this round from folks (especially
those who did not read the initial round)?  Does this round clearly
explain what the capability means and why projects want to use it
under what condition?

Personally, I still find that knownName is increasing potential
attack surface without much benefit, but in a tightly controled
intranet environment, it might have convenience value.  I dunno.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v2 0/4] Introduce a "promisor-remote" capability
  2024-09-26 18:09   ` [PATCH v2 0/4] Introduce a "promisor-remote" capability Junio C Hamano
@ 2024-09-27  9:15     ` Christian Couder
  2024-09-27 22:48       ` Junio C Hamano
  0 siblings, 1 reply; 110+ messages in thread
From: Christian Couder @ 2024-09-27  9:15 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Michael Haggerty

On Thu, Sep 26, 2024 at 8:09 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > Changes compared to version 1
> > ...
> > Thanks to Junio, Patrick, Eric and Taylor for their suggestions.
>
> We haven't heard from anybody in support of (or against, for that
> matter) this series even after a few weeks, which is not a good
> sign, even with everybody away for GitMerge for a few days.

By the way there was an unconference breakout session on day 2 of the
Git Merge called "Git LFS Can we do better?" where this was discussed
with a number of people. Scott Chacon took some notes:

https://github.com/git/git-merge/blob/main/breakouts/git-lfs.md

It was in parallel with the Contributor Summit, so few contributors
participated in this session (maybe only Michael Haggerty, John Cai
and me). But the impression of GitLab people there, including me, was
that folks in general would be happy to have an alternative to Git LFS
based on this.

> IIRC, the comments that the initial iteration have received were
> mostly about clarifying the intent of this new capability (and some
> typofixes).  What are opinions on this round from folks (especially
> those who did not read the initial round)?  Does this round clearly
> explain what the capability means and why projects want to use it
> under what condition?
>
> Personally, I still find that knownName is increasing potential
> attack surface without much benefit, but in a tightly controled
> intranet environment, it might have convenience value.  I dunno.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v2 0/4] Introduce a "promisor-remote" capability
  2024-09-27  9:15     ` Christian Couder
@ 2024-09-27 22:48       ` Junio C Hamano
  2024-09-27 23:31         ` rsbecker
  2024-09-30  7:57         ` Patrick Steinhardt
  0 siblings, 2 replies; 110+ messages in thread
From: Junio C Hamano @ 2024-09-27 22:48 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Michael Haggerty

Christian Couder <christian.couder@gmail.com> writes:

> By the way there was an unconference breakout session on day 2 of the
> Git Merge called "Git LFS Can we do better?" where this was discussed
> with a number of people. Scott Chacon took some notes:
>
> https://github.com/git/git-merge/blob/main/breakouts/git-lfs.md

Thanks for a link.

> It was in parallel with the Contributor Summit, so few contributors
> participated in this session (maybe only Michael Haggerty, John Cai
> and me). But the impression of GitLab people there, including me, was
> that folks in general would be happy to have an alternative to Git LFS
> based on this.

I am not sure what "based on this" is really about, though.

This series adds a feature to redirect requests to one server to
another, but does it really have much to solve the problem LFS wants
to solve?  I would imagine that you would want to be able to manage
larger objects separately to avoid affecting the performance and
convenience when handling smaller objects, and to serve these larger
objects from a dedicated server.  You certainly can filter the
larger blobs away with blob size filter, but when you really need
these larger blobs, it is unclear how the new capability helps, as
you cannot really tell what the criteria the serving side that gave
you the "promisor-remote" capability wants you to use to sift your
requests between the original server and the new promisor.  Wouldn't
your requests _all_ be redirected to a single place, the promisor
remote you learned via the capability?

Coming up with a better alternative to LFS is certainly good, and it
is worthwhile addtion to the system.  I just do not see how the
topic of this series helps further that goal.

Thanks.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v2 0/4] Introduce a "promisor-remote" capability
  2024-09-27 22:48       ` Junio C Hamano
@ 2024-09-27 23:31         ` rsbecker
  2024-09-28 10:56           ` Kristoffer Haugsbakk
  2024-09-30  7:57         ` Patrick Steinhardt
  1 sibling, 1 reply; 110+ messages in thread
From: rsbecker @ 2024-09-27 23:31 UTC (permalink / raw)
  To: 'Junio C Hamano', 'Christian Couder'
  Cc: git, 'John Cai', 'Patrick Steinhardt',
	'Taylor Blau', 'Eric Sunshine',
	'Michael Haggerty'

On September 27, 2024 6:48 PM, Junio C Hamano wrote:
>Christian Couder <christian.couder@gmail.com> writes:
>
>> By the way there was an unconference breakout session on day 2 of the
>> Git Merge called "Git LFS Can we do better?" where this was discussed
>> with a number of people. Scott Chacon took some notes:
>>
>> https://github.com/git/git-merge/blob/main/breakouts/git-lfs.md
>
>Thanks for a link.
>
>> It was in parallel with the Contributor Summit, so few contributors
>> participated in this session (maybe only Michael Haggerty, John Cai
>> and me). But the impression of GitLab people there, including me, was
>> that folks in general would be happy to have an alternative to Git LFS
>> based on this.
>
>I am not sure what "based on this" is really about, though.
>
>This series adds a feature to redirect requests to one server to another, but does it
>really have much to solve the problem LFS wants to solve?  I would imagine that
>you would want to be able to manage larger objects separately to avoid affecting
>the performance and convenience when handling smaller objects, and to serve
>these larger objects from a dedicated server.  You certainly can filter the larger blobs
>away with blob size filter, but when you really need these larger blobs, it is unclear
>how the new capability helps, as you cannot really tell what the criteria the serving
>side that gave you the "promisor-remote" capability wants you to use to sift your
>requests between the original server and the new promisor.  Wouldn't your
>requests _all_ be redirected to a single place, the promisor remote you learned via
>the capability?
>
>Coming up with a better alternative to LFS is certainly good, and it is worthwhile
>addtion to the system.  I just do not see how the topic of this series helps further
>that goal.

I am one of those who really would like to see an improvement in this area. My
community needs large binaries, and the GitHub LFS support limits sizes to the
point of being pretty much not enough. I would be happy to participate in
requirements gathering for this effort (even if it goes to Rust 😉 )
--Randall


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v2 0/4] Introduce a "promisor-remote" capability
  2024-09-27 23:31         ` rsbecker
@ 2024-09-28 10:56           ` Kristoffer Haugsbakk
  0 siblings, 0 replies; 110+ messages in thread
From: Kristoffer Haugsbakk @ 2024-09-28 10:56 UTC (permalink / raw)
  To: rsbecker, Junio C Hamano, Christian Couder
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	'Michael Haggerty'

On Sat, Sep 28, 2024, at 01:31, rsbecker@nexbridge.com wrote:
> I am one of those who really would like to see an improvement in this area. My
> community needs large binaries, and the GitHub LFS support limits sizes to the
> point of being pretty much not enough. I would be happy to participate in
> requirements gathering for this effort (even if it goes to Rust 😉 )

git-annex is an alternative to Git LFS which doesn’t have any size
limits since you can use any (multiple) remotes for the “big files”
storage. Like an external drive.

(written in Haskell)

https://git-annex.branchable.com/

-- 
Kristoffer Haugsbakk

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v2 3/4] Add 'promisor-remote' capability to protocol v2
  2024-09-10 16:29   ` [PATCH v2 3/4] Add 'promisor-remote' capability to protocol v2 Christian Couder
@ 2024-09-30  7:56     ` Patrick Steinhardt
  2024-09-30 13:28       ` Christian Couder
  2024-11-06 14:04     ` Patrick Steinhardt
  2024-11-28  5:47     ` Junio C Hamano
  2 siblings, 1 reply; 110+ messages in thread
From: Patrick Steinhardt @ 2024-09-30  7:56 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Taylor Blau, Eric Sunshine,
	Christian Couder

On Tue, Sep 10, 2024 at 06:29:59PM +0200, Christian Couder wrote:
> diff --git a/Documentation/config/promisor.txt b/Documentation/config/promisor.txt
> index 98c5cb2ec2..9cbfe3e59e 100644
> --- a/Documentation/config/promisor.txt
> +++ b/Documentation/config/promisor.txt
> @@ -1,3 +1,20 @@
>  promisor.quiet::
>  	If set to "true" assume `--quiet` when fetching additional
>  	objects for a partial clone.
> +
> +promisor.advertise::
> +	If set to "true", a server will use the "promisor-remote"
> +	capability, see linkgit:gitprotocol-v2[5], to advertise the
> +	promisor remotes it is using, if it uses some. Default is
> +	"false", which means the "promisor-remote" capability is not
> +	advertised.
> +
> +promisor.acceptFromServer::
> +	If set to "all", a client will accept all the promisor remotes
> +	a server might advertise using the "promisor-remote"
> +	capability. Default is "none", which means no promisor remote
> +	advertised by a server will be accepted. By accepting a
> +	promisor remote, the client agrees that the server might omit
> +	objects that are lazily fetchable from this promisor remote
> +	from its responses to "fetch" and "clone" requests from the
> +	client. See linkgit:gitprotocol-v2[5].

I wonder a bit about whether making this an option is all that sensible,
because that would of course apply globally to every server that you
might want to clone from. Wouldn't it be more sensible to make this
configurabe per server?

Another question: servers may advertise bogus addresses to us, and as
far as I can see there are currently no precautions in place against
malicious cases. The server might for example use this to redirect us to
a remote that uses no encryption, the Git protocol or even the "file://"
protocol. I guess the sane thing here would be to default to allow
clones via "https://" only, but make the set of accepted protocols
configurable.

> diff --git a/Documentation/gitprotocol-v2.txt b/Documentation/gitprotocol-v2.txt
> index 414bc625d5..65d5256baf 100644
> --- a/Documentation/gitprotocol-v2.txt
> +++ b/Documentation/gitprotocol-v2.txt
> @@ -781,6 +781,60 @@ retrieving the header from a bundle at the indicated URI, and thus
>  save themselves and the server(s) the request(s) needed to inspect the
>  headers of that bundle or bundles.
>  
> +promisor-remote=<pr-infos>
> +~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The server may advertise some promisor remotes it is using or knows
> +about to a client which may want to use them as its promisor remotes,
> +instead of this repository. In this case <pr-infos> should be of the
> +form:
> +
> +	pr-infos = pr-info | pr-infos ";" pr-info

Wouldn't it be preferable to make this multiple lines so that we cannot
ever burst through the pktline limits?

> +	pr-info = "name=" pr-name | "name=" pr-name "," "url=" pr-url
> +
> +where `pr-name` is the urlencoded name of a promisor remote, and
> +`pr-url` the urlencoded URL of that promisor remote.
> +In this case, if the client decides to use one or more promisor
> +remotes the server advertised, it can reply with
> +"promisor-remote=<pr-names>" where <pr-names> should be of the form:

One of the things that LFS provides is custom transfer types. It is for
example possible to use NFS or some other arbitrary protocol to fetch or
upload data. It should be possible to provide similar functionality on
the Git side via custom transport helpers, too, and if we make the
accepted set of helpers configurable as proposed further up this could
be made safe, too.

But one thing I'm missing here is any documentation around how the
client would know which promisor-remote to pick when the remote
advertises multiple of them. The easiest schema would of course be to
pick the first one whose transport helper the client understands and
considers to be safe. But given that we're talking about offloading of
large blobs, would we have usecases for advertising e.g. region-scoped
remotes that require more information on the client-side?

Also, are the promisor remotes promising to each contain all objects? Or
would the client have to ask each promisor remote until it finds a
desired object?

> +	pr-names = pr-name | pr-names ";" pr-name
> +
> +where `pr-name` is the urlencoded name of a promisor remote the server
> +advertised and the client accepts.
> +
> +Note that, everywhere in this document, `pr-name` MUST be a valid
> +remote name, and the ';' and ',' characters MUST be encoded if they
> +appear in `pr-name` or `pr-url`.

So I assume the intent here is to let the client add that promisor
remote with that exact, server-provided name? That makes me wonder about
two different scenarios:

  - We must keep the remote from announcing "origin".

  - What if we eventually decide to allow users to provide their own
    names for remotes during git-clone(1)?

Overall, I don't think that it's a good idea to let the remote dictate
which name a client's remotes have.

> +If the server doesn't know any promisor remote that could be good for
> +a client to use, or prefers a client not to use any promisor remote it
> +uses or knows about, it shouldn't advertise the "promisor-remote"
> +capability at all.
> +
> +In this case, or if the client doesn't want to use any promisor remote
> +the server advertised, the client shouldn't advertise the
> +"promisor-remote" capability at all in its reply.
> +
> +The "promisor.advertise" and "promisor.acceptFromServer" configuration
> +options can be used on the server and client side respectively to
> +control what they advertise or accept respectively. See the
> +documentation of these configuration options for more information.
> +
> +Note that in the future it would be nice if the "promisor-remote"
> +protocol capability could be used by the server, when responding to
> +`git fetch` or `git clone`, to advertise better-connected remotes that
> +the client can use as promisor remotes, instead of this repository, so
> +that the client can lazily fetch objects from these other
> +better-connected remotes. This would require the server to omit in its
> +response the objects available on the better-connected remotes that
> +the client has accepted. This hasn't been implemented yet though. So
> +for now this "promisor-remote" capability is useful only when the
> +server advertises some promisor remotes it already uses to borrow
> +objects from.

In the cover letter you mention that the server may not even have some
objects at all in the future. I wonder how that is supposed to interact
with clients that do not know about the "promisor-remote" capability at
all though.

From my point of view the server should be able tot handle that just
fine and provide a full packfile to the client. That would of course
require the server to fetch missing objects from its own promisor
remotes. Do we want to state explicitly that this is a MUST for servers
so that we don't end up in a future where clients wouldn't be able to
fetch from some forges anymore?

Patrick

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v2 4/4] promisor-remote: check advertised name or URL
  2024-09-10 16:30   ` [PATCH v2 4/4] promisor-remote: check advertised name or URL Christian Couder
@ 2024-09-30  7:57     ` Patrick Steinhardt
  0 siblings, 0 replies; 110+ messages in thread
From: Patrick Steinhardt @ 2024-09-30  7:57 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Taylor Blau, Eric Sunshine,
	Christian Couder

Oh, so here you address my previous comments. I'd propose to either
squash those two commits or to not introduce "acceptFromServer" in the
preceding commit in the first place.

On Tue, Sep 10, 2024 at 06:30:00PM +0200, Christian Couder wrote:
> diff --git a/Documentation/config/promisor.txt b/Documentation/config/promisor.txt
> index 9cbfe3e59e..d1364bc018 100644
> --- a/Documentation/config/promisor.txt
> +++ b/Documentation/config/promisor.txt
> @@ -12,9 +12,19 @@ promisor.advertise::
>  promisor.acceptFromServer::
>  	If set to "all", a client will accept all the promisor remotes
>  	a server might advertise using the "promisor-remote"
> -	capability. Default is "none", which means no promisor remote
> -	advertised by a server will be accepted. By accepting a
> -	promisor remote, the client agrees that the server might omit
> -	objects that are lazily fetchable from this promisor remote
> -	from its responses to "fetch" and "clone" requests from the
> -	client. See linkgit:gitprotocol-v2[5].
> +	capability. If set to "knownName" the client will accept
> +	promisor remotes which are already configured on the client
> +	and have the same name as those advertised by the client.

Wait, does this mean that a server can start advertising new promisor
remotes at any point in time and we'd backfill them on the client
whenever we execute git-fetch(1)? That sounds fishy to me -- I wouldn't
want anything to touch my configuration after I have created the repo
unless I explicitly tell it to.

If so, how does this handle the case where I manually added a remote
that by accident (or malicious intent of the server) that matches one of
the newly-advertised promisor remotes? This goes back to one of my
previous comments, where I said that it's likely not a good idea to let
the remote dictate names of our remotes in the first place.

If not, where do the known names come from?

Patrick

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v2 0/4] Introduce a "promisor-remote" capability
  2024-09-27 22:48       ` Junio C Hamano
  2024-09-27 23:31         ` rsbecker
@ 2024-09-30  7:57         ` Patrick Steinhardt
  2024-09-30  9:17           ` Christian Couder
                             ` (2 more replies)
  1 sibling, 3 replies; 110+ messages in thread
From: Patrick Steinhardt @ 2024-09-30  7:57 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Christian Couder, git, John Cai, Taylor Blau, Eric Sunshine,
	Michael Haggerty, brian m. carlson

On Fri, Sep 27, 2024 at 03:48:11PM -0700, Junio C Hamano wrote:
> Christian Couder <christian.couder@gmail.com> writes:
> 
> > By the way there was an unconference breakout session on day 2 of the
> > Git Merge called "Git LFS Can we do better?" where this was discussed
> > with a number of people. Scott Chacon took some notes:
> >
> > https://github.com/git/git-merge/blob/main/breakouts/git-lfs.md
> 
> Thanks for a link.
> 
> > It was in parallel with the Contributor Summit, so few contributors
> > participated in this session (maybe only Michael Haggerty, John Cai
> > and me). But the impression of GitLab people there, including me, was
> > that folks in general would be happy to have an alternative to Git LFS
> > based on this.
> 
> I am not sure what "based on this" is really about, though.
> 
> This series adds a feature to redirect requests to one server to
> another, but does it really have much to solve the problem LFS wants
> to solve?  I would imagine that you would want to be able to manage
> larger objects separately to avoid affecting the performance and
> convenience when handling smaller objects, and to serve these larger
> objects from a dedicated server.  You certainly can filter the
> larger blobs away with blob size filter, but when you really need
> these larger blobs, it is unclear how the new capability helps, as
> you cannot really tell what the criteria the serving side that gave
> you the "promisor-remote" capability wants you to use to sift your
> requests between the original server and the new promisor.  Wouldn't
> your requests _all_ be redirected to a single place, the promisor
> remote you learned via the capability?
> 
> Coming up with a better alternative to LFS is certainly good, and it
> is worthwhile addtion to the system.  I just do not see how the
> topic of this series helps further that goal.

I guess it helps to address part of the problem. I'm not sure whether my
understanding is aligned with Chris' intention, but I could certainly
see that at some point in time we start to advertise promisor remote
URLs that use different transport helpers to fetch objects. This would
allow hosting providers to offload objects to e.g. blob storage or
somesuch thing and the client would know how to fetch them.

But there are still a couple of pieces missing in the bigger puzzle:

  - How would a client know to omit certain objects? Right now it only
    knows that there are promisor remotes, but it doesn't know that it
    e.g. should omit every blob larger than X megabytes. The answer
    could of course be that the client should just know to do a partial
    clone by themselves.

  - Storing those large objects locally is still expensive. We had
    discussions in the past where such objects could be stored
    uncompressed to stop wasting compute here. At GitLab, we're thinking
    about the ability to use rolling hash functions to chunk such big
    objects into smaller parts to also allow for somewhat efficient
    deduplication. We're also thinking about how to make the overall ODB
    pluggable such that we can eventually make it more scalable in this
    context. But that's of course thinking into the future quite a bit.

  - Local repositories would likely want to prune large objects that
    have not been accessed for a while to eventually regain some storage
    space.

I think chipping away the problems one by one is fine. But it would be
nice to draw something like a "big picture" of where we eventually want
to end up at and how all the parts connect with each other to form a
viable native replacement for Git LFS.

Also Cc'ing brian, who likely has a thing or two to say about this :)

Patrick

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v2 0/4] Introduce a "promisor-remote" capability
  2024-09-30  7:57         ` Patrick Steinhardt
@ 2024-09-30  9:17           ` Christian Couder
  2024-09-30 16:52             ` Junio C Hamano
  2024-10-01 10:14             ` Patrick Steinhardt
  2024-09-30 16:34           ` Junio C Hamano
  2024-09-30 21:26           ` brian m. carlson
  2 siblings, 2 replies; 110+ messages in thread
From: Christian Couder @ 2024-09-30  9:17 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Junio C Hamano, git, John Cai, Taylor Blau, Eric Sunshine,
	Michael Haggerty, brian m. carlson

On Mon, Sep 30, 2024 at 9:57 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Fri, Sep 27, 2024 at 03:48:11PM -0700, Junio C Hamano wrote:
> > Christian Couder <christian.couder@gmail.com> writes:
> >
> > > By the way there was an unconference breakout session on day 2 of the
> > > Git Merge called "Git LFS Can we do better?" where this was discussed
> > > with a number of people. Scott Chacon took some notes:
> > >
> > > https://github.com/git/git-merge/blob/main/breakouts/git-lfs.md
> >
> > Thanks for a link.
> >
> > > It was in parallel with the Contributor Summit, so few contributors
> > > participated in this session (maybe only Michael Haggerty, John Cai
> > > and me). But the impression of GitLab people there, including me, was
> > > that folks in general would be happy to have an alternative to Git LFS
> > > based on this.
> >
> > I am not sure what "based on this" is really about, though.
> >
> > This series adds a feature to redirect requests to one server to
> > another, but does it really have much to solve the problem LFS wants
> > to solve?  I would imagine that you would want to be able to manage
> > larger objects separately to avoid affecting the performance and
> > convenience when handling smaller objects, and to serve these larger
> > objects from a dedicated server.  You certainly can filter the
> > larger blobs away with blob size filter, but when you really need
> > these larger blobs, it is unclear how the new capability helps, as
> > you cannot really tell what the criteria the serving side that gave
> > you the "promisor-remote" capability wants you to use to sift your
> > requests between the original server and the new promisor.  Wouldn't
> > your requests _all_ be redirected to a single place, the promisor
> > remote you learned via the capability?
> >
> > Coming up with a better alternative to LFS is certainly good, and it
> > is worthwhile addtion to the system.  I just do not see how the
> > topic of this series helps further that goal.
>
> I guess it helps to address part of the problem. I'm not sure whether my
> understanding is aligned with Chris' intention, but I could certainly
> see that at some point in time we start to advertise promisor remote
> URLs that use different transport helpers to fetch objects. This would
> allow hosting providers to offload objects to e.g. blob storage or
> somesuch thing and the client would know how to fetch them.
>
> But there are still a couple of pieces missing in the bigger puzzle:
>
>   - How would a client know to omit certain objects? Right now it only
>     knows that there are promisor remotes, but it doesn't know that it
>     e.g. should omit every blob larger than X megabytes. The answer
>     could of course be that the client should just know to do a partial
>     clone by themselves.

If we add a "filter" field to the "promisor-remote" capability in a
future patch series, then the server could pass information like a
filter-spec that the client could use to omit some large blobs.

Patch 3/4 has the following in its commit message about it: "In the
future, it might be possible to pass other information like a
filter-spec that the client should use when cloning from S".

>   - Storing those large objects locally is still expensive. We had
>     discussions in the past where such objects could be stored
>     uncompressed to stop wasting compute here.

Yeah, I think a new "verbatim" object representation in the object
database as discussed in
https://lore.kernel.org/git/xmqqbkdometi.fsf@gitster.g/ is the most
likely and easiest in the short term.

> At GitLab, we're thinking
>     about the ability to use rolling hash functions to chunk such big
>     objects into smaller parts to also allow for somewhat efficient
>     deduplication. We're also thinking about how to make the overall ODB
>     pluggable such that we can eventually make it more scalable in this
>     context. But that's of course thinking into the future quite a bit.

Yeah, there are different options for this. For example HuggingFace
(https://huggingface.co/) recently acquired the XetHub company (see
https://huggingface.co/blog/xethub-joins-hf), and said they might open
source XetHub software that does chunking and deduplicates chunks, so
that could be an option too.

>   - Local repositories would likely want to prune large objects that
>     have not been accessed for a while to eventually regain some storage
>     space.

`git repack --filter` and such might already help a bit in this area.
I agree that more work is needed though.

> I think chipping away the problems one by one is fine. But it would be
> nice to draw something like a "big picture" of where we eventually want
> to end up at and how all the parts connect with each other to form a
> viable native replacement for Git LFS.

I have tried to discuss this at the Git Merge 2022 and 2024 and
perhaps even before that. But as you know it's difficult to make
people agree on big projects that are not backed by patches and that
might span over several years (especially when very few people
actually work on them and when they might have other things to work on
too).

Thanks,
Christian.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v2 3/4] Add 'promisor-remote' capability to protocol v2
  2024-09-30  7:56     ` Patrick Steinhardt
@ 2024-09-30 13:28       ` Christian Couder
  2024-10-01 10:14         ` Patrick Steinhardt
  0 siblings, 1 reply; 110+ messages in thread
From: Christian Couder @ 2024-09-30 13:28 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Junio C Hamano, John Cai, Taylor Blau, Eric Sunshine,
	Christian Couder

On Mon, Sep 30, 2024 at 9:57 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Tue, Sep 10, 2024 at 06:29:59PM +0200, Christian Couder wrote:
> > diff --git a/Documentation/config/promisor.txt b/Documentation/config/promisor.txt
> > index 98c5cb2ec2..9cbfe3e59e 100644
> > --- a/Documentation/config/promisor.txt
> > +++ b/Documentation/config/promisor.txt
> > @@ -1,3 +1,20 @@
> >  promisor.quiet::
> >       If set to "true" assume `--quiet` when fetching additional
> >       objects for a partial clone.
> > +
> > +promisor.advertise::
> > +     If set to "true", a server will use the "promisor-remote"
> > +     capability, see linkgit:gitprotocol-v2[5], to advertise the
> > +     promisor remotes it is using, if it uses some. Default is
> > +     "false", which means the "promisor-remote" capability is not
> > +     advertised.
> > +
> > +promisor.acceptFromServer::
> > +     If set to "all", a client will accept all the promisor remotes
> > +     a server might advertise using the "promisor-remote"
> > +     capability. Default is "none", which means no promisor remote
> > +     advertised by a server will be accepted. By accepting a
> > +     promisor remote, the client agrees that the server might omit
> > +     objects that are lazily fetchable from this promisor remote
> > +     from its responses to "fetch" and "clone" requests from the
> > +     client. See linkgit:gitprotocol-v2[5].
>
> I wonder a bit about whether making this an option is all that sensible,
> because that would of course apply globally to every server that you
> might want to clone from. Wouldn't it be more sensible to make this
> configurabe per server?

It depends. If, for example, you are in a corporate environment where
you will interact only with trusted servers, then it might be easier
to have only one option and configure it once for all the servers you
are going to interact with. I am Ok to also have an option
configurable per server in the future though.

> Another question: servers may advertise bogus addresses to us, and as
> far as I can see there are currently no precautions in place against
> malicious cases.

The commit message says:

"In a following commit, other values for "promisor.acceptFromServer" will
be implemented, so that C will be able to decide the promisor remotes it
accepts depending on the name and URL it received from S."

and indeed as you noticed in your review of patch 4/4, this concern is
addressed by patch 4/4.

> The server might for example use this to redirect us to
> a remote that uses no encryption, the Git protocol or even the "file://"
> protocol. I guess the sane thing here would be to default to allow
> clones via "https://" only, but make the set of accepted protocols
> configurable.

Yeah, it is another potential config option that could be added. At
this stage I don't want to send a lot of patches with a large number
of possibly useful configuration options as it might appear later that
very few are actually used and useful.

> > diff --git a/Documentation/gitprotocol-v2.txt b/Documentation/gitprotocol-v2.txt
> > index 414bc625d5..65d5256baf 100644
> > --- a/Documentation/gitprotocol-v2.txt
> > +++ b/Documentation/gitprotocol-v2.txt
> > @@ -781,6 +781,60 @@ retrieving the header from a bundle at the indicated URI, and thus
> >  save themselves and the server(s) the request(s) needed to inspect the
> >  headers of that bundle or bundles.
> >
> > +promisor-remote=<pr-infos>
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +The server may advertise some promisor remotes it is using or knows
> > +about to a client which may want to use them as its promisor remotes,
> > +instead of this repository. In this case <pr-infos> should be of the
> > +form:
> > +
> > +     pr-infos = pr-info | pr-infos ";" pr-info
>
> Wouldn't it be preferable to make this multiple lines so that we cannot
> ever burst through the pktline limits?

LARGE_PACKET_MAX is 65520 which looks more than enough to me. So
having the pktline limit could actually prevent malicious servers from
sending too much junk.

I wouldn't be against multiple lines if there were reasonable cases
where the current pktline limit might not be enough (or if such cases
appeared in the future) though. It's just that I can't think of any
such reasonable case.

> > +     pr-info = "name=" pr-name | "name=" pr-name "," "url=" pr-url
> > +
> > +where `pr-name` is the urlencoded name of a promisor remote, and
> > +`pr-url` the urlencoded URL of that promisor remote.
> > +In this case, if the client decides to use one or more promisor
> > +remotes the server advertised, it can reply with
> > +"promisor-remote=<pr-names>" where <pr-names> should be of the form:
>
> One of the things that LFS provides is custom transfer types. It is for
> example possible to use NFS or some other arbitrary protocol to fetch or
> upload data. It should be possible to provide similar functionality on
> the Git side via custom transport helpers, too, and if we make the
> accepted set of helpers configurable as proposed further up this could
> be made safe, too.

It's already possible to use remote helpers using different protocols
with promisor remotes and the URL security issues are addressed by
patch 4/4.

> But one thing I'm missing here is any documentation around how the
> client would know which promisor-remote to pick when the remote
> advertises multiple of them.

In most cases for now, I think the server should advertise only one,
and the client should configure that promisor remote on its own and
set "promisor.acceptFromServer" to "KnownUrl", or maybe "KnownName" in
a corporate setting, (see patch 4/4).

If a server advertises more than one, it should have some docs to
explain why it does that and which one(s) should be picked by which
client. For example, it could say something like "Users in this part
of the world might want to pick only promisor remote A as it is likely
to be better connected to them, while users in other parts of the
world should pick only promisor remote B for the same reason."

> The easiest schema would of course be to
> pick the first one whose transport helper the client understands and
> considers to be safe. But given that we're talking about offloading of
> large blobs, would we have usecases for advertising e.g. region-scoped
> remotes that require more information on the client-side?

If region-scoped means something like the example I talk about above,
then yeah, as also discussed with Junio, this could be an interesting
use case.

> Also, are the promisor remotes promising to each contain all objects? Or
> would the client have to ask each promisor remote until it finds a
> desired object?

I think both use cases could be interesting.

> > +     pr-names = pr-name | pr-names ";" pr-name
> > +
> > +where `pr-name` is the urlencoded name of a promisor remote the server
> > +advertised and the client accepts.
> > +
> > +Note that, everywhere in this document, `pr-name` MUST be a valid
> > +remote name, and the ';' and ',' characters MUST be encoded if they
> > +appear in `pr-name` or `pr-url`.
>
> So I assume the intent here is to let the client add that promisor
> remote with that exact, server-provided name? That makes me wonder about
> two different scenarios:
>
>   - We must keep the remote from announcing "origin".

I agree that it might not be a good idea to have something else than
the main remote named origin. I am not sure it's necessary to
explicitly disallow it though.

>   - What if we eventually decide to allow users to provide their own
>     names for remotes during git-clone(1)?

I think it could be confusing, so I would say that we should wait
until a concrete case where it could be useful appear before allowing
this.

> Overall, I don't think that it's a good idea to let the remote dictate
> which name a client's remotes have.

Maybe a new mode like "KnownURL" but where only the URL and not the
name should match could be interesting in some cases then? If that's
the case it's very simple to add it. I just prefer not to do it for
now as I am not yet convinced there is a very relevant use case. I
think that if a client doesn't want to trust and cooperate with the
server at all, it might just be better in most cases for it to just
leave the server alone and not access it at all, independently of
using promisor remote or not.

> > +If the server doesn't know any promisor remote that could be good for
> > +a client to use, or prefers a client not to use any promisor remote it
> > +uses or knows about, it shouldn't advertise the "promisor-remote"
> > +capability at all.
> > +
> > +In this case, or if the client doesn't want to use any promisor remote
> > +the server advertised, the client shouldn't advertise the
> > +"promisor-remote" capability at all in its reply.
> > +
> > +The "promisor.advertise" and "promisor.acceptFromServer" configuration
> > +options can be used on the server and client side respectively to
> > +control what they advertise or accept respectively. See the
> > +documentation of these configuration options for more information.
> > +
> > +Note that in the future it would be nice if the "promisor-remote"
> > +protocol capability could be used by the server, when responding to
> > +`git fetch` or `git clone`, to advertise better-connected remotes that
> > +the client can use as promisor remotes, instead of this repository, so
> > +that the client can lazily fetch objects from these other
> > +better-connected remotes. This would require the server to omit in its
> > +response the objects available on the better-connected remotes that
> > +the client has accepted. This hasn't been implemented yet though. So
> > +for now this "promisor-remote" capability is useful only when the
> > +server advertises some promisor remotes it already uses to borrow
> > +objects from.
>
> In the cover letter you mention that the server may not even have some
> objects at all in the future.

I am not sure which part of the cover letter this refers to. If S uses
X as a promisor remote, then yeah, it might not have some objects that
are on X. But perhaps there is some wrong wording or a
misunderstanding here.

> I wonder how that is supposed to interact
> with clients that do not know about the "promisor-remote" capability at
> all though.

When that happens, S can fetch from X the objects it doesn't have, and
then proceed as usual to respond to the client. This has the drawback
of duplicating these objects on S, but perhaps there could be some
kind of garbage collection process that would regularly remove those
duplicated objects from S.

Another possibility that could be added in the future would be for S
to warn the client that it should be upgraded to have the
"promisor-remote" capability. Or S could just refuse to serve the
client in that case. I don't think we should implement these
possibilities right now, but it could be useful to do it in the
future.

> From my point of view the server should be able tot handle that just
> fine and provide a full packfile to the client. That would of course
> require the server to fetch missing objects from its own promisor
> remotes.

It's what already happens.

> Do we want to state explicitly that this is a MUST for servers
> so that we don't end up in a future where clients wouldn't be able to
> fetch from some forges anymore?

I don't think we should enforce anything like this. For example in
corporate setups, it might be easy to install the latest version of
Git and it might be a good thing to make sure the server doesn't get
overloaded with large files when they are supposed to only be stored
on a promisor remote.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v2 0/4] Introduce a "promisor-remote" capability
  2024-09-30  7:57         ` Patrick Steinhardt
  2024-09-30  9:17           ` Christian Couder
@ 2024-09-30 16:34           ` Junio C Hamano
  2024-09-30 21:26           ` brian m. carlson
  2 siblings, 0 replies; 110+ messages in thread
From: Junio C Hamano @ 2024-09-30 16:34 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Christian Couder, git, John Cai, Taylor Blau, Eric Sunshine,
	Michael Haggerty, brian m. carlson

Patrick Steinhardt <ps@pks.im> writes:

> I guess it helps to address part of the problem. I'm not sure whether my
> understanding is aligned with Chris' intention, but I could certainly
> see that at some point in time we start to advertise promisor remote
> URLs that use different transport helpers to fetch objects. This would
> allow hosting providers to offload objects to e.g. blob storage or
> somesuch thing and the client would know how to fetch them.
>
> But there are still a couple of pieces missing in the bigger puzzle:
> ...
> I think chipping away the problems one by one is fine. But it would be
> nice to draw something like a "big picture" of where we eventually want
> to end up at and how all the parts connect with each other to form a
> viable native replacement for Git LFS.

Yes, thanks for stating this a lot more clearly than I said in the
reviews so far.

> Also Cc'ing brian, who likely has a thing or two to say about this :)

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v2 0/4] Introduce a "promisor-remote" capability
  2024-09-30  9:17           ` Christian Couder
@ 2024-09-30 16:52             ` Junio C Hamano
  2024-10-01 10:14             ` Patrick Steinhardt
  1 sibling, 0 replies; 110+ messages in thread
From: Junio C Hamano @ 2024-09-30 16:52 UTC (permalink / raw)
  To: Christian Couder
  Cc: Patrick Steinhardt, git, John Cai, Taylor Blau, Eric Sunshine,
	Michael Haggerty, brian m. carlson

Christian Couder <christian.couder@gmail.com> writes:

>> But there are still a couple of pieces missing in the bigger puzzle:
>>
>>   - How would a client know to omit certain objects? Right now it only
>>     knows that there are promisor remotes, but it doesn't know that it
>>     e.g. should omit every blob larger than X megabytes. The answer
>>     could of course be that the client should just know to do a partial
>>     clone by themselves.
>
> If we add a "filter" field to the "promisor-remote" capability in a
> future patch series, then the server could pass information like a
> filter-spec that the client could use to omit some large blobs.

Yes, but at that point, is the current scheme to mark a promisor
pack with a single bit, the fact that the pack came from a promisor
remote (which one?, and for what filter settings does the remote
used?) becomes insufficient, isn't it?  Chipping away one by one is
fine, but we'd at least need to be aware that it is one of the
things we need to upgrade in the scope of the bigger picture.

It may even be OK to upgrade the on-the-wire protocol side before
the code on the both ends learn to take advantage of the feature
(e.g., to add "promisor-remote" capability itself, or to add the
capability that can also convey the associated filter specification
to that remote), but without even the design (let alone the
implementation) of what runs on both ends of the connection to to
make use of what is communicated via the capability, it is rather
hard to get the details of the protocol design right.

As on-the-wire protocol is harder to upgrade due to compatibility
constraints, it smells like it is a better order to do things if it
is left as the _last_ piece to be designed and implemented, if we
were to chip away one-by-one.  That may, for example, go like this:

 (0) We want to ensure that the projects can specify what kind of
     objects are to be offloaded to other transports.

 (1) We design the client end first.  We may want to be able to
     choose what remote to run a lazy fetch against, based on a
     filter spec, for example.  We realize and make a mental note
     that our new "capability" needs to tell the client enough
     information to make such a decision.

 (2) We design the server end to supply the above pieces of
     information to the client end.  During this process, we may
     realize that some pieces of information cannot be prepared on
     the server end and (1) may need to get adjusted.

 (3) There may be tons of other things that need to be designed and
     implemented before we know what pieces of information our new
     "capability" needs to convey, and what these pieces of
     information mean by iterating (1) and (2).

 (4) Once we nail (3) down, we can add a new protocol capability,
     knowing how it should work, and knowing that the client and the
     server ends would work well once it is implemented.

>> At GitLab, we're thinking
>>     about the ability to use rolling hash functions to chunk such big
>>     objects into smaller parts to also allow for somewhat efficient
>>     deduplication. We're also thinking about how to make the overall ODB
>>     pluggable such that we can eventually make it more scalable in this
>>     context. But that's of course thinking into the future quite a bit.

Reminds me of rsync and bup ;-).

Thanks.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v2 0/4] Introduce a "promisor-remote" capability
  2024-09-30  7:57         ` Patrick Steinhardt
  2024-09-30  9:17           ` Christian Couder
  2024-09-30 16:34           ` Junio C Hamano
@ 2024-09-30 21:26           ` brian m. carlson
  2024-09-30 22:27             ` Junio C Hamano
  2 siblings, 1 reply; 110+ messages in thread
From: brian m. carlson @ 2024-09-30 21:26 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Junio C Hamano, Christian Couder, git, John Cai, Taylor Blau,
	Eric Sunshine, Michael Haggerty

[-- Attachment #1: Type: text/plain, Size: 4241 bytes --]

On 2024-09-30 at 07:57:17, Patrick Steinhardt wrote:
> But there are still a couple of pieces missing in the bigger puzzle:
> 
>   - How would a client know to omit certain objects? Right now it only
>     knows that there are promisor remotes, but it doesn't know that it
>     e.g. should omit every blob larger than X megabytes. The answer
>     could of course be that the client should just know to do a partial
>     clone by themselves.

It would be helpful to have some sort of protocol v2 feature that says
that a partial clone (of whatever sort) is recommended and let honouring
that be a config flag.  Otherwise, you're going to have a bunch of users
who try to download every giant object in the repository when they don't
need to.

Git LFS has the advantage that this is the default behaviour, which is
really valuable.

>   - Storing those large objects locally is still expensive. We had
>     discussions in the past where such objects could be stored
>     uncompressed to stop wasting compute here. At GitLab, we're thinking
>     about the ability to use rolling hash functions to chunk such big
>     objects into smaller parts to also allow for somewhat efficient
>     deduplication. We're also thinking about how to make the overall ODB
>     pluggable such that we can eventually make it more scalable in this
>     context. But that's of course thinking into the future quite a bit.

Git LFS has a `git lfs dedup` command, which takes the files in the
working tree and creates a copy using the copy-on-write functionality in
the operating system and file system to avoid duplicating them.  There
are certainly some users who simply cannot afford to store multiple
copies of the file system (say, because their repository is 500 GB), and
this is important functionality for them.

Note that this doesn't work for all file systems.  It does for APFS on
macOS, XFS and Btrfs on Linux, and ReFS on Windows, but not HFS+, ext4,
or NTFS, which lack copy-on-write functionality.

We'd probably need to add an extension for uncompressed objects for
this, since it's a repository format change, but it shouldn't be hard to
do.

In Git LFS, it's also possible to share a set of objects across
repositories although one must be careful not to prune them.  We already
have that through alternates, so I don't think we're lacking anything
there.

>   - Local repositories would likely want to prune large objects that
>     have not been accessed for a while to eventually regain some storage
>     space.

Git LFS has a `git lfs prune` command for this as well.  It does have to
be run manually, though.

> I think chipping away the problems one by one is fine. But it would be
> nice to draw something like a "big picture" of where we eventually want
> to end up at and how all the parts connect with each other to form a
> viable native replacement for Git LFS.

I think a native replacement would be a valuable feature.  Part of the
essential component is going to be a way to handle this gracefully
during pushes, since part of the goal of Git LFS is to get large blobs
off the main server storage where they tend to make repacks extremely
expensive and into an external store.  Without that, it's unlikely that
this feature is going to be viable on the server side.  GitHub doesn't
allow large blobs for exactly that reason, so we'd want some way to
store them outside the main repository but still have the repo think
they were present.

One idea I had about this was pluggable storage backends, which might be
a nice feature to add via a dynamically loaded shared library.  In
addition, this seems like the kind of feature that one might like to use
Rust for, since it probably will involve HTTP code, and generally people
like doing that less in C (I do, at least).

> Also Cc'ing brian, who likely has a thing or two to say about this :)

I certainly have thought about this a lot.  I will say that I've stepped
down from being one of the Git LFS maintainers (endless supply of work,
not nearly enough time), but I am still familiar with the architecture
of the project.
-- 
brian m. carlson (they/them or he/him)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v2 0/4] Introduce a "promisor-remote" capability
  2024-09-30 21:26           ` brian m. carlson
@ 2024-09-30 22:27             ` Junio C Hamano
  2024-10-01 10:13               ` Patrick Steinhardt
  0 siblings, 1 reply; 110+ messages in thread
From: Junio C Hamano @ 2024-09-30 22:27 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Patrick Steinhardt, Christian Couder, git, John Cai, Taylor Blau,
	Eric Sunshine, Michael Haggerty

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> One idea I had about this was pluggable storage backends, which might be
> a nice feature to add via a dynamically loaded shared library.  In
> addition, this seems like the kind of feature that one might like to use
> Rust for, since it probably will involve HTTP code, and generally people
> like doing that less in C (I do, at least).

Yes, yes, and yes.

>> Also Cc'ing brian, who likely has a thing or two to say about this :)
>
> I certainly have thought about this a lot.  I will say that I've stepped
> down from being one of the Git LFS maintainers (endless supply of work,
> not nearly enough time), but I am still familiar with the architecture
> of the project.

Thanks.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v2 0/4] Introduce a "promisor-remote" capability
  2024-09-30 22:27             ` Junio C Hamano
@ 2024-10-01 10:13               ` Patrick Steinhardt
  0 siblings, 0 replies; 110+ messages in thread
From: Patrick Steinhardt @ 2024-10-01 10:13 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: brian m. carlson, Christian Couder, git, John Cai, Taylor Blau,
	Eric Sunshine, Michael Haggerty

On Mon, Sep 30, 2024 at 03:27:14PM -0700, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> > One idea I had about this was pluggable storage backends, which might be
> > a nice feature to add via a dynamically loaded shared library.  In
> > addition, this seems like the kind of feature that one might like to use
> > Rust for, since it probably will involve HTTP code, and generally people
> > like doing that less in C (I do, at least).
> 
> Yes, yes, and yes.

Indeed, I strongly agree with this. In fact, pluggable ODBs are the next
big topic I'll be working on now that the refdb is pluggable. Naturally
this is a huge undertaking that will likely take more on the order of
years to realize, but one has to start somewhen, I guess.

I'm also aligned with the idea of having something like dlopen-style
implementations of the backends. While the reftable-library is nice and
fixes some of the issues that we have at GitLab, the more important win
is that this demonstrates that the abstractions that we have hold. Which
also means that adding a new backend has gotten a ton easier now.

And yes, being able to implement self-contained features like a refdb
implementation or an ODB implementation in Rust would be a sensible
first step for adopting it. It doesn't interact with anything else and
initially we could continue to support platforms that do not have Rust
by simply not compiling such a backend.

Patrick

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v2 3/4] Add 'promisor-remote' capability to protocol v2
  2024-09-30 13:28       ` Christian Couder
@ 2024-10-01 10:14         ` Patrick Steinhardt
  2024-10-01 18:47           ` Junio C Hamano
  0 siblings, 1 reply; 110+ messages in thread
From: Patrick Steinhardt @ 2024-10-01 10:14 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Taylor Blau, Eric Sunshine,
	Christian Couder

On Mon, Sep 30, 2024 at 03:28:20PM +0200, Christian Couder wrote:
> On Mon, Sep 30, 2024 at 9:57 AM Patrick Steinhardt <ps@pks.im> wrote:
> > So I assume the intent here is to let the client add that promisor
> > remote with that exact, server-provided name? That makes me wonder about
> > two different scenarios:
> >
> >   - We must keep the remote from announcing "origin".
> 
> I agree that it might not be a good idea to have something else than
> the main remote named origin. I am not sure it's necessary to
> explicitly disallow it though.
> 
> >   - What if we eventually decide to allow users to provide their own
> >     names for remotes during git-clone(1)?
> 
> I think it could be confusing, so I would say that we should wait
> until a concrete case where it could be useful appear before allowing
> this.

I think we've been talking past another on this item. What I'm worried
about is a potential future where the default remote isn't called
"origin", but something else. I for example quite frequently rename the
remote right after cloning because I add a handful of remotes, and
"origin" would be too confusing. So there is a usecase that may at one
point in the future cause us to make this configurable at clone-time.

Which brings me to the issue with the current design: if the remote
dictates the names of additional remotes we basically cannot do the
above change anymore because we have to assume that no matter which
remote name is chosen, it could already be used by a promisor remote.
Our hands are bound by potential implementations of this feature by a
third party, which I think is not a good idea in general.

Now I'm not against advertising a name and storing it in our config when
we create the additional remote, for example by storing it as a separate
key "remote.<generated>.promisor-name". But the name of the remote
itself should not be controlled by the server, but should instead be
generated by the client.

> > Overall, I don't think that it's a good idea to let the remote dictate
> > which name a client's remotes have.
> 
> Maybe a new mode like "KnownURL" but where only the URL and not the
> name should match could be interesting in some cases then? If that's
> the case it's very simple to add it. I just prefer not to do it for
> now as I am not yet convinced there is a very relevant use case. I
> think that if a client doesn't want to trust and cooperate with the
> server at all, it might just be better in most cases for it to just
> leave the server alone and not access it at all, independently of
> using promisor remote or not.

It's not only about trust, as explained above. It's more about not
letting server operators dictate how Git can evolve in that context and
not taking away the ability of a user to configure their repository how
they want to.

> > I wonder how that is supposed to interact
> > with clients that do not know about the "promisor-remote" capability at
> > all though.
> 
> When that happens, S can fetch from X the objects it doesn't have, and
> then proceed as usual to respond to the client. This has the drawback
> of duplicating these objects on S, but perhaps there could be some
> kind of garbage collection process that would regularly remove those
> duplicated objects from S.
> 
> Another possibility that could be added in the future would be for S
> to warn the client that it should be upgraded to have the
> "promisor-remote" capability. Or S could just refuse to serve the
> client in that case. I don't think we should implement these
> possibilities right now, but it could be useful to do it in the
> future.
> 
> > From my point of view the server should be able tot handle that just
> > fine and provide a full packfile to the client. That would of course
> > require the server to fetch missing objects from its own promisor
> > remotes.
> 
> It's what already happens.
> 
> > Do we want to state explicitly that this is a MUST for servers
> > so that we don't end up in a future where clients wouldn't be able to
> > fetch from some forges anymore?
> 
> I don't think we should enforce anything like this. For example in
> corporate setups, it might be easy to install the latest version of
> Git and it might be a good thing to make sure the server doesn't get
> overloaded with large files when they are supposed to only be stored
> on a promisor remote.

Partitioning the Git userbase depending on the Git version they can use
doesn't feel sensible to me. We have been able to get by without
breaking backwards compatibility on the transport layer until now. So it
would be too bad if this new feature would break that.

Also, the argument with a corporate setup cuts both ways, I think. If
the administrators tightly control the Git version anyway they can just
upgrade it for all clients, and consequently all of the clients would
know how to handle the new capability and thus the server wouldn't be
overloaded.

Patrick

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v2 0/4] Introduce a "promisor-remote" capability
  2024-09-30  9:17           ` Christian Couder
  2024-09-30 16:52             ` Junio C Hamano
@ 2024-10-01 10:14             ` Patrick Steinhardt
  1 sibling, 0 replies; 110+ messages in thread
From: Patrick Steinhardt @ 2024-10-01 10:14 UTC (permalink / raw)
  To: Christian Couder
  Cc: Junio C Hamano, git, John Cai, Taylor Blau, Eric Sunshine,
	Michael Haggerty, brian m. carlson

On Mon, Sep 30, 2024 at 11:17:48AM +0200, Christian Couder wrote:
> On Mon, Sep 30, 2024 at 9:57 AM Patrick Steinhardt <ps@pks.im> wrote:
> > I think chipping away the problems one by one is fine. But it would be
> > nice to draw something like a "big picture" of where we eventually want
> > to end up at and how all the parts connect with each other to form a
> > viable native replacement for Git LFS.
> 
> I have tried to discuss this at the Git Merge 2022 and 2024 and
> perhaps even before that. But as you know it's difficult to make
> people agree on big projects that are not backed by patches and that
> might span over several years (especially when very few people
> actually work on them and when they might have other things to work on
> too).

Certainly true, yeah. But we did have documents in the past that
outlined long-term visions in our tree, and it may help the project as a
whole to better understand the long-term vision we're headed into. And
by encouraging discussion up front we may be able to spot any weaknesses
and address them before it is too late.

Patrick

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v2 3/4] Add 'promisor-remote' capability to protocol v2
  2024-10-01 10:14         ` Patrick Steinhardt
@ 2024-10-01 18:47           ` Junio C Hamano
  0 siblings, 0 replies; 110+ messages in thread
From: Junio C Hamano @ 2024-10-01 18:47 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Christian Couder, git, John Cai, Taylor Blau, Eric Sunshine,
	Christian Couder

Patrick Steinhardt <ps@pks.im> writes:

> Now I'm not against advertising a name and storing it in our config when
> we create the additional remote, for example by storing it as a separate
> key "remote.<generated>.promisor-name". But the name of the remote
> itself should not be controlled by the server, but should instead be
> generated by the client.

Thanks.  In an earlier round of the review, I noticed that the
remote side gives each promisor remote it suggests a name, but I
failed to realize that it is used without any say from the user at
the receiving end in the local repository---which is horrible.

The remote end wants to keep referring to a promisor remote in such
a way that both sides can understand when the same promisor remote
is referred to in the future, and I am OK for the protocol to allow
the remote to give a name to a promisor remote.  Such a name needs
to be kept separate from the name the end-user locally uses to refer
to the promisor remote (if they follow the suggestion given over the
protocol).  Do we need some mapping mechanism to do so?  A name N
the remote A gave to another remote B has to keep referring to
the remote we know as B today, even if we rename B to C.

Thanks.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v2 3/4] Add 'promisor-remote' capability to protocol v2
  2024-09-10 16:29   ` [PATCH v2 3/4] Add 'promisor-remote' capability to protocol v2 Christian Couder
  2024-09-30  7:56     ` Patrick Steinhardt
@ 2024-11-06 14:04     ` Patrick Steinhardt
  2024-11-28  5:47     ` Junio C Hamano
  2 siblings, 0 replies; 110+ messages in thread
From: Patrick Steinhardt @ 2024-11-06 14:04 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Taylor Blau, Eric Sunshine,
	Christian Couder

On Tue, Sep 10, 2024 at 06:29:59PM +0200, Christian Couder wrote:
[snip]
> +static void filter_promisor_remote(struct repository *repo,
> +				   struct strvec *accepted,
> +				   const char *info)
> +{
> +	struct strbuf **remotes;
> +	char *accept_str;
> +	enum accept_promisor accept = ACCEPT_NONE;
> +
> +	if (!git_config_get_string("promisor.acceptfromserver", &accept_str)) {
> +		if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
> +			accept = ACCEPT_NONE;
> +		else if (!strcasecmp("All", accept_str))
> +			accept = ACCEPT_ALL;
> +		else
> +			warning(_("unknown '%s' value for '%s' config option"),
> +				accept_str, "promisor.acceptfromserver");
> +	}
> +
> +	if (accept == ACCEPT_NONE)
> +		return;

This code path is leaking memory because we don't free `accept_str`.
Once you reroll, I'd propose to have below patch on top to fix the leak.

Patrick

diff --git a/promisor-remote.c b/promisor-remote.c
index 06507b2ee1..0a4f7f1188 100644
--- a/promisor-remote.c
+++ b/promisor-remote.c
@@ -424,12 +424,12 @@ static void filter_promisor_remote(struct repository *repo,
 				   const char *info)
 {
 	struct strbuf **remotes;
-	char *accept_str;
+	const char *accept_str;
 	enum accept_promisor accept = ACCEPT_NONE;
 	struct strvec names = STRVEC_INIT;
 	struct strvec urls = STRVEC_INIT;
 
-	if (!git_config_get_string("promisor.acceptfromserver", &accept_str)) {
+	if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
 		if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
 			accept = ACCEPT_NONE;
 		else if (!strcasecmp("KnownUrl", accept_str))
@@ -486,7 +486,6 @@ static void filter_promisor_remote(struct repository *repo,
 		free(decoded_url);
 	}
 
-	free(accept_str);
 	strvec_clear(&names);
 	strvec_clear(&urls);
 	strbuf_list_free(remotes);

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [PATCH v2 3/4] Add 'promisor-remote' capability to protocol v2
  2024-09-10 16:29   ` [PATCH v2 3/4] Add 'promisor-remote' capability to protocol v2 Christian Couder
  2024-09-30  7:56     ` Patrick Steinhardt
  2024-11-06 14:04     ` Patrick Steinhardt
@ 2024-11-28  5:47     ` Junio C Hamano
  2024-11-28 15:31       ` Christian Couder
  2 siblings, 1 reply; 110+ messages in thread
From: Junio C Hamano @ 2024-11-28  5:47 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Christian Couder

I was looking at test breakages caused by this topic (in 'seen',
t5710 fails leak checking).

Then I noticed something strange.  Next to the "$(TRASH_DIRECTORY)",
running this script leaves a few garbage files under the "t/"
directory.

I think the culprit is this helper function.

> +initialize_server () {
> +	# Repack everything first
> +	git -C server -c repack.writebitmaps=false repack -a -d &&
> +
> +	# Remove promisor file in case they exist, useful when reinitializing
> +	rm -rf server/objects/pack/*.promisor &&
> +
> +	# Repack without the largest object and create a promisor pack on server
> +	git -C server -c repack.writebitmaps=false repack -a -d \
> +	    --filter=blob:limit=5k --filter-to="$(pwd)" &&

This --filter-to="$(pwd)" expands to $(TRASH_DIRECTORY), which is
"..../t/trash-directory.t5710-promisor-remote-capability".  I think
that is the cause for two extra trash files that are _OUTSIDE_ the
trash directory, which is an absolute no-no for tests to be safely
runnable.  Next to the trash directory, this ends up creating three
files

trash directory.t5710-...-980d3ff591aae1651cdd52f7dfad4fb6319ee3c2.idx
trash directory.t5710-...-980d3ff591aae1651cdd52f7dfad4fb6319ee3c2.pack
trash directory.t5710-...-980d3ff591aae1651cdd52f7dfad4fb6319ee3c2.rev

> +	promisor_file=$(ls server/objects/pack/*.pack | sed "s/\.pack/.promisor/") &&
> +	touch "$promisor_file" &&

Style: don't "touch" a single file to create it.  Instead >redirect_into_it.


The first failure in leak check seems to be

not ok 5 - fetch with promisor.advertise set to 'false'
#
#               git -C server config promisor.advertise false &&
#
#               # Clone from server to create a client
#               GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
#                       -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
#                       -c remote.server2.url="file://$(pwd)/server2" \
#                       -c promisor.acceptfromserver=All \
#                       --no-local --filter="blob:limit=5k" server client &&
#               test_when_finished "rm -rf client" &&
#
#               # Check that the largest object is not missing on the server
#               check_missing_objects server 0 "" &&
#
#               # Reinitialize server so that the largest object is missing again
#               initialize_server

but I didn't dig further.  Can you take a look?  I'll eject the
topic from 'seen' in the meantime to unblock the CI.

Thanks.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v2 3/4] Add 'promisor-remote' capability to protocol v2
  2024-11-28  5:47     ` Junio C Hamano
@ 2024-11-28 15:31       ` Christian Couder
  2024-11-29  1:31         ` Junio C Hamano
  0 siblings, 1 reply; 110+ messages in thread
From: Christian Couder @ 2024-11-28 15:31 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Christian Couder

On Thu, Nov 28, 2024 at 6:47 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> I was looking at test breakages caused by this topic (in 'seen',
> t5710 fails leak checking).
>
> Then I noticed something strange.  Next to the "$(TRASH_DIRECTORY)",
> running this script leaves a few garbage files under the "t/"
> directory.
>
> I think the culprit is this helper function.
>
> > +initialize_server () {
> > +     # Repack everything first
> > +     git -C server -c repack.writebitmaps=false repack -a -d &&
> > +
> > +     # Remove promisor file in case they exist, useful when reinitializing
> > +     rm -rf server/objects/pack/*.promisor &&
> > +
> > +     # Repack without the largest object and create a promisor pack on server
> > +     git -C server -c repack.writebitmaps=false repack -a -d \
> > +         --filter=blob:limit=5k --filter-to="$(pwd)" &&
>
> This --filter-to="$(pwd)" expands to $(TRASH_DIRECTORY), which is
> "..../t/trash-directory.t5710-promisor-remote-capability".  I think
> that is the cause for two extra trash files that are _OUTSIDE_ the
> trash directory, which is an absolute no-no for tests to be safely
> runnable.  Next to the trash directory, this ends up creating three
> files
>
> trash directory.t5710-...-980d3ff591aae1651cdd52f7dfad4fb6319ee3c2.idx
> trash directory.t5710-...-980d3ff591aae1651cdd52f7dfad4fb6319ee3c2.pack
> trash directory.t5710-...-980d3ff591aae1651cdd52f7dfad4fb6319ee3c2.rev

Yeah, right. It should be --filter-to="$(pwd)/pack"

> > +     promisor_file=$(ls server/objects/pack/*.pack | sed "s/\.pack/.promisor/") &&
> > +     touch "$promisor_file" &&
>
> Style: don't "touch" a single file to create it.  Instead >redirect_into_it.

I have fixed this in the current version.

> The first failure in leak check seems to be
>
> not ok 5 - fetch with promisor.advertise set to 'false'
> #
> #               git -C server config promisor.advertise false &&
> #
> #               # Clone from server to create a client
> #               GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
> #                       -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
> #                       -c remote.server2.url="file://$(pwd)/server2" \
> #                       -c promisor.acceptfromserver=All \
> #                       --no-local --filter="blob:limit=5k" server client &&
> #               test_when_finished "rm -rf client" &&
> #
> #               # Check that the largest object is not missing on the server
> #               check_missing_objects server 0 "" &&
> #
> #               # Reinitialize server so that the largest object is missing again
> #               initialize_server
>
> but I didn't dig further.  Can you take a look?  I'll eject the
> topic from 'seen' in the meantime to unblock the CI.

No problem with ejecting the topic from 'seen'. I hope to send a new
version with a design doc hopefully next week.

Thanks.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v2 3/4] Add 'promisor-remote' capability to protocol v2
  2024-11-28 15:31       ` Christian Couder
@ 2024-11-29  1:31         ` Junio C Hamano
  0 siblings, 0 replies; 110+ messages in thread
From: Junio C Hamano @ 2024-11-29  1:31 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

>> > +     git -C server -c repack.writebitmaps=false repack -a -d \
>> > +         --filter=blob:limit=5k --filter-to="$(pwd)" &&
>>
>> This --filter-to="$(pwd)" expands to $(TRASH_DIRECTORY), which is
>> "..../t/trash-directory.t5710-promisor-remote-capability".  I think
>> that is the cause for two extra trash files that are _OUTSIDE_ the
>> trash directory, which is an absolute no-no for tests to be safely
>> runnable.  Next to the trash directory, this ends up creating three
>> files
>>
>> trash directory.t5710-...-980d3ff591aae1651cdd52f7dfad4fb6319ee3c2.idx
>> trash directory.t5710-...-980d3ff591aae1651cdd52f7dfad4fb6319ee3c2.pack
>> trash directory.t5710-...-980d3ff591aae1651cdd52f7dfad4fb6319ee3c2.rev
>
> Yeah, right. It should be --filter-to="$(pwd)/pack"

It would create "pack-*" next to (not inside) "server", both as
immediate children of the "$TRASH_DIRECTORY".  I guess it is fine as
long as they are not created inside server/objects/ ;-)

>> The first failure in leak check seems to be
>>
>> not ok 5 - fetch with promisor.advertise set to 'false'
>> #
>> #               git -C server config promisor.advertise false &&
>> #
>> #               # Clone from server to create a client
>> #               GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
>> #                       -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
>> #                       -c remote.server2.url="file://$(pwd)/server2" \
>> #                       -c promisor.acceptfromserver=All \
>> #                       --no-local --filter="blob:limit=5k" server client &&
>> #               test_when_finished "rm -rf client" &&
>> #
>> #               # Check that the largest object is not missing on the server
>> #               check_missing_objects server 0 "" &&
>> #
>> #               # Reinitialize server so that the largest object is missing again
>> #               initialize_server
>>
>> but I didn't dig further.  Can you take a look?  I'll eject the
>> topic from 'seen' in the meantime to unblock the CI.
>
> No problem with ejecting the topic from 'seen'. I hope to send a new
> version with a design doc hopefully next week.

Thanks.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH v3 0/5] Introduce a "promisor-remote" capability
  2024-09-10 16:29 ` [PATCH v2 " Christian Couder
                     ` (4 preceding siblings ...)
  2024-09-26 18:09   ` [PATCH v2 0/4] Introduce a "promisor-remote" capability Junio C Hamano
@ 2024-12-06 12:42   ` Christian Couder
  2024-12-06 12:42     ` [PATCH v3 1/5] version: refactor strbuf_sanitize() Christian Couder
                       ` (6 more replies)
  5 siblings, 7 replies; 110+ messages in thread
From: Christian Couder @ 2024-12-06 12:42 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Taylor Blau,
	Eric Sunshine, Christian Couder

This work is part of some effort to better handle large files/blobs in
a client-server context using promisor remotes dedicated to storing
large blobs. To help understand this effort, this series now contains
a patch (patch 5/5) that adds design documentation about this effort.

Earlier this year, I sent 3 versions of a patch series with the goal
of allowing a client C to clone from a server S while using the same
promisor remote X that S already use. See:

https://lore.kernel.org/git/20240418184043.2900955-1-christian.couder@gmail.com/

Junio suggested to implement that feature using:

"a protocol extension that lets S tell C that S wants C to fetch
missing objects from X (which means that if C knows about X in its
".git/config" then there is no need for end-user interaction at all),
or a protocol extension that C tells S that C is willing to see
objects available from X omitted when S does not have them (again,
this could be done by looking at ".git/config" at C, but there may be
security implications???)"

This patch series implements that protocol extension called
"promisor-remote" (that name is open to change or simplification)
which allows S and C to agree on C using X directly or not.

I have tried to implement it in a quite generic way that could allow S
and C to share more information about promisor remotes and how to use
them.

For now, C doesn't use the information it gets from S when cloning.
That information is only used to decide if C is OK to use the promisor
remotes advertised by S. But this could change in the future which
could make it much simpler for clients than using the current way of
passing information about X with the `-c` option of `git clone` many
times on the command line.

Another improvement could be to not require GIT_NO_LAZY_FETCH=0 when S
and C have agreed on using S.

Changes compared to version 2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Summary of the changes: there are few changes in the C code, but a
numberof them in the tests, and a new design doc.

  - To avoid conflicts and benefit from recent improvements (like leak
    checks) this series has been rebased to a recent master:
    23692e08c6 (The thirteenth batch, 2024-12-04)

  - In patch 3/5, some functions are not passed a
    `struct repository *repo` argument anymore as this argument is only used
    in patch 4/5, so it's better to introduce it in that patch.

  - In patch 3/5, a memory leak was fixed using
    git_config_get_string_tmp() instead of git_config_get_string() as
    suggested by Patrick.

  - In patch 3/5, a number of tests using `git fetch` are added. This
    is why there are a number of other refactorings and improvements
    in the tests described in some points below.

  - In patch 3/5, some tests using `git clone` had a title that used
    "fetch" instead of "clone". This has been corrected.

  - In patch 3/5, the test helper function initialize_server() now
    takes 2 arguments to make it more generic.

  - In patch 3/5, a new copy_to_server2() test helper function has
    been introduced.

  - In patch 3/5, a test repacking with a filter was writting outside
    the test directory which has been corrected using
    `--filter-to="$(pwd)/pack"`.

  - In patch 3/5, a test was using `touch "$promisor_file"`. This was
    replaced with `>"$promisor_file"`.

  - In patch 4/5, some functions are now passed a
    `struct repository *repo` argument. This is related to the
    corresponding change in patch 3/5 that removed this argument.

  - In patch 4/5, some tests using `git clone` had a title that used
    "fetch" instead of "clone". This has been corrected (in the same
    way as in patch 3/5).

  - Patch 5/5 is new. It adds design documentation that could help
    understand the broader context of this patch series, as this was
    requested by some reviewers. This patch is optional. I am OK with
    removing it or discussing it as a single separate patch.

Thanks to Junio, Patrick, Eric and Taylor for their suggestions.

CI tests
~~~~~~~~

https://github.com/chriscool/git/actions/runs/12197064528

Unfortunately some tests (linux-sha256, linux-reftable, linux-gcc and
linux-gcc-default) failed after around 46 minutes as the dependencies
couldn't be intalled.

One test, linux-TEST-vars, failed much earlier, in what doesn't look
like a CI issue as I could reproduce the failure locally when setting
GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL to 1. I will investigate,
but in the meantime I think I can send this as-is so we can start
discussing.

Range diff compared to version 2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1:  0d9d094181 = 1:  13dd730641 version: refactor strbuf_sanitize()
2:  fc53229eff = 2:  8f2aecf6a1 strbuf: refactor strbuf_trim_trailing_ch()
3:  5c507e427f ! 3:  57e1481bc4 Add 'promisor-remote' capability to protocol v2
    @@ promisor-remote.c: void promisor_remote_get_direct(struct repository *repo,
     +  BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
     +}
     +
    -+static void filter_promisor_remote(struct repository *repo,
    -+                             struct strvec *accepted,
    -+                             const char *info)
    ++static void filter_promisor_remote(struct strvec *accepted, const char *info)
     +{
     +  struct strbuf **remotes;
    -+  char *accept_str;
    ++  const char *accept_str;
     +  enum accept_promisor accept = ACCEPT_NONE;
     +
    -+  if (!git_config_get_string("promisor.acceptfromserver", &accept_str)) {
    ++  if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
     +          if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
     +                  accept = ACCEPT_NONE;
     +          else if (!strcasecmp("All", accept_str))
    @@ promisor-remote.c: void promisor_remote_get_direct(struct repository *repo,
     +          free(decoded_url);
     +  }
     +
    -+  free(accept_str);
     +  strbuf_list_free(remotes);
     +}
     +
    @@ promisor-remote.c: void promisor_remote_get_direct(struct repository *repo,
     +  struct strvec accepted = STRVEC_INIT;
     +  struct strbuf reply = STRBUF_INIT;
     +
    -+  filter_promisor_remote(the_repository, &accepted, info);
    ++  filter_promisor_remote(&accepted, info);
     +
     +  if (!accepted.nr)
     +          return NULL;
    @@ t/t5710-promisor-remote-capability.sh (new)
     +  git -C "$1" rev-list --objects --all --missing=print > all.txt &&
     +  perl -ne 'print if s/^[?]//' all.txt >missing.txt &&
     +  test_line_count = "$2" missing.txt &&
    -+  test "$3" = "$(cat missing.txt)"
    ++  if test "$2" -lt 2
    ++  then
    ++          test "$3" = "$(cat missing.txt)"
    ++  else
    ++          test -f "$3" &&
    ++          sort <"$3" >expected_sorted &&
    ++          sort <missing.txt >actual_sorted &&
    ++          test_cmp expected_sorted actual_sorted
    ++  fi
     +}
     +
     +initialize_server () {
    ++  count="$1"
    ++  missing_oids="$2"
    ++
     +  # Repack everything first
     +  git -C server -c repack.writebitmaps=false repack -a -d &&
     +
    @@ t/t5710-promisor-remote-capability.sh (new)
     +
     +  # Repack without the largest object and create a promisor pack on server
     +  git -C server -c repack.writebitmaps=false repack -a -d \
    -+      --filter=blob:limit=5k --filter-to="$(pwd)" &&
    ++      --filter=blob:limit=5k --filter-to="$(pwd)/pack" &&
     +  promisor_file=$(ls server/objects/pack/*.pack | sed "s/\.pack/.promisor/") &&
    -+  touch "$promisor_file" &&
    ++  >"$promisor_file" &&
     +
    -+  # Check that only one object is missing on the server
    -+  check_missing_objects server 1 "$oid"
    ++  # Check objects missing on the server
    ++  check_missing_objects server "$count" "$missing_oids"
    ++}
    ++
    ++copy_to_server2 () {
    ++  oid_path="$(test_oid_to_path $1)" &&
    ++  path="server/objects/$oid_path" &&
    ++  path2="server2/objects/$oid_path" &&
    ++  mkdir -p $(dirname "$path2") &&
    ++  cp "$path" "$path2"
     +}
     +
     +test_expect_success "setup for testing promisor remote advertisement" '
    @@ t/t5710-promisor-remote-capability.sh (new)
     +  # Copy the largest object from server to server2
     +  obj="HEAD:foo" &&
     +  oid="$(git -C server rev-parse $obj)" &&
    -+  oid_path="$(test_oid_to_path $oid)" &&
    -+  path="server/objects/$oid_path" &&
    -+  path2="server2/objects/$oid_path" &&
    -+  mkdir -p $(dirname "$path2") &&
    -+  cp "$path" "$path2" &&
    ++  copy_to_server2 "$oid" &&
     +
    -+  initialize_server &&
    ++  initialize_server 1 "$oid" &&
     +
     +  # Configure server2 as promisor remote for server
     +  git -C server remote add server2 "file://$(pwd)/server2" &&
    @@ t/t5710-promisor-remote-capability.sh (new)
     +  git -C server config uploadpack.allowAnySHA1InWant true
     +'
     +
    -+test_expect_success "fetch with promisor.advertise set to 'true'" '
    ++test_expect_success "clone with promisor.advertise set to 'true'" '
     +  git -C server config promisor.advertise true &&
     +
     +  # Clone from server to create a client
    @@ t/t5710-promisor-remote-capability.sh (new)
     +  check_missing_objects server 1 "$oid"
     +'
     +
    -+test_expect_success "fetch with promisor.advertise set to 'false'" '
    ++test_expect_success "clone with promisor.advertise set to 'false'" '
     +  git -C server config promisor.advertise false &&
     +
     +  # Clone from server to create a client
    @@ t/t5710-promisor-remote-capability.sh (new)
     +  check_missing_objects server 0 "" &&
     +
     +  # Reinitialize server so that the largest object is missing again
    -+  initialize_server
    ++  initialize_server 1 "$oid"
     +'
     +
    -+test_expect_success "fetch with promisor.acceptfromserver set to 'None'" '
    ++test_expect_success "clone with promisor.acceptfromserver set to 'None'" '
     +  git -C server config promisor.advertise true &&
     +
     +  # Clone from server to create a client
    @@ t/t5710-promisor-remote-capability.sh (new)
     +  test_when_finished "rm -rf client" &&
     +
     +  # Check that the largest object is not missing on the server
    -+  check_missing_objects server 0 ""
    ++  check_missing_objects server 0 "" &&
    ++
    ++  # Reinitialize server so that the largest object is missing again
    ++  initialize_server 1 "$oid"
    ++'
    ++
    ++test_expect_success "init + fetch with promisor.advertise set to 'true'" '
    ++  git -C server config promisor.advertise true &&
    ++
    ++  test_when_finished "rm -rf client" &&
    ++  mkdir client &&
    ++  git -C client init &&
    ++  git -C client config remote.server2.promisor true &&
    ++  git -C client config remote.server2.fetch "+refs/heads/*:refs/remotes/server2/*" &&
    ++  git -C client config remote.server2.url "file://$(pwd)/server2" &&
    ++  git -C client config remote.server.url "file://$(pwd)/server" &&
    ++  git -C client config remote.server.fetch "+refs/heads/*:refs/remotes/server/*" &&
    ++  git -C client config promisor.acceptfromserver All &&
    ++  GIT_NO_LAZY_FETCH=0 git -C client fetch --filter="blob:limit=5k" server &&
    ++
    ++  # Check that the largest object is still missing on the server
    ++  check_missing_objects server 1 "$oid"
    ++'
    ++
    ++test_expect_success "clone with promisor.advertise set to 'true' but don't delete the client" '
    ++  git -C server config promisor.advertise true &&
    ++
    ++  # Clone from server to create a client
    ++  GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
    ++          -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
    ++          -c remote.server2.url="file://$(pwd)/server2" \
    ++          -c promisor.acceptfromserver=All \
    ++          --no-local --filter="blob:limit=5k" server client &&
    ++
    ++  # Check that the largest object is still missing on the server
    ++  check_missing_objects server 1 "$oid"
    ++'
    ++
    ++test_expect_success "setup for subsequent fetches" '
    ++  # Generate new commit with large blob
    ++  test-tool genrandom bar 10240 >template/bar &&
    ++  git -C template add bar &&
    ++  git -C template commit -m bar &&
    ++
    ++  # Fetch new commit with large blob
    ++  git -C server fetch origin &&
    ++  git -C server update-ref HEAD FETCH_HEAD &&
    ++  git -C server rev-parse HEAD >expected_head &&
    ++
    ++  # Repack everything twice and remove .promisor files before
    ++  # each repack. This makes sure everything gets repacked
    ++  # into a single packfile. The second repack is necessary
    ++  # because the first one fetches from server2 and creates a new
    ++  # packfile and its associated .promisor file.
    ++
    ++  rm -f server/objects/pack/*.promisor &&
    ++  git -C server -c repack.writebitmaps=false repack -a -d &&
    ++  rm -f server/objects/pack/*.promisor &&
    ++  git -C server -c repack.writebitmaps=false repack -a -d &&
    ++
    ++  # Unpack everything
    ++  rm pack-* &&
    ++  mv server/objects/pack/pack-* . &&
    ++  packfile=$(ls pack-*.pack) &&
    ++  git -C server unpack-objects --strict <"$packfile" &&
    ++
    ++  # Copy new large object to server2
    ++  obj_bar="HEAD:bar" &&
    ++  oid_bar="$(git -C server rev-parse $obj_bar)" &&
    ++  copy_to_server2 "$oid_bar" &&
    ++
    ++  # Reinitialize server so that the 2 largest objects are missing
    ++  printf "%s\n" "$oid" "$oid_bar" >expected_missing.txt &&
    ++  initialize_server 2 expected_missing.txt &&
    ++
    ++  # Create one more client
    ++  cp -r client client2
    ++'
    ++
    ++test_expect_success "subsequent fetch from a client when promisor.advertise is true" '
    ++  git -C server config promisor.advertise true &&
    ++
    ++  GIT_NO_LAZY_FETCH=0 git -C client pull origin &&
    ++
    ++  git -C client rev-parse HEAD >actual &&
    ++  test_cmp expected_head actual &&
    ++
    ++  cat client/bar >/dev/null &&
    ++
    ++  check_missing_objects server 2 expected_missing.txt
    ++'
    ++
    ++test_expect_success "subsequent fetch from a client when promisor.advertise is false" '
    ++  git -C server config promisor.advertise false &&
    ++
    ++  GIT_NO_LAZY_FETCH=0 git -C client2 pull origin &&
    ++
    ++  git -C client2 rev-parse HEAD >actual &&
    ++  test_cmp expected_head actual &&
    ++
    ++  cat client2/bar >/dev/null &&
    ++
    ++  check_missing_objects server 1 "$oid"
     +'
     +
     +test_done
4:  1c2794f139 ! 4:  7fcc619e41 promisor-remote: check advertised name or URL
    @@ promisor-remote.c: char *promisor_remote_info(struct repository *repo)
     +  return 0;
      }
      
    - static void filter_promisor_remote(struct repository *repo,
    -@@ promisor-remote.c: static void filter_promisor_remote(struct repository *repo,
    +-static void filter_promisor_remote(struct strvec *accepted, const char *info)
    ++static void filter_promisor_remote(struct repository *repo,
    ++                             struct strvec *accepted,
    ++                             const char *info)
    + {
        struct strbuf **remotes;
    -   char *accept_str;
    +   const char *accept_str;
        enum accept_promisor accept = ACCEPT_NONE;
     +  struct strvec names = STRVEC_INIT;
     +  struct strvec urls = STRVEC_INIT;
      
    -   if (!git_config_get_string("promisor.acceptfromserver", &accept_str)) {
    +   if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
                if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
                        accept = ACCEPT_NONE;
     +          else if (!strcasecmp("KnownUrl", accept_str))
    @@ promisor-remote.c: static void filter_promisor_remote(struct repository *repo,
                else if (!strcasecmp("All", accept_str))
                        accept = ACCEPT_ALL;
                else
    -@@ promisor-remote.c: static void filter_promisor_remote(struct repository *repo,
    +@@ promisor-remote.c: static void filter_promisor_remote(struct strvec *accepted, const char *info)
        if (accept == ACCEPT_NONE)
                return;
      
    @@ promisor-remote.c: static void filter_promisor_remote(struct repository *repo,
        /* Parse remote info received */
      
        remotes = strbuf_split_str(info, ';', 0);
    -@@ promisor-remote.c: static void filter_promisor_remote(struct repository *repo,
    +@@ promisor-remote.c: static void filter_promisor_remote(struct strvec *accepted, const char *info)
                if (remote_url)
                        decoded_url = url_percent_decode(remote_url);
      
    @@ promisor-remote.c: static void filter_promisor_remote(struct repository *repo,
                        strvec_push(accepted, decoded_name);
      
                strbuf_list_free(elems);
    -@@ promisor-remote.c: static void filter_promisor_remote(struct repository *repo,
    +@@ promisor-remote.c: static void filter_promisor_remote(struct strvec *accepted, const char *info)
    +           free(decoded_url);
        }
      
    -   free(accept_str);
     +  strvec_clear(&names);
     +  strvec_clear(&urls);
        strbuf_list_free(remotes);
      }
      
    +@@ promisor-remote.c: char *promisor_remote_reply(const char *info)
    +   struct strvec accepted = STRVEC_INIT;
    +   struct strbuf reply = STRBUF_INIT;
    + 
    +-  filter_promisor_remote(&accepted, info);
    ++  filter_promisor_remote(the_repository, &accepted, info);
    + 
    +   if (!accepted.nr)
    +           return NULL;
     
      ## t/t5710-promisor-remote-capability.sh ##
    -@@ t/t5710-promisor-remote-capability.sh: test_expect_success "fetch with promisor.acceptfromserver set to 'None'" '
    -           --no-local --filter="blob:limit=5k" server client &&
    -   test_when_finished "rm -rf client" &&
    +@@ t/t5710-promisor-remote-capability.sh: test_expect_success "init + fetch with promisor.advertise set to 'true'" '
    +   check_missing_objects server 1 "$oid"
    + '
      
    -+  # Check that the largest object is not missing on the server
    -+  check_missing_objects server 0 "" &&
    -+
    -+  # Reinitialize server so that the largest object is missing again
    -+  initialize_server
    -+'
    -+
    -+test_expect_success "fetch with promisor.acceptfromserver set to 'KnownName'" '
    ++test_expect_success "clone with promisor.acceptfromserver set to 'KnownName'" '
     +  git -C server config promisor.advertise true &&
     +
     +  # Clone from server to create a client
    @@ t/t5710-promisor-remote-capability.sh: test_expect_success "fetch with promisor.
     +  check_missing_objects server 1 "$oid"
     +'
     +
    -+test_expect_success "fetch with 'KnownName' and different remote names" '
    ++test_expect_success "clone with 'KnownName' and different remote names" '
     +  git -C server config promisor.advertise true &&
     +
     +  # Clone from server to create a client
    @@ t/t5710-promisor-remote-capability.sh: test_expect_success "fetch with promisor.
     +  check_missing_objects server 0 "" &&
     +
     +  # Reinitialize server so that the largest object is missing again
    -+  initialize_server
    ++  initialize_server 1 "$oid"
     +'
     +
    -+test_expect_success "fetch with promisor.acceptfromserver set to 'KnownUrl'" '
    ++test_expect_success "clone with promisor.acceptfromserver set to 'KnownUrl'" '
     +  git -C server config promisor.advertise true &&
     +
     +  # Clone from server to create a client
    @@ t/t5710-promisor-remote-capability.sh: test_expect_success "fetch with promisor.
     +  check_missing_objects server 1 "$oid"
     +'
     +
    -+test_expect_success "fetch with 'KnownUrl' and different remote urls" '
    ++test_expect_success "clone with 'KnownUrl' and different remote urls" '
     +  ln -s server2 serverTwo &&
     +
     +  git -C server config promisor.advertise true &&
    @@ t/t5710-promisor-remote-capability.sh: test_expect_success "fetch with promisor.
     +          --no-local --filter="blob:limit=5k" server client &&
     +  test_when_finished "rm -rf client" &&
     +
    -   # Check that the largest object is not missing on the server
    -   check_missing_objects server 0 ""
    - '
    ++  # Check that the largest object is not missing on the server
    ++  check_missing_objects server 0 "" &&
    ++
    ++  # Reinitialize server so that the largest object is missing again
    ++  initialize_server 1 "$oid"
    ++'
    ++
    + test_expect_success "clone with promisor.advertise set to 'true' but don't delete the client" '
    +   git -C server config promisor.advertise true &&
    + 
-:  ---------- > 5:  c25c94707f doc: add technical design doc for large object promisors


Christian Couder (5):
  version: refactor strbuf_sanitize()
  strbuf: refactor strbuf_trim_trailing_ch()
  Add 'promisor-remote' capability to protocol v2
  promisor-remote: check advertised name or URL
  doc: add technical design doc for large object promisors

 Documentation/config/promisor.txt             |  27 +
 Documentation/gitprotocol-v2.txt              |  54 ++
 .../technical/large-object-promisors.txt      | 530 ++++++++++++++++++
 connect.c                                     |   9 +
 promisor-remote.c                             | 243 ++++++++
 promisor-remote.h                             |  36 +-
 serve.c                                       |  26 +
 strbuf.c                                      |  16 +
 strbuf.h                                      |  10 +
 t/t5710-promisor-remote-capability.sh         | 309 ++++++++++
 trace2/tr2_cfg.c                              |  10 +-
 upload-pack.c                                 |   3 +
 version.c                                     |   9 +-
 13 files changed, 1266 insertions(+), 16 deletions(-)
 create mode 100644 Documentation/technical/large-object-promisors.txt
 create mode 100755 t/t5710-promisor-remote-capability.sh

-- 
2.47.1.402.gc25c94707f


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH v3 1/5] version: refactor strbuf_sanitize()
  2024-12-06 12:42   ` [PATCH v3 0/5] " Christian Couder
@ 2024-12-06 12:42     ` Christian Couder
  2024-12-07  6:21       ` Junio C Hamano
  2024-12-06 12:42     ` [PATCH v3 2/5] strbuf: refactor strbuf_trim_trailing_ch() Christian Couder
                       ` (5 subsequent siblings)
  6 siblings, 1 reply; 110+ messages in thread
From: Christian Couder @ 2024-12-06 12:42 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Taylor Blau,
	Eric Sunshine, Christian Couder, Christian Couder

The git_user_agent_sanitized() function performs some sanitizing to
avoid special characters being sent over the line and possibly messing
up with the protocol or with the parsing on the other side.

Let's extract this sanitizing into a new strbuf_sanitize() function, as
we will want to reuse it in a following patch, and let's put it into
strbuf.{c,h}.

While at it, let's also make a few small improvements:
  - use 'size_t' for 'i' instead of 'int',
  - move the declaration of 'i' inside the 'for ( ... )',
  - use strbuf_detach() to explicitly detach the string contained by
    the 'sb' strbuf.

Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 strbuf.c  | 9 +++++++++
 strbuf.h  | 7 +++++++
 version.c | 9 ++-------
 3 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/strbuf.c b/strbuf.c
index 3d2189a7f6..cccfdec0e3 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -1082,3 +1082,12 @@ void strbuf_strip_file_from_path(struct strbuf *sb)
 	char *path_sep = find_last_dir_sep(sb->buf);
 	strbuf_setlen(sb, path_sep ? path_sep - sb->buf + 1 : 0);
 }
+
+void strbuf_sanitize(struct strbuf *sb)
+{
+	strbuf_trim(sb);
+	for (size_t i = 0; i < sb->len; i++) {
+		if (sb->buf[i] <= 32 || sb->buf[i] >= 127)
+			sb->buf[i] = '.';
+	}
+}
diff --git a/strbuf.h b/strbuf.h
index 003f880ff7..884157873e 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -664,6 +664,13 @@ typedef int (*char_predicate)(char ch);
 void strbuf_addstr_urlencode(struct strbuf *sb, const char *name,
 			     char_predicate allow_unencoded_fn);
 
+/*
+ * Trim and replace each character with ascii code below 32 or above
+ * 127 (included) using a dot '.' character. Useful for sending
+ * capabilities.
+ */
+void strbuf_sanitize(struct strbuf *sb);
+
 __attribute__((format (printf,1,2)))
 int printf_ln(const char *fmt, ...);
 __attribute__((format (printf,2,3)))
diff --git a/version.c b/version.c
index 41b718c29e..951e6dca74 100644
--- a/version.c
+++ b/version.c
@@ -24,15 +24,10 @@ const char *git_user_agent_sanitized(void)
 
 	if (!agent) {
 		struct strbuf buf = STRBUF_INIT;
-		int i;
 
 		strbuf_addstr(&buf, git_user_agent());
-		strbuf_trim(&buf);
-		for (i = 0; i < buf.len; i++) {
-			if (buf.buf[i] <= 32 || buf.buf[i] >= 127)
-				buf.buf[i] = '.';
-		}
-		agent = buf.buf;
+		strbuf_sanitize(&buf);
+		agent = strbuf_detach(&buf, NULL);
 	}
 
 	return agent;
-- 
2.47.1.402.gc25c94707f


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH v3 2/5] strbuf: refactor strbuf_trim_trailing_ch()
  2024-12-06 12:42   ` [PATCH v3 0/5] " Christian Couder
  2024-12-06 12:42     ` [PATCH v3 1/5] version: refactor strbuf_sanitize() Christian Couder
@ 2024-12-06 12:42     ` Christian Couder
  2024-12-07  6:35       ` Junio C Hamano
  2024-12-16 11:47       ` karthik nayak
  2024-12-06 12:42     ` [PATCH v3 3/5] Add 'promisor-remote' capability to protocol v2 Christian Couder
                       ` (4 subsequent siblings)
  6 siblings, 2 replies; 110+ messages in thread
From: Christian Couder @ 2024-12-06 12:42 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Taylor Blau,
	Eric Sunshine, Christian Couder, Christian Couder

We often have to split strings at some specified terminator character.
The strbuf_split*() functions, that we can use for this purpose,
return substrings that include the terminator character, so we often
need to remove that character.

When it is a whitespace, newline or directory separator, the
terminator character can easily be removed using an existing triming
function like strbuf_rtrim(), strbuf_trim_trailing_newline() or
strbuf_trim_trailing_dir_sep(). There is no function to remove that
character when it's not one of those characters though.

Let's introduce a new strbuf_trim_trailing_ch() function that can be
used to remove any trailing character, and let's refactor existing code
that manually removed trailing characters using this new function.

We are also going to use this new function in a following commit.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 strbuf.c         |  7 +++++++
 strbuf.h         |  3 +++
 trace2/tr2_cfg.c | 10 ++--------
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/strbuf.c b/strbuf.c
index cccfdec0e3..c986ec28f4 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -134,6 +134,13 @@ void strbuf_trim_trailing_dir_sep(struct strbuf *sb)
 	sb->buf[sb->len] = '\0';
 }
 
+void strbuf_trim_trailing_ch(struct strbuf *sb, int c)
+{
+	while (sb->len > 0 && sb->buf[sb->len - 1] == c)
+		sb->len--;
+	sb->buf[sb->len] = '\0';
+}
+
 void strbuf_trim_trailing_newline(struct strbuf *sb)
 {
 	if (sb->len > 0 && sb->buf[sb->len - 1] == '\n') {
diff --git a/strbuf.h b/strbuf.h
index 884157873e..5e389ab065 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -197,6 +197,9 @@ void strbuf_trim_trailing_dir_sep(struct strbuf *sb);
 /* Strip trailing LF or CR/LF */
 void strbuf_trim_trailing_newline(struct strbuf *sb);
 
+/* Strip trailing character c */
+void strbuf_trim_trailing_ch(struct strbuf *sb, int c);
+
 /**
  * Replace the contents of the strbuf with a reencoded form.  Returns -1
  * on error, 0 on success.
diff --git a/trace2/tr2_cfg.c b/trace2/tr2_cfg.c
index 22a99a0682..9da1f8466c 100644
--- a/trace2/tr2_cfg.c
+++ b/trace2/tr2_cfg.c
@@ -35,10 +35,7 @@ static int tr2_cfg_load_patterns(void)
 
 	tr2_cfg_patterns = strbuf_split_buf(envvar, strlen(envvar), ',', -1);
 	for (s = tr2_cfg_patterns; *s; s++) {
-		struct strbuf *buf = *s;
-
-		if (buf->len && buf->buf[buf->len - 1] == ',')
-			strbuf_setlen(buf, buf->len - 1);
+		strbuf_trim_trailing_ch(*s, ',');
 		strbuf_trim_trailing_newline(*s);
 		strbuf_trim(*s);
 	}
@@ -74,10 +71,7 @@ static int tr2_load_env_vars(void)
 
 	tr2_cfg_env_vars = strbuf_split_buf(varlist, strlen(varlist), ',', -1);
 	for (s = tr2_cfg_env_vars; *s; s++) {
-		struct strbuf *buf = *s;
-
-		if (buf->len && buf->buf[buf->len - 1] == ',')
-			strbuf_setlen(buf, buf->len - 1);
+		strbuf_trim_trailing_ch(*s, ',');
 		strbuf_trim_trailing_newline(*s);
 		strbuf_trim(*s);
 	}
-- 
2.47.1.402.gc25c94707f


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH v3 3/5] Add 'promisor-remote' capability to protocol v2
  2024-12-06 12:42   ` [PATCH v3 0/5] " Christian Couder
  2024-12-06 12:42     ` [PATCH v3 1/5] version: refactor strbuf_sanitize() Christian Couder
  2024-12-06 12:42     ` [PATCH v3 2/5] strbuf: refactor strbuf_trim_trailing_ch() Christian Couder
@ 2024-12-06 12:42     ` Christian Couder
  2024-12-07  7:59       ` Junio C Hamano
  2024-12-06 12:42     ` [PATCH v3 4/5] promisor-remote: check advertised name or URL Christian Couder
                       ` (3 subsequent siblings)
  6 siblings, 1 reply; 110+ messages in thread
From: Christian Couder @ 2024-12-06 12:42 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Taylor Blau,
	Eric Sunshine, Christian Couder, Christian Couder

When a server S knows that some objects from a repository are available
from a promisor remote X, S might want to suggest to a client C cloning
or fetching the repo from S that C should use X directly instead of S
for these objects.

Note that this could happen both in the case S itself doesn't have the
objects and borrows them from X, and in the case S has the objects but
knows that X is better connected to the world (e.g., it is in a
$LARGEINTERNETCOMPANY datacenter with petabit/s backbone connections)
than S. Implementation of the latter case, which would require S to
omit in its response the objects available on X, is left for future
improvement though.

Then C might or might not, want to get the objects from X, and should
let S know about this.

To allow S and C to agree and let each other know about C using X or
not, let's introduce a new "promisor-remote" capability in the
protocol v2, as well as a few new configuration variables:

  - "promisor.advertise" on the server side, and:
  - "promisor.acceptFromServer" on the client side.

By default, or if "promisor.advertise" is set to 'false', a server S will
not advertise the "promisor-remote" capability.

If S doesn't advertise the "promisor-remote" capability, then a client C
replying to S shouldn't advertise the "promisor-remote" capability
either.

If "promisor.advertise" is set to 'true', S will advertise its promisor
remotes with a string like:

  promisor-remote=<pr-info>[;<pr-info>]...

where each <pr-info> element contains information about a single
promisor remote in the form:

  name=<pr-name>[,url=<pr-url>]

where <pr-name> is the urlencoded name of a promisor remote and
<pr-url> is the urlencoded URL of the promisor remote named <pr-name>.

For now, the URL is passed in addition to the name. In the future, it
might be possible to pass other information like a filter-spec that the
client should use when cloning from S, or a token that the client should
use when retrieving objects from X.

It might also be possible in the future for "promisor.advertise" to have
other values. For example a value like "onlyName" could prevent S from
advertising URLs, which could help in case C should use a different URL
for X than the URL S is using. (The URL S is using might be an internal
one on the server side for example.)

By default or if "promisor.acceptFromServer" is set to "None", C will
not accept to use the promisor remotes that might have been advertised
by S. In this case, C will not advertise any "promisor-remote"
capability in its reply to S.

If "promisor.acceptFromServer" is set to "All" and S advertised some
promisor remotes, then on the contrary, C will accept to use all the
promisor remotes that S advertised and C will reply with a string like:

  promisor-remote=<pr-name>[;<pr-name>]...

where the <pr-name> elements are the urlencoded names of all the
promisor remotes S advertised.

In a following commit, other values for "promisor.acceptFromServer" will
be implemented, so that C will be able to decide the promisor remotes it
accepts depending on the name and URL it received from S. So even if
that name and URL information is not used much right now, it will be
needed soon.

Helped-by: Taylor Blau <me@ttaylorr.com>
Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/promisor.txt     |  17 ++
 Documentation/gitprotocol-v2.txt      |  54 ++++++
 connect.c                             |   9 +
 promisor-remote.c                     | 195 +++++++++++++++++++++
 promisor-remote.h                     |  36 +++-
 serve.c                               |  26 +++
 t/t5710-promisor-remote-capability.sh | 241 ++++++++++++++++++++++++++
 upload-pack.c                         |   3 +
 8 files changed, 580 insertions(+), 1 deletion(-)
 create mode 100755 t/t5710-promisor-remote-capability.sh

diff --git a/Documentation/config/promisor.txt b/Documentation/config/promisor.txt
index 98c5cb2ec2..9cbfe3e59e 100644
--- a/Documentation/config/promisor.txt
+++ b/Documentation/config/promisor.txt
@@ -1,3 +1,20 @@
 promisor.quiet::
 	If set to "true" assume `--quiet` when fetching additional
 	objects for a partial clone.
+
+promisor.advertise::
+	If set to "true", a server will use the "promisor-remote"
+	capability, see linkgit:gitprotocol-v2[5], to advertise the
+	promisor remotes it is using, if it uses some. Default is
+	"false", which means the "promisor-remote" capability is not
+	advertised.
+
+promisor.acceptFromServer::
+	If set to "all", a client will accept all the promisor remotes
+	a server might advertise using the "promisor-remote"
+	capability. Default is "none", which means no promisor remote
+	advertised by a server will be accepted. By accepting a
+	promisor remote, the client agrees that the server might omit
+	objects that are lazily fetchable from this promisor remote
+	from its responses to "fetch" and "clone" requests from the
+	client. See linkgit:gitprotocol-v2[5].
diff --git a/Documentation/gitprotocol-v2.txt b/Documentation/gitprotocol-v2.txt
index 1652fef3ae..f25a9a6ad8 100644
--- a/Documentation/gitprotocol-v2.txt
+++ b/Documentation/gitprotocol-v2.txt
@@ -781,6 +781,60 @@ retrieving the header from a bundle at the indicated URI, and thus
 save themselves and the server(s) the request(s) needed to inspect the
 headers of that bundle or bundles.
 
+promisor-remote=<pr-infos>
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The server may advertise some promisor remotes it is using or knows
+about to a client which may want to use them as its promisor remotes,
+instead of this repository. In this case <pr-infos> should be of the
+form:
+
+	pr-infos = pr-info | pr-infos ";" pr-info
+
+	pr-info = "name=" pr-name | "name=" pr-name "," "url=" pr-url
+
+where `pr-name` is the urlencoded name of a promisor remote, and
+`pr-url` the urlencoded URL of that promisor remote.
+
+In this case, if the client decides to use one or more promisor
+remotes the server advertised, it can reply with
+"promisor-remote=<pr-names>" where <pr-names> should be of the form:
+
+	pr-names = pr-name | pr-names ";" pr-name
+
+where `pr-name` is the urlencoded name of a promisor remote the server
+advertised and the client accepts.
+
+Note that, everywhere in this document, `pr-name` MUST be a valid
+remote name, and the ';' and ',' characters MUST be encoded if they
+appear in `pr-name` or `pr-url`.
+
+If the server doesn't know any promisor remote that could be good for
+a client to use, or prefers a client not to use any promisor remote it
+uses or knows about, it shouldn't advertise the "promisor-remote"
+capability at all.
+
+In this case, or if the client doesn't want to use any promisor remote
+the server advertised, the client shouldn't advertise the
+"promisor-remote" capability at all in its reply.
+
+The "promisor.advertise" and "promisor.acceptFromServer" configuration
+options can be used on the server and client side respectively to
+control what they advertise or accept respectively. See the
+documentation of these configuration options for more information.
+
+Note that in the future it would be nice if the "promisor-remote"
+protocol capability could be used by the server, when responding to
+`git fetch` or `git clone`, to advertise better-connected remotes that
+the client can use as promisor remotes, instead of this repository, so
+that the client can lazily fetch objects from these other
+better-connected remotes. This would require the server to omit in its
+response the objects available on the better-connected remotes that
+the client has accepted. This hasn't been implemented yet though. So
+for now this "promisor-remote" capability is useful only when the
+server advertises some promisor remotes it already uses to borrow
+objects from.
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/connect.c b/connect.c
index 58f53d8dcb..898bf3b438 100644
--- a/connect.c
+++ b/connect.c
@@ -22,6 +22,7 @@
 #include "protocol.h"
 #include "alias.h"
 #include "bundle-uri.h"
+#include "promisor-remote.h"
 
 static char *server_capabilities_v1;
 static struct strvec server_capabilities_v2 = STRVEC_INIT;
@@ -487,6 +488,7 @@ void check_stateless_delimiter(int stateless_rpc,
 static void send_capabilities(int fd_out, struct packet_reader *reader)
 {
 	const char *hash_name;
+	const char *promisor_remote_info;
 
 	if (server_supports_v2("agent"))
 		packet_write_fmt(fd_out, "agent=%s", git_user_agent_sanitized());
@@ -500,6 +502,13 @@ static void send_capabilities(int fd_out, struct packet_reader *reader)
 	} else {
 		reader->hash_algo = &hash_algos[GIT_HASH_SHA1];
 	}
+	if (server_feature_v2("promisor-remote", &promisor_remote_info)) {
+		char *reply = promisor_remote_reply(promisor_remote_info);
+		if (reply) {
+			packet_write_fmt(fd_out, "promisor-remote=%s", reply);
+			free(reply);
+		}
+	}
 }
 
 int get_remote_bundle_uri(int fd_out, struct packet_reader *reader,
diff --git a/promisor-remote.c b/promisor-remote.c
index 9345ae3db2..ea418c4094 100644
--- a/promisor-remote.c
+++ b/promisor-remote.c
@@ -11,6 +11,7 @@
 #include "strvec.h"
 #include "packfile.h"
 #include "environment.h"
+#include "url.h"
 
 struct promisor_remote_config {
 	struct promisor_remote *promisors;
@@ -221,6 +222,18 @@ int repo_has_promisor_remote(struct repository *r)
 	return !!repo_promisor_remote_find(r, NULL);
 }
 
+int repo_has_accepted_promisor_remote(struct repository *r)
+{
+	struct promisor_remote *p;
+
+	promisor_remote_init(r);
+
+	for (p = r->promisor_remote_config->promisors; p; p = p->next)
+		if (p->accepted)
+			return 1;
+	return 0;
+}
+
 static int remove_fetched_oids(struct repository *repo,
 			       struct object_id **oids,
 			       int oid_nr, int to_free)
@@ -292,3 +305,185 @@ void promisor_remote_get_direct(struct repository *repo,
 	if (to_free)
 		free(remaining_oids);
 }
+
+static int allow_unsanitized(char ch)
+{
+	if (ch == ',' || ch == ';' || ch == '%')
+		return 0;
+	return ch > 32 && ch < 127;
+}
+
+static void promisor_info_vecs(struct repository *repo,
+			       struct strvec *names,
+			       struct strvec *urls)
+{
+	struct promisor_remote *r;
+
+	promisor_remote_init(repo);
+
+	for (r = repo->promisor_remote_config->promisors; r; r = r->next) {
+		char *url;
+		char *url_key = xstrfmt("remote.%s.url", r->name);
+
+		strvec_push(names, r->name);
+		strvec_push(urls, git_config_get_string(url_key, &url) ? NULL : url);
+
+		free(url);
+		free(url_key);
+	}
+}
+
+char *promisor_remote_info(struct repository *repo)
+{
+	struct strbuf sb = STRBUF_INIT;
+	int advertise_promisors = 0;
+	struct strvec names = STRVEC_INIT;
+	struct strvec urls = STRVEC_INIT;
+
+	git_config_get_bool("promisor.advertise", &advertise_promisors);
+
+	if (!advertise_promisors)
+		return NULL;
+
+	promisor_info_vecs(repo, &names, &urls);
+
+	if (!names.nr)
+		return NULL;
+
+	for (size_t i = 0; i < names.nr; i++) {
+		if (i)
+			strbuf_addch(&sb, ';');
+		strbuf_addstr(&sb, "name=");
+		strbuf_addstr_urlencode(&sb, names.v[i], allow_unsanitized);
+		if (urls.v[i]) {
+			strbuf_addstr(&sb, ",url=");
+			strbuf_addstr_urlencode(&sb, urls.v[i], allow_unsanitized);
+		}
+	}
+
+	strbuf_sanitize(&sb);
+
+	strvec_clear(&names);
+	strvec_clear(&urls);
+
+	return strbuf_detach(&sb, NULL);
+}
+
+enum accept_promisor {
+	ACCEPT_NONE = 0,
+	ACCEPT_ALL
+};
+
+static int should_accept_remote(enum accept_promisor accept,
+				const char *remote_name UNUSED,
+				const char *remote_url UNUSED)
+{
+	if (accept == ACCEPT_ALL)
+		return 1;
+
+	BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+}
+
+static void filter_promisor_remote(struct strvec *accepted, const char *info)
+{
+	struct strbuf **remotes;
+	const char *accept_str;
+	enum accept_promisor accept = ACCEPT_NONE;
+
+	if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
+		if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
+			accept = ACCEPT_NONE;
+		else if (!strcasecmp("All", accept_str))
+			accept = ACCEPT_ALL;
+		else
+			warning(_("unknown '%s' value for '%s' config option"),
+				accept_str, "promisor.acceptfromserver");
+	}
+
+	if (accept == ACCEPT_NONE)
+		return;
+
+	/* Parse remote info received */
+
+	remotes = strbuf_split_str(info, ';', 0);
+
+	for (size_t i = 0; remotes[i]; i++) {
+		struct strbuf **elems;
+		const char *remote_name = NULL;
+		const char *remote_url = NULL;
+		char *decoded_name = NULL;
+		char *decoded_url = NULL;
+
+		strbuf_trim_trailing_ch(remotes[i], ';');
+		elems = strbuf_split_str(remotes[i]->buf, ',', 0);
+
+		for (size_t j = 0; elems[j]; j++) {
+			int res;
+			strbuf_trim_trailing_ch(elems[j], ',');
+			res = skip_prefix(elems[j]->buf, "name=", &remote_name) ||
+				skip_prefix(elems[j]->buf, "url=", &remote_url);
+			if (!res)
+				warning(_("unknown element '%s' from remote info"),
+					elems[j]->buf);
+		}
+
+		if (remote_name)
+			decoded_name = url_percent_decode(remote_name);
+		if (remote_url)
+			decoded_url = url_percent_decode(remote_url);
+
+		if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url))
+			strvec_push(accepted, decoded_name);
+
+		strbuf_list_free(elems);
+		free(decoded_name);
+		free(decoded_url);
+	}
+
+	strbuf_list_free(remotes);
+}
+
+char *promisor_remote_reply(const char *info)
+{
+	struct strvec accepted = STRVEC_INIT;
+	struct strbuf reply = STRBUF_INIT;
+
+	filter_promisor_remote(&accepted, info);
+
+	if (!accepted.nr)
+		return NULL;
+
+	for (size_t i = 0; i < accepted.nr; i++) {
+		if (i)
+			strbuf_addch(&reply, ';');
+		strbuf_addstr_urlencode(&reply, accepted.v[i], allow_unsanitized);
+	}
+
+	strvec_clear(&accepted);
+
+	return strbuf_detach(&reply, NULL);
+}
+
+void mark_promisor_remotes_as_accepted(struct repository *r, const char *remotes)
+{
+	struct strbuf **accepted_remotes = strbuf_split_str(remotes, ';', 0);
+
+	for (size_t i = 0; accepted_remotes[i]; i++) {
+		struct promisor_remote *p;
+		char *decoded_remote;
+
+		strbuf_trim_trailing_ch(accepted_remotes[i], ';');
+		decoded_remote = url_percent_decode(accepted_remotes[i]->buf);
+
+		p = repo_promisor_remote_find(r, decoded_remote);
+		if (p)
+			p->accepted = 1;
+		else
+			warning(_("accepted promisor remote '%s' not found"),
+				decoded_remote);
+
+		free(decoded_remote);
+	}
+
+	strbuf_list_free(accepted_remotes);
+}
diff --git a/promisor-remote.h b/promisor-remote.h
index 88cb599c39..814ca248c7 100644
--- a/promisor-remote.h
+++ b/promisor-remote.h
@@ -9,11 +9,13 @@ struct object_id;
  * Promisor remote linked list
  *
  * Information in its fields come from remote.XXX config entries or
- * from extensions.partialclone.
+ * from extensions.partialclone, except for 'accepted' which comes
+ * from protocol v2 capabilities exchange.
  */
 struct promisor_remote {
 	struct promisor_remote *next;
 	char *partial_clone_filter;
+	unsigned int accepted : 1;
 	const char name[FLEX_ARRAY];
 };
 
@@ -32,4 +34,36 @@ void promisor_remote_get_direct(struct repository *repo,
 				const struct object_id *oids,
 				int oid_nr);
 
+/*
+ * Prepare a "promisor-remote" advertisement by a server.
+ * Check the value of "promisor.advertise" and maybe the configured
+ * promisor remotes, if any, to prepare information to send in an
+ * advertisement.
+ * Return value is NULL if no promisor remote advertisement should be
+ * made. Otherwise it contains the names and urls of the advertised
+ * promisor remotes separated by ';'
+ */
+char *promisor_remote_info(struct repository *repo);
+
+/*
+ * Prepare a reply to a "promisor-remote" advertisement from a server.
+ * Check the value of "promisor.acceptfromserver" and maybe the
+ * configured promisor remotes, if any, to prepare the reply.
+ * Return value is NULL if no promisor remote from the server
+ * is accepted. Otherwise it contains the names of the accepted promisor
+ * remotes separated by ';'.
+ */
+char *promisor_remote_reply(const char *info);
+
+/*
+ * Set the 'accepted' flag for some promisor remotes. Useful when some
+ * promisor remotes have been accepted by the client.
+ */
+void mark_promisor_remotes_as_accepted(struct repository *repo, const char *remotes);
+
+/*
+ * Has any promisor remote been accepted by the client?
+ */
+int repo_has_accepted_promisor_remote(struct repository *r);
+
 #endif /* PROMISOR_REMOTE_H */
diff --git a/serve.c b/serve.c
index d674764a25..5a40a7abb7 100644
--- a/serve.c
+++ b/serve.c
@@ -12,6 +12,7 @@
 #include "upload-pack.h"
 #include "bundle-uri.h"
 #include "trace2.h"
+#include "promisor-remote.h"
 
 static int advertise_sid = -1;
 static int advertise_object_info = -1;
@@ -31,6 +32,26 @@ static int agent_advertise(struct repository *r UNUSED,
 	return 1;
 }
 
+static int promisor_remote_advertise(struct repository *r,
+				     struct strbuf *value)
+{
+	if (value) {
+		char *info = promisor_remote_info(r);
+		if (!info)
+			return 0;
+		strbuf_addstr(value, info);
+		free(info);
+	}
+	return 1;
+}
+
+static void promisor_remote_receive(struct repository *r,
+				    const char *remotes)
+{
+	mark_promisor_remotes_as_accepted(r, remotes);
+}
+
+
 static int object_format_advertise(struct repository *r,
 				   struct strbuf *value)
 {
@@ -157,6 +178,11 @@ static struct protocol_capability capabilities[] = {
 		.advertise = bundle_uri_advertise,
 		.command = bundle_uri_command,
 	},
+	{
+		.name = "promisor-remote",
+		.advertise = promisor_remote_advertise,
+		.receive = promisor_remote_receive,
+	},
 };
 
 void protocol_v2_advertise_capabilities(void)
diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
new file mode 100755
index 0000000000..000cb4c0f6
--- /dev/null
+++ b/t/t5710-promisor-remote-capability.sh
@@ -0,0 +1,241 @@
+#!/bin/sh
+
+test_description='handling of promisor remote advertisement'
+
+. ./test-lib.sh
+
+# Setup the repository with three commits, this way HEAD is always
+# available and we can hide commit 1 or 2.
+test_expect_success 'setup: create "template" repository' '
+	git init template &&
+	test_commit -C template 1 &&
+	test_commit -C template 2 &&
+	test_commit -C template 3 &&
+	test-tool genrandom foo 10240 >template/foo &&
+	git -C template add foo &&
+	git -C template commit -m foo
+'
+
+# A bare repo will act as a server repo with unpacked objects.
+test_expect_success 'setup: create bare "server" repository' '
+	git clone --bare --no-local template server &&
+	mv server/objects/pack/pack-* . &&
+	packfile=$(ls pack-*.pack) &&
+	git -C server unpack-objects --strict <"$packfile"
+'
+
+check_missing_objects () {
+	git -C "$1" rev-list --objects --all --missing=print > all.txt &&
+	perl -ne 'print if s/^[?]//' all.txt >missing.txt &&
+	test_line_count = "$2" missing.txt &&
+	if test "$2" -lt 2
+	then
+		test "$3" = "$(cat missing.txt)"
+	else
+		test -f "$3" &&
+		sort <"$3" >expected_sorted &&
+		sort <missing.txt >actual_sorted &&
+		test_cmp expected_sorted actual_sorted
+	fi
+}
+
+initialize_server () {
+	count="$1"
+	missing_oids="$2"
+
+	# Repack everything first
+	git -C server -c repack.writebitmaps=false repack -a -d &&
+
+	# Remove promisor file in case they exist, useful when reinitializing
+	rm -rf server/objects/pack/*.promisor &&
+
+	# Repack without the largest object and create a promisor pack on server
+	git -C server -c repack.writebitmaps=false repack -a -d \
+	    --filter=blob:limit=5k --filter-to="$(pwd)/pack" &&
+	promisor_file=$(ls server/objects/pack/*.pack | sed "s/\.pack/.promisor/") &&
+	>"$promisor_file" &&
+
+	# Check objects missing on the server
+	check_missing_objects server "$count" "$missing_oids"
+}
+
+copy_to_server2 () {
+	oid_path="$(test_oid_to_path $1)" &&
+	path="server/objects/$oid_path" &&
+	path2="server2/objects/$oid_path" &&
+	mkdir -p $(dirname "$path2") &&
+	cp "$path" "$path2"
+}
+
+test_expect_success "setup for testing promisor remote advertisement" '
+	# Create another bare repo called "server2"
+	git init --bare server2 &&
+
+	# Copy the largest object from server to server2
+	obj="HEAD:foo" &&
+	oid="$(git -C server rev-parse $obj)" &&
+	copy_to_server2 "$oid" &&
+
+	initialize_server 1 "$oid" &&
+
+	# Configure server2 as promisor remote for server
+	git -C server remote add server2 "file://$(pwd)/server2" &&
+	git -C server config remote.server2.promisor true &&
+
+	git -C server2 config uploadpack.allowFilter true &&
+	git -C server2 config uploadpack.allowAnySHA1InWant true &&
+	git -C server config uploadpack.allowFilter true &&
+	git -C server config uploadpack.allowAnySHA1InWant true
+'
+
+test_expect_success "clone with promisor.advertise set to 'true'" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=All \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is still missing on the server
+	check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.advertise set to 'false'" '
+	git -C server config promisor.advertise false &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=All \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is not missing on the server
+	check_missing_objects server 0 "" &&
+
+	# Reinitialize server so that the largest object is missing again
+	initialize_server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.acceptfromserver set to 'None'" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=None \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is not missing on the server
+	check_missing_objects server 0 "" &&
+
+	# Reinitialize server so that the largest object is missing again
+	initialize_server 1 "$oid"
+'
+
+test_expect_success "init + fetch with promisor.advertise set to 'true'" '
+	git -C server config promisor.advertise true &&
+
+	test_when_finished "rm -rf client" &&
+	mkdir client &&
+	git -C client init &&
+	git -C client config remote.server2.promisor true &&
+	git -C client config remote.server2.fetch "+refs/heads/*:refs/remotes/server2/*" &&
+	git -C client config remote.server2.url "file://$(pwd)/server2" &&
+	git -C client config remote.server.url "file://$(pwd)/server" &&
+	git -C client config remote.server.fetch "+refs/heads/*:refs/remotes/server/*" &&
+	git -C client config promisor.acceptfromserver All &&
+	GIT_NO_LAZY_FETCH=0 git -C client fetch --filter="blob:limit=5k" server &&
+
+	# Check that the largest object is still missing on the server
+	check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.advertise set to 'true' but don't delete the client" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=All \
+		--no-local --filter="blob:limit=5k" server client &&
+
+	# Check that the largest object is still missing on the server
+	check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "setup for subsequent fetches" '
+	# Generate new commit with large blob
+	test-tool genrandom bar 10240 >template/bar &&
+	git -C template add bar &&
+	git -C template commit -m bar &&
+
+	# Fetch new commit with large blob
+	git -C server fetch origin &&
+	git -C server update-ref HEAD FETCH_HEAD &&
+	git -C server rev-parse HEAD >expected_head &&
+
+	# Repack everything twice and remove .promisor files before
+	# each repack. This makes sure everything gets repacked
+	# into a single packfile. The second repack is necessary
+	# because the first one fetches from server2 and creates a new
+	# packfile and its associated .promisor file.
+
+	rm -f server/objects/pack/*.promisor &&
+	git -C server -c repack.writebitmaps=false repack -a -d &&
+	rm -f server/objects/pack/*.promisor &&
+	git -C server -c repack.writebitmaps=false repack -a -d &&
+
+	# Unpack everything
+	rm pack-* &&
+	mv server/objects/pack/pack-* . &&
+	packfile=$(ls pack-*.pack) &&
+	git -C server unpack-objects --strict <"$packfile" &&
+
+	# Copy new large object to server2
+	obj_bar="HEAD:bar" &&
+	oid_bar="$(git -C server rev-parse $obj_bar)" &&
+	copy_to_server2 "$oid_bar" &&
+
+	# Reinitialize server so that the 2 largest objects are missing
+	printf "%s\n" "$oid" "$oid_bar" >expected_missing.txt &&
+	initialize_server 2 expected_missing.txt &&
+
+	# Create one more client
+	cp -r client client2
+'
+
+test_expect_success "subsequent fetch from a client when promisor.advertise is true" '
+	git -C server config promisor.advertise true &&
+
+	GIT_NO_LAZY_FETCH=0 git -C client pull origin &&
+
+	git -C client rev-parse HEAD >actual &&
+	test_cmp expected_head actual &&
+
+	cat client/bar >/dev/null &&
+
+	check_missing_objects server 2 expected_missing.txt
+'
+
+test_expect_success "subsequent fetch from a client when promisor.advertise is false" '
+	git -C server config promisor.advertise false &&
+
+	GIT_NO_LAZY_FETCH=0 git -C client2 pull origin &&
+
+	git -C client2 rev-parse HEAD >actual &&
+	test_cmp expected_head actual &&
+
+	cat client2/bar >/dev/null &&
+
+	check_missing_objects server 1 "$oid"
+'
+
+test_done
diff --git a/upload-pack.c b/upload-pack.c
index 43006c0614..c6550a8d51 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -31,6 +31,7 @@
 #include "write-or-die.h"
 #include "json-writer.h"
 #include "strmap.h"
+#include "promisor-remote.h"
 
 /* Remember to update object flag allocation in object.h */
 #define THEY_HAVE	(1u << 11)
@@ -318,6 +319,8 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 		strvec_push(&pack_objects.args, "--delta-base-offset");
 	if (pack_data->use_include_tag)
 		strvec_push(&pack_objects.args, "--include-tag");
+	if (repo_has_accepted_promisor_remote(the_repository))
+		strvec_push(&pack_objects.args, "--missing=allow-promisor");
 	if (pack_data->filter_options.choice) {
 		const char *spec =
 			expand_list_objects_filter_spec(&pack_data->filter_options);
-- 
2.47.1.402.gc25c94707f


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH v3 4/5] promisor-remote: check advertised name or URL
  2024-12-06 12:42   ` [PATCH v3 0/5] " Christian Couder
                       ` (2 preceding siblings ...)
  2024-12-06 12:42     ` [PATCH v3 3/5] Add 'promisor-remote' capability to protocol v2 Christian Couder
@ 2024-12-06 12:42     ` Christian Couder
  2024-12-06 12:42     ` [PATCH v3 5/5] doc: add technical design doc for large object promisors Christian Couder
                       ` (2 subsequent siblings)
  6 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2024-12-06 12:42 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Taylor Blau,
	Eric Sunshine, Christian Couder, Christian Couder

A previous commit introduced a "promisor.acceptFromServer" configuration
variable with only "None" or "All" as valid values.

Let's introduce "KnownName" and "KnownUrl" as valid values for this
configuration option to give more choice to a client about which
promisor remotes it might accept among those that the server advertised.

In case of "KnownName", the client will accept promisor remotes which
are already configured on the client and have the same name as those
advertised by the client. This could be useful in a corporate setup
where servers and clients are trusted to not switch names and URLs, but
where some kind of control is still useful.

In case of "KnownUrl", the client will accept promisor remotes which
have both the same name and the same URL configured on the client as the
name and URL advertised by the server. This is the most secure option,
so it should be used if possible.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/promisor.txt     | 22 ++++++---
 promisor-remote.c                     | 60 ++++++++++++++++++++---
 t/t5710-promisor-remote-capability.sh | 68 +++++++++++++++++++++++++++
 3 files changed, 138 insertions(+), 12 deletions(-)

diff --git a/Documentation/config/promisor.txt b/Documentation/config/promisor.txt
index 9cbfe3e59e..d1364bc018 100644
--- a/Documentation/config/promisor.txt
+++ b/Documentation/config/promisor.txt
@@ -12,9 +12,19 @@ promisor.advertise::
 promisor.acceptFromServer::
 	If set to "all", a client will accept all the promisor remotes
 	a server might advertise using the "promisor-remote"
-	capability. Default is "none", which means no promisor remote
-	advertised by a server will be accepted. By accepting a
-	promisor remote, the client agrees that the server might omit
-	objects that are lazily fetchable from this promisor remote
-	from its responses to "fetch" and "clone" requests from the
-	client. See linkgit:gitprotocol-v2[5].
+	capability. If set to "knownName" the client will accept
+	promisor remotes which are already configured on the client
+	and have the same name as those advertised by the client. This
+	is not very secure, but could be used in a corporate setup
+	where servers and clients are trusted to not switch name and
+	URLs. If set to "knownUrl", the client will accept promisor
+	remotes which have both the same name and the same URL
+	configured on the client as the name and URL advertised by the
+	server. This is more secure than "all" or "knownUrl", so it
+	should be used if possible instead of those options. Default
+	is "none", which means no promisor remote advertised by a
+	server will be accepted. By accepting a promisor remote, the
+	client agrees that the server might omit objects that are
+	lazily fetchable from this promisor remote from its responses
+	to "fetch" and "clone" requests from the client. See
+	linkgit:gitprotocol-v2[5].
diff --git a/promisor-remote.c b/promisor-remote.c
index ea418c4094..b72d539c19 100644
--- a/promisor-remote.c
+++ b/promisor-remote.c
@@ -369,30 +369,73 @@ char *promisor_remote_info(struct repository *repo)
 	return strbuf_detach(&sb, NULL);
 }
 
+/*
+ * Find first index of 'vec' where there is 'val'. 'val' is compared
+ * case insensively to the strings in 'vec'. If not found 'vec->nr' is
+ * returned.
+ */
+static size_t strvec_find_index(struct strvec *vec, const char *val)
+{
+	for (size_t i = 0; i < vec->nr; i++)
+		if (!strcasecmp(vec->v[i], val))
+			return i;
+	return vec->nr;
+}
+
 enum accept_promisor {
 	ACCEPT_NONE = 0,
+	ACCEPT_KNOWN_URL,
+	ACCEPT_KNOWN_NAME,
 	ACCEPT_ALL
 };
 
 static int should_accept_remote(enum accept_promisor accept,
-				const char *remote_name UNUSED,
-				const char *remote_url UNUSED)
+				const char *remote_name, const char *remote_url,
+				struct strvec *names, struct strvec *urls)
 {
+	size_t i;
+
 	if (accept == ACCEPT_ALL)
 		return 1;
 
-	BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+	i = strvec_find_index(names, remote_name);
+
+	if (i >= names->nr)
+		/* We don't know about that remote */
+		return 0;
+
+	if (accept == ACCEPT_KNOWN_NAME)
+		return 1;
+
+	if (accept != ACCEPT_KNOWN_URL)
+		BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+
+	if (!strcasecmp(urls->v[i], remote_url))
+		return 1;
+
+	warning(_("known remote named '%s' but with url '%s' instead of '%s'"),
+		remote_name, urls->v[i], remote_url);
+
+	return 0;
 }
 
-static void filter_promisor_remote(struct strvec *accepted, const char *info)
+static void filter_promisor_remote(struct repository *repo,
+				   struct strvec *accepted,
+				   const char *info)
 {
 	struct strbuf **remotes;
 	const char *accept_str;
 	enum accept_promisor accept = ACCEPT_NONE;
+	struct strvec names = STRVEC_INIT;
+	struct strvec urls = STRVEC_INIT;
 
 	if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
 		if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
 			accept = ACCEPT_NONE;
+		else if (!strcasecmp("KnownUrl", accept_str))
+			accept = ACCEPT_KNOWN_URL;
+		else if (!strcasecmp("KnownName", accept_str))
+			accept = ACCEPT_KNOWN_NAME;
 		else if (!strcasecmp("All", accept_str))
 			accept = ACCEPT_ALL;
 		else
@@ -403,6 +446,9 @@ static void filter_promisor_remote(struct strvec *accepted, const char *info)
 	if (accept == ACCEPT_NONE)
 		return;
 
+	if (accept != ACCEPT_ALL)
+		promisor_info_vecs(repo, &names, &urls);
+
 	/* Parse remote info received */
 
 	remotes = strbuf_split_str(info, ';', 0);
@@ -432,7 +478,7 @@ static void filter_promisor_remote(struct strvec *accepted, const char *info)
 		if (remote_url)
 			decoded_url = url_percent_decode(remote_url);
 
-		if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url))
+		if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url, &names, &urls))
 			strvec_push(accepted, decoded_name);
 
 		strbuf_list_free(elems);
@@ -440,6 +486,8 @@ static void filter_promisor_remote(struct strvec *accepted, const char *info)
 		free(decoded_url);
 	}
 
+	strvec_clear(&names);
+	strvec_clear(&urls);
 	strbuf_list_free(remotes);
 }
 
@@ -448,7 +496,7 @@ char *promisor_remote_reply(const char *info)
 	struct strvec accepted = STRVEC_INIT;
 	struct strbuf reply = STRBUF_INIT;
 
-	filter_promisor_remote(&accepted, info);
+	filter_promisor_remote(the_repository, &accepted, info);
 
 	if (!accepted.nr)
 		return NULL;
diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
index 000cb4c0f6..483cc8e16d 100755
--- a/t/t5710-promisor-remote-capability.sh
+++ b/t/t5710-promisor-remote-capability.sh
@@ -157,6 +157,74 @@ test_expect_success "init + fetch with promisor.advertise set to 'true'" '
 	check_missing_objects server 1 "$oid"
 '
 
+test_expect_success "clone with promisor.acceptfromserver set to 'KnownName'" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=KnownName \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is still missing on the server
+	check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with 'KnownName' and different remote names" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.serverTwo.promisor=true \
+		-c remote.serverTwo.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.serverTwo.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=KnownName \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is not missing on the server
+	check_missing_objects server 0 "" &&
+
+	# Reinitialize server so that the largest object is missing again
+	initialize_server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.acceptfromserver set to 'KnownUrl'" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=KnownUrl \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is still missing on the server
+	check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with 'KnownUrl' and different remote urls" '
+	ln -s server2 serverTwo &&
+
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/serverTwo" \
+		-c promisor.acceptfromserver=KnownUrl \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is not missing on the server
+	check_missing_objects server 0 "" &&
+
+	# Reinitialize server so that the largest object is missing again
+	initialize_server 1 "$oid"
+'
+
 test_expect_success "clone with promisor.advertise set to 'true' but don't delete the client" '
 	git -C server config promisor.advertise true &&
 
-- 
2.47.1.402.gc25c94707f


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH v3 5/5] doc: add technical design doc for large object promisors
  2024-12-06 12:42   ` [PATCH v3 0/5] " Christian Couder
                       ` (3 preceding siblings ...)
  2024-12-06 12:42     ` [PATCH v3 4/5] promisor-remote: check advertised name or URL Christian Couder
@ 2024-12-06 12:42     ` Christian Couder
  2024-12-10  1:28       ` Junio C Hamano
  2024-12-10 11:43       ` Junio C Hamano
  2024-12-09  8:04     ` [PATCH v3 0/5] Introduce a "promisor-remote" capability Junio C Hamano
  2025-01-27 15:16     ` [PATCH v4 0/6] " Christian Couder
  6 siblings, 2 replies; 110+ messages in thread
From: Christian Couder @ 2024-12-06 12:42 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Taylor Blau,
	Eric Sunshine, Christian Couder, Christian Couder

Let's add a design doc about how we could improve handling liarge blobs
using "Large Object Promisors" (LOPs). It's a set of features with the
goal of using special dedicated promisor remotes to store large blobs,
and having them accessed directly by main remotes and clients.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 .../technical/large-object-promisors.txt      | 530 ++++++++++++++++++
 1 file changed, 530 insertions(+)
 create mode 100644 Documentation/technical/large-object-promisors.txt

diff --git a/Documentation/technical/large-object-promisors.txt b/Documentation/technical/large-object-promisors.txt
new file mode 100644
index 0000000000..267c65b0d5
--- /dev/null
+++ b/Documentation/technical/large-object-promisors.txt
@@ -0,0 +1,530 @@
+Large Object Promisors
+======================
+
+Since Git has been created, users have been complaining about issues
+with storing large files in Git. Some solutions have been created to
+help, but they haven't helped much with some issues.
+
+Git currently supports multiple promisor remotes, which could help
+with some of these remaining issues, but it's very hard to use them to
+help, because a number of important features are missing.
+
+The goal of the effort described in this document is to add these
+important features.
+
+We will call a "Large Object Promisor", or "LOP" in short, a promisor
+remote which is used to store only large blobs and which is separate
+from the main remote that should store the other Git objects and the
+rest of the repos.
+
+By extension, we will also call "Large Object Promisor", or LOP, the
+effort described in this document to add a set of features to make it
+easier to handle large blobs/files in Git by using LOPs.
+
+This effort would especially improve things on the server side, and
+especially for large blobs that are already compressed in a binary
+format.
+
+This effort could help provide an alternative to Git LFS
+(https://git-lfs.com/) and similar tools like git-annex
+(https://git-annex.branchable.com/) for handling large files, even
+though a complete alternative would very likely require other efforts
+especially on the client side, where it would likely help to implement
+a new object representation for large blobs as discussed in:
+
+https://lore.kernel.org/git/xmqqbkdometi.fsf@gitster.g/
+
+0) Non goals
+------------
+
+- We will not discuss those client side improvements here, as they
+  would require changes in different parts of Git than this effort.
++
+So we don't pretend to fully replace Git LFS with only this effort,
+but we nevertheless believe that it can significantly improve the
+current situation on the server side, and that other separate
+efforts could also improve the situation on the client side.
+
+- In the same way, we are not going to discuss all the possible ways
+  to implement a LOP or their underlying object storage.
++
+In particular we are not going to discuss pluggable ODBs or other
+object database backends that could chunk large blobs, dedup the
+chunks and store them efficiently. Sure, that would be a nice
+improvement to store large blobs on the server side, but we believe
+it can just be a separate effort as it's also not technically very
+related to this effort.
+
+In other words, the goal of this document is not to talk about all the
+possible ways to optimize how Git could handle large blobs, but to
+describe how a LOP based solution could work well and alleviate a
+number of current issues in the context of Git clients and servers
+sharing Git objects.
+
+I) Issues with the current situation
+------------------------------------
+
+- Some statistics made on GitLab repos have shown that more than 75%
+  of the disk space is used by blobs that are larger than 1MB and
+  often in a binary format.
+
+- So even if users could use Git LFS or similar tools to store a lot
+  of large blobs out of their repos, it's a fact that in practice they
+  don't do it as much as they probably should.
+
+- On the server side ideally, the server should be able to decide for
+  itself how it stores things. It should not depend on users deciding
+  to use tools like Git LFS on some blobs or not.
+
+- It's much more expensive to store large blobs that don't delta
+  compress well on regular fast seeking drives (like SSDs) than on
+  object storage (like Amazon S3 or GCP Buckets). Using fast drives
+  for regular Git repos makes sense though, as serving regular Git
+  content (blobs containing text or code) needs drives where seeking
+  is fast, but the content is relatively small. On the other hand,
+  object storage for Git LFS blobs makes sense as seeking speed is not
+  as important when dealing with large files, while costs are more
+  important. So the fact that users don't use Git LFS or similar tools
+  for a significant number of large blobs has likely some bad
+  consequences on the cost of repo storage for most Git hosting
+  platforms.
+
+- Having large blobs handled in the same way as other blobs and Git
+  objects in Git repos instead of on object storage also has a cost in
+  increased memory and CPU usage, and therefore decreased performance,
+  when creating packfiles. (This is because Git tries to use delta
+  compression or zlib compression which is unlikely to work well on
+  already compressed binary content.) So it's not just a storage cost
+  increase.
+
+- When a large blob has been committed into a repo, it might not be
+  possible to remove this blob from the repo without rewriting
+  history, even if the user then decides to use Git LFS or a similar
+  tool to handle it.
+
+- In fact Git LFS and similar tools are not very flexible in letting
+  users change their minds about the blobs they should handle or not.
+
+- Even when users are using Git LFS or similar tools, they are often
+  complaining that these tools require significant effort to set up,
+  learn and use correctly.
+
+II) Main features of the "Large Object Promisors" solution
+----------------------------------------------------------
+
+The main features below should give a rough overview of how the
+solution may work. Details about needed elements can be found in
+following sections.
+
+Even if each feature below is very useful for the full solution, it is
+very likely to be also useful on its own in some cases where the full
+solution is not required. However, we'll focus primarily on the big
+picture here.
+
+Also each feature doesn't need to be implemented entirely in Git
+itself. Some could be scripts, hooks or helpers that are not part of
+the Git repo. It could be helpful if those could be shared and
+improved on collaboratively though.
+
+1) Large blobs are stored on LOPs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Large blobs should be stored on special promisor remotes that we will
+call "Large Object Promisors" or LOPs. These LOPs should be additional
+remotes dedicated to contain large blobs especially those in binary
+format. They should be used along with main remotes that contain the
+other objects.
+
+Note 1
+++++++
+
+To clarify, a LOP is a normal promisor remote, except that:
+
+- it should store only large blobs,
+
+- it should be separate from the main remote, so that the main remote
+  can focus on serving other objects and the rest of the repos (see
+  feature 4) below) and can use the LOP as a promisor remote for
+  itself.
+
+Note 2
+++++++
+
+Git already makes it possible for a main remote to also be a promisor
+remote storing both regular objects and large blobs for a client that
+clones from it with a filter on blob size. But here we explicitly want
+to avoid that.
+
+Rationale
++++++++++
+
+LOP remotes should be good at handling large blobs while main remotes
+should be good at handling other objects.
+
+Implementation
+++++++++++++++
+
+Git already has support for multiple promisor remotes, see
+link:partial-clone.html#using-many-promisor-remotes[the partial clone documentation].
+
+Also, Git already has support for partial clone using a filter on the
+size of the blobs (with `git clone --filter=blob:limit=<size>`).  Most
+of the other main features below are based on these existing features
+and are about making them easy and efficient to use for the purpose of
+better handling large blobs.
+
+2) LOPs can use object storage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A LOP could be using object storage, like an Amazon S3 or GCP Bucket
+to actually store the large blobs, and could be accessed through a Git
+remote helper (see linkgit:gitremote-helpers[7]) which makes the
+underlying object storage appears like a remote to Git.
+
+Note
+++++
+
+A LOP could be a promisor remote accessed using a remote helper by
+both some clients and the main remote.
+
+Rationale
++++++++++
+
+This looks like the simplest way to create LOPs that can cheaply
+handle many large blobs.
+
+Implementation
+++++++++++++++
+
+Remote helpers are quite easy to write as shell scripts, but it might
+be more efficient and maintainable to write them using other languages
+like Go.
+
+Other ways to implement LOPs are certainly possible, but the goal of
+this document is not to discuss how to best implement a LOP or its
+underlying object storage (see the "0) Non goals" section above).
+
+3) LOP object storage can be Git LFS storage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The underlying object storage that a LOP uses could also serve as
+storage for large files handled by Git LFS.
+
+Rationale
++++++++++
+
+This would simplify the server side if it wants to both use a LOP and
+act as a Git LFS server.
+
+4) A main remote can offload to a LOP with a configurable threshold
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+On the server side, a main remote should have a way to offload to a
+LOP all its blobs with a size over a configurable threshold.
+
+Rationale
++++++++++
+
+This makes it easy to set things up and to clean things up. For
+example, an admin could use this to manually convert a repo not using
+LOPs to a repo using a LOP. On a repo already using a LOP but where
+some users would sometimes push large blobs, a cron job could use this
+to regularly make sure the large blobs are moved to the LOP.
+
+Implementation
+++++++++++++++
+
+Using something based on `git repack --filter=...` to separate the
+blobs we want to offload from the other Git objects could be a good
+idea. The missing part is to connect to the LOP, check if the blobs we
+want to offload are already there and if not send them.
+
+5) A main remote should try to remain clean from large blobs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A main remote should try to avoid containing a lot of oversize
+blobs. For that purpose, it should offload as needed to a LOP and it
+should have ways to prevent oversize blobs to be fetched, and also
+perhaps pushed, into it.
+
+Rationale
++++++++++
+
+A main remote containing many oversize blobs would defeat the purpose
+of LOPs.
+
+Implementation
+++++++++++++++
+
+The way to offload to a LOP discussed in 4) above can be used to
+regularly offload oversize blobs. About preventing oversize blobs to
+be fetched into the repo see 6) below. About preventing oversize blob
+pushes, a pre-receive hook could be used.
+
+Also there are different scenarios in which large blobs could get
+fetched into the main remote, for example:
+
+- A client that doesn't implement the "promisor-remote" protocol
+  (described in 6) below) clones from the main remote.
+
+- The main remote gets a request for information about a large blob
+  and is not able to get that information without fetching the blob
+  from the LOP.
+
+It might not be possible to completely prevent all these scenarios
+from happening. So the goal here should be to implement features that
+make the fetching of large blobs less likely. For example adding a
+`remote-object-info` command in the `git cat-file --batch*` protocol
+might make it possible for a main repo to respond to some requests
+about large blobs without fetching them.
+
+6) A protocol negotiation should happen when a client clones
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a client clones from a main repo, there should be a protocol
+negotiation so that the server can advertise one or more LOPs and so
+that the client and the server can discuss if the client could
+directly use a LOP the server is advertising. If the client and the
+server can agree on that, then the client would be able to get the
+large blobs directly from the LOP and the server would not need to
+fetch those blobs from the LOP to be able to serve the client.
+
+Note
+++++
+
+For fetches instead of clones, see the "What about fetches?" FAQ entry
+below.
+
+Rationale
++++++++++
+
+Security, configurability and efficiency of setting things up.
+
+Implementation
+++++++++++++++
+
+A "promisor-remote" protocol v2 capability looks like a good way to
+implement this. The way the client and server use this capability
+could be controlled by configuration variables.
+
+Information that the server could send to the client through that
+protocol could be things like: LOP name, LOP URL, filter-spec (for
+example `blob:limit=<size>`) or just size limit that should be used as
+a filter when cloning, token to be used with the LOP, etc..
+
+7) A client can offload to a LOP
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a client is using a LOP that is also a LOP of its main remote,
+the client should be able to offload some large blobs it has fetched,
+but might not need anymore, to the LOP.
+
+Rationale
++++++++++
+
+On the client, the easiest way to deal with unneeded large blobs is to
+offload them.
+
+Implementation
+++++++++++++++
+
+This is very similar to what 4) above is about, except on the client
+side instead of the server side. So a good solution to 4) could likely
+be adapted to work on the client side too.
+
+There might be some security issues here, as there is no negotiation,
+but they might be mitigated if the client can reuse a token it got
+when cloning (see 6) above). Also if the large blobs were fetched from
+a LOP, it is likely, and can easily be confirmed, that the LOP still
+has them, so that they can just be removed from the client.
+
+III) Benefits of using LOPs
+---------------------------
+
+Many benefits are related to the issues discussed in "I) Issues with
+the current situation" above:
+
+- No need to rewrite history when deciding which blobs are worth
+  handling separately than other objects, or when moving or removing
+  the threshold.
+
+- If the protocol between client and server is developed and secured
+  enough, then many details might be setup on the server side only and
+  all the clients could then easily get all the configuration
+  information and use it to set themselves up mostly automatically.
+
+- Storage costs benefits on the server side.
+
+- Reduced memory and CPU needs on main remotes on the server side.
+
+- Reduced storage needs on the client side.
+
+IV) FAQ
+-------
+
+What about using multiple LOPs on the server and client side?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+That could perhaps be useful in some cases, but it's more likely for
+now than in most cases a single LOP will be advertised by the server
+and should be used by the client.
+
+A case where it could be useful for a server to advertise multiple
+LOPs is if a LOP is better for some users while a different LOP is
+better for other users. For example some clients might have a better
+connection to a LOP than others.
+
+In those cases it's the responsibility of the server to have some
+documentation to help clients. It could say for example something like
+"Users in this part of the world might want to pick only LOP A as it
+is likely to be better connected to them, while users in other parts
+of the world should pick only LOP B for the same reason."
+
+Trusting the LOPs advertised by the server, or not trusting them?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In some contexts, like in corporate setup where the server and all the
+clients are parts of an internal network in a company where admins
+have all the rights on every system, it's Ok, and perhaps even a good
+thing, if the clients fully trust the server, as it can help ensure
+that all the clients are on the same page.
+
+There are also contexts in which clients trust a code hosting platform
+serving them some repos, but might not fully trust other users
+managing or contributing to some of these repos. For example, the code
+hosting platform could have hooks in place to check that any object it
+receives doesn't contain malware or otherwise bad content. In this
+case it might be OK for the client to use a main remote and its LOP if
+they are both hosted by the code hosting platform, but not if the LOP
+is hosted elsewhere (where the content is not checked).
+
+In other contexts, a client should just not trust a server.
+
+So there should be different ways to configure how the client should
+behave when a server advertises a LOP to it at clone time.
+
+As the basic elements that a server can advertise about a LOP are a
+LOP name and a LOP URL, the client should base its decision about
+accepting a LOP on these elements.
+
+One simple way to be very strict in the LOP it accepts is for example
+for the client to check that the LOP is already configured on the
+client with the same name and URL as what the server advertises.
+
+In general default and "safe" settings should require that the LOP are
+configured on the client separately from the "promisor-remote"
+protocol and that the client accepts a LOP only when information about
+it from the protocol matches what has been already configured
+separately.
+
+What about LOP names?
+~~~~~~~~~~~~~~~~~~~~~
+
+In some contexts, for example if the clients sometimes fetch from each
+other, it can be a good idea for all the clients to use the same names
+for all the remotes they use, including LOPs.
+
+In other contexts, each client might want to be able to give the name
+it wants to each remote, including each LOP, it interacts with.
+
+So there should be different ways to configure how the client accepts
+or not the LOP name the server advertises.
+
+If a default or "safe" setting is used, then as such a setting should
+require that the LOP be configured separately, then the name would be
+configured separately and there is no risk that the server could
+dictate a name to a client.
+
+Could the main remote be bogged down by old or paranoid clients?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Yes, it could happen if there are too many clients that are either
+unwilling to trust the main remote or that just don't implement the
+"promisor-remote" protocol because they are too old or not fully
+compatible with the 'git' client.
+
+When serving such a client, the main remote has no other choice than
+to first fetch from its LOP, to then be able to provide to the client
+everything it requested. So the main remote, even if it has cleanup
+mechanisms (see section II.4 above), would be burdened at least
+temporarily with the large blobs it had to fetch from its LOP.
+
+Not behaving like this would be breaking backward compatibility, and
+could be seen as segregating clients. For example, it might be
+possible to implement a special mode that allows the server to just
+reject clients that don't implement the "promisor-remote" protocol or
+aren't willing to trust the main remote. This mode might be useful in
+a special context like a corporate environment. There is no plan to
+implement such a mode though, and this should be discussed separately
+later anyway.
+
+A better way to proceed is probably for the main remote to show a
+message telling clients that don't implement the protocol or are
+unwilling to accept the advertised LOP(s) that they would get faster
+clone and fetches by upgrading client software or properly setting
+them up to accept LOP(s).
+
+Waiting for clients to upgrade, monitoring these upgrades and limiting
+the use of LOPs to repos that are not very frequently accessed might
+be other good ways to make sure that some benefits are still reaped
+from LOPs. Over time, as more and more clients upgrade and benefit
+from LOPs, using them in more and more frequently accessed repos will
+become worth it.
+
+Corporate environments, where it might be easier to make sure that all
+the clients are up-to-date and properly configured, could hopefully
+benefit more and earlier from using LOPs.
+
+What about fetches?
+~~~~~~~~~~~~~~~~~~~
+
+There are different kinds of fetches. A regular fetch happens when
+some refs have been updated on the server and the client wants the ref
+updates and possibly the new objects added with them. A "backfill" or
+"lazy" fetch, on the contrary, happens when the client needs to use
+some objects it already knows about but doesn't have because they are
+on a promisor remote.
+
+Regular fetch
++++++++++++++
+
+In a regular fetch, the client will contact the main remote and a
+protocol negotiation will happen between them. It's a good thing that
+a protocol negotiation happens every time, as the configuration on the
+client or the main remote could have changed since the previous
+protocol negotiation. In this case, the new protocol negotiation
+should ensure that the new fetch will happen in a way that satisfies
+the new configuration of both the client and the server.
+
+In most cases though, the configurations on the client and the main
+remote will not have changed between 2 fetches or between the initial
+clone and a subsequent fetch. This means that the result of a new
+protocol negotiation will be the same as the previous result, so the
+new fetch will happen in the same way as the previous clone or fetch,
+using, or not using, the same LOP(s) as last time.
+
+"Backfill" or "lazy" fetch
+++++++++++++++++++++++++++
+
+When there is a backfill fetch, the client doesn't necessarily contact
+the main remote first. It will try to fetch from its promisor remotes
+in the order they appear in the config file, except that a remote
+configured using the `extensions.partialClone` config variable will be
+tried last. See
+link:partial-clone.html#using-many-promisor-remotes[the partial clone documentation].
+
+This is not new with this effort. In fact this is how multiple remotes
+have already been working for around 5 years.
+
+When using LOPs, having the main remote configured using
+`extensions.partialClone`, so it's tried last, makes sense, as missing
+objects should only be large blobs that are on LOPs.
+
+This means that a protocol negotiation will likely not happen as the
+missing objects will be fetched from the LOPs, and then there will be
+nothing left to fetch from the main remote.
+
+To secure that, it could be a good idea for LOPs to require a token
+from the client when it fetches from them. The client could get the
+token when performing a protocol negotiation with the main remote (see
+section II.6 above).
-- 
2.47.1.402.gc25c94707f


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [PATCH v3 1/5] version: refactor strbuf_sanitize()
  2024-12-06 12:42     ` [PATCH v3 1/5] version: refactor strbuf_sanitize() Christian Couder
@ 2024-12-07  6:21       ` Junio C Hamano
  2025-01-27 15:07         ` Christian Couder
  0 siblings, 1 reply; 110+ messages in thread
From: Junio C Hamano @ 2024-12-07  6:21 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> +/*
> + * Trim and replace each character with ascii code below 32 or above
> + * 127 (included) using a dot '.' character. Useful for sending
> + * capabilities.
> + */
> +void strbuf_sanitize(struct strbuf *sb);

I am not getting "Useful for sending capabilities" here, and feel
that it is somewhat an unsubstantiated claim.  If some information
is going to be transferred (which the phrase "sending capabilities"
hints), I'd expect that we try as hard as possible not to lose
information, but redact-non-ASCII is the total opposite of "not
losing information".

> diff --git a/version.c b/version.c
> index 41b718c29e..951e6dca74 100644
> --- a/version.c
> +++ b/version.c
> @@ -24,15 +24,10 @@ const char *git_user_agent_sanitized(void)
>  
>  	if (!agent) {
>  		struct strbuf buf = STRBUF_INIT;
> -		int i;
>  
>  		strbuf_addstr(&buf, git_user_agent());
> -		strbuf_trim(&buf);
> -		for (i = 0; i < buf.len; i++) {
> -			if (buf.buf[i] <= 32 || buf.buf[i] >= 127)
> -				buf.buf[i] = '.';
> -		}
> -		agent = buf.buf;
> +		strbuf_sanitize(&buf);
> +		agent = strbuf_detach(&buf, NULL);
>  	}
>  
>  	return agent;

This is very faithful rewrite of the original.  The original had a
strbuf on stack, and after creating user-agent string in it, a
function scope static variable "agent" is made to point at it and
then the stack the strbuf was on is allowed to go out of scope.
Since the variable "agent" is holding onto the piece of memory, the
leak checker does not complain about anything.  The rewritten
version is leak-free for exactly the same reason, but because it
calls strbuf_detach() before the strbuf goes out of scope to
officially transfer the ownership to the variable "agent", it tells
what is going on to readers a lot more clearly.

Nicely done.

By the way, as we are trimming, I am very very much tempted to
squish a run of non-ASCII bytes into one dot, perhaps like

	void redact_non_printables(struct strbuf *sb)
	{
		size_t dst = 0;
                int skipped = 0;

        	strbuf_trim(sb);
		for (size_t src = 0; src < sb->len; src++) {
                	int ch = sb->buf[src];
			if (ch <= 32 && 127 <= ch) {
                                if (skipped)
                                	continue;
                        	ch = '.';
			}
                        sb->buf[dst++] = ch;
                        skipped = (ch == '.');
		}
	}

or even without strbuf_trim(), which would turn any leading or
trailing run of whitespaces into '.'.

But that is an improvement that can be easily done on top after the
dust settles and better left as #leftoverbits material.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v3 2/5] strbuf: refactor strbuf_trim_trailing_ch()
  2024-12-06 12:42     ` [PATCH v3 2/5] strbuf: refactor strbuf_trim_trailing_ch() Christian Couder
@ 2024-12-07  6:35       ` Junio C Hamano
  2025-01-27 15:07         ` Christian Couder
  2024-12-16 11:47       ` karthik nayak
  1 sibling, 1 reply; 110+ messages in thread
From: Junio C Hamano @ 2024-12-07  6:35 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> We often have to split strings at some specified terminator character.
> The strbuf_split*() functions, that we can use for this purpose,
> return substrings that include the terminator character, so we often
> need to remove that character.
>
> When it is a whitespace, newline or directory separator, the
> terminator character can easily be removed using an existing triming
> function like strbuf_rtrim(), strbuf_trim_trailing_newline() or
> strbuf_trim_trailing_dir_sep(). There is no function to remove that
> character when it's not one of those characters though.

Heh, totally uninteresting (alternative being open coding this one).
If we pass, instead of a single character 'c', an array of characters
to be stripped from the right (like strspn() allows you to skip from
the left), I may have been a bit more receptive, though ;-)

> +void strbuf_trim_trailing_ch(struct strbuf *sb, int c)
> +{
> +	while (sb->len > 0 && sb->buf[sb->len - 1] == c)
> +		sb->len--;
> +	sb->buf[sb->len] = '\0';
> +}

So, trim_trailing will leave "foo" when "foo,,," is fed with c set
to ','.

> diff --git a/trace2/tr2_cfg.c b/trace2/tr2_cfg.c
> index 22a99a0682..9da1f8466c 100644
> --- a/trace2/tr2_cfg.c
> +++ b/trace2/tr2_cfg.c
> @@ -35,10 +35,7 @@ static int tr2_cfg_load_patterns(void)
>  
>  	tr2_cfg_patterns = strbuf_split_buf(envvar, strlen(envvar), ',', -1);
>  	for (s = tr2_cfg_patterns; *s; s++) {
> -		struct strbuf *buf = *s;
> -
> -		if (buf->len && buf->buf[buf->len - 1] == ',')
> -			strbuf_setlen(buf, buf->len - 1);
> +		strbuf_trim_trailing_ch(*s, ',');

And the only thing that prevents this rewrite from being buggy is
the use of misdesigned strbuf_split_buf() function (which by now we
should have deprecated!).  Because it splits at ',', we won't have
more than one ',' trailing, but we still split that one trailing
comma because the misdesigned strbuf_split_buf() leaves the
separator at the end of each element.

This does not look like a very convincing example to demonstrate why
the new helper function is useful, at least to me.  

If somebody would touch this area of code, I think a lot nicer
clean-up would be to rewrite the thing into a helper function that
is called from here, and the other one in the next hunk in a single
patch, and then clean up the refactored helper function not to use
the strbuf_split_buf().  Looking at the way tr2_cfg_patterns and
tr2_cfg_env_vars are used, they have *NO* valid reason why they have
to be a strbuf.  Once populated, they are only used for a constant
string pointed at by their .buf member.  A string_list constructed
by appending (i.e. not sorted) would be a lot more suitable data
structure.

>  		strbuf_trim_trailing_newline(*s);
>  		strbuf_trim(*s);
>  	}
> @@ -74,10 +71,7 @@ static int tr2_load_env_vars(void)
>  
>  	tr2_cfg_env_vars = strbuf_split_buf(varlist, strlen(varlist), ',', -1);
>  	for (s = tr2_cfg_env_vars; *s; s++) {
> -		struct strbuf *buf = *s;
> -
> -		if (buf->len && buf->buf[buf->len - 1] == ',')
> -			strbuf_setlen(buf, buf->len - 1);
> +		strbuf_trim_trailing_ch(*s, ',');
>  		strbuf_trim_trailing_newline(*s);
>  		strbuf_trim(*s);
>  	}

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v3 3/5] Add 'promisor-remote' capability to protocol v2
  2024-12-06 12:42     ` [PATCH v3 3/5] Add 'promisor-remote' capability to protocol v2 Christian Couder
@ 2024-12-07  7:59       ` Junio C Hamano
  2025-01-27 15:08         ` Christian Couder
  0 siblings, 1 reply; 110+ messages in thread
From: Junio C Hamano @ 2024-12-07  7:59 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> Then C might or might not, want to get the objects from X, and should
> let S know about this.

I only left this instance quoted in this reply, but I found that
there are too many "should" in the description (both in the proposed
log message and in the documentation patch), which do not help the
readers with accompanying explanation on the reason why it is a good
idea to follow these "should".  For example, S may suggest X to C,
and C (imagine a third-party reimplementation of Git, which is not
bound by your "should") may take advantage of that suggestion and
use X as a better connected alternative, and C might want to do so
without even telling S.  What entices C to tell S?  IOW, how are
these two parties expected to collaborate with that information at
hand?  Without answering that question ...

> To allow S and C to agree and let each other know about C using X or
> not, let's introduce a new "promisor-remote" capability in the
> protocol v2, as well as a few new configuration variables:
>
>   - "promisor.advertise" on the server side, and:
>   - "promisor.acceptFromServer" on the client side.

... the need for a mechanism to share that information between S and
C is hard to sell.  "By telling S, C allows S to omit objects that
can be obtained from X when answering C's request?" or something,
perhaps?

> +Note that in the future it would be nice if the "promisor-remote"
> +protocol capability could be used by the server, when responding to
> +`git fetch` or `git clone`, to advertise better-connected remotes that
> +the client can use as promisor remotes, instead of this repository, so
> +that the client can lazily fetch objects from these other
> +better-connected remotes. This would require the server to omit in its
> +response the objects available on the better-connected remotes that
> +the client has accepted. This hasn't been implemented yet though. So
> +for now this "promisor-remote" capability is useful only when the
> +server advertises some promisor remotes it already uses to borrow
> +objects from.

We need to figure out before etching the protocol specification in
stone what to do when the network situations observable by C and S
are different.  For example, C may need to go over a proxy to reach
S, S may directly have connection to X, but C cannot reach X
directly, and C needs another proxy, different from the one it uses
to go to S, to reach X.  How is S expected to know about C's network
situation, and use the knowledge to tell C how to reach X?  Or is X
so well known a name that it is C's responsibility to arrange how it
can reach X?  I suspect that this was designed primarily to allow a
server to better help clients owned by the same enterprise entity,
so it might be tempting to distribute pieces of information we
usually do not consider Git's concern, like proxy configuration,
over the same protocol.  I personally would strongly prefer *not* to
go in that direction, and if we agree that we won't go there from
the beginning, I'd be a lot happier ;-)

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v3 0/5] Introduce a "promisor-remote" capability
  2024-12-06 12:42   ` [PATCH v3 0/5] " Christian Couder
                       ` (4 preceding siblings ...)
  2024-12-06 12:42     ` [PATCH v3 5/5] doc: add technical design doc for large object promisors Christian Couder
@ 2024-12-09  8:04     ` Junio C Hamano
  2024-12-09 10:40       ` Christian Couder
  2025-01-27 15:16     ` [PATCH v4 0/6] " Christian Couder
  6 siblings, 1 reply; 110+ messages in thread
From: Junio C Hamano @ 2024-12-09  8:04 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine

Christian Couder <christian.couder@gmail.com> writes:

> This work is part of some effort to better handle large files/blobs in
> a client-server context using promisor remotes dedicated to storing
> large blobs. To help understand this effort, this series now contains
> a patch (patch 5/5) that adds design documentation about this effort.

https://github.com/git/git/actions/runs/12229786922/job/34110073072
is a CI-run on 'seen' with this topic.  linux-TEST-vars job is failing.

A CI-run for the same topics in 'seen' but without this topic is
https://github.com/git/git/actions/runs/12230853182/job/34112864500

This topic seems to break linux-TEST-vars CI job (where different
settings like + export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=master
is used).


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v3 0/5] Introduce a "promisor-remote" capability
  2024-12-09  8:04     ` [PATCH v3 0/5] Introduce a "promisor-remote" capability Junio C Hamano
@ 2024-12-09 10:40       ` Christian Couder
  2024-12-09 10:42         ` Christian Couder
  2024-12-09 23:01         ` Junio C Hamano
  0 siblings, 2 replies; 110+ messages in thread
From: Christian Couder @ 2024-12-09 10:40 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine

On Mon, Dec 9, 2024 at 9:04 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > This work is part of some effort to better handle large files/blobs in
> > a client-server context using promisor remotes dedicated to storing
> > large blobs. To help understand this effort, this series now contains
> > a patch (patch 5/5) that adds design documentation about this effort.
>
> https://github.com/git/git/actions/runs/12229786922/job/34110073072
> is a CI-run on 'seen' with this topic.  linux-TEST-vars job is failing.
>
> A CI-run for the same topics in 'seen' but without this topic is
> https://github.com/git/git/actions/runs/12230853182/job/34112864500
>
> This topic seems to break linux-TEST-vars CI job (where different
> settings like + export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=master
> is used).

Yeah, in the "CI tests" section in the cover letter I wrote:

> One test, linux-TEST-vars, failed much earlier, in what doesn't look
> like a CI issue as I could reproduce the failure locally when setting
> GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL to 1. I will investigate,
> but in the meantime I think I can send this as-is so we can start
> discussing.

I noticed that fcb2205b77 (midx: implement support for writing
incremental MIDX chains, 2024-08-06)
which introduced GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL adds lines like:

GIT_TEST_MULTI_PACK_INDEX=0
GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0

at the top of a number of repack related test scripts like
t7700-repack.sh, so I guess that it should be OK to add the same lines
at the top of the t5710 test script added by this series. This should
fix the CI failures.

I have made this change in my current version.

Thanks.



Yeah, not sure why

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v3 0/5] Introduce a "promisor-remote" capability
  2024-12-09 10:40       ` Christian Couder
@ 2024-12-09 10:42         ` Christian Couder
  2024-12-09 23:01         ` Junio C Hamano
  1 sibling, 0 replies; 110+ messages in thread
From: Christian Couder @ 2024-12-09 10:42 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine

On Mon, Dec 9, 2024 at 11:40 AM Christian Couder
<christian.couder@gmail.com> wrote:

> Yeah, not sure why

Sorry for this. It's an editing mistake.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v3 0/5] Introduce a "promisor-remote" capability
  2024-12-09 10:40       ` Christian Couder
  2024-12-09 10:42         ` Christian Couder
@ 2024-12-09 23:01         ` Junio C Hamano
  2025-01-27 15:05           ` Christian Couder
  1 sibling, 1 reply; 110+ messages in thread
From: Junio C Hamano @ 2024-12-09 23:01 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine

Christian Couder <christian.couder@gmail.com> writes:

> I noticed that fcb2205b77 (midx: implement support for writing
> incremental MIDX chains, 2024-08-06)
> which introduced GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL adds lines like:
>
> GIT_TEST_MULTI_PACK_INDEX=0
> GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0
>
> at the top of a number of repack related test scripts like
> t7700-repack.sh, so I guess that it should be OK to add the same lines
> at the top of the t5710 test script added by this series. This should
> fix the CI failures.
>
> I have made this change in my current version.

Thanks.

Is it because the feature is fundamentally incompatible with the
multi-pack index (or its incremental writing), or is it merely
because the way the feature is verified assumes that the multi-pack
index is not used, even though the protocol exchange, capability
selection, and the actual behaviour adjustment for the capability
are all working just fine?  I am assuming it is the latter, but just
to make sure we know where we stand...

Thanks, again.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v3 5/5] doc: add technical design doc for large object promisors
  2024-12-06 12:42     ` [PATCH v3 5/5] doc: add technical design doc for large object promisors Christian Couder
@ 2024-12-10  1:28       ` Junio C Hamano
  2025-01-27 15:12         ` Christian Couder
  2024-12-10 11:43       ` Junio C Hamano
  1 sibling, 1 reply; 110+ messages in thread
From: Junio C Hamano @ 2024-12-10  1:28 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> Let's add a design doc about how we could improve handling liarge blobs
> using "Large Object Promisors" (LOPs). It's a set of features with the
> goal of using special dedicated promisor remotes to store large blobs,
> and having them accessed directly by main remotes and clients.
>
> Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
> ---
>  .../technical/large-object-promisors.txt      | 530 ++++++++++++++++++
>  1 file changed, 530 insertions(+)
>  create mode 100644 Documentation/technical/large-object-promisors.txt

Kudos to whoever suggested to write this kind of birds-eye view
document to help readers understand the bigger picture.  Such a "we
want to go in this direction, and this small piece fits within that
larger picture this way" is a good way to motivate readers.

Hopefully I'll have time to comment on different parts of the
documents, but the impression I got was that we should write with
fewer "we could" and instead say more "we aim to", i.e. be more
assertive.

Thanks.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v3 5/5] doc: add technical design doc for large object promisors
  2024-12-06 12:42     ` [PATCH v3 5/5] doc: add technical design doc for large object promisors Christian Couder
  2024-12-10  1:28       ` Junio C Hamano
@ 2024-12-10 11:43       ` Junio C Hamano
  2024-12-16  9:00         ` Patrick Steinhardt
  2025-01-27 15:11         ` Christian Couder
  1 sibling, 2 replies; 110+ messages in thread
From: Junio C Hamano @ 2024-12-10 11:43 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> +We will call a "Large Object Promisor", or "LOP" in short, a promisor
> +remote which is used to store only large blobs and which is separate
> +from the main remote that should store the other Git objects and the
> +rest of the repos.
> +
> +By extension, we will also call "Large Object Promisor", or LOP, the
> +effort described in this document to add a set of features to make it
> +easier to handle large blobs/files in Git by using LOPs.
> +
> +This effort would especially improve things on the server side, and
> +especially for large blobs that are already compressed in a binary
> +format.

The implementation on the server side can be hidden and be improved
as long as we have a reasonable wire protocol.  As it stands, even
with the promisor-remote referral extension, the data coming from
LOP still is expected to be a pack stream, which I am not sure is a
good match.  Is the expectation (yes, I know the document later says
it won't go into storage layer, but still, in order to get the
details of the protocol extension right, we MUST have some idea on
the characteristics the storage layer has so that the protocol would
work well with the storage implementation with such characteristics)
that we give up on deltifying these LOP objects (which might be a
sensible assumption, if they are incompressible large binary gunk),
we store each object in LOP as base representation inside a pack
stream (i.e. the in-pack "undeltified representation" defined in
Documentation/gitformat-pack.txt), so that to send these LOP objects
is just the matter of preparing the pack header (PACK + version +
numobjects) and then concatenating these objects while computing the
running checksum to place in the trailer of the pack stream?  Could
it still be too expensive for the server side, having to compute the
running sum, and we might want to update the object transfer part of
the pack stream definition somehow to reduce the load on the server
side?

> +- We will not discuss those client side improvements here, as they
> +  would require changes in different parts of Git than this effort.
> ++
> +So we don't pretend to fully replace Git LFS with only this effort,
> +but we nevertheless believe that it can significantly improve the
> +current situation on the server side, and that other separate
> +efforts could also improve the situation on the client side.

We still need to come up with a minimally working client side
components, if our goal were to only improve the server side, in
order to demonstrate the benefit of the effort.

> +In other words, the goal of this document is not to talk about all the
> +possible ways to optimize how Git could handle large blobs, but to
> +describe how a LOP based solution could work well and alleviate a
> +number of current issues in the context of Git clients and servers
> +sharing Git objects.

But if you do not discuss even a single way, and handwave "we'll
have this magical object storage that would solve all the problems
for us", then we cannot really tell if the problem is solved by us,
or by handwaved away by assuming the magical object storage.  We'd
need at least one working example.

> +6) A protocol negotiation should happen when a client clones
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +When a client clones from a main repo, there should be a protocol
> +negotiation so that the server can advertise one or more LOPs and so
> +that the client and the server can discuss if the client could
> +directly use a LOP the server is advertising. If the client and the
> +server can agree on that, then the client would be able to get the
> +large blobs directly from the LOP and the server would not need to
> +fetch those blobs from the LOP to be able to serve the client.
> +
> +Note
> +++++
> +
> +For fetches instead of clones, see the "What about fetches?" FAQ entry
> +below.
> +
> +Rationale
> ++++++++++
> +
> +Security, configurability and efficiency of setting things up.

It is unclear how it improves security and configurability if we
limit the protocol exchange only at the clone time (implying that
later either side cannot change it).  It will lead to security
issues if we assume that it is impossible for one side to "lie" to
the other side what they earlier agreed on (unless we somehow make
it actually impossible to lie to the other side, of course).

> +7) A client can offload to a LOP
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +When a client is using a LOP that is also a LOP of its main remote,
> +the client should be able to offload some large blobs it has fetched,
> +but might not need anymore, to the LOP.

For a client that _creates_ a large object, the situation would be
the same, right?  After it creates several versions of the opening
segment of, say, a movie, the latest version may be still wanted,
but the creating client may want to offload earlier versions.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v3 5/5] doc: add technical design doc for large object promisors
  2024-12-10 11:43       ` Junio C Hamano
@ 2024-12-16  9:00         ` Patrick Steinhardt
  2025-01-27 15:11         ` Christian Couder
  1 sibling, 0 replies; 110+ messages in thread
From: Patrick Steinhardt @ 2024-12-16  9:00 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Christian Couder, git, John Cai, Taylor Blau, Eric Sunshine,
	Christian Couder

On Tue, Dec 10, 2024 at 08:43:03PM +0900, Junio C Hamano wrote:
> Christian Couder <christian.couder@gmail.com> writes:
> > +In other words, the goal of this document is not to talk about all the
> > +possible ways to optimize how Git could handle large blobs, but to
> > +describe how a LOP based solution could work well and alleviate a
> > +number of current issues in the context of Git clients and servers
> > +sharing Git objects.
> 
> But if you do not discuss even a single way, and handwave "we'll
> have this magical object storage that would solve all the problems
> for us", then we cannot really tell if the problem is solved by us,
> or by handwaved away by assuming the magical object storage.  We'd
> need at least one working example.

It's something we're working on in parallel with the effort to slowly
move towards pluggable object databases. We aren't yet totally clear
on how exactly to store such objects, but there are a couple of ideas:

  - Store large objects verbatim in a separate path without any kind of
    compression at all. It solves the problem of wasting compute time
    during compression, but does not solve the problem of having to
    store blobs multiple times even if only a tiny part of them change.

  - Use a rolling hash function to split up large objects into smaller
    hunks that can be deduplicated. This solves the issue of only small
    parts of the binary file changing as we'd only have to store the
    hunk that has changed.

This has been discussed e.g. in [1], and I've been talking with some
people about rolling hash functions.

In any case, getting to pluggale ODBs is likely a multi-year effort, so
I wonder how detailed we should be in the context of the document here.
We might want to mention that there are ideas and maybe even provide
some pointers, but I think it makes sense to defer the technical
discussion of how exactly this could look like to the future. Mostly
because I think it's going to be a rather big discussion on its own.

Patrick

[1]: https://lore.kernel.org/git/xmqqbkdometi.fsf@gitster.g/

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v3 2/5] strbuf: refactor strbuf_trim_trailing_ch()
  2024-12-06 12:42     ` [PATCH v3 2/5] strbuf: refactor strbuf_trim_trailing_ch() Christian Couder
  2024-12-07  6:35       ` Junio C Hamano
@ 2024-12-16 11:47       ` karthik nayak
  1 sibling, 0 replies; 110+ messages in thread
From: karthik nayak @ 2024-12-16 11:47 UTC (permalink / raw)
  To: Christian Couder, git
  Cc: Junio C Hamano, John Cai, Patrick Steinhardt, Taylor Blau,
	Eric Sunshine, Christian Couder

[-- Attachment #1: Type: text/plain, Size: 1304 bytes --]

Christian Couder <christian.couder@gmail.com> writes:

> We often have to split strings at some specified terminator character.
> The strbuf_split*() functions, that we can use for this purpose,
> return substrings that include the terminator character, so we often
> need to remove that character.
>
> When it is a whitespace, newline or directory separator, the
> terminator character can easily be removed using an existing triming

Nit: s/triming/trimming

> function like strbuf_rtrim(), strbuf_trim_trailing_newline() or
> strbuf_trim_trailing_dir_sep(). There is no function to remove that
> character when it's not one of those characters though.
>
> Let's introduce a new strbuf_trim_trailing_ch() function that can be
> used to remove any trailing character, and let's refactor existing code
> that manually removed trailing characters using this new function.
>
> We are also going to use this new function in a following commit.
>
> Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
> ---
>  strbuf.c         |  7 +++++++
>  strbuf.h         |  3 +++
>  trace2/tr2_cfg.c | 10 ++--------
>  3 files changed, 12 insertions(+), 8 deletions(-)
>

Shouldn't this patch also add unit tests? We already have some in
't/unit-tests/t-strbuf.c'. This applies to the previous patch too.

[snip]

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v3 0/5] Introduce a "promisor-remote" capability
  2024-12-09 23:01         ` Junio C Hamano
@ 2025-01-27 15:05           ` Christian Couder
  2025-01-27 19:38             ` Junio C Hamano
  0 siblings, 1 reply; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:05 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine

On Tue, Dec 10, 2024 at 12:01 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > I noticed that fcb2205b77 (midx: implement support for writing
> > incremental MIDX chains, 2024-08-06)
> > which introduced GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL adds lines like:
> >
> > GIT_TEST_MULTI_PACK_INDEX=0
> > GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0
> >
> > at the top of a number of repack related test scripts like
> > t7700-repack.sh, so I guess that it should be OK to add the same lines
> > at the top of the t5710 test script added by this series. This should
> > fix the CI failures.
> >
> > I have made this change in my current version.
>
> Thanks.
>
> Is it because the feature is fundamentally incompatible with the
> multi-pack index (or its incremental writing),

It's not an incompatibility with the feature developed in this series.

Adding the following test script on top of master or even fcb2205b77
(midx: implement support for writing incremental MIDX chains,
2024-08-06), shows that it fails in the same way without any code
change to `git` itself from this series:

diff --git a/t/t5709-midx-increment-write.sh b/t/t5709-midx-increment-write.sh
new file mode 100755
index 0000000000..8801222374
--- /dev/null
+++ b/t/t5709-midx-increment-write.sh
@@ -0,0 +1,132 @@
+#!/bin/sh
+
+test_description='test midx incremental write'
+
+. ./test-lib.sh
+
+export GIT_TEST_MULTI_PACK_INDEX=1
+export GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=1
+
+# Setup the repository with three commits, this way HEAD is always
+# available and we can hide commit 1 or 2.
+test_expect_success 'setup: create "template" repository' '
+       git init template &&
+       test_commit -C template 1 &&
+       test_commit -C template 2 &&
+       test_commit -C template 3 &&
+       test-tool genrandom foo 10240 >template/foo &&
+       git -C template add foo &&
+       git -C template commit -m foo
+'
+
+# A bare repo will act as a server repo with unpacked objects.
+test_expect_success 'setup: create bare "server" repository' '
+       git clone --bare --no-local template server &&
+       mv server/objects/pack/pack-* . &&
+       packfile=$(ls pack-*.pack) &&
+       git -C server unpack-objects --strict <"$packfile"
+'
+
+check_missing_objects () {
+       git -C "$1" rev-list --objects --all --missing=print > all.txt &&
+       perl -ne 'print if s/^[?]//' all.txt >missing.txt &&
+       test_line_count = "$2" missing.txt &&
+       if test "$2" -lt 2
+       then
+               test "$3" = "$(cat missing.txt)"
+       else
+               test -f "$3" &&
+               sort <"$3" >expected_sorted &&
+               sort <missing.txt >actual_sorted &&
+               test_cmp expected_sorted actual_sorted
+       fi
+}
+
+initialize_server () {
+       count="$1"
+       missing_oids="$2"
+
+       # Repack everything first
+       git -C server -c repack.writebitmaps=false repack -a -d &&
+
+       # Remove promisor file in case they exist, useful when reinitializing
+       rm -rf server/objects/pack/*.promisor &&
+
+       # Repack without the largest object and create a promisor pack on server
+       git -C server -c repack.writebitmaps=false repack -a -d \
+           --filter=blob:limit=5k --filter-to="$(pwd)/pack" &&
+       promisor_file=$(ls server/objects/pack/*.pack | sed
"s/\.pack/.promisor/") &&
+       >"$promisor_file" &&
+
+       # Check objects missing on the server
+       check_missing_objects server "$count" "$missing_oids"
+}
+
+copy_to_server2 () {
+       oid_path="$(test_oid_to_path $1)" &&
+       path="server/objects/$oid_path" &&
+       path2="server2/objects/$oid_path" &&
+       mkdir -p $(dirname "$path2") &&
+       cp "$path" "$path2"
+}
+
+test_expect_success "setup for testing promisor remote advertisement" '
+       # Create another bare repo called "server2"
+       git init --bare server2 &&
+
+       # Copy the largest object from server to server2
+       obj="HEAD:foo" &&
+       oid="$(git -C server rev-parse $obj)" &&
+       copy_to_server2 "$oid" &&
+
+       initialize_server 1 "$oid" &&
+
+       # Configure server2 as promisor remote for server
+       git -C server remote add server2 "file://$(pwd)/server2" &&
+       git -C server config remote.server2.promisor true &&
+
+       git -C server2 config uploadpack.allowFilter true &&
+       git -C server2 config uploadpack.allowAnySHA1InWant true &&
+       git -C server config uploadpack.allowFilter true &&
+       git -C server config uploadpack.allowAnySHA1InWant true
+'
+
+test_expect_success "setup for subsequent fetches" '
+       # Generate new commit with large blob
+       test-tool genrandom bar 10240 >template/bar &&
+       git -C template add bar &&
+       git -C template commit -m bar &&
+
+       # Fetch new commit with large blob
+       git -C server fetch origin &&
+       git -C server update-ref HEAD FETCH_HEAD &&
+       git -C server rev-parse HEAD >expected_head &&
+
+       # Repack everything twice and remove .promisor files before
+       # each repack. This makes sure everything gets repacked
+       # into a single packfile. The second repack is necessary
+       # because the first one fetches from server2 and creates a new
+       # packfile and its associated .promisor file.
+
+       rm -f server/objects/pack/*.promisor &&
+       git -C server -c repack.writebitmaps=false repack -a -d &&
+       rm -f server/objects/pack/*.promisor &&
+       git -C server -c repack.writebitmaps=false repack -a -d &&
+
+       # Unpack everything
+       rm pack-* &&
+       mv server/objects/pack/pack-* . &&
+       packfile=$(ls pack-*.pack) &&
+       git -C server unpack-objects --strict <"$packfile" &&
+
+       # Copy new large object to server2
+       obj_bar="HEAD:bar" &&
+       oid_bar="$(git -C server rev-parse $obj_bar)" &&
+       copy_to_server2 "$oid_bar" &&
+
+       # Reinitialize server so that the 2 largest objects are missing
+       printf "%s\n" "$oid" "$oid_bar" >expected_missing.txt &&
+       initialize_server 2 expected_missing.txt
+'
+
+test_done

Changing `export GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=1` into
`export GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0` at the top of
the file makes it work.

This could probably be simplified, but I think it shows that it's just
the incremental writing of the multi-pack index that is incompatible
or has a bug when doing some repacking.

> or is it merely
> because the way the feature is verified assumes that the multi-pack
> index is not used, even though the protocol exchange, capability
> selection, and the actual behaviour adjustment for the capability
> are all working just fine?  I am assuming it is the latter, but just
> to make sure we know where we stand...

Let me know if you need more than the above, but I think it's fair for
now to just use:

GIT_TEST_MULTI_PACK_INDEX=0
GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0

at the top of the tests, like it's done in the version 4 of this
series I will send soon.

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [PATCH v3 1/5] version: refactor strbuf_sanitize()
  2024-12-07  6:21       ` Junio C Hamano
@ 2025-01-27 15:07         ` Christian Couder
  0 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:07 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Christian Couder, karthik nayak

On Sat, Dec 7, 2024 at 7:21 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > +/*
> > + * Trim and replace each character with ascii code below 32 or above
> > + * 127 (included) using a dot '.' character. Useful for sending
> > + * capabilities.
> > + */
> > +void strbuf_sanitize(struct strbuf *sb);
>
> I am not getting "Useful for sending capabilities" here, and feel
> that it is somewhat an unsubstantiated claim.  If some information
> is going to be transferred (which the phrase "sending capabilities"
> hints), I'd expect that we try as hard as possible not to lose
> information, but redact-non-ASCII is the total opposite of "not
> losing information".

Ok, "Useful for sending capabilities" will be removed.

> By the way, as we are trimming, I am very very much tempted to
> squish a run of non-ASCII bytes into one dot, perhaps like
>
>         void redact_non_printables(struct strbuf *sb)
>         {
>                 size_t dst = 0;
>                 int skipped = 0;
>
>                 strbuf_trim(sb);
>                 for (size_t src = 0; src < sb->len; src++) {
>                         int ch = sb->buf[src];
>                         if (ch <= 32 && 127 <= ch) {
>                                 if (skipped)
>                                         continue;
>                                 ch = '.';
>                         }
>                         sb->buf[dst++] = ch;
>                         skipped = (ch == '.');
>                 }
>         }
>
> or even without strbuf_trim(), which would turn any leading or
> trailing run of whitespaces into '.'.
>
> But that is an improvement that can be easily done on top after the
> dust settles and better left as #leftoverbits material.

Usman's patch series about introducing a "os-version" capability needs
such a feature too, and Usman already reworked this code according to
your comments here. It looks like you found it good too. So I will
just reuse his patches related to this in the version 4 of this patch
series.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v3 2/5] strbuf: refactor strbuf_trim_trailing_ch()
  2024-12-07  6:35       ` Junio C Hamano
@ 2025-01-27 15:07         ` Christian Couder
  0 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:07 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Christian Couder, karthik nayak

On Sat, Dec 7, 2024 at 7:35 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > We often have to split strings at some specified terminator character.
> > The strbuf_split*() functions, that we can use for this purpose,
> > return substrings that include the terminator character, so we often
> > need to remove that character.
> >
> > When it is a whitespace, newline or directory separator, the
> > terminator character can easily be removed using an existing triming
> > function like strbuf_rtrim(), strbuf_trim_trailing_newline() or
> > strbuf_trim_trailing_dir_sep(). There is no function to remove that
> > character when it's not one of those characters though.
>
> Heh, totally uninteresting (alternative being open coding this one).
> If we pass, instead of a single character 'c', an array of characters
> to be stripped from the right (like strspn() allows you to skip from
> the left), I may have been a bit more receptive, though ;-)

Yeah, I realized strbuf_strip_suffix() can do the job in the following
patches, so I dropped this patch and used strbuf_strip_suffix() in the
version 4 of this series.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v3 3/5] Add 'promisor-remote' capability to protocol v2
  2024-12-07  7:59       ` Junio C Hamano
@ 2025-01-27 15:08         ` Christian Couder
  0 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:08 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Christian Couder, karthik nayak

On Sat, Dec 7, 2024 at 8:59 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > Then C might or might not, want to get the objects from X, and should
> > let S know about this.
>
> I only left this instance quoted in this reply, but I found that
> there are too many "should" in the description (both in the proposed
> log message and in the documentation patch), which do not help the
> readers with accompanying explanation on the reason why it is a good
> idea to follow these "should".

In the next version, I have changed the commit message to replace many
"should" with something else.

> For example, S may suggest X to C,
> and C (imagine a third-party reimplementation of Git, which is not
> bound by your "should") may take advantage of that suggestion and
> use X as a better connected alternative, and C might want to do so
> without even telling S.  What entices C to tell S?  IOW, how are
> these two parties expected to collaborate with that information at
> hand?  Without answering that question ...

The improved commit message in the next version says earlier that "If
S and C can agree on C using X directly, S can then omit objects that
can be obtained from X when answering C's request."

> > To allow S and C to agree and let each other know about C using X or
> > not, let's introduce a new "promisor-remote" capability in the
> > protocol v2, as well as a few new configuration variables:
> >
> >   - "promisor.advertise" on the server side, and:
> >   - "promisor.acceptFromServer" on the client side.
>
> ... the need for a mechanism to share that information between S and
> C is hard to sell.  "By telling S, C allows S to omit objects that
> can be obtained from X when answering C's request?" or something,
> perhaps?

Yeah, now this is mentioned earlier.

> > +Note that in the future it would be nice if the "promisor-remote"
> > +protocol capability could be used by the server, when responding to
> > +`git fetch` or `git clone`, to advertise better-connected remotes that
> > +the client can use as promisor remotes, instead of this repository, so
> > +that the client can lazily fetch objects from these other
> > +better-connected remotes. This would require the server to omit in its
> > +response the objects available on the better-connected remotes that
> > +the client has accepted. This hasn't been implemented yet though. So
> > +for now this "promisor-remote" capability is useful only when the
> > +server advertises some promisor remotes it already uses to borrow
> > +objects from.
>
> We need to figure out before etching the protocol specification in
> stone what to do when the network situations observable by C and S
> are different.  For example, C may need to go over a proxy to reach
> S, S may directly have connection to X, but C cannot reach X
> directly, and C needs another proxy, different from the one it uses
> to go to S, to reach X.  How is S expected to know about C's network
> situation, and use the knowledge to tell C how to reach X?  Or is X
> so well known a name that it is C's responsibility to arrange how it
> can reach X?

Yeah, it's C's responsibility to arrange how it can reach X.

> I suspect that this was designed primarily to allow a
> server to better help clients owned by the same enterprise entity,
> so it might be tempting to distribute pieces of information we
> usually do not consider Git's concern, like proxy configuration,
> over the same protocol.  I personally would strongly prefer *not* to
> go in that direction, and if we agree that we won't go there from
> the beginning, I'd be a lot happier ;-)

I don't want to go into that direction either. I have added the
following into the commit message:

"It is C's responsibility to arrange how it can reach X though, so pieces
of information that are usually outside Git's concern, like proxy
configuration, must not be distributed over this protocol."

I think that requiring some global configuration is a good thing. What
we should particularly make easier and more flexible are some details
about the best ways to access each individual repo, like which filter
spec it is best to use. So that if the repo admins decide to move some
smaller objects to the LOP, each client doesn't have to adjust the
filter spec.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v3 5/5] doc: add technical design doc for large object promisors
  2024-12-10 11:43       ` Junio C Hamano
  2024-12-16  9:00         ` Patrick Steinhardt
@ 2025-01-27 15:11         ` Christian Couder
  2025-01-27 18:02           ` Junio C Hamano
  1 sibling, 1 reply; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:11 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Christian Couder

On Tue, Dec 10, 2024 at 12:43 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > +We will call a "Large Object Promisor", or "LOP" in short, a promisor
> > +remote which is used to store only large blobs and which is separate
> > +from the main remote that should store the other Git objects and the
> > +rest of the repos.
> > +
> > +By extension, we will also call "Large Object Promisor", or LOP, the
> > +effort described in this document to add a set of features to make it
> > +easier to handle large blobs/files in Git by using LOPs.
> > +
> > +This effort would especially improve things on the server side, and
> > +especially for large blobs that are already compressed in a binary
> > +format.
>
> The implementation on the server side can be hidden and be improved
> as long as we have a reasonable wire protocol.  As it stands, even
> with the promisor-remote referral extension, the data coming from
> LOP still is expected to be a pack stream, which I am not sure is a
> good match.

I agree it might not be a good match.

> Is the expectation (yes, I know the document later says
> it won't go into storage layer, but still, in order to get the
> details of the protocol extension right, we MUST have some idea on
> the characteristics the storage layer has so that the protocol would
> work well with the storage implementation with such characteristics)
> that we give up on deltifying these LOP objects (which might be a
> sensible assumption, if they are incompressible large binary gunk),

Yes, there is a section (II.2) called "LOPs can use object storage" about this.

In the next version I have tried to clarified this early in the doc by
saying the following in the non-goal section:

"Our opinion is that the simplest solution for now is for LOPs to use
object storage through a remote helper (see section II.2 below for
more details) to store their objects. So we consider that this is the
default implementation. If there are improvements on top of this,
that's great, but our opinion is that such improvements are not
necessary for LOPs to already be useful. Such improvements are likely
a different technical topic, and can be taken care of separately
anyway."

> we store each object in LOP as base representation inside a pack
> stream (i.e. the in-pack "undeltified representation" defined in
> Documentation/gitformat-pack.txt), so that to send these LOP objects
> is just the matter of preparing the pack header (PACK + version +
> numobjects) and then concatenating these objects while computing the
> running checksum to place in the trailer of the pack stream? Could
> it still be too expensive for the server side, having to compute the
> running sum, and we might want to update the object transfer part of
> the pack stream definition somehow to reduce the load on the server
> side?

I agree that this might be an interesting thing to look at, but I
think it's not necessary to work on this now. It's more important for
now that the storage for large blobs on LOPs is cheap.

As clients may not all migrate soon to a version of Git that supports
LOPs well, it's likely that LOPs will be used for repos that are
mostly inactive first (at least that's our plan at GitLab), so there
would not be much traffic. This would give us time to look at
optimizing data transfer.

> > +- We will not discuss those client side improvements here, as they
> > +  would require changes in different parts of Git than this effort.
> > ++
> > +So we don't pretend to fully replace Git LFS with only this effort,
> > +but we nevertheless believe that it can significantly improve the
> > +current situation on the server side, and that other separate
> > +efforts could also improve the situation on the client side.
>
> We still need to come up with a minimally working client side
> components, if our goal were to only improve the server side, in
> order to demonstrate the benefit of the effort.

How would clients work worse with large files compared to the current
situation, when the benefit of the current effort (the
"promisor-remote" capability) makes it easier for them, but doesn't
force them, to use promisor remotes?

If clients can use promisor remotes more, especially when cloning,
they can benefit from having fewer large files locally when they don't
need them. So they should just work better. And again they are not
forced to use promisor remotes, if they still prefer not to use them,
they still can perform a regular clone, and they will not work
differently than they do now.

> > +In other words, the goal of this document is not to talk about all the
> > +possible ways to optimize how Git could handle large blobs, but to
> > +describe how a LOP based solution could work well and alleviate a
> > +number of current issues in the context of Git clients and servers
> > +sharing Git objects.
>
> But if you do not discuss even a single way, and handwave "we'll
> have this magical object storage that would solve all the problems
> for us", then we cannot really tell if the problem is solved by us,
> or by handwaved away by assuming the magical object storage.
> We'd need at least one working example.

It's not magical object storage. Amazon S3, GCP Bucket and MinIO
(which is open source), for example, already exist and are used a lot
in the industry. Some Git remote helpers to access them can even be
found online under open source licenses, like for example:

  - https://github.com/awslabs/git-remote-s3
  - https://gitlab.com/eric.p.ju/git-remote-gs

Writing a remote helper to use some object storage as a promisor
remote is also not very difficult. Yeah, perhaps optimizing them would
be worth the effort, but they are, or would likely be, at least for
now, separate projects, and nothing prevents people interested in
optimizing them from contributing to these projects.

I have added some details about these object storage technologies and
remote helpers to access them in the next version of the doc.

> > +6) A protocol negotiation should happen when a client clones
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +When a client clones from a main repo, there should be a protocol
> > +negotiation so that the server can advertise one or more LOPs and so
> > +that the client and the server can discuss if the client could
> > +directly use a LOP the server is advertising. If the client and the
> > +server can agree on that, then the client would be able to get the
> > +large blobs directly from the LOP and the server would not need to
> > +fetch those blobs from the LOP to be able to serve the client.
> > +
> > +Note
> > +++++
> > +
> > +For fetches instead of clones, see the "What about fetches?" FAQ entry
> > +below.
> > +
> > +Rationale
> > ++++++++++
> > +
> > +Security, configurability and efficiency of setting things up.
>
> It is unclear how it improves security and configurability if we
> limit the protocol exchange only at the clone time (implying that
> later either side cannot change it).  It will lead to security
> issues if we assume that it is impossible for one side to "lie" to
> the other side what they earlier agreed on (unless we somehow make
> it actually impossible to lie to the other side, of course).

It's not limited to clone time. There are tests in the patch series
that test that the protocol is used and works when fetching.

The "What about fetches?" FAQ entry also says:

"In a regular fetch, the client will contact the main remote and a
protocol negotiation will happen between them."

Or are you talking about lazy fetches? There it is mentioned that a
token could be used to secure this. Other parts of the doc mention
using such a token by the way.

I have changed the note about fetches to be like this:

"For fetches instead of clones, a protocol negotiation might not always
happen, see the "What about fetches?" FAQ entry below for details."

> > +7) A client can offload to a LOP
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +When a client is using a LOP that is also a LOP of its main remote,
> > +the client should be able to offload some large blobs it has fetched,
> > +but might not need anymore, to the LOP.
>
> For a client that _creates_ a large object, the situation would be
> the same, right?  After it creates several versions of the opening
> segment of, say, a movie, the latest version may be still wanted,
> but the creating client may want to offload earlier versions.

Yeah, but it's not clear if the versions of the opening segment should
be sent directly to the LOP without the main remote checking them in
some ways (hooks might be configured only on the main remote) and/or
checking that they are connected to the repo. I guess it depends on
the context if it would be OK or not.

I have added the following note:

"It might depend on the context if it should be OK or not for clients
to offload large blobs they have created, instead of fetched, directly
to the LOP without the main remote checking them in some ways
(possibly using hooks or other tools)."

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v3 5/5] doc: add technical design doc for large object promisors
  2024-12-10  1:28       ` Junio C Hamano
@ 2025-01-27 15:12         ` Christian Couder
  0 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:12 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Christian Couder

On Tue, Dec 10, 2024 at 2:28 AM Junio C Hamano <gitster@pobox.com> wrote:

> Hopefully I'll have time to comment on different parts of the
> documents, but the impression I got was that we should write with
> fewer "we could" and instead say more "we aim to", i.e. be more
> assertive.

I have tried to make the next version of the document more assertive
in some places and clearer in other places by replacing some "could"
with other terms.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH v4 0/6] Introduce a "promisor-remote" capability
  2024-12-06 12:42   ` [PATCH v3 0/5] " Christian Couder
                       ` (5 preceding siblings ...)
  2024-12-09  8:04     ` [PATCH v3 0/5] Introduce a "promisor-remote" capability Junio C Hamano
@ 2025-01-27 15:16     ` Christian Couder
  2025-01-27 15:16       ` [PATCH v4 1/6] version: replace manual ASCII checks with isprint() for clarity Christian Couder
                         ` (7 more replies)
  6 siblings, 8 replies; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:16 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
	Randall S . Becker, Christian Couder

This work is part of some effort to better handle large files/blobs in
a client-server context using promisor remotes dedicated to storing
large blobs. To help understand this effort, this series now contains
a patch (patch 6/6) that adds design documentation about this effort.

Last year, I sent 3 versions of a patch series with the goal of
allowing a client C to clone from a server S while using the same
promisor remote X that S already use. See:

https://lore.kernel.org/git/20240418184043.2900955-1-christian.couder@gmail.com/

Junio suggested to implement that feature using:

"a protocol extension that lets S tell C that S wants C to fetch
missing objects from X (which means that if C knows about X in its
".git/config" then there is no need for end-user interaction at all),
or a protocol extension that C tells S that C is willing to see
objects available from X omitted when S does not have them (again,
this could be done by looking at ".git/config" at C, but there may be
security implications???)"

This patch series implements that protocol extension called
"promisor-remote" (that name is open to change or simplification)
which allows S and C to agree on C using X directly or not.

I have tried to implement it in a quite generic way that could allow S
and C to share more information about promisor remotes and how to use
them.

For now, C doesn't use the information it gets from S when cloning.
That information is only used to decide if C is OK to use the promisor
remotes advertised by S. But this could change in the future which
could make it much simpler for clients than using the current way of
passing information about X with the `-c` option of `git clone` many
times on the command line.

Another improvement could be to not require GIT_NO_LAZY_FETCH=0 when S
and C have agreed on using S.

Changes compared to version 3
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  - Patches 1/6 and 2/6 are new in this series. They come from the
    patch series Usman Akinyemi is working on
    (https://lore.kernel.org/git/20250124122217.250925-1-usmanakinyemi202@gmail.com/).
    We need a similar redact_non_printables() function as the one he
    has been working on in his patch series, so it's just simpler to
    reuse his patches related to this function, and to build on top of
    them.

  - Patch 2/5 in version 3 has been removed. It created a new
    strbuf_trim_trailing_ch() function as part of the strbuf API, but
    we can reuse an existing function, strbuf_strip_suffix(), instead.

  - Patch 3/6 is new. It makes the redact_non_printables() non-static
    to be able to reuse it in a following patch.

  - In patch 4/6, the commit message has been improved:

      - Some "should" have been replaced with "may".
      
      - It states early that "If S and C can agree on C using X
        directly, S can then omit objects that can be obtained from X
        when answering C's request."

      - It mentions that "pieces of information that are usually
        outside Git's concern, like proxy configuration, must not be
        distributed over this protocol."

  - In patch 4/6, there are also some code changes:

      - redact_non_printables() is used instead of strbuf_sanitize(),
        see changes in patches 1/6 to 3/6 above.

      - strbuf_strip_suffix() is used instead of
        strbuf_trim_trailing_ch(), see the removal of patch 2/5 in
        version 3 mentioned above.

      - strbuf_split() is used instead of strbuf_split_str() when
        possible to simplifies the code a bit.

  - In patch 4/6, there is also a small change in the tests. In t5710
    testing with multi pack index and especially its incremental write
    are disabled. An issue has been found between the setup code in
    this test script and the multi pack index incremental write.

  - In patch 6/6 (doc: add technical design doc for large object
    promisors) there are a number of changes:

      - "aim to" is used more often to better outline the direction of
        the effort. And in general some similarly small changes have
        been made to make the document more assertive.

      - The "0) Non goal" section has been improved to mention that we
        want to focus for now on using existing object storage
        solutions accessed through remote helpers, and that we don't
        want to discuss data transfer improvements between LOPs and
        clients or servers.

      - A few typos, grammos and such have been fixed.

      - Examples of existing remote helpers to access existing object
        storage solutions have been added.

      - A note has been improved to mention that a protocol
        negotiation might not always happen when fetching.

      - A new note has been added about clients offloading objects
        they created directly to a LOP.

      - A new "V) Future improvements" section has been added.

Thanks to Junio, Patrick, Eric, Karthik, Kristoffer, brian, Randall
and Taylor for their suggestions to improve this patch series.

CI tests
~~~~~~~~

All the CI tests passed, see:

https://github.com/chriscool/git/actions/runs/12989763108

Range diff compared to version 3
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1:  13dd730641 < -:  ---------- version: refactor strbuf_sanitize()
2:  8f2aecf6a1 < -:  ---------- strbuf: refactor strbuf_trim_trailing_ch()
-:  ---------- > 1:  9e646013be version: replace manual ASCII checks with isprint() for clarity
-:  ---------- > 2:  f4b22ef39d version: refactor redact_non_printables()
-:  ---------- > 3:  8bfa6f7a20 version: make redact_non_printables() non-static
3:  57e1481bc4 ! 4:  652ce32892 Add 'promisor-remote' capability to protocol v2
    @@ Commit message
     
         When a server S knows that some objects from a repository are available
         from a promisor remote X, S might want to suggest to a client C cloning
    -    or fetching the repo from S that C should use X directly instead of S
    -    for these objects.
    +    or fetching the repo from S that C may use X directly instead of S for
    +    these objects.
     
         Note that this could happen both in the case S itself doesn't have the
         objects and borrows them from X, and in the case S has the objects but
    @@ Commit message
         omit in its response the objects available on X, is left for future
         improvement though.
     
    -    Then C might or might not, want to get the objects from X, and should
    -    let S know about this.
    +    Then C might or might not, want to get the objects from X. If S and C
    +    can agree on C using X directly, S can then omit objects that can be
    +    obtained from X when answering C's request.
     
         To allow S and C to agree and let each other know about C using X or
         not, let's introduce a new "promisor-remote" capability in the
    @@ Commit message
     
         For now, the URL is passed in addition to the name. In the future, it
         might be possible to pass other information like a filter-spec that the
    -    client should use when cloning from S, or a token that the client should
    -    use when retrieving objects from X.
    +    client may use when cloning from S, or a token that the client may use
    +    when retrieving objects from X.
    +
    +    It is C's responsibility to arrange how it can reach X though, so pieces
    +    of information that are usually outside Git's concern, like proxy
    +    configuration, must not be distributed over this protocol.
     
         It might also be possible in the future for "promisor.advertise" to have
         other values. For example a value like "onlyName" could prevent S from
    @@ promisor-remote.c
      #include "packfile.h"
      #include "environment.h"
     +#include "url.h"
    ++#include "version.h"
      
      struct promisor_remote_config {
        struct promisor_remote *promisors;
    @@ promisor-remote.c: void promisor_remote_get_direct(struct repository *repo,
     +          }
     +  }
     +
    -+  strbuf_sanitize(&sb);
    ++  redact_non_printables(&sb);
     +
     +  strvec_clear(&names);
     +  strvec_clear(&urls);
    @@ promisor-remote.c: void promisor_remote_get_direct(struct repository *repo,
     +          char *decoded_name = NULL;
     +          char *decoded_url = NULL;
     +
    -+          strbuf_trim_trailing_ch(remotes[i], ';');
    -+          elems = strbuf_split_str(remotes[i]->buf, ',', 0);
    ++          strbuf_strip_suffix(remotes[i], ";");
    ++          elems = strbuf_split(remotes[i], ',');
     +
     +          for (size_t j = 0; elems[j]; j++) {
     +                  int res;
    -+                  strbuf_trim_trailing_ch(elems[j], ',');
    ++                  strbuf_strip_suffix(elems[j], ",");
     +                  res = skip_prefix(elems[j]->buf, "name=", &remote_name) ||
     +                          skip_prefix(elems[j]->buf, "url=", &remote_url);
     +                  if (!res)
    @@ promisor-remote.c: void promisor_remote_get_direct(struct repository *repo,
     +          struct promisor_remote *p;
     +          char *decoded_remote;
     +
    -+          strbuf_trim_trailing_ch(accepted_remotes[i], ';');
    ++          strbuf_strip_suffix(accepted_remotes[i], ";");
     +          decoded_remote = url_percent_decode(accepted_remotes[i]->buf);
     +
     +          p = repo_promisor_remote_find(r, decoded_remote);
    @@ serve.c: static struct protocol_capability capabilities[] = {
     +  },
      };
      
    - void protocol_v2_advertise_capabilities(void)
    + void protocol_v2_advertise_capabilities(struct repository *r)
    +
    + ## t/meson.build ##
    +@@ t/meson.build: integration_tests = [
    +   't5703-upload-pack-ref-in-want.sh',
    +   't5704-protocol-violations.sh',
    +   't5705-session-id-in-capabilities.sh',
    ++  't5710-promisor-remote-capability.sh',
    +   't5730-protocol-v2-bundle-uri-file.sh',
    +   't5731-protocol-v2-bundle-uri-git.sh',
    +   't5732-protocol-v2-bundle-uri-http.sh',
     
      ## t/t5710-promisor-remote-capability.sh (new) ##
     @@
    @@ t/t5710-promisor-remote-capability.sh (new)
     +
     +. ./test-lib.sh
     +
    ++GIT_TEST_MULTI_PACK_INDEX=0
    ++GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0
    ++
     +# Setup the repository with three commits, this way HEAD is always
     +# available and we can hide commit 1 or 2.
     +test_expect_success 'setup: create "template" repository' '
4:  7fcc619e41 = 5:  979a0af1c3 promisor-remote: check advertised name or URL
5:  c25c94707f ! 6:  3a0c134e09 doc: add technical design doc for large object promisors
    @@ Documentation/technical/large-object-promisors.txt (new)
     +effort described in this document to add a set of features to make it
     +easier to handle large blobs/files in Git by using LOPs.
     +
    -+This effort would especially improve things on the server side, and
    ++This effort aims to especially improve things on the server side, and
     +especially for large blobs that are already compressed in a binary
     +format.
     +
    -+This effort could help provide an alternative to Git LFS
    ++This effort aims to provide an alternative to Git LFS
     +(https://git-lfs.com/) and similar tools like git-annex
     +(https://git-annex.branchable.com/) for handling large files, even
     +though a complete alternative would very likely require other efforts
    @@ Documentation/technical/large-object-promisors.txt (new)
     +efforts could also improve the situation on the client side.
     +
     +- In the same way, we are not going to discuss all the possible ways
    -+  to implement a LOP or their underlying object storage.
    ++  to implement a LOP or their underlying object storage, or to
    ++  optimize how LOP works.
     ++
    -+In particular we are not going to discuss pluggable ODBs or other
    ++Our opinion is that the simplest solution for now is for LOPs to use
    ++object storage through a remote helper (see section II.2 below for
    ++more details) to store their objects. So we consider that this is the
    ++default implementation. If there are improvements on top of this,
    ++that's great, but our opinion is that such improvements are not
    ++necessary for LOPs to already be useful. Such improvements are likely
    ++a different technical topic, and can be taken care of separately
    ++anyway.
    +++
    ++So in particular we are not going to discuss pluggable ODBs or other
     +object database backends that could chunk large blobs, dedup the
     +chunks and store them efficiently. Sure, that would be a nice
     +improvement to store large blobs on the server side, but we believe
     +it can just be a separate effort as it's also not technically very
     +related to this effort.
    +++
    ++We are also not going to discuss data transfer improvements between
    ++LOPs and clients or servers. Sure, there might be some easy and very
    ++effective optimizations there (as we know that objects on LOPs are
    ++very likely incompressible and not deltifying well), but this can be
    ++dealt with separately in a separate effort.
     +
     +In other words, the goal of this document is not to talk about all the
     +possible ways to optimize how Git could handle large blobs, but to
    -+describe how a LOP based solution could work well and alleviate a
    -+number of current issues in the context of Git clients and servers
    ++describe how a LOP based solution can already work well and alleviate
    ++a number of current issues in the context of Git clients and servers
     +sharing Git objects.
     +
     +I) Issues with the current situation
    @@ Documentation/technical/large-object-promisors.txt (new)
     +
     +Also each feature doesn't need to be implemented entirely in Git
     +itself. Some could be scripts, hooks or helpers that are not part of
    -+the Git repo. It could be helpful if those could be shared and
    -+improved on collaboratively though.
    ++the Git repo. It would be helpful if those could be shared and
    ++improved on collaboratively though. So we want to encourage sharing
    ++them.
     +
     +1) Large blobs are stored on LOPs
     +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    @@ Documentation/technical/large-object-promisors.txt (new)
     +Rationale
     ++++++++++
     +
    -+LOP remotes should be good at handling large blobs while main remotes
    -+should be good at handling other objects.
    ++LOPs aim to be good at handling large blobs while main remotes are
    ++already good at handling other objects.
     +
     +Implementation
     +++++++++++++++
    @@ Documentation/technical/large-object-promisors.txt (new)
     +2) LOPs can use object storage
     +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     +
    -+A LOP could be using object storage, like an Amazon S3 or GCP Bucket
    -+to actually store the large blobs, and could be accessed through a Git
    ++LOPs can be implemented using object storage, like an Amazon S3 or GCP
    ++Bucket or MinIO (which is open source under the GNU AGPLv3 license) to
    ++actually store the large blobs, and can be accessed through a Git
     +remote helper (see linkgit:gitremote-helpers[7]) which makes the
    -+underlying object storage appears like a remote to Git.
    ++underlying object storage appear like a remote to Git.
     +
     +Note
     +++++
     +
    -+A LOP could be a promisor remote accessed using a remote helper by
    ++A LOP can be a promisor remote accessed using a remote helper by
     +both some clients and the main remote.
     +
     +Rationale
    @@ Documentation/technical/large-object-promisors.txt (new)
     +be more efficient and maintainable to write them using other languages
     +like Go.
     +
    ++Some already exist under open source licenses, for example:
    ++
    ++  - https://github.com/awslabs/git-remote-s3
    ++  - https://gitlab.com/eric.p.ju/git-remote-gs
    ++
     +Other ways to implement LOPs are certainly possible, but the goal of
     +this document is not to discuss how to best implement a LOP or its
     +underlying object storage (see the "0) Non goals" section above).
    @@ Documentation/technical/large-object-promisors.txt (new)
     +++++++++++++++
     +
     +The way to offload to a LOP discussed in 4) above can be used to
    -+regularly offload oversize blobs. About preventing oversize blobs to
    -+be fetched into the repo see 6) below. About preventing oversize blob
    -+pushes, a pre-receive hook could be used.
    ++regularly offload oversize blobs. About preventing oversize blobs from
    ++being fetched into the repo see 6) below. About preventing oversize
    ++blob pushes, a pre-receive hook could be used.
     +
     +Also there are different scenarios in which large blobs could get
     +fetched into the main remote, for example:
    @@ Documentation/technical/large-object-promisors.txt (new)
     +It might not be possible to completely prevent all these scenarios
     +from happening. So the goal here should be to implement features that
     +make the fetching of large blobs less likely. For example adding a
    -+`remote-object-info` command in the `git cat-file --batch*` protocol
    -+might make it possible for a main repo to respond to some requests
    -+about large blobs without fetching them.
    ++`remote-object-info` command in the `git cat-file --batch` protocol
    ++and its variants might make it possible for a main repo to respond to
    ++some requests about large blobs without fetching them.
     +
     +6) A protocol negotiation should happen when a client clones
     +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    @@ Documentation/technical/large-object-promisors.txt (new)
     +Note
     +++++
     +
    -+For fetches instead of clones, see the "What about fetches?" FAQ entry
    -+below.
    ++For fetches instead of clones, a protocol negotiation might not always
    ++happen, see the "What about fetches?" FAQ entry below for details.
     +
     +Rationale
     ++++++++++
    @@ Documentation/technical/large-object-promisors.txt (new)
     +Information that the server could send to the client through that
     +protocol could be things like: LOP name, LOP URL, filter-spec (for
     +example `blob:limit=<size>`) or just size limit that should be used as
    -+a filter when cloning, token to be used with the LOP, etc..
    ++a filter when cloning, token to be used with the LOP, etc.
     +
     +7) A client can offload to a LOP
     +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    @@ Documentation/technical/large-object-promisors.txt (new)
     +the client should be able to offload some large blobs it has fetched,
     +but might not need anymore, to the LOP.
     +
    ++Note
    ++++++
    ++
    ++It might depend on the context if it should be OK or not for clients
    ++to offload large blobs they have created, instead of fetched, directly
    ++to the LOP without the main remote checking them in some ways
    ++(possibly using hooks or other tools).
    ++
     +Rationale
     ++++++++++
     +
    @@ Documentation/technical/large-object-promisors.txt (new)
     +What about using multiple LOPs on the server and client side?
     +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     +
    -+That could perhaps be useful in some cases, but it's more likely for
    -+now than in most cases a single LOP will be advertised by the server
    -+and should be used by the client.
    ++That could perhaps be useful in some cases, but for now it's more
    ++likely that in most cases a single LOP will be advertised by the
    ++server and should be used by the client.
     +
     +A case where it could be useful for a server to advertise multiple
     +LOPs is if a LOP is better for some users while a different LOP is
    @@ Documentation/technical/large-object-promisors.txt (new)
     +is likely to be better connected to them, while users in other parts
     +of the world should pick only LOP B for the same reason."
     +
    -+Trusting the LOPs advertised by the server, or not trusting them?
    -+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    ++When should we trust or not trust the LOPs advertised by the server?
    ++~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     +
     +In some contexts, like in corporate setup where the server and all the
     +clients are parts of an internal network in a company where admins
    -+have all the rights on every system, it's Ok, and perhaps even a good
    ++have all the rights on every system, it's OK, and perhaps even a good
     +thing, if the clients fully trust the server, as it can help ensure
     +that all the clients are on the same page.
     +
    @@ Documentation/technical/large-object-promisors.txt (new)
     +from the client when it fetches from them. The client could get the
     +token when performing a protocol negotiation with the main remote (see
     +section II.6 above).
    ++
    ++V) Future improvements
    ++----------------------
    ++
    ++It is expected that at the beginning using LOPs will be mostly worth
    ++it either in a corporate context where the Git version that clients
    ++use can easily be controlled, or on repos that are infrequently
    ++accessed. (See the "Could the main remote be bogged down by old or
    ++paranoid clients?" section in the FAQ above.)
    ++
    ++Over time, as more and more clients upgrade to a version that
    ++implements the "promisor-remote" protocol v2 capability described
    ++above in section II.6), it will be worth it to use LOPs more widely.
    ++
    ++A lot of improvements may also help using LOPs more widely. Some of
    ++these improvements are part of the scope of this document like the
    ++following:
    ++
    ++  - Implementing a "remote-object-info" command in the
    ++    `git cat-file --batch` protocol and its variants to allow main
    ++    remotes to respond to requests about large blobs without fetching
    ++    them. (Eric Ju has started working on this based on previous work
    ++    by Calvin Wan.)
    ++
    ++  - Creating better cleanup and offload mechanisms for main remotes
    ++    and clients to prevent accumulation of large blobs.
    ++
    ++  - Developing more sophisticated protocol negotiation capabilities
    ++    between clients and servers for handling LOPs, for example adding
    ++    a filter-spec (e.g., blob:limit=<size>) or size limit for
    ++    filtering when cloning, or adding a token for LOP authentication.
    ++
    ++  - Improving security measures for LOP access, particularly around
    ++    token handling and authentication.
    ++
    ++  - Developing standardized ways to configure and manage multiple LOPs
    ++    across different environments. Especially in the case where
    ++    different LOPs serve the same content to clients in different
    ++    geographical locations, there is a need for replication or
    ++    synchronization between LOPs.
    ++
    ++Some improvements, including some that have been mentioned in the "0)
    ++Non Goals" section of this document, are out of the scope of this
    ++document:
    ++
    ++  - Implementing a new object representation for large blobs on the
    ++    client side.
    ++
    ++  - Developing pluggable ODBs or other object database backends that
    ++    could chunk large blobs, dedup the chunks and store them
    ++    efficiently.
    ++
    ++  - Optimizing data transfer between LOPs and clients/servers,
    ++    particularly for incompressible and non-deltifying content.
    ++
    ++  - Creating improved client side tools for managing large objects
    ++    more effectively, for example tools for migrating from Git LFS or
    ++    git-annex, or tools to find which objects could be offloaded and
    ++    how much disk space could be reclaimed by offloading them.
    ++
    ++Some improvements could be seen as part of the scope of this document,
    ++but might already have their own separate projects from the Git
    ++project, like:
    ++
    ++  - Improving existing remote helpers to access object storage or
    ++    developing new ones.
    ++
    ++  - Improving existing object storage solutions or developing new
    ++    ones.
    ++
    ++Even though all the above improvements may help, this document and the
    ++LOP effort should try to focus, at least first, on a relatively small
    ++number of improvements mostly those that are in its current scope.
    ++
    ++For example introducing pluggable ODBs and a new object database
    ++backend is likely a multi-year effort on its own that can happen
    ++separately in parallel. It has different technical requirements,
    ++touches other part of the Git code base and should have its own design
    ++document(s).


Christian Couder (4):
  version: make redact_non_printables() non-static
  Add 'promisor-remote' capability to protocol v2
  promisor-remote: check advertised name or URL
  doc: add technical design doc for large object promisors

Usman Akinyemi (2):
  version: replace manual ASCII checks with isprint() for clarity
  version: refactor redact_non_printables()

 Documentation/config/promisor.txt             |  27 +
 Documentation/gitprotocol-v2.txt              |  54 ++
 .../technical/large-object-promisors.txt      | 640 ++++++++++++++++++
 connect.c                                     |   9 +
 promisor-remote.c                             | 244 +++++++
 promisor-remote.h                             |  36 +-
 serve.c                                       |  26 +
 t/meson.build                                 |   1 +
 t/t5710-promisor-remote-capability.sh         | 312 +++++++++
 upload-pack.c                                 |   3 +
 version.c                                     |  18 +-
 version.h                                     |   8 +
 12 files changed, 1371 insertions(+), 7 deletions(-)
 create mode 100644 Documentation/technical/large-object-promisors.txt
 create mode 100755 t/t5710-promisor-remote-capability.sh

-- 
2.46.0.rc0.95.gcbf174a634


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH v4 1/6] version: replace manual ASCII checks with isprint() for clarity
  2025-01-27 15:16     ` [PATCH v4 0/6] " Christian Couder
@ 2025-01-27 15:16       ` Christian Couder
  2025-01-27 15:16       ` [PATCH v4 2/6] version: refactor redact_non_printables() Christian Couder
                         ` (6 subsequent siblings)
  7 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:16 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
	Randall S . Becker, Usman Akinyemi, Christian Couder

From: Usman Akinyemi <usmanakinyemi202@gmail.com>

Since the isprint() function checks for printable characters, let's
replace the existing hardcoded ASCII checks with it. However, since
the original checks also handled spaces, we need to account for spaces
explicitly in the new check.

Mentored-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Usman Akinyemi <usmanakinyemi202@gmail.com>
---
 version.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/version.c b/version.c
index 4786c4e0a5..c9192a5beb 100644
--- a/version.c
+++ b/version.c
@@ -1,6 +1,7 @@
 #include "git-compat-util.h"
 #include "version.h"
 #include "strbuf.h"
+#include "sane-ctype.h"
 
 #ifndef GIT_VERSION_H
 # include "version-def.h"
@@ -34,7 +35,7 @@ const char *git_user_agent_sanitized(void)
 		strbuf_addstr(&buf, git_user_agent());
 		strbuf_trim(&buf);
 		for (size_t i = 0; i < buf.len; i++) {
-			if (buf.buf[i] <= 32 || buf.buf[i] >= 127)
+			if (!isprint(buf.buf[i]) || buf.buf[i] == ' ')
 				buf.buf[i] = '.';
 		}
 		agent = buf.buf;
-- 
2.46.0.rc0.95.gcbf174a634


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH v4 2/6] version: refactor redact_non_printables()
  2025-01-27 15:16     ` [PATCH v4 0/6] " Christian Couder
  2025-01-27 15:16       ` [PATCH v4 1/6] version: replace manual ASCII checks with isprint() for clarity Christian Couder
@ 2025-01-27 15:16       ` Christian Couder
  2025-01-27 15:16       ` [PATCH v4 3/6] version: make redact_non_printables() non-static Christian Couder
                         ` (5 subsequent siblings)
  7 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:16 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
	Randall S . Becker, Usman Akinyemi, Christian Couder

From: Usman Akinyemi <usmanakinyemi202@gmail.com>

The git_user_agent_sanitized() function performs some sanitizing to
avoid special characters being sent over the line and possibly messing
up with the protocol or with the parsing on the other side.

Let's extract this sanitizing into a new redact_non_printables() function,
as we will want to reuse it in a following patch.

For now the new redact_non_printables() function is still static as
it's only needed locally.

While at it, let's use strbuf_detach() to explicitly detach the string
contained by the 'buf' strbuf.

Mentored-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Usman Akinyemi <usmanakinyemi202@gmail.com>
---
 version.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/version.c b/version.c
index c9192a5beb..4f37b4499d 100644
--- a/version.c
+++ b/version.c
@@ -12,6 +12,19 @@
 const char git_version_string[] = GIT_VERSION;
 const char git_built_from_commit_string[] = GIT_BUILT_FROM_COMMIT;
 
+/*
+ * Trim and replace each character with ascii code below 32 or above
+ * 127 (included) using a dot '.' character.
+ */
+static void redact_non_printables(struct strbuf *buf)
+{
+	strbuf_trim(buf);
+	for (size_t i = 0; i < buf->len; i++) {
+		if (!isprint(buf->buf[i]) || buf->buf[i] == ' ')
+			buf->buf[i] = '.';
+	}
+}
+
 const char *git_user_agent(void)
 {
 	static const char *agent = NULL;
@@ -33,12 +46,8 @@ const char *git_user_agent_sanitized(void)
 		struct strbuf buf = STRBUF_INIT;
 
 		strbuf_addstr(&buf, git_user_agent());
-		strbuf_trim(&buf);
-		for (size_t i = 0; i < buf.len; i++) {
-			if (!isprint(buf.buf[i]) || buf.buf[i] == ' ')
-				buf.buf[i] = '.';
-		}
-		agent = buf.buf;
+		redact_non_printables(&buf);
+		agent = strbuf_detach(&buf, NULL);
 	}
 
 	return agent;
-- 
2.46.0.rc0.95.gcbf174a634


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH v4 3/6] version: make redact_non_printables() non-static
  2025-01-27 15:16     ` [PATCH v4 0/6] " Christian Couder
  2025-01-27 15:16       ` [PATCH v4 1/6] version: replace manual ASCII checks with isprint() for clarity Christian Couder
  2025-01-27 15:16       ` [PATCH v4 2/6] version: refactor redact_non_printables() Christian Couder
@ 2025-01-27 15:16       ` Christian Couder
  2025-01-30 10:51         ` Patrick Steinhardt
  2025-01-27 15:16       ` [PATCH v4 4/6] Add 'promisor-remote' capability to protocol v2 Christian Couder
                         ` (4 subsequent siblings)
  7 siblings, 1 reply; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:16 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
	Randall S . Becker, Christian Couder

As we are going to reuse redact_non_printables() outside "version.c",
let's make it non-static.
---
 version.c | 6 +-----
 version.h | 8 ++++++++
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/version.c b/version.c
index 4f37b4499d..77423fcaf3 100644
--- a/version.c
+++ b/version.c
@@ -12,11 +12,7 @@
 const char git_version_string[] = GIT_VERSION;
 const char git_built_from_commit_string[] = GIT_BUILT_FROM_COMMIT;
 
-/*
- * Trim and replace each character with ascii code below 32 or above
- * 127 (included) using a dot '.' character.
- */
-static void redact_non_printables(struct strbuf *buf)
+void redact_non_printables(struct strbuf *buf)
 {
 	strbuf_trim(buf);
 	for (size_t i = 0; i < buf->len; i++) {
diff --git a/version.h b/version.h
index 7c62e80577..fcc1816685 100644
--- a/version.h
+++ b/version.h
@@ -4,7 +4,15 @@
 extern const char git_version_string[];
 extern const char git_built_from_commit_string[];
 
+struct strbuf;
+
 const char *git_user_agent(void);
 const char *git_user_agent_sanitized(void);
 
+/*
+ * Trim and replace each character with ascii code below 32 or above
+ * 127 (included) using a dot '.' character.
+*/
+void redact_non_printables(struct strbuf *buf);
+
 #endif /* VERSION_H */
-- 
2.46.0.rc0.95.gcbf174a634


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH v4 4/6] Add 'promisor-remote' capability to protocol v2
  2025-01-27 15:16     ` [PATCH v4 0/6] " Christian Couder
                         ` (2 preceding siblings ...)
  2025-01-27 15:16       ` [PATCH v4 3/6] version: make redact_non_printables() non-static Christian Couder
@ 2025-01-27 15:16       ` Christian Couder
  2025-01-30 10:51         ` Patrick Steinhardt
  2025-01-27 15:17       ` [PATCH v4 5/6] promisor-remote: check advertised name or URL Christian Couder
                         ` (3 subsequent siblings)
  7 siblings, 1 reply; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:16 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
	Randall S . Becker, Christian Couder, Christian Couder

When a server S knows that some objects from a repository are available
from a promisor remote X, S might want to suggest to a client C cloning
or fetching the repo from S that C may use X directly instead of S for
these objects.

Note that this could happen both in the case S itself doesn't have the
objects and borrows them from X, and in the case S has the objects but
knows that X is better connected to the world (e.g., it is in a
$LARGEINTERNETCOMPANY datacenter with petabit/s backbone connections)
than S. Implementation of the latter case, which would require S to
omit in its response the objects available on X, is left for future
improvement though.

Then C might or might not, want to get the objects from X. If S and C
can agree on C using X directly, S can then omit objects that can be
obtained from X when answering C's request.

To allow S and C to agree and let each other know about C using X or
not, let's introduce a new "promisor-remote" capability in the
protocol v2, as well as a few new configuration variables:

  - "promisor.advertise" on the server side, and:
  - "promisor.acceptFromServer" on the client side.

By default, or if "promisor.advertise" is set to 'false', a server S will
not advertise the "promisor-remote" capability.

If S doesn't advertise the "promisor-remote" capability, then a client C
replying to S shouldn't advertise the "promisor-remote" capability
either.

If "promisor.advertise" is set to 'true', S will advertise its promisor
remotes with a string like:

  promisor-remote=<pr-info>[;<pr-info>]...

where each <pr-info> element contains information about a single
promisor remote in the form:

  name=<pr-name>[,url=<pr-url>]

where <pr-name> is the urlencoded name of a promisor remote and
<pr-url> is the urlencoded URL of the promisor remote named <pr-name>.

For now, the URL is passed in addition to the name. In the future, it
might be possible to pass other information like a filter-spec that the
client may use when cloning from S, or a token that the client may use
when retrieving objects from X.

It is C's responsibility to arrange how it can reach X though, so pieces
of information that are usually outside Git's concern, like proxy
configuration, must not be distributed over this protocol.

It might also be possible in the future for "promisor.advertise" to have
other values. For example a value like "onlyName" could prevent S from
advertising URLs, which could help in case C should use a different URL
for X than the URL S is using. (The URL S is using might be an internal
one on the server side for example.)

By default or if "promisor.acceptFromServer" is set to "None", C will
not accept to use the promisor remotes that might have been advertised
by S. In this case, C will not advertise any "promisor-remote"
capability in its reply to S.

If "promisor.acceptFromServer" is set to "All" and S advertised some
promisor remotes, then on the contrary, C will accept to use all the
promisor remotes that S advertised and C will reply with a string like:

  promisor-remote=<pr-name>[;<pr-name>]...

where the <pr-name> elements are the urlencoded names of all the
promisor remotes S advertised.

In a following commit, other values for "promisor.acceptFromServer" will
be implemented, so that C will be able to decide the promisor remotes it
accepts depending on the name and URL it received from S. So even if
that name and URL information is not used much right now, it will be
needed soon.

Helped-by: Taylor Blau <me@ttaylorr.com>
Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/promisor.txt     |  17 ++
 Documentation/gitprotocol-v2.txt      |  54 ++++++
 connect.c                             |   9 +
 promisor-remote.c                     | 196 +++++++++++++++++++++
 promisor-remote.h                     |  36 +++-
 serve.c                               |  26 +++
 t/meson.build                         |   1 +
 t/t5710-promisor-remote-capability.sh | 244 ++++++++++++++++++++++++++
 upload-pack.c                         |   3 +
 9 files changed, 585 insertions(+), 1 deletion(-)
 create mode 100755 t/t5710-promisor-remote-capability.sh

diff --git a/Documentation/config/promisor.txt b/Documentation/config/promisor.txt
index 98c5cb2ec2..9cbfe3e59e 100644
--- a/Documentation/config/promisor.txt
+++ b/Documentation/config/promisor.txt
@@ -1,3 +1,20 @@
 promisor.quiet::
 	If set to "true" assume `--quiet` when fetching additional
 	objects for a partial clone.
+
+promisor.advertise::
+	If set to "true", a server will use the "promisor-remote"
+	capability, see linkgit:gitprotocol-v2[5], to advertise the
+	promisor remotes it is using, if it uses some. Default is
+	"false", which means the "promisor-remote" capability is not
+	advertised.
+
+promisor.acceptFromServer::
+	If set to "all", a client will accept all the promisor remotes
+	a server might advertise using the "promisor-remote"
+	capability. Default is "none", which means no promisor remote
+	advertised by a server will be accepted. By accepting a
+	promisor remote, the client agrees that the server might omit
+	objects that are lazily fetchable from this promisor remote
+	from its responses to "fetch" and "clone" requests from the
+	client. See linkgit:gitprotocol-v2[5].
diff --git a/Documentation/gitprotocol-v2.txt b/Documentation/gitprotocol-v2.txt
index 1652fef3ae..f25a9a6ad8 100644
--- a/Documentation/gitprotocol-v2.txt
+++ b/Documentation/gitprotocol-v2.txt
@@ -781,6 +781,60 @@ retrieving the header from a bundle at the indicated URI, and thus
 save themselves and the server(s) the request(s) needed to inspect the
 headers of that bundle or bundles.
 
+promisor-remote=<pr-infos>
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The server may advertise some promisor remotes it is using or knows
+about to a client which may want to use them as its promisor remotes,
+instead of this repository. In this case <pr-infos> should be of the
+form:
+
+	pr-infos = pr-info | pr-infos ";" pr-info
+
+	pr-info = "name=" pr-name | "name=" pr-name "," "url=" pr-url
+
+where `pr-name` is the urlencoded name of a promisor remote, and
+`pr-url` the urlencoded URL of that promisor remote.
+
+In this case, if the client decides to use one or more promisor
+remotes the server advertised, it can reply with
+"promisor-remote=<pr-names>" where <pr-names> should be of the form:
+
+	pr-names = pr-name | pr-names ";" pr-name
+
+where `pr-name` is the urlencoded name of a promisor remote the server
+advertised and the client accepts.
+
+Note that, everywhere in this document, `pr-name` MUST be a valid
+remote name, and the ';' and ',' characters MUST be encoded if they
+appear in `pr-name` or `pr-url`.
+
+If the server doesn't know any promisor remote that could be good for
+a client to use, or prefers a client not to use any promisor remote it
+uses or knows about, it shouldn't advertise the "promisor-remote"
+capability at all.
+
+In this case, or if the client doesn't want to use any promisor remote
+the server advertised, the client shouldn't advertise the
+"promisor-remote" capability at all in its reply.
+
+The "promisor.advertise" and "promisor.acceptFromServer" configuration
+options can be used on the server and client side respectively to
+control what they advertise or accept respectively. See the
+documentation of these configuration options for more information.
+
+Note that in the future it would be nice if the "promisor-remote"
+protocol capability could be used by the server, when responding to
+`git fetch` or `git clone`, to advertise better-connected remotes that
+the client can use as promisor remotes, instead of this repository, so
+that the client can lazily fetch objects from these other
+better-connected remotes. This would require the server to omit in its
+response the objects available on the better-connected remotes that
+the client has accepted. This hasn't been implemented yet though. So
+for now this "promisor-remote" capability is useful only when the
+server advertises some promisor remotes it already uses to borrow
+objects from.
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/connect.c b/connect.c
index 10fad43e98..7d309c4a7b 100644
--- a/connect.c
+++ b/connect.c
@@ -23,6 +23,7 @@
 #include "protocol.h"
 #include "alias.h"
 #include "bundle-uri.h"
+#include "promisor-remote.h"
 
 static char *server_capabilities_v1;
 static struct strvec server_capabilities_v2 = STRVEC_INIT;
@@ -488,6 +489,7 @@ void check_stateless_delimiter(int stateless_rpc,
 static void send_capabilities(int fd_out, struct packet_reader *reader)
 {
 	const char *hash_name;
+	const char *promisor_remote_info;
 
 	if (server_supports_v2("agent"))
 		packet_write_fmt(fd_out, "agent=%s", git_user_agent_sanitized());
@@ -501,6 +503,13 @@ static void send_capabilities(int fd_out, struct packet_reader *reader)
 	} else {
 		reader->hash_algo = &hash_algos[GIT_HASH_SHA1];
 	}
+	if (server_feature_v2("promisor-remote", &promisor_remote_info)) {
+		char *reply = promisor_remote_reply(promisor_remote_info);
+		if (reply) {
+			packet_write_fmt(fd_out, "promisor-remote=%s", reply);
+			free(reply);
+		}
+	}
 }
 
 int get_remote_bundle_uri(int fd_out, struct packet_reader *reader,
diff --git a/promisor-remote.c b/promisor-remote.c
index c714f4f007..5ac282ed27 100644
--- a/promisor-remote.c
+++ b/promisor-remote.c
@@ -11,6 +11,8 @@
 #include "strvec.h"
 #include "packfile.h"
 #include "environment.h"
+#include "url.h"
+#include "version.h"
 
 struct promisor_remote_config {
 	struct promisor_remote *promisors;
@@ -221,6 +223,18 @@ int repo_has_promisor_remote(struct repository *r)
 	return !!repo_promisor_remote_find(r, NULL);
 }
 
+int repo_has_accepted_promisor_remote(struct repository *r)
+{
+	struct promisor_remote *p;
+
+	promisor_remote_init(r);
+
+	for (p = r->promisor_remote_config->promisors; p; p = p->next)
+		if (p->accepted)
+			return 1;
+	return 0;
+}
+
 static int remove_fetched_oids(struct repository *repo,
 			       struct object_id **oids,
 			       int oid_nr, int to_free)
@@ -292,3 +306,185 @@ void promisor_remote_get_direct(struct repository *repo,
 	if (to_free)
 		free(remaining_oids);
 }
+
+static int allow_unsanitized(char ch)
+{
+	if (ch == ',' || ch == ';' || ch == '%')
+		return 0;
+	return ch > 32 && ch < 127;
+}
+
+static void promisor_info_vecs(struct repository *repo,
+			       struct strvec *names,
+			       struct strvec *urls)
+{
+	struct promisor_remote *r;
+
+	promisor_remote_init(repo);
+
+	for (r = repo->promisor_remote_config->promisors; r; r = r->next) {
+		char *url;
+		char *url_key = xstrfmt("remote.%s.url", r->name);
+
+		strvec_push(names, r->name);
+		strvec_push(urls, git_config_get_string(url_key, &url) ? NULL : url);
+
+		free(url);
+		free(url_key);
+	}
+}
+
+char *promisor_remote_info(struct repository *repo)
+{
+	struct strbuf sb = STRBUF_INIT;
+	int advertise_promisors = 0;
+	struct strvec names = STRVEC_INIT;
+	struct strvec urls = STRVEC_INIT;
+
+	git_config_get_bool("promisor.advertise", &advertise_promisors);
+
+	if (!advertise_promisors)
+		return NULL;
+
+	promisor_info_vecs(repo, &names, &urls);
+
+	if (!names.nr)
+		return NULL;
+
+	for (size_t i = 0; i < names.nr; i++) {
+		if (i)
+			strbuf_addch(&sb, ';');
+		strbuf_addstr(&sb, "name=");
+		strbuf_addstr_urlencode(&sb, names.v[i], allow_unsanitized);
+		if (urls.v[i]) {
+			strbuf_addstr(&sb, ",url=");
+			strbuf_addstr_urlencode(&sb, urls.v[i], allow_unsanitized);
+		}
+	}
+
+	redact_non_printables(&sb);
+
+	strvec_clear(&names);
+	strvec_clear(&urls);
+
+	return strbuf_detach(&sb, NULL);
+}
+
+enum accept_promisor {
+	ACCEPT_NONE = 0,
+	ACCEPT_ALL
+};
+
+static int should_accept_remote(enum accept_promisor accept,
+				const char *remote_name UNUSED,
+				const char *remote_url UNUSED)
+{
+	if (accept == ACCEPT_ALL)
+		return 1;
+
+	BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+}
+
+static void filter_promisor_remote(struct strvec *accepted, const char *info)
+{
+	struct strbuf **remotes;
+	const char *accept_str;
+	enum accept_promisor accept = ACCEPT_NONE;
+
+	if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
+		if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
+			accept = ACCEPT_NONE;
+		else if (!strcasecmp("All", accept_str))
+			accept = ACCEPT_ALL;
+		else
+			warning(_("unknown '%s' value for '%s' config option"),
+				accept_str, "promisor.acceptfromserver");
+	}
+
+	if (accept == ACCEPT_NONE)
+		return;
+
+	/* Parse remote info received */
+
+	remotes = strbuf_split_str(info, ';', 0);
+
+	for (size_t i = 0; remotes[i]; i++) {
+		struct strbuf **elems;
+		const char *remote_name = NULL;
+		const char *remote_url = NULL;
+		char *decoded_name = NULL;
+		char *decoded_url = NULL;
+
+		strbuf_strip_suffix(remotes[i], ";");
+		elems = strbuf_split(remotes[i], ',');
+
+		for (size_t j = 0; elems[j]; j++) {
+			int res;
+			strbuf_strip_suffix(elems[j], ",");
+			res = skip_prefix(elems[j]->buf, "name=", &remote_name) ||
+				skip_prefix(elems[j]->buf, "url=", &remote_url);
+			if (!res)
+				warning(_("unknown element '%s' from remote info"),
+					elems[j]->buf);
+		}
+
+		if (remote_name)
+			decoded_name = url_percent_decode(remote_name);
+		if (remote_url)
+			decoded_url = url_percent_decode(remote_url);
+
+		if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url))
+			strvec_push(accepted, decoded_name);
+
+		strbuf_list_free(elems);
+		free(decoded_name);
+		free(decoded_url);
+	}
+
+	strbuf_list_free(remotes);
+}
+
+char *promisor_remote_reply(const char *info)
+{
+	struct strvec accepted = STRVEC_INIT;
+	struct strbuf reply = STRBUF_INIT;
+
+	filter_promisor_remote(&accepted, info);
+
+	if (!accepted.nr)
+		return NULL;
+
+	for (size_t i = 0; i < accepted.nr; i++) {
+		if (i)
+			strbuf_addch(&reply, ';');
+		strbuf_addstr_urlencode(&reply, accepted.v[i], allow_unsanitized);
+	}
+
+	strvec_clear(&accepted);
+
+	return strbuf_detach(&reply, NULL);
+}
+
+void mark_promisor_remotes_as_accepted(struct repository *r, const char *remotes)
+{
+	struct strbuf **accepted_remotes = strbuf_split_str(remotes, ';', 0);
+
+	for (size_t i = 0; accepted_remotes[i]; i++) {
+		struct promisor_remote *p;
+		char *decoded_remote;
+
+		strbuf_strip_suffix(accepted_remotes[i], ";");
+		decoded_remote = url_percent_decode(accepted_remotes[i]->buf);
+
+		p = repo_promisor_remote_find(r, decoded_remote);
+		if (p)
+			p->accepted = 1;
+		else
+			warning(_("accepted promisor remote '%s' not found"),
+				decoded_remote);
+
+		free(decoded_remote);
+	}
+
+	strbuf_list_free(accepted_remotes);
+}
diff --git a/promisor-remote.h b/promisor-remote.h
index 88cb599c39..814ca248c7 100644
--- a/promisor-remote.h
+++ b/promisor-remote.h
@@ -9,11 +9,13 @@ struct object_id;
  * Promisor remote linked list
  *
  * Information in its fields come from remote.XXX config entries or
- * from extensions.partialclone.
+ * from extensions.partialclone, except for 'accepted' which comes
+ * from protocol v2 capabilities exchange.
  */
 struct promisor_remote {
 	struct promisor_remote *next;
 	char *partial_clone_filter;
+	unsigned int accepted : 1;
 	const char name[FLEX_ARRAY];
 };
 
@@ -32,4 +34,36 @@ void promisor_remote_get_direct(struct repository *repo,
 				const struct object_id *oids,
 				int oid_nr);
 
+/*
+ * Prepare a "promisor-remote" advertisement by a server.
+ * Check the value of "promisor.advertise" and maybe the configured
+ * promisor remotes, if any, to prepare information to send in an
+ * advertisement.
+ * Return value is NULL if no promisor remote advertisement should be
+ * made. Otherwise it contains the names and urls of the advertised
+ * promisor remotes separated by ';'
+ */
+char *promisor_remote_info(struct repository *repo);
+
+/*
+ * Prepare a reply to a "promisor-remote" advertisement from a server.
+ * Check the value of "promisor.acceptfromserver" and maybe the
+ * configured promisor remotes, if any, to prepare the reply.
+ * Return value is NULL if no promisor remote from the server
+ * is accepted. Otherwise it contains the names of the accepted promisor
+ * remotes separated by ';'.
+ */
+char *promisor_remote_reply(const char *info);
+
+/*
+ * Set the 'accepted' flag for some promisor remotes. Useful when some
+ * promisor remotes have been accepted by the client.
+ */
+void mark_promisor_remotes_as_accepted(struct repository *repo, const char *remotes);
+
+/*
+ * Has any promisor remote been accepted by the client?
+ */
+int repo_has_accepted_promisor_remote(struct repository *r);
+
 #endif /* PROMISOR_REMOTE_H */
diff --git a/serve.c b/serve.c
index f6dfe34a2b..e3ccf1505c 100644
--- a/serve.c
+++ b/serve.c
@@ -10,6 +10,7 @@
 #include "upload-pack.h"
 #include "bundle-uri.h"
 #include "trace2.h"
+#include "promisor-remote.h"
 
 static int advertise_sid = -1;
 static int advertise_object_info = -1;
@@ -29,6 +30,26 @@ static int agent_advertise(struct repository *r UNUSED,
 	return 1;
 }
 
+static int promisor_remote_advertise(struct repository *r,
+				     struct strbuf *value)
+{
+	if (value) {
+		char *info = promisor_remote_info(r);
+		if (!info)
+			return 0;
+		strbuf_addstr(value, info);
+		free(info);
+	}
+	return 1;
+}
+
+static void promisor_remote_receive(struct repository *r,
+				    const char *remotes)
+{
+	mark_promisor_remotes_as_accepted(r, remotes);
+}
+
+
 static int object_format_advertise(struct repository *r,
 				   struct strbuf *value)
 {
@@ -155,6 +176,11 @@ static struct protocol_capability capabilities[] = {
 		.advertise = bundle_uri_advertise,
 		.command = bundle_uri_command,
 	},
+	{
+		.name = "promisor-remote",
+		.advertise = promisor_remote_advertise,
+		.receive = promisor_remote_receive,
+	},
 };
 
 void protocol_v2_advertise_capabilities(struct repository *r)
diff --git a/t/meson.build b/t/meson.build
index 7b35eadbc8..20e15c407c 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -727,6 +727,7 @@ integration_tests = [
   't5703-upload-pack-ref-in-want.sh',
   't5704-protocol-violations.sh',
   't5705-session-id-in-capabilities.sh',
+  't5710-promisor-remote-capability.sh',
   't5730-protocol-v2-bundle-uri-file.sh',
   't5731-protocol-v2-bundle-uri-git.sh',
   't5732-protocol-v2-bundle-uri-http.sh',
diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
new file mode 100755
index 0000000000..0390c1dbad
--- /dev/null
+++ b/t/t5710-promisor-remote-capability.sh
@@ -0,0 +1,244 @@
+#!/bin/sh
+
+test_description='handling of promisor remote advertisement'
+
+. ./test-lib.sh
+
+GIT_TEST_MULTI_PACK_INDEX=0
+GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0
+
+# Setup the repository with three commits, this way HEAD is always
+# available and we can hide commit 1 or 2.
+test_expect_success 'setup: create "template" repository' '
+	git init template &&
+	test_commit -C template 1 &&
+	test_commit -C template 2 &&
+	test_commit -C template 3 &&
+	test-tool genrandom foo 10240 >template/foo &&
+	git -C template add foo &&
+	git -C template commit -m foo
+'
+
+# A bare repo will act as a server repo with unpacked objects.
+test_expect_success 'setup: create bare "server" repository' '
+	git clone --bare --no-local template server &&
+	mv server/objects/pack/pack-* . &&
+	packfile=$(ls pack-*.pack) &&
+	git -C server unpack-objects --strict <"$packfile"
+'
+
+check_missing_objects () {
+	git -C "$1" rev-list --objects --all --missing=print > all.txt &&
+	perl -ne 'print if s/^[?]//' all.txt >missing.txt &&
+	test_line_count = "$2" missing.txt &&
+	if test "$2" -lt 2
+	then
+		test "$3" = "$(cat missing.txt)"
+	else
+		test -f "$3" &&
+		sort <"$3" >expected_sorted &&
+		sort <missing.txt >actual_sorted &&
+		test_cmp expected_sorted actual_sorted
+	fi
+}
+
+initialize_server () {
+	count="$1"
+	missing_oids="$2"
+
+	# Repack everything first
+	git -C server -c repack.writebitmaps=false repack -a -d &&
+
+	# Remove promisor file in case they exist, useful when reinitializing
+	rm -rf server/objects/pack/*.promisor &&
+
+	# Repack without the largest object and create a promisor pack on server
+	git -C server -c repack.writebitmaps=false repack -a -d \
+	    --filter=blob:limit=5k --filter-to="$(pwd)/pack" &&
+	promisor_file=$(ls server/objects/pack/*.pack | sed "s/\.pack/.promisor/") &&
+	>"$promisor_file" &&
+
+	# Check objects missing on the server
+	check_missing_objects server "$count" "$missing_oids"
+}
+
+copy_to_server2 () {
+	oid_path="$(test_oid_to_path $1)" &&
+	path="server/objects/$oid_path" &&
+	path2="server2/objects/$oid_path" &&
+	mkdir -p $(dirname "$path2") &&
+	cp "$path" "$path2"
+}
+
+test_expect_success "setup for testing promisor remote advertisement" '
+	# Create another bare repo called "server2"
+	git init --bare server2 &&
+
+	# Copy the largest object from server to server2
+	obj="HEAD:foo" &&
+	oid="$(git -C server rev-parse $obj)" &&
+	copy_to_server2 "$oid" &&
+
+	initialize_server 1 "$oid" &&
+
+	# Configure server2 as promisor remote for server
+	git -C server remote add server2 "file://$(pwd)/server2" &&
+	git -C server config remote.server2.promisor true &&
+
+	git -C server2 config uploadpack.allowFilter true &&
+	git -C server2 config uploadpack.allowAnySHA1InWant true &&
+	git -C server config uploadpack.allowFilter true &&
+	git -C server config uploadpack.allowAnySHA1InWant true
+'
+
+test_expect_success "clone with promisor.advertise set to 'true'" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=All \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is still missing on the server
+	check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.advertise set to 'false'" '
+	git -C server config promisor.advertise false &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=All \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is not missing on the server
+	check_missing_objects server 0 "" &&
+
+	# Reinitialize server so that the largest object is missing again
+	initialize_server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.acceptfromserver set to 'None'" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=None \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is not missing on the server
+	check_missing_objects server 0 "" &&
+
+	# Reinitialize server so that the largest object is missing again
+	initialize_server 1 "$oid"
+'
+
+test_expect_success "init + fetch with promisor.advertise set to 'true'" '
+	git -C server config promisor.advertise true &&
+
+	test_when_finished "rm -rf client" &&
+	mkdir client &&
+	git -C client init &&
+	git -C client config remote.server2.promisor true &&
+	git -C client config remote.server2.fetch "+refs/heads/*:refs/remotes/server2/*" &&
+	git -C client config remote.server2.url "file://$(pwd)/server2" &&
+	git -C client config remote.server.url "file://$(pwd)/server" &&
+	git -C client config remote.server.fetch "+refs/heads/*:refs/remotes/server/*" &&
+	git -C client config promisor.acceptfromserver All &&
+	GIT_NO_LAZY_FETCH=0 git -C client fetch --filter="blob:limit=5k" server &&
+
+	# Check that the largest object is still missing on the server
+	check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.advertise set to 'true' but don't delete the client" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=All \
+		--no-local --filter="blob:limit=5k" server client &&
+
+	# Check that the largest object is still missing on the server
+	check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "setup for subsequent fetches" '
+	# Generate new commit with large blob
+	test-tool genrandom bar 10240 >template/bar &&
+	git -C template add bar &&
+	git -C template commit -m bar &&
+
+	# Fetch new commit with large blob
+	git -C server fetch origin &&
+	git -C server update-ref HEAD FETCH_HEAD &&
+	git -C server rev-parse HEAD >expected_head &&
+
+	# Repack everything twice and remove .promisor files before
+	# each repack. This makes sure everything gets repacked
+	# into a single packfile. The second repack is necessary
+	# because the first one fetches from server2 and creates a new
+	# packfile and its associated .promisor file.
+
+	rm -f server/objects/pack/*.promisor &&
+	git -C server -c repack.writebitmaps=false repack -a -d &&
+	rm -f server/objects/pack/*.promisor &&
+	git -C server -c repack.writebitmaps=false repack -a -d &&
+
+	# Unpack everything
+	rm pack-* &&
+	mv server/objects/pack/pack-* . &&
+	packfile=$(ls pack-*.pack) &&
+	git -C server unpack-objects --strict <"$packfile" &&
+
+	# Copy new large object to server2
+	obj_bar="HEAD:bar" &&
+	oid_bar="$(git -C server rev-parse $obj_bar)" &&
+	copy_to_server2 "$oid_bar" &&
+
+	# Reinitialize server so that the 2 largest objects are missing
+	printf "%s\n" "$oid" "$oid_bar" >expected_missing.txt &&
+	initialize_server 2 expected_missing.txt &&
+
+	# Create one more client
+	cp -r client client2
+'
+
+test_expect_success "subsequent fetch from a client when promisor.advertise is true" '
+	git -C server config promisor.advertise true &&
+
+	GIT_NO_LAZY_FETCH=0 git -C client pull origin &&
+
+	git -C client rev-parse HEAD >actual &&
+	test_cmp expected_head actual &&
+
+	cat client/bar >/dev/null &&
+
+	check_missing_objects server 2 expected_missing.txt
+'
+
+test_expect_success "subsequent fetch from a client when promisor.advertise is false" '
+	git -C server config promisor.advertise false &&
+
+	GIT_NO_LAZY_FETCH=0 git -C client2 pull origin &&
+
+	git -C client2 rev-parse HEAD >actual &&
+	test_cmp expected_head actual &&
+
+	cat client2/bar >/dev/null &&
+
+	check_missing_objects server 1 "$oid"
+'
+
+test_done
diff --git a/upload-pack.c b/upload-pack.c
index 728b2477fc..7498b45e2e 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -32,6 +32,7 @@
 #include "write-or-die.h"
 #include "json-writer.h"
 #include "strmap.h"
+#include "promisor-remote.h"
 
 /* Remember to update object flag allocation in object.h */
 #define THEY_HAVE	(1u << 11)
@@ -319,6 +320,8 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 		strvec_push(&pack_objects.args, "--delta-base-offset");
 	if (pack_data->use_include_tag)
 		strvec_push(&pack_objects.args, "--include-tag");
+	if (repo_has_accepted_promisor_remote(the_repository))
+		strvec_push(&pack_objects.args, "--missing=allow-promisor");
 	if (pack_data->filter_options.choice) {
 		const char *spec =
 			expand_list_objects_filter_spec(&pack_data->filter_options);
-- 
2.46.0.rc0.95.gcbf174a634


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH v4 5/6] promisor-remote: check advertised name or URL
  2025-01-27 15:16     ` [PATCH v4 0/6] " Christian Couder
                         ` (3 preceding siblings ...)
  2025-01-27 15:16       ` [PATCH v4 4/6] Add 'promisor-remote' capability to protocol v2 Christian Couder
@ 2025-01-27 15:17       ` Christian Couder
  2025-01-27 23:48         ` Junio C Hamano
  2025-01-27 15:17       ` [PATCH v4 6/6] doc: add technical design doc for large object promisors Christian Couder
                         ` (2 subsequent siblings)
  7 siblings, 1 reply; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:17 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
	Randall S . Becker, Christian Couder, Christian Couder

A previous commit introduced a "promisor.acceptFromServer" configuration
variable with only "None" or "All" as valid values.

Let's introduce "KnownName" and "KnownUrl" as valid values for this
configuration option to give more choice to a client about which
promisor remotes it might accept among those that the server advertised.

In case of "KnownName", the client will accept promisor remotes which
are already configured on the client and have the same name as those
advertised by the client. This could be useful in a corporate setup
where servers and clients are trusted to not switch names and URLs, but
where some kind of control is still useful.

In case of "KnownUrl", the client will accept promisor remotes which
have both the same name and the same URL configured on the client as the
name and URL advertised by the server. This is the most secure option,
so it should be used if possible.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/promisor.txt     | 22 ++++++---
 promisor-remote.c                     | 60 ++++++++++++++++++++---
 t/t5710-promisor-remote-capability.sh | 68 +++++++++++++++++++++++++++
 3 files changed, 138 insertions(+), 12 deletions(-)

diff --git a/Documentation/config/promisor.txt b/Documentation/config/promisor.txt
index 9cbfe3e59e..d1364bc018 100644
--- a/Documentation/config/promisor.txt
+++ b/Documentation/config/promisor.txt
@@ -12,9 +12,19 @@ promisor.advertise::
 promisor.acceptFromServer::
 	If set to "all", a client will accept all the promisor remotes
 	a server might advertise using the "promisor-remote"
-	capability. Default is "none", which means no promisor remote
-	advertised by a server will be accepted. By accepting a
-	promisor remote, the client agrees that the server might omit
-	objects that are lazily fetchable from this promisor remote
-	from its responses to "fetch" and "clone" requests from the
-	client. See linkgit:gitprotocol-v2[5].
+	capability. If set to "knownName" the client will accept
+	promisor remotes which are already configured on the client
+	and have the same name as those advertised by the client. This
+	is not very secure, but could be used in a corporate setup
+	where servers and clients are trusted to not switch name and
+	URLs. If set to "knownUrl", the client will accept promisor
+	remotes which have both the same name and the same URL
+	configured on the client as the name and URL advertised by the
+	server. This is more secure than "all" or "knownUrl", so it
+	should be used if possible instead of those options. Default
+	is "none", which means no promisor remote advertised by a
+	server will be accepted. By accepting a promisor remote, the
+	client agrees that the server might omit objects that are
+	lazily fetchable from this promisor remote from its responses
+	to "fetch" and "clone" requests from the client. See
+	linkgit:gitprotocol-v2[5].
diff --git a/promisor-remote.c b/promisor-remote.c
index 5ac282ed27..790a96aa19 100644
--- a/promisor-remote.c
+++ b/promisor-remote.c
@@ -370,30 +370,73 @@ char *promisor_remote_info(struct repository *repo)
 	return strbuf_detach(&sb, NULL);
 }
 
+/*
+ * Find first index of 'vec' where there is 'val'. 'val' is compared
+ * case insensively to the strings in 'vec'. If not found 'vec->nr' is
+ * returned.
+ */
+static size_t strvec_find_index(struct strvec *vec, const char *val)
+{
+	for (size_t i = 0; i < vec->nr; i++)
+		if (!strcasecmp(vec->v[i], val))
+			return i;
+	return vec->nr;
+}
+
 enum accept_promisor {
 	ACCEPT_NONE = 0,
+	ACCEPT_KNOWN_URL,
+	ACCEPT_KNOWN_NAME,
 	ACCEPT_ALL
 };
 
 static int should_accept_remote(enum accept_promisor accept,
-				const char *remote_name UNUSED,
-				const char *remote_url UNUSED)
+				const char *remote_name, const char *remote_url,
+				struct strvec *names, struct strvec *urls)
 {
+	size_t i;
+
 	if (accept == ACCEPT_ALL)
 		return 1;
 
-	BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+	i = strvec_find_index(names, remote_name);
+
+	if (i >= names->nr)
+		/* We don't know about that remote */
+		return 0;
+
+	if (accept == ACCEPT_KNOWN_NAME)
+		return 1;
+
+	if (accept != ACCEPT_KNOWN_URL)
+		BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+
+	if (!strcasecmp(urls->v[i], remote_url))
+		return 1;
+
+	warning(_("known remote named '%s' but with url '%s' instead of '%s'"),
+		remote_name, urls->v[i], remote_url);
+
+	return 0;
 }
 
-static void filter_promisor_remote(struct strvec *accepted, const char *info)
+static void filter_promisor_remote(struct repository *repo,
+				   struct strvec *accepted,
+				   const char *info)
 {
 	struct strbuf **remotes;
 	const char *accept_str;
 	enum accept_promisor accept = ACCEPT_NONE;
+	struct strvec names = STRVEC_INIT;
+	struct strvec urls = STRVEC_INIT;
 
 	if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
 		if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
 			accept = ACCEPT_NONE;
+		else if (!strcasecmp("KnownUrl", accept_str))
+			accept = ACCEPT_KNOWN_URL;
+		else if (!strcasecmp("KnownName", accept_str))
+			accept = ACCEPT_KNOWN_NAME;
 		else if (!strcasecmp("All", accept_str))
 			accept = ACCEPT_ALL;
 		else
@@ -404,6 +447,9 @@ static void filter_promisor_remote(struct strvec *accepted, const char *info)
 	if (accept == ACCEPT_NONE)
 		return;
 
+	if (accept != ACCEPT_ALL)
+		promisor_info_vecs(repo, &names, &urls);
+
 	/* Parse remote info received */
 
 	remotes = strbuf_split_str(info, ';', 0);
@@ -433,7 +479,7 @@ static void filter_promisor_remote(struct strvec *accepted, const char *info)
 		if (remote_url)
 			decoded_url = url_percent_decode(remote_url);
 
-		if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url))
+		if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url, &names, &urls))
 			strvec_push(accepted, decoded_name);
 
 		strbuf_list_free(elems);
@@ -441,6 +487,8 @@ static void filter_promisor_remote(struct strvec *accepted, const char *info)
 		free(decoded_url);
 	}
 
+	strvec_clear(&names);
+	strvec_clear(&urls);
 	strbuf_list_free(remotes);
 }
 
@@ -449,7 +497,7 @@ char *promisor_remote_reply(const char *info)
 	struct strvec accepted = STRVEC_INIT;
 	struct strbuf reply = STRBUF_INIT;
 
-	filter_promisor_remote(&accepted, info);
+	filter_promisor_remote(the_repository, &accepted, info);
 
 	if (!accepted.nr)
 		return NULL;
diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
index 0390c1dbad..5bce99f5eb 100755
--- a/t/t5710-promisor-remote-capability.sh
+++ b/t/t5710-promisor-remote-capability.sh
@@ -160,6 +160,74 @@ test_expect_success "init + fetch with promisor.advertise set to 'true'" '
 	check_missing_objects server 1 "$oid"
 '
 
+test_expect_success "clone with promisor.acceptfromserver set to 'KnownName'" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=KnownName \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is still missing on the server
+	check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with 'KnownName' and different remote names" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.serverTwo.promisor=true \
+		-c remote.serverTwo.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.serverTwo.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=KnownName \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is not missing on the server
+	check_missing_objects server 0 "" &&
+
+	# Reinitialize server so that the largest object is missing again
+	initialize_server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.acceptfromserver set to 'KnownUrl'" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		-c promisor.acceptfromserver=KnownUrl \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is still missing on the server
+	check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with 'KnownUrl' and different remote urls" '
+	ln -s server2 serverTwo &&
+
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/serverTwo" \
+		-c promisor.acceptfromserver=KnownUrl \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is not missing on the server
+	check_missing_objects server 0 "" &&
+
+	# Reinitialize server so that the largest object is missing again
+	initialize_server 1 "$oid"
+'
+
 test_expect_success "clone with promisor.advertise set to 'true' but don't delete the client" '
 	git -C server config promisor.advertise true &&
 
-- 
2.46.0.rc0.95.gcbf174a634


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH v4 6/6] doc: add technical design doc for large object promisors
  2025-01-27 15:16     ` [PATCH v4 0/6] " Christian Couder
                         ` (4 preceding siblings ...)
  2025-01-27 15:17       ` [PATCH v4 5/6] promisor-remote: check advertised name or URL Christian Couder
@ 2025-01-27 15:17       ` Christian Couder
  2025-01-27 21:14       ` [PATCH v4 0/6] Introduce a "promisor-remote" capability Junio C Hamano
  2025-02-18 11:32       ` [PATCH v5 0/3] " Christian Couder
  7 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-01-27 15:17 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
	Randall S . Becker, Christian Couder, Christian Couder

Let's add a design doc about how we could improve handling liarge blobs
using "Large Object Promisors" (LOPs). It's a set of features with the
goal of using special dedicated promisor remotes to store large blobs,
and having them accessed directly by main remotes and clients.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 .../technical/large-object-promisors.txt      | 640 ++++++++++++++++++
 1 file changed, 640 insertions(+)
 create mode 100644 Documentation/technical/large-object-promisors.txt

diff --git a/Documentation/technical/large-object-promisors.txt b/Documentation/technical/large-object-promisors.txt
new file mode 100644
index 0000000000..1984f11a55
--- /dev/null
+++ b/Documentation/technical/large-object-promisors.txt
@@ -0,0 +1,640 @@
+Large Object Promisors
+======================
+
+Since Git has been created, users have been complaining about issues
+with storing large files in Git. Some solutions have been created to
+help, but they haven't helped much with some issues.
+
+Git currently supports multiple promisor remotes, which could help
+with some of these remaining issues, but it's very hard to use them to
+help, because a number of important features are missing.
+
+The goal of the effort described in this document is to add these
+important features.
+
+We will call a "Large Object Promisor", or "LOP" in short, a promisor
+remote which is used to store only large blobs and which is separate
+from the main remote that should store the other Git objects and the
+rest of the repos.
+
+By extension, we will also call "Large Object Promisor", or LOP, the
+effort described in this document to add a set of features to make it
+easier to handle large blobs/files in Git by using LOPs.
+
+This effort aims to especially improve things on the server side, and
+especially for large blobs that are already compressed in a binary
+format.
+
+This effort aims to provide an alternative to Git LFS
+(https://git-lfs.com/) and similar tools like git-annex
+(https://git-annex.branchable.com/) for handling large files, even
+though a complete alternative would very likely require other efforts
+especially on the client side, where it would likely help to implement
+a new object representation for large blobs as discussed in:
+
+https://lore.kernel.org/git/xmqqbkdometi.fsf@gitster.g/
+
+0) Non goals
+------------
+
+- We will not discuss those client side improvements here, as they
+  would require changes in different parts of Git than this effort.
++
+So we don't pretend to fully replace Git LFS with only this effort,
+but we nevertheless believe that it can significantly improve the
+current situation on the server side, and that other separate
+efforts could also improve the situation on the client side.
+
+- In the same way, we are not going to discuss all the possible ways
+  to implement a LOP or their underlying object storage, or to
+  optimize how LOP works.
++
+Our opinion is that the simplest solution for now is for LOPs to use
+object storage through a remote helper (see section II.2 below for
+more details) to store their objects. So we consider that this is the
+default implementation. If there are improvements on top of this,
+that's great, but our opinion is that such improvements are not
+necessary for LOPs to already be useful. Such improvements are likely
+a different technical topic, and can be taken care of separately
+anyway.
++
+So in particular we are not going to discuss pluggable ODBs or other
+object database backends that could chunk large blobs, dedup the
+chunks and store them efficiently. Sure, that would be a nice
+improvement to store large blobs on the server side, but we believe
+it can just be a separate effort as it's also not technically very
+related to this effort.
++
+We are also not going to discuss data transfer improvements between
+LOPs and clients or servers. Sure, there might be some easy and very
+effective optimizations there (as we know that objects on LOPs are
+very likely incompressible and not deltifying well), but this can be
+dealt with separately in a separate effort.
+
+In other words, the goal of this document is not to talk about all the
+possible ways to optimize how Git could handle large blobs, but to
+describe how a LOP based solution can already work well and alleviate
+a number of current issues in the context of Git clients and servers
+sharing Git objects.
+
+I) Issues with the current situation
+------------------------------------
+
+- Some statistics made on GitLab repos have shown that more than 75%
+  of the disk space is used by blobs that are larger than 1MB and
+  often in a binary format.
+
+- So even if users could use Git LFS or similar tools to store a lot
+  of large blobs out of their repos, it's a fact that in practice they
+  don't do it as much as they probably should.
+
+- On the server side ideally, the server should be able to decide for
+  itself how it stores things. It should not depend on users deciding
+  to use tools like Git LFS on some blobs or not.
+
+- It's much more expensive to store large blobs that don't delta
+  compress well on regular fast seeking drives (like SSDs) than on
+  object storage (like Amazon S3 or GCP Buckets). Using fast drives
+  for regular Git repos makes sense though, as serving regular Git
+  content (blobs containing text or code) needs drives where seeking
+  is fast, but the content is relatively small. On the other hand,
+  object storage for Git LFS blobs makes sense as seeking speed is not
+  as important when dealing with large files, while costs are more
+  important. So the fact that users don't use Git LFS or similar tools
+  for a significant number of large blobs has likely some bad
+  consequences on the cost of repo storage for most Git hosting
+  platforms.
+
+- Having large blobs handled in the same way as other blobs and Git
+  objects in Git repos instead of on object storage also has a cost in
+  increased memory and CPU usage, and therefore decreased performance,
+  when creating packfiles. (This is because Git tries to use delta
+  compression or zlib compression which is unlikely to work well on
+  already compressed binary content.) So it's not just a storage cost
+  increase.
+
+- When a large blob has been committed into a repo, it might not be
+  possible to remove this blob from the repo without rewriting
+  history, even if the user then decides to use Git LFS or a similar
+  tool to handle it.
+
+- In fact Git LFS and similar tools are not very flexible in letting
+  users change their minds about the blobs they should handle or not.
+
+- Even when users are using Git LFS or similar tools, they are often
+  complaining that these tools require significant effort to set up,
+  learn and use correctly.
+
+II) Main features of the "Large Object Promisors" solution
+----------------------------------------------------------
+
+The main features below should give a rough overview of how the
+solution may work. Details about needed elements can be found in
+following sections.
+
+Even if each feature below is very useful for the full solution, it is
+very likely to be also useful on its own in some cases where the full
+solution is not required. However, we'll focus primarily on the big
+picture here.
+
+Also each feature doesn't need to be implemented entirely in Git
+itself. Some could be scripts, hooks or helpers that are not part of
+the Git repo. It would be helpful if those could be shared and
+improved on collaboratively though. So we want to encourage sharing
+them.
+
+1) Large blobs are stored on LOPs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Large blobs should be stored on special promisor remotes that we will
+call "Large Object Promisors" or LOPs. These LOPs should be additional
+remotes dedicated to contain large blobs especially those in binary
+format. They should be used along with main remotes that contain the
+other objects.
+
+Note 1
+++++++
+
+To clarify, a LOP is a normal promisor remote, except that:
+
+- it should store only large blobs,
+
+- it should be separate from the main remote, so that the main remote
+  can focus on serving other objects and the rest of the repos (see
+  feature 4) below) and can use the LOP as a promisor remote for
+  itself.
+
+Note 2
+++++++
+
+Git already makes it possible for a main remote to also be a promisor
+remote storing both regular objects and large blobs for a client that
+clones from it with a filter on blob size. But here we explicitly want
+to avoid that.
+
+Rationale
++++++++++
+
+LOPs aim to be good at handling large blobs while main remotes are
+already good at handling other objects.
+
+Implementation
+++++++++++++++
+
+Git already has support for multiple promisor remotes, see
+link:partial-clone.html#using-many-promisor-remotes[the partial clone documentation].
+
+Also, Git already has support for partial clone using a filter on the
+size of the blobs (with `git clone --filter=blob:limit=<size>`).  Most
+of the other main features below are based on these existing features
+and are about making them easy and efficient to use for the purpose of
+better handling large blobs.
+
+2) LOPs can use object storage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+LOPs can be implemented using object storage, like an Amazon S3 or GCP
+Bucket or MinIO (which is open source under the GNU AGPLv3 license) to
+actually store the large blobs, and can be accessed through a Git
+remote helper (see linkgit:gitremote-helpers[7]) which makes the
+underlying object storage appear like a remote to Git.
+
+Note
+++++
+
+A LOP can be a promisor remote accessed using a remote helper by
+both some clients and the main remote.
+
+Rationale
++++++++++
+
+This looks like the simplest way to create LOPs that can cheaply
+handle many large blobs.
+
+Implementation
+++++++++++++++
+
+Remote helpers are quite easy to write as shell scripts, but it might
+be more efficient and maintainable to write them using other languages
+like Go.
+
+Some already exist under open source licenses, for example:
+
+  - https://github.com/awslabs/git-remote-s3
+  - https://gitlab.com/eric.p.ju/git-remote-gs
+
+Other ways to implement LOPs are certainly possible, but the goal of
+this document is not to discuss how to best implement a LOP or its
+underlying object storage (see the "0) Non goals" section above).
+
+3) LOP object storage can be Git LFS storage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The underlying object storage that a LOP uses could also serve as
+storage for large files handled by Git LFS.
+
+Rationale
++++++++++
+
+This would simplify the server side if it wants to both use a LOP and
+act as a Git LFS server.
+
+4) A main remote can offload to a LOP with a configurable threshold
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+On the server side, a main remote should have a way to offload to a
+LOP all its blobs with a size over a configurable threshold.
+
+Rationale
++++++++++
+
+This makes it easy to set things up and to clean things up. For
+example, an admin could use this to manually convert a repo not using
+LOPs to a repo using a LOP. On a repo already using a LOP but where
+some users would sometimes push large blobs, a cron job could use this
+to regularly make sure the large blobs are moved to the LOP.
+
+Implementation
+++++++++++++++
+
+Using something based on `git repack --filter=...` to separate the
+blobs we want to offload from the other Git objects could be a good
+idea. The missing part is to connect to the LOP, check if the blobs we
+want to offload are already there and if not send them.
+
+5) A main remote should try to remain clean from large blobs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A main remote should try to avoid containing a lot of oversize
+blobs. For that purpose, it should offload as needed to a LOP and it
+should have ways to prevent oversize blobs to be fetched, and also
+perhaps pushed, into it.
+
+Rationale
++++++++++
+
+A main remote containing many oversize blobs would defeat the purpose
+of LOPs.
+
+Implementation
+++++++++++++++
+
+The way to offload to a LOP discussed in 4) above can be used to
+regularly offload oversize blobs. About preventing oversize blobs from
+being fetched into the repo see 6) below. About preventing oversize
+blob pushes, a pre-receive hook could be used.
+
+Also there are different scenarios in which large blobs could get
+fetched into the main remote, for example:
+
+- A client that doesn't implement the "promisor-remote" protocol
+  (described in 6) below) clones from the main remote.
+
+- The main remote gets a request for information about a large blob
+  and is not able to get that information without fetching the blob
+  from the LOP.
+
+It might not be possible to completely prevent all these scenarios
+from happening. So the goal here should be to implement features that
+make the fetching of large blobs less likely. For example adding a
+`remote-object-info` command in the `git cat-file --batch` protocol
+and its variants might make it possible for a main repo to respond to
+some requests about large blobs without fetching them.
+
+6) A protocol negotiation should happen when a client clones
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a client clones from a main repo, there should be a protocol
+negotiation so that the server can advertise one or more LOPs and so
+that the client and the server can discuss if the client could
+directly use a LOP the server is advertising. If the client and the
+server can agree on that, then the client would be able to get the
+large blobs directly from the LOP and the server would not need to
+fetch those blobs from the LOP to be able to serve the client.
+
+Note
+++++
+
+For fetches instead of clones, a protocol negotiation might not always
+happen, see the "What about fetches?" FAQ entry below for details.
+
+Rationale
++++++++++
+
+Security, configurability and efficiency of setting things up.
+
+Implementation
+++++++++++++++
+
+A "promisor-remote" protocol v2 capability looks like a good way to
+implement this. The way the client and server use this capability
+could be controlled by configuration variables.
+
+Information that the server could send to the client through that
+protocol could be things like: LOP name, LOP URL, filter-spec (for
+example `blob:limit=<size>`) or just size limit that should be used as
+a filter when cloning, token to be used with the LOP, etc.
+
+7) A client can offload to a LOP
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a client is using a LOP that is also a LOP of its main remote,
+the client should be able to offload some large blobs it has fetched,
+but might not need anymore, to the LOP.
+
+Note
+++++
+
+It might depend on the context if it should be OK or not for clients
+to offload large blobs they have created, instead of fetched, directly
+to the LOP without the main remote checking them in some ways
+(possibly using hooks or other tools).
+
+Rationale
++++++++++
+
+On the client, the easiest way to deal with unneeded large blobs is to
+offload them.
+
+Implementation
+++++++++++++++
+
+This is very similar to what 4) above is about, except on the client
+side instead of the server side. So a good solution to 4) could likely
+be adapted to work on the client side too.
+
+There might be some security issues here, as there is no negotiation,
+but they might be mitigated if the client can reuse a token it got
+when cloning (see 6) above). Also if the large blobs were fetched from
+a LOP, it is likely, and can easily be confirmed, that the LOP still
+has them, so that they can just be removed from the client.
+
+III) Benefits of using LOPs
+---------------------------
+
+Many benefits are related to the issues discussed in "I) Issues with
+the current situation" above:
+
+- No need to rewrite history when deciding which blobs are worth
+  handling separately than other objects, or when moving or removing
+  the threshold.
+
+- If the protocol between client and server is developed and secured
+  enough, then many details might be setup on the server side only and
+  all the clients could then easily get all the configuration
+  information and use it to set themselves up mostly automatically.
+
+- Storage costs benefits on the server side.
+
+- Reduced memory and CPU needs on main remotes on the server side.
+
+- Reduced storage needs on the client side.
+
+IV) FAQ
+-------
+
+What about using multiple LOPs on the server and client side?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+That could perhaps be useful in some cases, but for now it's more
+likely that in most cases a single LOP will be advertised by the
+server and should be used by the client.
+
+A case where it could be useful for a server to advertise multiple
+LOPs is if a LOP is better for some users while a different LOP is
+better for other users. For example some clients might have a better
+connection to a LOP than others.
+
+In those cases it's the responsibility of the server to have some
+documentation to help clients. It could say for example something like
+"Users in this part of the world might want to pick only LOP A as it
+is likely to be better connected to them, while users in other parts
+of the world should pick only LOP B for the same reason."
+
+When should we trust or not trust the LOPs advertised by the server?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In some contexts, like in corporate setup where the server and all the
+clients are parts of an internal network in a company where admins
+have all the rights on every system, it's OK, and perhaps even a good
+thing, if the clients fully trust the server, as it can help ensure
+that all the clients are on the same page.
+
+There are also contexts in which clients trust a code hosting platform
+serving them some repos, but might not fully trust other users
+managing or contributing to some of these repos. For example, the code
+hosting platform could have hooks in place to check that any object it
+receives doesn't contain malware or otherwise bad content. In this
+case it might be OK for the client to use a main remote and its LOP if
+they are both hosted by the code hosting platform, but not if the LOP
+is hosted elsewhere (where the content is not checked).
+
+In other contexts, a client should just not trust a server.
+
+So there should be different ways to configure how the client should
+behave when a server advertises a LOP to it at clone time.
+
+As the basic elements that a server can advertise about a LOP are a
+LOP name and a LOP URL, the client should base its decision about
+accepting a LOP on these elements.
+
+One simple way to be very strict in the LOP it accepts is for example
+for the client to check that the LOP is already configured on the
+client with the same name and URL as what the server advertises.
+
+In general default and "safe" settings should require that the LOP are
+configured on the client separately from the "promisor-remote"
+protocol and that the client accepts a LOP only when information about
+it from the protocol matches what has been already configured
+separately.
+
+What about LOP names?
+~~~~~~~~~~~~~~~~~~~~~
+
+In some contexts, for example if the clients sometimes fetch from each
+other, it can be a good idea for all the clients to use the same names
+for all the remotes they use, including LOPs.
+
+In other contexts, each client might want to be able to give the name
+it wants to each remote, including each LOP, it interacts with.
+
+So there should be different ways to configure how the client accepts
+or not the LOP name the server advertises.
+
+If a default or "safe" setting is used, then as such a setting should
+require that the LOP be configured separately, then the name would be
+configured separately and there is no risk that the server could
+dictate a name to a client.
+
+Could the main remote be bogged down by old or paranoid clients?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Yes, it could happen if there are too many clients that are either
+unwilling to trust the main remote or that just don't implement the
+"promisor-remote" protocol because they are too old or not fully
+compatible with the 'git' client.
+
+When serving such a client, the main remote has no other choice than
+to first fetch from its LOP, to then be able to provide to the client
+everything it requested. So the main remote, even if it has cleanup
+mechanisms (see section II.4 above), would be burdened at least
+temporarily with the large blobs it had to fetch from its LOP.
+
+Not behaving like this would be breaking backward compatibility, and
+could be seen as segregating clients. For example, it might be
+possible to implement a special mode that allows the server to just
+reject clients that don't implement the "promisor-remote" protocol or
+aren't willing to trust the main remote. This mode might be useful in
+a special context like a corporate environment. There is no plan to
+implement such a mode though, and this should be discussed separately
+later anyway.
+
+A better way to proceed is probably for the main remote to show a
+message telling clients that don't implement the protocol or are
+unwilling to accept the advertised LOP(s) that they would get faster
+clone and fetches by upgrading client software or properly setting
+them up to accept LOP(s).
+
+Waiting for clients to upgrade, monitoring these upgrades and limiting
+the use of LOPs to repos that are not very frequently accessed might
+be other good ways to make sure that some benefits are still reaped
+from LOPs. Over time, as more and more clients upgrade and benefit
+from LOPs, using them in more and more frequently accessed repos will
+become worth it.
+
+Corporate environments, where it might be easier to make sure that all
+the clients are up-to-date and properly configured, could hopefully
+benefit more and earlier from using LOPs.
+
+What about fetches?
+~~~~~~~~~~~~~~~~~~~
+
+There are different kinds of fetches. A regular fetch happens when
+some refs have been updated on the server and the client wants the ref
+updates and possibly the new objects added with them. A "backfill" or
+"lazy" fetch, on the contrary, happens when the client needs to use
+some objects it already knows about but doesn't have because they are
+on a promisor remote.
+
+Regular fetch
++++++++++++++
+
+In a regular fetch, the client will contact the main remote and a
+protocol negotiation will happen between them. It's a good thing that
+a protocol negotiation happens every time, as the configuration on the
+client or the main remote could have changed since the previous
+protocol negotiation. In this case, the new protocol negotiation
+should ensure that the new fetch will happen in a way that satisfies
+the new configuration of both the client and the server.
+
+In most cases though, the configurations on the client and the main
+remote will not have changed between 2 fetches or between the initial
+clone and a subsequent fetch. This means that the result of a new
+protocol negotiation will be the same as the previous result, so the
+new fetch will happen in the same way as the previous clone or fetch,
+using, or not using, the same LOP(s) as last time.
+
+"Backfill" or "lazy" fetch
+++++++++++++++++++++++++++
+
+When there is a backfill fetch, the client doesn't necessarily contact
+the main remote first. It will try to fetch from its promisor remotes
+in the order they appear in the config file, except that a remote
+configured using the `extensions.partialClone` config variable will be
+tried last. See
+link:partial-clone.html#using-many-promisor-remotes[the partial clone documentation].
+
+This is not new with this effort. In fact this is how multiple remotes
+have already been working for around 5 years.
+
+When using LOPs, having the main remote configured using
+`extensions.partialClone`, so it's tried last, makes sense, as missing
+objects should only be large blobs that are on LOPs.
+
+This means that a protocol negotiation will likely not happen as the
+missing objects will be fetched from the LOPs, and then there will be
+nothing left to fetch from the main remote.
+
+To secure that, it could be a good idea for LOPs to require a token
+from the client when it fetches from them. The client could get the
+token when performing a protocol negotiation with the main remote (see
+section II.6 above).
+
+V) Future improvements
+----------------------
+
+It is expected that at the beginning using LOPs will be mostly worth
+it either in a corporate context where the Git version that clients
+use can easily be controlled, or on repos that are infrequently
+accessed. (See the "Could the main remote be bogged down by old or
+paranoid clients?" section in the FAQ above.)
+
+Over time, as more and more clients upgrade to a version that
+implements the "promisor-remote" protocol v2 capability described
+above in section II.6), it will be worth it to use LOPs more widely.
+
+A lot of improvements may also help using LOPs more widely. Some of
+these improvements are part of the scope of this document like the
+following:
+
+  - Implementing a "remote-object-info" command in the
+    `git cat-file --batch` protocol and its variants to allow main
+    remotes to respond to requests about large blobs without fetching
+    them. (Eric Ju has started working on this based on previous work
+    by Calvin Wan.)
+
+  - Creating better cleanup and offload mechanisms for main remotes
+    and clients to prevent accumulation of large blobs.
+
+  - Developing more sophisticated protocol negotiation capabilities
+    between clients and servers for handling LOPs, for example adding
+    a filter-spec (e.g., blob:limit=<size>) or size limit for
+    filtering when cloning, or adding a token for LOP authentication.
+
+  - Improving security measures for LOP access, particularly around
+    token handling and authentication.
+
+  - Developing standardized ways to configure and manage multiple LOPs
+    across different environments. Especially in the case where
+    different LOPs serve the same content to clients in different
+    geographical locations, there is a need for replication or
+    synchronization between LOPs.
+
+Some improvements, including some that have been mentioned in the "0)
+Non Goals" section of this document, are out of the scope of this
+document:
+
+  - Implementing a new object representation for large blobs on the
+    client side.
+
+  - Developing pluggable ODBs or other object database backends that
+    could chunk large blobs, dedup the chunks and store them
+    efficiently.
+
+  - Optimizing data transfer between LOPs and clients/servers,
+    particularly for incompressible and non-deltifying content.
+
+  - Creating improved client side tools for managing large objects
+    more effectively, for example tools for migrating from Git LFS or
+    git-annex, or tools to find which objects could be offloaded and
+    how much disk space could be reclaimed by offloading them.
+
+Some improvements could be seen as part of the scope of this document,
+but might already have their own separate projects from the Git
+project, like:
+
+  - Improving existing remote helpers to access object storage or
+    developing new ones.
+
+  - Improving existing object storage solutions or developing new
+    ones.
+
+Even though all the above improvements may help, this document and the
+LOP effort should try to focus, at least first, on a relatively small
+number of improvements mostly those that are in its current scope.
+
+For example introducing pluggable ODBs and a new object database
+backend is likely a multi-year effort on its own that can happen
+separately in parallel. It has different technical requirements,
+touches other part of the Git code base and should have its own design
+document(s).
-- 
2.46.0.rc0.95.gcbf174a634


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [PATCH v3 5/5] doc: add technical design doc for large object promisors
  2025-01-27 15:11         ` Christian Couder
@ 2025-01-27 18:02           ` Junio C Hamano
  2025-02-18 11:42             ` Christian Couder
  0 siblings, 1 reply; 110+ messages in thread
From: Junio C Hamano @ 2025-01-27 18:02 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

>> > +In other words, the goal of this document is not to talk about all the
>> > +possible ways to optimize how Git could handle large blobs, but to
>> > +describe how a LOP based solution could work well and alleviate a
>> > +number of current issues in the context of Git clients and servers
>> > +sharing Git objects.
>>
>> But if you do not discuss even a single way, and handwave "we'll
>> have this magical object storage that would solve all the problems
>> for us", then we cannot really tell if the problem is solved by us,
>> or by handwaved away by assuming the magical object storage.
>> We'd need at least one working example.
>
> It's not magical object storage. Amazon S3, GCP Bucket and MinIO
> (which is open source), for example, already exist and are used a lot
> in the industry.

That's just "we can store bunch of bytes and ask them to be
retrieved".  What I said about handwaving the presence of magical
"object storage" is exactly the "optimize how to handle large blobs"
part.  I agree that we do not need to discuss _ALL_ the possible
ways.  But without telling what our thoughts on _how_ to use these
"lower cost and safe by duplication but with high latency" services
to store our objects efficiently enough to make it practical, I'd
have to call what we see in the document "magical object storage".

>> > +7) A client can offload to a LOP
>> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> > +
>> > +When a client is using a LOP that is also a LOP of its main remote,
>> > +the client should be able to offload some large blobs it has fetched,
>> > +but might not need anymore, to the LOP.
>>
>> For a client that _creates_ a large object, the situation would be
>> the same, right?  After it creates several versions of the opening
>> segment of, say, a movie, the latest version may be still wanted,
>> but the creating client may want to offload earlier versions.
>
> Yeah, but it's not clear if the versions of the opening segment should
> be sent directly to the LOP without the main remote checking them in
> some ways (hooks might be configured only on the main remote) and/or
> checking that they are connected to the repo. I guess it depends on
> the context if it would be OK or not.

If it is not clear to us or whoever writes this document, the users
would have a hard time to make effective use of it, which is why I
am worried about the current design in this feature.

Thanks for clarifying other parts of my confusion.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v3 0/5] Introduce a "promisor-remote" capability
  2025-01-27 15:05           ` Christian Couder
@ 2025-01-27 19:38             ` Junio C Hamano
  0 siblings, 0 replies; 110+ messages in thread
From: Junio C Hamano @ 2025-01-27 19:38 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine

Christian Couder <christian.couder@gmail.com> writes:

>> or is it merely
>> because the way the feature is verified assumes that the multi-pack
>> index is not used, even though the protocol exchange, capability
>> selection, and the actual behaviour adjustment for the capability
>> are all working just fine?  I am assuming it is the latter, but just
>> to make sure we know where we stand...
>
> Let me know if you need more than the above,

Hard to say if I got a test script when I asked for a simple yes-or-no
question.

> but I think it's fair for
> now to just use:
>
> GIT_TEST_MULTI_PACK_INDEX=0
> GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0
>
> at the top of the tests, like it's done in the version 4 of this
> series I will send soon.

Doesn't it mean that people should not use multi-pack-index or
incremental writing with this feature?  If we cannot make both of
them work together even in our controlled testing environment, how
would the users know what combinations of features are safe to use
and what are incompatible?  That sounds far from fair at least to me.

I see Taylor is included in the Cc: list, so hopefully, we'll get
the anomalies you found in the multi-pack stuff resolved and see how
well these two things would work together.

Thanks.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v4 0/6] Introduce a "promisor-remote" capability
  2025-01-27 15:16     ` [PATCH v4 0/6] " Christian Couder
                         ` (5 preceding siblings ...)
  2025-01-27 15:17       ` [PATCH v4 6/6] doc: add technical design doc for large object promisors Christian Couder
@ 2025-01-27 21:14       ` Junio C Hamano
  2025-02-18 11:40         ` Christian Couder
  2025-02-18 11:32       ` [PATCH v5 0/3] " Christian Couder
  7 siblings, 1 reply; 110+ messages in thread
From: Junio C Hamano @ 2025-01-27 21:14 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
	Randall S . Becker

Christian Couder <christian.couder@gmail.com> writes:

> This work is part of some effort to better handle large files/blobs in
> a client-server context using promisor remotes dedicated to storing
> large blobs. To help understand this effort, this series now contains
> a patch (patch 6/6) that adds design documentation about this effort.
>
> Last year, I sent 3 versions of a patch series with the goal of
> allowing a client C to clone from a server S while using the same
> promisor remote X that S already use. See:
>
> https://lore.kernel.org/git/20240418184043.2900955-1-christian.couder@gmail.com/
>
> Junio suggested to implement that feature using:
>
> "a protocol extension that lets S tell C that S wants C to fetch
> missing objects from X (which means that if C knows about X in its
> ".git/config" then there is no need for end-user interaction at all),
> or a protocol extension that C tells S that C is willing to see
> objects available from X omitted when S does not have them (again,
> this could be done by looking at ".git/config" at C, but there may be
> security implications???)"
>
> This patch series implements that protocol extension called
> "promisor-remote" (that name is open to change or simplification)
> which allows S and C to agree on C using X directly or not.
>
> I have tried to implement it in a quite generic way that could allow S
> and C to share more information about promisor remotes and how to use
> them.
>
> For now, C doesn't use the information it gets from S when cloning.
> That information is only used to decide if C is OK to use the promisor
> remotes advertised by S. But this could change in the future which
> could make it much simpler for clients than using the current way of
> passing information about X with the `-c` option of `git clone` many
> times on the command line.
>
> Another improvement could be to not require GIT_NO_LAZY_FETCH=0 when S
> and C have agreed on using S.
>
> Changes compared to version 3
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>   - Patches 1/6 and 2/6 are new in this series. They come from the
>     patch series Usman Akinyemi is working on
>     (https://lore.kernel.org/git/20250124122217.250925-1-usmanakinyemi202@gmail.com/).
>     We need a similar redact_non_printables() function as the one he
>     has been working on in his patch series, so it's just simpler to
>     reuse his patches related to this function, and to build on top of
>     them.

Two topics in flight, neither of which hit 'next', sharing a handful
of patches is cumbersome to keep track of.  Typically our strategy
dealing with such a situation has been for these topics to halt and
have the authors work together to help the common part solidify a
bit better before continuing.  Otherwise, every time any one of the
topics that share the same early parts of the series needs to change
them even a bit, it would result in a huge rebase chaos, and worse
yet, even if the two (or more) topics share the need for these two
early parts, they may have different dependency requirements (e.g.
this may be OK with these two early patches directly applied on
'maint', while the other topic may need to have these two early
patches on 'master').

I think [3/6] falls into the same category as [1/6] and [2/6], that
is, to lay foundation of the remainder?

>   - In patch 4/6, the commit message has been improved:
>   - In patch 4/6, there are also some code changes:
>   - In patch 4/6, there is also a small change in the tests.

All good changes.

Will queue, but we should find a better way to manage the "an
earlier part is shared across multiple topics" situation.

Thanks.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v4 5/6] promisor-remote: check advertised name or URL
  2025-01-27 15:17       ` [PATCH v4 5/6] promisor-remote: check advertised name or URL Christian Couder
@ 2025-01-27 23:48         ` Junio C Hamano
  2025-01-28  0:01           ` Junio C Hamano
                             ` (2 more replies)
  0 siblings, 3 replies; 110+ messages in thread
From: Junio C Hamano @ 2025-01-27 23:48 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
	Randall S . Becker, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> A previous commit introduced a "promisor.acceptFromServer" configuration
> variable with only "None" or "All" as valid values.
>
> Let's introduce "KnownName" and "KnownUrl" as valid values for this
> configuration option to give more choice to a client about which
> promisor remotes it might accept among those that the server advertised.

OK.

>  promisor.acceptFromServer::
>  	If set to "all", a client will accept all the promisor remotes
>  	a server might advertise using the "promisor-remote"
> -	capability. Default is "none", which means no promisor remote
> -	advertised by a server will be accepted. By accepting a
> -	promisor remote, the client agrees that the server might omit
> -	objects that are lazily fetchable from this promisor remote
> -	from its responses to "fetch" and "clone" requests from the
> -	client. See linkgit:gitprotocol-v2[5].
> +	capability. If set to "knownName" the client will accept
> +	promisor remotes which are already configured on the client
> +	and have the same name as those advertised by the client. This
> +	is not very secure, but could be used in a corporate setup
> +	where servers and clients are trusted to not switch name and
> +	URLs.

I wonder if the reader needs to be told a bit more about the
security argument here.  I imagine that the attack vector behind the
use of "secure" in the above paragraph is for a malicious server
that guesses a promisor remote name the client already uses, which
has a different URL from what the client expects to be associated
with the name, thereby such an acceptance means that the URL used in
future fetches would be replaced without the user's consent.  Being
able to silently repoint the remote.origin.url at an evil repository
you control is indeed a powerful thing, I would guess.  Of course,
in a corp environment, such a mechanism to drive the clients to a
new repository after upgrading or migrating may be extremely handy.

Or does the above paragraph assumes some other attack vectors,
perhaps?

> +	If set to "knownUrl", the client will accept promisor
> +	remotes which have both the same name and the same URL
> +	configured on the client as the name and URL advertised by the
> +	server. This is more secure than "all" or "knownUrl", so it
> +	should be used if possible instead of those options. Default
> +	is "none", which means no promisor remote advertised by a
> +	server will be accepted.

OK.

> diff --git a/promisor-remote.c b/promisor-remote.c
> index 5ac282ed27..790a96aa19 100644
> --- a/promisor-remote.c
> +++ b/promisor-remote.c
> @@ -370,30 +370,73 @@ char *promisor_remote_info(struct repository *repo)
>  	return strbuf_detach(&sb, NULL);
>  }
>  
> +/*
> + * Find first index of 'vec' where there is 'val'. 'val' is compared
> + * case insensively to the strings in 'vec'. If not found 'vec->nr' is
> + * returned.
> + */
> +static size_t strvec_find_index(struct strvec *vec, const char *val)
> +{
> +	for (size_t i = 0; i < vec->nr; i++)
> +		if (!strcasecmp(vec->v[i], val))
> +			return i;
> +	return vec->nr;
> +}

Hmph, without the hardcoded strcasecmp(), strvec_find() might make a
fine public API in <strvec.h>.  

Unless we intend to create a generic function that qualifies as a
part of the public strvec API, we shouldn't call it strvec_anything.
This is a great helper that finds a matching remote nickname from
list of remote nicknames, so

    remote_nick_find(struct strvec *nicks, const char *nick)

may be more appropriate.  When we lift it out of here and make it
more generic to move it to strvec.[ch], perhaps 

	size_t strvec_find(struct strvec *vec, void *needle,
		 int (*match)(const char *, void *)) {
		for (size_t ix = 0; ix < vec->nr, ix++)
			if (match(vec->v[ix], needle))
				return ix;
		return vec->nr;
	}

which will be used to rewrite remote_nick_find() like so:

	static int nicks_match(const char *nick, void *needle)
	{
		return !strcasecmp(nick, (conat char *)needle);
	}

	remote_hick_find(struct strvec *nicks, const char *nick)
	{
		return strvec_find(nicks, nick, nicks_match);
	}

it would be better to use a more generic parameter name "vec", but
until then, it is better to be more specific and explicit about the
reason why the immediate callers call the function for, which is
where my "nicks" vs "nick" comes from (it is OK to call the latter
"needle", though).

>  enum accept_promisor {
>  	ACCEPT_NONE = 0,
> +	ACCEPT_KNOWN_URL,
> +	ACCEPT_KNOWN_NAME,
>  	ACCEPT_ALL
>  };
>  
>  static int should_accept_remote(enum accept_promisor accept,
> -				const char *remote_name UNUSED,
> -				const char *remote_url UNUSED)
> +				const char *remote_name, const char *remote_url,
> +				struct strvec *names, struct strvec *urls)
>  {
> +	size_t i;
> +
>  	if (accept == ACCEPT_ALL)
>  		return 1;
>  
> -	BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
> +	i = strvec_find_index(names, remote_name);
> +
> +	if (i >= names->nr)
> +		/* We don't know about that remote */
> +		return 0;

OK.

> +	if (accept == ACCEPT_KNOWN_NAME)
> +		return 1;
> +
> +	if (accept != ACCEPT_KNOWN_URL)
> +		BUG("Unhandled 'enum accept_promisor' value '%d'", accept);

I can see why this defensiveness may be a good idea than not having
any, but I wonder if we can take advantage of compile time checks
some compilers have to ensure that case arms in a switch statement
are exhausitive?

> +	if (!strcasecmp(urls->v[i], remote_url))
> +		return 1;

This is iffy.  The <schema>://<host>/ part might want to be compared
case insensitively, but the rest of the URL is generally case
sensitive (unless the material served is stored on a machine with
case-insensitive filesystem)?

Given that the existing URL must have come by either cloning from
this server or another related server or by an earlier
acceptFromServer behaviour, I do not see a need for being extra lax
here.  We should be more careful about our use of case-insensitive
comparison, and I do not see how this URL comparison could be
something the end users would expect to be done case insensitively.

> -static void filter_promisor_remote(struct strvec *accepted, const char *info)
> +static void filter_promisor_remote(struct repository *repo,
> +				   struct strvec *accepted,
> +				   const char *info)
>  {
>  	struct strbuf **remotes;
>  	const char *accept_str;
>  	enum accept_promisor accept = ACCEPT_NONE;
> +	struct strvec names = STRVEC_INIT;
> +	struct strvec urls = STRVEC_INIT;
>  
>  	if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
>  		if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))

Not a fault of this step, but is it sensible to even expect
!accept_str in an error case?  *accept_str could be NUL, but
accept_str be either left uninitialized (because this caller does
not initialize it) when the get_string_tmp() returns non-zero, or
points at the internal cached value in the config_set if it returns
0 (and the control comes into this block).

>  			accept = ACCEPT_NONE;
> +		else if (!strcasecmp("KnownUrl", accept_str))
> +			accept = ACCEPT_KNOWN_URL;
> +		else if (!strcasecmp("KnownName", accept_str))
> +			accept = ACCEPT_KNOWN_NAME;
>  		else if (!strcasecmp("All", accept_str))
>  			accept = ACCEPT_ALL;
>  		else

Ditto about icase for all of the above.

> +test_expect_success "clone with 'KnownUrl' and different remote urls" '
> +	ln -s server2 serverTwo &&
> +
> +	git -C server config promisor.advertise true &&
> +
> +	# Clone from server to create a client
> +	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
> +		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
> +		-c remote.server2.url="file://$(pwd)/serverTwo" \
> +		-c promisor.acceptfromserver=KnownUrl \
> +		--no-local --filter="blob:limit=5k" server client &&
> +	test_when_finished "rm -rf client" &&
> +
> +	# Check that the largest object is not missing on the server
> +	check_missing_objects server 0 "" &&
> +
> +	# Reinitialize server so that the largest object is missing again
> +	initialize_server 1 "$oid"
> +'

Nice ;-)

Here, I also notice that we are not testing that serverTwo and
servertwo are considered the same thanks to the use of icase
comparison.  We shouldn't compare URLs with strcasecmp().

Thanks.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v4 5/6] promisor-remote: check advertised name or URL
  2025-01-27 23:48         ` Junio C Hamano
@ 2025-01-28  0:01           ` Junio C Hamano
  2025-01-30 10:51           ` Patrick Steinhardt
  2025-02-18 11:42           ` Christian Couder
  2 siblings, 0 replies; 110+ messages in thread
From: Junio C Hamano @ 2025-01-28  0:01 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
	Randall S . Becker, Christian Couder

Junio C Hamano <gitster@pobox.com> writes:

>> +	if (!strcasecmp(urls->v[i], remote_url))
>> +		return 1;
>
> This is iffy.  The <schema>://<host>/ part might want to be compared
> case insensitively, but the rest of the URL is generally case
> sensitive (unless the material served is stored on a machine with
> case-insensitive filesystem)?
>
> Given that the existing URL must have come by either cloning from
> this server or another related server or by an earlier
> acceptFromServer behaviour, I do not see a need for being extra lax
> here.  We should be more careful about our use of case-insensitive
> comparison, and I do not see how this URL comparison could be
> something the end users would expect to be done case insensitively.

Note that I am not advocating to compare the earlier part case
insensitively while comparing the remainder case sensitively.

Because we are not comparing URLs that come from random sources, but
we know they come from a only few very controlled sources (i.e., the
original server we cloned from, and the promisor remotes sugggested
by the original server and other promisor remotes whose suggestion
we accepted, recursively), it should be sufficient to compare the
whole string case sensitively.

Thanks.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v4 4/6] Add 'promisor-remote' capability to protocol v2
  2025-01-27 15:16       ` [PATCH v4 4/6] Add 'promisor-remote' capability to protocol v2 Christian Couder
@ 2025-01-30 10:51         ` Patrick Steinhardt
  2025-02-18 11:41           ` Christian Couder
  0 siblings, 1 reply; 110+ messages in thread
From: Patrick Steinhardt @ 2025-01-30 10:51 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, Taylor Blau, Eric Sunshine, Karthik Nayak,
	Kristoffer Haugsbakk, brian m . carlson, Randall S . Becker,
	Christian Couder

On Mon, Jan 27, 2025 at 04:16:59PM +0100, Christian Couder wrote:
> When a server S knows that some objects from a repository are available
> from a promisor remote X, S might want to suggest to a client C cloning
> or fetching the repo from S that C may use X directly instead of S for
> these objects.

A lot of the commit message seems to be duplicated with the technical
documentation that you add. I wonder whether it would make sense to
simply refer to that instead of repeating all of it? That would make it
easier to spot the actually-important bits in the commit message that
add context to the patch.

One very important bit of context that I was lacking is what exactly we
wire up and where we do so. I have been searching for longer than I want
to admit where the client ends up using the promisor remotes, until I
eventually figured out that the client-side isn't wired up at all. It
makes sense in retrospect, but it would've been nice if the reader was
guided a bit.

> diff --git a/Documentation/gitprotocol-v2.txt b/Documentation/gitprotocol-v2.txt
> index 1652fef3ae..f25a9a6ad8 100644
> --- a/Documentation/gitprotocol-v2.txt
> +++ b/Documentation/gitprotocol-v2.txt
> @@ -781,6 +781,60 @@ retrieving the header from a bundle at the indicated URI, and thus
>  save themselves and the server(s) the request(s) needed to inspect the
>  headers of that bundle or bundles.
>  
> +promisor-remote=<pr-infos>
> +~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The server may advertise some promisor remotes it is using or knows
> +about to a client which may want to use them as its promisor remotes,
> +instead of this repository. In this case <pr-infos> should be of the
> +form:
> +
> +	pr-infos = pr-info | pr-infos ";" pr-info
> +
> +	pr-info = "name=" pr-name | "name=" pr-name "," "url=" pr-url
> +
> +where `pr-name` is the urlencoded name of a promisor remote, and
> +`pr-url` the urlencoded URL of that promisor remote.
> +
> +In this case, if the client decides to use one or more promisor
> +remotes the server advertised, it can reply with
> +"promisor-remote=<pr-names>" where <pr-names> should be of the form:
> +
> +	pr-names = pr-name | pr-names ";" pr-name
> +
> +where `pr-name` is the urlencoded name of a promisor remote the server
> +advertised and the client accepts.
> +
> +Note that, everywhere in this document, `pr-name` MUST be a valid
> +remote name, and the ';' and ',' characters MUST be encoded if they
> +appear in `pr-name` or `pr-url`.
> +
> +If the server doesn't know any promisor remote that could be good for
> +a client to use, or prefers a client not to use any promisor remote it
> +uses or knows about, it shouldn't advertise the "promisor-remote"
> +capability at all.
> +
> +In this case, or if the client doesn't want to use any promisor remote
> +the server advertised, the client shouldn't advertise the
> +"promisor-remote" capability at all in its reply.
> +
> +The "promisor.advertise" and "promisor.acceptFromServer" configuration
> +options can be used on the server and client side respectively to

s/respectively//, as you already say that in the next line.

> +control what they advertise or accept respectively. See the
> +documentation of these configuration options for more information.
> +
> +Note that in the future it would be nice if the "promisor-remote"
> +protocol capability could be used by the server, when responding to
> +`git fetch` or `git clone`, to advertise better-connected remotes that
> +the client can use as promisor remotes, instead of this repository, so
> +that the client can lazily fetch objects from these other
> +better-connected remotes. This would require the server to omit in its
> +response the objects available on the better-connected remotes that
> +the client has accepted. This hasn't been implemented yet though. So
> +for now this "promisor-remote" capability is useful only when the
> +server advertises some promisor remotes it already uses to borrow
> +objects from.

I'd leave away this bit as it doesn't really add a lot to the document.
It's a possibility for the future, but without it being implemented
anywhere it's not that helpful from my point of view.

> diff --git a/promisor-remote.c b/promisor-remote.c
> index c714f4f007..5ac282ed27 100644
> --- a/promisor-remote.c
> +++ b/promisor-remote.c
> @@ -292,3 +306,185 @@ void promisor_remote_get_direct(struct repository *repo,
>  	if (to_free)
>  		free(remaining_oids);
>  }
> +
> +static int allow_unsanitized(char ch)
> +{
> +	if (ch == ',' || ch == ';' || ch == '%')
> +		return 0;
> +	return ch > 32 && ch < 127;
> +}

Isn't this too lenient? It would allow also allow e.g. '=' and all kinds
of other characters. This does make sense for URLs, but it doesn't make
sense for remote names as they aren't supposed to contain punctuation in
the first place. So for these remote names I'd think we should be way
stricter and return an error in case they contain non-alphanumeric data.

> +static void promisor_info_vecs(struct repository *repo,
> +			       struct strvec *names,
> +			       struct strvec *urls)

I wonder whether it would make more sense to track these as a strmap
instead of two arrays which are expected to have related entries in the
same place.

> +{
> +	struct promisor_remote *r;
> +
> +	promisor_remote_init(repo);
> +
> +	for (r = repo->promisor_remote_config->promisors; r; r = r->next) {
> +		char *url;
> +		char *url_key = xstrfmt("remote.%s.url", r->name);
> +
> +		strvec_push(names, r->name);
> +		strvec_push(urls, git_config_get_string(url_key, &url) ? NULL : url);
> +
> +		free(url);
> +		free(url_key);
> +	}
> +}
> +
> +char *promisor_remote_info(struct repository *repo)
> +{
> +	struct strbuf sb = STRBUF_INIT;
> +	int advertise_promisors = 0;
> +	struct strvec names = STRVEC_INIT;
> +	struct strvec urls = STRVEC_INIT;
> +
> +	git_config_get_bool("promisor.advertise", &advertise_promisors);
> +
> +	if (!advertise_promisors)
> +		return NULL;
> +
> +	promisor_info_vecs(repo, &names, &urls);
> +
> +	if (!names.nr)
> +		return NULL;
> +
> +	for (size_t i = 0; i < names.nr; i++) {
> +		if (i)
> +			strbuf_addch(&sb, ';');
> +		strbuf_addstr(&sb, "name=");
> +		strbuf_addstr_urlencode(&sb, names.v[i], allow_unsanitized);
> +		if (urls.v[i]) {
> +			strbuf_addstr(&sb, ",url=");
> +			strbuf_addstr_urlencode(&sb, urls.v[i], allow_unsanitized);
> +		}
> +	}
> +
> +	redact_non_printables(&sb);

So here we replace non-printable characters with dots as far as I
understand. But didn't we just URL-encode the strings? So is there ever
a possibility for non-printable characters here?

> +	strvec_clear(&names);
> +	strvec_clear(&urls);
> +
> +	return strbuf_detach(&sb, NULL);
> +}
> +
> +enum accept_promisor {
> +	ACCEPT_NONE = 0,
> +	ACCEPT_ALL
> +};
> +
> +static int should_accept_remote(enum accept_promisor accept,
> +				const char *remote_name UNUSED,
> +				const char *remote_url UNUSED)
> +{
> +	if (accept == ACCEPT_ALL)
> +		return 1;
> +
> +	BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
> +}
> +
> +static void filter_promisor_remote(struct strvec *accepted, const char *info)
> +{
> +	struct strbuf **remotes;
> +	const char *accept_str;
> +	enum accept_promisor accept = ACCEPT_NONE;
> +
> +	if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
> +		if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
> +			accept = ACCEPT_NONE;
> +		else if (!strcasecmp("All", accept_str))
> +			accept = ACCEPT_ALL;
> +		else
> +			warning(_("unknown '%s' value for '%s' config option"),
> +				accept_str, "promisor.acceptfromserver");
> +	}
> +
> +	if (accept == ACCEPT_NONE)
> +		return;
> +
> +	/* Parse remote info received */
> +
> +	remotes = strbuf_split_str(info, ';', 0);
> +
> +	for (size_t i = 0; remotes[i]; i++) {
> +		struct strbuf **elems;
> +		const char *remote_name = NULL;
> +		const char *remote_url = NULL;
> +		char *decoded_name = NULL;
> +		char *decoded_url = NULL;
> +
> +		strbuf_strip_suffix(remotes[i], ";");
> +		elems = strbuf_split(remotes[i], ',');
> +
> +		for (size_t j = 0; elems[j]; j++) {
> +			int res;
> +			strbuf_strip_suffix(elems[j], ",");
> +			res = skip_prefix(elems[j]->buf, "name=", &remote_name) ||
> +				skip_prefix(elems[j]->buf, "url=", &remote_url);
> +			if (!res)
> +				warning(_("unknown element '%s' from remote info"),
> +					elems[j]->buf);
> +		}
> +
> +		if (remote_name)
> +			decoded_name = url_percent_decode(remote_name);
> +		if (remote_url)
> +			decoded_url = url_percent_decode(remote_url);

This is data we have received from a potentially-untrusted remote, so we
should double-check that the data we have received doesn't contain any
weird characters:

  - For the remote name we should verify that it consists only of
    alphanumeric characters.

  - For the remote URL we need to verify that it's a proper URL without
    any newlines, non-printable characters or anything else.

We'll eventually end up storing that data in the configuration, so these
verifications are quite important so that an adversarial server cannot
perform config-injection and thus cause remote code execution.

[snip]
> +void mark_promisor_remotes_as_accepted(struct repository *r, const char *remotes)
> +{
> +	struct strbuf **accepted_remotes = strbuf_split_str(remotes, ';', 0);
> +
> +	for (size_t i = 0; accepted_remotes[i]; i++) {
> +		struct promisor_remote *p;
> +		char *decoded_remote;
> +
> +		strbuf_strip_suffix(accepted_remotes[i], ";");
> +		decoded_remote = url_percent_decode(accepted_remotes[i]->buf);
> +
> +		p = repo_promisor_remote_find(r, decoded_remote);
> +		if (p)
> +			p->accepted = 1;
> +		else
> +			warning(_("accepted promisor remote '%s' not found"),
> +				decoded_remote);

My initial understanding of this code was that it is about the
client-side accepting a remote, but this is about the server-side and
tracks whether a promisor remote has been accepted by the client. It
feels a bit weird to modify semi-global state for this, as I'd have
rather expected that we pass around a vector of accepted remotes
instead.

But I guess ultimately this isn't too bad. It would be nice though if
it was more obvious whether we're on the server- or client-side.

> diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
> new file mode 100755
> index 0000000000..0390c1dbad
> --- /dev/null
> +++ b/t/t5710-promisor-remote-capability.sh
> @@ -0,0 +1,244 @@
[snip]
> +initialize_server () {
> +	count="$1"
> +	missing_oids="$2"
> +
> +	# Repack everything first
> +	git -C server -c repack.writebitmaps=false repack -a -d &&
> +
> +	# Remove promisor file in case they exist, useful when reinitializing
> +	rm -rf server/objects/pack/*.promisor &&
> +
> +	# Repack without the largest object and create a promisor pack on server
> +	git -C server -c repack.writebitmaps=false repack -a -d \
> +	    --filter=blob:limit=5k --filter-to="$(pwd)/pack" &&
> +	promisor_file=$(ls server/objects/pack/*.pack | sed "s/\.pack/.promisor/") &&
> +	>"$promisor_file" &&
> +
> +	# Check objects missing on the server
> +	check_missing_objects server "$count" "$missing_oids"
> +}
> +
> +copy_to_server2 () {

Nit: `server2` could be renamed to `promisor` to make the relation
between the two servers more obvious.

> diff --git a/upload-pack.c b/upload-pack.c
> index 728b2477fc..7498b45e2e 100644
> --- a/upload-pack.c
> +++ b/upload-pack.c
> @@ -319,6 +320,8 @@ static void create_pack_file(struct upload_pack_data *pack_data,
>  		strvec_push(&pack_objects.args, "--delta-base-offset");
>  	if (pack_data->use_include_tag)
>  		strvec_push(&pack_objects.args, "--include-tag");
> +	if (repo_has_accepted_promisor_remote(the_repository))
> +		strvec_push(&pack_objects.args, "--missing=allow-promisor");

This is nice and simple, I like it.

Patrick

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v4 5/6] promisor-remote: check advertised name or URL
  2025-01-27 23:48         ` Junio C Hamano
  2025-01-28  0:01           ` Junio C Hamano
@ 2025-01-30 10:51           ` Patrick Steinhardt
  2025-02-18 11:41             ` Christian Couder
  2025-02-18 11:42           ` Christian Couder
  2 siblings, 1 reply; 110+ messages in thread
From: Patrick Steinhardt @ 2025-01-30 10:51 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Christian Couder, git, Taylor Blau, Eric Sunshine, Karthik Nayak,
	Kristoffer Haugsbakk, brian m . carlson, Randall S . Becker,
	Christian Couder

On Mon, Jan 27, 2025 at 03:48:08PM -0800, Junio C Hamano wrote:
> Christian Couder <christian.couder@gmail.com> writes:
> >  promisor.acceptFromServer::
> >  	If set to "all", a client will accept all the promisor remotes
> >  	a server might advertise using the "promisor-remote"
> > -	capability. Default is "none", which means no promisor remote
> > -	advertised by a server will be accepted. By accepting a
> > -	promisor remote, the client agrees that the server might omit
> > -	objects that are lazily fetchable from this promisor remote
> > -	from its responses to "fetch" and "clone" requests from the
> > -	client. See linkgit:gitprotocol-v2[5].
> > +	capability. If set to "knownName" the client will accept
> > +	promisor remotes which are already configured on the client
> > +	and have the same name as those advertised by the client. This
> > +	is not very secure, but could be used in a corporate setup
> > +	where servers and clients are trusted to not switch name and
> > +	URLs.
> 
> I wonder if the reader needs to be told a bit more about the
> security argument here.  I imagine that the attack vector behind the
> use of "secure" in the above paragraph is for a malicious server
> that guesses a promisor remote name the client already uses, which
> has a different URL from what the client expects to be associated
> with the name, thereby such an acceptance means that the URL used in
> future fetches would be replaced without the user's consent.  Being
> able to silently repoint the remote.origin.url at an evil repository
> you control is indeed a powerful thing, I would guess.  Of course,
> in a corp environment, such a mechanism to drive the clients to a
> new repository after upgrading or migrating may be extremely handy.

I'm still very hesitant about letting the server-side control remote
names at all, as I've already mentioned in previous review rounds. I
think that it opens up the client for a whole lot of issues that should
rather be avoided. Most importantly, it takes control away from the
user, as they are not free anymore to name the remotes however they want
to. It also casts into stone current behaviour because it is now part of
the protocol.

That being said, I get the point that it may make sense to be "agile"
regarding the promisor remotes. But I think we can achieve that without
having to compromise on either usability or security by using something
like a promisor ID instead.

Instead of announcing remote names, each announced promisor would have
an ID. This ID is opaque and merely used to identify the promisor after
the fact. It could for example be a UUID or something else that is
mostly unique.

The client will then create a promisor remote for each of the remote
names. The name of the promisor is derived from the remote name that it
is being created from. When there's a single promisor only it could for
example be called "origin-promisor". When there are multiple ones they
could be enumerated as "origin-promisor-1". In practice, we can even
roll the dice to generate the name, even though that may not be as user
friendly.

These names are _not_ used to identify the promisor. Instead, we also
write "remote.origin-promisor.id" and point it to the UUID that the
server has advertised. Furthermore, for each promisor that gets added in
this way, we'll also add "remote.origin.promisor" pointing to the
promisor name.

So on a subsequent fetch, we can now:

  1. Look up all the promisors for the remote we're fetching from via
     the "remote.origin.promisor" multivalue config.

  2. For each promisor, we figure out whether its ID is still being
     advertised by the remote server. If not, then it is a stale
     promisor and we can optionally remove it.

  3. If the promisor ID is still being announced we double check whether
     the URL we have stored is still valid. If not, we can optionally
     update it to point to the new URL.

This buys us a bunch of things:

  - We have promisor agility and are easily able to update URLs and
    prune out stale promisors.

  - Promisors can be renamed by the user at will, as they are identified
    by ID and not by remote name. We have to add logic to update the
    "remote.*.promisor" links, but that should be doable.

  - Each remote has its own set of promisors that cannot conflict with
    one another.

From hereon, I'd also redesign "promisor.acceptFromServer" a bit:

  - "new" allows newly announced promisor remotes.

  - "update" allows updating existing promisor remotes.

  - "prune" allows pruning existing promisor remotes.

All of that only applies to promisors connected to the current remote,
of course. Furthermore, the values may be combined arbitrarily with one
another, e.g. you can say "new,update" to only accept new or updated
remotes but not allow pruning, or "update,prune" to only allow updating
or pruning promisors without adding new ones.

I realize that this is a bit more work than what we currently have, but
I think that the design is significantly better than the proposed one.
From my point of view none of this really needs to be part of the
current patch series though, as these are all client-side changes in the
first place, and as far as I understand we don't have the client-side
ready yet anyway.

The only change required would be to adapt the protocol so that we don't
advertise a promisor names anymore, but instead promisor IDs.

Patrick

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v4 3/6] version: make redact_non_printables() non-static
  2025-01-27 15:16       ` [PATCH v4 3/6] version: make redact_non_printables() non-static Christian Couder
@ 2025-01-30 10:51         ` Patrick Steinhardt
  2025-02-18 11:42           ` Christian Couder
  0 siblings, 1 reply; 110+ messages in thread
From: Patrick Steinhardt @ 2025-01-30 10:51 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, Taylor Blau, Eric Sunshine, Karthik Nayak,
	Kristoffer Haugsbakk, brian m . carlson, Randall S . Becker

On Mon, Jan 27, 2025 at 04:16:58PM +0100, Christian Couder wrote:
> As we are going to reuse redact_non_printables() outside "version.c",
> let's make it non-static.

Missing the DCO.

> diff --git a/version.h b/version.h
> index 7c62e80577..fcc1816685 100644
> --- a/version.h
> +++ b/version.h
> @@ -4,7 +4,15 @@
>  extern const char git_version_string[];
>  extern const char git_built_from_commit_string[];
>  
> +struct strbuf;
> +
>  const char *git_user_agent(void);
>  const char *git_user_agent_sanitized(void);
>  
> +/*
> + * Trim and replace each character with ascii code below 32 or above
> + * 127 (included) using a dot '.' character.
> +*/
> +void redact_non_printables(struct strbuf *buf);

Is this header really the right spot though? If I want to redact
characters I certainly wouldn't be looking at "version.h" for that
functionality.

Patrick

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH v5 0/3] Introduce a "promisor-remote" capability
  2025-01-27 15:16     ` [PATCH v4 0/6] " Christian Couder
                         ` (6 preceding siblings ...)
  2025-01-27 21:14       ` [PATCH v4 0/6] Introduce a "promisor-remote" capability Junio C Hamano
@ 2025-02-18 11:32       ` Christian Couder
  2025-02-18 11:32         ` [PATCH v5 1/3] Add 'promisor-remote' capability to protocol v2 Christian Couder
                           ` (4 more replies)
  7 siblings, 5 replies; 110+ messages in thread
From: Christian Couder @ 2025-02-18 11:32 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
	Randall S . Becker, Christian Couder

This work is part of some effort to better handle large files/blobs in
a client-server context using promisor remotes dedicated to storing
large blobs. To help understand this effort, this series now contains
a patch (patch 6/6) that adds design documentation about this effort.

Last year, I sent 3 versions of a patch series with the goal of
allowing a client C to clone from a server S while using the same
promisor remote X that S already use. See:

https://lore.kernel.org/git/20240418184043.2900955-1-christian.couder@gmail.com/

Junio suggested to implement that feature using:

"a protocol extension that lets S tell C that S wants C to fetch
missing objects from X (which means that if C knows about X in its
".git/config" then there is no need for end-user interaction at all),
or a protocol extension that C tells S that C is willing to see
objects available from X omitted when S does not have them (again,
this could be done by looking at ".git/config" at C, but there may be
security implications???)"

This patch series implements that protocol extension called
"promisor-remote" (that name is open to change or simplification)
which allows S and C to agree on C using X directly or not.

I have tried to implement it in a quite generic way that could allow S
and C to share more information about promisor remotes and how to use
them.

For now, C doesn't use the information it gets from S when cloning.
That information is only used to decide if C is OK to use the promisor
remotes advertised by S. But this could change in the future which
could make it much simpler for clients than using the current way of
passing information about X with the `-c` option of `git clone` many
times on the command line.

Another improvement could be to not require GIT_NO_LAZY_FETCH=0 when S
and C have agreed on using S.

Changes compared to version 4
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  - The series is rebased on top 0394451348 (The eleventh batch,
    2025-02-14). This is to take into account some recent changes like
    some documentation files using the ".adoc" extension instead of
    ".txt".

  - Patches 1/6, 2/6 and 3/6 from version 4 have been removed, as it
    looks like using redact_non_printables() is not necessary after
    all.

  - Patch 1/3 ("Add 'promisor-remote' capability to protocol v2") has
    a number of small changes:

      - In the protocol-v2 doc, "respectively" is not repeated.

      - In "promisor-remote.c", the useless call to
        redact_non_printables() has been removed.

      - In "promisor-remote.c", a useless "!accept_str" check has been
        removed.

      - In "promisor-remote.h", references to gitprotocol-v2(5) have
        been added to some comments.

      - In "promisor-remote.h", a comment has been improved to say
        that mark_promisor_remotes_as_accepted() is useful on the
        server side.

      - In "t/t5710-promisor-remote-capability.sh", "server2" has been
        replaced with "lop".

  - In patch 2/3 ("promisor-remote: check advertised name or URL"),
    there are also a number of small changes:

      - In "Documentation/config/promisor.adoc", an instance of
        "knownUrl" has been replaced with "knownName" to fix a
        mistake.

      - In "promisor-remote.c", strvec_find_index() has been renamed
        remote_nick_find() and its arguments have been renamed
        accordingly. Its comment doc has also been updated
        accordingly.

      - In "promisor-remote.c", URLs are now compared case
        sensitively, so a call to strcasecmp() has been replaced with
        a call to strcmp().

  - In patch 3/3 ("doc: add technical design doc for large object
    promisors"), there are a few small changes:

      - A paragraph was added to tell that even if used not very
        efficiently LOPs can be useful.

      - A small sentence was added to acknowledge that more discussion
        will be needed before implementing a feature to offload large
        blobs from clients.

Thanks to Junio, Patrick, Eric, Karthik, Kristoffer, brian, Randall
and Taylor for their suggestions to improve this patch series.

CI tests
~~~~~~~~

All the CI tests passed, see:

https://github.com/chriscool/git/actions/runs/13388314841

Range diff compared to version 4
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1:  9e646013be < -:  ---------- version: replace manual ASCII checks with isprint() for clarity
2:  f4b22ef39d < -:  ---------- version: refactor redact_non_printables()
3:  8bfa6f7a20 < -:  ---------- version: make redact_non_printables() non-static
4:  652ce32892 ! 1:  918515f5ee Add 'promisor-remote' capability to protocol v2
    @@ Commit message
         Helped-by: Patrick Steinhardt <ps@pks.im>
         Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
     
    - ## Documentation/config/promisor.txt ##
    + ## Documentation/config/promisor.adoc ##
     @@
      promisor.quiet::
        If set to "true" assume `--quiet` when fetching additional
    @@ Documentation/config/promisor.txt
     +  from its responses to "fetch" and "clone" requests from the
     +  client. See linkgit:gitprotocol-v2[5].
     
    - ## Documentation/gitprotocol-v2.txt ##
    -@@ Documentation/gitprotocol-v2.txt: retrieving the header from a bundle at the indicated URI, and thus
    + ## Documentation/gitprotocol-v2.adoc ##
    +@@ Documentation/gitprotocol-v2.adoc: retrieving the header from a bundle at the indicated URI, and thus
      save themselves and the server(s) the request(s) needed to inspect the
      headers of that bundle or bundles.
      
    @@ Documentation/gitprotocol-v2.txt: retrieving the header from a bundle at the ind
     +"promisor-remote" capability at all in its reply.
     +
     +The "promisor.advertise" and "promisor.acceptFromServer" configuration
    -+options can be used on the server and client side respectively to
    -+control what they advertise or accept respectively. See the
    -+documentation of these configuration options for more information.
    ++options can be used on the server and client side to control what they
    ++advertise or accept respectively. See the documentation of these
    ++configuration options for more information.
     +
     +Note that in the future it would be nice if the "promisor-remote"
     +protocol capability could be used by the server, when responding to
    @@ promisor-remote.c: void promisor_remote_get_direct(struct repository *repo,
     +          }
     +  }
     +
    -+  redact_non_printables(&sb);
    -+
     +  strvec_clear(&names);
     +  strvec_clear(&urls);
     +
    @@ promisor-remote.c: void promisor_remote_get_direct(struct repository *repo,
     +  enum accept_promisor accept = ACCEPT_NONE;
     +
     +  if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
    -+          if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
    ++          if (!*accept_str || !strcasecmp("None", accept_str))
     +                  accept = ACCEPT_NONE;
     +          else if (!strcasecmp("All", accept_str))
     +                  accept = ACCEPT_ALL;
    @@ promisor-remote.h: void promisor_remote_get_direct(struct repository *repo,
     + * advertisement.
     + * Return value is NULL if no promisor remote advertisement should be
     + * made. Otherwise it contains the names and urls of the advertised
    -+ * promisor remotes separated by ';'
    ++ * promisor remotes separated by ';'. See gitprotocol-v2(5).
     + */
     +char *promisor_remote_info(struct repository *repo);
     +
    @@ promisor-remote.h: void promisor_remote_get_direct(struct repository *repo,
     + * configured promisor remotes, if any, to prepare the reply.
     + * Return value is NULL if no promisor remote from the server
     + * is accepted. Otherwise it contains the names of the accepted promisor
    -+ * remotes separated by ';'.
    ++ * remotes separated by ';'. See gitprotocol-v2(5).
     + */
     +char *promisor_remote_reply(const char *info);
     +
     +/*
    -+ * Set the 'accepted' flag for some promisor remotes. Useful when some
    -+ * promisor remotes have been accepted by the client.
    ++ * Set the 'accepted' flag for some promisor remotes. Useful on the
    ++ * server side when some promisor remotes have been accepted by the
    ++ * client.
     + */
     +void mark_promisor_remotes_as_accepted(struct repository *repo, const char *remotes);
     +
    @@ t/t5710-promisor-remote-capability.sh (new)
     +  check_missing_objects server "$count" "$missing_oids"
     +}
     +
    -+copy_to_server2 () {
    ++copy_to_lop () {
     +  oid_path="$(test_oid_to_path $1)" &&
     +  path="server/objects/$oid_path" &&
    -+  path2="server2/objects/$oid_path" &&
    ++  path2="lop/objects/$oid_path" &&
     +  mkdir -p $(dirname "$path2") &&
     +  cp "$path" "$path2"
     +}
     +
     +test_expect_success "setup for testing promisor remote advertisement" '
    -+  # Create another bare repo called "server2"
    -+  git init --bare server2 &&
    ++  # Create another bare repo called "lop" (for Large Object Promisor)
    ++  git init --bare lop &&
     +
    -+  # Copy the largest object from server to server2
    ++  # Copy the largest object from server to lop
     +  obj="HEAD:foo" &&
     +  oid="$(git -C server rev-parse $obj)" &&
    -+  copy_to_server2 "$oid" &&
    ++  copy_to_lop "$oid" &&
     +
     +  initialize_server 1 "$oid" &&
     +
    -+  # Configure server2 as promisor remote for server
    -+  git -C server remote add server2 "file://$(pwd)/server2" &&
    -+  git -C server config remote.server2.promisor true &&
    ++  # Configure lop as promisor remote for server
    ++  git -C server remote add lop "file://$(pwd)/lop" &&
    ++  git -C server config remote.lop.promisor true &&
     +
    -+  git -C server2 config uploadpack.allowFilter true &&
    -+  git -C server2 config uploadpack.allowAnySHA1InWant true &&
    ++  git -C lop config uploadpack.allowFilter true &&
    ++  git -C lop config uploadpack.allowAnySHA1InWant true &&
     +  git -C server config uploadpack.allowFilter true &&
     +  git -C server config uploadpack.allowAnySHA1InWant true
     +'
    @@ t/t5710-promisor-remote-capability.sh (new)
     +  git -C server config promisor.advertise true &&
     +
     +  # Clone from server to create a client
    -+  GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
    -+          -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
    -+          -c remote.server2.url="file://$(pwd)/server2" \
    ++  GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
    ++          -c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
    ++          -c remote.lop.url="file://$(pwd)/lop" \
     +          -c promisor.acceptfromserver=All \
     +          --no-local --filter="blob:limit=5k" server client &&
     +  test_when_finished "rm -rf client" &&
    @@ t/t5710-promisor-remote-capability.sh (new)
     +  git -C server config promisor.advertise false &&
     +
     +  # Clone from server to create a client
    -+  GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
    -+          -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
    -+          -c remote.server2.url="file://$(pwd)/server2" \
    ++  GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
    ++          -c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
    ++          -c remote.lop.url="file://$(pwd)/lop" \
     +          -c promisor.acceptfromserver=All \
     +          --no-local --filter="blob:limit=5k" server client &&
     +  test_when_finished "rm -rf client" &&
    @@ t/t5710-promisor-remote-capability.sh (new)
     +  git -C server config promisor.advertise true &&
     +
     +  # Clone from server to create a client
    -+  GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
    -+          -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
    -+          -c remote.server2.url="file://$(pwd)/server2" \
    ++  GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
    ++          -c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
    ++          -c remote.lop.url="file://$(pwd)/lop" \
     +          -c promisor.acceptfromserver=None \
     +          --no-local --filter="blob:limit=5k" server client &&
     +  test_when_finished "rm -rf client" &&
    @@ t/t5710-promisor-remote-capability.sh (new)
     +  test_when_finished "rm -rf client" &&
     +  mkdir client &&
     +  git -C client init &&
    -+  git -C client config remote.server2.promisor true &&
    -+  git -C client config remote.server2.fetch "+refs/heads/*:refs/remotes/server2/*" &&
    -+  git -C client config remote.server2.url "file://$(pwd)/server2" &&
    ++  git -C client config remote.lop.promisor true &&
    ++  git -C client config remote.lop.fetch "+refs/heads/*:refs/remotes/lop/*" &&
    ++  git -C client config remote.lop.url "file://$(pwd)/lop" &&
     +  git -C client config remote.server.url "file://$(pwd)/server" &&
     +  git -C client config remote.server.fetch "+refs/heads/*:refs/remotes/server/*" &&
     +  git -C client config promisor.acceptfromserver All &&
    @@ t/t5710-promisor-remote-capability.sh (new)
     +  git -C server config promisor.advertise true &&
     +
     +  # Clone from server to create a client
    -+  GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
    -+          -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
    -+          -c remote.server2.url="file://$(pwd)/server2" \
    ++  GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
    ++          -c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
    ++          -c remote.lop.url="file://$(pwd)/lop" \
     +          -c promisor.acceptfromserver=All \
     +          --no-local --filter="blob:limit=5k" server client &&
     +
    @@ t/t5710-promisor-remote-capability.sh (new)
     +  # Repack everything twice and remove .promisor files before
     +  # each repack. This makes sure everything gets repacked
     +  # into a single packfile. The second repack is necessary
    -+  # because the first one fetches from server2 and creates a new
    ++  # because the first one fetches from lop and creates a new
     +  # packfile and its associated .promisor file.
     +
     +  rm -f server/objects/pack/*.promisor &&
    @@ t/t5710-promisor-remote-capability.sh (new)
     +  packfile=$(ls pack-*.pack) &&
     +  git -C server unpack-objects --strict <"$packfile" &&
     +
    -+  # Copy new large object to server2
    ++  # Copy new large object to lop
     +  obj_bar="HEAD:bar" &&
     +  oid_bar="$(git -C server rev-parse $obj_bar)" &&
    -+  copy_to_server2 "$oid_bar" &&
    ++  copy_to_lop "$oid_bar" &&
     +
     +  # Reinitialize server so that the 2 largest objects are missing
     +  printf "%s\n" "$oid" "$oid_bar" >expected_missing.txt &&
5:  979a0af1c3 ! 2:  89e20976ba promisor-remote: check advertised name or URL
    @@ Commit message
     
         Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
     
    - ## Documentation/config/promisor.txt ##
    -@@ Documentation/config/promisor.txt: promisor.advertise::
    + ## Documentation/config/promisor.adoc ##
    +@@ Documentation/config/promisor.adoc: promisor.advertise::
      promisor.acceptFromServer::
        If set to "all", a client will accept all the promisor remotes
        a server might advertise using the "promisor-remote"
    @@ Documentation/config/promisor.txt: promisor.advertise::
     +  URLs. If set to "knownUrl", the client will accept promisor
     +  remotes which have both the same name and the same URL
     +  configured on the client as the name and URL advertised by the
    -+  server. This is more secure than "all" or "knownUrl", so it
    ++  server. This is more secure than "all" or "knownName", so it
     +  should be used if possible instead of those options. Default
     +  is "none", which means no promisor remote advertised by a
     +  server will be accepted. By accepting a promisor remote, the
    @@ promisor-remote.c: char *promisor_remote_info(struct repository *repo)
      }
      
     +/*
    -+ * Find first index of 'vec' where there is 'val'. 'val' is compared
    -+ * case insensively to the strings in 'vec'. If not found 'vec->nr' is
    -+ * returned.
    ++ * Find first index of 'nicks' where there is 'nick'. 'nick' is
    ++ * compared case insensitively to the strings in 'nicks'. If not found
    ++ * 'nicks->nr' is returned.
     + */
    -+static size_t strvec_find_index(struct strvec *vec, const char *val)
    ++static size_t remote_nick_find(struct strvec *nicks, const char *nick)
     +{
    -+  for (size_t i = 0; i < vec->nr; i++)
    -+          if (!strcasecmp(vec->v[i], val))
    ++  for (size_t i = 0; i < nicks->nr; i++)
    ++          if (!strcasecmp(nicks->v[i], nick))
     +                  return i;
    -+  return vec->nr;
    ++  return nicks->nr;
     +}
     +
      enum accept_promisor {
    @@ promisor-remote.c: char *promisor_remote_info(struct repository *repo)
                return 1;
      
     -  BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
    -+  i = strvec_find_index(names, remote_name);
    ++  i = remote_nick_find(names, remote_name);
     +
     +  if (i >= names->nr)
     +          /* We don't know about that remote */
    @@ promisor-remote.c: char *promisor_remote_info(struct repository *repo)
     +  if (accept != ACCEPT_KNOWN_URL)
     +          BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
     +
    -+  if (!strcasecmp(urls->v[i], remote_url))
    ++  if (!strcmp(urls->v[i], remote_url))
     +          return 1;
     +
     +  warning(_("known remote named '%s' but with url '%s' instead of '%s'"),
    @@ promisor-remote.c: char *promisor_remote_info(struct repository *repo)
     +  struct strvec urls = STRVEC_INIT;
      
        if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
    -           if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
    +           if (!*accept_str || !strcasecmp("None", accept_str))
                        accept = ACCEPT_NONE;
     +          else if (!strcasecmp("KnownUrl", accept_str))
     +                  accept = ACCEPT_KNOWN_URL;
    @@ t/t5710-promisor-remote-capability.sh: test_expect_success "init + fetch with pr
     +  git -C server config promisor.advertise true &&
     +
     +  # Clone from server to create a client
    -+  GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
    -+          -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
    -+          -c remote.server2.url="file://$(pwd)/server2" \
    ++  GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
    ++          -c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
    ++          -c remote.lop.url="file://$(pwd)/lop" \
     +          -c promisor.acceptfromserver=KnownName \
     +          --no-local --filter="blob:limit=5k" server client &&
     +  test_when_finished "rm -rf client" &&
    @@ t/t5710-promisor-remote-capability.sh: test_expect_success "init + fetch with pr
     +
     +  # Clone from server to create a client
     +  GIT_NO_LAZY_FETCH=0 git clone -c remote.serverTwo.promisor=true \
    -+          -c remote.serverTwo.fetch="+refs/heads/*:refs/remotes/server2/*" \
    -+          -c remote.serverTwo.url="file://$(pwd)/server2" \
    ++          -c remote.serverTwo.fetch="+refs/heads/*:refs/remotes/lop/*" \
    ++          -c remote.serverTwo.url="file://$(pwd)/lop" \
     +          -c promisor.acceptfromserver=KnownName \
     +          --no-local --filter="blob:limit=5k" server client &&
     +  test_when_finished "rm -rf client" &&
    @@ t/t5710-promisor-remote-capability.sh: test_expect_success "init + fetch with pr
     +  git -C server config promisor.advertise true &&
     +
     +  # Clone from server to create a client
    -+  GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
    -+          -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
    -+          -c remote.server2.url="file://$(pwd)/server2" \
    ++  GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
    ++          -c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
    ++          -c remote.lop.url="file://$(pwd)/lop" \
     +          -c promisor.acceptfromserver=KnownUrl \
     +          --no-local --filter="blob:limit=5k" server client &&
     +  test_when_finished "rm -rf client" &&
    @@ t/t5710-promisor-remote-capability.sh: test_expect_success "init + fetch with pr
     +'
     +
     +test_expect_success "clone with 'KnownUrl' and different remote urls" '
    -+  ln -s server2 serverTwo &&
    ++  ln -s lop serverTwo &&
     +
     +  git -C server config promisor.advertise true &&
     +
     +  # Clone from server to create a client
    -+  GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
    -+          -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
    -+          -c remote.server2.url="file://$(pwd)/serverTwo" \
    ++  GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
    ++          -c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
    ++          -c remote.lop.url="file://$(pwd)/serverTwo" \
     +          -c promisor.acceptfromserver=KnownUrl \
     +          --no-local --filter="blob:limit=5k" server client &&
     +  test_when_finished "rm -rf client" &&
6:  3a0c134e09 ! 3:  e980fe0aa2 doc: add technical design doc for large object promisors
    @@ Documentation/technical/large-object-promisors.txt (new)
     +a number of current issues in the context of Git clients and servers
     +sharing Git objects.
     +
    ++Even if LOPs are used not very efficiently, they can still be useful
    ++and worth using in some cases because, as we will see in more details
    ++later in this document:
    ++
    ++  - they can make it simpler for clients to use promisor remotes and
    ++    therefore avoid fetching a lot of large blobs they might not need
    ++    locally,
    ++
    ++  - they can make it significantly cheaper or easier for servers to
    ++    host a significant part of the current repository content, and
    ++    even more to host content with larger blobs or more large blobs
    ++    than currently.
    ++
     +I) Issues with the current situation
     +------------------------------------
     +
    @@ Documentation/technical/large-object-promisors.txt (new)
     +to the LOP without the main remote checking them in some ways
     +(possibly using hooks or other tools).
     +
    ++This should be discussed and refined when we get closer to
    ++implementing this feature.
    ++
     +Rationale
     ++++++++++
     +


Christian Couder (3):
  Add 'promisor-remote' capability to protocol v2
  promisor-remote: check advertised name or URL
  doc: add technical design doc for large object promisors

 Documentation/config/promisor.adoc            |  27 +
 Documentation/gitprotocol-v2.adoc             |  54 ++
 .../technical/large-object-promisors.txt      | 656 ++++++++++++++++++
 connect.c                                     |   9 +
 promisor-remote.c                             | 242 +++++++
 promisor-remote.h                             |  37 +-
 serve.c                                       |  26 +
 t/meson.build                                 |   1 +
 t/t5710-promisor-remote-capability.sh         | 312 +++++++++
 upload-pack.c                                 |   3 +
 10 files changed, 1366 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/technical/large-object-promisors.txt
 create mode 100755 t/t5710-promisor-remote-capability.sh

-- 
2.48.1.359.ge980fe0aa2


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH v5 1/3] Add 'promisor-remote' capability to protocol v2
  2025-02-18 11:32       ` [PATCH v5 0/3] " Christian Couder
@ 2025-02-18 11:32         ` Christian Couder
  2025-02-18 11:32         ` [PATCH v5 2/3] promisor-remote: check advertised name or URL Christian Couder
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-02-18 11:32 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
	Randall S . Becker, Christian Couder, Christian Couder

When a server S knows that some objects from a repository are available
from a promisor remote X, S might want to suggest to a client C cloning
or fetching the repo from S that C may use X directly instead of S for
these objects.

Note that this could happen both in the case S itself doesn't have the
objects and borrows them from X, and in the case S has the objects but
knows that X is better connected to the world (e.g., it is in a
$LARGEINTERNETCOMPANY datacenter with petabit/s backbone connections)
than S. Implementation of the latter case, which would require S to
omit in its response the objects available on X, is left for future
improvement though.

Then C might or might not, want to get the objects from X. If S and C
can agree on C using X directly, S can then omit objects that can be
obtained from X when answering C's request.

To allow S and C to agree and let each other know about C using X or
not, let's introduce a new "promisor-remote" capability in the
protocol v2, as well as a few new configuration variables:

  - "promisor.advertise" on the server side, and:
  - "promisor.acceptFromServer" on the client side.

By default, or if "promisor.advertise" is set to 'false', a server S will
not advertise the "promisor-remote" capability.

If S doesn't advertise the "promisor-remote" capability, then a client C
replying to S shouldn't advertise the "promisor-remote" capability
either.

If "promisor.advertise" is set to 'true', S will advertise its promisor
remotes with a string like:

  promisor-remote=<pr-info>[;<pr-info>]...

where each <pr-info> element contains information about a single
promisor remote in the form:

  name=<pr-name>[,url=<pr-url>]

where <pr-name> is the urlencoded name of a promisor remote and
<pr-url> is the urlencoded URL of the promisor remote named <pr-name>.

For now, the URL is passed in addition to the name. In the future, it
might be possible to pass other information like a filter-spec that the
client may use when cloning from S, or a token that the client may use
when retrieving objects from X.

It is C's responsibility to arrange how it can reach X though, so pieces
of information that are usually outside Git's concern, like proxy
configuration, must not be distributed over this protocol.

It might also be possible in the future for "promisor.advertise" to have
other values. For example a value like "onlyName" could prevent S from
advertising URLs, which could help in case C should use a different URL
for X than the URL S is using. (The URL S is using might be an internal
one on the server side for example.)

By default or if "promisor.acceptFromServer" is set to "None", C will
not accept to use the promisor remotes that might have been advertised
by S. In this case, C will not advertise any "promisor-remote"
capability in its reply to S.

If "promisor.acceptFromServer" is set to "All" and S advertised some
promisor remotes, then on the contrary, C will accept to use all the
promisor remotes that S advertised and C will reply with a string like:

  promisor-remote=<pr-name>[;<pr-name>]...

where the <pr-name> elements are the urlencoded names of all the
promisor remotes S advertised.

In a following commit, other values for "promisor.acceptFromServer" will
be implemented, so that C will be able to decide the promisor remotes it
accepts depending on the name and URL it received from S. So even if
that name and URL information is not used much right now, it will be
needed soon.

Helped-by: Taylor Blau <me@ttaylorr.com>
Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/promisor.adoc    |  17 ++
 Documentation/gitprotocol-v2.adoc     |  54 ++++++
 connect.c                             |   9 +
 promisor-remote.c                     | 194 ++++++++++++++++++++
 promisor-remote.h                     |  37 +++-
 serve.c                               |  26 +++
 t/meson.build                         |   1 +
 t/t5710-promisor-remote-capability.sh | 244 ++++++++++++++++++++++++++
 upload-pack.c                         |   3 +
 9 files changed, 584 insertions(+), 1 deletion(-)
 create mode 100755 t/t5710-promisor-remote-capability.sh

diff --git a/Documentation/config/promisor.adoc b/Documentation/config/promisor.adoc
index 98c5cb2ec2..9cbfe3e59e 100644
--- a/Documentation/config/promisor.adoc
+++ b/Documentation/config/promisor.adoc
@@ -1,3 +1,20 @@
 promisor.quiet::
 	If set to "true" assume `--quiet` when fetching additional
 	objects for a partial clone.
+
+promisor.advertise::
+	If set to "true", a server will use the "promisor-remote"
+	capability, see linkgit:gitprotocol-v2[5], to advertise the
+	promisor remotes it is using, if it uses some. Default is
+	"false", which means the "promisor-remote" capability is not
+	advertised.
+
+promisor.acceptFromServer::
+	If set to "all", a client will accept all the promisor remotes
+	a server might advertise using the "promisor-remote"
+	capability. Default is "none", which means no promisor remote
+	advertised by a server will be accepted. By accepting a
+	promisor remote, the client agrees that the server might omit
+	objects that are lazily fetchable from this promisor remote
+	from its responses to "fetch" and "clone" requests from the
+	client. See linkgit:gitprotocol-v2[5].
diff --git a/Documentation/gitprotocol-v2.adoc b/Documentation/gitprotocol-v2.adoc
index 1652fef3ae..c20b74aac0 100644
--- a/Documentation/gitprotocol-v2.adoc
+++ b/Documentation/gitprotocol-v2.adoc
@@ -781,6 +781,60 @@ retrieving the header from a bundle at the indicated URI, and thus
 save themselves and the server(s) the request(s) needed to inspect the
 headers of that bundle or bundles.
 
+promisor-remote=<pr-infos>
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The server may advertise some promisor remotes it is using or knows
+about to a client which may want to use them as its promisor remotes,
+instead of this repository. In this case <pr-infos> should be of the
+form:
+
+	pr-infos = pr-info | pr-infos ";" pr-info
+
+	pr-info = "name=" pr-name | "name=" pr-name "," "url=" pr-url
+
+where `pr-name` is the urlencoded name of a promisor remote, and
+`pr-url` the urlencoded URL of that promisor remote.
+
+In this case, if the client decides to use one or more promisor
+remotes the server advertised, it can reply with
+"promisor-remote=<pr-names>" where <pr-names> should be of the form:
+
+	pr-names = pr-name | pr-names ";" pr-name
+
+where `pr-name` is the urlencoded name of a promisor remote the server
+advertised and the client accepts.
+
+Note that, everywhere in this document, `pr-name` MUST be a valid
+remote name, and the ';' and ',' characters MUST be encoded if they
+appear in `pr-name` or `pr-url`.
+
+If the server doesn't know any promisor remote that could be good for
+a client to use, or prefers a client not to use any promisor remote it
+uses or knows about, it shouldn't advertise the "promisor-remote"
+capability at all.
+
+In this case, or if the client doesn't want to use any promisor remote
+the server advertised, the client shouldn't advertise the
+"promisor-remote" capability at all in its reply.
+
+The "promisor.advertise" and "promisor.acceptFromServer" configuration
+options can be used on the server and client side to control what they
+advertise or accept respectively. See the documentation of these
+configuration options for more information.
+
+Note that in the future it would be nice if the "promisor-remote"
+protocol capability could be used by the server, when responding to
+`git fetch` or `git clone`, to advertise better-connected remotes that
+the client can use as promisor remotes, instead of this repository, so
+that the client can lazily fetch objects from these other
+better-connected remotes. This would require the server to omit in its
+response the objects available on the better-connected remotes that
+the client has accepted. This hasn't been implemented yet though. So
+for now this "promisor-remote" capability is useful only when the
+server advertises some promisor remotes it already uses to borrow
+objects from.
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/connect.c b/connect.c
index 91f3990014..125150ac25 100644
--- a/connect.c
+++ b/connect.c
@@ -22,6 +22,7 @@
 #include "protocol.h"
 #include "alias.h"
 #include "bundle-uri.h"
+#include "promisor-remote.h"
 
 static char *server_capabilities_v1;
 static struct strvec server_capabilities_v2 = STRVEC_INIT;
@@ -487,6 +488,7 @@ void check_stateless_delimiter(int stateless_rpc,
 static void send_capabilities(int fd_out, struct packet_reader *reader)
 {
 	const char *hash_name;
+	const char *promisor_remote_info;
 
 	if (server_supports_v2("agent"))
 		packet_write_fmt(fd_out, "agent=%s", git_user_agent_sanitized());
@@ -500,6 +502,13 @@ static void send_capabilities(int fd_out, struct packet_reader *reader)
 	} else {
 		reader->hash_algo = &hash_algos[GIT_HASH_SHA1];
 	}
+	if (server_feature_v2("promisor-remote", &promisor_remote_info)) {
+		char *reply = promisor_remote_reply(promisor_remote_info);
+		if (reply) {
+			packet_write_fmt(fd_out, "promisor-remote=%s", reply);
+			free(reply);
+		}
+	}
 }
 
 int get_remote_bundle_uri(int fd_out, struct packet_reader *reader,
diff --git a/promisor-remote.c b/promisor-remote.c
index c714f4f007..918be6528f 100644
--- a/promisor-remote.c
+++ b/promisor-remote.c
@@ -11,6 +11,8 @@
 #include "strvec.h"
 #include "packfile.h"
 #include "environment.h"
+#include "url.h"
+#include "version.h"
 
 struct promisor_remote_config {
 	struct promisor_remote *promisors;
@@ -221,6 +223,18 @@ int repo_has_promisor_remote(struct repository *r)
 	return !!repo_promisor_remote_find(r, NULL);
 }
 
+int repo_has_accepted_promisor_remote(struct repository *r)
+{
+	struct promisor_remote *p;
+
+	promisor_remote_init(r);
+
+	for (p = r->promisor_remote_config->promisors; p; p = p->next)
+		if (p->accepted)
+			return 1;
+	return 0;
+}
+
 static int remove_fetched_oids(struct repository *repo,
 			       struct object_id **oids,
 			       int oid_nr, int to_free)
@@ -292,3 +306,183 @@ void promisor_remote_get_direct(struct repository *repo,
 	if (to_free)
 		free(remaining_oids);
 }
+
+static int allow_unsanitized(char ch)
+{
+	if (ch == ',' || ch == ';' || ch == '%')
+		return 0;
+	return ch > 32 && ch < 127;
+}
+
+static void promisor_info_vecs(struct repository *repo,
+			       struct strvec *names,
+			       struct strvec *urls)
+{
+	struct promisor_remote *r;
+
+	promisor_remote_init(repo);
+
+	for (r = repo->promisor_remote_config->promisors; r; r = r->next) {
+		char *url;
+		char *url_key = xstrfmt("remote.%s.url", r->name);
+
+		strvec_push(names, r->name);
+		strvec_push(urls, git_config_get_string(url_key, &url) ? NULL : url);
+
+		free(url);
+		free(url_key);
+	}
+}
+
+char *promisor_remote_info(struct repository *repo)
+{
+	struct strbuf sb = STRBUF_INIT;
+	int advertise_promisors = 0;
+	struct strvec names = STRVEC_INIT;
+	struct strvec urls = STRVEC_INIT;
+
+	git_config_get_bool("promisor.advertise", &advertise_promisors);
+
+	if (!advertise_promisors)
+		return NULL;
+
+	promisor_info_vecs(repo, &names, &urls);
+
+	if (!names.nr)
+		return NULL;
+
+	for (size_t i = 0; i < names.nr; i++) {
+		if (i)
+			strbuf_addch(&sb, ';');
+		strbuf_addstr(&sb, "name=");
+		strbuf_addstr_urlencode(&sb, names.v[i], allow_unsanitized);
+		if (urls.v[i]) {
+			strbuf_addstr(&sb, ",url=");
+			strbuf_addstr_urlencode(&sb, urls.v[i], allow_unsanitized);
+		}
+	}
+
+	strvec_clear(&names);
+	strvec_clear(&urls);
+
+	return strbuf_detach(&sb, NULL);
+}
+
+enum accept_promisor {
+	ACCEPT_NONE = 0,
+	ACCEPT_ALL
+};
+
+static int should_accept_remote(enum accept_promisor accept,
+				const char *remote_name UNUSED,
+				const char *remote_url UNUSED)
+{
+	if (accept == ACCEPT_ALL)
+		return 1;
+
+	BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+}
+
+static void filter_promisor_remote(struct strvec *accepted, const char *info)
+{
+	struct strbuf **remotes;
+	const char *accept_str;
+	enum accept_promisor accept = ACCEPT_NONE;
+
+	if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
+		if (!*accept_str || !strcasecmp("None", accept_str))
+			accept = ACCEPT_NONE;
+		else if (!strcasecmp("All", accept_str))
+			accept = ACCEPT_ALL;
+		else
+			warning(_("unknown '%s' value for '%s' config option"),
+				accept_str, "promisor.acceptfromserver");
+	}
+
+	if (accept == ACCEPT_NONE)
+		return;
+
+	/* Parse remote info received */
+
+	remotes = strbuf_split_str(info, ';', 0);
+
+	for (size_t i = 0; remotes[i]; i++) {
+		struct strbuf **elems;
+		const char *remote_name = NULL;
+		const char *remote_url = NULL;
+		char *decoded_name = NULL;
+		char *decoded_url = NULL;
+
+		strbuf_strip_suffix(remotes[i], ";");
+		elems = strbuf_split(remotes[i], ',');
+
+		for (size_t j = 0; elems[j]; j++) {
+			int res;
+			strbuf_strip_suffix(elems[j], ",");
+			res = skip_prefix(elems[j]->buf, "name=", &remote_name) ||
+				skip_prefix(elems[j]->buf, "url=", &remote_url);
+			if (!res)
+				warning(_("unknown element '%s' from remote info"),
+					elems[j]->buf);
+		}
+
+		if (remote_name)
+			decoded_name = url_percent_decode(remote_name);
+		if (remote_url)
+			decoded_url = url_percent_decode(remote_url);
+
+		if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url))
+			strvec_push(accepted, decoded_name);
+
+		strbuf_list_free(elems);
+		free(decoded_name);
+		free(decoded_url);
+	}
+
+	strbuf_list_free(remotes);
+}
+
+char *promisor_remote_reply(const char *info)
+{
+	struct strvec accepted = STRVEC_INIT;
+	struct strbuf reply = STRBUF_INIT;
+
+	filter_promisor_remote(&accepted, info);
+
+	if (!accepted.nr)
+		return NULL;
+
+	for (size_t i = 0; i < accepted.nr; i++) {
+		if (i)
+			strbuf_addch(&reply, ';');
+		strbuf_addstr_urlencode(&reply, accepted.v[i], allow_unsanitized);
+	}
+
+	strvec_clear(&accepted);
+
+	return strbuf_detach(&reply, NULL);
+}
+
+void mark_promisor_remotes_as_accepted(struct repository *r, const char *remotes)
+{
+	struct strbuf **accepted_remotes = strbuf_split_str(remotes, ';', 0);
+
+	for (size_t i = 0; accepted_remotes[i]; i++) {
+		struct promisor_remote *p;
+		char *decoded_remote;
+
+		strbuf_strip_suffix(accepted_remotes[i], ";");
+		decoded_remote = url_percent_decode(accepted_remotes[i]->buf);
+
+		p = repo_promisor_remote_find(r, decoded_remote);
+		if (p)
+			p->accepted = 1;
+		else
+			warning(_("accepted promisor remote '%s' not found"),
+				decoded_remote);
+
+		free(decoded_remote);
+	}
+
+	strbuf_list_free(accepted_remotes);
+}
diff --git a/promisor-remote.h b/promisor-remote.h
index 88cb599c39..263d331a55 100644
--- a/promisor-remote.h
+++ b/promisor-remote.h
@@ -9,11 +9,13 @@ struct object_id;
  * Promisor remote linked list
  *
  * Information in its fields come from remote.XXX config entries or
- * from extensions.partialclone.
+ * from extensions.partialclone, except for 'accepted' which comes
+ * from protocol v2 capabilities exchange.
  */
 struct promisor_remote {
 	struct promisor_remote *next;
 	char *partial_clone_filter;
+	unsigned int accepted : 1;
 	const char name[FLEX_ARRAY];
 };
 
@@ -32,4 +34,37 @@ void promisor_remote_get_direct(struct repository *repo,
 				const struct object_id *oids,
 				int oid_nr);
 
+/*
+ * Prepare a "promisor-remote" advertisement by a server.
+ * Check the value of "promisor.advertise" and maybe the configured
+ * promisor remotes, if any, to prepare information to send in an
+ * advertisement.
+ * Return value is NULL if no promisor remote advertisement should be
+ * made. Otherwise it contains the names and urls of the advertised
+ * promisor remotes separated by ';'. See gitprotocol-v2(5).
+ */
+char *promisor_remote_info(struct repository *repo);
+
+/*
+ * Prepare a reply to a "promisor-remote" advertisement from a server.
+ * Check the value of "promisor.acceptfromserver" and maybe the
+ * configured promisor remotes, if any, to prepare the reply.
+ * Return value is NULL if no promisor remote from the server
+ * is accepted. Otherwise it contains the names of the accepted promisor
+ * remotes separated by ';'. See gitprotocol-v2(5).
+ */
+char *promisor_remote_reply(const char *info);
+
+/*
+ * Set the 'accepted' flag for some promisor remotes. Useful on the
+ * server side when some promisor remotes have been accepted by the
+ * client.
+ */
+void mark_promisor_remotes_as_accepted(struct repository *repo, const char *remotes);
+
+/*
+ * Has any promisor remote been accepted by the client?
+ */
+int repo_has_accepted_promisor_remote(struct repository *r);
+
 #endif /* PROMISOR_REMOTE_H */
diff --git a/serve.c b/serve.c
index f6dfe34a2b..e3ccf1505c 100644
--- a/serve.c
+++ b/serve.c
@@ -10,6 +10,7 @@
 #include "upload-pack.h"
 #include "bundle-uri.h"
 #include "trace2.h"
+#include "promisor-remote.h"
 
 static int advertise_sid = -1;
 static int advertise_object_info = -1;
@@ -29,6 +30,26 @@ static int agent_advertise(struct repository *r UNUSED,
 	return 1;
 }
 
+static int promisor_remote_advertise(struct repository *r,
+				     struct strbuf *value)
+{
+	if (value) {
+		char *info = promisor_remote_info(r);
+		if (!info)
+			return 0;
+		strbuf_addstr(value, info);
+		free(info);
+	}
+	return 1;
+}
+
+static void promisor_remote_receive(struct repository *r,
+				    const char *remotes)
+{
+	mark_promisor_remotes_as_accepted(r, remotes);
+}
+
+
 static int object_format_advertise(struct repository *r,
 				   struct strbuf *value)
 {
@@ -155,6 +176,11 @@ static struct protocol_capability capabilities[] = {
 		.advertise = bundle_uri_advertise,
 		.command = bundle_uri_command,
 	},
+	{
+		.name = "promisor-remote",
+		.advertise = promisor_remote_advertise,
+		.receive = promisor_remote_receive,
+	},
 };
 
 void protocol_v2_advertise_capabilities(struct repository *r)
diff --git a/t/meson.build b/t/meson.build
index a03ebc81fd..75ad6726c4 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -728,6 +728,7 @@ integration_tests = [
   't5703-upload-pack-ref-in-want.sh',
   't5704-protocol-violations.sh',
   't5705-session-id-in-capabilities.sh',
+  't5710-promisor-remote-capability.sh',
   't5730-protocol-v2-bundle-uri-file.sh',
   't5731-protocol-v2-bundle-uri-git.sh',
   't5732-protocol-v2-bundle-uri-http.sh',
diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
new file mode 100755
index 0000000000..51cf2269e1
--- /dev/null
+++ b/t/t5710-promisor-remote-capability.sh
@@ -0,0 +1,244 @@
+#!/bin/sh
+
+test_description='handling of promisor remote advertisement'
+
+. ./test-lib.sh
+
+GIT_TEST_MULTI_PACK_INDEX=0
+GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0
+
+# Setup the repository with three commits, this way HEAD is always
+# available and we can hide commit 1 or 2.
+test_expect_success 'setup: create "template" repository' '
+	git init template &&
+	test_commit -C template 1 &&
+	test_commit -C template 2 &&
+	test_commit -C template 3 &&
+	test-tool genrandom foo 10240 >template/foo &&
+	git -C template add foo &&
+	git -C template commit -m foo
+'
+
+# A bare repo will act as a server repo with unpacked objects.
+test_expect_success 'setup: create bare "server" repository' '
+	git clone --bare --no-local template server &&
+	mv server/objects/pack/pack-* . &&
+	packfile=$(ls pack-*.pack) &&
+	git -C server unpack-objects --strict <"$packfile"
+'
+
+check_missing_objects () {
+	git -C "$1" rev-list --objects --all --missing=print > all.txt &&
+	perl -ne 'print if s/^[?]//' all.txt >missing.txt &&
+	test_line_count = "$2" missing.txt &&
+	if test "$2" -lt 2
+	then
+		test "$3" = "$(cat missing.txt)"
+	else
+		test -f "$3" &&
+		sort <"$3" >expected_sorted &&
+		sort <missing.txt >actual_sorted &&
+		test_cmp expected_sorted actual_sorted
+	fi
+}
+
+initialize_server () {
+	count="$1"
+	missing_oids="$2"
+
+	# Repack everything first
+	git -C server -c repack.writebitmaps=false repack -a -d &&
+
+	# Remove promisor file in case they exist, useful when reinitializing
+	rm -rf server/objects/pack/*.promisor &&
+
+	# Repack without the largest object and create a promisor pack on server
+	git -C server -c repack.writebitmaps=false repack -a -d \
+	    --filter=blob:limit=5k --filter-to="$(pwd)/pack" &&
+	promisor_file=$(ls server/objects/pack/*.pack | sed "s/\.pack/.promisor/") &&
+	>"$promisor_file" &&
+
+	# Check objects missing on the server
+	check_missing_objects server "$count" "$missing_oids"
+}
+
+copy_to_lop () {
+	oid_path="$(test_oid_to_path $1)" &&
+	path="server/objects/$oid_path" &&
+	path2="lop/objects/$oid_path" &&
+	mkdir -p $(dirname "$path2") &&
+	cp "$path" "$path2"
+}
+
+test_expect_success "setup for testing promisor remote advertisement" '
+	# Create another bare repo called "lop" (for Large Object Promisor)
+	git init --bare lop &&
+
+	# Copy the largest object from server to lop
+	obj="HEAD:foo" &&
+	oid="$(git -C server rev-parse $obj)" &&
+	copy_to_lop "$oid" &&
+
+	initialize_server 1 "$oid" &&
+
+	# Configure lop as promisor remote for server
+	git -C server remote add lop "file://$(pwd)/lop" &&
+	git -C server config remote.lop.promisor true &&
+
+	git -C lop config uploadpack.allowFilter true &&
+	git -C lop config uploadpack.allowAnySHA1InWant true &&
+	git -C server config uploadpack.allowFilter true &&
+	git -C server config uploadpack.allowAnySHA1InWant true
+'
+
+test_expect_success "clone with promisor.advertise set to 'true'" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
+		-c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
+		-c remote.lop.url="file://$(pwd)/lop" \
+		-c promisor.acceptfromserver=All \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is still missing on the server
+	check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.advertise set to 'false'" '
+	git -C server config promisor.advertise false &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
+		-c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
+		-c remote.lop.url="file://$(pwd)/lop" \
+		-c promisor.acceptfromserver=All \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is not missing on the server
+	check_missing_objects server 0 "" &&
+
+	# Reinitialize server so that the largest object is missing again
+	initialize_server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.acceptfromserver set to 'None'" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
+		-c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
+		-c remote.lop.url="file://$(pwd)/lop" \
+		-c promisor.acceptfromserver=None \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is not missing on the server
+	check_missing_objects server 0 "" &&
+
+	# Reinitialize server so that the largest object is missing again
+	initialize_server 1 "$oid"
+'
+
+test_expect_success "init + fetch with promisor.advertise set to 'true'" '
+	git -C server config promisor.advertise true &&
+
+	test_when_finished "rm -rf client" &&
+	mkdir client &&
+	git -C client init &&
+	git -C client config remote.lop.promisor true &&
+	git -C client config remote.lop.fetch "+refs/heads/*:refs/remotes/lop/*" &&
+	git -C client config remote.lop.url "file://$(pwd)/lop" &&
+	git -C client config remote.server.url "file://$(pwd)/server" &&
+	git -C client config remote.server.fetch "+refs/heads/*:refs/remotes/server/*" &&
+	git -C client config promisor.acceptfromserver All &&
+	GIT_NO_LAZY_FETCH=0 git -C client fetch --filter="blob:limit=5k" server &&
+
+	# Check that the largest object is still missing on the server
+	check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.advertise set to 'true' but don't delete the client" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
+		-c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
+		-c remote.lop.url="file://$(pwd)/lop" \
+		-c promisor.acceptfromserver=All \
+		--no-local --filter="blob:limit=5k" server client &&
+
+	# Check that the largest object is still missing on the server
+	check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "setup for subsequent fetches" '
+	# Generate new commit with large blob
+	test-tool genrandom bar 10240 >template/bar &&
+	git -C template add bar &&
+	git -C template commit -m bar &&
+
+	# Fetch new commit with large blob
+	git -C server fetch origin &&
+	git -C server update-ref HEAD FETCH_HEAD &&
+	git -C server rev-parse HEAD >expected_head &&
+
+	# Repack everything twice and remove .promisor files before
+	# each repack. This makes sure everything gets repacked
+	# into a single packfile. The second repack is necessary
+	# because the first one fetches from lop and creates a new
+	# packfile and its associated .promisor file.
+
+	rm -f server/objects/pack/*.promisor &&
+	git -C server -c repack.writebitmaps=false repack -a -d &&
+	rm -f server/objects/pack/*.promisor &&
+	git -C server -c repack.writebitmaps=false repack -a -d &&
+
+	# Unpack everything
+	rm pack-* &&
+	mv server/objects/pack/pack-* . &&
+	packfile=$(ls pack-*.pack) &&
+	git -C server unpack-objects --strict <"$packfile" &&
+
+	# Copy new large object to lop
+	obj_bar="HEAD:bar" &&
+	oid_bar="$(git -C server rev-parse $obj_bar)" &&
+	copy_to_lop "$oid_bar" &&
+
+	# Reinitialize server so that the 2 largest objects are missing
+	printf "%s\n" "$oid" "$oid_bar" >expected_missing.txt &&
+	initialize_server 2 expected_missing.txt &&
+
+	# Create one more client
+	cp -r client client2
+'
+
+test_expect_success "subsequent fetch from a client when promisor.advertise is true" '
+	git -C server config promisor.advertise true &&
+
+	GIT_NO_LAZY_FETCH=0 git -C client pull origin &&
+
+	git -C client rev-parse HEAD >actual &&
+	test_cmp expected_head actual &&
+
+	cat client/bar >/dev/null &&
+
+	check_missing_objects server 2 expected_missing.txt
+'
+
+test_expect_success "subsequent fetch from a client when promisor.advertise is false" '
+	git -C server config promisor.advertise false &&
+
+	GIT_NO_LAZY_FETCH=0 git -C client2 pull origin &&
+
+	git -C client2 rev-parse HEAD >actual &&
+	test_cmp expected_head actual &&
+
+	cat client2/bar >/dev/null &&
+
+	check_missing_objects server 1 "$oid"
+'
+
+test_done
diff --git a/upload-pack.c b/upload-pack.c
index 728b2477fc..7498b45e2e 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -32,6 +32,7 @@
 #include "write-or-die.h"
 #include "json-writer.h"
 #include "strmap.h"
+#include "promisor-remote.h"
 
 /* Remember to update object flag allocation in object.h */
 #define THEY_HAVE	(1u << 11)
@@ -319,6 +320,8 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 		strvec_push(&pack_objects.args, "--delta-base-offset");
 	if (pack_data->use_include_tag)
 		strvec_push(&pack_objects.args, "--include-tag");
+	if (repo_has_accepted_promisor_remote(the_repository))
+		strvec_push(&pack_objects.args, "--missing=allow-promisor");
 	if (pack_data->filter_options.choice) {
 		const char *spec =
 			expand_list_objects_filter_spec(&pack_data->filter_options);
-- 
2.48.1.359.ge980fe0aa2


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH v5 2/3] promisor-remote: check advertised name or URL
  2025-02-18 11:32       ` [PATCH v5 0/3] " Christian Couder
  2025-02-18 11:32         ` [PATCH v5 1/3] Add 'promisor-remote' capability to protocol v2 Christian Couder
@ 2025-02-18 11:32         ` Christian Couder
  2025-02-18 11:32         ` [PATCH v5 3/3] doc: add technical design doc for large object promisors Christian Couder
                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-02-18 11:32 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
	Randall S . Becker, Christian Couder, Christian Couder

A previous commit introduced a "promisor.acceptFromServer" configuration
variable with only "None" or "All" as valid values.

Let's introduce "KnownName" and "KnownUrl" as valid values for this
configuration option to give more choice to a client about which
promisor remotes it might accept among those that the server advertised.

In case of "KnownName", the client will accept promisor remotes which
are already configured on the client and have the same name as those
advertised by the client. This could be useful in a corporate setup
where servers and clients are trusted to not switch names and URLs, but
where some kind of control is still useful.

In case of "KnownUrl", the client will accept promisor remotes which
have both the same name and the same URL configured on the client as the
name and URL advertised by the server. This is the most secure option,
so it should be used if possible.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/promisor.adoc    | 22 ++++++---
 promisor-remote.c                     | 60 ++++++++++++++++++++---
 t/t5710-promisor-remote-capability.sh | 68 +++++++++++++++++++++++++++
 3 files changed, 138 insertions(+), 12 deletions(-)

diff --git a/Documentation/config/promisor.adoc b/Documentation/config/promisor.adoc
index 9cbfe3e59e..9192acfd24 100644
--- a/Documentation/config/promisor.adoc
+++ b/Documentation/config/promisor.adoc
@@ -12,9 +12,19 @@ promisor.advertise::
 promisor.acceptFromServer::
 	If set to "all", a client will accept all the promisor remotes
 	a server might advertise using the "promisor-remote"
-	capability. Default is "none", which means no promisor remote
-	advertised by a server will be accepted. By accepting a
-	promisor remote, the client agrees that the server might omit
-	objects that are lazily fetchable from this promisor remote
-	from its responses to "fetch" and "clone" requests from the
-	client. See linkgit:gitprotocol-v2[5].
+	capability. If set to "knownName" the client will accept
+	promisor remotes which are already configured on the client
+	and have the same name as those advertised by the client. This
+	is not very secure, but could be used in a corporate setup
+	where servers and clients are trusted to not switch name and
+	URLs. If set to "knownUrl", the client will accept promisor
+	remotes which have both the same name and the same URL
+	configured on the client as the name and URL advertised by the
+	server. This is more secure than "all" or "knownName", so it
+	should be used if possible instead of those options. Default
+	is "none", which means no promisor remote advertised by a
+	server will be accepted. By accepting a promisor remote, the
+	client agrees that the server might omit objects that are
+	lazily fetchable from this promisor remote from its responses
+	to "fetch" and "clone" requests from the client. See
+	linkgit:gitprotocol-v2[5].
diff --git a/promisor-remote.c b/promisor-remote.c
index 918be6528f..6a0a61382f 100644
--- a/promisor-remote.c
+++ b/promisor-remote.c
@@ -368,30 +368,73 @@ char *promisor_remote_info(struct repository *repo)
 	return strbuf_detach(&sb, NULL);
 }
 
+/*
+ * Find first index of 'nicks' where there is 'nick'. 'nick' is
+ * compared case insensitively to the strings in 'nicks'. If not found
+ * 'nicks->nr' is returned.
+ */
+static size_t remote_nick_find(struct strvec *nicks, const char *nick)
+{
+	for (size_t i = 0; i < nicks->nr; i++)
+		if (!strcasecmp(nicks->v[i], nick))
+			return i;
+	return nicks->nr;
+}
+
 enum accept_promisor {
 	ACCEPT_NONE = 0,
+	ACCEPT_KNOWN_URL,
+	ACCEPT_KNOWN_NAME,
 	ACCEPT_ALL
 };
 
 static int should_accept_remote(enum accept_promisor accept,
-				const char *remote_name UNUSED,
-				const char *remote_url UNUSED)
+				const char *remote_name, const char *remote_url,
+				struct strvec *names, struct strvec *urls)
 {
+	size_t i;
+
 	if (accept == ACCEPT_ALL)
 		return 1;
 
-	BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+	i = remote_nick_find(names, remote_name);
+
+	if (i >= names->nr)
+		/* We don't know about that remote */
+		return 0;
+
+	if (accept == ACCEPT_KNOWN_NAME)
+		return 1;
+
+	if (accept != ACCEPT_KNOWN_URL)
+		BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
+
+	if (!strcmp(urls->v[i], remote_url))
+		return 1;
+
+	warning(_("known remote named '%s' but with url '%s' instead of '%s'"),
+		remote_name, urls->v[i], remote_url);
+
+	return 0;
 }
 
-static void filter_promisor_remote(struct strvec *accepted, const char *info)
+static void filter_promisor_remote(struct repository *repo,
+				   struct strvec *accepted,
+				   const char *info)
 {
 	struct strbuf **remotes;
 	const char *accept_str;
 	enum accept_promisor accept = ACCEPT_NONE;
+	struct strvec names = STRVEC_INIT;
+	struct strvec urls = STRVEC_INIT;
 
 	if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
 		if (!*accept_str || !strcasecmp("None", accept_str))
 			accept = ACCEPT_NONE;
+		else if (!strcasecmp("KnownUrl", accept_str))
+			accept = ACCEPT_KNOWN_URL;
+		else if (!strcasecmp("KnownName", accept_str))
+			accept = ACCEPT_KNOWN_NAME;
 		else if (!strcasecmp("All", accept_str))
 			accept = ACCEPT_ALL;
 		else
@@ -402,6 +445,9 @@ static void filter_promisor_remote(struct strvec *accepted, const char *info)
 	if (accept == ACCEPT_NONE)
 		return;
 
+	if (accept != ACCEPT_ALL)
+		promisor_info_vecs(repo, &names, &urls);
+
 	/* Parse remote info received */
 
 	remotes = strbuf_split_str(info, ';', 0);
@@ -431,7 +477,7 @@ static void filter_promisor_remote(struct strvec *accepted, const char *info)
 		if (remote_url)
 			decoded_url = url_percent_decode(remote_url);
 
-		if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url))
+		if (decoded_name && should_accept_remote(accept, decoded_name, decoded_url, &names, &urls))
 			strvec_push(accepted, decoded_name);
 
 		strbuf_list_free(elems);
@@ -439,6 +485,8 @@ static void filter_promisor_remote(struct strvec *accepted, const char *info)
 		free(decoded_url);
 	}
 
+	strvec_clear(&names);
+	strvec_clear(&urls);
 	strbuf_list_free(remotes);
 }
 
@@ -447,7 +495,7 @@ char *promisor_remote_reply(const char *info)
 	struct strvec accepted = STRVEC_INIT;
 	struct strbuf reply = STRBUF_INIT;
 
-	filter_promisor_remote(&accepted, info);
+	filter_promisor_remote(the_repository, &accepted, info);
 
 	if (!accepted.nr)
 		return NULL;
diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
index 51cf2269e1..d2cc69a17e 100755
--- a/t/t5710-promisor-remote-capability.sh
+++ b/t/t5710-promisor-remote-capability.sh
@@ -160,6 +160,74 @@ test_expect_success "init + fetch with promisor.advertise set to 'true'" '
 	check_missing_objects server 1 "$oid"
 '
 
+test_expect_success "clone with promisor.acceptfromserver set to 'KnownName'" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
+		-c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
+		-c remote.lop.url="file://$(pwd)/lop" \
+		-c promisor.acceptfromserver=KnownName \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is still missing on the server
+	check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with 'KnownName' and different remote names" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.serverTwo.promisor=true \
+		-c remote.serverTwo.fetch="+refs/heads/*:refs/remotes/lop/*" \
+		-c remote.serverTwo.url="file://$(pwd)/lop" \
+		-c promisor.acceptfromserver=KnownName \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is not missing on the server
+	check_missing_objects server 0 "" &&
+
+	# Reinitialize server so that the largest object is missing again
+	initialize_server 1 "$oid"
+'
+
+test_expect_success "clone with promisor.acceptfromserver set to 'KnownUrl'" '
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
+		-c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
+		-c remote.lop.url="file://$(pwd)/lop" \
+		-c promisor.acceptfromserver=KnownUrl \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is still missing on the server
+	check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "clone with 'KnownUrl' and different remote urls" '
+	ln -s lop serverTwo &&
+
+	git -C server config promisor.advertise true &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.lop.promisor=true \
+		-c remote.lop.fetch="+refs/heads/*:refs/remotes/lop/*" \
+		-c remote.lop.url="file://$(pwd)/serverTwo" \
+		-c promisor.acceptfromserver=KnownUrl \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is not missing on the server
+	check_missing_objects server 0 "" &&
+
+	# Reinitialize server so that the largest object is missing again
+	initialize_server 1 "$oid"
+'
+
 test_expect_success "clone with promisor.advertise set to 'true' but don't delete the client" '
 	git -C server config promisor.advertise true &&
 
-- 
2.48.1.359.ge980fe0aa2


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH v5 3/3] doc: add technical design doc for large object promisors
  2025-02-18 11:32       ` [PATCH v5 0/3] " Christian Couder
  2025-02-18 11:32         ` [PATCH v5 1/3] Add 'promisor-remote' capability to protocol v2 Christian Couder
  2025-02-18 11:32         ` [PATCH v5 2/3] promisor-remote: check advertised name or URL Christian Couder
@ 2025-02-18 11:32         ` Christian Couder
  2025-02-21  8:33           ` Patrick Steinhardt
  2025-02-18 19:07         ` [PATCH v5 0/3] Introduce a "promisor-remote" capability Junio C Hamano
  2025-02-21  8:34         ` Patrick Steinhardt
  4 siblings, 1 reply; 110+ messages in thread
From: Christian Couder @ 2025-02-18 11:32 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
	Randall S . Becker, Christian Couder, Christian Couder

Let's add a design doc about how we could improve handling liarge blobs
using "Large Object Promisors" (LOPs). It's a set of features with the
goal of using special dedicated promisor remotes to store large blobs,
and having them accessed directly by main remotes and clients.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 .../technical/large-object-promisors.txt      | 656 ++++++++++++++++++
 1 file changed, 656 insertions(+)
 create mode 100644 Documentation/technical/large-object-promisors.txt

diff --git a/Documentation/technical/large-object-promisors.txt b/Documentation/technical/large-object-promisors.txt
new file mode 100644
index 0000000000..ebbbd7c18f
--- /dev/null
+++ b/Documentation/technical/large-object-promisors.txt
@@ -0,0 +1,656 @@
+Large Object Promisors
+======================
+
+Since Git has been created, users have been complaining about issues
+with storing large files in Git. Some solutions have been created to
+help, but they haven't helped much with some issues.
+
+Git currently supports multiple promisor remotes, which could help
+with some of these remaining issues, but it's very hard to use them to
+help, because a number of important features are missing.
+
+The goal of the effort described in this document is to add these
+important features.
+
+We will call a "Large Object Promisor", or "LOP" in short, a promisor
+remote which is used to store only large blobs and which is separate
+from the main remote that should store the other Git objects and the
+rest of the repos.
+
+By extension, we will also call "Large Object Promisor", or LOP, the
+effort described in this document to add a set of features to make it
+easier to handle large blobs/files in Git by using LOPs.
+
+This effort aims to especially improve things on the server side, and
+especially for large blobs that are already compressed in a binary
+format.
+
+This effort aims to provide an alternative to Git LFS
+(https://git-lfs.com/) and similar tools like git-annex
+(https://git-annex.branchable.com/) for handling large files, even
+though a complete alternative would very likely require other efforts
+especially on the client side, where it would likely help to implement
+a new object representation for large blobs as discussed in:
+
+https://lore.kernel.org/git/xmqqbkdometi.fsf@gitster.g/
+
+0) Non goals
+------------
+
+- We will not discuss those client side improvements here, as they
+  would require changes in different parts of Git than this effort.
++
+So we don't pretend to fully replace Git LFS with only this effort,
+but we nevertheless believe that it can significantly improve the
+current situation on the server side, and that other separate
+efforts could also improve the situation on the client side.
+
+- In the same way, we are not going to discuss all the possible ways
+  to implement a LOP or their underlying object storage, or to
+  optimize how LOP works.
++
+Our opinion is that the simplest solution for now is for LOPs to use
+object storage through a remote helper (see section II.2 below for
+more details) to store their objects. So we consider that this is the
+default implementation. If there are improvements on top of this,
+that's great, but our opinion is that such improvements are not
+necessary for LOPs to already be useful. Such improvements are likely
+a different technical topic, and can be taken care of separately
+anyway.
++
+So in particular we are not going to discuss pluggable ODBs or other
+object database backends that could chunk large blobs, dedup the
+chunks and store them efficiently. Sure, that would be a nice
+improvement to store large blobs on the server side, but we believe
+it can just be a separate effort as it's also not technically very
+related to this effort.
++
+We are also not going to discuss data transfer improvements between
+LOPs and clients or servers. Sure, there might be some easy and very
+effective optimizations there (as we know that objects on LOPs are
+very likely incompressible and not deltifying well), but this can be
+dealt with separately in a separate effort.
+
+In other words, the goal of this document is not to talk about all the
+possible ways to optimize how Git could handle large blobs, but to
+describe how a LOP based solution can already work well and alleviate
+a number of current issues in the context of Git clients and servers
+sharing Git objects.
+
+Even if LOPs are used not very efficiently, they can still be useful
+and worth using in some cases because, as we will see in more details
+later in this document:
+
+  - they can make it simpler for clients to use promisor remotes and
+    therefore avoid fetching a lot of large blobs they might not need
+    locally,
+
+  - they can make it significantly cheaper or easier for servers to
+    host a significant part of the current repository content, and
+    even more to host content with larger blobs or more large blobs
+    than currently.
+
+I) Issues with the current situation
+------------------------------------
+
+- Some statistics made on GitLab repos have shown that more than 75%
+  of the disk space is used by blobs that are larger than 1MB and
+  often in a binary format.
+
+- So even if users could use Git LFS or similar tools to store a lot
+  of large blobs out of their repos, it's a fact that in practice they
+  don't do it as much as they probably should.
+
+- On the server side ideally, the server should be able to decide for
+  itself how it stores things. It should not depend on users deciding
+  to use tools like Git LFS on some blobs or not.
+
+- It's much more expensive to store large blobs that don't delta
+  compress well on regular fast seeking drives (like SSDs) than on
+  object storage (like Amazon S3 or GCP Buckets). Using fast drives
+  for regular Git repos makes sense though, as serving regular Git
+  content (blobs containing text or code) needs drives where seeking
+  is fast, but the content is relatively small. On the other hand,
+  object storage for Git LFS blobs makes sense as seeking speed is not
+  as important when dealing with large files, while costs are more
+  important. So the fact that users don't use Git LFS or similar tools
+  for a significant number of large blobs has likely some bad
+  consequences on the cost of repo storage for most Git hosting
+  platforms.
+
+- Having large blobs handled in the same way as other blobs and Git
+  objects in Git repos instead of on object storage also has a cost in
+  increased memory and CPU usage, and therefore decreased performance,
+  when creating packfiles. (This is because Git tries to use delta
+  compression or zlib compression which is unlikely to work well on
+  already compressed binary content.) So it's not just a storage cost
+  increase.
+
+- When a large blob has been committed into a repo, it might not be
+  possible to remove this blob from the repo without rewriting
+  history, even if the user then decides to use Git LFS or a similar
+  tool to handle it.
+
+- In fact Git LFS and similar tools are not very flexible in letting
+  users change their minds about the blobs they should handle or not.
+
+- Even when users are using Git LFS or similar tools, they are often
+  complaining that these tools require significant effort to set up,
+  learn and use correctly.
+
+II) Main features of the "Large Object Promisors" solution
+----------------------------------------------------------
+
+The main features below should give a rough overview of how the
+solution may work. Details about needed elements can be found in
+following sections.
+
+Even if each feature below is very useful for the full solution, it is
+very likely to be also useful on its own in some cases where the full
+solution is not required. However, we'll focus primarily on the big
+picture here.
+
+Also each feature doesn't need to be implemented entirely in Git
+itself. Some could be scripts, hooks or helpers that are not part of
+the Git repo. It would be helpful if those could be shared and
+improved on collaboratively though. So we want to encourage sharing
+them.
+
+1) Large blobs are stored on LOPs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Large blobs should be stored on special promisor remotes that we will
+call "Large Object Promisors" or LOPs. These LOPs should be additional
+remotes dedicated to contain large blobs especially those in binary
+format. They should be used along with main remotes that contain the
+other objects.
+
+Note 1
+++++++
+
+To clarify, a LOP is a normal promisor remote, except that:
+
+- it should store only large blobs,
+
+- it should be separate from the main remote, so that the main remote
+  can focus on serving other objects and the rest of the repos (see
+  feature 4) below) and can use the LOP as a promisor remote for
+  itself.
+
+Note 2
+++++++
+
+Git already makes it possible for a main remote to also be a promisor
+remote storing both regular objects and large blobs for a client that
+clones from it with a filter on blob size. But here we explicitly want
+to avoid that.
+
+Rationale
++++++++++
+
+LOPs aim to be good at handling large blobs while main remotes are
+already good at handling other objects.
+
+Implementation
+++++++++++++++
+
+Git already has support for multiple promisor remotes, see
+link:partial-clone.html#using-many-promisor-remotes[the partial clone documentation].
+
+Also, Git already has support for partial clone using a filter on the
+size of the blobs (with `git clone --filter=blob:limit=<size>`).  Most
+of the other main features below are based on these existing features
+and are about making them easy and efficient to use for the purpose of
+better handling large blobs.
+
+2) LOPs can use object storage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+LOPs can be implemented using object storage, like an Amazon S3 or GCP
+Bucket or MinIO (which is open source under the GNU AGPLv3 license) to
+actually store the large blobs, and can be accessed through a Git
+remote helper (see linkgit:gitremote-helpers[7]) which makes the
+underlying object storage appear like a remote to Git.
+
+Note
+++++
+
+A LOP can be a promisor remote accessed using a remote helper by
+both some clients and the main remote.
+
+Rationale
++++++++++
+
+This looks like the simplest way to create LOPs that can cheaply
+handle many large blobs.
+
+Implementation
+++++++++++++++
+
+Remote helpers are quite easy to write as shell scripts, but it might
+be more efficient and maintainable to write them using other languages
+like Go.
+
+Some already exist under open source licenses, for example:
+
+  - https://github.com/awslabs/git-remote-s3
+  - https://gitlab.com/eric.p.ju/git-remote-gs
+
+Other ways to implement LOPs are certainly possible, but the goal of
+this document is not to discuss how to best implement a LOP or its
+underlying object storage (see the "0) Non goals" section above).
+
+3) LOP object storage can be Git LFS storage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The underlying object storage that a LOP uses could also serve as
+storage for large files handled by Git LFS.
+
+Rationale
++++++++++
+
+This would simplify the server side if it wants to both use a LOP and
+act as a Git LFS server.
+
+4) A main remote can offload to a LOP with a configurable threshold
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+On the server side, a main remote should have a way to offload to a
+LOP all its blobs with a size over a configurable threshold.
+
+Rationale
++++++++++
+
+This makes it easy to set things up and to clean things up. For
+example, an admin could use this to manually convert a repo not using
+LOPs to a repo using a LOP. On a repo already using a LOP but where
+some users would sometimes push large blobs, a cron job could use this
+to regularly make sure the large blobs are moved to the LOP.
+
+Implementation
+++++++++++++++
+
+Using something based on `git repack --filter=...` to separate the
+blobs we want to offload from the other Git objects could be a good
+idea. The missing part is to connect to the LOP, check if the blobs we
+want to offload are already there and if not send them.
+
+5) A main remote should try to remain clean from large blobs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A main remote should try to avoid containing a lot of oversize
+blobs. For that purpose, it should offload as needed to a LOP and it
+should have ways to prevent oversize blobs to be fetched, and also
+perhaps pushed, into it.
+
+Rationale
++++++++++
+
+A main remote containing many oversize blobs would defeat the purpose
+of LOPs.
+
+Implementation
+++++++++++++++
+
+The way to offload to a LOP discussed in 4) above can be used to
+regularly offload oversize blobs. About preventing oversize blobs from
+being fetched into the repo see 6) below. About preventing oversize
+blob pushes, a pre-receive hook could be used.
+
+Also there are different scenarios in which large blobs could get
+fetched into the main remote, for example:
+
+- A client that doesn't implement the "promisor-remote" protocol
+  (described in 6) below) clones from the main remote.
+
+- The main remote gets a request for information about a large blob
+  and is not able to get that information without fetching the blob
+  from the LOP.
+
+It might not be possible to completely prevent all these scenarios
+from happening. So the goal here should be to implement features that
+make the fetching of large blobs less likely. For example adding a
+`remote-object-info` command in the `git cat-file --batch` protocol
+and its variants might make it possible for a main repo to respond to
+some requests about large blobs without fetching them.
+
+6) A protocol negotiation should happen when a client clones
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a client clones from a main repo, there should be a protocol
+negotiation so that the server can advertise one or more LOPs and so
+that the client and the server can discuss if the client could
+directly use a LOP the server is advertising. If the client and the
+server can agree on that, then the client would be able to get the
+large blobs directly from the LOP and the server would not need to
+fetch those blobs from the LOP to be able to serve the client.
+
+Note
+++++
+
+For fetches instead of clones, a protocol negotiation might not always
+happen, see the "What about fetches?" FAQ entry below for details.
+
+Rationale
++++++++++
+
+Security, configurability and efficiency of setting things up.
+
+Implementation
+++++++++++++++
+
+A "promisor-remote" protocol v2 capability looks like a good way to
+implement this. The way the client and server use this capability
+could be controlled by configuration variables.
+
+Information that the server could send to the client through that
+protocol could be things like: LOP name, LOP URL, filter-spec (for
+example `blob:limit=<size>`) or just size limit that should be used as
+a filter when cloning, token to be used with the LOP, etc.
+
+7) A client can offload to a LOP
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a client is using a LOP that is also a LOP of its main remote,
+the client should be able to offload some large blobs it has fetched,
+but might not need anymore, to the LOP.
+
+Note
+++++
+
+It might depend on the context if it should be OK or not for clients
+to offload large blobs they have created, instead of fetched, directly
+to the LOP without the main remote checking them in some ways
+(possibly using hooks or other tools).
+
+This should be discussed and refined when we get closer to
+implementing this feature.
+
+Rationale
++++++++++
+
+On the client, the easiest way to deal with unneeded large blobs is to
+offload them.
+
+Implementation
+++++++++++++++
+
+This is very similar to what 4) above is about, except on the client
+side instead of the server side. So a good solution to 4) could likely
+be adapted to work on the client side too.
+
+There might be some security issues here, as there is no negotiation,
+but they might be mitigated if the client can reuse a token it got
+when cloning (see 6) above). Also if the large blobs were fetched from
+a LOP, it is likely, and can easily be confirmed, that the LOP still
+has them, so that they can just be removed from the client.
+
+III) Benefits of using LOPs
+---------------------------
+
+Many benefits are related to the issues discussed in "I) Issues with
+the current situation" above:
+
+- No need to rewrite history when deciding which blobs are worth
+  handling separately than other objects, or when moving or removing
+  the threshold.
+
+- If the protocol between client and server is developed and secured
+  enough, then many details might be setup on the server side only and
+  all the clients could then easily get all the configuration
+  information and use it to set themselves up mostly automatically.
+
+- Storage costs benefits on the server side.
+
+- Reduced memory and CPU needs on main remotes on the server side.
+
+- Reduced storage needs on the client side.
+
+IV) FAQ
+-------
+
+What about using multiple LOPs on the server and client side?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+That could perhaps be useful in some cases, but for now it's more
+likely that in most cases a single LOP will be advertised by the
+server and should be used by the client.
+
+A case where it could be useful for a server to advertise multiple
+LOPs is if a LOP is better for some users while a different LOP is
+better for other users. For example some clients might have a better
+connection to a LOP than others.
+
+In those cases it's the responsibility of the server to have some
+documentation to help clients. It could say for example something like
+"Users in this part of the world might want to pick only LOP A as it
+is likely to be better connected to them, while users in other parts
+of the world should pick only LOP B for the same reason."
+
+When should we trust or not trust the LOPs advertised by the server?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In some contexts, like in corporate setup where the server and all the
+clients are parts of an internal network in a company where admins
+have all the rights on every system, it's OK, and perhaps even a good
+thing, if the clients fully trust the server, as it can help ensure
+that all the clients are on the same page.
+
+There are also contexts in which clients trust a code hosting platform
+serving them some repos, but might not fully trust other users
+managing or contributing to some of these repos. For example, the code
+hosting platform could have hooks in place to check that any object it
+receives doesn't contain malware or otherwise bad content. In this
+case it might be OK for the client to use a main remote and its LOP if
+they are both hosted by the code hosting platform, but not if the LOP
+is hosted elsewhere (where the content is not checked).
+
+In other contexts, a client should just not trust a server.
+
+So there should be different ways to configure how the client should
+behave when a server advertises a LOP to it at clone time.
+
+As the basic elements that a server can advertise about a LOP are a
+LOP name and a LOP URL, the client should base its decision about
+accepting a LOP on these elements.
+
+One simple way to be very strict in the LOP it accepts is for example
+for the client to check that the LOP is already configured on the
+client with the same name and URL as what the server advertises.
+
+In general default and "safe" settings should require that the LOP are
+configured on the client separately from the "promisor-remote"
+protocol and that the client accepts a LOP only when information about
+it from the protocol matches what has been already configured
+separately.
+
+What about LOP names?
+~~~~~~~~~~~~~~~~~~~~~
+
+In some contexts, for example if the clients sometimes fetch from each
+other, it can be a good idea for all the clients to use the same names
+for all the remotes they use, including LOPs.
+
+In other contexts, each client might want to be able to give the name
+it wants to each remote, including each LOP, it interacts with.
+
+So there should be different ways to configure how the client accepts
+or not the LOP name the server advertises.
+
+If a default or "safe" setting is used, then as such a setting should
+require that the LOP be configured separately, then the name would be
+configured separately and there is no risk that the server could
+dictate a name to a client.
+
+Could the main remote be bogged down by old or paranoid clients?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Yes, it could happen if there are too many clients that are either
+unwilling to trust the main remote or that just don't implement the
+"promisor-remote" protocol because they are too old or not fully
+compatible with the 'git' client.
+
+When serving such a client, the main remote has no other choice than
+to first fetch from its LOP, to then be able to provide to the client
+everything it requested. So the main remote, even if it has cleanup
+mechanisms (see section II.4 above), would be burdened at least
+temporarily with the large blobs it had to fetch from its LOP.
+
+Not behaving like this would be breaking backward compatibility, and
+could be seen as segregating clients. For example, it might be
+possible to implement a special mode that allows the server to just
+reject clients that don't implement the "promisor-remote" protocol or
+aren't willing to trust the main remote. This mode might be useful in
+a special context like a corporate environment. There is no plan to
+implement such a mode though, and this should be discussed separately
+later anyway.
+
+A better way to proceed is probably for the main remote to show a
+message telling clients that don't implement the protocol or are
+unwilling to accept the advertised LOP(s) that they would get faster
+clone and fetches by upgrading client software or properly setting
+them up to accept LOP(s).
+
+Waiting for clients to upgrade, monitoring these upgrades and limiting
+the use of LOPs to repos that are not very frequently accessed might
+be other good ways to make sure that some benefits are still reaped
+from LOPs. Over time, as more and more clients upgrade and benefit
+from LOPs, using them in more and more frequently accessed repos will
+become worth it.
+
+Corporate environments, where it might be easier to make sure that all
+the clients are up-to-date and properly configured, could hopefully
+benefit more and earlier from using LOPs.
+
+What about fetches?
+~~~~~~~~~~~~~~~~~~~
+
+There are different kinds of fetches. A regular fetch happens when
+some refs have been updated on the server and the client wants the ref
+updates and possibly the new objects added with them. A "backfill" or
+"lazy" fetch, on the contrary, happens when the client needs to use
+some objects it already knows about but doesn't have because they are
+on a promisor remote.
+
+Regular fetch
++++++++++++++
+
+In a regular fetch, the client will contact the main remote and a
+protocol negotiation will happen between them. It's a good thing that
+a protocol negotiation happens every time, as the configuration on the
+client or the main remote could have changed since the previous
+protocol negotiation. In this case, the new protocol negotiation
+should ensure that the new fetch will happen in a way that satisfies
+the new configuration of both the client and the server.
+
+In most cases though, the configurations on the client and the main
+remote will not have changed between 2 fetches or between the initial
+clone and a subsequent fetch. This means that the result of a new
+protocol negotiation will be the same as the previous result, so the
+new fetch will happen in the same way as the previous clone or fetch,
+using, or not using, the same LOP(s) as last time.
+
+"Backfill" or "lazy" fetch
+++++++++++++++++++++++++++
+
+When there is a backfill fetch, the client doesn't necessarily contact
+the main remote first. It will try to fetch from its promisor remotes
+in the order they appear in the config file, except that a remote
+configured using the `extensions.partialClone` config variable will be
+tried last. See
+link:partial-clone.html#using-many-promisor-remotes[the partial clone documentation].
+
+This is not new with this effort. In fact this is how multiple remotes
+have already been working for around 5 years.
+
+When using LOPs, having the main remote configured using
+`extensions.partialClone`, so it's tried last, makes sense, as missing
+objects should only be large blobs that are on LOPs.
+
+This means that a protocol negotiation will likely not happen as the
+missing objects will be fetched from the LOPs, and then there will be
+nothing left to fetch from the main remote.
+
+To secure that, it could be a good idea for LOPs to require a token
+from the client when it fetches from them. The client could get the
+token when performing a protocol negotiation with the main remote (see
+section II.6 above).
+
+V) Future improvements
+----------------------
+
+It is expected that at the beginning using LOPs will be mostly worth
+it either in a corporate context where the Git version that clients
+use can easily be controlled, or on repos that are infrequently
+accessed. (See the "Could the main remote be bogged down by old or
+paranoid clients?" section in the FAQ above.)
+
+Over time, as more and more clients upgrade to a version that
+implements the "promisor-remote" protocol v2 capability described
+above in section II.6), it will be worth it to use LOPs more widely.
+
+A lot of improvements may also help using LOPs more widely. Some of
+these improvements are part of the scope of this document like the
+following:
+
+  - Implementing a "remote-object-info" command in the
+    `git cat-file --batch` protocol and its variants to allow main
+    remotes to respond to requests about large blobs without fetching
+    them. (Eric Ju has started working on this based on previous work
+    by Calvin Wan.)
+
+  - Creating better cleanup and offload mechanisms for main remotes
+    and clients to prevent accumulation of large blobs.
+
+  - Developing more sophisticated protocol negotiation capabilities
+    between clients and servers for handling LOPs, for example adding
+    a filter-spec (e.g., blob:limit=<size>) or size limit for
+    filtering when cloning, or adding a token for LOP authentication.
+
+  - Improving security measures for LOP access, particularly around
+    token handling and authentication.
+
+  - Developing standardized ways to configure and manage multiple LOPs
+    across different environments. Especially in the case where
+    different LOPs serve the same content to clients in different
+    geographical locations, there is a need for replication or
+    synchronization between LOPs.
+
+Some improvements, including some that have been mentioned in the "0)
+Non Goals" section of this document, are out of the scope of this
+document:
+
+  - Implementing a new object representation for large blobs on the
+    client side.
+
+  - Developing pluggable ODBs or other object database backends that
+    could chunk large blobs, dedup the chunks and store them
+    efficiently.
+
+  - Optimizing data transfer between LOPs and clients/servers,
+    particularly for incompressible and non-deltifying content.
+
+  - Creating improved client side tools for managing large objects
+    more effectively, for example tools for migrating from Git LFS or
+    git-annex, or tools to find which objects could be offloaded and
+    how much disk space could be reclaimed by offloading them.
+
+Some improvements could be seen as part of the scope of this document,
+but might already have their own separate projects from the Git
+project, like:
+
+  - Improving existing remote helpers to access object storage or
+    developing new ones.
+
+  - Improving existing object storage solutions or developing new
+    ones.
+
+Even though all the above improvements may help, this document and the
+LOP effort should try to focus, at least first, on a relatively small
+number of improvements mostly those that are in its current scope.
+
+For example introducing pluggable ODBs and a new object database
+backend is likely a multi-year effort on its own that can happen
+separately in parallel. It has different technical requirements,
+touches other part of the Git code base and should have its own design
+document(s).
-- 
2.48.1.359.ge980fe0aa2


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [PATCH v4 0/6] Introduce a "promisor-remote" capability
  2025-01-27 21:14       ` [PATCH v4 0/6] Introduce a "promisor-remote" capability Junio C Hamano
@ 2025-02-18 11:40         ` Christian Couder
  0 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-02-18 11:40 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
	Randall S . Becker

On Mon, Jan 27, 2025 at 10:14 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:

> >   - Patches 1/6 and 2/6 are new in this series. They come from the
> >     patch series Usman Akinyemi is working on
> >     (https://lore.kernel.org/git/20250124122217.250925-1-usmanakinyemi202@gmail.com/).
> >     We need a similar redact_non_printables() function as the one he
> >     has been working on in his patch series, so it's just simpler to
> >     reuse his patches related to this function, and to build on top of
> >     them.
>
> Two topics in flight, neither of which hit 'next', sharing a handful
> of patches is cumbersome to keep track of.  Typically our strategy
> dealing with such a situation has been for these topics to halt and
> have the authors work together to help the common part solidify a
> bit better before continuing.  Otherwise, every time any one of the
> topics that share the same early parts of the series needs to change
> them even a bit, it would result in a huge rebase chaos, and worse
> yet, even if the two (or more) topics share the need for these two
> early parts, they may have different dependency requirements (e.g.
> this may be OK with these two early patches directly applied on
> 'maint', while the other topic may need to have these two early
> patches on 'master').
>
> I think [3/6] falls into the same category as [1/6] and [2/6], that
> is, to lay foundation of the remainder?

Yeah, but patches 1/6, 2/6 and 3/6 are removed in the next version,
thanks to a comment by Patrick...

> >   - In patch 4/6, the commit message has been improved:
> >   - In patch 4/6, there are also some code changes:
> >   - In patch 4/6, there is also a small change in the tests.
>
> All good changes.
>
> Will queue, but we should find a better way to manage the "an
> earlier part is shared across multiple topics" situation.

... so no problem anymore with this earlier part.

Thanks!

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v4 4/6] Add 'promisor-remote' capability to protocol v2
  2025-01-30 10:51         ` Patrick Steinhardt
@ 2025-02-18 11:41           ` Christian Couder
  0 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-02-18 11:41 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Junio C Hamano, Taylor Blau, Eric Sunshine, Karthik Nayak,
	Kristoffer Haugsbakk, brian m . carlson, Randall S . Becker,
	Christian Couder

On Thu, Jan 30, 2025 at 11:51 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Mon, Jan 27, 2025 at 04:16:59PM +0100, Christian Couder wrote:
> > When a server S knows that some objects from a repository are available
> > from a promisor remote X, S might want to suggest to a client C cloning
> > or fetching the repo from S that C may use X directly instead of S for
> > these objects.
>
> A lot of the commit message seems to be duplicated with the technical
> documentation that you add. I wonder whether it would make sense to
> simply refer to that instead of repeating all of it? That would make it
> easier to spot the actually-important bits in the commit message that
> add context to the patch.

I thought that commit messages should be self-contained as much as
possible. I am fine with adding a sentence saying that a design doc to
help with seeing the big picture will follow in one of the next
commits if it helps though.

> One very important bit of context that I was lacking is what exactly we
> wire up and where we do so. I have been searching for longer than I want
> to admit where the client ends up using the promisor remotes, until I
> eventually figured out that the client-side isn't wired up at all. It
> makes sense in retrospect, but it would've been nice if the reader was
> guided a bit.

The protocol side is implemented on both the client and the server
side in this patch. The rest already works on the client side because
using promisor remotes already works on the client side. We are just
making sure client and server agree on using a promisor remote before
the server allows it by passing "--missing=allow-promisor" to `git
pack-objects`, see below . The tests show that this single change is
enough to make things work.

> > diff --git a/Documentation/gitprotocol-v2.txt b/Documentation/gitprotocol-v2.txt
> > index 1652fef3ae..f25a9a6ad8 100644
> > --- a/Documentation/gitprotocol-v2.txt
> > +++ b/Documentation/gitprotocol-v2.txt
> > @@ -781,6 +781,60 @@ retrieving the header from a bundle at the indicated URI, and thus
> >  save themselves and the server(s) the request(s) needed to inspect the
> >  headers of that bundle or bundles.
> >
> > +promisor-remote=<pr-infos>
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +The server may advertise some promisor remotes it is using or knows
> > +about to a client which may want to use them as its promisor remotes,
> > +instead of this repository. In this case <pr-infos> should be of the
> > +form:
> > +
> > +     pr-infos = pr-info | pr-infos ";" pr-info
> > +
> > +     pr-info = "name=" pr-name | "name=" pr-name "," "url=" pr-url
> > +
> > +where `pr-name` is the urlencoded name of a promisor remote, and
> > +`pr-url` the urlencoded URL of that promisor remote.
> > +
> > +In this case, if the client decides to use one or more promisor
> > +remotes the server advertised, it can reply with
> > +"promisor-remote=<pr-names>" where <pr-names> should be of the form:
> > +
> > +     pr-names = pr-name | pr-names ";" pr-name
> > +
> > +where `pr-name` is the urlencoded name of a promisor remote the server
> > +advertised and the client accepts.
> > +
> > +Note that, everywhere in this document, `pr-name` MUST be a valid
> > +remote name, and the ';' and ',' characters MUST be encoded if they
> > +appear in `pr-name` or `pr-url`.
> > +
> > +If the server doesn't know any promisor remote that could be good for
> > +a client to use, or prefers a client not to use any promisor remote it
> > +uses or knows about, it shouldn't advertise the "promisor-remote"
> > +capability at all.
> > +
> > +In this case, or if the client doesn't want to use any promisor remote
> > +the server advertised, the client shouldn't advertise the
> > +"promisor-remote" capability at all in its reply.
> > +
> > +The "promisor.advertise" and "promisor.acceptFromServer" configuration
> > +options can be used on the server and client side respectively to
>
> s/respectively//, as you already say that in the next line.

I have removed it in the next version.

> > +control what they advertise or accept respectively. See the
> > +documentation of these configuration options for more information.
> > +
> > +Note that in the future it would be nice if the "promisor-remote"
> > +protocol capability could be used by the server, when responding to
> > +`git fetch` or `git clone`, to advertise better-connected remotes that
> > +the client can use as promisor remotes, instead of this repository, so
> > +that the client can lazily fetch objects from these other
> > +better-connected remotes. This would require the server to omit in its
> > +response the objects available on the better-connected remotes that
> > +the client has accepted. This hasn't been implemented yet though. So
> > +for now this "promisor-remote" capability is useful only when the
> > +server advertises some promisor remotes it already uses to borrow
> > +objects from.
>
> I'd leave away this bit as it doesn't really add a lot to the document.
> It's a possibility for the future, but without it being implemented
> anywhere it's not that helpful from my point of view.

In previous iterations, Junio talked about this as an interesting
possibility to implement in the future, so I thought it could be
interesting to mention it in some places. I would be Ok to remove it
if no one cares though.

> > diff --git a/promisor-remote.c b/promisor-remote.c
> > index c714f4f007..5ac282ed27 100644
> > --- a/promisor-remote.c
> > +++ b/promisor-remote.c
> > @@ -292,3 +306,185 @@ void promisor_remote_get_direct(struct repository *repo,
> >       if (to_free)
> >               free(remaining_oids);
> >  }
> > +
> > +static int allow_unsanitized(char ch)
> > +{
> > +     if (ch == ',' || ch == ';' || ch == '%')
> > +             return 0;
> > +     return ch > 32 && ch < 127;
> > +}
>
> Isn't this too lenient? It would allow also allow e.g. '=' and all kinds
> of other characters. This does make sense for URLs, but it doesn't make
> sense for remote names as they aren't supposed to contain punctuation in
> the first place. So for these remote names I'd think we should be way
> stricter and return an error in case they contain non-alphanumeric data.

This is used only to determine which characters are URL-encoded, not
which characters we pass or not to the other side. See below.

> > +static void promisor_info_vecs(struct repository *repo,
> > +                            struct strvec *names,
> > +                            struct strvec *urls)
>
> I wonder whether it would make more sense to track these as a strmap
> instead of two arrays which are expected to have related entries in the
> same place.

In the future we might have more generic code with perhaps a
configuration option (maybe "promisor.advertiseFields") that lists the
remote fields, like "name, url, token, filter, id", that should be
advertised by the server. If that happens, then it will make a lot of
sense to use a strmap indeed. For now we just don't know how that code
will evolve, so I think it's not worth risking overengineering this.

> > +{
> > +     struct promisor_remote *r;
> > +
> > +     promisor_remote_init(repo);
> > +
> > +     for (r = repo->promisor_remote_config->promisors; r; r = r->next) {
> > +             char *url;
> > +             char *url_key = xstrfmt("remote.%s.url", r->name);
> > +
> > +             strvec_push(names, r->name);
> > +             strvec_push(urls, git_config_get_string(url_key, &url) ? NULL : url);
> > +
> > +             free(url);
> > +             free(url_key);
> > +     }
> > +}
> > +
> > +char *promisor_remote_info(struct repository *repo)
> > +{
> > +     struct strbuf sb = STRBUF_INIT;
> > +     int advertise_promisors = 0;
> > +     struct strvec names = STRVEC_INIT;
> > +     struct strvec urls = STRVEC_INIT;
> > +
> > +     git_config_get_bool("promisor.advertise", &advertise_promisors);
> > +
> > +     if (!advertise_promisors)
> > +             return NULL;
> > +
> > +     promisor_info_vecs(repo, &names, &urls);
> > +
> > +     if (!names.nr)
> > +             return NULL;
> > +
> > +     for (size_t i = 0; i < names.nr; i++) {
> > +             if (i)
> > +                     strbuf_addch(&sb, ';');
> > +             strbuf_addstr(&sb, "name=");
> > +             strbuf_addstr_urlencode(&sb, names.v[i], allow_unsanitized);
> > +             if (urls.v[i]) {
> > +                     strbuf_addstr(&sb, ",url=");
> > +                     strbuf_addstr_urlencode(&sb, urls.v[i], allow_unsanitized);
> > +             }
> > +     }
> > +
> > +     redact_non_printables(&sb);
>
> So here we replace non-printable characters with dots as far as I
> understand. But didn't we just URL-encode the strings? So is there ever
> a possibility for non-printable characters here?

Yeah, right. I am removing this call in the next version then. This is
nice because it allows us to remove the first 3 patches in this series
and not depend on Usman's "extend agent capability to include OS name"
series (https://lore.kernel.org/git/20250215155130.1756934-1-usmanakinyemi202@gmail.com/).

> > +     strvec_clear(&names);
> > +     strvec_clear(&urls);
> > +
> > +     return strbuf_detach(&sb, NULL);
> > +}

[...]

> > +static void filter_promisor_remote(struct strvec *accepted, const char *info)
> > +{
> > +     struct strbuf **remotes;
> > +     const char *accept_str;
> > +     enum accept_promisor accept = ACCEPT_NONE;
> > +
> > +     if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
> > +             if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
> > +                     accept = ACCEPT_NONE;
> > +             else if (!strcasecmp("All", accept_str))
> > +                     accept = ACCEPT_ALL;
> > +             else
> > +                     warning(_("unknown '%s' value for '%s' config option"),
> > +                             accept_str, "promisor.acceptfromserver");
> > +     }
> > +
> > +     if (accept == ACCEPT_NONE)
> > +             return;
> > +
> > +     /* Parse remote info received */
> > +
> > +     remotes = strbuf_split_str(info, ';', 0);
> > +
> > +     for (size_t i = 0; remotes[i]; i++) {
> > +             struct strbuf **elems;
> > +             const char *remote_name = NULL;
> > +             const char *remote_url = NULL;
> > +             char *decoded_name = NULL;
> > +             char *decoded_url = NULL;
> > +
> > +             strbuf_strip_suffix(remotes[i], ";");
> > +             elems = strbuf_split(remotes[i], ',');
> > +
> > +             for (size_t j = 0; elems[j]; j++) {
> > +                     int res;
> > +                     strbuf_strip_suffix(elems[j], ",");
> > +                     res = skip_prefix(elems[j]->buf, "name=", &remote_name) ||
> > +                             skip_prefix(elems[j]->buf, "url=", &remote_url);
> > +                     if (!res)
> > +                             warning(_("unknown element '%s' from remote info"),
> > +                                     elems[j]->buf);
> > +             }
> > +
> > +             if (remote_name)
> > +                     decoded_name = url_percent_decode(remote_name);
> > +             if (remote_url)
> > +                     decoded_url = url_percent_decode(remote_url);
>
> This is data we have received from a potentially-untrusted remote, so we
> should double-check that the data we have received doesn't contain any
> weird characters:
>
>   - For the remote name we should verify that it consists only of
>     alphanumeric characters.
>
>   - For the remote URL we need to verify that it's a proper URL without
>     any newlines, non-printable characters or anything else.
>
> We'll eventually end up storing that data in the configuration, so these
> verifications are quite important so that an adversarial server cannot
> perform config-injection and thus cause remote code execution.

We currently don't store that data in the configuration. We just use
it to compare it with what is already configured on the client side. I
agree that if we ever make changes in a future series to store that
data, we should be careful to double-check it.

> > +void mark_promisor_remotes_as_accepted(struct repository *r, const char *remotes)
> > +{
> > +     struct strbuf **accepted_remotes = strbuf_split_str(remotes, ';', 0);
> > +
> > +     for (size_t i = 0; accepted_remotes[i]; i++) {
> > +             struct promisor_remote *p;
> > +             char *decoded_remote;
> > +
> > +             strbuf_strip_suffix(accepted_remotes[i], ";");
> > +             decoded_remote = url_percent_decode(accepted_remotes[i]->buf);
> > +
> > +             p = repo_promisor_remote_find(r, decoded_remote);
> > +             if (p)
> > +                     p->accepted = 1;
> > +             else
> > +                     warning(_("accepted promisor remote '%s' not found"),
> > +                             decoded_remote);
>
> My initial understanding of this code was that it is about the
> client-side accepting a remote, but this is about the server-side and
> tracks whether a promisor remote has been accepted by the client. It
> feels a bit weird to modify semi-global state for this, as I'd have
> rather expected that we pass around a vector of accepted remotes
> instead.
>
> But I guess ultimately this isn't too bad. It would be nice though if
> it was more obvious whether we're on the server- or client-side.

I have changed the description of the function like this in "promisor-remote.h":

/*
 * Set the 'accepted' flag for some promisor remotes. Useful on the
 * server side when some promisor remotes have been accepted by the
 * client.
 */
void mark_promisor_remotes_as_accepted(struct repository *repo, const
char *remotes);

> > diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
> > new file mode 100755
> > index 0000000000..0390c1dbad
> > --- /dev/null
> > +++ b/t/t5710-promisor-remote-capability.sh
> > @@ -0,0 +1,244 @@
> [snip]
> > +initialize_server () {
> > +     count="$1"
> > +     missing_oids="$2"
> > +
> > +     # Repack everything first
> > +     git -C server -c repack.writebitmaps=false repack -a -d &&
> > +
> > +     # Remove promisor file in case they exist, useful when reinitializing
> > +     rm -rf server/objects/pack/*.promisor &&
> > +
> > +     # Repack without the largest object and create a promisor pack on server
> > +     git -C server -c repack.writebitmaps=false repack -a -d \
> > +         --filter=blob:limit=5k --filter-to="$(pwd)/pack" &&
> > +     promisor_file=$(ls server/objects/pack/*.pack | sed "s/\.pack/.promisor/") &&
> > +     >"$promisor_file" &&
> > +
> > +     # Check objects missing on the server
> > +     check_missing_objects server "$count" "$missing_oids"
> > +}
> > +
> > +copy_to_server2 () {
>
> Nit: `server2` could be renamed to `promisor` to make the relation
> between the two servers more obvious.

I think "promisor" might be confusing as that is already used in parts
of some config variable names. For example we would have to set
"remote.promisor.promisor" to "true" several times. I have renamed it
to "lop" instead.

> > diff --git a/upload-pack.c b/upload-pack.c
> > index 728b2477fc..7498b45e2e 100644
> > --- a/upload-pack.c
> > +++ b/upload-pack.c
> > @@ -319,6 +320,8 @@ static void create_pack_file(struct upload_pack_data *pack_data,
> >               strvec_push(&pack_objects.args, "--delta-base-offset");
> >       if (pack_data->use_include_tag)
> >               strvec_push(&pack_objects.args, "--include-tag");
> > +     if (repo_has_accepted_promisor_remote(the_repository))
> > +             strvec_push(&pack_objects.args, "--missing=allow-promisor");
>
> This is nice and simple, I like it.

Yeah, this is really the only change that is needed for a client to be
able to lazy fetch from promisor remotes at clone time.

Thanks.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v4 5/6] promisor-remote: check advertised name or URL
  2025-01-30 10:51           ` Patrick Steinhardt
@ 2025-02-18 11:41             ` Christian Couder
  0 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-02-18 11:41 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Junio C Hamano, git, Taylor Blau, Eric Sunshine, Karthik Nayak,
	Kristoffer Haugsbakk, brian m . carlson, Randall S . Becker,
	Christian Couder

On Thu, Jan 30, 2025 at 11:51 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Mon, Jan 27, 2025 at 03:48:08PM -0800, Junio C Hamano wrote:

> > I wonder if the reader needs to be told a bit more about the
> > security argument here.  I imagine that the attack vector behind the
> > use of "secure" in the above paragraph is for a malicious server
> > that guesses a promisor remote name the client already uses, which
> > has a different URL from what the client expects to be associated
> > with the name, thereby such an acceptance means that the URL used in
> > future fetches would be replaced without the user's consent.  Being
> > able to silently repoint the remote.origin.url at an evil repository
> > you control is indeed a powerful thing, I would guess.  Of course,
> > in a corp environment, such a mechanism to drive the clients to a
> > new repository after upgrading or migrating may be extremely handy.
>
> I'm still very hesitant about letting the server-side control remote
> names at all, as I've already mentioned in previous review rounds. I
> think that it opens up the client for a whole lot of issues that should
> rather be avoided. Most importantly, it takes control away from the
> user, as they are not free anymore to name the remotes however they want
> to. It also casts into stone current behaviour because it is now part of
> the protocol.

The server-side doesn't control remote names at all in this series.
There is just a match or no match, depending on the value of
promisor.acceptFromServer on the client-side, between what the client
already has configured (for example using the clone -c option) and
what the server advertises.

> That being said, I get the point that it may make sense to be "agile"
> regarding the promisor remotes. But I think we can achieve that without
> having to compromise on either usability or security by using something
> like a promisor ID instead.

Thanks for the suggestion and the ideas, but I think that what you
suggest could be discussed and implemented as part of a follow up
patch series. This patch series implements basic checks with
information (name and URL) that already exists on the server side and
might also be available on the client side. For a number of use cases
it is likely enough, and it's also not very complex.

I would be fine with resending the series without this patch, if
that's what is prefered though.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v4 5/6] promisor-remote: check advertised name or URL
  2025-01-27 23:48         ` Junio C Hamano
  2025-01-28  0:01           ` Junio C Hamano
  2025-01-30 10:51           ` Patrick Steinhardt
@ 2025-02-18 11:42           ` Christian Couder
  2 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-02-18 11:42 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
	Randall S . Becker, Christian Couder

On Tue, Jan 28, 2025 at 12:48 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > A previous commit introduced a "promisor.acceptFromServer" configuration
> > variable with only "None" or "All" as valid values.
> >
> > Let's introduce "KnownName" and "KnownUrl" as valid values for this
> > configuration option to give more choice to a client about which
> > promisor remotes it might accept among those that the server advertised.
>
> OK.
>
> >  promisor.acceptFromServer::
> >       If set to "all", a client will accept all the promisor remotes
> >       a server might advertise using the "promisor-remote"
> > -     capability. Default is "none", which means no promisor remote
> > -     advertised by a server will be accepted. By accepting a
> > -     promisor remote, the client agrees that the server might omit
> > -     objects that are lazily fetchable from this promisor remote
> > -     from its responses to "fetch" and "clone" requests from the
> > -     client. See linkgit:gitprotocol-v2[5].
> > +     capability. If set to "knownName" the client will accept
> > +     promisor remotes which are already configured on the client
> > +     and have the same name as those advertised by the client. This
> > +     is not very secure, but could be used in a corporate setup
> > +     where servers and clients are trusted to not switch name and
> > +     URLs.
>
> I wonder if the reader needs to be told a bit more about the
> security argument here.  I imagine that the attack vector behind the
> use of "secure" in the above paragraph is for a malicious server
> that guesses a promisor remote name the client already uses, which
> has a different URL from what the client expects to be associated
> with the name, thereby such an acceptance means that the URL used in
> future fetches would be replaced without the user's consent.

There is currently no mechanism for the URL to be replaced on the
client side by the one advertised by the server. The client will still
use the URL that has been configured in another way, likely the clone
`-c` option. But yeah it could lead to misunderstandings between the
client and the server. And if we later develop such a mechanism to
replace the URL on the client side, or to just temporarily use the one
advertised by the server, this could be a problem.

> Being
> able to silently repoint the remote.origin.url at an evil repository
> you control is indeed a powerful thing, I would guess.  Of course,
> in a corp environment, such a mechanism to drive the clients to a
> new repository after upgrading or migrating may be extremely handy.

Yeah, that's why there are chances that such a mechanism will be
developed later, and we should take care of warning users even if
currently there are no real security risks.

> Or does the above paragraph assumes some other attack vectors,
> perhaps?

No, I don't see another attack vector.

> > +     If set to "knownUrl", the client will accept promisor
> > +     remotes which have both the same name and the same URL
> > +     configured on the client as the name and URL advertised by the
> > +     server. This is more secure than "all" or "knownUrl", so it

Here I see that it should be "knownName" instead of "knownUrl". I have
fixed this in the next version I will send soon.

> > +     should be used if possible instead of those options. Default
> > +     is "none", which means no promisor remote advertised by a
> > +     server will be accepted.
>
> OK.
>
> > diff --git a/promisor-remote.c b/promisor-remote.c
> > index 5ac282ed27..790a96aa19 100644
> > --- a/promisor-remote.c
> > +++ b/promisor-remote.c
> > @@ -370,30 +370,73 @@ char *promisor_remote_info(struct repository *repo)
> >       return strbuf_detach(&sb, NULL);
> >  }
> >
> > +/*
> > + * Find first index of 'vec' where there is 'val'. 'val' is compared
> > + * case insensively to the strings in 'vec'. If not found 'vec->nr' is

I mean "insensitively" instead of "insensively". This is fixed in the
next version.

> > + * returned.
> > + */
> > +static size_t strvec_find_index(struct strvec *vec, const char *val)
> > +{
> > +     for (size_t i = 0; i < vec->nr; i++)
> > +             if (!strcasecmp(vec->v[i], val))
> > +                     return i;
> > +     return vec->nr;
> > +}
>
> Hmph, without the hardcoded strcasecmp(), strvec_find() might make a
> fine public API in <strvec.h>.

Yeah, but I didn't find any other places in the code where a
strvec_find() function could be useful.

> Unless we intend to create a generic function that qualifies as a
> part of the public strvec API, we shouldn't call it strvec_anything.
> This is a great helper that finds a matching remote nickname from
> list of remote nicknames, so
>
>     remote_nick_find(struct strvec *nicks, const char *nick)
>
> may be more appropriate.

Ok, I have renamed it remote_nick_find() in the next version.

> When we lift it out of here and make it
> more generic to move it to strvec.[ch], perhaps
>
>         size_t strvec_find(struct strvec *vec, void *needle,
>                  int (*match)(const char *, void *)) {
>                 for (size_t ix = 0; ix < vec->nr, ix++)
>                         if (match(vec->v[ix], needle))
>                                 return ix;
>                 return vec->nr;
>         }
>
> which will be used to rewrite remote_nick_find() like so:
>
>         static int nicks_match(const char *nick, void *needle)
>         {
>                 return !strcasecmp(nick, (conat char *)needle);
>         }
>
>         remote_hick_find(struct strvec *nicks, const char *nick)
>         {
>                 return strvec_find(nicks, nick, nicks_match);
>         }
>
> it would be better to use a more generic parameter name "vec", but
> until then, it is better to be more specific and explicit about the
> reason why the immediate callers call the function for, which is
> where my "nicks" vs "nick" comes from (it is OK to call the latter
> "needle", though).

Yeah, I would be fine with this solution if there were other places
where strvec_find() could be useful.

> >  enum accept_promisor {
> >       ACCEPT_NONE = 0,
> > +     ACCEPT_KNOWN_URL,
> > +     ACCEPT_KNOWN_NAME,
> >       ACCEPT_ALL
> >  };
> >
> >  static int should_accept_remote(enum accept_promisor accept,
> > -                             const char *remote_name UNUSED,
> > -                             const char *remote_url UNUSED)
> > +                             const char *remote_name, const char *remote_url,
> > +                             struct strvec *names, struct strvec *urls)
> >  {
> > +     size_t i;
> > +
> >       if (accept == ACCEPT_ALL)
> >               return 1;
> >
> > -     BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
> > +     i = strvec_find_index(names, remote_name);
> > +
> > +     if (i >= names->nr)
> > +             /* We don't know about that remote */
> > +             return 0;
>
> OK.
>
> > +     if (accept == ACCEPT_KNOWN_NAME)
> > +             return 1;
> > +
> > +     if (accept != ACCEPT_KNOWN_URL)
> > +             BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
>
> I can see why this defensiveness may be a good idea than not having
> any, but I wonder if we can take advantage of compile time checks
> some compilers have to ensure that case arms in a switch statement
> are exhausitive?

Perhaps, but otherwise I am not sure that using a switch statement
would make the code better. The ACCEPT_KNOWN_NAME and ACCEPT_KNOWN_URL
cases need to share some code and the ACCEPT_NONE case seems better
handled by the caller.

> > +     if (!strcasecmp(urls->v[i], remote_url))
> > +             return 1;
>
> This is iffy.  The <schema>://<host>/ part might want to be compared
> case insensitively, but the rest of the URL is generally case
> sensitive (unless the material served is stored on a machine with
> case-insensitive filesystem)?

I am fine with comparing the whole URL case sensitively. So
"strcasecmp()" is replaced with "strcmp()" in the next version.

> Given that the existing URL must have come by either cloning from
> this server or another related server or by an earlier
> acceptFromServer behaviour, I do not see a need for being extra lax
> here.  We should be more careful about our use of case-insensitive
> comparison, and I do not see how this URL comparison could be
> something the end users would expect to be done case insensitively.

In another email you also said:

> Note that I am not advocating to compare the earlier part case
> insensitively while comparing the remainder case sensitively.
>
> Because we are not comparing URLs that come from random sources, but
> we know they come from a only few very controlled sources (i.e., the
> original server we cloned from, and the promisor remotes sugggested
> by the original server and other promisor remotes whose suggestion
> we accepted, recursively), it should be sufficient to compare the
> whole string case sensitively.

When I implemented this, I was just thinking that some users might for
example spell the scheme part "HTTPS" in their client config and then
complain that it should work when the server advertises the same URL
with "https" instead of "HTTPS", because yeah the <schema>://<host>/
part should be case insensitive. But I agree we can start with
everything being case sensitive and improve on this (likely by
comparing the <schema>://<host>/ part case insensitively and the rest
case sensitively) if/when users complain.

> > -static void filter_promisor_remote(struct strvec *accepted, const char *info)
> > +static void filter_promisor_remote(struct repository *repo,
> > +                                struct strvec *accepted,
> > +                                const char *info)
> >  {
> >       struct strbuf **remotes;
> >       const char *accept_str;
> >       enum accept_promisor accept = ACCEPT_NONE;
> > +     struct strvec names = STRVEC_INIT;
> > +     struct strvec urls = STRVEC_INIT;
> >
> >       if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
> >               if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
>
> Not a fault of this step, but is it sensible to even expect
> !accept_str in an error case?  *accept_str could be NUL, but
> accept_str be either left uninitialized (because this caller does
> not initialize it) when the get_string_tmp() returns non-zero, or
> points at the internal cached value in the config_set if it returns
> 0 (and the control comes into this block).

Yeah, I agree accept_str cannot be NULL here. I have removed
"!accept_str || " in the next version.

> >                       accept = ACCEPT_NONE;
> > +             else if (!strcasecmp("KnownUrl", accept_str))
> > +                     accept = ACCEPT_KNOWN_URL;
> > +             else if (!strcasecmp("KnownName", accept_str))
> > +                     accept = ACCEPT_KNOWN_NAME;
> >               else if (!strcasecmp("All", accept_str))
> >                       accept = ACCEPT_ALL;
> >               else
>
> Ditto about icase for all of the above.

These are config values that can take only a specific set of values. I
think those are most often compared case insensitively in Git, for
example there is no distinction between "True" and "true" for bool
values. So I am not sure what you suggest here.

> > +test_expect_success "clone with 'KnownUrl' and different remote urls" '
> > +     ln -s server2 serverTwo &&
> > +
> > +     git -C server config promisor.advertise true &&
> > +
> > +     # Clone from server to create a client
> > +     GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
> > +             -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
> > +             -c remote.server2.url="file://$(pwd)/serverTwo" \
> > +             -c promisor.acceptfromserver=KnownUrl \
> > +             --no-local --filter="blob:limit=5k" server client &&
> > +     test_when_finished "rm -rf client" &&
> > +
> > +     # Check that the largest object is not missing on the server
> > +     check_missing_objects server 0 "" &&
> > +
> > +     # Reinitialize server so that the largest object is missing again
> > +     initialize_server 1 "$oid"
> > +'
>
> Nice ;-)
>
> Here, I also notice that we are not testing that serverTwo and
> servertwo are considered the same thanks to the use of icase
> comparison.  We shouldn't compare URLs with strcasecmp().

Ok, thanks.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v4 3/6] version: make redact_non_printables() non-static
  2025-01-30 10:51         ` Patrick Steinhardt
@ 2025-02-18 11:42           ` Christian Couder
  0 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-02-18 11:42 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Junio C Hamano, Taylor Blau, Eric Sunshine, Karthik Nayak,
	Kristoffer Haugsbakk, brian m . carlson, Randall S . Becker

On Thu, Jan 30, 2025 at 11:51 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Mon, Jan 27, 2025 at 04:16:58PM +0100, Christian Couder wrote:
> > As we are going to reuse redact_non_printables() outside "version.c",
> > let's make it non-static.
>
> Missing the DCO.

Thanks for spotting this.

> > diff --git a/version.h b/version.h
> > index 7c62e80577..fcc1816685 100644
> > --- a/version.h
> > +++ b/version.h
> > @@ -4,7 +4,15 @@
> >  extern const char git_version_string[];
> >  extern const char git_built_from_commit_string[];
> >
> > +struct strbuf;
> > +
> >  const char *git_user_agent(void);
> >  const char *git_user_agent_sanitized(void);
> >
> > +/*
> > + * Trim and replace each character with ascii code below 32 or above
> > + * 127 (included) using a dot '.' character.
> > +*/
> > +void redact_non_printables(struct strbuf *buf);
>
> Is this header really the right spot though? If I want to redact
> characters I certainly wouldn't be looking at "version.h" for that
> functionality.

In previous versions of this series, I wanted to put this in the
strbuf API but it appeared not to be a good idea.

Anyway, now I think that this patch is not needed, thanks to a comment
you made about the following patch. So we don't need to find a good
place for it for now.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v3 5/5] doc: add technical design doc for large object promisors
  2025-01-27 18:02           ` Junio C Hamano
@ 2025-02-18 11:42             ` Christian Couder
  0 siblings, 0 replies; 110+ messages in thread
From: Christian Couder @ 2025-02-18 11:42 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, John Cai, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Christian Couder

On Mon, Jan 27, 2025 at 7:02 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> >> > +In other words, the goal of this document is not to talk about all the
> >> > +possible ways to optimize how Git could handle large blobs, but to
> >> > +describe how a LOP based solution could work well and alleviate a
> >> > +number of current issues in the context of Git clients and servers
> >> > +sharing Git objects.
> >>
> >> But if you do not discuss even a single way, and handwave "we'll
> >> have this magical object storage that would solve all the problems
> >> for us", then we cannot really tell if the problem is solved by us,
> >> or by handwaved away by assuming the magical object storage.
> >> We'd need at least one working example.
> >
> > It's not magical object storage. Amazon S3, GCP Bucket and MinIO
> > (which is open source), for example, already exist and are used a lot
> > in the industry.
>
> That's just "we can store bunch of bytes and ask them to be
> retrieved".  What I said about handwaving the presence of magical
> "object storage" is exactly the "optimize how to handle large blobs"
> part.  I agree that we do not need to discuss _ALL_ the possible
> ways.  But without telling what our thoughts on _how_ to use these
> "lower cost and safe by duplication but with high latency" services
> to store our objects efficiently enough to make it practical, I'd
> have to call what we see in the document "magical object storage".

I have added the following:

Even if LOPs are used not very efficiently, they can still be useful
and worth using in some cases because, as we will see in more details
later in this document:

  - they can make it simpler for clients to use promisor remotes and
    therefore avoid fetching a lot of large blobs they might not need
    locally,

  - they can make it significantly cheaper or easier for servers to
    host a significant part of the current repository content, and
    even more to host content with larger blobs or more large blobs
    than currently.

I hope this addresses some of your concerns. I could also talk about
remote helpers and object storage here, but this would be duplicating
the "2) LOPs can use object storage" section. If you think that we
should tell our thoughts about how to improve remote helpers and
object storage performance, I think this should go into that section
rather than here.

> >> > +7) A client can offload to a LOP
> >> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >> > +
> >> > +When a client is using a LOP that is also a LOP of its main remote,
> >> > +the client should be able to offload some large blobs it has fetched,
> >> > +but might not need anymore, to the LOP.
> >>
> >> For a client that _creates_ a large object, the situation would be
> >> the same, right?  After it creates several versions of the opening
> >> segment of, say, a movie, the latest version may be still wanted,
> >> but the creating client may want to offload earlier versions.
> >
> > Yeah, but it's not clear if the versions of the opening segment should
> > be sent directly to the LOP without the main remote checking them in
> > some ways (hooks might be configured only on the main remote) and/or
> > checking that they are connected to the repo. I guess it depends on
> > the context if it would be OK or not.
>
> If it is not clear to us or whoever writes this document, the users
> would have a hard time to make effective use of it, which is why I
> am worried about the current design in this feature.

Yeah, but this feature doesn't exist at all yet, and it might not even
be a priority, so I prefer not to promise too much.

For now, I have added:

"This should be discussed and refined when we get closer to
implementing this feature."

just after:

"It might depend on the context if it should be OK or not for clients
to offload large blobs they have created, instead of fetched, directly
to the LOP without the main remote checking them in some ways
(possibly using hooks or other tools)."

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v5 0/3] Introduce a "promisor-remote" capability
  2025-02-18 11:32       ` [PATCH v5 0/3] " Christian Couder
                           ` (2 preceding siblings ...)
  2025-02-18 11:32         ` [PATCH v5 3/3] doc: add technical design doc for large object promisors Christian Couder
@ 2025-02-18 19:07         ` Junio C Hamano
  2025-02-21  8:34         ` Patrick Steinhardt
  4 siblings, 0 replies; 110+ messages in thread
From: Junio C Hamano @ 2025-02-18 19:07 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Patrick Steinhardt, Taylor Blau, Eric Sunshine,
	Karthik Nayak, Kristoffer Haugsbakk, brian m . carlson,
	Randall S . Becker

Christian Couder <christian.couder@gmail.com> writes:

> Changes compared to version 4
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>   - The series is rebased on top 0394451348 (The eleventh batch,
>     2025-02-14). This is to take into account some recent changes like
>     some documentation files using the ".adoc" extension instead of
>     ".txt".

That would make it easier to work for you and anybody who wants to
improve on these changes, which is very much welcome.  The topic is
not a maint material to fix anything, so the rebase is pretty much
welcome.

>   - Patches 1/6, 2/6 and 3/6 from version 4 have been removed, as it
>     looks like using redact_non_printables() is not necessary after
>     all.

That would make my work a lot simpler ;-)  I had to juggle the two
topics every time one of them changed.

Will queue.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v5 3/3] doc: add technical design doc for large object promisors
  2025-02-18 11:32         ` [PATCH v5 3/3] doc: add technical design doc for large object promisors Christian Couder
@ 2025-02-21  8:33           ` Patrick Steinhardt
  2025-03-03 16:58             ` Junio C Hamano
  0 siblings, 1 reply; 110+ messages in thread
From: Patrick Steinhardt @ 2025-02-21  8:33 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, Taylor Blau, Eric Sunshine, Karthik Nayak,
	Kristoffer Haugsbakk, brian m . carlson, Randall S . Becker,
	Christian Couder

On Tue, Feb 18, 2025 at 12:32:04PM +0100, Christian Couder wrote:
> diff --git a/Documentation/technical/large-object-promisors.txt b/Documentation/technical/large-object-promisors.txt
> new file mode 100644
> index 0000000000..ebbbd7c18f
> --- /dev/null
> +++ b/Documentation/technical/large-object-promisors.txt
> @@ -0,0 +1,656 @@
> +In other words, the goal of this document is not to talk about all the
> +possible ways to optimize how Git could handle large blobs, but to
> +describe how a LOP based solution can already work well and alleviate
> +a number of current issues in the context of Git clients and servers
> +sharing Git objects.
> +
> +Even if LOPs are used not very efficiently, they can still be useful
> +and worth using in some cases because, as we will see in more details

s/because//

Patrick

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v5 0/3] Introduce a "promisor-remote" capability
  2025-02-18 11:32       ` [PATCH v5 0/3] " Christian Couder
                           ` (3 preceding siblings ...)
  2025-02-18 19:07         ` [PATCH v5 0/3] Introduce a "promisor-remote" capability Junio C Hamano
@ 2025-02-21  8:34         ` Patrick Steinhardt
  2025-02-21 18:40           ` Junio C Hamano
  4 siblings, 1 reply; 110+ messages in thread
From: Patrick Steinhardt @ 2025-02-21  8:34 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, Taylor Blau, Eric Sunshine, Karthik Nayak,
	Kristoffer Haugsbakk, brian m . carlson, Randall S . Becker

On Tue, Feb 18, 2025 at 12:32:01PM +0100, Christian Couder wrote:
> This work is part of some effort to better handle large files/blobs in
> a client-server context using promisor remotes dedicated to storing
> large blobs. To help understand this effort, this series now contains
> a patch (patch 6/6) that adds design documentation about this effort.
> 
> Last year, I sent 3 versions of a patch series with the goal of
> allowing a client C to clone from a server S while using the same
> promisor remote X that S already use. See:
> 
> https://lore.kernel.org/git/20240418184043.2900955-1-christian.couder@gmail.com/
> 
> Junio suggested to implement that feature using:
> 
> "a protocol extension that lets S tell C that S wants C to fetch
> missing objects from X (which means that if C knows about X in its
> ".git/config" then there is no need for end-user interaction at all),
> or a protocol extension that C tells S that C is willing to see
> objects available from X omitted when S does not have them (again,
> this could be done by looking at ".git/config" at C, but there may be
> security implications???)"
> 
> This patch series implements that protocol extension called
> "promisor-remote" (that name is open to change or simplification)
> which allows S and C to agree on C using X directly or not.
> 
> I have tried to implement it in a quite generic way that could allow S
> and C to share more information about promisor remotes and how to use
> them.
> 
> For now, C doesn't use the information it gets from S when cloning.
> That information is only used to decide if C is OK to use the promisor
> remotes advertised by S. But this could change in the future which
> could make it much simpler for clients than using the current way of
> passing information about X with the `-c` option of `git clone` many
> times on the command line.
> 
> Another improvement could be to not require GIT_NO_LAZY_FETCH=0 when S
> and C have agreed on using S.

I'm fine with this version of the patch series. There are a couple of
features that we probably want to have eventually:

  - Persisting announced promisors. As far as I understand, we don't yet
    write them into the client-side configuration of the repository at
    all.

  - Promisor remote agility. When the set of announced promisors
    changes, we should optionally update the set of promisors connected
    to that remote on the client-side.

  - Authentication. In case the promisor remote requires authentication
    we'll somehow need to communicate the credentials to the client.

All of these feel like topics that can be implemented incrementally once
the foundation has landed, so I don't think they have to be implemented
as part of the patch series here. I also don't see anything obvious that
would block any of these features with the current design.

Thanks for working on this!

Patrick

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v5 0/3] Introduce a "promisor-remote" capability
  2025-02-21  8:34         ` Patrick Steinhardt
@ 2025-02-21 18:40           ` Junio C Hamano
  0 siblings, 0 replies; 110+ messages in thread
From: Junio C Hamano @ 2025-02-21 18:40 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Christian Couder, git, Taylor Blau, Eric Sunshine, Karthik Nayak,
	Kristoffer Haugsbakk, brian m . carlson, Randall S . Becker

Patrick Steinhardt <ps@pks.im> writes:

> I'm fine with this version of the patch series. There are a couple of
> features that we probably want to have eventually:
>
>   - Persisting announced promisors. As far as I understand, we don't yet
>     write them into the client-side configuration of the repository at
>     all.
>
>   - Promisor remote agility. When the set of announced promisors
>     changes, we should optionally update the set of promisors connected
>     to that remote on the client-side.
>
>   - Authentication. In case the promisor remote requires authentication
>     we'll somehow need to communicate the credentials to the client.
>
> All of these feel like topics that can be implemented incrementally once
> the foundation has landed, so I don't think they have to be implemented
> as part of the patch series here. I also don't see anything obvious that
> would block any of these features with the current design.

All of them smell like with grave security implications to me.

I am happy to see none of them are included in this round, as
getting the details of them right would take a lot of time and
effort; it is great to have the fundamentals first without having to
worry about them.

> Thanks for working on this!

Likewise.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v5 3/3] doc: add technical design doc for large object promisors
  2025-02-21  8:33           ` Patrick Steinhardt
@ 2025-03-03 16:58             ` Junio C Hamano
  0 siblings, 0 replies; 110+ messages in thread
From: Junio C Hamano @ 2025-03-03 16:58 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Christian Couder, git, Taylor Blau, Eric Sunshine, Karthik Nayak,
	Kristoffer Haugsbakk, brian m . carlson, Randall S . Becker,
	Christian Couder

Patrick Steinhardt <ps@pks.im> writes:

> On Tue, Feb 18, 2025 at 12:32:04PM +0100, Christian Couder wrote:
>> diff --git a/Documentation/technical/large-object-promisors.txt b/Documentation/technical/large-object-promisors.txt
>> new file mode 100644
>> index 0000000000..ebbbd7c18f
>> --- /dev/null
>> +++ b/Documentation/technical/large-object-promisors.txt
>> @@ -0,0 +1,656 @@
>> +In other words, the goal of this document is not to talk about all the
>> +possible ways to optimize how Git could handle large blobs, but to
>> +describe how a LOP based solution can already work well and alleviate
>> +a number of current issues in the context of Git clients and servers
>> +sharing Git objects.
>> +
>> +Even if LOPs are used not very efficiently, they can still be useful
>> +and worth using in some cases because, as we will see in more details
>
> s/because//

I've squashed this in and it seems everything is in order in this
topic, so let's mark it for 'next' now.

Thanks, all.

^ permalink raw reply	[flat|nested] 110+ messages in thread

end of thread, other threads:[~2025-03-03 16:58 UTC | newest]

Thread overview: 110+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-31 13:40 [PATCH 0/4] Introduce a "promisor-remote" capability Christian Couder
2024-07-31 13:40 ` [PATCH 1/4] version: refactor strbuf_sanitize() Christian Couder
2024-07-31 17:18   ` Junio C Hamano
2024-08-20 11:29     ` Christian Couder
2024-07-31 13:40 ` [PATCH 2/4] strbuf: refactor strbuf_trim_trailing_ch() Christian Couder
2024-07-31 17:29   ` Junio C Hamano
2024-07-31 21:49     ` Taylor Blau
2024-08-20 11:29       ` Christian Couder
2024-08-20 11:29     ` Christian Couder
2024-07-31 13:40 ` [PATCH 3/4] Add 'promisor-remote' capability to protocol v2 Christian Couder
2024-07-31 15:40   ` Taylor Blau
2024-08-20 11:32     ` Christian Couder
2024-08-20 17:01       ` Junio C Hamano
2024-09-10 16:32         ` Christian Couder
2024-07-31 16:16   ` Taylor Blau
2024-08-20 11:32     ` Christian Couder
2024-08-20 16:55       ` Junio C Hamano
2024-09-10 16:32       ` Christian Couder
2024-09-10 17:46         ` Junio C Hamano
2024-07-31 18:25   ` Junio C Hamano
2024-07-31 19:34     ` Junio C Hamano
2024-08-20 12:21     ` Christian Couder
2024-08-05 13:48   ` Patrick Steinhardt
2024-08-19 20:00     ` Junio C Hamano
2024-09-10 16:31     ` Christian Couder
2024-07-31 13:40 ` [PATCH 4/4] promisor-remote: check advertised name or URL Christian Couder
2024-07-31 18:35   ` Junio C Hamano
2024-09-10 16:32     ` Christian Couder
2024-07-31 16:01 ` [PATCH 0/4] Introduce a "promisor-remote" capability Junio C Hamano
2024-07-31 16:17 ` Taylor Blau
2024-09-10 16:29 ` [PATCH v2 " Christian Couder
2024-09-10 16:29   ` [PATCH v2 1/4] version: refactor strbuf_sanitize() Christian Couder
2024-09-10 16:29   ` [PATCH v2 2/4] strbuf: refactor strbuf_trim_trailing_ch() Christian Couder
2024-09-10 16:29   ` [PATCH v2 3/4] Add 'promisor-remote' capability to protocol v2 Christian Couder
2024-09-30  7:56     ` Patrick Steinhardt
2024-09-30 13:28       ` Christian Couder
2024-10-01 10:14         ` Patrick Steinhardt
2024-10-01 18:47           ` Junio C Hamano
2024-11-06 14:04     ` Patrick Steinhardt
2024-11-28  5:47     ` Junio C Hamano
2024-11-28 15:31       ` Christian Couder
2024-11-29  1:31         ` Junio C Hamano
2024-09-10 16:30   ` [PATCH v2 4/4] promisor-remote: check advertised name or URL Christian Couder
2024-09-30  7:57     ` Patrick Steinhardt
2024-09-26 18:09   ` [PATCH v2 0/4] Introduce a "promisor-remote" capability Junio C Hamano
2024-09-27  9:15     ` Christian Couder
2024-09-27 22:48       ` Junio C Hamano
2024-09-27 23:31         ` rsbecker
2024-09-28 10:56           ` Kristoffer Haugsbakk
2024-09-30  7:57         ` Patrick Steinhardt
2024-09-30  9:17           ` Christian Couder
2024-09-30 16:52             ` Junio C Hamano
2024-10-01 10:14             ` Patrick Steinhardt
2024-09-30 16:34           ` Junio C Hamano
2024-09-30 21:26           ` brian m. carlson
2024-09-30 22:27             ` Junio C Hamano
2024-10-01 10:13               ` Patrick Steinhardt
2024-12-06 12:42   ` [PATCH v3 0/5] " Christian Couder
2024-12-06 12:42     ` [PATCH v3 1/5] version: refactor strbuf_sanitize() Christian Couder
2024-12-07  6:21       ` Junio C Hamano
2025-01-27 15:07         ` Christian Couder
2024-12-06 12:42     ` [PATCH v3 2/5] strbuf: refactor strbuf_trim_trailing_ch() Christian Couder
2024-12-07  6:35       ` Junio C Hamano
2025-01-27 15:07         ` Christian Couder
2024-12-16 11:47       ` karthik nayak
2024-12-06 12:42     ` [PATCH v3 3/5] Add 'promisor-remote' capability to protocol v2 Christian Couder
2024-12-07  7:59       ` Junio C Hamano
2025-01-27 15:08         ` Christian Couder
2024-12-06 12:42     ` [PATCH v3 4/5] promisor-remote: check advertised name or URL Christian Couder
2024-12-06 12:42     ` [PATCH v3 5/5] doc: add technical design doc for large object promisors Christian Couder
2024-12-10  1:28       ` Junio C Hamano
2025-01-27 15:12         ` Christian Couder
2024-12-10 11:43       ` Junio C Hamano
2024-12-16  9:00         ` Patrick Steinhardt
2025-01-27 15:11         ` Christian Couder
2025-01-27 18:02           ` Junio C Hamano
2025-02-18 11:42             ` Christian Couder
2024-12-09  8:04     ` [PATCH v3 0/5] Introduce a "promisor-remote" capability Junio C Hamano
2024-12-09 10:40       ` Christian Couder
2024-12-09 10:42         ` Christian Couder
2024-12-09 23:01         ` Junio C Hamano
2025-01-27 15:05           ` Christian Couder
2025-01-27 19:38             ` Junio C Hamano
2025-01-27 15:16     ` [PATCH v4 0/6] " Christian Couder
2025-01-27 15:16       ` [PATCH v4 1/6] version: replace manual ASCII checks with isprint() for clarity Christian Couder
2025-01-27 15:16       ` [PATCH v4 2/6] version: refactor redact_non_printables() Christian Couder
2025-01-27 15:16       ` [PATCH v4 3/6] version: make redact_non_printables() non-static Christian Couder
2025-01-30 10:51         ` Patrick Steinhardt
2025-02-18 11:42           ` Christian Couder
2025-01-27 15:16       ` [PATCH v4 4/6] Add 'promisor-remote' capability to protocol v2 Christian Couder
2025-01-30 10:51         ` Patrick Steinhardt
2025-02-18 11:41           ` Christian Couder
2025-01-27 15:17       ` [PATCH v4 5/6] promisor-remote: check advertised name or URL Christian Couder
2025-01-27 23:48         ` Junio C Hamano
2025-01-28  0:01           ` Junio C Hamano
2025-01-30 10:51           ` Patrick Steinhardt
2025-02-18 11:41             ` Christian Couder
2025-02-18 11:42           ` Christian Couder
2025-01-27 15:17       ` [PATCH v4 6/6] doc: add technical design doc for large object promisors Christian Couder
2025-01-27 21:14       ` [PATCH v4 0/6] Introduce a "promisor-remote" capability Junio C Hamano
2025-02-18 11:40         ` Christian Couder
2025-02-18 11:32       ` [PATCH v5 0/3] " Christian Couder
2025-02-18 11:32         ` [PATCH v5 1/3] Add 'promisor-remote' capability to protocol v2 Christian Couder
2025-02-18 11:32         ` [PATCH v5 2/3] promisor-remote: check advertised name or URL Christian Couder
2025-02-18 11:32         ` [PATCH v5 3/3] doc: add technical design doc for large object promisors Christian Couder
2025-02-21  8:33           ` Patrick Steinhardt
2025-03-03 16:58             ` Junio C Hamano
2025-02-18 19:07         ` [PATCH v5 0/3] Introduce a "promisor-remote" capability Junio C Hamano
2025-02-21  8:34         ` Patrick Steinhardt
2025-02-21 18:40           ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).