Git development
 help / color / mirror / Atom feed
From: "Matheus Afonso Martins Moreira via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: "Torsten Bögershausen" <tboegi@web.de>,
	"Ghanshyam Thakkar" <shyamthakkar001@gmail.com>,
	"Matheus Moreira" <matheus@matheusmoreira.com>,
	"Matheus Afonso Martins Moreira" <matheus@matheusmoreira.com>
Subject: [PATCH v3 5/8] urlmatch: define url_parse function
Date: Sat, 02 May 2026 05:28:39 +0000	[thread overview]
Message-ID: <89932a70f3ace6ff1198628873df702f40f1442a.1777699722.git.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.1715.v3.git.git.1777699722.gitgitgadget@gmail.com>

From: Matheus Afonso Martins Moreira <matheus@matheusmoreira.com>

Define url_parse, a general parsing function that supports all Git URLs
including scp style URLs such as hostname:~user/repo.

It is adapted from the algorithm in connect.c's parse_connect_url
and reuses the shared enum url_scheme and url_get_scheme function
that previous commits made available in url.h. The new parser and
the connect path agree on scheme classification. url_parse has the
same interface as url_normalize and uses the same data structures.

Both functions accept the same URL forms with one deliberate
exception. Bare local paths such as "/abs/path", "./rel"
or "repo" are accepted by parse_connect_url as URL_SCHEME_LOCAL,
but rejected by url_parse because url_normalize requires a URL
with a scheme://host form. A consumer that wants to handle both
URLs and local paths needs to dispatch on url_is_local_not_ssh
before calling url_parse, just as the connect path does internally.

The duplication with parse_connect_url is intentional.
The two functions have different contracts:

  - parse_connect_url

    Calls die() on an unknown scheme
    and returns NUL-terminated host/path
    strings for the connect path

  - url_parse

    Returns NULL on failure while populating
    out_info->err, and exposes components
    as offset/length pairs into the normalized
    URL buffer, matching url_normalize.

Reconciling both is possible, but not in the scope
of the current patch set.

Signed-off-by: Matheus Afonso Martins Moreira <matheus@matheusmoreira.com>
---
 t/unit-tests/u-urlmatch-normalization.c |  45 +++++++++
 urlmatch.c                              | 127 ++++++++++++++++++++++++
 urlmatch.h                              |   1 +
 3 files changed, 173 insertions(+)

diff --git a/t/unit-tests/u-urlmatch-normalization.c b/t/unit-tests/u-urlmatch-normalization.c
index 39f6e1ba26..3595d893a2 100644
--- a/t/unit-tests/u-urlmatch-normalization.c
+++ b/t/unit-tests/u-urlmatch-normalization.c
@@ -245,3 +245,48 @@ void test_urlmatch_normalization__equivalents(void)
 	compare_normalized_urls("https://@x.y/^/../abc", "httpS://@x.y:0443/abc", 1);
 	compare_normalized_urls("https://@x.y/^/..", "httpS://@x.y:0443/", 1);
 }
+
+static void check_parsed_path(const char *url, const char *expected_path)
+{
+	struct url_info info;
+	char *parsed = url_parse(url, &info);
+	char *path;
+
+	cl_assert(parsed != NULL);
+	path = xstrndup(parsed + info.path_off, info.path_len);
+	cl_assert_equal_s(path, expected_path);
+	free(path);
+	free(parsed);
+}
+
+void test_urlmatch_normalization__parse_scp(void)
+{
+	check_parsed_path("host:path", "/path");
+	check_parsed_path("user@host:path", "/path");
+	check_parsed_path("host:~user/repo", "~user/repo");
+	check_parsed_path("user@host:~user/repo", "~user/repo");
+	check_parsed_path("[host]:src", "/src");
+	check_parsed_path("[host:123]:src", "/src");
+	check_parsed_path("[::1]:repo", "/repo");
+	check_parsed_path("user@[::1]:repo", "/repo");
+}
+
+void test_urlmatch_normalization__parse_url_form(void)
+{
+	check_parsed_path("ssh://host/repo", "/repo");
+	check_parsed_path("ssh://host/~user/repo", "~user/repo");
+	check_parsed_path("git://host:9418/repo", "/repo");
+	check_parsed_path("git://host/~user/repo", "~user/repo");
+	check_parsed_path("ssh://[::1]:1234/repo", "/repo");
+	check_parsed_path("http://[2001:db8::1]/repo", "/repo");
+}
+
+void test_urlmatch_normalization__parse_strips_query_and_fragment(void)
+{
+	check_parsed_path("ssh://host/~user/repo?q", "~user/repo");
+	check_parsed_path("ssh://host/~user/repo#frag", "~user/repo");
+	check_parsed_path("git://host/~user/repo?q", "~user/repo");
+	check_parsed_path("user@host:~user/repo?q", "~user/repo");
+	check_parsed_path("https://host/repo?q", "/repo");
+	check_parsed_path("https://host/repo#frag", "/repo");
+}
diff --git a/urlmatch.c b/urlmatch.c
index eea8300489..bf8cce6de9 100644
--- a/urlmatch.c
+++ b/urlmatch.c
@@ -5,6 +5,7 @@
 #include "hex-ll.h"
 #include "strbuf.h"
 #include "urlmatch.h"
+#include "url.h"
 
 #define URL_ALPHA "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
 #define URL_DIGIT "0123456789"
@@ -440,6 +441,132 @@ char *url_normalize(const char *url, struct url_info *out_info)
 	return url_normalize_1(url, out_info, 0);
 }
 
+char *url_parse(const char *url_orig, struct url_info *out_info)
+{
+	struct strbuf url;
+	char *host, *separator;
+	char *detached, *normalized;
+	char *url_decoded;
+	enum url_scheme scheme = URL_SCHEME_LOCAL;
+	struct url_info local_info;
+	struct url_info *info = out_info ? out_info : &local_info;
+	bool scp_syntax = false;
+
+	if (is_url(url_orig))
+		url_decoded = url_decode(url_orig);
+	else
+		url_decoded = xstrdup(url_orig);
+
+	strbuf_init(&url, strlen(url_decoded) + sizeof("ssh://"));
+	strbuf_addstr(&url, url_decoded);
+	free(url_decoded);
+
+	host = strstr(url.buf, "://");
+	if (host) {
+		/*
+		 * Temporarily NUL-terminate the scheme name
+		 * so we can pass it to url_get_scheme(),
+		 * then restore the ':' so the buffer
+		 * is intact for url_normalize() below.
+		 */
+		char saved = *host;
+		*host = '\0';
+		scheme = url_get_scheme(url.buf);
+		*host = saved;
+		host += 3;
+	} else {
+		if (!url_is_local_not_ssh(url.buf)) {
+			scp_syntax = true;
+			scheme = URL_SCHEME_SSH;
+			strbuf_insertstr(&url, 0, "ssh://");
+			host = url.buf + strlen("ssh://");
+		}
+	}
+
+	/*
+	 * Path starts after ':' in scp style SSH URLs.
+	 *
+	 * The host portion can begin with an optional "user@",
+	 * and the host itself can be wrapped in '[' ']' brackets.
+	 * The bracket form is git's legacy way of supporting:
+	 *
+	 *   - IPv6 literals: [::1]:repo
+	 *   - host:port pairs in the short form: [myhost:123]:src
+	 *   - Plain hostnames that happen to need bracketing: [host]:path
+	 *
+	 * Treat '[' followed by 0 or 1 inner colons as the host:port
+	 * or plain hostname form and strip the brackets so url_normalize
+	 * sees host[:port] natively. Two or more inner colons mark an
+	 * IPv6 literal: keep the brackets for url_normalize to recognize.
+	 *
+	 * The scp path separator is the ':' that follows the host part,
+	 * and we must skip over user@ and any '[...]' before searching.
+	 */
+	if (scp_syntax) {
+		char *user_at;
+		char *host_start;
+		char *bracket_end;
+
+		user_at = strchr(host, '@');
+		host_start = user_at ? user_at + 1 : host;
+
+		if (*host_start == '[') {
+			char *p;
+			int inner_colons;
+
+			bracket_end = strchr(host_start, ']');
+			inner_colons = 0;
+			for (p = host_start + 1; bracket_end && p < bracket_end; p++)
+				if (*p == ':')
+					inner_colons++;
+
+			if (bracket_end && inner_colons <= 1) {
+				size_t close_off = bracket_end - url.buf;
+				size_t open_off = host_start - url.buf;
+				strbuf_remove(&url, close_off, 1);
+				strbuf_remove(&url, open_off, 1);
+				separator = url.buf + close_off - 1;
+			} else if (bracket_end) {
+				separator = strchr(bracket_end + 1, ':');
+			} else {
+				separator = strchr(host_start, ':');
+			}
+		} else {
+			separator = strchr(host_start, ':');
+		}
+
+		if (separator) {
+			if (separator[1] == '/')
+				strbuf_remove(&url, separator - url.buf, 1);
+			else
+				*separator = '/';
+		}
+	}
+
+	detached = strbuf_detach(&url, NULL);
+	normalized = url_normalize(detached, info);
+	free(detached);
+
+	if (!normalized)
+		return NULL;
+
+	/*
+	 * Point path to ~ for URLs like this:
+	 *
+	 *     ssh://host.xz/~user/repo
+	 *     git://host.xz/~user/repo
+	 *     host.xz:~user/repo
+	 */
+	if (scheme == URL_SCHEME_GIT || scheme == URL_SCHEME_SSH) {
+		if (normalized[info->path_off + 1] == '~') {
+			info->path_off++;
+			info->path_len--;
+		}
+	}
+
+	return normalized;
+}
+
 static size_t url_match_prefix(const char *url,
 			       const char *url_prefix,
 			       size_t url_prefix_len)
diff --git a/urlmatch.h b/urlmatch.h
index 5ba85cea13..6b3ce42858 100644
--- a/urlmatch.h
+++ b/urlmatch.h
@@ -35,6 +35,7 @@ struct url_info {
 };
 
 char *url_normalize(const char *, struct url_info *);
+char *url_parse(const char *, struct url_info *);
 
 struct urlmatch_item {
 	size_t hostmatch_len;
-- 
gitgitgadget


  parent reply	other threads:[~2026-05-02  5:28 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-28 22:30 [PATCH 00/13] builtin: implement, document and test url-parse Matheus Moreira via GitGitGadget
2024-04-28 22:30 ` [PATCH 01/13] url: move helper function to URL header and source Matheus Afonso Martins Moreira via GitGitGadget
2024-04-28 22:30 ` [PATCH 02/13] urlmatch: define url_parse function Matheus Afonso Martins Moreira via GitGitGadget
2024-05-01 22:18   ` Ghanshyam Thakkar
2024-05-02  4:02     ` Torsten Bögershausen
2024-04-28 22:30 ` [PATCH 03/13] builtin: create url-parse command Matheus Afonso Martins Moreira via GitGitGadget
2024-04-28 22:30 ` [PATCH 04/13] url-parse: add URL parsing helper function Matheus Afonso Martins Moreira via GitGitGadget
2024-04-28 22:30 ` [PATCH 05/13] url-parse: enumerate possible URL components Matheus Afonso Martins Moreira via GitGitGadget
2024-04-28 22:30 ` [PATCH 06/13] url-parse: define component extraction helper fn Matheus Afonso Martins Moreira via GitGitGadget
2024-04-28 22:30 ` [PATCH 07/13] url-parse: define string to component converter fn Matheus Afonso Martins Moreira via GitGitGadget
2024-04-28 22:30 ` [PATCH 08/13] url-parse: define usage and options Matheus Afonso Martins Moreira via GitGitGadget
2024-04-28 22:30 ` [PATCH 09/13] url-parse: parse options given on the command line Matheus Afonso Martins Moreira via GitGitGadget
2024-04-28 22:30 ` [PATCH 10/13] url-parse: validate all given git URLs Matheus Afonso Martins Moreira via GitGitGadget
2024-04-28 22:30 ` [PATCH 11/13] url-parse: output URL components selected by user Matheus Afonso Martins Moreira via GitGitGadget
2024-04-28 22:31 ` [PATCH 12/13] Documentation: describe the url-parse builtin Matheus Afonso Martins Moreira via GitGitGadget
2024-04-30  7:37   ` Ghanshyam Thakkar
2024-04-28 22:31 ` [PATCH 13/13] tests: add tests for the new " Matheus Afonso Martins Moreira via GitGitGadget
2024-04-29 20:53 ` [PATCH 00/13] builtin: implement, document and test url-parse Torsten Bögershausen
2024-04-29 22:04   ` Reply to community feedback Matheus Afonso Martins Moreira
2024-04-30  6:51     ` Torsten Bögershausen
2026-05-01 23:15 ` [PATCH v2 0/8] builtin: implement, document and test url-parse Matheus Moreira via GitGitGadget
2026-05-01 23:15   ` [PATCH v2 1/8] connect: rename enum protocol to url_scheme Matheus Afonso Martins Moreira via GitGitGadget
2026-05-01 23:15   ` [PATCH v2 2/8] url: move url_is_local_not_ssh to url.h Matheus Afonso Martins Moreira via GitGitGadget
2026-05-01 23:15   ` [PATCH v2 3/8] url: move scheme detection to URL header/source Matheus Afonso Martins Moreira via GitGitGadget
2026-05-01 23:15   ` [PATCH v2 4/8] url: return URL_SCHEME_UNKNOWN instead of dying Matheus Afonso Martins Moreira via GitGitGadget
2026-05-01 23:15   ` [PATCH v2 5/8] urlmatch: define url_parse function Matheus Afonso Martins Moreira via GitGitGadget
2026-05-01 23:15   ` [PATCH v2 6/8] builtin: create url-parse command Matheus Afonso Martins Moreira via GitGitGadget
2026-05-01 23:15   ` [PATCH v2 7/8] doc: describe the url-parse builtin Matheus Afonso Martins Moreira via GitGitGadget
2026-05-01 23:15   ` [PATCH v2 8/8] t9904: add tests for the new " Matheus Afonso Martins Moreira via GitGitGadget
2026-05-02  5:28   ` [PATCH v3 0/8] builtin: implement, document and test url-parse Matheus Moreira via GitGitGadget
2026-05-02  5:28     ` [PATCH v3 1/8] connect: rename enum protocol to url_scheme Matheus Afonso Martins Moreira via GitGitGadget
2026-05-02  5:28     ` [PATCH v3 2/8] url: move url_is_local_not_ssh to url.h Matheus Afonso Martins Moreira via GitGitGadget
2026-05-02  5:28     ` [PATCH v3 3/8] url: move scheme detection to URL header/source Matheus Afonso Martins Moreira via GitGitGadget
2026-05-02  5:28     ` [PATCH v3 4/8] url: return URL_SCHEME_UNKNOWN instead of dying Matheus Afonso Martins Moreira via GitGitGadget
2026-05-02  5:28     ` Matheus Afonso Martins Moreira via GitGitGadget [this message]
2026-05-02  5:28     ` [PATCH v3 6/8] builtin: create url-parse command Matheus Afonso Martins Moreira via GitGitGadget
2026-05-02  5:28     ` [PATCH v3 7/8] doc: describe the url-parse builtin Matheus Afonso Martins Moreira via GitGitGadget
2026-05-02  5:28     ` [PATCH v3 8/8] t9904: add tests for the new " Matheus Afonso Martins Moreira via GitGitGadget
2026-05-03  3:49     ` [PATCH v3 0/8] builtin: implement, document and test url-parse Junio C Hamano
2026-05-03  4:29       ` Matheus Afonso Martins Moreira
2026-05-03 17:28     ` Torsten Bögershausen
2026-05-03 19:36       ` Matheus Afonso Martins Moreira
2026-05-12  3:50         ` Junio C Hamano
2026-05-12  8:57           ` Torsten Bögershausen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=89932a70f3ace6ff1198628873df702f40f1442a.1777699722.git.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=matheus@matheusmoreira.com \
    --cc=shyamthakkar001@gmail.com \
    --cc=tboegi@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox