From: "Matheus Afonso Martins Moreira via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: "Torsten Bögershausen" <tboegi@web.de>,
"Ghanshyam Thakkar" <shyamthakkar001@gmail.com>,
"Matheus Moreira" <matheus@matheusmoreira.com>,
"Matheus Afonso Martins Moreira" <matheus@matheusmoreira.com>
Subject: [PATCH v2 5/8] urlmatch: define url_parse function
Date: Fri, 01 May 2026 23:15:07 +0000 [thread overview]
Message-ID: <89932a70f3ace6ff1198628873df702f40f1442a.1777677310.git.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.1715.v2.git.git.1777677310.gitgitgadget@gmail.com>
From: Matheus Afonso Martins Moreira <matheus@matheusmoreira.com>
Define url_parse, a general parsing function that supports all Git URLs
including scp style URLs such as hostname:~user/repo.
It is adapted from the algorithm in connect.c's parse_connect_url
and reuses the shared enum url_scheme and url_get_scheme function
that previous commits made available in url.h. The new parser and
the connect path agree on scheme classification. url_parse has the
same interface as url_normalize and uses the same data structures.
Both functions accept the same URL forms with one deliberate
exception. Bare local paths such as "/abs/path", "./rel"
or "repo" are accepted by parse_connect_url as URL_SCHEME_LOCAL,
but rejected by url_parse because url_normalize requires a URL
with a scheme://host form. A consumer that wants to handle both
URLs and local paths needs to dispatch on url_is_local_not_ssh
before calling url_parse, just as the connect path does internally.
The duplication with parse_connect_url is intentional.
The two functions have different contracts:
- parse_connect_url
Calls die() on an unknown scheme
and returns NUL-terminated host/path
strings for the connect path
- url_parse
Returns NULL on failure while populating
out_info->err, and exposes components
as offset/length pairs into the normalized
URL buffer, matching url_normalize.
Reconciling both is possible, but not in the scope
of the current patch set.
Signed-off-by: Matheus Afonso Martins Moreira <matheus@matheusmoreira.com>
---
t/unit-tests/u-urlmatch-normalization.c | 45 +++++++++
urlmatch.c | 127 ++++++++++++++++++++++++
urlmatch.h | 1 +
3 files changed, 173 insertions(+)
diff --git a/t/unit-tests/u-urlmatch-normalization.c b/t/unit-tests/u-urlmatch-normalization.c
index 39f6e1ba26..3595d893a2 100644
--- a/t/unit-tests/u-urlmatch-normalization.c
+++ b/t/unit-tests/u-urlmatch-normalization.c
@@ -245,3 +245,48 @@ void test_urlmatch_normalization__equivalents(void)
compare_normalized_urls("https://@x.y/^/../abc", "httpS://@x.y:0443/abc", 1);
compare_normalized_urls("https://@x.y/^/..", "httpS://@x.y:0443/", 1);
}
+
+static void check_parsed_path(const char *url, const char *expected_path)
+{
+ struct url_info info;
+ char *parsed = url_parse(url, &info);
+ char *path;
+
+ cl_assert(parsed != NULL);
+ path = xstrndup(parsed + info.path_off, info.path_len);
+ cl_assert_equal_s(path, expected_path);
+ free(path);
+ free(parsed);
+}
+
+void test_urlmatch_normalization__parse_scp(void)
+{
+ check_parsed_path("host:path", "/path");
+ check_parsed_path("user@host:path", "/path");
+ check_parsed_path("host:~user/repo", "~user/repo");
+ check_parsed_path("user@host:~user/repo", "~user/repo");
+ check_parsed_path("[host]:src", "/src");
+ check_parsed_path("[host:123]:src", "/src");
+ check_parsed_path("[::1]:repo", "/repo");
+ check_parsed_path("user@[::1]:repo", "/repo");
+}
+
+void test_urlmatch_normalization__parse_url_form(void)
+{
+ check_parsed_path("ssh://host/repo", "/repo");
+ check_parsed_path("ssh://host/~user/repo", "~user/repo");
+ check_parsed_path("git://host:9418/repo", "/repo");
+ check_parsed_path("git://host/~user/repo", "~user/repo");
+ check_parsed_path("ssh://[::1]:1234/repo", "/repo");
+ check_parsed_path("http://[2001:db8::1]/repo", "/repo");
+}
+
+void test_urlmatch_normalization__parse_strips_query_and_fragment(void)
+{
+ check_parsed_path("ssh://host/~user/repo?q", "~user/repo");
+ check_parsed_path("ssh://host/~user/repo#frag", "~user/repo");
+ check_parsed_path("git://host/~user/repo?q", "~user/repo");
+ check_parsed_path("user@host:~user/repo?q", "~user/repo");
+ check_parsed_path("https://host/repo?q", "/repo");
+ check_parsed_path("https://host/repo#frag", "/repo");
+}
diff --git a/urlmatch.c b/urlmatch.c
index eea8300489..bf8cce6de9 100644
--- a/urlmatch.c
+++ b/urlmatch.c
@@ -5,6 +5,7 @@
#include "hex-ll.h"
#include "strbuf.h"
#include "urlmatch.h"
+#include "url.h"
#define URL_ALPHA "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
#define URL_DIGIT "0123456789"
@@ -440,6 +441,132 @@ char *url_normalize(const char *url, struct url_info *out_info)
return url_normalize_1(url, out_info, 0);
}
+char *url_parse(const char *url_orig, struct url_info *out_info)
+{
+ struct strbuf url;
+ char *host, *separator;
+ char *detached, *normalized;
+ char *url_decoded;
+ enum url_scheme scheme = URL_SCHEME_LOCAL;
+ struct url_info local_info;
+ struct url_info *info = out_info ? out_info : &local_info;
+ bool scp_syntax = false;
+
+ if (is_url(url_orig))
+ url_decoded = url_decode(url_orig);
+ else
+ url_decoded = xstrdup(url_orig);
+
+ strbuf_init(&url, strlen(url_decoded) + sizeof("ssh://"));
+ strbuf_addstr(&url, url_decoded);
+ free(url_decoded);
+
+ host = strstr(url.buf, "://");
+ if (host) {
+ /*
+ * Temporarily NUL-terminate the scheme name
+ * so we can pass it to url_get_scheme(),
+ * then restore the ':' so the buffer
+ * is intact for url_normalize() below.
+ */
+ char saved = *host;
+ *host = '\0';
+ scheme = url_get_scheme(url.buf);
+ *host = saved;
+ host += 3;
+ } else {
+ if (!url_is_local_not_ssh(url.buf)) {
+ scp_syntax = true;
+ scheme = URL_SCHEME_SSH;
+ strbuf_insertstr(&url, 0, "ssh://");
+ host = url.buf + strlen("ssh://");
+ }
+ }
+
+ /*
+ * Path starts after ':' in scp style SSH URLs.
+ *
+ * The host portion can begin with an optional "user@",
+ * and the host itself can be wrapped in '[' ']' brackets.
+ * The bracket form is git's legacy way of supporting:
+ *
+ * - IPv6 literals: [::1]:repo
+ * - host:port pairs in the short form: [myhost:123]:src
+ * - Plain hostnames that happen to need bracketing: [host]:path
+ *
+ * Treat '[' followed by 0 or 1 inner colons as the host:port
+ * or plain hostname form and strip the brackets so url_normalize
+ * sees host[:port] natively. Two or more inner colons mark an
+ * IPv6 literal: keep the brackets for url_normalize to recognize.
+ *
+ * The scp path separator is the ':' that follows the host part,
+ * and we must skip over user@ and any '[...]' before searching.
+ */
+ if (scp_syntax) {
+ char *user_at;
+ char *host_start;
+ char *bracket_end;
+
+ user_at = strchr(host, '@');
+ host_start = user_at ? user_at + 1 : host;
+
+ if (*host_start == '[') {
+ char *p;
+ int inner_colons;
+
+ bracket_end = strchr(host_start, ']');
+ inner_colons = 0;
+ for (p = host_start + 1; bracket_end && p < bracket_end; p++)
+ if (*p == ':')
+ inner_colons++;
+
+ if (bracket_end && inner_colons <= 1) {
+ size_t close_off = bracket_end - url.buf;
+ size_t open_off = host_start - url.buf;
+ strbuf_remove(&url, close_off, 1);
+ strbuf_remove(&url, open_off, 1);
+ separator = url.buf + close_off - 1;
+ } else if (bracket_end) {
+ separator = strchr(bracket_end + 1, ':');
+ } else {
+ separator = strchr(host_start, ':');
+ }
+ } else {
+ separator = strchr(host_start, ':');
+ }
+
+ if (separator) {
+ if (separator[1] == '/')
+ strbuf_remove(&url, separator - url.buf, 1);
+ else
+ *separator = '/';
+ }
+ }
+
+ detached = strbuf_detach(&url, NULL);
+ normalized = url_normalize(detached, info);
+ free(detached);
+
+ if (!normalized)
+ return NULL;
+
+ /*
+ * Point path to ~ for URLs like this:
+ *
+ * ssh://host.xz/~user/repo
+ * git://host.xz/~user/repo
+ * host.xz:~user/repo
+ */
+ if (scheme == URL_SCHEME_GIT || scheme == URL_SCHEME_SSH) {
+ if (normalized[info->path_off + 1] == '~') {
+ info->path_off++;
+ info->path_len--;
+ }
+ }
+
+ return normalized;
+}
+
static size_t url_match_prefix(const char *url,
const char *url_prefix,
size_t url_prefix_len)
diff --git a/urlmatch.h b/urlmatch.h
index 5ba85cea13..6b3ce42858 100644
--- a/urlmatch.h
+++ b/urlmatch.h
@@ -35,6 +35,7 @@ struct url_info {
};
char *url_normalize(const char *, struct url_info *);
+char *url_parse(const char *, struct url_info *);
struct urlmatch_item {
size_t hostmatch_len;
--
gitgitgadget
next prev parent reply other threads:[~2026-05-01 23:15 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-28 22:30 [PATCH 00/13] builtin: implement, document and test url-parse Matheus Moreira via GitGitGadget
2024-04-28 22:30 ` [PATCH 01/13] url: move helper function to URL header and source Matheus Afonso Martins Moreira via GitGitGadget
2024-04-28 22:30 ` [PATCH 02/13] urlmatch: define url_parse function Matheus Afonso Martins Moreira via GitGitGadget
2024-05-01 22:18 ` Ghanshyam Thakkar
2024-05-02 4:02 ` Torsten Bögershausen
2024-04-28 22:30 ` [PATCH 03/13] builtin: create url-parse command Matheus Afonso Martins Moreira via GitGitGadget
2024-04-28 22:30 ` [PATCH 04/13] url-parse: add URL parsing helper function Matheus Afonso Martins Moreira via GitGitGadget
2024-04-28 22:30 ` [PATCH 05/13] url-parse: enumerate possible URL components Matheus Afonso Martins Moreira via GitGitGadget
2024-04-28 22:30 ` [PATCH 06/13] url-parse: define component extraction helper fn Matheus Afonso Martins Moreira via GitGitGadget
2024-04-28 22:30 ` [PATCH 07/13] url-parse: define string to component converter fn Matheus Afonso Martins Moreira via GitGitGadget
2024-04-28 22:30 ` [PATCH 08/13] url-parse: define usage and options Matheus Afonso Martins Moreira via GitGitGadget
2024-04-28 22:30 ` [PATCH 09/13] url-parse: parse options given on the command line Matheus Afonso Martins Moreira via GitGitGadget
2024-04-28 22:30 ` [PATCH 10/13] url-parse: validate all given git URLs Matheus Afonso Martins Moreira via GitGitGadget
2024-04-28 22:30 ` [PATCH 11/13] url-parse: output URL components selected by user Matheus Afonso Martins Moreira via GitGitGadget
2024-04-28 22:31 ` [PATCH 12/13] Documentation: describe the url-parse builtin Matheus Afonso Martins Moreira via GitGitGadget
2024-04-30 7:37 ` Ghanshyam Thakkar
2024-04-28 22:31 ` [PATCH 13/13] tests: add tests for the new " Matheus Afonso Martins Moreira via GitGitGadget
2024-04-29 20:53 ` [PATCH 00/13] builtin: implement, document and test url-parse Torsten Bögershausen
2024-04-29 22:04 ` Reply to community feedback Matheus Afonso Martins Moreira
2024-04-30 6:51 ` Torsten Bögershausen
2026-05-01 23:15 ` [PATCH v2 0/8] builtin: implement, document and test url-parse Matheus Moreira via GitGitGadget
2026-05-01 23:15 ` [PATCH v2 1/8] connect: rename enum protocol to url_scheme Matheus Afonso Martins Moreira via GitGitGadget
2026-05-01 23:15 ` [PATCH v2 2/8] url: move url_is_local_not_ssh to url.h Matheus Afonso Martins Moreira via GitGitGadget
2026-05-01 23:15 ` [PATCH v2 3/8] url: move scheme detection to URL header/source Matheus Afonso Martins Moreira via GitGitGadget
2026-05-01 23:15 ` [PATCH v2 4/8] url: return URL_SCHEME_UNKNOWN instead of dying Matheus Afonso Martins Moreira via GitGitGadget
2026-05-01 23:15 ` Matheus Afonso Martins Moreira via GitGitGadget [this message]
2026-05-01 23:15 ` [PATCH v2 6/8] builtin: create url-parse command Matheus Afonso Martins Moreira via GitGitGadget
2026-05-01 23:15 ` [PATCH v2 7/8] doc: describe the url-parse builtin Matheus Afonso Martins Moreira via GitGitGadget
2026-05-01 23:15 ` [PATCH v2 8/8] t9904: add tests for the new " Matheus Afonso Martins Moreira via GitGitGadget
2026-05-02 5:28 ` [PATCH v3 0/8] builtin: implement, document and test url-parse Matheus Moreira via GitGitGadget
2026-05-02 5:28 ` [PATCH v3 1/8] connect: rename enum protocol to url_scheme Matheus Afonso Martins Moreira via GitGitGadget
2026-05-02 5:28 ` [PATCH v3 2/8] url: move url_is_local_not_ssh to url.h Matheus Afonso Martins Moreira via GitGitGadget
2026-05-02 5:28 ` [PATCH v3 3/8] url: move scheme detection to URL header/source Matheus Afonso Martins Moreira via GitGitGadget
2026-05-02 5:28 ` [PATCH v3 4/8] url: return URL_SCHEME_UNKNOWN instead of dying Matheus Afonso Martins Moreira via GitGitGadget
2026-05-02 5:28 ` [PATCH v3 5/8] urlmatch: define url_parse function Matheus Afonso Martins Moreira via GitGitGadget
2026-05-02 5:28 ` [PATCH v3 6/8] builtin: create url-parse command Matheus Afonso Martins Moreira via GitGitGadget
2026-05-02 5:28 ` [PATCH v3 7/8] doc: describe the url-parse builtin Matheus Afonso Martins Moreira via GitGitGadget
2026-05-02 5:28 ` [PATCH v3 8/8] t9904: add tests for the new " Matheus Afonso Martins Moreira via GitGitGadget
2026-05-03 3:49 ` [PATCH v3 0/8] builtin: implement, document and test url-parse Junio C Hamano
2026-05-03 4:29 ` Matheus Afonso Martins Moreira
2026-05-03 17:28 ` Torsten Bögershausen
2026-05-03 19:36 ` Matheus Afonso Martins Moreira
2026-05-12 3:50 ` Junio C Hamano
2026-05-12 8:57 ` Torsten Bögershausen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=89932a70f3ace6ff1198628873df702f40f1442a.1777677310.git.gitgitgadget@gmail.com \
--to=gitgitgadget@gmail.com \
--cc=git@vger.kernel.org \
--cc=matheus@matheusmoreira.com \
--cc=shyamthakkar001@gmail.com \
--cc=tboegi@web.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.