[PATCH 0/1] Filter C and POSIX out of Accept-Language

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/1] Filter C and POSIX out of Accept-Language
@ 2025-07-10 22:16 brian m. carlson
  2025-07-10 22:16 ` [PATCH 1/1] http: don't send C or POSIX in Accept-Language brian m. carlson
  2025-07-10 22:45 ` [PATCH 0/1] Filter C and POSIX out of Accept-Language Junio C Hamano
  0 siblings, 2 replies; 18+ messages in thread
From: brian m. carlson @ 2025-07-10 22:16 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Taylor Blau, Yi EungJun

At work, I've seen some cases where people provide "C" in the
Accept-Language header of their Git requests, such as when they provide
us with debugging traces, but "C" and "POSIX", while valid locales, are
not valid languages and do not belong in the Accept-Language header.

It turns out this is actually very easy to reproduce and fix, so there's
a patch to filter these out.  I have not actually myself seen "POSIX" in
the header, but it's equivalent to "C" and I've seen it in non-Git
requests in various places online, so we reject that as well.

This can be seen in GitLab's issues as well at
https://gitlab.com/gitlab-org/gitlab/-/issues/412077.

brian m. carlson (1):
  http: don't send C or POSIX in Accept-Language

 http.c                     |  8 ++++++++
 t/t5541-http-push-smart.sh | 18 ++++++++++++++++++
 2 files changed, 26 insertions(+)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/1] http: don't send C or POSIX in Accept-Language
  2025-07-10 22:16 [PATCH 0/1] Filter C and POSIX out of Accept-Language brian m. carlson
@ 2025-07-10 22:16 ` brian m. carlson
  2025-07-10 22:47   ` Junio C Hamano
                     ` (2 more replies)
  2025-07-10 22:45 ` [PATCH 0/1] Filter C and POSIX out of Accept-Language Junio C Hamano
  1 sibling, 3 replies; 18+ messages in thread
From: brian m. carlson @ 2025-07-10 22:16 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Taylor Blau, Yi EungJun

The LANGUAGE environment variable is not specified by POSIX, but a
variety of programs using GNU gettext accept it.  The Linux manpages
state that it can contain a colon-separated list of locales.

However, not all locales are valid as languages.  The C and POSIX
locales, for instance, are not languages and are not registered with
IANA, nor are they a part of ISO 639.  In fact, "C" is too short to
match the ABNF production for a language, which must be at least two
characters in length.

Nonetheless, many users provide these values in the LANGUAGE environment
variable for unknown reasons and if they do, we do not want to send a
malformed Accept-Language header to the server.  If there are no other
valid language tags, then send no header; otherwise, send only the valid
tags, ignoring "C" and "POSIX" wherever they may appear, as well as any
variants (such as the "C.UTF-8" locale found on some Linux systems).

We do not reject all possible invalid language tags since doing so
would require bundling a copy of the IANA database and would risk poor
behavior in the face of uncommon languages or values that are not
registered but meet the production for private use or other restricted
interchange.  However, these two values are widely used in the LANGUAGE
header, are well-known and widely used non-language locales, and have
been seen in the wild on the server side.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 http.c                     |  8 ++++++++
 t/t5541-http-push-smart.sh | 18 ++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/http.c b/http.c
index d88e79fbde..a96df4fcdb 100644
--- a/http.c
+++ b/http.c
@@ -2022,6 +2022,14 @@ static void write_accept_language(struct strbuf *buf)
 			s++;

 		if (tag.len) {
+			/*
+			 * These are not valid languages: do not send them to
+			 * the server.
+			 */
+			if (!strcmp(tag.buf, "C") || !strcmp(tag.buf, "POSIX")) {
+				strbuf_reset(&tag);
+				continue;
+			}
 			num_langs++;
 			REALLOC_ARRAY(language_tags, num_langs);
 			language_tags[num_langs - 1] = strbuf_detach(&tag, NULL);
diff --git a/t/t5541-http-push-smart.sh b/t/t5541-http-push-smart.sh
index 538b603f03..96a6833e67 100755
--- a/t/t5541-http-push-smart.sh
+++ b/t/t5541-http-push-smart.sh
@@ -86,6 +86,24 @@ test_expect_success 'push to remote repository (standard) with sending Accept-La
 	GIT_TRACE_CURL=true LANGUAGE="ko_KR.UTF-8" git push -v -v 2>err &&
 	! grep "Expect: 100-continue" err &&

+	grep "=> Send header: Accept-Language:" err >err.language &&
+	test_cmp exp err.language &&
+
+	test_commit C-is-not-a-language &&
+	GIT_TRACE_CURL=true LANGUAGE="C" git push -v -v 2>err &&
+
+	! grep "=> Send header: Accept-Language:" err >err.language &&
+	test_must_be_empty err.language &&
+
+	test_commit POSIX-is-not-a-language-either &&
+	GIT_TRACE_CURL=true LANGUAGE="POSIX" git push -v -v 2>err &&
+
+	! grep "=> Send header: Accept-Language:" err >err.language &&
+	test_must_be_empty err.language &&
+
+	test_commit ignore-C-and-POSIX-as-languages-wherever-provided &&
+	GIT_TRACE_CURL=true LANGUAGE="C.UTF-8:ko_KR.UTF-8:POSIX" git push -v -v 2>err &&
+
 	grep "=> Send header: Accept-Language:" err >err.language &&
 	test_cmp exp err.language
 '

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] http: don't send C or POSIX in Accept-Language
  2025-07-10 22:16 ` [PATCH 1/1] http: don't send C or POSIX in Accept-Language brian m. carlson
@ 2025-07-10 22:47   ` Junio C Hamano
  2025-07-10 23:08     ` brian m. carlson
  2025-07-11 15:23   ` Justin Tobler
  2025-07-15  4:38   ` Eli Schwartz
  2 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2025-07-10 22:47 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Taylor Blau, Yi EungJun

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> ...  However, these two values are widely used in the LANGUAGE
> header, are well-known and widely used non-language locales, and have
> been seen in the wild on the server side.

"header" -> "environment variable" I presume?  


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] http: don't send C or POSIX in Accept-Language
  2025-07-10 22:47   ` Junio C Hamano
@ 2025-07-10 23:08     ` brian m. carlson
  0 siblings, 0 replies; 18+ messages in thread
From: brian m. carlson @ 2025-07-10 23:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Taylor Blau, Yi EungJun

[-- Attachment #1: Type: text/plain, Size: 501 bytes --]

On 2025-07-10 at 22:47:56, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> > ...  However, these two values are widely used in the LANGUAGE
> > header, are well-known and widely used non-language locales, and have
> > been seen in the wild on the server side.
> 
> "header" -> "environment variable" I presume?  

Ah, yes.  I'll fix that for a v2 in a few days while I wait for other
comments.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] http: don't send C or POSIX in Accept-Language
  2025-07-10 22:16 ` [PATCH 1/1] http: don't send C or POSIX in Accept-Language brian m. carlson
  2025-07-10 22:47   ` Junio C Hamano
@ 2025-07-11 15:23   ` Justin Tobler
  2025-07-11 17:02     ` Collin Funk
  2025-07-11 18:32     ` Junio C Hamano
  2025-07-15  4:38   ` Eli Schwartz
  2 siblings, 2 replies; 18+ messages in thread
From: Justin Tobler @ 2025-07-11 15:23 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Junio C Hamano, Taylor Blau, Yi EungJun

On 25/07/10 10:16PM, brian m. carlson wrote:
> The LANGUAGE environment variable is not specified by POSIX, but a
> variety of programs using GNU gettext accept it.  The Linux manpages
> state that it can contain a colon-separated list of locales.
> 
> However, not all locales are valid as languages.  The C and POSIX
> locales, for instance, are not languages and are not registered with
> IANA, nor are they a part of ISO 639.  In fact, "C" is too short to
> match the ABNF production for a language, which must be at least two
> characters in length.
> 
> Nonetheless, many users provide these values in the LANGUAGE environment
> variable for unknown reasons and if they do, we do not want to send a
> malformed Accept-Language header to the server.  If there are no other
> valid language tags, then send no header; otherwise, send only the valid
> tags, ignoring "C" and "POSIX" wherever they may appear, as well as any
> variants (such as the "C.UTF-8" locale found on some Linux systems).

Ok so the languages returned by `get_preferred_languages()` are used to
write the Accept-Language header when making requests.

Looking at `get_preferred_languages()` when NO_GETTEXT is defined, we
already filter out "C" and "POSIX". So doing this for the LANGUAGE
environment variable when writing the header also makes sense.

> We do not reject all possible invalid language tags since doing so
> would require bundling a copy of the IANA database and would risk poor
> behavior in the face of uncommon languages or values that are not
> registered but meet the production for private use or other restricted
> interchange.  However, these two values are widely used in the LANGUAGE
> header, are well-known and widely used non-language locales, and have
> been seen in the wild on the server side.
> 
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
>  http.c                     |  8 ++++++++
>  t/t5541-http-push-smart.sh | 18 ++++++++++++++++++
>  2 files changed, 26 insertions(+)
> 
> diff --git a/http.c b/http.c
> index d88e79fbde..a96df4fcdb 100644
> --- a/http.c
> +++ b/http.c
> @@ -2022,6 +2022,14 @@ static void write_accept_language(struct strbuf *buf)
>  			s++;
>  
>  		if (tag.len) {
> +			/*
> +			 * These are not valid languages: do not send them to
> +			 * the server.
> +			 */
> +			if (!strcmp(tag.buf, "C") || !strcmp(tag.buf, "POSIX")) {
> +				strbuf_reset(&tag);
> +				continue;
> +			}

From my understanding, each language is expected to be defined in the
following form:

  language[_territory][.codeset][@modifier]

When we parse the list of languages we only care about the
`language[_territory]` part though.

From looking at ISO 639 language codes, only codes with two or three
characters are valid. If we wanted to be a bit more strict, we could
check the length of the language code (everything before the first '_')
and filter out anything outside of those limits. This would naturally
filter out "C" and "POSIX" without having to mention them explicitly.

Not sure if being more strict adds much more value here in practice
though. So it may be fine to keep it as-is. :)

>  			num_langs++;
>  			REALLOC_ARRAY(language_tags, num_langs);
>  			language_tags[num_langs - 1] = strbuf_detach(&tag, NULL);
> diff --git a/t/t5541-http-push-smart.sh b/t/t5541-http-push-smart.sh
> index 538b603f03..96a6833e67 100755
> --- a/t/t5541-http-push-smart.sh
> +++ b/t/t5541-http-push-smart.sh
> @@ -86,6 +86,24 @@ test_expect_success 'push to remote repository (standard) with sending Accept-La
>  	GIT_TRACE_CURL=true LANGUAGE="ko_KR.UTF-8" git push -v -v 2>err &&
>  	! grep "Expect: 100-continue" err &&
>  
> +	grep "=> Send header: Accept-Language:" err >err.language &&
> +	test_cmp exp err.language &&
> +
> +	test_commit C-is-not-a-language &&
> +	GIT_TRACE_CURL=true LANGUAGE="C" git push -v -v 2>err &&
> +
> +	! grep "=> Send header: Accept-Language:" err >err.language &&
> +	test_must_be_empty err.language &&
> +
> +	test_commit POSIX-is-not-a-language-either &&
> +	GIT_TRACE_CURL=true LANGUAGE="POSIX" git push -v -v 2>err &&
> +
> +	! grep "=> Send header: Accept-Language:" err >err.language &&
> +	test_must_be_empty err.language &&

The above two tests demonstrate that the Accept-Language header is not
sent if no valid languages are found.

> +
> +	test_commit ignore-C-and-POSIX-as-languages-wherever-provided &&
> +	GIT_TRACE_CURL=true LANGUAGE="C.UTF-8:ko_KR.UTF-8:POSIX" git push -v -v 2>err &&
> +
>  	grep "=> Send header: Accept-Language:" err >err.language &&
>  	test_cmp exp err.language
>  '

And here we see only the valid languages sent in the header. Looks good!

-Justin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] http: don't send C or POSIX in Accept-Language
  2025-07-11 15:23   ` Justin Tobler
@ 2025-07-11 17:02     ` Collin Funk
  2025-07-11 20:57       ` Carlo Marcelo Arenas Belón
  2025-07-11 18:32     ` Junio C Hamano
  1 sibling, 1 reply; 18+ messages in thread
From: Collin Funk @ 2025-07-11 17:02 UTC (permalink / raw)
  To: Justin Tobler
  Cc: brian m. carlson, git, Junio C Hamano, Taylor Blau, Yi EungJun

Justin Tobler <jltobler@gmail.com> writes:

> From my understanding, each language is expected to be defined in the
> following form:
>
>   language[_territory][.codeset][@modifier]
>
> When we parse the list of languages we only care about the
> `language[_territory]` part though.
>
> From looking at ISO 639 language codes, only codes with two or three
> characters are valid. If we wanted to be a bit more strict, we could
> check the length of the language code (everything before the first '_')
> and filter out anything outside of those limits. This would naturally
> filter out "C" and "POSIX" without having to mention them explicitly.
>
> Not sure if being more strict adds much more value here in practice
> though. So it may be fine to keep it as-is. :)

Filtering out anything that isn't 2-3 letters seems like a good
heuristic to me.

It seems better than only filtering out "C" and "POSIX" and allowing
anything else. And it keeps us from having to keep a list of updated BCP
47 language tags.

Collin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] http: don't send C or POSIX in Accept-Language
  2025-07-11 17:02     ` Collin Funk
@ 2025-07-11 20:57       ` Carlo Marcelo Arenas Belón
  2025-07-11 21:29         ` brian m. carlson
  0 siblings, 1 reply; 18+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2025-07-11 20:57 UTC (permalink / raw)
  To: Collin Funk
  Cc: Justin Tobler, brian m. carlson, git, Junio C Hamano, Taylor Blau,
	Yi EungJun

On Fri, Jul 11, 2025 at 10:02:01AM -0800, Collin Funk wrote:
> Justin Tobler <jltobler@gmail.com> writes:
> 
> > From my understanding, each language is expected to be defined in the
> > following form:
> >
> >   language[_territory][.codeset][@modifier]
> >
> > When we parse the list of languages we only care about the
> > `language[_territory]` part though.
> >
> > From looking at ISO 639 language codes, only codes with two or three
> > characters are valid. If we wanted to be a bit more strict, we could
> > check the length of the language code (everything before the first '_')
> > and filter out anything outside of those limits. This would naturally
> > filter out "C" and "POSIX" without having to mention them explicitly.
> 
> Filtering out anything that isn't 2-3 letters seems like a good
> heuristic to me.

except that it would be incorrect, as language tags are defined in RFC5646
and are larger than that.

most importantly, deriving language tags from locales provides some very
useful tags when including the characters after the _, because zh_CN and
zh_HK use completely different scripts, for example.

Carlo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] http: don't send C or POSIX in Accept-Language
  2025-07-11 20:57       ` Carlo Marcelo Arenas Belón
@ 2025-07-11 21:29         ` brian m. carlson
  2025-07-11 22:12           ` Carlo Arenas
  0 siblings, 1 reply; 18+ messages in thread
From: brian m. carlson @ 2025-07-11 21:29 UTC (permalink / raw)
  To: Carlo Marcelo Arenas Belón
  Cc: Collin Funk, Justin Tobler, git, Junio C Hamano, Taylor Blau,
	Yi EungJun

[-- Attachment #1: Type: text/plain, Size: 1596 bytes --]

On 2025-07-11 at 20:57:03, Carlo Marcelo Arenas Belón wrote:
> except that it would be incorrect, as language tags are defined in RFC5646
> and are larger than that.
> 
> most importantly, deriving language tags from locales provides some very
> useful tags when including the characters after the _, because zh_CN and
> zh_HK use completely different scripts, for example.

Yes, that's true.  You have some private use and some irregular tags and
you also have some tags that include scripts or country codes.

For instance, Swahili can be written in Latin or Arabic script.  As I
understand it, the Arabic script form is older and less common these
days, so if I learned Swahili (which I would like to), then I might only
learn the Latin script variant in a course.  I would need to specify
that script in the language code to be sure that I was presented with
content in a form that I could read and understand.  Similar concerns
exist with the variants of Serbo-Croatian: some are written in Latin
scripts, some in Cyrillic, and some in both, and it's not guaranteed
that all speakers understand all forms.

And then there's pt-PT and pt-BR, which are not always mutually
intelligible.  Most free software I've seen ships these as separate
translations.

I don't want to implement language tag parsing here since we don't need
to do that.  I would like to do the simple thing to prevent commonly
used locales that don't represent actual language tags from being
included and not overengineer this design.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] http: don't send C or POSIX in Accept-Language
  2025-07-11 21:29         ` brian m. carlson
@ 2025-07-11 22:12           ` Carlo Arenas
  2025-07-11 22:17             ` Collin Funk
  0 siblings, 1 reply; 18+ messages in thread
From: Carlo Arenas @ 2025-07-11 22:12 UTC (permalink / raw)
  To: brian m. carlson, Carlo Marcelo Arenas Belón, Collin Funk,
	Justin Tobler, git, Junio C Hamano, Taylor Blau, semtlenori

On Fri, Jul 11, 2025 at 2:29 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
>
> On 2025-07-11 at 20:57:03, Carlo Marcelo Arenas Belón wrote:
> > except that it would be incorrect, as language tags are defined in RFC5646
> > and are larger than that.
> >
> > most importantly, deriving language tags from locales provides some very
> > useful tags when including the characters after the _, because zh_CN and
> > zh_HK use completely different scripts, for example.
>
> Yes, that's true.  You have some private use and some irregular tags and
> you also have some tags that include scripts or country codes.
>
> For instance, Swahili can be written in Latin or Arabic script.  As I
> understand it, the Arabic script form is older and less common these
> days, so if I learned Swahili (which I would like to), then I might only
> learn the Latin script variant in a course.  I would need to specify
> that script in the language code to be sure that I was presented with
> content in a form that I could read and understand.  Similar concerns
> exist with the variants of Serbo-Croatian: some are written in Latin
> scripts, some in Cyrillic, and some in both, and it's not guaranteed
> that all speakers understand all forms.
>
> And then there's pt-PT and pt-BR, which are not always mutually
> intelligible.  Most free software I've seen ships these as separate
> translations.
>
> I don't want to implement language tag parsing here since we don't need
> to do that.  I would like to do the simple thing to prevent commonly
> used locales that don't represent actual language tags from being
> included and not overengineer this design

I think that your design of filtering C and POSIX accomplishes that,
even if it might seem like hardcoding those two values is a little dirty.

Moving the logic (including the filtering, which is already happening
for the `!NO_GETTEXT `code path adds several chances to modernize
and cleanup the code though which will be beneficial (ex: using and
strvec or even a hashtable to process the candidates, improve
validation and tests)

Carlo

CC Yi EungJun at a hopefully working email address with link to thread
https://lore.kernel.org/git/20250710221641.857081-1-sandals@crustytoothpaste.net/

.
> --
> brian m. carlson (they/them)
> Toronto, Ontario, CA

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] http: don't send C or POSIX in Accept-Language
  2025-07-11 22:12           ` Carlo Arenas
@ 2025-07-11 22:17             ` Collin Funk
  0 siblings, 0 replies; 18+ messages in thread
From: Collin Funk @ 2025-07-11 22:17 UTC (permalink / raw)
  To: Carlo Arenas
  Cc: brian m. carlson, Justin Tobler, git, Junio C Hamano, Taylor Blau,
	semtlenori

Carlo Arenas <carenas@gmail.com> writes:

>> I don't want to implement language tag parsing here since we don't need
>> to do that.  I would like to do the simple thing to prevent commonly
>> used locales that don't represent actual language tags from being
>> included and not overengineer this design
>
> I think that your design of filtering C and POSIX accomplishes that,
> even if it might seem like hardcoding those two values is a little dirty.

Thanks for correcting me regarding language tag length Carlo.

I guess I am fine with filtering out "C" and "POSIX" now. Not perfect,
but I think everyone agrees that we don't want to maintain a database of
language tags just for this.

Collin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] http: don't send C or POSIX in Accept-Language
  2025-07-11 15:23   ` Justin Tobler
  2025-07-11 17:02     ` Collin Funk
@ 2025-07-11 18:32     ` Junio C Hamano
  2025-07-11 20:22       ` Carlo Marcelo Arenas Belón
  1 sibling, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2025-07-11 18:32 UTC (permalink / raw)
  To: Justin Tobler; +Cc: brian m. carlson, git, Taylor Blau, Yi EungJun

Justin Tobler <jltobler@gmail.com> writes:

> Looking at `get_preferred_languages()` when NO_GETTEXT is defined, we
> already filter out "C" and "POSIX". So doing this for the LANGUAGE
> environment variable when writing the header also makes sense.

True.  I wonder if it makes sense to do the check in that helper
function, though.  I.e. something like

diff --git c/gettext.c w/gettext.c
index 8d08a61f84..e2e0fe339d 100644
--- c/gettext.c
+++ w/gettext.c
@@ -41,6 +41,16 @@ static const char *locale_charset(void)
 
 static const char *charset;
 
+static const char *filter_out_non_languages(const char *candidate)
+{
+	if (candidate && *candidate &&
+	    strcmp(candidate, "C") &&
+	    strcmp(candidate, "POSIX"))
+		return candidate;
+	else
+		return NULL;
+}
+
 /*
  * Guess the user's preferred languages from the value in LANGUAGE environment
  * variable and LC_MESSAGES locale category if NO_GETTEXT is not defined.
@@ -51,15 +61,13 @@ const char *get_preferred_languages(void)
 {
 	const char *retval;
 
-	retval = getenv("LANGUAGE");
-	if (retval && *retval)
+	retval = filter_out_non_languages(getenv("LANGUAGE"));
+	if (retval)
 		return retval;
 
 #ifndef NO_GETTEXT
-	retval = setlocale(LC_MESSAGES, NULL);
-	if (retval && *retval &&
-		strcmp(retval, "C") &&
-		strcmp(retval, "POSIX"))
+	retval = filter_out_non_languages(setlocale(LC_MESSAGES, NULL));
+	if (retval)
 		return retval;
 #endif
 

In the production code, we should have a comment before that new
helper function that explains why we exclude C and POSIX, if we were
to go that route.

> Not sure if being more strict adds much more value here in practice
> though. So it may be fine to keep it as-is. :)

Yup.  I care more about having a single place that checks using the
same logic, than what that logic exactly is ;-).

Thanks.

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] http: don't send C or POSIX in Accept-Language
  2025-07-11 18:32     ` Junio C Hamano
@ 2025-07-11 20:22       ` Carlo Marcelo Arenas Belón
  0 siblings, 0 replies; 18+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2025-07-11 20:22 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Justin Tobler, brian m. carlson, git, Taylor Blau, Yi EungJun

On Fri, Jul 11, 2025 at 11:32:19AM -0800, Junio C Hamano wrote:
> Justin Tobler <jltobler@gmail.com> writes:
> 
> > Looking at `get_preferred_languages()` when NO_GETTEXT is defined, we
> > already filter out "C" and "POSIX". So doing this for the LANGUAGE
> > environment variable when writing the header also makes sense.
> 
> True.  I wonder if it makes sense to do the check in that helper
> function, though.  I.e. something like

Definitely, and might also fix another bug, as IMHO the current logic have
a couple of issues:

* LANGUAGE is not meant to be relevant unless LANG is set to a valid locale
  as per the SPEC[1], allthough for our use case it might be better to still
  do, specially if there are users in the wild setting C and POSIX there.
* it might make more sense to use the union of LANGUAGE and LC_MESSAGES
  instead.

Carlo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/1] http: don't send C or POSIX in Accept-Language
  2025-07-10 22:16 ` [PATCH 1/1] http: don't send C or POSIX in Accept-Language brian m. carlson
  2025-07-10 22:47   ` Junio C Hamano
  2025-07-11 15:23   ` Justin Tobler
@ 2025-07-15  4:38   ` Eli Schwartz
  2 siblings, 0 replies; 18+ messages in thread
From: Eli Schwartz @ 2025-07-15  4:38 UTC (permalink / raw)
  To: brian m. carlson, git
  Cc: Junio C Hamano, Taylor Blau, Carlo Arenas, Justin Tobler,
	semtlenori, Collin Funk

[-- Attachment #1.1: Type: text/plain, Size: 3676 bytes --]

On 7/10/25 6:16 PM, brian m. carlson wrote:
> The LANGUAGE environment variable is not specified by POSIX, but a
> variety of programs using GNU gettext accept it.  The Linux manpages
> state that it can contain a colon-separated list of locales.
> 
> However, not all locales are valid as languages.  The C and POSIX
> locales, for instance, are not languages and are not registered with
> IANA, nor are they a part of ISO 639.  In fact, "C" is too short to
> match the ABNF production for a language, which must be at least two
> characters in length.
> 
> Nonetheless, many users provide these values in the LANGUAGE environment
> variable for unknown reasons and if they do, we do not want to send a
> malformed Accept-Language header to the server.  If there are no other
> valid language tags, then send no header; otherwise, send only the valid
> tags, ignoring "C" and "POSIX" wherever they may appear, as well as any
> variants (such as the "C.UTF-8" locale found on some Linux systems).

Better docs -- the gettext manpages suck:
https://www.gnu.org/software/gettext/manual/html_node/Locale-Names.html
https://www.gnu.org/software/gettext/manual/html_node/The-LANGUAGE-variable.html

At minimum this commit message needs revising. Gettext was adopted into
POSIX 2024 (Issue 8).

Respected by tools of course:
https://pubs.opengroup.org/onlinepubs/9799919799/utilities/gettext.html#tag_20_54_08
https://pubs.opengroup.org/onlinepubs/9799919799/functions/gettext.html

$LANGUAGE docs can be found at

https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap08.html#tag_08_02

"""
The value of LANGUAGE shall be a list of locale names separated by a
<colon> (':') character. If LANGUAGE is set to a non-empty string, each
locale name shall be tried in the specified order and if a messages
object is found, it shall be used for translation. If a locale name has
the format language[_territory][.codeset][@modifier], additional
searches of locale names without .codeset (if present), without
_territory (if present), and without @modifier (if present) may be performed
"""

And, for locale name values,

"""
If the locale value is "C" or "POSIX", the POSIX locale shall be used
and the standard utilities behave in accordance with the rules in 7.2
POSIX Locale for the associated category.

If the locale value begins with a <slash>, it shall be interpreted as
the pathname of a file that was created in the output format used by the
localedef utility; see OUTPUT FILES under localedef. Referencing such a
pathname shall result in that locale being used for the indicated category.

[XSI] [Option Start] If the locale value has the form:

language[_territory][.codeset]

it refers to an implementation-provided locale, where settings of
language, territory, and codeset are implementation-defined.

LC_COLLATE , LC_CTYPE , LC_MESSAGES , LC_MONETARY , LC_NUMERIC , and
LC_TIME are defined to accept an additional field @modifier, which
allows the user to select a specific instance of localization data
within a single category (for example, for selecting the dictionary as
opposed to the character ordering of data). The syntax for these
environment variables is thus defined as:

[language[_territory][.codeset][@modifier]]
"""

Your tests and code are probably broken -- they appear to normalize
nearly none of the standard grammar into valid Accept-Language entries.
Of course, "surely nobody actually does that" (except when they do!) --
but it's a relatively simple grammar structure, simply getting the
"shape" correct seems like a good idea.

-- 
Eli Schwartz

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/1] Filter C and POSIX out of Accept-Language
  2025-07-10 22:16 [PATCH 0/1] Filter C and POSIX out of Accept-Language brian m. carlson
  2025-07-10 22:16 ` [PATCH 1/1] http: don't send C or POSIX in Accept-Language brian m. carlson
@ 2025-07-10 22:45 ` Junio C Hamano
  2025-07-10 23:08   ` brian m. carlson
  1 sibling, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2025-07-10 22:45 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Taylor Blau, Yi EungJun

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> At work, I've seen some cases where people provide "C" in the
> Accept-Language header of their Git requests, such as when they provide
> us with debugging traces, but "C" and "POSIX", while valid locales, are
> not valid languages and do not belong in the Accept-Language header.
>
> It turns out this is actually very easy to reproduce and fix, so there's
> a patch to filter these out.  I have not actually myself seen "POSIX" in
> the header, but it's equivalent to "C" and I've seen it in non-Git
> requests in various places online, so we reject that as well.
>
> This can be seen in GitLab's issues as well at
> https://gitlab.com/gitlab-org/gitlab/-/issues/412077.

Sorry, I am confused.  Is that Authentication failure in the cited
issue "caused by" the client sending "Accept-Language: C"?

"reproduce and fix" makes it sound like a correct exchange between
such a client and a server is somehow broken (i.e. unable to clone,
unable to authenticate, etc.) if the client sends C (or POSIX) as if
it were a langauge, but is there a breakage there?

I understand and agree with the change in patch 1/1 that it is the
right thing to do (to more strictly adhere to the standard in what
we send out) for hygiene.  I just want to understand if this caused
real problems, or if it is primarily a preemptive clean-up to avoid
non-standard behaviour causing problems in the future.

Thanks.

> brian m. carlson (1):
>   http: don't send C or POSIX in Accept-Language
>
>  http.c                     |  8 ++++++++
>  t/t5541-http-push-smart.sh | 18 ++++++++++++++++++
>  2 files changed, 26 insertions(+)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/1] Filter C and POSIX out of Accept-Language
  2025-07-10 22:45 ` [PATCH 0/1] Filter C and POSIX out of Accept-Language Junio C Hamano
@ 2025-07-10 23:08   ` brian m. carlson
  2025-07-10 23:26     ` Collin Funk
  0 siblings, 1 reply; 18+ messages in thread
From: brian m. carlson @ 2025-07-10 23:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Taylor Blau

[-- Attachment #1: Type: text/plain, Size: 2267 bytes --]

[Dropping Yi EungJun from CC because their email bounced.]

On 2025-07-10 at 22:45:20, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> > At work, I've seen some cases where people provide "C" in the
> > Accept-Language header of their Git requests, such as when they provide
> > us with debugging traces, but "C" and "POSIX", while valid locales, are
> > not valid languages and do not belong in the Accept-Language header.
> >
> > It turns out this is actually very easy to reproduce and fix, so there's
> > a patch to filter these out.  I have not actually myself seen "POSIX" in
> > the header, but it's equivalent to "C" and I've seen it in non-Git
> > requests in various places online, so we reject that as well.
> >
> > This can be seen in GitLab's issues as well at
> > https://gitlab.com/gitlab-org/gitlab/-/issues/412077.
> 
> Sorry, I am confused.  Is that Authentication failure in the cited
> issue "caused by" the client sending "Accept-Language: C"?
> 
> "reproduce and fix" makes it sound like a correct exchange between
> such a client and a server is somehow broken (i.e. unable to clone,
> unable to authenticate, etc.) if the client sends C (or POSIX) as if
> it were a langauge, but is there a breakage there?

No, sorry.  I just meant that the trace in that issue demonstrates the
incorrect Accept-Language header; it's unrelated to the authentication
problem that the issue is about (which I think is a GitLab issue).

> I understand and agree with the change in patch 1/1 that it is the
> right thing to do (to more strictly adhere to the standard in what
> we send out) for hygiene.  I just want to understand if this caused
> real problems, or if it is primarily a preemptive clean-up to avoid
> non-standard behaviour causing problems in the future.

I'm not aware of it causing any practical problems for people, although
I could imagine some cases where it could, in theory, break things.  I
merely noticed this in trace output and thought we should tidy it up.
If users are using the header and expecting a localized response, this
will make it more likely that they get the one they were expecting.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/1] Filter C and POSIX out of Accept-Language
  2025-07-10 23:08   ` brian m. carlson
@ 2025-07-10 23:26     ` Collin Funk
  2025-07-11  2:49       ` [External] " Han Young
  0 siblings, 1 reply; 18+ messages in thread
From: Collin Funk @ 2025-07-10 23:26 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Junio C Hamano, git, Taylor Blau

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> I'm not aware of it causing any practical problems for people, although
> I could imagine some cases where it could, in theory, break things.  I
> merely noticed this in trace output and thought we should tidy it up.
> If users are using the header and expecting a localized response, this
> will make it more likely that they get the one they were expecting.

I feel like it is a bit strange to only exclude "C" or "POSIX".

I think the correct behavior would be to accept any values, or convert
the current locale to the closest BCP 47 language tag.

But as you mentioned converting them would require a database of all
tags...

Collin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [External] Re: [PATCH 0/1] Filter C and POSIX out of Accept-Language
  2025-07-10 23:26     ` Collin Funk
@ 2025-07-11  2:49       ` Han Young
  2025-07-11 16:46         ` Junio C Hamano
  0 siblings, 1 reply; 18+ messages in thread
From: Han Young @ 2025-07-11  2:49 UTC (permalink / raw)
  To: Collin Funk; +Cc: brian m. carlson, Junio C Hamano, git, Taylor Blau

On Fri, Jul 11, 2025 at 7:26 AM Collin Funk <collin.funk1@gmail.com> wrote:

> I think the correct behavior would be to accept any values, or convert
> the current locale to the closest BCP 47 language tag.

On some Linux systems, not all BCP languages are supported. Not all
Linux distributions generate all the locales, and musl doesn't even support
locales. Converting to the closest BCP 47 language alone does not
ensure the locale is valid. Not to mention the tricky heuristics of language
matching (pt_PT or pt_BR if LANGUAGE is pt?).

> But as you mentioned converting them would require a database of all
> tags...

Hardcoding all the locale names in the code should be fine, I guess?
Though the problem of filtering out locales unsupported by glibc is more
troublesome.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [External] Re: [PATCH 0/1] Filter C and POSIX out of Accept-Language
  2025-07-11  2:49       ` [External] " Han Young
@ 2025-07-11 16:46         ` Junio C Hamano
  0 siblings, 0 replies; 18+ messages in thread
From: Junio C Hamano @ 2025-07-11 16:46 UTC (permalink / raw)
  To: Han Young; +Cc: Collin Funk, brian m. carlson, git, Taylor Blau

Han Young <hanyang.tony@bytedance.com> writes:

>> But as you mentioned converting them would require a database of all
>> tags...
>
> Hardcoding all the locale names in the code should be fine, I guess?

That is exactly the "database" we do not want to have to maintain,
so not fine.

> Though the problem of filtering out locales unsupported by glibc is more
> troublesome.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-07-15  4:38 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-10 22:16 [PATCH 0/1] Filter C and POSIX out of Accept-Language brian m. carlson
2025-07-10 22:16 ` [PATCH 1/1] http: don't send C or POSIX in Accept-Language brian m. carlson
2025-07-10 22:47   ` Junio C Hamano
2025-07-10 23:08     ` brian m. carlson
2025-07-11 15:23   ` Justin Tobler
2025-07-11 17:02     ` Collin Funk
2025-07-11 20:57       ` Carlo Marcelo Arenas Belón
2025-07-11 21:29         ` brian m. carlson
2025-07-11 22:12           ` Carlo Arenas
2025-07-11 22:17             ` Collin Funk
2025-07-11 18:32     ` Junio C Hamano
2025-07-11 20:22       ` Carlo Marcelo Arenas Belón
2025-07-15  4:38   ` Eli Schwartz
2025-07-10 22:45 ` [PATCH 0/1] Filter C and POSIX out of Accept-Language Junio C Hamano
2025-07-10 23:08   ` brian m. carlson
2025-07-10 23:26     ` Collin Funk
2025-07-11  2:49       ` [External] " Han Young
2025-07-11 16:46         ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).