* Re: [PATCH 1/1] http: don't send C or POSIX in Accept-Language
2025-07-10 22:16 ` [PATCH 1/1] http: don't send C or POSIX in Accept-Language brian m. carlson
@ 2025-07-10 22:47 ` Junio C Hamano
2025-07-10 23:08 ` brian m. carlson
2025-07-11 15:23 ` Justin Tobler
2025-07-15 4:38 ` Eli Schwartz
2 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2025-07-10 22:47 UTC (permalink / raw)
To: brian m. carlson; +Cc: git, Taylor Blau, Yi EungJun
"brian m. carlson" <sandals@crustytoothpaste.net> writes:
> ... However, these two values are widely used in the LANGUAGE
> header, are well-known and widely used non-language locales, and have
> been seen in the wild on the server side.
"header" -> "environment variable" I presume?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/1] http: don't send C or POSIX in Accept-Language
2025-07-10 22:47 ` Junio C Hamano
@ 2025-07-10 23:08 ` brian m. carlson
0 siblings, 0 replies; 18+ messages in thread
From: brian m. carlson @ 2025-07-10 23:08 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Taylor Blau, Yi EungJun
[-- Attachment #1: Type: text/plain, Size: 501 bytes --]
On 2025-07-10 at 22:47:56, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
>
> > ... However, these two values are widely used in the LANGUAGE
> > header, are well-known and widely used non-language locales, and have
> > been seen in the wild on the server side.
>
> "header" -> "environment variable" I presume?
Ah, yes. I'll fix that for a v2 in a few days while I wait for other
comments.
--
brian m. carlson (they/them)
Toronto, Ontario, CA
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/1] http: don't send C or POSIX in Accept-Language
2025-07-10 22:16 ` [PATCH 1/1] http: don't send C or POSIX in Accept-Language brian m. carlson
2025-07-10 22:47 ` Junio C Hamano
@ 2025-07-11 15:23 ` Justin Tobler
2025-07-11 17:02 ` Collin Funk
2025-07-11 18:32 ` Junio C Hamano
2025-07-15 4:38 ` Eli Schwartz
2 siblings, 2 replies; 18+ messages in thread
From: Justin Tobler @ 2025-07-11 15:23 UTC (permalink / raw)
To: brian m. carlson; +Cc: git, Junio C Hamano, Taylor Blau, Yi EungJun
On 25/07/10 10:16PM, brian m. carlson wrote:
> The LANGUAGE environment variable is not specified by POSIX, but a
> variety of programs using GNU gettext accept it. The Linux manpages
> state that it can contain a colon-separated list of locales.
>
> However, not all locales are valid as languages. The C and POSIX
> locales, for instance, are not languages and are not registered with
> IANA, nor are they a part of ISO 639. In fact, "C" is too short to
> match the ABNF production for a language, which must be at least two
> characters in length.
>
> Nonetheless, many users provide these values in the LANGUAGE environment
> variable for unknown reasons and if they do, we do not want to send a
> malformed Accept-Language header to the server. If there are no other
> valid language tags, then send no header; otherwise, send only the valid
> tags, ignoring "C" and "POSIX" wherever they may appear, as well as any
> variants (such as the "C.UTF-8" locale found on some Linux systems).
Ok so the languages returned by `get_preferred_languages()` are used to
write the Accept-Language header when making requests.
Looking at `get_preferred_languages()` when NO_GETTEXT is defined, we
already filter out "C" and "POSIX". So doing this for the LANGUAGE
environment variable when writing the header also makes sense.
> We do not reject all possible invalid language tags since doing so
> would require bundling a copy of the IANA database and would risk poor
> behavior in the face of uncommon languages or values that are not
> registered but meet the production for private use or other restricted
> interchange. However, these two values are widely used in the LANGUAGE
> header, are well-known and widely used non-language locales, and have
> been seen in the wild on the server side.
>
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
> http.c | 8 ++++++++
> t/t5541-http-push-smart.sh | 18 ++++++++++++++++++
> 2 files changed, 26 insertions(+)
>
> diff --git a/http.c b/http.c
> index d88e79fbde..a96df4fcdb 100644
> --- a/http.c
> +++ b/http.c
> @@ -2022,6 +2022,14 @@ static void write_accept_language(struct strbuf *buf)
> s++;
>
> if (tag.len) {
> + /*
> + * These are not valid languages: do not send them to
> + * the server.
> + */
> + if (!strcmp(tag.buf, "C") || !strcmp(tag.buf, "POSIX")) {
> + strbuf_reset(&tag);
> + continue;
> + }
From my understanding, each language is expected to be defined in the
following form:
language[_territory][.codeset][@modifier]
When we parse the list of languages we only care about the
`language[_territory]` part though.
From looking at ISO 639 language codes, only codes with two or three
characters are valid. If we wanted to be a bit more strict, we could
check the length of the language code (everything before the first '_')
and filter out anything outside of those limits. This would naturally
filter out "C" and "POSIX" without having to mention them explicitly.
Not sure if being more strict adds much more value here in practice
though. So it may be fine to keep it as-is. :)
> num_langs++;
> REALLOC_ARRAY(language_tags, num_langs);
> language_tags[num_langs - 1] = strbuf_detach(&tag, NULL);
> diff --git a/t/t5541-http-push-smart.sh b/t/t5541-http-push-smart.sh
> index 538b603f03..96a6833e67 100755
> --- a/t/t5541-http-push-smart.sh
> +++ b/t/t5541-http-push-smart.sh
> @@ -86,6 +86,24 @@ test_expect_success 'push to remote repository (standard) with sending Accept-La
> GIT_TRACE_CURL=true LANGUAGE="ko_KR.UTF-8" git push -v -v 2>err &&
> ! grep "Expect: 100-continue" err &&
>
> + grep "=> Send header: Accept-Language:" err >err.language &&
> + test_cmp exp err.language &&
> +
> + test_commit C-is-not-a-language &&
> + GIT_TRACE_CURL=true LANGUAGE="C" git push -v -v 2>err &&
> +
> + ! grep "=> Send header: Accept-Language:" err >err.language &&
> + test_must_be_empty err.language &&
> +
> + test_commit POSIX-is-not-a-language-either &&
> + GIT_TRACE_CURL=true LANGUAGE="POSIX" git push -v -v 2>err &&
> +
> + ! grep "=> Send header: Accept-Language:" err >err.language &&
> + test_must_be_empty err.language &&
The above two tests demonstrate that the Accept-Language header is not
sent if no valid languages are found.
> +
> + test_commit ignore-C-and-POSIX-as-languages-wherever-provided &&
> + GIT_TRACE_CURL=true LANGUAGE="C.UTF-8:ko_KR.UTF-8:POSIX" git push -v -v 2>err &&
> +
> grep "=> Send header: Accept-Language:" err >err.language &&
> test_cmp exp err.language
> '
And here we see only the valid languages sent in the header. Looks good!
-Justin
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH 1/1] http: don't send C or POSIX in Accept-Language
2025-07-11 15:23 ` Justin Tobler
@ 2025-07-11 17:02 ` Collin Funk
2025-07-11 20:57 ` Carlo Marcelo Arenas Belón
2025-07-11 18:32 ` Junio C Hamano
1 sibling, 1 reply; 18+ messages in thread
From: Collin Funk @ 2025-07-11 17:02 UTC (permalink / raw)
To: Justin Tobler
Cc: brian m. carlson, git, Junio C Hamano, Taylor Blau, Yi EungJun
Justin Tobler <jltobler@gmail.com> writes:
> From my understanding, each language is expected to be defined in the
> following form:
>
> language[_territory][.codeset][@modifier]
>
> When we parse the list of languages we only care about the
> `language[_territory]` part though.
>
> From looking at ISO 639 language codes, only codes with two or three
> characters are valid. If we wanted to be a bit more strict, we could
> check the length of the language code (everything before the first '_')
> and filter out anything outside of those limits. This would naturally
> filter out "C" and "POSIX" without having to mention them explicitly.
>
> Not sure if being more strict adds much more value here in practice
> though. So it may be fine to keep it as-is. :)
Filtering out anything that isn't 2-3 letters seems like a good
heuristic to me.
It seems better than only filtering out "C" and "POSIX" and allowing
anything else. And it keeps us from having to keep a list of updated BCP
47 language tags.
Collin
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/1] http: don't send C or POSIX in Accept-Language
2025-07-11 17:02 ` Collin Funk
@ 2025-07-11 20:57 ` Carlo Marcelo Arenas Belón
2025-07-11 21:29 ` brian m. carlson
0 siblings, 1 reply; 18+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2025-07-11 20:57 UTC (permalink / raw)
To: Collin Funk
Cc: Justin Tobler, brian m. carlson, git, Junio C Hamano, Taylor Blau,
Yi EungJun
On Fri, Jul 11, 2025 at 10:02:01AM -0800, Collin Funk wrote:
> Justin Tobler <jltobler@gmail.com> writes:
>
> > From my understanding, each language is expected to be defined in the
> > following form:
> >
> > language[_territory][.codeset][@modifier]
> >
> > When we parse the list of languages we only care about the
> > `language[_territory]` part though.
> >
> > From looking at ISO 639 language codes, only codes with two or three
> > characters are valid. If we wanted to be a bit more strict, we could
> > check the length of the language code (everything before the first '_')
> > and filter out anything outside of those limits. This would naturally
> > filter out "C" and "POSIX" without having to mention them explicitly.
>
> Filtering out anything that isn't 2-3 letters seems like a good
> heuristic to me.
except that it would be incorrect, as language tags are defined in RFC5646
and are larger than that.
most importantly, deriving language tags from locales provides some very
useful tags when including the characters after the _, because zh_CN and
zh_HK use completely different scripts, for example.
Carlo
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/1] http: don't send C or POSIX in Accept-Language
2025-07-11 20:57 ` Carlo Marcelo Arenas Belón
@ 2025-07-11 21:29 ` brian m. carlson
2025-07-11 22:12 ` Carlo Arenas
0 siblings, 1 reply; 18+ messages in thread
From: brian m. carlson @ 2025-07-11 21:29 UTC (permalink / raw)
To: Carlo Marcelo Arenas Belón
Cc: Collin Funk, Justin Tobler, git, Junio C Hamano, Taylor Blau,
Yi EungJun
[-- Attachment #1: Type: text/plain, Size: 1596 bytes --]
On 2025-07-11 at 20:57:03, Carlo Marcelo Arenas Belón wrote:
> except that it would be incorrect, as language tags are defined in RFC5646
> and are larger than that.
>
> most importantly, deriving language tags from locales provides some very
> useful tags when including the characters after the _, because zh_CN and
> zh_HK use completely different scripts, for example.
Yes, that's true. You have some private use and some irregular tags and
you also have some tags that include scripts or country codes.
For instance, Swahili can be written in Latin or Arabic script. As I
understand it, the Arabic script form is older and less common these
days, so if I learned Swahili (which I would like to), then I might only
learn the Latin script variant in a course. I would need to specify
that script in the language code to be sure that I was presented with
content in a form that I could read and understand. Similar concerns
exist with the variants of Serbo-Croatian: some are written in Latin
scripts, some in Cyrillic, and some in both, and it's not guaranteed
that all speakers understand all forms.
And then there's pt-PT and pt-BR, which are not always mutually
intelligible. Most free software I've seen ships these as separate
translations.
I don't want to implement language tag parsing here since we don't need
to do that. I would like to do the simple thing to prevent commonly
used locales that don't represent actual language tags from being
included and not overengineer this design.
--
brian m. carlson (they/them)
Toronto, Ontario, CA
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/1] http: don't send C or POSIX in Accept-Language
2025-07-11 21:29 ` brian m. carlson
@ 2025-07-11 22:12 ` Carlo Arenas
2025-07-11 22:17 ` Collin Funk
0 siblings, 1 reply; 18+ messages in thread
From: Carlo Arenas @ 2025-07-11 22:12 UTC (permalink / raw)
To: brian m. carlson, Carlo Marcelo Arenas Belón, Collin Funk,
Justin Tobler, git, Junio C Hamano, Taylor Blau, semtlenori
On Fri, Jul 11, 2025 at 2:29 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
>
> On 2025-07-11 at 20:57:03, Carlo Marcelo Arenas Belón wrote:
> > except that it would be incorrect, as language tags are defined in RFC5646
> > and are larger than that.
> >
> > most importantly, deriving language tags from locales provides some very
> > useful tags when including the characters after the _, because zh_CN and
> > zh_HK use completely different scripts, for example.
>
> Yes, that's true. You have some private use and some irregular tags and
> you also have some tags that include scripts or country codes.
>
> For instance, Swahili can be written in Latin or Arabic script. As I
> understand it, the Arabic script form is older and less common these
> days, so if I learned Swahili (which I would like to), then I might only
> learn the Latin script variant in a course. I would need to specify
> that script in the language code to be sure that I was presented with
> content in a form that I could read and understand. Similar concerns
> exist with the variants of Serbo-Croatian: some are written in Latin
> scripts, some in Cyrillic, and some in both, and it's not guaranteed
> that all speakers understand all forms.
>
> And then there's pt-PT and pt-BR, which are not always mutually
> intelligible. Most free software I've seen ships these as separate
> translations.
>
> I don't want to implement language tag parsing here since we don't need
> to do that. I would like to do the simple thing to prevent commonly
> used locales that don't represent actual language tags from being
> included and not overengineer this design
I think that your design of filtering C and POSIX accomplishes that,
even if it might seem like hardcoding those two values is a little dirty.
Moving the logic (including the filtering, which is already happening
for the `!NO_GETTEXT `code path adds several chances to modernize
and cleanup the code though which will be beneficial (ex: using and
strvec or even a hashtable to process the candidates, improve
validation and tests)
Carlo
CC Yi EungJun at a hopefully working email address with link to thread
https://lore.kernel.org/git/20250710221641.857081-1-sandals@crustytoothpaste.net/
.
> --
> brian m. carlson (they/them)
> Toronto, Ontario, CA
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/1] http: don't send C or POSIX in Accept-Language
2025-07-11 22:12 ` Carlo Arenas
@ 2025-07-11 22:17 ` Collin Funk
0 siblings, 0 replies; 18+ messages in thread
From: Collin Funk @ 2025-07-11 22:17 UTC (permalink / raw)
To: Carlo Arenas
Cc: brian m. carlson, Justin Tobler, git, Junio C Hamano, Taylor Blau,
semtlenori
Carlo Arenas <carenas@gmail.com> writes:
>> I don't want to implement language tag parsing here since we don't need
>> to do that. I would like to do the simple thing to prevent commonly
>> used locales that don't represent actual language tags from being
>> included and not overengineer this design
>
> I think that your design of filtering C and POSIX accomplishes that,
> even if it might seem like hardcoding those two values is a little dirty.
Thanks for correcting me regarding language tag length Carlo.
I guess I am fine with filtering out "C" and "POSIX" now. Not perfect,
but I think everyone agrees that we don't want to maintain a database of
language tags just for this.
Collin
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/1] http: don't send C or POSIX in Accept-Language
2025-07-11 15:23 ` Justin Tobler
2025-07-11 17:02 ` Collin Funk
@ 2025-07-11 18:32 ` Junio C Hamano
2025-07-11 20:22 ` Carlo Marcelo Arenas Belón
1 sibling, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2025-07-11 18:32 UTC (permalink / raw)
To: Justin Tobler; +Cc: brian m. carlson, git, Taylor Blau, Yi EungJun
Justin Tobler <jltobler@gmail.com> writes:
> Looking at `get_preferred_languages()` when NO_GETTEXT is defined, we
> already filter out "C" and "POSIX". So doing this for the LANGUAGE
> environment variable when writing the header also makes sense.
True. I wonder if it makes sense to do the check in that helper
function, though. I.e. something like
diff --git c/gettext.c w/gettext.c
index 8d08a61f84..e2e0fe339d 100644
--- c/gettext.c
+++ w/gettext.c
@@ -41,6 +41,16 @@ static const char *locale_charset(void)
static const char *charset;
+static const char *filter_out_non_languages(const char *candidate)
+{
+ if (candidate && *candidate &&
+ strcmp(candidate, "C") &&
+ strcmp(candidate, "POSIX"))
+ return candidate;
+ else
+ return NULL;
+}
+
/*
* Guess the user's preferred languages from the value in LANGUAGE environment
* variable and LC_MESSAGES locale category if NO_GETTEXT is not defined.
@@ -51,15 +61,13 @@ const char *get_preferred_languages(void)
{
const char *retval;
- retval = getenv("LANGUAGE");
- if (retval && *retval)
+ retval = filter_out_non_languages(getenv("LANGUAGE"));
+ if (retval)
return retval;
#ifndef NO_GETTEXT
- retval = setlocale(LC_MESSAGES, NULL);
- if (retval && *retval &&
- strcmp(retval, "C") &&
- strcmp(retval, "POSIX"))
+ retval = filter_out_non_languages(setlocale(LC_MESSAGES, NULL));
+ if (retval)
return retval;
#endif
In the production code, we should have a comment before that new
helper function that explains why we exclude C and POSIX, if we were
to go that route.
> Not sure if being more strict adds much more value here in practice
> though. So it may be fine to keep it as-is. :)
Yup. I care more about having a single place that checks using the
same logic, than what that logic exactly is ;-).
Thanks.
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH 1/1] http: don't send C or POSIX in Accept-Language
2025-07-11 18:32 ` Junio C Hamano
@ 2025-07-11 20:22 ` Carlo Marcelo Arenas Belón
0 siblings, 0 replies; 18+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2025-07-11 20:22 UTC (permalink / raw)
To: Junio C Hamano
Cc: Justin Tobler, brian m. carlson, git, Taylor Blau, Yi EungJun
On Fri, Jul 11, 2025 at 11:32:19AM -0800, Junio C Hamano wrote:
> Justin Tobler <jltobler@gmail.com> writes:
>
> > Looking at `get_preferred_languages()` when NO_GETTEXT is defined, we
> > already filter out "C" and "POSIX". So doing this for the LANGUAGE
> > environment variable when writing the header also makes sense.
>
> True. I wonder if it makes sense to do the check in that helper
> function, though. I.e. something like
Definitely, and might also fix another bug, as IMHO the current logic have
a couple of issues:
* LANGUAGE is not meant to be relevant unless LANG is set to a valid locale
as per the SPEC[1], allthough for our use case it might be better to still
do, specially if there are users in the wild setting C and POSIX there.
* it might make more sense to use the union of LANGUAGE and LC_MESSAGES
instead.
Carlo
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/1] http: don't send C or POSIX in Accept-Language
2025-07-10 22:16 ` [PATCH 1/1] http: don't send C or POSIX in Accept-Language brian m. carlson
2025-07-10 22:47 ` Junio C Hamano
2025-07-11 15:23 ` Justin Tobler
@ 2025-07-15 4:38 ` Eli Schwartz
2 siblings, 0 replies; 18+ messages in thread
From: Eli Schwartz @ 2025-07-15 4:38 UTC (permalink / raw)
To: brian m. carlson, git
Cc: Junio C Hamano, Taylor Blau, Carlo Arenas, Justin Tobler,
semtlenori, Collin Funk
[-- Attachment #1.1: Type: text/plain, Size: 3676 bytes --]
On 7/10/25 6:16 PM, brian m. carlson wrote:
> The LANGUAGE environment variable is not specified by POSIX, but a
> variety of programs using GNU gettext accept it. The Linux manpages
> state that it can contain a colon-separated list of locales.
>
> However, not all locales are valid as languages. The C and POSIX
> locales, for instance, are not languages and are not registered with
> IANA, nor are they a part of ISO 639. In fact, "C" is too short to
> match the ABNF production for a language, which must be at least two
> characters in length.
>
> Nonetheless, many users provide these values in the LANGUAGE environment
> variable for unknown reasons and if they do, we do not want to send a
> malformed Accept-Language header to the server. If there are no other
> valid language tags, then send no header; otherwise, send only the valid
> tags, ignoring "C" and "POSIX" wherever they may appear, as well as any
> variants (such as the "C.UTF-8" locale found on some Linux systems).
Better docs -- the gettext manpages suck:
https://www.gnu.org/software/gettext/manual/html_node/Locale-Names.html
https://www.gnu.org/software/gettext/manual/html_node/The-LANGUAGE-variable.html
At minimum this commit message needs revising. Gettext was adopted into
POSIX 2024 (Issue 8).
Respected by tools of course:
https://pubs.opengroup.org/onlinepubs/9799919799/utilities/gettext.html#tag_20_54_08
https://pubs.opengroup.org/onlinepubs/9799919799/functions/gettext.html
$LANGUAGE docs can be found at
https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap08.html#tag_08_02
"""
The value of LANGUAGE shall be a list of locale names separated by a
<colon> (':') character. If LANGUAGE is set to a non-empty string, each
locale name shall be tried in the specified order and if a messages
object is found, it shall be used for translation. If a locale name has
the format language[_territory][.codeset][@modifier], additional
searches of locale names without .codeset (if present), without
_territory (if present), and without @modifier (if present) may be performed
"""
And, for locale name values,
"""
If the locale value is "C" or "POSIX", the POSIX locale shall be used
and the standard utilities behave in accordance with the rules in 7.2
POSIX Locale for the associated category.
If the locale value begins with a <slash>, it shall be interpreted as
the pathname of a file that was created in the output format used by the
localedef utility; see OUTPUT FILES under localedef. Referencing such a
pathname shall result in that locale being used for the indicated category.
[XSI] [Option Start] If the locale value has the form:
language[_territory][.codeset]
it refers to an implementation-provided locale, where settings of
language, territory, and codeset are implementation-defined.
LC_COLLATE , LC_CTYPE , LC_MESSAGES , LC_MONETARY , LC_NUMERIC , and
LC_TIME are defined to accept an additional field @modifier, which
allows the user to select a specific instance of localization data
within a single category (for example, for selecting the dictionary as
opposed to the character ordering of data). The syntax for these
environment variables is thus defined as:
[language[_territory][.codeset][@modifier]]
"""
Your tests and code are probably broken -- they appear to normalize
nearly none of the standard grammar into valid Accept-Language entries.
Of course, "surely nobody actually does that" (except when they do!) --
but it's a relatively simple grammar structure, simply getting the
"shape" correct seems like a good idea.
--
Eli Schwartz
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread