Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Eric Sunshine <sunshine@sunshineco.com>
To: Lars Schneider <lars.schneider@autodesk.com>
Cc: "Git List" <git@vger.kernel.org>,
	"Junio C Hamano" <gitster@pobox.com>,
	"Torsten Bögershausen" <tboegi@web.de>,
	"Johannes Sixt" <j6t@kdbg.org>, "Jeff King" <peff@peff.net>,
	"Ramsay Jones" <ramsay@ramsayjones.plus.com>,
	"Johannes Schindelin" <Johannes.Schindelin@gmx.de>,
	"Lars Schneider" <larsxschneider@gmail.com>
Subject: Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings
Date: Wed, 7 Mar 2018 13:04:44 -0500	[thread overview]
Message-ID: <CAPig+cTsRufRNZbKJS-02fvbtaLB30FO3Em6HR_LZrsK+CfOjw@mail.gmail.com> (raw)
In-Reply-To: <20180307173026.30058-8-lars.schneider@autodesk.com>

On Wed, Mar 7, 2018 at 12:30 PM,  <lars.schneider@autodesk.com> wrote:
> Check that new content is valid with respect to the user defined
> 'working-tree-encoding' attribute.
>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
> diff --git a/convert.c b/convert.c
> @@ -266,6 +266,58 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
> +static int validate_encoding(const char *path, const char *enc,
> +                     const char *data, size_t len, int die_on_error)
> +{
> +       /* We only check for UTF here as UTF?? can be an alias for UTF-?? */
> +       if (startscase_with(enc, "UTF")) {
> +               /*
> +                * Check for detectable errors in UTF encodings
> +                */
> +               if (has_prohibited_utf_bom(enc, data, len)) {
> +                       const char *error_msg = _(
> +                               "BOM is prohibited in '%s' if encoded as %s");
> +                       /*
> +                        * This advice is shown for UTF-??BE and UTF-??LE encodings.
> +                        * We cut off the last two characters of the encoding name
> +                        # to generate the encoding name suitable for BOMs.

s/#/*/

> +                        */
> +                       const char *advise_msg = _(
> +                               "The file '%s' contains a byte order "
> +                               "mark (BOM). Please use %s as "
> +                               "working-tree-encoding.");
> +                       char *upper_enc = xstrdup_toupper(enc);
> +                       upper_enc[strlen(upper_enc)-2] = '\0';

Due to startscase_with(...,"UTF"), we know at this point that the
string is at least 3 characters long, thus it's safe to back up by 2.
Good.

> +                       advise(advise_msg, path, upper_enc);
> +                       free(upper_enc);
> +                       if (die_on_error)
> +                               die(error_msg, path, enc);
> +                       else {
> +                               return error(error_msg, path, enc);
> +                       }
> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
> @@ -62,6 +62,46 @@ test_expect_success 'check $GIT_DIR/info/attributes support' '
>  for i in 16 32
>  do
> +       test_expect_success "check prohibited UTF-${i} BOM" '
> +               test_when_finished "git reset --hard HEAD" &&
> +
> +               echo "*.utf${i}be text working-tree-encoding=utf-${i}be" >>.gitattributes &&
> +               echo "*.utf${i}le text working-tree-encoding=utf-${i}le" >>.gitattributes &&

v10 is checking only hyphenated lowercase encoding name; earlier
versions checked uppercase. For better coverage, it would be nice to
check several combinations: all uppercase, all lowercase, mixed case,
hyphenated, not hyphenated.

I'm not suggesting running all the tests repeatedly but rather just
varying the format of the encoding name in these tests you're adding.
For instance, the above could instead be:

    echo "*.utf${i}be text working-tree-encoding=UTF-${i}be" >>.gitattributes &&
    echo "*.utf${i}le text working-tree-encoding=utf${i}LE" >>.gitattributes &&

or something.

> +               # Here we add a UTF-16 (resp. UTF-32) files with BOM (big/little-endian)
> +               # but we tell Git to treat it as UTF-16BE/UTF-16LE (resp. UTF-32).
> +               # In these cases the BOM is prohibited.
> +               cp bebom.utf${i}be.raw bebom.utf${i}be &&
> +               test_must_fail git add bebom.utf${i}be 2>err.out &&
> +               test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out &&
> +
> +               cp lebom.utf${i}le.raw lebom.utf${i}be &&
> +               test_must_fail git add lebom.utf${i}be 2>err.out &&
> +               test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out &&
> +
> +               cp bebom.utf${i}be.raw bebom.utf${i}le &&
> +               test_must_fail git add bebom.utf${i}le 2>err.out &&
> +               test_i18ngrep "fatal: BOM is prohibited .* utf-${i}le" err.out &&
> +
> +               cp lebom.utf${i}le.raw lebom.utf${i}le &&
> +               test_must_fail git add lebom.utf${i}le 2>err.out &&
> +               test_i18ngrep "fatal: BOM is prohibited .* utf-${i}le" err.out
> +       '
> +
> +       test_expect_success "check required UTF-${i} BOM" '
> +               test_when_finished "git reset --hard HEAD" &&
> +
> +               echo "*.utf${i} text working-tree-encoding=utf-${i}" >>.gitattributes &&

This is another opportunity for checking a variation (case,
hyphenation) of the encoding name rather than using only hyphenated
lowercase.

> +
> +               cp nobom.utf${i}be.raw nobom.utf${i} &&
> +               test_must_fail git add nobom.utf${i} 2>err.out &&
> +               test_i18ngrep "fatal: BOM is required .* utf-${i}" err.out &&
> +
> +               cp nobom.utf${i}le.raw nobom.utf${i} &&
> +               test_must_fail git add nobom.utf${i} 2>err.out &&
> +               test_i18ngrep "fatal: BOM is required .* utf-${i}" err.out
> +       '

next prev parent reply	other threads:[~2018-03-07 18:04 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-07 17:30 [PATCH v10 0/9] convert: add support for different encodings lars.schneider
2018-03-07 17:30 ` [PATCH v10 1/9] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
2018-03-07 17:30 ` [PATCH v10 2/9] strbuf: add xstrdup_toupper() lars.schneider
2018-03-07 17:30 ` [PATCH v10 3/9] strbuf: add a case insensitive starts_with() lars.schneider
2018-03-08  0:31   ` Duy Nguyen
2018-03-08 23:12     ` Junio C Hamano
2018-03-09 15:54       ` Lars Schneider
2018-03-09 17:20         ` Junio C Hamano
2018-03-09 19:06           ` Ævar Arnfjörð Bjarmason
2018-03-07 17:30 ` [PATCH v10 4/9] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
2018-03-07 17:30 ` [PATCH v10 5/9] utf8: add function to detect a missing " lars.schneider
2018-03-07 17:30 ` [PATCH v10 6/9] convert: add 'working-tree-encoding' attribute lars.schneider
2018-03-07 17:54   ` Eric Sunshine
2018-03-07 22:56     ` Lars Schneider
2018-03-07 22:57       ` Junio C Hamano
2018-03-07 19:35   ` Junio C Hamano
2018-03-07 17:30 ` [PATCH v10 7/9] convert: check for detectable errors in UTF encodings lars.schneider
2018-03-07 18:04   ` Eric Sunshine [this message]
2018-03-09 17:02     ` Lars Schneider
2018-03-07 19:49   ` Junio C Hamano
2018-03-07 22:12     ` Lars Schneider
2018-03-07 22:32       ` Junio C Hamano
2018-03-07 22:49         ` Lars Schneider
2018-03-07 22:57           ` Junio C Hamano
2018-03-07 23:19             ` Lars Schneider
2018-03-07 23:34               ` Junio C Hamano
2018-03-07 17:30 ` [PATCH v10 8/9] convert: add tracing for 'working-tree-encoding' attribute lars.schneider
2018-03-07 17:30 ` [PATCH v10 9/9] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
2018-03-07 19:59   ` Junio C Hamano
2018-03-07 22:44     ` Lars Schneider
2018-03-07 22:52       ` Junio C Hamano
2018-03-07 22:58         ` Lars Schneider

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPig+cTsRufRNZbKJS-02fvbtaLB30FO3Em6HR_LZrsK+CfOjw@mail.gmail.com \
    --to=sunshine@sunshineco.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=j6t@kdbg.org \
    --cc=lars.schneider@autodesk.com \
    --cc=larsxschneider@gmail.com \
    --cc=peff@peff.net \
    --cc=ramsay@ramsayjones.plus.com \
    --cc=tboegi@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).