From: "Torsten Bögershausen" <tboegi@web.de>
To: Lars Schneider <larsxschneider@gmail.com>
Cc: Junio C Hamano <gitster@pobox.com>,
lars.schneider@autodesk.com, git@vger.kernel.org, j6t@kdbg.org,
sunshine@sunshineco.com, peff@peff.net,
ramsay@ramsayjones.plus.com, Johannes.Schindelin@gmx.de
Subject: Re: [PATCH v7 0/7] convert: add support for different encodings
Date: Fri, 16 Feb 2018 17:58:15 +0100 [thread overview]
Message-ID: <20180216165815.GA4681@tor.lan> (raw)
In-Reply-To: <DC552BF4-3E87-41E0-BF92-4BA9633D374E@gmail.com>
On Fri, Feb 16, 2018 at 03:42:35PM +0100, Lars Schneider wrote:
[]
>
> Agreed. However, people using ShiftJIS are not my target audience.
> My target audience are:
>
> (1) People that have to encode their text files in UTF-16 (for
> whatever reason - usually because of legacy processes or tools).
>
> (2) People that want to see textual diffs of their UTF-16 encoded
> files in their Git tools without special adjustments (on the
> command line, on the web, etc).
>
> That was my primary motivation. The fact that w-t-e supports any
> other encoding too is just a nice side effect. I don't foresee people
> using other w-t-encodings other than UTF-16 in my organization.
>
> I have the suspicion that the feature could be useful for the Git
> community at large. Consider this Stack Overflow question:
> https://stackoverflow.com/questions/777949/can-i-make-git-recognize-a-utf-16-file-as-text
>
> This question was viewed 42k times and there is no good solution.
> I believe w-t-e could be a good solution.
>
If it was only about a diff of UTF-16 files, I may suggest a patch.
I simply copy-paste it here for review, if someone thinks that it may
be useful, I can send it as a real patch/RFC.
git show HEAD
commit 9f7d43f29eaf6017b7b16261ce91d8ef182cf415
Author: Torsten Bögershausen <tboegi@web.de>
Date: Fri Feb 2 15:35:23 2018 +0100
Auto diff of UTF-16 files in UTF-8
When an UTF-16 file is commited and later changed, `git diff` shows
"Binary files XX and YY differ".
When the user wants a diff in UTF-8, a textconv needs to be specified
in .gitattributes and the textconv must be configured.
A more user-friendly diff can be produced for UTF-16 if
- the user did not use `git diff --binary`
- the blob is identified as binary
- the blob has an UTF-16 BOM
- the blob can be converted into UTF-8
Enhance the diff machinery to auto-detect UTF-16 blobs and show them
as UTF-8, unless the user specifies `git diff --binary` which creates
a binary diff.
diff --git a/diff.c b/diff.c
index fb22b19f09..51831ee94d 100644
--- a/diff.c
+++ b/diff.c
@@ -3192,6 +3192,10 @@ static void builtin_diff(const char *name_a,
strbuf_reset(&header);
}
+ if (one && one->reencoded_from_utf16)
+ strbuf_addf(&header, "a is converted to UTF-8 from UTF-16\n");
+ if (two && two->reencoded_from_utf16)
+ strbuf_addf(&header, "b is converted to UTF-8 from UTF-16\n");
mf1.size = fill_textconv(textconv_one, one, &mf1.ptr);
mf2.size = fill_textconv(textconv_two, two, &mf2.ptr);
@@ -3611,8 +3615,25 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
s->size = size;
s->should_free = 1;
}
- }
- else {
+ if (!s->binary && buffer_is_binary(s->data, s->size) &&
+ buffer_has_utf16_bom(s->data, s->size)) {
+ int outsz = 0;
+ char *outbuf;
+ outbuf = reencode_string_len(s->data, (int)s->size,
+ "UTF-8", "UTF-16", &outsz);
+ if (outbuf) {
+ if (s->should_free)
+ free(s->data);
+ if (s->should_munmap)
+ munmap(s->data, s->size);
+ s->should_munmap = 0;
+ s->data = outbuf;
+ s->size = outsz;
+ s->reencoded_from_utf16 = 1;
+ s->should_free = 1;
+ }
+ }
+ } else {
enum object_type type;
if (size_only || (flags & CHECK_BINARY)) {
type = sha1_object_info(s->oid.hash, &s->size);
@@ -3629,6 +3650,19 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
s->data = read_sha1_file(s->oid.hash, &type, &s->size);
if (!s->data)
die("unable to read %s", oid_to_hex(&s->oid));
+ if (!s->binary && buffer_is_binary(s->data, s->size) &&
+ buffer_has_utf16_bom(s->data, s->size)) {
+ int outsz = 0;
+ char *buf;
+ buf = reencode_string_len(s->data, (int)s->size,
+ "UTF-8", "UTF-16", &outsz);
+ if (buf) {
+ free(s->data);
+ s->data = buf;
+ s->size = outsz;
+ s->reencoded_from_utf16 = 1;
+ }
+ }
s->should_free = 1;
}
return 0;
@@ -5695,6 +5729,10 @@ static int diff_filespec_is_identical(struct diff_filespec *one,
static int diff_filespec_check_stat_unmatch(struct diff_filepair *p)
{
+ if (p->binary) {
+ p->one->binary = 1;
+ p->two->binary = 1;
+ }
if (p->done_skip_stat_unmatch)
return p->skip_stat_unmatch_result;
@@ -5735,6 +5773,7 @@ static void diffcore_skip_stat_unmatch(struct diff_options *diffopt)
for (i = 0; i < q->nr; i++) {
struct diff_filepair *p = q->queue[i];
+ p->binary = diffopt->flags.binary;
if (diff_filespec_check_stat_unmatch(p))
diff_q(&outq, p);
else {
diff --git a/diffcore.h b/diffcore.h
index a30da161da..3cd97bb93b 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -47,6 +47,8 @@ struct diff_filespec {
unsigned has_more_entries : 1; /* only appear in combined diff */
/* data should be considered "binary"; -1 means "don't know yet" */
signed int is_binary : 2;
+ unsigned binary : 1;
+ unsigned reencoded_from_utf16 : 1;
struct userdiff_driver *driver;
};
@@ -72,6 +74,7 @@ struct diff_filepair {
unsigned is_unmerged : 1;
unsigned done_skip_stat_unmatch : 1;
unsigned skip_stat_unmatch_result : 1;
+ unsigned binary : 1;
};
#define DIFF_PAIR_UNMERGED(p) ((p)->is_unmerged)
diff --git a/t/t4066-diff-encoding.sh b/t/t4066-diff-encoding.sh
new file mode 100755
index 0000000000..9bb3c70ada
--- /dev/null
+++ b/t/t4066-diff-encoding.sh
@@ -0,0 +1,98 @@
+#!/bin/sh
+
+test_description='git diff with encoding attribute'
+
+. ./test-lib.sh
+
+printf '\303\244rger\n\303\266se\n\303\274bel\n' |
+ iconv -f UTF-8 -t UTF-16 >UTF-16
+printf '\303\266se\n\303\274bel\n\303\245st\n' |
+ iconv -f UTF-8 -t UTF-16 >file2
+
+test_expect_success 'setup' '
+ cp UTF-16 file &&
+ git add file &&
+ git commit -m "add file in UTF-16" &&
+ test_tick &&
+ echo "file encoding=UTF-16" >.gitattributes
+'
+
+test_expect_success 'diff against local change' '
+ cp file2 file &&
+ test_tick &&
+ cat >expect <<-\EOF &&
+ diff --git a/file b/file
+ index 26acf09..e98d27a 100644
+ a is converted to UTF-8 from UTF-16
+ b is converted to UTF-8 from UTF-16
+ --- a/file
+ +++ b/file
+ @@ -1,3 +1,3 @@
+ -ärger
+ öse
+ übel
+ +åst
+EOF
+ git diff file >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'diff --binary against local change' '
+ cp file2 file &&
+ test_tick &&
+ cat >expect <<-\EOF &&
+ diff --git a/file b/file
+ index 26acf09b0aad19fb22566956d1a39cb4e2a3b420..e98d27acfb90cfcfc84fcc5173baa4aa7828290f 100644
+ GIT binary patch
+ literal 28
+ ecmezW?;ArgLn;Fo!ykquAe{qbJq3!C0BHb{ln3Pi
+
+ literal 32
+ icmezW?+HT@Lpnn$kmO?c#!w7oaWVX1NCMJ1Ko$VA_z0~4
+
+EOF
+ git diff --binary file >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'commit local change' '
+ git add file &&
+ git commit -m "add file V2 in UTF-16" &&
+ test_tick
+'
+
+test_expect_success 'diff HEAD against HEAD^' '
+ cat >expect <<-\EOF &&
+ diff --git a/file b/file
+ index 26acf09..e98d27a 100644
+ a is converted to UTF-8 from UTF-16
+ b is converted to UTF-8 from UTF-16
+ --- a/file
+ +++ b/file
+ @@ -1,3 +1,3 @@
+ -ärger
+ öse
+ übel
+ +åst
+EOF
+ git diff HEAD^ HEAD -- file >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'diff --binary HEAD against HEAD^' '
+ cat >expect <<-\EOF &&
+ diff --git a/file b/file
+ index 26acf09b0aad19fb22566956d1a39cb4e2a3b420..e98d27acfb90cfcfc84fcc5173baa4aa7828290f 100644
+ GIT binary patch
+ literal 28
+ ecmezW?;ArgLn;Fo!ykquAe{qbJq3!C0BHb{ln3Pi
+
+ literal 32
+ icmezW?+HT@Lpnn$kmO?c#!w7oaWVX1NCMJ1Ko$VA_z0~4
+
+EOF
+ git diff --binary HEAD^ HEAD -- file >actual &&
+ test_cmp expect actual
+'
+
+test_done
diff --git a/utf8.h b/utf8.h
index 6bbcf31a83..a2184d0300 100644
--- a/utf8.h
+++ b/utf8.h
@@ -16,6 +16,17 @@ int utf8_fprintf(FILE *, const char *, ...);
extern const char utf8_bom[];
extern int skip_utf8_bom(char **, size_t);
+static inline int buffer_has_utf16_bom(const void *buf, size_t len) {
+ const unsigned char *text = (unsigned char *)buf;
+ if (!text || len < 2)
+ return 0;
+ if (text[0] == 0xff && text[1] == 0xfe)
+ return 1;
+ if (text[0] == 0xfe && text[1] == 0xff)
+ return 1;
+ return 0;
+}
+
void strbuf_add_wrapped_text(struct strbuf *buf,
const char *text, int indent, int indent2, int width);
void strbuf_add_wrapped_bytes(struct strbuf *buf, const char *data, int len,
next prev parent reply other threads:[~2018-02-16 16:58 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-02-15 15:27 [PATCH v7 0/7] convert: add support for different encodings lars.schneider
2018-02-15 15:27 ` [PATCH v7 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
2018-02-16 12:55 ` Ævar Arnfjörð Bjarmason
2018-02-16 18:45 ` Jeff King
2018-02-16 19:30 ` Junio C Hamano
2018-02-15 15:27 ` [PATCH v7 2/7] strbuf: add xstrdup_toupper() lars.schneider
2018-02-15 15:27 ` [PATCH v7 3/7] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
2018-02-15 15:27 ` [PATCH v7 4/7] utf8: add function to detect a missing " lars.schneider
2018-02-15 15:27 ` [PATCH v7 5/7] convert: add 'working-tree-encoding' attribute lars.schneider
2018-02-15 15:27 ` [PATCH v7 6/7] convert: add tracing for " lars.schneider
2018-02-15 15:27 ` [PATCH v7 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
2018-02-15 20:03 ` [PATCH v7 0/7] convert: add support for different encodings Junio C Hamano
2018-02-15 22:09 ` Jeff King
2018-02-16 18:55 ` Junio C Hamano
2018-02-16 19:25 ` Jeff King
2018-02-16 19:27 ` Jeff King
2018-02-16 19:41 ` Junio C Hamano
2018-02-21 18:06 ` Lars Schneider
2018-02-16 14:42 ` Lars Schneider
2018-02-16 16:58 ` Torsten Bögershausen [this message]
2018-02-22 20:00 ` Lars Schneider
2018-02-22 20:12 ` Jeff King
2018-02-23 16:35 ` Junio C Hamano
2018-02-23 20:11 ` Junio C Hamano
2018-02-24 15:18 ` Lars Schneider
2018-02-26 1:44 ` Jeff King
2018-02-26 17:35 ` Torsten Bögershausen
2018-02-26 20:46 ` Jeff King
2018-02-27 21:05 ` Torsten Bögershausen
2018-02-27 21:25 ` Jeff King
2018-02-27 21:55 ` Junio C Hamano
2018-02-27 21:58 ` Jeff King
2018-02-27 22:10 ` Junio C Hamano
2018-02-27 22:20 ` Jeff King
2018-02-28 8:20 ` Torsten Bögershausen
2018-02-28 13:21 ` Jeff King
2018-02-28 17:42 ` Junio C Hamano
2018-03-01 7:49 ` Jeff King
2018-03-04 10:16 ` Torsten Bögershausen
2018-02-28 20:46 ` Lars Schneider
2018-02-16 19:04 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180216165815.GA4681@tor.lan \
--to=tboegi@web.de \
--cc=Johannes.Schindelin@gmx.de \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=j6t@kdbg.org \
--cc=lars.schneider@autodesk.com \
--cc=larsxschneider@gmail.com \
--cc=peff@peff.net \
--cc=ramsay@ramsayjones.plus.com \
--cc=sunshine@sunshineco.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.