* [PATCH 1/2] utf8_width(): allow non NUL-terminated input
2008-01-02 5:26 ` Junio C Hamano
@ 2008-01-02 9:49 ` Junio C Hamano
2008-01-02 9:50 ` [PATCH 2/2] diff: do not chomp hunk-header in the middle of a character Junio C Hamano
2008-01-02 9:50 ` [PATCH 3/2] attribute "coding": specify blob encoding Junio C Hamano
2 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2008-01-02 9:49 UTC (permalink / raw)
To: Tsugikazu Shibata; +Cc: git
The original interface assumed that the input string to be
always terminated with NUL, but that wasn't too useful.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
* This is a preparatory step for the next one that follows what
I outlined in my response. As we are in pre-1.5.4 freeze, I
won't be applying this to 'master' but this and the one that
implements the sane truncation might be 'next' material.
utf8.c | 68 ++++++++++++++++++++++++++++++++++++++-------------------------
utf8.h | 2 +-
2 files changed, 42 insertions(+), 28 deletions(-)
diff --git a/utf8.c b/utf8.c
index 9efcdb9..5d641fa 100644
--- a/utf8.c
+++ b/utf8.c
@@ -157,57 +157,71 @@ static int git_wcwidth(ucs_char_t ch)
* pointed to by the variable start. The pointer is updated to point at
* the next character. If it was not valid UTF-8, the pointer is set to NULL.
*/
-int utf8_width(const char **start)
+int utf8_width(const char **start, size_t *remainder_p)
{
unsigned char *s = (unsigned char *)*start;
ucs_char_t ch;
+ size_t remainder, incr;
- if (*s < 0x80) {
+ /*
+ * A caller that assumes NUL terminated text can choose
+ * not to bother with the remainder length.
+ */
+ remainder = (remainder_p ? *remainder_p : 999);
+
+ if (remainder < 1) {
+ goto invalid;
+ } else if (*s < 0x80) {
/* 0xxxxxxx */
ch = *s;
- *start += 1;
+ incr = 1;
} else if ((s[0] & 0xe0) == 0xc0) {
/* 110XXXXx 10xxxxxx */
- if ((s[1] & 0xc0) != 0x80 ||
- /* overlong? */
- (s[0] & 0xfe) == 0xc0)
+ if (remainder < 2 ||
+ (s[1] & 0xc0) != 0x80 ||
+ (s[0] & 0xfe) == 0xc0)
goto invalid;
ch = ((s[0] & 0x1f) << 6) | (s[1] & 0x3f);
- *start += 2;
+ incr = 2;
} else if ((s[0] & 0xf0) == 0xe0) {
/* 1110XXXX 10Xxxxxx 10xxxxxx */
- if ((s[1] & 0xc0) != 0x80 ||
- (s[2] & 0xc0) != 0x80 ||
- /* overlong? */
- (s[0] == 0xe0 && (s[1] & 0xe0) == 0x80) ||
- /* surrogate? */
- (s[0] == 0xed && (s[1] & 0xe0) == 0xa0) ||
- /* U+FFFE or U+FFFF? */
- (s[0] == 0xef && s[1] == 0xbf &&
- (s[2] & 0xfe) == 0xbe))
+ if (remainder < 3 ||
+ (s[1] & 0xc0) != 0x80 ||
+ (s[2] & 0xc0) != 0x80 ||
+ /* overlong? */
+ (s[0] == 0xe0 && (s[1] & 0xe0) == 0x80) ||
+ /* surrogate? */
+ (s[0] == 0xed && (s[1] & 0xe0) == 0xa0) ||
+ /* U+FFFE or U+FFFF? */
+ (s[0] == 0xef && s[1] == 0xbf &&
+ (s[2] & 0xfe) == 0xbe))
goto invalid;
ch = ((s[0] & 0x0f) << 12) |
((s[1] & 0x3f) << 6) | (s[2] & 0x3f);
- *start += 3;
+ incr = 3;
} else if ((s[0] & 0xf8) == 0xf0) {
/* 11110XXX 10XXxxxx 10xxxxxx 10xxxxxx */
- if ((s[1] & 0xc0) != 0x80 ||
- (s[2] & 0xc0) != 0x80 ||
- (s[3] & 0xc0) != 0x80 ||
- /* overlong? */
- (s[0] == 0xf0 && (s[1] & 0xf0) == 0x80) ||
- /* > U+10FFFF? */
- (s[0] == 0xf4 && s[1] > 0x8f) || s[0] > 0xf4)
+ if (remainder < 4 ||
+ (s[1] & 0xc0) != 0x80 ||
+ (s[2] & 0xc0) != 0x80 ||
+ (s[3] & 0xc0) != 0x80 ||
+ /* overlong? */
+ (s[0] == 0xf0 && (s[1] & 0xf0) == 0x80) ||
+ /* > U+10FFFF? */
+ (s[0] == 0xf4 && s[1] > 0x8f) || s[0] > 0xf4)
goto invalid;
ch = ((s[0] & 0x07) << 18) | ((s[1] & 0x3f) << 12) |
((s[2] & 0x3f) << 6) | (s[3] & 0x3f);
- *start += 4;
+ incr = 4;
} else {
invalid:
*start = NULL;
return 0;
}
+ *start += incr;
+ if (remainder_p)
+ *remainder_p = remainder - incr;
return git_wcwidth(ch);
}
@@ -218,7 +232,7 @@ int is_utf8(const char *text)
text++;
continue;
}
- utf8_width(&text);
+ utf8_width(&text, NULL);
if (!text)
return 0;
}
@@ -278,7 +292,7 @@ int print_wrapped_text(const char *text, int indent, int indent2, int width)
continue;
}
if (assume_utf8)
- w += utf8_width(&text);
+ w += utf8_width(&text, NULL);
else {
w++;
text++;
diff --git a/utf8.h b/utf8.h
index 15db6f1..4295a19 100644
--- a/utf8.h
+++ b/utf8.h
@@ -1,7 +1,7 @@
#ifndef GIT_UTF8_H
#define GIT_UTF8_H
-int utf8_width(const char **start);
+int utf8_width(const char **start, size_t *remain);
int is_utf8(const char *text);
int is_encoding_utf8(const char *name);
--
1.5.4.rc2.13.g690bd
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/2] diff: do not chomp hunk-header in the middle of a character
2008-01-02 5:26 ` Junio C Hamano
2008-01-02 9:49 ` [PATCH 1/2] utf8_width(): allow non NUL-terminated input Junio C Hamano
@ 2008-01-02 9:50 ` Junio C Hamano
2008-01-02 9:50 ` [PATCH 3/2] attribute "coding": specify blob encoding Junio C Hamano
2 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2008-01-02 9:50 UTC (permalink / raw)
To: Tsugikazu Shibata; +Cc: git
We truncate hunk-header line at 80 bytes, but that 80th byte
could be in the middle of a character, which is bad. This uses
utf8_width() function to make sure we do not cut a character in
the middle.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
* And this implements the sane_truncate_line() that
unconditionally assumes UTF-8.
The 80-byte cutoff is meant to chomp at 80 display columns,
and in many encodings including UTF-8 80 cols are usually
more than 80 bytes. To implement 80 column chomping, we
would need to tweak the low level xdl_find_func() interface
further so that it gives us back more than 80-bytes. Ideally
that layer should not be doing any chomping.
Another thing I noticed but did not bother to fix was that
the emit_line() interface is doing bad things with color.
It emits:
start-color text (= line, len) reset-color
and the text often ends with LF. We should emit reset-color
before the terminating LF instead.
diff.c | 28 ++++++++++++++++++++++++++++
1 files changed, 28 insertions(+), 0 deletions(-)
diff --git a/diff.c b/diff.c
index 5bdc111..b9159bc 100644
--- a/diff.c
+++ b/diff.c
@@ -10,6 +10,7 @@
#include "color.h"
#include "attr.h"
#include "run-command.h"
+#include "utf8.h"
#ifdef NO_FAST_WORKING_DIRECTORY
#define FAST_WORKING_DIRECTORY 0
@@ -465,10 +466,13 @@ static void diff_words_show(struct diff_words_data *diff_words)
}
}
+typedef unsigned long (*sane_truncate_fn)(char *line, unsigned long len);
+
struct emit_callback {
struct xdiff_emit_state xm;
int nparents, color_diff;
unsigned ws_rule;
+ sane_truncate_fn truncate;
const char **label_path;
struct diff_words_data *diff_words;
int *found_changesp;
@@ -521,6 +525,27 @@ static void emit_add_line(const char *reset, struct emit_callback *ecbdata, cons
}
}
+static unsigned long sane_truncate_line(struct emit_callback *ecb, char *line, unsigned long len)
+{
+ const char *cp;
+ unsigned long width, allot;
+
+ if (ecb->truncate)
+ return ecb->truncate(line, len);
+ cp = line;
+ width = 0;
+ allot = len;
+ while (0 < len) {
+ int w = utf8_width(&cp, &len);
+ if (!cp)
+ break; /* truncated in the middle? */
+ width += w;
+ if (allot <= width)
+ break;
+ }
+ return allot - len;
+}
+
static void fn_out_consume(void *priv, char *line, unsigned long len)
{
int i;
@@ -551,8 +576,11 @@ static void fn_out_consume(void *priv, char *line, unsigned long len)
;
if (2 <= i && i < len && line[i] == ' ') {
ecbdata->nparents = i - 1;
+ len = sane_truncate_line(ecbdata, line, len);
emit_line(diff_get_color(ecbdata->color_diff, DIFF_FRAGINFO),
reset, line, len);
+ if (line[len-1] != '\n')
+ putchar('\n');
return;
}
--
1.5.4.rc2.13.g690bd
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 3/2] attribute "coding": specify blob encoding
2008-01-02 5:26 ` Junio C Hamano
2008-01-02 9:49 ` [PATCH 1/2] utf8_width(): allow non NUL-terminated input Junio C Hamano
2008-01-02 9:50 ` [PATCH 2/2] diff: do not chomp hunk-header in the middle of a character Junio C Hamano
@ 2008-01-02 9:50 ` Junio C Hamano
2008-01-03 21:23 ` しらいしななこ
2 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2008-01-02 9:50 UTC (permalink / raw)
To: Tsugikazu Shibata; +Cc: git
This teaches "diff hunk header" function about custom character
encoding per path via gitattributes(5) so that it can sensibly
chomp a line without truncating a character in the middle.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
* This is not intended for serious inclusion, but was done more
as a demonstration of the concept, hence [3/2].
Makefile | 5 ++-
coding.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
coding.h | 10 +++++++++
diff.c | 48 ++++++++++++++++++++++++++++++++++++++++-----
diffcore.h | 1 +
5 files changed, 118 insertions(+), 8 deletions(-)
create mode 100644 coding.c
create mode 100644 coding.h
diff --git a/Makefile b/Makefile
index 21c80e6..de205e6 100644
--- a/Makefile
+++ b/Makefile
@@ -293,7 +293,8 @@ LIB_H = \
run-command.h strbuf.h tag.h tree.h git-compat-util.h revision.h \
tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h \
utf8.h reflog-walk.h patch-ids.h attr.h decorate.h progress.h \
- mailmap.h remote.h parse-options.h transport.h diffcore.h hash.h
+ mailmap.h remote.h parse-options.h transport.h diffcore.h hash.h \
+ coding.h
DIFF_OBJS = \
diff.o diff-lib.o diffcore-break.o diffcore-order.o \
@@ -316,7 +317,7 @@ LIB_OBJS = \
alloc.o merge-file.o path-list.o help.o unpack-trees.o $(DIFF_OBJS) \
color.o wt-status.o archive-zip.o archive-tar.o shallow.o utf8.o \
convert.o attr.o decorate.o progress.o mailmap.o symlinks.o remote.o \
- transport.o bundle.o walker.o parse-options.o ws.o
+ transport.o bundle.o walker.o parse-options.o ws.o coding.o
BUILTIN_OBJS = \
builtin-add.o \
diff --git a/coding.c b/coding.c
new file mode 100644
index 0000000..63c5dc7
--- /dev/null
+++ b/coding.c
@@ -0,0 +1,62 @@
+#include "coding.h"
+
+unsigned long truncate_euc_jp(char *line, unsigned long len)
+{
+ /*
+ * line contains len bytes but the end of the string could
+ * be incomplete (chomped in the middle of a character).
+ * grab complete lines and count the display width, and
+ * return the number of bytes to chomp at (which could
+ * be smaller than 'len' given, to discard the corrupt
+ * characters at the end
+ */
+ char *ep = line + len;
+ unsigned long sofar = 0;
+
+ while (line < ep) {
+ int ch = (*line++ & 0xFF);
+ int width;
+
+ if (ch < 0x80) {
+ /* JIS X 0201 lower half */
+ sofar++;
+ width = 1;
+ } else if (0xA1 <= ch && ch <= 0xFE) {
+ /* JIS X 0208 */
+ if (ep <= line)
+ goto invalid;
+ ch = (*line++ & 0xFF);
+ if (ch < 0xA1 || 0xFE < ch)
+ goto invalid;
+ width = 2;
+ sofar += 2;
+ } else if (ch == 0x8E) {
+ /* JIS X 0201 upper half */
+ if (ep <= line)
+ goto invalid;
+ ch = (*line++ & 0xFF);
+ if (ch <= 0xA1 || 0xDF < ch)
+ goto invalid;
+ width = 1;
+ sofar += 2;
+ } else if (ch == 0x8F) {
+ /* JIS X 0212 */
+ if (ep <= line)
+ goto invalid;
+ ch = (*line++ & 0xFF);
+ if (ch < 0xA1 || 0xFE < ch)
+ goto invalid;
+ if (ep <= line)
+ goto invalid;
+ ch = (*line++ & 0xFF);
+ if (ch < 0xA1 || 0xFE < ch)
+ goto invalid;
+ width = 2;
+ sofar += 3;
+ } else {
+ invalid:
+ return sofar;
+ }
+ }
+ return sofar;
+}
diff --git a/coding.h b/coding.h
new file mode 100644
index 0000000..0e02701
--- /dev/null
+++ b/coding.h
@@ -0,0 +1,10 @@
+#ifndef CODING_H
+#define CODING_H
+
+#include "cache.h"
+
+typedef unsigned long (*sane_truncate_fn)(char *line, unsigned long len);
+
+unsigned long truncate_euc_jp(char *, unsigned long);
+
+#endif
diff --git a/diff.c b/diff.c
index b9159bc..40c8d2b 100644
--- a/diff.c
+++ b/diff.c
@@ -11,6 +11,7 @@
#include "attr.h"
#include "run-command.h"
#include "utf8.h"
+#include "coding.h"
#ifdef NO_FAST_WORKING_DIRECTORY
#define FAST_WORKING_DIRECTORY 0
@@ -466,8 +467,6 @@ static void diff_words_show(struct diff_words_data *diff_words)
}
}
-typedef unsigned long (*sane_truncate_fn)(char *line, unsigned long len);
-
struct emit_callback {
struct xdiff_emit_state xm;
int nparents, color_diff;
@@ -1125,30 +1124,34 @@ static void emit_binary_diff(mmfile_t *one, mmfile_t *two)
static void setup_diff_attr_check(struct git_attr_check *check)
{
static struct git_attr *attr_diff;
+ static struct git_attr *attr_coding;
if (!attr_diff) {
attr_diff = git_attr("diff", 4);
+ attr_coding = git_attr("coding", 6);
}
check[0].attr = attr_diff;
+ check[1].attr = attr_coding;
}
static void diff_filespec_check_attr(struct diff_filespec *one)
{
- struct git_attr_check attr_diff_check;
+ struct git_attr_check attr_diff_check[2];
int check_from_data = 0;
if (one->checked_attr)
return;
- setup_diff_attr_check(&attr_diff_check);
+ setup_diff_attr_check(attr_diff_check);
one->is_binary = 0;
one->funcname_pattern_ident = NULL;
+ one->coding = NULL;
- if (!git_checkattr(one->path, 1, &attr_diff_check)) {
+ if (!git_checkattr(one->path, 2, attr_diff_check)) {
const char *value;
/* binaryness */
- value = attr_diff_check.value;
+ value = attr_diff_check[0].value;
if (ATTR_TRUE(value))
;
else if (ATTR_FALSE(value))
@@ -1161,6 +1164,15 @@ static void diff_filespec_check_attr(struct diff_filespec *one)
;
else
one->funcname_pattern_ident = value;
+
+ /* coding */
+ value = attr_diff_check[1].value;
+ if (ATTR_TRUE(value) ||
+ ATTR_FALSE(value) ||
+ ATTR_UNSET(value))
+ ;
+ else
+ one->coding = value;
}
if (check_from_data) {
@@ -1233,6 +1245,27 @@ static const char *diff_funcname_pattern(struct diff_filespec *one)
return NULL;
}
+static struct {
+ const char *coding;
+ sane_truncate_fn fn;
+} builtin_truncate_fn[] = {
+ { "euc-jp", truncate_euc_jp },
+ { "utf-8", NULL },
+};
+
+static sane_truncate_fn diff_truncate_line_fn(struct diff_filespec *one)
+{
+ int i;
+
+ diff_filespec_check_attr(one);
+ if (!one->coding)
+ return NULL;
+ for (i = 0; i < ARRAY_SIZE(builtin_truncate_fn); i++)
+ if (!strcmp(one->coding, builtin_truncate_fn[i].coding))
+ return builtin_truncate_fn[i].fn;
+ return NULL;
+}
+
static void builtin_diff(const char *name_a,
const char *name_b,
struct diff_filespec *one,
@@ -1318,6 +1351,9 @@ static void builtin_diff(const char *name_a,
ecbdata.color_diff = DIFF_OPT_TST(o, COLOR_DIFF);
ecbdata.found_changesp = &o->found_changes;
ecbdata.ws_rule = whitespace_rule(name_b ? name_b : name_a);
+ ecbdata.truncate = diff_truncate_line_fn(one);
+ if (!ecbdata.truncate)
+ ecbdata.truncate = diff_truncate_line_fn(two);
xpp.flags = XDF_NEED_MINIMAL | o->xdl_opts;
xecfg.ctxlen = o->context;
xecfg.flags = XDL_EMIT_FUNCNAMES;
diff --git a/diffcore.h b/diffcore.h
index cc96c20..711a4df 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -28,6 +28,7 @@ struct diff_filespec {
void *data;
void *cnt_data;
const char *funcname_pattern_ident;
+ const char *coding; /* blob text coding */
unsigned long size;
int count; /* Reference count */
int xfrm_flags; /* for use by the xfrm */
--
1.5.4.rc2.13.g690bd
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 3/2] attribute "coding": specify blob encoding
2008-01-02 9:50 ` [PATCH 3/2] attribute "coding": specify blob encoding Junio C Hamano
@ 2008-01-03 21:23 ` しらいしななこ
2008-01-03 21:54 ` Junio C Hamano
0 siblings, 1 reply; 9+ messages in thread
From: しらいしななこ @ 2008-01-03 21:23 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Tsugikazu Shibata, git
Quoting Junio C Hamano <gitster@pobox.com>:
> This teaches "diff hunk header" function about custom character
> encoding per path via gitattributes(5) so that it can sensibly
> chomp a line without truncating a character in the middle.
>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
>
> * This is not intended for serious inclusion, but was done more
> as a demonstration of the concept, hence [3/2].
Why not? It looks a useful addition for us Japanese people.
> +static struct {
> + const char *coding;
> + sane_truncate_fn fn;
> +} builtin_truncate_fn[] = {
> + { "euc-jp", truncate_euc_jp },
> + { "utf-8", NULL },
> +};
Can you also do JIS and Shift JIS? I ask because many of my old notes are in Shift JIS and I think it is the same for many other people.
--
Nanako Shiraishi
http://ivory.ap.teacup.com/nanako3/
----------------------------------------------------------------------
Get a free email account with anti spam protection.
http://www.bluebottle.com/tag/2
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 3/2] attribute "coding": specify blob encoding
2008-01-03 21:23 ` しらいしななこ
@ 2008-01-03 21:54 ` Junio C Hamano
2008-01-04 16:16 ` Tsugikazu Shibata
0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2008-01-03 21:54 UTC (permalink / raw)
To: しらいしななこ
Cc: Tsugikazu Shibata, git
しらいしななこ <nanako3@bluebottle.com> writes:
> Quoting Junio C Hamano <gitster@pobox.com>:
>
>> This teaches "diff hunk header" function about custom character
>> encoding per path via gitattributes(5) so that it can sensibly
>> chomp a line without truncating a character in the middle.
>>
>> Signed-off-by: Junio C Hamano <gitster@pobox.com>
>> ---
>>
>> * This is not intended for serious inclusion, but was done more
>> as a demonstration of the concept, hence [3/2].
>
> Why not? It looks a useful addition for us Japanese people.
(offtopic) I was once told that "us Japanese people" is a
bad thing to say in public because it sets an unfriendly
tone by creating a psychological divide between "us" and
"others". After all I am one of you ;-)
The reason I do not like the patch as-is is because I have
doubts about the way "coding" acts in the patch.
There already is clean/smudge filter mechanism. Even if your
work tree has files in euc-jp or Shift_JIS, you could choose to
internally use UTF-8 at git object level. Then the part that
deals with diff hunk headers (the topic of the patch we are
discussing) would have to work only on UTF-8 data.
Side note: when getting the data from a file in the work
tree, we convert into internal representation before
running diff (see diff.c::diff_populate_filespec()), but
we do not convert it back to external representation by
running the smudge filter on the diff output. We might
optionally want to but if somebody is going to do this,
the patch accepting side also needs to be modified to
reverse the conversion.
The solution with clean/smudge is not applicable to everybody.
It needs to be agreed upon project-wide what encoding is used as
the canonical encoding for the project, and when the project
chooses to use UTF-8, the above would become a cleaner and
workable approach.
If the project, on the other hand, chooses to use a non UTF-8
encoding (e.g. euc-jp) as the canonical representation,
something like my patch may be necessary.
Between these two ways to skin the cat, I do not want to close
the door for either one of them too early, although I am
somewhat partial to "internally everything is UTF-8" approach.
Maybe we would want to use "coding" (short, sweet and nice name
for an attribute) to mean a canned smudge/clean filter pair that
runs to/from UTF-8 iconv, making the "internally, everything is
UTF-8" approach a more officially supported option. If we
choose to go that route, the way "coding" attribute was used in
the patch directly conflicts with that design, as the world view
my "coding" patch takes is "whatever coding project chooses is
used internally, and the attribute is used to teach coding
specific actions to the underlying logic".
>> +static struct {
>> + const char *coding;
>> + sane_truncate_fn fn;
>> +} builtin_truncate_fn[] = {
>> + { "euc-jp", truncate_euc_jp },
>> + { "utf-8", NULL },
>> +};
>
>Can you also do JIS and Shift JIS? I ask because many of my
>old notes are in Shift JIS and I think it is the same for many
>other people.
I guess somebody else could (hint, hint,...). Shift_JIS should
be more or less straightforward to add.
With the current code structure, however, ISO-2022 (you said
"JIS" -- Japanese often use that word to mean 7-bit ISO-2022 and
so did you in this context) is a bit cumbersome to handle, as
you cannot just truncate but also have to add a few escape bytes
to go back to ASCII at the end of line.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 3/2] attribute "coding": specify blob encoding
2008-01-03 21:54 ` Junio C Hamano
@ 2008-01-04 16:16 ` Tsugikazu Shibata
2008-01-04 20:53 ` Junio C Hamano
0 siblings, 1 reply; 9+ messages in thread
From: Tsugikazu Shibata @ 2008-01-04 16:16 UTC (permalink / raw)
To: gitster; +Cc: nanako3, git, tshibata
On Thu, 03 Jan 2008 13:54:58 -0800, gitster wrote:
> しらいしななこ <nanako3@bluebottle.com> writes:
>
> > Quoting Junio C Hamano <gitster@pobox.com>:
> >
> >> This teaches "diff hunk header" function about custom character
> >> encoding per path via gitattributes(5) so that it can sensibly
> >> chomp a line without truncating a character in the middle.
> >>
> >> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> >> ---
> >>
> >> * This is not intended for serious inclusion, but was done more
> >> as a demonstration of the concept, hence [3/2].
> >
...
> >> +static struct {
> >> + const char *coding;
> >> + sane_truncate_fn fn;
> >> +} builtin_truncate_fn[] = {
> >> + { "euc-jp", truncate_euc_jp },
> >> + { "utf-8", NULL },
> >> +};
> >
> >Can you also do JIS and Shift JIS? I ask because many of my
> >old notes are in Shift JIS and I think it is the same for many
> >other people.
>
> I guess somebody else could (hint, hint,...). Shift_JIS should
> be more or less straightforward to add.
>
> With the current code structure, however, ISO-2022 (you said
> "JIS" -- Japanese often use that word to mean 7-bit ISO-2022 and
> so did you in this context) is a bit cumbersome to handle, as
> you cannot just truncate but also have to add a few escape bytes
> to go back to ASCII at the end of line.
I guess that there are many other encodings and support everything are
not reasonable.
In my experience, It seems that chopping a multi-byte character
occurs the field after "@@". I believe this field is for the name of
the appropriate function. Also, I believe most of the function names
are in ASCII.
So, question is why we should think of this field in case of non
programming language files ?
In case of text file using any coding, Should we need to add something
after "@@"?
How about not to add anything after "@@" when the file name was .txt
or no extension (ie. HOWTO)?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 3/2] attribute "coding": specify blob encoding
2008-01-04 16:16 ` Tsugikazu Shibata
@ 2008-01-04 20:53 ` Junio C Hamano
0 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2008-01-04 20:53 UTC (permalink / raw)
To: Tsugikazu Shibata; +Cc: nanako3, git
Tsugikazu Shibata <tshibata@ab.jp.nec.com> writes:
> So, question is why we should think of this field in case of non
> programming language files ?
> In case of text file using any coding, Should we need to add something
> after "@@"?
See gitattributes(5), "Defining a custom hunk-header". The
associated configuration variable for this feature is somewhat
misnamed [*1*], but if you do not want to see anything, I think
you should be able to define a custom pattern that match
emptyness.
For .txt files, I always hoped that someday, somebody clever
would come up with a custom pattern to grok AsciiDoc mark-up to
pick a label-looking line so that the hunks in patches to our
documentation would be easier to find. A simpler and more
generally applicable possibility is to find the previous line
that starts a paragraph (e.g. you would pick "For .txt files..."
for this part because it is the non-blank line that immediately
follows the first blank line before this part). But I think the
code to find an appropriate string for hunk header is not
capable of looking at more than one line, so neither of these
would be easy to implement without changing the underlying logic
and semantics of "diff.*.funcname" configuration.
[Footnote]
*1* It may be misnamed but it is in line with GNU diff, though.
They call -p "show-c-function" and -F "show-function-line" --
both assume function names are what people are interested in
having on hunk headers.
^ permalink raw reply [flat|nested] 9+ messages in thread