* [PATCH v3] safecrlf: Add mechanism to warn about irreversible crlf conversions
@ 2008-01-13 16:30 Steffen Prohaska
2008-01-13 22:02 ` Dmitry Potapov
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Steffen Prohaska @ 2008-01-13 16:30 UTC (permalink / raw)
To: gitster; +Cc: dpotapov, torvalds, git, Steffen Prohaska
This one could be applied.
I mentioned earlier that crlf_to_git() would be called twice. Unfortunately,
I can't reproduce this behaviour and are not even sure if it ever happend.
Maybe it was only an illusion in my terminal?
Steffen
---- snip snap ---
CRLF conversion bears a slight chance of corrupting data.
autocrlf=true will convert CRLF to LF during commit and LF to
CRLF during checkout. A file that containes a mixture of LF and
CRLF before the commit cannot be recreated by git. For text
files this does not really matter because we do not care about
the line endings anyway; but for binary files that are
accidentally classified as text the conversion can result in
corrupted data.
If you recognize such corruption during commit you can easily fix
it by setting the conversion type explicitly in .gitattributes.
Right after committing you still have the original file in your
work tree and this file is not yet corrupted.
However, in mixed Windows/Unix environments text files quite
easily can end up containing a mixture of CRLF and LF line
endings and git should handle such situations gracefully. For
example a user could copy a CRLF file from Windows to Unix and
mix it with an existing LF file there. The result would contain
both types of line endings.
Unfortunately, the desired effect of cleaning up text files
with mixed lineendings and undesired effect of corrupting binary
files can not be distinguished. In both cases CRLF are removed
in an irreversible way. For text files this is the right thing
to do, while for binary file its corrupting data.
In a sane environment committing and checking out the same file
should not modify the origin file in the work tree. For
autocrlf=input the original file must not contain CRLF. For
autocrlf=true the original file must not contain LF without
preceding CR. Otherwise the conversion is irreversible. Note,
git might be able to recreate the original file with different
autocrlf settings, but in the current environment checking out
will yield a file that differs from the file before the commit.
This patch adds a mechanism that can either warn the user about
an irreversible conversion or can even refuse to convert. The
mechanism is controlled by the variable core.safecrlf, with the
following values
- false: disable safecrlf mechanism
- warn: warn about irreversible conversions
- true: refuse irreversible conversions
The default is to warn.
The concept of a safety check was originally proposed in a similar
way by Linus Torvalds. Thanks to Dimitry Potapov for insisting
on getting the naked LF/autocrlf=true case right.
Signed-off-by: Steffen Prohaska <prohaska@zib.de>
---
Documentation/config.txt | 21 ++++++++++++++++++
Documentation/gitattributes.txt | 5 ++++
cache.h | 8 +++++++
config.c | 9 +++++++
convert.c | 28 +++++++++++++++++++++--
environment.c | 1 +
t/t0020-crlf.sh | 45 +++++++++++++++++++++++++++++++++++++++
7 files changed, 114 insertions(+), 3 deletions(-)
diff --git a/Documentation/config.txt b/Documentation/config.txt
index df091d1..d2a9fdd 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -139,6 +139,27 @@ core.autocrlf::
"text" (i.e. be subjected to the autocrlf mechanism) is
decided purely based on the contents.
+core.safecrlf::
+ If true, makes git check if converting `CRLF` as controlled by
+ `core.autocrlf` is reversible. Git will check if committing a
+ file followed by checking out the same file will yield the
+ original file in the work tree. If this is not the case for
+ the current setting of `core.autocrlf`, git will reject the
+ file. The variable can be set to "warn", in which case git
+ will only warn about an irreversible conversion but accept the
+ file.
++
+Note, this safety check does not mean that a checkout will generate a
+file identical to the original file for different settings of
+`core.autocrlf`, but only for the current one. For example, a text
+file with `LF` would be accepted with `core.autocrlf=input` and could
+later be checked out with `core.autocrlf=true`, in which case the
+resulting file would contain `CRLF`, although the original file
+contained `LF`. However, in both work trees the line endings would be
+consistent, that is either all `LF` or all `CRLF`, but never mixed. A
+file with mixed line endings would be reported by the `core.safecrlf`
+mechanism.
+
core.symlinks::
If false, symbolic links are checked out as small plain files that
contain the link text. linkgit:git-update-index[1] and
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 35a29fd..2ffe1fb 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -133,6 +133,11 @@ When `core.autocrlf` is set to "input", line endings are
converted to LF upon checkin, but there is no conversion done
upon checkout.
+If `core.safecrlf` is set to "true" or "warn", git verifies if
+the conversion is reversible for the current setting of
+`core.autocrlf`. For "true", git rejects irreversible
+conversions; for "warn", git only prints a warning but accepts
+an irreversible conversion.
`ident`
^^^^^^^
diff --git a/cache.h b/cache.h
index 39331c2..4e03e3d 100644
--- a/cache.h
+++ b/cache.h
@@ -330,6 +330,14 @@ extern size_t packed_git_limit;
extern size_t delta_base_cache_limit;
extern int auto_crlf;
+enum safe_crlf {
+ SAFE_CRLF_FALSE = 0,
+ SAFE_CRLF_FAIL = 1,
+ SAFE_CRLF_WARN = 2,
+};
+
+extern enum safe_crlf safe_crlf;
+
#define GIT_REPO_VERSION 0
extern int repository_format_version;
extern int check_repository_format(void);
diff --git a/config.c b/config.c
index 857deb6..0a46046 100644
--- a/config.c
+++ b/config.c
@@ -407,6 +407,15 @@ int git_default_config(const char *var, const char *value)
return 0;
}
+ if (!strcmp(var, "core.safecrlf")) {
+ if (value && !strcasecmp(value, "warn")) {
+ safe_crlf = SAFE_CRLF_WARN;
+ return 0;
+ }
+ safe_crlf = git_config_bool(var, value);
+ return 0;
+ }
+
if (!strcmp(var, "user.name")) {
strlcpy(git_default_name, value, sizeof(git_default_name));
return 0;
diff --git a/convert.c b/convert.c
index 4df7559..c9678ee 100644
--- a/convert.c
+++ b/convert.c
@@ -90,9 +90,6 @@ static int crlf_to_git(const char *path, const char *src, size_t len,
return 0;
gather_stats(src, len, &stats);
- /* No CR? Nothing to convert, regardless. */
- if (!stats.cr)
- return 0;
if (action == CRLF_GUESS) {
/*
@@ -110,6 +107,20 @@ static int crlf_to_git(const char *path, const char *src, size_t len,
return 0;
}
+ if (safe_crlf && auto_crlf > 0 && action != CRLF_INPUT) {
+ /* CRLFs would be added by checkout: check if we have "naked" LFs */
+ if (stats.lf != stats.crlf) {
+ if (safe_crlf == SAFE_CRLF_WARN)
+ warning("Checkout will replace LFs with CRLF in %s", path);
+ else
+ die("Checkout would replace LFs with CRLF in %s", path);
+ }
+ }
+
+ /* Optimization: No CR? Nothing to convert, regardless. */
+ if (!stats.cr)
+ return 0;
+
/* only grow if not in place */
if (strbuf_avail(buf) + buf->len < len)
strbuf_grow(buf, len - buf->len);
@@ -132,6 +143,17 @@ static int crlf_to_git(const char *path, const char *src, size_t len,
*dst++ = c;
} while (--len);
}
+
+ if (safe_crlf && (action == CRLF_INPUT || auto_crlf <= 0)) {
+ /* CRLFs would not be restored by checkout: check if we removed CRLFs */
+ if (buf->len != dst - buf->buf) {
+ if (safe_crlf == SAFE_CRLF_WARN)
+ warning("Stripped CRLF from %s.", path);
+ else
+ die("Refusing to strip CRLF from %s.", path);
+ }
+ }
+
strbuf_setlen(buf, dst - buf->buf);
return 1;
}
diff --git a/environment.c b/environment.c
index 18a1c4e..e351e99 100644
--- a/environment.c
+++ b/environment.c
@@ -35,6 +35,7 @@ int pager_use_color = 1;
char *editor_program;
char *excludes_file;
int auto_crlf = 0; /* 1: both ways, -1: only when adding git objects */
+enum safe_crlf safe_crlf = SAFE_CRLF_WARN;
unsigned whitespace_rule_cfg = WS_DEFAULT_RULE;
/* This is set by setup_git_dir_gently() and/or git_default_config() */
diff --git a/t/t0020-crlf.sh b/t/t0020-crlf.sh
index 89baebd..e2e0f7b 100755
--- a/t/t0020-crlf.sh
+++ b/t/t0020-crlf.sh
@@ -8,6 +8,10 @@ q_to_nul () {
tr Q '\000'
}
+q_to_cr () {
+ tr Q '\015'
+}
+
append_cr () {
sed -e 's/$/Q/' | tr Q '\015'
}
@@ -42,6 +46,47 @@ test_expect_success setup '
echo happy.
'
+test_expect_failure 'safecrlf: autocrlf=input, all CRLF' '
+
+ git repo-config core.autocrlf input &&
+ git repo-config core.safecrlf true &&
+
+ for w in I am all CRLF; do echo $w; done | append_cr >allcrlf &&
+ git add allcrlf
+'
+
+test_expect_failure 'safecrlf: autocrlf=input, mixed LF/CRLF' '
+
+ git repo-config core.autocrlf input &&
+ git repo-config core.safecrlf true &&
+
+ for w in Oh here is CRLFQ in text; do echo $w; done | q_to_cr >mixed &&
+ git add mixed
+'
+
+test_expect_failure 'safecrlf: autocrlf=true, all LF' '
+
+ git repo-config core.autocrlf true &&
+ git repo-config core.safecrlf true &&
+
+ for w in I am all LF; do echo $w; done >alllf &&
+ git add alllf
+'
+
+test_expect_failure 'safecrlf: autocrlf=true mixed LF/CRLF' '
+
+ git repo-config core.autocrlf true &&
+ git repo-config core.safecrlf true &&
+
+ for w in Oh here is CRLFQ in text; do echo $w; done | q_to_cr >mixed &&
+ git add mixed
+'
+
+test_expect_success 'switch off autocrlf, safecrlf' '
+ git repo-config core.autocrlf false &&
+ git repo-config core.safecrlf false
+'
+
test_expect_success 'update with autocrlf=input' '
rm -f tmp one dir/two three &&
--
1.5.4.rc2.60.g46ee
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v3] safecrlf: Add mechanism to warn about irreversible crlf conversions
2008-01-13 16:30 [PATCH v3] safecrlf: Add mechanism to warn about irreversible crlf conversions Steffen Prohaska
@ 2008-01-13 22:02 ` Dmitry Potapov
2008-01-13 22:13 ` Dmitry Potapov
2008-01-13 23:46 ` [PATCH] Don't display crlf warning twice Dmitry Potapov
2 siblings, 0 replies; 15+ messages in thread
From: Dmitry Potapov @ 2008-01-13 22:02 UTC (permalink / raw)
To: Steffen Prohaska; +Cc: gitster, torvalds, git
Hi Steffan,
The patch looks good to me. So:
Acked-by: Dmitry Potapov <dpotapov@gmail.com>
Dmitry
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3] safecrlf: Add mechanism to warn about irreversible crlf conversions
2008-01-13 16:30 [PATCH v3] safecrlf: Add mechanism to warn about irreversible crlf conversions Steffen Prohaska
2008-01-13 22:02 ` Dmitry Potapov
@ 2008-01-13 22:13 ` Dmitry Potapov
2008-01-13 23:46 ` [PATCH] Don't display crlf warning twice Dmitry Potapov
2 siblings, 0 replies; 15+ messages in thread
From: Dmitry Potapov @ 2008-01-13 22:13 UTC (permalink / raw)
To: Steffen Prohaska; +Cc: gitster, torvalds, git
Hi Steffan,
It looks like I was to quick to ack your patch
On Sun, Jan 13, 2008 at 05:30:47PM +0100, Steffen Prohaska wrote:
>
> I mentioned earlier that crlf_to_git() would be called twice. Unfortunately,
> I can't reproduce this behaviour and are not even sure if it ever happend.
that happened to me right now. My setting is autocrlf = input.
and file that I am adding has mixing line-ending:
===
foo
bar\r
===
Dmitry
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH] Don't display crlf warning twice
2008-01-13 16:30 [PATCH v3] safecrlf: Add mechanism to warn about irreversible crlf conversions Steffen Prohaska
2008-01-13 22:02 ` Dmitry Potapov
2008-01-13 22:13 ` Dmitry Potapov
@ 2008-01-13 23:46 ` Dmitry Potapov
2008-01-14 6:17 ` Steffen Prohaska
2 siblings, 1 reply; 15+ messages in thread
From: Dmitry Potapov @ 2008-01-13 23:46 UTC (permalink / raw)
To: Steffen Prohaska; +Cc: gitster, torvalds, git, Dmitry Potapov
'git add' could call crlf_to_git() twice, and this caused that the same
crlf warning being display twice. The first time crlf_to_git() is called
when a file is added to index, and it could be called the second time
during writing the index.
This patches sets safe_crlf to false before the second call.
Signed-off-by: Dmitry Potapov <dpotapov@gmail.com>
---
On Sun, Jan 13, 2008 at 05:30:47PM +0100, Steffen Prohaska wrote:
>_
> I mentioned earlier that crlf_to_git() would be called twice. Unfortunately,
> I can't reproduce this behaviour and are not even sure if it ever happend.
I think I have found the cause. It can be seen from the following trace:
==============
#0 crlf_to_git (path=0x814b37e "a", src=0xb7fb4000 "Hello\r\nHello\r\nHello\n", len=20, buf=0xbfad504c, action=-1) at convert.c:89
#1 0x080e0454 in convert_to_git (path=0x814b37e "a", src=0xb7fb4000 "Hello\r\nHello\r\nHello\n", len=20, dst=0xbfad504c) at convert.c:578
#2 0x080b7194 in index_fd (sha1=0xbfad508c "", fd=7, st=0xbfad50d8, write_object=0, type=OBJ_BLOB, path=0x814b37e "a") at sha1_file.c:2345
#3 0x080a7dd4 in ce_compare_data (ce=0x814b340, st=0xbfad50d8) at read-cache.c:56
#4 0x080a8045 in ce_modified_check_fs (ce=0x814b340, st=0xbfad50d8) at read-cache.c:111
#5 0x080aa66d in ce_smudge_racily_clean_entry (ce=0x814b340) at read-cache.c:1121
#6 0x080aa79d in write_index (istate=0x814a3e0, newfd=6) at read-cache.c:1177
#7 0x0804c66a in cmd_add (argc=1, argv=0xbfad6408, prefix=0x0) at builtin-add.c:261
==============
#0 crlf_to_git (path=0x814b094 "a", src=0xb7fb4000 "Hello\r\nHello\r\nHello\n", len=20, buf=0xbfad50ec, action=-1) at convert.c:89
#1 0x080e0454 in convert_to_git (path=0x814b094 "a", src=0xb7fb4000 "Hello\r\nHello\r\nHello\n", len=20, dst=0xbfad50ec) at convert.c:578
#2 0x080b7194 in index_fd (sha1=0x814b368 "", fd=7, st=0xbfad5174, write_object=1, type=OBJ_BLOB, path=0x814b094 "a") at sha1_file.c:2345
#3 0x080b731f in index_path (sha1=0x814b368 "", path=0x814b094 "a", st=0xbfad5174, write_object=1) at sha1_file.c:2377
#4 0x080a8c00 in add_file_to_index (istate=0x814a3e0, path=0x814b094 "a", verbose=0) at read-cache.c:433
#5 0x0804c640 in cmd_add (argc=1, argv=0xbfad6408, prefix=0x0) at builtin-add.c:257
==============
This patch works for me but it certainly needs better testing.
builtin-add.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/builtin-add.c b/builtin-add.c
index 5c29cc2..f113fc1 100644
--- a/builtin-add.c
+++ b/builtin-add.c
@@ -258,9 +258,12 @@ int cmd_add(int argc, const char **argv, const char *prefix)
finish:
if (active_cache_changed) {
+ enum safe_crlf old = safe_crlf;
+ safe_crlf = SAFE_CRLF_FALSE;
if (write_cache(newfd, active_cache, active_nr) ||
close(newfd) || commit_locked_index(&lock_file))
die("Unable to write new index file");
+ safe_crlf = old;
}
return 0;
--
1.5.3.5
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] Don't display crlf warning twice
2008-01-13 23:46 ` [PATCH] Don't display crlf warning twice Dmitry Potapov
@ 2008-01-14 6:17 ` Steffen Prohaska
2008-01-14 6:40 ` Dmitry Potapov
0 siblings, 1 reply; 15+ messages in thread
From: Steffen Prohaska @ 2008-01-14 6:17 UTC (permalink / raw)
To: Dmitry Potapov; +Cc: gitster, torvalds, git
Dimitry,
Thanks for finding the code path that leads to double printing
the warning.
Your traces reveal that it is a racy condition that can trigger
the double warnings. In my test cases this apparently does not
happen, at least not on my machine.
Do you have a test case that reliably triggers the second call to
convert_to_git()?
On Jan 14, 2008, at 12:46 AM, Dmitry Potapov wrote:
> 'git add' could call crlf_to_git() twice, and this caused that the
> same
> crlf warning being display twice. The first time crlf_to_git() is
> called
> when a file is added to index, and it could be called the second time
> during writing the index.
>
> This patches sets safe_crlf to false before the second call.
This would certainly work.
However I do not think it is the right place to fix ...
> Signed-off-by: Dmitry Potapov <dpotapov@gmail.com>
> ---
>
> On Sun, Jan 13, 2008 at 05:30:47PM +0100, Steffen Prohaska wrote:
>> _
>> I mentioned earlier that crlf_to_git() would be called twice.
>> Unfortunately,
>> I can't reproduce this behaviour and are not even sure if it ever
>> happend.
>
> I think I have found the cause. It can be seen from the following
> trace:
>
> ==============
> #0 crlf_to_git (path=0x814b37e "a", src=0xb7fb4000 "Hello\r\nHello
> \r\nHello\n", len=20, buf=0xbfad504c, action=-1) at convert.c:89
> #1 0x080e0454 in convert_to_git (path=0x814b37e "a",
> src=0xb7fb4000 "Hello\r\nHello\r\nHello\n", len=20, dst=0xbfad504c)
> at convert.c:578
> #2 0x080b7194 in index_fd (sha1=0xbfad508c "", fd=7,
> st=0xbfad50d8, write_object=0, type=OBJ_BLOB, path=0x814b37e "a")
> at sha1_file.c:2345
Here, index_fd() is called with write_object=0 ...
> #3 0x080a7dd4 in ce_compare_data (ce=0x814b340, st=0xbfad50d8) at
> read-cache.c:56
> #4 0x080a8045 in ce_modified_check_fs (ce=0x814b340,
> st=0xbfad50d8) at read-cache.c:111
> #5 0x080aa66d in ce_smudge_racily_clean_entry (ce=0x814b340) at
> read-cache.c:1121
> #6 0x080aa79d in write_index (istate=0x814a3e0, newfd=6) at read-
> cache.c:1177
> #7 0x0804c66a in cmd_add (argc=1, argv=0xbfad6408, prefix=0x0) at
> builtin-add.c:261
> ==============
> #0 crlf_to_git (path=0x814b094 "a", src=0xb7fb4000 "Hello\r\nHello
> \r\nHello\n", len=20, buf=0xbfad50ec, action=-1) at convert.c:89
> #1 0x080e0454 in convert_to_git (path=0x814b094 "a",
> src=0xb7fb4000 "Hello\r\nHello\r\nHello\n", len=20, dst=0xbfad50ec)
> at convert.c:578
> #2 0x080b7194 in index_fd (sha1=0x814b368 "", fd=7, st=0xbfad5174,
> write_object=1, type=OBJ_BLOB, path=0x814b094 "a") at sha1_file.c:2345
while here index_fd() is called with write_object=1.
> #3 0x080b731f in index_path (sha1=0x814b368 "", path=0x814b094
> "a", st=0xbfad5174, write_object=1) at sha1_file.c:2377
> #4 0x080a8c00 in add_file_to_index (istate=0x814a3e0,
> path=0x814b094 "a", verbose=0) at read-cache.c:433
> #5 0x0804c640 in cmd_add (argc=1, argv=0xbfad6408, prefix=0x0) at
> builtin-add.c:257
> ==============
Without digging deeply through the code, this looks to me as if
we could base our decision wether to print a warning or not on
this difference.
My first take is: we should print a warning if write_object=1
and should be quiet if write_object=0. The flag can easily be
passed to convert_to_git().
My assumption is that users would only be interested in the
warning if data is really written to the repository (and that
this is the meaning of write_object=1). But maybe there are
cases where a warning is also appropriate even if the data
is not written? I don't now.
Another question is if we should die() even if write_object=0;
or only if write_object=1?
Steffen
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] Don't display crlf warning twice
2008-01-14 6:17 ` Steffen Prohaska
@ 2008-01-14 6:40 ` Dmitry Potapov
2008-01-14 6:53 ` Steffen Prohaska
0 siblings, 1 reply; 15+ messages in thread
From: Dmitry Potapov @ 2008-01-14 6:40 UTC (permalink / raw)
To: Steffen Prohaska; +Cc: gitster, torvalds, git
On Mon, Jan 14, 2008 at 07:17:26AM +0100, Steffen Prohaska wrote:
>
> Your traces reveal that it is a racy condition that can trigger
> the double warnings.
It is not a racy condition. It is just another branch in the
code, which happens when an existing file is modified. Your
tests always added a new file, so they could not reproduce
the problem.
>
> Do you have a test case that reliably triggers the second call to
> convert_to_git()?
===========================================
git config core.autocrlf input
echo Hello > foo
git add foo
git commit -m 'add foo'
echo Hello >> foo
unix2dos foo
# This should trigger double crlf warning
git add foo
===========================================
Dmitry
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] Don't display crlf warning twice
2008-01-14 6:40 ` Dmitry Potapov
@ 2008-01-14 6:53 ` Steffen Prohaska
2008-01-14 23:20 ` [PATCH] safecrlf: Add flag to convert_to_git() to disable safecrlf check Steffen Prohaska
0 siblings, 1 reply; 15+ messages in thread
From: Steffen Prohaska @ 2008-01-14 6:53 UTC (permalink / raw)
To: Dmitry Potapov; +Cc: gitster, torvalds, git
On Jan 14, 2008, at 7:40 AM, Dmitry Potapov wrote:
> On Mon, Jan 14, 2008 at 07:17:26AM +0100, Steffen Prohaska wrote:
>>
>> Your traces reveal that it is a racy condition that can trigger
>> the double warnings.
>
> It is not a racy condition. It is just another branch in the
> code, which happens when an existing file is modified. Your
> tests always added a new file, so they could not reproduce
> the problem.
>
>>
>> Do you have a test case that reliably triggers the second call to
>> convert_to_git()?
>
> ===========================================
> git config core.autocrlf input
>
> echo Hello > foo
> git add foo
> git commit -m 'add foo'
>
> echo Hello >> foo
> unix2dos foo
> # This should trigger double crlf warning
> git add foo
> ===========================================
Yes, this reproduces the problem.
Thanks,
Steffen
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH] safecrlf: Add flag to convert_to_git() to disable safecrlf check
2008-01-14 6:53 ` Steffen Prohaska
@ 2008-01-14 23:20 ` Steffen Prohaska
2008-01-14 23:58 ` Junio C Hamano
2008-01-15 10:26 ` Dmitry Potapov
0 siblings, 2 replies; 15+ messages in thread
From: Steffen Prohaska @ 2008-01-14 23:20 UTC (permalink / raw)
To: dpotapov; +Cc: gitster, torvalds, git, Steffen Prohaska
This commit goes on top of
"[PATCH v3] safecrlf: Add mechanism to warn about irreversible crlf conversions"
I send it as a separate patch to make reviewing easier.
Eventually it should be squashed.
I looked briefly at the various places where convert_to_git() is
called. I think that only the one code path through index_fd()
actually writes data to the repsitory. Maybe someone else with
a better understanding of git's internals should confirm this.
Steffen
---- snip ---
We want to verify if an autocrlf conversion is reversible only if
the converted data is actually written to the repository. Only
in this case the file would be modified during the next checkout.
But convert_to_git() is used for some other purposes.
This commit adds a flag to convert_to_git() that controls if the
safecrlf check is enabled. The check is only enabled if
index_fd() writes the object converted. In all other cases the
check is disabled.
Signed-off-by: Steffen Prohaska <prohaska@zib.de>
---
builtin-apply.c | 2 +-
builtin-blame.c | 2 +-
cache.h | 2 +-
convert.c | 10 +++++-----
diff.c | 2 +-
sha1_file.c | 2 +-
6 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/builtin-apply.c b/builtin-apply.c
index d57bb6e..9fb6fe3 100644
--- a/builtin-apply.c
+++ b/builtin-apply.c
@@ -1430,7 +1430,7 @@ static int read_old_data(struct stat *st, const char *path, struct strbuf *buf)
case S_IFREG:
if (strbuf_read_file(buf, path, st->st_size) != st->st_size)
return error("unable to open or read %s", path);
- convert_to_git(path, buf->buf, buf->len, buf);
+ convert_to_git(path, buf->buf, buf->len, buf, 0);
return 0;
default:
return -1;
diff --git a/builtin-blame.c b/builtin-blame.c
index 9b4c02e..c361ee1 100644
--- a/builtin-blame.c
+++ b/builtin-blame.c
@@ -2073,7 +2073,7 @@ static struct commit *fake_working_tree_commit(const char *path, const char *con
if (strbuf_read(&buf, 0, 0) < 0)
die("read error %s from stdin", strerror(errno));
}
- convert_to_git(path, buf.buf, buf.len, &buf);
+ convert_to_git(path, buf.buf, buf.len, &buf, 0);
origin->file.ptr = buf.buf;
origin->file.size = buf.len;
pretend_sha1_file(buf.buf, buf.len, OBJ_BLOB, origin->blob_sha1);
diff --git a/cache.h b/cache.h
index 4e03e3d..43e7d0e 100644
--- a/cache.h
+++ b/cache.h
@@ -640,7 +640,7 @@ extern void trace_argv_printf(const char **argv, const char *format, ...);
/* convert.c */
/* returns 1 if *dst was used */
-extern int convert_to_git(const char *path, const char *src, size_t len, struct strbuf *dst);
+extern int convert_to_git(const char *path, const char *src, size_t len, struct strbuf *dst, int checksafe);
extern int convert_to_working_tree(const char *path, const char *src, size_t len, struct strbuf *dst);
/* add */
diff --git a/convert.c b/convert.c
index c9678ee..5adef4f 100644
--- a/convert.c
+++ b/convert.c
@@ -81,7 +81,7 @@ static int is_binary(unsigned long size, struct text_stat *stats)
}
static int crlf_to_git(const char *path, const char *src, size_t len,
- struct strbuf *buf, int action)
+ struct strbuf *buf, int action, int checksafe)
{
struct text_stat stats;
char *dst;
@@ -107,7 +107,7 @@ static int crlf_to_git(const char *path, const char *src, size_t len,
return 0;
}
- if (safe_crlf && auto_crlf > 0 && action != CRLF_INPUT) {
+ if (safe_crlf && checksafe && auto_crlf > 0 && action != CRLF_INPUT) {
/* CRLFs would be added by checkout: check if we have "naked" LFs */
if (stats.lf != stats.crlf) {
if (safe_crlf == SAFE_CRLF_WARN)
@@ -144,7 +144,7 @@ static int crlf_to_git(const char *path, const char *src, size_t len,
} while (--len);
}
- if (safe_crlf && (action == CRLF_INPUT || auto_crlf <= 0)) {
+ if (safe_crlf && checksafe && (action == CRLF_INPUT || auto_crlf <= 0)) {
/* CRLFs would not be restored by checkout: check if we removed CRLFs */
if (buf->len != dst - buf->buf) {
if (safe_crlf == SAFE_CRLF_WARN)
@@ -553,7 +553,7 @@ static int git_path_check_ident(const char *path, struct git_attr_check *check)
return !!ATTR_TRUE(value);
}
-int convert_to_git(const char *path, const char *src, size_t len, struct strbuf *dst)
+int convert_to_git(const char *path, const char *src, size_t len, struct strbuf *dst, int checksafe)
{
struct git_attr_check check[3];
int crlf = CRLF_GUESS;
@@ -575,7 +575,7 @@ int convert_to_git(const char *path, const char *src, size_t len, struct strbuf
src = dst->buf;
len = dst->len;
}
- ret |= crlf_to_git(path, src, len, dst, crlf);
+ ret |= crlf_to_git(path, src, len, dst, crlf, checksafe);
if (ret) {
src = dst->buf;
len = dst->len;
diff --git a/diff.c b/diff.c
index b18c140..b5ac4f9 100644
--- a/diff.c
+++ b/diff.c
@@ -1624,7 +1624,7 @@ int diff_populate_filespec(struct diff_filespec *s, int size_only)
* Convert from working tree format to canonical git format
*/
strbuf_init(&buf, 0);
- if (convert_to_git(s->path, s->data, s->size, &buf)) {
+ if (convert_to_git(s->path, s->data, s->size, &buf, 0)) {
size_t size = 0;
munmap(s->data, s->size);
s->should_munmap = 0;
diff --git a/sha1_file.c b/sha1_file.c
index 6583797..3b5413d 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2342,7 +2342,7 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object,
if ((type == OBJ_BLOB) && S_ISREG(st->st_mode)) {
struct strbuf nbuf;
strbuf_init(&nbuf, 0);
- if (convert_to_git(path, buf, size, &nbuf)) {
+ if (convert_to_git(path, buf, size, &nbuf, write_object)) {
munmap(buf, size);
buf = strbuf_detach(&nbuf, &size);
re_allocated = 1;
--
1.5.4.rc2.60.g46ee
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] safecrlf: Add flag to convert_to_git() to disable safecrlf check
2008-01-14 23:20 ` [PATCH] safecrlf: Add flag to convert_to_git() to disable safecrlf check Steffen Prohaska
@ 2008-01-14 23:58 ` Junio C Hamano
2008-01-15 20:52 ` Steffen Prohaska
2008-01-15 10:26 ` Dmitry Potapov
1 sibling, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2008-01-14 23:58 UTC (permalink / raw)
To: Steffen Prohaska; +Cc: dpotapov, gitster, torvalds, git
Steffen Prohaska <prohaska@zib.de> writes:
> We want to verify if an autocrlf conversion is reversible only if
> the converted data is actually written to the repository. Only
> in this case the file would be modified during the next checkout.
> But convert_to_git() is used for some other purposes.
> This commit adds a flag to convert_to_git() that controls if the
> safecrlf check is enabled...
At first this felt dirty to me as convert_to_git() is not
limited to crlf, but about external vs canonical representation.
The variable name being "checksafe" however makes it much more
palatable. It is clear that it is talking about irreversible
conversion.
When running diff with a work tree file and the index (or a
named tree), we read the work tree file and run convert_to_git()
on it before comparing it with what we have in the object store
(either index or a named tree). When running apply without
touching the index, we also use convert_to_git() on the work
tree file. The patch file is supposed to record the data in
canonical format, I think.
Of course, "git add" on the path will warn or fail with your
patch, but we may somehow want to be warned about the breakage
before "git add" on that path triggers it. Perhaps we can have
a separate "check-work-tree" command that iterates over locally
modified work tree files and runs convert_to_git() with checking
enabled.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] safecrlf: Add flag to convert_to_git() to disable safecrlf check
2008-01-14 23:20 ` [PATCH] safecrlf: Add flag to convert_to_git() to disable safecrlf check Steffen Prohaska
2008-01-14 23:58 ` Junio C Hamano
@ 2008-01-15 10:26 ` Dmitry Potapov
2008-01-15 20:39 ` Steffen Prohaska
1 sibling, 1 reply; 15+ messages in thread
From: Dmitry Potapov @ 2008-01-15 10:26 UTC (permalink / raw)
To: Steffen Prohaska; +Cc: gitster, torvalds, git
On Tue, Jan 15, 2008 at 12:20:40AM +0100, Steffen Prohaska wrote:
>
> I looked briefly at the various places where convert_to_git() is
> called. I think that only the one code path through index_fd()
> actually writes data to the repsitory. Maybe someone else with
> a better understanding of git's internals should confirm this.
Your patch is certainly better than my quick hack for git-add,
and perhaps you are right that the check should be done only
when data are written, but it also means that there is no longer
any warning when you are running git diff with the work tree,
which would be useful, because it is what most users do before
adding anything.
However, my real concern is that it seems we have two different
heuristics for binary -- one that is used inside of convert.c
and the other one is buffer_is_binary() in xdiff-interface.c.
So, I am running 'git diff' for some test file, and it says:
diff --git a/foo b/foo
index e965047..c102bdc 100644
Binary files a/foo and b/foo differ
okay, now I want to add this *binary* file, so I run 'git add':
warning: Stripped CRLF from foo.
I imagine a user saying: "What the hell! Why did this stupid Git
strip CRLF from my _binary_ file?"
And the current version of Git, which does not print CRLF warning,
seems to be dangerous, because when 'git diff' told me that it is
a _binary_ file, I expect that Git will put it as *binary*. So,
from the user's point of view, it looks like a bug.
So, I suppose that at least we should make is_binary heuristic in
convert.c more strict than those that is used by diff. Namely, if
there is at least one NUL byte in the buffer, it should be treated
as binary.
Dmitry
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] safecrlf: Add flag to convert_to_git() to disable safecrlf check
2008-01-15 10:26 ` Dmitry Potapov
@ 2008-01-15 20:39 ` Steffen Prohaska
2008-01-15 23:03 ` Dmitry Potapov
0 siblings, 1 reply; 15+ messages in thread
From: Steffen Prohaska @ 2008-01-15 20:39 UTC (permalink / raw)
To: Dmitry Potapov; +Cc: gitster, torvalds, git
On Jan 15, 2008, at 11:26 AM, Dmitry Potapov wrote:
> On Tue, Jan 15, 2008 at 12:20:40AM +0100, Steffen Prohaska wrote:
>>
>> I looked briefly at the various places where convert_to_git() is
>> called. I think that only the one code path through index_fd()
>> actually writes data to the repsitory. Maybe someone else with
>> a better understanding of git's internals should confirm this.
>
> Your patch is certainly better than my quick hack for git-add,
> and perhaps you are right that the check should be done only
> when data are written, but it also means that there is no longer
> any warning when you are running git diff with the work tree,
> which would be useful, because it is what most users do before
> adding anything.
My first concern is to avoid data corruption. But we should also
avoid to unnecessarily bother users with annoying warnings. Thus
I thought that guarding only the code paths that modify data in
an irreversible way is sufficient and hence I only guarded the
code path that writes to the repository. The underlying
assumption is that git checkout is the only way of destroying the
original data. Unless you check out you still have the original
data and can copy them to a safe place.
However, there might be other ways by which git will modify the
files in the work tree. And even commands that never touch the
local file system could be dangerous. For example a user could
run git diff to create a patch, which is sent to someone else and
applied there. The change is committed in the remote repository,
published, and pulled back to the local repository and eventually
this might result in the corruption we originally tried to avoid.
So perhaps we should guard _every_ code path that calls
convert_to_git() in an irreversible way; and only avoid printing
the same warning twice for a single run of git add. This is
exactly what your "quick hack" does.
More in my reply to Junios mail ...
> However, my real concern is that it seems we have two different
> heuristics for binary -- one that is used inside of convert.c
> and the other one is buffer_is_binary() in xdiff-interface.c.
>
> So, I am running 'git diff' for some test file, and it says:
>
> diff --git a/foo b/foo
> index e965047..c102bdc 100644
> Binary files a/foo and b/foo differ
>
> okay, now I want to add this *binary* file, so I run 'git add':
>
> warning: Stripped CRLF from foo.
>
> I imagine a user saying: "What the hell! Why did this stupid Git
> strip CRLF from my _binary_ file?"
>
> And the current version of Git, which does not print CRLF warning,
> seems to be dangerous, because when 'git diff' told me that it is
> ca _binary_ file, I expect that Git will put it as *binary*. So,
> from the user's point of view, it looks like a bug.
>
> So, I suppose that at least we should make is_binary heuristic in
> convert.c more strict than those that is used by diff. Namely, if
> there is at least one NUL byte in the buffer, it should be treated
> as binary.
I think we should do this.
Perhaps we should also place a hint for future developers in
xdiff-interface.c, like the patch below.
Steffen
diff --git a/xdiff-interface.c b/xdiff-interface.c
index 4b8e5cc..baa3664 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -160,6 +160,10 @@ int read_mmfile(mmfile_t *ptr, const char
*filename)
return 0;
}
+/* If you ever modify buffer_is_binary() you should check that
is_binary()
+ in convert.c always uses a stricter heuristic. That is all files
that are
+ classified as binary here should also be classified as binary in
convert.c.
+ */
#define FIRST_FEW_BYTES 8000
int buffer_is_binary(const char *ptr, unsigned long size)
{
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] safecrlf: Add flag to convert_to_git() to disable safecrlf check
2008-01-14 23:58 ` Junio C Hamano
@ 2008-01-15 20:52 ` Steffen Prohaska
2008-01-15 21:41 ` Steffen Prohaska
0 siblings, 1 reply; 15+ messages in thread
From: Steffen Prohaska @ 2008-01-15 20:52 UTC (permalink / raw)
To: Junio C Hamano; +Cc: dpotapov, torvalds, git
On Jan 15, 2008, at 12:58 AM, Junio C Hamano wrote:
> Steffen Prohaska <prohaska@zib.de> writes:
>
>> We want to verify if an autocrlf conversion is reversible only if
>> the converted data is actually written to the repository. Only
>> in this case the file would be modified during the next checkout.
>> But convert_to_git() is used for some other purposes.
>> This commit adds a flag to convert_to_git() that controls if the
>> safecrlf check is enabled...
>
> At first this felt dirty to me as convert_to_git() is not
> limited to crlf, but about external vs canonical representation.
> The variable name being "checksafe" however makes it much more
> palatable. It is clear that it is talking about irreversible
> conversion.
>
> When running diff with a work tree file and the index (or a
> named tree), we read the work tree file and run convert_to_git()
> on it before comparing it with what we have in the object store
> (either index or a named tree). When running apply without
> touching the index, we also use convert_to_git() on the work
> tree file. The patch file is supposed to record the data in
> canonical format, I think.
>
> Of course, "git add" on the path will warn or fail with your
> patch, but we may somehow want to be warned about the breakage
> before "git add" on that path triggers it. Perhaps we can have
> a separate "check-work-tree" command that iterates over locally
> modified work tree files and runs convert_to_git() with checking
> enabled.
We could certainly have such a command, yet the question remains
when to call it. Do you have in mind calling it when we enter
the work tree, such that all files in the work tree will always
be verified? Running the check once during start up should be
sufficient and we could switch it off for the remainder of the
execution.
We would certainly print all paths with an irreversible conversion
and only die() afterwards if requested by core.safecrlf=true.
All information would be printed at once in an ordered way. This
could be more user friendly than the current approach.
I'll work on this.
Steffen
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] safecrlf: Add flag to convert_to_git() to disable safecrlf check
2008-01-15 20:52 ` Steffen Prohaska
@ 2008-01-15 21:41 ` Steffen Prohaska
2008-01-15 23:23 ` Junio C Hamano
0 siblings, 1 reply; 15+ messages in thread
From: Steffen Prohaska @ 2008-01-15 21:41 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Dmitry Potapov, Linus Torvalds, Git Mailing List
On Jan 15, 2008, at 9:52 PM, Steffen Prohaska wrote:
>> Of course, "git add" on the path will warn or fail with your
>> patch, but we may somehow want to be warned about the breakage
>> before "git add" on that path triggers it. Perhaps we can have
>> a separate "check-work-tree" command that iterates over locally
>> modified work tree files and runs convert_to_git() with checking
>> enabled.
>
> We could certainly have such a command, yet the question remains
> when to call it. Do you have in mind calling it when we enter
> the work tree, such that all files in the work tree will always
> be verified? Running the check once during start up should be
> sufficient and we could switch it off for the remainder of the
> execution.
>
> We would certainly print all paths with an irreversible conversion
> and only die() afterwards if requested by core.safecrlf=true.
> All information would be printed at once in an ordered way. This
> could be more user friendly than the current approach.
>
> I'll work on this.
What is the right way to iterate over the changed files?
Should I copy and adapt the following from wt-status.c?
static void wt_status_print_changed(struct wt_status *s)
{
struct rev_info rev;
init_revisions(&rev, "");
setup_revisions(0, NULL, &rev, NULL);
rev.diffopt.output_format |= DIFF_FORMAT_CALLBACK;
rev.diffopt.format_callback = wt_status_print_changed_cb;
rev.diffopt.format_callback_data = s;
wt_read_cache(s);
run_diff_files(&rev, 0);
}
Steffen
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] safecrlf: Add flag to convert_to_git() to disable safecrlf check
2008-01-15 20:39 ` Steffen Prohaska
@ 2008-01-15 23:03 ` Dmitry Potapov
0 siblings, 0 replies; 15+ messages in thread
From: Dmitry Potapov @ 2008-01-15 23:03 UTC (permalink / raw)
To: Steffen Prohaska; +Cc: gitster, torvalds, git
On Tue, Jan 15, 2008 at 09:39:00PM +0100, Steffen Prohaska wrote:
>
> On Jan 15, 2008, at 11:26 AM, Dmitry Potapov wrote:
>
> >but it also means that there is no longer
> >any warning when you are running git diff with the work tree,
> >which would be useful, because it is what most users do before
> >adding anything.
>
> My first concern is to avoid data corruption. But we should also
> avoid to unnecessarily bother users with annoying warnings.
I don't think we bother users with unnecessary warnings. If there
is a problem then it is better to report it earlier than later,
and we *really* want that the user took some action, i.e. either
to mark this file as binary using .gitattributes or corrected its
endings. However, reporting the same problem twice during running
one command does not look nice...
> Thus
> I thought that guarding only the code paths that modify data in
> an irreversible way is sufficient and hence I only guarded the
> code path that writes to the repository.
This policy makes sense to me, it is just I would prefer if we
could find a way to warn a user a bit earlier...
> The underlying
> assumption is that git checkout is the only way of destroying the
> original data. Unless you check out you still have the original
> data and can copy them to a safe place.
It is obvious to you but it may be not so obvious to a new user.
Running 'git diff' before check in is a common practice, so if
the warning was reported at that moment then he or she would not
check in.
Anyway, I like your solution for being simple and guarding the
most important pass reliably.
Dmitry
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] safecrlf: Add flag to convert_to_git() to disable safecrlf check
2008-01-15 21:41 ` Steffen Prohaska
@ 2008-01-15 23:23 ` Junio C Hamano
0 siblings, 0 replies; 15+ messages in thread
From: Junio C Hamano @ 2008-01-15 23:23 UTC (permalink / raw)
To: Steffen Prohaska; +Cc: Dmitry Potapov, Linus Torvalds, Git Mailing List
Steffen Prohaska <prohaska@zib.de> writes:
> What is the right way to iterate over the changed files?
I think something like
read_cache();
for (i = 0; i < active_nr; i++) {
struct cache_entry *ce = active_cache[i];
struct stat st;
if (!lstat(ce->name, &st) &&
ce_match_stat(ce, &st, 0)) {
/* do your thing */
}
}
would do.
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2008-01-15 23:24 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-13 16:30 [PATCH v3] safecrlf: Add mechanism to warn about irreversible crlf conversions Steffen Prohaska
2008-01-13 22:02 ` Dmitry Potapov
2008-01-13 22:13 ` Dmitry Potapov
2008-01-13 23:46 ` [PATCH] Don't display crlf warning twice Dmitry Potapov
2008-01-14 6:17 ` Steffen Prohaska
2008-01-14 6:40 ` Dmitry Potapov
2008-01-14 6:53 ` Steffen Prohaska
2008-01-14 23:20 ` [PATCH] safecrlf: Add flag to convert_to_git() to disable safecrlf check Steffen Prohaska
2008-01-14 23:58 ` Junio C Hamano
2008-01-15 20:52 ` Steffen Prohaska
2008-01-15 21:41 ` Steffen Prohaska
2008-01-15 23:23 ` Junio C Hamano
2008-01-15 10:26 ` Dmitry Potapov
2008-01-15 20:39 ` Steffen Prohaska
2008-01-15 23:03 ` Dmitry Potapov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).