git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] is_hfs_dotgit: loosen over-eager match of \u{..47}
@ 2014-12-23  8:45 Jeff King
  2014-12-23 15:24 ` Torsten Bögershausen
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Jeff King @ 2014-12-23  8:45 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Linus Torvalds

Our is_hfs_dotgit function relies on the hackily-implemented
next_hfs_char to give us the next character that an HFS+
filename comparison would look at. It's hacky because it
doesn't implement the full case-folding table of HFS+; it
gives us just enough to see if the path matches ".git".

At the end of next_hfs_char, we use tolower() to convert our
32-bit code point to lowercase. Our tolower() implementation
only takes an 8-bit char, though; it throws away the upper
24 bits. This means we can't have any false negatives for
is_hfs_dotgit. We only care about matching 7-bit ASCII
characters in ".git", and we will correctly process 'G' or
'g'.

However, we _can_ have false positives. Because we throw
away the upper bits, code point \u{0147} (for example) will
look like 'G' and get downcased to 'g'. It's not known
whether a sequence of code points whose truncation ends up
as ".git" is meaningful in any language, but it does not
hurt to be more accurate here. We can just pass out the full
32-bit code point, and compare it manually to the upper and
lowercase characters we care about.

Signed-off-by: Jeff King <peff@peff.net>
---
I saw Linus ask about this on G+. I had done the "no false
negative" analysis when writing the patch, but didn't
consider the false positive.

Another way of accomplishing the same thing is for next_hfs_char to
continue folding case, but _only_ do so for 8-bit code points. Like:

  return (out & 0xffffff00) ? out : tolower(out);

I think the what's below is more obvious (and is actually
how I originally wrote it; I switched to using tolower()
during development to try to make it more readable).

 t/t1450-fsck.sh | 16 ++++++++++++++++
 utf8.c          | 17 ++++++++++++-----
 2 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
index 1f04b8a..3f5883d 100755
--- a/t/t1450-fsck.sh
+++ b/t/t1450-fsck.sh
@@ -349,6 +349,22 @@ dot-backslash-case .\\\\.GIT\\\\foobar
 dotgit-case-backslash .git\\\\foobar
 EOF
 
+test_expect_success 'fsck allows .Ňit' '
+	(
+		git init not-dotgit &&
+		cd not-dotgit &&
+		echo content >file &&
+		git add file &&
+		git commit -m base &&
+		blob=$(git rev-parse :file) &&
+		printf "100644 blob $blob\t.\\305\\207it" >tree &&
+		tree=$(git mktree <tree) &&
+		git fsck 2>err &&
+		cat err &&
+		! test -s err
+	)
+'
+
 # create a static test repo which is broken by omitting
 # one particular object ($1, which is looked up via rev-parse
 # in the new repository).
diff --git a/utf8.c b/utf8.c
index 9a3f4ad..34a779e 100644
--- a/utf8.c
+++ b/utf8.c
@@ -606,7 +606,7 @@ static ucs_char_t next_hfs_char(const char **in)
 		 * but this is enough to catch anything that will convert
 		 * to ".git"
 		 */
-		return tolower(out);
+		return out;
 	}
 }
 
@@ -614,10 +614,17 @@ int is_hfs_dotgit(const char *path)
 {
 	ucs_char_t c;
 
-	if (next_hfs_char(&path) != '.' ||
-	    next_hfs_char(&path) != 'g' ||
-	    next_hfs_char(&path) != 'i' ||
-	    next_hfs_char(&path) != 't')
+	c = next_hfs_char(&path);
+	if (c != '.')
+		return 0;
+	c = next_hfs_char(&path);
+	if (c != 'g' && c != 'G')
+		return 0;
+	c = next_hfs_char(&path);
+	if (c != 'i' && c != 'I')
+		return 0;
+	c = next_hfs_char(&path);
+	if (c != 't' && c != 'T')
 		return 0;
 	c = next_hfs_char(&path);
 	if (c && !is_dir_sep(c))
-- 
2.2.1.376.gec59d43

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] is_hfs_dotgit: loosen over-eager match of \u{..47}
  2014-12-23  8:45 [PATCH] is_hfs_dotgit: loosen over-eager match of \u{..47} Jeff King
@ 2014-12-23 15:24 ` Torsten Bögershausen
  2014-12-23 18:17   ` Junio C Hamano
  2014-12-23 18:18   ` Jeff King
  2014-12-23 20:14 ` Jonathan Nieder
  2014-12-23 20:31 ` Johannes Sixt
  2 siblings, 2 replies; 10+ messages in thread
From: Torsten Bögershausen @ 2014-12-23 15:24 UTC (permalink / raw)
  To: Jeff King, git; +Cc: Junio C Hamano, Linus Torvalds

On 2014-12-23 09.45, Jeff King wrote:
> Our is_hfs_dotgit function relies on the hackily-implemented
> next_hfs_char to give us the next character that an HFS+
> filename comparison would look at. It's hacky because it
> doesn't implement the full case-folding table of HFS+; it
> gives us just enough to see if the path matches ".git".
> 
> At the end of next_hfs_char, we use tolower() to convert our
> 32-bit code point to lowercase. Our tolower() implementation
> only takes an 8-bit char, though; it throws away the upper
> 24 bits. This means we can't have any false negatives for
> is_hfs_dotgit. We only care about matching 7-bit ASCII
> characters in ".git", and we will correctly process 'G' or
> 'g'.
> 
> However, we _can_ have false positives. Because we throw
> away the upper bits, code point \u{0147} (for example) will
> look like 'G' and get downcased to 'g'. It's not known
> whether a sequence of code points whose truncation ends up
> as ".git" is meaningful in any language, but it does not
> hurt to be more accurate here. We can just pass out the full
> 32-bit code point, and compare it manually to the upper and
> lowercase characters we care about.
> 
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> I saw Linus ask about this on G+. I had done the "no false
> negative" analysis when writing the patch, but didn't
> consider the false positive.
> 
> Another way of accomplishing the same thing is for next_hfs_char to
> continue folding case, but _only_ do so for 8-bit code points. Like:
> 
Don't we have the same possible problem under NTFS?
Under Linux + VFAT ?
Under all OS + VFAT ?


And would it make sense to turn this
>   return (out & 0xffffff00) ? out : tolower(out);
into this:
static ucs_char_t unicode_tolower(ucs_char_t ch) {
   return (ch & 0xffffff00) ? ch : tolower(ch);
}


And what happens if I export NTFS to Mac OS X?
(Other combinations possible)
Shouldn't fsck under all OS warn for NTFS and hfs possible attacks ?
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] is_hfs_dotgit: loosen over-eager match of \u{..47}
  2014-12-23 15:24 ` Torsten Bögershausen
@ 2014-12-23 18:17   ` Junio C Hamano
  2014-12-23 18:18   ` Jeff King
  1 sibling, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2014-12-23 18:17 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: Jeff King, git, Linus Torvalds

Torsten Bögershausen <tboegi@web.de> writes:

> Don't we have the same possible problem under NTFS?
> Under Linux + VFAT ?
> Under all OS + VFAT ?

I do not think so, as NTFS codepath does not use the (previouly too
wide but now fixed) is_hfs_dotgit() in the first place, and
is_ntfs_dotgit() does not read one unicode codepoint at a time if I
remember correctly.

> And what happens if I export NTFS to Mac OS X?
> (Other combinations possible)
> Shouldn't fsck under all OS warn for NTFS and hfs possible attacks ?

Doesn't it?  Have you read the whole series you are commenting on?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] is_hfs_dotgit: loosen over-eager match of \u{..47}
  2014-12-23 15:24 ` Torsten Bögershausen
  2014-12-23 18:17   ` Junio C Hamano
@ 2014-12-23 18:18   ` Jeff King
  1 sibling, 0 replies; 10+ messages in thread
From: Jeff King @ 2014-12-23 18:18 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: git, Junio C Hamano, Linus Torvalds

On Tue, Dec 23, 2014 at 04:24:57PM +0100, Torsten Bögershausen wrote:

> Don't we have the same possible problem under NTFS?
> Under Linux + VFAT ?
> Under all OS + VFAT ?

I'm not sure what you mean.

This code path is _only_ about checking for HFS+-specific problems. We
check general case-insensitivity in another code path. And we check
NTFS-specific problems in another code path.

(Technically we could make a "check all" function that runs over the
data only once, which would be more efficient. But doing it this way
makes the code much easier to follow).

> And would it make sense to turn this
> >   return (out & 0xffffff00) ? out : tolower(out);
> into this:
> static ucs_char_t unicode_tolower(ucs_char_t ch) {
>    return (ch & 0xffffff00) ? ch : tolower(ch);
> }

Perhaps, but you would need a big warning at the top of that function
that is _only_ downcasing ASCII characters. I.e., it is fine for our use
here, but you would not want other unicode-aware code to call it. The
next_hfs_char already has such a warning at the top.

> And what happens if I export NTFS to Mac OS X?
> (Other combinations possible)
> Shouldn't fsck under all OS warn for NTFS and hfs possible attacks ?

Fsck already warns for all system types, no matter what platform you're
on. And we do case-insensitive blocking of ".git" entering the index for
all platforms.

We don't turn on filesystem-specific blocking of paths entering the
index on all platforms. You get HFS+ blocking by default on OS X, and
NTFS on Windows. If you are using HFS+ on Linux, you should set
core.protectHFS.

Possibly we should just turn on all checks everywhere. That would be a
safer default, at the cost to Linux folks of:

  1. Some slight inefficiency in each read-tree, as we have to do extra
     processing on each name.

  2. Some names would be disallowed that are otherwise OK. For the most
     part these are not names that would be used in practice, though
     (e.g., losing the ability to create ".git\u200c" is not a big loss
     to anyone). I think Git for Windows has started blocking other
     stuff like "CON" that is not specifically related to .git, though,
     and that is more plausible for somebody to use in real life.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] is_hfs_dotgit: loosen over-eager match of \u{..47}
  2014-12-23  8:45 [PATCH] is_hfs_dotgit: loosen over-eager match of \u{..47} Jeff King
  2014-12-23 15:24 ` Torsten Bögershausen
@ 2014-12-23 20:14 ` Jonathan Nieder
  2014-12-23 21:02   ` Junio C Hamano
  2014-12-23 21:09   ` Jeff King
  2014-12-23 20:31 ` Johannes Sixt
  2 siblings, 2 replies; 10+ messages in thread
From: Jonathan Nieder @ 2014-12-23 20:14 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano, Linus Torvalds

Jeff King wrote:

> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  t/t1450-fsck.sh | 16 ++++++++++++++++
>  utf8.c          | 17 ++++++++++++-----
>  2 files changed, 28 insertions(+), 5 deletions(-)

Mph.  Sorry I missed this before.

Reviewed-by: Jonathan Nieder <jrnieder@gmail.com>

[...]
> +++ b/t/t1450-fsck.sh
[...]
> +		git fsck 2>err &&
> +		cat err &&
> +		! test -s err

Nit: if this said

		test_line_count = 0 err

then the error message would be more obvious when it fails with
--verbose.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] is_hfs_dotgit: loosen over-eager match of \u{..47}
  2014-12-23  8:45 [PATCH] is_hfs_dotgit: loosen over-eager match of \u{..47} Jeff King
  2014-12-23 15:24 ` Torsten Bögershausen
  2014-12-23 20:14 ` Jonathan Nieder
@ 2014-12-23 20:31 ` Johannes Sixt
  2014-12-23 21:11   ` Jeff King
  2 siblings, 1 reply; 10+ messages in thread
From: Johannes Sixt @ 2014-12-23 20:31 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano, Linus Torvalds

Am 23.12.2014 um 09:45 schrieb Jeff King:
> @@ -606,7 +606,7 @@ static ucs_char_t next_hfs_char(const char **in)
>  		 * but this is enough to catch anything that will convert
>  		 * to ".git"
>  		 */
> -		return tolower(out);
> +		return out;

Did you consider changing the comment that we see in the pre-context here?

-- Hannes

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] is_hfs_dotgit: loosen over-eager match of \u{..47}
  2014-12-23 20:14 ` Jonathan Nieder
@ 2014-12-23 21:02   ` Junio C Hamano
  2014-12-23 21:12     ` Jeff King
  2014-12-23 21:09   ` Jeff King
  1 sibling, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2014-12-23 21:02 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jeff King, git, Linus Torvalds

Jonathan Nieder <jrnieder@gmail.com> writes:

>> +++ b/t/t1450-fsck.sh
> [...]
>> +		git fsck 2>err &&
>> +		cat err &&
>> +		! test -s err
>
> Nit: if this said
>
> 		test_line_count = 0 err
>
> then the error message would be more obvious when it fails with
> --verbose.

That's a good suggestion, I think.  This is meant to apply on top of
d08c13b, and we already had test_line_count back then.

So far I collected these follow-ups to squash into Peff's patch.



 t/t1450-fsck.sh |  3 +--
 utf8.c          | 15 ++++++++-------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
index 5e42385..0279b2b 100755
--- a/t/t1450-fsck.sh
+++ b/t/t1450-fsck.sh
@@ -284,8 +284,7 @@ test_expect_success 'fsck allows .Ňit' '
 		printf "100644 blob $blob\t.\\305\\207it" >tree &&
 		tree=$(git mktree <tree) &&
 		git fsck 2>err &&
-		cat err &&
-		! test -s err
+		test_line_count = 0 err
 	)
 '
 
diff --git a/utf8.c b/utf8.c
index 18a8f42..9c9fa3a 100644
--- a/utf8.c
+++ b/utf8.c
@@ -630,8 +630,8 @@ int mbs_chrlen(const char **text, size_t *remainder_p, const char *encoding)
 }
 
 /*
- * Pick the next char from the stream, folding as an HFS+ filename comparison
- * would. Note that this is _not_ complete by any means. It's just enough
+ * Pick the next char from the stream, ignoring codepoints an HFS+ would.
+ * Note that this is _not_ complete by any means. It's just enough
  * to make is_hfs_dotgit() work, and should not be used otherwise.
  */
 static ucs_char_t next_hfs_char(const char **in)
@@ -668,11 +668,6 @@ static ucs_char_t next_hfs_char(const char **in)
 			continue;
 		}
 
-		/*
-		 * there's a great deal of other case-folding that occurs,
-		 * but this is enough to catch anything that will convert
-		 * to ".git"
-		 */
 		return out;
 	}
 }
@@ -685,6 +680,12 @@ int is_hfs_dotgit(const char *path)
 	if (c != '.')
 		return 0;
 	c = next_hfs_char(&path);
+
+	/*
+	 * there's a great deal of other case-folding that occurs
+	 * in HFS+, but this is enough to catch anything that will
+	 * convert to ".git"
+	 */
 	if (c != 'g' && c != 'G')
 		return 0;
 	c = next_hfs_char(&path);

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] is_hfs_dotgit: loosen over-eager match of \u{..47}
  2014-12-23 20:14 ` Jonathan Nieder
  2014-12-23 21:02   ` Junio C Hamano
@ 2014-12-23 21:09   ` Jeff King
  1 sibling, 0 replies; 10+ messages in thread
From: Jeff King @ 2014-12-23 21:09 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, Junio C Hamano, Linus Torvalds

On Tue, Dec 23, 2014 at 12:14:16PM -0800, Jonathan Nieder wrote:

> > +++ b/t/t1450-fsck.sh
> [...]
> > +		git fsck 2>err &&
> > +		cat err &&
> > +		! test -s err
> 
> Nit: if this said
> 
> 		test_line_count = 0 err
> 
> then the error message would be more obvious when it fails with
> --verbose.

Thanks, I actually considered adding a "test_file_is_empty" for that
reason, but empty line-count does the same thing.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] is_hfs_dotgit: loosen over-eager match of \u{..47}
  2014-12-23 20:31 ` Johannes Sixt
@ 2014-12-23 21:11   ` Jeff King
  0 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2014-12-23 21:11 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: git, Junio C Hamano, Linus Torvalds

On Tue, Dec 23, 2014 at 09:31:10PM +0100, Johannes Sixt wrote:

> Am 23.12.2014 um 09:45 schrieb Jeff King:
> > @@ -606,7 +606,7 @@ static ucs_char_t next_hfs_char(const char **in)
> >  		 * but this is enough to catch anything that will convert
> >  		 * to ".git"
> >  		 */
> > -		return tolower(out);
> > +		return out;
> 
> Did you consider changing the comment that we see in the pre-context here?

I did consider it, but the comment that is there was actually written
for the _original_ version, before I added tolower in the first place
(it also applied equally to the tolower() version, so I left it).

So it was clear either way, at least in my brain. :)

Of course that does not say anything about people's brains who did not
write the patch.  The changes Junio suggested elsewhere in the thread do
make it more clear.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] is_hfs_dotgit: loosen over-eager match of \u{..47}
  2014-12-23 21:02   ` Junio C Hamano
@ 2014-12-23 21:12     ` Jeff King
  0 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2014-12-23 21:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Nieder, git, Linus Torvalds

On Tue, Dec 23, 2014 at 01:02:23PM -0800, Junio C Hamano wrote:

> Jonathan Nieder <jrnieder@gmail.com> writes:
> 
> >> +++ b/t/t1450-fsck.sh
> > [...]
> >> +		git fsck 2>err &&
> >> +		cat err &&
> >> +		! test -s err
> >
> > Nit: if this said
> >
> > 		test_line_count = 0 err
> >
> > then the error message would be more obvious when it fails with
> > --verbose.
> 
> That's a good suggestion, I think.  This is meant to apply on top of
> d08c13b, and we already had test_line_count back then.
> 
> So far I collected these follow-ups to squash into Peff's patch.

They all look good to me. Thanks.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-12-23 21:12 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-23  8:45 [PATCH] is_hfs_dotgit: loosen over-eager match of \u{..47} Jeff King
2014-12-23 15:24 ` Torsten Bögershausen
2014-12-23 18:17   ` Junio C Hamano
2014-12-23 18:18   ` Jeff King
2014-12-23 20:14 ` Jonathan Nieder
2014-12-23 21:02   ` Junio C Hamano
2014-12-23 21:12     ` Jeff King
2014-12-23 21:09   ` Jeff King
2014-12-23 20:31 ` Johannes Sixt
2014-12-23 21:11   ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).