[JGIT PATCH 0/4] Decodings

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [JGIT PATCH 0/4] Decodings
@ 2008-10-12 22:50 Robin Rosenberg
  2008-10-12 22:50 ` [JGIT PATCH 1/4] log command: Use explicit US locale for dates Robin Rosenberg
  0 siblings, 1 reply; 8+ messages in thread
From: Robin Rosenberg @ 2008-10-12 22:50 UTC (permalink / raw)
  To: spearce; +Cc: git

These patches might be slightly controversial. Since there is no perfect
solution, we may want to try something that works to some extent and gives
what a user might expect, i.e. presenting to a viewer the same glyphs 
that the user who entered them saw, to the extent it is possible.

We already handle commit messages like this for the old style objects (sort of).
This patch set also affects other data like refs. Currenly no sane solution
exists in git so nothing really works well outside the non-ascii range for
refs anyway so we can discuss what should happen with refs that contain
non-ascii characters. The best thing is to avoid them, but some of us live
in countries with funny dots in what we do and other have even stranger ways
of expressing what they do, and hence things like branch names etc.

Legacy SCM to GIT conversion programs seem to do every variation of transcoding/
not transcoding commit messages and file names to UTF-8 so there is an issue here.
The nice thing about transcoding filenames to UTF-8 is that they work on all
platforms. A non-UTF-8 filename in a UTF-8 environement doesn't. In particular 
such filenames are more or less inaccessible to a Java programs. For the reverse
case it looks really bad. C Git currently does not transform file names. Missing from
this patch set is test cases. As it is quite undefined in git what happens that
is sort of ok so far, but I'd like to define it too in the same way.

-- robin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [JGIT PATCH 1/4] log command: Use explicit US locale for dates
  2008-10-12 22:50 [JGIT PATCH 0/4] Decodings Robin Rosenberg
@ 2008-10-12 22:50 ` Robin Rosenberg
  2008-10-12 22:50   ` [JGIT PATCH 2/4] jgit programs: Use i18n.logOutputEncoding or user's locale for output Robin Rosenberg
  0 siblings, 1 reply; 8+ messages in thread
From: Robin Rosenberg @ 2008-10-12 22:50 UTC (permalink / raw)
  To: spearce; +Cc: git, Robin Rosenberg

The format used is a variation of the US locale and looks odd when localized
anyway so do not try to localize it.

Signed-off-by: Robin Rosenberg <robin.rosenberg@dewire.com>
---
 .../src/org/spearce/jgit/pgm/Log.java              |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/org.spearce.jgit.pgm/src/org/spearce/jgit/pgm/Log.java b/org.spearce.jgit.pgm/src/org/spearce/jgit/pgm/Log.java
index e16387b..e3a32c8 100644
--- a/org.spearce.jgit.pgm/src/org/spearce/jgit/pgm/Log.java
+++ b/org.spearce.jgit.pgm/src/org/spearce/jgit/pgm/Log.java
@@ -40,6 +40,7 @@
 
 import java.text.DateFormat;
 import java.text.SimpleDateFormat;
+import java.util.Locale;
 import java.util.TimeZone;
 
 import org.spearce.jgit.lib.PersonIdent;
@@ -52,7 +53,7 @@
 	private final DateFormat fmt;
 
 	Log() {
-		fmt = new SimpleDateFormat("EEE MMM dd HH:mm:ss yyyy ZZZZZ");
+		fmt = new SimpleDateFormat("EEE MMM dd HH:mm:ss yyyy ZZZZZ", Locale.US);
 	}
 
 	@Override
-- 
1.6.0.2.308.gef4a

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [JGIT PATCH 2/4] jgit programs: Use i18n.logOutputEncoding or user's locale for output
  2008-10-12 22:50 ` [JGIT PATCH 1/4] log command: Use explicit US locale for dates Robin Rosenberg
@ 2008-10-12 22:50   ` Robin Rosenberg
  2008-10-12 22:50     ` [JGIT PATCH 3/4] The git config file is case insensitive Robin Rosenberg
  0 siblings, 1 reply; 8+ messages in thread
From: Robin Rosenberg @ 2008-10-12 22:50 UTC (permalink / raw)
  To: spearce; +Cc: git, Robin Rosenberg

We should present the data in the user's locale to make it readable.

This prevents garbage from being displayed for user's whose locale is not
UTF-8, when non-ascii appears in most cases. If the characters cannot be
converted garbage will appear in any case.

When jgit gains the capability to present blob data this code must be
replaced to handle data and metadata differently.

Signed-off-by: Robin Rosenberg <robin.rosenberg@dewire.com>
---
 .../src/org/spearce/jgit/pgm/TextBuiltin.java      |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/org.spearce.jgit.pgm/src/org/spearce/jgit/pgm/TextBuiltin.java b/org.spearce.jgit.pgm/src/org/spearce/jgit/pgm/TextBuiltin.java
index a68d87c..d1bf9e0 100644
--- a/org.spearce.jgit.pgm/src/org/spearce/jgit/pgm/TextBuiltin.java
+++ b/org.spearce.jgit.pgm/src/org/spearce/jgit/pgm/TextBuiltin.java
@@ -86,8 +86,15 @@ final void setCommandName(final String name) {
 
 	void init(final Repository repo) {
 		try {
-			out = new PrintWriter(new BufferedWriter(new OutputStreamWriter(
-					System.out, "UTF-8")));
+			String outputEncoding = repo.getConfig().getString("i18n", null,
+					"logOutputEncoding");
+			System.out.println("Encoding = "+ outputEncoding);
+			if (outputEncoding != null)
+				out = new PrintWriter(new BufferedWriter(
+						new OutputStreamWriter(System.out, outputEncoding)));
+			else
+				out = new PrintWriter(new BufferedWriter(
+						new OutputStreamWriter(System.out)));
 		} catch (IOException e) {
 			throw die("cannot create output stream");
 		}
-- 
1.6.0.2.308.gef4a

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [JGIT PATCH 3/4] The git config file is case insensitive
  2008-10-12 22:50   ` [JGIT PATCH 2/4] jgit programs: Use i18n.logOutputEncoding or user's locale for output Robin Rosenberg
@ 2008-10-12 22:50     ` Robin Rosenberg
  2008-10-12 22:51       ` [JGIT PATCH 4/4] Intelligent parsing of ambiguously encoded meta data Robin Rosenberg
  2008-10-13  2:36       ` [JGIT PATCH 3/4] The git config file is case insensitive Shawn O. Pearce
  0 siblings, 2 replies; 8+ messages in thread
From: Robin Rosenberg @ 2008-10-12 22:50 UTC (permalink / raw)
  To: spearce; +Cc: git, Robin Rosenberg

Signed-off-by: Robin Rosenberg <robin.rosenberg@dewire.com>
---
 .../org/spearce/jgit/lib/RepositoryConfigTest.java |    8 ++++++++
 .../src/org/spearce/jgit/lib/RepositoryConfig.java |   18 ++++++++++++------
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/org.spearce.jgit.test/tst/org/spearce/jgit/lib/RepositoryConfigTest.java b/org.spearce.jgit.test/tst/org/spearce/jgit/lib/RepositoryConfigTest.java
index da7e704..bd5329c 100644
--- a/org.spearce.jgit.test/tst/org/spearce/jgit/lib/RepositoryConfigTest.java
+++ b/org.spearce.jgit.test/tst/org/spearce/jgit/lib/RepositoryConfigTest.java
@@ -109,4 +109,12 @@ assertTrue(Arrays.equals(values.toArray(), repositoryConfig
 				.getStringList("my", null, "somename")));
 		checkFile(cfgFile, "[my]\n\tsomename = value1\n\tsomename = value2\n");
 	}
+
+	public void test006_readCaseInsensitive() throws IOException {
+		final File path = writeTrashFile("config_001", "[Foo]\nBar\n");
+		RepositoryConfig repositoryConfig = new RepositoryConfig(null, path);
+		System.out.println(repositoryConfig.getString("foo", null, "bar"));
+		assertEquals(true, repositoryConfig.getBoolean("foo", null, "bar", false));
+		assertEquals("", repositoryConfig.getString("foo", null, "bar"));
+	}
 }
diff --git a/org.spearce.jgit/src/org/spearce/jgit/lib/RepositoryConfig.java b/org.spearce.jgit/src/org/spearce/jgit/lib/RepositoryConfig.java
index 45c2f8a..7a34cde 100644
--- a/org.spearce.jgit/src/org/spearce/jgit/lib/RepositoryConfig.java
+++ b/org.spearce.jgit/src/org/spearce/jgit/lib/RepositoryConfig.java
@@ -52,12 +52,13 @@
 import java.io.PrintWriter;
 import java.util.ArrayList;
 import java.util.Collections;
-import java.util.HashMap;
+import java.util.Comparator;
 import java.util.HashSet;
 import java.util.Iterator;
 import java.util.List;
 import java.util.Map;
 import java.util.Set;
+import java.util.TreeMap;
 
 import org.spearce.jgit.util.FS;
 
@@ -236,9 +237,9 @@ protected boolean getBoolean(final String section, String subsection,
 			return defaultValue;
 
 		n = n.toLowerCase();
-		if (MAGIC_EMPTY_VALUE.equals(n) || "yes".equals(n) || "true".equals(n) || "1".equals(n)) {
+		if (MAGIC_EMPTY_VALUE.equals(n) || "yes".equalsIgnoreCase(n) || "true".equalsIgnoreCase(n) || "1".equals(n)) {
 			return true;
-		} else if ("no".equals(n) || "false".equals(n) || "0".equals(n)) {
+		} else if ("no".equalsIgnoreCase(n) || "false".equalsIgnoreCase(n) || "0".equalsIgnoreCase(n)) {
 			return false;
 		} else {
 			throw new IllegalArgumentException("Invalid boolean value: "
@@ -300,7 +301,7 @@ public String getString(final String section, String subsection, final String na
 		final Set<String> result = new HashSet<String>();
 
 		for (final Entry e : entries) {
-			if (section.equals(e.base) && e.extendedBase != null)
+			if (section.equalsIgnoreCase(e.base) && e.extendedBase != null)
 				result.add(e.extendedBase);
 		}
 		if (baseConfig != null)
@@ -682,7 +683,12 @@ public void load() throws IOException {
 
 	private void clear() {
 		entries = new ArrayList<Entry>();
-		byName = new HashMap<String, Object>();
+		byName = new TreeMap<String, Object>(new Comparator<String>() {
+
+			public int compare(String o1, String o2) {
+				return o1.compareToIgnoreCase(o2);
+			}
+		});
 	}
 
 	@SuppressWarnings("unchecked")
@@ -954,7 +960,7 @@ private static boolean eq(final String a, final String b) {
 				return true;
 			if (a == null || b == null)
 				return false;
-			return a.equals(b);
+			return a.equalsIgnoreCase(b);
 		}
 	}
 }
-- 
1.6.0.2.308.gef4a

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [JGIT PATCH 4/4] Intelligent parsing of ambiguously encoded meta data.
  2008-10-12 22:50     ` [JGIT PATCH 3/4] The git config file is case insensitive Robin Rosenberg
@ 2008-10-12 22:51       ` Robin Rosenberg
  2008-10-13  2:27         ` Shawn O. Pearce
  2008-10-13  2:36       ` [JGIT PATCH 3/4] The git config file is case insensitive Shawn O. Pearce
  1 sibling, 1 reply; 8+ messages in thread
From: Robin Rosenberg @ 2008-10-12 22:51 UTC (permalink / raw)
  To: spearce; +Cc: git, Robin Rosenberg

We cannot trust meta data to be encoded in any particular way, so we try
different encodings. First we try UTF-8, which is the only sane encoding
for non-local data, even when used in regions where eight bit legacy
encodings are common. The chance of mistakenly parsing non-UTF-8 data
as valid UTF-8 is varies from extremely low (western encodings) to low
for most other encodings. If the data does not look like UTF-8, we try the
suggested encoding. If that fails we try the user locale and finally, if
that fails we try ISO-8859-1, which cannot fail.

Signed-off-by: Robin Rosenberg <robin.rosenberg@dewire.com>
---
 .../spearce/jgit/revwalk/RevCommitParseTest.java   |  119 ++++++++++++++++++++
 .../src/org/spearce/jgit/util/RawParseUtils.java   |   63 ++++++++++-
 2 files changed, 179 insertions(+), 3 deletions(-)

diff --git a/org.spearce.jgit.test/tst/org/spearce/jgit/revwalk/RevCommitParseTest.java b/org.spearce.jgit.test/tst/org/spearce/jgit/revwalk/RevCommitParseTest.java
index 3d9d42d..805e29e 100644
--- a/org.spearce.jgit.test/tst/org/spearce/jgit/revwalk/RevCommitParseTest.java
+++ b/org.spearce.jgit.test/tst/org/spearce/jgit/revwalk/RevCommitParseTest.java
@@ -37,6 +37,8 @@
 
 package org.spearce.jgit.revwalk;
 
+import java.io.ByteArrayOutputStream;
+
 import org.spearce.jgit.lib.ObjectId;
 import org.spearce.jgit.lib.PersonIdent;
 import org.spearce.jgit.lib.RepositoryTestCase;
@@ -130,6 +132,123 @@ public void testParse_WeirdHeaderOnlyCommit() throws Exception {
 		assertEquals("", c.getShortMessage());
 	}
 
+	public void testParse_implicit_UTF8_encoded() throws Exception {
+		final ByteArrayOutputStream b = new ByteArrayOutputStream();
+		b.write("tree 9788669ad918b6fcce64af8882fc9a81cb6aba67\n".getBytes("UTF-8"));
+		b.write("author F\u00f6r fattare <a_u_thor@example.com> 1218123387 +0700\n".getBytes("UTF-8"));
+		b.write("committer C O. Miter <c@example.com> 1218123390 -0500\n".getBytes("UTF-8"));
+		b.write("\n".getBytes("UTF-8"));
+		b.write("Sm\u00f6rg\u00e5sbord\n".getBytes("UTF-8"));
+		b.write("\n".getBytes("UTF-8"));
+		b.write("\u304d\u308c\u3044\n".getBytes("UTF-8"));
+		final RevCommit c;
+		c = new RevCommit(id("9473095c4cb2f12aefe1db8a355fe3fafba42f67")); // bogus id
+		c.parseCanonical(new RevWalk(db), b.toByteArray());
+
+		assertEquals("F\u00f6r fattare", c.getAuthorIdent().getName());
+		assertEquals("Sm\u00f6rg\u00e5sbord", c.getShortMessage());
+		assertEquals("Sm\u00f6rg\u00e5sbord\n\n\u304d\u308c\u3044\n", c.getFullMessage());
+	}
+
+	public void testParse_implicit_mixed_encoded() throws Exception {
+		final ByteArrayOutputStream b = new ByteArrayOutputStream();
+		b.write("tree 9788669ad918b6fcce64af8882fc9a81cb6aba67\n".getBytes("UTF-8"));
+		b.write("author F\u00f6r fattare <a_u_thor@example.com> 1218123387 +0700\n".getBytes("ISO-8859-1"));
+		b.write("committer C O. Miter <c@example.com> 1218123390 -0500\n".getBytes("UTF-8"));
+		b.write("\n".getBytes("UTF-8"));
+		b.write("Sm\u00f6rg\u00e5sbord\n".getBytes("UTF-8"));
+		b.write("\n".getBytes("UTF-8"));
+		b.write("\u304d\u308c\u3044\n".getBytes("UTF-8"));
+		final RevCommit c;
+		c = new RevCommit(id("9473095c4cb2f12aefe1db8a355fe3fafba42f67")); // bogus id
+		c.parseCanonical(new RevWalk(db), b.toByteArray());
+
+		assertEquals("F\u00f6r fattare", c.getAuthorIdent().getName());
+		assertEquals("Sm\u00f6rg\u00e5sbord", c.getShortMessage());
+		assertEquals("Sm\u00f6rg\u00e5sbord\n\n\u304d\u308c\u3044\n", c.getFullMessage());
+	}
+
+	/**
+	 * Test parsing of a commit whose encoding is given and works.
+	 *
+	 * @throws Exception
+	 */
+	public void testParse_explicit_encoded() throws Exception {
+		final ByteArrayOutputStream b = new ByteArrayOutputStream();
+		b.write("tree 9788669ad918b6fcce64af8882fc9a81cb6aba67\n".getBytes("EUC-JP"));
+		b.write("author F\u00f6r fattare <a_u_thor@example.com> 1218123387 +0700\n".getBytes("EUC-JP"));
+		b.write("committer C O. Miter <c@example.com> 1218123390 -0500\n".getBytes("EUC-JP"));
+		b.write("encoding euc_JP\n".getBytes("EUC-JP"));
+		b.write("\n".getBytes("EUC-JP"));
+		b.write("\u304d\u308c\u3044\n".getBytes("EUC-JP"));
+		b.write("\n".getBytes("EUC-JP"));
+		b.write("Hi\n".getBytes("EUC-JP"));
+		final RevCommit c;
+		c = new RevCommit(id("9473095c4cb2f12aefe1db8a355fe3fafba42f67")); // bogus id
+		c.parseCanonical(new RevWalk(db), b.toByteArray());
+
+		assertEquals("F\u00f6r fattare", c.getAuthorIdent().getName());
+		assertEquals("\u304d\u308c\u3044", c.getShortMessage());
+		assertEquals("\u304d\u308c\u3044\n\nHi\n", c.getFullMessage());
+	}
+
+	/**
+	 * This is a twisted case, but show what we expect here. We can revise the
+	 * expectations provided this case is updated.
+	 *
+	 * What happens here is that an encoding us given, but data is not encoded
+	 * that way (and we can detect it), so we try other encodings.
+	 *
+	 * @throws Exception
+	 */
+	public void testParse_explicit_bad_encoded() throws Exception {
+		final ByteArrayOutputStream b = new ByteArrayOutputStream();
+		b.write("tree 9788669ad918b6fcce64af8882fc9a81cb6aba67\n".getBytes("UTF-8"));
+		b.write("author F\u00f6r fattare <a_u_thor@example.com> 1218123387 +0700\n".getBytes("ISO-8859-1"));
+		b.write("committer C O. Miter <c@example.com> 1218123390 -0500\n".getBytes("UTF-8"));
+		b.write("encoding EUC-JP\n".getBytes("UTF-8"));
+		b.write("\n".getBytes("UTF-8"));
+		b.write("\u304d\u308c\u3044\n".getBytes("UTF-8"));
+		b.write("\n".getBytes("UTF-8"));
+		b.write("Hi\n".getBytes("UTF-8"));
+		final RevCommit c;
+		c = new RevCommit(id("9473095c4cb2f12aefe1db8a355fe3fafba42f67")); // bogus id
+		c.parseCanonical(new RevWalk(db), b.toByteArray());
+
+		assertEquals("F\u00f6r fattare", c.getAuthorIdent().getName());
+		assertEquals("\u304d\u308c\u3044", c.getShortMessage());
+		assertEquals("\u304d\u308c\u3044\n\nHi\n", c.getFullMessage());
+	}
+
+	/**
+	 * This is a twisted case too, but show what we expect here. We can revise the
+	 * expectations provided this case is updated.
+	 *
+	 * What happens here is that an encoding us given, but data is not encoded
+	 * that way (and we can detect it), so we try other encodings. Here data could
+	 * actually be decoded in the stated encoding, but we overide using UTF-8.
+	 *
+	 * @throws Exception
+	 */
+	public void testParse_explicit_bad_encoded2() throws Exception {
+		final ByteArrayOutputStream b = new ByteArrayOutputStream();
+		b.write("tree 9788669ad918b6fcce64af8882fc9a81cb6aba67\n".getBytes("UTF-8"));
+		b.write("author F\u00f6r fattare <a_u_thor@example.com> 1218123387 +0700\n".getBytes("UTF-8"));
+		b.write("committer C O. Miter <c@example.com> 1218123390 -0500\n".getBytes("UTF-8"));
+		b.write("encoding ISO-8859-1\n".getBytes("UTF-8"));
+		b.write("\n".getBytes("UTF-8"));
+		b.write("\u304d\u308c\u3044\n".getBytes("UTF-8"));
+		b.write("\n".getBytes("UTF-8"));
+		b.write("Hi\n".getBytes("UTF-8"));
+		final RevCommit c;
+		c = new RevCommit(id("9473095c4cb2f12aefe1db8a355fe3fafba42f67")); // bogus id
+		c.parseCanonical(new RevWalk(db), b.toByteArray());
+
+		assertEquals("F\u00f6r fattare", c.getAuthorIdent().getName());
+		assertEquals("\u304d\u308c\u3044", c.getShortMessage());
+		assertEquals("\u304d\u308c\u3044\n\nHi\n", c.getFullMessage());
+	}
+
 	public void testParse_NoMessage() throws Exception {
 		final String msg = "";
 		final RevCommit c = create(msg);
diff --git a/org.spearce.jgit/src/org/spearce/jgit/util/RawParseUtils.java b/org.spearce.jgit/src/org/spearce/jgit/util/RawParseUtils.java
index a31734b..7c16394 100644
--- a/org.spearce.jgit/src/org/spearce/jgit/util/RawParseUtils.java
+++ b/org.spearce.jgit/src/org/spearce/jgit/util/RawParseUtils.java
@@ -42,7 +42,10 @@
 import static org.spearce.jgit.lib.ObjectChecker.encoding;
 
 import java.nio.ByteBuffer;
+import java.nio.charset.CharacterCodingException;
 import java.nio.charset.Charset;
+import java.nio.charset.CharsetDecoder;
+import java.nio.charset.CodingErrorAction;
 import java.util.Arrays;
 
 import org.spearce.jgit.lib.Constants;
@@ -376,7 +379,10 @@ public static PersonIdent parsePersonIdent(final byte[] raw, final int nameB) {
 	}
 
 	/**
-	 * Decode a region of the buffer under the specified character set.
+	 * Decode a region of the buffer under the specified character set if possible.
+	 *
+	 * If the byte stream cannot be decoded that way, the platform default is tried
+	 * and if that too fails, the fail-safe ISO-8859-1 encoding is tried.
 	 * 
 	 * @param cs
 	 *            character set to use when decoding the buffer.
@@ -393,12 +399,63 @@ public static PersonIdent parsePersonIdent(final byte[] raw, final int nameB) {
 	public static String decode(final Charset cs, final byte[] buffer,
 			final int start, final int end) {
 		final ByteBuffer b = ByteBuffer.wrap(buffer, start, end - start);
-		return cs.decode(b).toString();
+		b.mark();
+		for (int i = 0;; ++i) {
+			try {
+				Charset charset;
+				switch (i) {
+				case 0:
+					/*
+					 * Try our built-in favorite. The assumption here is that
+					 * decoding will fail if the data is not actually encoded
+					 * using that encoder.
+					 */
+					charset = Constants.CHARSET;
+					break;
+				case 1:
+					/*
+					 * Try the suggested encoding, it might be right since it
+					 * was provided
+					 */
+					if (cs.equals(Constants.CHARSET))
+						continue;
+					b.reset();
+					charset = cs;
+					break;
+				case 2:
+					/*
+					 * Try the default character set. A small group of people
+					 * might actually use the same (or very similar) locale.
+					 */
+					charset = Charset.defaultCharset();
+					if (charset.equals(Constants.CHARSET))
+						continue;
+					if (charset.equals(cs))
+						continue;
+					b.reset();
+					break;
+				default:
+					/*
+					 * This one is to make sure we do no fail. Data may look
+					 * funny but there is nothing we can do here withou much
+					 * more advanced guessing.
+					 */
+					b.reset();
+					charset = Charset.forName("ISO-8859-1");
+				}
+				CharsetDecoder d = charset.newDecoder();
+				d.onMalformedInput(CodingErrorAction.REPORT);
+				d.onUnmappableCharacter(CodingErrorAction.REPORT);
+				return d.decode(b).toString();
+			} catch (CharacterCodingException e1) {
+				continue;
+			}
+		}
 	}
 
 	/**
 	 * Locate the position of the commit message body.
-	 * 
+	 *
 	 * @param b
 	 *            buffer to scan.
 	 * @param ptr
-- 
1.6.0.2.308.gef4a

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [JGIT PATCH 4/4] Intelligent parsing of ambiguously encoded meta data.
  2008-10-12 22:51       ` [JGIT PATCH 4/4] Intelligent parsing of ambiguously encoded meta data Robin Rosenberg
@ 2008-10-13  2:27         ` Shawn O. Pearce
  2008-10-13 17:10           ` Robin Rosenberg
  0 siblings, 1 reply; 8+ messages in thread
From: Shawn O. Pearce @ 2008-10-13  2:27 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: git

Robin Rosenberg <robin.rosenberg@dewire.com> wrote:
> We cannot trust meta data to be encoded in any particular way, so we try
> different encodings. First we try UTF-8, which is the only sane encoding
> for non-local data, even when used in regions where eight bit legacy
> encodings are common. The chance of mistakenly parsing non-UTF-8 data
> as valid UTF-8 is varies from extremely low (western encodings) to low
> for most other encodings. If the data does not look like UTF-8, we try the
> suggested encoding. If that fails we try the user locale and finally, if
> that fails we try ISO-8859-1, which cannot fail.

Hmm.  I'm concerned about the infinite loop you have here.
If ISO-8859-1 fails we'd be stuck here until the end of time.
Plus its a bit ugly to read.

I wonder if this is any better.  It passes your tests and is 2
lines shorter.

diff --git a/org.spearce.jgit/src/org/spearce/jgit/util/RawParseUtils.java b/org.spearce.jgit/src/org/spearce/jgit/util/RawParseUtils.java
index a31734b..6c0e339 100644
--- a/org.spearce.jgit/src/org/spearce/jgit/util/RawParseUtils.java
+++ b/org.spearce.jgit/src/org/spearce/jgit/util/RawParseUtils.java
@@ -42,7 +42,10 @@
 import static org.spearce.jgit.lib.ObjectChecker.encoding;
 
 import java.nio.ByteBuffer;
+import java.nio.charset.CharacterCodingException;
 import java.nio.charset.Charset;
+import java.nio.charset.CharsetDecoder;
+import java.nio.charset.CodingErrorAction;
 import java.util.Arrays;
 
 import org.spearce.jgit.lib.Constants;
@@ -376,7 +379,10 @@ public static PersonIdent parsePersonIdent(final byte[] raw, final int nameB) {
 	}
 
 	/**
-	 * Decode a region of the buffer under the specified character set.
+	 * Decode a region of the buffer under the specified character set if possible.
+	 *
+	 * If the byte stream cannot be decoded that way, the platform default is tried
+	 * and if that too fails, the fail-safe ISO-8859-1 encoding is tried.
 	 * 
 	 * @param cs
 	 *            character set to use when decoding the buffer.
@@ -393,7 +399,56 @@ public static PersonIdent parsePersonIdent(final byte[] raw, final int nameB) {
 	public static String decode(final Charset cs, final byte[] buffer,
 			final int start, final int end) {
 		final ByteBuffer b = ByteBuffer.wrap(buffer, start, end - start);
-		return cs.decode(b).toString();
+		b.mark();
+
+		// Try our built-in favorite. The assumption here is that
+		// decoding will fail if the data is not actually encoded
+		// using that encoder.
+		//
+		try {
+			return decode(b, Constants.CHARSET);
+		} catch (CharacterCodingException e) {
+			b.reset();
+		}
+
+		if (!cs.equals(Constants.CHARSET)) {
+			// Try the suggested encoding, it might be right since it was
+			// provided by the caller.
+			//
+			try {
+				return decode(b, cs);
+			} catch (CharacterCodingException e) {
+				b.reset();
+			}
+		}
+
+		// Try the default character set. A small group of people
+		// might actually use the same (or very similar) locale.
+		//
+		final Charset defcs = Charset.defaultCharset();
+		if (!defcs.equals(cs) && !defcs.equals(Constants.CHARSET)) {
+			try {
+				return decode(b, defcs);
+			} catch (CharacterCodingException e) {
+				b.reset();
+			}
+		}
+
+		// Fall back to an ISO-8859-1 style encoding. At least all of
+		// the bytes will be present in the output.
+		//
+		final StringBuilder r = new StringBuilder(end - start);
+		for (int i = start; i < end; i++)
+			r.append((char) (buffer[i] & 0xff));
+		return r.toString();
+	}
+
+	private static String decode(final ByteBuffer b, final Charset charset)
+			throws CharacterCodingException {
+		final CharsetDecoder d = charset.newDecoder();
+		d.onMalformedInput(CodingErrorAction.REPORT);
+		d.onUnmappableCharacter(CodingErrorAction.REPORT);
+		return d.decode(b).toString();
 	}
 
 	/**


-- 
Shawn.

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [JGIT PATCH 3/4] The git config file is case insensitive
  2008-10-12 22:50     ` [JGIT PATCH 3/4] The git config file is case insensitive Robin Rosenberg
  2008-10-12 22:51       ` [JGIT PATCH 4/4] Intelligent parsing of ambiguously encoded meta data Robin Rosenberg
@ 2008-10-13  2:36       ` Shawn O. Pearce
  1 sibling, 0 replies; 8+ messages in thread
From: Shawn O. Pearce @ 2008-10-13  2:36 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: git

Robin Rosenberg <robin.rosenberg@dewire.com> wrote:
> diff --git a/org.spearce.jgit/src/org/spearce/jgit/lib/RepositoryConfig.java b/org.spearce.jgit/src/org/spearce/jgit/lib/RepositoryConfig.java
> index 45c2f8a..7a34cde 100644
> --- a/org.spearce.jgit/src/org/spearce/jgit/lib/RepositoryConfig.java
> +++ b/org.spearce.jgit/src/org/spearce/jgit/lib/RepositoryConfig.java
> @@ -682,7 +683,12 @@ public void load() throws IOException {
>  
>  	private void clear() {
>  		entries = new ArrayList<Entry>();
> -		byName = new HashMap<String, Object>();
> +		byName = new TreeMap<String, Object>(new Comparator<String>() {
> +
> +			public int compare(String o1, String o2) {
> +				return o1.compareToIgnoreCase(o2);
> +			}
> +		});
>  	}

This isn't necessary.  Everyone who does a get or a put against the
byName map already is forming a lower case key string.  I'd rather
keep the lookup O(1) than O(log N), especially if the code has a
ton of .toLowerCase() calls in it to normalize the keys.

If you are going to change it to a TreeMap with a custom Comparator
then maybe we should cleanup the code that operates on byName so it
can use the original input strings, instead of the .toLowerCase()
forms.

For now I'm going to apply your patch without this one hunk.  If you
want to switch to a TreeMap lets also cleanup the get/put calls.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [JGIT PATCH 4/4] Intelligent parsing of ambiguously encoded meta data.
  2008-10-13  2:27         ` Shawn O. Pearce
@ 2008-10-13 17:10           ` Robin Rosenberg
  0 siblings, 0 replies; 8+ messages in thread
From: Robin Rosenberg @ 2008-10-13 17:10 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git

måndagen den 13 oktober 2008 04.27.08 skrev Shawn O. Pearce:
> Robin Rosenberg <robin.rosenberg@dewire.com> wrote:
> > We cannot trust meta data to be encoded in any particular way, so we try
> > different encodings. First we try UTF-8, which is the only sane encoding
> > for non-local data, even when used in regions where eight bit legacy
> > encodings are common. The chance of mistakenly parsing non-UTF-8 data
> > as valid UTF-8 is varies from extremely low (western encodings) to low
> > for most other encodings. If the data does not look like UTF-8, we try the
> > suggested encoding. If that fails we try the user locale and finally, if
> > that fails we try ISO-8859-1, which cannot fail.
> 
> Hmm.  I'm concerned about the infinite loop you have here.
> If ISO-8859-1 fails we'd be stuck here until the end of time.
> Plus its a bit ugly to read.
> 
> I wonder if this is any better.  It passes your tests and is 2
> lines shorter.

Yes. Not sure what I was thinking with the loop there... :)

-- robin

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-10-13 17:13 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-12 22:50 [JGIT PATCH 0/4] Decodings Robin Rosenberg
2008-10-12 22:50 ` [JGIT PATCH 1/4] log command: Use explicit US locale for dates Robin Rosenberg
2008-10-12 22:50   ` [JGIT PATCH 2/4] jgit programs: Use i18n.logOutputEncoding or user's locale for output Robin Rosenberg
2008-10-12 22:50     ` [JGIT PATCH 3/4] The git config file is case insensitive Robin Rosenberg
2008-10-12 22:51       ` [JGIT PATCH 4/4] Intelligent parsing of ambiguously encoded meta data Robin Rosenberg
2008-10-13  2:27         ` Shawn O. Pearce
2008-10-13 17:10           ` Robin Rosenberg
2008-10-13  2:36       ` [JGIT PATCH 3/4] The git config file is case insensitive Shawn O. Pearce

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).