All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Tao Klerks via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: Tao Klerks <tao@klerks.biz>, Tao Klerks <tao@klerks.biz>
Subject: [PATCH] git-p4: preserve utf8 BOM when importing from p4 to git
Date: Mon, 04 Apr 2022 05:50:36 +0000	[thread overview]
Message-ID: <pull.1203.git.1649051436934.gitgitgadget@gmail.com> (raw)

From: Tao Klerks <tao@klerks.biz>

Perforce has a file type "utf8" which represents a text file with
explicit BOM. utf8-encoded files *without* BOM are stored as
regular file type "text". The "utf8" file type behaves like text
in all but one important way: it is stored, internally, without
the leading 3 BOM bytes.

git-p4 has historically imported utf8-with-BOM files (files stored,
in Perforce, as type "utf8") the same way as regular text files -
losing the BOM in the process.

Under most circumstances this issue has little functional impact,
as most systems consider the BOM to be optional and redundant, but
this *is* a correctness failure, and can have lead to practical
issues for example when BOMs are explicitly included in test files,
for example in a file encoding test suite.

Fix the handling of utf8-with-BOM files when importing changes from
p4 to git, and introduce a test that checks it is working correctly.

Signed-off-by: Tao Klerks <tao@klerks.biz>
---
    git-p4: preserve utf8 BOM when importing from p4 to git
    
    I manually tested these changes with python2 and python3 - I don't know
    whether there is a more rigorous approach possible than changing the
    system default python and rerunning the "t98xx" tests, but that did work
    (and initially highlighted an issue, now fixed).

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1203%2FTaoK%2Fgit-p4-utf8-bom-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1203/TaoK/git-p4-utf8-bom-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1203

 git-p4.py                  | 10 ++++++++++
 t/t9802-git-p4-filetype.sh | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff --git a/git-p4.py b/git-p4.py
index a9b1f904410..6d932e7ed76 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -2885,6 +2885,16 @@ class P4Sync(Command, P4UserMap):
             print("\nIgnoring apple filetype file %s" % file['depotFile'])
             return
 
+        if type_base == "utf8":
+            # The type utf8 explicitly means utf8 *with BOM*. These are
+            # streamed just like regular text files, however, without
+            # the BOM in the stream.
+            # Therefore, to accurately import these files into git, we
+            # need to explicitly re-add the BOM before writing.
+            # 'contents' is a set of bytes in this case, so create the
+            # BOM prefix as a b'' literal.
+            contents = [b'\xef\xbb\xbf' + contents[0]] + contents[1:]
+
         # Note that we do not try to de-mangle keywords on utf16 files,
         # even though in theory somebody may want that.
         regexp = p4_keywords_regexp_for_type(type_base, type_mods)
diff --git a/t/t9802-git-p4-filetype.sh b/t/t9802-git-p4-filetype.sh
index 19073c6e9f8..2a6ee2a4678 100755
--- a/t/t9802-git-p4-filetype.sh
+++ b/t/t9802-git-p4-filetype.sh
@@ -333,4 +333,38 @@ test_expect_success SYMLINKS 'empty symlink target' '
 	)
 '
 
+test_expect_success SYMLINKS 'utf-8 with and without BOM in text file' '
+	(
+		cd "$cli" &&
+
+		# some utf8 content
+		echo some tǣxt >utf8-nobom-test &&
+
+		# same utf8 content as before but with bom
+		echo some tǣxt | sed '\''s/^/\xef\xbb\xbf/'\'' >utf8-bom-test &&
+
+		# bom only
+		dd bs=1 count=3 if=utf8-bom-test of=utf8-bom-empty-test &&
+
+		p4 add utf8-nobom-test utf8-bom-test utf8-bom-empty-test &&
+		p4 submit -d "add utf8 test files"
+	) &&
+	test_when_finished cleanup_git &&
+
+	git p4 clone --dest="$git" //depot@all &&
+	(
+		cd "$git" &&
+		git checkout refs/remotes/p4/master &&
+
+		echo some tǣxt >utf8-nobom-check &&
+		test_cmp utf8-nobom-check utf8-nobom-test &&
+
+		echo some tǣxt | sed '\''s/^/\xef\xbb\xbf/'\'' >utf8-bom-check &&
+		test_cmp utf8-bom-check utf8-bom-test &&
+
+		dd bs=1 count=3 if=utf8-bom-check of=utf8-bom-empty-check &&
+		test_cmp utf8-bom-empty-check utf8-bom-empty-test
+	)
+'
+
 test_done

base-commit: 4b6846d9dcd391164b72bd70e8a0c0e09776afe3
-- 
gitgitgadget

             reply	other threads:[~2022-04-04  5:50 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-04  5:50 Tao Klerks via GitGitGadget [this message]
  -- strict thread matches above, loose matches on Subject: below --
2022-12-14  6:10 [PATCH] git-p4: preserve utf8 BOM when importing from p4 to git Tzadik Vanderhoof
2022-12-14  8:29 ` Junio C Hamano
2022-12-14 18:24 ` Tao Klerks
2022-12-14 23:11   ` Junio C Hamano
2022-12-19  9:09     ` Tao Klerks
2022-12-22  8:26       ` Tao Klerks

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pull.1203.git.1649051436934.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=tao@klerks.biz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.