From: "Chen Xuewei via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: Chen Xuewei <316403398@qq.com>, Chen Xuewei <316403398@qq.com>
Subject: [PATCH] fix: platform accordance while calculating murmur3
Date: Thu, 04 Jan 2024 13:56:46 +0000 [thread overview]
Message-ID: <pull.1636.git.git.1704376606625.gitgitgadget@gmail.com> (raw)
From: Chen Xuewei <316403398@qq.com>
It is known that whether the highest bit is extended when char cast to
uint32, depends on CPU architecture, which will lead different hash
value. This is a fix to accord all architecture behaviour.
Signed-off-by: Chen Xuewei <316403398@qq.com>
---
fix: platform accordance while calculating murmur3
Short Description
=================
fix: platform accordance while calculating murmur3
It is known that whether the highest bit is extended when char cast to
uint32, depends on CPU architecture, which will lead different hash
value. This is a fix to accord all architecture behaviour.
Problem backgroud:
==================
when using git log --max-count=1 <commit> -- <path> in an mixed cpu
cluster environment both arm and x86 in a cluster as a service, where
the <path> character is chinese or some other character that the highest
bit of char is 1. all machines share the same repo disk. It happened
that sometimes you can get the searched file among commit, sometimes you
cannot.
Conditions
==========
1. file path include chinese characters or other characters that the
highest bit is 1.
2. mixed cpu architecture as a git cluster service
Reason
======
when you have over 2 machines (both arm and x86 are included at least
one) as a git server cluster. once you open the commit-graph's
bloom_filter feature. The bloom filter stores the file path as hash
values using the murmur3 function. suppose the arm take it this time,
then the char's highest bit is not extended. for example, on arm,
char(11100110) to uint32(00000000 00000000 00000000 11100110) on x86,
char(11100110) to uint32(11111111 11111111 11111111 11100110) then
according to the murmur3 function that git currently use, the calculated
hash value will be different. If the value was calculated through the
same cpu architure machine, then it is ok. however, sometimes the hash
value is calculated through a different cpu architure machine, then you
cannot get the searched file. for example, bloom_filter's hash set is
calculated through arm, and query through x86. So the hash value is
incorrect, then missed the searched file.
Solution
========
No matter what the highest 24 bits will be when char cast to uint32, the
murmur3 function only cares about the char part , which is only the
lowest 8 bits, so we can use & 0xFF(11111111) to the casted uint32 value
to choose only the lowest 8 bits.
Others
======
after fixed the bug, the historical bloom_filter data stored in
commit-graph need to be updated. because the path's hash value is
already calculated through a bad way. so we need to update it. this need
to be done in repository
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1636%2Fcdegree%2Fmaster-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1636/cdegree/master-v1
Pull-Request: https://github.com/git/git/pull/1636
bloom.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/bloom.c b/bloom.c
index 1474aa19fa5..bc40edac795 100644
--- a/bloom.c
+++ b/bloom.c
@@ -116,11 +116,11 @@ uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
uint32_t k;
for (i = 0; i < len4; i++) {
- uint32_t byte1 = (uint32_t)data[4*i];
- uint32_t byte2 = ((uint32_t)data[4*i + 1]) << 8;
- uint32_t byte3 = ((uint32_t)data[4*i + 2]) << 16;
- uint32_t byte4 = ((uint32_t)data[4*i + 3]) << 24;
- k = byte1 | byte2 | byte3 | byte4;
+ uint32_t byte1 = ((uint32_t)data[4*i]) & 0xFF;
+ uint32_t byte2 = ((uint32_t)data[4*i + 1]) & 0xFF;
+ uint32_t byte3 = ((uint32_t)data[4*i + 2]) & 0xFF;
+ uint32_t byte4 = ((uint32_t)data[4*i + 3]) & 0xFF;
+ k = byte1 | (byte2 << 8) | (byte3 << 16) | (byte4 << 24);
k *= c1;
k = rotate_left(k, r1);
k *= c2;
base-commit: a26002b62827b89a19b1084bd75d9371d565d03c
--
gitgitgadget
next reply other threads:[~2024-01-04 13:56 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-04 13:56 Chen Xuewei via GitGitGadget [this message]
2024-01-04 14:53 ` [PATCH] fix: platform accordance while calculating murmur3 Taylor Blau
2024-01-04 18:12 ` Junio C Hamano
2024-01-04 18:27 ` Taylor Blau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=pull.1636.git.git.1704376606625.gitgitgadget@gmail.com \
--to=gitgitgadget@gmail.com \
--cc=316403398@qq.com \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).