From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
Garima Singh <garima.singh@microsoft.com>,
Derrick Stolee <stolee@gmail.com>,
Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
Taylor Blau <me@ttaylorr.com>
Subject: Re: [PATCH 15/34] commit-graph-format.txt: document the modified path Bloom filter chunks
Date: Tue, 2 Jun 2020 19:59:49 +0200 [thread overview]
Message-ID: <20200602175949.GA2898@szeder.dev> (raw)
In-Reply-To: <20200529085038.26008-16-szeder.dev@gmail.com>
On Fri, May 29, 2020 at 10:50:19AM +0200, SZEDER Gábor wrote:
> Modified Path Bloom Filters
> Each modified path Bloom filter consists of:
>
> - 4 bytes specifying the number of bits in the Bloom filter's bit
> array.
>
> For practical purposes 32 bit is more than sufficient to store the
> number of bits in the Bloom filter's array. When using k = 7
> hashes, i.e. 10 bits per path, then we could store over 400
> million paths in a single Bloom filter; with k = 32 hashes we'd
> use 46 bit per path, which is still over 93 million paths.
> Alternatives considered
> -----------------------
>
> Here are some alternatives that I've considered but discarded and
> ideas that I haven't (yet) followed through:
> - Varints. Using 4 bytes for the size of each Bloom filter in the
> Modified Path Bloom Filters chunk is a lot more than necessary.
> How much space can be saved by using varints?
Not that much, at least compared to the whole commit-graph file.
Since 63 bit modified path Bloom filters are embedded in the index
entries, it's pointless to store smaller Bloom filters, so the size of
non-embedded Bloom filters can be defined as
nr_bits = 64 + decode_varint(...)
A one byte varint can encode values up to 127, while a two bytes
varint can encode values up to 16511. So the max nr_bits is 191 and
16575, respectively, which means max 19 or 1657 paths with 7 hashes,
i.e. 10 bits per path. The table below shows the percentage of
commits with embedded modified path Bloom filters and commits with
non-embedded Bloom filters with one byte and two bytes varints, and
how much space is saved.
Percentage of commits |
modifying <= N paths | varint commit-graph
compared to first parent | size diff file size
<=6 <=19 <=1657 | (bytes) (ls -lh)
------------------------------------------+-------------------------------
elasticsearch 18.32% 65.56% 99.79% | -90158 9.5M -0.9%
jdk 26.62% 70.46% 97.94% | -90262 16M -0.6%
webkit 38.42% 82.24% 99.92% | -298824 19M -1.6%
android-base 42.32% 88.02% 99.98% | -132327 30M -0.5%
llvm-project 53.60% 93.86% 99.99% | -344239 24M -1.4%
gecko-dev 54.54% 87.15% 99.87% | -625029 63M -1.0%
tensorflow 55.17% 90.76% 99.52% | -83758 9.0M -0.9%
rails 58.71% 95.13% 99.99% | -53153 6.0M -0.9%
rust 59.29% 88.03% 99.96% | -90677 8.2M -1.1%
glibc 61.59% 91.11% 99.96% | -38422 3.8M -1.0%
gcc 63.80% 95.39% 99.97% | -179463 18M -1.0%
go 65.31% 95.19% 99.99% | -39109 3.2M -1.2%
cmssw 67.58% 93.02% 99.91% | -154440 23M -0.7%
linux 72.79% 95.27% 99.78% | -527045 78M -0.7%
cpython 81.91% 97.78% 100.00% | -40678 7.8M -0.5%
git 90.28% 98.56% 100.00% | -13406 3.9M -0.4%
homebrew-cask 98.61% 99.58% 99.99% | -2992 6.6M -0.1%
homebrew-core 99.81% 99.93% 100.00% | -810 11M -0.1%
> How much is the runtime overhead of decoding those varints?
Not enough to be above noise level. ("a lot of paths",
MurmurHash3, k = 7, enhanced double hashing, 32 bit uint arithmetic)
| Average time spent
Average runtime | loading and querying
| Bloom filters
uint32 varint | uint32 varint
-------------------------------------+-----------------------------
android-base 0.1456s 0.144172s | 0.01550s 0.01571s 1.3%
cmssw 0.0313s 0.032046s | 0.00383s 0.00395s 3.1%
cpython 0.0810s 0.081649s | 0.00765s 0.00785s 2.6%
elasticsearch 0.0136s 0.013899s | 0.00246s 0.00258s 4.8%
gcc 0.2114s 0.211080s | 0.02259s 0.02261s 0%
gecko-dev 0.4815s 0.474903s | 0.06004s 0.06028s 0.3%
git 0.0310s 0.031156s | 0.00192s 0.00195s 1.5%
glibc 0.0282s 0.029032s | 0.00320s 0.00342s 6.8%
go 0.0403s 0.039408s | 0.00428s 0.00414s -3.3%
jdk 0.0068s 0.006666s | 0.00117s 0.00113s -3.5%
linux 0.0873s 0.087438s | 0.01109s 0.01133s 2.1%
llvm-project 0.4135s 0.418390s | 0.04716s 0.04834s 2.5%
rails 0.0391s 0.038997s | 0.00449s 0.00448s -0.3%
rust 0.0439s 0.044569s | 0.00444s 0.00461s 3.8%
tensorflow 0.0487s 0.049166s | 0.00618s 0.00634s 2.5%
webkit 0.2420s 0.241807s | 0.03353s 0.03379s 0.7%
> How can we ensure that varint decoding doesn't read past the end
> of the mmapped memory region?
With a decode_varint() variant that takes the max number of bytes to
read as an additional parameter, and returns 0 if the varint is too
long.
next prev parent reply other threads:[~2020-06-02 18:00 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-05-29 8:50 [PoC PATCH 00/34] An alternative modified path Bloom filters implementation SZEDER Gábor
2020-05-28 22:09 ` Johannes Schindelin
2020-05-29 8:50 ` [PATCH 01/34] tree-walk.c: don't match submodule entries for 'submod/anything' SZEDER Gábor
2020-05-29 8:50 ` [PATCH 02/34] commit-graph: fix parsing the Chunk Lookup table SZEDER Gábor
2020-05-29 8:50 ` [PATCH 03/34] commit-graph-format.txt: all multi-byte numbers are in network byte order SZEDER Gábor
2020-05-29 8:50 ` [PATCH 04/34] commit-slab: add a function to deep free entries on the slab SZEDER Gábor
2020-06-04 16:43 ` Derrick Stolee
2020-06-27 15:56 ` SZEDER Gábor
2020-06-29 11:30 ` Derrick Stolee
2020-05-29 8:50 ` [PATCH 05/34] diff.h: drop diff_tree_oid() & friends' return value SZEDER Gábor
2020-05-29 8:50 ` [PATCH 06/34] commit-graph: clean up #includes SZEDER Gábor
2020-05-29 8:50 ` [PATCH 07/34] commit-graph: simplify parse_commit_graph() #1 SZEDER Gábor
2020-05-29 8:50 ` [PATCH 08/34] commit-graph: simplify parse_commit_graph() #2 SZEDER Gábor
2020-05-29 8:50 ` [PATCH 09/34] commit-graph: simplify write_commit_graph_file() #1 SZEDER Gábor
2020-05-29 8:50 ` [PATCH 10/34] commit-graph: simplify write_commit_graph_file() #2 SZEDER Gábor
2020-05-29 8:50 ` [PATCH 11/34] commit-graph: allocate the 'struct chunk_info' array dinamically SZEDER Gábor
2020-05-29 8:50 ` [PATCH 12/34] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor
2020-05-29 8:50 ` [PATCH 13/34] commit-graph: simplify write_commit_graph_file() #3 SZEDER Gábor
2020-05-29 8:50 ` [PATCH 14/34] commit-graph: check chunk sizes after writing SZEDER Gábor
2020-05-29 8:50 ` [PATCH 15/34] commit-graph-format.txt: document the modified path Bloom filter chunks SZEDER Gábor
2020-06-02 17:59 ` SZEDER Gábor [this message]
2020-05-29 8:50 ` [PATCH 16/34] Add a generic and minimal Bloom filter implementation SZEDER Gábor
2020-05-29 8:50 ` [PATCH 17/34] Import a streaming-capable Murmur3 hash function implementation SZEDER Gábor
2020-05-29 8:50 ` [PATCH 18/34] commit-graph: write "empty" Modified Path Bloom Filter Index chunk SZEDER Gábor
2020-05-29 8:50 ` [PATCH 19/34] commit-graph: add commit slab for modified path Bloom filters SZEDER Gábor
2020-05-29 8:50 ` [PATCH 20/34] commit-graph: fill the Modified Path Bloom Filter Index chunk SZEDER Gábor
2020-05-29 8:50 ` [PATCH 21/34] commit-graph: load and use " SZEDER Gábor
2020-05-29 8:50 ` [PATCH 22/34] commit-graph: write the Modified Path Bloom Filters chunk SZEDER Gábor
2020-05-29 8:50 ` [PATCH 23/34] commit-graph: load and use " SZEDER Gábor
2020-05-29 8:50 ` [PATCH 24/34] commit-graph: check all leading directories in modified path Bloom filters SZEDER Gábor
2020-05-29 8:50 ` [PATCH 25/34] commit-graph: check embedded modified path Bloom filters with a mask SZEDER Gábor
2020-05-29 8:50 ` [PATCH 26/34] commit-graph: deduplicate modified path Bloom filters SZEDER Gábor
2020-05-29 8:50 ` [PATCH 27/34] commit-graph: load modified path Bloom filters for merge commits SZEDER Gábor
2020-05-29 8:50 ` [PATCH 28/34] commit-graph: write Modified Path Bloom Filter Merge Index chunk SZEDER Gábor
2020-05-29 8:50 ` [PATCH 29/34] commit-graph: extract init and free write_commit_graph_context SZEDER Gábor
2020-05-29 8:50 ` [PATCH 30/34] commit-graph: move write_commit_graph_reachable below write_commit_graph SZEDER Gábor
2020-05-29 8:50 ` [PATCH 31/34] t7007-show: make the first test compatible with the next patch SZEDER Gábor
2020-05-29 8:50 ` [PATCH 32/34] PoC commit-graph: use revision walk machinery for '--reachable' SZEDER Gábor
2020-05-29 8:50 ` [PATCH 33/34] commit-graph: write modified path Bloom filters in "history order" SZEDER Gábor
2020-05-29 8:50 ` [PATCH 34/34] commit-graph: use modified path Bloom filters with wildcards, if possible SZEDER Gábor
2020-05-29 13:59 ` [PoC PATCH 00/34] An alternative modified path Bloom filters implementation Derrick Stolee
2020-06-01 23:25 ` Taylor Blau
2020-06-02 17:08 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200602175949.GA2898@szeder.dev \
--to=szeder.dev@gmail.com \
--cc=garima.singh@microsoft.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=jnareb@gmail.com \
--cc=me@ttaylorr.com \
--cc=peff@peff.net \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.