From: Michael Haggerty <mhagger@alum.mit.edu>
To: Junio C Hamano <gitster@pobox.com>
Cc: "Stefan Beller" <sbeller@google.com>,
"Johannes Schindelin" <Johannes.Schindelin@gmx.de>,
"Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>,
"Jeff King" <peff@peff.net>,
"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
"Brandon Williams" <bmwill@google.com>,
git@vger.kernel.org, "Michael Haggerty" <mhagger@alum.mit.edu>
Subject: [PATCH v2 00/21] Read `packed-refs` using mmap()
Date: Tue, 19 Sep 2017 08:22:08 +0200 [thread overview]
Message-ID: <cover.1505799700.git.mhagger@alum.mit.edu> (raw)
This is v2 of a patch series that changes the reading and caching of
the `packed-refs` file to use `mmap()`. Thanks to Junio, Stefan, and
Johannes for their comments about v1 [1].
The main change since v1 is to accommodate Windows, which doesn't let
you replace a file using `rename()` if the file is currently mmapped.
This is unfortunate, because it means that Windows will never get the
O(N) → O(lg N) improvement for reading single references that more
capable systems can now enjoy.
The background was discussed on the mailing list [2]. The bottom line
is that on Windows, keeping the `packed-refs` lock mmapped would be
tantamount to holding reader lock on that file, preventing anybody
(even unrelated processes) from changing the `packed-refs` file while
it is mmapped. This is even worse than the situation for packfiles
(which is solved using `close_all_packs()`), because a packfile, once
created, never needs to be replaced—every packfile has a filename that
is determined from its contents. The worst that can happen if a
packfile is locked is that another process cannot remove it, but that
is not critical for correctness. The `packed-refs` file, on the other
hand, always has the same filename and needs to be overwritten for
correctness.
So the approach taken here is that a new compile-time option,
`MMAP_PREVENTS_DELETE`, is introduced. When this option is set, then
the `packed-refs` file is read quickly into memory then closed.
Even in that case, though, this branch brings significant performance
benefits, because instead of parsing the whole file and storing it
into lots of little objects in a `ref_cache` (which also involves a
lot of small memory allocations), we copy the verbatim contents of the
file into memory. Then we use the same binary search techniques to
find any references that we need to read, just as we would do if the
file were memory mapped. This means that we only have to fully parse
the references that we are interested in, and hardly have to allocate
any additional memory.
I did some more careful benchmarks of this code vs. Git 2.14.1 on a
repository that is not quite as pathological. The test repo has 110k
references that are fully packed in a `packed-refs` file that has the
`sorted` trait. The current version is compiled three ways:
* `NO_MMAP=YesPlease`—prevents all use of `mmap()`. This variant is
O(N) when reading a single reference.
* `MMAP_PREVENTS_DELETE=YesPlease`—uses mmap for the initial read, but
quickly copies the contents to heap-allocated memory and munmaps
right away. This variant is also O(N) when reading a single
reference.
* default (mmap enabled)—the `packed-refs` file is kept mmapped for as
long as it is in use.
The commands that I timed were as follows:
# for-each-ref, warm cache:
$ git -C lots-of-refs for-each-ref --format="%(objectname) %(refname)" >/dev/null
# rev-parse, warm cache (this command was run 10 times then the total
# time divided by 10):
$ git -C lots-of-refs rev-parse --verify refs/remotes/origin/pr/38733
# rev-parse, cold cache (but git binary warm):
$ sync ; sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'; git rev-parse v1.0.0; time git -C lots-of-refs rev-parse --verify refs/remotes/origin/pr/38733
(Note that the `rev-parse` commands involve a handful of reference
lookups as the argument is DWIMmed.)
Results:
for-each-ref rev-parse rev-parse
warm cache warm cache cold cache
------------ ---------- ----------
v2.14.1 92 ms 23.7 ms 30 ms
NO_MMAP=YesPlease 83 ms 3.4 ms 10 ms
MMAP_PREVENTS_DELETE=YesPlease 82 ms 3.5 ms 11 ms
default (mmap enabled) 81 ms 0.8 ms 6 ms
So this branch is a little bit faster at iterating over all
references, but it really has big advantages when looking up single
references. The advantage is smaller if `NO_MMAP` or
`MMAP_PREVENTS_DELETE` is set, but is still quite significant.
This branch is also available from my fork on GitHub as branch
`mmap-packed-refs`.
My main uncertainties are:
1. Does this code actually work on Windows?
2. Did I implement the new compile-time option correctly? (I just
cargo-culted some existing options.) Is there some existing option
that I could piggy-back off of instead of adding a new one?
3. Is a compile-time option sufficient, or would the `mmap()` option
need to be configurable at runtime, or even tested at repository
create time as is done for some other filesystem properties in
`create_default_files()`?
Michael
[1] https://public-inbox.org/git/cover.1505319366.git.mhagger@alum.mit.edu/
[2] https://public-inbox.org/git/alpine.DEB.2.21.1.1709142101560.4132@virtualbox/
https://public-inbox.org/git/alpine.DEB.2.21.1.1709150012550.219280@virtualbox/
[3] https://github.com/mhagger/git
Jeff King (1):
prefix_ref_iterator: break when we leave the prefix
Michael Haggerty (20):
ref_iterator: keep track of whether the iterator output is ordered
packed_ref_cache: add a backlink to the associated `packed_ref_store`
die_unterminated_line(), die_invalid_line(): new functions
read_packed_refs(): use mmap to read the `packed-refs` file
read_packed_refs(): only check for a header at the top of the file
read_packed_refs(): make parsing of the header line more robust
read_packed_refs(): read references with minimal copying
packed_ref_cache: remember the file-wide peeling state
mmapped_ref_iterator: add iterator over a packed-refs file
mmapped_ref_iterator_advance(): no peeled value for broken refs
packed-backend.c: reorder some definitions
packed_ref_cache: keep the `packed-refs` file mmapped if possible
read_packed_refs(): ensure that references are ordered when read
packed_ref_iterator_begin(): iterate using `mmapped_ref_iterator`
packed_read_raw_ref(): read the reference from the mmapped buffer
ref_store: implement `refs_peel_ref()` generically
packed_ref_store: get rid of the `ref_cache` entirely
ref_cache: remove support for storing peeled values
mmapped_ref_iterator: inline into `packed_ref_iterator`
packed-backend.c: rename a bunch of things and update comments
Makefile | 10 +
config.mak.uname | 3 +
refs.c | 22 +-
refs/files-backend.c | 54 +--
refs/iterator.c | 47 ++-
refs/packed-backend.c | 978 ++++++++++++++++++++++++++++++++++++++------------
refs/ref-cache.c | 44 +--
refs/ref-cache.h | 35 +-
refs/refs-internal.h | 26 +-
9 files changed, 850 insertions(+), 369 deletions(-)
--
2.14.1
next reply other threads:[~2017-09-19 6:22 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-09-19 6:22 Michael Haggerty [this message]
2017-09-19 6:22 ` [PATCH v2 01/21] ref_iterator: keep track of whether the iterator output is ordered Michael Haggerty
2017-09-19 6:22 ` [PATCH v2 02/21] prefix_ref_iterator: break when we leave the prefix Michael Haggerty
2017-09-20 20:25 ` Stefan Beller
2017-09-21 4:59 ` Jeff King
2017-09-21 17:29 ` Stefan Beller
2017-09-21 7:42 ` Michael Haggerty
2017-09-19 6:22 ` [PATCH v2 03/21] packed_ref_cache: add a backlink to the associated `packed_ref_store` Michael Haggerty
2017-09-19 6:22 ` [PATCH v2 04/21] die_unterminated_line(), die_invalid_line(): new functions Michael Haggerty
2017-09-19 6:22 ` [PATCH v2 05/21] read_packed_refs(): use mmap to read the `packed-refs` file Michael Haggerty
2017-09-19 6:22 ` [PATCH v2 06/21] read_packed_refs(): only check for a header at the top of the file Michael Haggerty
2017-09-19 6:22 ` [PATCH v2 07/21] read_packed_refs(): make parsing of the header line more robust Michael Haggerty
2017-09-19 6:22 ` [PATCH v2 08/21] read_packed_refs(): read references with minimal copying Michael Haggerty
2017-09-20 18:27 ` Jeff King
2017-09-21 7:34 ` Michael Haggerty
2017-09-19 6:22 ` [PATCH v2 09/21] packed_ref_cache: remember the file-wide peeling state Michael Haggerty
2017-09-19 6:22 ` [PATCH v2 10/21] mmapped_ref_iterator: add iterator over a packed-refs file Michael Haggerty
2017-09-19 6:22 ` [PATCH v2 11/21] mmapped_ref_iterator_advance(): no peeled value for broken refs Michael Haggerty
2017-09-20 18:29 ` Jeff King
2017-09-19 6:22 ` [PATCH v2 12/21] packed-backend.c: reorder some definitions Michael Haggerty
2017-09-19 6:22 ` [PATCH v2 13/21] packed_ref_cache: keep the `packed-refs` file mmapped if possible Michael Haggerty
2017-09-19 12:44 ` Michael Haggerty
2017-09-24 6:56 ` Junio C Hamano
2017-09-20 18:40 ` Jeff King
2017-09-20 18:51 ` Jeff King
2017-09-21 8:04 ` Michael Haggerty
2017-09-19 6:22 ` [PATCH v2 14/21] read_packed_refs(): ensure that references are ordered when read Michael Haggerty
2017-09-20 18:50 ` Jeff King
2017-09-21 8:27 ` Michael Haggerty
2017-09-25 15:44 ` Johannes Schindelin
2017-09-19 6:22 ` [PATCH v2 15/21] packed_ref_iterator_begin(): iterate using `mmapped_ref_iterator` Michael Haggerty
2017-09-19 6:22 ` [PATCH v2 16/21] packed_read_raw_ref(): read the reference from the mmapped buffer Michael Haggerty
2017-09-19 6:22 ` [PATCH v2 17/21] ref_store: implement `refs_peel_ref()` generically Michael Haggerty
2017-09-19 6:22 ` [PATCH v2 18/21] packed_ref_store: get rid of the `ref_cache` entirely Michael Haggerty
2017-09-19 6:22 ` [PATCH v2 19/21] ref_cache: remove support for storing peeled values Michael Haggerty
2017-09-19 6:22 ` [PATCH v2 20/21] mmapped_ref_iterator: inline into `packed_ref_iterator` Michael Haggerty
2017-09-19 6:22 ` [PATCH v2 21/21] packed-backend.c: rename a bunch of things and update comments Michael Haggerty
2017-09-19 19:53 ` [PATCH v2 00/21] Read `packed-refs` using mmap() Johannes Schindelin
2017-09-20 18:57 ` Jeff King
2017-09-25 15:55 ` Johannes Schindelin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cover.1505799700.git.mhagger@alum.mit.edu \
--to=mhagger@alum.mit.edu \
--cc=Johannes.Schindelin@gmx.de \
--cc=avarab@gmail.com \
--cc=bmwill@google.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=pclouds@gmail.com \
--cc=peff@peff.net \
--cc=sbeller@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).