From: Patrick Steinhardt <ps@pks.im>
To: git@vger.kernel.org
Subject: Re: [PATCH v2 4/7] reftable/block: refactor binary search over restart points
Date: Tue, 2 Apr 2024 19:15:42 +0200 [thread overview]
Message-ID: <Zgw9PlgGDcmwLnDB@tanuki> (raw)
In-Reply-To: <45v2z6uszlkanwl5qhvio4ikrrytztohbmdpnmdwiefyznclum@xhbvlvnfmkmq>
[-- Attachment #1: Type: text/plain, Size: 4686 bytes --]
On Tue, Apr 02, 2024 at 11:42:29AM -0500, Justin Tobler wrote:
> On 24/03/25 11:10AM, Patrick Steinhardt wrote:
> > When seeking a record in our block reader we perform a binary search
> > over the block's restart points so that we don't have to do a linear
> > scan over the whole block. The logic to do so is quite intricate though,
> > which makes it hard to understand.
> >
> > Improve documentation and rename some of the functions and variables so
> > that the code becomes easier to understand overall. This refactoring
> > should not result in any change in behaviour.
> >
> > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> > ---
> ...
> > - i = binsearch(br->restart_count, &restart_key_less, &args);
> > + /*
> > + * Perform a binary search over the block's restart points, which
> > + * avoids doing a linear scan over the whole block. Like this, we
> > + * identify the section of the block that should contain our key.
> > + *
> > + * Note that we explicitly search for the first restart point _greater_
> > + * than the sought-after record, not _greater or equal_ to it. In case
> > + * the sought-after record is located directly at the restart point we
> > + * would otherwise start doing the linear search at the preceding
> > + * restart point. While that works alright, we would end up scanning
> > + * too many record.
> > + */
> > + i = binsearch(br->restart_count, &restart_needle_less, &args);
> > +
> > + /*
> > + * Now there are multiple cases:
> > + *
> > + * - `i == 0`: The wanted record must be contained before the first
> > + * restart point. We will thus start searching for the record in
> > + * that section after accounting for the header offset.
>
> To clarify my understanding, does a restart_offset not exist at the
> beginning of a block? I was originally under the impression that the
> first reset point would be at the beginning of the block (or just after
> the header). If this was the case, the first restart point being greater
> would indicate that the wanted record is not contained within the block.
It wouldn't make much sense to include it as a restart point. A restart
point is a point where the prefix compression will be reset such that
the record at that point can be read without reading preceding records.
This is always implicitly true for the first record in a block as it is
never prefix-compressed. Consequently, writing a restart point for the
first record would be a waste of disk space.
Patrick
> -Justin
>
> > + *
> > + * - `i == restart_count`: The wanted record was not found at any of
> > + * the restart points. As there is no restart point at the end of
> > + * the section the record may thus be contained in the last block.
> > + *
> > + * - `i > 0`: The wanted record must be contained in the section
> > + * before the found restart point. We thus do a linear search
> > + * starting from the preceding restart point.
> > + */
> > if (i > 0)
> > it->next_off = block_reader_restart_offset(br, i - 1);
> > else
> > @@ -399,21 +429,34 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
> >
> > reftable_record_init(&rec, block_reader_type(br));
> >
> > - /* We're looking for the last entry less/equal than the wanted key, so
> > - we have to go one entry too far and then back up.
> > - */
> > + /*
> > + * We're looking for the last entry less than the wanted key so that
> > + * the next call to `block_reader_next()` would yield the wanted
> > + * record. We thus don't want to position our reader at the sought
> > + * after record, but one before. To do so, we have to go one entry too
> > + * far and then back up.
> > + */
> > while (1) {
> > block_iter_copy_from(&next, it);
> > err = block_iter_next(&next, &rec);
> > if (err < 0)
> > goto done;
> > -
> > - reftable_record_key(&rec, &it->last_key);
> > - if (err > 0 || strbuf_cmp(&it->last_key, want) >= 0) {
> > + if (err > 0) {
> > err = 0;
> > goto done;
> > }
> >
> > + /*
> > + * Check whether the current key is greater or equal to the
> > + * sought-after key. In case it is greater we know that the
> > + * record does not exist in the block and can thus abort early.
> > + * In case it is equal to the sought-after key we have found
> > + * the desired record.
> > + */
> > + reftable_record_key(&rec, &it->last_key);
> > + if (strbuf_cmp(&it->last_key, want) >= 0)
> > + goto done;
> > +
> > block_iter_copy_from(it, &next);
> > }
> >
> > --
> > 2.44.GIT
> >
>
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2024-04-02 17:15 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-22 12:22 [PATCH 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
2024-03-22 12:22 ` [PATCH 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
2024-03-22 12:22 ` [PATCH 2/7] reftable/basics: improve `binsearch()` test Patrick Steinhardt
2024-03-22 18:46 ` Justin Tobler
2024-03-25 10:07 ` Patrick Steinhardt
2024-03-22 12:22 ` [PATCH 3/7] reftable/refname: refactor binary search over refnames Patrick Steinhardt
2024-03-22 18:55 ` Justin Tobler
2024-03-25 10:07 ` Patrick Steinhardt
2024-03-22 12:22 ` [PATCH 4/7] reftable/block: refactor binary search over restart points Patrick Steinhardt
2024-03-22 12:22 ` [PATCH 5/7] reftable/block: fix error handling when searching " Patrick Steinhardt
2024-03-22 12:22 ` [PATCH 6/7] reftable/record: extract function to decode key lengths Patrick Steinhardt
2024-03-22 12:22 ` [PATCH 7/7] reftable/block: avoid decoding keys when searching restart points Patrick Steinhardt
2024-03-25 10:10 ` [PATCH v2 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
2024-03-25 10:10 ` [PATCH v2 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
2024-03-25 10:10 ` [PATCH v2 2/7] reftable/basics: improve `binsearch()` test Patrick Steinhardt
2024-03-25 10:10 ` [PATCH v2 3/7] reftable/refname: refactor binary search over refnames Patrick Steinhardt
2024-04-02 16:27 ` Justin Tobler
2024-04-02 17:15 ` Patrick Steinhardt
2024-03-25 10:10 ` [PATCH v2 4/7] reftable/block: refactor binary search over restart points Patrick Steinhardt
2024-04-02 16:42 ` Justin Tobler
2024-04-02 17:15 ` Patrick Steinhardt [this message]
2024-04-02 17:46 ` Justin Tobler
2024-04-03 6:01 ` Patrick Steinhardt
2024-03-25 10:10 ` [PATCH v2 5/7] reftable/block: fix error handling when searching " Patrick Steinhardt
2024-03-25 10:10 ` [PATCH v2 6/7] reftable/record: extract function to decode key lengths Patrick Steinhardt
2024-03-25 10:11 ` [PATCH v2 7/7] reftable/block: avoid decoding keys when searching restart points Patrick Steinhardt
2024-04-02 16:47 ` Justin Tobler
2024-04-02 17:15 ` Patrick Steinhardt
2024-04-02 17:24 ` [PATCH v3 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
2024-04-02 17:24 ` [PATCH v3 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
2024-04-02 17:24 ` [PATCH v3 2/7] reftable/basics: improve `binsearch()` test Patrick Steinhardt
2024-04-02 17:24 ` [PATCH v3 3/7] reftable/refname: refactor binary search over refnames Patrick Steinhardt
2024-04-02 17:24 ` [PATCH v3 4/7] reftable/block: refactor binary search over restart points Patrick Steinhardt
2024-04-02 17:24 ` [PATCH v3 5/7] reftable/block: fix error handling when searching " Patrick Steinhardt
2024-04-02 17:25 ` [PATCH v3 6/7] reftable/record: extract function to decode key lengths Patrick Steinhardt
2024-04-02 17:25 ` [PATCH v3 7/7] reftable/block: avoid decoding keys when searching restart points Patrick Steinhardt
2024-04-02 17:49 ` [PATCH v3 0/7] reftable: improvements for the `binsearch()` mechanism Justin Tobler
2024-04-03 6:03 ` [PATCH v4 " Patrick Steinhardt
2024-04-03 6:03 ` [PATCH v4 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
2024-04-03 6:04 ` [PATCH v4 2/7] reftable/basics: improve `binsearch()` test Patrick Steinhardt
2024-04-03 6:04 ` [PATCH v4 3/7] reftable/refname: refactor binary search over refnames Patrick Steinhardt
2024-04-03 6:04 ` [PATCH v4 4/7] reftable/block: refactor binary search over restart points Patrick Steinhardt
2024-04-03 6:04 ` [PATCH v4 5/7] reftable/block: fix error handling when searching " Patrick Steinhardt
2024-04-03 6:04 ` [PATCH v4 6/7] reftable/record: extract function to decode key lengths Patrick Steinhardt
2024-04-03 6:04 ` [PATCH v4 7/7] reftable/block: avoid decoding keys when searching restart points Patrick Steinhardt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zgw9PlgGDcmwLnDB@tanuki \
--to=ps@pks.im \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).