Git development

Git development
 help / color / mirror / Atom feed

* Re: [PATCH v5 2/3] trailer: find the end of the log message
From: Linus Arver @ 2023-12-29  6:42 UTC (permalink / raw)
  To: Junio C Hamano, Linus Arver via GitGitGadget
  Cc: git, Glen Choo, Christian Couder, Phillip Wood, Jonathan Tan
In-Reply-To: <xmqqr0lpoue3.fsf@gitster.g>

TL;DR: I'm working on a new approach.

Junio C Hamano <gitster@pobox.com> writes:
> Other than that, I didn't find anything quesionable in any of the
> patches in this round.  Looking good.

So actually, I'm now taking a much more aggressive approach to libifying
the trailer subsystem. Instead of incrementally simplifying/improving
things as in this series, I think I need to get to the root problem,
which is that the trailer.h API isn't rich enough to make it pleasant
for clients to use, including our own builtin/interpret-trailers.c
client. That is, the problem we have today is that the trailer subsystem
is not very ergonomic for internal use, much less external use (outside
of Git itself).

As an example, the current API exposes process_trailers() which does a
whole bunch of things that only builtin/interpret-trailers.c cares
about. Multiple other clients of trailer.h exist in our codebase (e.g.,
sequencer.c, pretty.c, ref-filter.c) but none of them use
process_trailers().

One really useful data structure is the trailer_iterator that was
introduced in f0939a0eb1 (trailer: add interface for iterating over
commit trailers, 2020-09-27). The only problem is that it is not generic
enough such that interpret-trailers.c can use it.

My new goal is to introduce a new API in trailer.h so that
interpret-trailers.c and everyone else can start using these new data
structures and associated functions (while preserving the
trailer_iterator interface). So the order of operations should be:

(1) enrich the trailer API (make trailer.h have simpler data structures
    and practical functions that clients can readily use), and
(2) make builtin/interpret-trailers.c, and other clients in the Git
    codebase use this new API.

This way when the unit test framework selection process is finalized we
can

(3) write unit tests for the functions in the (enriched) trailer API,

which is one of the major goals for my efforts around this area.

The work I've started locally for (1) does not depend on this series,
and I think it'll be cleaner (less churn) that way. So, feel free to
drop this series in favor of the forthcoming work described in this
message.

Thanks.

^ permalink raw reply

* Re: [PATCH v2 03/12] refs: refactor logic to look up storage backends
From: Junio C Hamano @ 2023-12-28 20:42 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Karthik Nayak
In-Reply-To: <ZY3Wcua6dtzO2jG4@framework>

Patrick Steinhardt <ps@pks.im> writes:

> Yeah, we do not really discern those two cases for now and instead just
> return `NULL` both for any unknown ref storage format. All callers know
> to handle `NULL`, but the error handling will only report a generic
> "unknown" backend error.
>
> The easiest way to discern those cases would be to `BUG()` when being
> passed an invalid ref storage format smaller than 0 or larger than the
> number of known backends. Because ultimately it is just that, a bug that
> shouldn't ever occur.
>
> Not sure whether this is worth a reroll?

By using an unsigned type, you no longer have to worry about getting
handed a negative index, as the "must be smaller than ARRAY_SIZE()"
check will be sufficient to catch anybody who passes "-1" (casted to
unsigned by parameter passing).  So I would say that would be a good
enough reason to reroll, whether we differentiate 0 and an index
that is larger than refs_backends[] (or a negative one) with an
explicit BUG(), or just leave it to the caller by returning NULL.
As to the error handling, I suspect it is sufficient to return NULL
and let the caller handle it.

Thanks.

>
> Patrick
>
>> > +static const struct ref_storage_be *refs_backends[] = {
>> > +	[REF_STORAGE_FORMAT_FILES] = &refs_be_files,
>> > +};
>> > ...
>> > +static const struct ref_storage_be *find_ref_storage_backend(int ref_storage_format)
>> >  {
>> > +	if (ref_storage_format < ARRAY_SIZE(refs_backends))
>> > +		return refs_backends[ref_storage_format];
>> >  	return NULL;
>> >  }

^ permalink raw reply

* Re: [PATCH v4] sideband.c: remove redundant 'NEEDSWORK' tag
From: Junio C Hamano @ 2023-12-28 20:33 UTC (permalink / raw)
  To: Chandra Pratap via GitGitGadget
  Cc: git, Torsten Bögershausen, Chandra Pratap, Chandra Pratap
In-Reply-To: <pull.1625.v4.git.1703750460527.gitgitgadget@gmail.com>

"Chandra Pratap via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Subject: Re: [PATCH v4] sideband.c: remove redundant 'NEEDSWORK' tag

The reason for removal is not that it was redundant and we said the
same thing elsewhere.  Rather, what it claimed to be necessary has
turned to be unwanted.  So something like

    Subject: sideband.c: update stale NEEDSWORK comment

    If we really wanted to change the type of the parameter to this
    function to "size_t", we should also update its callers to hold
    the values they use to compute the parameter also in "size_t".

    But in this callchain, "int" is wide enough.  Avoid tempting
    future developers into wasting their time on using "size_t"
    around this function.

or along that line would be more appropriate, perhaps?

Thanks.

> From: Chandra Pratap <chandrapratap3519@gmail.com>
>
> Signed-off-by: Chandra Pratap <chandrapratap3519@gmail.com>
> ---
>     sideband.c: replace int with size_t for clarity
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1625%2FChand-ra%2Fdusra-v4
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1625/Chand-ra/dusra-v4
> Pull-Request: https://github.com/gitgitgadget/git/pull/1625
>
> Range-diff vs v3:
>
>  1:  273415aa6a4 ! 1:  8c003256e5b sideband.c: remove redundant 'NEEDSWORK' tag
>      @@ sideband.c: void list_config_color_sideband_slots(struct string_list *list, cons
>         *
>       - * NEEDSWORK: use "size_t n" instead for clarity.
>       + * It is fine to use "int n" here instead of "size_t n" as all calls to this
>      -+ * function pass an 'int' parameter.
>      ++ * function pass an 'int' parameter. Additionally, the buffer involved in
>      ++ * storing these 'int' values takes input from a packet via the pkt-line
>      ++ * interface, which is capable of transferring only 64kB at a time.
>         */
>        static void maybe_colorize_sideband(struct strbuf *dest, const char *src, int n)
>        {
>
>
>  sideband.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/sideband.c b/sideband.c
> index 6cbfd391c47..266a67342be 100644
> --- a/sideband.c
> +++ b/sideband.c
> @@ -69,7 +69,10 @@ void list_config_color_sideband_slots(struct string_list *list, const char *pref
>   * of the line. This should be called for a single line only, which is
>   * passed as the first N characters of the SRC array.
>   *
> - * NEEDSWORK: use "size_t n" instead for clarity.
> + * It is fine to use "int n" here instead of "size_t n" as all calls to this
> + * function pass an 'int' parameter. Additionally, the buffer involved in
> + * storing these 'int' values takes input from a packet via the pkt-line
> + * interface, which is capable of transferring only 64kB at a time.
>   */
>  static void maybe_colorize_sideband(struct strbuf *dest, const char *src, int n)
>  {
>
> base-commit: 1a87c842ece327d03d08096395969aca5e0a6996

^ permalink raw reply

* Re: [PATCH v2 02/12] worktree: skip reading HEAD when repairing worktrees
From: Patrick Steinhardt @ 2023-12-28 20:18 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: git, Karthik Nayak, Junio C Hamano
In-Reply-To: <CAPig+cSKpzOCOzC_mtNoA4yYmHCtMxB-Ujsd7YYHK-SPJvgt8w@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1132 bytes --]

On Thu, Dec 28, 2023 at 01:13:04PM -0500, Eric Sunshine wrote:
> On Thu, Dec 28, 2023 at 1:08 PM Eric Sunshine <sunshine@sunshineco.com> wrote:
> > Having said all that, I'm not overly opposed to this patch, especially
> > since your main focus is on getting the reftable backend integrated,
> > and because the changes (and ugliness) introduced by this patch are
> > entirely self-contained and private to worktree.c, so are not a
> > show-stopper by any means. Rather, I wanted to get down to writing
> > what I think would be a better future approach if someone gets around
> > to tackling it. (There is no pressing need at the moment, and that
> > someone doesn't have to be you.)
> 
> I forgot to mention that, if you reroll for some reason, the
> get_worktrees()/get_worktrees_internal() dance might deserve an
> in-source NEEDSWORK comment explaining that get_worktrees_internal()
> exists to work around the shortcoming that a corruption-tolerant
> function for retrieving worktree metadata (for use by the "repair"
> function) does not yet exist.

Thanks for sharing your thoughts, will do.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v2 03/12] refs: refactor logic to look up storage backends
From: Patrick Steinhardt @ 2023-12-28 20:11 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Karthik Nayak
In-Reply-To: <xmqqjzoygrx8.fsf@gitster.g>

[-- Attachment #1: Type: text/plain, Size: 2650 bytes --]

On Thu, Dec 28, 2023 at 09:25:55AM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > In order to look up ref storage backends, we're currently using a linked
> > list of backends, where each backend is expected to set up its `next`
> > pointer to the next ref storage backend. This is kind of a weird setup
> > as backends need to be aware of other backends without much of a reason.
> >
> > Refactor the code so that the array of backends is centrally defined in
> > "refs.c", where each backend is now identified by an integer constant.
> > Expose functions to translate from those integer constants to the name
> > and vice versa, which will be required by subsequent patches.
> 
> A small question.  Does this have to be "int", or is "unsigned" (or
> even an enum, rewrittenfrom the "REF_STORAGE_FORMAT_*" family of CPP
> macro constants) good enough?  I am only wondering what happens when
> you clal find_ref_storage_backend() with a negative index.

No, it does not have to be an `int`, and handling a negative index would
be a bug. I tried to stick to what we have with `GIT_HASH_UNKNOWN`,
`GIT_HASH_SHA1` etc, which is exactly similar in spirit. Whether it's
the perfect way to handle this... probably not. Without the context I
would've used an `enum`, but instead I opted for consistency.

> For that matter, how REF_STORAGE_FORMAT_UNKNOWN (whose value is 0)
> is handled by the function also gets curious.  The caller may have
> to find that the backend hasn't been specified by receiving an
> element in the refs_backends[] array that corresponds to it, but the
> error behaviour of this function is also to return NULL, so it has
> to be prepared to handle both cases?

Yeah, we do not really discern those two cases for now and instead just
return `NULL` both for any unknown ref storage format. All callers know
to handle `NULL`, but the error handling will only report a generic
"unknown" backend error.

The easiest way to discern those cases would be to `BUG()` when being
passed an invalid ref storage format smaller than 0 or larger than the
number of known backends. Because ultimately it is just that, a bug that
shouldn't ever occur.

Not sure whether this is worth a reroll?

Patrick

> > +static const struct ref_storage_be *refs_backends[] = {
> > +	[REF_STORAGE_FORMAT_FILES] = &refs_be_files,
> > +};
> > ...
> > +static const struct ref_storage_be *find_ref_storage_backend(int ref_storage_format)
> >  {
> > +	if (ref_storage_format < ARRAY_SIZE(refs_backends))
> > +		return refs_backends[ref_storage_format];
> >  	return NULL;
> >  }

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 04/12] setup: start tracking ref storage format when
From: Patrick Steinhardt @ 2023-12-28 20:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <xmqqplyqgsem.fsf@gitster.g>

[-- Attachment #1: Type: text/plain, Size: 1450 bytes --]

On Thu, Dec 28, 2023 at 09:15:29AM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > Makes me wonder whether we should then also add the following diff to
> > "setup: set repository's format on init" when both topics are being
> > merged together:
> >
> > diff --git a/setup.c b/setup.c
> > index 3d980814bc..3d35c78c68 100644
> > --- a/setup.c
> > +++ b/setup.c
> > @@ -2210,6 +2210,7 @@ int init_db(const char *git_dir, const char *real_git_dir,
> >  	 * format we can update the repository's settings accordingly.
> >  	 */
> >  	repo_set_hash_algo(the_repository, repo_fmt.hash_algo);
> > +	repo_set_compat_hash_algo(the_repository, repo_fmt.compat_hash_algo);
> >  	repo_set_ref_storage_format(the_repository, repo_fmt.ref_storage_format);
> >  
> >  	if (!(flags & INIT_DB_SKIP_REFDB))
> 
> Shouldn't that come from the series that wants .compat_hash_algo in
> the repo_fmt structure, whichever it is, not added by an evil merge?

Well, the above code is newly added by my series to ensure that
`init_db()` results in a properly initialized repo upon return. So the
compat hash algo series cannot yet call `repo_set_compat_hash_algo()`
because the code site doesn't exist, whereas my series cannot yet add
the call because there is no compat hash algo yet.

So depending on which series lands first we'll either have to adapt the
respective other series or do an evil merge.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 0/6] worktree: initialize refdb via ref backends
From: Patrick Steinhardt @ 2023-12-28 19:57 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <xmqqedf6gpt8.fsf@gitster.g>

[-- Attachment #1: Type: text/plain, Size: 2121 bytes --]

On Thu, Dec 28, 2023 at 10:11:31AM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > when initializing worktrees we manually create the on-disk data
> > structures required for the ref backend in "worktree.c". This works just
> > fine right now where we only have a single user-exposed ref backend, but
> > it will become unwieldy once we have multiple ref backends. This patch
> > series thus refactors how we initialize worktrees so that we can use
> > `refs_init_db()` to initialize required files for us.
> >
> > This patch series conflicts with ps/refstorage-extension. The conflict
> > can be solved as shown below. I'm happy to defer this patch series
> > though until the topic has landed on `master` in case this causes
> > issues.
> 
> Resolution is not all that bad, but the change in function signature
> means comments/explanations near both the caller and the callee of
> the get_linked_worktree() function may need updating, I would think.
> For example, ...
> 
> > diff --git a/worktree.h b/worktree.h
> > index 8a75691eac..f14784a2ff 100644
> > --- a/worktree.h
> > +++ b/worktree.h
> > @@ -61,7 +61,8 @@ struct worktree *find_worktree(struct worktree **list,
> >   * Look up the worktree corresponding to `id`, or NULL of no such worktree
> >   * exists.
> >   */
> > -struct worktree *get_linked_worktree(const char *id);
> > +struct worktree *get_linked_worktree(const char *id,
> > +				     int skip_reading_head);
> 
> ... this now needs to help developers who may want to add new
> callers what to pass in "skip_reading_head" and why.
> 
> We may indeed want to build this on top of the refstorage-extansion
> thing, as it seems to be relatively close to completion.

Fair enough. I'll wait for the refstorage extension topic to hit `next`
or `master` first so as to not build deep dependency chains when things
may still move around. I don't mind waiting another one or two weeks,
especially during holidays where things are moving slower anyway.

> Thanks (and a happy new year).

Thanks, the same to you, too.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v2] mem-pool: fix big allocations
From: phillip.wood123 @ 2023-12-28 19:36 UTC (permalink / raw)
  To: René Scharfe, Git List
  Cc: Jameson Miller, Phillip Wood, Elijah Newren, Junio C Hamano
In-Reply-To: <1c39c0e7-05b2-4726-a90c-f78df4356a41@web.de>

Hi René

On 28/12/2023 19:19, René Scharfe wrote:
> Interdiff against v1:
>    diff --git a/t/unit-tests/t-mem-pool.c b/t/unit-tests/t-mem-pool.c
>    index 2295779b0b..a0d57df761 100644
>    --- a/t/unit-tests/t-mem-pool.c
>    +++ b/t/unit-tests/t-mem-pool.c
>    @@ -1,8 +1,6 @@
>     #include "test-lib.h"
>     #include "mem-pool.h"
> 
>    -#define check_ptr(a, op, b) check_int(((a) op (b)), ==, 1)
>    -
>     static void setup_static(void (*f)(struct mem_pool *), size_t block_alloc)
>     {
>     	struct mem_pool pool = { .block_alloc = block_alloc };
>    @@ -16,11 +14,10 @@ static void t_calloc_100(struct mem_pool *pool)
>     	char *buffer = mem_pool_calloc(pool, 1, size);
>     	for (size_t i = 0; i < size; i++)
>     		check_int(buffer[i], ==, 0);
>    -	if (!check_ptr(pool->mp_block, !=, NULL))
>    +	if (!check(pool->mp_block != NULL))
>     		return;
>    -	check_ptr(pool->mp_block->next_free, <=, pool->mp_block->end);
>    -	check_ptr(pool->mp_block->next_free, !=, NULL);
>    -	check_ptr(pool->mp_block->end, !=, NULL);
>    +	check(pool->mp_block->next_free != NULL);
>    +	check(pool->mp_block->end != NULL);
>     }

The changes to the unit tests look good to me (I haven't really looked 
at the actual bug fix in the mem_pool code).

Best Wishes

Phillip

^ permalink raw reply

* Re: [PATCH] mem-pool: fix big allocations
From: phillip.wood123 @ 2023-12-28 19:34 UTC (permalink / raw)
  To: René Scharfe, phillip.wood, Git List; +Cc: Jameson Miller
In-Reply-To: <34f5913f-b187-43c3-99b7-3d57065dba12@web.de>

On 28/12/2023 18:56, René Scharfe wrote:
> Am 28.12.23 um 17:48 schrieb phillip.wood123@gmail.com:
>> On 28/12/2023 16:05, René Scharfe wrote:
>>> Am 28.12.23 um 16:10 schrieb Phillip Wood:
>>>> The diff at the end of
>>>> this email shows a possible implementation of a check_ptr() macro for
>>>> the unit test library. I'm wary of adding it though because I'm not sure
>>>> printing the pointer values is actually very useful most of the
>>>> time. I'm also concerned that the rules around pointer arithmetic and
>>>> comparisons mean that many pointer tests such as
>>>>
>>>>       check_ptr(pool->mp_block->next_free, <=, pool->mp_block->end);
>>>>
>>>> will be undefined if they fail.
>>>
>>> True, the compiler could legally emit mush when it finds out that the
>>> pointers are for different objects.  And the error being fixed produces
>>> such unrelated pointer pairs -- oops.
>>>
>>> This check is not important here, we can just drop it.
>>>
>>> mem_pool_contains() has the same problem, by the way.
>>>
>>> Restricting ourselves to only equality comparisons for pointers prevents
>>> some interesting sanity checks, though.  Casting to intptr_t or
>>> uintptr_t would allow arbitrary comparisons without risk of undefined
>>> behavior, though.  Perhaps that would make a check_ptr() macro viable
>>> and useful.
>>
>> That certainly helps and the check_ptr() macro in my previous email
>> casts the pointers to uintptr_t before comparing them. Maybe I'm
>> worrying too much, but my concern is that in a failing comparison it
>> is likely one of the pointers is invalid (for example it is the
>> result of some undefined pointer arithmetic) and the program is
>> undefined from the point the invalid pointer is created.
> 
> There are no restrictions on integer comparisons.  So comparing after
> casting to uintptr_t should not invoke undefined behavior.  If undefined
> behavior was involved in calculating the pointers in the first place
> then the compiler might still legally go crazy, but not due to the
> comparison.  Right?

Exactly, my worry is that if the comparison fails it is likely that 
there will have been undefined behavior involved in calculating the 
pointer before we get to the comparison in which case so casting to 
uintptr_t in the comparison does not help.

> Whether the result of a uintptr_t-cast comparison of pointers to
> different objects is meaningful is a different question.  Hopefully
> range checks are possible.
> 
>> The
>> documentation for check_ptr() in my previous mail contains the
>> following example
>>
>>      For example if `start` and `end` are pointers to the beginning and
>>      end of an allocation and `offset` is an integer then
>>
>>          check_ptr(start + offset, <=, end)
>>
>>      is undefined when `offset` is larger than `end - start`. Rewriting
>>      the comparison as
>>
>>          check_uint(offset, <=, end - start)
>>
>>      avoids undefined behavior when offset is too large, but is still
>>      undefined if there is a bug that means `start` and `end` do not
>>      point to the same allocation.
> 
> True, but in such a unit test we'd need additional checks verifying
> that start and end belong to the same object.  Or perhaps use a
> numerical size instead of an end pointer.

Agreed, but I think the implication is that there will be cases we 
should be using check_uint() as in the second comparison above rather 
than check_ptr() as in the first comparison above. I'm not opposed to 
adding check_ptr() if we think it will be useful but I am worried it is 
easy to misuse it. If we do add check_ptr() we should have some 
guidelines about when it makes sense to use it.

Best Wishes

Phillip

^ permalink raw reply

* Re: [PATCH 1/1] Replace SID with domain/username
From: Eric Sunshine @ 2023-12-28 19:27 UTC (permalink / raw)
  To: Sören Krecker; +Cc: git
In-Reply-To: <20231228132844.4240-2-soekkle@freenet.de>

On Thu, Dec 28, 2023 at 8:29 AM Sören Krecker <soekkle@freenet.de> wrote:
> From: soekkle <soekkle@freenet.de>
>
> Replace SID with domain/username in erromessage, if owner of repository
> and user are not equal on windows systems.
>
> Signed-off-by: Sören Krecker <soekkle@freenet.de>
> ---

I don't do Windows (anymore), thus I'm not qualified to comment on the
substance of this patch, so I'll just make some general, hopefully
helpful, observations.

Typo: "erromessage" should be "error message"

Your name in the "From:" header and Signed-off-by: should be the same.

Perhaps Widows folks can understand the purpose of this patch without
further explanation, but for other readers, it's not clear what
problem the patch is trying to solve. The commit message is a good
place to explain _why_ this change is desirable.

> diff --git a/compat/mingw.c b/compat/mingw.c
> @@ -2684,6 +2684,25 @@ static PSID get_current_user_sid(void)
> +BOOL user_sid_to_string(PSID sid, LPSTR* str)

In this codebase, '*' sticks to the variable name, not the type, so:

    BOOL user_sid_to_string(PSID sid, LPSTR *str)

> +{
> +       SID_NAME_USE peUse;
> +       DWORD lenName = { 0 }, lenDomain = { 0 };

Looking through compat/mingw.c, it appears that (as with the rest of
the project), variable names tend to use underscores rather than
camel-case, so for consistency these might be better expressed as
"pe_use" (whatever that means), "name_len", and "domain_len".

I was curious about the `{ 0 }` initializer. It seems we have a mix of
both `{0}` and `{ 0 }` in the codebase, so what you have here is
likely fine.

> +       LookupAccountSidA(NULL, sid, NULL, &lenName, NULL,
> +                                       &lenDomain, &peUse); // returns only FALSE, because the string pointers are NULL

As with the rest of the project, compat/mingw.c still shuns "//"
comments. Use /*...*/ comments instead.

> +       ALLOC_ARRAY((*str), (size_t)lenDomain + (size_t)lenName); // Alloc neded Space of the strings

Type: "neded" -> "needed"

(and "Space" -> "space")

> +       BOOL retVal = LookupAccountSidA(NULL, sid, (*str) + lenDomain, &lenName,
> +                                      *str,
> +                                       &lenDomain, &peUse);
> +       *(*str + lenDomain) = '/';
> +       if (retVal == FALSE)
> +       {
> +               free(*str);
> +               *str = NULL;

The FREE_AND_NULL() macro from git-compat-util.h is a good companion
to the ALLOC_ARRAY() macro used above, so freeing and nullifying could
be done in one line:

    FREE_AND_NULL(*str);

> +       }
> +       return retVal;
> +}

Perhaps a variable name such as `ok` would convey more to the reader
than the generic `retVal`?

^ permalink raw reply

* [PATCH v2] mem-pool: fix big allocations
From: René Scharfe @ 2023-12-28 19:19 UTC (permalink / raw)
  To: Git List; +Cc: Jameson Miller, Phillip Wood, Elijah Newren, Junio C Hamano
In-Reply-To: <fa89d269-1a23-4ed6-bebc-30c0b629f444@web.de>

Memory pool allocations that require a new block and would fill at
least half of it are handled specially.  Before 158dfeff3d (mem-pool:
add life cycle management functions, 2018-07-02) they used to be
allocated outside of the pool.  This patch made mem_pool_alloc() create
a bespoke block instead, to allow releasing it when the pool gets
discarded.

Unfortunately mem_pool_alloc() returns a pointer to the start of such a
bespoke block, i.e. to the struct mp_block at its top.  When the caller
writes to it, the management information gets corrupted.  This affects
mem_pool_discard() and -- if there are no other blocks in the pool --
also mem_pool_alloc().

Return the payload pointer of bespoke blocks, just like for smaller
allocations, to protect the management struct.

Also update next_free to mark the block as full.  This is only strictly
necessary for the first allocated block, because subsequent ones are
inserted after the current block and never considered for further
allocations, but it's easier to just do it in all cases.

Add a basic unit test to demonstrate the issue by using
mem_pool_calloc() with a tiny block size, which forces the creation of a
bespoke block.

Helped-by: Phillip Wood <phillip.wood123@gmail.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
---
Changes since v1:
- simply use check() instead of a custom check_ptr() macro
- drop unnecessary comparison of next_free and end pointers

Interdiff against v1:
  diff --git a/t/unit-tests/t-mem-pool.c b/t/unit-tests/t-mem-pool.c
  index 2295779b0b..a0d57df761 100644
  --- a/t/unit-tests/t-mem-pool.c
  +++ b/t/unit-tests/t-mem-pool.c
  @@ -1,8 +1,6 @@
   #include "test-lib.h"
   #include "mem-pool.h"

  -#define check_ptr(a, op, b) check_int(((a) op (b)), ==, 1)
  -
   static void setup_static(void (*f)(struct mem_pool *), size_t block_alloc)
   {
   	struct mem_pool pool = { .block_alloc = block_alloc };
  @@ -16,11 +14,10 @@ static void t_calloc_100(struct mem_pool *pool)
   	char *buffer = mem_pool_calloc(pool, 1, size);
   	for (size_t i = 0; i < size; i++)
   		check_int(buffer[i], ==, 0);
  -	if (!check_ptr(pool->mp_block, !=, NULL))
  +	if (!check(pool->mp_block != NULL))
   		return;
  -	check_ptr(pool->mp_block->next_free, <=, pool->mp_block->end);
  -	check_ptr(pool->mp_block->next_free, !=, NULL);
  -	check_ptr(pool->mp_block->end, !=, NULL);
  +	check(pool->mp_block->next_free != NULL);
  +	check(pool->mp_block->end != NULL);
   }

   int cmd_main(int argc, const char **argv)

 Makefile                  |  1 +
 mem-pool.c                |  6 +++---
 t/unit-tests/t-mem-pool.c | 31 +++++++++++++++++++++++++++++++
 3 files changed, 35 insertions(+), 3 deletions(-)
 create mode 100644 t/unit-tests/t-mem-pool.c

diff --git a/Makefile b/Makefile
index 88ba7a3c51..15990ff312 100644
--- a/Makefile
+++ b/Makefile
@@ -1340,6 +1340,7 @@ THIRD_PARTY_SOURCES += sha1collisiondetection/%
 THIRD_PARTY_SOURCES += sha1dc/%

 UNIT_TEST_PROGRAMS += t-basic
+UNIT_TEST_PROGRAMS += t-mem-pool
 UNIT_TEST_PROGRAMS += t-strbuf
 UNIT_TEST_PROGS = $(patsubst %,$(UNIT_TEST_BIN)/%$X,$(UNIT_TEST_PROGRAMS))
 UNIT_TEST_OBJS = $(patsubst %,$(UNIT_TEST_DIR)/%.o,$(UNIT_TEST_PROGRAMS))
diff --git a/mem-pool.c b/mem-pool.c
index c34846d176..e8d976c3ee 100644
--- a/mem-pool.c
+++ b/mem-pool.c
@@ -99,9 +99,9 @@ void *mem_pool_alloc(struct mem_pool *pool, size_t len)

 	if (!p) {
 		if (len >= (pool->block_alloc / 2))
-			return mem_pool_alloc_block(pool, len, pool->mp_block);
-
-		p = mem_pool_alloc_block(pool, pool->block_alloc, NULL);
+			p = mem_pool_alloc_block(pool, len, pool->mp_block);
+		else
+			p = mem_pool_alloc_block(pool, pool->block_alloc, NULL);
 	}

 	r = p->next_free;
diff --git a/t/unit-tests/t-mem-pool.c b/t/unit-tests/t-mem-pool.c
new file mode 100644
index 0000000000..a0d57df761
--- /dev/null
+++ b/t/unit-tests/t-mem-pool.c
@@ -0,0 +1,31 @@
+#include "test-lib.h"
+#include "mem-pool.h"
+
+static void setup_static(void (*f)(struct mem_pool *), size_t block_alloc)
+{
+	struct mem_pool pool = { .block_alloc = block_alloc };
+	f(&pool);
+	mem_pool_discard(&pool, 0);
+}
+
+static void t_calloc_100(struct mem_pool *pool)
+{
+	size_t size = 100;
+	char *buffer = mem_pool_calloc(pool, 1, size);
+	for (size_t i = 0; i < size; i++)
+		check_int(buffer[i], ==, 0);
+	if (!check(pool->mp_block != NULL))
+		return;
+	check(pool->mp_block->next_free != NULL);
+	check(pool->mp_block->end != NULL);
+}
+
+int cmd_main(int argc, const char **argv)
+{
+	TEST(setup_static(t_calloc_100, 1024 * 1024),
+	     "mem_pool_calloc returns 100 zeroed bytes with big block");
+	TEST(setup_static(t_calloc_100, 1),
+	     "mem_pool_calloc returns 100 zeroed bytes with tiny block");
+
+	return test_done();
+}
--
2.43.0

^ permalink raw reply related

* Re: [PATCH] mem-pool: fix big allocations
From: René Scharfe @ 2023-12-28 18:56 UTC (permalink / raw)
  To: phillip.wood, Git List; +Cc: Jameson Miller
In-Reply-To: <e1e43a6c-3e06-4453-88a3-f00476132bcd@gmail.com>

Am 28.12.23 um 17:48 schrieb phillip.wood123@gmail.com:
> On 28/12/2023 16:05, René Scharfe wrote:
>> Am 28.12.23 um 16:10 schrieb Phillip Wood:
>>> The diff at the end of
>>> this email shows a possible implementation of a check_ptr() macro for
>>> the unit test library. I'm wary of adding it though because I'm not sure
>>> printing the pointer values is actually very useful most of the
>>> time. I'm also concerned that the rules around pointer arithmetic and
>>> comparisons mean that many pointer tests such as
>>>
>>>      check_ptr(pool->mp_block->next_free, <=, pool->mp_block->end);
>>>
>>> will be undefined if they fail.
>>
>> True, the compiler could legally emit mush when it finds out that the
>> pointers are for different objects.  And the error being fixed produces
>> such unrelated pointer pairs -- oops.
>>
>> This check is not important here, we can just drop it.
>>
>> mem_pool_contains() has the same problem, by the way.
>>
>> Restricting ourselves to only equality comparisons for pointers prevents
>> some interesting sanity checks, though.  Casting to intptr_t or
>> uintptr_t would allow arbitrary comparisons without risk of undefined
>> behavior, though.  Perhaps that would make a check_ptr() macro viable
>> and useful.
>
> That certainly helps and the check_ptr() macro in my previous email
> casts the pointers to uintptr_t before comparing them. Maybe I'm
> worrying too much, but my concern is that in a failing comparison it
> is likely one of the pointers is invalid (for example it is the
> result of some undefined pointer arithmetic) and the program is
> undefined from the point the invalid pointer is created.

There are no restrictions on integer comparisons.  So comparing after
casting to uintptr_t should not invoke undefined behavior.  If undefined
behavior was involved in calculating the pointers in the first place
then the compiler might still legally go crazy, but not due to the
comparison.  Right?

Whether the result of a uintptr_t-cast comparison of pointers to
different objects is meaningful is a different question.  Hopefully
range checks are possible.

> The
> documentation for check_ptr() in my previous mail contains the
> following example
>
>     For example if `start` and `end` are pointers to the beginning and
>     end of an allocation and `offset` is an integer then
>
>         check_ptr(start + offset, <=, end)
>
>     is undefined when `offset` is larger than `end - start`. Rewriting
>     the comparison as
>
>         check_uint(offset, <=, end - start)
>
>     avoids undefined behavior when offset is too large, but is still
>     undefined if there is a bug that means `start` and `end` do not
>     point to the same allocation.

True, but in such a unit test we'd need additional checks verifying
that start and end belong to the same object.  Or perhaps use a
numerical size instead of an end pointer.

René

^ permalink raw reply

* Re: [PATCH v2 02/12] worktree: skip reading HEAD when repairing worktrees
From: Eric Sunshine @ 2023-12-28 18:13 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Karthik Nayak, Junio C Hamano
In-Reply-To: <CAPig+cT6mRyJijL1qo2g56Yny-JxkDYjjmGpAncyS_4Hcpaz6Q@mail.gmail.com>

On Thu, Dec 28, 2023 at 1:08 PM Eric Sunshine <sunshine@sunshineco.com> wrote:
> Having said all that, I'm not overly opposed to this patch, especially
> since your main focus is on getting the reftable backend integrated,
> and because the changes (and ugliness) introduced by this patch are
> entirely self-contained and private to worktree.c, so are not a
> show-stopper by any means. Rather, I wanted to get down to writing
> what I think would be a better future approach if someone gets around
> to tackling it. (There is no pressing need at the moment, and that
> someone doesn't have to be you.)

I forgot to mention that, if you reroll for some reason, the
get_worktrees()/get_worktrees_internal() dance might deserve an
in-source NEEDSWORK comment explaining that get_worktrees_internal()
exists to work around the shortcoming that a corruption-tolerant
function for retrieving worktree metadata (for use by the "repair"
function) does not yet exist.

^ permalink raw reply

* Re: [PATCH 0/6] worktree: initialize refdb via ref backends
From: Junio C Hamano @ 2023-12-28 18:11 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git
In-Reply-To: <cover.1703754513.git.ps@pks.im>

Patrick Steinhardt <ps@pks.im> writes:

> when initializing worktrees we manually create the on-disk data
> structures required for the ref backend in "worktree.c". This works just
> fine right now where we only have a single user-exposed ref backend, but
> it will become unwieldy once we have multiple ref backends. This patch
> series thus refactors how we initialize worktrees so that we can use
> `refs_init_db()` to initialize required files for us.
>
> This patch series conflicts with ps/refstorage-extension. The conflict
> can be solved as shown below. I'm happy to defer this patch series
> though until the topic has landed on `master` in case this causes
> issues.

Resolution is not all that bad, but the change in function signature
means comments/explanations near both the caller and the callee of
the get_linked_worktree() function may need updating, I would think.
For example, ...

> diff --git a/worktree.h b/worktree.h
> index 8a75691eac..f14784a2ff 100644
> --- a/worktree.h
> +++ b/worktree.h
> @@ -61,7 +61,8 @@ struct worktree *find_worktree(struct worktree **list,
>   * Look up the worktree corresponding to `id`, or NULL of no such worktree
>   * exists.
>   */
> -struct worktree *get_linked_worktree(const char *id);
> +struct worktree *get_linked_worktree(const char *id,
> +				     int skip_reading_head);

... this now needs to help developers who may want to add new
callers what to pass in "skip_reading_head" and why.

We may indeed want to build this on top of the refstorage-extansion
thing, as it seems to be relatively close to completion.

Thanks (and a happy new year).

^ permalink raw reply

* Re: [PATCH v2 02/12] worktree: skip reading HEAD when repairing worktrees
From: Eric Sunshine @ 2023-12-28 18:08 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Karthik Nayak, Junio C Hamano
In-Reply-To: <ecf4f1ddee36643f0ff7e3d40b9aa7c7e6e6ce43.1703753910.git.ps@pks.im>

On Thu, Dec 28, 2023 at 4:57 AM Patrick Steinhardt <ps@pks.im> wrote:
> When calling `git init --separate-git-dir=<new-path>` on a preexisting
> repository, we move the Git directory of that repository to the new path
> specified by the user. If there are worktrees present in the repository,
> we need to repair the worktrees so that their gitlinks point to the new
> location of the repository.
>
> This repair logic will load repositories via `get_worktrees()`, which
> will enumerate up and initialize all worktrees. Part of initialization
> is logic that we resolve their respective worktree HEADs, even though
> that information may not actually be needed in the end by all callers.
>
> In the context of git-init(1) this is about to become a problem, because
> we do not have a repository that was set up via `setup_git_directory()`
> or friends. Consequentially, it is not yet fully initialized at the time
> of calling `repair_worktrees()`, and properly setting up all parts of
> the repository in `init_db()` before we repair worktrees is not an easy
> thing to do. While this is okay right now where we only have a single
> reference backend in Git, once we gain a second one we would be trying
> to look up the worktree HEADs before we have figured out the reference
> format, which does not work.

s/Consequentially/Consequently/

I found it difficult to digest this paragraph with its foreshadowing
phrase "about to become a problem" since it wasn't apparent until the
very final sentence in the paragraph what the actual problem would be.
Perhaps if you mention early on that the reftable backend will have
trouble with the current code, it would be easier to grasp. Maybe
something like this:

    Although not a problem presently with the file-based reference
    backend, it will become a problem with the upcoming reftable
    backend.  In the context of git-init(1) a fully-materialized
    repository set up via `setup_git_directory()` or friends is not
    yet present.  Consequently, it is not yet fully initialized at the
    time `repair_worktrees()` is called, and properly setting up all
    parts of the repository in `init_db()` before we repair worktrees
    is not an easy task.  With introduction of the reftable backend,
    it would try to look up the worktree HEADs before we have figured
    out the reference format, thus would not work.

> We do not require the worktree HEADs at all to repair worktrees. So
> let's fix this issue by skipping over the step that reads them.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
> diff --git a/worktree.c b/worktree.c
> @@ -51,7 +51,7 @@ static void add_head_info(struct worktree *wt)
> -static struct worktree *get_main_worktree(void)
> +static struct worktree *get_main_worktree(int skip_reading_head)
>  {
> -       add_head_info(worktree);
> +       if (!skip_reading_head)
> +               add_head_info(worktree);

This is so special-case that it feels more than a little dirty.

> @@ -591,7 +599,7 @@ static void repair_noop(int iserr UNUSED,
>  void repair_worktrees(worktree_repair_fn fn, void *cb_data)
>  {
> -       struct worktree **worktrees = get_worktrees();
> +       struct worktree **worktrees = get_worktrees_internal(1);

In an ideal world, a repair function should not be calling
get_worktrees() at all since get_worktrees() is not tolerant of
corruption of the worktree administrative files. (Plus, as you note,
it does more work than necessary for the current set of repairs
performed by `git worktree repair`.)

Even as I was implementing the worktree repair code, I wavered back
and forth multiple times between calling get_worktrees() and writing a
custom corruption-tolerant function to retrieve worktree
administrative information. In the end, I opted for get_worktrees()
for the pragmatic reason that it allowed me to narrow the scope of the
patches to the types of repairs which were the current focus without
getting mired down in the involved details of writing a
corruption-tolerant function for retrieving worktree metadata.
However, that decision was made with the understanding that the
pragmatic choice of the moment would not rule out the possibility of
returning later and implementing the more correct approach of having a
corruption-tolerant function for retrieving worktree metadata.

The special-case ugliness of this patch suggests strongly in favor of
implementing the earlier-envisioned corruption-tolerant function for
retrieving worktree metadata rather than the band-aid approach taken
by this patch. The generic name get_worktrees_internal() isn't helpful
either; it doesn't do a good job of conveying any particular meaning
to the reader.

Having said all that, I'm not overly opposed to this patch, especially
since your main focus is on getting the reftable backend integrated,
and because the changes (and ugliness) introduced by this patch are
entirely self-contained and private to worktree.c, so are not a
show-stopper by any means. Rather, I wanted to get down to writing
what I think would be a better future approach if someone gets around
to tackling it. (There is no pressing need at the moment, and that
someone doesn't have to be you.)

^ permalink raw reply

* Re: [PATCH v2 03/12] refs: refactor logic to look up storage backends
From: Junio C Hamano @ 2023-12-28 17:25 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Karthik Nayak
In-Reply-To: <12329b99b753f79fe93fe017e71b08227d213c1e.1703753910.git.ps@pks.im>

Patrick Steinhardt <ps@pks.im> writes:

> In order to look up ref storage backends, we're currently using a linked
> list of backends, where each backend is expected to set up its `next`
> pointer to the next ref storage backend. This is kind of a weird setup
> as backends need to be aware of other backends without much of a reason.
>
> Refactor the code so that the array of backends is centrally defined in
> "refs.c", where each backend is now identified by an integer constant.
> Expose functions to translate from those integer constants to the name
> and vice versa, which will be required by subsequent patches.

A small question.  Does this have to be "int", or is "unsigned" (or
even an enum, rewrittenfrom the "REF_STORAGE_FORMAT_*" family of CPP
macro constants) good enough?  I am only wondering what happens when
you clal find_ref_storage_backend() with a negative index.

For that matter, how REF_STORAGE_FORMAT_UNKNOWN (whose value is 0)
is handled by the function also gets curious.  The caller may have
to find that the backend hasn't been specified by receiving an
element in the refs_backends[] array that corresponds to it, but the
error behaviour of this function is also to return NULL, so it has
to be prepared to handle both cases?

> +static const struct ref_storage_be *refs_backends[] = {
> +	[REF_STORAGE_FORMAT_FILES] = &refs_be_files,
> +};
> ...
> +static const struct ref_storage_be *find_ref_storage_backend(int ref_storage_format)
>  {
> +	if (ref_storage_format < ARRAY_SIZE(refs_backends))
> +		return refs_backends[ref_storage_format];
>  	return NULL;
>  }

^ permalink raw reply

* Re: [PATCH 04/12] setup: start tracking ref storage format when
From: Junio C Hamano @ 2023-12-28 17:15 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git
In-Reply-To: <ZY04OlTNUEZs5T-T@tanuki>

Patrick Steinhardt <ps@pks.im> writes:

> Makes me wonder whether we should then also add the following diff to
> "setup: set repository's format on init" when both topics are being
> merged together:
>
> diff --git a/setup.c b/setup.c
> index 3d980814bc..3d35c78c68 100644
> --- a/setup.c
> +++ b/setup.c
> @@ -2210,6 +2210,7 @@ int init_db(const char *git_dir, const char *real_git_dir,
>  	 * format we can update the repository's settings accordingly.
>  	 */
>  	repo_set_hash_algo(the_repository, repo_fmt.hash_algo);
> +	repo_set_compat_hash_algo(the_repository, repo_fmt.compat_hash_algo);
>  	repo_set_ref_storage_format(the_repository, repo_fmt.ref_storage_format);
>  
>  	if (!(flags & INIT_DB_SKIP_REFDB))

Shouldn't that come from the series that wants .compat_hash_algo in
the repo_fmt structure, whichever it is, not added by an evil merge?

^ permalink raw reply

* Re: [PATCH v2 8/8] reftable/merged: transfer ownership of records when iterating
From: Junio C Hamano @ 2023-12-28 17:04 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Han-Wen Nienhuys
In-Reply-To: <25a3919e583a9d13403b8add92260529e19e08fb.1703743174.git.ps@pks.im>

Patrick Steinhardt <ps@pks.im> writes:

> When iterating over recods with the merged iterator we put the records

"records"?

I commented on a few patches in this iteration that I found a bit
harder to follow than necessary; aside from these small nits,
everything in the series looked quite well explained and executed.

Thanks.

^ permalink raw reply

* Re: [PATCH v2 5/8] reftable/record: store "val1" hashes as static arrays
From: Junio C Hamano @ 2023-12-28 17:03 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Han-Wen Nienhuys
In-Reply-To: <46ca3a37f805cd36faa26927220c2793d4cdd561.1703743174.git.ps@pks.im>

Patrick Steinhardt <ps@pks.im> writes:

> When reading ref records of type "val1" we store its object ID in an

I'd find it easier to follow if we had a comma before "we store",
but perhaps I am old fashioned.

> allocated array. This results in an additional allocation for every
> single ref record we read, which is rather inefficient especially when
> iterating over refs.
>
> Refactor the code to instead use a static array of `GIT_MAX_RAWSZ`

"a static" -> "an embedded", perhaps?  The struct as the whole may
or may not be static but the point of this patch is that the array
is embedded in it.

> bytes. While this means that `struct ref_record` is bigger now, we
> typically do not store all refs in an array anyway and instead only
> handle a limited number of records at the same point in time.

Nicely explained.

> Using `git show-ref --quiet` in a repository with ~350k refs this leads
> to a significant drop in allocations. Before:
>
>     HEAP SUMMARY:
>         in use at exit: 21,098 bytes in 192 blocks
>       total heap usage: 2,116,683 allocs, 2,116,491 frees, 76,098,060 bytes allocated
>
> After:
>
>     HEAP SUMMARY:
>         in use at exit: 21,098 bytes in 192 blocks
>       total heap usage: 1,419,031 allocs, 1,418,839 frees, 62,145,036 bytes allocated
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  reftable/block_test.c      |  4 +---
>  reftable/merged_test.c     | 16 ++++++----------
>  reftable/readwrite_test.c  | 14 ++++----------
>  reftable/record.c          |  3 ---
>  reftable/record_test.c     |  1 -
>  reftable/reftable-record.h |  3 ++-
>  reftable/stack_test.c      |  2 --
>  7 files changed, 13 insertions(+), 30 deletions(-)
>
> diff --git a/reftable/block_test.c b/reftable/block_test.c
> index c00bbc8aed..dedb05c7d8 100644
> --- a/reftable/block_test.c
> +++ b/reftable/block_test.c
> @@ -49,13 +49,11 @@ static void test_block_read_write(void)
>  
>  	for (i = 0; i < N; i++) {
>  		char name[100];
> -		uint8_t hash[GIT_SHA1_RAWSZ];
>  		snprintf(name, sizeof(name), "branch%02d", i);
> -		memset(hash, i, sizeof(hash));
>  
>  		rec.u.ref.refname = name;
>  		rec.u.ref.value_type = REFTABLE_REF_VAL1;
> -		rec.u.ref.value.val1 = hash;
> +		memset(rec.u.ref.value.val1, i, GIT_SHA1_RAWSZ);
>  
>  		names[i] = xstrdup(name);
>  		n = block_writer_add(&bw, &rec);
> diff --git a/reftable/merged_test.c b/reftable/merged_test.c
> index d08c16abef..b3927a5d73 100644
> --- a/reftable/merged_test.c
> +++ b/reftable/merged_test.c
> @@ -123,13 +123,11 @@ static void readers_destroy(struct reftable_reader **readers, size_t n)
>  
>  static void test_merged_between(void)
>  {
> -	uint8_t hash1[GIT_SHA1_RAWSZ] = { 1, 2, 3, 0 };
> -
>  	struct reftable_ref_record r1[] = { {
>  		.refname = "b",
>  		.update_index = 1,
>  		.value_type = REFTABLE_REF_VAL1,
> -		.value.val1 = hash1,
> +		.value.val1 = { 1, 2, 3, 0 },
>  	} };
>  	struct reftable_ref_record r2[] = { {
>  		.refname = "a",
> @@ -165,26 +163,24 @@ static void test_merged_between(void)
>  
>  static void test_merged(void)
>  {
> -	uint8_t hash1[GIT_SHA1_RAWSZ] = { 1 };
> -	uint8_t hash2[GIT_SHA1_RAWSZ] = { 2 };
>  	struct reftable_ref_record r1[] = {
>  		{
>  			.refname = "a",
>  			.update_index = 1,
>  			.value_type = REFTABLE_REF_VAL1,
> -			.value.val1 = hash1,
> +			.value.val1 = { 1 },
>  		},
>  		{
>  			.refname = "b",
>  			.update_index = 1,
>  			.value_type = REFTABLE_REF_VAL1,
> -			.value.val1 = hash1,
> +			.value.val1 = { 1 },
>  		},
>  		{
>  			.refname = "c",
>  			.update_index = 1,
>  			.value_type = REFTABLE_REF_VAL1,
> -			.value.val1 = hash1,
> +			.value.val1 = { 1 },
>  		}
>  	};
>  	struct reftable_ref_record r2[] = { {
> @@ -197,13 +193,13 @@ static void test_merged(void)
>  			.refname = "c",
>  			.update_index = 3,
>  			.value_type = REFTABLE_REF_VAL1,
> -			.value.val1 = hash2,
> +			.value.val1 = { 2 },
>  		},
>  		{
>  			.refname = "d",
>  			.update_index = 3,
>  			.value_type = REFTABLE_REF_VAL1,
> -			.value.val1 = hash1,
> +			.value.val1 = { 1 },
>  		},
>  	};
>  
> diff --git a/reftable/readwrite_test.c b/reftable/readwrite_test.c
> index 9c16e0504e..87b238105c 100644
> --- a/reftable/readwrite_test.c
> +++ b/reftable/readwrite_test.c
> @@ -60,18 +60,15 @@ static void write_table(char ***names, struct strbuf *buf, int N,
>  	*names = reftable_calloc(sizeof(char *) * (N + 1));
>  	reftable_writer_set_limits(w, update_index, update_index);
>  	for (i = 0; i < N; i++) {
> -		uint8_t hash[GIT_SHA256_RAWSZ] = { 0 };
>  		char name[100];
>  		int n;
>  
> -		set_test_hash(hash, i);
> -
>  		snprintf(name, sizeof(name), "refs/heads/branch%02d", i);
>  
>  		ref.refname = name;
>  		ref.update_index = update_index;
>  		ref.value_type = REFTABLE_REF_VAL1;
> -		ref.value.val1 = hash;
> +		set_test_hash(ref.value.val1, i);
>  		(*names)[i] = xstrdup(name);
>  
>  		n = reftable_writer_add_ref(w, &ref);
> @@ -675,11 +672,10 @@ static void test_write_object_id_min_length(void)
>  	struct strbuf buf = STRBUF_INIT;
>  	struct reftable_writer *w =
>  		reftable_new_writer(&strbuf_add_void, &buf, &opts);
> -	uint8_t hash[GIT_SHA1_RAWSZ] = {42};
>  	struct reftable_ref_record ref = {
>  		.update_index = 1,
>  		.value_type = REFTABLE_REF_VAL1,
> -		.value.val1 = hash,
> +		.value.val1 = {42},
>  	};
>  	int err;
>  	int i;
> @@ -711,11 +707,10 @@ static void test_write_object_id_length(void)
>  	struct strbuf buf = STRBUF_INIT;
>  	struct reftable_writer *w =
>  		reftable_new_writer(&strbuf_add_void, &buf, &opts);
> -	uint8_t hash[GIT_SHA1_RAWSZ] = {42};
>  	struct reftable_ref_record ref = {
>  		.update_index = 1,
>  		.value_type = REFTABLE_REF_VAL1,
> -		.value.val1 = hash,
> +		.value.val1 = {42},
>  	};
>  	int err;
>  	int i;
> @@ -814,11 +809,10 @@ static void test_write_multiple_indices(void)
>  	writer = reftable_new_writer(&strbuf_add_void, &writer_buf, &opts);
>  	reftable_writer_set_limits(writer, 1, 1);
>  	for (i = 0; i < 100; i++) {
> -		unsigned char hash[GIT_SHA1_RAWSZ] = {i};
>  		struct reftable_ref_record ref = {
>  			.update_index = 1,
>  			.value_type = REFTABLE_REF_VAL1,
> -			.value.val1 = hash,
> +			.value.val1 = {i},
>  		};
>  
>  		strbuf_reset(&buf);
> diff --git a/reftable/record.c b/reftable/record.c
> index 5e258c734b..a67a6b4d8a 100644
> --- a/reftable/record.c
> +++ b/reftable/record.c
> @@ -219,7 +219,6 @@ static void reftable_ref_record_copy_from(void *rec, const void *src_rec,
>  	case REFTABLE_REF_DELETION:
>  		break;
>  	case REFTABLE_REF_VAL1:
> -		ref->value.val1 = reftable_malloc(hash_size);
>  		memcpy(ref->value.val1, src->value.val1, hash_size);
>  		break;
>  	case REFTABLE_REF_VAL2:
> @@ -303,7 +302,6 @@ void reftable_ref_record_release(struct reftable_ref_record *ref)
>  		reftable_free(ref->value.val2.value);
>  		break;
>  	case REFTABLE_REF_VAL1:
> -		reftable_free(ref->value.val1);
>  		break;
>  	case REFTABLE_REF_DELETION:
>  		break;
> @@ -394,7 +392,6 @@ static int reftable_ref_record_decode(void *rec, struct strbuf key,
>  			return -1;
>  		}
>  
> -		r->value.val1 = reftable_malloc(hash_size);
>  		memcpy(r->value.val1, in.buf, hash_size);
>  		string_view_consume(&in, hash_size);
>  		break;
> diff --git a/reftable/record_test.c b/reftable/record_test.c
> index 70ae78feca..5c94d26e35 100644
> --- a/reftable/record_test.c
> +++ b/reftable/record_test.c
> @@ -119,7 +119,6 @@ static void test_reftable_ref_record_roundtrip(void)
>  		case REFTABLE_REF_DELETION:
>  			break;
>  		case REFTABLE_REF_VAL1:
> -			in.u.ref.value.val1 = reftable_malloc(GIT_SHA1_RAWSZ);
>  			set_hash(in.u.ref.value.val1, 1);
>  			break;
>  		case REFTABLE_REF_VAL2:
> diff --git a/reftable/reftable-record.h b/reftable/reftable-record.h
> index f7eb2d6015..7f3a0df635 100644
> --- a/reftable/reftable-record.h
> +++ b/reftable/reftable-record.h
> @@ -9,6 +9,7 @@ license that can be found in the LICENSE file or at
>  #ifndef REFTABLE_RECORD_H
>  #define REFTABLE_RECORD_H
>  
> +#include "hash-ll.h"
>  #include <stdint.h>
>  
>  /*
> @@ -38,7 +39,7 @@ struct reftable_ref_record {
>  #define REFTABLE_NR_REF_VALUETYPES 4
>  	} value_type;
>  	union {
> -		uint8_t *val1; /* malloced hash. */
> +		unsigned char val1[GIT_MAX_RAWSZ];
>  		struct {
>  			uint8_t *value; /* first value, malloced hash  */
>  			uint8_t *target_value; /* second value, malloced hash */
> diff --git a/reftable/stack_test.c b/reftable/stack_test.c
> index 14a3fc11ee..feab49d7f7 100644
> --- a/reftable/stack_test.c
> +++ b/reftable/stack_test.c
> @@ -463,7 +463,6 @@ static void test_reftable_stack_add(void)
>  		refs[i].refname = xstrdup(buf);
>  		refs[i].update_index = i + 1;
>  		refs[i].value_type = REFTABLE_REF_VAL1;
> -		refs[i].value.val1 = reftable_malloc(GIT_SHA1_RAWSZ);
>  		set_test_hash(refs[i].value.val1, i);
>  
>  		logs[i].refname = xstrdup(buf);
> @@ -600,7 +599,6 @@ static void test_reftable_stack_tombstone(void)
>  		refs[i].update_index = i + 1;
>  		if (i % 2 == 0) {
>  			refs[i].value_type = REFTABLE_REF_VAL1;
> -			refs[i].value.val1 = reftable_malloc(GIT_SHA1_RAWSZ);
>  			set_test_hash(refs[i].value.val1, i);
>  		}

^ permalink raw reply

* Re: [PATCH v2 7/8] reftable/merged: really reuse buffers to compute record keys
From: Junio C Hamano @ 2023-12-28 17:03 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Han-Wen Nienhuys
In-Reply-To: <6313f8affdc136b183c1bd411d481efe5c676aee.1703743174.git.ps@pks.im>

Patrick Steinhardt <ps@pks.im> writes:

> In 829231dc20 (reftable/merged: reuse buffer to compute record keys,
> 2023-12-11), we have refactored the merged iterator to reuse a set of
> buffers that it would otherwise have to reallocate on every single
> iteration. Unfortunately, there was a brown-paper-bag-style bug here as
> we continue to release these buffers after the iteration, and thus we
> have essentially gained nothing.

s/-style// perhaps.  It took me more than just reading of the above
but I needed to see the code before noticed that you are talking
about strbuf_release().  Only after that I think I understood what
you meant as a bug.

    With the change, instead of using a new "entry_key" strbuf for
    each iteration, the code now passes mi->entry_key to
    reftable_record_key(), which will reuse the existing .buf member
    of the strbuf to avoid reallcation.  But releasing the strbuf in
    each iteration defeats such optimization.

I suspect that a Git developer who will be reading "git log" output
in 6 months and finds the above paragraph understands the problem
and its fix better if the description hinted strbuf_reset() near
where it mentions "release", something like:

    ... to reuse a pair of long-living strbufs by relying on the
    fact that reftable_record_key() tries to reuse its already
    allocated .buf member by calling strbuf_reset(), which should
    give us significantly fewer reallocation compared to the old
    code that used on-stack strbufs that are allocated for each and
    every iteration.  Unfortunately, we called strbuf_release() on
    these long-living strbufs that we meant to reuse, defeating the
    optimization.

or along that line, perhaps?

Other than that, a very reasonable fix.  Thanks for a pleasant read.

> Fix this performance issue by not releasing those buffers on iteration
> anymore, where we instead rely on `merged_iter_close()` to release the
> buffers for us.
>
> Using `git show-ref --quiet` in a repository with ~350k refs this leads
> to a significant drop in allocations. Before:
>
>     HEAP SUMMARY:
>         in use at exit: 21,163 bytes in 193 blocks
>       total heap usage: 1,410,148 allocs, 1,409,955 frees, 61,976,068 bytes allocated
>
> After:
>
>     HEAP SUMMARY:
>         in use at exit: 21,163 bytes in 193 blocks
>       total heap usage: 708,058 allocs, 707,865 frees, 36,783,255 bytes allocated
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  reftable/merged.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/reftable/merged.c b/reftable/merged.c
> index 556bb5c556..a28bb99aaf 100644
> --- a/reftable/merged.c
> +++ b/reftable/merged.c
> @@ -128,8 +128,6 @@ static int merged_iter_next_entry(struct merged_iter *mi,
>  
>  done:
>  	reftable_record_release(&entry.rec);
> -	strbuf_release(&mi->entry_key);
> -	strbuf_release(&mi->key);
>  	return err;
>  }

^ permalink raw reply

* Re: [PATCH] mem-pool: fix big allocations
From: phillip.wood123 @ 2023-12-28 16:48 UTC (permalink / raw)
  To: René Scharfe, phillip.wood, Git List; +Cc: Jameson Miller
In-Reply-To: <c5d35735-10e2-4b71-8fc7-6218e7002549@web.de>

On 28/12/2023 16:05, René Scharfe wrote:
> Am 28.12.23 um 16:10 schrieb Phillip Wood:
>> The diff at the end of
>> this email shows a possible implementation of a check_ptr() macro for
>> the unit test library. I'm wary of adding it though because I'm not sure
>> printing the pointer values is actually very useful most of the
>> time. I'm also concerned that the rules around pointer arithmetic and
>> comparisons mean that many pointer tests such as
>>
>>      check_ptr(pool->mp_block->next_free, <=, pool->mp_block->end);
>>
>> will be undefined if they fail.
> 
> True, the compiler could legally emit mush when it finds out that the
> pointers are for different objects.  And the error being fixed produces
> such unrelated pointer pairs -- oops.
> 
> This check is not important here, we can just drop it.
> 
> mem_pool_contains() has the same problem, by the way.
> 
> Restricting ourselves to only equality comparisons for pointers prevents
> some interesting sanity checks, though.  Casting to intptr_t or
> uintptr_t would allow arbitrary comparisons without risk of undefined
> behavior, though.  Perhaps that would make a check_ptr() macro viable
> and useful.

That certainly helps and the check_ptr() macro in my previous email 
casts the pointers to uintptr_t before comparing them. Maybe I'm 
worrying too much, but my concern is that in a failing comparison it is 
likely one of the pointers is invalid (for example it is the result of 
some undefined pointer arithmetic) and the program is undefined from the 
point the invalid pointer is created. The documentation for check_ptr() 
in my previous mail contains the following example

     For example if `start` and `end` are pointers to the beginning and
     end of an allocation and `offset` is an integer then

         check_ptr(start + offset, <=, end)

     is undefined when `offset` is larger than `end - start`. Rewriting
     the comparison as

         check_uint(offset, <=, end - start)

     avoids undefined behavior when offset is too large, but is still
     undefined if there is a bug that means `start` and `end` do not
     point to the same allocation.

I agree it would be nice to allow arbitrary pointer comparisons but it 
would be good to do it in a way that does not expose us to undefined 
behavior. I'm not sure what the right balance is here.

Best Wishes

Phillip

^ permalink raw reply

* Re: [PATCH] mem-pool: fix big allocations
From: René Scharfe @ 2023-12-28 16:05 UTC (permalink / raw)
  To: phillip.wood, Git List; +Cc: Jameson Miller
In-Reply-To: <9aad15c8-8d3b-475b-bd44-5d24121cb793@gmail.com>

Am 28.12.23 um 16:10 schrieb Phillip Wood:
> Hi René
>
> On 21/12/2023 23:13, René Scharfe wrote:
>> +#define check_ptr(a, op, b) check_int(((a) op (b)), ==, 1)
>> [...]
>> +static void t_calloc_100(struct mem_pool *pool)
>> +{
>> +    size_t size = 100;
>> +    char *buffer = mem_pool_calloc(pool, 1, size);
>> +    for (size_t i = 0; i < size; i++)
>> +        check_int(buffer[i], ==, 0);
>> +    if (!check_ptr(pool->mp_block, !=, NULL))
>> +        return;
>> +    check_ptr(pool->mp_block->next_free, <=, pool->mp_block->end);
>> +    check_ptr(pool->mp_block->next_free, !=, NULL);
>> +    check_ptr(pool->mp_block->end, !=, NULL);
>> +}
>
> It's great to see the unit test framework being used here. I wonder
> though if it would be simpler just to use
>
>     check(ptr != NULL)

Yes, that's better.

> as I'm not sure what the check_ptr() macro adds. The diff at the end of
> this email shows a possible implementation of a check_ptr() macro for
> the unit test library. I'm wary of adding it though because I'm not sure
> printing the pointer values is actually very useful most of the
> time. I'm also concerned that the rules around pointer arithmetic and
> comparisons mean that many pointer tests such as
>
>     check_ptr(pool->mp_block->next_free, <=, pool->mp_block->end);
>
> will be undefined if they fail.

True, the compiler could legally emit mush when it finds out that the
pointers are for different objects.  And the error being fixed produces
such unrelated pointer pairs -- oops.

This check is not important here, we can just drop it.

mem_pool_contains() has the same problem, by the way.

Restricting ourselves to only equality comparisons for pointers prevents
some interesting sanity checks, though.  Casting to intptr_t or
uintptr_t would allow arbitrary comparisons without risk of undefined
behavior, though.  Perhaps that would make a check_ptr() macro viable
and useful.

René

^ permalink raw reply

* Re: [PATCH] Port helper/test-ctype.c to unit-tests/t-ctype.c
From: René Scharfe @ 2023-12-28 16:05 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Christian Couder, Achu Luma, git, Christian Couder, Phillip Wood,
	Josh Steadmon
In-Reply-To: <xmqqcyurky00.fsf@gitster.g>

Am 28.12.23 um 00:48 schrieb Junio C Hamano:
> René Scharfe <l.s.r@web.de> writes:
>
>>> Also it might not be a big issue here, but when the new unit test
>>> framework was proposed, I commented on the fact that "left" and
>>> "right" were perhaps a bit less explicit than "actual" and "expected".
>>
>> True.
>> ...
>> The added repetition is a bit grating.  With a bit of setup, loop
>> unrolling and stringification you can retain the property of only having
>> to mention the class name once.  Demo patch below.
>
> Nice.
>
> This (and your mempool thing) being one of the early efforts to
> adopt the unit-test framework outside the initial set of sample
> tests, it is understandable that we might find what framework offers
> is still lacking.  But at the same time, while the macro tricks
> demonstrated here are all amusing to read and admire, it feels a bit
> too much to expect that the test writers are willing to invent
> something like these every time they want to test.
>
> Being a relatively faithful conversion of the original ctype tests,
> with its thorough enumeration of test samples and expected output,
> is what makes this test program require these macro tricks, and it
> does not have much to do with the features (or lack thereof) of the
> framework, I guess.

*nod*

>
>> +struct ctype {
>> +	const char *name;
>> +	const char *expect;
>> +	int actual[256];
>> +};
>> +
>> +static void test_ctype(const struct ctype *class)
>> +{
>> +	for (int i = 0; i < 256; i++) {
>> +		int expect = is_in(class->expect, i);
>> +		int actual = class->actual[i];
>> +		int res = test_assert(TEST_LOCATION(), class->name,
>> +				      actual == expect);
>> +		if (!res)
>> +			test_msg("%s classifies char %d (0x%02x) wrongly",
>> +				 class->name, i, i);
>> +	}
>>  }
>
> Somehow, the "test_assert" does not seem to be adding much value
> here (i.e. we can do "res = (actual == expect)" there).  Is this
> because we want to be able to report success, too?
>
>     ... goes and looks at test_assert() ...
>
> Ah, is it because we want to be able to "skip" (which pretends that
> the assert() was satisified).  OK, but then the error reporting from
> it is redundant with our own test_msg().

True, the test_msg() emits the old message here, but it doesn't have to
report that the check failed anymore, because test_assert() already
covers that part.  It would only have to report the misclassified
character and perhaps the expected result.

René

^ permalink raw reply

* Re: Git Rename Detection Bug
From: Philip Oakley @ 2023-12-28 15:33 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Jeremy Pridmore, git@vger.kernel.org, Paul Baumgartner
In-Reply-To: <CABPp-BEdSGBt7DCrJCmOtG+RgZ2F3fNZQJ91PjZQxNa-ShKf8g@mail.gmail.com>

Hi Elijah,
Many thanks.. personal notes in-line.

On 24/12/2023 07:46, Elijah Newren wrote:
> Hi Philip,
> 
> Sorry for the late reply; I somehow missed this earlier.
> 
> On Wed, Nov 15, 2023 at 8:51 AM Philip Oakley <philipoakley@iee.email> wrote:
>>
>> Hi Elijah,
>>
>> On 11/11/2023 05:46, Elijah Newren wrote:
>>> * filename similarity is extraordinarily expensive compared to exact
>>> renames, and if not carefully handled, can sometimes rival the cost of
>>> file content similarity computations given our spanhash
>>> representations.
>>
>> I've not heard of spanhash representation before. Any references or
>> further reading?
> 
> You can find more in diffcore-delta.c, especially the big comment near
> the top of the file.

+1

>         But here's a short explanation of spanhashes:
>   * Split files into chunks delimited either by LF or 64 bytes,
> whichever comes first.

neat


>   * Hash every chunk into an integer between 0 and 107926

as per the comment, this is 1 less than a nice prime 107927 that fits
17bits.
Some discussions at
https://lore.kernel.org/git/7vwtezt202.fsf@assigned-by-dhcp.cox.net/ and
surrounding  messages.

The hash is very similar to a CRC, a rotating 64bit value, using 7 bit
shifts and a 8bit char addition, then reduced to a hash computed at ~#L157

>   * Keep a character count for each of those integers as well (thus if
> a line has N characters, but appears twice in the file, the associated
> count for that integer will be 2N).
>   * A "spanhash" is the combination of the integer that a chunk (or
> span) hashes to, plus the count associated with it.
>   * The list/array of spanhashes for a file (i.e. the list/array of
> integers and character counts) is used to compare one file to another.

I was surprised to see that I'd been in the area at #L162 ;-)

Thank you for the useful summary.


> 
> Now, why do I claim that comparison of filenames can rival cost of
> file content similarity?  Well, in a monorepo of interest, the median
> sized file is named something close to
> "modules/client-resources/src/main/resources/buttons/smallTriangleBlackRight.png"
> and is 2875 bytes.  As a png, all its chunks are probably the full 64
> characters, which works out to about 45 chunks (assuming the 64-byte
> chunks are different from each other).  The filename is 79 characters.
> So, for this case, 45 pairs of integers vs 79 characters.  So, the
> comparison cost is roughly the same order of magnitude.
> (Yes, creating the spanhashes is a heavy overhead; however, we only
> initialize it once and then do N comparisons of each spanhash to the
> other spanhashes.  And we'd be doing N comparisons of each filename to
> other filenames, so the overhead of creating the spanhashes can be
> overlooked if your merge has enough files modified on both sides of
> history.)

Nice point about the hashes only being computed once.

> 
> Yes, this particular repository is a case I randomly picked that you
> can argue is special.  But rather than look at the particular example,
> I think it's interesting to check how the spanhash size vs. filename
> size scale with repository size.  From my experience: (1) I don't
> think the median-sized file varies all that much between small and big
> repositories; every time I check a repo the median size seems to be
> order of a few thousand bytes, regardless of whether the repository
> I'm looking at is tiny or huge, (2) while small repositories often
> have much shorter filenames, big repositories often will have
> filenames even longer than my example; length of filename tends to
> grow with repository size from deep directory nestings.  So, between
> these two facts, I'd expect the filename comparison costs to grow
> relative to file content comparison costs, when considering only
> median-sized files being modified.  And since it's common to have
> merges or rebases or diffs where only approximately-median-sized files
> are involved, I think this is relevant to look at.  Finally, since I
> already had an example that showed the cost likely roughly comparable
> for a random repository of interest, and it's not even all that big a
> repository compared to many out there, I think the combination
> motivates pretty well my claim that filename similarity costs _could_
> rival file content similarity costs if one wasn't careful.
> 
> I don't have a rigorous proof here.  And, in fact, I ended up doing
> this rough back-of-the-envelope analysis _after_ implementing some
> filename similarity comparison ideas and seeing performance degrade
> badly, and wondering why it made such a difference.  I don't know if I
> ever got exact numbers, but I certainly didn't record them.  This
> rough analysis, though, was what made me realize that I needed to be
> careful with any such added filename comparisons, though, and is why
> I'm leery of adding more.

Thanks again.

^ permalink raw reply

* Re: [PATCH] mem-pool: fix big allocations
From: Phillip Wood @ 2023-12-28 15:10 UTC (permalink / raw)
  To: René Scharfe, Git List; +Cc: Jameson Miller
In-Reply-To: <fa89d269-1a23-4ed6-bebc-30c0b629f444@web.de>

Hi René

On 21/12/2023 23:13, René Scharfe wrote:
> +#define check_ptr(a, op, b) check_int(((a) op (b)), ==, 1)
> [...]
> +static void t_calloc_100(struct mem_pool *pool)
> +{
> +	size_t size = 100;
> +	char *buffer = mem_pool_calloc(pool, 1, size);
> +	for (size_t i = 0; i < size; i++)
> +		check_int(buffer[i], ==, 0);
> +	if (!check_ptr(pool->mp_block, !=, NULL))
> +		return;
> +	check_ptr(pool->mp_block->next_free, <=, pool->mp_block->end);
> +	check_ptr(pool->mp_block->next_free, !=, NULL);
> +	check_ptr(pool->mp_block->end, !=, NULL);
> +}

It's great to see the unit test framework being used here. I wonder
though if it would be simpler just to use

	check(ptr != NULL)

as I'm not sure what the check_ptr() macro adds. The diff at the end of
this email shows a possible implementation of a check_ptr() macro for
the unit test library. I'm wary of adding it though because I'm not sure
printing the pointer values is actually very useful most of the
time. I'm also concerned that the rules around pointer arithmetic and
comparisons mean that many pointer tests such as

     check_ptr(pool->mp_block->next_free, <=, pool->mp_block->end);

will be undefined if they fail. The documentation for check_ptr() below
tries to illustrate that concern. If the compiler can prove that a check
is undefined when that check fails it is at liberty to hard code the
test as passing. In practice I think most failing pointer comparisons
would fall into the category of "this is undefined but the compiler
can't prove it" but that doesn't really make me any happier.

Best Wishes

Phillip

---- >8 ----
diff --git a/t/unit-tests/test-lib.h b/t/unit-tests/test-lib.h
index a8f07ae0b7..ecd1fce17d 100644
--- a/t/unit-tests/test-lib.h
+++ b/t/unit-tests/test-lib.h
@@ -99,6 +99,39 @@ int check_int_loc(const char *loc, const char *check, int ok,
  int check_uint_loc(const char *loc, const char *check, int ok,
  		   uintmax_t a, uintmax_t b);
  
+/*
+ * Compare two pointers. Prints a message with the two values if the
+ * comparison fails. NB this is not thread safe.
+ *
+ * Use this with care. The rules around pointer arithmetic and comparison
+ * in C are quite strict and violating them results in undefined behavior
+ * To avoid a failing comparison resulting undefined behavior we compare
+ * the integer value of the pointers. While this avoids undefined
+ * behavior in the comparison in many cases a failing test will be the
+ * result of creating an invalid pointer in a way that violates the
+ * rules on pointer arithmetic. For example if `start` and `end` are
+ * pointers to the beginning and end of an allocation and `offset` is an
+ * integer then
+ *
+ *     check_ptr(start + offset, <=, end)
+ *
+ * is undefined when `offset` is larger than `end - start`. Rewriting the
+ * comparison as
+ *
+ *     check_uint(offset, <=, end - start)
+ *
+ * avoids undefined behavior when offset is too large, but is still
+ * undefined if there is a bug that means `start` and `end` do not point
+ * to the same allocation.
+ */
+#define check_ptr(a, op, b)						\
+	(test__tmp[0].p = (a), test__tmp[1].p = (b),			\
+	 check_ptr_loc(TEST_LOCATION(), #a" "#op" "#b,			\
+		       (uintptr_t)test__tmp[0].p op (uintptr_t)test__tmp[1].p,	\
+			test__tmp[0].p, test__tmp[1].p))
+
+int check_ptr_loc(const char *loc, const char *check, int ok, void *a, void *b);
+
  /*
   * Compare two chars. Prints a message with the two values if the
   * comparison fails. NB this is not thread safe.
@@ -133,6 +166,7 @@ int check_str_loc(const char *loc, const char *check,
  #define TEST__MAKE_LOCATION(line) __FILE__ ":" TEST__STR(line)
  
  union test__tmp {
+	void *p;
  	intmax_t i;
  	uintmax_t u;
  	char c;
diff --git a/t/unit-tests/test-lib.c b/t/unit-tests/test-lib.c
index 7bf9dfdb95..cb757edbd8 100644
--- a/t/unit-tests/test-lib.c
+++ b/t/unit-tests/test-lib.c
@@ -311,6 +311,18 @@ int check_uint_loc(const char *loc, const char *check, int ok,
  	return ret;
  }
  
+int check_ptr_loc(const char *loc, const char *check, int ok, void *a, void *b)
+{
+	int ret = test_assert(loc, check, ok);
+
+	if (!ret) {
+		test_msg("   left: %p", a);
+		test_msg("  right: %p", b);
+	}
+
+	return ret;
+}
+
  static void print_one_char(char ch, char quote)
  {
  	if ((unsigned char)ch < 0x20u || ch == 0x7f) {

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox