[PATCH] git-compat-util: introduce `count

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] git-compat-util: introduce `count_t` typedef
@ 2025-08-07  9:22 Patrick Steinhardt
  2025-08-07 11:00 ` Matthias Aßhauer
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Patrick Steinhardt @ 2025-08-07  9:22 UTC (permalink / raw)
  To: git; +Cc: Oswald Buddenhagen, Junio C Hamano, Taylor Blau, Jeff King

Historically, Git has been very lenient with its use of integer types
and didn't really give much thought into which type to use in what
situation. We interchangeably mix and match signed and unsigned types
and often times blindly convert them. This use has led to several
out-of-bounds reads and writes in the past, some of which could be
turned into arbitrary code execution.

As a counter measure we have eventually enabled "-Wsign-compare"
warnings. Most of our code base generates heaps of warnings, which is
why we have a macro `DISABLE_SIGN_COMPARE_WARNINGS` defined for every
such file. The expectation is that slowly but surely we'll convert our
code base to have better hygiene around signedness, and new code that is
being added handles types correctly from the start.

There are regular discussions around whether or not these warnings are
sensible to have in the first place. My (biased) opinion with having
fixed several out-of-bounds reads and writes is that they are senisble,
as they would have provided warnings around code sites that had those
issues. And arguably, we still have _lots_ of sites that are susceptible
to using the wrong type, and more likely than not some of those will be
exploitable.

Furthermore, I would claim that the question of whether or not those
warnings are helpful wouldn't have come up if we had the warnings
enabled from the inception of Git. The churn caused by the fixes for
such warnings is real, and they need to be done with a lot of care. But
since we have removed this project from our microprojects page we don't
see "random" contributions in this area anymore.

So overall, the conversions are on the painful side, but in the long
term they will help us to protect against introducing new exploits.

A discussion that regularly comes up in this context though is what
types to use for counting entities:

  - One question is whether the type should be signed or unsigned.
    Arguably, the answer should be to use unsigned types as long as we
    know that we never need a negative value, e.g. as a sentinel. This
    helps guide the reader and explicitly conveys the sense that such a
    counter is only ever going to be a non-negative number. Otherwise,
    code would need to be more careful as it may hold negative values.

  - Another question is what type to use. In lots of situations we have
    used `size_t`, but this is conflating semantics. `size_t` is used to
    count bytes, not entities.

Introduce a new typedef for `count_t` that is of type `uintptr_t` to
give clear guidance what type to use for counting entities. This type
was chosen because in the worst case, an entity may be a single byte and
we fill all of our memory with these entities. As `uintptr_t` is
guaranteed to hold at least the value of a pointer, we know that it
could be used to index into every single such entity.

Amend the coding guidelines to state when to use `size_t` and when to
use `count_t`. Convert an example file to use the new type.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 Documentation/CodingGuidelines |  3 +++
 builtin/rm.c                   | 25 ++++++++++++-------------
 git-compat-util.h              | 15 +++++++++++++++
 3 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/Documentation/CodingGuidelines b/Documentation/CodingGuidelines
index 224f0978a8..2e9f3c07ff 100644
--- a/Documentation/CodingGuidelines
+++ b/Documentation/CodingGuidelines
@@ -238,6 +238,9 @@ For shell scripts specifically (not exhaustive):
 
 For C programs:
 
+ - We use `size_t` to count the number of bytes and `count_t` to count the
+   number of entities of a given type.
+
  - We use tabs to indent, and interpret tabs as taking up to
    8 spaces.
 
diff --git a/builtin/rm.c b/builtin/rm.c
index 05d89e98c3..99b845cf34 100644
--- a/builtin/rm.c
+++ b/builtin/rm.c
@@ -33,11 +33,11 @@ static const char * const builtin_rm_usage[] = {
 };
 
 static struct {
-	int nr, alloc;
 	struct {
 		const char *name;
 		char is_submodule;
 	} *entry;
+	count_t entry_nr, entry_alloc;
 } list;
 
 static int get_ours_cache_pos(const char *path, unsigned int pos)
@@ -73,8 +73,7 @@ static void print_error_files(struct string_list *files_list,
 
 static void submodules_absorb_gitdir_if_needed(void)
 {
-	int i;
-	for (i = 0; i < list.nr; i++) {
+	for (count_t i = 0; i < list.entry_nr; i++) {
 		const char *name = list.entry[i].name;
 		int pos;
 		const struct cache_entry *ce;
@@ -106,14 +105,14 @@ static int check_local_mod(struct object_id *head, int index_only)
 	 * lazy, and who cares if removal of files is a tad
 	 * slower than the theoretical maximum speed?
 	 */
-	int i, no_head;
+	int no_head;
 	int errs = 0;
 	struct string_list files_staged = STRING_LIST_INIT_NODUP;
 	struct string_list files_cached = STRING_LIST_INIT_NODUP;
 	struct string_list files_local = STRING_LIST_INIT_NODUP;
 
 	no_head = is_null_oid(head);
-	for (i = 0; i < list.nr; i++) {
+	for (count_t i = 0; i < list.entry_nr; i++) {
 		struct stat st;
 		int pos;
 		const struct cache_entry *ce;
@@ -268,7 +267,7 @@ int cmd_rm(int argc,
 	   struct repository *repo UNUSED)
 {
 	struct lock_file lock_file = LOCK_INIT;
-	int i, ret = 0;
+	int ret = 0;
 	struct pathspec pathspec;
 	char *seen;
 
@@ -321,10 +320,10 @@ int cmd_rm(int argc,
 			continue;
 		if (!ce_path_match(the_repository->index, ce, &pathspec, seen))
 			continue;
-		ALLOC_GROW(list.entry, list.nr + 1, list.alloc);
-		list.entry[list.nr].name = xstrdup(ce->name);
-		list.entry[list.nr].is_submodule = S_ISGITLINK(ce->ce_mode);
-		if (list.entry[list.nr++].is_submodule &&
+		ALLOC_GROW(list.entry, list.entry_nr + 1, list.entry_alloc);
+		list.entry[list.entry_nr].name = xstrdup(ce->name);
+		list.entry[list.entry_nr].is_submodule = S_ISGITLINK(ce->ce_mode);
+		if (list.entry[list.entry_nr++].is_submodule &&
 		    !is_staging_gitmodules_ok(the_repository->index))
 			die(_("please stage your changes to .gitmodules or stash them to proceed"));
 	}
@@ -335,7 +334,7 @@ int cmd_rm(int argc,
 		char *skip_worktree_seen = NULL;
 		struct string_list only_match_skip_worktree = STRING_LIST_INIT_NODUP;
 
-		for (i = 0; i < pathspec.nr; i++) {
+		for (int i = 0; i < pathspec.nr; i++) {
 			original = pathspec.items[i].original;
 			if (seen[i])
 				seen_any = 1;
@@ -390,7 +389,7 @@ int cmd_rm(int argc,
 	 * First remove the names from the index: we won't commit
 	 * the index unless all of them succeed.
 	 */
-	for (i = 0; i < list.nr; i++) {
+	for (count_t i = 0; i < list.entry_nr; i++) {
 		const char *path = list.entry[i].name;
 		if (!quiet)
 			printf("rm '%s'\n", path);
@@ -414,7 +413,7 @@ int cmd_rm(int argc,
 		int removed = 0, gitmodules_modified = 0;
 		struct strbuf buf = STRBUF_INIT;
 		int flag = force ? REMOVE_DIR_PURGE_ORIGINAL_CWD : 0;
-		for (i = 0; i < list.nr; i++) {
+		for (count_t i = 0; i < list.entry_nr; i++) {
 			const char *path = list.entry[i].name;
 			if (list.entry[i].is_submodule) {
 				strbuf_reset(&buf);
diff --git a/git-compat-util.h b/git-compat-util.h
index 9408f463e3..e9c30d59e8 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -610,6 +610,21 @@ static inline bool strip_suffix(const char *str, const char *suffix,
 int git_open_cloexec(const char *name, int flags);
 #define git_open(name) git_open_cloexec(name, O_RDONLY)
 
+/*
+ * The type used to count the number of entities, e.g. in an array. We have
+ * historically used `size_t` for this, but `size_t` is expected to count the
+ * maximum number of _bytes_, not entities.
+ *
+ * The counter is unsigned. If you need to store sentinel values like `-1` you
+ * should use a different type.
+ *
+ * Note that we pick `uintptr_t` because in the theoretical worst case, every
+ * entity is a single byte and we populate the entire address space with them.
+ * As `uintptr_t` is able to point to every addressable byte it would also be
+ * able to count them all.
+ */
+typedef uintptr_t count_t;
+
 static inline size_t st_add(size_t a, size_t b)
 {
 	if (unsigned_add_overflows(a, b))

---
base-commit: 64cbe5e2e8a7b0f92c780b210e602496bd5cad0f
change-id: 20250807-pks-introduce-count-t-0f4499f80221


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] git-compat-util: introduce `count_t` typedef
  2025-08-07  9:22 [PATCH] git-compat-util: introduce `count_t` typedef Patrick Steinhardt
@ 2025-08-07 11:00 ` Matthias Aßhauer
  2025-08-07 14:17 ` Phillip Wood
  2025-08-07 16:38 ` Junio C Hamano
  2 siblings, 0 replies; 6+ messages in thread
From: Matthias Aßhauer @ 2025-08-07 11:00 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Oswald Buddenhagen, Junio C Hamano, Taylor Blau, Jeff King



On Thu, 7 Aug 2025, Patrick Steinhardt wrote:

> Historically, Git has been very lenient with its use of integer types
> and didn't really give much thought into which type to use in what
> situation. We interchangeably mix and match signed and unsigned types
> and often times blindly convert them. This use has led to several
> out-of-bounds reads and writes in the past, some of which could be
> turned into arbitrary code execution.
>
> As a counter measure we have eventually enabled "-Wsign-compare"
> warnings. Most of our code base generates heaps of warnings, which is
> why we have a macro `DISABLE_SIGN_COMPARE_WARNINGS` defined for every
> such file. The expectation is that slowly but surely we'll convert our
> code base to have better hygiene around signedness, and new code that is
> being added handles types correctly from the start.
>
> There are regular discussions around whether or not these warnings are
> sensible to have in the first place. My (biased) opinion with having
> fixed several out-of-bounds reads and writes is that they are senisble,
> as they would have provided warnings around code sites that had those
> issues. And arguably, we still have _lots_ of sites that are susceptible
> to using the wrong type, and more likely than not some of those will be
> exploitable.
>
> Furthermore, I would claim that the question of whether or not those
> warnings are helpful wouldn't have come up if we had the warnings
> enabled from the inception of Git. The churn caused by the fixes for
> such warnings is real, and they need to be done with a lot of care. But
> since we have removed this project from our microprojects page we don't
> see "random" contributions in this area anymore.
>
> So overall, the conversions are on the painful side, but in the long
> term they will help us to protect against introducing new exploits.
>
> A discussion that regularly comes up in this context though is what
> types to use for counting entities:
>
>  - One question is whether the type should be signed or unsigned.
>    Arguably, the answer should be to use unsigned types as long as we
>    know that we never need a negative value, e.g. as a sentinel. This
>    helps guide the reader and explicitly conveys the sense that such a
>    counter is only ever going to be a non-negative number. Otherwise,
>    code would need to be more careful as it may hold negative values.
>
>  - Another question is what type to use. In lots of situations we have
>    used `size_t`, but this is conflating semantics. `size_t` is used to
>    count bytes, not entities.
>
> Introduce a new typedef for `count_t` that is of type `uintptr_t` to
> give clear guidance what type to use for counting entities. This type
> was chosen because in the worst case, an entity may be a single byte and
> we fill all of our memory with these entities. As `uintptr_t` is
> guaranteed to hold at least the value of a pointer, we know that it
> could be used to index into every single such entity.
>
> Amend the coding guidelines to state when to use `size_t` and when to
> use `count_t`. Convert an example file to use the new type.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
> Documentation/CodingGuidelines |  3 +++
> builtin/rm.c                   | 25 ++++++++++++-------------
> git-compat-util.h              | 15 +++++++++++++++
> 3 files changed, 30 insertions(+), 13 deletions(-)
>
> diff --git a/Documentation/CodingGuidelines b/Documentation/CodingGuidelines
> index 224f0978a8..2e9f3c07ff 100644
> --- a/Documentation/CodingGuidelines
> +++ b/Documentation/CodingGuidelines
> @@ -238,6 +238,9 @@ For shell scripts specifically (not exhaustive):
>
> For C programs:
>
> + - We use `size_t` to count the number of bytes and `count_t` to count the
> +   number of entities of a given type.
> +
>  - We use tabs to indent, and interpret tabs as taking up to
>    8 spaces.
>
> diff --git a/builtin/rm.c b/builtin/rm.c
> index 05d89e98c3..99b845cf34 100644
> --- a/builtin/rm.c
> +++ b/builtin/rm.c
> @@ -33,11 +33,11 @@ static const char * const builtin_rm_usage[] = {
> };
>
> static struct {
> -	int nr, alloc;
> 	struct {
> 		const char *name;
> 		char is_submodule;
> 	} *entry;
> +	count_t entry_nr, entry_alloc;
> } list;
>
> static int get_ours_cache_pos(const char *path, unsigned int pos)
> @@ -73,8 +73,7 @@ static void print_error_files(struct string_list *files_list,
>
> static void submodules_absorb_gitdir_if_needed(void)
> {
> -	int i;
> -	for (i = 0; i < list.nr; i++) {
> +	for (count_t i = 0; i < list.entry_nr; i++) {
> 		const char *name = list.entry[i].name;
> 		int pos;
> 		const struct cache_entry *ce;
> @@ -106,14 +105,14 @@ static int check_local_mod(struct object_id *head, int index_only)
> 	 * lazy, and who cares if removal of files is a tad
> 	 * slower than the theoretical maximum speed?
> 	 */
> -	int i, no_head;
> +	int no_head;
> 	int errs = 0;
> 	struct string_list files_staged = STRING_LIST_INIT_NODUP;
> 	struct string_list files_cached = STRING_LIST_INIT_NODUP;
> 	struct string_list files_local = STRING_LIST_INIT_NODUP;
>
> 	no_head = is_null_oid(head);
> -	for (i = 0; i < list.nr; i++) {
> +	for (count_t i = 0; i < list.entry_nr; i++) {
> 		struct stat st;
> 		int pos;
> 		const struct cache_entry *ce;
> @@ -268,7 +267,7 @@ int cmd_rm(int argc,
> 	   struct repository *repo UNUSED)
> {
> 	struct lock_file lock_file = LOCK_INIT;
> -	int i, ret = 0;
> +	int ret = 0;
> 	struct pathspec pathspec;
> 	char *seen;
>
> @@ -321,10 +320,10 @@ int cmd_rm(int argc,
> 			continue;
> 		if (!ce_path_match(the_repository->index, ce, &pathspec, seen))
> 			continue;
> -		ALLOC_GROW(list.entry, list.nr + 1, list.alloc);
> -		list.entry[list.nr].name = xstrdup(ce->name);
> -		list.entry[list.nr].is_submodule = S_ISGITLINK(ce->ce_mode);
> -		if (list.entry[list.nr++].is_submodule &&
> +		ALLOC_GROW(list.entry, list.entry_nr + 1, list.entry_alloc);
> +		list.entry[list.entry_nr].name = xstrdup(ce->name);
> +		list.entry[list.entry_nr].is_submodule = S_ISGITLINK(ce->ce_mode);
> +		if (list.entry[list.entry_nr++].is_submodule &&
> 		    !is_staging_gitmodules_ok(the_repository->index))
> 			die(_("please stage your changes to .gitmodules or stash them to proceed"));
> 	}

This hunk doesn't deal with count_t at all. Should we split the renaming 
of nr and alloc into a separate patch?
> @@ -335,7 +334,7 @@ int cmd_rm(int argc,
> 		char *skip_worktree_seen = NULL;
> 		struct string_list only_match_skip_worktree = STRING_LIST_INIT_NODUP;
>
> -		for (i = 0; i < pathspec.nr; i++) {
> +		for (int i = 0; i < pathspec.nr; i++) {

Is this i intentionally still an int?

> 			original = pathspec.items[i].original;
> 			if (seen[i])
> 				seen_any = 1;
> @@ -390,7 +389,7 @@ int cmd_rm(int argc,
> 	 * First remove the names from the index: we won't commit
> 	 * the index unless all of them succeed.
> 	 */
> -	for (i = 0; i < list.nr; i++) {
> +	for (count_t i = 0; i < list.entry_nr; i++) {
> 		const char *path = list.entry[i].name;
> 		if (!quiet)
> 			printf("rm '%s'\n", path);
> @@ -414,7 +413,7 @@ int cmd_rm(int argc,
> 		int removed = 0, gitmodules_modified = 0;
> 		struct strbuf buf = STRBUF_INIT;
> 		int flag = force ? REMOVE_DIR_PURGE_ORIGINAL_CWD : 0;
> -		for (i = 0; i < list.nr; i++) {
> +		for (count_t i = 0; i < list.entry_nr; i++) {
> 			const char *path = list.entry[i].name;
> 			if (list.entry[i].is_submodule) {
> 				strbuf_reset(&buf);
> diff --git a/git-compat-util.h b/git-compat-util.h
> index 9408f463e3..e9c30d59e8 100644
> --- a/git-compat-util.h
> +++ b/git-compat-util.h
> @@ -610,6 +610,21 @@ static inline bool strip_suffix(const char *str, const char *suffix,
> int git_open_cloexec(const char *name, int flags);
> #define git_open(name) git_open_cloexec(name, O_RDONLY)
>
> +/*
> + * The type used to count the number of entities, e.g. in an array. We have
> + * historically used `size_t` for this, but `size_t` is expected to count the
> + * maximum number of _bytes_, not entities.
> + *
> + * The counter is unsigned. If you need to store sentinel values like `-1` you
> + * should use a different type.

Do we want to make a recommendation of a "different type" here to keep 
things consistent?
> + *
> + * Note that we pick `uintptr_t` because in the theoretical worst case, every
> + * entity is a single byte and we populate the entire address space with them.
> + * As `uintptr_t` is able to point to every addressable byte it would also be
> + * able to count them all.
> + */
> +typedef uintptr_t count_t;
> +
> static inline size_t st_add(size_t a, size_t b)
> {
> 	if (unsigned_add_overflows(a, b))
>
> ---
> base-commit: 64cbe5e2e8a7b0f92c780b210e602496bd5cad0f
> change-id: 20250807-pks-introduce-count-t-0f4499f80221
>
>

Best regards

Matthias

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] git-compat-util: introduce `count_t` typedef
  2025-08-07  9:22 [PATCH] git-compat-util: introduce `count_t` typedef Patrick Steinhardt
  2025-08-07 11:00 ` Matthias Aßhauer
@ 2025-08-07 14:17 ` Phillip Wood
  2025-08-07 16:43   ` Junio C Hamano
  2025-08-07 16:38 ` Junio C Hamano
  2 siblings, 1 reply; 6+ messages in thread
From: Phillip Wood @ 2025-08-07 14:17 UTC (permalink / raw)
  To: Patrick Steinhardt, git
  Cc: Oswald Buddenhagen, Junio C Hamano, Taylor Blau, Jeff King

Hi Patrick

On 07/08/2025 10:22, Patrick Steinhardt wrote:
> Historically, Git has been very lenient with its use of integer types
> and didn't really give much thought into which type to use in what
> situation. We interchangeably mix and match signed and unsigned types
> and often times blindly convert them. This use has led to several
> out-of-bounds reads and writes in the past, some of which could be
> turned into arbitrary code execution.

My feeling is that one of the main problems has been using different 
types for loop indexes and loop limits. If we mandated that the loop 
index had to be the same type as the limit that would improve things 
considerably and without mandating a particular type.

> A discussion that regularly comes up in this context though is what
> types to use for counting entities:
> 
>    - One question is whether the type should be signed or unsigned.
>      Arguably, the answer should be to use unsigned types as long as we
>      know that we never need a negative value, e.g. as a sentinel. This
>      helps guide the reader and explicitly conveys the sense that such a
>      counter is only ever going to be a non-negative number. Otherwise,
>      code would need to be more careful as it may hold negative values.

The counter argument to this is that it is easy to write incorrect loops 
when counting down if the loop variable is unsigned. Using a typedef 
that hides the actual type makes that harder to spot as it is not 
immediately obvious whether the loop index is signed or not. As we have 
cases that do need to store a negative value then we're still left with 
using a mix of signed and unsigned types for counting in our code base.

> Introduce a new typedef for `count_t` that is of type `uintptr_t` to
> give clear guidance what type to use for counting entities. This type
> was chosen because in the worst case, an entity may be a single byte and
> we fill all of our memory with these entities. As `uintptr_t` is
> guaranteed to hold at least the value of a pointer, we know that it
> could be used to index into every single such entity.

How many sites actually allocate anything like that number of 
entities?Generally we use ALLOC_GROW() or ALLOC_GROW_BY() which means 
that we're not normally counting bytes. ALLOC_GROW_BY() assumes the 
number of entities fits into a size_t so should be be changing that to 
use count_t? If we're worried about overflows then maybe we should look 
at alloc_nr() which calculates the new allocation with

     (nr + 16) * 3 / 2

which which will start overflowing long before we starting allocating 
UINTPTR_MAX single byte entities.

Thanks

Phillip

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] git-compat-util: introduce `count_t` typedef
  2025-08-07  9:22 [PATCH] git-compat-util: introduce `count_t` typedef Patrick Steinhardt
  2025-08-07 11:00 ` Matthias Aßhauer
  2025-08-07 14:17 ` Phillip Wood
@ 2025-08-07 16:38 ` Junio C Hamano
  2025-08-07 22:07   ` Taylor Blau
  2 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2025-08-07 16:38 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Oswald Buddenhagen, Taylor Blau, Jeff King

Patrick Steinhardt <ps@pks.im> writes:

>  For C programs:
>  
> + - We use `size_t` to count the number of bytes and `count_t` to count the
> +   number of entities of a given type.

I am not interested in this specific implementation at all for a
number of reasons, but I am excited to see people thinking about the
issues.  The following is a random list of things, both positive and
negative, that came to my mind after skimming the changes.

 * We do not want to pretend that one size fits all.  If it were a
   good idea for developers to express "This variable is a simple
   counter that counts up from 0 and never goes negative" by using
   an unsigned type (which is dubious), it should be equally, or not
   more, a good idea to allow them to say "We will not have more
   than 256 fan-out directories under .git/objects/ and this is a
   counter to count them, so I know 'unsigned short' is big enough
   on any platforms".

 * As far as I can tell, the patch does not seem to address the
   biggest concern of unsigned integer wraparound.  We often see

	ALLOC_GROW(thing.entry, thing.nr + 1, thing.alloc);

   with the arithmetic "thing.nr + 1" checked by nobody.
   ALLOC_GROW_BY() is slightly better in this regard, but nobody
   uses it with only small exceptions.  And of course, alloc_nr()
   does even riskier arithmetic that is unchecked.

 * Standardising the names used for <item[], item_nr, item_alloc>
   somehow is very much welcome (we can see an example in the change
   to builtin/rm.c below).  Such a naming convention would allow us
   to write

	#define ALLOC_INCR(thing) ALLOC_INCR_BY(thing, 1)
	ALLOC_INCR_BY(thing, increment)

   that do ALLOC_GROW(thing, thing_nr + increment, thing_alloc) more
   safely than what the current code does, perhaps?  Also, we should
   be able to use any unsigned integral type and perform sensible
   bound checking with typeof().

 * The codebase avoids inventing a new type with typedef, with the
   exception of callback function type, following old tradition we
   inherited from the Linux kernel project.  And even when we create
   a new type, of course, we do not want to give it a name that ends
   with "_t".

> diff --git a/builtin/rm.c b/builtin/rm.c
> index 05d89e98c3..99b845cf34 100644
> --- a/builtin/rm.c
> +++ b/builtin/rm.c
> @@ -33,11 +33,11 @@ static const char * const builtin_rm_usage[] = {
>  };
>  
>  static struct {
> -	int nr, alloc;
>  	struct {
>  		const char *name;
>  		char is_submodule;
>  	} *entry;
> +	count_t entry_nr, entry_alloc;
>  } list;

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] git-compat-util: introduce `count_t` typedef
  2025-08-07 14:17 ` Phillip Wood
@ 2025-08-07 16:43   ` Junio C Hamano
  0 siblings, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2025-08-07 16:43 UTC (permalink / raw)
  To: Phillip Wood
  Cc: Patrick Steinhardt, git, Oswald Buddenhagen, Taylor Blau,
	Jeff King

Phillip Wood <phillip.wood123@gmail.com> writes:

> Hi Patrick
>
> On 07/08/2025 10:22, Patrick Steinhardt wrote:
>> Historically, Git has been very lenient with its use of integer types
>> and didn't really give much thought into which type to use in what
>> situation. We interchangeably mix and match signed and unsigned types
>> and often times blindly convert them. This use has led to several
>> out-of-bounds reads and writes in the past, some of which could be
>> turned into arbitrary code execution.
>
> My feeling is that one of the main problems has been using different
> types for loop indexes and loop limits. If we mandated that the loop
> index had to be the same type as the limit that would improve things
> considerably and without mandating a particular type.

Yup.  And the limit being unsigned would force the counter to be
also unsigned, which can introduce buggy constructs (like counting
down).

>> A discussion that regularly comes up in this context though is what
>> types to use for counting entities:
>>    - One question is whether the type should be signed or unsigned.
>>      Arguably, the answer should be to use unsigned types as long as we
>>      know that we never need a negative value, e.g. as a sentinel. This
>>      helps guide the reader and explicitly conveys the sense that such a
>>      counter is only ever going to be a non-negative number. Otherwise,
>>      code would need to be more careful as it may hold negative values.
>
> The counter argument to this is that it is easy to write incorrect
> loops when counting down if the loop variable is unsigned. Using a
> typedef that hides the actual type makes that harder to spot as it is
> not immediately obvious whether the loop index is signed or not.

This is very valid argument against typedef for something trivial
like an integer.  Use of proposed count_t loses both size and
signedness information.

Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] git-compat-util: introduce `count_t` typedef
  2025-08-07 16:38 ` Junio C Hamano
@ 2025-08-07 22:07   ` Taylor Blau
  0 siblings, 0 replies; 6+ messages in thread
From: Taylor Blau @ 2025-08-07 22:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Patrick Steinhardt, git, Oswald Buddenhagen, Jeff King

On Thu, Aug 07, 2025 at 09:38:46AM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
>
> >  For C programs:
> >
> > + - We use `size_t` to count the number of bytes and `count_t` to count the
> > +   number of entities of a given type.
>
> I am not interested in this specific implementation at all for a
> number of reasons, but I am excited to see people thinking about the
> issues.  The following is a random list of things, both positive and
> negative, that came to my mind after skimming the changes.
>
>  * We do not want to pretend that one size fits all.  If it were a
>    good idea for developers to express "This variable is a simple
>    counter that counts up from 0 and never goes negative" by using
>    an unsigned type (which is dubious), it should be equally, or not
>    more, a good idea to allow them to say "We will not have more
>    than 256 fan-out directories under .git/objects/ and this is a
>    counter to count them, so I know 'unsigned short' is big enough
>    on any platforms".

This to me is the most compelling argument against a "count_t" typedef
or something similar. Different callers have different needs (the ones
you pointed out above are the ones that I thought of as most relevant),
and we shouldn't force them to all use the same type, or pretend that
one type is best for all of them.

>  * As far as I can tell, the patch does not seem to address the
>    biggest concern of unsigned integer wraparound.  We often see
>
> 	ALLOC_GROW(thing.entry, thing.nr + 1, thing.alloc);
>
>    with the arithmetic "thing.nr + 1" checked by nobody.
>    ALLOC_GROW_BY() is slightly better in this regard, but nobody
>    uses it with only small exceptions.  And of course, alloc_nr()
>    does even riskier arithmetic that is unchecked.

I wonder if we should push more people towards ALLOC_GROW_BY() for that
reason. We could do something like recommend that callers use
ALLOC_GROW_BY() instead of ALLOC_GROW() in cases like:

    @@
    expression array, nr, n, alloc;
    @@
    - ALLOC_GROW(array, nr + n, alloc)
    + ALLOC_GROW_BY(array, nr, n, alloc)

, but I'm not sure that's a good idea as a blanket rule, since it's
changing the behavior away from using alloc_nr() to instead grow by a
fixed amount.

We have definitely talked before about adding overflow checks to
alloc_nr() before, but I think the slow-down made it a non-starter
(IIRC). I wonder if something like this:

    diff --git a/git-compat-util.h b/git-compat-util.h
    index 9408f463e31..22b8701b40d 100644
    --- a/git-compat-util.h
    +++ b/git-compat-util.h
    @@ -852,11 +852,14 @@ static inline void move_array(void *dst, const void *src, size_t n, size_t size)
      */
     #define ALLOC_GROW(x, nr, alloc) \
      do { \
    +		size_t __alloc__ = alloc; \
        if ((nr) > alloc) { \
          if (alloc_nr(alloc) < (nr)) \
            alloc = (nr); \
          else \
            alloc = alloc_nr(alloc); \
    +			if (alloc < __alloc__) \
    +				BUG("negative growth in ALLOC_GROW"); \
          REALLOC_ARRAY(x, alloc); \
        } \
      } while (0)

would be a reasonable compromise? It's not quite as careful as checking
each step of the computation done by alloc_nr(), but it's better than
not checking at all.

So perhaps we should do some combination of the two ;-).

>  * Standardising the names used for <item[], item_nr, item_alloc>
>    somehow is very much welcome (we can see an example in the change
>    to builtin/rm.c below).  Such a naming convention would allow us
>    to write
>
> 	#define ALLOC_INCR(thing) ALLOC_INCR_BY(thing, 1)
> 	ALLOC_INCR_BY(thing, increment)
>
>    that do ALLOC_GROW(thing, thing_nr + increment, thing_alloc) more
>    safely than what the current code does, perhaps?  Also, we should
>    be able to use any unsigned integral type and perform sensible
>    bound checking with typeof().

...meaning that ALLOC_INCR() and ALLOC_INCR_BY() would use thing##_nr? I
do like the idea of standardizing on that naming scheme, but the
thing##_nr approach is a bit magical for my taste.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-08-07 22:07 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-07  9:22 [PATCH] git-compat-util: introduce `count_t` typedef Patrick Steinhardt
2025-08-07 11:00 ` Matthias Aßhauer
2025-08-07 14:17 ` Phillip Wood
2025-08-07 16:43   ` Junio C Hamano
2025-08-07 16:38 ` Junio C Hamano
2025-08-07 22:07   ` Taylor Blau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).