Git development
 help / color / mirror / Atom feed
* [PATCH] object-file: don't use object database without a repository
@ 2026-04-04 17:28 Luca Stefani
  2026-04-05  6:03 ` Pushkar Singh
  2026-04-05  6:46 ` Jeff King
  0 siblings, 2 replies; 10+ messages in thread
From: Luca Stefani @ 2026-04-04 17:28 UTC (permalink / raw)
  To: git, cat; +Cc: Luca Stefani

When running `git diff -- $file1 $file2' on large enough files,
index_fd() attempts to use 'the_repository->objects', assuming it
is initialized, but that's not the case for non-repository usecases.

When git diff is invoked without a backing repository,
INDEX_WRITE_OBJECT is never set in flags, meaning only the hash is
needed and nothing should be written to the object store.

Enforce the use of index_core() in this case.

Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
---
 object-file.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/object-file.c b/object-file.c
index f0b029ff0b..68303aa99c 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1654,7 +1654,8 @@ int index_fd(struct index_state *istate, struct object_id *oid,
 	} else if ((st->st_size >= 0 &&
 		    (size_t)st->st_size <= repo_settings_get_big_file_threshold(istate->repo)) ||
 		   type != OBJ_BLOB ||
-		   (path && would_convert_to_git(istate, path))) {
+		   (path && would_convert_to_git(istate, path)) ||
+		   !(flags & INDEX_WRITE_OBJECT)) {
 		ret = index_core(istate, oid, fd, xsize_t(st->st_size),
 				 type, path, flags);
 	} else {
-- 
2.54.0.rc0.dirty


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] object-file: don't use object database without a repository
  2026-04-04 17:28 [PATCH] object-file: don't use object database without a repository Luca Stefani
@ 2026-04-05  6:03 ` Pushkar Singh
  2026-04-05  6:46 ` Jeff King
  1 sibling, 0 replies; 10+ messages in thread
From: Pushkar Singh @ 2026-04-05  6:03 UTC (permalink / raw)
  To: Luca Stefani; +Cc: git, cat

Hi Luca,

Thanks for the patch, this was interesting to read.

[snip]
> When git diff is invoked without a backing repository,
> INDEX_WRITE_OBJECT is never set in flags, meaning only the hash is
> needed and nothing should be written to the object store.

From my understanding, this avoids using the object database in
non-repository scenarios by forcing the use of index_core() when
INDEX_WRITE_OBJECT is not set, which makes sense since we only
need the hash in that case.

I had a small question regarding coverage:

- Do we already have tests for cases like:
  git diff -- <file1> <file2> outside a repository,
  especially with large files triggering this path?

It might be useful to add one to ensure this behavior is
preserved.

Also, are there any other callers of index_fd() that might
rely on similar assumptions about repository initialization?

Thanks,
Pushkar

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] object-file: don't use object database without a repository
  2026-04-04 17:28 [PATCH] object-file: don't use object database without a repository Luca Stefani
  2026-04-05  6:03 ` Pushkar Singh
@ 2026-04-05  6:46 ` Jeff King
  2026-04-05 16:10   ` Luca Stefani
  1 sibling, 1 reply; 10+ messages in thread
From: Jeff King @ 2026-04-05  6:46 UTC (permalink / raw)
  To: Luca Stefani; +Cc: git, cat

On Sat, Apr 04, 2026 at 07:28:17PM +0200, Luca Stefani wrote:

> When running `git diff -- $file1 $file2' on large enough files,
> index_fd() attempts to use 'the_repository->objects', assuming it
> is initialized, but that's not the case for non-repository usecases.
> 
> When git diff is invoked without a backing repository,
> INDEX_WRITE_OBJECT is never set in flags, meaning only the hash is
> needed and nothing should be written to the object store.
> 
> Enforce the use of index_core() in this case.

I don't think we want to use index_core() for a large file, though. A
test like this:

diff --git a/t/t4053-diff-no-index.sh b/t/t4053-diff-no-index.sh
index 15076dfe0d..7ef5604430 100755
--- a/t/t4053-diff-no-index.sh
+++ b/t/t4053-diff-no-index.sh
@@ -413,4 +413,10 @@ test_expect_success 'diff --no-index with pathspec glob and exclude' '
 	test_cmp expect actual
 '
 
+test_expect_success 'diff --no-index on a huge file' '
+	dd if=/dev/zero bs=1M count=4000 >big.file &&
+	echo whatever >small.file &&
+	test_expect_code 1 git diff --no-index big.file small.file
+'
+
 test_done


will now fail on a 32-bit system, because we try to mmap the whole file,
which will fail.  We really do want to follow the streaming code path
(which knows to respect the lack of a WRITE_OBJECT flag and works
without an odb in that case).

It's kind of an expensive test, though, so we probably don't want to
actually include it in the test suite.

-Peff

PS I'd expect a 4GB+ file to work, too, but it looks like the diff code
   barfs when trying to stuff the file into a diff_filespec. A simpler
   example is:

     dd if=/dev/zero bs=1G count=5 >big.file
     git hash-object big.file

   but that dies, too! It looks like the streaming helper uses a size_t
   to take the size, which is wrong. It really should be an off_t. So I
   dunno, maybe nobody cares about ever working with 4GB files on 32-bit
   systems these days. It still feels like we should avoid a large mmap,
   though.

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] object-file: don't use object database without a repository
  2026-04-05  6:46 ` Jeff King
@ 2026-04-05 16:10   ` Luca Stefani
  2026-04-05 19:17     ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: Luca Stefani @ 2026-04-05 16:10 UTC (permalink / raw)
  To: Jeff King; +Cc: git, cat


On 05/04/2026 08:46, Jeff King wrote:
> On Sat, Apr 04, 2026 at 07:28:17PM +0200, Luca Stefani wrote:
>
>> When running `git diff -- $file1 $file2' on large enough files,
>> index_fd() attempts to use 'the_repository->objects', assuming it
>> is initialized, but that's not the case for non-repository usecases.
>>
>> When git diff is invoked without a backing repository,
>> INDEX_WRITE_OBJECT is never set in flags, meaning only the hash is
>> needed and nothing should be written to the object store.
>>
>> Enforce the use of index_core() in this case.
> I don't think we want to use index_core() for a large file, though. A
> test like this:

I don't know what would be the right approach, index_core sure is slow, 
but maybe that's expected for those sizes.

This fix by itself simply avoids entering into the broken case, and it 
still gives me a working diff.

> diff --git a/t/t4053-diff-no-index.sh b/t/t4053-diff-no-index.sh
> index 15076dfe0d..7ef5604430 100755
> --- a/t/t4053-diff-no-index.sh
> +++ b/t/t4053-diff-no-index.sh
> @@ -413,4 +413,10 @@ test_expect_success 'diff --no-index with pathspec glob and exclude' '
>   	test_cmp expect actual
>   '
>   
> +test_expect_success 'diff --no-index on a huge file' '
> +	dd if=/dev/zero bs=1M count=4000 >big.file &&
> +	echo whatever >small.file &&
> +	test_expect_code 1 git diff --no-index big.file small.file
> +'
> +
>   test_done

If you want  I can send a V2 with that, but given it's your test suit 
I'd rather you handle it.

Especially when it comes to multi-arch, as I only really care about amd64

>
> will now fail on a 32-bit system, because we try to mmap the whole file,
> which will fail.  We really do want to follow the streaming code path
> (which knows to respect the lack of a WRITE_OBJECT flag and works
> without an odb in that case).
>
> It's kind of an expensive test, though, so we probably don't want to
> actually include it in the test suite.
>
> -Peff
>
> PS I'd expect a 4GB+ file to work, too, but it looks like the diff code
>     barfs when trying to stuff the file into a diff_filespec. A simpler
>     example is:
>
>       dd if=/dev/zero bs=1G count=5 >big.file
>       git hash-object big.file
>
>     but that dies, too! It looks like the streaming helper uses a size_t
>     to take the size, which is wrong. It really should be an off_t. So I
>     dunno, maybe nobody cares about ever working with 4GB files on 32-bit
>     systems these days. It still feels like we should avoid a large mmap,
>     though.

Ah I I just happened to have a 3G file and I threw it at 'git diff' :)

With `#define _FILE_OFFSET_BITS 64` off_t is properly sized, but if 
other places downcast it then there's little hope on 32bit.

Now even with that in mind not sure if it's worth fixing...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] object-file: don't use object database without a repository
  2026-04-05 16:10   ` Luca Stefani
@ 2026-04-05 19:17     ` Jeff King
  2026-04-06 18:17       ` Justin Tobler
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2026-04-05 19:17 UTC (permalink / raw)
  To: Luca Stefani; +Cc: Justin Tobler, git, cat

On Sun, Apr 05, 2026 at 06:10:33PM +0200, Luca Stefani wrote:

> > > Enforce the use of index_core() in this case.
> > I don't think we want to use index_core() for a large file, though. A
> > test like this:
> 
> I don't know what would be the right approach, index_core sure is slow, but
> maybe that's expected for those sizes.

It's always going to be slow because we're hashing a lot of data. But
the point is that we should be streaming it, and trying to allocate a
huge buffer (even via mmap). So it's not that it's slow, it's that some
cases which used to work will not do so any longer.

We want to keep going into the streaming code path, and not
index_core(). But the streaming code path has been broken outside of a
repo, by ce1661f9da (odb: add transaction interface, 2025-09-16) and its
follow-on patches. And we should fix that instead of avoiding it.

> This fix by itself simply avoids entering into the broken case, and it
> still gives me a working diff.

For some files, yes, but it breaks other cases (like the one I
demonstrated).

> > diff --git a/t/t4053-diff-no-index.sh b/t/t4053-diff-no-index.sh
> > index 15076dfe0d..7ef5604430 100755
> > --- a/t/t4053-diff-no-index.sh
> > +++ b/t/t4053-diff-no-index.sh
> > @@ -413,4 +413,10 @@ test_expect_success 'diff --no-index with pathspec glob and exclude' '
> >   	test_cmp expect actual
> >   '
> > +test_expect_success 'diff --no-index on a huge file' '
> > +	dd if=/dev/zero bs=1M count=4000 >big.file &&
> > +	echo whatever >small.file &&
> > +	test_expect_code 1 git diff --no-index big.file small.file
> > +'
> > +
> >   test_done
> 
> If you want  I can send a V2 with that, but given it's your test suit I'd
> rather you handle it.

If you sent a v2 with this, the tests would not pass. ;)

But like I said, I don't think we want that in the test suite because it
is too expensive to run. What we probably do want is a cheap
demonstration of the segfault, which is this:

diff --git a/t/t1517-outside-repo.sh b/t/t1517-outside-repo.sh
index c824c1a25c..31965908a6 100755
--- a/t/t1517-outside-repo.sh
+++ b/t/t1517-outside-repo.sh
@@ -149,4 +149,9 @@ test_expect_success 'fmt-merge-msg does not crash with -h' '
 	test_grep "[Uu]sage: git fmt-merge-msg " usage
 '
 
+test_expect_success 'indexing large file outside repo' '
+	nongit dd if=/dev/zero of=big.file bs=10k count=1 &&
+	nongit git -c core.bigfilethreshold=5k hash-object big.file
+'
+
 test_done

But I think the actual code change in your patch is the wrong thing, so
I also don't think we'd want to just squash that test in. I'm hoping
Justin has some insights on how to do a more complete fix.

-Peff

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] object-file: don't use object database without a repository
  2026-04-05 19:17     ` Jeff King
@ 2026-04-06 18:17       ` Justin Tobler
  2026-04-06 19:31         ` Luca Stefani
  2026-04-06 20:06         ` Jeff King
  0 siblings, 2 replies; 10+ messages in thread
From: Justin Tobler @ 2026-04-06 18:17 UTC (permalink / raw)
  To: Jeff King; +Cc: Luca Stefani, git, cat

On 26/04/05 03:17PM, Jeff King wrote:
> But I think the actual code change in your patch is the wrong thing, so
> I also don't think we'd want to just squash that test in. I'm hoping
> Justin has some insights on how to do a more complete fix.

I agree with Peff here that the correct fix should continue to use the
object streaming mechanisms. To avoid this segfault, we really should
avoid using ODB transactions when there isn't an ODB in the first place.

I replied in another thread[1] with how we could go about fixing. To
summarize, it just so happens that I already have a patch[2] out on the
list that appears to resolve this issue.

For the use case here, git-diff(1) is only interested in generating the
hash for the "large" blobs and not actually writing anything to the ODB.
This patch introduces a separate "hash-only" variant of
`index_blob_packfile_transaction()` and is used to bypass creating an
ODB transaction when object writes are not needed.

If this is the route we want to go down, I can extract this patch from
the current series and send it as a separate fix. :)

-Justin

[1]: https://lore.kernel.org/git/adPjXKGIT5O7SK6E@denethor/T/#m9cee420941b66abfb0244ea4b7762ba8d0ff7b52
[2]: https://lore.kernel.org/git/20260402213220.2651523-5-jltobler@gmail.com/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] object-file: don't use object database without a repository
  2026-04-06 18:17       ` Justin Tobler
@ 2026-04-06 19:31         ` Luca Stefani
  2026-04-06 20:31           ` Justin Tobler
  2026-04-06 20:06         ` Jeff King
  1 sibling, 1 reply; 10+ messages in thread
From: Luca Stefani @ 2026-04-06 19:31 UTC (permalink / raw)
  To: Justin Tobler, Jeff King; +Cc: git, cat


On 06/04/2026 20:17, Justin Tobler wrote:
> On 26/04/05 03:17PM, Jeff King wrote:
>> But I think the actual code change in your patch is the wrong thing, so
>> I also don't think we'd want to just squash that test in. I'm hoping
>> Justin has some insights on how to do a more complete fix.
> I agree with Peff here that the correct fix should continue to use the
> object streaming mechanisms. To avoid this segfault, we really should
> avoid using ODB transactions when there isn't an ODB in the first place.
>
> I replied in another thread[1] with how we could go about fixing. To
> summarize, it just so happens that I already have a patch[2] out on the
> list that appears to resolve this issue.

Thanks, just verified it works as expected.

>
> For the use case here, git-diff(1) is only interested in generating the
> hash for the "large" blobs and not actually writing anything to the ODB.
> This patch introduces a separate "hash-only" variant of
> `index_blob_packfile_transaction()` and is used to bypass creating an
> ODB transaction when object writes are not needed.
>
> If this is the route we want to go down, I can extract this patch from
> the current series and send it as a separate fix. :)
If this ends up happening CC me and I'll gladly stamp it with Tested-by :)
>
> -Justin
>
> [1]: https://lore.kernel.org/git/adPjXKGIT5O7SK6E@denethor/T/#m9cee420941b66abfb0244ea4b7762ba8d0ff7b52
> [2]: https://lore.kernel.org/git/20260402213220.2651523-5-jltobler@gmail.com/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] object-file: don't use object database without a repository
  2026-04-06 18:17       ` Justin Tobler
  2026-04-06 19:31         ` Luca Stefani
@ 2026-04-06 20:06         ` Jeff King
  2026-04-06 20:38           ` Justin Tobler
  1 sibling, 1 reply; 10+ messages in thread
From: Jeff King @ 2026-04-06 20:06 UTC (permalink / raw)
  To: Justin Tobler; +Cc: Luca Stefani, git, cat

On Mon, Apr 06, 2026 at 01:17:17PM -0500, Justin Tobler wrote:

> On 26/04/05 03:17PM, Jeff King wrote:
> > But I think the actual code change in your patch is the wrong thing, so
> > I also don't think we'd want to just squash that test in. I'm hoping
> > Justin has some insights on how to do a more complete fix.
> 
> I agree with Peff here that the correct fix should continue to use the
> object streaming mechanisms. To avoid this segfault, we really should
> avoid using ODB transactions when there isn't an ODB in the first place.
> 
> I replied in another thread[1] with how we could go about fixing. To
> summarize, it just so happens that I already have a patch[2] out on the
> list that appears to resolve this issue.
> 
> For the use case here, git-diff(1) is only interested in generating the
> hash for the "large" blobs and not actually writing anything to the ODB.
> This patch introduces a separate "hash-only" variant of
> `index_blob_packfile_transaction()` and is used to bypass creating an
> ODB transaction when object writes are not needed.
> 
> If this is the route we want to go down, I can extract this patch from
> the current series and send it as a separate fix. :)

Yeah, I think this is a good path forward. I took a look at making the
transaction begin/end conditional, but that's not nearly enough anymore.
The transaction object stores state which is used under the hood by
index_blob_packfile_transaction(). So we'd really need some kind of fake
noop transaction that understands how to stream.

Just having the caller divert to a "hash this without having an odb"
interface is way simpler (especially since this is the only spot that
needs it, so we are only paying the price once either way).

I gave a cursory look at the patch you linked. For a maint fix like this
I think we could probably slim it down a bit: introduce the new
hash-only helper but _don't_ actually rip flag support out of
index_blob_packfile_transaction(), so we know that we can't accidentally
break it. Though maybe that is being overly cautious; it only has one
caller, and that caller would no longer be passing in any meaningful
flags.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] object-file: don't use object database without a repository
  2026-04-06 19:31         ` Luca Stefani
@ 2026-04-06 20:31           ` Justin Tobler
  0 siblings, 0 replies; 10+ messages in thread
From: Justin Tobler @ 2026-04-06 20:31 UTC (permalink / raw)
  To: Luca Stefani; +Cc: Jeff King, git, cat

On 26/04/06 09:31PM, Luca Stefani wrote:
> 
> On 06/04/2026 20:17, Justin Tobler wrote:
> > On 26/04/05 03:17PM, Jeff King wrote:
> > > But I think the actual code change in your patch is the wrong thing, so
> > > I also don't think we'd want to just squash that test in. I'm hoping
> > > Justin has some insights on how to do a more complete fix.
> > I agree with Peff here that the correct fix should continue to use the
> > object streaming mechanisms. To avoid this segfault, we really should
> > avoid using ODB transactions when there isn't an ODB in the first place.
> > 
> > I replied in another thread[1] with how we could go about fixing. To
> > summarize, it just so happens that I already have a patch[2] out on the
> > list that appears to resolve this issue.
> 
> Thanks, just verified it works as expected.

Thanks for testing! :)

> > 
> > For the use case here, git-diff(1) is only interested in generating the
> > hash for the "large" blobs and not actually writing anything to the ODB.
> > This patch introduces a separate "hash-only" variant of
> > `index_blob_packfile_transaction()` and is used to bypass creating an
> > ODB transaction when object writes are not needed.
> > 
> > If this is the route we want to go down, I can extract this patch from
> > the current series and send it as a separate fix. :)
> If this ends up happening CC me and I'll gladly stamp it with Tested-by :)

Will do!

-Justin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] object-file: don't use object database without a repository
  2026-04-06 20:06         ` Jeff King
@ 2026-04-06 20:38           ` Justin Tobler
  0 siblings, 0 replies; 10+ messages in thread
From: Justin Tobler @ 2026-04-06 20:38 UTC (permalink / raw)
  To: Jeff King; +Cc: Luca Stefani, git, cat

On 26/04/06 04:06PM, Jeff King wrote:
> On Mon, Apr 06, 2026 at 01:17:17PM -0500, Justin Tobler wrote:
> 
> > I agree with Peff here that the correct fix should continue to use the
> > object streaming mechanisms. To avoid this segfault, we really should
> > avoid using ODB transactions when there isn't an ODB in the first place.
> > 
> > I replied in another thread[1] with how we could go about fixing. To
> > summarize, it just so happens that I already have a patch[2] out on the
> > list that appears to resolve this issue.
> > 
> > For the use case here, git-diff(1) is only interested in generating the
> > hash for the "large" blobs and not actually writing anything to the ODB.
> > This patch introduces a separate "hash-only" variant of
> > `index_blob_packfile_transaction()` and is used to bypass creating an
> > ODB transaction when object writes are not needed.
> > 
> > If this is the route we want to go down, I can extract this patch from
> > the current series and send it as a separate fix. :)
> 
> Yeah, I think this is a good path forward. I took a look at making the
> transaction begin/end conditional, but that's not nearly enough anymore.
> The transaction object stores state which is used under the hood by
> index_blob_packfile_transaction(). So we'd really need some kind of fake
> noop transaction that understands how to stream.
> 
> Just having the caller divert to a "hash this without having an odb"
> interface is way simpler (especially since this is the only spot that
> needs it, so we are only paying the price once either way).
> 
> I gave a cursory look at the patch you linked. For a maint fix like this
> I think we could probably slim it down a bit: introduce the new
> hash-only helper but _don't_ actually rip flag support out of
> index_blob_packfile_transaction(), so we know that we can't accidentally
> break it. Though maybe that is being overly cautious; it only has one
> caller, and that caller would no longer be passing in any meaningful
> flags.

Ya, I think slimming down the patch probably makes sense. I'll start
working on it and make sure to include some tests too. :)

Thanks,
-Justin

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-04-06 20:38 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-04 17:28 [PATCH] object-file: don't use object database without a repository Luca Stefani
2026-04-05  6:03 ` Pushkar Singh
2026-04-05  6:46 ` Jeff King
2026-04-05 16:10   ` Luca Stefani
2026-04-05 19:17     ` Jeff King
2026-04-06 18:17       ` Justin Tobler
2026-04-06 19:31         ` Luca Stefani
2026-04-06 20:31           ` Justin Tobler
2026-04-06 20:06         ` Jeff King
2026-04-06 20:38           ` Justin Tobler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox