From: "René Scharfe" <l.s.r@web.de>
To: Jason Hatton <jhatton@globalfinishing.com>,
"git@vger.kernel.org" <git@vger.kernel.org>
Cc: Junio C Hamano <gitster@pobox.com>
Subject: Re: Git status extremely slow if any file is a multiple of 8GBi
Date: Thu, 5 May 2022 23:04:03 +0200 [thread overview]
Message-ID: <0d78c98a-841e-719b-add3-acc7a7a2d7c6@web.de> (raw)
In-Reply-To: <CY4PR16MB16558FE8E69B2045435AD59DAFC39@CY4PR16MB1655.namprd16.prod.outlook.com>
Am 04.05.22 um 19:47 schrieb Jason Hatton:
>>> The condition sd_size==0 is used as a signal for "no, we really need
>>> to compare the contents", and causes the contents to be hashed, and
>>> if the contents match the object name recorded in the index, the
>>> on-disk size is stored in sd_size and the entry is marked as
>>> CE_UPTODATE. Alas, if the truncated st_size is 0, the resulting
>>> entry would have sd_size==0 again, so a workaround like what you
>>> outlined is needed.
>>
>> Junio C Hamano <gitster@pobox.com> writes:
>>
>> This is of secondary importance, but the fact that Jason observed
>> 8GBi files gets hashed over and over unnecessarily means that we
>> would do the same for an empty file, opening, reading 0-bytes,
>> hashing, and closing, without taking advantage of the fact that
>> CE_UPTODATE bit says the file contents should be up-to-date with
>> respect to the cached object name, doesn't it?
>>
>> Or do we have "if st_size == 0 and sd_size == 0 then we know what it
>> hashes to (i.e. EMPTY_BLOB_SHA*) and there is no need to do the
>> usual open-read-hash-close dance" logic (I didn't check)?
>
> Junio C Hamano
>
> As best as I can tell, it rechecks the zero sized files. My Linux box can run
> git ls in .006 seconds with 1000 zero sized files in the repo. Rehashing every
> file that is a multiple of 2^32 with every "git ls" on the other hand...
>
> I managed to actually compile git with the proposed changes.
Meaning that file sizes of n * 2^32 bytes get recorded as 1 byte instead
of 0 bytes? Why 1 and not e.g. 2^32-1 or 2^31 (or 42)?
> It seems to correct
> the problem and "make test" passes. If upgrading to the patched version if git,
> git will rehash the 8GBi files once and work normally. If downgrading to an
> unpatched version, git will perceive that the 8GBi files have changes. This
> needs to be corrected with "git add" or "git checkout".
Not nice, but safe. Can there be an unsafe scenario as well? Like if a
4GiB file gets added to the index by the new version, which records a
size of 1, then the file is extended by one byte while mtime stays the
same and then an old git won't detect the change?
> If you people are
> interested, I may be able to find a way to send a patch to the list or put it
> on github.
Patches are always welcome, they make discussions and testing easier.
René
next prev parent reply other threads:[~2022-05-05 21:04 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-04 17:47 Git status extremely slow if any file is a multiple of 8GBi Jason Hatton
2022-05-05 21:04 ` René Scharfe [this message]
2022-05-05 22:55 ` Philip Oakley
2022-05-06 0:22 ` [Email External to GFS] " Jason Hatton
2022-05-06 9:40 ` Philip Oakley
2022-05-07 5:19 ` Carlo Marcelo Arenas Belón
-- strict thread matches above, loose matches on Subject: below --
2022-05-04 0:15 Jason Hatton
2022-05-04 13:55 ` Junio C Hamano
2022-05-04 16:08 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0d78c98a-841e-719b-add3-acc7a7a2d7c6@web.de \
--to=l.s.r@web.de \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=jhatton@globalfinishing.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).