* [PATCH v2 0/2] Prevent re-reading 4 GiB files on every status @ 2023-10-12 16:09 brian m. carlson 2023-10-12 16:09 ` [PATCH v2 1/2] t: add a test helper to truncate files brian m. carlson 2023-10-12 16:09 ` [PATCH v2 2/2] Prevent git from rehashing 4GiB files brian m. carlson 0 siblings, 2 replies; 16+ messages in thread From: brian m. carlson @ 2023-10-12 16:09 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Taylor Blau, Jason Hatton Several people have noticed that Git continually re-reads and re-hashes files that are an exact multiple of 4 GiB every time the index is refreshed (for example, when "git status" is called). This is slow and expensive, especially with SHA-1 DC, and it also causes performance problems when these files are used with Git LFS (because the same issue occurs, just with Git LFS hashing the data instead of Git). Jason Hatton sent a patch previously to fix this, but it lacked tests and didn't get picked up. I've adopted their patch, making some minor changes to the commit message and including some tests, and also including a suitable test helper to make the tests possible. All credit should be directed to Jason, and I'll accept all the responsibility for any problems. I don't anticipate this being in any way controversial, so I'm not expecting a huge number of rerolls, but of course one or two might be necessary. Jason Hatton (1): Prevent git from rehashing 4GiB files brian m. carlson (1): t: add a test helper to truncate files Makefile | 1 + statinfo.c | 20 ++++++++++++++++++-- t/helper/test-tool.c | 1 + t/helper/test-tool.h | 1 + t/helper/test-truncate.c | 27 +++++++++++++++++++++++++++ t/t7508-status.sh | 16 ++++++++++++++++ 6 files changed, 64 insertions(+), 2 deletions(-) create mode 100644 t/helper/test-truncate.c ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 1/2] t: add a test helper to truncate files 2023-10-12 16:09 [PATCH v2 0/2] Prevent re-reading 4 GiB files on every status brian m. carlson @ 2023-10-12 16:09 ` brian m. carlson 2023-10-12 17:49 ` Eric Sunshine ` (2 more replies) 2023-10-12 16:09 ` [PATCH v2 2/2] Prevent git from rehashing 4GiB files brian m. carlson 1 sibling, 3 replies; 16+ messages in thread From: brian m. carlson @ 2023-10-12 16:09 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Taylor Blau, Jason Hatton From: "brian m. carlson" <bk2204@github.com> In a future commit, we're going to work with some large files which will be at least 4 GiB in size. To take advantage of the sparseness functionality on most Unix systems and avoid running the system out of disk, it would be convenient to use truncate(2) to simply create a sparse file of sufficient size. However, the GNU truncate(1) utility isn't portable, so let's write a tiny test helper that does the work for us. Signed-off-by: brian m. carlson <bk2204@github.com> --- Makefile | 1 + t/helper/test-tool.c | 1 + t/helper/test-tool.h | 1 + t/helper/test-truncate.c | 27 +++++++++++++++++++++++++++ 4 files changed, 30 insertions(+) create mode 100644 t/helper/test-truncate.c diff --git a/Makefile b/Makefile index 9c6a2f125f..03adcb5a48 100644 --- a/Makefile +++ b/Makefile @@ -852,6 +852,7 @@ TEST_BUILTINS_OBJS += test-submodule-nested-repo-config.o TEST_BUILTINS_OBJS += test-submodule.o TEST_BUILTINS_OBJS += test-subprocess.o TEST_BUILTINS_OBJS += test-trace2.o +TEST_BUILTINS_OBJS += test-truncate.o TEST_BUILTINS_OBJS += test-urlmatch-normalization.o TEST_BUILTINS_OBJS += test-userdiff.o TEST_BUILTINS_OBJS += test-wildmatch.o diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c index 9010ac6de7..876cd2dc31 100644 --- a/t/helper/test-tool.c +++ b/t/helper/test-tool.c @@ -86,6 +86,7 @@ static struct test_cmd cmds[] = { { "submodule-nested-repo-config", cmd__submodule_nested_repo_config }, { "subprocess", cmd__subprocess }, { "trace2", cmd__trace2 }, + { "truncate", cmd__truncate }, { "userdiff", cmd__userdiff }, { "urlmatch-normalization", cmd__urlmatch_normalization }, { "xml-encode", cmd__xml_encode }, diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h index f134f96b97..70dd4eba11 100644 --- a/t/helper/test-tool.h +++ b/t/helper/test-tool.h @@ -79,6 +79,7 @@ int cmd__submodule_config(int argc, const char **argv); int cmd__submodule_nested_repo_config(int argc, const char **argv); int cmd__subprocess(int argc, const char **argv); int cmd__trace2(int argc, const char **argv); +int cmd__truncate(int argc, const char **argv); int cmd__userdiff(int argc, const char **argv); int cmd__urlmatch_normalization(int argc, const char **argv); int cmd__xml_encode(int argc, const char **argv); diff --git a/t/helper/test-truncate.c b/t/helper/test-truncate.c new file mode 100644 index 0000000000..bd3fde364b --- /dev/null +++ b/t/helper/test-truncate.c @@ -0,0 +1,27 @@ +#include "test-tool.h" +#include "git-compat-util.h" + +/* + * Truncate a file to the given size. + */ +int cmd__truncate(int argc, const char **argv) +{ + char *p = NULL; + uintmax_t sz = 0; + int fd = -1; + + if (argc != 3) + die("expected filename and size"); + + sz = strtoumax(argv[2], &p, 0); + if (*p) + die("invalid size"); + + fd = open(argv[1], O_WRONLY | O_CREAT, 0600); + if (fd < 0) + die_errno("failed to open file %s", argv[1]); + + if (ftruncate(fd, (off_t) sz) < 0) + die_errno("failed to truncate file"); + return 0; +} ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] t: add a test helper to truncate files 2023-10-12 16:09 ` [PATCH v2 1/2] t: add a test helper to truncate files brian m. carlson @ 2023-10-12 17:49 ` Eric Sunshine 2023-10-13 20:23 ` brian m. carlson 2023-10-12 22:52 ` Junio C Hamano 2023-10-16 23:53 ` Jeff King 2 siblings, 1 reply; 16+ messages in thread From: Eric Sunshine @ 2023-10-12 17:49 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Junio C Hamano, Taylor Blau, Jason Hatton On Thu, Oct 12, 2023 at 12:10 PM brian m. carlson <sandals@crustytoothpaste.net> wrote: > In a future commit, we're going to work with some large files which will > be at least 4 GiB in size. To take advantage of the sparseness > functionality on most Unix systems and avoid running the system out of > disk, it would be convenient to use truncate(2) to simply create a > sparse file of sufficient size. > > However, the GNU truncate(1) utility isn't portable, so let's write a > tiny test helper that does the work for us. > > Signed-off-by: brian m. carlson <bk2204@github.com> > --- > diff --git a/t/helper/test-truncate.c b/t/helper/test-truncate.c > @@ -0,0 +1,27 @@ > +int cmd__truncate(int argc, const char **argv) > +{ > + char *p = NULL; > + uintmax_t sz = 0; > + int fd = -1; > + > + if (argc != 3) > + die("expected filename and size"); > + > + sz = strtoumax(argv[2], &p, 0); > + if (*p) > + die("invalid size"); Do you want to check 'errno' here, as well (probably before the '*p' check)? Or is that being too defensive for a 'test-tool' command? > + fd = open(argv[1], O_WRONLY | O_CREAT, 0600); > + if (fd < 0) > + die_errno("failed to open file %s", argv[1]); > + > + if (ftruncate(fd, (off_t) sz) < 0) > + die_errno("failed to truncate file"); > + return 0; > +} ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] t: add a test helper to truncate files 2023-10-12 17:49 ` Eric Sunshine @ 2023-10-13 20:23 ` brian m. carlson 0 siblings, 0 replies; 16+ messages in thread From: brian m. carlson @ 2023-10-13 20:23 UTC (permalink / raw) To: Eric Sunshine; +Cc: git, Junio C Hamano, Taylor Blau, Jason Hatton [-- Attachment #1: Type: text/plain, Size: 912 bytes --] On 2023-10-12 at 17:49:13, Eric Sunshine wrote: > > + sz = strtoumax(argv[2], &p, 0); > > + if (*p) > > + die("invalid size"); > > Do you want to check 'errno' here, as well (probably before the '*p' check)? > > Or is that being too defensive for a 'test-tool' command? I don't believe that's necessary. The Linux manual page leads me to believe that if *argv[2] is not 0 but *p is 0, then the entire string is valid, which would imply that errno is not set. I'm happy to ignore for the moment the case where the user specifies "" as the argument, because it is a test helper and "don't do weird, unexpected things with the test helper without looking at the source code first" is legitimate advice for Git developers. If this were user-facing, improved robustness would be warranted, I agree. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 263 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] t: add a test helper to truncate files 2023-10-12 16:09 ` [PATCH v2 1/2] t: add a test helper to truncate files brian m. carlson 2023-10-12 17:49 ` Eric Sunshine @ 2023-10-12 22:52 ` Junio C Hamano 2023-10-13 20:18 ` brian m. carlson 2023-10-16 23:53 ` Jeff King 2 siblings, 1 reply; 16+ messages in thread From: Junio C Hamano @ 2023-10-12 22:52 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Taylor Blau, Jason Hatton "brian m. carlson" <sandals@crustytoothpaste.net> writes: > + fd = open(argv[1], O_WRONLY | O_CREAT, 0600); > + if (fd < 0) > + die_errno("failed to open file %s", argv[1]); contrib/coccinelle/xopen.cocci tells us to write this simply as fd = xopen(argv[1], O_WRONLY | O_CREAT, 0600); https://github.com/git/git/actions/runs/6500777388/job/17656837393#step:4:657 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] t: add a test helper to truncate files 2023-10-12 22:52 ` Junio C Hamano @ 2023-10-13 20:18 ` brian m. carlson 2023-10-13 20:32 ` Junio C Hamano 0 siblings, 1 reply; 16+ messages in thread From: brian m. carlson @ 2023-10-13 20:18 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Taylor Blau, Jason Hatton [-- Attachment #1: Type: text/plain, Size: 458 bytes --] On 2023-10-12 at 22:52:59, Junio C Hamano wrote: > "brian m. carlson" <sandals@crustytoothpaste.net> writes: > > > + fd = open(argv[1], O_WRONLY | O_CREAT, 0600); > > + if (fd < 0) > > + die_errno("failed to open file %s", argv[1]); > > contrib/coccinelle/xopen.cocci tells us to write this simply as > > fd = xopen(argv[1], O_WRONLY | O_CREAT, 0600); Sure, I can do that. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 263 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] t: add a test helper to truncate files 2023-10-13 20:18 ` brian m. carlson @ 2023-10-13 20:32 ` Junio C Hamano 0 siblings, 0 replies; 16+ messages in thread From: Junio C Hamano @ 2023-10-13 20:32 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Taylor Blau, Jason Hatton "brian m. carlson" <sandals@crustytoothpaste.net> writes: > On 2023-10-12 at 22:52:59, Junio C Hamano wrote: >> "brian m. carlson" <sandals@crustytoothpaste.net> writes: >> >> > + fd = open(argv[1], O_WRONLY | O_CREAT, 0600); >> > + if (fd < 0) >> > + die_errno("failed to open file %s", argv[1]); >> >> contrib/coccinelle/xopen.cocci tells us to write this simply as >> >> fd = xopen(argv[1], O_WRONLY | O_CREAT, 0600); > > Sure, I can do that. Unless there are other changes needed, I'll handle it on this end. Thanks. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] t: add a test helper to truncate files 2023-10-12 16:09 ` [PATCH v2 1/2] t: add a test helper to truncate files brian m. carlson 2023-10-12 17:49 ` Eric Sunshine 2023-10-12 22:52 ` Junio C Hamano @ 2023-10-16 23:53 ` Jeff King 2 siblings, 0 replies; 16+ messages in thread From: Jeff King @ 2023-10-16 23:53 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Junio C Hamano, Taylor Blau, Jason Hatton On Thu, Oct 12, 2023 at 04:09:29PM +0000, brian m. carlson wrote: > +int cmd__truncate(int argc, const char **argv) > +{ > + char *p = NULL; > + uintmax_t sz = 0; > + int fd = -1; > + > + if (argc != 3) > + die("expected filename and size"); > + > + sz = strtoumax(argv[2], &p, 0); > + if (*p) > + die("invalid size"); > + > + fd = open(argv[1], O_WRONLY | O_CREAT, 0600); > + if (fd < 0) > + die_errno("failed to open file %s", argv[1]); > + > + if (ftruncate(fd, (off_t) sz) < 0) > + die_errno("failed to truncate file"); > + return 0; > +} Coverity flagged this as leaking the descriptor "fd" (which is obviously true). I guess it doesn't matter much in practice, since we're exiting the process directly afterwards. If it were a memory leak we'd free() it anyway to shut up the leak detector, but I don't know if we want to be as careful here (in theory it creates noise that distracts from real problems, but Coverity is noisy enough that I don't even bother looking at existing problems). -Peff ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 2/2] Prevent git from rehashing 4GiB files 2023-10-12 16:09 [PATCH v2 0/2] Prevent re-reading 4 GiB files on every status brian m. carlson 2023-10-12 16:09 ` [PATCH v2 1/2] t: add a test helper to truncate files brian m. carlson @ 2023-10-12 16:09 ` brian m. carlson 2023-10-12 17:58 ` Junio C Hamano 2023-10-17 0:00 ` Jeff King 1 sibling, 2 replies; 16+ messages in thread From: brian m. carlson @ 2023-10-12 16:09 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Taylor Blau, Jason Hatton From: Jason Hatton <jhatton@globalfinishing.com> The index stores file sizes using a uint32_t. This causes any file that is a multiple of 2^32 to have a cached file size of zero. Zero is a special value used by racily clean. This causes git to rehash every file that is a multiple of 2^32 every time git status or git commit is run. This patch mitigates the problem by making all files that are a multiple of 2^32 appear to have a size of 1<<31 instead of zero. The value of 1<<31 is chosen to keep it as far away from zero as possible to help prevent things getting mixed up with unpatched versions of git. An example would be to have a 2^32 sized file in the index of patched git. Patched git would save the file as 2^31 in the cache. An unpatched git would very much see the file has changed in size and force it to rehash the file, which is safe. The file would have to grow or shrink by exactly 2^31 and retain all of its ctime, mtime, and other attributes for old git to not notice the change. This patch does not change the behavior of any file that is not an exact multiple of 2^32. Signed-off-by: Jason D. Hatton <jhatton@globalfinishing.com> Signed-off-by: brian m. carlson <bk2204@github.com> --- statinfo.c | 20 ++++++++++++++++++-- t/t7508-status.sh | 16 ++++++++++++++++ 2 files changed, 34 insertions(+), 2 deletions(-) diff --git a/statinfo.c b/statinfo.c index 17bb8966c3..9367ca099c 100644 --- a/statinfo.c +++ b/statinfo.c @@ -2,6 +2,22 @@ #include "environment.h" #include "statinfo.h" +/* + * Munge st_size into an unsigned int. + */ +static unsigned int munge_st_size(off_t st_size) { + unsigned int sd_size = st_size; + + /* + * If the file is an exact multiple of 4 GiB, modify the value so it + * doesn't get marked as racily clean (zero). + */ + if (!sd_size && st_size) + return 0x80000000; + else + return sd_size; +} + void fill_stat_data(struct stat_data *sd, struct stat *st) { sd->sd_ctime.sec = (unsigned int)st->st_ctime; @@ -12,7 +28,7 @@ void fill_stat_data(struct stat_data *sd, struct stat *st) sd->sd_ino = st->st_ino; sd->sd_uid = st->st_uid; sd->sd_gid = st->st_gid; - sd->sd_size = st->st_size; + sd->sd_size = munge_st_size(st->st_size); } int match_stat_data(const struct stat_data *sd, struct stat *st) @@ -51,7 +67,7 @@ int match_stat_data(const struct stat_data *sd, struct stat *st) changed |= INODE_CHANGED; #endif - if (sd->sd_size != (unsigned int) st->st_size) + if (sd->sd_size != munge_st_size(st->st_size)) changed |= DATA_CHANGED; return changed; diff --git a/t/t7508-status.sh b/t/t7508-status.sh index 6928fd89f5..6c46648e11 100755 --- a/t/t7508-status.sh +++ b/t/t7508-status.sh @@ -1745,4 +1745,20 @@ test_expect_success 'slow status advice when core.untrackedCache true, and fsmon ) ' +test_expect_success EXPENSIVE 'status does not re-read unchanged 4 or 8 GiB file' ' + ( + mkdir large-file && + cd large-file && + # Files are 2 GiB, 4 GiB, and 8 GiB sparse files. + test-tool truncate file-a 0x080000000 && + test-tool truncate file-b 0x100000000 && + test-tool truncate file-c 0x200000000 && + # This will be slow. + git add file-a file-b file-c && + git commit -m "add large files" && + git diff-index HEAD file-a file-b file-c >actual && + test_must_be_empty actual + ) +' + test_done ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/2] Prevent git from rehashing 4GiB files 2023-10-12 16:09 ` [PATCH v2 2/2] Prevent git from rehashing 4GiB files brian m. carlson @ 2023-10-12 17:58 ` Junio C Hamano 2023-10-12 21:58 ` brian m. carlson 2023-10-17 0:00 ` Jeff King 1 sibling, 1 reply; 16+ messages in thread From: Junio C Hamano @ 2023-10-12 17:58 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Taylor Blau, Jason Hatton "brian m. carlson" <sandals@crustytoothpaste.net> writes: > An example would be to have a 2^32 sized file in the index of > patched git. Patched git would save the file as 2^31 in the cache. > An unpatched git would very much see the file has changed in size > and force it to rehash the file, which is safe. The reason why this is "safe" is because an older Git will would keep rehashing whether 2^31 or 0 is stored as its sd_size, so the change is not making things worse? With older git, "git diff-files" will report that such a file is not up to date, and then the user will refresh the index, which will store 0 as its sd_file, so tentatively "git status" may give a wrong information, but we probalby do not care? Is that how the reasoning goes? > +/* > + * Munge st_size into an unsigned int. > + */ > +static unsigned int munge_st_size(off_t st_size) { > + unsigned int sd_size = st_size; > + > + /* > + * If the file is an exact multiple of 4 GiB, modify the value so it > + * doesn't get marked as racily clean (zero). > + */ > + if (!sd_size && st_size) > + return 0x80000000; > + else > + return sd_size; > +} This assumes typeof(sd_size) aka "unsigned int" is always 32-bit, which does not sound reasonable. Reference to 4GiB, 2^32 and 2^31 in the code and the proposed commit log message need to be qualified with "on a platform whose uint is 32-bit" or something, or better yet, phrased in a way that is agnostic to the integer size. At the very least, the hardcoded 0x80000000 needs to be rethought, I am afraid. Other than that, the workaround for the racy-git avoidance code does sound good. I actually wonder if we should always use 1 regardless of integer size. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/2] Prevent git from rehashing 4GiB files 2023-10-12 17:58 ` Junio C Hamano @ 2023-10-12 21:58 ` brian m. carlson 2023-10-12 22:11 ` Junio C Hamano 0 siblings, 1 reply; 16+ messages in thread From: brian m. carlson @ 2023-10-12 21:58 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Taylor Blau, Jason Hatton [-- Attachment #1: Type: text/plain, Size: 3498 bytes --] On 2023-10-12 at 17:58:42, Junio C Hamano wrote: > "brian m. carlson" <sandals@crustytoothpaste.net> writes: > > > An example would be to have a 2^32 sized file in the index of > > patched git. Patched git would save the file as 2^31 in the cache. > > An unpatched git would very much see the file has changed in size > > and force it to rehash the file, which is safe. > > The reason why this is "safe" is because an older Git will would > keep rehashing whether 2^31 or 0 is stored as its sd_size, so the > change is not making things worse? With older git, "git diff-files" > will report that such a file is not up to date, and then the user > will refresh the index, which will store 0 as its sd_file, so > tentatively "git status" may give a wrong information, but we > probalby do not care? Is that how the reasoning goes? Correct. An older Git will rehash it either way, on every git status. If you run git diff-files on an older Git, it will always show it as modified (hence why I used that in the tests). The git status will rehash it and then realize that it isn't modified, not printing any changes, so the behaviour is not wrong, it's just extremely slow (SHA-1 DC on 4 GiB of data). If you use a new Git with an old index, git status will still rehash it the first time and correctly determine that it isn't changed, then write the 0x80000000 to the index. It just won't rehash it on subsequent calls to git status. The only incorrect behaviour (assuming that slow is not incorrect) that I'm aware of on older Git is that git diff-files marks it as modified when it's not. There is no incorrect behaviour on a fixed Git. > > +/* > > + * Munge st_size into an unsigned int. > > + */ > > +static unsigned int munge_st_size(off_t st_size) { > > + unsigned int sd_size = st_size; > > + > > + /* > > + * If the file is an exact multiple of 4 GiB, modify the value so it > > + * doesn't get marked as racily clean (zero). > > + */ > > + if (!sd_size && st_size) > > + return 0x80000000; > > + else > > + return sd_size; > > +} > > This assumes typeof(sd_size) aka "unsigned int" is always 32-bit, > which does not sound reasonable. Reference to 4GiB, 2^32 and 2^31 > in the code and the proposed commit log message need to be qualified > with "on a platform whose uint is 32-bit" or something, or better > yet, phrased in a way that is agnostic to the integer size. At > the very least, the hardcoded 0x80000000 needs to be rethought, I > am afraid. unsigned int is 32-bit on every modern 32- or 64-bit platform known to exist today. I don't believe we support MS-DOS or systems of its era, nor should we support 8- or 16-bit systems. If you'd prefer "uint32_t", I can do that. Note that the problem is reproducible on a stock Ubuntu amd64 system, so it is definitely a problem on all modern systems. > Other than that, the workaround for the racy-git avoidance code does > sound good. I actually wonder if we should always use 1 regardless > of integer size. I think using 2^31 is better because it's far away from very small values and very large values, which means that it's a hard to modify a file which used to have its value as a multiple of 4 GiB and accidentally make Git think it was unchanged. Using 1 would make a simple "echo > foo" possibly think things were unchanged in some cases, which we should avoid. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 263 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/2] Prevent git from rehashing 4GiB files 2023-10-12 21:58 ` brian m. carlson @ 2023-10-12 22:11 ` Junio C Hamano 0 siblings, 0 replies; 16+ messages in thread From: Junio C Hamano @ 2023-10-12 22:11 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Taylor Blau, Jason Hatton "brian m. carlson" <sandals@crustytoothpaste.net> writes: > nor should we support 8- or 16-bit systems. If you'd prefer "uint32_t", > I can do that. Going into statinfo.h and updating sd_size to uint32_t is totally outside the scope of this fix. > I think using 2^31 is better because it's far away from very small > values and very large values, which means that it's a hard to modify a > file which used to have its value as a multiple of 4 GiB and > accidentally make Git think it was unchanged. Using 1 would make a > simple "echo > foo" possibly think things were unchanged in some cases, > which we should avoid. The reason I gave the extreme "1-byte" was to gauge how much actual "size" we are relying on the correctness of this change. As mtime is giving the primary protection from false matching of cached stat information, I do not think "echo >foo" would be a huge issue. IOW my stance is 1U<<31 is as good as 1U<<0, so I do not oppose to the former, either. But in a few years, 64-bit integers may cease to be too odd, who knows ;-) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/2] Prevent git from rehashing 4GiB files 2023-10-12 16:09 ` [PATCH v2 2/2] Prevent git from rehashing 4GiB files brian m. carlson 2023-10-12 17:58 ` Junio C Hamano @ 2023-10-17 0:00 ` Jeff King 2023-10-17 14:49 ` Jason Hatton 2023-10-18 0:42 ` brian m. carlson 1 sibling, 2 replies; 16+ messages in thread From: Jeff King @ 2023-10-17 0:00 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Junio C Hamano, Taylor Blau, Jason Hatton On Thu, Oct 12, 2023 at 04:09:30PM +0000, brian m. carlson wrote: > +static unsigned int munge_st_size(off_t st_size) { > + unsigned int sd_size = st_size; > + > + /* > + * If the file is an exact multiple of 4 GiB, modify the value so it > + * doesn't get marked as racily clean (zero). > + */ > + if (!sd_size && st_size) > + return 0x80000000; > + else > + return sd_size; > +} Coverity complained that the "true" side of this conditional is unreachable, since sd_size is assigned from st_size, so the two values cannot be both true and false. But obviously we are depending here on the truncation of off_t to "unsigned int". I'm not sure if Coverity is just dumb, or if it somehow has a different size for off_t. I don't _think_ this would ever cause confusion in a real compiler, as assignment from a larger type to a smaller has well-defined truncation, as far as I know. But I do wonder if an explicit "& 0xFFFFFFFF" would make it more obvious what is happening (which would also do the right thing if in some hypothetical platform "unsigned int" ended up larger than 32 bits). -Peff ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/2] Prevent git from rehashing 4GiB files 2023-10-17 0:00 ` Jeff King @ 2023-10-17 14:49 ` Jason Hatton 2023-10-17 17:02 ` Junio C Hamano 2023-10-18 0:42 ` brian m. carlson 1 sibling, 1 reply; 16+ messages in thread From: Jason Hatton @ 2023-10-17 14:49 UTC (permalink / raw) To: Jeff King, brian m. carlson Cc: git@vger.kernel.org, Junio C Hamano, Taylor Blau From: Jeff King <peff@peff.net> > On Thu, Oct 12, 2023 at 04:09:30PM +0000, brian m. carlson wrote: > > > +static unsigned int munge_st_size(off_t st_size) { > > + unsigned int sd_size = st_size; > > + > > + /* > > + * If the file is an exact multiple of 4 GiB, modify the value so it > > + * doesn't get marked as racily clean (zero). > > + */ > > + if (!sd_size && st_size) > > + return 0x80000000; > > + else > > + return sd_size; > > +} > > Coverity complained that the "true" side of this conditional is > unreachable, since sd_size is assigned from st_size, so the two values > cannot be both true and false. But obviously we are depending here on > the truncation of off_t to "unsigned int". I'm not sure if Coverity is > just dumb, or if it somehow has a different size for off_t. > > I don't _think_ this would ever cause confusion in a real compiler, as > assignment from a larger type to a smaller has well-defined truncation, > as far as I know. > > But I do wonder if an explicit "& 0xFFFFFFFF" would make it more obvious > what is happening (which would also do the right thing if in some > hypothetical platform "unsigned int" ended up larger than 32 bits). > > -Peff I originally wrote the code this way to work exactly like the original code with one exception: Never truncate a nonzero st_size to a zero sd_size. The original code is here in fill_stat_data: I was attempting to use exactly the same implicit type conversion and types as the original. We could probably write the below to do the same thing. void fill_stat_data(struct stat_data *sd, struct stat *st) { sd->sd_ctime.sec = (unsigned int)st->st_ctime; sd->sd_mtime.sec = (unsigned int)st->st_mtime; sd->sd_ctime.nsec = ST_CTIME_NSEC(*st); sd->sd_mtime.nsec = ST_MTIME_NSEC(*st); sd->sd_dev = st->st_dev; sd->sd_ino = st->st_ino; sd->sd_uid = st->st_uid; sd->sd_gid = st->st_gid; sd->sd_size = st->st_size; + if (sd->sd_size == 0 && st->st_size!= 0) { + sd->sd_size = 1; + } } - Jason D. Hatton ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/2] Prevent git from rehashing 4GiB files 2023-10-17 14:49 ` Jason Hatton @ 2023-10-17 17:02 ` Junio C Hamano 0 siblings, 0 replies; 16+ messages in thread From: Junio C Hamano @ 2023-10-17 17:02 UTC (permalink / raw) To: Jason Hatton Cc: Jeff King, brian m. carlson, git@vger.kernel.org, Taylor Blau Jason Hatton <jhatton@globalfinishing.com> writes: > We could probably write the below to do the same thing. > > void fill_stat_data(struct stat_data *sd, struct stat *st) > { > sd->sd_ctime.sec = (unsigned int)st->st_ctime; > sd->sd_mtime.sec = (unsigned int)st->st_mtime; > sd->sd_ctime.nsec = ST_CTIME_NSEC(*st); > sd->sd_mtime.nsec = ST_MTIME_NSEC(*st); > sd->sd_dev = st->st_dev; > sd->sd_ino = st->st_ino; > sd->sd_uid = st->st_uid; > sd->sd_gid = st->st_gid; > sd->sd_size = st->st_size; > + if (sd->sd_size == 0 && st->st_size!= 0) { > + sd->sd_size = 1; > + } > } The above is a fairly straight-forward inlining (except that it does explicit comparisons with zero) of the helper function the version of patch under discussion added, and uses 1 instead of (1<<31) as an arbigrary nonzero number that can be used to work around the issue. So I agree with you that it would do the same thing. I am not surprised if it also gets scolded by Coverity the same way, though. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/2] Prevent git from rehashing 4GiB files 2023-10-17 0:00 ` Jeff King 2023-10-17 14:49 ` Jason Hatton @ 2023-10-18 0:42 ` brian m. carlson 1 sibling, 0 replies; 16+ messages in thread From: brian m. carlson @ 2023-10-18 0:42 UTC (permalink / raw) To: Jeff King; +Cc: git, Junio C Hamano, Taylor Blau, Jason Hatton [-- Attachment #1: Type: text/plain, Size: 2403 bytes --] On 2023-10-17 at 00:00:19, Jeff King wrote: > On Thu, Oct 12, 2023 at 04:09:30PM +0000, brian m. carlson wrote: > > > +static unsigned int munge_st_size(off_t st_size) { > > + unsigned int sd_size = st_size; > > + > > + /* > > + * If the file is an exact multiple of 4 GiB, modify the value so it > > + * doesn't get marked as racily clean (zero). > > + */ > > + if (!sd_size && st_size) > > + return 0x80000000; > > + else > > + return sd_size; > > +} > > Coverity complained that the "true" side of this conditional is > unreachable, since sd_size is assigned from st_size, so the two values > cannot be both true and false. But obviously we are depending here on > the truncation of off_t to "unsigned int". I'm not sure if Coverity is > just dumb, or if it somehow has a different size for off_t. Technically, on 32-bit machines, you can have a 32-bit off_t if you don't compile with -D_FILE_OFFSET_BITS=64. However, such a program is not very useful on modern systems, since it wouldn't be able to handle files that are 2 GiB or larger, and so everyone considers that a silly and buggy way to compile programs. > I don't _think_ this would ever cause confusion in a real compiler, as > assignment from a larger type to a smaller has well-defined truncation, > as far as I know. > > But I do wonder if an explicit "& 0xFFFFFFFF" would make it more obvious > what is happening (which would also do the right thing if in some > hypothetical platform "unsigned int" ended up larger than 32 bits). Such a hypothetical platform is of course allowed by the C standard, but it's also going to run near zero real Unix programs or kernels. I am at this point reflecting on the prudent decision Rust made to make almost everything an explicit bit size (e.g., u32, i64). Nonetheless, I'll reroll this with the other things you mentioned, and I'll switch that "unsigned int" above to "uint32_t", which I think makes this more obvious. I don't, however, intend to change the constant from 0x8000000 to 1, because I think that's a poorer choice, but I'll try to explain it better in the commit message. (I had originally aimed to avoid editing it as much as possible since it's not my name on the commit to avoid Jason getting blamed unnecessarily for any mistakes I might make.) -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 263 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2023-10-18 0:42 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-10-12 16:09 [PATCH v2 0/2] Prevent re-reading 4 GiB files on every status brian m. carlson 2023-10-12 16:09 ` [PATCH v2 1/2] t: add a test helper to truncate files brian m. carlson 2023-10-12 17:49 ` Eric Sunshine 2023-10-13 20:23 ` brian m. carlson 2023-10-12 22:52 ` Junio C Hamano 2023-10-13 20:18 ` brian m. carlson 2023-10-13 20:32 ` Junio C Hamano 2023-10-16 23:53 ` Jeff King 2023-10-12 16:09 ` [PATCH v2 2/2] Prevent git from rehashing 4GiB files brian m. carlson 2023-10-12 17:58 ` Junio C Hamano 2023-10-12 21:58 ` brian m. carlson 2023-10-12 22:11 ` Junio C Hamano 2023-10-17 0:00 ` Jeff King 2023-10-17 14:49 ` Jason Hatton 2023-10-17 17:02 ` Junio C Hamano 2023-10-18 0:42 ` brian m. carlson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).