* linux-leaks CI failure on master @ 2022-10-18 18:37 Taylor Blau 2022-10-18 19:40 ` Jeff King 0 siblings, 1 reply; 8+ messages in thread From: Taylor Blau @ 2022-10-18 18:37 UTC (permalink / raw) To: git Cc: Victoria Dye, Ævar Arnfjörð Bjarmason, Jeff King, Junio C Hamano After Junio pushed out the tags for v2.38.1 and friends, I noticed that our linux-leaks CI job is failing t1300.104 and t1300.109, claiming that there is a leak here: Direct leak of 7 byte(s) in 1 object(s) allocated from: #0 0x7fc4a5b319c1 in __interceptor_malloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:54 #1 0x7fc4a598638e in __GI___strdup /build/glibc-SzIz7B/glibc-2.31/string/strdup.c:42 #2 0x55ba23538f7c in xstrdup wrapper.c:39 #3 0x55ba233e258c in git_config_string config.c:1445 #4 0x55ba233e6b06 in git_config_include config.c:439 #5 0x55ba233e063f in get_value config.c:910 #6 0x55ba233e063f in git_parse_source config.c:1092 #7 0x55ba233e063f in do_config_from config.c:1937 #8 0x55ba233e3d2d in do_config_from_file config.c:1965 #9 0x55ba233e3d2d in git_config_from_file_with_options config.c:1987 #10 0x55ba233e4793 in git_config_from_file config.c:1996 #11 0x55ba233e4793 in do_git_config_sequence config.c:2143 #12 0x55ba233e4793 in config_with_options config.c:2198 #13 0x55ba233e4a50 in read_early_config config.c:2255 #14 0x55ba233acb36 in alias_lookup alias.c:35 #15 0x55ba232d0748 in handle_alias git.c:346 #16 0x55ba232d0748 in run_argv git.c:851 #17 0x55ba232d0748 in cmd_main git.c:921 #18 0x55ba232cee03 in main common-main.c:57 #19 0x7fc4a590b082 in __libc_start_main ../csu/libc-start.c:308 #20 0x55ba232cee8d in _start (git+0x1fe8d) I can't reproduce the failure locally with gcc (Debian 10.3.0-15) 10.3.0, but Victoria (CC'd) can reproduce the failure with 9.4.0. Interestingly, the failure only appears when compiling with `-O2`, but not `-O0` or `-O1`. This is reminiscent to me of the discussion in: https://lore.kernel.org/git/Yy4eo6500C0ijhk+@coredump.intra.peff.net/ I'm not sure if I'm content to treat the 9.4.0 behavior as a compiler bug, but definitely running the linux-leaks build with `-O2` is suspicious. I suppose we could temporarily mark t1300 as not passing with SANITIZE=leak turned on, but I tend to agree with Peff that that feels like a hack working around compiler behavior, that will ultimately result in us playing whack-a-mole. So my preference would be to run the linux-leaks build with `-O0` in its CFLAGS, optionally with a newer compiler if one is available for Focal. Thoughts? Thanks, Taylor ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: linux-leaks CI failure on master 2022-10-18 18:37 linux-leaks CI failure on master Taylor Blau @ 2022-10-18 19:40 ` Jeff King 2022-10-18 20:15 ` [PATCH] Makefile: force -O0 when compiling with SANITIZE=leak Jeff King 0 siblings, 1 reply; 8+ messages in thread From: Jeff King @ 2022-10-18 19:40 UTC (permalink / raw) To: Taylor Blau Cc: git, Victoria Dye, Ævar Arnfjörð Bjarmason, Junio C Hamano On Tue, Oct 18, 2022 at 02:37:44PM -0400, Taylor Blau wrote: > After Junio pushed out the tags for v2.38.1 and friends, I noticed that > our linux-leaks CI job is failing t1300.104 and t1300.109, claiming that > there is a leak here: > [...] > #14 0x55ba233acb36 in alias_lookup alias.c:35 > #15 0x55ba232d0748 in handle_alias git.c:346 These are the interesting part of the trace. alias_lookup() returns an allocated string, and then split_cmdline() fails, so we call die() with: fatal: bad alias.split-cmdline-fix string: unclosed quote So yeah, I think it's probably the same issue as discussed previously: the compiler presumably puts alias_string into a register, and then clobbers the register when calling die(), because it knows we're never coming back. > I can't reproduce the failure locally with gcc (Debian 10.3.0-15) > 10.3.0, but Victoria (CC'd) can reproduce the failure with 9.4.0. > Interestingly, the failure only appears when compiling with `-O2`, but > not `-O0` or `-O1`. I can't reproduce on debian unstable using any of gcc 9-12 (but note gcc-9 here is 9.5.0). But then, I had trouble convincing gcc to find _actual_ leaks with lsan. Clang is much more reliable for me, but it turns up only the failure in t1300.135 that I reported earlier. But given the trace above plus the findings on gcc 9.4.0, I feel pretty confident saying this is another instance of the same problem. > I'm not sure if I'm content to treat the 9.4.0 behavior as a compiler > bug, but definitely running the linux-leaks build with `-O2` is > suspicious. It's definitely not a compiler bug. What the optimizer is doing is perfectly reasonable; it's just that the leak checker interacts badly with it. > I suppose we could temporarily mark t1300 as not passing with > SANITIZE=leak turned on, but I tend to agree with Peff that that feels > like a hack working around compiler behavior, that will ultimately > result in us playing whack-a-mole. > > So my preference would be to run the linux-leaks build with `-O0` in its > CFLAGS, optionally with a newer compiler if one is available for Focal. Yes, I still think disabling optimizations is the best path forward. Not just to avoid whack-a-mole, but this is also something we'd eventually need to confront when the code base really is leak-free. I don't think there's any need for a newer compiler. While the optimizer behavior may change between versions, none of what we've seen is any compiler being _wrong_, just different. As a lesser change, I suspect that making NORETURN a noop in leak-checking builds would help in practice, because the compiler wouldn't realize that die() doesn't return. But it's not foolproof (the same thing might trigger with a direct exit() call, or one that becomes direct via inlining). Using -O0 is the more complete fix, and IMHO it's not important to try to get optimal performance during the leak-checking test run. -Peff ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH] Makefile: force -O0 when compiling with SANITIZE=leak 2022-10-18 19:40 ` Jeff King @ 2022-10-18 20:15 ` Jeff King 2022-10-18 21:00 ` Ævar Arnfjörð Bjarmason 2022-10-19 15:36 ` Junio C Hamano 0 siblings, 2 replies; 8+ messages in thread From: Jeff King @ 2022-10-18 20:15 UTC (permalink / raw) To: Taylor Blau Cc: git, Victoria Dye, Ævar Arnfjörð Bjarmason, Junio C Hamano On Tue, Oct 18, 2022 at 03:40:45PM -0400, Jeff King wrote: > > I suppose we could temporarily mark t1300 as not passing with > > SANITIZE=leak turned on, but I tend to agree with Peff that that feels > > like a hack working around compiler behavior, that will ultimately > > result in us playing whack-a-mole. > > > > So my preference would be to run the linux-leaks build with `-O0` in its > > CFLAGS, optionally with a newer compiler if one is available for Focal. > > Yes, I still think disabling optimizations is the best path forward. Not > just to avoid whack-a-mole, but this is also something we'd eventually > need to confront when the code base really is leak-free. So here's a patch that does so. I think one of Ævar's reservations in the other thread was not wanting to have to switch optimization levels manually for various builds. I think the patch below should make things Just Work, whether in CI or running a custom build locally. I tested it with the clang t1300.135 failure I can reproduce, and it does indeed make the problem go away. -- >8 -- Subject: Makefile: force -O0 when compiling with SANITIZE=leak Compiling with -O2 can interact badly with LSan's leak-checker, causing false positives. Imagine a simplified example like: char *str = allocate_some_string(); if (some_func(str) < 0) die("bad str"); free(str); The compiler may eliminate "str" as a stack variable, and just leave it in a register. The register is preserved through most of the function, including across the call to some_func(), since we'd eventually need to free it. But because die() is marked with NORETURN, the compiler knows that it doesn't need to save registers, and just clobbers it. When die() eventually exits, the leak-checker runs. It looks in registers and on the stack for any reference to the memory allocated by str (which would indicate that it's not leaked), but can't find one. So it reports it as a leak. Neither system is wrong, really. The C standard (mostly section 5.1.2.3) defines an abstract machine, and compilers are allowed to modify the program as long as the observable behavior of that abstract machine is unchanged. Looking at random memory values on the stack is undefined behavior, and not something that the optimizer needs to support. But there really isn't any other way for a leak checker to work; it inherently has to do undefined things like scouring memory for pointers. So the two things are inherently at odds with each other. We can't fix it by changing the code, because from the perspective of the program running in an abstract machine, there is no leak. This has caused real false positives in the past, like: - https://lore.kernel.org/git/patch-v3-5.6-9a44204c4c9-20211022T175227Z-avarab@gmail.com/ - https://lore.kernel.org/git/Yy4eo6500C0ijhk+@coredump.intra.peff.net/ - https://lore.kernel.org/git/Y07yeEQu+C7AH7oN@nand.local/ This patch makes those go away by forcing -O0 when compiling with LSan. There are a few ways we could do this: - we could just teach the linux-leaks CI job to set -O0. That's the smallest change, and means we wouldn't get spurious CI failures. But it doesn't help people looking for leaks manually or in a specific test (and because the problem depends on the vagaries of the optimizer, investigating these can waste a lot of time in head-scratching as the problem comes and goes) - we default to -O2 in CFLAGS; we could pull this out to a separate variable ("-O$(O)" or something) and modify "O" when LSan is in use. This is the most flexible, in that you could still build with "make O=2 SANITIZE=leak" if you really wanted to (say, for experimenting). But it would also fail to kick in if the user defines their own CFLAGS variable, which again leads to head-scratching. - we can just stick -O0 into BASIC_CFLAGS when enabling LSan. Since this comes after the user-provided CFLAGS, it will override any previous -O setting found there. This is more foolproof, albeit less flexible. If you want to experiment with an optimized leak-checking build, you'll have to put "-O2 -fsanitize=leak" into CFLAGS manually, rather than using our SANITIZE=leak Makefile magic. Since the final one is the least likely to break in normal use, this patch uses that approach. The resulting build is a little slower, of course, but since LSan is already about 2x slower than a regular build, another 10% slowdown isn't that big a deal. Signed-off-by: Jeff King <peff@peff.net> --- Makefile | 1 + 1 file changed, 1 insertion(+) diff --git a/Makefile b/Makefile index d93ad956e5..85f03c6aed 100644 --- a/Makefile +++ b/Makefile @@ -1339,6 +1339,7 @@ BASIC_CFLAGS += -DSHA1DC_FORCE_ALIGNED_ACCESS endif ifneq ($(filter leak,$(SANITIZERS)),) BASIC_CFLAGS += -DSUPPRESS_ANNOTATED_LEAKS +BASIC_CFLAGS += -O0 SANITIZE_LEAK = YesCompiledWithIt endif ifneq ($(filter address,$(SANITIZERS)),) -- 2.38.1.387.g7766a44117 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] Makefile: force -O0 when compiling with SANITIZE=leak 2022-10-18 20:15 ` [PATCH] Makefile: force -O0 when compiling with SANITIZE=leak Jeff King @ 2022-10-18 21:00 ` Ævar Arnfjörð Bjarmason 2022-10-19 7:10 ` Jeff King 2022-10-19 15:36 ` Junio C Hamano 1 sibling, 1 reply; 8+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2022-10-18 21:00 UTC (permalink / raw) To: Jeff King; +Cc: Taylor Blau, git, Victoria Dye, Junio C Hamano On Tue, Oct 18 2022, Jeff King wrote: Note that we're discussing two different leak here, git_config_push_env() one in https://lore.kernel.org/git/Yxl91jfycCo7O7Pp@coredump.intra.peff.net/; and now another one that originates in the exit() behavior in git.c. They're both triggered from t1300, but they're otherwise unrelated. > On Tue, Oct 18, 2022 at 03:40:45PM -0400, Jeff King wrote: > >> > I suppose we could temporarily mark t1300 as not passing with >> > SANITIZE=leak turned on, but I tend to agree with Peff that that feels >> > like a hack working around compiler behavior, that will ultimately >> > result in us playing whack-a-mole. >> > >> > So my preference would be to run the linux-leaks build with `-O0` in its >> > CFLAGS, optionally with a newer compiler if one is available for Focal. >> >> Yes, I still think disabling optimizations is the best path forward. Not >> just to avoid whack-a-mole, but this is also something we'd eventually >> need to confront when the code base really is leak-free. > > So here's a patch that does so. I think one of Ævar's reservations in > the other thread was not wanting to have to switch optimization levels > manually for various builds. I think the patch below should make things > Just Work, whether in CI or running a custom build locally. No, the opposite. I want to try out various combinations of compile flags & optimization levels and not have SANITIZE=leak second-guess it. Now if I want to compile with -O2 and SANITIZE=leak I'll need to monkeypatch the Makefile. I'd prefer not to need to do that, because while non-O0 is noisy sometimes, I've also found that it points (at least indirectly) to some useful edge-cases. > The compiler may eliminate "str" as a stack variable, and just leave it > in a register. The register is preserved through most of the function, > including across the call to some_func(), since we'd eventually need to > free it. But because die() is marked with NORETURN, the compiler knows > that it doesn't need to save registers, and just clobbers it. > > When die() eventually exits, the leak-checker runs. It looks in > registers and on the stack for any reference to the memory allocated by > str (which would indicate that it's not leaked), but can't find one. So > it reports it as a leak. > > Neither system is wrong, really. The C standard (mostly section 5.1.2.3) > defines an abstract machine, and compilers are allowed to modify the > program as long as the observable behavior of that abstract machine is > unchanged. Looking at random memory values on the stack is undefined > behavior, and not something that the optimizer needs to support. But > there really isn't any other way for a leak checker to work; it > inherently has to do undefined things like scouring memory for pointers. > So the two things are inherently at odds with each other. We can't fix > it by changing the code, because from the perspective of the program > running in an abstract machine, there is no leak. In the abstract program yes, there is a leak, and you can see that it leaks with e.g. valgrind regardless of optimization level: $ valgrind --leak-check=full --show-leak-kinds=all ./git --config-env=foo.flag= config --bool foo.flag [...] ==6540== 13 bytes in 1 blocks are still reachable in loss record 3 of 17 ==6540== at 0x48437B4: malloc (vg_replace_malloc.c:381) ==6540== by 0x40278B: do_xmalloc (wrapper.c:51) ==6540== by 0x402884: do_xmallocz (wrapper.c:85) ==6540== by 0x4028C0: xmallocz (wrapper.c:93) ==6540== by 0x4028FD: xmemdupz (wrapper.c:109) ==6540== by 0x402962: xstrndup (wrapper.c:115) ==6540== by 0x342F53: strip_path_suffix (path.c:1300) ==6540== by 0x2B4C89: system_prefix (exec-cmd.c:50) ==6540== by 0x2B5057: system_path (exec-cmd.c:268) ==6540== by 0x2C5E17: git_setup_gettext (gettext.c:109) ==6540== by 0x21DC57: main (common-main.c:44) I think I elaborated on some of this in https://lore.kernel.org/git/220218.861r00ib86.gmgdl@evledraar.gmail.com/ The tl;dr is that I think you use "leak" in the sense that valgrind talks about "still reachable" leaks, which is conceptually where -fsanitize=leak draws the line. I.e. it's not a malloc()/free() tracker, but tries to see what's "in scope". This is getting a bit academic, but I don't see how you can both say that the "compilers are allowed to modify the program as long as the observable behavior of that abstract machine is unchanged" *and* claim that e.g. the git_config_push_env() case isn't a real leak. Because surely the thing that makes it a "leak" by your definition (and what LSAN strives for) is that it's attributed to a variable that's "in scope", but the compiler is free to re-arrange all of that. Anyway, one reason I wanted to punt on that "git --config-env" issue is because we can entirely avoid the malloc()/free() there. See the "-- >8 --" below, but if we just malloc() after we do our assertions we can un-confuse clang. And that seems like a good idea in general, and re whether the "leak" is gone, at that point valgrind will stop reporting it, so we're really not leaking at all, not just in the "still reachable" sense. The reason I mention all of that is to try to define the problem here. I haven't seen cases where the compilers get it wrong about there being a leak, it's just that they're mis-categorizing them as "still reachable" or not, re your "abstract machine" summary. > This has caused real false positives in the past, like: > > - https://lore.kernel.org/git/patch-v3-5.6-9a44204c4c9-20211022T175227Z-avarab@gmail.com/ > > - https://lore.kernel.org/git/Yy4eo6500C0ijhk+@coredump.intra.peff.net/ > > - https://lore.kernel.org/git/Y07yeEQu+C7AH7oN@nand.local/ Now, the other cases in t1300 are from git.c using exit() instead of retur-ing to our main(). Among numerous other leak fixes I have queued up I have a fix for that, which fixes the other t1300 cases that have been reported now: https://github.com/avar/git/tree/avar/git-c-do-not-exit So as fuzzy as the "abstract machine" can be sometimes, I've found this useful. -- >8 -- Subject: config.c: do sanity checks before xmemdupz() Adjust the code added in 1ff21c05ba9 (config: store "git -c" variables using more robust format, 2021-01-12) to avoid allocating a "key" until after we've done the sanity checks on it. There was no reason to allocate it this early in the function, except because we were using the "env_name" pointer we were about to increment, let's save a copy of it instead. As a result of this early allocation the "clang" at higher optimization levels would report that we had a leak here. E.g. with Debian clang 14.0.6-2 with CFLAGS="-O3 -g": $ ./git --config-env=foo.flag= config --bool foo.flag fatal: missing environment variable name for configuration 'foo.flag' ================================================================= ==18018==ERROR: LeakSanitizer: detected memory leaks Direct leak of 9 byte(s) in 1 object(s) allocated from: #0 0x55e40aad7cd2 in __interceptor_malloc (/home/avar/g/git/git+0x78cd2) (BuildId: b89fa8f797ccb02ef1148beed300071a5f9b9ab1) #1 0x55e40ad41071 in do_xmalloc /home/avar/g/git/wrapper.c:51:8 #2 0x55e40ad41071 in do_xmallocz /home/avar/g/git/wrapper.c:85:8 #3 0x55e40ad41071 in xmallocz /home/avar/g/git/wrapper.c:93:9 #4 0x55e40ad41071 in xmemdupz /home/avar/g/git/wrapper.c:109:16 #5 0x55e40abec960 in git_config_push_env /home/avar/g/git/config.c:521:8 #6 0x55e40aadb8b9 in handle_options /home/avar/g/git/git.c:268:4 #7 0x55e40aada68d in cmd_main /home/avar/g/git/git.c:896:2 #8 0x55e40abae219 in main /home/avar/g/git/common-main.c:57:11 #9 0x7fbb5287e209 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 #10 0x7fbb5287e2bb in __libc_start_main csu/../csu/libc-start.c:389:3 #11 0x55e40aaac130 in _start (/home/avar/g/git/git+0x4d130) (BuildId: b89fa8f797ccb02ef1148beed300071a5f9b9ab1) SUMMARY: LeakSanitizer: 9 byte(s) leaked in 1 allocation(s). Aborted This has come up before, with an earlier proposed fix of mine[1] to sweep this under the rug. The actual meaningful fix here is that we don't need to do this allocation at all. The only reason it's needed is because there's no variant of "sq_quote_buf()" in quote.c that takes a "len" parameter (but I have one locally). If we had that we could just pass a "key" and "key_len" to git_config_push_split_parameter() instead and avoid the allocation altogether. But in lieu of that better fix (which other API users of quote.c would benefit from) let's defer the allocation, which happens to fix the leak reporting on 1. https://lore.kernel.org/git/patch-1.1-fb2f0a660c0-20220908T153252Z-avarab@gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> --- config.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/config.c b/config.c index cbb5a3bab74..e49fd87bbd3 100644 --- a/config.c +++ b/config.c @@ -512,13 +512,14 @@ void git_config_push_parameter(const char *text) void git_config_push_env(const char *spec) { char *key; + const char *p; const char *env_name; const char *env_value; env_name = strrchr(spec, '='); if (!env_name) die(_("invalid config format: %s"), spec); - key = xmemdupz(spec, env_name - spec); + p = env_name; env_name++; if (!*env_name) die(_("missing environment variable name for configuration '%.*s'"), @@ -529,6 +530,7 @@ void git_config_push_env(const char *spec) die(_("missing environment variable '%s' for configuration '%.*s'"), env_name, (int)(env_name - spec - 1), spec); + key = xmemdupz(spec, p - spec); git_config_push_split_parameter(key, env_value); free(key); } -- 2.38.0.1093.gcd4a685f0b1 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] Makefile: force -O0 when compiling with SANITIZE=leak 2022-10-18 21:00 ` Ævar Arnfjörð Bjarmason @ 2022-10-19 7:10 ` Jeff King 2022-10-20 18:57 ` Rubén Justo 0 siblings, 1 reply; 8+ messages in thread From: Jeff King @ 2022-10-19 7:10 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Taylor Blau, git, Victoria Dye, Junio C Hamano On Tue, Oct 18, 2022 at 11:00:54PM +0200, Ævar Arnfjörð Bjarmason wrote: > Note that we're discussing two different leak here, > git_config_push_env() one in > https://lore.kernel.org/git/Yxl91jfycCo7O7Pp@coredump.intra.peff.net/; > and now another one that originates in the exit() behavior in > git.c. They're both triggered from t1300, but they're otherwise > unrelated. Yes, there are two different cases being discussed. But they _are_ related, in that they are caused by similar compiler optimizations. > > So here's a patch that does so. I think one of Ævar's reservations in > > the other thread was not wanting to have to switch optimization levels > > manually for various builds. I think the patch below should make things > > Just Work, whether in CI or running a custom build locally. > > No, the opposite. I want to try out various combinations of compile > flags & optimization levels and not have SANITIZE=leak second-guess it. > > Now if I want to compile with -O2 and SANITIZE=leak I'll need to > monkeypatch the Makefile. You don't have to patch it. You can do: make CFLAGS='-O2 -fsanitize=leak -DSUPPRESS_ANNOTATED_LEAKS' Yes, that's a mouthful. If you really really want it, we can make the "-O0" behavior conditional. But I remain entirely unconvinced that what you're trying to do is useful. > I'd prefer not to need to do that, because while non-O0 is noisy > sometimes, I've also found that it points (at least indirectly) to some > useful edge-cases. I haven't seen any evidence that leak-checking with -O2 produces anything of value. And I have seen several examples where it produces garbage. I guess you're referring here to the elaboration you give below in your message. But I think you're just wrong about it being useful. > The tl;dr is that I think you use "leak" in the sense that valgrind > talks about "still reachable" leaks, which is conceptually where > -fsanitize=leak draws the line. I.e. it's not a malloc()/free() tracker, > but tries to see what's "in scope". No, these "still reachable" cases are not at all interesting. If we end the program by returning up the stack, then perhaps they could be (but IMHO they are not generally, because it's perfectly reasonable to leave globals pointing to allocated memory at program exit). But if we exit in the middle of a function, and variables are left on the stack, what in the world is the use of a leak-checker complaining about that? It is not hurting anything, because by definition we have exited the program and released the memory. And it is near-impossible to remove these entirely. In the example we're discussing, there's a local variable in git_config_push_env() that _could_ be freed before calling die(). But not in the general case: - what about heap allocations in callers? Imagine a program like this: void foo(const char *str) { if (some_func(str)) die("bad str"); } void bar(void) { char *str = strdup("some string"); foo(str); free(str); } If foo() calls die(), then bar()'s str is a "still reachable" leak. But we can't fix it in any reasonable way. foo() doesn't know about freeing the string. And bar() doesn't know about calling die(). - likewise, we may actually need the heap-allocated string to call die! Consider this: void foo(void) { char *str = strdup("some string"); if (some_func(str)) die("bad str: %s", str); free(str); } You can't avoid a "still reachable" leak here, because die() would need to use the string, then free it, then exit. So if these "still reachable" leaks are not hurting anything, and if they are impossible to get rid of, why do we want to care about them? Doing so just causes pain with no benefit. > This is getting a bit academic, but I don't see how you can both say > that the "compilers are allowed to modify the program as long as the > observable behavior of that abstract machine is unchanged" *and* claim > that e.g. the git_config_push_env() case isn't a real leak. The words "observable behavior" are doing a lot of heavy lifting there. That is the behavior which a conforming C program can observe. And in this case, nothing is violated. We did not reach the free() call, so there was no need to make sure we had the parameters available. The C standard's abstract machine doesn't know about things like "the stack is memory that you can look at". Doing that is undefined behavior, and a program which cares about that is not conforming. I agree this is mostly academic. My point in bringing it up was just to say that no, this isn't a bug in the optimizer. Clobbering values for NORETURN is perfectly reasonable and valid according to the standard. And leak-checking, while basically undefined behavior from the perspective of the standard, is a useful tool. It's the interaction between the two that is ugly, and the most useful thing for us to do is avoid using both at the same time. > Because surely the thing that makes it a "leak" by your definition (and > what LSAN strives for) is that it's attributed to a variable that's "in > scope", but the compiler is free to re-arrange all of that. If you mean "not a leak", then yes: there's still an in-scope variable that points to it, which is what makes it not a leak. > Anyway, one reason I wanted to punt on that "git --config-env" issue is > because we can entirely avoid the malloc()/free() there. See the "-- >8 > --" below, but if we just malloc() after we do our assertions we can > un-confuse clang. > > And that seems like a good idea in general, and re whether the "leak" is > gone, at that point valgrind will stop reporting it, so we're really not > leaking at all, not just in the "still reachable" sense. I'm not convinced it's a good idea. The resulting code requires an extra variable and is (IMHO) slightly harder to understand. And what have we accomplished? We silenced one false positive, but we know there are others. And the linux-leaks CI job is failing on master _right now_, and the patch below doesn't fix that. That makes the job somewhat worthless. And worse, it makes all of CI less useful, because if it says "failed" on every build, people will start to ignore it. > The reason I mention all of that is to try to define the problem here. I > haven't seen cases where the compilers get it wrong about there being a > leak, it's just that they're mis-categorizing them as "still reachable" > or not, re your "abstract machine" summary. This paragraph makes it sound like we are mixing up a real leak and a "still reachable" leak. But it is mixing up "there is no leak and nothing to fix" with "there is a still reachable leak". > Now, the other cases in t1300 are from git.c using exit() instead of > retur-ing to our main(). > > Among numerous other leak fixes I have queued up I have a fix for that, > which fixes the other t1300 cases that have been reported now: > https://github.com/avar/git/tree/avar/git-c-do-not-exit I said it in the last round of discussion, and I'll say it again: this is the wrong direction. These are not real leaks, and we should not be twisting our code to try to avoid them. We should be fixing our leak-checking so that they don't come up. > The actual meaningful fix here is that we don't need to do this > allocation at all. The only reason it's needed is because there's no > variant of "sq_quote_buf()" in quote.c that takes a "len" > parameter (but I have one locally). > > If we had that we could just pass a "key" and "key_len" to > git_config_push_split_parameter() instead and avoid the allocation > altogether. But in lieu of that better fix (which other API users of > quote.c would benefit from) let's defer the allocation, which happens > to fix the leak reporting on So as you probably guessed, I absolutely hate this patch and think we should not take it. If we had a version of sq_quote_buf() that took a "len" parameter, then sure, it would be reasonable to use. But it absolutely is not worth caring about to silence a leak that is a false positive. There is _nothing_ wrong with the code as written. -Peff ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] Makefile: force -O0 when compiling with SANITIZE=leak 2022-10-19 7:10 ` Jeff King @ 2022-10-20 18:57 ` Rubén Justo 2022-10-21 6:04 ` Jeff King 0 siblings, 1 reply; 8+ messages in thread From: Rubén Justo @ 2022-10-20 18:57 UTC (permalink / raw) To: Jeff King, Ævar Arnfjörð Bjarmason Cc: Taylor Blau, git, Victoria Dye, Junio C Hamano On 19/10/22 9:10, Jeff King wrote: > Yes, that's a mouthful. If you really really want it, we can make the > "-O0" behavior conditional. But I remain entirely unconvinced that what > you're trying to do is useful. > >> I'd prefer not to need to do that, because while non-O0 is noisy >> sometimes, I've also found that it points (at least indirectly) to some >> useful edge-cases. > > I haven't seen any evidence that leak-checking with -O2 produces > anything of value. And I have seen several examples where it produces > garbage. I guess you're referring here to the elaboration you give below > in your message. But I think you're just wrong about it being useful. > >> The tl;dr is that I think you use "leak" in the sense that valgrind >> talks about "still reachable" leaks, which is conceptually where >> -fsanitize=leak draws the line. I.e. it's not a malloc()/free() tracker, >> but tries to see what's "in scope". > > No, these "still reachable" cases are not at all interesting. If we end > the program by returning up the stack, then perhaps they could be (but > IMHO they are not generally, because it's perfectly reasonable to leave > globals pointing to allocated memory at program exit). But if we exit in > the middle of a function, and variables are left on the stack, what in > the world is the use of a leak-checker complaining about that? > > It is not hurting anything, because by definition we have exited the > program and released the memory. And it is near-impossible to remove > these entirely. In the example we're discussing, there's a local > variable in git_config_push_env() that _could_ be freed before calling > die(). But not in the general case: > > - what about heap allocations in callers? Imagine a program like this: > > void foo(const char *str) > { > if (some_func(str)) > die("bad str"); > } > void bar(void) > { > char *str = strdup("some string"); > foo(str); > free(str); > } > > If foo() calls die(), then bar()'s str is a "still reachable" leak. > But we can't fix it in any reasonable way. foo() doesn't know about > freeing the string. And bar() doesn't know about calling die(). > > - likewise, we may actually need the heap-allocated string to call > die! Consider this: > > void foo(void) > { > char *str = strdup("some string"); > if (some_func(str)) > die("bad str: %s", str); > free(str); > } > > You can't avoid a "still reachable" leak here, because die() would > need to use the string, then free it, then exit. > > So if these "still reachable" leaks are not hurting anything, and if > they are impossible to get rid of, why do we want to care about them? > Doing so just causes pain with no benefit. > >> This is getting a bit academic, but I don't see how you can both say >> that the "compilers are allowed to modify the program as long as the >> observable behavior of that abstract machine is unchanged" *and* claim >> that e.g. the git_config_push_env() case isn't a real leak. > > The words "observable behavior" are doing a lot of heavy lifting there. > That is the behavior which a conforming C program can observe. And in > this case, nothing is violated. We did not reach the free() call, so > there was no need to make sure we had the parameters available. The C > standard's abstract machine doesn't know about things like "the stack is > memory that you can look at". Doing that is undefined behavior, and a > program which cares about that is not conforming. > > I agree this is mostly academic. My point in bringing it up was just to > say that no, this isn't a bug in the optimizer. Clobbering values for > NORETURN is perfectly reasonable and valid according to the standard. > And leak-checking, while basically undefined behavior from the > perspective of the standard, is a useful tool. It's the interaction > between the two that is ugly, and the most useful thing for us to do is > avoid using both at the same time. > >> Because surely the thing that makes it a "leak" by your definition (and >> what LSAN strives for) is that it's attributed to a variable that's "in >> scope", but the compiler is free to re-arrange all of that. > > If you mean "not a leak", then yes: there's still an in-scope variable > that points to it, which is what makes it not a leak. > >> Anyway, one reason I wanted to punt on that "git --config-env" issue is >> because we can entirely avoid the malloc()/free() there. See the "-- >8 >> --" below, but if we just malloc() after we do our assertions we can >> un-confuse clang. >> >> And that seems like a good idea in general, and re whether the "leak" is >> gone, at that point valgrind will stop reporting it, so we're really not >> leaking at all, not just in the "still reachable" sense. > > I'm not convinced it's a good idea. The resulting code requires an extra > variable and is (IMHO) slightly harder to understand. And what have we > accomplished? We silenced one false positive, but we know there are > others. And the linux-leaks CI job is failing on master _right now_, and > the patch below doesn't fix that. > > That makes the job somewhat worthless. And worse, it makes all of CI > less useful, because if it says "failed" on every build, people will > start to ignore it. > >> The reason I mention all of that is to try to define the problem here. I >> haven't seen cases where the compilers get it wrong about there being a >> leak, it's just that they're mis-categorizing them as "still reachable" >> or not, re your "abstract machine" summary. > > This paragraph makes it sound like we are mixing up a real leak and a > "still reachable" leak. But it is mixing up "there is no leak and > nothing to fix" with "there is a still reachable leak". > >> Now, the other cases in t1300 are from git.c using exit() instead of >> retur-ing to our main(). >> >> Among numerous other leak fixes I have queued up I have a fix for that, >> which fixes the other t1300 cases that have been reported now: >> https://github.com/avar/git/tree/avar/git-c-do-not-exit > > I said it in the last round of discussion, and I'll say it again: this > is the wrong direction. These are not real leaks, and we should not be > twisting our code to try to avoid them. We should be fixing our > leak-checking so that they don't come up. > >> The actual meaningful fix here is that we don't need to do this >> allocation at all. The only reason it's needed is because there's no >> variant of "sq_quote_buf()" in quote.c that takes a "len" >> parameter (but I have one locally). >> >> If we had that we could just pass a "key" and "key_len" to >> git_config_push_split_parameter() instead and avoid the allocation >> altogether. But in lieu of that better fix (which other API users of >> quote.c would benefit from) let's defer the allocation, which happens >> to fix the leak reporting on > > So as you probably guessed, I absolutely hate this patch and think we > should not take it. > > If we had a version of sq_quote_buf() that took a "len" parameter, then > sure, it would be reasonable to use. But it absolutely is not worth > caring about to silence a leak that is a false positive. There is > _nothing_ wrong with the code as written. > > -Peff > "-O2" is what goes public, testing it, directly or indirectly, is useful. Another thing is the pay-off.. We're losing a few eyeballs on "-O2" and some test performance in the way, in exchange of avoiding some false positives. And even with "-O0" the compiler is free to keep playing the whack-a-mole with us. I can easily find motivations for people to switch to "-O0" and do its local tests for example, and so we add eyeballs... but not so easily the other way around, to switch to "-O2". "False positives" makes me think the leak checker does its best. And the compiler. This "-O0 leak" is ignored by both, with "-O2": void func() { malloc(1024); exit(1); } UNLEAK(), "-O0" as a /second opinion/ /confirmation/, some attention to the false positives,... are things that doesn't sound so bad. Maybe there is a better way. Dunno. Un saludo. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] Makefile: force -O0 when compiling with SANITIZE=leak 2022-10-20 18:57 ` Rubén Justo @ 2022-10-21 6:04 ` Jeff King 0 siblings, 0 replies; 8+ messages in thread From: Jeff King @ 2022-10-21 6:04 UTC (permalink / raw) To: Rubén Justo Cc: Ævar Arnfjörð Bjarmason, Taylor Blau, git, Victoria Dye, Junio C Hamano On Thu, Oct 20, 2022 at 08:57:31PM +0200, Rubén Justo wrote: > "-O2" is what goes public, testing it, directly or indirectly, is useful. > Another thing is the pay-off.. Sort of. The leak-check build itself is not what goes public, so we have already diverged from a release build. But yes, if there were cases that -O2 caught that -O0 did not, that would be of interest. But are there? I.e., we already know of false positives with -O2. Are there false negatives with -O0? The example you give below goes the other way: > "False positives" makes me think the leak checker does its best. And the > compiler. This "-O0 leak" is ignored by both, with "-O2": > > void func() > { > malloc(1024); > exit(1); > } It is a false negative with -O2. Which again seems to argue for using -O0. > UNLEAK(), "-O0" as a /second opinion/ /confirmation/, some attention to the > false positives,... are things that doesn't sound so bad. Maybe there is a > better way. Dunno. I don't know why we'd want to prefer -O2 if we know it produces worse results, but don't have any cases where it produce better ones (or even a plausible explanation for why it might). Sure, we could squelch its false positives with UNLEAK(), but: - that's per-site work, versus setting -O0 once - UNLEAK() suppresses false positives, but it also would suppress true positives (e.g., if the code later changed and somebody introduced a leak). It seems better to avoid that if possible. -Peff ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] Makefile: force -O0 when compiling with SANITIZE=leak 2022-10-18 20:15 ` [PATCH] Makefile: force -O0 when compiling with SANITIZE=leak Jeff King 2022-10-18 21:00 ` Ævar Arnfjörð Bjarmason @ 2022-10-19 15:36 ` Junio C Hamano 1 sibling, 0 replies; 8+ messages in thread From: Junio C Hamano @ 2022-10-19 15:36 UTC (permalink / raw) To: Jeff King Cc: Taylor Blau, git, Victoria Dye, Ævar Arnfjörð Bjarmason Jeff King <peff@peff.net> writes: > Subject: Makefile: force -O0 when compiling with SANITIZE=leak > > Compiling with -O2 can interact badly with LSan's leak-checker, causing > false positives. Imagine a simplified example like: > > char *str = allocate_some_string(); > if (some_func(str) < 0) > die("bad str"); > free(str); > > The compiler may eliminate "str" as a stack variable, and just leave it > in a register. The register is preserved through most of the function, > including across the call to some_func(), since we'd eventually need to > free it. But because die() is marked with NORETURN, the compiler knows > that it doesn't need to save registers, and just clobbers it. Yup, this is one weak point in the runtime checker in that it must see the pointer held in the stack or register to ignore a still reachable cruft that does not matter upon program exit, which cannot work well with certain optimizations. Theoretically there may be no guarantee that -O0 would disable all optimizations that are potentially problematic to what LSan expects to see, but I fully agree with you that this is the right direction. Will queue. Thanks. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2022-10-21 6:04 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-10-18 18:37 linux-leaks CI failure on master Taylor Blau 2022-10-18 19:40 ` Jeff King 2022-10-18 20:15 ` [PATCH] Makefile: force -O0 when compiling with SANITIZE=leak Jeff King 2022-10-18 21:00 ` Ævar Arnfjörð Bjarmason 2022-10-19 7:10 ` Jeff King 2022-10-20 18:57 ` Rubén Justo 2022-10-21 6:04 ` Jeff King 2022-10-19 15:36 ` Junio C Hamano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).