linux-leaks CI failure on master

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* linux-leaks CI failure on master
@ 2022-10-18 18:37 Taylor Blau
  2022-10-18 19:40 ` Jeff King
  0 siblings, 1 reply; 8+ messages in thread
From: Taylor Blau @ 2022-10-18 18:37 UTC (permalink / raw)
  To: git
  Cc: Victoria Dye, Ævar Arnfjörð Bjarmason, Jeff King,
	Junio C Hamano

After Junio pushed out the tags for v2.38.1 and friends, I noticed that
our linux-leaks CI job is failing t1300.104 and t1300.109, claiming that
there is a leak here:

    Direct leak of 7 byte(s) in 1 object(s) allocated from:
    #0 0x7fc4a5b319c1 in __interceptor_malloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:54
    #1 0x7fc4a598638e in __GI___strdup /build/glibc-SzIz7B/glibc-2.31/string/strdup.c:42
    #2 0x55ba23538f7c in xstrdup wrapper.c:39
    #3 0x55ba233e258c in git_config_string config.c:1445
    #4 0x55ba233e6b06 in git_config_include config.c:439
    #5 0x55ba233e063f in get_value config.c:910
    #6 0x55ba233e063f in git_parse_source config.c:1092
    #7 0x55ba233e063f in do_config_from config.c:1937
    #8 0x55ba233e3d2d in do_config_from_file config.c:1965
    #9 0x55ba233e3d2d in git_config_from_file_with_options config.c:1987
    #10 0x55ba233e4793 in git_config_from_file config.c:1996
    #11 0x55ba233e4793 in do_git_config_sequence config.c:2143
    #12 0x55ba233e4793 in config_with_options config.c:2198
    #13 0x55ba233e4a50 in read_early_config config.c:2255
    #14 0x55ba233acb36 in alias_lookup alias.c:35
    #15 0x55ba232d0748 in handle_alias git.c:346
    #16 0x55ba232d0748 in run_argv git.c:851
    #17 0x55ba232d0748 in cmd_main git.c:921
    #18 0x55ba232cee03 in main common-main.c:57
    #19 0x7fc4a590b082 in __libc_start_main ../csu/libc-start.c:308
    #20 0x55ba232cee8d in _start (git+0x1fe8d)

I can't reproduce the failure locally with gcc (Debian 10.3.0-15)
10.3.0, but Victoria (CC'd) can reproduce the failure with 9.4.0.
Interestingly, the failure only appears when compiling with `-O2`, but
not `-O0` or `-O1`.

This is reminiscent to me of the discussion in:

  https://lore.kernel.org/git/Yy4eo6500C0ijhk+@coredump.intra.peff.net/

I'm not sure if I'm content to treat the 9.4.0 behavior as a compiler
bug, but definitely running the linux-leaks build with `-O2` is
suspicious.

I suppose we could temporarily mark t1300 as not passing with
SANITIZE=leak turned on, but I tend to agree with Peff that that feels
like a hack working around compiler behavior, that will ultimately
result in us playing whack-a-mole.

So my preference would be to run the linux-leaks build with `-O0` in its
CFLAGS, optionally with a newer compiler if one is available for Focal.

Thoughts?

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: linux-leaks CI failure on master
  2022-10-18 18:37 linux-leaks CI failure on master Taylor Blau
@ 2022-10-18 19:40 ` Jeff King
  2022-10-18 20:15   ` [PATCH] Makefile: force -O0 when compiling with SANITIZE=leak Jeff King
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff King @ 2022-10-18 19:40 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Victoria Dye, Ævar Arnfjörð Bjarmason,
	Junio C Hamano

On Tue, Oct 18, 2022 at 02:37:44PM -0400, Taylor Blau wrote:

> After Junio pushed out the tags for v2.38.1 and friends, I noticed that
> our linux-leaks CI job is failing t1300.104 and t1300.109, claiming that
> there is a leak here:
> [...]
>     #14 0x55ba233acb36 in alias_lookup alias.c:35
>     #15 0x55ba232d0748 in handle_alias git.c:346

These are the interesting part of the trace. alias_lookup() returns an
allocated string, and then split_cmdline() fails, so we call die() with:

  fatal: bad alias.split-cmdline-fix string: unclosed quote

So yeah, I think it's probably the same issue as discussed previously:
the compiler presumably puts alias_string into a register, and then
clobbers the register when calling die(), because it knows we're never
coming back.

> I can't reproduce the failure locally with gcc (Debian 10.3.0-15)
> 10.3.0, but Victoria (CC'd) can reproduce the failure with 9.4.0.
> Interestingly, the failure only appears when compiling with `-O2`, but
> not `-O0` or `-O1`.

I can't reproduce on debian unstable using any of gcc 9-12 (but note
gcc-9 here is 9.5.0). But then, I had trouble convincing gcc to find
_actual_ leaks with lsan. Clang is much more reliable for me, but it
turns up only the failure in t1300.135 that I reported earlier.

But given the trace above plus the findings on gcc 9.4.0, I feel pretty
confident saying this is another instance of the same problem.

> I'm not sure if I'm content to treat the 9.4.0 behavior as a compiler
> bug, but definitely running the linux-leaks build with `-O2` is
> suspicious.

It's definitely not a compiler bug. What the optimizer is doing is
perfectly reasonable; it's just that the leak checker interacts badly
with it.

> I suppose we could temporarily mark t1300 as not passing with
> SANITIZE=leak turned on, but I tend to agree with Peff that that feels
> like a hack working around compiler behavior, that will ultimately
> result in us playing whack-a-mole.
> 
> So my preference would be to run the linux-leaks build with `-O0` in its
> CFLAGS, optionally with a newer compiler if one is available for Focal.

Yes, I still think disabling optimizations is the best path forward. Not
just to avoid whack-a-mole, but this is also something we'd eventually
need to confront when the code base really is leak-free.

I don't think there's any need for a newer compiler. While the optimizer
behavior may change between versions, none of what we've seen is any
compiler being _wrong_, just different.

As a lesser change, I suspect that making NORETURN a noop in
leak-checking builds would help in practice, because the compiler
wouldn't realize that die() doesn't return. But it's not foolproof (the
same thing might trigger with a direct exit() call, or one that becomes
direct via inlining). Using -O0 is the more complete fix, and IMHO it's
not important to try to get optimal performance during the leak-checking
test run.

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] Makefile: force -O0 when compiling with SANITIZE=leak
  2022-10-18 19:40 ` Jeff King
@ 2022-10-18 20:15   ` Jeff King
  2022-10-18 21:00     ` Ævar Arnfjörð Bjarmason
  2022-10-19 15:36     ` Junio C Hamano
  0 siblings, 2 replies; 8+ messages in thread
From: Jeff King @ 2022-10-18 20:15 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Victoria Dye, Ævar Arnfjörð Bjarmason,
	Junio C Hamano

On Tue, Oct 18, 2022 at 03:40:45PM -0400, Jeff King wrote:

> > I suppose we could temporarily mark t1300 as not passing with
> > SANITIZE=leak turned on, but I tend to agree with Peff that that feels
> > like a hack working around compiler behavior, that will ultimately
> > result in us playing whack-a-mole.
> > 
> > So my preference would be to run the linux-leaks build with `-O0` in its
> > CFLAGS, optionally with a newer compiler if one is available for Focal.
> 
> Yes, I still think disabling optimizations is the best path forward. Not
> just to avoid whack-a-mole, but this is also something we'd eventually
> need to confront when the code base really is leak-free.

So here's a patch that does so. I think one of Ævar's reservations in
the other thread was not wanting to have to switch optimization levels
manually for various builds. I think the patch below should make things
Just Work, whether in CI or running a custom build locally.

I tested it with the clang t1300.135 failure I can reproduce, and it
does indeed make the problem go away.

-- >8 --
Subject: Makefile: force -O0 when compiling with SANITIZE=leak

Compiling with -O2 can interact badly with LSan's leak-checker, causing
false positives. Imagine a simplified example like:

  char *str = allocate_some_string();
  if (some_func(str) < 0)
          die("bad str");
  free(str);

The compiler may eliminate "str" as a stack variable, and just leave it
in a register. The register is preserved through most of the function,
including across the call to some_func(), since we'd eventually need to
free it. But because die() is marked with NORETURN, the compiler knows
that it doesn't need to save registers, and just clobbers it.

When die() eventually exits, the leak-checker runs. It looks in
registers and on the stack for any reference to the memory allocated by
str (which would indicate that it's not leaked), but can't find one.  So
it reports it as a leak.

Neither system is wrong, really. The C standard (mostly section 5.1.2.3)
defines an abstract machine, and compilers are allowed to modify the
program as long as the observable behavior of that abstract machine is
unchanged. Looking at random memory values on the stack is undefined
behavior, and not something that the optimizer needs to support. But
there really isn't any other way for a leak checker to work; it
inherently has to do undefined things like scouring memory for pointers.
So the two things are inherently at odds with each other. We can't fix
it by changing the code, because from the perspective of the program
running in an abstract machine, there is no leak.

This has caused real false positives in the past, like:

  - https://lore.kernel.org/git/patch-v3-5.6-9a44204c4c9-20211022T175227Z-avarab@gmail.com/

  - https://lore.kernel.org/git/Yy4eo6500C0ijhk+@coredump.intra.peff.net/

  - https://lore.kernel.org/git/Y07yeEQu+C7AH7oN@nand.local/

This patch makes those go away by forcing -O0 when compiling with LSan.
There are a few ways we could do this:

  - we could just teach the linux-leaks CI job to set -O0. That's the
    smallest change, and means we wouldn't get spurious CI failures. But
    it doesn't help people looking for leaks manually or in a specific
    test (and because the problem depends on the vagaries of the
    optimizer, investigating these can waste a lot of time in
    head-scratching as the problem comes and goes)

  - we default to -O2 in CFLAGS; we could pull this out to a separate
    variable ("-O$(O)" or something) and modify "O" when LSan is in use.
    This is the most flexible, in that you could still build with "make
    O=2 SANITIZE=leak" if you really wanted to (say, for experimenting).
    But it would also fail to kick in if the user defines their own
    CFLAGS variable, which again leads to head-scratching.

  - we can just stick -O0 into BASIC_CFLAGS when enabling LSan. Since
    this comes after the user-provided CFLAGS, it will override any
    previous -O setting found there. This is more foolproof, albeit less
    flexible. If you want to experiment with an optimized leak-checking
    build, you'll have to put "-O2 -fsanitize=leak" into CFLAGS
    manually, rather than using our SANITIZE=leak Makefile magic.

Since the final one is the least likely to break in normal use, this
patch uses that approach.

The resulting build is a little slower, of course, but since LSan is
already about 2x slower than a regular build, another 10% slowdown isn't
that big a deal.

Signed-off-by: Jeff King <peff@peff.net>
---
 Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Makefile b/Makefile
index d93ad956e5..85f03c6aed 100644
--- a/Makefile
+++ b/Makefile
@@ -1339,6 +1339,7 @@ BASIC_CFLAGS += -DSHA1DC_FORCE_ALIGNED_ACCESS
 endif
 ifneq ($(filter leak,$(SANITIZERS)),)
 BASIC_CFLAGS += -DSUPPRESS_ANNOTATED_LEAKS
+BASIC_CFLAGS += -O0
 SANITIZE_LEAK = YesCompiledWithIt
 endif
 ifneq ($(filter address,$(SANITIZERS)),)
-- 
2.38.1.387.g7766a44117

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] Makefile: force -O0 when compiling with SANITIZE=leak
  2022-10-18 20:15   ` [PATCH] Makefile: force -O0 when compiling with SANITIZE=leak Jeff King
@ 2022-10-18 21:00     ` Ævar Arnfjörð Bjarmason
  2022-10-19  7:10       ` Jeff King
  2022-10-19 15:36     ` Junio C Hamano
  1 sibling, 1 reply; 8+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-10-18 21:00 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, Victoria Dye, Junio C Hamano


On Tue, Oct 18 2022, Jeff King wrote:

Note that we're discussing two different leak here,
git_config_push_env() one in
https://lore.kernel.org/git/Yxl91jfycCo7O7Pp@coredump.intra.peff.net/;
and now another one that originates in the exit() behavior in
git.c. They're both triggered from t1300, but they're otherwise
unrelated.

> On Tue, Oct 18, 2022 at 03:40:45PM -0400, Jeff King wrote:
>
>> > I suppose we could temporarily mark t1300 as not passing with
>> > SANITIZE=leak turned on, but I tend to agree with Peff that that feels
>> > like a hack working around compiler behavior, that will ultimately
>> > result in us playing whack-a-mole.
>> > 
>> > So my preference would be to run the linux-leaks build with `-O0` in its
>> > CFLAGS, optionally with a newer compiler if one is available for Focal.
>> 
>> Yes, I still think disabling optimizations is the best path forward. Not
>> just to avoid whack-a-mole, but this is also something we'd eventually
>> need to confront when the code base really is leak-free.
>
> So here's a patch that does so. I think one of Ævar's reservations in
> the other thread was not wanting to have to switch optimization levels
> manually for various builds. I think the patch below should make things
> Just Work, whether in CI or running a custom build locally.

No, the opposite. I want to try out various combinations of compile
flags & optimization levels and not have SANITIZE=leak second-guess it.

Now if I want to compile with -O2 and SANITIZE=leak I'll need to
monkeypatch the Makefile.

I'd prefer not to need to do that, because while non-O0 is noisy
sometimes, I've also found that it points (at least indirectly) to some
useful edge-cases.

> The compiler may eliminate "str" as a stack variable, and just leave it
> in a register. The register is preserved through most of the function,
> including across the call to some_func(), since we'd eventually need to
> free it. But because die() is marked with NORETURN, the compiler knows
> that it doesn't need to save registers, and just clobbers it.
>
> When die() eventually exits, the leak-checker runs. It looks in
> registers and on the stack for any reference to the memory allocated by
> str (which would indicate that it's not leaked), but can't find one.  So
> it reports it as a leak.
>
> Neither system is wrong, really. The C standard (mostly section 5.1.2.3)
> defines an abstract machine, and compilers are allowed to modify the
> program as long as the observable behavior of that abstract machine is
> unchanged. Looking at random memory values on the stack is undefined
> behavior, and not something that the optimizer needs to support. But
> there really isn't any other way for a leak checker to work; it
> inherently has to do undefined things like scouring memory for pointers.
> So the two things are inherently at odds with each other. We can't fix
> it by changing the code, because from the perspective of the program
> running in an abstract machine, there is no leak.

In the abstract program yes, there is a leak, and you can see that it
leaks with e.g. valgrind regardless of optimization level:

	$ valgrind --leak-check=full --show-leak-kinds=all ./git --config-env=foo.flag= config --bool foo.flag
        [...]
	==6540== 13 bytes in 1 blocks are still reachable in loss record 3 of 17
	==6540==    at 0x48437B4: malloc (vg_replace_malloc.c:381)
	==6540==    by 0x40278B: do_xmalloc (wrapper.c:51)
	==6540==    by 0x402884: do_xmallocz (wrapper.c:85)
	==6540==    by 0x4028C0: xmallocz (wrapper.c:93)
	==6540==    by 0x4028FD: xmemdupz (wrapper.c:109)
	==6540==    by 0x402962: xstrndup (wrapper.c:115)
	==6540==    by 0x342F53: strip_path_suffix (path.c:1300)
	==6540==    by 0x2B4C89: system_prefix (exec-cmd.c:50)
	==6540==    by 0x2B5057: system_path (exec-cmd.c:268)
	==6540==    by 0x2C5E17: git_setup_gettext (gettext.c:109)
	==6540==    by 0x21DC57: main (common-main.c:44)

I think I elaborated on some of this in
https://lore.kernel.org/git/220218.861r00ib86.gmgdl@evledraar.gmail.com/

The tl;dr is that I think you use "leak" in the sense that valgrind
talks about "still reachable" leaks, which is conceptually where
-fsanitize=leak draws the line. I.e. it's not a malloc()/free() tracker,
but tries to see what's "in scope".

This is getting a bit academic, but I don't see how you can both say
that the "compilers are allowed to modify the program as long as the
observable behavior of that abstract machine is unchanged" *and* claim
that e.g. the git_config_push_env() case isn't a real leak.

Because surely the thing that makes it a "leak" by your definition (and
what LSAN strives for) is that it's attributed to a variable that's "in
scope", but the compiler is free to re-arrange all of that.

Anyway, one reason I wanted to punt on that "git --config-env" issue is
because we can entirely avoid the malloc()/free() there. See the "-- >8
--" below, but if we just malloc() after we do our assertions we can
un-confuse clang.

And that seems like a good idea in general, and re whether the "leak" is
gone, at that point valgrind will stop reporting it, so we're really not
leaking at all, not just in the "still reachable" sense.

The reason I mention all of that is to try to define the problem here. I
haven't seen cases where the compilers get it wrong about there being a
leak, it's just that they're mis-categorizing them as "still reachable"
or not, re your "abstract machine" summary.

> This has caused real false positives in the past, like:
>
>   - https://lore.kernel.org/git/patch-v3-5.6-9a44204c4c9-20211022T175227Z-avarab@gmail.com/
>
>   - https://lore.kernel.org/git/Yy4eo6500C0ijhk+@coredump.intra.peff.net/
>
>   - https://lore.kernel.org/git/Y07yeEQu+C7AH7oN@nand.local/

Now, the other cases in t1300 are from git.c using exit() instead of
retur-ing to our main().

Among numerous other leak fixes I have queued up I have a fix for that,
which fixes the other t1300 cases that have been reported now:
https://github.com/avar/git/tree/avar/git-c-do-not-exit

So as fuzzy as the "abstract machine" can be sometimes, I've found this
useful.

-- >8 --
Subject: config.c: do sanity checks before xmemdupz()

Adjust the code added in 1ff21c05ba9 (config: store "git -c" variables
using more robust format, 2021-01-12) to avoid allocating a "key"
until after we've done the sanity checks on it.

There was no reason to allocate it this early in the function, except
because we were using the "env_name" pointer we were about to
increment, let's save a copy of it instead.

As a result of this early allocation the "clang" at higher
optimization levels would report that we had a leak here. E.g. with
Debian clang 14.0.6-2 with CFLAGS="-O3 -g":

	$ ./git --config-env=foo.flag= config --bool foo.flag
	fatal: missing environment variable name for configuration 'foo.flag'

	=================================================================
	==18018==ERROR: LeakSanitizer: detected memory leaks

	Direct leak of 9 byte(s) in 1 object(s) allocated from:
	    #0 0x55e40aad7cd2 in __interceptor_malloc (/home/avar/g/git/git+0x78cd2) (BuildId: b89fa8f797ccb02ef1148beed300071a5f9b9ab1)
	    #1 0x55e40ad41071 in do_xmalloc /home/avar/g/git/wrapper.c:51:8
	    #2 0x55e40ad41071 in do_xmallocz /home/avar/g/git/wrapper.c:85:8
	    #3 0x55e40ad41071 in xmallocz /home/avar/g/git/wrapper.c:93:9
	    #4 0x55e40ad41071 in xmemdupz /home/avar/g/git/wrapper.c:109:16
	    #5 0x55e40abec960 in git_config_push_env /home/avar/g/git/config.c:521:8
	    #6 0x55e40aadb8b9 in handle_options /home/avar/g/git/git.c:268:4
	    #7 0x55e40aada68d in cmd_main /home/avar/g/git/git.c:896:2
	    #8 0x55e40abae219 in main /home/avar/g/git/common-main.c:57:11
	    #9 0x7fbb5287e209 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
	    #10 0x7fbb5287e2bb in __libc_start_main csu/../csu/libc-start.c:389:3
	    #11 0x55e40aaac130 in _start (/home/avar/g/git/git+0x4d130) (BuildId: b89fa8f797ccb02ef1148beed300071a5f9b9ab1)

	SUMMARY: LeakSanitizer: 9 byte(s) leaked in 1 allocation(s).
	Aborted

This has come up before, with an earlier proposed fix of mine[1] to
sweep this under the rug.

The actual meaningful fix here is that we don't need to do this
allocation at all. The only reason it's needed is because there's no
variant of "sq_quote_buf()" in quote.c that takes a "len"
parameter (but I have one locally).

If we had that we could just pass a "key" and "key_len" to
git_config_push_split_parameter() instead and avoid the allocation
altogether. But in lieu of that better fix (which other API users of
quote.c would benefit from) let's defer the allocation, which happens
to fix the leak reporting on

1. https://lore.kernel.org/git/patch-1.1-fb2f0a660c0-20220908T153252Z-avarab@gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 config.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/config.c b/config.c
index cbb5a3bab74..e49fd87bbd3 100644
--- a/config.c
+++ b/config.c
@@ -512,13 +512,14 @@ void git_config_push_parameter(const char *text)
 void git_config_push_env(const char *spec)
 {
 	char *key;
+	const char *p;
 	const char *env_name;
 	const char *env_value;
 
 	env_name = strrchr(spec, '=');
 	if (!env_name)
 		die(_("invalid config format: %s"), spec);
-	key = xmemdupz(spec, env_name - spec);
+	p = env_name;
 	env_name++;
 	if (!*env_name)
 		die(_("missing environment variable name for configuration '%.*s'"),
@@ -529,6 +530,7 @@ void git_config_push_env(const char *spec)
 		die(_("missing environment variable '%s' for configuration '%.*s'"),
 		    env_name, (int)(env_name - spec - 1), spec);
 
+	key = xmemdupz(spec, p - spec);
 	git_config_push_split_parameter(key, env_value);
 	free(key);
 }
-- 
2.38.0.1093.gcd4a685f0b1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] Makefile: force -O0 when compiling with SANITIZE=leak
  2022-10-18 21:00     ` Ævar Arnfjörð Bjarmason
@ 2022-10-19  7:10       ` Jeff King
  2022-10-20 18:57         ` Rubén Justo
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff King @ 2022-10-19  7:10 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Taylor Blau, git, Victoria Dye, Junio C Hamano

On Tue, Oct 18, 2022 at 11:00:54PM +0200, Ævar Arnfjörð Bjarmason wrote:

> Note that we're discussing two different leak here,
> git_config_push_env() one in
> https://lore.kernel.org/git/Yxl91jfycCo7O7Pp@coredump.intra.peff.net/;
> and now another one that originates in the exit() behavior in
> git.c. They're both triggered from t1300, but they're otherwise
> unrelated.

Yes, there are two different cases being discussed. But they _are_
related, in that they are caused by similar compiler optimizations.

> > So here's a patch that does so. I think one of Ævar's reservations in
> > the other thread was not wanting to have to switch optimization levels
> > manually for various builds. I think the patch below should make things
> > Just Work, whether in CI or running a custom build locally.
> 
> No, the opposite. I want to try out various combinations of compile
> flags & optimization levels and not have SANITIZE=leak second-guess it.
>
> Now if I want to compile with -O2 and SANITIZE=leak I'll need to
> monkeypatch the Makefile.

You don't have to patch it. You can do:

  make CFLAGS='-O2 -fsanitize=leak -DSUPPRESS_ANNOTATED_LEAKS'

Yes, that's a mouthful. If you really really want it, we can make the
"-O0" behavior conditional. But I remain entirely unconvinced that what
you're trying to do is useful.

> I'd prefer not to need to do that, because while non-O0 is noisy
> sometimes, I've also found that it points (at least indirectly) to some
> useful edge-cases.

I haven't seen any evidence that leak-checking with -O2 produces
anything of value. And I have seen several examples where it produces
garbage. I guess you're referring here to the elaboration you give below
in your message. But I think you're just wrong about it being useful.

> The tl;dr is that I think you use "leak" in the sense that valgrind
> talks about "still reachable" leaks, which is conceptually where
> -fsanitize=leak draws the line. I.e. it's not a malloc()/free() tracker,
> but tries to see what's "in scope".

No, these "still reachable" cases are not at all interesting. If we end
the program by returning up the stack, then perhaps they could be (but
IMHO they are not generally, because it's perfectly reasonable to leave
globals pointing to allocated memory at program exit). But if we exit in
the middle of a function, and variables are left on the stack, what in
the world is the use of a leak-checker complaining about that?

It is not hurting anything, because by definition we have exited the
program and released the memory. And it is near-impossible to remove
these entirely. In the example we're discussing, there's a local
variable in git_config_push_env() that _could_ be freed before calling
die(). But not in the general case:

  - what about heap allocations in callers? Imagine a program like this:

      void foo(const char *str)
      {
        if (some_func(str))
          die("bad str");
      }
      void bar(void)
      {
        char *str = strdup("some string");
	foo(str);
	free(str);
      }

    If foo() calls die(), then bar()'s str is a "still reachable" leak.
    But we can't fix it in any reasonable way. foo() doesn't know about
    freeing the string. And bar() doesn't know about calling die().

  - likewise, we may actually need the heap-allocated string to call
    die! Consider this:

      void foo(void)
      {
        char *str = strdup("some string");
	if (some_func(str))
	  die("bad str: %s", str);
	free(str);
      }

    You can't avoid a "still reachable" leak here, because die() would
    need to use the string, then free it, then exit.

So if these "still reachable" leaks are not hurting anything, and if
they are impossible to get rid of, why do we want to care about them?
Doing so just causes pain with no benefit.

> This is getting a bit academic, but I don't see how you can both say
> that the "compilers are allowed to modify the program as long as the
> observable behavior of that abstract machine is unchanged" *and* claim
> that e.g. the git_config_push_env() case isn't a real leak.

The words "observable behavior" are doing a lot of heavy lifting there.
That is the behavior which a conforming C program can observe. And in
this case, nothing is violated. We did not reach the free() call, so
there was no need to make sure we had the parameters available. The C
standard's abstract machine doesn't know about things like "the stack is
memory that you can look at". Doing that is undefined behavior, and a
program which cares about that is not conforming.

I agree this is mostly academic. My point in bringing it up was just to
say that no, this isn't a bug in the optimizer. Clobbering values for
NORETURN is perfectly reasonable and valid according to the standard.
And leak-checking, while basically undefined behavior from the
perspective of the standard, is a useful tool. It's the interaction
between the two that is ugly, and the most useful thing for us to do is
avoid using both at the same time.

> Because surely the thing that makes it a "leak" by your definition (and
> what LSAN strives for) is that it's attributed to a variable that's "in
> scope", but the compiler is free to re-arrange all of that.

If you mean "not a leak", then yes: there's still an in-scope variable
that points to it, which is what makes it not a leak.

> Anyway, one reason I wanted to punt on that "git --config-env" issue is
> because we can entirely avoid the malloc()/free() there. See the "-- >8
> --" below, but if we just malloc() after we do our assertions we can
> un-confuse clang.
> 
> And that seems like a good idea in general, and re whether the "leak" is
> gone, at that point valgrind will stop reporting it, so we're really not
> leaking at all, not just in the "still reachable" sense.

I'm not convinced it's a good idea. The resulting code requires an extra
variable and is (IMHO) slightly harder to understand. And what have we
accomplished? We silenced one false positive, but we know there are
others. And the linux-leaks CI job is failing on master _right now_, and
the patch below doesn't fix that.

That makes the job somewhat worthless. And worse, it makes all of CI
less useful, because if it says "failed" on every build, people will
start to ignore it.

> The reason I mention all of that is to try to define the problem here. I
> haven't seen cases where the compilers get it wrong about there being a
> leak, it's just that they're mis-categorizing them as "still reachable"
> or not, re your "abstract machine" summary.

This paragraph makes it sound like we are mixing up a real leak and a
"still reachable" leak. But it is mixing up "there is no leak and
nothing to fix" with "there is a still reachable leak".

> Now, the other cases in t1300 are from git.c using exit() instead of
> retur-ing to our main().
> 
> Among numerous other leak fixes I have queued up I have a fix for that,
> which fixes the other t1300 cases that have been reported now:
> https://github.com/avar/git/tree/avar/git-c-do-not-exit

I said it in the last round of discussion, and I'll say it again: this
is the wrong direction. These are not real leaks, and we should not be
twisting our code to try to avoid them. We should be fixing our
leak-checking so that they don't come up.

> The actual meaningful fix here is that we don't need to do this
> allocation at all. The only reason it's needed is because there's no
> variant of "sq_quote_buf()" in quote.c that takes a "len"
> parameter (but I have one locally).
> 
> If we had that we could just pass a "key" and "key_len" to
> git_config_push_split_parameter() instead and avoid the allocation
> altogether. But in lieu of that better fix (which other API users of
> quote.c would benefit from) let's defer the allocation, which happens
> to fix the leak reporting on

So as you probably guessed, I absolutely hate this patch and think we
should not take it.

If we had a version of sq_quote_buf() that took a "len" parameter, then
sure, it would be reasonable to use. But it absolutely is not worth
caring about to silence a leak that is a false positive. There is
_nothing_ wrong with the code as written.

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Makefile: force -O0 when compiling with SANITIZE=leak
  2022-10-19  7:10       ` Jeff King
@ 2022-10-20 18:57         ` Rubén Justo
  2022-10-21  6:04           ` Jeff King
  0 siblings, 1 reply; 8+ messages in thread
From: Rubén Justo @ 2022-10-20 18:57 UTC (permalink / raw)
  To: Jeff King, Ævar Arnfjörð Bjarmason
  Cc: Taylor Blau, git, Victoria Dye, Junio C Hamano

On 19/10/22 9:10, Jeff King wrote:
 
> Yes, that's a mouthful. If you really really want it, we can make the
> "-O0" behavior conditional. But I remain entirely unconvinced that what
> you're trying to do is useful.
> 
>> I'd prefer not to need to do that, because while non-O0 is noisy
>> sometimes, I've also found that it points (at least indirectly) to some
>> useful edge-cases.
> 
> I haven't seen any evidence that leak-checking with -O2 produces
> anything of value. And I have seen several examples where it produces
> garbage. I guess you're referring here to the elaboration you give below
> in your message. But I think you're just wrong about it being useful.
> 
>> The tl;dr is that I think you use "leak" in the sense that valgrind
>> talks about "still reachable" leaks, which is conceptually where
>> -fsanitize=leak draws the line. I.e. it's not a malloc()/free() tracker,
>> but tries to see what's "in scope".
> 
> No, these "still reachable" cases are not at all interesting. If we end
> the program by returning up the stack, then perhaps they could be (but
> IMHO they are not generally, because it's perfectly reasonable to leave
> globals pointing to allocated memory at program exit). But if we exit in
> the middle of a function, and variables are left on the stack, what in
> the world is the use of a leak-checker complaining about that?
> 
> It is not hurting anything, because by definition we have exited the
> program and released the memory. And it is near-impossible to remove
> these entirely. In the example we're discussing, there's a local
> variable in git_config_push_env() that _could_ be freed before calling
> die(). But not in the general case:
> 
>   - what about heap allocations in callers? Imagine a program like this:
> 
>       void foo(const char *str)
>       {
>         if (some_func(str))
>           die("bad str");
>       }
>       void bar(void)
>       {
>         char *str = strdup("some string");
> 	foo(str);
> 	free(str);
>       }
> 
>     If foo() calls die(), then bar()'s str is a "still reachable" leak.
>     But we can't fix it in any reasonable way. foo() doesn't know about
>     freeing the string. And bar() doesn't know about calling die().
> 
>   - likewise, we may actually need the heap-allocated string to call
>     die! Consider this:
> 
>       void foo(void)
>       {
>         char *str = strdup("some string");
> 	if (some_func(str))
> 	  die("bad str: %s", str);
> 	free(str);
>       }
> 
>     You can't avoid a "still reachable" leak here, because die() would
>     need to use the string, then free it, then exit.
> 
> So if these "still reachable" leaks are not hurting anything, and if
> they are impossible to get rid of, why do we want to care about them?
> Doing so just causes pain with no benefit.
> 
>> This is getting a bit academic, but I don't see how you can both say
>> that the "compilers are allowed to modify the program as long as the
>> observable behavior of that abstract machine is unchanged" *and* claim
>> that e.g. the git_config_push_env() case isn't a real leak.
> 
> The words "observable behavior" are doing a lot of heavy lifting there.
> That is the behavior which a conforming C program can observe. And in
> this case, nothing is violated. We did not reach the free() call, so
> there was no need to make sure we had the parameters available. The C
> standard's abstract machine doesn't know about things like "the stack is
> memory that you can look at". Doing that is undefined behavior, and a
> program which cares about that is not conforming.
> 
> I agree this is mostly academic. My point in bringing it up was just to
> say that no, this isn't a bug in the optimizer. Clobbering values for
> NORETURN is perfectly reasonable and valid according to the standard.
> And leak-checking, while basically undefined behavior from the
> perspective of the standard, is a useful tool. It's the interaction
> between the two that is ugly, and the most useful thing for us to do is
> avoid using both at the same time.
> 
>> Because surely the thing that makes it a "leak" by your definition (and
>> what LSAN strives for) is that it's attributed to a variable that's "in
>> scope", but the compiler is free to re-arrange all of that.
> 
> If you mean "not a leak", then yes: there's still an in-scope variable
> that points to it, which is what makes it not a leak.
> 
>> Anyway, one reason I wanted to punt on that "git --config-env" issue is
>> because we can entirely avoid the malloc()/free() there. See the "-- >8
>> --" below, but if we just malloc() after we do our assertions we can
>> un-confuse clang.
>>
>> And that seems like a good idea in general, and re whether the "leak" is
>> gone, at that point valgrind will stop reporting it, so we're really not
>> leaking at all, not just in the "still reachable" sense.
> 
> I'm not convinced it's a good idea. The resulting code requires an extra
> variable and is (IMHO) slightly harder to understand. And what have we
> accomplished? We silenced one false positive, but we know there are
> others. And the linux-leaks CI job is failing on master _right now_, and
> the patch below doesn't fix that.
> 
> That makes the job somewhat worthless. And worse, it makes all of CI
> less useful, because if it says "failed" on every build, people will
> start to ignore it.
> 
>> The reason I mention all of that is to try to define the problem here. I
>> haven't seen cases where the compilers get it wrong about there being a
>> leak, it's just that they're mis-categorizing them as "still reachable"
>> or not, re your "abstract machine" summary.
> 
> This paragraph makes it sound like we are mixing up a real leak and a
> "still reachable" leak. But it is mixing up "there is no leak and
> nothing to fix" with "there is a still reachable leak".
> 
>> Now, the other cases in t1300 are from git.c using exit() instead of
>> retur-ing to our main().
>>
>> Among numerous other leak fixes I have queued up I have a fix for that,
>> which fixes the other t1300 cases that have been reported now:
>> https://github.com/avar/git/tree/avar/git-c-do-not-exit
> 
> I said it in the last round of discussion, and I'll say it again: this
> is the wrong direction. These are not real leaks, and we should not be
> twisting our code to try to avoid them. We should be fixing our
> leak-checking so that they don't come up.
> 
>> The actual meaningful fix here is that we don't need to do this
>> allocation at all. The only reason it's needed is because there's no
>> variant of "sq_quote_buf()" in quote.c that takes a "len"
>> parameter (but I have one locally).
>>
>> If we had that we could just pass a "key" and "key_len" to
>> git_config_push_split_parameter() instead and avoid the allocation
>> altogether. But in lieu of that better fix (which other API users of
>> quote.c would benefit from) let's defer the allocation, which happens
>> to fix the leak reporting on
> 
> So as you probably guessed, I absolutely hate this patch and think we
> should not take it.
> 
> If we had a version of sq_quote_buf() that took a "len" parameter, then
> sure, it would be reasonable to use. But it absolutely is not worth
> caring about to silence a leak that is a false positive. There is
> _nothing_ wrong with the code as written.
> 
> -Peff
> 

"-O2" is what goes public, testing it, directly or indirectly, is useful.
Another thing is the pay-off..

We're losing a few eyeballs on "-O2" and some test performance in the way, in
exchange of avoiding some false positives.  And even with "-O0" the compiler is
free to keep playing the whack-a-mole with us.

I can easily find motivations for people to switch to "-O0" and do its local
tests for example, and so we add eyeballs... but not so easily the other way
around, to switch to "-O2".

"False positives" makes me think the leak checker does its best.  And the
compiler.  This "-O0 leak" is ignored by both, with "-O2":

	void func()
	{
		malloc(1024);
		exit(1);
	}

UNLEAK(), "-O0" as a /second opinion/ /confirmation/, some attention to the
false positives,... are things that doesn't sound so bad.  Maybe there is a
better way.  Dunno.

Un saludo.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Makefile: force -O0 when compiling with SANITIZE=leak
  2022-10-20 18:57         ` Rubén Justo
@ 2022-10-21  6:04           ` Jeff King
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff King @ 2022-10-21  6:04 UTC (permalink / raw)
  To: Rubén Justo
  Cc: Ævar Arnfjörð Bjarmason, Taylor Blau, git,
	Victoria Dye, Junio C Hamano

On Thu, Oct 20, 2022 at 08:57:31PM +0200, Rubén Justo wrote:

> "-O2" is what goes public, testing it, directly or indirectly, is useful.
> Another thing is the pay-off..

Sort of. The leak-check build itself is not what goes public, so we have
already diverged from a release build. But yes, if there were cases that
-O2 caught that -O0 did not, that would be of interest.

But are there?

I.e., we already know of false positives with -O2. Are there false
negatives with -O0?

The example you give below goes the other way:

> "False positives" makes me think the leak checker does its best.  And the
> compiler.  This "-O0 leak" is ignored by both, with "-O2":
> 
> 	void func()
> 	{
> 		malloc(1024);
> 		exit(1);
> 	}

It is a false negative with -O2. Which again seems to argue for using
-O0.

> UNLEAK(), "-O0" as a /second opinion/ /confirmation/, some attention to the
> false positives,... are things that doesn't sound so bad.  Maybe there is a
> better way.  Dunno.

I don't know why we'd want to prefer -O2 if we know it produces worse
results, but don't have any cases where it produce better ones (or even
a plausible explanation for why it might). Sure, we could squelch its
false positives with UNLEAK(), but:

  - that's per-site work, versus setting -O0 once

  - UNLEAK() suppresses false positives, but it also would suppress true
    positives (e.g., if the code later changed and somebody introduced a
    leak). It seems better to avoid that if possible.

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Makefile: force -O0 when compiling with SANITIZE=leak
  2022-10-18 20:15   ` [PATCH] Makefile: force -O0 when compiling with SANITIZE=leak Jeff King
  2022-10-18 21:00     ` Ævar Arnfjörð Bjarmason
@ 2022-10-19 15:36     ` Junio C Hamano
  1 sibling, 0 replies; 8+ messages in thread
From: Junio C Hamano @ 2022-10-19 15:36 UTC (permalink / raw)
  To: Jeff King
  Cc: Taylor Blau, git, Victoria Dye,
	Ævar Arnfjörð Bjarmason

Jeff King <peff@peff.net> writes:

> Subject: Makefile: force -O0 when compiling with SANITIZE=leak
>
> Compiling with -O2 can interact badly with LSan's leak-checker, causing
> false positives. Imagine a simplified example like:
>
>   char *str = allocate_some_string();
>   if (some_func(str) < 0)
>           die("bad str");
>   free(str);
>
> The compiler may eliminate "str" as a stack variable, and just leave it
> in a register. The register is preserved through most of the function,
> including across the call to some_func(), since we'd eventually need to
> free it. But because die() is marked with NORETURN, the compiler knows
> that it doesn't need to save registers, and just clobbers it.

Yup, this is one weak point in the runtime checker in that it must
see the pointer held in the stack or register to ignore a still
reachable cruft that does not matter upon program exit, which cannot
work well with certain optimizations.

Theoretically there may be no guarantee that -O0 would disable all
optimizations that are potentially problematic to what LSan expects
to see, but I fully agree with you that this is the right direction.

Will queue.  Thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-10-21  6:04 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-10-18 18:37 linux-leaks CI failure on master Taylor Blau
2022-10-18 19:40 ` Jeff King
2022-10-18 20:15   ` [PATCH] Makefile: force -O0 when compiling with SANITIZE=leak Jeff King
2022-10-18 21:00     ` Ævar Arnfjörð Bjarmason
2022-10-19  7:10       ` Jeff King
2022-10-20 18:57         ` Rubén Justo
2022-10-21  6:04           ` Jeff King
2022-10-19 15:36     ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).