* [PATCH 0/5] fixing thread races in linux-leaks CI
@ 2024-12-30 4:23 Jeff King
2024-12-30 4:24 ` [PATCH 1/5] test-lib: use individual lsan dir for --stress runs Jeff King
` (4 more replies)
0 siblings, 5 replies; 11+ messages in thread
From: Jeff King @ 2024-12-30 4:23 UTC (permalink / raw)
To: git; +Cc: René Scharfe, Patrick Steinhardt, Junio C Hamano
This series should fix the races we see in linux-leaks CI jobs. Or at
least some of them. We tried previously in 993d38a066 (index-pack: spawn
threads atomically, 2024-01-05), but it wasn't enough. This series
covers index-pack as well as git-grep, which are the two I've seen in
practice. We may run into more, but the general pattern should be
applicable if we do.
This series takes the minimal approach, and only provides a knob to use
pthread_barrier_t if your system supports it (and then turns it on for
our CI jobs). The obvious alternative is to actually implement wrappers
for other systems, and then use it everywhere. That doesn't buy us much
for this problem, but it's possible we'd find a use for barriers
elsewhere.
The even farther alternatives are:
1. Do nothing. This is arguably an LSan bug, which should be doing its
own locking to synchronize the at-exit leak check with the
per-thread setup code. I don't think we've even reported it there,
so one option would be to work with them. Even if we do that, it
might be nice to remove the false positive pain in the meantime
with a series like this (and certainly patches 1 and 2 here are
worth applying independently).
2. Change our posture on thread exit. In index-pack, the issue is that
one worker thread calls exit() via die(), taking down the whole
program (including other worker threads in unknown states). If it
installed a die() handler that called pthread_exit() instead, then
the main thread could see that failure and cancel/join the
remaining threads.
But I suspect it's not so simple:
a. We still want the error in the worker thread to cancel ongoing
work immediately. Right now index-pack just called
pthread_join() in sequence to wait for all threads to finish.
But instead we'd need to use some synchronization primitives
to report the error. Probably not too hard, but new
potentially tricky code.
b. In the git-grep case, it's actually the main thread that calls
die(). So the rule is not really "threads should install a die
handler that calls pthread_exit()", but _any_ die() call when
threads are established would need to cleanly ask all threads
to stop (whether by signaling them via the work queue, or just
hitting them with pthread_cancel; and that's assuming that
there's no race between LSan's setup code and cancel).
c. When we call die(), all bets are off about what state various
data structures are in. For example, a thread could be holding
a lock, and if it's cancelled by another thread, that lock
will remain. So it's not safe to do any real work in the other
threads, unless we start setting up pthread_cleanup handlers,
etc. That seems like a recipe for obscure, racy deadlocks.
I think that may be an appealing direction in the long run,
especially since die() itself is not thread-safe. But coupled with
libification, we probably want less "threads can die() cleanly" and
more "threads pass errors up the stack and report the problem back
to the work queue". Either way, though, it's a much bigger approach
change than I think we want just to try to address LSan races.
So I think we want something like this (either this, or the variant
where we really implement barriers everywhere) in the near-term.
[1/5]: test-lib: use individual lsan dir for --stress runs
[2/5]: Revert "index-pack: spawn threads atomically"
[3/5]: thread-utils: introduce optional barrier type
[4/5]: index-pack: work around LSan threading race with barrier
[5/5]: grep: work around LSan threading race with barrier
Makefile | 7 +++++++
builtin/grep.c | 8 ++++++++
builtin/index-pack.c | 8 ++++++--
ci/lib.sh | 1 +
t/test-lib.sh | 2 +-
thread-utils.h | 17 +++++++++++++++++
6 files changed, 40 insertions(+), 3 deletions(-)
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/5] test-lib: use individual lsan dir for --stress runs
2024-12-30 4:23 [PATCH 0/5] fixing thread races in linux-leaks CI Jeff King
@ 2024-12-30 4:24 ` Jeff King
2024-12-30 4:34 ` Jeff King
2024-12-30 17:08 ` Junio C Hamano
2024-12-30 4:26 ` [PATCH 2/5] Revert "index-pack: spawn threads atomically" Jeff King
` (3 subsequent siblings)
4 siblings, 2 replies; 11+ messages in thread
From: Jeff King @ 2024-12-30 4:24 UTC (permalink / raw)
To: git; +Cc: René Scharfe, Patrick Steinhardt, Junio C Hamano
When storing output in test-results/, we usually give each numbered run
in a --stress set its own output file. But we don't do that for storing
LSan logs, so something like:
./t0003-attributes.sh --stress
will have many scripts simultaneously creating, writing to, and deleting
the test-results/t0003-attributes.leak directory. This can cause logs
from one run to be attributed to another, spurious failures when
creation and deletion race, and so on.
This has always been broken, but nobody noticed because it's rare to do
a --stress run with LSan (since the point is for the code to run quickly
many times in order to hit races). But if you're trying to find a race
in the leak sanitizing code, it makes sense to use these together.
We can fix it by using $TEST_RESULTS_BASE, which already incorporates
the stress job suffix.
Signed-off-by: Jeff King <peff@peff.net>
---
t/test-lib.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/t/test-lib.sh b/t/test-lib.sh
index 1a67adb207..96f2dfb69d 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -331,7 +331,7 @@ TEST_RESULTS_BASE="$TEST_RESULTS_DIR/$TEST_NAME$TEST_STRESS_JOB_SFX"
TEST_RESULTS_SAN_FILE_PFX=trace
TEST_RESULTS_SAN_DIR_SFX=leak
TEST_RESULTS_SAN_FILE=
-TEST_RESULTS_SAN_DIR="$TEST_RESULTS_DIR/$TEST_NAME.$TEST_RESULTS_SAN_DIR_SFX"
+TEST_RESULTS_SAN_DIR="$TEST_RESULTS_BASE.$TEST_RESULTS_SAN_DIR_SFX"
TRASH_DIRECTORY="trash directory.$TEST_NAME$TEST_STRESS_JOB_SFX"
test -n "$root" && TRASH_DIRECTORY="$root/$TRASH_DIRECTORY"
case "$TRASH_DIRECTORY" in
--
bar
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/5] Revert "index-pack: spawn threads atomically"
2024-12-30 4:23 [PATCH 0/5] fixing thread races in linux-leaks CI Jeff King
2024-12-30 4:24 ` [PATCH 1/5] test-lib: use individual lsan dir for --stress runs Jeff King
@ 2024-12-30 4:26 ` Jeff King
2024-12-30 4:28 ` [PATCH 3/5] thread-utils: introduce optional barrier type Jeff King
` (2 subsequent siblings)
4 siblings, 0 replies; 11+ messages in thread
From: Jeff King @ 2024-12-30 4:26 UTC (permalink / raw)
To: git; +Cc: René Scharfe, Patrick Steinhardt, Junio C Hamano
This reverts commit 993d38a0669a8056d496797516e743e26b6b8b54.
That commit was trying to solve a race between LSan setting up the
threads stack and another thread calling exit(), by making sure that all
pthread_create() calls have finished before doing any work that might
trigger the exit().
But that isn't sufficient. The setup code actually runs in the
individual threads themselves, not in the spawning thread's call to
pthread_create(). So while it may have improved the race a bit, you can
still trigger it pretty quickly with:
make SANITIZE=leak
cd t
./t5309-pack-delta-cycles.sh --stress
Let's back out that failed attempt so we can try again.
Signed-off-by: Jeff King <peff@peff.net>
---
I'm still puzzled why I ever thought 93d38a066 solved this, because the
stress test above is exactly what I was using back then to test it.
Possibly LSan changed in some way, or possibly I just got lucky in my
stress run, but the more likely reason is that I simply bungled the
testing (e.g., if you accidentally use a non-LSan build, then it
naturally appears to work).
builtin/index-pack.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index d773809c4c..0b62b2589f 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1336,15 +1336,13 @@ static void resolve_deltas(struct pack_idx_option *opts)
base_cache_limit = opts->delta_base_cache_limit * nr_threads;
if (nr_threads > 1 || getenv("GIT_FORCE_THREADS")) {
init_thread();
- work_lock();
for (i = 0; i < nr_threads; i++) {
int ret = pthread_create(&thread_data[i].thread, NULL,
threaded_second_pass, thread_data + i);
if (ret)
die(_("unable to create thread: %s"),
strerror(ret));
}
- work_unlock();
for (i = 0; i < nr_threads; i++)
pthread_join(thread_data[i].thread, NULL);
cleanup_thread();
--
bar
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 3/5] thread-utils: introduce optional barrier type
2024-12-30 4:23 [PATCH 0/5] fixing thread races in linux-leaks CI Jeff King
2024-12-30 4:24 ` [PATCH 1/5] test-lib: use individual lsan dir for --stress runs Jeff King
2024-12-30 4:26 ` [PATCH 2/5] Revert "index-pack: spawn threads atomically" Jeff King
@ 2024-12-30 4:28 ` Jeff King
2024-12-30 7:03 ` Patrick Steinhardt
2024-12-30 4:29 ` [PATCH 4/5] index-pack: work around LSan threading race with barrier Jeff King
2024-12-30 4:30 ` [PATCH 5/5] grep: " Jeff King
4 siblings, 1 reply; 11+ messages in thread
From: Jeff King @ 2024-12-30 4:28 UTC (permalink / raw)
To: git; +Cc: René Scharfe, Patrick Steinhardt, Junio C Hamano
One thread primitive we don't yet support is a barrier: it waits for all
threads to reach a synchronization point before letting any of them
continue. This would be useful for avoiding the LSan race we see in
index-pack (and other places) by having all threads complete their
initialization before any of them start to do real work.
POSIX introduced a pthread_barrier_t in 2004, which does what we want.
But if we want to rely on it:
1. Our Windows pthread emulation would need a new set of wrapper
functions. There's a Synchronization Barrier primitive there, which
was introduced in Windows 8 (which is old enough for us to depend
on).
2. macOS (and possibly other systems) has pthreads but not
pthread_barrier_t. So there we'd have to implement our own barrier
based on the mutex and cond primitives.
Those are do-able, but since we only care about avoiding races in our
LSan builds, there's an easier way: make it a noop on systems without a
native pthread barrier.
This patch introduces a "maybe_thread_barrier" API. The clunky name
(rather than just using pthread_barrier directly) should hopefully clue
people in that on some systems it will do nothing. It's wired to a
Makefile knob which has to be triggered manually, and we enable it for
the linux-leaks CI jobs (since we know we'll have it there).
There are some other possible options:
- we could turn it on all the time for Linux systems based on uname.
But we really only care about it for LSan builds, and there is no
need to add extra code to regular builds.
- we could turn it on only for LSan builds. But that would break
builds on non-Linux platforms (like macOS) that otherwise should
support sanitizers.
- we could trigger only on the combination of Linux and LSan together.
This isn't too hard to do, but the uname check isn't completely
accurate. It is really about what your libc supports, and non-glibc
systems might not have it (though at least musl seems to).
So we'd risk breaking builds on those systems, which would need to
add a new knob. Though the upside would be that running local "make
SANITIZE=leak test" would be protected automatically.
And of course none of this protects LSan runs from races on systems
without pthread barriers. It's probably OK in practice to protect only
our CI jobs, though. The race is rare-ish and most leak-checking happens
through CI.
Signed-off-by: Jeff King <peff@peff.net>
---
Makefile | 7 +++++++
ci/lib.sh | 1 +
thread-utils.h | 17 +++++++++++++++++
3 files changed, 25 insertions(+)
diff --git a/Makefile b/Makefile
index 97e8385b66..2c6dad8a75 100644
--- a/Makefile
+++ b/Makefile
@@ -141,6 +141,10 @@ include shared.mak
#
# Define NO_PTHREADS if you do not have or do not want to use Pthreads.
#
+# Define THREAD_BARRIER_PTHREAD if your system has pthread_barrier_t. Barrier
+# support is optional and is only helpful when building with SANITIZE=leak, as
+# it is used to eliminate some races in the leak-checker.
+#
# Define NO_PREAD if you have a problem with pread() system call (e.g.
# cygwin1.dll before v1.5.22).
#
@@ -2079,6 +2083,9 @@ ifdef NO_PTHREADS
else
BASIC_CFLAGS += $(PTHREAD_CFLAGS)
EXTLIBS += $(PTHREAD_LIBS)
+ ifdef THREAD_BARRIER_PTHREAD
+ BASIC_CFLAGS += -DTHREAD_BARRIER_PTHREAD
+ endif
endif
ifdef HAVE_PATHS_H
diff --git a/ci/lib.sh b/ci/lib.sh
index 8885ee3c3f..6a1267fbcb 100755
--- a/ci/lib.sh
+++ b/ci/lib.sh
@@ -385,6 +385,7 @@ linux-musl)
;;
linux-leaks|linux-reftable-leaks)
export SANITIZE=leak
+ export THREAD_BARRIER_PTHREAD=1
;;
linux-asan-ubsan)
export SANITIZE=address,undefined
diff --git a/thread-utils.h b/thread-utils.h
index 4961487ed9..3df5be9916 100644
--- a/thread-utils.h
+++ b/thread-utils.h
@@ -53,5 +53,22 @@ int dummy_pthread_init(void *);
int online_cpus(void);
int init_recursive_mutex(pthread_mutex_t*);
+#ifdef THREAD_BARRIER_PTHREAD
+#define maybe_thread_barrier_t pthread_barrier_t
+#define maybe_thread_barrier_init pthread_barrier_init
+#define maybe_thread_barrier_wait pthread_barrier_wait
+#define maybe_thread_barrier_destroy pthread_barrier_destroy
+#else
+#define maybe_thread_barrier_t int
+static inline int maybe_thread_barrier_init(maybe_thread_barrier_t *b UNUSED,
+ void *attr UNUSED,
+ unsigned nr UNUSED)
+{
+ errno = ENOSYS;
+ return -1;
+}
+#define maybe_thread_barrier_wait(barrier)
+#define maybe_thread_barrier_destroy(barrier)
+#endif
#endif /* THREAD_COMPAT_H */
--
bar
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 4/5] index-pack: work around LSan threading race with barrier
2024-12-30 4:23 [PATCH 0/5] fixing thread races in linux-leaks CI Jeff King
` (2 preceding siblings ...)
2024-12-30 4:28 ` [PATCH 3/5] thread-utils: introduce optional barrier type Jeff King
@ 2024-12-30 4:29 ` Jeff King
2024-12-30 4:30 ` [PATCH 5/5] grep: " Jeff King
4 siblings, 0 replies; 11+ messages in thread
From: Jeff King @ 2024-12-30 4:29 UTC (permalink / raw)
To: git; +Cc: René Scharfe, Patrick Steinhardt, Junio C Hamano
We sometimes get false positives from our linux-leaks CI job because of
a race in LSan itself. The problem is that one thread is still
initializing its stack in LSan's code (and allocating memory to do so)
while anothe thread calls die(), taking down the whole process and
triggering a leak check.
The problem is described in more detail in 993d38a066 (index-pack: spawn
threads atomically, 2024-01-05), which tried to fix it by pausing worker
threads until all calls to pthread_create() had completed. But that's
not enough to fix the problem, because the LSan setup code runs in the
threads themselves. So even though pthread_create() has returned, we
have no idea if all threads actually finished their setup before letting
any of them do real work.
We can fix that by using a barrier inside the threads themselves,
waiting for all of them to hit the start of their main function before
any of them proceed.
You can test for the race by running:
make SANITIZE=leak THREAD_BARRIER_PTHREAD=YesOnLinux
cd t
./t5309-pack-delta-cycles.sh --stress
which fails quickly before this patch, and should run indefinitely
without it.
Signed-off-by: Jeff King <peff@peff.net>
---
builtin/index-pack.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 0b62b2589f..27b120f26c 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -185,6 +185,8 @@ static pthread_mutex_t deepest_delta_mutex;
static pthread_key_t key;
+static maybe_thread_barrier_t start_barrier;
+
static inline void lock_mutex(pthread_mutex_t *mutex)
{
if (threads_active)
@@ -209,6 +211,7 @@ static void init_thread(void)
if (show_stat)
pthread_mutex_init(&deepest_delta_mutex, NULL);
pthread_key_create(&key, NULL);
+ maybe_thread_barrier_init(&start_barrier, NULL, nr_threads);
CALLOC_ARRAY(thread_data, nr_threads);
for (i = 0; i < nr_threads; i++) {
thread_data[i].pack_fd = xopen(curr_pack, O_RDONLY);
@@ -231,6 +234,7 @@ static void cleanup_thread(void)
for (i = 0; i < nr_threads; i++)
close(thread_data[i].pack_fd);
pthread_key_delete(key);
+ maybe_thread_barrier_destroy(&start_barrier);
free(thread_data);
}
@@ -1100,6 +1104,8 @@ static int compare_ref_delta_entry(const void *a, const void *b)
static void *threaded_second_pass(void *data)
{
+ if (threads_active)
+ maybe_thread_barrier_wait(&start_barrier);
if (data)
set_thread_data(data);
for (;;) {
--
bar
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 5/5] grep: work around LSan threading race with barrier
2024-12-30 4:23 [PATCH 0/5] fixing thread races in linux-leaks CI Jeff King
` (3 preceding siblings ...)
2024-12-30 4:29 ` [PATCH 4/5] index-pack: work around LSan threading race with barrier Jeff King
@ 2024-12-30 4:30 ` Jeff King
4 siblings, 0 replies; 11+ messages in thread
From: Jeff King @ 2024-12-30 4:30 UTC (permalink / raw)
To: git; +Cc: René Scharfe, Patrick Steinhardt, Junio C Hamano
There's a race with LSan when spawning threads and one of the threads
calls die(). We worked around one such problem with index-pack in the
previous commit, but it exists in git-grep, too. You can see it with:
make SANITIZE=leak THREAD_BARRIER_PTHREAD=YesOnLinux
cd t
./t0003-attributes.sh --stress
which fails pretty quickly with:
==git==4096424==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 32 byte(s) in 1 object(s) allocated from:
#0 0x7f906de14556 in realloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:98
#1 0x7f906dc9d2c1 in __pthread_getattr_np nptl/pthread_getattr_np.c:180
#2 0x7f906de2500d in __sanitizer::GetThreadStackTopAndBottom(bool, unsigned long*, unsigned long*) ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:150
#3 0x7f906de25187 in __sanitizer::GetThreadStackAndTls(bool, unsigned long*, unsigned long*, unsigned long*, unsigned long*) ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:614
#4 0x7f906de17d18 in __lsan::ThreadStart(unsigned int, unsigned long long, __sanitizer::ThreadType) ../../../../src/libsanitizer/lsan/lsan_posix.cpp:53
#5 0x7f906de143a9 in ThreadStartFunc<false> ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:431
#6 0x7f906dc9bf51 in start_thread nptl/pthread_create.c:447
#7 0x7f906dd1a677 in __clone3 ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
As with the previous commit, we can fix this by inserting a barrier that
makes sure all threads have finished their setup before continuing. But
there's one twist in this case: the thread which calls die() is not one
of the worker threads, but the main thread itself!
So we need the main thread to wait in the barrier, too, until all
threads have gotten to it. And thus we initialize the barrier for
num_threads+1, to account for all of the worker threads plus the main
one.
If we then test as above, t0003 should run indefinitely.
Signed-off-by: Jeff King <peff@peff.net>
---
builtin/grep.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/builtin/grep.c b/builtin/grep.c
index d00ee76f24..61b2c27490 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -101,6 +101,9 @@ static pthread_cond_t cond_write;
/* Signalled when we are finished with everything. */
static pthread_cond_t cond_result;
+/* Synchronize the start of all threads */
+static maybe_thread_barrier_t start_barrier;
+
static int skip_first_line;
static void add_work(struct grep_opt *opt, struct grep_source *gs)
@@ -198,6 +201,8 @@ static void *run(void *arg)
int hit = 0;
struct grep_opt *opt = arg;
+ maybe_thread_barrier_wait(&start_barrier);
+
while (1) {
struct work_item *w = get_work();
if (!w)
@@ -229,6 +234,7 @@ static void start_threads(struct grep_opt *opt)
pthread_cond_init(&cond_add, NULL);
pthread_cond_init(&cond_write, NULL);
pthread_cond_init(&cond_result, NULL);
+ maybe_thread_barrier_init(&start_barrier, NULL, num_threads + 1);
grep_use_locks = 1;
enable_obj_read_lock();
@@ -248,6 +254,7 @@ static void start_threads(struct grep_opt *opt)
die(_("grep: failed to create thread: %s"),
strerror(err));
}
+ maybe_thread_barrier_wait(&start_barrier);
}
static int wait_all(void)
@@ -284,6 +291,7 @@ static int wait_all(void)
pthread_cond_destroy(&cond_add);
pthread_cond_destroy(&cond_write);
pthread_cond_destroy(&cond_result);
+ maybe_thread_barrier_destroy(&start_barrier);
grep_use_locks = 0;
disable_obj_read_lock();
--
bar
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 1/5] test-lib: use individual lsan dir for --stress runs
2024-12-30 4:24 ` [PATCH 1/5] test-lib: use individual lsan dir for --stress runs Jeff King
@ 2024-12-30 4:34 ` Jeff King
2024-12-30 17:08 ` Junio C Hamano
1 sibling, 0 replies; 11+ messages in thread
From: Jeff King @ 2024-12-30 4:34 UTC (permalink / raw)
To: git; +Cc: René Scharfe, Patrick Steinhardt, Junio C Hamano
On Sun, Dec 29, 2024 at 11:24:01PM -0500, Jeff King wrote:
> --
> bar
BTW, in case anybody is curious about this, it's because I still had:
GIT_VERSION = bar
in my config.mak from testing Patrick's GIT-VERSION-GEN series. :)
Naturally "foo" was taken from testing:
make GIT_VERSION=foo
-Peff
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/5] thread-utils: introduce optional barrier type
2024-12-30 4:28 ` [PATCH 3/5] thread-utils: introduce optional barrier type Jeff King
@ 2024-12-30 7:03 ` Patrick Steinhardt
2025-01-01 18:28 ` Jeff King
0 siblings, 1 reply; 11+ messages in thread
From: Patrick Steinhardt @ 2024-12-30 7:03 UTC (permalink / raw)
To: Jeff King; +Cc: git, René Scharfe, Junio C Hamano
On Sun, Dec 29, 2024 at 11:28:30PM -0500, Jeff King wrote:
> One thread primitive we don't yet support is a barrier: it waits for all
> threads to reach a synchronization point before letting any of them
> continue. This would be useful for avoiding the LSan race we see in
> index-pack (and other places) by having all threads complete their
> initialization before any of them start to do real work.
>
> POSIX introduced a pthread_barrier_t in 2004, which does what we want.
> But if we want to rely on it:
>
> 1. Our Windows pthread emulation would need a new set of wrapper
> functions. There's a Synchronization Barrier primitive there, which
> was introduced in Windows 8 (which is old enough for us to depend
> on).
>
> 2. macOS (and possibly other systems) has pthreads but not
> pthread_barrier_t. So there we'd have to implement our own barrier
> based on the mutex and cond primitives.
>
> Those are do-able, but since we only care about avoiding races in our
> LSan builds, there's an easier way: make it a noop on systems without a
> native pthread barrier.
I think this is fine for a first iteration. If we ever feel the need for
having barriers anywhere else for actual correctness we can iterate on
the solution and provide wrappers for those platforms.
> This patch introduces a "maybe_thread_barrier" API. The clunky name
> (rather than just using pthread_barrier directly) should hopefully clue
> people in that on some systems it will do nothing. It's wired to a
> Makefile knob which has to be triggered manually, and we enable it for
> the linux-leaks CI jobs (since we know we'll have it there).
>
> There are some other possible options:
>
> - we could turn it on all the time for Linux systems based on uname.
Tiniest of nits: these are all full sentences, which should start with
an upper-case letter.
> But we really only care about it for LSan builds, and there is no
> need to add extra code to regular builds.
>
> - we could turn it on only for LSan builds. But that would break
> builds on non-Linux platforms (like macOS) that otherwise should
> support sanitizers.
Mh. I'm not a huge fan of having extra code for just a subset of our
builds and think that having the code generally enabled on platforms
that support it is preferable to reduce the number of build variants.
But...
> - we could trigger only on the combination of Linux and LSan together.
> This isn't too hard to do, but the uname check isn't completely
> accurate. It is really about what your libc supports, and non-glibc
> systems might not have it (though at least musl seems to).
>
> So we'd risk breaking builds on those systems, which would need to
> add a new knob. Though the upside would be that running local "make
> SANITIZE=leak test" would be protected automatically.
... this is a fair remark. So I dunno.
> And of course none of this protects LSan runs from races on systems
> without pthread barriers. It's probably OK in practice to protect only
> our CI jobs, though. The race is rare-ish and most leak-checking happens
> through CI.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> Makefile | 7 +++++++
> ci/lib.sh | 1 +
> thread-utils.h | 17 +++++++++++++++++
> 3 files changed, 25 insertions(+)
>
> diff --git a/Makefile b/Makefile
> index 97e8385b66..2c6dad8a75 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -141,6 +141,10 @@ include shared.mak
> #
> # Define NO_PTHREADS if you do not have or do not want to use Pthreads.
> #
> +# Define THREAD_BARRIER_PTHREAD if your system has pthread_barrier_t. Barrier
> +# support is optional and is only helpful when building with SANITIZE=leak, as
> +# it is used to eliminate some races in the leak-checker.
> +#
> # Define NO_PREAD if you have a problem with pread() system call (e.g.
> # cygwin1.dll before v1.5.22).
> #
> @@ -2079,6 +2083,9 @@ ifdef NO_PTHREADS
> else
> BASIC_CFLAGS += $(PTHREAD_CFLAGS)
> EXTLIBS += $(PTHREAD_LIBS)
> + ifdef THREAD_BARRIER_PTHREAD
> + BASIC_CFLAGS += -DTHREAD_BARRIER_PTHREAD
> + endif
> endif
>
> ifdef HAVE_PATHS_H
Okay. The Meson equivalent would be:
diff --git a/meson.build b/meson.build
index a0654a3f24..db4c1e6929 100644
--- a/meson.build
+++ b/meson.build
@@ -788,6 +788,10 @@ threads = dependency('threads', required: false)
if threads.found()
libgit_dependencies += threads
build_options_config.set('NO_PTHREADS', '')
+
+ if get_option('b_sanitize').contains('leak') and compiler.has_function('pthread_barrier_init', dependencies: threads)
+ libgit_c_args += '-DTHREAD_BARRIER_PTHREAD'
+ endif
else
libgit_c_args += '-DNO_PTHREADS'
build_options_config.set('NO_PTHREADS', '1')
> diff --git a/thread-utils.h b/thread-utils.h
> index 4961487ed9..3df5be9916 100644
> --- a/thread-utils.h
> +++ b/thread-utils.h
> @@ -53,5 +53,22 @@ int dummy_pthread_init(void *);
> int online_cpus(void);
> int init_recursive_mutex(pthread_mutex_t*);
>
> +#ifdef THREAD_BARRIER_PTHREAD
> +#define maybe_thread_barrier_t pthread_barrier_t
> +#define maybe_thread_barrier_init pthread_barrier_init
> +#define maybe_thread_barrier_wait pthread_barrier_wait
> +#define maybe_thread_barrier_destroy pthread_barrier_destroy
> +#else
> +#define maybe_thread_barrier_t int
Out of curiosity: why did you pick a define here and not a typedef?
> +static inline int maybe_thread_barrier_init(maybe_thread_barrier_t *b UNUSED,
> + void *attr UNUSED,
> + unsigned nr UNUSED)
> +{
> + errno = ENOSYS;
> + return -1;
> +}
> +#define maybe_thread_barrier_wait(barrier)
> +#define maybe_thread_barrier_destroy(barrier)
So the way these wrappers are implemented it is not possible to check
for errors of `pthread_barrier_init()` et al. When the implementation
exists we do have return codes, but if it's stubbed out we don't.
I think we should align these two implementations so that it does become
possible to check for errors, or otherwise we wouldn't be using the
pthread APIs correctly. It does raise the question though whether we
should really return `-1` in the stubbed-out variant or whether we
should instead pretend as if things were alright.
An alternative would be to die in case the pthread-functions return an
error.
Patrick
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 1/5] test-lib: use individual lsan dir for --stress runs
2024-12-30 4:24 ` [PATCH 1/5] test-lib: use individual lsan dir for --stress runs Jeff King
2024-12-30 4:34 ` Jeff King
@ 2024-12-30 17:08 ` Junio C Hamano
1 sibling, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2024-12-30 17:08 UTC (permalink / raw)
To: Jeff King; +Cc: git, René Scharfe, Patrick Steinhardt
Jeff King <peff@peff.net> writes:
> When storing output in test-results/, we usually give each numbered run
> in a --stress set its own output file. But we don't do that for storing
> LSan logs, so something like:
>
> ./t0003-attributes.sh --stress
>
> will have many scripts simultaneously creating, writing to, and deleting
> the test-results/t0003-attributes.leak directory. This can cause logs
> from one run to be attributed to another, spurious failures when
> creation and deletion race, and so on.
>
> This has always been broken, but nobody noticed because it's rare to do
> a --stress run with LSan (since the point is for the code to run quickly
> many times in order to hit races). But if you're trying to find a race
> in the leak sanitizing code, it makes sense to use these together.
Makes sense.
> We can fix it by using $TEST_RESULTS_BASE, which already incorporates
> the stress job suffix.
Thanks. Queued.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/5] thread-utils: introduce optional barrier type
2024-12-30 7:03 ` Patrick Steinhardt
@ 2025-01-01 18:28 ` Jeff King
2025-01-03 6:45 ` Patrick Steinhardt
0 siblings, 1 reply; 11+ messages in thread
From: Jeff King @ 2025-01-01 18:28 UTC (permalink / raw)
To: Patrick Steinhardt; +Cc: git, René Scharfe, Junio C Hamano
On Mon, Dec 30, 2024 at 08:03:38AM +0100, Patrick Steinhardt wrote:
> > - we could turn it on only for LSan builds. But that would break
> > builds on non-Linux platforms (like macOS) that otherwise should
> > support sanitizers.
>
> Mh. I'm not a huge fan of having extra code for just a subset of our
> builds and think that having the code generally enabled on platforms
> that support it is preferable to reduce the number of build variants.
> But...
>
> > - we could trigger only on the combination of Linux and LSan together.
> > This isn't too hard to do, but the uname check isn't completely
> > accurate. It is really about what your libc supports, and non-glibc
> > systems might not have it (though at least musl seems to).
> >
> > So we'd risk breaking builds on those systems, which would need to
> > add a new knob. Though the upside would be that running local "make
> > SANITIZE=leak test" would be protected automatically.
>
> ... this is a fair remark. So I dunno.
Yeah. I do not like having the behavior differ only for LSan, or only on
Linux. But I also do not like having tricky threading code that we do
not otherwise need in all of the other builds. So it is really about
picking the least-bad option, and they all seem similarly bad to me. So
I picked the one that involved writing the least amount of code. ;)
> Okay. The Meson equivalent would be:
> [...]
Yeah, I figured it would need something similar (but for our CI it does
not yet matter). Do you want to prepare that as a patch on top? (Though
also see the message I'm about to send that we might be able to avoid
this series entirely!).
> > +#ifdef THREAD_BARRIER_PTHREAD
> > +#define maybe_thread_barrier_t pthread_barrier_t
> > +#define maybe_thread_barrier_init pthread_barrier_init
> > +#define maybe_thread_barrier_wait pthread_barrier_wait
> > +#define maybe_thread_barrier_destroy pthread_barrier_destroy
> > +#else
> > +#define maybe_thread_barrier_t int
>
> Out of curiosity: why did you pick a define here and not a typedef?
That's what we do for all of the NO_PTHREADS fallbacks, so I was just
following that style. I suspect it matters more there, because you would
not want to typedef pthread_t on a system that is building with
NO_PTHREADS but actually does define that type (whereas the "maybe"
variants are our own invention, so we can be confident those names won't
conflict).
I don't think it matters too much either way.
> > +static inline int maybe_thread_barrier_init(maybe_thread_barrier_t *b UNUSED,
> > + void *attr UNUSED,
> > + unsigned nr UNUSED)
> > +{
> > + errno = ENOSYS;
> > + return -1;
> > +}
> > +#define maybe_thread_barrier_wait(barrier)
> > +#define maybe_thread_barrier_destroy(barrier)
>
> So the way these wrappers are implemented it is not possible to check
> for errors of `pthread_barrier_init()` et al. When the implementation
> exists we do have return codes, but if it's stubbed out we don't.
Yeah, again I was following the NO_PTHREADS fallbacks defined above.
Perhaps those are different in that we wouldn't generally expect to ever
call them if we don't have threads at all (whereas these "maybe" ones
are meant to be quiet noops).
> I think we should align these two implementations so that it does become
> possible to check for errors, or otherwise we wouldn't be using the
> pthread APIs correctly. It does raise the question though whether we
> should really return `-1` in the stubbed-out variant or whether we
> should instead pretend as if things were alright.
You'd have to return a fake success if you expect to check errors, I'd
think. Unless you want to introduce a bunch of conditional code to say
"well, we tried to initialize a barrier but it didn't work, so let's not
actually wait".
For all of the existing pthread calls, we simply assume init stuff like
pthread_mutex_init() doesn't fail. Possibly we should change that
globally, but this is just following the same pattern.
> An alternative would be to die in case the pthread-functions return an
> error.
That's certainly akin to xmalloc(). But probably we don't want to go
that direction, simply because it works against libification in the long
run.
-Peff
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/5] thread-utils: introduce optional barrier type
2025-01-01 18:28 ` Jeff King
@ 2025-01-03 6:45 ` Patrick Steinhardt
0 siblings, 0 replies; 11+ messages in thread
From: Patrick Steinhardt @ 2025-01-03 6:45 UTC (permalink / raw)
To: Jeff King; +Cc: git, René Scharfe, Junio C Hamano
On Wed, Jan 01, 2025 at 01:28:26PM -0500, Jeff King wrote:
> On Mon, Dec 30, 2024 at 08:03:38AM +0100, Patrick Steinhardt wrote:
> > Okay. The Meson equivalent would be:
> > [...]
>
> Yeah, I figured it would need something similar (but for our CI it does
> not yet matter). Do you want to prepare that as a patch on top? (Though
> also see the message I'm about to send that we might be able to avoid
> this series entirely!).
Looks like your version was reverted in favor of the new version you
have sent. So I'll refrain from doing that :)
Patrick
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-01-03 6:45 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-30 4:23 [PATCH 0/5] fixing thread races in linux-leaks CI Jeff King
2024-12-30 4:24 ` [PATCH 1/5] test-lib: use individual lsan dir for --stress runs Jeff King
2024-12-30 4:34 ` Jeff King
2024-12-30 17:08 ` Junio C Hamano
2024-12-30 4:26 ` [PATCH 2/5] Revert "index-pack: spawn threads atomically" Jeff King
2024-12-30 4:28 ` [PATCH 3/5] thread-utils: introduce optional barrier type Jeff King
2024-12-30 7:03 ` Patrick Steinhardt
2025-01-01 18:28 ` Jeff King
2025-01-03 6:45 ` Patrick Steinhardt
2024-12-30 4:29 ` [PATCH 4/5] index-pack: work around LSan threading race with barrier Jeff King
2024-12-30 4:30 ` [PATCH 5/5] grep: " Jeff King
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).