Petr Tesařík writes: > On Fri, 05 Jan 2024 10:38:16 -0800 > Stephen Brennan wrote: > >> Hi Petr, >> >> I recently encountered a segmentation fault with libkdumpfile & drgn >> which appears to be related to the cache implementation. I've included >> the stack trace at the end of this message, since it's a bit of a longer >> one. The exact issue occurred with a test vmcore that I could probably >> share with you privately if you'd like. In any case, the reproducer is >> fairly straightforward in drgn code: >> >> for t in for_each_task(prog): >> prog.stack_trace(t) >> for t in for_each_task(prog): >> prog.stack_trace(t) >> >> The repetition is required, the segfault only occurs on the second >> iteration of the loop. Which, in hindsight, is a textbook sign that the >> issue has to do with caching. I'd expect that the issue is specific to >> this vmcore, it doesn't reproduce on others. >> >> I stuck that into a git bisect script and bisected the libkdumpfile >> commit that introduced it: >> >> commit 487a8042ea5da580e1fdb5b8f91c8bd7cad05cd6 >> Author: Petr Tesarik >> Date: Wed Jan 11 22:53:01 2023 +0100 >> >> Cache: Calculate eprobe in reinit_entry() >> >> If this function is called to reuse a ghost entry, the probe list >> has not been walked yet, so eprobe is left uninitialized. >> >> This passed the test case, because the correct old value was left >> on stack. Modify the test case to poison the stack. >> >> Signed-off-by: Petr Tesarik >> >> src/kdumpfile/cache.c | 6 +++++- >> src/kdumpfile/test-cache.c | 13 +++++++++++++ >> 2 files changed, 18 insertions(+), 1 deletion(-) > > This looks like a red herring to me. The cache most likely continues in > a corrupted state without this commit, which may mask the issue (until > it resurfaces later). I see, that makes a lot of sense. >> I haven't yet tried to debug the logic of the cache implementation and >> create a patch. I'm totally willing to try that, but I figured I would >> send this report to you first, to see if there's something obvious that >> sticks out to your eyes. > > No, but I should be able to recreate the issue if I get a log of the > cache API calls: > > - cache_alloc() - to know the number of elements > - cache_get_entry() > - cache_put_entry() > - cache_insert() > - cache_discard() > - cache_flush() - not likely after initialization, but... I went ahead and logged each of these calls as you suggested, I tried to log them at the beginning of the function call and always include the cache pointer, cache_entry, and the key. I took the resulting log and filtered it to just contain the most recently logged cache prior to the crash, compressed it, and attached it. For completeness, the patch I used is below (applies to tip branch 8254897 ("Merge pull request #78 from fweimer-rh/c99")). I'll also see if I can reproduce it based on the log. diff --git a/src/kdumpfile/cache.c b/src/kdumpfile/cache.c index 826d632..1f5820b 100644 --- a/src/kdumpfile/cache.c +++ b/src/kdumpfile/cache.c @@ -33,6 +33,7 @@ #include "kdumpfile-priv.h" #include +#include #include /** Simple cache. @@ -531,6 +532,7 @@ cache_get_entry(struct cache *cache, cache_key_t key) entry = cache_get_entry_noref(cache, key); if (entry) ++entry->refcnt; + fprintf(stderr, "%p: cache_get_entry(%lu) -> %p\n", cache, key, entry); return entry; } @@ -549,6 +551,7 @@ cache_insert(struct cache *cache, struct cache_entry *entry) { unsigned idx; + fprintf(stderr, "%p: cache_insert(%p) (key=%lu)\n", cache, entry, entry->key); if (cache_entry_valid(entry)) return; @@ -584,6 +587,7 @@ cache_insert(struct cache *cache, struct cache_entry *entry) void cache_put_entry(struct cache *cache, struct cache_entry *entry) { + fprintf(stderr, "%p: cache_put_entry(%p) (key=%lu)\n", cache, entry, entry->key); --entry->refcnt; } @@ -605,6 +609,8 @@ cache_discard(struct cache *cache, struct cache_entry *entry) { unsigned n, idx, eprobe; + fprintf(stderr, "%p: cache_discard_entry(%p) (key=%lu)\n", cache, entry, entry->key); + if (--entry->refcnt) return; if (cache_entry_valid(entry)) @@ -669,6 +675,7 @@ cache_flush(struct cache *cache) { unsigned i, n; + fprintf(stderr, "%p: cache_flush()\n", cache); cleanup_entries(cache); n = 2 * cache->cap; @@ -709,6 +716,8 @@ cache_alloc(unsigned n, size_t size) if (!cache) return cache; + fprintf(stderr, "%p: cache_alloc(%u, %zu)\n", cache, n, size); + cache->elemsize = size; cache->cap = n; cache->hits.number = 0;