Re: Segmentation fault with drgn + libkdumpfile

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Stephen Brennan <stephen.s.brennan@oracle.com>
To: "Petr Tesařík" <petr@tesarici.cz>
Cc: Omar Sandoval <osandov@osandov.com>, linux-debuggers@vger.kernel.org
Subject: Re: Segmentation fault with drgn + libkdumpfile
Date: Wed, 10 Jan 2024 11:48:22 -0800	[thread overview]
Message-ID: <87ply9ynp5.fsf@oracle.com> (raw)
In-Reply-To: <20240110190354.0ff56adc@meshulam.tesarici.cz>

Petr Tesařík <petr@tesarici.cz> writes:
> On Wed, 10 Jan 2024 14:49:28 +0100
> Petr Tesařík <petr@tesarici.cz> wrote:
>
>> On Wed, 10 Jan 2024 09:36:00 +0100
>> Petr Tesařík <petr@tesarici.cz> wrote:
>> 
>> > On Tue, 09 Jan 2024 17:40:15 -0800
>> > Stephen Brennan <stephen.s.brennan@oracle.com> wrote:
>> > 
>> > > Petr Tesařík <petr@tesarici.cz> writes:
>> > >   
>> > > > On Mon, 8 Jan 2024 21:40:08 +0100
>> > > > Petr Tesařík <petr@tesarici.cz> wrote:
>> > > >    
>> > > >> On Fri, 05 Jan 2024 13:53:15 -0800
>> > > >> Stephen Brennan <stephen.s.brennan@oracle.com> wrote:
>> > > >>     
>> > > >> > Petr Tesařík <petr@tesarici.cz> writes:      
>> > > >> > > On Fri, 05 Jan 2024 10:38:16 -0800
>> > > >> > > Stephen Brennan <stephen.s.brennan@oracle.com> wrote:
>> > > >> > >        
>> > > >> > >> Hi Petr,
>> > > >> > >> 
>> > > >> > >> I recently encountered a segmentation fault with libkdumpfile & drgn
>> > > >> > >> which appears to be related to the cache implementation. I've included
>> > > >> > >> the stack trace at the end of this message, since it's a bit of a longer
>> > > >> > >> one. The exact issue occurred with a test vmcore that I could probably
>> > > >> > >> share with you privately if you'd like. In any case, the reproducer is
>> > > >> > >> fairly straightforward in drgn code:
>> > > >> > >> 
>> > > >> > >> for t in for_each_task(prog):
>> > > >> > >>     prog.stack_trace(t)
>> > > >> > >> for t in for_each_task(prog):
>> > > >> > >>     prog.stack_trace(t)
>> > > >> > >> 
>> > > >> > >> The repetition is required, the segfault only occurs on the second
>> > > >> > >> iteration of the loop. Which, in hindsight, is a textbook sign that the
>> > > >> > >> issue has to do with caching. I'd expect that the issue is specific to
>> > > >> > >> this vmcore, it doesn't reproduce on others.
>> > > >> > >> 
>> > > >> > >> I stuck that into a git bisect script and bisected the libkdumpfile
>> > > >> > >> commit that introduced it:
>> > > >> > >> 
>> > > >> > >> commit 487a8042ea5da580e1fdb5b8f91c8bd7cad05cd6
>> > > >> > >> Author: Petr Tesarik <petr@tesarici.cz>
>> > > >> > >> Date:   Wed Jan 11 22:53:01 2023 +0100
>> > > >> > >> 
>> > > >> > >>     Cache: Calculate eprobe in reinit_entry()
>> > > >> > >> 
>> > > >> > >>     If this function is called to reuse a ghost entry, the probe list
>> > > >> > >>     has not been walked yet, so eprobe is left uninitialized.
>> > > >> > >> 
>> > > >> > >>     This passed the test case, because the correct old value was left
>> > > >> > >>     on stack. Modify the test case to poison the stack.
>> > > >> > >> 
>> > > >> > >>     Signed-off-by: Petr Tesarik <petr@tesarici.cz>
>> > > >> > >> 
>> > > >> > >>  src/kdumpfile/cache.c      |  6 +++++-
>> > > >> > >>  src/kdumpfile/test-cache.c | 13 +++++++++++++
>> > > >> > >>  2 files changed, 18 insertions(+), 1 deletion(-)        
>> > > >> > >
>> > > >> > > This looks like a red herring to me. The cache most likely continues in
>> > > >> > > a corrupted state without this commit, which may mask the issue (until
>> > > >> > > it resurfaces later).        
>> > > >> > 
>> > > >> > I see, that makes a lot of sense.
>> > > >> >       
>> > > >> > >> I haven't yet tried to debug the logic of the cache implementation and
>> > > >> > >> create a patch. I'm totally willing to try that, but I figured I would
>> > > >> > >> send this report to you first, to see if there's something obvious that
>> > > >> > >> sticks out to your eyes.        
>> > > >> > >
>> > > >> > > No, but I should be able to recreate the issue if I get a log of the
>> > > >> > > cache API calls:
>> > > >> > >
>> > > >> > > - cache_alloc() - to know the number of elements
>> > > >> > > - cache_get_entry()
>> > > >> > > - cache_put_entry()
>> > > >> > > - cache_insert()
>> > > >> > > - cache_discard()
>> > > >> > > - cache_flush() - not likely after initialization, but...        
>> > > >> > 
>> > > >> > I went ahead and logged each of these calls as you suggested, I tried to
>> > > >> > log them at the beginning of the function call and always include the
>> > > >> > cache pointer, cache_entry, and the key. I took the resulting log and
>> > > >> > filtered it to just contain the most recently logged cache prior to the
>> > > >> > crash, compressed it, and attached it. For completeness, the patch
>> > > >> > I used is below (applies to tip branch 8254897 ("Merge pull request #78
>> > > >> > from fweimer-rh/c99")).
>> > > >> > 
>> > > >> > I'll also see if I can reproduce it based on the log.      
>> > > >> 
>> > > >> Thank you for the log. I haven't had much time to look at it, but the
>> > > >> first line is a good hint already:
>> > > >> 
>> > > >> 0x56098b68c4c0: cache_alloc(1024, 0)
>> > > >> 
>> > > >> Zero size means the data pointers are managed by the caller, so this
>> > > >> must be the cache of mmap()'ed segments. That's the only cache which
>> > > >> installs a cleanup callback with set_cache_entry_cleanup(). There is
>> > > >> only one call to the cleanup callback for evicted entries in cache.c:
>> > > >> 
>> > > >> 		/* Get an unused cached entry. */
>> > > >> 		if (cs->nuprobe != 0 &&
>> > > >> 		    (cs->nuprec == 0 || cache->nprobe + bias > cache->dprobe))
>> > > >> 			evict = evict_probe(cache, cs);
>> > > >> 		else
>> > > >> 			evict = evict_prec(cache, cs);
>> > > >> 		if (cache->entry_cleanup)
>> > > >> 			cache->entry_cleanup(cache->cleanup_data, evict);
>> > > >> 
>> > > >> The entries can be evicted from the probe partition or from the precious
>> > > >> partition. This might be relevant. Please, can you re-run and log where
>> > > >> the evict entry comes from?    
>> > > >
>> > > > I found some time this morning, and it wouldn't help. Because of a bug
>> > > > in fcache_new(), the number of elements in the cache is big enough that
>> > > > cache entries are never evicted in your case. It's quite weird to hit a
>> > > > cache metadata bug after elements have been inserted. FWIW I am not
>> > > > able to reproduce the bug by replaying the logged file read pattern.
>> > > >
>> > > > Since you have a reliable reproducer, it cannot be a Heisenbug. But it
>> > > > could be caused by the other cache - the cache of decompressed pages.
>> > > > Do you know for sure that lzo1x_decompress_safe() crashes while trying
>> > > > to _read_ from the input buffer, and not while trying to _write_ to the
>> > > > output buffer?    
>> > > 
>> > > Hi Petr,
>> > > 
>> > > Sorry for the delay here, I got pulled into other issues and am trying
>> > > to attend to all my work in a round-robin fashion :)  
>> > 
>> > Hi Stephen,
>> > 
>> > that's fine, I also work on this only as time permits. ;-)
>> > 
>> > > The fault is definitely in lzo1x_decompress_safe() *writing* to address
>> > > 0. I fetched debuginfo for all the necessary libraries and we see the
>> > > following stack trace:
>> > > 
>> > > %<-----------------------
>> > > #0  0x00007fcd9adddef3 in lzo1x_decompress_safe (in=<optimized out>,
>> > >     in_len=<optimized out>, out=0x0, out_len=0x7ffdee2c1388, wrkmem=<optimized out>)
>> > >     at src/lzo1x_d.ch:120
>> > > #1  0x00007fcd9ae25be1 in diskdump_read_page (pio=0x7ffdee2c1590) at diskdump.c:584
>> > > #2  0x00007fcd9ae32d4d in _kdumpfile_priv_cache_get_page (pio=0x7ffdee2c1590,
>> > >     fn=0x7fcd9ae257ae <diskdump_read_page>) at read.c:69
>> > > #3  0x00007fcd9ae25e44 in diskdump_get_page (pio=0x7ffdee2c1590) at diskdump.c:647
>> > > #4  0x00007fcd9ae32be0 in get_page (pio=0x7ffdee2c1590)
>> > >     at /home/stepbren/repos/libkdumpfile/src/kdumpfile/kdumpfile-priv.h:1512
>> > > #5  0x00007fcd9ae32ed4 in get_page_xlat (pio=0x7ffdee2c1590) at read.c:126
>> > > #6  0x00007fcd9ae32f22 in get_page_maybe_xlat (pio=0x7ffdee2c1590) at read.c:137
>> > > #7  0x00007fcd9ae32fb1 in _kdumpfile_priv_read_locked (ctx=0x55745bfca8f0,
>> > >     as=KDUMP_KVADDR, addr=18446612133360081960, buffer=0x7ffdee2c17df,
>> > >     plength=0x7ffdee2c1698) at read.c:169
>> > > #8  0x00007fcd9ae330dd in kdump_read (ctx=0x55745bfca8f0, as=KDUMP_KVADDR,
>> > >     addr=18446612133360081960, buffer=0x7ffdee2c17df, plength=0x7ffdee2c1698)
>> > >     at read.c:196
>> > > #9  0x00007fcd9afb0cc4 in drgn_read_kdump (buf=0x7ffdee2c17df,
>> > >     address=18446612133360081960, count=4, offset=18446612133360081960,
>> > >     arg=0x55745bfca8f0, physical=false) at ../../libdrgn/kdump.c:73
>> > > %<-----------------------
>> > > 
>> > > In frame 1 where we are calling the decompressor:
>> > > 
>> > > %<-----------------------
>> > > (gdb) frame 1
>> > > #1  0x00007fcd9ae25be1 in diskdump_read_page (pio=0x7ffdee2c1590) at diskdump.c:584
>> > > 584                     int ret = lzo1x_decompress_safe(fch.data, pd.size,
>> > > (gdb) list
>> > > 579                     if (ret != KDUMP_OK)
>> > > 580                             return ret;
>> > > 581             } else if (pd.flags & DUMP_DH_COMPRESSED_LZO) {
>> > > 582     #if USE_LZO
>> > > 583                     lzo_uint retlen = get_page_size(ctx);
>> > > 584                     int ret = lzo1x_decompress_safe(fch.data, pd.size,
>> > > 585                                                     pio->chunk.data,
>> > > 586                                                     &retlen,
>> > > 587                                                     LZO1X_MEM_DECOMPRESS);
>> > > 588                     fcache_put_chunk(&fch);
>> > > (gdb) p retlen
>> > > $7 = 0  
>> > 
>> > This is a bit weird. Looking at liblzo sources, it seems to me that
>> > the output length is not changed until right before returning from
>> > lzo1x_decompress_safe().
>> > 
>> > > (gdb) p pio->chunk.data
>> > > $8 = (void *) 0x0  
>> > 
>> > OK, here's our immediate root cause. ;-)
>> > 
>> > > (gdb) p fch.data
>> > > $9 = (void *) 0x7fcd7cc33da4  
>> > 
>> > This looks sane.
>> > 
>> > > (gdb) p pd.size
>> > > $10 = 816  
>> > 
>> > This also looks sane.
>> > 
>> > > %<-----------------------
>> > > 
>> > > As far as I can tell, pio->chunk.data comes directly from the
>> > > cache_get_page() function in frame 2:
>> > > 
>> > > %<-----------------------
>> > > (gdb) up
>> > > #2  0x00007fcd9ae32d4d in _kdumpfile_priv_cache_get_page (pio=0x7ffdee2c1590,
>> > >     fn=0x7fcd9ae257ae <diskdump_read_page>) at read.c:69
>> > > 69              ret = fn(pio);
>> > > (gdb) list
>> > > 64              pio->chunk.data = entry->data;
>> > > 65              pio->chunk.embed_fces->ce = entry;
>> > > 66              if (cache_entry_valid(entry))
>> > > 67                      return KDUMP_OK;
>> > > 68
>> > > 69              ret = fn(pio);
>> > > 70              mutex_lock(&ctx->shared->cache_lock);
>> > > 71              if (ret == KDUMP_OK)
>> > > 72                      cache_insert(pio->chunk.embed_fces->cache, entry);
>> > > 73              else
>> > > (gdb) p *entry
>> > > $11 = {key = 1045860353, state = cs_precious, next = 626, prev = 626, refcnt = 1,
>> > >   data = 0x0}  
>> > 
>> > The key (0x3e569000 | ADDRXLAT_MACHPHYSADDR) corresponds to the
>> > requested virtual address 0xffff88003e569c28.
>> > 
>> > > (gdb) p *pio
>> > > $12 = {ctx = 0x55745bfca8f0, addr = {addr = 1045860352, as = ADDRXLAT_MACHPHYSADDR},
>> > >   chunk = {data = 0x0, nent = 1, {embed_fces = {{data = 0xffff880ff1470788,
>> > >           len = 140728599320032, ce = 0x55745c1003d8, cache = 0x55745c0fb540}, {
>> > >           data = 0x55745bfd42f0, len = 140728599320112,
>> > >           ce = 0x7fcd9ae330ef <kdump_read+102>, cache = 0xffff88003e569c28}},
>> > >       fces = 0xffff880ff1470788}}}
>> > > %<-----------------------  
>> > 
>> > 
>> > Looking at pio->chunk->embed_fces->ce, struct cache_entry is at
>> > 0x55745c1003d8. Assuming that sizeof(struct cache_entry) == 32 on your
>> > system, this is element 626 in the cache entry array. The next and
>> > prev indices are also 626, which looks good, because cache->ninflight
>> > is 1, so this is the only element in the (circular) in-flight list.
>> > 
>> > Since state is cs_precious, but the data was discarded, this cache
>> > entry has just been recovered from a ghost partition, evicting another
>> > entry, and _that_ entry had a NULL data pointer.
>> > 
>> > It would be really helpful if I could get the log for this cache
>> > instead of the one you posted earlier.
>> 
>> Let's recap:
>> 
>> 1. Data indicates that a reused ghost entry has a NULL data pointer.
>> 2. Reverting commit 487a8042ea5da580e1fdb5b8f91c8bd7cad05cd6 masks the issue.
>> 
>> My conclusion is that the data pointer was taken from an entry in the
>> unused partition. This partition is empty, except when the cache is
>> new, or after calling cache_discard(). Given the statistic counter
>> values, the latter is the case here,
>> 
>> I think I found the bug: reinit_entry() finds the unused partition by
>> skipping over the ghost probe partition, but if the target itself is a
>> reused ghost probe entry, cache->ngprobe was already decremented in
>> get_ghost_or_missed_entry().
>> 
>> I'm going to write a test case and fix.
>
> I was able to write a test case for the NULL data pointer. The bug
> should be fixed by commit e63dab40c8cefdfc013dc57915de9852dbb283e4.
>
> Stephen, can you rebuild and verify with your dump file that this was
> indeed the same bug?

Hi Petr,

I went ahead and pulled the latest from tip and ran my reproducer. The
segmentation fault is indeed gone!

I tested c10036e ("cache: Optimize data reclaim for missed entries"), to
be exact, which contains e63dab4 as well.

Thank you so much for taking the time to go back & forth with me on this
one! I suppose there's a reason that caching is one of the two hard
problems in computer science ;)

I owe you a beverage of your choice next time we see each other!

Thanks,
Stephen

next prev parent reply	other threads:[~2024-01-10 19:48 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-05 18:38 Segmentation fault with drgn + libkdumpfile Stephen Brennan
2024-01-05 19:23 ` Petr Tesařík
2024-01-05 21:53   ` Stephen Brennan
2024-01-08 20:40     ` Petr Tesařík
2024-01-09  9:06       ` Petr Tesařík
2024-01-10  1:40         ` Stephen Brennan
2024-01-10  8:36           ` Petr Tesařík
2024-01-10 13:49             ` Petr Tesařík
2024-01-10 18:03               ` Petr Tesařík
2024-01-10 19:48                 ` Stephen Brennan [this message]
2024-01-10 19:58                   ` Petr Tesařík

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ply9ynp5.fsf@oracle.com \
    --to=stephen.s.brennan@oracle.com \
    --cc=linux-debuggers@vger.kernel.org \
    --cc=osandov@osandov.com \
    --cc=petr@tesarici.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.