All of lore.kernel.org
 help / color / mirror / Atom feed
From: SeongJae Park <sj@kernel.org>
To: SeongJae Park <sj@kernel.org>
Cc: Guenter Roeck <linux@roeck-us.net>, damon@lists.linux.dev
Subject: Re: Problems running DAMON kunit tests with spinlock debugging enabled
Date: Wed,  4 Sep 2024 23:14:15 -0700	[thread overview]
Message-ID: <20240905061415.1294-1-sj@kernel.org> (raw)
In-Reply-To: <20240904175306.1862-1-sj@kernel.org>

On Wed,  4 Sep 2024 10:53:06 -0700 SeongJae Park <sj@kernel.org> wrote:

> Hi Guenter,
> 
> On Tue, 3 Sep 2024 19:46:44 -0700 Guenter Roeck <linux@roeck-us.net> wrote:
> 
> > On 9/3/24 19:24, Guenter Roeck wrote:
> > > On 9/3/24 18:00, SeongJae Park wrote:
> > >> Hi Guenter,
> > >>
> > >> On Tue, 3 Sep 2024 12:23:50 -0700 Guenter Roeck <linux@roeck-us.net> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> when trying to run DAMON kunit tests with spinlock debugging enabled,
> > >>> I get the following error messages.
> > >>>
> > >>>      BUG: spinlock bad magic on CPU#0, kunit_try_catch/209
> > >>>       lock: mm.13+0x40/0x840, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
> > >>>      CPU: 0 UID: 0 PID: 209 Comm: kunit_try_catch Tainted: G                 N 6.11.0-rc6-00148-g178297ec52d6 #1
> > >>>      Tainted: [N]=TEST
> > >>>      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> > >>>      Call Trace:
> > >>>       <TASK>
> > >>>       dump_stack_lvl+0x9e/0xe0
> > >>>       do_raw_spin_lock+0x63/0xb0
> > >>>       damon_test_three_regions_in_vmas+0x121/0x450S
> > >>>       ...
> > >>>
> > >>> This happens because damon_test_three_regions_in_vmas() calls
> > >>> mt_init_flags() with MM_MT_FLAGS, which sets MT_FLAGS_LOCK_EXTERN.
> > >>> Because of this, the spinlock is not initialized, which then results
> > >>> in the error message when mas_lock() is called during the test.
> > >>>
> > >>> Unfortunately, replacing MM_MT_FLAGS with MT_FLAGS_ALLOC_RANGE |
> > >>> MT_FLAGS_USE_RCU or just with MT_FLAGS_ALLOC_RANGE doesn't help;
> > >>> it results in various "suspicious RCU usage" messages.
> > >>
> > >> Thank you very much for this report with the grateful detailed investigation!
> > >> I was able to make and send a fix[1], thanks to the investigation.  Seems it
> > >> may need some more revisions[2], though.  I'll continue discussion on the
> > >> thread.
> > >>
> > >> [1] https://lore.kernel.org/20240904004534.1189-1-sj@kernel.org
> > >> [2] https://lore.kernel.org/jy6263g6em4jsdhp6tknmh2cljpuvq652kvcet4ko3z2xt7pym@ltc5h5twsszu/
> > >>
> > >>
> > > 
> > > Some more problems. I tried your patch on various architectures.
> > > Unfortunately, that didn't turn out well. See below for some of the failures.
> 
> Nice finding, thank you for reporting this!  Just to be clear, seems this is
> independent from the originally reported issue or the fix for it.  Please let
> me know if not.
> 
> > > 
> > > [ I am not sure I understand the code in damon_test_nr_accesses_to_accesses_bp(),
> > >    especially
> > >      .aggr_interval = ((unsigned long)UINT_MAX + 1) * 10
> > >    What happens if sizeof(int) == sizeof(long) ?
> > >    Unless I am really missing something, .aggr_interval will be
> > >    0 in that case.
> 
> I agree.  aggr_interval would be zero in the case.  And in this case,
> damon_max_nr_accesses(attrs) in damon_nr_accesses_to_accesses_bp() returns
> zero, so damon_nr_accesses_to_accesses_bp() will get a divide by zero problem.
> That is, the problem happens when damon_nr_accesses_to_accesses_bp() is called
> with zero aggr_interval, regardless of the architecture.
> 
> > > ]
> > > 
> > The diff below fixes the problem in damon_test_nr_accesses_to_accesses_bp().
> > Obviously I don't know if this is the correct fix.
> > 
> > Guenter
> > 
> > ---
> > diff --git a/mm/damon/core-test.h b/mm/damon/core-test.h
> > index 0cee634f3544..8072f83b2bba 100644
> > --- a/mm/damon/core-test.h
> > +++ b/mm/damon/core-test.h
> > @@ -309,6 +309,9 @@ static void damon_test_nr_accesses_to_accesses_bp(struct kunit *test)
> >                  .aggr_interval = ((unsigned long)UINT_MAX + 1) * 10
> >          };
> > 
> > +       if (sizeof(int) == sizeof(long))
> > +               kunit_skip(test, "Test requires sizeof(int) != sizeof(long)");
> > +
> >          KUNIT_EXPECT_EQ(test, damon_nr_accesses_to_accesses_bp(123, &attrs), 0);
> >   }
> 
> So, above fix would work, but I think doing the skip if the resulting
> aggr_interval is zero would make more sense.  Having more comments would also
> be nice.

Now I think this is enough fix for this issue, since this test code is only
place that can trigger the issue.  More detail below.

> 
> Nonetheless, if the issue can be triggered outside of the test, we may need to
> think a better fundamental fix.  And seems it is the case.
> 
> Sharing current incomplete findings here first.
> 
> First of all, users can set the zero aggr_interval.  And the problematic
> function (damon_nr_accesses_to_accesses_bp()) is called when user updates the
> intervals while DAMON is running, with the old intervals attributes.  Hence,
> users could trigger the issue by making DAMON to run with zero aggregation
> interval, and then updating the intervals to any values.

It turned out this is not the case.  Only the test code can trigger this issue.
The only possible callstack for the problematic function is as below.

    damon_nr_accesses_to_accesses_bp()
    damon_nr_accesses_for_new_attrs()
    damon_update_monitoring_result()
    damon_update_monitoring_results()
    damon_set_attrs()

And damon_update_monitoring_results() does not call
damon_update_monitoring_result() if the aggregation interval of the old context
is zero, as below.

    static void damon_update_monitoring_results(struct damon_ctx *ctx,
                    struct damon_attrs *new_attrs)
    {
            struct damon_attrs *old_attrs = &ctx->attrs;
            [...]
    
            /* if any interval is zero, simply forgive conversion */
            if (!old_attrs->sample_interval || !old_attrs->aggr_interval ||
                            !new_attrs->sample_interval ||
                            !new_attrs->aggr_interval)
                    return;
    
            damon_for_each_target(t, ctx)
                    damon_for_each_region(r, t)
                            damon_update_monitoring_result(
                                            r, old_attrs, new_attrs);
    }

Hence, again, only the test code can invoke damon_nr_accesses_to_accesses_bp()
with zero aggregation interval.  Hence fixing the test code and adding comments
would be sufficient for now.


Thanks,
SJ

      parent reply	other threads:[~2024-09-05  6:14 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-03 19:23 Problems running DAMON kunit tests with spinlock debugging enabled Guenter Roeck
2024-09-04  1:00 ` SeongJae Park
2024-09-04  2:24   ` Guenter Roeck
2024-09-04  2:46     ` Guenter Roeck
2024-09-04 17:53       ` SeongJae Park
2024-09-04 19:06         ` Guenter Roeck
2024-09-05 16:30           ` SeongJae Park
2024-09-04 21:04         ` Guenter Roeck
2024-09-05 17:35           ` SeongJae Park
2024-09-05 18:04             ` Guenter Roeck
2024-09-05 18:08               ` SeongJae Park
2024-09-05  6:14         ` SeongJae Park [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240905061415.1294-1-sj@kernel.org \
    --to=sj@kernel.org \
    --cc=damon@lists.linux.dev \
    --cc=linux@roeck-us.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.