From: Bob Peterson <rpeterso@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
Date: Fri, 1 Feb 2019 09:34:17 -0500 (EST) [thread overview]
Message-ID: <2061149090.69285339.1549031657280.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <9b9dd4a6-6529-98e9-b1a1-aab8426cfa86@citrix.com>
Hi Ross,
----- Original Message -----
(snip)
> We haven't observed any problems that can be directly attributed to this
> without KASAN, although it is hard to tell what a stray write may do. We
> have hit sporadic asserts and filesystem corruption during testing.
>
> When I added tracing, the time between freeing a glock and writing to it
> varied but could be up to hundreds of milliseconds so I would guess that
> this could easily happen without KASAN. It is relatively easy to
> reproduce in our test environment.
>
> Do you have any suggestions for tracking down the root cause?
In the past, I've debugged problems with glock reference counting by
using kernel tracing and instrumentation. Unfortunately, the "glock_put"
trace point only shows you when the glock ref count goes to 0, and
doesn't show when or how the glock is first created, which, of course,
doesn't show if it's created and destroyed multiple times, and often
that's important to figuring these out, otherwise it's just a lot of chaos.
In the past, I've added my own temporary kernel trace point for when new
glocks are created, and called it "glock_new." You probably also want to
modify the glock put functions, such as gfs2_glock_put and
gfs2_glock_queue_put, to call a trace point so you can tell that too, and
have it save off the gl_lockref reference count in the trace.
Then recreate the problem with the trace running. I attached a script I
often use for these purposes. The script contains several bogus trace
point references for various sets of temporary trace points I've added
and deleted over the years, like a generic "debug" trace point where I
can add generic messages of what's happening. So don't be surprised if
you get errors about trying to cat values into non-existent debugfs files.
Just ignore them. The script DOES contain a trigger for a "glock_new"
trace point for just this purpose. I can try to dig out whether I still
have that trace point (glock_new) and the generic debug trace point
lying around somewhere in my many git repositories, but it might take
longer than just writing them again from scratch. I know it pre-dates
the concept of a "queued_put" so things will need to be tweaked anyway.
The script had a bunch of declares at the top for which trace points to
monitor and collect. I modified it for glock_new and glock_put, but
you can play with it.
To run the script and collect the trace, just do this:
./gfs2trace.sh &
(recreate the problem)
rm /var/run/gfs2-tracepoints.pid
Removing that file triggers the trace script to stop tracing and save
the results to a file in /tmp/ named after the machine's name
(so we can keep them straight in clustered situations).
Then, of course, someone needs to analyze the resulting trace file and
figure out where the count is getting off. I hope this helps.
Regards,
Bob Peterson
Red Hat File Systems
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gfs2trace.sh
Type: application/x-shellscript
Size: 9477 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20190201/3c6a6bdf/attachment.bin>
next prev parent reply other threads:[~2019-02-01 14:34 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-01-31 10:55 [Cluster-devel] [PATCH 0/2] GFS2 counting fixes Ross Lagerwall
2019-01-31 10:55 ` [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free Ross Lagerwall
2019-01-31 11:23 ` Steven Whitehouse
2019-01-31 14:40 ` Bob Peterson
2019-01-31 17:18 ` Andreas Gruenbacher
2019-02-01 9:23 ` Ross Lagerwall
2019-02-01 14:34 ` Bob Peterson [this message]
2019-02-01 14:51 ` Bob Peterson
2019-02-01 15:03 ` [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free (Another debug patch) Bob Peterson
2019-03-26 18:49 ` [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free Ross Lagerwall
2019-03-26 19:14 ` Bob Peterson
2019-04-01 22:59 ` Andreas Gruenbacher
2019-04-05 17:50 ` Andreas Gruenbacher
2019-04-09 15:36 ` Ross Lagerwall
2019-04-09 15:41 ` Andreas Gruenbacher
2019-01-31 10:55 ` [Cluster-devel] [PATCH 2/2] gfs2: Fix lru_count going negative Ross Lagerwall
2019-01-31 11:21 ` Steven Whitehouse
2019-01-31 14:36 ` Bob Peterson
2019-01-31 15:04 ` Bob Peterson
2019-01-31 15:23 ` Ross Lagerwall
2019-01-31 18:32 ` Andreas Gruenbacher
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2061149090.69285339.1549031657280.JavaMail.zimbra@redhat.com \
--to=rpeterso@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).