From: Roosen Henri <Henri.Roosen@ginzinger.com>
To: "bigeasy@linutronix.de" <bigeasy@linutronix.de>
Cc: "linux-rt-users@vger.kernel.org" <linux-rt-users@vger.kernel.org>
Subject: Re: inconsistent lock state on v4.14.20-rt17
Date: Thu, 8 Mar 2018 17:38:59 +0000 [thread overview]
Message-ID: <1520530739.1744.18.camel@ginzinger.com> (raw)
In-Reply-To: <20180308155741.vbdqprs6stx5zu4q@linutronix.de>
On Thu, 2018-03-08 at 16:57 +0100, bigeasy@linutronix.de wrote:
> On 2018-03-07 08:26:40 [+0000], Roosen Henri wrote:
> > > The backtrace suggests that the rq lock is taken with interrupts
> > > disabled and then with interrupts enabled. But based on the call-
> > > trace
> > > it should be with interrupts disabled in both cases.
> > > I do have a imx6q running hackbench on a regular basis and I
> > > haven't
> > > seen this. Do you see this backtrace on every hackbench
> > > invocation or
> > > just after some time. The uptime suggest after ~5 hours.
> >
> > The crashes are not seen at every invocation of hackbench, but
> > unfortunately after a few hours to after a few weeks. The test is
> > based
> > on a "dohell.sh" script from the Xenomai project.
>
> I kicked a hackbench on a imx6 with your config.
Thanks Sebastian for looking into this!
> Is the backtrace, that you receive from lockdep, always the same or
> is
> it different sometimes?
It is different each time. So my gut feeling tells me it might be a
memory corruption of some kind.. maybe caused by a use after free or
so..
I restarted the target yesterday evening and this morning it was frozen
without any trace on the terminal. Attaching a JTAG showed different
call-stacks than yesterday; Core #2 (trying to print the info to the
terminal) and #3 were spinning on a spin-lock, don't understand what
core #0 and #1 were doing.
Core#0:
__vectors_start() at entry-armv.S:1,234 0xffff000c
__pabt_svc() at entry-armv.S:319 0xc010cd84
Core#1:
__vectors_start() at entry-armv.S:1,234 0xffff000c
__pabt_svc() at entry-armv.S:319 0xc010cd84
Core#2:
arch_spin_lock() at spinlock.h:77 0xc017bf38
do_raw_spin_lock() at spinlock_debug.c:115 0xc017bf38
__raw_spin_lock() at spinlock_api_smp.h:143 0xc0996158
_raw_spin_lock() at spinlock.c:155 0xc0996158
vprintk_emit() at printk.c:1,804 0xc0183a8c
vprintk_default() at printk.c:1,906 0xc0183e8c
vprintk_func() at printk_safe.c:382 0xc0185764
printk() at printk.c:1,939 0xc0184fc4
__warn() at panic.c:523 0xc0123754
warn_slowpath_fmt() at panic.c:567 0xc0123870
Core#3:
arch_spin_lock() at spinlock.h:77 0xc017bf38
do_raw_spin_lock() at spinlock_debug.c:115 0xc017bf38
__raw_spin_lock() at spinlock_api_smp.h:143 0xc0996158
_raw_spin_lock() at spinlock.c:155 0xc0996158
__task_rq_lock() at core.c:105 0xc015347c
rt_mutex_setprio() at core.c:3,792 0xc0157148
rt_mutex_adjust_prio() at rtmutex.c:380 0xc017a100
task_blocks_on_rt_mutex() at rtmutex.c:1,374 0xc017a780
rt_spin_lock_slowlock_locked() at rtmutex.c:1,061 0xc09941ec
rt_spin_lock_slowlock() at rtmutex.c:1,116 0xc099445c
rt_spin_lock_fastlock() at rtmutex.c:979 0xc09966c4
rt_spin_lock() at rtmutex.c:1,146 0xc09966c4
skb_queue_tail() at skbuff.c:2,915 0xc076eb08
unix_stream_sendmsg() at af_unix.c:1,908 0xc081dcf8
sock_sendmsg_nosec() at socket.c:639 0xc0760214
sock_sendmsg() at socket.c:643 0xc0760214
sock_write_iter() at socket.c:912 0xc07602b8
call_write_iter() at fs.h:1,771 0xc0255c68
new_sync_write() at read_write.c:469 0xc0255c68
__vfs_write() at read_write.c:482 0xc0255c68
vfs_write() at read_write.c:544 0xc0255e2c
SYSC_write() at read_write.c:589 0xc0255fd4
SyS_write() at read_write.c:581 0xc0255fd4
0xc01074e0
Most of the times the call-stacks start at SyS_write() or SyS_read()
from hackbench.
Some things I found out by testing on v4.9:
- minimum test to reproduce problem "while true; do hackbench -g 100 -l
1000; done &"
- reproducible with "hackbench -T" (threads)
- reproducible only on iMX6Q, not (yet) on iMX6S, iMX6D
- NOT reproducible with "hackbench -p" (pipes)
As that might be pointing towards the streaming unix socketpair
hackbench is using from multiple forked processes, I had a look at
net/unix/af_unix.c and wondered why unix_stream_sendmsg() doesn't
increase the reference count on the "other" socket the same as
unix_dgram_sendmsg() does. I don't see a reason why "other" is handled
differently in both functions, so it smells fishy to me. But I'm not
familiar with the net-code, so maybe you could review if the diff below
would make sense:
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 7f46bab4ce5c..5f2ca91bc54d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1854,7 +1854,7 @@ static int unix_stream_sendmsg(struct socket
*sock, struct msghdr *msg,
goto out_err;
} else {
err = -ENOTCONN;
- other = unix_peer(sk);
+ other = unix_peer_get(sk);
if (!other)
goto out_err;
}
@@ -1910,7 +1910,7 @@ static int unix_stream_sendmsg(struct socket
*sock, struct msghdr *msg,
other->sk_data_ready(other);
sent += size;
}
-
+ sock_put(other);
scm_destroy(&scm);
return sent;
@@ -1923,6 +1923,8 @@ static int unix_stream_sendmsg(struct socket
*sock, struct msghdr *msg,
send_sig(SIGPIPE, current, 0);
err = -EPIPE;
out_err:
+ if (other)
+ sock_put(other);
scm_destroy(&scm);
return sent ? : err;
}
@@ -1934,13 +1936,13 @@ static ssize_t unix_stream_sendpage(struct
socket *socket, struct page *page,
bool send_sigpipe = false;
bool init_scm = true;
struct scm_cookie scm;
- struct sock *other, *sk = socket->sk;
+ struct sock *other = NULL, *sk = socket->sk;
struct sk_buff *skb, *newskb = NULL, *tail = NULL;
if (flags & MSG_OOB)
return -EOPNOTSUPP;
- other = unix_peer(sk);
+ other = unix_peer_get(sk);
if (!other || sk->sk_state != TCP_ESTABLISHED)
return -ENOTCONN;
@@ -2027,6 +2029,7 @@ static ssize_t unix_stream_sendpage(struct socket
*socket, struct page *page,
mutex_unlock(&unix_sk(other)->iolock);
other->sk_data_ready(other);
+ sock_put(other);
scm_destroy(&scm);
return size;
@@ -2038,6 +2041,8 @@ static ssize_t unix_stream_sendpage(struct socket
*socket, struct page *page,
kfree_skb(newskb);
if (send_sigpipe && !(flags & MSG_NOSIGNAL))
send_sig(SIGPIPE, current, 0);
+ if (other)
+ sock_put(other);
if (!init_scm)
scm_destroy(&scm);
return err;
> Sebastian
Thanks,
Henri
>
next prev parent reply other threads:[~2018-03-08 17:39 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-03-06 15:27 inconsistent lock state on v4.14.20-rt17 Roosen Henri
2018-03-06 18:21 ` Sebastian Andrzej Siewior
[not found] ` <1520411200.1744.11.camel@ginzinger.com>
2018-03-08 15:57 ` bigeasy
2018-03-08 17:38 ` Roosen Henri [this message]
2018-03-08 18:00 ` bigeasy
2018-03-09 9:47 ` Roosen Henri
2018-03-14 19:55 ` bigeasy
2018-03-16 10:30 ` bigeasy
2018-03-17 21:03 ` bigeasy
2018-03-21 8:31 ` Roosen Henri
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1520530739.1744.18.camel@ginzinger.com \
--to=henri.roosen@ginzinger.com \
--cc=bigeasy@linutronix.de \
--cc=linux-rt-users@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).