From: "Luis R. Rodriguez" <mcgrof@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: "Михаил Гаврилов" <mikhail.v.gavrilov@gmail.com>,
"Jan Kara" <jack@suse.cz>,
"Christoph Hellwig" <hch@infradead.org>,
linux-xfs@vger.kernel.org, linux-mm@kvack.org,
"Aleksa Sarai" <asarai@suse.com>,
"Hannes Reinecke" <hare@suse.de>,
"Eric W. Biederman" <ebiederm@xmission.com>,
"Jan Blunck" <jblunck@infradead.org>,
"Oscar Salvador" <osalvador@suse.com>
Subject: Re: kernel BUG at fs/xfs/xfs_aops.c:853! in kernel 4.13 rc6
Date: Mon, 9 Oct 2017 20:31:29 +0200 [thread overview]
Message-ID: <20171009183129.GE11645@wotan.suse.de> (raw)
In-Reply-To: <20171009000529.GY3666@dastard>
On Mon, Oct 09, 2017 at 11:05:29AM +1100, Dave Chinner wrote:
> On Sat, Oct 07, 2017 at 01:10:58PM +0500, D?D,N?D?D,D>> D?D?D2N?D,D>>D 3/4 D2 wrote:
> > But seems now got another issue:
> >
> > [ 1966.953781] INFO: task tracker-store:8578 blocked for more than 120 seconds.
> > [ 1966.953797] Not tainted 4.13.4-301.fc27.x86_64+debug #1
> > [ 1966.953800] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [ 1966.953804] tracker-store D12840 8578 1655 0x00000000
> > [ 1966.953811] Call Trace:
> > [ 1966.953823] __schedule+0x2dc/0xbb0
> > [ 1966.953830] ? wait_on_page_bit_common+0xfb/0x1a0
> > [ 1966.953838] schedule+0x3d/0x90
> > [ 1966.953843] io_schedule+0x16/0x40
> > [ 1966.953847] wait_on_page_bit_common+0x10a/0x1a0
> > [ 1966.953857] ? page_cache_tree_insert+0x170/0x170
> > [ 1966.953865] __filemap_fdatawait_range+0x101/0x1a0
> > [ 1966.953883] file_write_and_wait_range+0x63/0xc0
>
> Ok, that's in wait_on_page_writeback(page)
> ......
>
> > And yet another
> >
> > [41288.797026] INFO: task tracker-store:4535 blocked for more than 120 seconds.
> > [41288.797034] Not tainted 4.13.4-301.fc27.x86_64+debug #1
> > [41288.797037] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [41288.797041] tracker-store D10616 4535 1655 0x00000000
> > [41288.797049] Call Trace:
> > [41288.797061] __schedule+0x2dc/0xbb0
> > [41288.797072] ? bit_wait+0x60/0x60
> > [41288.797076] schedule+0x3d/0x90
> > [41288.797082] io_schedule+0x16/0x40
> > [41288.797086] bit_wait_io+0x11/0x60
> > [41288.797091] __wait_on_bit+0x31/0x90
> > [41288.797099] out_of_line_wait_on_bit+0x94/0xb0
> > [41288.797106] ? bit_waitqueue+0x40/0x40
> > [41288.797113] __block_write_begin_int+0x265/0x550
> > [41288.797132] iomap_write_begin.constprop.14+0x7d/0x130
>
> And that's in wait_on_buffer().
>
> In both cases we are waiting on a bit lock for IO completion. In the
> first case it is on page, the second it's on sub-page read IO
> completion during a write.
>
> Triggeringa hung task timeouts like this doesn't usually indicate a
> filesystem problem.
<-- snip -->
> None of these things usually filesystem problems, and the trainsmash
> of blocked tasks on filesystem locks is typical for these types of
> "blocked indefinitely with locks held" type of situations. It does
> tend to indicate taht there is quite a bit of load on the
> filesystem, though...
As Jan Kara noted we've seen this also on customers SLE12-SP2 kernel (4.4
based). Although we also were never able to root cause, since that bug
is now closed on our end I figured it would be worth mentioning two
theories we discussed, one more recent than the other.
One theory came from observation of logs and these logs hinting at an
older issue our Docker team had been looking into for a while, that of
libdm mounts being leaked into another container's namespace. Apparently
the clue to when this happens is when something as follows is seen on
the docker logs:
2017-06-22T18:59:47.925917+08:00 host-docker-01 dockerd[1957]: time="2017-06-22T18:59:47.925857351+08:00" level=info msg="Container
6b8f678a27d61939f358614c673224675d64f1527bb2046943e2d493f095c865 failed to exit within 30 seconds of signal 15 - using the force"
This is a SIGTERM. When one sees the above message it is a good hint that the
container actually did not finish at all, but instead was forced to be
terminated via 'docker kill' or 'docker rm -f'. To be clear docker was not able
to use SIGTERM so then resorts to SIGKILL.
After this is when we get the next clueful message which interests us to try
to figure out a way to reproduce the originally reported issue:
2017-06-22T18:59:48.071374+08:00 host-docker-01 dockerd[1957]: time="2017-06-22T18:59:48.071314219+08:00" level=error msg="devmapper: Error
unmounting device
Aleksa indicates he's been studying this code for a while, and although he
has fixes for this on Docker it also means he does understands what is
going on here. The above error message indicates Docker in turn *failed* to
still kill the damn container with SIGKILL. This, he indicates, is due to
libdm mounts leaking from one container namespace to another container's
namespace. The error is not docker being unable to umount, its actually
that docker cannot remove the backing device.
Its after this when the bug triggers. It doesn't explain *why* we hit this
bug on XFS, but it should be a clue. Aleksa is however is absolutely sure
that the bugs found internally *are* caused by mount leakage somehow, and
all we know is that XFS oops happened after this.
He suggests this could in theory be reproduced by doing the following
while you have a Docker daemon running:
o unshare -m
o mount --make-rprivate /
Effectively forcing a mount leak. We had someone trying to reproduce this
internally somehow but I don't think they were in the end able to.
If this docker issue leads to a kernel issue the severity might be a bit more
serious, it indicates an unprivileged user namespaces can effectively perform a
DoS against the host if it's trying to operate on devicemapper mounts.
Then *another* theory came recently from Jan Blunck during the ALPSS, he noted
he traced a similar issue back to a systemd misuse, ie, systemd setting the
root namespace to 'shared' by default. Originally the root namespace is set to
'private', but systemd decided it'd be a good idea to set it to 'shared'. Which
would quite easily explain the mount namespace leaking.
Aleksa however contends that the mountspace leaking happens because of Docker's
architecture (all of the mounts are created in one mount namespace, and all of
the runc invocations happen in that same mount namespace). Even if you
explicitly set different sharing options (which runc already supports), the
problem still exists.
But if *not* using docker, it gives us an idea of how you perhaps you could
run into a similar issue.
Regardless Aleksa points out there's still a more fundamental issue of "I can
leak a mountpoint as an unprivileged user, no matter what the propagation is":
% unshare -rm
% mount --make-rprivate /
% # I now have the mount alive.
Luis
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-10-09 18:31 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CABXGCsOL+_OgC0dpO1+Zeg=iu7ryZRZT4S7k-io8EGB0ZRgZGw@mail.gmail.com>
2017-09-03 7:43 ` kernel BUG at fs/xfs/xfs_aops.c:853! in kernel 4.13 rc6 Christoph Hellwig
2017-09-03 14:08 ` Михаил Гаврилов
2017-09-04 12:30 ` Jan Kara
2017-10-07 8:10 ` Михаил Гаврилов
2017-10-07 9:22 ` Михаил Гаврилов
2017-10-09 0:05 ` Dave Chinner
2017-10-09 18:31 ` Luis R. Rodriguez [this message]
2017-10-09 19:02 ` Eric W. Biederman
2017-10-15 8:53 ` Aleksa Sarai
2017-10-15 13:06 ` Theodore Ts'o
2017-10-15 22:14 ` Eric W. Biederman
2017-10-15 23:22 ` Dave Chinner
2017-10-16 17:44 ` Eric W. Biederman
2017-10-16 21:38 ` Dave Chinner
2017-10-16 1:13 ` Theodore Ts'o
2017-10-16 17:53 ` Eric W. Biederman
2017-10-16 18:50 ` Theodore Ts'o
2017-10-16 22:00 ` Dave Chinner
2017-10-17 1:34 ` Theodore Ts'o
2017-10-17 0:59 ` Aleksa Sarai
2017-10-17 9:20 ` Jan Kara
2017-10-17 14:12 ` Theodore Ts'o
2017-11-06 19:25 ` Luis R. Rodriguez
2017-11-07 15:26 ` Jan Kara
2017-10-09 22:28 ` Dave Chinner
2017-10-10 7:57 ` Jan Kara
2017-09-04 1:43 ` Dave Chinner
2017-09-04 2:20 ` Darrick J. Wong
2017-09-04 12:14 ` Jan Kara
2017-09-04 22:36 ` Dave Chinner
2017-09-05 16:17 ` Jan Kara
2017-09-05 23:42 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20171009183129.GE11645@wotan.suse.de \
--to=mcgrof@kernel.org \
--cc=asarai@suse.com \
--cc=david@fromorbit.com \
--cc=ebiederm@xmission.com \
--cc=hare@suse.de \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=jblunck@infradead.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=mikhail.v.gavrilov@gmail.com \
--cc=osalvador@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).