From: Mike Snitzer <snitzer@redhat.com>
To: Stanislas Oger <stanislas.oger@gmail.com>
Cc: dm-devel@redhat.com, Maza Benjamin <b.maza@nectardecode.com>
Subject: Re: Kernel BUG at dm-cache-policy-mq.c
Date: Tue, 21 Mar 2017 12:26:31 -0400 [thread overview]
Message-ID: <20170321162631.GA1299@redhat.com> (raw)
In-Reply-To: <2f3fa6fc-b425-bfa2-b25a-776b7dbdd477@gmail.com>
On Tue, Mar 21 2017 at 9:02am -0400,
Stanislas Oger <stanislas.oger@gmail.com> wrote:
> Hi,
>
> We currently encounter a critical issue on a Proxmox cluster we
> operate, which seems to be triggered by a bug in dm-cache ("kernel
> BUG at drivers/md/dm-cache-policy-mq.c:1079!", see syslog below).
>
>
> 1/ Context
>
> The Proxmox cluster uses 4.4 kernel, the VM storage is a DRBD9
> cluster on top of lvm with SSD caching. The underlaying disks are on
> a MegaRAID hardware RAID.
> The problem started to occur since we installed a VM (a mail server)
> that performs many disk reads on many small files (~ 1 million),
> with read lock using flock at each read. With the VM fully running,
> the IO wait of the system is less than 1%.
>
>
> 2/ The problem
>
> Randomly, without pre-fail signs, syslog reports a bug in
> dm-cache-policy-mq.c (see below). A few minutes later all write
> operations infinitely block. A few minutes after the node stopped to
> perform write operations, the other DRBD9 nodes stop writing too. At
> this point all the cluster is down. Reads can be done as usual, but
> write operations are inifitinely blocking.
>
> The only way we figured out to overcome this situation is to perform
> a hard reboot of the failing node. As soon as the failing node is
> down, the other nodes resume to a normal activity. When the failing
> node is up again, DRBD9 performs disk resynchronization and the
> cluster resume normal activity, as if nothing happened.
>
> The bug occurred with both 4.4.35 and 4.4.40 kernels, with a
> frequency of about once every 10 days.
How large is your cache? (size of slow and fast device?)
Have you tried the smq policy? mq is no longer maintained (has been
removed and made an alias of smq, see commit 9ed84698fdda ("dm cache:
make the 'mq' policy an alias for 'smq'")).
It should be noted that dm-cache is changing significantly in 4.12
(already staged in linux-next), see:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.12
The new smq code doesn't have the BUG_ON() in question.
next prev parent reply other threads:[~2017-03-21 16:26 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-03-21 13:02 Kernel BUG at dm-cache-policy-mq.c Stanislas Oger
2017-03-21 16:26 ` Mike Snitzer [this message]
2017-03-21 19:46 ` Stanislas Oger
-- strict thread matches above, loose matches on Subject: below --
2017-03-20 12:15 Stanislas Oger
2015-11-19 9:32 [linux-lvm] " Ciprian Hacman
2015-11-19 15:49 ` Mike Snitzer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170321162631.GA1299@redhat.com \
--to=snitzer@redhat.com \
--cc=b.maza@nectardecode.com \
--cc=dm-devel@redhat.com \
--cc=stanislas.oger@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.