All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vojtech Pavlik <vojtech@suse.com>
To: "Jens-U. Mozdzen" <jmozdzen@nde.ag>
Cc: Kent Overstreet <kent.overstreet@gmail.com>,
	linux-bcache@vger.kernel.org
Subject: Re: Bcache stuck at writeback of a key, consuming 100% CPU, not possible to detach
Date: Mon, 7 Sep 2015 17:52:17 +0200	[thread overview]
Message-ID: <20150907155217.GA27227@suse.com> (raw)
In-Reply-To: <20150907171318.Horde.LCssHlKH7jqwmHt1ENeCSCd@www3.nde.ag>

On Mon, Sep 07, 2015 at 05:13:18PM +0200, Jens-U. Mozdzen wrote:

> and the patches
> 
> bcache001.eml:Subject: [PATCH] bcache: [BUG] clear
> BCACHE_DEV_UNLINK_DONE flag when attaching a backing device
> bcache002.eml:Subject: [PATCH] bcache: fix a livelock in btree lock
> bcache003.eml:Subject: [PATCH] bcache: unregister reboot notifier
> when bcache fails to register a block device
> bcache004.eml:Subject: [PATCH] fix a leak in bch_cached_dev_run()
> bcache005.eml:Subject: [PATCH] bcache: Fix writeback_thread never
> writing back incomplete stripes.
> 
> I can confirm that running with writeback_percent to zero now works
> much smoother (or "at all", for certain circumstances).

I'm glad to hear that.

> >>PS: We're still facing random reboots (of unknown cause), which may
> >>correlate with bcache's "amount dirty" being near the limit set by
> >>writeback_percent.
> 
> For a test, after a few hours running the latest patch, I switched
> from writeback_percent==0 to writeback_percent==1, and had a full
> kernel crash within an hour! Luckily, I still had a console open on
> the machine, so I could for the first time see a hint (but not much
> more) of what is going on:

I'm running the openSUSE most recent stable kernel, available here:

	http://download.opensuse.org/repositories/Kernel:/stable/standard/

It's currently at 4.2.0 and contains all of the above patches. I've
seen crashes in __find_stripe a couple times a few months apart on older
kernels, but these aren't likely related to bcache. Similar to this:

	https://bugzilla.kernel.org/show_bug.cgi?id=100321

But except for these, the system has been running stable (at
writeback_percent=40 the last few months), so I would bet on a different
source of your crashes than bcache.

> --- cut here ---
> Message from syslogd@san02 at Sep  7 14:56:15 ...
>  kernel:[74182.424659] Kernel panic - not syncing: stack-protector:
> Kernel stack is corrupted in: ffffffffa001a815
> 
> Message from syslogd@san02 at Sep  7 14:56:15 ...
>  kernel:[74182.424659]
> 
> Message from syslogd@san02 at Sep  7 14:56:15 ...
>  kernel:[74182.474050] Kernel Offset: 0x0 from 0xffffffff81000000
> (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
> --- cut here ---

Maybe you could set up a serial console? That way you'd be able to catch
all the kernel messages.

> Since there's no stack trace, this lets much room for speculation.
> But at least I now have an idea where the reboots (and two other
> "full stops") might stem from: stack corruption. I have run
> scripts/checkstack.pl on the bcache module and found no excessive
> stack use, but checking for memset() and memcpy() in bcache's code
> gave a number of hits - I'll have to have a look at them, one by
> one, and hope to find my way around.
> 
> I'll give my servers at least two weeks to run with your patch and
> writeback_percent==0 to see if we're hit by reboots with that code
> as well. If not, I'll take that as an indicator that the
> implementation of the "PID regulator" may need a closer look.
>
> Kent, do you remember having fixed anything that might explain this
> stack corruption behavior, in code later than what's included in
> kernel 3.18.8?

-- 
Vojtech Pavlik
Director SUSE Labs

  reply	other threads:[~2015-09-07 15:52 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-08-30  8:54 Bcache stuck at writeback of a key, consuming 100% CPU, not possible to detach Vojtech Pavlik
2015-08-31 14:39 ` Emmanuel Florac
2015-08-31 14:49   ` Vojtech Pavlik
2015-08-31 15:04     ` Kent Overstreet
2015-08-31 16:45       ` Vojtech Pavlik
2015-08-31 16:53         ` Kent Overstreet
2015-08-31 17:09           ` Vojtech Pavlik
2015-09-01 13:34           ` Vojtech Pavlik
2015-08-31 16:54       ` Vojtech Pavlik
2015-08-31 15:09     ` Emmanuel Florac
2015-08-31 15:54       ` Vojtech Pavlik
2015-09-05 11:06 ` Jens-U. Mozdzen
2015-09-05 11:29   ` Vojtech Pavlik
2015-09-07 15:13     ` Jens-U. Mozdzen
2015-09-07 15:52       ` Vojtech Pavlik [this message]
2015-09-07 16:01         ` Vojtech Pavlik
     [not found]           ` <B7A73681-AF9A-438C-9323-B2CE3BEFCA98@profihost.ag>
2015-09-07 18:56             ` Vojtech Pavlik
2015-09-08  9:04               ` Jens-U. Mozdzen
2015-09-08  9:10                 ` Vojtech Pavlik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150907155217.GA27227@suse.com \
    --to=vojtech@suse.com \
    --cc=jmozdzen@nde.ag \
    --cc=kent.overstreet@gmail.com \
    --cc=linux-bcache@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.