From: "Coly Li" <colyli@fnnas.com>
To: 顾泽兵 <guzebing@bytedance.com>
Cc: "linux-bcache@vger.kernel.org" <linux-bcache@vger.kernel.org>,
"kent.overstreet@linux.dev" <kent.overstreet@linux.dev>
Subject: Re: [QUESTION] Using bcache to mask transient I/O hangs and errors from an unstable backing device
Date: Tue, 24 Mar 2026 14:37:23 +0800 [thread overview]
Message-ID: <acIutX0i9rbAR3K7@studio.local> (raw)
In-Reply-To: <4b210910448ef2227190f426e97614787d15d32b.53a515ea.f6d9.4988.8d81.b40b559c0503@bytedance.com>
On Tue, Mar 24, 2026 at 12:50:55PM +0800, 顾泽兵 wrote:
> Hi all,
>
> I'd like to describe a problem we're facing and a proposed solution based on bcache. I'm writing to ask whether this direction makes sense, and whether there is a better existing mechanism in the kernel that I might have missed.
>
> == Problem ==
> We use kernel RBD (krbd) as a block device backed by a Ceph cluster. Due to network instability and other infrastructure issues, the krbd device occasionally suffers transient I/O hangs or I/O errors. These episodes can last anywhere from a few seconds to several minutes, after which krbd recovers on its own.
> During these periods, upper-layer applications (filesystems, databases, etc.) observe hung or failed I/O and may degrade or crash. Our goal is to make these transient backing-device failures completely invisible to the application layer.
> To put it more generally: when a block device is unstable and may experience intermittent I/O hangs or I/O errors, how can we guarantee I/O stability for the layers above it?
>
> == Proposed approach ==
> We plan to use bcache with a local NVMe device as the cache, sitting in front of the krbd backing device. The idea is to let the NVMe absorb all I/O during a krbd stall and drain dirty data back to krbd once it recovers. Specifically:
> - The NVMe cache partition is sized equal to the krbd device, so the entire working set can reside in cache. This maximises read cache hit rate. This is the most important.
Data buckets on cache device are stored in append-only way, it means the old
data won't be deleted before a garbage collection. An exact equal sized cache
partation will hold less or much less data comparing to the whole data set
on backing device. Actual cached data size depends on how the old data is
handled by garbage collection.
> - We use writeback mode, so both reads and writes are served from the NVMe first and asynchronously flushed to the krbd backing device.
> - The workload is a mix of reads and writes.
That means read-miss is still possible and frequent. It is an open question
how to handle read failure for read-miss while the backing device is invalid
temporarily.
> bcache already supports most of what we need. However, the current writeback mode does not fully isolate the upper I/O path from backing-device failures. When krbd hangs or returns errors during dirty-data flushing, bcache may still propagate those failures upward or stall the cache device.
>
> == What we think is needed ==
> We believe a relatively small addition — a new cache mode alongside the existing write-through / writeback / write-around / none modes — could solve this. The semantics of this new mode would be:
> * All reads and writes are served exclusively from the cache device.
> * Dirty data is flushed to the backing device asynchronously.
> * Any I/O errors or hangs on the backing device during flushing are handled gracefully — retried later rather than propagated to the upper layer.
> * When the backing device is healthy, dirty data drains normally.
> This would allow bcache to act as a resilience layer, not just a performance cache. The required changes seem modest and would not affect the existing modes.
>
Indeed you don't need a new cache mode. It seems to work if a multiple-retry
added to writeback failure situation, and in case you handle all relative
stuffs e.g. writeback order, writeback throttle properly.
> == Questions ==
> 1) Is this direction sound within the bcache architecture? Is there anything fundamental that would make it impractical?
1, There is no assurance that all data set of backing device can be cached on
cache device with any specific cache size.
2, If backing device is invalid while read-miss happens, how to handle it is an open question.
> 2) Would adding such a new mode to bcache be considered meaningful and welcome? I'm willing to do the development and submit patches, but I want to make sure this is not out of scope for the project.
> 3) Is there an existing in-kernel solution — dm-cache, dm-writecache, or some other mechanism — that already handles the "mask transient backing-device failures" use case and that I may have overlooked?
>
> Any feedback, pointers, or alternative suggestions would be greatly appreciated. Thank you for your time.
Just a very simple reply at this moment.
Coly Li
next prev parent reply other threads:[~2026-03-24 6:40 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-24 4:50 [QUESTION] Using bcache to mask transient I/O hangs and errors from an unstable backing device 顾泽兵
2026-03-24 6:37 ` Coly Li [this message]
-- strict thread matches above, loose matches on Subject: below --
2026-03-25 4:41 顾泽兵
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=acIutX0i9rbAR3K7@studio.local \
--to=colyli@fnnas.com \
--cc=guzebing@bytedance.com \
--cc=kent.overstreet@linux.dev \
--cc=linux-bcache@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox