public inbox for linux-bcache@vger.kernel.org
 help / color / mirror / Atom feed
* [QUESTION] Using bcache to mask transient I/O hangs and errors from an unstable backing device
@ 2026-03-24  4:50 顾泽兵
  2026-03-24  6:37 ` Coly Li
  0 siblings, 1 reply; 3+ messages in thread
From: 顾泽兵 @ 2026-03-24  4:50 UTC (permalink / raw)
  To: linux-bcache@vger.kernel.org, colyli@fnnas.com,
	kent.overstreet@linux.dev

Hi all,

I'd like to describe a problem we're facing and a proposed solution based on bcache. I'm writing to ask whether this direction makes sense, and whether there is a better existing mechanism in the kernel that I might have missed.

== Problem ==
We use kernel RBD (krbd) as a block device backed by a Ceph cluster. Due to network instability and other infrastructure issues, the krbd device occasionally suffers transient I/O hangs or I/O errors. These episodes can last anywhere from a few seconds to several minutes, after which krbd recovers on its own.
During these periods, upper-layer applications (filesystems, databases, etc.) observe hung or failed I/O and may degrade or crash. Our goal is to make these transient backing-device failures completely invisible to the application layer.
To put it more generally: when a block device is unstable and may experience intermittent I/O hangs or I/O errors, how can we guarantee I/O stability for the layers above it?

== Proposed approach ==
We plan to use bcache with a local NVMe device as the cache, sitting in front of the krbd backing device. The idea is to let the NVMe absorb all I/O during a krbd stall and drain dirty data back to krbd once it recovers. Specifically:
  - The NVMe cache partition is sized equal to the krbd device, so the entire working set can reside in cache. This maximises read cache hit rate. This is the most important.
  - We use writeback mode, so both reads and writes are served from the NVMe first and asynchronously flushed to the krbd backing device.
  - The workload is a mix of reads and writes.
bcache already supports most of what we need. However, the current writeback mode does not fully isolate the upper I/O path from backing-device failures. When krbd hangs or returns errors during dirty-data flushing, bcache may still propagate those failures upward or stall the cache device.

== What we think is needed ==
We believe a relatively small addition — a new cache mode alongside the existing write-through / writeback / write-around / none modes — could solve this. The semantics of this new mode would be:
  * All reads and writes are served exclusively from the cache device.
  * Dirty data is flushed to the backing device asynchronously.
  * Any I/O errors or hangs on the backing device during flushing are handled gracefully — retried later rather than propagated to the upper layer.
  * When the backing device is healthy, dirty data drains normally.
This would allow bcache to act as a resilience layer, not just a performance cache. The required changes seem modest and would not affect the existing modes.

== Questions ==
  1) Is this direction sound within the bcache architecture? Is there anything fundamental that would make it impractical?
  2) Would adding such a new mode to bcache be considered meaningful and welcome? I'm willing to do the development and submit patches, but I want to make sure this is not out of scope for the project.
  3) Is there an existing in-kernel solution — dm-cache, dm-writecache, or some other mechanism — that already handles the "mask transient backing-device failures" use case and that I may have overlooked?

Any feedback, pointers, or alternative suggestions would be greatly appreciated. Thank you for your time.

Best regards.

^ permalink raw reply	[flat|nested] 3+ messages in thread
* Re: [QUESTION] Using bcache to mask transient I/O hangs and errors from an unstable backing device
@ 2026-03-25  4:41 顾泽兵
  0 siblings, 0 replies; 3+ messages in thread
From: 顾泽兵 @ 2026-03-25  4:41 UTC (permalink / raw)
  To: Coly Li
  Cc: 徐广治, 常凤楠,
	kent.overstreet@linux.dev, linux-bcache@vger.kernel.org

(Resending with linux-bcache@vger.kernel.org kent.overstreet@linux.dev on CC, which was accidentally dropped in my previous reply.)

> From: "Coly Li"<colyli@fnnas.com>
> Date:  Tue, Mar 24, 2026, 14:37
> Subject:  Re: [QUESTION] Using bcache to mask transient I/O hangs and errors from an unstable backing device
> To: "顾泽兵"<guzebing@bytedance.com>
> Cc: "linux-bcache@vger.kernel.org"<linux-bcache@vger.kernel.org>, "kent.overstreet@linux.dev"<kent.overstreet@linux.dev>
> On Tue, Mar 24, 2026 at 12:50:55PM +0800, 顾泽兵 wrote:
> > Hi all,
> > 
> > I'd like to describe a problem we're facing and a proposed solution based on bcache. I'm writing to ask whether this direction makes sense, and whether there is a better existing mechanism in the kernel that I might have missed.
> > 
> > == Problem ==
> > We use kernel RBD (krbd) as a block device backed by a Ceph cluster. Due to network instability and other infrastructure issues, the krbd device occasionally suffers transient I/O hangs or I/O errors. These episodes can last anywhere from a few seconds to several minutes, after which krbd recovers on its own.
> > During these periods, upper-layer applications (filesystems, databases, etc.) observe hung or failed I/O and may degrade or crash. Our goal is to make these transient backing-device failures completely invisible to the application layer.
> > To put it more generally: when a block device is unstable and may experience intermittent I/O hangs or I/O errors, how can we guarantee I/O stability for the layers above it?
> > 
> > == Proposed approach ==
> > We plan to use bcache with a local NVMe device as the cache, sitting in front of the krbd backing device. The idea is to let the NVMe absorb all I/O during a krbd stall and drain dirty data back to krbd once it recovers. Specifically:
> >   - The NVMe cache partition is sized equal to the krbd device, so the entire working set can reside in cache. This maximises read cache hit rate. This is the most important.
> 
> Data buckets on cache device are stored in append-only way, it means the old
> data won't be deleted before a garbage collection. An exact equal sized cache
> partation will hold less or much less data comparing to the whole data set
> on backing device. Actual cached data size depends on how the old data is
> handled by garbage collection.

Thanks for the explanation and reply, now I understand. 
Because cache buckets are append-only, effective usable cache capacity
can be smaller than raw cache-device size, depending on overwrite rate
and GC behavior.

My remaining question is about the full-set-caching case. If the goal is
to keep serving I/O from the local device even while the backing device
is temporarily unhealthy, then it seems this may no longer fit bcache's
current append-only cache model very well. Maybe in-place overwrites
would work here and avoid the need for a GC thread.

This would move bcache away from a pure caching role and closer to a
local-persistence layer with asynchronous writeback to the backing device.
Before I spend time prototyping anything larger: would you consider that
direction fundamentally out of scope for bcache?

> 
> 
> >   - We use writeback mode, so both reads and writes are served from the NVMe first and asynchronously flushed to the krbd backing device.
> >   - The workload is a mix of reads and writes.
> 
> That means read-miss is still possible and frequent. It is an open question
> how to handle read failure for read-miss while the backing device is invalid
> temporarily.
> 
> > bcache already supports most of what we need. However, the current writeback mode does not fully isolate the upper I/O path from backing-device failures. When krbd hangs or returns errors during dirty-data flushing, bcache may still propagate those failures upward or stall the cache device.
> > 
> > == What we think is needed ==
> > We believe a relatively small addition — a new cache mode alongside the existing write-through / writeback / write-around / none modes — could solve this. The semantics of this new mode would be:
> >   * All reads and writes are served exclusively from the cache device.
> >   * Dirty data is flushed to the backing device asynchronously.
> >   * Any I/O errors or hangs on the backing device during flushing are handled gracefully — retried later rather than propagated to the upper layer.
> >   * When the backing device is healthy, dirty data drains normally.
> > This would allow bcache to act as a resilience layer, not just a performance cache. The required changes seem modest and would not affect the existing modes.
> >
> 
> Indeed you don't need a new cache mode. It seems to work if a multiple-retry
> added to writeback failure situation, and in case you handle all relative
> stuffs e.g.  writeback order, writeback throttle properly.

Yes, I agree. If full-set caching is not a hard requirement and partial caching is
acceptable, such that most read I/O can hit the cache without requiring a 100%
hit rate, then retrying writeback failures should be sufficient.

We are still evaluating whether to relax the original goal from full-set caching to
partial caching, but that is one option we are considering.

> 
>  
> > == Questions ==
> >   1) Is this direction sound within the bcache architecture? Is there anything fundamental that would make it impractical?
> 
> 1, There is no assurance that all data set of backing device can be cached on
>    cache device with any specific cache size.
> 
> 2, If backing device is invalid while read-miss happens, how to handle it is an open question.
> 
> 
> >   2) Would adding such a new mode to bcache be considered meaningful and welcome? I'm willing to do the development and submit patches, but I want to make sure this is not out of scope for the project.
> >   3) Is there an existing in-kernel solution — dm-cache, dm-writecache, or some other mechanism — that already handles the "mask transient backing-device failures" use case and that I may have overlooked?
> > 
> > Any feedback, pointers, or alternative suggestions would be greatly appreciated. Thank you for your time.
> 
> Just a very simple reply at this moment.
> 
> Coly Li
> 

Thanks again.
Guzebing

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-03-25  4:41 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-24  4:50 [QUESTION] Using bcache to mask transient I/O hangs and errors from an unstable backing device 顾泽兵
2026-03-24  6:37 ` Coly Li
  -- strict thread matches above, loose matches on Subject: below --
2026-03-25  4:41 顾泽兵

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox