* [PATCH] bcache: enhancing the security of dirty data writeback
@ 2025-07-31 6:21 Zhou Jifeng
2025-07-31 15:49 ` Kent Overstreet
2025-08-05 4:57 ` Coly Li
0 siblings, 2 replies; 25+ messages in thread
From: Zhou Jifeng @ 2025-07-31 6:21 UTC (permalink / raw)
To: colyli, kent.overstreet; +Cc: linux-bcache, linux-kernel, Zhou Jifeng
There is a potential data consistency risk in bcache's writeback mode:when
the application calls fsync, bcache returns success after completing the
log write, persisting the cache disk data, and persisting the HDD internal
cache. However, at this point, the actual application data may still be in
a dirty state and remain stuck in the cache disk. when these data are
subsequently written back to the HDD asynchronously through REQ_OP_WRITE,
there is no forced refresh mechanism to ensure physical placement on the
disk, and there may be no power-off protection measures, which poses a risk
of data loss. This mechanism may cause the application to misjudge that the
data has been persisted, which is different from the actual storage state,
and also violates the semantic agreement that fsync should ensure data
persistence.
This patch aims to enhance the reliability of dirty data writeback through
PREFLUSH, ensuring that the dirty data mark in the cache device is cleared
only after the dirty data is written to the disk. the main features
include:
1. Dynamic function switch: Control the function on/off through the
/sys/block/bcache0/bcache/writeback_flush parameter, the default value is
off
2. Double triggering conditions for PREFLUSH:
When the cumulative number of dirty bkeys written back reaches the
threshold(Dynamic control parameters:
/sys/block/bcache0/bcache/flush_interval, the default value is 20000. by
increasing the value of this parameter, the impact of flush on performance
can be reduced.)
Or when the interval since the last refresh exceeds 30 seconds
If any of the conditions are met, the system will send a PREFLUSH command
to the backend HDD, and clear the corresponding dirty bkey mark only after
confirming that the PREFLUSH is executed successfully.
Two tests were conducted:
1. Comparison test of writeback efficiency:
Group A: Use native Linux kernel source code driver, and use default values
for other system parameters.
Group B: Use this patch based on native Linux kernel source code, and
configure system /sys/block/bcache0/bcache/writeback_flush to "on", and use
default values for other parameters.
Test script:
https://github.com/jifengzhou/file/blob/main/writebackflush-test.sh
The execution process of this test script is as follows:
Preparation phase: Continue 4KB random write operation for 10 minutes
Initial record: At the end of the data preparation phase, immediately
record the current time and the dirty_data value of bcache
Monitoring phase: Continuously poll and detect the dirty_data value, and
record the completion time when the value drops to 0
Calculation index: According to the initial/end time difference and the
initial dirty_data value, calculate the average writeback speed (formula:
initial dirty_data value/time consumption)
Test results:
Group A test data
Initial record: Started writing dirty data at 2025-07-23 09:05:47, the
dirty_data value was 133.8GB
Monitoring stage: Completed writing dirty data at 2025-07-23 17:31:45,
the dirty_data value was 0
Total duration: 8 hours 25 minutes 58 seconds
Average write rate: 4.51MB/s
Group B test data
Initial record: Started writing dirty data at 2025-07-22 20:00:09, the
dirty_data value was 131.2GB
Monitoring stage: Completed writing dirty data at 2025-07-23 04:32:33, the
dirty_data value was 0
Initial dirty_data value: 131.2GB
Total duration: 8 hours 32 minutes 24 seconds
Average write rate: 4.37MB/s
Monitoring data comparison
HDD iostat monitoring data curve comparison chart:
https://github.com/jifengzhou/file/blob/main/HDD%20iostat%20data%20for%20
test%20scenario%201.xlsx (covers the full cycle data from the initial
record stage to the end of the monitoring stage)
This test summary: In the average 8.5-hour dirty data write speed test,
after using this patch, the dirty data write speed was approximately 96.9%
of that before the patch.
2. Performance comparison test
Group C: Use the native Linux kernel source code driver, and use the
default values for other system parameters.
Group D: Use this patch based on the native Linux kernel source code, and
configure the system /sys/block/bcache0/bcache/writeback_flush to "on", and
use the default values for other parameters.
Test plan:
Test tool: Use vdbench storage performance test tool
Load type: Use 4KB random write mode
Duration: Continuously perform 48 hours of stress test
Configuration file path:
https://github.com/jifengzhou/file/blob/main/block.conf
Execute command: vdbench -f block.conf
Test results:
Comparison curve graph of vdbench test output data for groups C and D:
https://github.com/jifengzhou/file/blob/main/48-hour%204K%20Random%20Write
.xlsx
This test summary:
During the 48-hour long test, the average IOPS of Group C was 942.5, while
the average speed of Group D was 937.1. The difference between the two was
very small.Group C completed its first GC around 14 hours, while Group D
completed it at 14.5 hours,and the difference between the two was also
relatively small.
From the above two results, it can be seen that this patch significantly
enhances data reliability while having a relatively minor impact on
performance. Moreover, the impact on performance and write efficiency can
be further reduced by dynamically adjusting the value of "/sys/block/
bcache0/bcache/flush_interval".
Signed-off-by: Zhou Jifeng <zhoujifeng@kylinos.com.cn>
---
drivers/md/bcache/bcache.h | 23 ++++
drivers/md/bcache/bcache_ondisk.h | 4 +
drivers/md/bcache/sysfs.c | 47 ++++++++
drivers/md/bcache/writeback.c | 174 +++++++++++++++++++++++++++---
drivers/md/bcache/writeback.h | 4 +
5 files changed, 239 insertions(+), 13 deletions(-)
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 785b0d9008fa..09424938437b 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -247,6 +247,17 @@ struct keybuf {
DECLARE_ARRAY_ALLOCATOR(struct keybuf_key, freelist, KEYBUF_NR);
};
+struct keybuf_preflush {
+ spinlock_t lock;
+ struct list_head list;
+ u32 count;
+};
+
+struct flush_key_entry {
+ struct keybuf_key key;
+ struct list_head list;
+};
+
struct bcache_device {
struct closure cl;
@@ -346,6 +357,18 @@ struct cached_dev {
struct keybuf writeback_keys;
+ /*
+ * Before performing preflush to the backing device, temporarily
+ * store the bkey waiting to clean up the dirty mark
+ */
+ struct keybuf_preflush preflush_keys;
+ /*
+ * flush_interval is used to specify that a PROFLUSH operation will
+ * be issued once a certain number of dirty bkeys have been written
+ * each time.
+ */
+ unsigned int flush_interval;
+
struct task_struct *status_update_thread;
/*
* Order the write-half of writeback operations strongly in dispatch
diff --git a/drivers/md/bcache/bcache_ondisk.h b/drivers/md/bcache/bcache_ondisk.h
index 6620a7f8fffc..df5800838e40 100644
--- a/drivers/md/bcache/bcache_ondisk.h
+++ b/drivers/md/bcache/bcache_ondisk.h
@@ -294,6 +294,10 @@ BITMASK(BDEV_CACHE_MODE, struct cache_sb, flags, 0, 4);
#define CACHE_MODE_WRITEBACK 1U
#define CACHE_MODE_WRITEAROUND 2U
#define CACHE_MODE_NONE 3U
+BITMASK(BDEV_WRITEBACK_FLUSH, struct cache_sb, flags, 4, 1);
+#define WRITEBACK_FLUSH_OFF 0U
+#define WRITEBACK_FLUSH_ON 1U
+
BITMASK(BDEV_STATE, struct cache_sb, flags, 61, 2);
#define BDEV_STATE_NONE 0U
#define BDEV_STATE_CLEAN 1U
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index e8f696cb58c0..cc228e592ab6 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -28,6 +28,18 @@ static const char * const bch_cache_modes[] = {
NULL
};
+/*
+ * Default is 0 ("off")
+ * off: Do nothing
+ * on: Use FLUSH when writing back dirty data.
+ */
+static const char * const bch_writeback_flush[] = {
+ "off",
+ "on",
+ NULL
+};
+
+
static const char * const bch_reada_cache_policies[] = {
"all",
"meta-only",
@@ -151,6 +163,19 @@ rw_attribute(copy_gc_enabled);
rw_attribute(idle_max_writeback_rate);
rw_attribute(gc_after_writeback);
rw_attribute(size);
+/*
+ * The "writeback_flush" has two options: "off" and "on". "off" is
+ * the default value.
+ * off: Do nothing
+ * on: Use FLUSH when writing back dirty data.
+ */
+rw_attribute(writeback_flush);
+/*
+ * "flush_interval" is used to specify that a PROFLUSH operation will
+ * be issued once a certain number of dirty bkeys have been written
+ * each time.
+ */
+rw_attribute(flush_interval);
static ssize_t bch_snprint_string_list(char *buf,
size_t size,
@@ -213,6 +238,7 @@ SHOW(__bch_cached_dev)
var_print(writeback_rate_fp_term_mid);
var_print(writeback_rate_fp_term_high);
var_print(writeback_rate_minimum);
+ var_print(flush_interval);
if (attr == &sysfs_writeback_rate_debug) {
char rate[20];
@@ -283,6 +309,11 @@ SHOW(__bch_cached_dev)
return strlen(buf);
}
+ if (attr == &sysfs_writeback_flush)
+ return bch_snprint_string_list(buf, PAGE_SIZE,
+ bch_writeback_flush,
+ BDEV_WRITEBACK_FLUSH(&dc->sb));
+
#undef var
return 0;
}
@@ -354,6 +385,9 @@ STORE(__cached_dev)
sysfs_strtoul_clamp(io_error_limit, dc->error_limit, 0, INT_MAX);
+ sysfs_strtoul_clamp(flush_interval, dc->flush_interval,
+ WRITEBACK_FLUSH_INTERVAL_MIN, WRITEBACK_FLUSH_INTERVAL_MAX);
+
if (attr == &sysfs_io_disable) {
int v = strtoul_or_return(buf);
@@ -451,6 +485,17 @@ STORE(__cached_dev)
if (attr == &sysfs_stop)
bcache_device_stop(&dc->disk);
+ if (attr == &sysfs_writeback_flush) {
+ v = __sysfs_match_string(bch_writeback_flush, -1, buf);
+ if (v < 0)
+ return v;
+
+ if ((unsigned int) v != BDEV_WRITEBACK_FLUSH(&dc->sb)) {
+ SET_BDEV_WRITEBACK_FLUSH(&dc->sb, v);
+ bch_write_bdev_super(dc, NULL);
+ }
+ }
+
return size;
}
@@ -541,6 +586,8 @@ static struct attribute *bch_cached_dev_attrs[] = {
#endif
&sysfs_backing_dev_name,
&sysfs_backing_dev_uuid,
+ &sysfs_writeback_flush,
+ &sysfs_flush_interval,
NULL
};
ATTRIBUTE_GROUPS(bch_cached_dev);
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 453efbbdc8ee..530eea2b953a 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -348,8 +348,121 @@ static CLOSURE_CALLBACK(dirty_io_destructor)
kfree(io);
}
+static int bcache_add_preflush_key(struct cached_dev *dc, struct keybuf_key *key)
+{
+ struct flush_key_entry *entry = kmalloc(sizeof(*entry), GFP_ATOMIC);
+
+ if (!entry) {
+ pr_info("the preflush bkey memory allocation failed.\n");
+ return -ENOMEM;
+ }
+
+ memcpy(&entry->key, key, sizeof(*key));
+ INIT_LIST_HEAD(&entry->list);
+
+ spin_lock(&dc->preflush_keys.lock);
+ list_add_tail(&entry->list, &dc->preflush_keys.list);
+ dc->preflush_keys.count++;
+ spin_unlock(&dc->preflush_keys.lock);
+
+ return 0;
+}
+
+static void bcache_mark_preflush_keys_clean(struct cached_dev *dc)
+{
+ struct flush_key_entry *e, *tmp;
+
+ list_for_each_entry_safe(e, tmp, &dc->preflush_keys.list, list) {
+ list_del(&e->list);
+ kfree(e);
+ }
+ dc->preflush_keys.count = 0;
+}
+
+static void launch_async_preflush_endio(struct bio *bio)
+{
+ if (bio->bi_status)
+ pr_err("flush backing device error %d.\n", bio->bi_status);
+}
+
+
+static inline void launch_async_preflush_request(struct cached_dev *dc)
+{
+ struct bio flush;
+
+ bio_init(&flush, dc->bdev, NULL, 0, REQ_OP_WRITE | REQ_PREFLUSH);
+
+ flush.bi_private = dc;
+ flush.bi_end_io = launch_async_preflush_endio;
+
+ submit_bio(&flush);
+}
+
+
+static void flush_backing_device(struct cached_dev *dc)
+{
+ int ret;
+ unsigned int i;
+ struct keylist keys;
+ struct flush_key_entry *e, *tmp;
+ struct bio flush;
+
+ if (dc->preflush_keys.count == 0)
+ return;
+
+ bio_init(&flush, dc->bdev, NULL, 0, REQ_OP_WRITE | REQ_PREFLUSH);
+ ret = submit_bio_wait(&flush);
+ if (ret) {
+ pr_err("flush backing device error %d.\n", ret);
+
+ /*
+ * In the case of flush failure, do not update the status of bkey
+ * in the btree, and wait until the next time to re-write the dirty
+ * data.
+ */
+ bcache_mark_preflush_keys_clean(dc);
+ return;
+ }
+
+ /*
+ * The dirty data was successfully written back and confirmed to be written
+ * to the disk. The status of the bkey in the btree was updated.
+ */
+ list_for_each_entry_safe(e, tmp, &dc->preflush_keys.list, list) {
+ memset(keys.inline_keys, 0, sizeof(keys.inline_keys));
+ bch_keylist_init(&keys);
+
+ bkey_copy(keys.top, &(e->key.key));
+ SET_KEY_DIRTY(keys.top, false);
+ bch_keylist_push(&keys);
+
+ for (i = 0; i < KEY_PTRS(&(e->key.key)); i++)
+ atomic_inc(&PTR_BUCKET(dc->disk.c, &(e->key.key), i)->pin);
+
+ ret = bch_btree_insert(dc->disk.c, &keys, NULL, &(e->key.key));
+
+ if (ret)
+ trace_bcache_writeback_collision(&(e->key.key));
+
+ atomic_long_inc(ret
+ ? &dc->disk.c->writeback_keys_failed
+ : &dc->disk.c->writeback_keys_done);
+
+ list_del(&e->list);
+ kfree(e);
+
+ /* For those bkeys that failed to be inserted, you can
+ * ignore them and they will be processed again in the
+ * next write-back scan.
+ */
+ }
+
+ dc->preflush_keys.count = 0;
+}
+
static CLOSURE_CALLBACK(write_dirty_finish)
{
+ int ret;
closure_type(io, struct dirty_io, cl);
struct keybuf_key *w = io->bio.bi_private;
struct cached_dev *dc = io->dc;
@@ -358,27 +471,41 @@ static CLOSURE_CALLBACK(write_dirty_finish)
/* This is kind of a dumb way of signalling errors. */
if (KEY_DIRTY(&w->key)) {
- int ret;
unsigned int i;
struct keylist keys;
- bch_keylist_init(&keys);
+ if (!BDEV_WRITEBACK_FLUSH(&dc->sb)) {
+update_bkey:
+ bch_keylist_init(&keys);
- bkey_copy(keys.top, &w->key);
- SET_KEY_DIRTY(keys.top, false);
- bch_keylist_push(&keys);
+ bkey_copy(keys.top, &w->key);
+ SET_KEY_DIRTY(keys.top, false);
+ bch_keylist_push(&keys);
- for (i = 0; i < KEY_PTRS(&w->key); i++)
- atomic_inc(&PTR_BUCKET(dc->disk.c, &w->key, i)->pin);
+ for (i = 0; i < KEY_PTRS(&w->key); i++)
+ atomic_inc(&PTR_BUCKET(dc->disk.c, &w->key, i)->pin);
- ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key);
+ ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key);
- if (ret)
- trace_bcache_writeback_collision(&w->key);
+ if (ret)
+ trace_bcache_writeback_collision(&w->key);
- atomic_long_inc(ret
- ? &dc->disk.c->writeback_keys_failed
- : &dc->disk.c->writeback_keys_done);
+ atomic_long_inc(ret
+ ? &dc->disk.c->writeback_keys_failed
+ : &dc->disk.c->writeback_keys_done);
+ } else {
+ /* After flushing the backing device, update the btree */
+ ret = bcache_add_preflush_key(dc, w);
+
+ /*
+ * When memory allocation fails, immediately send PREFLUSH
+ * and then update the btree.
+ */
+ if (ret) {
+ launch_async_preflush_request(dc);
+ goto update_bkey;
+ }
+ }
}
bch_keybuf_del(&dc->writeback_keys, w);
@@ -435,6 +562,7 @@ static CLOSURE_CALLBACK(write_dirty)
if (KEY_DIRTY(&w->key)) {
dirty_init(w);
io->bio.bi_opf = REQ_OP_WRITE;
+
io->bio.bi_iter.bi_sector = KEY_START(&w->key);
bio_set_dev(&io->bio, io->dc->bdev);
io->bio.bi_end_io = dirty_endio;
@@ -741,6 +869,7 @@ static int bch_writeback_thread(void *arg)
struct cached_dev *dc = arg;
struct cache_set *c = dc->disk.c;
bool searched_full_index;
+ unsigned long last_flush_jiffies = jiffies;
bch_ratelimit_reset(&dc->writeback_rate);
@@ -819,9 +948,23 @@ static int bch_writeback_thread(void *arg)
read_dirty(dc);
+ /*
+ * If the accumulated preflush_keys exceed a certain quantity or
+ * the interval time exceeds 30 seconds, issue the PREFLUSH command
+ * once.
+ */
+ if (dc->preflush_keys.count >= dc->flush_interval ||
+ time_after(jiffies, last_flush_jiffies + 30 * HZ)) {
+ flush_backing_device(dc);
+ last_flush_jiffies = jiffies;
+ }
+
if (searched_full_index) {
unsigned int delay = dc->writeback_delay * HZ;
+ /* Clean up the remaining preflush_keys. */
+ flush_backing_device(dc);
+
while (delay &&
!kthread_should_stop() &&
!test_bit(CACHE_SET_IO_DISABLE, &c->flags) &&
@@ -1068,10 +1211,15 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
dc->writeback_rate_fp_term_mid = 10;
dc->writeback_rate_fp_term_high = 1000;
dc->writeback_rate_i_term_inverse = 10000;
+ dc->flush_interval = WRITEBACK_FLUSH_INTERVAL_DEFAULT;
/* For dc->writeback_lock contention in update_writeback_rate() */
dc->rate_update_retry = 0;
+ INIT_LIST_HEAD(&dc->preflush_keys.list);
+ spin_lock_init(&dc->preflush_keys.lock);
+ dc->preflush_keys.count = 0;
+
WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags));
INIT_DELAYED_WORK(&dc->writeback_rate_update, update_writeback_rate);
}
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index 31df716951f6..1cd3e4addba2 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -14,6 +14,10 @@
#define WRITEBACK_RATE_UPDATE_SECS_MAX 60
#define WRITEBACK_RATE_UPDATE_SECS_DEFAULT 5
+#define WRITEBACK_FLUSH_INTERVAL_MIN 500
+#define WRITEBACK_FLUSH_INTERVAL_MAX 50000
+#define WRITEBACK_FLUSH_INTERVAL_DEFAULT 20000 /* the number of bkey */
+
#define BCH_AUTO_GC_DIRTY_THRESHOLD 50
#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW 50
--
2.18.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH] bcache: enhancing the security of dirty data writeback
2025-07-31 6:21 [PATCH] bcache: enhancing the security of dirty data writeback Zhou Jifeng
@ 2025-07-31 15:49 ` Kent Overstreet
2025-08-01 2:27 ` Zhou Jifeng
2025-08-05 4:57 ` Coly Li
1 sibling, 1 reply; 25+ messages in thread
From: Kent Overstreet @ 2025-07-31 15:49 UTC (permalink / raw)
To: Zhou Jifeng; +Cc: colyli, linux-bcache, linux-kernel
On Thu, Jul 31, 2025 at 02:21:40PM +0800, Zhou Jifeng wrote:
> There is a potential data consistency risk in bcache's writeback mode:when
> the application calls fsync, bcache returns success after completing the
> log write, persisting the cache disk data, and persisting the HDD internal
> cache. However, at this point, the actual application data may still be in
> a dirty state and remain stuck in the cache disk. when these data are
> subsequently written back to the HDD asynchronously through REQ_OP_WRITE,
> there is no forced refresh mechanism to ensure physical placement on the
> disk, and there may be no power-off protection measures, which poses a risk
> of data loss. This mechanism may cause the application to misjudge that the
> data has been persisted, which is different from the actual storage state,
> and also violates the semantic agreement that fsync should ensure data
> persistence.
So, the way you start out describing the patch, it sounds like you're
addressing some sort of bug in REQ_OP_FLUSH handling, but it looks like
what you're actually doing is adding a mode where flushes also flush
bcache?
This doesn't sound like a bugfix, this sounds like a completely new
mode - we'd need an explanation of the use case you're aiming for.
The model for bcache has always been that since the cache is persistent,
when you're in writeback mode there are no fsync/REQ_OP_FLUSH
considerations for dirty data; once we've properly persisted (and
flushed) that data we're good.
If you want a mode where you can run in writeback mode, but obey flushes
so that it's still safe to lose or yank the cache device - is that what
you're after?
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] bcache: enhancing the security of dirty data writeback
2025-07-31 15:49 ` Kent Overstreet
@ 2025-08-01 2:27 ` Zhou Jifeng
2025-08-01 2:37 ` Kent Overstreet
0 siblings, 1 reply; 25+ messages in thread
From: Zhou Jifeng @ 2025-08-01 2:27 UTC (permalink / raw)
To: Kent Overstreet; +Cc: Coly Li, linux-bcache, linux-kernel
On Fri, 1 Aug 2025 at 00:02, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> On Thu, Jul 31, 2025 at 02:21:40PM +0800, Zhou Jifeng wrote:
> > There is a potential data consistency risk in bcache's writeback mode:when
> > the application calls fsync, bcache returns success after completing the
> > log write, persisting the cache disk data, and persisting the HDD internal
> > cache. However, at this point, the actual application data may still be in
> > a dirty state and remain stuck in the cache disk. when these data are
> > subsequently written back to the HDD asynchronously through REQ_OP_WRITE,
> > there is no forced refresh mechanism to ensure physical placement on the
> > disk, and there may be no power-off protection measures, which poses a risk
> > of data loss. This mechanism may cause the application to misjudge that the
> > data has been persisted, which is different from the actual storage state,
> > and also violates the semantic agreement that fsync should ensure data
> > persistence.
>
> So, the way you start out describing the patch, it sounds like you're
> addressing some sort of bug in REQ_OP_FLUSH handling, but it looks like
> what you're actually doing is adding a mode where flushes also flush
> bcache?
>
> This doesn't sound like a bugfix, this sounds like a completely new
> mode - we'd need an explanation of the use case you're aiming for.
Yes, this is not about trying to fix a particular defect. Instead, it is about
adding a new method within the current writeback mode to make the
process of writing dirty data in the cache to the HDD more reliable.
> The model for bcache has always been that since the cache is persistent,
> when you're in writeback mode there are no fsync/REQ_OP_FLUSH
> considerations for dirty data; once we've properly persisted (and
> flushed) that data we're good.
In the writeback mode, the current bcache code uses the
REQ_OP_WRITE operation to handle dirty data, and clears the bkey
dirty flag in the btree during the bio completion callback. I think
there might be a potential risk: if in the event of an unexpected
power outage, the data in the HDD hardware cache may not have
had time to be persisted, then the data in the HDD hardware cache
that is pending processing may be lost. Since at this time the bkey
dirty flag in the btree has been cleared, the data status recorded
by the bkey does not match the actual situation of the SSD and
HDD.
Am I understanding this correctly?
> If you want a mode where you can run in writeback mode, but obey flushes
> so that it's still safe to lose or yank the cache device - is that what
> you're after?
When dirty data is temporarily stored in the HDD hardware cache (not
persistedto the disk medium), if an unexpected power outage or unplugging
occurs, the data in the HDD hardware cache may be lost. If such a scenario
occurs, since thedirty flag of the bkey in the btree has been cleared at this
time, but the actual dirtydata corresponding to the bkey has not been
successfully written back to the HDD. I want to use a method to avoid this
situation.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] bcache: enhancing the security of dirty data writeback
2025-08-01 2:27 ` Zhou Jifeng
@ 2025-08-01 2:37 ` Kent Overstreet
2025-08-01 3:30 ` Zhou Jifeng
0 siblings, 1 reply; 25+ messages in thread
From: Kent Overstreet @ 2025-08-01 2:37 UTC (permalink / raw)
To: Zhou Jifeng; +Cc: Coly Li, linux-bcache, linux-kernel
On Fri, Aug 01, 2025 at 10:27:21AM +0800, Zhou Jifeng wrote:
> In the writeback mode, the current bcache code uses the
> REQ_OP_WRITE operation to handle dirty data, and clears the bkey
> dirty flag in the btree during the bio completion callback. I think
> there might be a potential risk: if in the event of an unexpected
> power outage, the data in the HDD hardware cache may not have
> had time to be persisted, then the data in the HDD hardware cache
> that is pending processing may be lost. Since at this time the bkey
> dirty flag in the btree has been cleared, the data status recorded
> by the bkey does not match the actual situation of the SSD and
> HDD.
> Am I understanding this correctly?
For what you're describing, we need to make sure the backing device is
flushed when we're flushing the journal.
It's possible that this isn't handled correctly in bcache; bcachefs
does, and I wrote that code after bcache - but the bcache version would
look quite different.
You've read that code more recently than I have - have you checked for
that?
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] bcache: enhancing the security of dirty data writeback
2025-08-01 2:37 ` Kent Overstreet
@ 2025-08-01 3:30 ` Zhou Jifeng
2025-08-01 3:41 ` Kent Overstreet
0 siblings, 1 reply; 25+ messages in thread
From: Zhou Jifeng @ 2025-08-01 3:30 UTC (permalink / raw)
To: Kent Overstreet; +Cc: Coly Li, linux-bcache, linux-kernel
On Fri, 1 Aug 2025 at 10:37, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> On Fri, Aug 01, 2025 at 10:27:21AM +0800, Zhou Jifeng wrote:
> > In the writeback mode, the current bcache code uses the
> > REQ_OP_WRITE operation to handle dirty data, and clears the bkey
> > dirty flag in the btree during the bio completion callback. I think
> > there might be a potential risk: if in the event of an unexpected
> > power outage, the data in the HDD hardware cache may not have
> > had time to be persisted, then the data in the HDD hardware cache
> > that is pending processing may be lost. Since at this time the bkey
> > dirty flag in the btree has been cleared, the data status recorded
> > by the bkey does not match the actual situation of the SSD and
> > HDD.
> > Am I understanding this correctly?
>
> For what you're describing, we need to make sure the backing device is
> flushed when we're flushing the journal.
>
> It's possible that this isn't handled correctly in bcache; bcachefs
> does, and I wrote that code after bcache - but the bcache version would
> look quite different.
>
> You've read that code more recently than I have - have you checked for
> that?
In the `write_dirty_finish` function, there is an attempt to update the
`bkey` status, but I did not observe any logging writing process. In the
core function `journal_write_unlocked` of bcache for writing logs, I
also couldn't find the code logic for sending a FLUSH command to the
backend HDD.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] bcache: enhancing the security of dirty data writeback
2025-08-01 3:30 ` Zhou Jifeng
@ 2025-08-01 3:41 ` Kent Overstreet
2025-08-01 6:10 ` Zhou Jifeng
0 siblings, 1 reply; 25+ messages in thread
From: Kent Overstreet @ 2025-08-01 3:41 UTC (permalink / raw)
To: Zhou Jifeng; +Cc: Coly Li, linux-bcache, linux-kernel
On Fri, Aug 01, 2025 at 11:30:43AM +0800, Zhou Jifeng wrote:
> On Fri, 1 Aug 2025 at 10:37, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > On Fri, Aug 01, 2025 at 10:27:21AM +0800, Zhou Jifeng wrote:
> > > In the writeback mode, the current bcache code uses the
> > > REQ_OP_WRITE operation to handle dirty data, and clears the bkey
> > > dirty flag in the btree during the bio completion callback. I think
> > > there might be a potential risk: if in the event of an unexpected
> > > power outage, the data in the HDD hardware cache may not have
> > > had time to be persisted, then the data in the HDD hardware cache
> > > that is pending processing may be lost. Since at this time the bkey
> > > dirty flag in the btree has been cleared, the data status recorded
> > > by the bkey does not match the actual situation of the SSD and
> > > HDD.
> > > Am I understanding this correctly?
> >
> > For what you're describing, we need to make sure the backing device is
> > flushed when we're flushing the journal.
> >
> > It's possible that this isn't handled correctly in bcache; bcachefs
> > does, and I wrote that code after bcache - but the bcache version would
> > look quite different.
> >
> > You've read that code more recently than I have - have you checked for
> > that?
>
> In the `write_dirty_finish` function, there is an attempt to update the
> `bkey` status, but I did not observe any logging writing process. In the
> core function `journal_write_unlocked` of bcache for writing logs, I
> also couldn't find the code logic for sending a FLUSH command to the
> backend HDD.
The right place for it would be in the journal code: before doing a
journal write, issue flushes to the backing devices.
Can you check for that?
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] bcache: enhancing the security of dirty data writeback
2025-08-01 3:41 ` Kent Overstreet
@ 2025-08-01 6:10 ` Zhou Jifeng
2025-08-02 17:29 ` Coly Li
0 siblings, 1 reply; 25+ messages in thread
From: Zhou Jifeng @ 2025-08-01 6:10 UTC (permalink / raw)
To: Kent Overstreet; +Cc: Coly Li, linux-bcache, linux-kernel
On Fri, 1 Aug 2025 at 11:42, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> On Fri, Aug 01, 2025 at 11:30:43AM +0800, Zhou Jifeng wrote:
> > On Fri, 1 Aug 2025 at 10:37, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > >
> > > On Fri, Aug 01, 2025 at 10:27:21AM +0800, Zhou Jifeng wrote:
> > > > In the writeback mode, the current bcache code uses the
> > > > REQ_OP_WRITE operation to handle dirty data, and clears the bkey
> > > > dirty flag in the btree during the bio completion callback. I think
> > > > there might be a potential risk: if in the event of an unexpected
> > > > power outage, the data in the HDD hardware cache may not have
> > > > had time to be persisted, then the data in the HDD hardware cache
> > > > that is pending processing may be lost. Since at this time the bkey
> > > > dirty flag in the btree has been cleared, the data status recorded
> > > > by the bkey does not match the actual situation of the SSD and
> > > > HDD.
> > > > Am I understanding this correctly?
> > >
> > > For what you're describing, we need to make sure the backing device is
> > > flushed when we're flushing the journal.
> > >
> > > It's possible that this isn't handled correctly in bcache; bcachefs
> > > does, and I wrote that code after bcache - but the bcache version would
> > > look quite different.
> > >
> > > You've read that code more recently than I have - have you checked for
> > > that?
> >
> > In the `write_dirty_finish` function, there is an attempt to update the
> > `bkey` status, but I did not observe any logging writing process. In the
> > core function `journal_write_unlocked` of bcache for writing logs, I
> > also couldn't find the code logic for sending a FLUSH command to the
> > backend HDD.
>
> The right place for it would be in the journal code: before doing a
> journal write, issue flushes to the backing devices.
>
> Can you check for that?
>
I checked and found that there was no code for sending a flush request
to the backend device before the execution log was written. Additionally,
in the callback function after the dirty data was written back, when it
updated the bkey, it did not insert this update into the log.
The following callback function after dirty data is written back only
updates the bkey, without recording the inserted bkey in the journal:
static CLOSURE_CALLBACK(write_dirty_finish)
{
closure_type(io, struct dirty_io, cl);
struct keybuf_key *w = io->bio.bi_private;
struct cached_dev *dc = io->dc;
bio_free_pages(&io->bio);
/* This is kind of a dumb way of signalling errors. */
if (KEY_DIRTY(&w->key)) {
int ret;
unsigned int i;
struct keylist keys;
bch_keylist_init(&keys);
bkey_copy(keys.top, &w->key);
SET_KEY_DIRTY(keys.top, false);
bch_keylist_push(&keys);
for (i = 0; i < KEY_PTRS(&w->key); i++)
atomic_inc(&PTR_BUCKET(dc->disk.c, &w->key, i)->pin);
ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key);
if (ret)
trace_bcache_writeback_collision(&w->key);
atomic_long_inc(ret
? &dc->disk.c->writeback_keys_failed
: &dc->disk.c->writeback_keys_done);
}
bch_keybuf_del(&dc->writeback_keys, w);
up(&dc->in_flight);
closure_return_with_destructor(cl, dirty_io_destructor);
}
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] bcache: enhancing the security of dirty data writeback
2025-08-01 6:10 ` Zhou Jifeng
@ 2025-08-02 17:29 ` Coly Li
2025-08-02 18:49 ` Kent Overstreet
2025-08-04 3:47 ` Zhou Jifeng
0 siblings, 2 replies; 25+ messages in thread
From: Coly Li @ 2025-08-02 17:29 UTC (permalink / raw)
To: Zhou Jifeng; +Cc: Kent Overstreet, linux-bcache, linux-kernel
On Fri, Aug 01, 2025 at 02:10:12PM +0800, Zhou Jifeng wrote:
> On Fri, 1 Aug 2025 at 11:42, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > On Fri, Aug 01, 2025 at 11:30:43AM +0800, Zhou Jifeng wrote:
> > > On Fri, 1 Aug 2025 at 10:37, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > >
> > > > On Fri, Aug 01, 2025 at 10:27:21AM +0800, Zhou Jifeng wrote:
> > > > > In the writeback mode, the current bcache code uses the
> > > > > REQ_OP_WRITE operation to handle dirty data, and clears the bkey
> > > > > dirty flag in the btree during the bio completion callback. I think
> > > > > there might be a potential risk: if in the event of an unexpected
> > > > > power outage, the data in the HDD hardware cache may not have
> > > > > had time to be persisted, then the data in the HDD hardware cache
> > > > > that is pending processing may be lost. Since at this time the bkey
> > > > > dirty flag in the btree has been cleared, the data status recorded
> > > > > by the bkey does not match the actual situation of the SSD and
> > > > > HDD.
> > > > > Am I understanding this correctly?
> > > >
> > > > For what you're describing, we need to make sure the backing device is
> > > > flushed when we're flushing the journal.
> > > >
> > > > It's possible that this isn't handled correctly in bcache; bcachefs
> > > > does, and I wrote that code after bcache - but the bcache version would
> > > > look quite different.
> > > >
> > > > You've read that code more recently than I have - have you checked for
> > > > that?
> > >
> > > In the `write_dirty_finish` function, there is an attempt to update the
> > > `bkey` status, but I did not observe any logging writing process. In the
> > > core function `journal_write_unlocked` of bcache for writing logs, I
> > > also couldn't find the code logic for sending a FLUSH command to the
> > > backend HDD.
> >
> > The right place for it would be in the journal code: before doing a
> > journal write, issue flushes to the backing devices.
> >
> > Can you check for that?
> >
>
> I checked and found that there was no code for sending a flush request
> to the backend device before the execution log was written. Additionally,
> in the callback function after the dirty data was written back, when it
> updated the bkey, it did not insert this update into the log.
>
It doesn't have to. If the previous dirty version of the key is on cache device
already, and power off happens, even the clean version of the key is gone, the
dirty version and its data are all valid. If the LBA range of this key is
allocated to a new key, a GC must have alrady happened, and the dirty key is
invalid due to bucket generation increased. So don't worry, the clean key is
unncessary to go into journal in the writeback scenario.
IMHO, you may try to flush backing device in a kworker before calling
set_gc_sectors() in bch_gc_thread(). The disk cache can be flushed in time
before the still-dirty-on-disk keys are invalidated by increase bucket key
gen. And also flushing backing device after searched_full_index becomes
true in the writeback thread main loop (as you did now).
Flushing backing device after read_dirty() returns is too heavy, event with
the flush hint keys. And the flushing operations are unnecessary before the
keys are reclaimed by garbage collaction.
Just my suggestion for your consideration.
Thanks.
Coly Li
> The following callback function after dirty data is written back only
> updates the bkey, without recording the inserted bkey in the journal:
> static CLOSURE_CALLBACK(write_dirty_finish)
> {
> closure_type(io, struct dirty_io, cl);
> struct keybuf_key *w = io->bio.bi_private;
> struct cached_dev *dc = io->dc;
>
> bio_free_pages(&io->bio);
>
> /* This is kind of a dumb way of signalling errors. */
> if (KEY_DIRTY(&w->key)) {
> int ret;
> unsigned int i;
> struct keylist keys;
>
> bch_keylist_init(&keys);
>
> bkey_copy(keys.top, &w->key);
> SET_KEY_DIRTY(keys.top, false);
> bch_keylist_push(&keys);
>
> for (i = 0; i < KEY_PTRS(&w->key); i++)
> atomic_inc(&PTR_BUCKET(dc->disk.c, &w->key, i)->pin);
>
> ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key);
>
> if (ret)
> trace_bcache_writeback_collision(&w->key);
>
> atomic_long_inc(ret
> ? &dc->disk.c->writeback_keys_failed
> : &dc->disk.c->writeback_keys_done);
> }
>
> bch_keybuf_del(&dc->writeback_keys, w);
> up(&dc->in_flight);
>
> closure_return_with_destructor(cl, dirty_io_destructor);
> }
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] bcache: enhancing the security of dirty data writeback
2025-08-02 17:29 ` Coly Li
@ 2025-08-02 18:49 ` Kent Overstreet
2025-08-04 3:47 ` Zhou Jifeng
1 sibling, 0 replies; 25+ messages in thread
From: Kent Overstreet @ 2025-08-02 18:49 UTC (permalink / raw)
To: Coly Li; +Cc: Zhou Jifeng, linux-bcache, linux-kernel
On Sun, Aug 03, 2025 at 01:29:55AM +0800, Coly Li wrote:
> On Fri, Aug 01, 2025 at 02:10:12PM +0800, Zhou Jifeng wrote:
> > On Fri, 1 Aug 2025 at 11:42, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > >
> > > On Fri, Aug 01, 2025 at 11:30:43AM +0800, Zhou Jifeng wrote:
> > > > On Fri, 1 Aug 2025 at 10:37, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > > >
> > > > > On Fri, Aug 01, 2025 at 10:27:21AM +0800, Zhou Jifeng wrote:
> > > > > > In the writeback mode, the current bcache code uses the
> > > > > > REQ_OP_WRITE operation to handle dirty data, and clears the bkey
> > > > > > dirty flag in the btree during the bio completion callback. I think
> > > > > > there might be a potential risk: if in the event of an unexpected
> > > > > > power outage, the data in the HDD hardware cache may not have
> > > > > > had time to be persisted, then the data in the HDD hardware cache
> > > > > > that is pending processing may be lost. Since at this time the bkey
> > > > > > dirty flag in the btree has been cleared, the data status recorded
> > > > > > by the bkey does not match the actual situation of the SSD and
> > > > > > HDD.
> > > > > > Am I understanding this correctly?
> > > > >
> > > > > For what you're describing, we need to make sure the backing device is
> > > > > flushed when we're flushing the journal.
> > > > >
> > > > > It's possible that this isn't handled correctly in bcache; bcachefs
> > > > > does, and I wrote that code after bcache - but the bcache version would
> > > > > look quite different.
> > > > >
> > > > > You've read that code more recently than I have - have you checked for
> > > > > that?
> > > >
> > > > In the `write_dirty_finish` function, there is an attempt to update the
> > > > `bkey` status, but I did not observe any logging writing process. In the
> > > > core function `journal_write_unlocked` of bcache for writing logs, I
> > > > also couldn't find the code logic for sending a FLUSH command to the
> > > > backend HDD.
> > >
> > > The right place for it would be in the journal code: before doing a
> > > journal write, issue flushes to the backing devices.
> > >
> > > Can you check for that?
> > >
> >
> > I checked and found that there was no code for sending a flush request
> > to the backend device before the execution log was written. Additionally,
> > in the callback function after the dirty data was written back, when it
> > updated the bkey, it did not insert this update into the log.
> >
>
> It doesn't have to. If the previous dirty version of the key is on cache device
> already, and power off happens, even the clean version of the key is gone, the
> dirty version and its data are all valid. If the LBA range of this key is
> allocated to a new key, a GC must have alrady happened, and the dirty key is
> invalid due to bucket generation increased. So don't worry, the clean key is
> unncessary to go into journal in the writeback scenario.
>
> IMHO, you may try to flush backing device in a kworker before calling
> set_gc_sectors() in bch_gc_thread(). The disk cache can be flushed in time
> before the still-dirty-on-disk keys are invalidated by increase bucket key
> gen. And also flushing backing device after searched_full_index becomes
> true in the writeback thread main loop (as you did now).
>
> Flushing backing device after read_dirty() returns is too heavy, event with
> the flush hint keys. And the flushing operations are unnecessary before the
> keys are reclaimed by garbage collaction.
No, a flush is necessary.
And it's not after read_dirty, the flush just needs to happen before
bcache does its journal commit.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] bcache: enhancing the security of dirty data writeback
2025-08-02 17:29 ` Coly Li
2025-08-02 18:49 ` Kent Overstreet
@ 2025-08-04 3:47 ` Zhou Jifeng
2025-08-04 4:17 ` Kent Overstreet
1 sibling, 1 reply; 25+ messages in thread
From: Zhou Jifeng @ 2025-08-04 3:47 UTC (permalink / raw)
To: Coly Li; +Cc: kent.overstreet, linux-bcache, linux-kernel
On Sun, 3 Aug 2025 at 01:30, Coly Li <colyli@kernel.org> wrote:
>
> On Fri, Aug 01, 2025 at 02:10:12PM +0800, Zhou Jifeng wrote:
> > On Fri, 1 Aug 2025 at 11:42, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > >
> > > On Fri, Aug 01, 2025 at 11:30:43AM +0800, Zhou Jifeng wrote:
> > > > On Fri, 1 Aug 2025 at 10:37, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > > >
> > > > > On Fri, Aug 01, 2025 at 10:27:21AM +0800, Zhou Jifeng wrote:
> > > > > > In the writeback mode, the current bcache code uses the
> > > > > > REQ_OP_WRITE operation to handle dirty data, and clears the bkey
> > > > > > dirty flag in the btree during the bio completion callback. I think
> > > > > > there might be a potential risk: if in the event of an unexpected
> > > > > > power outage, the data in the HDD hardware cache may not have
> > > > > > had time to be persisted, then the data in the HDD hardware cache
> > > > > > that is pending processing may be lost. Since at this time the bkey
> > > > > > dirty flag in the btree has been cleared, the data status recorded
> > > > > > by the bkey does not match the actual situation of the SSD and
> > > > > > HDD.
> > > > > > Am I understanding this correctly?
> > > > >
> > > > > For what you're describing, we need to make sure the backing device is
> > > > > flushed when we're flushing the journal.
> > > > >
> > > > > It's possible that this isn't handled correctly in bcache; bcachefs
> > > > > does, and I wrote that code after bcache - but the bcache version would
> > > > > look quite different.
> > > > >
> > > > > You've read that code more recently than I have - have you checked for
> > > > > that?
> > > >
> > > > In the `write_dirty_finish` function, there is an attempt to update the
> > > > `bkey` status, but I did not observe any logging writing process. In the
> > > > core function `journal_write_unlocked` of bcache for writing logs, I
> > > > also couldn't find the code logic for sending a FLUSH command to the
> > > > backend HDD.
> > >
> > > The right place for it would be in the journal code: before doing a
> > > journal write, issue flushes to the backing devices.
> > >
> > > Can you check for that?
> > >
> >
> > I checked and found that there was no code for sending a flush request
> > to the backend device before the execution log was written. Additionally,
> > in the callback function after the dirty data was written back, when it
> > updated the bkey, it did not insert this update into the log.
> >
>
> It doesn't have to. If the previous dirty version of the key is on cache device
> already, and power off happens, even the clean version of the key is gone, the
> dirty version and its data are all valid. If the LBA range of this key is
> allocated to a new key, a GC must have alrady happened, and the dirty key is
> invalid due to bucket generation increased. So don't worry, the clean key is
> unncessary to go into journal in the writeback scenario.
>
> IMHO, you may try to flush backing device in a kworker before calling
> set_gc_sectors() in bch_gc_thread(). The disk cache can be flushed in time
> before the still-dirty-on-disk keys are invalidated by increase bucket key
> gen. And also flushing backing device after searched_full_index becomes
> true in the writeback thread main loop (as you did now).
>
The "flush" command previously issued by GC was supposed to alleviate
the problems in some scenarios. However, I thought of a situation where
this "flush" command issued before GC might not be sufficient to solve
the issue.
Suppose such a scenario: after a power failure, some hardware cache data
in the HDD is lost, while the corresponding bkey(with the dirty flag cleared)
update in the SSD has been persisted. After the power is restored, if
bcache sends a flush before GC, will this cause data loss?
> Flushing backing device after read_dirty() returns is too heavy, event with
> the flush hint keys. And the flushing operations are unnecessary before the
> keys are reclaimed by garbage collaction.
>
> Just my suggestion for your consideration.
>
> Thanks.
>
> Coly Li
>
>
> > The following callback function after dirty data is written back only
> > updates the bkey, without recording the inserted bkey in the journal:
> > static CLOSURE_CALLBACK(write_dirty_finish)
> > {
> > closure_type(io, struct dirty_io, cl);
> > struct keybuf_key *w = io->bio.bi_private;
> > struct cached_dev *dc = io->dc;
> >
> > bio_free_pages(&io->bio);
> >
> > /* This is kind of a dumb way of signalling errors. */
> > if (KEY_DIRTY(&w->key)) {
> > int ret;
> > unsigned int i;
> > struct keylist keys;
> >
> > bch_keylist_init(&keys);
> >
> > bkey_copy(keys.top, &w->key);
> > SET_KEY_DIRTY(keys.top, false);
> > bch_keylist_push(&keys);
> >
> > for (i = 0; i < KEY_PTRS(&w->key); i++)
> > atomic_inc(&PTR_BUCKET(dc->disk.c, &w->key, i)->pin);
> >
> > ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key);
> >
> > if (ret)
> > trace_bcache_writeback_collision(&w->key);
> >
> > atomic_long_inc(ret
> > ? &dc->disk.c->writeback_keys_failed
> > : &dc->disk.c->writeback_keys_done);
> > }
> >
> > bch_keybuf_del(&dc->writeback_keys, w);
> > up(&dc->in_flight);
> >
> > closure_return_with_destructor(cl, dirty_io_destructor);
> > }
>
>
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] bcache: enhancing the security of dirty data writeback
2025-08-04 3:47 ` Zhou Jifeng
@ 2025-08-04 4:17 ` Kent Overstreet
2025-08-04 15:31 ` Coly Li
0 siblings, 1 reply; 25+ messages in thread
From: Kent Overstreet @ 2025-08-04 4:17 UTC (permalink / raw)
To: Zhou Jifeng; +Cc: Coly Li, linux-bcache, linux-kernel
On Mon, Aug 04, 2025 at 11:47:57AM +0800, Zhou Jifeng wrote:
> On Sun, 3 Aug 2025 at 01:30, Coly Li <colyli@kernel.org> wrote:
> >
> > On Fri, Aug 01, 2025 at 02:10:12PM +0800, Zhou Jifeng wrote:
> > > On Fri, 1 Aug 2025 at 11:42, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > >
> > > > On Fri, Aug 01, 2025 at 11:30:43AM +0800, Zhou Jifeng wrote:
> > > > > On Fri, 1 Aug 2025 at 10:37, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > > > >
> > > > > > On Fri, Aug 01, 2025 at 10:27:21AM +0800, Zhou Jifeng wrote:
> > > > > > > In the writeback mode, the current bcache code uses the
> > > > > > > REQ_OP_WRITE operation to handle dirty data, and clears the bkey
> > > > > > > dirty flag in the btree during the bio completion callback. I think
> > > > > > > there might be a potential risk: if in the event of an unexpected
> > > > > > > power outage, the data in the HDD hardware cache may not have
> > > > > > > had time to be persisted, then the data in the HDD hardware cache
> > > > > > > that is pending processing may be lost. Since at this time the bkey
> > > > > > > dirty flag in the btree has been cleared, the data status recorded
> > > > > > > by the bkey does not match the actual situation of the SSD and
> > > > > > > HDD.
> > > > > > > Am I understanding this correctly?
> > > > > >
> > > > > > For what you're describing, we need to make sure the backing device is
> > > > > > flushed when we're flushing the journal.
> > > > > >
> > > > > > It's possible that this isn't handled correctly in bcache; bcachefs
> > > > > > does, and I wrote that code after bcache - but the bcache version would
> > > > > > look quite different.
> > > > > >
> > > > > > You've read that code more recently than I have - have you checked for
> > > > > > that?
> > > > >
> > > > > In the `write_dirty_finish` function, there is an attempt to update the
> > > > > `bkey` status, but I did not observe any logging writing process. In the
> > > > > core function `journal_write_unlocked` of bcache for writing logs, I
> > > > > also couldn't find the code logic for sending a FLUSH command to the
> > > > > backend HDD.
> > > >
> > > > The right place for it would be in the journal code: before doing a
> > > > journal write, issue flushes to the backing devices.
> > > >
> > > > Can you check for that?
> > > >
> > >
> > > I checked and found that there was no code for sending a flush request
> > > to the backend device before the execution log was written. Additionally,
> > > in the callback function after the dirty data was written back, when it
> > > updated the bkey, it did not insert this update into the log.
> > >
> >
> > It doesn't have to. If the previous dirty version of the key is on cache device
> > already, and power off happens, even the clean version of the key is gone, the
> > dirty version and its data are all valid. If the LBA range of this key is
> > allocated to a new key, a GC must have alrady happened, and the dirty key is
> > invalid due to bucket generation increased. So don't worry, the clean key is
> > unncessary to go into journal in the writeback scenario.
> >
> > IMHO, you may try to flush backing device in a kworker before calling
> > set_gc_sectors() in bch_gc_thread(). The disk cache can be flushed in time
> > before the still-dirty-on-disk keys are invalidated by increase bucket key
> > gen. And also flushing backing device after searched_full_index becomes
> > true in the writeback thread main loop (as you did now).
> >
>
> The "flush" command previously issued by GC was supposed to alleviate
> the problems in some scenarios. However, I thought of a situation where
> this "flush" command issued before GC might not be sufficient to solve
> the issue.
>
> Suppose such a scenario: after a power failure, some hardware cache data
> in the HDD is lost, while the corresponding bkey(with the dirty flag cleared)
> update in the SSD has been persisted. After the power is restored, if
> bcache sends a flush before GC, will this cause data loss?
Yes.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] bcache: enhancing the security of dirty data writeback
2025-08-04 4:17 ` Kent Overstreet
@ 2025-08-04 15:31 ` Coly Li
2025-08-04 16:07 ` Kent Overstreet
0 siblings, 1 reply; 25+ messages in thread
From: Coly Li @ 2025-08-04 15:31 UTC (permalink / raw)
To: Kent Overstreet; +Cc: Zhou Jifeng, linux-bcache, linux-kernel
On Mon, Aug 04, 2025 at 12:17:28AM -0400, Kent Overstreet wrote:
> On Mon, Aug 04, 2025 at 11:47:57AM +0800, Zhou Jifeng wrote:
> > On Sun, 3 Aug 2025 at 01:30, Coly Li <colyli@kernel.org> wrote:
> > >
> > > On Fri, Aug 01, 2025 at 02:10:12PM +0800, Zhou Jifeng wrote:
> > > > On Fri, 1 Aug 2025 at 11:42, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > > >
> > > > > On Fri, Aug 01, 2025 at 11:30:43AM +0800, Zhou Jifeng wrote:
> > > > > > On Fri, 1 Aug 2025 at 10:37, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > > > > >
> > > > > > > On Fri, Aug 01, 2025 at 10:27:21AM +0800, Zhou Jifeng wrote:
> > > > > > > > In the writeback mode, the current bcache code uses the
> > > > > > > > REQ_OP_WRITE operation to handle dirty data, and clears the bkey
> > > > > > > > dirty flag in the btree during the bio completion callback. I think
> > > > > > > > there might be a potential risk: if in the event of an unexpected
> > > > > > > > power outage, the data in the HDD hardware cache may not have
> > > > > > > > had time to be persisted, then the data in the HDD hardware cache
> > > > > > > > that is pending processing may be lost. Since at this time the bkey
> > > > > > > > dirty flag in the btree has been cleared, the data status recorded
> > > > > > > > by the bkey does not match the actual situation of the SSD and
> > > > > > > > HDD.
> > > > > > > > Am I understanding this correctly?
> > > > > > >
> > > > > > > For what you're describing, we need to make sure the backing device is
> > > > > > > flushed when we're flushing the journal.
> > > > > > >
> > > > > > > It's possible that this isn't handled correctly in bcache; bcachefs
> > > > > > > does, and I wrote that code after bcache - but the bcache version would
> > > > > > > look quite different.
> > > > > > >
> > > > > > > You've read that code more recently than I have - have you checked for
> > > > > > > that?
> > > > > >
> > > > > > In the `write_dirty_finish` function, there is an attempt to update the
> > > > > > `bkey` status, but I did not observe any logging writing process. In the
> > > > > > core function `journal_write_unlocked` of bcache for writing logs, I
> > > > > > also couldn't find the code logic for sending a FLUSH command to the
> > > > > > backend HDD.
> > > > >
> > > > > The right place for it would be in the journal code: before doing a
> > > > > journal write, issue flushes to the backing devices.
> > > > >
> > > > > Can you check for that?
> > > > >
> > > >
> > > > I checked and found that there was no code for sending a flush request
> > > > to the backend device before the execution log was written. Additionally,
> > > > in the callback function after the dirty data was written back, when it
> > > > updated the bkey, it did not insert this update into the log.
> > > >
> > >
> > > It doesn't have to. If the previous dirty version of the key is on cache device
> > > already, and power off happens, even the clean version of the key is gone, the
> > > dirty version and its data are all valid. If the LBA range of this key is
> > > allocated to a new key, a GC must have alrady happened, and the dirty key is
> > > invalid due to bucket generation increased. So don't worry, the clean key is
> > > unncessary to go into journal in the writeback scenario.
> > >
> > > IMHO, you may try to flush backing device in a kworker before calling
> > > set_gc_sectors() in bch_gc_thread(). The disk cache can be flushed in time
> > > before the still-dirty-on-disk keys are invalidated by increase bucket key
> > > gen. And also flushing backing device after searched_full_index becomes
> > > true in the writeback thread main loop (as you did now).
> > >
> >
> > The "flush" command previously issued by GC was supposed to alleviate
> > the problems in some scenarios. However, I thought of a situation where
> > this "flush" command issued before GC might not be sufficient to solve
> > the issue.
> >
> > Suppose such a scenario: after a power failure, some hardware cache data
> > in the HDD is lost, while the corresponding bkey(with the dirty flag cleared)
> > update in the SSD has been persisted. After the power is restored, if
> > bcache sends a flush before GC, will this cause data loss?
>
> Yes.
The cleared key is updated in-place within the in-memory btree node,
flushing backing devices before committing journal doesn't help.
I try to avoid adding the cleared key into journal, in high write pressure,
such synchronized link between writeback, gc and journal makes me really
uncomfortable.
Another choice might be adding a tag in struct btree, and set the tag when
the cleared key in-place updated in the btree node. When writing a bset and
the tag is set, then flush corresponding backing devcie before writing the
btree node. Maybe hurts less performance than flushing backing device before
committing journal set.
How do you think of it, Kent?
Thanks.
Coly Li
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] bcache: enhancing the security of dirty data writeback
2025-08-04 15:31 ` Coly Li
@ 2025-08-04 16:07 ` Kent Overstreet
2025-08-05 1:17 ` Zhou Jifeng
0 siblings, 1 reply; 25+ messages in thread
From: Kent Overstreet @ 2025-08-04 16:07 UTC (permalink / raw)
To: Coly Li; +Cc: Zhou Jifeng, linux-bcache, linux-kernel
On Mon, Aug 04, 2025 at 11:31:30PM +0800, Coly Li wrote:
> On Mon, Aug 04, 2025 at 12:17:28AM -0400, Kent Overstreet wrote:
> > On Mon, Aug 04, 2025 at 11:47:57AM +0800, Zhou Jifeng wrote:
> > > On Sun, 3 Aug 2025 at 01:30, Coly Li <colyli@kernel.org> wrote:
> > > >
> > > > On Fri, Aug 01, 2025 at 02:10:12PM +0800, Zhou Jifeng wrote:
> > > > > On Fri, 1 Aug 2025 at 11:42, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > > > >
> > > > > > On Fri, Aug 01, 2025 at 11:30:43AM +0800, Zhou Jifeng wrote:
> > > > > > > On Fri, 1 Aug 2025 at 10:37, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > > > > > >
> > > > > > > > On Fri, Aug 01, 2025 at 10:27:21AM +0800, Zhou Jifeng wrote:
> > > > > > > > > In the writeback mode, the current bcache code uses the
> > > > > > > > > REQ_OP_WRITE operation to handle dirty data, and clears the bkey
> > > > > > > > > dirty flag in the btree during the bio completion callback. I think
> > > > > > > > > there might be a potential risk: if in the event of an unexpected
> > > > > > > > > power outage, the data in the HDD hardware cache may not have
> > > > > > > > > had time to be persisted, then the data in the HDD hardware cache
> > > > > > > > > that is pending processing may be lost. Since at this time the bkey
> > > > > > > > > dirty flag in the btree has been cleared, the data status recorded
> > > > > > > > > by the bkey does not match the actual situation of the SSD and
> > > > > > > > > HDD.
> > > > > > > > > Am I understanding this correctly?
> > > > > > > >
> > > > > > > > For what you're describing, we need to make sure the backing device is
> > > > > > > > flushed when we're flushing the journal.
> > > > > > > >
> > > > > > > > It's possible that this isn't handled correctly in bcache; bcachefs
> > > > > > > > does, and I wrote that code after bcache - but the bcache version would
> > > > > > > > look quite different.
> > > > > > > >
> > > > > > > > You've read that code more recently than I have - have you checked for
> > > > > > > > that?
> > > > > > >
> > > > > > > In the `write_dirty_finish` function, there is an attempt to update the
> > > > > > > `bkey` status, but I did not observe any logging writing process. In the
> > > > > > > core function `journal_write_unlocked` of bcache for writing logs, I
> > > > > > > also couldn't find the code logic for sending a FLUSH command to the
> > > > > > > backend HDD.
> > > > > >
> > > > > > The right place for it would be in the journal code: before doing a
> > > > > > journal write, issue flushes to the backing devices.
> > > > > >
> > > > > > Can you check for that?
> > > > > >
> > > > >
> > > > > I checked and found that there was no code for sending a flush request
> > > > > to the backend device before the execution log was written. Additionally,
> > > > > in the callback function after the dirty data was written back, when it
> > > > > updated the bkey, it did not insert this update into the log.
> > > > >
> > > >
> > > > It doesn't have to. If the previous dirty version of the key is on cache device
> > > > already, and power off happens, even the clean version of the key is gone, the
> > > > dirty version and its data are all valid. If the LBA range of this key is
> > > > allocated to a new key, a GC must have alrady happened, and the dirty key is
> > > > invalid due to bucket generation increased. So don't worry, the clean key is
> > > > unncessary to go into journal in the writeback scenario.
> > > >
> > > > IMHO, you may try to flush backing device in a kworker before calling
> > > > set_gc_sectors() in bch_gc_thread(). The disk cache can be flushed in time
> > > > before the still-dirty-on-disk keys are invalidated by increase bucket key
> > > > gen. And also flushing backing device after searched_full_index becomes
> > > > true in the writeback thread main loop (as you did now).
> > > >
> > >
> > > The "flush" command previously issued by GC was supposed to alleviate
> > > the problems in some scenarios. However, I thought of a situation where
> > > this "flush" command issued before GC might not be sufficient to solve
> > > the issue.
> > >
> > > Suppose such a scenario: after a power failure, some hardware cache data
> > > in the HDD is lost, while the corresponding bkey(with the dirty flag cleared)
> > > update in the SSD has been persisted. After the power is restored, if
> > > bcache sends a flush before GC, will this cause data loss?
> >
> > Yes.
>
> The cleared key is updated in-place within the in-memory btree node,
> flushing backing devices before committing journal doesn't help.
Yes, it would, although obviously we wouldn't want to do a flush every
time we clear the dirty bit - it needs batching.
Any time you're doing writes to multiple devices that have ordering
dependencies, a flush needs to be involved.
> I try to avoid adding the cleared key into journal, in high write pressure,
> such synchronized link between writeback, gc and journal makes me really
> uncomfortable.
>
> Another choice might be adding a tag in struct btree, and set the tag when
> the cleared key in-place updated in the btree node. When writing a bset and
> the tag is set, then flush corresponding backing devcie before writing the
> btree node. Maybe hurts less performance than flushing backing device before
> committing journal set.
>
> How do you think of it, Kent?
Have a look at the code for this in bcachefs, fs/bcachefs/journal_io.c,
journal_write_preflush().
If it's a multi device filesystem, we issue flushes separately from the
journal write and wait for them to complete before doing the REQ_FUA
journal write - that ensures that any cross device IO dependencies are
ordered correctly.
That approach would work in bcache as well, but it'd have higher
performance overhead than in bcachefs because bcache doesn't have the
concept of noflush (non commit) journal writes - every journal write is
FLUSH/FUA, and there's also writes that bypass the cache, which we we'll
be flushing unnecessarily.
Having a flag/bitmask for "we cleared dirty bits, these backing
device(s) need flushes" would probably have acceptable performance
overhead.
Also, we're getting damn close to being ready to lift the experimental
label on bcachefs, so maybe have a look at that too :)
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] bcache: enhancing the security of dirty data writeback
2025-08-04 16:07 ` Kent Overstreet
@ 2025-08-05 1:17 ` Zhou Jifeng
2025-08-05 1:31 ` Kent Overstreet
0 siblings, 1 reply; 25+ messages in thread
From: Zhou Jifeng @ 2025-08-05 1:17 UTC (permalink / raw)
To: Kent Overstreet, Coly Li; +Cc: linux-bcache, linux-kernel
On Tue, 5 Aug 2025 at 00:07, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> On Mon, Aug 04, 2025 at 11:31:30PM +0800, Coly Li wrote:
> > On Mon, Aug 04, 2025 at 12:17:28AM -0400, Kent Overstreet wrote:
> > > On Mon, Aug 04, 2025 at 11:47:57AM +0800, Zhou Jifeng wrote:
> > > > On Sun, 3 Aug 2025 at 01:30, Coly Li <colyli@kernel.org> wrote:
> > > > >
> > > > > On Fri, Aug 01, 2025 at 02:10:12PM +0800, Zhou Jifeng wrote:
> > > > > > On Fri, 1 Aug 2025 at 11:42, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > > > > >
> > > > > > > On Fri, Aug 01, 2025 at 11:30:43AM +0800, Zhou Jifeng wrote:
> > > > > > > > On Fri, 1 Aug 2025 at 10:37, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Aug 01, 2025 at 10:27:21AM +0800, Zhou Jifeng wrote:
> > > > > > > > > > In the writeback mode, the current bcache code uses the
> > > > > > > > > > REQ_OP_WRITE operation to handle dirty data, and clears the bkey
> > > > > > > > > > dirty flag in the btree during the bio completion callback. I think
> > > > > > > > > > there might be a potential risk: if in the event of an unexpected
> > > > > > > > > > power outage, the data in the HDD hardware cache may not have
> > > > > > > > > > had time to be persisted, then the data in the HDD hardware cache
> > > > > > > > > > that is pending processing may be lost. Since at this time the bkey
> > > > > > > > > > dirty flag in the btree has been cleared, the data status recorded
> > > > > > > > > > by the bkey does not match the actual situation of the SSD and
> > > > > > > > > > HDD.
> > > > > > > > > > Am I understanding this correctly?
> > > > > > > > >
> > > > > > > > > For what you're describing, we need to make sure the backing device is
> > > > > > > > > flushed when we're flushing the journal.
> > > > > > > > >
> > > > > > > > > It's possible that this isn't handled correctly in bcache; bcachefs
> > > > > > > > > does, and I wrote that code after bcache - but the bcache version would
> > > > > > > > > look quite different.
> > > > > > > > >
> > > > > > > > > You've read that code more recently than I have - have you checked for
> > > > > > > > > that?
> > > > > > > >
> > > > > > > > In the `write_dirty_finish` function, there is an attempt to update the
> > > > > > > > `bkey` status, but I did not observe any logging writing process. In the
> > > > > > > > core function `journal_write_unlocked` of bcache for writing logs, I
> > > > > > > > also couldn't find the code logic for sending a FLUSH command to the
> > > > > > > > backend HDD.
> > > > > > >
> > > > > > > The right place for it would be in the journal code: before doing a
> > > > > > > journal write, issue flushes to the backing devices.
> > > > > > >
> > > > > > > Can you check for that?
> > > > > > >
> > > > > >
> > > > > > I checked and found that there was no code for sending a flush request
> > > > > > to the backend device before the execution log was written. Additionally,
> > > > > > in the callback function after the dirty data was written back, when it
> > > > > > updated the bkey, it did not insert this update into the log.
> > > > > >
> > > > >
> > > > > It doesn't have to. If the previous dirty version of the key is on cache device
> > > > > already, and power off happens, even the clean version of the key is gone, the
> > > > > dirty version and its data are all valid. If the LBA range of this key is
> > > > > allocated to a new key, a GC must have alrady happened, and the dirty key is
> > > > > invalid due to bucket generation increased. So don't worry, the clean key is
> > > > > unncessary to go into journal in the writeback scenario.
> > > > >
> > > > > IMHO, you may try to flush backing device in a kworker before calling
> > > > > set_gc_sectors() in bch_gc_thread(). The disk cache can be flushed in time
> > > > > before the still-dirty-on-disk keys are invalidated by increase bucket key
> > > > > gen. And also flushing backing device after searched_full_index becomes
> > > > > true in the writeback thread main loop (as you did now).
> > > > >
> > > >
> > > > The "flush" command previously issued by GC was supposed to alleviate
> > > > the problems in some scenarios. However, I thought of a situation where
> > > > this "flush" command issued before GC might not be sufficient to solve
> > > > the issue.
> > > >
> > > > Suppose such a scenario: after a power failure, some hardware cache data
> > > > in the HDD is lost, while the corresponding bkey(with the dirty flag cleared)
> > > > update in the SSD has been persisted. After the power is restored, if
> > > > bcache sends a flush before GC, will this cause data loss?
> > >
> > > Yes.
> >
> > The cleared key is updated in-place within the in-memory btree node,
> > flushing backing devices before committing journal doesn't help.
>
> Yes, it would, although obviously we wouldn't want to do a flush every
> time we clear the dirty bit - it needs batching.
>
> Any time you're doing writes to multiple devices that have ordering
> dependencies, a flush needs to be involved.
>
> > I try to avoid adding the cleared key into journal, in high write pressure,
> > such synchronized link between writeback, gc and journal makes me really
> > uncomfortable.
> >
> > Another choice might be adding a tag in struct btree, and set the tag when
> > the cleared key in-place updated in the btree node. When writing a bset and
> > the tag is set, then flush corresponding backing devcie before writing the
> > btree node. Maybe hurts less performance than flushing backing device before
> > committing journal set.
> >
> > How do you think of it, Kent?
>
> Have a look at the code for this in bcachefs, fs/bcachefs/journal_io.c,
> journal_write_preflush().
>
> If it's a multi device filesystem, we issue flushes separately from the
> journal write and wait for them to complete before doing the REQ_FUA
> journal write - that ensures that any cross device IO dependencies are
> ordered correctly.
>
> That approach would work in bcache as well, but it'd have higher
> performance overhead than in bcachefs because bcache doesn't have the
> concept of noflush (non commit) journal writes - every journal write is
> FLUSH/FUA, and there's also writes that bypass the cache, which we we'll
> be flushing unnecessarily.
>
> Having a flag/bitmask for "we cleared dirty bits, these backing
> device(s) need flushes" would probably have acceptable performance
> overhead.
>
> Also, we're getting damn close to being ready to lift the experimental
> label on bcachefs, so maybe have a look at that too :)
>
Could we consider the solution I submitted, which is based on the
following main principle:
1. Firstly, in the write_dirty_finish stage, the dirty marking bkeys are
not inserted into the btree immediately. Instead, they are temporarily
stored in an internal memory queue called Alist.
2. Then, when the number of bkeys in Alist exceeds a certain limit, a
flush request is sent to the backend HDD.
3. After the flush is sent, the bkeys recorded in Alist are then
inserted into the btree.
This process ensures that the written dirty data is written to the disk
before the btree is updated. The length of Alist can be configured,
allowing for better control of the flush sending frequency and reducing
the impact of the flush on the write speed.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] bcache: enhancing the security of dirty data writeback
2025-08-05 1:17 ` Zhou Jifeng
@ 2025-08-05 1:31 ` Kent Overstreet
2025-08-05 3:31 ` Coly Li
0 siblings, 1 reply; 25+ messages in thread
From: Kent Overstreet @ 2025-08-05 1:31 UTC (permalink / raw)
To: Zhou Jifeng; +Cc: Coly Li, linux-bcache, linux-kernel
On Tue, Aug 05, 2025 at 09:17:31AM +0800, Zhou Jifeng wrote:
> On Tue, 5 Aug 2025 at 00:07, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > On Mon, Aug 04, 2025 at 11:31:30PM +0800, Coly Li wrote:
> > > On Mon, Aug 04, 2025 at 12:17:28AM -0400, Kent Overstreet wrote:
> > > > On Mon, Aug 04, 2025 at 11:47:57AM +0800, Zhou Jifeng wrote:
> > > > > On Sun, 3 Aug 2025 at 01:30, Coly Li <colyli@kernel.org> wrote:
> > > > > >
> > > > > > On Fri, Aug 01, 2025 at 02:10:12PM +0800, Zhou Jifeng wrote:
> > > > > > > On Fri, 1 Aug 2025 at 11:42, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > > > > > >
> > > > > > > > On Fri, Aug 01, 2025 at 11:30:43AM +0800, Zhou Jifeng wrote:
> > > > > > > > > On Fri, 1 Aug 2025 at 10:37, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri, Aug 01, 2025 at 10:27:21AM +0800, Zhou Jifeng wrote:
> > > > > > > > > > > In the writeback mode, the current bcache code uses the
> > > > > > > > > > > REQ_OP_WRITE operation to handle dirty data, and clears the bkey
> > > > > > > > > > > dirty flag in the btree during the bio completion callback. I think
> > > > > > > > > > > there might be a potential risk: if in the event of an unexpected
> > > > > > > > > > > power outage, the data in the HDD hardware cache may not have
> > > > > > > > > > > had time to be persisted, then the data in the HDD hardware cache
> > > > > > > > > > > that is pending processing may be lost. Since at this time the bkey
> > > > > > > > > > > dirty flag in the btree has been cleared, the data status recorded
> > > > > > > > > > > by the bkey does not match the actual situation of the SSD and
> > > > > > > > > > > HDD.
> > > > > > > > > > > Am I understanding this correctly?
> > > > > > > > > >
> > > > > > > > > > For what you're describing, we need to make sure the backing device is
> > > > > > > > > > flushed when we're flushing the journal.
> > > > > > > > > >
> > > > > > > > > > It's possible that this isn't handled correctly in bcache; bcachefs
> > > > > > > > > > does, and I wrote that code after bcache - but the bcache version would
> > > > > > > > > > look quite different.
> > > > > > > > > >
> > > > > > > > > > You've read that code more recently than I have - have you checked for
> > > > > > > > > > that?
> > > > > > > > >
> > > > > > > > > In the `write_dirty_finish` function, there is an attempt to update the
> > > > > > > > > `bkey` status, but I did not observe any logging writing process. In the
> > > > > > > > > core function `journal_write_unlocked` of bcache for writing logs, I
> > > > > > > > > also couldn't find the code logic for sending a FLUSH command to the
> > > > > > > > > backend HDD.
> > > > > > > >
> > > > > > > > The right place for it would be in the journal code: before doing a
> > > > > > > > journal write, issue flushes to the backing devices.
> > > > > > > >
> > > > > > > > Can you check for that?
> > > > > > > >
> > > > > > >
> > > > > > > I checked and found that there was no code for sending a flush request
> > > > > > > to the backend device before the execution log was written. Additionally,
> > > > > > > in the callback function after the dirty data was written back, when it
> > > > > > > updated the bkey, it did not insert this update into the log.
> > > > > > >
> > > > > >
> > > > > > It doesn't have to. If the previous dirty version of the key is on cache device
> > > > > > already, and power off happens, even the clean version of the key is gone, the
> > > > > > dirty version and its data are all valid. If the LBA range of this key is
> > > > > > allocated to a new key, a GC must have alrady happened, and the dirty key is
> > > > > > invalid due to bucket generation increased. So don't worry, the clean key is
> > > > > > unncessary to go into journal in the writeback scenario.
> > > > > >
> > > > > > IMHO, you may try to flush backing device in a kworker before calling
> > > > > > set_gc_sectors() in bch_gc_thread(). The disk cache can be flushed in time
> > > > > > before the still-dirty-on-disk keys are invalidated by increase bucket key
> > > > > > gen. And also flushing backing device after searched_full_index becomes
> > > > > > true in the writeback thread main loop (as you did now).
> > > > > >
> > > > >
> > > > > The "flush" command previously issued by GC was supposed to alleviate
> > > > > the problems in some scenarios. However, I thought of a situation where
> > > > > this "flush" command issued before GC might not be sufficient to solve
> > > > > the issue.
> > > > >
> > > > > Suppose such a scenario: after a power failure, some hardware cache data
> > > > > in the HDD is lost, while the corresponding bkey(with the dirty flag cleared)
> > > > > update in the SSD has been persisted. After the power is restored, if
> > > > > bcache sends a flush before GC, will this cause data loss?
> > > >
> > > > Yes.
> > >
> > > The cleared key is updated in-place within the in-memory btree node,
> > > flushing backing devices before committing journal doesn't help.
> >
> > Yes, it would, although obviously we wouldn't want to do a flush every
> > time we clear the dirty bit - it needs batching.
> >
> > Any time you're doing writes to multiple devices that have ordering
> > dependencies, a flush needs to be involved.
> >
> > > I try to avoid adding the cleared key into journal, in high write pressure,
> > > such synchronized link between writeback, gc and journal makes me really
> > > uncomfortable.
> > >
> > > Another choice might be adding a tag in struct btree, and set the tag when
> > > the cleared key in-place updated in the btree node. When writing a bset and
> > > the tag is set, then flush corresponding backing devcie before writing the
> > > btree node. Maybe hurts less performance than flushing backing device before
> > > committing journal set.
> > >
> > > How do you think of it, Kent?
> >
> > Have a look at the code for this in bcachefs, fs/bcachefs/journal_io.c,
> > journal_write_preflush().
> >
> > If it's a multi device filesystem, we issue flushes separately from the
> > journal write and wait for them to complete before doing the REQ_FUA
> > journal write - that ensures that any cross device IO dependencies are
> > ordered correctly.
> >
> > That approach would work in bcache as well, but it'd have higher
> > performance overhead than in bcachefs because bcache doesn't have the
> > concept of noflush (non commit) journal writes - every journal write is
> > FLUSH/FUA, and there's also writes that bypass the cache, which we we'll
> > be flushing unnecessarily.
> >
> > Having a flag/bitmask for "we cleared dirty bits, these backing
> > device(s) need flushes" would probably have acceptable performance
> > overhead.
> >
> > Also, we're getting damn close to being ready to lift the experimental
> > label on bcachefs, so maybe have a look at that too :)
> >
>
> Could we consider the solution I submitted, which is based on the
> following main principle:
> 1. Firstly, in the write_dirty_finish stage, the dirty marking bkeys are
> not inserted into the btree immediately. Instead, they are temporarily
> stored in an internal memory queue called Alist.
> 2. Then, when the number of bkeys in Alist exceeds a certain limit, a
> flush request is sent to the backend HDD.
> 3. After the flush is sent, the bkeys recorded in Alist are then
> inserted into the btree.
> This process ensures that the written dirty data is written to the disk
> before the btree is updated. The length of Alist can be configured,
> allowing for better control of the flush sending frequency and reducing
> the impact of the flush on the write speed.
That approach should work as well. You'll want to make the list size
rather bit, and add statistics for how ofter flushes are being issued.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] bcache: enhancing the security of dirty data writeback
2025-08-05 1:31 ` Kent Overstreet
@ 2025-08-05 3:31 ` Coly Li
2025-08-05 3:34 ` Kent Overstreet
0 siblings, 1 reply; 25+ messages in thread
From: Coly Li @ 2025-08-05 3:31 UTC (permalink / raw)
To: Kent Overstreet; +Cc: Zhou Jifeng, linux-bcache, linux-kernel
On Mon, Aug 04, 2025 at 09:31:38PM -0400, Kent Overstreet wrote:
> > Could we consider the solution I submitted, which is based on the
> > following main principle:
> > 1. Firstly, in the write_dirty_finish stage, the dirty marking bkeys are
> > not inserted into the btree immediately. Instead, they are temporarily
> > stored in an internal memory queue called Alist.
> > 2. Then, when the number of bkeys in Alist exceeds a certain limit, a
> > flush request is sent to the backend HDD.
> > 3. After the flush is sent, the bkeys recorded in Alist are then
> > inserted into the btree.
> > This process ensures that the written dirty data is written to the disk
> > before the btree is updated. The length of Alist can be configured,
> > allowing for better control of the flush sending frequency and reducing
> > the impact of the flush on the write speed.
>
> That approach should work as well. You'll want to make the list size
> rather bit, and add statistics for how ofter flushes are being issued.
>
OK, then let me review this patch.
Coly Li
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] bcache: enhancing the security of dirty data writeback
2025-08-05 3:31 ` Coly Li
@ 2025-08-05 3:34 ` Kent Overstreet
0 siblings, 0 replies; 25+ messages in thread
From: Kent Overstreet @ 2025-08-05 3:34 UTC (permalink / raw)
To: Coly Li; +Cc: Zhou Jifeng, linux-bcache, linux-kernel
On Tue, Aug 05, 2025 at 11:31:20AM +0800, Coly Li wrote:
> On Mon, Aug 04, 2025 at 09:31:38PM -0400, Kent Overstreet wrote:
> > > Could we consider the solution I submitted, which is based on the
> > > following main principle:
> > > 1. Firstly, in the write_dirty_finish stage, the dirty marking bkeys are
> > > not inserted into the btree immediately. Instead, they are temporarily
> > > stored in an internal memory queue called Alist.
> > > 2. Then, when the number of bkeys in Alist exceeds a certain limit, a
> > > flush request is sent to the backend HDD.
> > > 3. After the flush is sent, the bkeys recorded in Alist are then
> > > inserted into the btree.
> > > This process ensures that the written dirty data is written to the disk
> > > before the btree is updated. The length of Alist can be configured,
> > > allowing for better control of the flush sending frequency and reducing
> > > the impact of the flush on the write speed.
> >
> > That approach should work as well. You'll want to make the list size
> > rather bit, and add statistics for how ofter flushes are being issued.
> >
>
> OK, then let me review this patch.
>
> Coly Li
s/rather bit/rather large
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] bcache: enhancing the security of dirty data writeback
2025-07-31 6:21 [PATCH] bcache: enhancing the security of dirty data writeback Zhou Jifeng
2025-07-31 15:49 ` Kent Overstreet
@ 2025-08-05 4:57 ` Coly Li
2025-08-05 9:37 ` Zhou Jifeng
1 sibling, 1 reply; 25+ messages in thread
From: Coly Li @ 2025-08-05 4:57 UTC (permalink / raw)
To: Zhou Jifeng; +Cc: kent.overstreet, linux-bcache, linux-kernel
On Thu, Jul 31, 2025 at 02:21:40PM +0800, Zhou Jifeng wrote:
> There is a potential data consistency risk in bcache's writeback mode:when
> the application calls fsync, bcache returns success after completing the
> log write, persisting the cache disk data, and persisting the HDD internal
> cache. However, at this point, the actual application data may still be in
> a dirty state and remain stuck in the cache disk. when these data are
> subsequently written back to the HDD asynchronously through REQ_OP_WRITE,
> there is no forced refresh mechanism to ensure physical placement on the
> disk, and there may be no power-off protection measures, which poses a risk
> of data loss. This mechanism may cause the application to misjudge that the
> data has been persisted, which is different from the actual storage state,
> and also violates the semantic agreement that fsync should ensure data
> persistence.
>
[snipped]
> Signed-off-by: Zhou Jifeng <zhoujifeng@kylinos.com.cn>
> ---
> drivers/md/bcache/bcache.h | 23 ++++
> drivers/md/bcache/bcache_ondisk.h | 4 +
> drivers/md/bcache/sysfs.c | 47 ++++++++
> drivers/md/bcache/writeback.c | 174 +++++++++++++++++++++++++++---
> drivers/md/bcache/writeback.h | 4 +
> 5 files changed, 239 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 785b0d9008fa..09424938437b 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -247,6 +247,17 @@ struct keybuf {
> DECLARE_ARRAY_ALLOCATOR(struct keybuf_key, freelist, KEYBUF_NR);
> };
>
> +struct keybuf_preflush {
> + spinlock_t lock;
> + struct list_head list;
> + u32 count;
> +};
> +
> +struct flush_key_entry {
> + struct keybuf_key key;
> + struct list_head list;
> +};
> +
> struct bcache_device {
> struct closure cl;
>
> @@ -346,6 +357,18 @@ struct cached_dev {
>
> struct keybuf writeback_keys;
>
> + /*
> + * Before performing preflush to the backing device, temporarily
> + * store the bkey waiting to clean up the dirty mark
> + */
> + struct keybuf_preflush preflush_keys;
> + /*
> + * flush_interval is used to specify that a PROFLUSH operation will
> + * be issued once a certain number of dirty bkeys have been written
> + * each time.
> + */
> + unsigned int flush_interval;
> +
> struct task_struct *status_update_thread;
> /*
> * Order the write-half of writeback operations strongly in dispatch
> diff --git a/drivers/md/bcache/bcache_ondisk.h b/drivers/md/bcache/bcache_ondisk.h
> index 6620a7f8fffc..df5800838e40 100644
> --- a/drivers/md/bcache/bcache_ondisk.h
> +++ b/drivers/md/bcache/bcache_ondisk.h
> @@ -294,6 +294,10 @@ BITMASK(BDEV_CACHE_MODE, struct cache_sb, flags, 0, 4);
> #define CACHE_MODE_WRITEBACK 1U
> #define CACHE_MODE_WRITEAROUND 2U
> #define CACHE_MODE_NONE 3U
> +BITMASK(BDEV_WRITEBACK_FLUSH, struct cache_sb, flags, 4, 1);
> +#define WRITEBACK_FLUSH_OFF 0U
> +#define WRITEBACK_FLUSH_ON 1U
> +
We should avoid to change the on disk format. Can you use another method
to check whether the cached device has to be flushed? e.g. checking
dc->preflush_keys.
> BITMASK(BDEV_STATE, struct cache_sb, flags, 61, 2);
> #define BDEV_STATE_NONE 0U
> #define BDEV_STATE_CLEAN 1U
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index e8f696cb58c0..cc228e592ab6 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -28,6 +28,18 @@ static const char * const bch_cache_modes[] = {
> NULL
> };
>
> +/*
> + * Default is 0 ("off")
> + * off: Do nothing
> + * on: Use FLUSH when writing back dirty data.
> + */
> +static const char * const bch_writeback_flush[] = {
> + "off",
> + "on",
> + NULL
> +};
> +
> +
> static const char * const bch_reada_cache_policies[] = {
> "all",
> "meta-only",
> @@ -151,6 +163,19 @@ rw_attribute(copy_gc_enabled);
> rw_attribute(idle_max_writeback_rate);
> rw_attribute(gc_after_writeback);
> rw_attribute(size);
> +/*
> + * The "writeback_flush" has two options: "off" and "on". "off" is
> + * the default value.
> + * off: Do nothing
> + * on: Use FLUSH when writing back dirty data.
> + */
> +rw_attribute(writeback_flush);
> +/*
> + * "flush_interval" is used to specify that a PROFLUSH operation will
> + * be issued once a certain number of dirty bkeys have been written
> + * each time.
> + */
> +rw_attribute(flush_interval);
>
> static ssize_t bch_snprint_string_list(char *buf,
> size_t size,
> @@ -213,6 +238,7 @@ SHOW(__bch_cached_dev)
> var_print(writeback_rate_fp_term_mid);
> var_print(writeback_rate_fp_term_high);
> var_print(writeback_rate_minimum);
> + var_print(flush_interval);
>
> if (attr == &sysfs_writeback_rate_debug) {
> char rate[20];
> @@ -283,6 +309,11 @@ SHOW(__bch_cached_dev)
> return strlen(buf);
> }
>
> + if (attr == &sysfs_writeback_flush)
> + return bch_snprint_string_list(buf, PAGE_SIZE,
> + bch_writeback_flush,
> + BDEV_WRITEBACK_FLUSH(&dc->sb));
> +
> #undef var
> return 0;
> }
> @@ -354,6 +385,9 @@ STORE(__cached_dev)
>
> sysfs_strtoul_clamp(io_error_limit, dc->error_limit, 0, INT_MAX);
>
> + sysfs_strtoul_clamp(flush_interval, dc->flush_interval,
> + WRITEBACK_FLUSH_INTERVAL_MIN, WRITEBACK_FLUSH_INTERVAL_MAX);
> +
> if (attr == &sysfs_io_disable) {
> int v = strtoul_or_return(buf);
>
> @@ -451,6 +485,17 @@ STORE(__cached_dev)
> if (attr == &sysfs_stop)
> bcache_device_stop(&dc->disk);
>
> + if (attr == &sysfs_writeback_flush) {
> + v = __sysfs_match_string(bch_writeback_flush, -1, buf);
> + if (v < 0)
> + return v;
> +
> + if ((unsigned int) v != BDEV_WRITEBACK_FLUSH(&dc->sb)) {
> + SET_BDEV_WRITEBACK_FLUSH(&dc->sb, v);
> + bch_write_bdev_super(dc, NULL);
> + }
> + }
> +
> return size;
> }
>
> @@ -541,6 +586,8 @@ static struct attribute *bch_cached_dev_attrs[] = {
> #endif
> &sysfs_backing_dev_name,
> &sysfs_backing_dev_uuid,
> + &sysfs_writeback_flush,
> + &sysfs_flush_interval,
> NULL
> };
> ATTRIBUTE_GROUPS(bch_cached_dev);
We don't need this sysfs item. Once the issue is fixed, this is
mandatory for data security, even performance hurts.
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index 453efbbdc8ee..530eea2b953a 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -348,8 +348,121 @@ static CLOSURE_CALLBACK(dirty_io_destructor)
> kfree(io);
> }
>
> +static int bcache_add_preflush_key(struct cached_dev *dc, struct keybuf_key *key)
> +{
> + struct flush_key_entry *entry = kmalloc(sizeof(*entry), GFP_ATOMIC);
> +
> + if (!entry) {
> + pr_info("the preflush bkey memory allocation failed.\n");
The debug pr_info() can be removed.
> + return -ENOMEM;
> + }
> +
> + memcpy(&entry->key, key, sizeof(*key));
You may want to look at bkey_copy().
> + INIT_LIST_HEAD(&entry->list);
> +
> + spin_lock(&dc->preflush_keys.lock);
> + list_add_tail(&entry->list, &dc->preflush_keys.list);
> + dc->preflush_keys.count++;
> + spin_unlock(&dc->preflush_keys.lock);
> +
> + return 0;
> +}
> +
> +static void bcache_mark_preflush_keys_clean(struct cached_dev *dc)
> +{
> + struct flush_key_entry *e, *tmp;
> +
> + list_for_each_entry_safe(e, tmp, &dc->preflush_keys.list, list) {
> + list_del(&e->list);
> + kfree(e);
> + }
> + dc->preflush_keys.count = 0;
> +}
> +
For dc->preflush_keys, I would prefer to use a static allocated buffer
then dynamic list. You may check the head routines in uti.h, this is
the thing I'd suggest to use.
You may initialize the heap buffer quite large, and store to preflush
keys in order. It will be useful when you insert the keys back to btree.
> +static void launch_async_preflush_endio(struct bio *bio)
> +{
> + if (bio->bi_status)
> + pr_err("flush backing device error %d.\n", bio->bi_status);
> +}
> +
> +
> +static inline void launch_async_preflush_request(struct cached_dev *dc)
> +{
> + struct bio flush;
> +
> + bio_init(&flush, dc->bdev, NULL, 0, REQ_OP_WRITE | REQ_PREFLUSH);
> +
> + flush.bi_private = dc;
> + flush.bi_end_io = launch_async_preflush_endio;
> +
> + submit_bio(&flush);
> +}
> +
> +
> +static void flush_backing_device(struct cached_dev *dc)
> +{
> + int ret;
> + unsigned int i;
> + struct keylist keys;
> + struct flush_key_entry *e, *tmp;
> + struct bio flush;
> +
> + if (dc->preflush_keys.count == 0)
> + return;
> +
> + bio_init(&flush, dc->bdev, NULL, 0, REQ_OP_WRITE | REQ_PREFLUSH);
> + ret = submit_bio_wait(&flush);
> + if (ret) {
> + pr_err("flush backing device error %d.\n", ret);
> +
> + /*
> + * In the case of flush failure, do not update the status of bkey
> + * in the btree, and wait until the next time to re-write the dirty
> + * data.
> + */
> + bcache_mark_preflush_keys_clean(dc);
> + return;
> + }
> +
> + /*
> + * The dirty data was successfully written back and confirmed to be written
> + * to the disk. The status of the bkey in the btree was updated.
> + */
> + list_for_each_entry_safe(e, tmp, &dc->preflush_keys.list, list) {
> + memset(keys.inline_keys, 0, sizeof(keys.inline_keys));
> + bch_keylist_init(&keys);
> +
> + bkey_copy(keys.top, &(e->key.key));
> + SET_KEY_DIRTY(keys.top, false);
> + bch_keylist_push(&keys);
> +
If all the preflush keys are stored in a min heap, as I suggested
previously. Now you can assemble a keylist with multiple keys and
send them into bch_btree_insert(). The keys can beinserted in a
batch, which will be a bit faster.
> + for (i = 0; i < KEY_PTRS(&(e->key.key)); i++)
> + atomic_inc(&PTR_BUCKET(dc->disk.c, &(e->key.key), i)->pin);
> +
> + ret = bch_btree_insert(dc->disk.c, &keys, NULL, &(e->key.key));
> +
> + if (ret)
> + trace_bcache_writeback_collision(&(e->key.key));
> +
> + atomic_long_inc(ret
> + ? &dc->disk.c->writeback_keys_failed
> + : &dc->disk.c->writeback_keys_done);
> +
> + list_del(&e->list);
> + kfree(e);
> +
> + /* For those bkeys that failed to be inserted, you can
> + * ignore them and they will be processed again in the
> + * next write-back scan.
> + */
> + }
> +
Then the above code might be something like,
while (heap is not empty) {
while (heap is not full && heap is not empty) {
key = heap_pop()
bch_keylist_add(keylist, key)
}
bch_btree_insert(keylist);
}
You need to check bch_btree_insert() to decide hte heap is max heap or
min heap.
> + dc->preflush_keys.count = 0;
> +}
> +
> static CLOSURE_CALLBACK(write_dirty_finish)
> {
> + int ret;
> closure_type(io, struct dirty_io, cl);
> struct keybuf_key *w = io->bio.bi_private;
> struct cached_dev *dc = io->dc;
> @@ -358,27 +471,41 @@ static CLOSURE_CALLBACK(write_dirty_finish)
>
> /* This is kind of a dumb way of signalling errors. */
> if (KEY_DIRTY(&w->key)) {
> - int ret;
> unsigned int i;
> struct keylist keys;
>
> - bch_keylist_init(&keys);
> + if (!BDEV_WRITEBACK_FLUSH(&dc->sb)) {
> +update_bkey:
> + bch_keylist_init(&keys);
>
> - bkey_copy(keys.top, &w->key);
> - SET_KEY_DIRTY(keys.top, false);
> - bch_keylist_push(&keys);
> + bkey_copy(keys.top, &w->key);
> + SET_KEY_DIRTY(keys.top, false);
> + bch_keylist_push(&keys);
>
> - for (i = 0; i < KEY_PTRS(&w->key); i++)
> - atomic_inc(&PTR_BUCKET(dc->disk.c, &w->key, i)->pin);
> + for (i = 0; i < KEY_PTRS(&w->key); i++)
> + atomic_inc(&PTR_BUCKET(dc->disk.c, &w->key, i)->pin);
>
> - ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key);
> + ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key);
>
> - if (ret)
> - trace_bcache_writeback_collision(&w->key);
> + if (ret)
> + trace_bcache_writeback_collision(&w->key);
>
> - atomic_long_inc(ret
> - ? &dc->disk.c->writeback_keys_failed
> - : &dc->disk.c->writeback_keys_done);
> + atomic_long_inc(ret
> + ? &dc->disk.c->writeback_keys_failed
> + : &dc->disk.c->writeback_keys_done);
> + } else {
> + /* After flushing the backing device, update the btree */
> + ret = bcache_add_preflush_key(dc, w);
> +
> + /*
> + * When memory allocation fails, immediately send PREFLUSH
> + * and then update the btree.
> + */
> + if (ret) {
> + launch_async_preflush_request(dc);
> + goto update_bkey;
> + }
> + }
> }
>
If before the cleared key inserted into the btree, there are new write
into overlapped LBA range of the cleared key and a dirty key inserted.
Then the cleared key is inserted and overwrites the dirty key, but the
dirty data on cache is not written back to backing device yet. How to
handle such situation?
> bch_keybuf_del(&dc->writeback_keys, w);
> @@ -435,6 +562,7 @@ static CLOSURE_CALLBACK(write_dirty)
> if (KEY_DIRTY(&w->key)) {
> dirty_init(w);
> io->bio.bi_opf = REQ_OP_WRITE;
> +
> io->bio.bi_iter.bi_sector = KEY_START(&w->key);
> bio_set_dev(&io->bio, io->dc->bdev);
> io->bio.bi_end_io = dirty_endio;
> @@ -741,6 +869,7 @@ static int bch_writeback_thread(void *arg)
> struct cached_dev *dc = arg;
> struct cache_set *c = dc->disk.c;
> bool searched_full_index;
> + unsigned long last_flush_jiffies = jiffies;
>
> bch_ratelimit_reset(&dc->writeback_rate);
>
> @@ -819,9 +948,23 @@ static int bch_writeback_thread(void *arg)
>
Thanks.
Coly Li
> read_dirty(dc);
>
> + /*
> + * If the accumulated preflush_keys exceed a certain quantity or
> + * the interval time exceeds 30 seconds, issue the PREFLUSH command
> + * once.
> + */
> + if (dc->preflush_keys.count >= dc->flush_interval ||
> + time_after(jiffies, last_flush_jiffies + 30 * HZ)) {
> + flush_backing_device(dc);
> + last_flush_jiffies = jiffies;
> + }
> +
> if (searched_full_index) {
> unsigned int delay = dc->writeback_delay * HZ;
>
> + /* Clean up the remaining preflush_keys. */
> + flush_backing_device(dc);
> +
> while (delay &&
> !kthread_should_stop() &&
> !test_bit(CACHE_SET_IO_DISABLE, &c->flags) &&
> @@ -1068,10 +1211,15 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
> dc->writeback_rate_fp_term_mid = 10;
> dc->writeback_rate_fp_term_high = 1000;
> dc->writeback_rate_i_term_inverse = 10000;
> + dc->flush_interval = WRITEBACK_FLUSH_INTERVAL_DEFAULT;
>
> /* For dc->writeback_lock contention in update_writeback_rate() */
> dc->rate_update_retry = 0;
>
> + INIT_LIST_HEAD(&dc->preflush_keys.list);
> + spin_lock_init(&dc->preflush_keys.lock);
> + dc->preflush_keys.count = 0;
> +
> WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags));
> INIT_DELAYED_WORK(&dc->writeback_rate_update, update_writeback_rate);
> }
> diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
> index 31df716951f6..1cd3e4addba2 100644
> --- a/drivers/md/bcache/writeback.h
> +++ b/drivers/md/bcache/writeback.h
> @@ -14,6 +14,10 @@
> #define WRITEBACK_RATE_UPDATE_SECS_MAX 60
> #define WRITEBACK_RATE_UPDATE_SECS_DEFAULT 5
>
> +#define WRITEBACK_FLUSH_INTERVAL_MIN 500
> +#define WRITEBACK_FLUSH_INTERVAL_MAX 50000
> +#define WRITEBACK_FLUSH_INTERVAL_DEFAULT 20000 /* the number of bkey */
> +
> #define BCH_AUTO_GC_DIRTY_THRESHOLD 50
>
> #define BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW 50
> --
> 2.18.1
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] bcache: enhancing the security of dirty data writeback
2025-08-05 4:57 ` Coly Li
@ 2025-08-05 9:37 ` Zhou Jifeng
2025-08-05 16:29 ` Coly Li
0 siblings, 1 reply; 25+ messages in thread
From: Zhou Jifeng @ 2025-08-05 9:37 UTC (permalink / raw)
To: Coly Li; +Cc: kent.overstreet, linux-bcache, linux-kernel
On Tue, 5 Aug 2025 at 13:00, Coly Li <colyli@kernel.org> wrote:
>
> On Thu, Jul 31, 2025 at 02:21:40PM +0800, Zhou Jifeng wrote:
> > There is a potential data consistency risk in bcache's writeback mode:when
> > the application calls fsync, bcache returns success after completing the
> > log write, persisting the cache disk data, and persisting the HDD internal
> > cache. However, at this point, the actual application data may still be in
> > a dirty state and remain stuck in the cache disk. when these data are
> > subsequently written back to the HDD asynchronously through REQ_OP_WRITE,
> > there is no forced refresh mechanism to ensure physical placement on the
> > disk, and there may be no power-off protection measures, which poses a risk
> > of data loss. This mechanism may cause the application to misjudge that the
> > data has been persisted, which is different from the actual storage state,
> > and also violates the semantic agreement that fsync should ensure data
> > persistence.
> >
>
> [snipped]
>
>
> > Signed-off-by: Zhou Jifeng <zhoujifeng@kylinos.com.cn>
> > ---
> > drivers/md/bcache/bcache.h | 23 ++++
> > drivers/md/bcache/bcache_ondisk.h | 4 +
> > drivers/md/bcache/sysfs.c | 47 ++++++++
> > drivers/md/bcache/writeback.c | 174 +++++++++++++++++++++++++++---
> > drivers/md/bcache/writeback.h | 4 +
> > 5 files changed, 239 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> > index 785b0d9008fa..09424938437b 100644
> > --- a/drivers/md/bcache/bcache.h
> > +++ b/drivers/md/bcache/bcache.h
> > @@ -247,6 +247,17 @@ struct keybuf {
> > DECLARE_ARRAY_ALLOCATOR(struct keybuf_key, freelist, KEYBUF_NR);
> > };
> >
> > +struct keybuf_preflush {
> > + spinlock_t lock;
> > + struct list_head list;
> > + u32 count;
> > +};
> > +
> > +struct flush_key_entry {
> > + struct keybuf_key key;
> > + struct list_head list;
> > +};
> > +
> > struct bcache_device {
> > struct closure cl;
> >
> > @@ -346,6 +357,18 @@ struct cached_dev {
> >
> > struct keybuf writeback_keys;
> >
> > + /*
> > + * Before performing preflush to the backing device, temporarily
> > + * store the bkey waiting to clean up the dirty mark
> > + */
> > + struct keybuf_preflush preflush_keys;
> > + /*
> > + * flush_interval is used to specify that a PROFLUSH operation will
> > + * be issued once a certain number of dirty bkeys have been written
> > + * each time.
> > + */
> > + unsigned int flush_interval;
> > +
> > struct task_struct *status_update_thread;
> > /*
> > * Order the write-half of writeback operations strongly in dispatch
> > diff --git a/drivers/md/bcache/bcache_ondisk.h b/drivers/md/bcache/bcache_ondisk.h
> > index 6620a7f8fffc..df5800838e40 100644
> > --- a/drivers/md/bcache/bcache_ondisk.h
> > +++ b/drivers/md/bcache/bcache_ondisk.h
> > @@ -294,6 +294,10 @@ BITMASK(BDEV_CACHE_MODE, struct cache_sb, flags, 0, 4);
> > #define CACHE_MODE_WRITEBACK 1U
> > #define CACHE_MODE_WRITEAROUND 2U
> > #define CACHE_MODE_NONE 3U
> > +BITMASK(BDEV_WRITEBACK_FLUSH, struct cache_sb, flags, 4, 1);
> > +#define WRITEBACK_FLUSH_OFF 0U
> > +#define WRITEBACK_FLUSH_ON 1U
> > +
>
>
> We should avoid to change the on disk format. Can you use another method
> to check whether the cached device has to be flushed? e.g. checking
> dc->preflush_keys.
>
>
> > BITMASK(BDEV_STATE, struct cache_sb, flags, 61, 2);
> > #define BDEV_STATE_NONE 0U
> > #define BDEV_STATE_CLEAN 1U
> > diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> > index e8f696cb58c0..cc228e592ab6 100644
> > --- a/drivers/md/bcache/sysfs.c
> > +++ b/drivers/md/bcache/sysfs.c
> > @@ -28,6 +28,18 @@ static const char * const bch_cache_modes[] = {
> > NULL
> > };
> >
> > +/*
> > + * Default is 0 ("off")
> > + * off: Do nothing
> > + * on: Use FLUSH when writing back dirty data.
> > + */
> > +static const char * const bch_writeback_flush[] = {
> > + "off",
> > + "on",
> > + NULL
> > +};
> > +
> > +
> > static const char * const bch_reada_cache_policies[] = {
> > "all",
> > "meta-only",
> > @@ -151,6 +163,19 @@ rw_attribute(copy_gc_enabled);
> > rw_attribute(idle_max_writeback_rate);
> > rw_attribute(gc_after_writeback);
> > rw_attribute(size);
> > +/*
> > + * The "writeback_flush" has two options: "off" and "on". "off" is
> > + * the default value.
> > + * off: Do nothing
> > + * on: Use FLUSH when writing back dirty data.
> > + */
> > +rw_attribute(writeback_flush);
> > +/*
> > + * "flush_interval" is used to specify that a PROFLUSH operation will
> > + * be issued once a certain number of dirty bkeys have been written
> > + * each time.
> > + */
> > +rw_attribute(flush_interval);
> >
> > static ssize_t bch_snprint_string_list(char *buf,
> > size_t size,
> > @@ -213,6 +238,7 @@ SHOW(__bch_cached_dev)
> > var_print(writeback_rate_fp_term_mid);
> > var_print(writeback_rate_fp_term_high);
> > var_print(writeback_rate_minimum);
> > + var_print(flush_interval);
> >
> > if (attr == &sysfs_writeback_rate_debug) {
> > char rate[20];
> > @@ -283,6 +309,11 @@ SHOW(__bch_cached_dev)
> > return strlen(buf);
> > }
> >
> > + if (attr == &sysfs_writeback_flush)
> > + return bch_snprint_string_list(buf, PAGE_SIZE,
> > + bch_writeback_flush,
> > + BDEV_WRITEBACK_FLUSH(&dc->sb));
> > +
> > #undef var
> > return 0;
> > }
> > @@ -354,6 +385,9 @@ STORE(__cached_dev)
> >
> > sysfs_strtoul_clamp(io_error_limit, dc->error_limit, 0, INT_MAX);
> >
> > + sysfs_strtoul_clamp(flush_interval, dc->flush_interval,
> > + WRITEBACK_FLUSH_INTERVAL_MIN, WRITEBACK_FLUSH_INTERVAL_MAX);
> > +
> > if (attr == &sysfs_io_disable) {
> > int v = strtoul_or_return(buf);
> >
> > @@ -451,6 +485,17 @@ STORE(__cached_dev)
> > if (attr == &sysfs_stop)
> > bcache_device_stop(&dc->disk);
> >
> > + if (attr == &sysfs_writeback_flush) {
> > + v = __sysfs_match_string(bch_writeback_flush, -1, buf);
> > + if (v < 0)
> > + return v;
> > +
> > + if ((unsigned int) v != BDEV_WRITEBACK_FLUSH(&dc->sb)) {
> > + SET_BDEV_WRITEBACK_FLUSH(&dc->sb, v);
> > + bch_write_bdev_super(dc, NULL);
> > + }
> > + }
> > +
> > return size;
> > }
> >
> > @@ -541,6 +586,8 @@ static struct attribute *bch_cached_dev_attrs[] = {
> > #endif
> > &sysfs_backing_dev_name,
> > &sysfs_backing_dev_uuid,
> > + &sysfs_writeback_flush,
> > + &sysfs_flush_interval,
> > NULL
> > };
> > ATTRIBUTE_GROUPS(bch_cached_dev);
>
> We don't need this sysfs item. Once the issue is fixed, this is
> mandatory for data security, even performance hurts.
>
>
>
> > diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> > index 453efbbdc8ee..530eea2b953a 100644
> > --- a/drivers/md/bcache/writeback.c
> > +++ b/drivers/md/bcache/writeback.c
> > @@ -348,8 +348,121 @@ static CLOSURE_CALLBACK(dirty_io_destructor)
> > kfree(io);
> > }
> >
> > +static int bcache_add_preflush_key(struct cached_dev *dc, struct keybuf_key *key)
> > +{
> > + struct flush_key_entry *entry = kmalloc(sizeof(*entry), GFP_ATOMIC);
> > +
> > + if (!entry) {
> > + pr_info("the preflush bkey memory allocation failed.\n");
>
> The debug pr_info() can be removed.
>
>
>
> > + return -ENOMEM;
> > + }
> > +
> > + memcpy(&entry->key, key, sizeof(*key));
>
> You may want to look at bkey_copy().
>
> > + INIT_LIST_HEAD(&entry->list);
> > +
> > + spin_lock(&dc->preflush_keys.lock);
> > + list_add_tail(&entry->list, &dc->preflush_keys.list);
> > + dc->preflush_keys.count++;
> > + spin_unlock(&dc->preflush_keys.lock);
> > +
> > + return 0;
> > +}
> > +
> > +static void bcache_mark_preflush_keys_clean(struct cached_dev *dc)
> > +{
> > + struct flush_key_entry *e, *tmp;
> > +
> > + list_for_each_entry_safe(e, tmp, &dc->preflush_keys.list, list) {
> > + list_del(&e->list);
> > + kfree(e);
> > + }
> > + dc->preflush_keys.count = 0;
> > +}
> > +
>
> For dc->preflush_keys, I would prefer to use a static allocated buffer
> then dynamic list. You may check the head routines in uti.h, this is
> the thing I'd suggest to use.
>
> You may initialize the heap buffer quite large, and store to preflush
> keys in order. It will be useful when you insert the keys back to btree.
>
>
> > +static void launch_async_preflush_endio(struct bio *bio)
> > +{
> > + if (bio->bi_status)
> > + pr_err("flush backing device error %d.\n", bio->bi_status);
> > +}
> > +
> > +
> > +static inline void launch_async_preflush_request(struct cached_dev *dc)
> > +{
> > + struct bio flush;
> > +
> > + bio_init(&flush, dc->bdev, NULL, 0, REQ_OP_WRITE | REQ_PREFLUSH);
> > +
> > + flush.bi_private = dc;
> > + flush.bi_end_io = launch_async_preflush_endio;
> > +
> > + submit_bio(&flush);
> > +}
> > +
> > +
> > +static void flush_backing_device(struct cached_dev *dc)
> > +{
> > + int ret;
> > + unsigned int i;
> > + struct keylist keys;
> > + struct flush_key_entry *e, *tmp;
> > + struct bio flush;
> > +
> > + if (dc->preflush_keys.count == 0)
> > + return;
> > +
> > + bio_init(&flush, dc->bdev, NULL, 0, REQ_OP_WRITE | REQ_PREFLUSH);
> > + ret = submit_bio_wait(&flush);
> > + if (ret) {
> > + pr_err("flush backing device error %d.\n", ret);
> > +
> > + /*
> > + * In the case of flush failure, do not update the status of bkey
> > + * in the btree, and wait until the next time to re-write the dirty
> > + * data.
> > + */
> > + bcache_mark_preflush_keys_clean(dc);
> > + return;
> > + }
> > +
> > + /*
> > + * The dirty data was successfully written back and confirmed to be written
> > + * to the disk. The status of the bkey in the btree was updated.
> > + */
> > + list_for_each_entry_safe(e, tmp, &dc->preflush_keys.list, list) {
> > + memset(keys.inline_keys, 0, sizeof(keys.inline_keys));
> > + bch_keylist_init(&keys);
> > +
> > + bkey_copy(keys.top, &(e->key.key));
> > + SET_KEY_DIRTY(keys.top, false);
> > + bch_keylist_push(&keys);
> > +
>
> If all the preflush keys are stored in a min heap, as I suggested
> previously. Now you can assemble a keylist with multiple keys and
> send them into bch_btree_insert(). The keys can beinserted in a
> batch, which will be a bit faster.
>
>
> > + for (i = 0; i < KEY_PTRS(&(e->key.key)); i++)
> > + atomic_inc(&PTR_BUCKET(dc->disk.c, &(e->key.key), i)->pin);
> > +
> > + ret = bch_btree_insert(dc->disk.c, &keys, NULL, &(e->key.key));
> > +
> > + if (ret)
> > + trace_bcache_writeback_collision(&(e->key.key));
> > +
> > + atomic_long_inc(ret
> > + ? &dc->disk.c->writeback_keys_failed
> > + : &dc->disk.c->writeback_keys_done);
> > +
> > + list_del(&e->list);
> > + kfree(e);
> > +
> > + /* For those bkeys that failed to be inserted, you can
> > + * ignore them and they will be processed again in the
> > + * next write-back scan.
> > + */
> > + }
> > +
>
> Then the above code might be something like,
>
> while (heap is not empty) {
> while (heap is not full && heap is not empty) {
> key = heap_pop()
> bch_keylist_add(keylist, key)
> }
>
> bch_btree_insert(keylist);
> }
>
> You need to check bch_btree_insert() to decide hte heap is max heap or
> min heap.
>
>
> > + dc->preflush_keys.count = 0;
> > +}
> > +
> > static CLOSURE_CALLBACK(write_dirty_finish)
> > {
> > + int ret;
> > closure_type(io, struct dirty_io, cl);
> > struct keybuf_key *w = io->bio.bi_private;
> > struct cached_dev *dc = io->dc;
> > @@ -358,27 +471,41 @@ static CLOSURE_CALLBACK(write_dirty_finish)
> >
> > /* This is kind of a dumb way of signalling errors. */
> > if (KEY_DIRTY(&w->key)) {
> > - int ret;
> > unsigned int i;
> > struct keylist keys;
> >
> > - bch_keylist_init(&keys);
> > + if (!BDEV_WRITEBACK_FLUSH(&dc->sb)) {
> > +update_bkey:
> > + bch_keylist_init(&keys);
> >
> > - bkey_copy(keys.top, &w->key);
> > - SET_KEY_DIRTY(keys.top, false);
> > - bch_keylist_push(&keys);
> > + bkey_copy(keys.top, &w->key);
> > + SET_KEY_DIRTY(keys.top, false);
> > + bch_keylist_push(&keys);
> >
> > - for (i = 0; i < KEY_PTRS(&w->key); i++)
> > - atomic_inc(&PTR_BUCKET(dc->disk.c, &w->key, i)->pin);
> > + for (i = 0; i < KEY_PTRS(&w->key); i++)
> > + atomic_inc(&PTR_BUCKET(dc->disk.c, &w->key, i)->pin);
> >
> > - ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key);
> > + ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key);
> >
> > - if (ret)
> > - trace_bcache_writeback_collision(&w->key);
> > + if (ret)
> > + trace_bcache_writeback_collision(&w->key);
> >
> > - atomic_long_inc(ret
> > - ? &dc->disk.c->writeback_keys_failed
> > - : &dc->disk.c->writeback_keys_done);
> > + atomic_long_inc(ret
> > + ? &dc->disk.c->writeback_keys_failed
> > + : &dc->disk.c->writeback_keys_done);
> > + } else {
> > + /* After flushing the backing device, update the btree */
> > + ret = bcache_add_preflush_key(dc, w);
> > +
> > + /*
> > + * When memory allocation fails, immediately send PREFLUSH
> > + * and then update the btree.
> > + */
> > + if (ret) {
> > + launch_async_preflush_request(dc);
> > + goto update_bkey;
> > + }
> > + }
> > }
> >
>
> If before the cleared key inserted into the btree, there are new write
> into overlapped LBA range of the cleared key and a dirty key inserted.
> Then the cleared key is inserted and overwrites the dirty key, but the
> dirty data on cache is not written back to backing device yet. How to
> handle such situation?
>
There are indeed some issues here. I have initially come up with a
solution: Utilize the existing dc->writeback_keys mechanism for
protection. The general processing flow is as follows:
1. In the write_dirty_finish() function, remove the operation of
updating bkey insertion, and delete the code bch_keybuf_del(&dc
->writeback_keys, w).
2. After executing the read_dirty(dc) code, perform flush, then
insert the updated bkey, and finally remove the bkey from dc->
writeback_keys. This process is equivalent to sending a flush
every KEYBUF_NR bkeys are written back.
3. Support configurable KEYBUF_NR to indirectly control the
frequency of flush.
Is this plan appropriate? Or are there any better ways to handle it?
> > bch_keybuf_del(&dc->writeback_keys, w);
> > @@ -435,6 +562,7 @@ static CLOSURE_CALLBACK(write_dirty)
> > if (KEY_DIRTY(&w->key)) {
> > dirty_init(w);
> > io->bio.bi_opf = REQ_OP_WRITE;
> > +
> > io->bio.bi_iter.bi_sector = KEY_START(&w->key);
> > bio_set_dev(&io->bio, io->dc->bdev);
> > io->bio.bi_end_io = dirty_endio;
> > @@ -741,6 +869,7 @@ static int bch_writeback_thread(void *arg)
> > struct cached_dev *dc = arg;
> > struct cache_set *c = dc->disk.c;
> > bool searched_full_index;
> > + unsigned long last_flush_jiffies = jiffies;
> >
> > bch_ratelimit_reset(&dc->writeback_rate);
> >
> > @@ -819,9 +948,23 @@ static int bch_writeback_thread(void *arg)
> >
>
> Thanks.
>
> Coly Li
> > read_dirty(dc);
> >
> > + /*
> > + * If the accumulated preflush_keys exceed a certain quantity or
> > + * the interval time exceeds 30 seconds, issue the PREFLUSH command
> > + * once.
> > + */
> > + if (dc->preflush_keys.count >= dc->flush_interval ||
> > + time_after(jiffies, last_flush_jiffies + 30 * HZ)) {
> > + flush_backing_device(dc);
> > + last_flush_jiffies = jiffies;
> > + }
> > +
> > if (searched_full_index) {
> > unsigned int delay = dc->writeback_delay * HZ;
> >
> > + /* Clean up the remaining preflush_keys. */
> > + flush_backing_device(dc);
> > +
> > while (delay &&
> > !kthread_should_stop() &&
> > !test_bit(CACHE_SET_IO_DISABLE, &c->flags) &&
> > @@ -1068,10 +1211,15 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
> > dc->writeback_rate_fp_term_mid = 10;
> > dc->writeback_rate_fp_term_high = 1000;
> > dc->writeback_rate_i_term_inverse = 10000;
> > + dc->flush_interval = WRITEBACK_FLUSH_INTERVAL_DEFAULT;
> >
> > /* For dc->writeback_lock contention in update_writeback_rate() */
> > dc->rate_update_retry = 0;
> >
> > + INIT_LIST_HEAD(&dc->preflush_keys.list);
> > + spin_lock_init(&dc->preflush_keys.lock);
> > + dc->preflush_keys.count = 0;
> > +
> > WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags));
> > INIT_DELAYED_WORK(&dc->writeback_rate_update, update_writeback_rate);
> > }
> > diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
> > index 31df716951f6..1cd3e4addba2 100644
> > --- a/drivers/md/bcache/writeback.h
> > +++ b/drivers/md/bcache/writeback.h
> > @@ -14,6 +14,10 @@
> > #define WRITEBACK_RATE_UPDATE_SECS_MAX 60
> > #define WRITEBACK_RATE_UPDATE_SECS_DEFAULT 5
> >
> > +#define WRITEBACK_FLUSH_INTERVAL_MIN 500
> > +#define WRITEBACK_FLUSH_INTERVAL_MAX 50000
> > +#define WRITEBACK_FLUSH_INTERVAL_DEFAULT 20000 /* the number of bkey */
> > +
> > #define BCH_AUTO_GC_DIRTY_THRESHOLD 50
> >
> > #define BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW 50
> > --
> > 2.18.1
> >
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] bcache: enhancing the security of dirty data writeback
2025-08-05 9:37 ` Zhou Jifeng
@ 2025-08-05 16:29 ` Coly Li
2025-08-06 11:19 ` Zhou Jifeng
0 siblings, 1 reply; 25+ messages in thread
From: Coly Li @ 2025-08-05 16:29 UTC (permalink / raw)
To: Zhou Jifeng; +Cc: kent.overstreet, linux-bcache, linux-kernel
On Tue, Aug 05, 2025 at 05:37:44PM +0800, Zhou Jifeng wrote:
> On Tue, 5 Aug 2025 at 13:00, Coly Li <colyli@kernel.org> wrote:
> >
> > On Thu, Jul 31, 2025 at 02:21:40PM +0800, Zhou Jifeng wrote:
> > > There is a potential data consistency risk in bcache's writeback mode:when
> > > the application calls fsync, bcache returns success after completing the
> > > log write, persisting the cache disk data, and persisting the HDD internal
> > > cache. However, at this point, the actual application data may still be in
> > > a dirty state and remain stuck in the cache disk. when these data are
> > > subsequently written back to the HDD asynchronously through REQ_OP_WRITE,
> > > there is no forced refresh mechanism to ensure physical placement on the
> > > disk, and there may be no power-off protection measures, which poses a risk
> > > of data loss. This mechanism may cause the application to misjudge that the
> > > data has been persisted, which is different from the actual storage state,
> > > and also violates the semantic agreement that fsync should ensure data
> > > persistence.
> > >
> >
> > [snipped]
> >
> >
> >
> > If before the cleared key inserted into the btree, there are new write
> > into overlapped LBA range of the cleared key and a dirty key inserted.
> > Then the cleared key is inserted and overwrites the dirty key, but the
> > dirty data on cache is not written back to backing device yet. How to
> > handle such situation?
> >
>
> There are indeed some issues here. I have initially come up with a
> solution: Utilize the existing dc->writeback_keys mechanism for
> protection. The general processing flow is as follows:
> 1. In the write_dirty_finish() function, remove the operation of
> updating bkey insertion, and delete the code bch_keybuf_del(&dc
> ->writeback_keys, w).
> 2. After executing the read_dirty(dc) code, perform flush, then
> insert the updated bkey, and finally remove the bkey from dc->
> writeback_keys. This process is equivalent to sending a flush
> every KEYBUF_NR bkeys are written back.
> 3. Support configurable KEYBUF_NR to indirectly control the
> frequency of flush.
>
> Is this plan appropriate? Or are there any better ways to handle it?
No, I won't suggest this way. It sounds complicaed and changes the main
code flow too much in an implicit way, this should be avoided.
So it seems Kent's suggestion to flush backing device before committing
jset is the proper method I can see now.
Coly Li
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] bcache: enhancing the security of dirty data writeback
2025-08-05 16:29 ` Coly Li
@ 2025-08-06 11:19 ` Zhou Jifeng
2025-08-06 16:10 ` Coly Li
0 siblings, 1 reply; 25+ messages in thread
From: Zhou Jifeng @ 2025-08-06 11:19 UTC (permalink / raw)
To: Coly Li; +Cc: kent.overstreet, linux-bcache, linux-kernel
On Wed, 6 Aug 2025 at 00:29, Coly Li <colyli@kernel.org> wrote:
>
> On Tue, Aug 05, 2025 at 05:37:44PM +0800, Zhou Jifeng wrote:
> > On Tue, 5 Aug 2025 at 13:00, Coly Li <colyli@kernel.org> wrote:
> > >
> > > On Thu, Jul 31, 2025 at 02:21:40PM +0800, Zhou Jifeng wrote:
> > > > There is a potential data consistency risk in bcache's writeback mode:when
> > > > the application calls fsync, bcache returns success after completing the
> > > > log write, persisting the cache disk data, and persisting the HDD internal
> > > > cache. However, at this point, the actual application data may still be in
> > > > a dirty state and remain stuck in the cache disk. when these data are
> > > > subsequently written back to the HDD asynchronously through REQ_OP_WRITE,
> > > > there is no forced refresh mechanism to ensure physical placement on the
> > > > disk, and there may be no power-off protection measures, which poses a risk
> > > > of data loss. This mechanism may cause the application to misjudge that the
> > > > data has been persisted, which is different from the actual storage state,
> > > > and also violates the semantic agreement that fsync should ensure data
> > > > persistence.
> > > >
> > >
> > > [snipped]
> > >
> > >
> > >
> > > If before the cleared key inserted into the btree, there are new write
> > > into overlapped LBA range of the cleared key and a dirty key inserted.
> > > Then the cleared key is inserted and overwrites the dirty key, but the
> > > dirty data on cache is not written back to backing device yet. How to
> > > handle such situation?
> > >
> >
> > There are indeed some issues here. I have initially come up with a
> > solution: Utilize the existing dc->writeback_keys mechanism for
> > protection. The general processing flow is as follows:
> > 1. In the write_dirty_finish() function, remove the operation of
> > updating bkey insertion, and delete the code bch_keybuf_del(&dc
> > ->writeback_keys, w).
> > 2. After executing the read_dirty(dc) code, perform flush, then
> > insert the updated bkey, and finally remove the bkey from dc->
> > writeback_keys. This process is equivalent to sending a flush
> > every KEYBUF_NR bkeys are written back.
> > 3. Support configurable KEYBUF_NR to indirectly control the
> > frequency of flush.
> >
> > Is this plan appropriate? Or are there any better ways to handle it?
>
> No, I won't suggest this way. It sounds complicaed and changes the main
> code flow too much in an implicit way, this should be avoided.
>
> So it seems Kent's suggestion to flush backing device before committing
> jset is the proper method I can see now.
>
> Coly Li
>
Sorry, my previous response was not rigorous enough. I have carefully
considered your question about "the bkey being overwritten". In fact,
there is no issue of being overwritten. The bcache has ingeniously
designed a replace mechanism. In my code, the bkey with the dirty flag
cleared is inserted using the replace method. This method handles
address overlaps ingeniously during the insertion of the bkey and will
not overwrite the bkey generated by concurrent writes. The main code
for the replace mechanism is located in bch_btree_insert_key->bch_extent_insert_fixup
, which calls the bch_bkey_equal_header function, which is also a
crucial checkpoint.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] bcache: enhancing the security of dirty data writeback
2025-08-06 11:19 ` Zhou Jifeng
@ 2025-08-06 16:10 ` Coly Li
2025-08-07 2:01 ` Zhou Jifeng
0 siblings, 1 reply; 25+ messages in thread
From: Coly Li @ 2025-08-06 16:10 UTC (permalink / raw)
To: Zhou Jifeng; +Cc: kent.overstreet, linux-bcache, linux-kernel
On Wed, Aug 06, 2025 at 07:19:49PM +0800, Zhou Jifeng wrote:
> On Wed, 6 Aug 2025 at 00:29, Coly Li <colyli@kernel.org> wrote:
> >
> > On Tue, Aug 05, 2025 at 05:37:44PM +0800, Zhou Jifeng wrote:
> > > On Tue, 5 Aug 2025 at 13:00, Coly Li <colyli@kernel.org> wrote:
> > > >
> > > > On Thu, Jul 31, 2025 at 02:21:40PM +0800, Zhou Jifeng wrote:
> > > > > There is a potential data consistency risk in bcache's writeback mode:when
> > > > > the application calls fsync, bcache returns success after completing the
> > > > > log write, persisting the cache disk data, and persisting the HDD internal
> > > > > cache. However, at this point, the actual application data may still be in
> > > > > a dirty state and remain stuck in the cache disk. when these data are
> > > > > subsequently written back to the HDD asynchronously through REQ_OP_WRITE,
> > > > > there is no forced refresh mechanism to ensure physical placement on the
> > > > > disk, and there may be no power-off protection measures, which poses a risk
> > > > > of data loss. This mechanism may cause the application to misjudge that the
> > > > > data has been persisted, which is different from the actual storage state,
> > > > > and also violates the semantic agreement that fsync should ensure data
> > > > > persistence.
> > > > >
> > > >
> > > > [snipped]
> > > >
> > > >
> > > >
> > > > If before the cleared key inserted into the btree, there are new write
> > > > into overlapped LBA range of the cleared key and a dirty key inserted.
> > > > Then the cleared key is inserted and overwrites the dirty key, but the
> > > > dirty data on cache is not written back to backing device yet. How to
> > > > handle such situation?
> > > >
> > >
> > > There are indeed some issues here. I have initially come up with a
> > > solution: Utilize the existing dc->writeback_keys mechanism for
> > > protection. The general processing flow is as follows:
> > > 1. In the write_dirty_finish() function, remove the operation of
> > > updating bkey insertion, and delete the code bch_keybuf_del(&dc
> > > ->writeback_keys, w).
> > > 2. After executing the read_dirty(dc) code, perform flush, then
> > > insert the updated bkey, and finally remove the bkey from dc->
> > > writeback_keys. This process is equivalent to sending a flush
> > > every KEYBUF_NR bkeys are written back.
> > > 3. Support configurable KEYBUF_NR to indirectly control the
> > > frequency of flush.
> > >
> > > Is this plan appropriate? Or are there any better ways to handle it?
> >
> > No, I won't suggest this way. It sounds complicaed and changes the main
> > code flow too much in an implicit way, this should be avoided.
> >
> > So it seems Kent's suggestion to flush backing device before committing
> > jset is the proper method I can see now.
> >
> > Coly Li
> >
>
> Sorry, my previous response was not rigorous enough. I have carefully
> considered your question about "the bkey being overwritten". In fact,
> there is no issue of being overwritten. The bcache has ingeniously
> designed a replace mechanism. In my code, the bkey with the dirty flag
> cleared is inserted using the replace method. This method handles
> address overlaps ingeniously during the insertion of the bkey and will
> not overwrite the bkey generated by concurrent writes. The main code
> for the replace mechanism is located in bch_btree_insert_key->bch_extent_insert_fixup
> , which calls the bch_bkey_equal_header function, which is also a
> crucial checkpoint.
I am not able to make judgement by the above description, can you post a patch
then I can see how you insert the key with replace parameter.
Coly Li
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] bcache: enhancing the security of dirty data writeback
2025-08-06 16:10 ` Coly Li
@ 2025-08-07 2:01 ` Zhou Jifeng
2025-08-13 2:12 ` [PATCH v2] " Zhou Jifeng
0 siblings, 1 reply; 25+ messages in thread
From: Zhou Jifeng @ 2025-08-07 2:01 UTC (permalink / raw)
To: Coly Li; +Cc: kent.overstreet, linux-bcache, linux-kernel
On Thu, 7 Aug 2025 at 00:11, Coly Li <colyli@kernel.org> wrote:
>
> On Wed, Aug 06, 2025 at 07:19:49PM +0800, Zhou Jifeng wrote:
> > On Wed, 6 Aug 2025 at 00:29, Coly Li <colyli@kernel.org> wrote:
> > >
> > > On Tue, Aug 05, 2025 at 05:37:44PM +0800, Zhou Jifeng wrote:
> > > > On Tue, 5 Aug 2025 at 13:00, Coly Li <colyli@kernel.org> wrote:
> > > > >
> > > > > On Thu, Jul 31, 2025 at 02:21:40PM +0800, Zhou Jifeng wrote:
> > > > > > There is a potential data consistency risk in bcache's writeback mode:when
> > > > > > the application calls fsync, bcache returns success after completing the
> > > > > > log write, persisting the cache disk data, and persisting the HDD internal
> > > > > > cache. However, at this point, the actual application data may still be in
> > > > > > a dirty state and remain stuck in the cache disk. when these data are
> > > > > > subsequently written back to the HDD asynchronously through REQ_OP_WRITE,
> > > > > > there is no forced refresh mechanism to ensure physical placement on the
> > > > > > disk, and there may be no power-off protection measures, which poses a risk
> > > > > > of data loss. This mechanism may cause the application to misjudge that the
> > > > > > data has been persisted, which is different from the actual storage state,
> > > > > > and also violates the semantic agreement that fsync should ensure data
> > > > > > persistence.
> > > > > >
> > > > >
> > > > > [snipped]
> > > > >
> > > > >
> > > > >
> > > > > If before the cleared key inserted into the btree, there are new write
> > > > > into overlapped LBA range of the cleared key and a dirty key inserted.
> > > > > Then the cleared key is inserted and overwrites the dirty key, but the
> > > > > dirty data on cache is not written back to backing device yet. How to
> > > > > handle such situation?
> > > > >
> > > >
> > > > There are indeed some issues here. I have initially come up with a
> > > > solution: Utilize the existing dc->writeback_keys mechanism for
> > > > protection. The general processing flow is as follows:
> > > > 1. In the write_dirty_finish() function, remove the operation of
> > > > updating bkey insertion, and delete the code bch_keybuf_del(&dc
> > > > ->writeback_keys, w).
> > > > 2. After executing the read_dirty(dc) code, perform flush, then
> > > > insert the updated bkey, and finally remove the bkey from dc->
> > > > writeback_keys. This process is equivalent to sending a flush
> > > > every KEYBUF_NR bkeys are written back.
> > > > 3. Support configurable KEYBUF_NR to indirectly control the
> > > > frequency of flush.
> > > >
> > > > Is this plan appropriate? Or are there any better ways to handle it?
> > >
> > > No, I won't suggest this way. It sounds complicaed and changes the main
> > > code flow too much in an implicit way, this should be avoided.
> > >
> > > So it seems Kent's suggestion to flush backing device before committing
> > > jset is the proper method I can see now.
> > >
> > > Coly Li
> > >
> >
> > Sorry, my previous response was not rigorous enough. I have carefully
> > considered your question about "the bkey being overwritten". In fact,
> > there is no issue of being overwritten. The bcache has ingeniously
> > designed a replace mechanism. In my code, the bkey with the dirty flag
> > cleared is inserted using the replace method. This method handles
> > address overlaps ingeniously during the insertion of the bkey and will
> > not overwrite the bkey generated by concurrent writes. The main code
> > for the replace mechanism is located in bch_btree_insert_key->bch_extent_insert_fixup
> > , which calls the bch_bkey_equal_header function, which is also a
> > crucial checkpoint.
>
> I am not able to make judgement by the above description, can you post a patch
> then I can see how you insert the key with replace parameter.
>
> Coly Li
>
In the patch I submitted earlier, the bkey for clearing the dirty mark is
implemented in the following code:
+ /*
+ * The dirty data was successfully written back and confirmed to be written
+ * to the disk. The status of the bkey in the btree was updated.
+ */
+ list_for_each_entry_safe(e, tmp, &dc->preflush_keys.list, list) {
+ memset(keys.inline_keys, 0, sizeof(keys.inline_keys));
+ bch_keylist_init(&keys);
+
+ bkey_copy(keys.top, &(e->key.key));
+ SET_KEY_DIRTY(keys.top, false);
+ bch_keylist_push(&keys);
+
+ for (i = 0; i < KEY_PTRS(&(e->key.key)); i++)
+ atomic_inc(&PTR_BUCKET(dc->disk.c, &(e->key.key), i)->pin);
+
+ ret = bch_btree_insert(dc->disk.c, &keys, NULL, &(e->key.key));
+
+ if (ret)
+ trace_bcache_writeback_collision(&(e->key.key));
+
+ atomic_long_inc(ret
+ ? &dc->disk.c->writeback_keys_failed
+ : &dc->disk.c->writeback_keys_done);
+
+ list_del(&e->list);
+ kfree(e);
+
+ /* For those bkeys that failed to be inserted, you can
+ * ignore them and they will be processed again in the
+ * next write-back scan.
+ */
+ }
Code Explanation:
1. dc->preflush_keys.list stores all bkeys whose dirty flags are to
be cleared.
2. In the list_for_each_entry_safe loop, there are two important
variables: variable 'e' stores the bkeys whose dirty flags are to be
cleared (the original bkeys waiting to be replaced), and variable
'keys' stores the bkeys whose dirty flags have been cleared (the
new bkeys waiting to be inserted).
3. In the bch_btree_insert(dc->disk.c, &keys, NULL, &(e->key.key))
function call, the second parameter '&keys' represents the new bkey
to be inserted, whose dirty flag has been cleared, and the fourth
parameter '&(e->key.key)' represents the original bkey to be replaced.
4. The prototype definition of the bch_btree_insert function is
"int bch_btree_insert(struct cache_set *c, struct keylist *keys,
atomic_t *journal_ref, struct bkey *replace_key)". Its fourth parameter
specifies the bkey to be replaced.
5. The core code call stack for replacing the bkey: bch_btree_insert->
bch_btree_map_leaf_nodes->bch_btree_map_nodes_recurse->
btree_insert_fn->bch_btree_insert_node->bch_btree_insert_keys->
btree_insert_key->bch_btree_insert_key->bch_extent_insert_fixup
6. The prototype of the bch_extent_insert_fixup function is defined as
"static bool bch_extent_insert_fixup(struct btree_keys *b, struct bkey
*insert, struct btree_iter *iter, struct bkey *replace_key)". The function's
internal implementation checks whether the replace_key parameter
exists. If so, it initiates the replacement process. This replacement
process ensures that the bkey is not error overwritten when a write
request and dirty data are written back concurrently.
Zhou Jifeng
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v2] bcache: enhancing the security of dirty data writeback
2025-08-07 2:01 ` Zhou Jifeng
@ 2025-08-13 2:12 ` Zhou Jifeng
2025-08-13 7:11 ` Zhou Jifeng
0 siblings, 1 reply; 25+ messages in thread
From: Zhou Jifeng @ 2025-08-13 2:12 UTC (permalink / raw)
To: zhoujifeng, colyli; +Cc: kent.overstreet, linux-bcache, linux-kernel
There is a potential data consistency risk in bcache's writeback mode:when
the application calls fsync, bcache returns success after completing the
log write, persisting the cache disk data, and persisting the HDD internal
cache. However, at this point, the actual application data may still be in
a dirty state and remain stuck in the cache disk. when these data are
subsequently written back to the HDD asynchronously through REQ_OP_WRITE,
there is no forced refresh mechanism to ensure physical placement on the
disk, and there may be no power-off protection measures, which poses a risk
of data loss. This mechanism may cause the application to misjudge that the
data has been persisted, which is different from the actual storage state,
and also violates the semantic agreement that fsync should ensure data
persistence.
This patch aims to enhance the reliability of dirty data writeback through
PREFLUSH, ensuring that the dirty data mark in the cache device is cleared
only after the dirty data is written to the disk. Double triggering
conditions for PREFLUSH:
1、When the cumulative number of dirty bkeys written back reaches the
threshold(Dynamic control parameters:/sys/block/bcache0/bcache/
flush_interval, the default value is 20000. by increasing the value of this
parameter, the impact of flush on performance can be reduced.)
2、When the interval since the last refresh exceeds 30 seconds
If any of the conditions are met, the system will send a PREFLUSH command
to the backend HDD, and clear the corresponding dirty bkey mark only after
confirming that the PREFLUSH is executed successfully.
Signed-off-by: Zhou Jifeng <zhoujifeng@kylinos.com.cn>
---
v1 -> v2: Make revisions according to the review comments.
drivers/md/bcache/bcache.h | 29 +++++++
drivers/md/bcache/bcache_ondisk.h | 4 +
drivers/md/bcache/btree.c | 136 ++++++++++++++++++++++++++++++
drivers/md/bcache/btree.h | 4 +
drivers/md/bcache/sysfs.c | 11 +++
drivers/md/bcache/writeback.c | 115 +++++++++++++++++++++----
6 files changed, 283 insertions(+), 16 deletions(-)
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 785b0d9008fa..09012202658a 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -247,6 +247,23 @@ struct keybuf {
DECLARE_ARRAY_ALLOCATOR(struct keybuf_key, freelist, KEYBUF_NR);
};
+struct preflush_bkey {
+ BKEY_PADDED(key);
+ struct list_head list;
+};
+
+#define WRITEBACK_FLUSH_INTERVAL_DEFAULT 20000 /* the number of bkey */
+#define WRITEBACK_FLUSH_INTERVAL_MIN 500
+#define WRITEBACK_FLUSH_INTERVAL_MAX 50000
+
+struct preflush_buf {
+ spinlock_t lock;
+ u32 count;
+ struct list_head allocated_head;
+
+ DECLARE_ARRAY_ALLOCATOR(struct preflush_bkey, freelist, WRITEBACK_FLUSH_INTERVAL_MAX);
+};
+
struct bcache_device {
struct closure cl;
@@ -346,6 +363,18 @@ struct cached_dev {
struct keybuf writeback_keys;
+ /*
+ * Before performing preflush to the backing device, temporarily
+ * store the bkey waiting to clean up the dirty mark
+ */
+ struct preflush_buf preflush_keys;
+ /*
+ * flush_interval is used to specify that a PROFLUSH operation will
+ * be issued once a certain number of dirty bkeys have been written
+ * each time.
+ */
+ unsigned int flush_interval;
+
struct task_struct *status_update_thread;
/*
* Order the write-half of writeback operations strongly in dispatch
diff --git a/drivers/md/bcache/bcache_ondisk.h b/drivers/md/bcache/bcache_ondisk.h
index 6620a7f8fffc..df5800838e40 100644
--- a/drivers/md/bcache/bcache_ondisk.h
+++ b/drivers/md/bcache/bcache_ondisk.h
@@ -294,6 +294,10 @@ BITMASK(BDEV_CACHE_MODE, struct cache_sb, flags, 0, 4);
#define CACHE_MODE_WRITEBACK 1U
#define CACHE_MODE_WRITEAROUND 2U
#define CACHE_MODE_NONE 3U
+BITMASK(BDEV_WRITEBACK_FLUSH, struct cache_sb, flags, 4, 1);
+#define WRITEBACK_FLUSH_OFF 0U
+#define WRITEBACK_FLUSH_ON 1U
+
BITMASK(BDEV_STATE, struct cache_sb, flags, 61, 2);
#define BDEV_STATE_NONE 0U
#define BDEV_STATE_CLEAN 1U
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index ed40d8600656..3b47ef20f1f8 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1830,6 +1830,126 @@ static void bch_btree_gc_finish(struct cache_set *c)
mutex_unlock(&c->bucket_lock);
}
+void bcache_add_preflush_buf(struct cached_dev *dc, struct bkey *key)
+{
+ struct preflush_bkey *p;
+
+ spin_lock(&dc->preflush_keys.lock);
+
+ p = array_alloc(&dc->preflush_keys.freelist);
+
+ if (!p) {
+ spin_unlock(&dc->preflush_keys.lock);
+ return;
+ }
+
+ bkey_copy(&p->key, key);
+ INIT_LIST_HEAD(&p->list);
+
+ list_add_tail(&p->list, &dc->preflush_keys.allocated_head);
+ dc->preflush_keys.count++;
+
+ spin_unlock(&dc->preflush_keys.lock);
+}
+
+void bcache_mark_preflush_keys_clean(struct cached_dev *dc)
+{
+ struct preflush_bkey *p;
+
+ spin_lock(&dc->preflush_keys.lock);
+
+ /*
+ * Release the reference added to the bucket during the
+ * "write_dirty_finish" period.
+ */
+ list_for_each_entry(p, &dc->preflush_keys.allocated_head, list) {
+ bkey_put(dc->disk.c, &p->key);
+ }
+
+ dc->preflush_keys.count = 0;
+ array_allocator_init(&dc->preflush_keys.freelist);
+ INIT_LIST_HEAD(&dc->preflush_keys.allocated_head);
+
+ spin_unlock(&dc->preflush_keys.lock);
+}
+
+static void gc_preflush_keys_writeback(struct cache_set *c)
+{
+ unsigned int i, j;
+ struct preflush_bkey *p, *tmp;
+ struct keylist keys;
+ struct bio flush;
+ struct bkey_padded {
+ BKEY_PADDED(key);
+ } *copy;
+ unsigned int copied = 0;
+ int ret = 0;
+
+ for (i = 0; i < c->devices_max_used; i++) {
+ struct bcache_device *d = c->devices[i];
+ struct cached_dev *dc;
+
+ copied = 0;
+
+ if (!d || UUID_FLASH_ONLY(&c->uuids[i]))
+ continue;
+ dc = container_of(d, struct cached_dev, disk);
+
+ copy = kmalloc_array(WRITEBACK_FLUSH_INTERVAL_MAX,
+ sizeof(*copy), GFP_KERNEL);
+ if (!copy)
+ return;
+
+ spin_lock(&dc->preflush_keys.lock);
+
+ list_for_each_entry_safe(p, tmp, &dc->preflush_keys.allocated_head, list) {
+ bkey_copy(©[copied].key, &p->key);
+ copied++;
+
+ list_del(&p->list);
+ dc->preflush_keys.count--;
+ bch_preflush_buf_del(&dc->preflush_keys, p);
+ }
+
+ spin_unlock(&dc->preflush_keys.lock);
+
+ bio_init(&flush, dc->bdev, NULL, 0, REQ_OP_WRITE | REQ_PREFLUSH);
+ ret = submit_bio_wait(&flush);
+ if (ret) {
+ pr_err("flush backing device error %d.\n", ret);
+
+ /*
+ * In the case of flush failure, do not update the status of bkey
+ * in the btree, and wait until the next time to re-write the dirty
+ * data.
+ */
+ bcache_mark_preflush_keys_clean(dc);
+ kfree(copy);
+ return;
+ }
+
+ for (j = 0; j < copied; j++) {
+ memset(keys.inline_keys, 0, sizeof(keys.inline_keys));
+ bch_keylist_init(&keys);
+
+ bkey_copy(keys.top, ©[j].key);
+ SET_KEY_DIRTY(keys.top, false);
+ bch_keylist_push(&keys);
+
+ ret = bch_btree_insert(dc->disk.c, &keys, NULL, ©[j].key);
+
+ if (ret)
+ trace_bcache_writeback_collision(©[j].key);
+
+ atomic_long_inc(ret
+ ? &dc->disk.c->writeback_keys_failed
+ : &dc->disk.c->writeback_keys_done);
+ }
+
+ kfree(copy);
+ }
+}
+
static void bch_btree_gc(struct cache_set *c)
{
int ret;
@@ -1844,6 +1964,8 @@ static void bch_btree_gc(struct cache_set *c)
closure_init_stack(&writes);
bch_btree_op_init(&op, SHRT_MAX);
+ gc_preflush_keys_writeback(c);
+
btree_gc_start(c);
/* if CACHE_SET_IO_DISABLE set, gc thread should stop too */
@@ -2749,6 +2871,11 @@ void bch_keybuf_del(struct keybuf *buf, struct keybuf_key *w)
spin_unlock(&buf->lock);
}
+void bch_preflush_buf_del(struct preflush_buf *buf, struct preflush_bkey *p)
+{
+ array_free(&buf->freelist, p);
+}
+
bool bch_keybuf_check_overlapping(struct keybuf *buf, struct bkey *start,
struct bkey *end)
{
@@ -2828,6 +2955,15 @@ void bch_keybuf_init(struct keybuf *buf)
array_allocator_init(&buf->freelist);
}
+void bch_preflush_buf_init(struct preflush_buf *buf)
+{
+ buf->count = 0;
+
+ spin_lock_init(&buf->lock);
+ array_allocator_init(&buf->freelist);
+ INIT_LIST_HEAD(&buf->allocated_head);
+}
+
void bch_btree_exit(void)
{
if (btree_io_wq)
diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h
index 45d64b54115a..59a2ee7c81f7 100644
--- a/drivers/md/bcache/btree.h
+++ b/drivers/md/bcache/btree.h
@@ -414,4 +414,8 @@ struct keybuf_key *bch_keybuf_next_rescan(struct cache_set *c,
struct bkey *end,
keybuf_pred_fn *pred);
void bch_update_bucket_in_use(struct cache_set *c, struct gc_stat *stats);
+void bch_preflush_buf_init(struct preflush_buf *buf);
+void bch_preflush_buf_del(struct preflush_buf *buf, struct preflush_bkey *p);
+void bcache_add_preflush_buf(struct cached_dev *dc, struct bkey *key);
+void bcache_mark_preflush_keys_clean(struct cached_dev *dc);
#endif
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index e8f696cb58c0..69332aa39b4e 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -151,6 +151,12 @@ rw_attribute(copy_gc_enabled);
rw_attribute(idle_max_writeback_rate);
rw_attribute(gc_after_writeback);
rw_attribute(size);
+/*
+ * "flush_interval" is used to specify that a PROFLUSH operation will
+ * be issued once a certain number of dirty bkeys have been written
+ * each time.
+ */
+rw_attribute(flush_interval);
static ssize_t bch_snprint_string_list(char *buf,
size_t size,
@@ -213,6 +219,7 @@ SHOW(__bch_cached_dev)
var_print(writeback_rate_fp_term_mid);
var_print(writeback_rate_fp_term_high);
var_print(writeback_rate_minimum);
+ var_print(flush_interval);
if (attr == &sysfs_writeback_rate_debug) {
char rate[20];
@@ -354,6 +361,9 @@ STORE(__cached_dev)
sysfs_strtoul_clamp(io_error_limit, dc->error_limit, 0, INT_MAX);
+ sysfs_strtoul_clamp(flush_interval, dc->flush_interval,
+ WRITEBACK_FLUSH_INTERVAL_MIN, WRITEBACK_FLUSH_INTERVAL_MAX);
+
if (attr == &sysfs_io_disable) {
int v = strtoul_or_return(buf);
@@ -541,6 +551,7 @@ static struct attribute *bch_cached_dev_attrs[] = {
#endif
&sysfs_backing_dev_name,
&sysfs_backing_dev_uuid,
+ &sysfs_flush_interval,
NULL
};
ATTRIBUTE_GROUPS(bch_cached_dev);
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 453efbbdc8ee..69e2d37ddbef 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -348,37 +348,100 @@ static CLOSURE_CALLBACK(dirty_io_destructor)
kfree(io);
}
-static CLOSURE_CALLBACK(write_dirty_finish)
+static void flush_backing_device(struct cached_dev *dc)
{
- closure_type(io, struct dirty_io, cl);
- struct keybuf_key *w = io->bio.bi_private;
- struct cached_dev *dc = io->dc;
+ int ret;
+ struct keylist keys;
+ struct bio flush;
+ struct preflush_bkey *p;
- bio_free_pages(&io->bio);
+ BKEY_PADDED(key) replace_key;
- /* This is kind of a dumb way of signalling errors. */
- if (KEY_DIRTY(&w->key)) {
- int ret;
- unsigned int i;
- struct keylist keys;
+ spin_lock(&dc->preflush_keys.lock);
+ if (dc->preflush_keys.count == 0) {
+ spin_unlock(&dc->preflush_keys.lock);
+ return;
+ }
+ spin_unlock(&dc->preflush_keys.lock);
+
+ bio_init(&flush, dc->bdev, NULL, 0, REQ_OP_WRITE | REQ_PREFLUSH);
+ ret = submit_bio_wait(&flush);
+ if (ret) {
+ pr_err("flush backing device error %d.\n", ret);
+
+ /*
+ * In the case of flush failure, do not update the status of bkey
+ * in the btree, and wait until the next time to re-write the dirty
+ * data.
+ */
+ bcache_mark_preflush_keys_clean(dc);
+ return;
+ }
+ /*
+ * The dirty data was successfully written back and confirmed to be written
+ * to the disk. The status of the bkey in the btree was updated.
+ */
+ while (1) {
+ memset(keys.inline_keys, 0, sizeof(keys.inline_keys));
bch_keylist_init(&keys);
- bkey_copy(keys.top, &w->key);
+ spin_lock(&dc->preflush_keys.lock);
+ if (!list_empty(&dc->preflush_keys.allocated_head)) {
+ p = list_first_entry(&dc->preflush_keys.allocated_head,
+ struct preflush_bkey,
+ list);
+
+ bkey_copy(keys.top, &(p->key));
+ bkey_copy(&replace_key.key, &(p->key));
+ list_del(&p->list);
+ dc->preflush_keys.count--;
+ bch_preflush_buf_del(&dc->preflush_keys, p);
+ } else {
+ spin_unlock(&dc->preflush_keys.lock);
+ break;
+ }
+ spin_unlock(&dc->preflush_keys.lock);
+
SET_KEY_DIRTY(keys.top, false);
bch_keylist_push(&keys);
- for (i = 0; i < KEY_PTRS(&w->key); i++)
- atomic_inc(&PTR_BUCKET(dc->disk.c, &w->key, i)->pin);
-
- ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key);
+ ret = bch_btree_insert(dc->disk.c, &keys, NULL, &replace_key.key);
if (ret)
- trace_bcache_writeback_collision(&w->key);
+ trace_bcache_writeback_collision(&replace_key.key);
atomic_long_inc(ret
? &dc->disk.c->writeback_keys_failed
: &dc->disk.c->writeback_keys_done);
+
+ /* For those bkeys that failed to be inserted, you can
+ * ignore them and they will be processed again in the
+ * next write-back scan.
+ */
+ }
+}
+
+static CLOSURE_CALLBACK(write_dirty_finish)
+{
+ closure_type(io, struct dirty_io, cl);
+ struct keybuf_key *w = io->bio.bi_private;
+ struct cached_dev *dc = io->dc;
+
+ bio_free_pages(&io->bio);
+
+ /* This is kind of a dumb way of signalling errors. */
+ if (KEY_DIRTY(&w->key)) {
+ unsigned int i;
+
+ /*
+ * Add the bucket reference before inserting the bkey into
+ * the btree.
+ */
+ for (i = 0; i < KEY_PTRS(&w->key); i++)
+ atomic_inc(&PTR_BUCKET(dc->disk.c, &w->key, i)->pin);
+
+ bcache_add_preflush_buf(dc, &w->key);
}
bch_keybuf_del(&dc->writeback_keys, w);
@@ -435,6 +498,7 @@ static CLOSURE_CALLBACK(write_dirty)
if (KEY_DIRTY(&w->key)) {
dirty_init(w);
io->bio.bi_opf = REQ_OP_WRITE;
+
io->bio.bi_iter.bi_sector = KEY_START(&w->key);
bio_set_dev(&io->bio, io->dc->bdev);
io->bio.bi_end_io = dirty_endio;
@@ -741,6 +805,7 @@ static int bch_writeback_thread(void *arg)
struct cached_dev *dc = arg;
struct cache_set *c = dc->disk.c;
bool searched_full_index;
+ unsigned long last_flush_jiffies = jiffies;
bch_ratelimit_reset(&dc->writeback_rate);
@@ -819,9 +884,23 @@ static int bch_writeback_thread(void *arg)
read_dirty(dc);
+ /*
+ * If the accumulated preflush_keys exceed a certain quantity or
+ * the interval time exceeds 30 seconds, issue the PREFLUSH command
+ * once.
+ */
+ if (dc->preflush_keys.count >= (dc->flush_interval - KEYBUF_NR) ||
+ time_after(jiffies, last_flush_jiffies + 30 * HZ)) {
+ flush_backing_device(dc);
+ last_flush_jiffies = jiffies;
+ }
+
if (searched_full_index) {
unsigned int delay = dc->writeback_delay * HZ;
+ /* Clean up the remaining preflush_keys. */
+ flush_backing_device(dc);
+
while (delay &&
!kthread_should_stop() &&
!test_bit(CACHE_SET_IO_DISABLE, &c->flags) &&
@@ -832,6 +911,8 @@ static int bch_writeback_thread(void *arg)
}
}
+ flush_backing_device(dc);
+
if (dc->writeback_write_wq)
destroy_workqueue(dc->writeback_write_wq);
@@ -1053,6 +1134,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
sema_init(&dc->in_flight, 64);
init_rwsem(&dc->writeback_lock);
bch_keybuf_init(&dc->writeback_keys);
+ bch_preflush_buf_init(&dc->preflush_keys);
dc->writeback_metadata = true;
dc->writeback_running = false;
@@ -1068,6 +1150,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
dc->writeback_rate_fp_term_mid = 10;
dc->writeback_rate_fp_term_high = 1000;
dc->writeback_rate_i_term_inverse = 10000;
+ dc->flush_interval = WRITEBACK_FLUSH_INTERVAL_DEFAULT;
/* For dc->writeback_lock contention in update_writeback_rate() */
dc->rate_update_retry = 0;
--
2.18.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re:[PATCH v2] bcache: enhancing the security of dirty data writeback
2025-08-13 2:12 ` [PATCH v2] " Zhou Jifeng
@ 2025-08-13 7:11 ` Zhou Jifeng
0 siblings, 0 replies; 25+ messages in thread
From: Zhou Jifeng @ 2025-08-13 7:11 UTC (permalink / raw)
To: 周继峰, Coly Li
Cc: kent.overstreet, linux-bcache, linux-kernel
On Wed, 13 Aug 2025 at 10:14, Zhou Jifeng <zhoujifeng@kylinos.com.cn> wrote:
>
> There is a potential data consistency risk in bcache's writeback mode:when
> the application calls fsync, bcache returns success after completing the
> log write, persisting the cache disk data, and persisting the HDD internal
> cache. However, at this point, the actual application data may still be in
> a dirty state and remain stuck in the cache disk. when these data are
> subsequently written back to the HDD asynchronously through REQ_OP_WRITE,
> there is no forced refresh mechanism to ensure physical placement on the
> disk, and there may be no power-off protection measures, which poses a risk
> of data loss. This mechanism may cause the application to misjudge that the
> data has been persisted, which is different from the actual storage state,
> and also violates the semantic agreement that fsync should ensure data
> persistence.
>
> This patch aims to enhance the reliability of dirty data writeback through
> PREFLUSH, ensuring that the dirty data mark in the cache device is cleared
> only after the dirty data is written to the disk. Double triggering
> conditions for PREFLUSH:
> 1、When the cumulative number of dirty bkeys written back reaches the
> threshold(Dynamic control parameters:/sys/block/bcache0/bcache/
> flush_interval, the default value is 20000. by increasing the value of this
> parameter, the impact of flush on performance can be reduced.)
> 2、When the interval since the last refresh exceeds 30 seconds
> If any of the conditions are met, the system will send a PREFLUSH command
> to the backend HDD, and clear the corresponding dirty bkey mark only after
> confirming that the PREFLUSH is executed successfully.
>
> Signed-off-by: Zhou Jifeng <zhoujifeng@kylinos.com.cn>
> ---
> v1 -> v2: Make revisions according to the review comments.
>
>
> [snipped]
>
>
> diff --git a/drivers/md/bcache/bcache_ondisk.h b/drivers/md/bcache/bcache_ondisk.h
> index 6620a7f8fffc..df5800838e40 100644
> --- a/drivers/md/bcache/bcache_ondisk.h
> +++ b/drivers/md/bcache/bcache_ondisk.h
> @@ -294,6 +294,10 @@ BITMASK(BDEV_CACHE_MODE, struct cache_sb, flags, 0, 4);
> #define CACHE_MODE_WRITEBACK 1U
> #define CACHE_MODE_WRITEAROUND 2U
> #define CACHE_MODE_NONE 3U
> +BITMASK(BDEV_WRITEBACK_FLUSH, struct cache_sb, flags, 4, 1);
> +#define WRITEBACK_FLUSH_OFF 0U
> +#define WRITEBACK_FLUSH_ON 1U
> +
> BITMASK(BDEV_STATE, struct cache_sb, flags, 61, 2);
> #define BDEV_STATE_NONE 0U
> #define BDEV_STATE_CLEAN 1U
Sorry, the inspection was not thorough enough. This is an invalid code. It was left over
from the past and was forgotten to be deleted.
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2025-08-13 7:12 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-31 6:21 [PATCH] bcache: enhancing the security of dirty data writeback Zhou Jifeng
2025-07-31 15:49 ` Kent Overstreet
2025-08-01 2:27 ` Zhou Jifeng
2025-08-01 2:37 ` Kent Overstreet
2025-08-01 3:30 ` Zhou Jifeng
2025-08-01 3:41 ` Kent Overstreet
2025-08-01 6:10 ` Zhou Jifeng
2025-08-02 17:29 ` Coly Li
2025-08-02 18:49 ` Kent Overstreet
2025-08-04 3:47 ` Zhou Jifeng
2025-08-04 4:17 ` Kent Overstreet
2025-08-04 15:31 ` Coly Li
2025-08-04 16:07 ` Kent Overstreet
2025-08-05 1:17 ` Zhou Jifeng
2025-08-05 1:31 ` Kent Overstreet
2025-08-05 3:31 ` Coly Li
2025-08-05 3:34 ` Kent Overstreet
2025-08-05 4:57 ` Coly Li
2025-08-05 9:37 ` Zhou Jifeng
2025-08-05 16:29 ` Coly Li
2025-08-06 11:19 ` Zhou Jifeng
2025-08-06 16:10 ` Coly Li
2025-08-07 2:01 ` Zhou Jifeng
2025-08-13 2:12 ` [PATCH v2] " Zhou Jifeng
2025-08-13 7:11 ` Zhou Jifeng
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).