* [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices
@ 2025-06-05 14:22 Dongsheng Yang
2025-06-05 14:22 ` [RFC PATCH 01/11] dm-pcache: add pcache_internal.h Dongsheng Yang
` (11 more replies)
0 siblings, 12 replies; 18+ messages in thread
From: Dongsheng Yang @ 2025-06-05 14:22 UTC (permalink / raw)
To: mpatocka, agk, snitzer, axboe, hch, dan.j.williams,
Jonathan.Cameron
Cc: linux-block, linux-kernel, linux-cxl, nvdimm, dm-devel,
Dongsheng Yang
Hi Mikulas and all,
This is *RFC v2* of the *pcache* series, a persistent-memory backed cache.
Compared with *RFC v1*
<https://lore.kernel.org/lkml/20250414014505.20477-1-dongsheng.yang@linux.dev/>
the most important change is that the whole cache has been *ported to
the Device-Mapper framework* and is now exposed as a regular DM target.
Code:
https://github.com/DataTravelGuide/linux/tree/dm-pcache
Full RFC v2 test results:
https://datatravelguide.github.io/dtg-blog/pcache/pcache_rfc_v2_result/results.html
All 962 xfstests cases passed successfully under four different
pcache configurations.
One of the detailed xfstests run:
https://datatravelguide.github.io/dtg-blog/pcache/pcache_rfc_v2_result/test-results/02-._pcache.py_PcacheTest.test_run-crc-enable-gc-gc0-test_script-xfstests-a515/debug.log
Below is a quick tour through the three layers of the implementation,
followed by an example invocation.
----------------------------------------------------------------------
1. pmem access layer
----------------------------------------------------------------------
* All reads use *copy_mc_to_kernel()* so that uncorrectable media
errors are detected and reported.
* All writes go through *memcpy_flushcache()* to guarantee durability
on real persistent memory.
----------------------------------------------------------------------
2. cache-logic layer (segments / keys / workers)
----------------------------------------------------------------------
Main features
- 16 MiB pmem segments, log-structured allocation.
- Multi-subtree RB-tree index for high parallelism.
- Optional per-entry *CRC32* on cached data.
- Background *write-back* worker and watermark-driven *GC*.
- Crash-safe replay: key-sets are scanned from *key_tail* on start-up.
Current limitations
- Only *write-back* mode implemented.
- Only FIFO cache invalidate; other (LRU, ARC...) planned.
----------------------------------------------------------------------
3. dm-pcache target integration
----------------------------------------------------------------------
* Table line
`pcache <pmem_dev> <origin_dev> writeback <true|false>`
* Features advertised to DM:
- `ti->flush_supported = true`, so *PREFLUSH* and *FUA* are honoured
(they force all open key-sets to close and data to be durable).
* Not yet supported:
- Discard / TRIM.
- dynamic `dmsetup reload`.
Runtime controls
- `dmsetup message <dev> 0 gc_percent <0-90>` adjusts the GC trigger.
Status line reports super-block flags, segment counts, GC threshold and
the three tail/head pointers (see the RST document for details).
----------------------------------------------------------------------
Example
----------------------------------------------------------------------
# 1. create a pmem and ssd
pmem=/dev/pmem0
ssd=/dev/sdb
# 2. map a pcache device in front.
dmsetup create pcache_sdb --table \
"0 $(blockdev --getsz $ssd) pcache $pmem $ssd writeback true"
# 3. format and mount
mkfs.ext4 /dev/mapper/pcache_sdb
mount /dev/mapper/pcache_sdb /mnt
# 4. tune GC to 80 %
dmsetup message pcache_sdb 0 gc_percent 80
# 5. monitor
watch -n1 'dmsetup status pcache_sdb'
Testing:
The test suite for pcache is hosted in the dtg-tests project, built
on top of the Avocado Framework. It includes currently:
- Management-related test cases for pcache devices.
- Data verification and validation tests.
- Complete execution of xfstests suite under multiple
configurations.
Thanx
Dongsheng
Dongsheng Yang (11):
dm-pcache: add pcache_internal.h
dm-pcache: add backing device management
dm-pcache: add cache device
dm-pcache: add segment layer
dm-pcache: add cache_segment
dm-pcache: add cache_writeback
dm-pcache: add cache_gc
dm-pcache: add cache_key
dm-pcache: add cache_req
dm-pcache: add cache core
dm-pcache: initial dm-pcache target
.../admin-guide/device-mapper/dm-pcache.rst | 200 ++++
MAINTAINERS | 9 +
drivers/md/Kconfig | 2 +
drivers/md/Makefile | 1 +
drivers/md/dm-pcache/Kconfig | 17 +
drivers/md/dm-pcache/Makefile | 3 +
drivers/md/dm-pcache/backing_dev.c | 305 ++++++
drivers/md/dm-pcache/backing_dev.h | 84 ++
drivers/md/dm-pcache/cache.c | 443 +++++++++
drivers/md/dm-pcache/cache.h | 601 ++++++++++++
drivers/md/dm-pcache/cache_dev.c | 310 ++++++
drivers/md/dm-pcache/cache_dev.h | 70 ++
drivers/md/dm-pcache/cache_gc.c | 170 ++++
drivers/md/dm-pcache/cache_key.c | 907 ++++++++++++++++++
drivers/md/dm-pcache/cache_req.c | 810 ++++++++++++++++
drivers/md/dm-pcache/cache_segment.c | 300 ++++++
drivers/md/dm-pcache/cache_writeback.c | 239 +++++
drivers/md/dm-pcache/dm_pcache.c | 388 ++++++++
drivers/md/dm-pcache/dm_pcache.h | 61 ++
drivers/md/dm-pcache/pcache_internal.h | 116 +++
drivers/md/dm-pcache/segment.c | 63 ++
drivers/md/dm-pcache/segment.h | 74 ++
22 files changed, 5173 insertions(+)
create mode 100644 Documentation/admin-guide/device-mapper/dm-pcache.rst
create mode 100644 drivers/md/dm-pcache/Kconfig
create mode 100644 drivers/md/dm-pcache/Makefile
create mode 100644 drivers/md/dm-pcache/backing_dev.c
create mode 100644 drivers/md/dm-pcache/backing_dev.h
create mode 100644 drivers/md/dm-pcache/cache.c
create mode 100644 drivers/md/dm-pcache/cache.h
create mode 100644 drivers/md/dm-pcache/cache_dev.c
create mode 100644 drivers/md/dm-pcache/cache_dev.h
create mode 100644 drivers/md/dm-pcache/cache_gc.c
create mode 100644 drivers/md/dm-pcache/cache_key.c
create mode 100644 drivers/md/dm-pcache/cache_req.c
create mode 100644 drivers/md/dm-pcache/cache_segment.c
create mode 100644 drivers/md/dm-pcache/cache_writeback.c
create mode 100644 drivers/md/dm-pcache/dm_pcache.c
create mode 100644 drivers/md/dm-pcache/dm_pcache.h
create mode 100644 drivers/md/dm-pcache/pcache_internal.h
create mode 100644 drivers/md/dm-pcache/segment.c
create mode 100644 drivers/md/dm-pcache/segment.h
--
2.34.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC PATCH 01/11] dm-pcache: add pcache_internal.h
2025-06-05 14:22 [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices Dongsheng Yang
@ 2025-06-05 14:22 ` Dongsheng Yang
2025-06-05 14:22 ` [RFC PATCH 02/11] dm-pcache: add backing device management Dongsheng Yang
` (10 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Dongsheng Yang @ 2025-06-05 14:22 UTC (permalink / raw)
To: mpatocka, agk, snitzer, axboe, hch, dan.j.williams,
Jonathan.Cameron
Cc: linux-block, linux-kernel, linux-cxl, nvdimm, dm-devel,
Dongsheng Yang
Consolidate common PCACHE helpers into a new header so that subsequent
patches can include them without repeating boiler-plate.
- Logging macros with unified prefix and location info.
- Common constants (KB/MB helpers, metadata replica count, CRC seed).
- On-disk metadata header definition and CRC helper.
- Sequence-number comparison that handles wrap-around.
- pcache_meta_find_latest() to pick the newest valid metadata copy.
Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
drivers/md/dm-pcache/pcache_internal.h | 116 +++++++++++++++++++++++++
1 file changed, 116 insertions(+)
create mode 100644 drivers/md/dm-pcache/pcache_internal.h
diff --git a/drivers/md/dm-pcache/pcache_internal.h b/drivers/md/dm-pcache/pcache_internal.h
new file mode 100644
index 000000000000..389d1fa667c9
--- /dev/null
+++ b/drivers/md/dm-pcache/pcache_internal.h
@@ -0,0 +1,116 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _PCACHE_INTERNAL_H
+#define _PCACHE_INTERNAL_H
+
+#include <linux/delay.h>
+#include <linux/crc32.h>
+
+#define pcache_err(fmt, ...) \
+ pr_err("dm-pcache: %s:%u " fmt, __func__, __LINE__, ##__VA_ARGS__)
+#define pcache_info(fmt, ...) \
+ pr_info("dm-pcache: %s:%u " fmt, __func__, __LINE__, ##__VA_ARGS__)
+#define pcache_debug(fmt, ...) \
+ pr_debug("dm-pcache: %s:%u " fmt, __func__, __LINE__, ##__VA_ARGS__)
+
+#define PCACHE_KB (1024ULL)
+#define PCACHE_MB (1024 * PCACHE_KB)
+
+/* Maximum number of metadata indices */
+#define PCACHE_META_INDEX_MAX 2
+
+#define PCACHE_CRC_SEED 0x3B15A
+/*
+ * struct pcache_meta_header - PCACHE metadata header structure
+ * @crc: CRC checksum for validating metadata integrity.
+ * @seq: Sequence number to track metadata updates.
+ * @version: Metadata version.
+ * @res: Reserved space for future use.
+ */
+struct pcache_meta_header {
+ __u32 crc;
+ __u8 seq;
+ __u8 version;
+ __u16 res;
+};
+
+/*
+ * pcache_meta_crc - Calculate CRC for the given metadata header.
+ * @header: Pointer to the metadata header.
+ * @meta_size: Size of the metadata structure.
+ *
+ * Returns the CRC checksum calculated by excluding the CRC field itself.
+ */
+static inline u32 pcache_meta_crc(struct pcache_meta_header *header, u32 meta_size)
+{
+ return crc32(PCACHE_CRC_SEED, (void *)header + 4, meta_size - 4);
+}
+
+/*
+ * pcache_meta_seq_after - Check if a sequence number is more recent, accounting for overflow.
+ * @seq1: First sequence number.
+ * @seq2: Second sequence number.
+ *
+ * Determines if @seq1 is more recent than @seq2 by calculating the signed
+ * difference between them. This approach allows handling sequence number
+ * overflow correctly because the difference wraps naturally, and any value
+ * greater than zero indicates that @seq1 is "after" @seq2. This method
+ * assumes 8-bit unsigned sequence numbers, where the difference wraps
+ * around if seq1 overflows past seq2.
+ *
+ * Returns:
+ * - true if @seq1 is more recent than @seq2, indicating it comes "after"
+ * - false otherwise.
+ */
+static inline bool pcache_meta_seq_after(u8 seq1, u8 seq2)
+{
+ return (s8)(seq1 - seq2) > 0;
+}
+
+/*
+ * pcache_meta_find_latest - Find the latest valid metadata.
+ * @header: Pointer to the metadata header.
+ * @meta_size: Size of each metadata block.
+ *
+ * Finds the latest valid metadata by checking sequence numbers. If a
+ * valid entry with the highest sequence number is found, its pointer
+ * is returned. Returns NULL if no valid metadata is found.
+ */
+static inline void __must_check *pcache_meta_find_latest(struct pcache_meta_header *header,
+ u32 meta_size, u32 meta_max_size,
+ void *meta_ret)
+{
+ struct pcache_meta_header *meta, *latest = NULL;
+ u32 i, seq_latest = 0;
+ void *meta_addr;
+
+ meta = meta_ret;
+
+ for (i = 0; i < PCACHE_META_INDEX_MAX; i++) {
+ meta_addr = (void *)header + (i * meta_max_size);
+ if (copy_mc_to_kernel(meta, meta_addr, meta_size)) {
+ pcache_err("hardware memory error when copy meta");
+ return ERR_PTR(-EIO);
+ }
+
+ /* Skip if CRC check fails */
+ if (meta->crc != pcache_meta_crc(meta, meta_size))
+ continue;
+
+ /* Update latest if a more recent sequence is found */
+ if (!latest || pcache_meta_seq_after(meta->seq, seq_latest)) {
+ seq_latest = meta->seq;
+ latest = (void *)header + (i * meta_max_size);
+ }
+ }
+
+ if (latest) {
+ if (copy_mc_to_kernel(meta_ret, latest, meta_size)) {
+ pcache_err("hardware memory error");
+ return ERR_PTR(-EIO);
+ }
+ }
+
+ return latest;
+}
+
+#endif /* _PCACHE_INTERNAL_H */
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH 02/11] dm-pcache: add backing device management
2025-06-05 14:22 [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices Dongsheng Yang
2025-06-05 14:22 ` [RFC PATCH 01/11] dm-pcache: add pcache_internal.h Dongsheng Yang
@ 2025-06-05 14:22 ` Dongsheng Yang
2025-06-05 14:22 ` [RFC PATCH 03/11] dm-pcache: add cache device Dongsheng Yang
` (9 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Dongsheng Yang @ 2025-06-05 14:22 UTC (permalink / raw)
To: mpatocka, agk, snitzer, axboe, hch, dan.j.williams,
Jonathan.Cameron
Cc: linux-block, linux-kernel, linux-cxl, nvdimm, dm-devel,
Dongsheng Yang
This patch introduces *backing_dev.{c,h}*, a self-contained layer that
handles all interaction with the *backing block device* where cache
write-back and cache-miss reads are serviced. Isolating this logic
keeps the core dm-pcache code free of low-level bio plumbing.
* Device setup / teardown
- Opens the target with `dm_get_device()`, stores `bdev`, file and
size, and initialises a dedicated `bioset`.
- Gracefully releases resources via `backing_dev_stop()`.
* Request object (`struct pcache_backing_dev_req`)
- Two request flavours:
- REQ-type – cloned from an upper `struct bio` issued to
dm-pcache; trimmed and re-targeted to the backing LBA.
- KMEM-type – maps an arbitrary kernel memory buffer
into a freshly built.
- Private completion callback (`end_req`) propagates status to the
upper layer and handles resource recycling.
* Submission & completion path
- Lock-protected submit queue + worker (`req_submit_work`) let pcache
push many requests asynchronously, at the same time, allow caller
to submit backing_dev_req in atomic context.
- End-io handler moves finished requests to a completion list processed
by `req_complete_work`, ensuring callbacks run in process context.
- Direct-submit option for non-atomic context.
* Flush
- `backing_dev_flush()` issues a flush to persist backing-device data.
Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
drivers/md/dm-pcache/backing_dev.c | 305 +++++++++++++++++++++++++++++
drivers/md/dm-pcache/backing_dev.h | 84 ++++++++
2 files changed, 389 insertions(+)
create mode 100644 drivers/md/dm-pcache/backing_dev.c
create mode 100644 drivers/md/dm-pcache/backing_dev.h
diff --git a/drivers/md/dm-pcache/backing_dev.c b/drivers/md/dm-pcache/backing_dev.c
new file mode 100644
index 000000000000..080944cd6c93
--- /dev/null
+++ b/drivers/md/dm-pcache/backing_dev.c
@@ -0,0 +1,305 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include <linux/blkdev.h>
+
+#include "../dm-core.h"
+#include "pcache_internal.h"
+#include "cache_dev.h"
+#include "backing_dev.h"
+#include "cache.h"
+#include "dm_pcache.h"
+
+static void backing_dev_exit(struct pcache_backing_dev *backing_dev)
+{
+ kmem_cache_destroy(backing_dev->backing_req_cache);
+}
+
+static void req_submit_fn(struct work_struct *work);
+static void req_complete_fn(struct work_struct *work);
+static int backing_dev_init(struct dm_pcache *pcache)
+{
+ struct pcache_backing_dev *backing_dev = &pcache->backing_dev;
+ int ret;
+
+ backing_dev->backing_req_cache = KMEM_CACHE(pcache_backing_dev_req, 0);
+ if (!backing_dev->backing_req_cache) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ INIT_LIST_HEAD(&backing_dev->submit_list);
+ INIT_LIST_HEAD(&backing_dev->complete_list);
+ spin_lock_init(&backing_dev->submit_lock);
+ spin_lock_init(&backing_dev->complete_lock);
+ INIT_WORK(&backing_dev->req_submit_work, req_submit_fn);
+ INIT_WORK(&backing_dev->req_complete_work, req_complete_fn);
+
+ return 0;
+err:
+ return ret;
+}
+
+static int backing_dev_open(struct pcache_backing_dev *backing_dev, const char *path)
+{
+ struct dm_pcache *pcache = BACKING_DEV_TO_PCACHE(backing_dev);
+ int ret;
+
+ ret = dm_get_device(pcache->ti, path,
+ BLK_OPEN_READ | BLK_OPEN_WRITE, &backing_dev->dm_dev);
+ if (ret) {
+ pcache_dev_err(pcache, "failed to open dm_dev: %s: %d", path, ret);
+ goto err;
+ }
+ backing_dev->dev_size = bdev_nr_sectors(backing_dev->dm_dev->bdev);
+
+ return 0;
+err:
+ return ret;
+}
+
+static void backing_dev_close(struct pcache_backing_dev *backing_dev)
+{
+ struct dm_pcache *pcache = BACKING_DEV_TO_PCACHE(backing_dev);
+
+ dm_put_device(pcache->ti, backing_dev->dm_dev);
+}
+
+int backing_dev_start(struct dm_pcache *pcache, const char *backing_dev_path)
+{
+ struct pcache_backing_dev *backing_dev = &pcache->backing_dev;
+ int ret;
+
+ ret = backing_dev_init(pcache);
+ if (ret)
+ goto err;
+
+ ret = backing_dev_open(backing_dev, backing_dev_path);
+ if (ret)
+ goto destroy_backing_dev;
+
+ return 0;
+
+destroy_backing_dev:
+ backing_dev_exit(backing_dev);
+err:
+ return ret;
+}
+
+void backing_dev_stop(struct dm_pcache *pcache)
+{
+ struct pcache_backing_dev *backing_dev = &pcache->backing_dev;
+
+ flush_work(&backing_dev->req_submit_work);
+ flush_work(&backing_dev->req_complete_work);
+
+ /* There should be no inflight backing_dev_request */
+ BUG_ON(!list_empty(&backing_dev->submit_list));
+ BUG_ON(!list_empty(&backing_dev->complete_list));
+
+ backing_dev_close(backing_dev);
+ backing_dev_exit(backing_dev);
+}
+
+/* pcache_backing_dev_req functions */
+void backing_dev_req_end(struct pcache_backing_dev_req *backing_req)
+{
+ struct pcache_backing_dev *backing_dev = backing_req->backing_dev;
+
+ if (backing_req->end_req)
+ backing_req->end_req(backing_req, backing_req->ret);
+
+ switch (backing_req->type) {
+ case BACKING_DEV_REQ_TYPE_REQ:
+ pcache_req_put(backing_req->req.upper_req, backing_req->ret);
+ break;
+ case BACKING_DEV_REQ_TYPE_KMEM:
+ kfree(backing_req->kmem.bvecs);
+ break;
+ default:
+ BUG();
+ }
+
+ kmem_cache_free(backing_dev->backing_req_cache, backing_req);
+}
+
+static void req_complete_fn(struct work_struct *work)
+{
+ struct pcache_backing_dev *backing_dev = container_of(work, struct pcache_backing_dev, req_complete_work);
+ struct pcache_backing_dev_req *backing_req;
+ unsigned long flags;
+ LIST_HEAD(tmp_list);
+
+ spin_lock_irqsave(&backing_dev->complete_lock, flags);
+ list_splice_init(&backing_dev->complete_list, &tmp_list);
+ spin_unlock_irqrestore(&backing_dev->complete_lock, flags);
+
+ while (!list_empty(&tmp_list)) {
+ backing_req = list_first_entry(&tmp_list,
+ struct pcache_backing_dev_req, node);
+ list_del_init(&backing_req->node);
+ backing_dev_req_end(backing_req);
+ }
+}
+
+static void backing_dev_bio_end(struct bio *bio)
+{
+ struct pcache_backing_dev_req *backing_req = bio->bi_private;
+ struct pcache_backing_dev *backing_dev = backing_req->backing_dev;
+
+ backing_req->ret = bio->bi_status;
+
+ spin_lock(&backing_dev->complete_lock);
+ list_move_tail(&backing_req->node, &backing_dev->complete_list);
+ spin_unlock(&backing_dev->complete_lock);
+
+ queue_work(BACKING_DEV_TO_PCACHE(backing_dev)->task_wq, &backing_dev->req_complete_work);
+}
+
+static void req_submit_fn(struct work_struct *work)
+{
+ struct pcache_backing_dev *backing_dev = container_of(work, struct pcache_backing_dev, req_submit_work);
+ struct pcache_backing_dev_req *backing_req;
+ LIST_HEAD(tmp_list);
+
+ spin_lock(&backing_dev->submit_lock);
+ list_splice_init(&backing_dev->submit_list, &tmp_list);
+ spin_unlock(&backing_dev->submit_lock);
+
+ while (!list_empty(&tmp_list)) {
+ backing_req = list_first_entry(&tmp_list,
+ struct pcache_backing_dev_req, node);
+ list_del_init(&backing_req->node);
+ submit_bio_noacct(&backing_req->bio);
+ }
+}
+
+void backing_dev_req_submit(struct pcache_backing_dev_req *backing_req, bool direct)
+{
+ struct pcache_backing_dev *backing_dev = backing_req->backing_dev;
+
+ if (direct) {
+ submit_bio_noacct(&backing_req->bio);
+ return;
+ }
+
+ spin_lock(&backing_dev->submit_lock);
+ list_add_tail(&backing_req->node, &backing_dev->submit_list);
+ spin_unlock(&backing_dev->submit_lock);
+
+ queue_work(BACKING_DEV_TO_PCACHE(backing_dev)->task_wq, &backing_dev->req_submit_work);
+}
+
+static struct pcache_backing_dev_req *req_type_req_create(struct pcache_backing_dev *backing_dev,
+ struct pcache_backing_dev_req_opts *opts)
+{
+ struct pcache_request *pcache_req = opts->req.upper_req;
+ struct pcache_backing_dev_req *backing_req;
+ struct bio *clone, *orig = pcache_req->bio;
+ u32 off = opts->req.req_off;
+ u32 len = opts->req.len;
+ int ret;
+
+ backing_req = kmem_cache_zalloc(backing_dev->backing_req_cache, opts->gfp_mask);
+ if (!backing_req)
+ return NULL;
+
+ ret = bio_init_clone(backing_dev->dm_dev->bdev, &backing_req->bio, orig, opts->gfp_mask);
+ if (ret)
+ goto err_free_req;
+
+ backing_req->type = BACKING_DEV_REQ_TYPE_REQ;
+
+ clone = &backing_req->bio;
+ BUG_ON(off & SECTOR_MASK);
+ BUG_ON(len & SECTOR_MASK);
+ bio_trim(clone, off >> SECTOR_SHIFT, len >> SECTOR_SHIFT);
+
+ clone->bi_iter.bi_sector = (pcache_req->off + off) >> SECTOR_SHIFT;
+ clone->bi_private = backing_req;
+ clone->bi_end_io = backing_dev_bio_end;
+
+ backing_req->backing_dev = backing_dev;
+ INIT_LIST_HEAD(&backing_req->node);
+ backing_req->end_req = opts->end_fn;
+
+ pcache_req_get(pcache_req);
+ backing_req->req.upper_req = pcache_req;
+ backing_req->req.bio_off = off;
+
+ return backing_req;
+
+err_free_req:
+ kmem_cache_free(backing_dev->backing_req_cache, backing_req);
+ return NULL;
+}
+
+static void bio_map(struct bio *bio, void *base, size_t size)
+{
+ if (is_vmalloc_addr(base))
+ flush_kernel_vmap_range(base, size);
+
+ while (size) {
+ struct page *page = is_vmalloc_addr(base)
+ ? vmalloc_to_page(base)
+ : virt_to_page(base);
+ unsigned int offset = offset_in_page(base);
+ unsigned int len = min_t(size_t, PAGE_SIZE - offset, size);
+
+ BUG_ON(!bio_add_page(bio, page, len, offset));
+ size -= len;
+ base += len;
+ }
+}
+
+static struct pcache_backing_dev_req *kmem_type_req_create(struct pcache_backing_dev *backing_dev,
+ struct pcache_backing_dev_req_opts *opts)
+{
+ struct pcache_backing_dev_req *backing_req;
+ struct bio *backing_bio;
+ u32 n_vecs = DIV_ROUND_UP(opts->kmem.len, PAGE_SIZE);
+
+ backing_req = kmem_cache_zalloc(backing_dev->backing_req_cache, opts->gfp_mask);
+ if (!backing_req)
+ return NULL;
+
+ backing_req->kmem.bvecs = kcalloc(n_vecs, sizeof(struct bio_vec), opts->gfp_mask);
+ if (!backing_req->kmem.bvecs)
+ goto err_free_req;
+
+ backing_req->type = BACKING_DEV_REQ_TYPE_KMEM;
+
+ bio_init(&backing_req->bio, backing_dev->dm_dev->bdev, backing_req->kmem.bvecs,
+ n_vecs, opts->kmem.opf);
+
+ backing_bio = &backing_req->bio;
+ bio_map(backing_bio, opts->kmem.data, opts->kmem.len);
+
+ backing_bio->bi_iter.bi_sector = (opts->kmem.backing_off) >> SECTOR_SHIFT;
+ backing_bio->bi_private = backing_req;
+ backing_bio->bi_end_io = backing_dev_bio_end;
+
+ backing_req->backing_dev = backing_dev;
+ INIT_LIST_HEAD(&backing_req->node);
+ backing_req->end_req = opts->end_fn;
+
+ return backing_req;
+
+err_free_req:
+ kmem_cache_free(backing_dev->backing_req_cache, backing_req);
+ return NULL;
+}
+
+struct pcache_backing_dev_req *backing_dev_req_create(struct pcache_backing_dev *backing_dev,
+ struct pcache_backing_dev_req_opts *opts)
+{
+ if (opts->type == BACKING_DEV_REQ_TYPE_REQ)
+ return req_type_req_create(backing_dev, opts);
+ else if (opts->type == BACKING_DEV_REQ_TYPE_KMEM)
+ return kmem_type_req_create(backing_dev, opts);
+
+ return NULL;
+}
+
+void backing_dev_flush(struct pcache_backing_dev *backing_dev)
+{
+ blkdev_issue_flush(backing_dev->dm_dev->bdev);
+}
diff --git a/drivers/md/dm-pcache/backing_dev.h b/drivers/md/dm-pcache/backing_dev.h
new file mode 100644
index 000000000000..935fdd88ef6e
--- /dev/null
+++ b/drivers/md/dm-pcache/backing_dev.h
@@ -0,0 +1,84 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _BACKING_DEV_H
+#define _BACKING_DEV_H
+
+#include <linux/device-mapper.h>
+
+#include "pcache_internal.h"
+
+struct pcache_backing_dev_req;
+typedef void (*backing_req_end_fn_t)(struct pcache_backing_dev_req *backing_req, int ret);
+
+#define BACKING_DEV_REQ_TYPE_REQ 1
+#define BACKING_DEV_REQ_TYPE_KMEM 2
+
+struct pcache_request;
+struct pcache_backing_dev_req {
+ u8 type;
+ struct bio bio;
+ struct pcache_backing_dev *backing_dev;
+
+ void *priv_data;
+ backing_req_end_fn_t end_req;
+
+ struct list_head node;
+ int ret;
+
+ union {
+ struct {
+ struct pcache_request *upper_req;
+ u32 bio_off;
+ } req;
+ struct {
+ struct bio_vec *bvecs;
+ } kmem;
+ };
+};
+
+struct pcache_backing_dev {
+ struct pcache_cache *cache;
+
+ struct dm_dev *dm_dev;
+ struct kmem_cache *backing_req_cache;
+
+ struct list_head submit_list;
+ spinlock_t submit_lock;
+ struct work_struct req_submit_work;
+
+ struct list_head complete_list;
+ spinlock_t complete_lock;
+ struct work_struct req_complete_work;
+
+ u64 dev_size;
+};
+
+struct dm_pcache;
+int backing_dev_start(struct dm_pcache *pcache, const char *backing_dev_path);
+void backing_dev_stop(struct dm_pcache *pcache);
+
+struct pcache_backing_dev_req_opts {
+ u32 type;
+ union {
+ struct {
+ struct pcache_request *upper_req;
+ u32 req_off;
+ u32 len;
+ } req;
+ struct {
+ void *data;
+ blk_opf_t opf;
+ u32 len;
+ u64 backing_off;
+ } kmem;
+ };
+
+ gfp_t gfp_mask;
+ backing_req_end_fn_t end_fn;
+};
+
+void backing_dev_req_submit(struct pcache_backing_dev_req *backing_req, bool direct);
+void backing_dev_req_end(struct pcache_backing_dev_req *backing_req);
+struct pcache_backing_dev_req *backing_dev_req_create(struct pcache_backing_dev *backing_dev,
+ struct pcache_backing_dev_req_opts *opts);
+void backing_dev_flush(struct pcache_backing_dev *backing_dev);
+#endif /* _BACKING_DEV_H */
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH 03/11] dm-pcache: add cache device
2025-06-05 14:22 [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices Dongsheng Yang
2025-06-05 14:22 ` [RFC PATCH 01/11] dm-pcache: add pcache_internal.h Dongsheng Yang
2025-06-05 14:22 ` [RFC PATCH 02/11] dm-pcache: add backing device management Dongsheng Yang
@ 2025-06-05 14:22 ` Dongsheng Yang
2025-06-05 14:22 ` [RFC PATCH 04/11] dm-pcache: add segment layer Dongsheng Yang
` (8 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Dongsheng Yang @ 2025-06-05 14:22 UTC (permalink / raw)
To: mpatocka, agk, snitzer, axboe, hch, dan.j.williams,
Jonathan.Cameron
Cc: linux-block, linux-kernel, linux-cxl, nvdimm, dm-devel,
Dongsheng Yang
Add cache_dev.{c,h} to manage the persistent-memory device that stores
all pcache metadata and data segments. Splitting this logic out keeps
the main dm-pcache code focused on policy while cache_dev handles the
low-level interaction with the DAX block device.
* DAX mapping
- Opens the underlying device via dm_get_device().
- Uses dax_direct_access() to obtain a direct linear mapping; falls
back to vmap() when the range is fragmented.
* On-disk layout
┌─ 4 KB ─┐ super-block (SB)
├─ 4 KB ─┤ cache_info[0]
├─ 4 KB ─┤ cache_info[1]
├─ 4 KB ─┤ cache_ctrl
└─ ... ─┘ segments
Constants and macros in the header expose offsets and sizes.
* Super-block handling
- sb_read(), sb_validate(), sb_init() verify magic, CRC32 and host
endianness (flag *PCACHE_SB_F_BIGENDIAN*).
- Formatting zeroes the metadata replicas and initialises the segment
bitmap when the SB is blank.
* Segment allocator
- Bitmap protected by seg_lock; find_next_zero_bit() yields the next
free 16 MB segment.
* Lifecycle helpers
- cache_dev_start()/stop() encapsulate init/exit and are invoked by
dm-pcache core.
- Gracefully handles errors: CRC mismatch, wrong endianness, device
too small (< 512 MB), or failed DAX mapping.
Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
drivers/md/dm-pcache/cache_dev.c | 310 +++++++++++++++++++++++++++++++
drivers/md/dm-pcache/cache_dev.h | 70 +++++++
2 files changed, 380 insertions(+)
create mode 100644 drivers/md/dm-pcache/cache_dev.c
create mode 100644 drivers/md/dm-pcache/cache_dev.h
diff --git a/drivers/md/dm-pcache/cache_dev.c b/drivers/md/dm-pcache/cache_dev.c
new file mode 100644
index 000000000000..8089518fe5c9
--- /dev/null
+++ b/drivers/md/dm-pcache/cache_dev.c
@@ -0,0 +1,310 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include <linux/blkdev.h>
+#include <linux/dax.h>
+#include <linux/vmalloc.h>
+#include <linux/pfn_t.h>
+#include <linux/parser.h>
+
+#include "cache_dev.h"
+#include "backing_dev.h"
+#include "cache.h"
+#include "dm_pcache.h"
+
+static void cache_dev_dax_exit(struct pcache_cache_dev *cache_dev)
+{
+ struct dm_pcache *pcache = CACHE_DEV_TO_PCACHE(cache_dev);
+
+ if (cache_dev->use_vmap)
+ vunmap(cache_dev->mapping);
+
+ dm_put_device(pcache->ti, cache_dev->dm_dev);
+}
+
+static int build_vmap(struct dax_device *dax_dev, long total_pages, void **vaddr)
+{
+ struct page **pages;
+ long i = 0, chunk;
+ pfn_t pfn;
+ int ret;
+
+ pages = vmalloc_array(total_pages, sizeof(struct page *));
+ if (!pages)
+ return -ENOMEM;
+
+ do {
+ chunk = dax_direct_access(dax_dev, i, total_pages - i,
+ DAX_ACCESS, NULL, &pfn);
+ if (chunk <= 0) {
+ ret = chunk ? chunk : -EINVAL;
+ goto out_free;
+ }
+
+ if (!pfn_t_has_page(pfn)) {
+ ret = -EOPNOTSUPP;
+ goto out_free;
+ }
+
+ while (chunk-- && i < total_pages) {
+ pages[i++] = pfn_t_to_page(pfn);
+ pfn.val++;
+ if (!(i & 15))
+ cond_resched();
+ }
+ } while (i < total_pages);
+
+ *vaddr = vmap(pages, total_pages, VM_MAP, PAGE_KERNEL);
+ if (!*vaddr)
+ ret = -ENOMEM;
+out_free:
+ vfree(pages);
+ return ret;
+}
+
+static int cache_dev_dax_init(struct pcache_cache_dev *cache_dev, const char *path)
+{
+ struct dm_pcache *pcache = CACHE_DEV_TO_PCACHE(cache_dev);
+ struct dax_device *dax_dev;
+ long total_pages, mapped_pages;
+ u64 bdev_size;
+ void *vaddr;
+ int ret, id;
+ pfn_t pfn;
+
+ ret = dm_get_device(pcache->ti, path,
+ BLK_OPEN_READ | BLK_OPEN_WRITE, &cache_dev->dm_dev);
+ if (ret) {
+ pcache_dev_err(pcache, "failed to open dm_dev: %s: %d", path, ret);
+ goto err;
+ }
+
+ dax_dev = cache_dev->dm_dev->dax_dev;
+
+ /* total size check */
+ bdev_size = bdev_nr_bytes(cache_dev->dm_dev->bdev);
+ if (!bdev_size) {
+ ret = -ENODEV;
+ pcache_dev_err(pcache, "device %s has zero size\n", path);
+ goto put_dm;
+ }
+
+ total_pages = bdev_size >> PAGE_SHIFT;
+ /* attempt: direct-map the whole range */
+ id = dax_read_lock();
+ mapped_pages = dax_direct_access(dax_dev, 0, total_pages,
+ DAX_ACCESS, &vaddr, &pfn);
+ if (mapped_pages < 0) {
+ pcache_dev_err(pcache, "dax_direct_access failed: %ld\n", mapped_pages);
+ ret = mapped_pages;
+ goto unlock;
+ }
+
+ if (!pfn_t_has_page(pfn)) {
+ ret = -EOPNOTSUPP;
+ goto unlock;
+ }
+
+ if (mapped_pages == total_pages) {
+ /* success: contiguous direct mapping */
+ cache_dev->mapping = vaddr;
+ } else {
+ /* need vmap fallback */
+ ret = build_vmap(dax_dev, total_pages, &vaddr);
+ if (ret) {
+ pcache_dev_err(pcache, "vmap fallback failed: %d\n", ret);
+ goto unlock;
+ }
+
+ cache_dev->mapping = vaddr;
+ cache_dev->use_vmap = true;
+ }
+ dax_read_unlock(id);
+
+ return 0;
+unlock:
+ dax_read_unlock(id);
+put_dm:
+ dm_put_device(pcache->ti, cache_dev->dm_dev);
+err:
+ return ret;
+}
+
+void cache_dev_zero_range(struct pcache_cache_dev *cache_dev, void *pos, u32 size)
+{
+ memset(pos, 0, size);
+ dax_flush(cache_dev->dm_dev->dax_dev, pos, size);
+}
+
+static int sb_read(struct pcache_cache_dev *cache_dev, struct pcache_sb *sb)
+{
+ struct pcache_sb *sb_addr = CACHE_DEV_SB(cache_dev);
+
+ if (copy_mc_to_kernel(sb, sb_addr, sizeof(struct pcache_sb)))
+ return -EIO;
+
+ return 0;
+}
+
+static void sb_write(struct pcache_cache_dev *cache_dev, struct pcache_sb *sb)
+{
+ struct pcache_sb *sb_addr = CACHE_DEV_SB(cache_dev);
+
+ memcpy_flushcache(sb_addr, sb, sizeof(struct pcache_sb));
+ pmem_wmb();
+}
+
+static int sb_init(struct pcache_cache_dev *cache_dev, struct pcache_sb *sb)
+{
+ struct dm_pcache *pcache = CACHE_DEV_TO_PCACHE(cache_dev);
+ u64 nr_segs;
+ u64 cache_dev_size;
+ u64 magic;
+ u32 flags = 0;
+
+ magic = le64_to_cpu(sb->magic);
+ if (magic)
+ return -EEXIST;
+
+ cache_dev_size = bdev_nr_bytes(file_bdev(cache_dev->dm_dev->bdev_file));
+ if (cache_dev_size < PCACHE_CACHE_DEV_SIZE_MIN) {
+ pcache_dev_err(pcache, "dax device is too small, required at least %llu",
+ PCACHE_CACHE_DEV_SIZE_MIN);
+ return -ENOSPC;
+ }
+
+ nr_segs = (cache_dev_size - PCACHE_SEGMENTS_OFF) / ((PCACHE_SEG_SIZE));
+
+#if defined(__BYTE_ORDER) ? (__BIG_ENDIAN == __BYTE_ORDER) : defined(__BIG_ENDIAN)
+ flags |= PCACHE_SB_F_BIGENDIAN;
+#endif
+ sb->flags = cpu_to_le32(flags);
+ sb->magic = cpu_to_le64(PCACHE_MAGIC);
+ sb->seg_num = cpu_to_le32(nr_segs);
+ sb->crc = cpu_to_le32(crc32(PCACHE_CRC_SEED, (void *)(sb) + 4, sizeof(struct pcache_sb) - 4));
+
+ cache_dev_zero_range(cache_dev, CACHE_DEV_CACHE_INFO(cache_dev),
+ PCACHE_CACHE_INFO_SIZE * PCACHE_META_INDEX_MAX +
+ PCACHE_CACHE_CTRL_SIZE);
+
+ return 0;
+}
+
+static int sb_validate(struct pcache_cache_dev *cache_dev, struct pcache_sb *sb)
+{
+ struct dm_pcache *pcache = CACHE_DEV_TO_PCACHE(cache_dev);
+ u32 flags;
+ u32 crc;
+
+ if (le64_to_cpu(sb->magic) != PCACHE_MAGIC) {
+ pcache_dev_err(pcache, "unexpected magic: %llx\n",
+ le64_to_cpu(sb->magic));
+ return -EINVAL;
+ }
+
+ crc = crc32(PCACHE_CRC_SEED, (void *)(sb) + 4, sizeof(struct pcache_sb) - 4);
+ if (crc != le32_to_cpu(sb->crc)) {
+ pcache_dev_err(pcache, "corrupted sb: %u, expected: %u\n", crc, le32_to_cpu(sb->crc));
+ return -EINVAL;
+ }
+
+ flags = le32_to_cpu(sb->flags);
+#if defined(__BYTE_ORDER) ? (__BIG_ENDIAN == __BYTE_ORDER) : defined(__BIG_ENDIAN)
+ if (!(flags & PCACHE_SB_F_BIGENDIAN)) {
+ pcache_dev_err(pcache, "cache_dev is not big endian\n");
+ return -EINVAL;
+ }
+#else
+ if (flags & PCACHE_SB_F_BIGENDIAN) {
+ pcache_dev_err(pcache, "cache_dev is big endian\n");
+ return -EINVAL;
+ }
+#endif
+ return 0;
+}
+
+static int cache_dev_init(struct pcache_cache_dev *cache_dev, u32 seg_num)
+{
+ cache_dev->seg_num = seg_num;
+ cache_dev->seg_bitmap = bitmap_zalloc(cache_dev->seg_num, GFP_KERNEL);
+ if (!cache_dev->seg_bitmap)
+ return -ENOMEM;
+
+ return 0;
+}
+
+static void cache_dev_exit(struct pcache_cache_dev *cache_dev)
+{
+ bitmap_free(cache_dev->seg_bitmap);
+}
+
+void cache_dev_stop(struct dm_pcache *pcache)
+{
+ struct pcache_cache_dev *cache_dev = &pcache->cache_dev;
+
+ cache_dev_exit(cache_dev);
+ cache_dev_dax_exit(cache_dev);
+}
+
+int cache_dev_start(struct dm_pcache *pcache, const char *cache_dev_path)
+{
+ struct pcache_cache_dev *cache_dev = &pcache->cache_dev;
+ struct pcache_sb sb;
+ bool format = false;
+ int ret;
+
+ mutex_init(&cache_dev->seg_lock);
+
+ ret = cache_dev_dax_init(cache_dev, cache_dev_path);
+ if (ret) {
+ pcache_dev_err(pcache, "failed to init cache_dev via dax way: %d.", ret);
+ goto err;
+ }
+
+ ret = sb_read(cache_dev, &sb);
+ if (ret)
+ goto dax_release;
+
+ if (le64_to_cpu(sb.magic) == 0) {
+ format = true;
+ ret = sb_init(cache_dev, &sb);
+ if (ret < 0)
+ goto dax_release;
+ }
+
+ ret = sb_validate(cache_dev, &sb);
+ if (ret)
+ goto dax_release;
+
+ cache_dev->sb_flags = le32_to_cpu(sb.flags);
+ ret = cache_dev_init(cache_dev, sb.seg_num);
+ if (ret)
+ goto dax_release;
+
+ if (format)
+ sb_write(cache_dev, &sb);
+
+ return 0;
+
+dax_release:
+ cache_dev_dax_exit(cache_dev);
+err:
+ return ret;
+}
+
+int cache_dev_get_empty_segment_id(struct pcache_cache_dev *cache_dev, u32 *seg_id)
+{
+ int ret;
+
+ mutex_lock(&cache_dev->seg_lock);
+ *seg_id = find_next_zero_bit(cache_dev->seg_bitmap, cache_dev->seg_num, 0);
+ if (*seg_id == cache_dev->seg_num) {
+ ret = -ENOSPC;
+ goto unlock;
+ }
+
+ set_bit(*seg_id, cache_dev->seg_bitmap);
+ ret = 0;
+unlock:
+ mutex_unlock(&cache_dev->seg_lock);
+ return ret;
+}
diff --git a/drivers/md/dm-pcache/cache_dev.h b/drivers/md/dm-pcache/cache_dev.h
new file mode 100644
index 000000000000..3b5249f7128e
--- /dev/null
+++ b/drivers/md/dm-pcache/cache_dev.h
@@ -0,0 +1,70 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _PCACHE_CACHE_DEV_H
+#define _PCACHE_CACHE_DEV_H
+
+#include <linux/device.h>
+#include <linux/device-mapper.h>
+
+#include "pcache_internal.h"
+
+#define PCACHE_MAGIC 0x65B05EFA96C596EFULL
+
+#define PCACHE_SB_OFF (4 * PCACHE_KB)
+#define PCACHE_SB_SIZE (4 * PCACHE_KB)
+
+#define PCACHE_CACHE_INFO_OFF (PCACHE_SB_OFF + PCACHE_SB_SIZE)
+#define PCACHE_CACHE_INFO_SIZE (4 * PCACHE_KB)
+
+#define PCACHE_CACHE_CTRL_OFF (PCACHE_CACHE_INFO_OFF + (PCACHE_CACHE_INFO_SIZE * PCACHE_META_INDEX_MAX))
+#define PCACHE_CACHE_CTRL_SIZE (4 * PCACHE_KB)
+
+#define PCACHE_SEGMENTS_OFF (PCACHE_CACHE_CTRL_OFF + PCACHE_CACHE_CTRL_SIZE)
+#define PCACHE_SEG_INFO_SIZE (4 * PCACHE_KB)
+
+#define PCACHE_CACHE_DEV_SIZE_MIN (512 * PCACHE_MB) /* 512 MB */
+#define PCACHE_SEG_SIZE (16 * PCACHE_MB) /* Size of each PCACHE segment (16 MB) */
+
+#define CACHE_DEV_SB(cache_dev) ((struct pcache_sb *)(cache_dev->mapping + PCACHE_SB_OFF))
+#define CACHE_DEV_CACHE_INFO(cache_dev) ((void *)cache_dev->mapping + PCACHE_CACHE_INFO_OFF)
+#define CACHE_DEV_CACHE_CTRL(cache_dev) ((void *)cache_dev->mapping + PCACHE_CACHE_CTRL_OFF)
+#define CACHE_DEV_SEGMENTS(cache_dev) ((void *)cache_dev->mapping + PCACHE_SEGMENTS_OFF)
+#define CACHE_DEV_SEGMENT(cache_dev, id) ((void *)CACHE_DEV_SEGMENTS(cache_dev) + (u64)id * PCACHE_SEG_SIZE)
+
+/*
+ * PCACHE SB flags configured during formatting
+ *
+ * The PCACHE_SB_F_xxx flags define registration requirements based on cache_dev
+ * formatting. For a machine to register a cache_dev:
+ * - PCACHE_SB_F_BIGENDIAN: Requires a big-endian machine.
+ */
+#define PCACHE_SB_F_BIGENDIAN BIT(0)
+
+struct pcache_sb {
+ __le32 crc;
+ __le32 flags;
+ __le64 magic;
+
+ __le32 seg_num;
+};
+
+struct pcache_cache_dev {
+ u32 sb_flags;
+ u32 seg_num;
+ void *mapping;
+ bool use_vmap;
+
+ struct dm_dev *dm_dev;
+
+ struct mutex seg_lock;
+ unsigned long *seg_bitmap;
+};
+
+struct dm_pcache;
+int cache_dev_start(struct dm_pcache *pcache, const char *cache_dev_path);
+void cache_dev_stop(struct dm_pcache *pcache);
+
+void cache_dev_zero_range(struct pcache_cache_dev *cache_dev, void *pos, u32 size);
+
+int cache_dev_get_empty_segment_id(struct pcache_cache_dev *cache_dev, u32 *seg_id);
+
+#endif /* _PCACHE_CACHE_DEV_H */
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH 04/11] dm-pcache: add segment layer
2025-06-05 14:22 [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices Dongsheng Yang
` (2 preceding siblings ...)
2025-06-05 14:22 ` [RFC PATCH 03/11] dm-pcache: add cache device Dongsheng Yang
@ 2025-06-05 14:22 ` Dongsheng Yang
2025-06-05 14:23 ` [RFC PATCH 05/11] dm-pcache: add cache_segment Dongsheng Yang
` (7 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Dongsheng Yang @ 2025-06-05 14:22 UTC (permalink / raw)
To: mpatocka, agk, snitzer, axboe, hch, dan.j.williams,
Jonathan.Cameron
Cc: linux-block, linux-kernel, linux-cxl, nvdimm, dm-devel,
Dongsheng Yang
Introduce segment.{c,h}, an internal abstraction that encapsulates
everything related to a single pcache *segment* (the fixed-size
allocation unit stored on the cache-device).
* On-disk metadata (`struct pcache_segment_info`)
- Embedded `struct pcache_meta_header` for CRC/sequence handling.
- `flags` field encodes a “has-next” bit and a 4-bit *type* class
(`CACHE_DATA` added as the first type).
* Initialisation
- `pcache_segment_init()` populates the in-memory
`struct pcache_segment` from a given segment id, data offset and
metadata pointer, computing the usable `data_size` and virtual
address within the DAX mapping.
* IO helpers
- `segment_copy_to_bio()` / `segment_copy_from_bio()` move data
between pmem and a bio, using `_copy_mc_to_iter()` and
`_copy_from_iter_flushcache()` to tolerate hw memory errors and
ensure durability.
- `segment_pos_advance()` advances an internal offset while staying
inside the segment’s data area.
These helpers allow upper layers (cache key management, write-back
logic, GC, etc.) to treat a segment as a contiguous byte array without
knowing about DAX mappings or persistence details.
Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
drivers/md/dm-pcache/segment.c | 63 +++++++++++++++++++++++++++++
drivers/md/dm-pcache/segment.h | 74 ++++++++++++++++++++++++++++++++++
2 files changed, 137 insertions(+)
create mode 100644 drivers/md/dm-pcache/segment.c
create mode 100644 drivers/md/dm-pcache/segment.h
diff --git a/drivers/md/dm-pcache/segment.c b/drivers/md/dm-pcache/segment.c
new file mode 100644
index 000000000000..0597e2bb053b
--- /dev/null
+++ b/drivers/md/dm-pcache/segment.c
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include <linux/dax.h>
+
+#include "pcache_internal.h"
+#include "cache_dev.h"
+#include "segment.h"
+
+int segment_copy_to_bio(struct pcache_segment *segment,
+ u32 data_off, u32 data_len, struct bio *bio, u32 bio_off)
+{
+ struct iov_iter iter;
+ size_t copied;
+ void *src;
+
+ iov_iter_bvec(&iter, ITER_DEST, &bio->bi_io_vec[bio->bi_iter.bi_idx],
+ bio_segments(bio), bio->bi_iter.bi_size);
+ iter.iov_offset = bio->bi_iter.bi_bvec_done;
+ if (bio_off)
+ iov_iter_advance(&iter, bio_off);
+
+ src = segment->data + data_off;
+ copied = _copy_mc_to_iter(src, data_len, &iter);
+ if (copied != data_len)
+ return -EIO;
+
+ return 0;
+}
+
+int segment_copy_from_bio(struct pcache_segment *segment,
+ u32 data_off, u32 data_len, struct bio *bio, u32 bio_off)
+{
+ struct iov_iter iter;
+ size_t copied;
+ void *dst;
+
+ iov_iter_bvec(&iter, ITER_SOURCE, &bio->bi_io_vec[bio->bi_iter.bi_idx],
+ bio_segments(bio), bio->bi_iter.bi_size);
+ iter.iov_offset = bio->bi_iter.bi_bvec_done;
+ if (bio_off)
+ iov_iter_advance(&iter, bio_off);
+
+ dst = segment->data + data_off;
+ copied = _copy_from_iter_flushcache(dst, data_len, &iter);
+ if (copied != data_len)
+ return -EIO;
+ pmem_wmb();
+
+ return 0;
+}
+
+void pcache_segment_init(struct pcache_cache_dev *cache_dev, struct pcache_segment *segment,
+ struct pcache_segment_init_options *options)
+{
+ segment->seg_info = options->seg_info;
+
+ segment_info_set_type(segment->seg_info, options->type);
+ segment->seg_info->seg_id = options->seg_id;
+ segment->seg_info->data_off = options->data_off;
+
+ segment->cache_dev = cache_dev;
+ segment->data_size = PCACHE_SEG_SIZE - options->data_off;
+ segment->data = CACHE_DEV_SEGMENT(cache_dev, options->seg_id) + options->data_off;
+}
diff --git a/drivers/md/dm-pcache/segment.h b/drivers/md/dm-pcache/segment.h
new file mode 100644
index 000000000000..6d732709c425
--- /dev/null
+++ b/drivers/md/dm-pcache/segment.h
@@ -0,0 +1,74 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _PCACHE_SEGMENT_H
+#define _PCACHE_SEGMENT_H
+
+#include <linux/bio.h>
+
+#include "pcache_internal.h"
+
+struct pcache_segment_info {
+ struct pcache_meta_header header; /* Metadata header for the segment */
+ __u32 flags;
+ __u32 next_seg;
+ __u32 seg_id;
+ __u32 data_off;
+} __packed;
+
+#define PCACHE_SEG_INFO_FLAGS_HAS_NEXT BIT(0)
+
+#define PCACHE_SEG_INFO_FLAGS_TYPE_MASK GENMASK(4, 1)
+#define PCACHE_SEGMENT_TYPE_CACHE_DATA 1
+
+static inline bool segment_info_has_next(struct pcache_segment_info *seg_info)
+{
+ return (seg_info->flags & PCACHE_SEG_INFO_FLAGS_HAS_NEXT);
+}
+
+static inline void segment_info_set_type(struct pcache_segment_info *seg_info, u8 type)
+{
+ seg_info->flags &= ~PCACHE_SEG_INFO_FLAGS_TYPE_MASK;
+ seg_info->flags |= FIELD_PREP(PCACHE_SEG_INFO_FLAGS_TYPE_MASK, type);
+}
+
+static inline u8 segment_info_get_type(struct pcache_segment_info *seg_info)
+{
+ return FIELD_GET(PCACHE_SEG_INFO_FLAGS_TYPE_MASK, seg_info->flags);
+}
+
+struct pcache_segment_pos {
+ struct pcache_segment *segment; /* Segment associated with the position */
+ u32 off; /* Offset within the segment */
+};
+
+struct pcache_segment_init_options {
+ u8 type;
+ u32 seg_id;
+ u32 data_off;
+
+ struct pcache_segment_info *seg_info;
+};
+
+struct pcache_segment {
+ struct pcache_cache_dev *cache_dev;
+
+ void *data;
+ u32 data_size;
+
+ struct pcache_segment_info *seg_info;
+};
+
+int segment_copy_to_bio(struct pcache_segment *segment,
+ u32 data_off, u32 data_len, struct bio *bio, u32 bio_off);
+int segment_copy_from_bio(struct pcache_segment *segment,
+ u32 data_off, u32 data_len, struct bio *bio, u32 bio_off);
+
+static inline void segment_pos_advance(struct pcache_segment_pos *seg_pos, u32 len)
+{
+ BUG_ON(seg_pos->off + len > seg_pos->segment->data_size);
+
+ seg_pos->off += len;
+}
+
+void pcache_segment_init(struct pcache_cache_dev *cache_dev, struct pcache_segment *segment,
+ struct pcache_segment_init_options *options);
+#endif /* _PCACHE_SEGMENT_H */
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH 05/11] dm-pcache: add cache_segment
2025-06-05 14:22 [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices Dongsheng Yang
` (3 preceding siblings ...)
2025-06-05 14:22 ` [RFC PATCH 04/11] dm-pcache: add segment layer Dongsheng Yang
@ 2025-06-05 14:23 ` Dongsheng Yang
2025-06-05 14:23 ` [RFC PATCH 06/11] dm-pcache: add cache_writeback Dongsheng Yang
` (6 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Dongsheng Yang @ 2025-06-05 14:23 UTC (permalink / raw)
To: mpatocka, agk, snitzer, axboe, hch, dan.j.williams,
Jonathan.Cameron
Cc: linux-block, linux-kernel, linux-cxl, nvdimm, dm-devel,
Dongsheng Yang
Introduce *cache_segment.c*, the in-memory/on-disk glue that lets a
`struct pcache_cache` manage its array of data segments.
* Metadata handling
- Loads the most-recent replica of both the segment-info block
(`struct pcache_segment_info`) and per-segment generation counter
(`struct pcache_cache_seg_gen`) using `pcache_meta_find_latest()`.
- Updates those structures atomically with CRC + sequence rollover,
writing alternately to the two metadata slots inside each segment.
* Segment initialisation (`cache_seg_init`)
- Builds a `struct pcache_segment` pointing to the segment’s data
area, sets up locks, generation counters, and, when formatting a new
cache, zeroes the on-segment kset header.
* Linked-list of segments
- `cache_seg_set_next_seg()` stores the *next* segment id in
`seg_info->next_seg` and sets the HAS_NEXT flag, allowing a cache to
span multiple segments. This is important to allow other type of
segment added in future.
* Runtime life-cycle
- Reference counting (`cache_seg_get/put`) with invalidate-on-last-put
that clears the bitmap slot and schedules cleanup work.
- Generation bump (`cache_seg_gen_increase`) persists a new generation
record whenever the segment is modified.
* Allocator
- `get_cache_segment()` uses a bitmap and per-cache hint to pick the
next free segment, retrying with micro-delays when none are
immediately available.
Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
drivers/md/dm-pcache/cache_segment.c | 300 +++++++++++++++++++++++++++
1 file changed, 300 insertions(+)
create mode 100644 drivers/md/dm-pcache/cache_segment.c
diff --git a/drivers/md/dm-pcache/cache_segment.c b/drivers/md/dm-pcache/cache_segment.c
new file mode 100644
index 000000000000..0dc6e73a030b
--- /dev/null
+++ b/drivers/md/dm-pcache/cache_segment.c
@@ -0,0 +1,300 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include "cache_dev.h"
+#include "cache.h"
+#include "backing_dev.h"
+#include "dm_pcache.h"
+
+static inline struct pcache_segment_info *get_seg_info_addr(struct pcache_cache_segment *cache_seg)
+{
+ struct pcache_segment_info *seg_info_addr;
+ u32 seg_id = cache_seg->segment.seg_info->seg_id;
+ void *seg_addr;
+
+ seg_addr = CACHE_DEV_SEGMENT(cache_seg->cache->cache_dev, seg_id);
+ seg_info_addr = seg_addr + PCACHE_SEG_INFO_SIZE * cache_seg->info_index;
+
+ return seg_info_addr;
+}
+
+static void cache_seg_info_write(struct pcache_cache_segment *cache_seg)
+{
+ struct pcache_segment_info *seg_info_addr;
+ struct pcache_segment_info *seg_info = &cache_seg->cache_seg_info;
+
+ mutex_lock(&cache_seg->info_lock);
+ seg_info->header.seq++;
+ seg_info->header.crc = pcache_meta_crc(&seg_info->header, sizeof(struct pcache_segment_info));
+
+ seg_info_addr = get_seg_info_addr(cache_seg);
+ memcpy_flushcache(seg_info_addr, seg_info, sizeof(struct pcache_segment_info));
+ pmem_wmb();
+
+ cache_seg->info_index = (cache_seg->info_index + 1) % PCACHE_META_INDEX_MAX;
+ mutex_unlock(&cache_seg->info_lock);
+}
+
+static int cache_seg_info_load(struct pcache_cache_segment *cache_seg)
+{
+ struct pcache_segment_info *cache_seg_info_addr_base, *cache_seg_info_addr;
+ struct pcache_cache_dev *cache_dev = cache_seg->cache->cache_dev;
+ struct dm_pcache *pcache = CACHE_DEV_TO_PCACHE(cache_dev);
+ u32 seg_id = cache_seg->segment.seg_info->seg_id;
+ int ret = 0;
+
+ cache_seg_info_addr_base = CACHE_DEV_SEGMENT(cache_dev, seg_id);
+
+ mutex_lock(&cache_seg->info_lock);
+ cache_seg_info_addr = pcache_meta_find_latest(&cache_seg_info_addr_base->header,
+ sizeof(struct pcache_segment_info),
+ PCACHE_SEG_INFO_SIZE,
+ &cache_seg->cache_seg_info);
+ if (IS_ERR(cache_seg_info_addr)) {
+ ret = PTR_ERR(cache_seg_info_addr);
+ goto out;
+ } else if (!cache_seg_info_addr) {
+ ret = -EIO;
+ goto out;
+ }
+ cache_seg->info_index = cache_seg_info_addr - cache_seg_info_addr_base;
+out:
+ mutex_unlock(&cache_seg->info_lock);
+
+ if (ret)
+ pcache_dev_err(pcache, "can't read segment info of segment: %u, ret: %d\n",
+ cache_seg->segment.seg_info->seg_id, ret);
+ return ret;
+}
+
+static int cache_seg_ctrl_load(struct pcache_cache_segment *cache_seg)
+{
+ struct pcache_cache_seg_ctrl *cache_seg_ctrl = cache_seg->cache_seg_ctrl;
+ struct pcache_cache_seg_gen cache_seg_gen, *cache_seg_gen_addr;
+ int ret = 0;
+
+ mutex_lock(&cache_seg->ctrl_lock);
+ cache_seg_gen_addr = pcache_meta_find_latest(&cache_seg_ctrl->gen->header,
+ sizeof(struct pcache_cache_seg_gen),
+ sizeof(struct pcache_cache_seg_gen),
+ &cache_seg_gen);
+ if (IS_ERR(cache_seg_gen_addr)) {
+ ret = PTR_ERR(cache_seg_gen_addr);
+ goto out;
+ }
+
+ if (!cache_seg_gen_addr) {
+ cache_seg->gen = 0;
+ cache_seg->gen_seq = 0;
+ cache_seg->gen_index = 0;
+ goto out;
+ }
+
+ cache_seg->gen = cache_seg_gen.gen;
+ cache_seg->gen_seq = cache_seg_gen.header.seq;
+ cache_seg->gen_index = (cache_seg_gen_addr - cache_seg_ctrl->gen);
+out:
+ mutex_unlock(&cache_seg->ctrl_lock);
+
+ return ret;
+}
+
+static inline struct pcache_cache_seg_gen *get_cache_seg_gen_addr(struct pcache_cache_segment *cache_seg)
+{
+ struct pcache_cache_seg_ctrl *cache_seg_ctrl = cache_seg->cache_seg_ctrl;
+
+ return (cache_seg_ctrl->gen + cache_seg->gen_index);
+}
+
+static void cache_seg_ctrl_write(struct pcache_cache_segment *cache_seg)
+{
+ struct pcache_cache_seg_gen cache_seg_gen;
+
+ mutex_lock(&cache_seg->ctrl_lock);
+ cache_seg_gen.gen = cache_seg->gen;
+ cache_seg_gen.header.seq = ++cache_seg->gen_seq;
+ cache_seg_gen.header.crc = pcache_meta_crc(&cache_seg_gen.header,
+ sizeof(struct pcache_cache_seg_gen));
+
+ memcpy_flushcache(get_cache_seg_gen_addr(cache_seg), &cache_seg_gen, sizeof(struct pcache_cache_seg_gen));
+ pmem_wmb();
+
+ cache_seg->gen_index = (cache_seg->gen_index + 1) % PCACHE_META_INDEX_MAX;
+ mutex_unlock(&cache_seg->ctrl_lock);
+}
+
+static void cache_seg_ctrl_init(struct pcache_cache_segment *cache_seg)
+{
+ cache_seg->gen = 0;
+ cache_seg->gen_seq = 0;
+ cache_seg->gen_index = 0;
+ cache_seg_ctrl_write(cache_seg);
+}
+
+static int cache_seg_meta_load(struct pcache_cache_segment *cache_seg)
+{
+ int ret;
+
+ ret = cache_seg_info_load(cache_seg);
+ if (ret)
+ goto err;
+
+ ret = cache_seg_ctrl_load(cache_seg);
+ if (ret)
+ goto err;
+
+ return 0;
+err:
+ return ret;
+}
+
+/**
+ * cache_seg_set_next_seg - Sets the ID of the next segment
+ * @cache_seg: Pointer to the cache segment structure.
+ * @seg_id: The segment ID to set as the next segment.
+ *
+ * A pcache_cache allocates multiple cache segments, which are linked together
+ * through next_seg. When loading a pcache_cache, the first cache segment can
+ * be found using cache->seg_id, which allows access to all the cache segments.
+ */
+void cache_seg_set_next_seg(struct pcache_cache_segment *cache_seg, u32 seg_id)
+{
+ cache_seg->cache_seg_info.flags |= PCACHE_SEG_INFO_FLAGS_HAS_NEXT;
+ cache_seg->cache_seg_info.next_seg = seg_id;
+ cache_seg_info_write(cache_seg);
+}
+
+int cache_seg_init(struct pcache_cache *cache, u32 seg_id, u32 cache_seg_id,
+ bool new_cache)
+{
+ struct pcache_cache_dev *cache_dev = cache->cache_dev;
+ struct pcache_cache_segment *cache_seg = &cache->segments[cache_seg_id];
+ struct pcache_segment_init_options seg_options = { 0 };
+ struct pcache_segment *segment = &cache_seg->segment;
+ int ret;
+
+ cache_seg->cache = cache;
+ cache_seg->cache_seg_id = cache_seg_id;
+ spin_lock_init(&cache_seg->gen_lock);
+ atomic_set(&cache_seg->refs, 0);
+ mutex_init(&cache_seg->info_lock);
+ mutex_init(&cache_seg->ctrl_lock);
+
+ /* init pcache_segment */
+ seg_options.type = PCACHE_SEGMENT_TYPE_CACHE_DATA;
+ seg_options.data_off = PCACHE_CACHE_SEG_CTRL_OFF + PCACHE_CACHE_SEG_CTRL_SIZE;
+ seg_options.seg_id = seg_id;
+ seg_options.seg_info = &cache_seg->cache_seg_info;
+ pcache_segment_init(cache_dev, segment, &seg_options);
+
+ cache_seg->cache_seg_ctrl = CACHE_DEV_SEGMENT(cache_dev, seg_id) + PCACHE_CACHE_SEG_CTRL_OFF;
+
+ if (new_cache) {
+ cache_dev_zero_range(cache_dev, CACHE_DEV_SEGMENT(cache_dev, seg_id),
+ PCACHE_SEG_INFO_SIZE * PCACHE_META_INDEX_MAX +
+ PCACHE_CACHE_SEG_CTRL_SIZE);
+
+ cache_seg_ctrl_init(cache_seg);
+
+ cache_seg->info_index = 0;
+ cache_seg_info_write(cache_seg);
+
+ /* clear outdated kset in segment */
+ memcpy_flushcache(segment->data, &pcache_empty_kset, sizeof(struct pcache_cache_kset_onmedia));
+ pmem_wmb();
+ } else {
+ ret = cache_seg_meta_load(cache_seg);
+ if (ret)
+ goto err;
+ }
+
+ return 0;
+err:
+ return ret;
+}
+
+#define PCACHE_WAIT_NEW_CACHE_INTERVAL 100
+#define PCACHE_WAIT_NEW_CACHE_COUNT 100
+
+/**
+ * get_cache_segment - Retrieves a free cache segment from the cache.
+ * @cache: Pointer to the cache structure.
+ *
+ * This function attempts to find a free cache segment that can be used.
+ * It locks the segment map and checks for the next available segment ID.
+ * If no segment is available, it waits for a predefined interval and retries.
+ * If a free segment is found, it initializes it and returns a pointer to the
+ * cache segment structure. Returns NULL if no segments are available after
+ * waiting for a specified count.
+ */
+struct pcache_cache_segment *get_cache_segment(struct pcache_cache *cache)
+{
+ struct pcache_cache_segment *cache_seg;
+ u32 seg_id;
+ u32 wait_count = 0;
+
+again:
+ spin_lock(&cache->seg_map_lock);
+ seg_id = find_next_zero_bit(cache->seg_map, cache->n_segs, cache->last_cache_seg);
+ if (seg_id == cache->n_segs) {
+ spin_unlock(&cache->seg_map_lock);
+ /* reset the hint of ->last_cache_seg and retry */
+ if (cache->last_cache_seg) {
+ cache->last_cache_seg = 0;
+ goto again;
+ }
+
+ if (++wait_count >= PCACHE_WAIT_NEW_CACHE_COUNT)
+ return NULL;
+
+ udelay(PCACHE_WAIT_NEW_CACHE_INTERVAL);
+ goto again;
+ }
+
+ /*
+ * found an available cache_seg, mark it used in seg_map
+ * and update the search hint ->last_cache_seg
+ */
+ set_bit(seg_id, cache->seg_map);
+ cache->last_cache_seg = seg_id;
+ spin_unlock(&cache->seg_map_lock);
+
+ cache_seg = &cache->segments[seg_id];
+ cache_seg->cache_seg_id = seg_id;
+
+ return cache_seg;
+}
+
+static void cache_seg_gen_increase(struct pcache_cache_segment *cache_seg)
+{
+ spin_lock(&cache_seg->gen_lock);
+ cache_seg->gen++;
+ spin_unlock(&cache_seg->gen_lock);
+
+ cache_seg_ctrl_write(cache_seg);
+}
+
+void cache_seg_get(struct pcache_cache_segment *cache_seg)
+{
+ atomic_inc(&cache_seg->refs);
+}
+
+static void cache_seg_invalidate(struct pcache_cache_segment *cache_seg)
+{
+ struct pcache_cache *cache;
+
+ cache = cache_seg->cache;
+ cache_seg_gen_increase(cache_seg);
+
+ spin_lock(&cache->seg_map_lock);
+ clear_bit(cache_seg->cache_seg_id, cache->seg_map);
+ spin_unlock(&cache->seg_map_lock);
+
+ /* clean_work will clean the bad key in key_tree*/
+ queue_work(cache_get_wq(cache), &cache->clean_work);
+}
+
+void cache_seg_put(struct pcache_cache_segment *cache_seg)
+{
+ if (atomic_dec_and_test(&cache_seg->refs))
+ cache_seg_invalidate(cache_seg);
+}
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH 06/11] dm-pcache: add cache_writeback
2025-06-05 14:22 [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices Dongsheng Yang
` (4 preceding siblings ...)
2025-06-05 14:23 ` [RFC PATCH 05/11] dm-pcache: add cache_segment Dongsheng Yang
@ 2025-06-05 14:23 ` Dongsheng Yang
2025-06-05 14:23 ` [RFC PATCH 07/11] dm-pcache: add cache_gc Dongsheng Yang
` (5 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Dongsheng Yang @ 2025-06-05 14:23 UTC (permalink / raw)
To: mpatocka, agk, snitzer, axboe, hch, dan.j.williams,
Jonathan.Cameron
Cc: linux-block, linux-kernel, linux-cxl, nvdimm, dm-devel,
Dongsheng Yang
Introduce cache_writeback.c, which implements the asynchronous write-back
path for pcache. The new file is responsible for detecting dirty data,
organising it into an in-memory tree, issuing bios to the backing block
device, and advancing the cache’s *dirty tail* pointer once data has
been safely persisted.
* Dirty-state detection
- `__is_cache_clean()` reads the kset header at `dirty_tail`, checks
magic and CRC, and thus decides whether there is anything to flush.
* Write-back scheduler
- `cache_writeback_work` is queued on the cache task-workqueue and
re-arms itself at `PCACHE_CACHE_WRITEBACK_INTERVAL`.
- Uses an internal spin-protected `writeback_key_tree` to batch keys
belonging to the same stripe before IO.
* Key processing
- `cache_kset_insert_tree()` decodes each key inside the on-media
kset, allocates an in-memory key object, and inserts it into the
writeback_key_tree.
- `cache_key_writeback()` builds a *KMEM-type* backing request that
maps the persistent-memory range directly into a WRITE bio and
submits it with `submit_bio_noacct()`.
- After all keys from the writeback_key_tree have been flushed,
`backing_dev_flush()` issues a single FLUSH to ensure durability.
* Tail advancement
- Once a kset is written back, `cache_pos_advance()` moves
`cache->dirty_tail` by the exact on-disk size and the new position is
persisted via `cache_encode_dirty_tail()`.
- When the `PCACHE_KSET_FLAGS_LAST` flag is seen, the write-back
engine switches to the next segment indicated by `next_cache_seg_id`.
Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
drivers/md/dm-pcache/cache_writeback.c | 239 +++++++++++++++++++++++++
1 file changed, 239 insertions(+)
create mode 100644 drivers/md/dm-pcache/cache_writeback.c
diff --git a/drivers/md/dm-pcache/cache_writeback.c b/drivers/md/dm-pcache/cache_writeback.c
new file mode 100644
index 000000000000..fe07e7fad15e
--- /dev/null
+++ b/drivers/md/dm-pcache/cache_writeback.c
@@ -0,0 +1,239 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include <linux/bio.h>
+
+#include "cache.h"
+#include "backing_dev.h"
+#include "cache_dev.h"
+#include "dm_pcache.h"
+
+static inline bool is_cache_clean(struct pcache_cache *cache, struct pcache_cache_pos *dirty_tail)
+{
+ struct dm_pcache *pcache = CACHE_TO_PCACHE(cache);
+ struct pcache_cache_kset_onmedia *kset_onmedia;
+ u32 to_copy;
+ void *addr;
+ int ret;
+
+ addr = cache_pos_addr(dirty_tail);
+ kset_onmedia = (struct pcache_cache_kset_onmedia *)cache->wb_kset_onmedia_buf;
+
+ to_copy = min(PCACHE_KSET_ONMEDIA_SIZE_MAX, PCACHE_SEG_SIZE - dirty_tail->seg_off);
+ ret = copy_mc_to_kernel(kset_onmedia, addr, to_copy);
+ if (ret) {
+ pcache_dev_err(pcache, "error to read kset: %d", ret);
+ return true;
+ }
+
+ /* Check if the magic number matches the expected value */
+ if (kset_onmedia->magic != PCACHE_KSET_MAGIC) {
+ pcache_dev_debug(pcache, "dirty_tail: %u:%u magic: %llx, not expected: %llx\n",
+ dirty_tail->cache_seg->cache_seg_id, dirty_tail->seg_off,
+ kset_onmedia->magic, PCACHE_KSET_MAGIC);
+ return true;
+ }
+
+ /* Verify the CRC checksum for data integrity */
+ if (kset_onmedia->crc != cache_kset_crc(kset_onmedia)) {
+ pcache_dev_debug(pcache, "dirty_tail: %u:%u crc: %x, not expected: %x\n",
+ dirty_tail->cache_seg->cache_seg_id, dirty_tail->seg_off,
+ cache_kset_crc(kset_onmedia), kset_onmedia->crc);
+ return true;
+ }
+
+ return false;
+}
+
+void cache_writeback_exit(struct pcache_cache *cache)
+{
+ cancel_delayed_work_sync(&cache->writeback_work);
+ cache_tree_exit(&cache->writeback_key_tree);
+}
+
+int cache_writeback_init(struct pcache_cache *cache)
+{
+ int ret;
+
+ ret = cache_tree_init(cache, &cache->writeback_key_tree, 1);
+ if (ret)
+ goto err;
+
+ /* Queue delayed work to start writeback handling */
+ queue_delayed_work(cache_get_wq(cache), &cache->writeback_work, 0);
+
+ return 0;
+err:
+ return ret;
+}
+
+static int cache_key_writeback(struct pcache_cache *cache, struct pcache_cache_key *key)
+{
+ struct pcache_backing_dev_req *writeback_req;
+ struct pcache_backing_dev_req_opts writeback_req_opts = { 0 };
+ struct pcache_cache_pos *pos;
+ void *addr;
+ u32 seg_remain;
+ u64 off;
+
+ if (cache_key_clean(key))
+ return 0;
+
+ pos = &key->cache_pos;
+
+ seg_remain = cache_seg_remain(pos);
+ BUG_ON(seg_remain < key->len);
+
+ addr = cache_pos_addr(pos);
+ off = key->off;
+
+ writeback_req_opts.type = BACKING_DEV_REQ_TYPE_KMEM;
+ writeback_req_opts.end_fn = NULL;
+ writeback_req_opts.gfp_mask = GFP_NOIO;
+
+ writeback_req_opts.kmem.data = addr;
+ writeback_req_opts.kmem.opf = REQ_OP_WRITE;
+ writeback_req_opts.kmem.len = key->len;
+ writeback_req_opts.kmem.backing_off = off;
+
+ writeback_req = backing_dev_req_create(cache->backing_dev, &writeback_req_opts);
+ if (!writeback_req)
+ return -EIO;
+
+ backing_dev_req_submit(writeback_req, true);
+
+ return 0;
+}
+
+static int cache_wb_tree_writeback(struct pcache_cache *cache)
+{
+ struct dm_pcache *pcache = CACHE_TO_PCACHE(cache);
+ struct pcache_cache_tree *cache_tree = &cache->writeback_key_tree;
+ struct pcache_cache_subtree *cache_subtree;
+ struct rb_node *node;
+ struct pcache_cache_key *key;
+ int ret;
+ u32 i;
+
+ for (i = 0; i < cache_tree->n_subtrees; i++) {
+ cache_subtree = &cache_tree->subtrees[i];
+
+ node = rb_first(&cache_subtree->root);
+ while (node) {
+ key = CACHE_KEY(node);
+ node = rb_next(node);
+
+ ret = cache_key_writeback(cache, key);
+ if (ret) {
+ pcache_dev_err(pcache, "writeback error: %d\n", ret);
+ return ret;
+ }
+
+ cache_key_delete(key);
+ }
+ }
+
+ backing_dev_flush(cache->backing_dev);
+
+ return 0;
+}
+
+static int cache_kset_insert_tree(struct pcache_cache *cache, struct pcache_cache_kset_onmedia *kset_onmedia)
+{
+ struct pcache_cache_key_onmedia *key_onmedia;
+ struct pcache_cache_key *key;
+ int ret;
+ u32 i;
+
+ /* Iterate through all keys in the kset and write each back to storage */
+ for (i = 0; i < kset_onmedia->key_num; i++) {
+ key_onmedia = &kset_onmedia->data[i];
+
+ key = cache_key_alloc(&cache->writeback_key_tree);
+ if (!key)
+ return -ENOMEM;
+
+ ret = cache_key_decode(cache, key_onmedia, key);
+ if (ret) {
+ cache_key_delete(key);
+ return ret;
+ }
+
+ ret = cache_key_insert(&cache->writeback_key_tree, key, true);
+ if (ret) {
+ cache_key_delete(key);
+ return ret;
+ }
+ }
+
+ return 0;
+}
+
+static void last_kset_writeback(struct pcache_cache *cache,
+ struct pcache_cache_kset_onmedia *last_kset_onmedia)
+{
+ struct dm_pcache *pcache = CACHE_TO_PCACHE(cache);
+ struct pcache_cache_segment *next_seg;
+
+ pcache_dev_debug(pcache, "last kset, next: %u\n", last_kset_onmedia->next_cache_seg_id);
+
+ next_seg = &cache->segments[last_kset_onmedia->next_cache_seg_id];
+
+ mutex_lock(&cache->dirty_tail_lock);
+ cache->dirty_tail.cache_seg = next_seg;
+ cache->dirty_tail.seg_off = 0;
+ cache_encode_dirty_tail(cache);
+ mutex_unlock(&cache->dirty_tail_lock);
+}
+
+void cache_writeback_fn(struct work_struct *work)
+{
+ struct pcache_cache *cache = container_of(work, struct pcache_cache, writeback_work.work);
+ struct dm_pcache *pcache = CACHE_TO_PCACHE(cache);
+ struct pcache_cache_pos dirty_tail;
+ struct pcache_cache_kset_onmedia *kset_onmedia;
+ int ret = 0;
+
+ mutex_lock(&cache->writeback_lock);
+ kset_onmedia = (struct pcache_cache_kset_onmedia *)cache->wb_kset_onmedia_buf;
+ /* Loop until all dirty data is written back and the cache is clean */
+ while (true) {
+ if (pcache_is_stopping(pcache)) {
+ mutex_unlock(&cache->writeback_lock);
+ return;
+ }
+
+ /* Get new dirty tail */
+ mutex_lock(&cache->dirty_tail_lock);
+ cache_pos_copy(&dirty_tail, &cache->dirty_tail);
+ mutex_unlock(&cache->dirty_tail_lock);
+
+ if (is_cache_clean(cache, &dirty_tail))
+ break;
+
+ if (kset_onmedia->flags & PCACHE_KSET_FLAGS_LAST) {
+ last_kset_writeback(cache, kset_onmedia);
+ continue;
+ }
+
+ ret = cache_kset_insert_tree(cache, kset_onmedia);
+ if (ret)
+ break;
+
+ ret = cache_wb_tree_writeback(cache);
+ if (ret)
+ break;
+
+ pcache_dev_debug(pcache, "writeback advance: %u:%u %u\n",
+ dirty_tail.cache_seg->cache_seg_id,
+ dirty_tail.seg_off,
+ get_kset_onmedia_size(kset_onmedia));
+
+ mutex_lock(&cache->dirty_tail_lock);
+ cache_pos_advance(&cache->dirty_tail, get_kset_onmedia_size(kset_onmedia));
+ cache_encode_dirty_tail(cache);
+ mutex_unlock(&cache->dirty_tail_lock);
+ }
+ mutex_unlock(&cache->writeback_lock);
+
+ queue_delayed_work(cache_get_wq(cache), &cache->writeback_work, PCACHE_CACHE_WRITEBACK_INTERVAL);
+}
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH 07/11] dm-pcache: add cache_gc
2025-06-05 14:22 [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices Dongsheng Yang
` (5 preceding siblings ...)
2025-06-05 14:23 ` [RFC PATCH 06/11] dm-pcache: add cache_writeback Dongsheng Yang
@ 2025-06-05 14:23 ` Dongsheng Yang
2025-06-05 14:23 ` [RFC PATCH 08/11] dm-pcache: add cache_key Dongsheng Yang
` (4 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Dongsheng Yang @ 2025-06-05 14:23 UTC (permalink / raw)
To: mpatocka, agk, snitzer, axboe, hch, dan.j.williams,
Jonathan.Cameron
Cc: linux-block, linux-kernel, linux-cxl, nvdimm, dm-devel,
Dongsheng Yang
Introduce cache_gc.c, a self-contained engine that reclaims cache
segments whose data have already been flushed to the backing device.
Running in the cache workqueue, the GC keeps segment usage below the
user-configurable *cache_gc_percent* threshold.
* need_gc() – decides when to trigger GC by checking:
- *dirty_tail* vs *key_tail* position,
- kset integrity (magic + CRC),
- bitmap utilisation against the gc-percent threshold.
* Per-key reclamation
- Decodes each key in the target kset (`cache_key_decode()`).
- Drops the segment reference with `cache_seg_put()`, allowing the
segment to be invalidated once all keys are gone.
- When the reference count hits zero the segment is cleared from
`seg_map`, making it immediately reusable by the allocator.
* Scheduling
- `pcache_cache_gc_fn()` loops until no more work is needed, then
re-queues itself after *PCACHE_CACHE_GC_INTERVAL*.
Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
drivers/md/dm-pcache/cache_gc.c | 170 ++++++++++++++++++++++++++++++++
1 file changed, 170 insertions(+)
create mode 100644 drivers/md/dm-pcache/cache_gc.c
diff --git a/drivers/md/dm-pcache/cache_gc.c b/drivers/md/dm-pcache/cache_gc.c
new file mode 100644
index 000000000000..a122d95c8f46
--- /dev/null
+++ b/drivers/md/dm-pcache/cache_gc.c
@@ -0,0 +1,170 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include "cache.h"
+#include "backing_dev.h"
+#include "cache_dev.h"
+#include "dm_pcache.h"
+
+/**
+ * cache_key_gc - Releases the reference of a cache key segment.
+ * @cache: Pointer to the pcache_cache structure.
+ * @key: Pointer to the cache key to be garbage collected.
+ *
+ * This function decrements the reference count of the cache segment
+ * associated with the given key. If the reference count drops to zero,
+ * the segment may be invalidated and reused.
+ */
+static void cache_key_gc(struct pcache_cache *cache, struct pcache_cache_key *key)
+{
+ cache_seg_put(key->cache_pos.cache_seg);
+}
+
+static bool need_gc(struct pcache_cache *cache, struct pcache_cache_pos *dirty_tail, struct pcache_cache_pos *key_tail)
+{
+ struct dm_pcache *pcache = CACHE_TO_PCACHE(cache);
+ struct pcache_cache_kset_onmedia *kset_onmedia;
+ void *dirty_addr, *key_addr;
+ u32 segs_used, segs_gc_threshold, to_copy;
+ int ret;
+
+ dirty_addr = cache_pos_addr(dirty_tail);
+ key_addr = cache_pos_addr(key_tail);
+ if (dirty_addr == key_addr) {
+ pcache_dev_debug(pcache, "key tail is equal to dirty tail: %u:%u\n",
+ dirty_tail->cache_seg->cache_seg_id,
+ dirty_tail->seg_off);
+ return false;
+ }
+
+ kset_onmedia = (struct pcache_cache_kset_onmedia *)cache->gc_kset_onmedia_buf;
+
+ to_copy = min(PCACHE_KSET_ONMEDIA_SIZE_MAX, PCACHE_SEG_SIZE - key_tail->seg_off);
+ ret = copy_mc_to_kernel(kset_onmedia, key_addr, to_copy);
+ if (ret) {
+ pcache_dev_err(pcache, "error to read kset: %d", ret);
+ return false;
+ }
+
+ /* Check if kset_onmedia is corrupted */
+ if (kset_onmedia->magic != PCACHE_KSET_MAGIC) {
+ pcache_dev_debug(pcache, "gc error: magic is not as expected. key_tail: %u:%u magic: %llx, expected: %llx\n",
+ key_tail->cache_seg->cache_seg_id, key_tail->seg_off,
+ kset_onmedia->magic, PCACHE_KSET_MAGIC);
+ return false;
+ }
+
+ /* Verify the CRC of the kset_onmedia */
+ if (kset_onmedia->crc != cache_kset_crc(kset_onmedia)) {
+ pcache_dev_debug(pcache, "gc error: crc is not as expected. crc: %x, expected: %x\n",
+ cache_kset_crc(kset_onmedia), kset_onmedia->crc);
+ return false;
+ }
+
+ segs_used = bitmap_weight(cache->seg_map, cache->n_segs);
+ segs_gc_threshold = cache->n_segs * pcache_cache_get_gc_percent(cache) / 100;
+ if (segs_used < segs_gc_threshold) {
+ pcache_dev_debug(pcache, "segs_used: %u, segs_gc_threshold: %u\n", segs_used, segs_gc_threshold);
+ return false;
+ }
+
+ return true;
+}
+
+/**
+ * last_kset_gc - Advances the garbage collection for the last kset.
+ * @cache: Pointer to the pcache_cache structure.
+ * @kset_onmedia: Pointer to the kset_onmedia structure for the last kset.
+ */
+static void last_kset_gc(struct pcache_cache *cache, struct pcache_cache_kset_onmedia *kset_onmedia)
+{
+ struct dm_pcache *pcache = CACHE_TO_PCACHE(cache);
+ struct pcache_cache_segment *cur_seg, *next_seg;
+
+ cur_seg = cache->key_tail.cache_seg;
+
+ next_seg = &cache->segments[kset_onmedia->next_cache_seg_id];
+
+ mutex_lock(&cache->key_tail_lock);
+ cache->key_tail.cache_seg = next_seg;
+ cache->key_tail.seg_off = 0;
+ cache_encode_key_tail(cache);
+ mutex_unlock(&cache->key_tail_lock);
+
+ pcache_dev_debug(pcache, "gc advance kset seg: %u\n", cur_seg->cache_seg_id);
+
+ spin_lock(&cache->seg_map_lock);
+ clear_bit(cur_seg->cache_seg_id, cache->seg_map);
+ spin_unlock(&cache->seg_map_lock);
+}
+
+void pcache_cache_gc_fn(struct work_struct *work)
+{
+ struct pcache_cache *cache = container_of(work, struct pcache_cache, gc_work.work);
+ struct dm_pcache *pcache = CACHE_TO_PCACHE(cache);
+ struct pcache_cache_pos dirty_tail, key_tail;
+ struct pcache_cache_kset_onmedia *kset_onmedia;
+ struct pcache_cache_key_onmedia *key_onmedia;
+ struct pcache_cache_key *key;
+ int ret;
+ int i;
+
+ kset_onmedia = (struct pcache_cache_kset_onmedia *)cache->gc_kset_onmedia_buf;
+
+ while (true) {
+ if (pcache_is_stopping(pcache) || atomic_read(&cache->gc_errors))
+ return;
+
+ /* Get new tail positions */
+ mutex_lock(&cache->dirty_tail_lock);
+ cache_pos_copy(&dirty_tail, &cache->dirty_tail);
+ mutex_unlock(&cache->dirty_tail_lock);
+
+ mutex_lock(&cache->key_tail_lock);
+ cache_pos_copy(&key_tail, &cache->key_tail);
+ mutex_unlock(&cache->key_tail_lock);
+
+ if (!need_gc(cache, &dirty_tail, &key_tail))
+ break;
+
+ if (kset_onmedia->flags & PCACHE_KSET_FLAGS_LAST) {
+ /* Don't move to the next segment if dirty_tail has not moved */
+ if (dirty_tail.cache_seg == key_tail.cache_seg)
+ break;
+
+ last_kset_gc(cache, kset_onmedia);
+ continue;
+ }
+
+ for (i = 0; i < kset_onmedia->key_num; i++) {
+ struct pcache_cache_key key_tmp = { 0 };
+
+ key_onmedia = &kset_onmedia->data[i];
+
+ key = &key_tmp;
+ cache_key_init(&cache->req_key_tree, key);
+
+ ret = cache_key_decode(cache, key_onmedia, key);
+ if (ret) {
+ /* return without re-arm gc work, and prevent future
+ * gc, because we can't retry the partial-gc-ed kset
+ */
+ atomic_inc(&cache->gc_errors);
+ pcache_dev_err(pcache, "failed to decode cache key in gc\n");
+ return;
+ }
+
+ cache_key_gc(cache, key);
+ }
+
+ pcache_dev_debug(pcache, "gc advance: %u:%u %u\n",
+ key_tail.cache_seg->cache_seg_id,
+ key_tail.seg_off,
+ get_kset_onmedia_size(kset_onmedia));
+
+ mutex_lock(&cache->key_tail_lock);
+ cache_pos_advance(&cache->key_tail, get_kset_onmedia_size(kset_onmedia));
+ cache_encode_key_tail(cache);
+ mutex_unlock(&cache->key_tail_lock);
+ }
+
+ queue_delayed_work(cache_get_wq(cache), &cache->gc_work, PCACHE_CACHE_GC_INTERVAL);
+}
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH 08/11] dm-pcache: add cache_key
2025-06-05 14:22 [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices Dongsheng Yang
` (6 preceding siblings ...)
2025-06-05 14:23 ` [RFC PATCH 07/11] dm-pcache: add cache_gc Dongsheng Yang
@ 2025-06-05 14:23 ` Dongsheng Yang
2025-06-05 14:23 ` [RFC PATCH 09/11] dm-pcache: add cache_req Dongsheng Yang
` (3 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Dongsheng Yang @ 2025-06-05 14:23 UTC (permalink / raw)
To: mpatocka, agk, snitzer, axboe, hch, dan.j.williams,
Jonathan.Cameron
Cc: linux-block, linux-kernel, linux-cxl, nvdimm, dm-devel,
Dongsheng Yang
Add *cache_key.c* which becomes the heart of dm-pcache’s
in-memory index and on-media key-set (“kset”) format.
* Key objects (`struct pcache_cache_key`)
- Slab-backed allocator & ref-count helpers
- `cache_key_encode()/decode()` translate between in-memory keys and
their on-disk representation, validating CRC when
*cache_data_crc* is enabled.
* Kset construction & persistence
- Per-kset buffer lives in `struct pcache_cache_kset`; keys are
appended until full or *force_close* triggers an immediate flush.
- `cache_kset_close()` writes the kset to the *key_head* segment,
automatically chaining a *LAST* kset header when rolling over to a
freshly allocated segment.
* Red-black tree with striping
- Cache space is divided into *subtrees* to reduce lock
contention; each subtree owns its own RB-root + spinlock.
- Complex overlap-resolution logic (`cache_insert_fixup()`) ensures
newly inserted keys never leave overlapping stale ranges behind
(head/tail/contain/contained cases handled).
* Replay on start-up
- `cache_replay()` walks from *key_tail* to *key_head*, re-hydrates
keys, validates CRC/magic, seamlessly
skipping placeholder “empty” keys left by read-misses.
* Background maintenance
- `clean_work` lazily prunes invalidated keys after GC.
- `kset_flush_work` background thread to close a kset.
With this patch dm-pcache can persistently track cached extents, rebuild
its index after crash, and guarantee non-overlapping key space – paving
the way for functional read/write caching.
Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
drivers/md/dm-pcache/cache_key.c | 907 +++++++++++++++++++++++++++++++
1 file changed, 907 insertions(+)
create mode 100644 drivers/md/dm-pcache/cache_key.c
diff --git a/drivers/md/dm-pcache/cache_key.c b/drivers/md/dm-pcache/cache_key.c
new file mode 100644
index 000000000000..ee9d482f963b
--- /dev/null
+++ b/drivers/md/dm-pcache/cache_key.c
@@ -0,0 +1,907 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include "cache.h"
+#include "backing_dev.h"
+#include "cache_dev.h"
+#include "dm_pcache.h"
+
+struct pcache_cache_kset_onmedia pcache_empty_kset = { 0 };
+
+void cache_key_init(struct pcache_cache_tree *cache_tree, struct pcache_cache_key *key)
+{
+ kref_init(&key->ref);
+ key->cache_tree = cache_tree;
+ INIT_LIST_HEAD(&key->list_node);
+ RB_CLEAR_NODE(&key->rb_node);
+}
+
+struct pcache_cache_key *cache_key_alloc(struct pcache_cache_tree *cache_tree)
+{
+ struct pcache_cache_key *key;
+
+ key = kmem_cache_zalloc(cache_tree->key_cache, GFP_NOWAIT);
+ if (!key)
+ return NULL;
+
+ cache_key_init(cache_tree, key);
+
+ return key;
+}
+
+/**
+ * cache_key_get - Increment the reference count of a cache key.
+ * @key: Pointer to the pcache_cache_key structure.
+ *
+ * This function increments the reference count of the specified cache key,
+ * ensuring that it is not freed while still in use.
+ */
+void cache_key_get(struct pcache_cache_key *key)
+{
+ kref_get(&key->ref);
+}
+
+/**
+ * cache_key_destroy - Free a cache key structure when its reference count drops to zero.
+ * @ref: Pointer to the kref structure.
+ *
+ * This function is called when the reference count of the cache key reaches zero.
+ * It frees the allocated cache key back to the slab cache.
+ */
+static void cache_key_destroy(struct kref *ref)
+{
+ struct pcache_cache_key *key = container_of(ref, struct pcache_cache_key, ref);
+ struct pcache_cache_tree *cache_tree = key->cache_tree;
+
+ kmem_cache_free(cache_tree->key_cache, key);
+}
+
+void cache_key_put(struct pcache_cache_key *key)
+{
+ kref_put(&key->ref, cache_key_destroy);
+}
+
+void cache_pos_advance(struct pcache_cache_pos *pos, u32 len)
+{
+ /* Ensure enough space remains in the current segment */
+ BUG_ON(cache_seg_remain(pos) < len);
+
+ pos->seg_off += len;
+}
+
+static void cache_key_encode(struct pcache_cache *cache,
+ struct pcache_cache_key_onmedia *key_onmedia,
+ struct pcache_cache_key *key)
+{
+ key_onmedia->off = key->off;
+ key_onmedia->len = key->len;
+
+ key_onmedia->cache_seg_id = key->cache_pos.cache_seg->cache_seg_id;
+ key_onmedia->cache_seg_off = key->cache_pos.seg_off;
+
+ key_onmedia->seg_gen = key->seg_gen;
+ key_onmedia->flags = key->flags;
+
+ if (cache_data_crc_on(cache))
+ key_onmedia->data_crc = cache_key_data_crc(key);
+}
+
+int cache_key_decode(struct pcache_cache *cache,
+ struct pcache_cache_key_onmedia *key_onmedia,
+ struct pcache_cache_key *key)
+{
+ struct dm_pcache *pcache = CACHE_TO_PCACHE(cache);
+
+ key->off = key_onmedia->off;
+ key->len = key_onmedia->len;
+
+ key->cache_pos.cache_seg = &cache->segments[key_onmedia->cache_seg_id];
+ key->cache_pos.seg_off = key_onmedia->cache_seg_off;
+
+ key->seg_gen = key_onmedia->seg_gen;
+ key->flags = key_onmedia->flags;
+
+ if (cache_data_crc_on(cache) &&
+ key_onmedia->data_crc != cache_key_data_crc(key)) {
+ pcache_dev_err(pcache, "key: %llu:%u seg %u:%u data_crc error: %x, expected: %x\n",
+ key->off, key->len, key->cache_pos.cache_seg->cache_seg_id,
+ key->cache_pos.seg_off, cache_key_data_crc(key), key_onmedia->data_crc);
+ return -EIO;
+ }
+
+ return 0;
+}
+
+static void append_last_kset(struct pcache_cache *cache, u32 next_seg)
+{
+ struct pcache_cache_kset_onmedia kset_onmedia = { 0 };
+
+ kset_onmedia.flags |= PCACHE_KSET_FLAGS_LAST;
+ kset_onmedia.next_cache_seg_id = next_seg;
+ kset_onmedia.magic = PCACHE_KSET_MAGIC;
+ kset_onmedia.crc = cache_kset_crc(&kset_onmedia);
+
+ memcpy_flushcache(get_key_head_addr(cache), &kset_onmedia, sizeof(struct pcache_cache_kset_onmedia));
+ pmem_wmb();
+ cache_pos_advance(&cache->key_head, sizeof(struct pcache_cache_kset_onmedia));
+}
+
+int cache_kset_close(struct pcache_cache *cache, struct pcache_cache_kset *kset)
+{
+ struct pcache_cache_kset_onmedia *kset_onmedia;
+ u32 kset_onmedia_size;
+ int ret;
+
+ kset_onmedia = &kset->kset_onmedia;
+
+ if (!kset_onmedia->key_num)
+ return 0;
+
+ kset_onmedia_size = struct_size(kset_onmedia, data, kset_onmedia->key_num);
+
+ spin_lock(&cache->key_head_lock);
+again:
+ /* Reserve space for the last kset */
+ if (cache_seg_remain(&cache->key_head) < kset_onmedia_size + sizeof(struct pcache_cache_kset_onmedia)) {
+ struct pcache_cache_segment *next_seg;
+
+ next_seg = get_cache_segment(cache);
+ if (!next_seg) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ /* clear outdated kset in next seg */
+ memcpy_flushcache(next_seg->segment.data, &pcache_empty_kset,
+ sizeof(struct pcache_cache_kset_onmedia));
+ append_last_kset(cache, next_seg->cache_seg_id);
+ cache->key_head.cache_seg = next_seg;
+ cache->key_head.seg_off = 0;
+ goto again;
+ }
+
+ kset_onmedia->magic = PCACHE_KSET_MAGIC;
+ kset_onmedia->crc = cache_kset_crc(kset_onmedia);
+
+ /* clear outdated kset after current kset */
+ memcpy_flushcache(get_key_head_addr(cache) + kset_onmedia_size, &pcache_empty_kset,
+ sizeof(struct pcache_cache_kset_onmedia));
+ /* write current kset into segment */
+ memcpy_flushcache(get_key_head_addr(cache), kset_onmedia, kset_onmedia_size);
+ pmem_wmb();
+
+ /* reset kset_onmedia */
+ memset(kset_onmedia, 0, sizeof(struct pcache_cache_kset_onmedia));
+ cache_pos_advance(&cache->key_head, kset_onmedia_size);
+
+ ret = 0;
+out:
+ spin_unlock(&cache->key_head_lock);
+
+ return ret;
+}
+
+/**
+ * cache_key_append - Append a cache key to the related kset.
+ * @cache: Pointer to the pcache_cache structure.
+ * @key: Pointer to the cache key structure to append.
+ *
+ * This function appends a cache key to the appropriate kset. If the kset
+ * is full, it closes the kset. If not, it queues a flush work to write
+ * the kset to media.
+ *
+ * Returns 0 on success, or a negative error code on failure.
+ */
+int cache_key_append(struct pcache_cache *cache, struct pcache_cache_key *key, bool force_close)
+{
+ struct pcache_cache_kset *kset;
+ struct pcache_cache_kset_onmedia *kset_onmedia;
+ struct pcache_cache_key_onmedia *key_onmedia;
+ u32 kset_id = get_kset_id(cache, key->off);
+ int ret = 0;
+
+ kset = get_kset(cache, kset_id);
+ kset_onmedia = &kset->kset_onmedia;
+
+ spin_lock(&kset->kset_lock);
+ key_onmedia = &kset_onmedia->data[kset_onmedia->key_num];
+ cache_key_encode(cache, key_onmedia, key);
+
+ /* Check if the current kset has reached the maximum number of keys */
+ if (++kset_onmedia->key_num == PCACHE_KSET_KEYS_MAX || force_close) {
+ /* If full, close the kset */
+ ret = cache_kset_close(cache, kset);
+ if (ret) {
+ kset_onmedia->key_num--;
+ goto out;
+ }
+ } else {
+ /* If not full, queue a delayed work to flush the kset */
+ queue_delayed_work(cache_get_wq(cache), &kset->flush_work, 1 * HZ);
+ }
+out:
+ spin_unlock(&kset->kset_lock);
+
+ return ret;
+}
+
+/**
+ * cache_subtree_walk - Traverse the cache tree.
+ * @cache: Pointer to the pcache_cache structure.
+ * @ctx: Pointer to the context structure for traversal.
+ *
+ * This function traverses the cache tree starting from the specified node.
+ * It calls the appropriate callback functions based on the relationships
+ * between the keys in the cache tree.
+ *
+ * Returns 0 on success, or a negative error code on failure.
+ */
+int cache_subtree_walk(struct pcache_cache_subtree_walk_ctx *ctx)
+{
+ struct pcache_cache_key *key_tmp, *key;
+ struct rb_node *node_tmp;
+ int ret;
+
+ key = ctx->key;
+ node_tmp = ctx->start_node;
+
+ while (node_tmp) {
+ if (ctx->walk_done && ctx->walk_done(ctx))
+ break;
+
+ key_tmp = CACHE_KEY(node_tmp);
+ /*
+ * If key_tmp ends before the start of key, continue to the next node.
+ * |----------|
+ * |=====|
+ */
+ if (cache_key_lend(key_tmp) <= cache_key_lstart(key)) {
+ if (ctx->after) {
+ ret = ctx->after(key, key_tmp, ctx);
+ if (ret)
+ goto out;
+ }
+ goto next;
+ }
+
+ /*
+ * If key_tmp starts after the end of key, stop traversing.
+ * |--------|
+ * |====|
+ */
+ if (cache_key_lstart(key_tmp) >= cache_key_lend(key)) {
+ if (ctx->before) {
+ ret = ctx->before(key, key_tmp, ctx);
+ if (ret)
+ goto out;
+ }
+ break;
+ }
+
+ /* Handle overlapping keys */
+ if (cache_key_lstart(key_tmp) >= cache_key_lstart(key)) {
+ /*
+ * If key_tmp encompasses key.
+ * |----------------| key_tmp
+ * |===========| key
+ */
+ if (cache_key_lend(key_tmp) >= cache_key_lend(key)) {
+ if (ctx->overlap_tail) {
+ ret = ctx->overlap_tail(key, key_tmp, ctx);
+ if (ret)
+ goto out;
+ }
+ break;
+ }
+
+ /*
+ * If key_tmp is contained within key.
+ * |----| key_tmp
+ * |==========| key
+ */
+ if (ctx->overlap_contain) {
+ ret = ctx->overlap_contain(key, key_tmp, ctx);
+ if (ret)
+ goto out;
+ }
+
+ goto next;
+ }
+
+ /*
+ * If key_tmp starts before key ends but ends after key.
+ * |-----------| key_tmp
+ * |====| key
+ */
+ if (cache_key_lend(key_tmp) > cache_key_lend(key)) {
+ if (ctx->overlap_contained) {
+ ret = ctx->overlap_contained(key, key_tmp, ctx);
+ if (ret)
+ goto out;
+ }
+ break;
+ }
+
+ /*
+ * If key_tmp starts before key and ends within key.
+ * |--------| key_tmp
+ * |==========| key
+ */
+ if (ctx->overlap_head) {
+ ret = ctx->overlap_head(key, key_tmp, ctx);
+ if (ret)
+ goto out;
+ }
+next:
+ node_tmp = rb_next(node_tmp);
+ }
+
+ if (ctx->walk_finally) {
+ ret = ctx->walk_finally(ctx);
+ if (ret)
+ goto out;
+ }
+
+ return 0;
+out:
+ return ret;
+}
+
+/**
+ * cache_subtree_search - Search for a key in the cache tree.
+ * @cache_subtree: Pointer to the cache tree structure.
+ * @key: Pointer to the cache key to search for.
+ * @parentp: Pointer to store the parent node of the found node.
+ * @newp: Pointer to store the location where the new node should be inserted.
+ * @delete_key_list: List to collect invalid keys for deletion.
+ *
+ * This function searches the cache tree for a specific key and returns
+ * the node that is the predecessor of the key, or first node if the key is
+ * less than all keys in the tree. If any invalid keys are found during
+ * the search, they are added to the delete_key_list for later cleanup.
+ *
+ * Returns a pointer to the previous node.
+ */
+struct rb_node *cache_subtree_search(struct pcache_cache_subtree *cache_subtree, struct pcache_cache_key *key,
+ struct rb_node **parentp, struct rb_node ***newp,
+ struct list_head *delete_key_list)
+{
+ struct rb_node **new, *parent = NULL;
+ struct pcache_cache_key *key_tmp;
+ struct rb_node *prev_node = NULL;
+
+ new = &(cache_subtree->root.rb_node);
+ while (*new) {
+ key_tmp = container_of(*new, struct pcache_cache_key, rb_node);
+ if (cache_key_invalid(key_tmp))
+ list_add(&key_tmp->list_node, delete_key_list);
+
+ parent = *new;
+ if (key_tmp->off >= key->off) {
+ new = &((*new)->rb_left);
+ } else {
+ prev_node = *new;
+ new = &((*new)->rb_right);
+ }
+ }
+
+ if (!prev_node)
+ prev_node = rb_first(&cache_subtree->root);
+
+ if (parentp)
+ *parentp = parent;
+
+ if (newp)
+ *newp = new;
+
+ return prev_node;
+}
+
+/**
+ * fixup_overlap_tail - Adjust the key when it overlaps at the tail.
+ * @key: Pointer to the new cache key being inserted.
+ * @key_tmp: Pointer to the existing key that overlaps.
+ * @ctx: Pointer to the context for walking the cache tree.
+ *
+ * This function modifies the existing key (key_tmp) when there is an
+ * overlap at the tail with the new key. If the modified key becomes
+ * empty, it is deleted. Returns 0 on success, or -EAGAIN if the key
+ * needs to be reinserted.
+ */
+static int fixup_overlap_tail(struct pcache_cache_key *key,
+ struct pcache_cache_key *key_tmp,
+ struct pcache_cache_subtree_walk_ctx *ctx)
+{
+ int ret;
+
+ /*
+ * |----------------| key_tmp
+ * |===========| key
+ */
+ if (cache_key_empty(key_tmp)) {
+ cache_key_delete(key_tmp);
+ ret = -EAGAIN;
+ goto out;
+ }
+
+ cache_key_cutfront(key_tmp, cache_key_lend(key) - cache_key_lstart(key_tmp));
+ if (key_tmp->len == 0) {
+ cache_key_delete(key_tmp);
+ ret = -EAGAIN;
+
+ /*
+ * Deleting key_tmp may change the structure of the
+ * entire cache tree, so we need to re-search the tree
+ * to determine the new insertion point for the key.
+ */
+ goto out;
+ }
+
+ return 0;
+out:
+ return ret;
+}
+
+/**
+ * fixup_overlap_contain - Handle case where new key completely contains an existing key.
+ * @key: Pointer to the new cache key being inserted.
+ * @key_tmp: Pointer to the existing key that is being contained.
+ * @ctx: Pointer to the context for walking the cache tree.
+ *
+ * This function deletes the existing key (key_tmp) when the new key
+ * completely contains it. It returns -EAGAIN to indicate that the
+ * tree structure may have changed, necessitating a re-insertion of
+ * the new key.
+ */
+static int fixup_overlap_contain(struct pcache_cache_key *key,
+ struct pcache_cache_key *key_tmp,
+ struct pcache_cache_subtree_walk_ctx *ctx)
+{
+ /*
+ * |----| key_tmp
+ * |==========| key
+ */
+ cache_key_delete(key_tmp);
+
+ return -EAGAIN;
+}
+
+/**
+ * fixup_overlap_contained - Handle overlap when a new key is contained in an existing key.
+ * @key: The new cache key being inserted.
+ * @key_tmp: The existing cache key that overlaps with the new key.
+ * @ctx: Context for the cache tree walk.
+ *
+ * This function adjusts the existing key if the new key is contained
+ * within it. If the existing key is empty, it indicates a placeholder key
+ * that was inserted during a miss read. This placeholder will later be
+ * updated with real data from the backing_dev, making it no longer an empty key.
+ *
+ * If we delete key or insert a key, the structure of the entire cache tree may change,
+ * requiring a full research of the tree to find a new insertion point.
+ */
+static int fixup_overlap_contained(struct pcache_cache_key *key,
+ struct pcache_cache_key *key_tmp, struct pcache_cache_subtree_walk_ctx *ctx)
+{
+ struct pcache_cache_tree *cache_tree = ctx->cache_tree;
+ int ret;
+
+ /*
+ * |-----------| key_tmp
+ * |====| key
+ */
+ if (cache_key_empty(key_tmp)) {
+ /* If key_tmp is empty, don't split it;
+ * it's a placeholder key for miss reads that will be updated later.
+ */
+ cache_key_cutback(key_tmp, cache_key_lend(key_tmp) - cache_key_lstart(key));
+ if (key_tmp->len == 0) {
+ cache_key_delete(key_tmp);
+ ret = -EAGAIN;
+ goto out;
+ }
+ } else {
+ struct pcache_cache_key *key_fixup;
+ bool need_research = false;
+
+ /* Allocate a new cache key for splitting key_tmp */
+ key_fixup = cache_key_alloc(cache_tree);
+ if (!key_fixup) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ cache_key_copy(key_fixup, key_tmp);
+
+ /* Split key_tmp based on the new key's range */
+ cache_key_cutback(key_tmp, cache_key_lend(key_tmp) - cache_key_lstart(key));
+ if (key_tmp->len == 0) {
+ cache_key_delete(key_tmp);
+ need_research = true;
+ }
+
+ /* Create a new portion for key_fixup */
+ cache_key_cutfront(key_fixup, cache_key_lend(key) - cache_key_lstart(key_tmp));
+ if (key_fixup->len == 0) {
+ cache_key_put(key_fixup);
+ } else {
+ /* Insert the new key into the cache */
+ ret = cache_key_insert(cache_tree, key_fixup, false);
+ if (ret)
+ goto out;
+ need_research = true;
+ }
+
+ if (need_research) {
+ ret = -EAGAIN;
+ goto out;
+ }
+ }
+
+ return 0;
+out:
+ return ret;
+}
+
+/**
+ * fixup_overlap_head - Handle overlap when a new key overlaps with the head of an existing key.
+ * @key: The new cache key being inserted.
+ * @key_tmp: The existing cache key that overlaps with the new key.
+ * @ctx: Context for the cache tree walk.
+ *
+ * This function adjusts the existing key if the new key overlaps
+ * with the beginning of it. If the resulting key length is zero
+ * after the adjustment, the key is deleted. This indicates that
+ * the key no longer holds valid data and requires the tree to be
+ * re-researched for a new insertion point.
+ */
+static int fixup_overlap_head(struct pcache_cache_key *key,
+ struct pcache_cache_key *key_tmp, struct pcache_cache_subtree_walk_ctx *ctx)
+{
+ /*
+ * |--------| key_tmp
+ * |==========| key
+ */
+ /* Adjust key_tmp by cutting back based on the new key's start */
+ cache_key_cutback(key_tmp, cache_key_lend(key_tmp) - cache_key_lstart(key));
+ if (key_tmp->len == 0) {
+ /* If the adjusted key_tmp length is zero, delete it */
+ cache_key_delete(key_tmp);
+ return -EAGAIN;
+ }
+
+ return 0;
+}
+
+/**
+ * cache_insert_fixup - Fix up overlaps when inserting a new key.
+ * @cache_tree: Pointer to the cache_tree structure.
+ * @key: The new cache key to insert.
+ * @prev_node: The last visited node during the search.
+ *
+ * This function initializes a walking context and calls the
+ * cache_subtree_walk function to handle potential overlaps between
+ * the new key and existing keys in the cache tree. Various
+ * fixup functions are provided to manage different overlap scenarios.
+ */
+static int cache_insert_fixup(struct pcache_cache_tree *cache_tree,
+ struct pcache_cache_key *key, struct rb_node *prev_node)
+{
+ struct pcache_cache_subtree_walk_ctx walk_ctx = { 0 };
+
+ /* Set up the context with the cache, start node, and new key */
+ walk_ctx.cache_tree = cache_tree;
+ walk_ctx.start_node = prev_node;
+ walk_ctx.key = key;
+
+ /* Assign overlap handling functions for different scenarios */
+ walk_ctx.overlap_tail = fixup_overlap_tail;
+ walk_ctx.overlap_head = fixup_overlap_head;
+ walk_ctx.overlap_contain = fixup_overlap_contain;
+ walk_ctx.overlap_contained = fixup_overlap_contained;
+
+ /* Begin walking the cache tree to fix overlaps */
+ return cache_subtree_walk(&walk_ctx);
+}
+
+/**
+ * cache_key_insert - Insert a new cache key into the cache tree.
+ * @cache_tree: Pointer to the cache_tree structure.
+ * @key: The cache key to insert.
+ * @fixup: Indicates if this is a new key being inserted.
+ *
+ * This function searches for the appropriate location to insert
+ * a new cache key into the cache tree. It handles key overlaps
+ * and ensures any invalid keys are removed before insertion.
+ *
+ * Returns 0 on success or a negative error code on failure.
+ */
+int cache_key_insert(struct pcache_cache_tree *cache_tree, struct pcache_cache_key *key, bool fixup)
+{
+ struct rb_node **new, *parent = NULL;
+ struct pcache_cache_subtree *cache_subtree;
+ struct pcache_cache_key *key_tmp = NULL, *key_next;
+ struct rb_node *prev_node = NULL;
+ LIST_HEAD(delete_key_list);
+ int ret;
+
+ cache_subtree = get_subtree(cache_tree, key->off);
+ key->cache_subtree = cache_subtree;
+search:
+ prev_node = cache_subtree_search(cache_subtree, key, &parent, &new, &delete_key_list);
+ if (!list_empty(&delete_key_list)) {
+ /* Remove invalid keys from the delete list */
+ list_for_each_entry_safe(key_tmp, key_next, &delete_key_list, list_node) {
+ list_del_init(&key_tmp->list_node);
+ cache_key_delete(key_tmp);
+ }
+ goto search;
+ }
+
+ if (fixup) {
+ ret = cache_insert_fixup(cache_tree, key, prev_node);
+ if (ret == -EAGAIN)
+ goto search;
+ if (ret)
+ goto out;
+ }
+
+ /* Link and insert the new key into the red-black tree */
+ rb_link_node(&key->rb_node, parent, new);
+ rb_insert_color(&key->rb_node, &cache_subtree->root);
+
+ return 0;
+out:
+ return ret;
+}
+
+/**
+ * clean_fn - Cleanup function to remove invalid keys from the cache tree.
+ * @work: Pointer to the work_struct associated with the cleanup.
+ *
+ * This function cleans up invalid keys from the cache tree in the background
+ * after a cache segment has been invalidated during cache garbage collection.
+ * It processes a maximum of PCACHE_CLEAN_KEYS_MAX keys per iteration and holds
+ * the tree lock to ensure thread safety.
+ */
+void clean_fn(struct work_struct *work)
+{
+ struct pcache_cache *cache = container_of(work, struct pcache_cache, clean_work);
+ struct pcache_cache_subtree *cache_subtree;
+ struct rb_node *node;
+ struct pcache_cache_key *key;
+ int i, count;
+
+ for (i = 0; i < cache->req_key_tree.n_subtrees; i++) {
+ cache_subtree = &cache->req_key_tree.subtrees[i];
+
+again:
+ if (pcache_is_stopping(CACHE_TO_PCACHE(cache)))
+ return;
+
+ /* Delete up to PCACHE_CLEAN_KEYS_MAX keys in one iteration */
+ count = 0;
+ spin_lock(&cache_subtree->tree_lock);
+ node = rb_first(&cache_subtree->root);
+ while (node) {
+ key = CACHE_KEY(node);
+ node = rb_next(node);
+ if (cache_key_invalid(key)) {
+ count++;
+ cache_key_delete(key);
+ }
+
+ if (count >= PCACHE_CLEAN_KEYS_MAX) {
+ /* Unlock and pause before continuing cleanup */
+ spin_unlock(&cache_subtree->tree_lock);
+ usleep_range(1000, 2000);
+ goto again;
+ }
+ }
+ spin_unlock(&cache_subtree->tree_lock);
+ }
+}
+
+/*
+ * kset_flush_fn - Flush work for a cache kset.
+ *
+ * This function is called when a kset flush work is queued from
+ * cache_key_append(). If the kset is full, it will be closed
+ * immediately. If not, the flush work will be queued for later closure.
+ *
+ * If cache_kset_close detects that a new segment is required to store
+ * the kset and there are no available segments, it will return an error.
+ * In this scenario, a retry will be attempted.
+ */
+void kset_flush_fn(struct work_struct *work)
+{
+ struct pcache_cache_kset *kset = container_of(work, struct pcache_cache_kset, flush_work.work);
+ struct pcache_cache *cache = kset->cache;
+ int ret;
+
+ if (pcache_is_stopping(CACHE_TO_PCACHE(cache)))
+ return;
+
+ spin_lock(&kset->kset_lock);
+ ret = cache_kset_close(cache, kset);
+ spin_unlock(&kset->kset_lock);
+
+ if (ret) {
+ /* Failed to flush kset, schedule a retry. */
+ queue_delayed_work(cache_get_wq(cache), &kset->flush_work, msecs_to_jiffies(100));
+ }
+}
+
+static int kset_replay(struct pcache_cache *cache, struct pcache_cache_kset_onmedia *kset_onmedia)
+{
+ struct pcache_cache_key_onmedia *key_onmedia;
+ struct pcache_cache_key *key;
+ int ret;
+ int i;
+
+ for (i = 0; i < kset_onmedia->key_num; i++) {
+ key_onmedia = &kset_onmedia->data[i];
+
+ key = cache_key_alloc(&cache->req_key_tree);
+ if (!key) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ ret = cache_key_decode(cache, key_onmedia, key);
+ if (ret) {
+ cache_key_put(key);
+ goto err;
+ }
+
+ /* Mark the segment as used in the segment map. */
+ set_bit(key->cache_pos.cache_seg->cache_seg_id, cache->seg_map);
+
+ /* Check if the segment generation is valid for insertion. */
+ if (key->seg_gen < key->cache_pos.cache_seg->gen) {
+ cache_key_put(key);
+ } else {
+ ret = cache_key_insert(&cache->req_key_tree, key, true);
+ if (ret) {
+ cache_key_put(key);
+ goto err;
+ }
+ }
+
+ cache_seg_get(key->cache_pos.cache_seg);
+ }
+
+ return 0;
+err:
+ return ret;
+}
+
+int cache_replay(struct pcache_cache *cache)
+{
+ struct dm_pcache *pcache = CACHE_TO_PCACHE(cache);
+ struct pcache_cache_pos pos_tail;
+ struct pcache_cache_pos *pos;
+ struct pcache_cache_kset_onmedia *kset_onmedia;
+ u32 to_copy, count = 0;
+ int ret = 0;
+
+ kset_onmedia = kzalloc(PCACHE_KSET_ONMEDIA_SIZE_MAX, GFP_KERNEL);
+ if (!kset_onmedia)
+ return -ENOMEM;
+
+ cache_pos_copy(&pos_tail, &cache->key_tail);
+ pos = &pos_tail;
+
+ /* Mark the segment as used in the segment map. */
+ set_bit(pos->cache_seg->cache_seg_id, cache->seg_map);
+
+ while (true) {
+ to_copy = min(PCACHE_KSET_ONMEDIA_SIZE_MAX, PCACHE_SEG_SIZE - pos->seg_off);
+ ret = copy_mc_to_kernel(kset_onmedia, cache_pos_addr(pos), to_copy);
+ if (ret) {
+ ret = -EIO;
+ goto out;
+ }
+
+ if (kset_onmedia->magic != PCACHE_KSET_MAGIC ||
+ kset_onmedia->crc != cache_kset_crc(kset_onmedia)) {
+ break;
+ }
+
+ /* Process the last kset and prepare for the next segment. */
+ if (kset_onmedia->flags & PCACHE_KSET_FLAGS_LAST) {
+ struct pcache_cache_segment *next_seg;
+
+ pcache_dev_debug(pcache, "last kset replay, next: %u\n", kset_onmedia->next_cache_seg_id);
+
+ next_seg = &cache->segments[kset_onmedia->next_cache_seg_id];
+
+ pos->cache_seg = next_seg;
+ pos->seg_off = 0;
+
+ set_bit(pos->cache_seg->cache_seg_id, cache->seg_map);
+ continue;
+ }
+
+ /* Replay the kset and check for errors. */
+ ret = kset_replay(cache, kset_onmedia);
+ if (ret)
+ goto out;
+
+ /* Advance the position after processing the kset. */
+ cache_pos_advance(pos, get_kset_onmedia_size(kset_onmedia));
+ if (++count > 512) {
+ cond_resched();
+ count = 0;
+ }
+ }
+
+ /* Update the key_head position after replaying. */
+ spin_lock(&cache->key_head_lock);
+ cache_pos_copy(&cache->key_head, pos);
+ spin_unlock(&cache->key_head_lock);
+out:
+ kfree(kset_onmedia);
+ return ret;
+}
+
+int cache_tree_init(struct pcache_cache *cache, struct pcache_cache_tree *cache_tree, u32 n_subtrees)
+{
+ int ret;
+ u32 i;
+
+ cache_tree->cache = cache;
+ cache_tree->n_subtrees = n_subtrees;
+
+ cache_tree->key_cache = KMEM_CACHE(pcache_cache_key, 0);
+ if (!cache_tree->key_cache) {
+ ret = -ENOMEM;
+ goto err;
+ }
+ /*
+ * Allocate and initialize the subtrees array.
+ * Each element is a cache tree structure that contains
+ * an RB tree root and a spinlock for protecting its contents.
+ */
+ cache_tree->subtrees = kvcalloc(cache_tree->n_subtrees, sizeof(struct pcache_cache_subtree), GFP_KERNEL);
+ if (!cache_tree->subtrees) {
+ ret = -ENOMEM;
+ goto destroy_key_cache;
+ }
+
+ for (i = 0; i < cache_tree->n_subtrees; i++) {
+ struct pcache_cache_subtree *cache_subtree = &cache_tree->subtrees[i];
+
+ cache_subtree->root = RB_ROOT;
+ spin_lock_init(&cache_subtree->tree_lock);
+ }
+
+ return 0;
+
+destroy_key_cache:
+ kmem_cache_destroy(cache_tree->key_cache);
+err:
+ return ret;
+}
+
+void cache_tree_exit(struct pcache_cache_tree *cache_tree)
+{
+ struct pcache_cache_subtree *cache_subtree;
+ struct rb_node *node;
+ struct pcache_cache_key *key;
+ u32 i;
+
+ for (i = 0; i < cache_tree->n_subtrees; i++) {
+ cache_subtree = &cache_tree->subtrees[i];
+
+ spin_lock(&cache_subtree->tree_lock);
+ node = rb_first(&cache_subtree->root);
+ while (node) {
+ key = CACHE_KEY(node);
+ node = rb_next(node);
+
+ cache_key_delete(key);
+ }
+ spin_unlock(&cache_subtree->tree_lock);
+ }
+ kvfree(cache_tree->subtrees);
+ kmem_cache_destroy(cache_tree->key_cache);
+}
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH 09/11] dm-pcache: add cache_req
2025-06-05 14:22 [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices Dongsheng Yang
` (7 preceding siblings ...)
2025-06-05 14:23 ` [RFC PATCH 08/11] dm-pcache: add cache_key Dongsheng Yang
@ 2025-06-05 14:23 ` Dongsheng Yang
2025-06-05 14:23 ` [RFC PATCH 10/11] dm-pcache: add cache core Dongsheng Yang
` (2 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Dongsheng Yang @ 2025-06-05 14:23 UTC (permalink / raw)
To: mpatocka, agk, snitzer, axboe, hch, dan.j.williams,
Jonathan.Cameron
Cc: linux-block, linux-kernel, linux-cxl, nvdimm, dm-devel,
Dongsheng Yang
Introduce cache_req.c, the high-level engine that
drives I/O requests through dm-pcache. It decides whether data is served
from the cache or fetched from the backing device, allocates new cache
space on writes, and flushes dirty ksets when required.
* Read path
- Traverses the striped RB-trees to locate cached extents.
- Generates backing READ requests for gaps and inserts placeholder
“empty” keys to avoid duplicate fetches.
- Copies valid data directly from pmem into the caller’s bio; CRC and
generation checks guard against stale segments.
* Write path
- Allocates space in the current data segment via cache_data_alloc().
- Copies data from the bio into pmem, then inserts or updates keys,
splitting or trimming overlapped ranges as needed.
- Adds each new key to the active kset; forces kset close when FUA is
requested or the kset is full.
* Miss handling
- create_cache_miss_req() builds a backing READ, optionally attaching
an empty key.
- miss_read_end_req() replaces the placeholder with real data once the
READ completes, or deletes it on error.
* Flush support
- cache_flush() iterates over all ksets and forces them to close,
ensuring data durability when REQ_PREFLUSH is received.
Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
drivers/md/dm-pcache/cache_req.c | 810 +++++++++++++++++++++++++++++++
1 file changed, 810 insertions(+)
create mode 100644 drivers/md/dm-pcache/cache_req.c
diff --git a/drivers/md/dm-pcache/cache_req.c b/drivers/md/dm-pcache/cache_req.c
new file mode 100644
index 000000000000..ab4dd4446d70
--- /dev/null
+++ b/drivers/md/dm-pcache/cache_req.c
@@ -0,0 +1,810 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include "cache.h"
+#include "backing_dev.h"
+#include "cache_dev.h"
+#include "dm_pcache.h"
+
+static int cache_data_head_init(struct pcache_cache *cache)
+{
+ struct pcache_cache_segment *next_seg;
+ struct pcache_cache_data_head *data_head;
+
+ data_head = get_data_head(cache);
+ next_seg = get_cache_segment(cache);
+ if (!next_seg)
+ return -EBUSY;
+
+ cache_seg_get(next_seg);
+ data_head->head_pos.cache_seg = next_seg;
+ data_head->head_pos.seg_off = 0;
+
+ return 0;
+}
+
+/*
+ * cache_data_alloc - Allocate data for a cache key.
+ * @cache: Pointer to the cache structure.
+ * @key: Pointer to the cache key to allocate data for.
+ *
+ * This function tries to allocate space from the cache segment specified by the
+ * data head. If the remaining space in the segment is insufficient to allocate
+ * the requested length for the cache key, it will allocate whatever is available
+ * and adjust the key's length accordingly. This function does not allocate
+ * space that crosses segment boundaries.
+ */
+static int cache_data_alloc(struct pcache_cache *cache, struct pcache_cache_key *key)
+{
+ struct pcache_cache_data_head *data_head;
+ struct pcache_cache_pos *head_pos;
+ struct pcache_cache_segment *cache_seg;
+ u32 seg_remain;
+ u32 allocated = 0, to_alloc;
+ int ret = 0;
+
+ preempt_disable();
+ data_head = get_data_head(cache);
+again:
+ if (!data_head->head_pos.cache_seg) {
+ seg_remain = 0;
+ } else {
+ cache_pos_copy(&key->cache_pos, &data_head->head_pos);
+ key->seg_gen = key->cache_pos.cache_seg->gen;
+
+ head_pos = &data_head->head_pos;
+ cache_seg = head_pos->cache_seg;
+ seg_remain = cache_seg_remain(head_pos);
+ to_alloc = key->len - allocated;
+ }
+
+ if (seg_remain > to_alloc) {
+ /* If remaining space in segment is sufficient for the cache key, allocate it. */
+ cache_pos_advance(head_pos, to_alloc);
+ allocated += to_alloc;
+ cache_seg_get(cache_seg);
+ } else if (seg_remain) {
+ /* If remaining space is not enough, allocate the remaining space and adjust the cache key length. */
+ cache_pos_advance(head_pos, seg_remain);
+ key->len = seg_remain;
+
+ /* Get for key: obtain a reference to the cache segment for the key. */
+ cache_seg_get(cache_seg);
+ /* Put for head_pos->cache_seg: release the reference for the current head's segment. */
+ cache_seg_put(head_pos->cache_seg);
+ head_pos->cache_seg = NULL;
+ } else {
+ /* Initialize a new data head if no segment is available. */
+ ret = cache_data_head_init(cache);
+ if (ret)
+ goto out;
+
+ goto again;
+ }
+
+out:
+ preempt_enable();
+
+ return ret;
+}
+
+static int cache_copy_from_req_bio(struct pcache_cache *cache, struct pcache_cache_key *key,
+ struct pcache_request *pcache_req, u32 bio_off)
+{
+ struct pcache_cache_pos *pos = &key->cache_pos;
+ struct pcache_segment *segment;
+
+ segment = &pos->cache_seg->segment;
+
+ return segment_copy_from_bio(segment, pos->seg_off, key->len, pcache_req->bio, bio_off);
+}
+
+static int cache_copy_to_req_bio(struct pcache_cache *cache, struct pcache_request *pcache_req,
+ u32 bio_off, u32 len, struct pcache_cache_pos *pos, u64 key_gen)
+{
+ struct pcache_cache_segment *cache_seg = pos->cache_seg;
+ struct pcache_segment *segment = &cache_seg->segment;
+ int ret;
+
+ spin_lock(&cache_seg->gen_lock);
+ if (key_gen < cache_seg->gen) {
+ spin_unlock(&cache_seg->gen_lock);
+ return -EINVAL;
+ }
+
+ ret = segment_copy_to_bio(segment, pos->seg_off, len, pcache_req->bio, bio_off);
+ spin_unlock(&cache_seg->gen_lock);
+
+ return ret;
+}
+
+/**
+ * miss_read_end_req - Handle the end of a miss read request.
+ * @pcache_req: Pointer to the request structure.
+ * @read_ret: Return value of read.
+ *
+ * This function is called when a backing request to read data from
+ * the backing_dev is completed. If the key associated with the request
+ * is empty (a placeholder), it allocates cache space for the key,
+ * copies the data read from the bio into the cache, and updates
+ * the key's status. If the key has been overwritten by a write
+ * request during this process, it will be deleted from the cache
+ * tree and no further action will be taken.
+ */
+static void miss_read_end_req(struct pcache_backing_dev_req *backing_req, int read_ret)
+{
+ void *priv_data = backing_req->priv_data;
+ struct pcache_request *pcache_req = backing_req->req.upper_req;
+ struct pcache_cache *cache = backing_req->backing_dev->cache;
+ int ret;
+
+ if (priv_data) {
+ struct pcache_cache_key *key;
+ struct pcache_cache_subtree *cache_subtree;
+
+ key = (struct pcache_cache_key *)priv_data;
+ cache_subtree = key->cache_subtree;
+
+ /* if this key was deleted from cache_subtree by a write, key->flags should be cleared,
+ * so if cache_key_empty() return true, this key is still in cache_subtree
+ */
+ spin_lock(&cache_subtree->tree_lock);
+ if (cache_key_empty(key)) {
+ /* Check if the backing request was successful. */
+ if (read_ret) {
+ cache_key_delete(key);
+ goto unlock;
+ }
+
+ /* Allocate cache space for the key and copy data from the backing_dev. */
+ ret = cache_data_alloc(cache, key);
+ if (ret) {
+ cache_key_delete(key);
+ goto unlock;
+ }
+
+ ret = cache_copy_from_req_bio(cache, key, pcache_req, backing_req->req.bio_off);
+ if (ret) {
+ cache_seg_put(key->cache_pos.cache_seg);
+ cache_key_delete(key);
+ goto unlock;
+ }
+ key->flags &= ~PCACHE_CACHE_KEY_FLAGS_EMPTY;
+ key->flags |= PCACHE_CACHE_KEY_FLAGS_CLEAN;
+
+ /* Append the key to the cache. */
+ ret = cache_key_append(cache, key, false);
+ if (ret) {
+ cache_seg_put(key->cache_pos.cache_seg);
+ cache_key_delete(key);
+ goto unlock;
+ }
+ }
+unlock:
+ spin_unlock(&cache_subtree->tree_lock);
+ cache_key_put(key);
+ }
+}
+
+/**
+ * submit_cache_miss_req - Submit a backing request when cache data is missing
+ * @cache: The cache context that manages cache operations
+ * @pcache_req: The cache request containing information about the read request
+ *
+ * This function is used to handle cases where a cache read request cannot locate
+ * the required data in the cache. When such a miss occurs during `cache_subtree_walk`,
+ * it triggers a backing read request to fetch data from the backing storage.
+ *
+ * If `pcache_req->priv_data` is set, it points to a `pcache_cache_key`, representing
+ * a new cache key to be inserted into the cache. The function calls `cache_key_insert`
+ * to attempt adding the key. On insertion failure, it releases the key reference and
+ * clears `priv_data` to avoid further processing.
+ */
+static void submit_cache_miss_req(struct pcache_cache *cache, struct pcache_backing_dev_req *backing_req)
+{
+ int ret;
+
+ if (backing_req->priv_data) {
+ struct pcache_cache_key *key;
+
+ /* Attempt to insert the key into the cache if priv_data is set */
+ key = (struct pcache_cache_key *)backing_req->priv_data;
+ ret = cache_key_insert(&cache->req_key_tree, key, true);
+ if (ret) {
+ /* Release the key if insertion fails */
+ cache_key_put(key);
+ backing_req->priv_data = NULL;
+ backing_req->ret = ret;
+ backing_dev_req_end(backing_req);
+ return;
+ }
+ }
+ backing_dev_req_submit(backing_req, false);
+}
+
+/**
+ * create_cache_miss_req - Create a backing read request for a cache miss
+ * @cache: The cache structure that manages cache operations
+ * @parent: The parent request structure initiating the miss read
+ * @off: Offset in the parent request to read from
+ * @len: Length of data to read from the backing_dev
+ * @insert_key: Determines whether to insert a placeholder empty key in the cache tree
+ *
+ * This function generates a new backing read request when a cache miss occurs. The
+ * `insert_key` parameter controls whether a placeholder (empty) cache key should be
+ * added to the cache tree to prevent multiple backing requests for the same missing
+ * data. Generally, when the miss read occurs in a cache segment that doesn't contain
+ * the requested data, a placeholder key is created and inserted.
+ *
+ * However, if the cache tree already has an empty key at the location for this
+ * read, there is no need to create another. Instead, this function just send the
+ * new request without adding a duplicate placeholder.
+ *
+ * Returns:
+ * A pointer to the newly created request structure on success, or NULL on failure.
+ * If an empty key is created, it will be released if any errors occur during the
+ * process to ensure proper cleanup.
+ */
+static struct pcache_backing_dev_req *create_cache_miss_req(struct pcache_cache *cache, struct pcache_request *parent,
+ u32 off, u32 len, bool insert_key)
+{
+ struct pcache_backing_dev *backing_dev = cache->backing_dev;
+ struct pcache_backing_dev_req *backing_req;
+ struct pcache_cache_key *key = NULL;
+ struct pcache_backing_dev_req_opts req_opts = { 0 };
+
+ req_opts.type = BACKING_DEV_REQ_TYPE_REQ;
+ req_opts.gfp_mask = GFP_NOWAIT;
+ req_opts.req.upper_req = parent;
+ req_opts.req.req_off = off;
+ req_opts.req.len = len;
+ req_opts.end_fn = miss_read_end_req;
+
+ backing_req = backing_dev_req_create(backing_dev, &req_opts);
+ if (!backing_req)
+ goto out;
+
+ /* Allocate a new empty key if insert_key is set */
+ if (insert_key) {
+ key = cache_key_alloc(&cache->req_key_tree);
+ if (!key) {
+ backing_req->ret = -ENOMEM;
+ goto end_req;
+ }
+
+ /* Initialize the empty key with offset, length, and empty flag */
+ key->off = parent->off + off;
+ key->len = len;
+ key->flags |= PCACHE_CACHE_KEY_FLAGS_EMPTY;
+ }
+
+ /* Attach the empty key to the request if it was created */
+ if (key) {
+ cache_key_get(key);
+ backing_req->priv_data = key;
+ }
+
+ return backing_req;
+
+end_req:
+ backing_dev_req_end(backing_req);
+out:
+ return NULL;
+}
+
+static int send_cache_miss_req(struct pcache_cache *cache, struct pcache_request *pcache_req,
+ u32 off, u32 len, bool insert_key)
+{
+ struct pcache_backing_dev_req *backing_req;
+
+ backing_req = create_cache_miss_req(cache, pcache_req, off, len, insert_key);
+ if (!backing_req)
+ return -ENOMEM;
+
+ submit_cache_miss_req(cache, backing_req);
+
+ return 0;
+}
+
+/*
+ * In the process of walking the cache tree to locate cached data, this
+ * function handles the situation where the requested data range lies
+ * entirely before an existing cache node (`key_tmp`). This outcome
+ * signifies that the target data is absent from the cache (cache miss).
+ *
+ * To fulfill this portion of the read request, the function creates a
+ * backing request (`backing_req`) for the missing data range represented
+ * by `key`. It then appends this request to the submission list in the
+ * `ctx`, which will later be processed to retrieve the data from backing
+ * storage. After setting up the backing request, `req_done` in `ctx` is
+ * updated to reflect the length of the handled range, and the range
+ * in `key` is adjusted by trimming off the portion that is now handled.
+ *
+ * The scenario handled here:
+ *
+ * |--------| key_tmp (existing cached range)
+ * |====| key (requested range, preceding key_tmp)
+ *
+ * Since `key` is before `key_tmp`, it signifies that the requested data
+ * range is missing in the cache (cache miss) and needs retrieval from
+ * backing storage.
+ */
+static int read_before(struct pcache_cache_key *key, struct pcache_cache_key *key_tmp,
+ struct pcache_cache_subtree_walk_ctx *ctx)
+{
+ struct pcache_backing_dev_req *backing_req;
+ int ret;
+
+ /*
+ * In this scenario, `key` represents a range that precedes `key_tmp`,
+ * meaning the requested data range is missing from the cache tree
+ * and must be retrieved from the backing_dev.
+ */
+ backing_req = create_cache_miss_req(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done, key->len, true);
+ if (!backing_req) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ list_add(&backing_req->node, ctx->submit_req_list);
+ ctx->req_done += key->len;
+ cache_key_cutfront(key, key->len);
+
+ return 0;
+out:
+ return ret;
+}
+
+/*
+ * During cache_subtree_walk, this function manages a scenario where part of the
+ * requested data range overlaps with an existing cache node (`key_tmp`).
+ *
+ * |----------------| key_tmp (existing cached range)
+ * |===========| key (requested range, overlapping the tail of key_tmp)
+ */
+static int read_overlap_tail(struct pcache_cache_key *key, struct pcache_cache_key *key_tmp,
+ struct pcache_cache_subtree_walk_ctx *ctx)
+{
+ struct pcache_backing_dev_req *backing_req;
+ u32 io_len;
+ int ret;
+
+ /*
+ * Calculate the length of the non-overlapping portion of `key`
+ * before `key_tmp`, representing the data missing in the cache.
+ */
+ io_len = cache_key_lstart(key_tmp) - cache_key_lstart(key);
+ if (io_len) {
+ backing_req = create_cache_miss_req(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done, io_len, true);
+ if (!backing_req) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ list_add(&backing_req->node, ctx->submit_req_list);
+ ctx->req_done += io_len;
+ cache_key_cutfront(key, io_len);
+ }
+
+ /*
+ * Handle the overlapping portion by calculating the length of
+ * the remaining data in `key` that coincides with `key_tmp`.
+ */
+ io_len = cache_key_lend(key) - cache_key_lstart(key_tmp);
+ if (cache_key_empty(key_tmp)) {
+ ret = send_cache_miss_req(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done, io_len, false);
+ if (ret)
+ goto out;
+ } else {
+ ret = cache_copy_to_req_bio(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done,
+ io_len, &key_tmp->cache_pos, key_tmp->seg_gen);
+ if (ret) {
+ list_add(&key_tmp->list_node, ctx->delete_key_list);
+ goto out;
+ }
+ }
+
+ ctx->req_done += io_len;
+ cache_key_cutfront(key, io_len);
+
+ return 0;
+
+out:
+ return ret;
+}
+
+/**
+ * The scenario handled here:
+ *
+ * |----| key_tmp (existing cached range)
+ * |==========| key (requested range)
+ */
+static int read_overlap_contain(struct pcache_cache_key *key, struct pcache_cache_key *key_tmp,
+ struct pcache_cache_subtree_walk_ctx *ctx)
+{
+ struct pcache_backing_dev_req *backing_req;
+ u32 io_len;
+ int ret;
+
+ /*
+ * Calculate the non-overlapping part of `key` before `key_tmp`
+ * to identify the missing data length.
+ */
+ io_len = cache_key_lstart(key_tmp) - cache_key_lstart(key);
+ if (io_len) {
+ backing_req = create_cache_miss_req(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done, io_len, true);
+ if (!backing_req) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ list_add(&backing_req->node, ctx->submit_req_list);
+
+ ctx->req_done += io_len;
+ cache_key_cutfront(key, io_len);
+ }
+
+ /*
+ * Handle the overlapping portion between `key` and `key_tmp`.
+ */
+ io_len = key_tmp->len;
+ if (cache_key_empty(key_tmp)) {
+ ret = send_cache_miss_req(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done, io_len, false);
+ if (ret)
+ goto out;
+ } else {
+ ret = cache_copy_to_req_bio(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done,
+ io_len, &key_tmp->cache_pos, key_tmp->seg_gen);
+ if (ret) {
+ list_add(&key_tmp->list_node, ctx->delete_key_list);
+ goto out;
+ }
+ }
+
+ ctx->req_done += io_len;
+ cache_key_cutfront(key, io_len);
+
+ return 0;
+out:
+ return ret;
+}
+
+/*
+ * |-----------| key_tmp (existing cached range)
+ * |====| key (requested range, fully within key_tmp)
+ *
+ * If `key_tmp` contains valid cached data, this function copies the relevant
+ * portion to the request's bio. Otherwise, it sends a backing request to
+ * fetch the required data range.
+ */
+static int read_overlap_contained(struct pcache_cache_key *key, struct pcache_cache_key *key_tmp,
+ struct pcache_cache_subtree_walk_ctx *ctx)
+{
+ struct pcache_cache_pos pos;
+ int ret;
+
+ /*
+ * Check if `key_tmp` is empty, indicating a miss. If so, initiate
+ * a backing request to fetch the required data for `key`.
+ */
+ if (cache_key_empty(key_tmp)) {
+ ret = send_cache_miss_req(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done, key->len, false);
+ if (ret)
+ goto out;
+ } else {
+ cache_pos_copy(&pos, &key_tmp->cache_pos);
+ cache_pos_advance(&pos, cache_key_lstart(key) - cache_key_lstart(key_tmp));
+
+ ret = cache_copy_to_req_bio(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done,
+ key->len, &pos, key_tmp->seg_gen);
+ if (ret) {
+ list_add(&key_tmp->list_node, ctx->delete_key_list);
+ goto out;
+ }
+ }
+
+ ctx->req_done += key->len;
+ cache_key_cutfront(key, key->len);
+
+ return 0;
+out:
+ return ret;
+}
+
+/*
+ * |--------| key_tmp (existing cached range)
+ * |==========| key (requested range, overlapping the head of key_tmp)
+ */
+static int read_overlap_head(struct pcache_cache_key *key, struct pcache_cache_key *key_tmp,
+ struct pcache_cache_subtree_walk_ctx *ctx)
+{
+ struct pcache_cache_pos pos;
+ u32 io_len;
+ int ret;
+
+ io_len = cache_key_lend(key_tmp) - cache_key_lstart(key);
+
+ if (cache_key_empty(key_tmp)) {
+ ret = send_cache_miss_req(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done, io_len, false);
+ if (ret)
+ goto out;
+ } else {
+ cache_pos_copy(&pos, &key_tmp->cache_pos);
+ cache_pos_advance(&pos, cache_key_lstart(key) - cache_key_lstart(key_tmp));
+
+ ret = cache_copy_to_req_bio(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done,
+ io_len, &pos, key_tmp->seg_gen);
+ if (ret) {
+ list_add(&key_tmp->list_node, ctx->delete_key_list);
+ goto out;
+ }
+ }
+
+ ctx->req_done += io_len;
+ cache_key_cutfront(key, io_len);
+
+ return 0;
+out:
+ return ret;
+}
+
+/*
+ * read_walk_finally - Finalizes the cache read tree walk by submitting any
+ * remaining backing requests
+ * @ctx: Context structure holding information about the cache,
+ * read request, and submission list
+ *
+ * This function is called at the end of the `cache_subtree_walk` during a
+ * cache read operation. It completes the walk by checking if any data
+ * requested by `key` was not found in the cache tree, and if so, it sends
+ * a backing request to retrieve that data. Then, it iterates through the
+ * submission list of backing requests created during the walk, removing
+ * each request from the list and submitting it.
+ *
+ * The scenario managed here includes:
+ * - Sending a backing request for the remaining length of `key` if it was
+ * not fulfilled by existing cache entries.
+ * - Iterating through `ctx->submit_req_list` to submit each backing request
+ * enqueued during the walk.
+ *
+ * This ensures all necessary backing requests for cache misses are submitted
+ * to the backing storage to retrieve any data that could not be found in
+ * the cache.
+ */
+static int read_walk_finally(struct pcache_cache_subtree_walk_ctx *ctx)
+{
+ struct pcache_backing_dev_req *backing_req, *next_req;
+ struct pcache_cache_key *key = ctx->key;
+ int ret;
+
+ if (key->len) {
+ ret = send_cache_miss_req(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done, key->len, true);
+ if (ret)
+ goto out;
+ ctx->req_done += key->len;
+ }
+
+ list_for_each_entry_safe(backing_req, next_req, ctx->submit_req_list, node) {
+ list_del_init(&backing_req->node);
+ submit_cache_miss_req(ctx->cache_tree->cache, backing_req);
+ }
+
+ return 0;
+
+out:
+ return ret;
+}
+
+/*
+ * This function is used within `cache_subtree_walk` to determine whether the
+ * read operation has covered the requested data length. It compares the
+ * amount of data processed (`ctx->req_done`) with the total data length
+ * specified in the original request (`ctx->pcache_req->data_len`).
+ *
+ * If `req_done` meets or exceeds the required data length, the function
+ * returns `true`, indicating the walk is complete. Otherwise, it returns `false`,
+ * signaling that additional data processing is needed to fulfill the request.
+ */
+static bool read_walk_done(struct pcache_cache_subtree_walk_ctx *ctx)
+{
+ return (ctx->req_done >= ctx->pcache_req->data_len);
+}
+
+/*
+ * cache_read - Process a read request by traversing the cache tree
+ * @cache: Cache structure holding cache trees and related configurations
+ * @pcache_req: Request structure with information about the data to read
+ *
+ * This function attempts to fulfill a read request by traversing the cache tree(s)
+ * to locate cached data for the requested range. If parts of the data are missing
+ * in the cache, backing requests are generated to retrieve the required segments.
+ *
+ * The function operates by initializing a key for the requested data range and
+ * preparing a context (`walk_ctx`) to manage the cache tree traversal. The context
+ * includes pointers to functions (e.g., `read_before`, `read_overlap_tail`) that handle
+ * specific conditions encountered during the traversal. The `walk_finally` and `walk_done`
+ * functions manage the end stages of the traversal, while the `delete_key_list` and
+ * `submit_req_list` lists track any keys to be deleted or requests to be submitted.
+ *
+ * The function first calculates the requested range and checks if it fits within the
+ * current cache tree (based on the tree's size limits). It then locks the cache tree
+ * and performs a search to locate any matching keys. If there are outdated keys,
+ * these are deleted, and the search is restarted to ensure accurate data retrieval.
+ *
+ * If the requested range spans multiple cache trees, the function moves on to the
+ * next tree once the current range has been processed. This continues until the
+ * entire requested data length has been handled.
+ */
+static int cache_read(struct pcache_cache *cache, struct pcache_request *pcache_req)
+{
+ struct pcache_cache_key key_data = { .off = pcache_req->off, .len = pcache_req->data_len };
+ struct pcache_cache_subtree *cache_subtree;
+ struct pcache_cache_key *key_tmp = NULL, *key_next;
+ struct rb_node *prev_node = NULL;
+ struct pcache_cache_key *key = &key_data;
+ struct pcache_cache_subtree_walk_ctx walk_ctx = { 0 };
+ LIST_HEAD(delete_key_list);
+ LIST_HEAD(submit_req_list);
+ int ret;
+
+ walk_ctx.cache_tree = &cache->req_key_tree;
+ walk_ctx.req_done = 0;
+ walk_ctx.pcache_req = pcache_req;
+ walk_ctx.before = read_before;
+ walk_ctx.overlap_tail = read_overlap_tail;
+ walk_ctx.overlap_head = read_overlap_head;
+ walk_ctx.overlap_contain = read_overlap_contain;
+ walk_ctx.overlap_contained = read_overlap_contained;
+ walk_ctx.walk_finally = read_walk_finally;
+ walk_ctx.walk_done = read_walk_done;
+ walk_ctx.delete_key_list = &delete_key_list;
+ walk_ctx.submit_req_list = &submit_req_list;
+
+next_tree:
+ key->off = pcache_req->off + walk_ctx.req_done;
+ key->len = pcache_req->data_len - walk_ctx.req_done;
+ if (key->len > PCACHE_CACHE_SUBTREE_SIZE - (key->off & PCACHE_CACHE_SUBTREE_SIZE_MASK))
+ key->len = PCACHE_CACHE_SUBTREE_SIZE - (key->off & PCACHE_CACHE_SUBTREE_SIZE_MASK);
+
+ cache_subtree = get_subtree(&cache->req_key_tree, key->off);
+ spin_lock(&cache_subtree->tree_lock);
+
+search:
+ prev_node = cache_subtree_search(cache_subtree, key, NULL, NULL, &delete_key_list);
+
+cleanup_tree:
+ if (!list_empty(&delete_key_list)) {
+ list_for_each_entry_safe(key_tmp, key_next, &delete_key_list, list_node) {
+ list_del_init(&key_tmp->list_node);
+ cache_key_delete(key_tmp);
+ }
+ goto search;
+ }
+
+ walk_ctx.start_node = prev_node;
+ walk_ctx.key = key;
+
+ ret = cache_subtree_walk(&walk_ctx);
+ if (ret == -EINVAL)
+ goto cleanup_tree;
+ else if (ret)
+ goto out;
+
+ spin_unlock(&cache_subtree->tree_lock);
+
+ if (walk_ctx.req_done < pcache_req->data_len)
+ goto next_tree;
+
+ return 0;
+out:
+ spin_unlock(&cache_subtree->tree_lock);
+
+ return ret;
+}
+
+static int cache_write(struct pcache_cache *cache, struct pcache_request *pcache_req)
+{
+ struct pcache_cache_subtree *cache_subtree;
+ struct pcache_cache_key *key;
+ u64 offset = pcache_req->off;
+ u32 length = pcache_req->data_len;
+ u32 io_done = 0;
+ int ret;
+
+ while (true) {
+ if (io_done >= length)
+ break;
+
+ key = cache_key_alloc(&cache->req_key_tree);
+ if (!key) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ key->off = offset + io_done;
+ key->len = length - io_done;
+ if (key->len > PCACHE_CACHE_SUBTREE_SIZE - (key->off & PCACHE_CACHE_SUBTREE_SIZE_MASK))
+ key->len = PCACHE_CACHE_SUBTREE_SIZE - (key->off & PCACHE_CACHE_SUBTREE_SIZE_MASK);
+
+ ret = cache_data_alloc(cache, key);
+ if (ret) {
+ cache_key_put(key);
+ goto err;
+ }
+
+ ret = cache_copy_from_req_bio(cache, key, pcache_req, io_done);
+ if (ret) {
+ cache_seg_put(key->cache_pos.cache_seg);
+ cache_key_put(key);
+ goto err;
+ }
+
+ cache_subtree = get_subtree(&cache->req_key_tree, key->off);
+ spin_lock(&cache_subtree->tree_lock);
+ ret = cache_key_insert(&cache->req_key_tree, key, true);
+ if (ret) {
+ cache_seg_put(key->cache_pos.cache_seg);
+ cache_key_put(key);
+ goto unlock;
+ }
+
+ ret = cache_key_append(cache, key, pcache_req->bio->bi_opf & REQ_FUA);
+ if (ret) {
+ cache_seg_put(key->cache_pos.cache_seg);
+ cache_key_delete(key);
+ goto unlock;
+ }
+
+ io_done += key->len;
+ spin_unlock(&cache_subtree->tree_lock);
+ }
+
+ return 0;
+unlock:
+ spin_unlock(&cache_subtree->tree_lock);
+err:
+ return ret;
+}
+
+/**
+ * cache_flush - Flush all ksets to persist any pending cache data
+ * @cache: Pointer to the cache structure
+ *
+ * This function iterates through all ksets associated with the provided `cache`
+ * and ensures that any data marked for persistence is written to media. For each
+ * kset, it acquires the kset lock, then invokes `cache_kset_close`, which handles
+ * the persistence logic for that kset.
+ *
+ * If `cache_kset_close` encounters an error, the function exits immediately with
+ * the respective error code, preventing the flush operation from proceeding to
+ * subsequent ksets.
+ */
+int cache_flush(struct pcache_cache *cache)
+{
+ struct pcache_cache_kset *kset;
+ u32 i, ret;
+
+ for (i = 0; i < cache->n_ksets; i++) {
+ kset = get_kset(cache, i);
+
+ spin_lock(&kset->kset_lock);
+ ret = cache_kset_close(cache, kset);
+ spin_unlock(&kset->kset_lock);
+
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
+int pcache_cache_handle_req(struct pcache_cache *cache, struct pcache_request *pcache_req)
+{
+ struct bio *bio = pcache_req->bio;
+
+ if (unlikely(bio->bi_opf & REQ_PREFLUSH))
+ return cache_flush(cache);
+
+ if (bio_data_dir(bio) == READ)
+ return cache_read(cache, pcache_req);
+
+ return cache_write(cache, pcache_req);
+}
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH 10/11] dm-pcache: add cache core
2025-06-05 14:22 [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices Dongsheng Yang
` (8 preceding siblings ...)
2025-06-05 14:23 ` [RFC PATCH 09/11] dm-pcache: add cache_req Dongsheng Yang
@ 2025-06-05 14:23 ` Dongsheng Yang
2025-06-05 14:23 ` [RFC PATCH 11/11] dm-pcache: initial dm-pcache target Dongsheng Yang
2025-06-12 16:57 ` [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices Mikulas Patocka
11 siblings, 0 replies; 18+ messages in thread
From: Dongsheng Yang @ 2025-06-05 14:23 UTC (permalink / raw)
To: mpatocka, agk, snitzer, axboe, hch, dan.j.williams,
Jonathan.Cameron
Cc: linux-block, linux-kernel, linux-cxl, nvdimm, dm-devel,
Dongsheng Yang
Add cache.c and cache.h that introduce the top-level
“struct pcache_cache”. This object glues together the backing block
device, the persistent-memory cache device, segment array, RB-tree
indexes, and the background workers for write-back and garbage
collection.
* Persistent metadata
- pcache_cache_info tracks options such as cache mode, data-crc flag
and GC threshold, written atomically with CRC+sequence.
- key_tail and dirty_tail positions are double-buffered and recovered
at mount time.
* Segment management
- kvcalloc()’d array of pcache_cache_segment objects, bitmap for fast
allocation, refcounts and generation numbers so GC can invalidate
old extents safely.
- First segment hosts a pcache_cache_ctrl block shared by all
threads.
* Request path hooks
- pcache_cache_handle_req() dispatches READ, WRITE and FLUSH bios to
the engines added in earlier patches.
- Per-CPU data_heads support lock-free allocation of space for new
writes.
* Background workers
- Delayed work items for write-back (5 s) and GC (5 s).
- clean_work removes stale keys after segments are reclaimed.
* Lifecycle helpers
- pcache_cache_start()/stop() bring the cache online, replay keys,
start workers, and flush everything on shutdown.
With this piece in place dm-pcache has a fully initialised cache object
capable of serving I/O and maintaining its on-disk structures.
Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
drivers/md/dm-pcache/cache.c | 443 ++++++++++++++++++++++++++
drivers/md/dm-pcache/cache.h | 601 +++++++++++++++++++++++++++++++++++
2 files changed, 1044 insertions(+)
create mode 100644 drivers/md/dm-pcache/cache.c
create mode 100644 drivers/md/dm-pcache/cache.h
diff --git a/drivers/md/dm-pcache/cache.c b/drivers/md/dm-pcache/cache.c
new file mode 100644
index 000000000000..af2e59f16be8
--- /dev/null
+++ b/drivers/md/dm-pcache/cache.c
@@ -0,0 +1,443 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include <linux/blk_types.h>
+
+#include "cache.h"
+#include "cache_dev.h"
+#include "backing_dev.h"
+#include "dm_pcache.h"
+
+static inline struct pcache_cache_info *get_cache_info_addr(struct pcache_cache *cache)
+{
+ return cache->cache_info_addr + cache->info_index;
+}
+
+static void cache_info_write(struct pcache_cache *cache)
+{
+ struct pcache_cache_info *cache_info = &cache->cache_info;
+
+ cache_info->header.seq++;
+ cache_info->header.crc = pcache_meta_crc(&cache_info->header,
+ sizeof(struct pcache_cache_info));
+
+ memcpy_flushcache(get_cache_info_addr(cache), cache_info,
+ sizeof(struct pcache_cache_info));
+
+ cache->info_index = (cache->info_index + 1) % PCACHE_META_INDEX_MAX;
+}
+
+static void cache_info_init_default(struct pcache_cache *cache);
+static int cache_info_init(struct pcache_cache *cache, struct pcache_cache_options *opts)
+{
+ struct dm_pcache *pcache = CACHE_TO_PCACHE(cache);
+ struct pcache_cache_info *cache_info_addr;
+
+ cache_info_addr = pcache_meta_find_latest(&cache->cache_info_addr->header,
+ sizeof(struct pcache_cache_info),
+ PCACHE_CACHE_INFO_SIZE,
+ &cache->cache_info);
+ if (IS_ERR(cache_info_addr))
+ return PTR_ERR(cache_info_addr);
+
+ if (cache_info_addr) {
+ if (opts->data_crc !=
+ (cache->cache_info.flags & PCACHE_CACHE_FLAGS_DATA_CRC)) {
+ pcache_dev_err(pcache, "invalid option for data_crc: %s, expected: %s",
+ opts->data_crc ? "true" : "false",
+ cache->cache_info.flags & PCACHE_CACHE_FLAGS_DATA_CRC ? "true" : "false");
+ return -EINVAL;
+ }
+
+ return 0;
+ }
+
+ /* init cache_info for new cache */
+ cache_info_init_default(cache);
+
+ cache->cache_info.flags &= ~PCACHE_CACHE_FLAGS_CACHE_MODE_MASK;
+ cache->cache_info.flags |= FIELD_PREP(PCACHE_CACHE_FLAGS_CACHE_MODE_MASK, opts->cache_mode);
+
+ if (opts->data_crc)
+ cache->cache_info.flags |= PCACHE_CACHE_FLAGS_DATA_CRC;
+
+ return 0;
+}
+
+static void cache_info_set_gc_percent(struct pcache_cache_info *cache_info, u8 percent)
+{
+ cache_info->flags &= ~PCACHE_CACHE_FLAGS_GC_PERCENT_MASK;
+ cache_info->flags |= FIELD_PREP(PCACHE_CACHE_FLAGS_GC_PERCENT_MASK, percent);
+}
+
+int pcache_cache_set_gc_percent(struct pcache_cache *cache, u8 percent)
+{
+ if (percent > PCACHE_CACHE_GC_PERCENT_MAX || percent < PCACHE_CACHE_GC_PERCENT_MIN)
+ return -EINVAL;
+
+ mutex_lock(&cache->cache_info_lock);
+ cache_info_set_gc_percent(&cache->cache_info, percent);
+
+ cache_info_write(cache);
+ mutex_unlock(&cache->cache_info_lock);
+
+ return 0;
+}
+
+void cache_pos_encode(struct pcache_cache *cache,
+ struct pcache_cache_pos_onmedia *pos_onmedia_base,
+ struct pcache_cache_pos *pos, u64 seq, u32 *index)
+{
+ struct pcache_cache_pos_onmedia pos_onmedia;
+ struct pcache_cache_pos_onmedia *pos_onmedia_addr = pos_onmedia_base + *index;
+
+ pos_onmedia.cache_seg_id = pos->cache_seg->cache_seg_id;
+ pos_onmedia.seg_off = pos->seg_off;
+ pos_onmedia.header.seq = seq;
+ pos_onmedia.header.crc = cache_pos_onmedia_crc(&pos_onmedia);
+
+ memcpy_flushcache(pos_onmedia_addr, &pos_onmedia, sizeof(struct pcache_cache_pos_onmedia));
+ pmem_wmb();
+
+ *index = (*index + 1) % PCACHE_META_INDEX_MAX;
+}
+
+int cache_pos_decode(struct pcache_cache *cache,
+ struct pcache_cache_pos_onmedia *pos_onmedia,
+ struct pcache_cache_pos *pos, u64 *seq, u32 *index)
+{
+ struct pcache_cache_pos_onmedia latest, *latest_addr;
+
+ latest_addr = pcache_meta_find_latest(&pos_onmedia->header,
+ sizeof(struct pcache_cache_pos_onmedia),
+ sizeof(struct pcache_cache_pos_onmedia),
+ &latest);
+ if (IS_ERR(latest_addr))
+ return PTR_ERR(latest_addr);
+
+ if (!latest_addr)
+ return -EIO;
+
+ pos->cache_seg = &cache->segments[latest.cache_seg_id];
+ pos->seg_off = latest.seg_off;
+ *seq = latest.header.seq;
+ *index = (latest_addr - pos_onmedia);
+
+ return 0;
+}
+
+static inline void cache_info_set_seg_id(struct pcache_cache *cache, u32 seg_id)
+{
+ cache->cache_info.seg_id = seg_id;
+}
+
+static int cache_init(struct dm_pcache *pcache)
+{
+ struct pcache_cache *cache = &pcache->cache;
+ struct pcache_backing_dev *backing_dev = &pcache->backing_dev;
+ struct pcache_cache_dev *cache_dev = &pcache->cache_dev;
+ int ret;
+
+ cache->segments = kvcalloc(cache_dev->seg_num, sizeof(struct pcache_cache_segment), GFP_KERNEL);
+ if (!cache->segments) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ cache->seg_map = bitmap_zalloc(cache_dev->seg_num, GFP_KERNEL);
+ if (!cache->seg_map) {
+ ret = -ENOMEM;
+ goto free_segments;
+ }
+
+ cache->req_cache = KMEM_CACHE(pcache_backing_dev_req, 0);
+ if (!cache->req_cache) {
+ ret = -ENOMEM;
+ goto free_bitmap;
+ }
+
+ cache->backing_dev = backing_dev;
+ cache->cache_dev = &pcache->cache_dev;
+ cache->n_segs = cache_dev->seg_num;
+ atomic_set(&cache->gc_errors, 0);
+ spin_lock_init(&cache->seg_map_lock);
+ spin_lock_init(&cache->key_head_lock);
+
+ mutex_init(&cache->cache_info_lock);
+ mutex_init(&cache->key_tail_lock);
+ mutex_init(&cache->dirty_tail_lock);
+ mutex_init(&cache->writeback_lock);
+
+ INIT_DELAYED_WORK(&cache->writeback_work, cache_writeback_fn);
+ INIT_DELAYED_WORK(&cache->gc_work, pcache_cache_gc_fn);
+ INIT_WORK(&cache->clean_work, clean_fn);
+
+ return 0;
+
+free_bitmap:
+ bitmap_free(cache->seg_map);
+free_segments:
+ kvfree(cache->segments);
+err:
+ return ret;
+}
+
+static void cache_exit(struct pcache_cache *cache)
+{
+ kmem_cache_destroy(cache->req_cache);
+ bitmap_free(cache->seg_map);
+ kvfree(cache->segments);
+}
+
+static void cache_info_init_default(struct pcache_cache *cache)
+{
+ struct pcache_cache_info *cache_info = &cache->cache_info;
+
+ cache_info->header.seq = 0;
+ cache_info->n_segs = cache->cache_dev->seg_num;
+ cache_info_set_gc_percent(cache_info, PCACHE_CACHE_GC_PERCENT_DEFAULT);
+}
+
+static int cache_tail_init(struct pcache_cache *cache)
+{
+ struct dm_pcache *pcache = CACHE_TO_PCACHE(cache);
+ bool new_cache = !(cache->cache_info.flags & PCACHE_CACHE_FLAGS_INIT_DONE);
+ int ret;
+
+ if (new_cache) {
+ set_bit(0, cache->seg_map);
+
+ cache->key_head.cache_seg = &cache->segments[0];
+ cache->key_head.seg_off = 0;
+ cache_pos_copy(&cache->key_tail, &cache->key_head);
+ cache_pos_copy(&cache->dirty_tail, &cache->key_head);
+
+ cache_encode_dirty_tail(cache);
+ cache_encode_key_tail(cache);
+ } else {
+ if (cache_decode_key_tail(cache) || cache_decode_dirty_tail(cache)) {
+ pcache_dev_err(pcache, "Corrupted key tail or dirty tail.\n");
+ ret = -EIO;
+ goto err;
+ }
+ }
+ return 0;
+err:
+ return ret;
+}
+
+static int get_seg_id(struct pcache_cache *cache,
+ struct pcache_cache_segment *prev_cache_seg,
+ bool new_cache, u32 *seg_id)
+{
+ struct dm_pcache *pcache = CACHE_TO_PCACHE(cache);
+ struct pcache_cache_dev *cache_dev = cache->cache_dev;
+ int ret;
+
+ if (new_cache) {
+ ret = cache_dev_get_empty_segment_id(cache_dev, seg_id);
+ if (ret) {
+ pcache_dev_err(pcache, "no available segment\n");
+ goto err;
+ }
+
+ if (prev_cache_seg)
+ cache_seg_set_next_seg(prev_cache_seg, *seg_id);
+ else
+ cache_info_set_seg_id(cache, *seg_id);
+ } else {
+ if (prev_cache_seg) {
+ struct pcache_segment_info *prev_seg_info;
+
+ prev_seg_info = &prev_cache_seg->cache_seg_info;
+ if (!segment_info_has_next(prev_seg_info)) {
+ ret = -EFAULT;
+ goto err;
+ }
+ *seg_id = prev_cache_seg->cache_seg_info.next_seg;
+ } else {
+ *seg_id = cache->cache_info.seg_id;
+ }
+ }
+ return 0;
+err:
+ return ret;
+}
+
+static int cache_segs_init(struct pcache_cache *cache)
+{
+ struct pcache_cache_segment *prev_cache_seg = NULL;
+ struct pcache_cache_info *cache_info = &cache->cache_info;
+ bool new_cache = !(cache->cache_info.flags & PCACHE_CACHE_FLAGS_INIT_DONE);
+ u32 seg_id;
+ int ret;
+ u32 i;
+
+ for (i = 0; i < cache_info->n_segs; i++) {
+ ret = get_seg_id(cache, prev_cache_seg, new_cache, &seg_id);
+ if (ret)
+ goto err;
+
+ ret = cache_seg_init(cache, seg_id, i, new_cache);
+ if (ret)
+ goto err;
+
+ prev_cache_seg = &cache->segments[i];
+ }
+ return 0;
+err:
+ return ret;
+}
+
+static int cache_init_req_keys(struct pcache_cache *cache, u32 n_paral)
+{
+ struct dm_pcache *pcache = CACHE_TO_PCACHE(cache);
+ u32 n_subtrees;
+ int ret;
+ u32 i, cpu;
+
+ /* Calculate number of cache trees based on the device size */
+ n_subtrees = DIV_ROUND_UP(cache->dev_size << SECTOR_SHIFT, PCACHE_CACHE_SUBTREE_SIZE);
+ ret = cache_tree_init(cache, &cache->req_key_tree, n_subtrees);
+ if (ret)
+ goto err;
+
+ cache->n_ksets = n_paral;
+ cache->ksets = kcalloc(cache->n_ksets, PCACHE_KSET_SIZE, GFP_KERNEL);
+ if (!cache->ksets) {
+ ret = -ENOMEM;
+ goto req_tree_exit;
+ }
+
+ /*
+ * Initialize each kset with a spinlock and delayed work for flushing.
+ * Each kset is associated with one queue to ensure independent handling
+ * of cache keys across multiple queues, maximizing multiqueue concurrency.
+ */
+ for (i = 0; i < cache->n_ksets; i++) {
+ struct pcache_cache_kset *kset = get_kset(cache, i);
+
+ kset->cache = cache;
+ spin_lock_init(&kset->kset_lock);
+ INIT_DELAYED_WORK(&kset->flush_work, kset_flush_fn);
+ }
+
+ cache->data_heads = alloc_percpu(struct pcache_cache_data_head);
+ if (!cache->data_heads) {
+ ret = -ENOMEM;
+ goto free_kset;
+ }
+
+ for_each_possible_cpu(cpu) {
+ struct pcache_cache_data_head *h =
+ per_cpu_ptr(cache->data_heads, cpu);
+ h->head_pos.cache_seg = NULL;
+ }
+
+ /*
+ * Replay persisted cache keys using cache_replay.
+ * This function loads and replays cache keys from previously stored
+ * ksets, allowing the cache to restore its state after a restart.
+ */
+ ret = cache_replay(cache);
+ if (ret) {
+ pcache_dev_err(pcache, "failed to replay keys\n");
+ goto free_heads;
+ }
+
+ return 0;
+
+free_heads:
+ free_percpu(cache->data_heads);
+free_kset:
+ kfree(cache->ksets);
+req_tree_exit:
+ cache_tree_exit(&cache->req_key_tree);
+err:
+ return ret;
+}
+
+static void cache_destroy_req_keys(struct pcache_cache *cache)
+{
+ u32 i;
+
+ for (i = 0; i < cache->n_ksets; i++) {
+ struct pcache_cache_kset *kset = get_kset(cache, i);
+
+ cancel_delayed_work_sync(&kset->flush_work);
+ }
+
+ free_percpu(cache->data_heads);
+ kfree(cache->ksets);
+ cache_tree_exit(&cache->req_key_tree);
+}
+
+int pcache_cache_start(struct dm_pcache *pcache, struct pcache_cache_options *opts)
+{
+ struct pcache_backing_dev *backing_dev = &pcache->backing_dev;
+ struct pcache_cache *cache = &pcache->cache;
+ int ret;
+
+ ret = cache_init(pcache);
+ if (ret)
+ return ret;
+
+ cache->cache_info_addr = CACHE_DEV_CACHE_INFO(cache->cache_dev);
+ cache->cache_ctrl = CACHE_DEV_CACHE_CTRL(cache->cache_dev);
+ backing_dev->cache = cache;
+ cache->dev_size = backing_dev->dev_size;
+
+ ret = cache_info_init(cache, opts);
+ if (ret)
+ goto cache_exit;
+
+ ret = cache_segs_init(cache);
+ if (ret)
+ goto cache_exit;
+
+ ret = cache_tail_init(cache);
+ if (ret)
+ goto cache_exit;
+
+ ret = cache_init_req_keys(cache, num_online_cpus());
+ if (ret)
+ goto cache_exit;
+
+ ret = cache_writeback_init(cache);
+ if (ret)
+ goto destroy_keys;
+
+ cache->cache_info.flags |= PCACHE_CACHE_FLAGS_INIT_DONE;
+ cache_info_write(cache);
+ queue_delayed_work(cache_get_wq(cache), &cache->gc_work, 0);
+
+ return 0;
+
+destroy_keys:
+ cache_destroy_req_keys(cache);
+cache_exit:
+ cache_exit(cache);
+
+ return ret;
+}
+
+void pcache_cache_stop(struct dm_pcache *pcache)
+{
+ struct pcache_cache *cache = &pcache->cache;
+
+ cache_flush(cache);
+
+ cancel_delayed_work_sync(&cache->gc_work);
+ flush_work(&cache->clean_work);
+ cache_writeback_exit(cache);
+
+ if (cache->req_key_tree.n_subtrees)
+ cache_destroy_req_keys(cache);
+
+ cache_exit(cache);
+}
+
+struct workqueue_struct *cache_get_wq(struct pcache_cache *cache)
+{
+ struct dm_pcache *pcache = CACHE_TO_PCACHE(cache);
+
+ return pcache->task_wq;
+}
diff --git a/drivers/md/dm-pcache/cache.h b/drivers/md/dm-pcache/cache.h
new file mode 100644
index 000000000000..edb1f9d85c1c
--- /dev/null
+++ b/drivers/md/dm-pcache/cache.h
@@ -0,0 +1,601 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _PCACHE_CACHE_H
+#define _PCACHE_CACHE_H
+
+#include "segment.h"
+
+/* Garbage collection thresholds */
+#define PCACHE_CACHE_GC_PERCENT_MIN 0 /* Minimum GC percentage */
+#define PCACHE_CACHE_GC_PERCENT_MAX 90 /* Maximum GC percentage */
+#define PCACHE_CACHE_GC_PERCENT_DEFAULT 70 /* Default GC percentage */
+
+#define PCACHE_CACHE_SUBTREE_SIZE (4 * 1024 * 1024) /* 4MB total tree size */
+#define PCACHE_CACHE_SUBTREE_SIZE_MASK 0x3FFFFF /* Mask for tree size */
+#define PCACHE_CACHE_SUBTREE_SIZE_SHIFT 22 /* Bit shift for tree size */
+
+/* Maximum number of keys per key set */
+#define PCACHE_KSET_KEYS_MAX 128
+#define PCACHE_CACHE_SEGS_MAX (1024 * 1024) /* maximum cache size for each device is 16T */
+#define PCACHE_KSET_ONMEDIA_SIZE_MAX struct_size_t(struct pcache_cache_kset_onmedia, data, PCACHE_KSET_KEYS_MAX)
+#define PCACHE_KSET_SIZE (sizeof(struct pcache_cache_kset) + sizeof(struct pcache_cache_key_onmedia) * PCACHE_KSET_KEYS_MAX)
+
+/* Maximum number of keys to clean in one round of clean_work */
+#define PCACHE_CLEAN_KEYS_MAX 10
+
+/* Writeback and garbage collection intervals in jiffies */
+#define PCACHE_CACHE_WRITEBACK_INTERVAL (5 * HZ)
+#define PCACHE_CACHE_GC_INTERVAL (5 * HZ)
+
+/* Macro to get the cache key structure from an rb_node pointer */
+#define CACHE_KEY(node) (container_of(node, struct pcache_cache_key, rb_node))
+
+struct pcache_cache_pos_onmedia {
+ struct pcache_meta_header header;
+ __u32 cache_seg_id;
+ __u32 seg_off;
+} __packed;
+
+/* Offset and size definitions for cache segment control */
+#define PCACHE_CACHE_SEG_CTRL_OFF (PCACHE_SEG_INFO_SIZE * PCACHE_META_INDEX_MAX)
+#define PCACHE_CACHE_SEG_CTRL_SIZE (4 * PCACHE_KB)
+
+struct pcache_cache_seg_gen {
+ struct pcache_meta_header header;
+ __u64 gen;
+} __packed;
+
+/* Control structure for cache segments */
+struct pcache_cache_seg_ctrl {
+ struct pcache_cache_seg_gen gen[PCACHE_META_INDEX_MAX];
+ __u64 res[64];
+} __packed;
+
+#define PCACHE_CACHE_FLAGS_DATA_CRC BIT(0)
+#define PCACHE_CACHE_FLAGS_INIT_DONE BIT(1)
+
+#define PCACHE_CACHE_FLAGS_CACHE_MODE_MASK GENMASK(5, 2)
+#define PCACHE_CACHE_MODE_WRITEBACK 0
+#define PCACHE_CACHE_MODE_WRITETHROUGH 1
+#define PCACHE_CACHE_MODE_WRITEAROUND 2
+#define PCACHE_CACHE_MODE_WRITEONLY 3
+
+#define PCACHE_CACHE_FLAGS_GC_PERCENT_MASK GENMASK(12, 6)
+
+struct pcache_cache_info {
+ struct pcache_meta_header header;
+ __u32 seg_id;
+ __u32 n_segs;
+ __u32 flags;
+ __u32 reserved;
+} __packed;
+
+struct pcache_cache_pos {
+ struct pcache_cache_segment *cache_seg;
+ u32 seg_off;
+};
+
+struct pcache_cache_segment {
+ struct pcache_cache *cache;
+ u32 cache_seg_id; /* Index in cache->segments */
+ struct pcache_segment segment;
+ atomic_t refs;
+
+ struct pcache_segment_info cache_seg_info;
+ struct mutex info_lock;
+ u32 info_index;
+
+ spinlock_t gen_lock;
+ u64 gen;
+ u64 gen_seq;
+ u32 gen_index;
+
+ struct pcache_cache_seg_ctrl *cache_seg_ctrl;
+ struct mutex ctrl_lock;
+};
+
+/* rbtree for cache entries */
+struct pcache_cache_subtree {
+ struct rb_root root;
+ spinlock_t tree_lock;
+};
+
+struct pcache_cache_tree {
+ struct pcache_cache *cache;
+ u32 n_subtrees;
+ struct kmem_cache *key_cache;
+ struct pcache_cache_subtree *subtrees;
+};
+
+struct pcache_cache_key {
+ struct pcache_cache_tree *cache_tree;
+ struct pcache_cache_subtree *cache_subtree;
+ struct kref ref;
+ struct rb_node rb_node;
+ struct list_head list_node;
+ u64 off;
+ u32 len;
+ u32 flags;
+ struct pcache_cache_pos cache_pos;
+ u64 seg_gen;
+};
+
+#define PCACHE_CACHE_KEY_FLAGS_EMPTY BIT(0)
+#define PCACHE_CACHE_KEY_FLAGS_CLEAN BIT(1)
+
+struct pcache_cache_key_onmedia {
+ __u64 off;
+ __u32 len;
+ __u32 flags;
+ __u32 cache_seg_id;
+ __u32 cache_seg_off;
+ __u64 seg_gen;
+ __u32 data_crc;
+ __u32 reserved;
+} __packed;
+
+struct pcache_cache_kset_onmedia {
+ __u32 crc;
+ union {
+ __u32 key_num;
+ __u32 next_cache_seg_id;
+ };
+ __u64 magic;
+ __u64 flags;
+ struct pcache_cache_key_onmedia data[];
+} __packed;
+
+struct pcache_cache {
+ struct pcache_backing_dev *backing_dev;
+ struct pcache_cache_dev *cache_dev;
+ struct pcache_cache_ctrl *cache_ctrl;
+ u64 dev_size;
+
+ struct pcache_cache_data_head __percpu *data_heads;
+
+ spinlock_t key_head_lock;
+ struct pcache_cache_pos key_head;
+ u32 n_ksets;
+ struct pcache_cache_kset *ksets;
+
+ struct mutex key_tail_lock;
+ struct pcache_cache_pos key_tail;
+ u64 key_tail_seq;
+ u32 key_tail_index;
+
+ struct mutex dirty_tail_lock;
+ struct pcache_cache_pos dirty_tail;
+ u64 dirty_tail_seq;
+ u32 dirty_tail_index;
+
+ struct pcache_cache_tree req_key_tree;
+ struct work_struct clean_work;
+
+ struct mutex writeback_lock;
+ char wb_kset_onmedia_buf[PCACHE_KSET_ONMEDIA_SIZE_MAX];
+ struct pcache_cache_tree writeback_key_tree;
+ struct delayed_work writeback_work;
+
+ char gc_kset_onmedia_buf[PCACHE_KSET_ONMEDIA_SIZE_MAX];
+ struct delayed_work gc_work;
+ atomic_t gc_errors;
+
+ struct kmem_cache *req_cache;
+
+ struct mutex cache_info_lock;
+ struct pcache_cache_info cache_info;
+ struct pcache_cache_info *cache_info_addr;
+ u32 info_index;
+
+ u32 n_segs;
+ unsigned long *seg_map;
+ u32 last_cache_seg;
+ spinlock_t seg_map_lock;
+ struct pcache_cache_segment *segments;
+};
+
+struct workqueue_struct *cache_get_wq(struct pcache_cache *cache);
+
+struct dm_pcache;
+struct pcache_cache_options {
+ u32 cache_mode:4;
+ u32 data_crc:1;
+};
+int pcache_cache_start(struct dm_pcache *pcache, struct pcache_cache_options *opts);
+void pcache_cache_stop(struct dm_pcache *pcache);
+
+struct pcache_cache_ctrl {
+ /* Updated by gc_thread */
+ struct pcache_cache_pos_onmedia key_tail_pos[PCACHE_META_INDEX_MAX];
+
+ /* Updated by writeback_thread */
+ struct pcache_cache_pos_onmedia dirty_tail_pos[PCACHE_META_INDEX_MAX];
+};
+
+struct pcache_cache_data_head {
+ struct pcache_cache_pos head_pos;
+};
+
+static inline u16 pcache_cache_get_gc_percent(struct pcache_cache *cache)
+{
+ return FIELD_GET(PCACHE_CACHE_FLAGS_GC_PERCENT_MASK, cache->cache_info.flags);
+}
+
+int pcache_cache_set_gc_percent(struct pcache_cache *cache, u8 percent);
+
+/* cache key */
+struct pcache_cache_key *cache_key_alloc(struct pcache_cache_tree *cache_tree);
+void cache_key_init(struct pcache_cache_tree *cache_tree, struct pcache_cache_key *key);
+void cache_key_get(struct pcache_cache_key *key);
+void cache_key_put(struct pcache_cache_key *key);
+int cache_key_append(struct pcache_cache *cache, struct pcache_cache_key *key, bool force_close);
+int cache_key_insert(struct pcache_cache_tree *cache_tree, struct pcache_cache_key *key, bool fixup);
+int cache_key_decode(struct pcache_cache *cache,
+ struct pcache_cache_key_onmedia *key_onmedia,
+ struct pcache_cache_key *key);
+void cache_pos_advance(struct pcache_cache_pos *pos, u32 len);
+
+#define PCACHE_KSET_FLAGS_LAST BIT(0)
+#define PCACHE_KSET_MAGIC 0x676894a64e164f1aULL
+
+struct pcache_cache_kset {
+ struct pcache_cache *cache;
+ spinlock_t kset_lock;
+ struct delayed_work flush_work;
+ struct pcache_cache_kset_onmedia kset_onmedia;
+};
+
+extern struct pcache_cache_kset_onmedia pcache_empty_kset;
+
+struct pcache_cache_subtree_walk_ctx {
+ struct pcache_cache_tree *cache_tree;
+ struct rb_node *start_node;
+ struct pcache_request *pcache_req;
+ u32 req_done;
+ struct pcache_cache_key *key;
+
+ struct list_head *delete_key_list;
+ struct list_head *submit_req_list;
+
+ /*
+ * |--------| key_tmp
+ * |====| key
+ */
+ int (*before)(struct pcache_cache_key *key, struct pcache_cache_key *key_tmp,
+ struct pcache_cache_subtree_walk_ctx *ctx);
+
+ /*
+ * |----------| key_tmp
+ * |=====| key
+ */
+ int (*after)(struct pcache_cache_key *key, struct pcache_cache_key *key_tmp,
+ struct pcache_cache_subtree_walk_ctx *ctx);
+
+ /*
+ * |----------------| key_tmp
+ * |===========| key
+ */
+ int (*overlap_tail)(struct pcache_cache_key *key, struct pcache_cache_key *key_tmp,
+ struct pcache_cache_subtree_walk_ctx *ctx);
+
+ /*
+ * |--------| key_tmp
+ * |==========| key
+ */
+ int (*overlap_head)(struct pcache_cache_key *key, struct pcache_cache_key *key_tmp,
+ struct pcache_cache_subtree_walk_ctx *ctx);
+
+ /*
+ * |----| key_tmp
+ * |==========| key
+ */
+ int (*overlap_contain)(struct pcache_cache_key *key, struct pcache_cache_key *key_tmp,
+ struct pcache_cache_subtree_walk_ctx *ctx);
+
+ /*
+ * |-----------| key_tmp
+ * |====| key
+ */
+ int (*overlap_contained)(struct pcache_cache_key *key, struct pcache_cache_key *key_tmp,
+ struct pcache_cache_subtree_walk_ctx *ctx);
+
+ int (*walk_finally)(struct pcache_cache_subtree_walk_ctx *ctx);
+ bool (*walk_done)(struct pcache_cache_subtree_walk_ctx *ctx);
+};
+
+int cache_subtree_walk(struct pcache_cache_subtree_walk_ctx *ctx);
+struct rb_node *cache_subtree_search(struct pcache_cache_subtree *cache_subtree, struct pcache_cache_key *key,
+ struct rb_node **parentp, struct rb_node ***newp,
+ struct list_head *delete_key_list);
+int cache_kset_close(struct pcache_cache *cache, struct pcache_cache_kset *kset);
+void clean_fn(struct work_struct *work);
+void kset_flush_fn(struct work_struct *work);
+int cache_replay(struct pcache_cache *cache);
+int cache_tree_init(struct pcache_cache *cache, struct pcache_cache_tree *cache_tree, u32 n_subtrees);
+void cache_tree_exit(struct pcache_cache_tree *cache_tree);
+
+/* cache segments */
+struct pcache_cache_segment *get_cache_segment(struct pcache_cache *cache);
+int cache_seg_init(struct pcache_cache *cache, u32 seg_id, u32 cache_seg_id,
+ bool new_cache);
+void cache_seg_get(struct pcache_cache_segment *cache_seg);
+void cache_seg_put(struct pcache_cache_segment *cache_seg);
+void cache_seg_set_next_seg(struct pcache_cache_segment *cache_seg, u32 seg_id);
+
+/* cache request*/
+int cache_flush(struct pcache_cache *cache);
+void miss_read_end_work_fn(struct work_struct *work);
+int pcache_cache_handle_req(struct pcache_cache *cache, struct pcache_request *pcache_req);
+
+/* gc */
+void pcache_cache_gc_fn(struct work_struct *work);
+
+/* writeback */
+void cache_writeback_exit(struct pcache_cache *cache);
+int cache_writeback_init(struct pcache_cache *cache);
+void cache_writeback_fn(struct work_struct *work);
+
+/* inline functions */
+static inline struct pcache_cache_subtree *get_subtree(struct pcache_cache_tree *cache_tree, u64 off)
+{
+ if (cache_tree->n_subtrees == 1)
+ return &cache_tree->subtrees[0];
+
+ return &cache_tree->subtrees[off >> PCACHE_CACHE_SUBTREE_SIZE_SHIFT];
+}
+
+static inline void *cache_pos_addr(struct pcache_cache_pos *pos)
+{
+ return (pos->cache_seg->segment.data + pos->seg_off);
+}
+
+static inline void *get_key_head_addr(struct pcache_cache *cache)
+{
+ return cache_pos_addr(&cache->key_head);
+}
+
+static inline u32 get_kset_id(struct pcache_cache *cache, u64 off)
+{
+ return (off >> PCACHE_CACHE_SUBTREE_SIZE_SHIFT) % cache->n_ksets;
+}
+
+static inline struct pcache_cache_kset *get_kset(struct pcache_cache *cache, u32 kset_id)
+{
+ return (void *)cache->ksets + PCACHE_KSET_SIZE * kset_id;
+}
+
+static inline struct pcache_cache_data_head *get_data_head(struct pcache_cache *cache)
+{
+ return this_cpu_ptr(cache->data_heads);
+}
+
+static inline bool cache_key_empty(struct pcache_cache_key *key)
+{
+ return key->flags & PCACHE_CACHE_KEY_FLAGS_EMPTY;
+}
+
+static inline bool cache_key_clean(struct pcache_cache_key *key)
+{
+ return key->flags & PCACHE_CACHE_KEY_FLAGS_CLEAN;
+}
+
+static inline void cache_pos_copy(struct pcache_cache_pos *dst, struct pcache_cache_pos *src)
+{
+ memcpy(dst, src, sizeof(struct pcache_cache_pos));
+}
+
+/**
+ * cache_seg_is_ctrl_seg - Checks if a cache segment is a cache ctrl segment.
+ * @cache_seg_id: ID of the cache segment.
+ *
+ * Returns true if the cache segment ID corresponds to a cache ctrl segment.
+ *
+ * Note: We extend the segment control of the first cache segment
+ * (cache segment ID 0) to serve as the cache control (pcache_cache_ctrl)
+ * for the entire PCACHE cache. This function determines whether the given
+ * cache segment is the one storing the pcache_cache_ctrl information.
+ */
+static inline bool cache_seg_is_ctrl_seg(u32 cache_seg_id)
+{
+ return (cache_seg_id == 0);
+}
+
+/**
+ * cache_key_cutfront - Cuts a specified length from the front of a cache key.
+ * @key: Pointer to pcache_cache_key structure.
+ * @cut_len: Length to cut from the front.
+ *
+ * Advances the cache key position by cut_len and adjusts offset and length accordingly.
+ */
+static inline void cache_key_cutfront(struct pcache_cache_key *key, u32 cut_len)
+{
+ if (key->cache_pos.cache_seg)
+ cache_pos_advance(&key->cache_pos, cut_len);
+
+ key->off += cut_len;
+ key->len -= cut_len;
+}
+
+/**
+ * cache_key_cutback - Cuts a specified length from the back of a cache key.
+ * @key: Pointer to pcache_cache_key structure.
+ * @cut_len: Length to cut from the back.
+ *
+ * Reduces the length of the cache key by cut_len.
+ */
+static inline void cache_key_cutback(struct pcache_cache_key *key, u32 cut_len)
+{
+ key->len -= cut_len;
+}
+
+static inline void cache_key_delete(struct pcache_cache_key *key)
+{
+ struct pcache_cache_subtree *cache_subtree;
+
+ cache_subtree = key->cache_subtree;
+ if (!cache_subtree)
+ return;
+
+ rb_erase(&key->rb_node, &cache_subtree->root);
+ key->flags = 0;
+ cache_key_put(key);
+}
+
+static inline bool cache_data_crc_on(struct pcache_cache *cache)
+{
+ return (cache->cache_info.flags & PCACHE_CACHE_FLAGS_DATA_CRC);
+}
+
+/**
+ * cache_key_data_crc - Calculates CRC for data in a cache key.
+ * @key: Pointer to the pcache_cache_key structure.
+ *
+ * Returns the CRC-32 checksum of the data within the cache key's position.
+ */
+static inline u32 cache_key_data_crc(struct pcache_cache_key *key)
+{
+ void *data;
+
+ data = cache_pos_addr(&key->cache_pos);
+
+ return crc32(PCACHE_CRC_SEED, data, key->len);
+}
+
+static inline u32 cache_kset_crc(struct pcache_cache_kset_onmedia *kset_onmedia)
+{
+ u32 crc_size;
+
+ if (kset_onmedia->flags & PCACHE_KSET_FLAGS_LAST)
+ crc_size = sizeof(struct pcache_cache_kset_onmedia) - 4;
+ else
+ crc_size = struct_size(kset_onmedia, data, kset_onmedia->key_num) - 4;
+
+ return crc32(PCACHE_CRC_SEED, (void *)kset_onmedia + 4, crc_size);
+}
+
+static inline u32 get_kset_onmedia_size(struct pcache_cache_kset_onmedia *kset_onmedia)
+{
+ return struct_size_t(struct pcache_cache_kset_onmedia, data, kset_onmedia->key_num);
+}
+
+/**
+ * cache_seg_remain - Computes remaining space in a cache segment.
+ * @pos: Pointer to pcache_cache_pos structure.
+ *
+ * Returns the amount of remaining space in the segment data starting from
+ * the current position offset.
+ */
+static inline u32 cache_seg_remain(struct pcache_cache_pos *pos)
+{
+ struct pcache_cache_segment *cache_seg;
+ struct pcache_segment *segment;
+ u32 seg_remain;
+
+ cache_seg = pos->cache_seg;
+ segment = &cache_seg->segment;
+ seg_remain = segment->data_size - pos->seg_off;
+
+ return seg_remain;
+}
+
+/**
+ * cache_key_invalid - Checks if a cache key is invalid.
+ * @key: Pointer to pcache_cache_key structure.
+ *
+ * Returns true if the cache key is invalid due to its generation being
+ * less than the generation of its segment; otherwise returns false.
+ *
+ * When the GC (garbage collection) thread identifies a segment
+ * as reclaimable, it increments the segment's generation (gen). However,
+ * it does not immediately remove all related cache keys. When accessing
+ * such a cache key, this function can be used to determine if the cache
+ * key has already become invalid.
+ */
+static inline bool cache_key_invalid(struct pcache_cache_key *key)
+{
+ if (cache_key_empty(key))
+ return false;
+
+ return (key->seg_gen < key->cache_pos.cache_seg->gen);
+}
+
+/**
+ * cache_key_lstart - Retrieves the logical start offset of a cache key.
+ * @key: Pointer to pcache_cache_key structure.
+ *
+ * Returns the logical start offset for the cache key.
+ */
+static inline u64 cache_key_lstart(struct pcache_cache_key *key)
+{
+ return key->off;
+}
+
+/**
+ * cache_key_lend - Retrieves the logical end offset of a cache key.
+ * @key: Pointer to pcache_cache_key structure.
+ *
+ * Returns the logical end offset for the cache key.
+ */
+static inline u64 cache_key_lend(struct pcache_cache_key *key)
+{
+ return key->off + key->len;
+}
+
+static inline void cache_key_copy(struct pcache_cache_key *key_dst, struct pcache_cache_key *key_src)
+{
+ key_dst->off = key_src->off;
+ key_dst->len = key_src->len;
+ key_dst->seg_gen = key_src->seg_gen;
+ key_dst->cache_tree = key_src->cache_tree;
+ key_dst->cache_subtree = key_src->cache_subtree;
+ key_dst->flags = key_src->flags;
+
+ cache_pos_copy(&key_dst->cache_pos, &key_src->cache_pos);
+}
+
+/**
+ * cache_pos_onmedia_crc - Calculates the CRC for an on-media cache position.
+ * @pos_om: Pointer to pcache_cache_pos_onmedia structure.
+ *
+ * Calculates the CRC-32 checksum of the position, excluding the first 4 bytes.
+ * Returns the computed CRC value.
+ */
+static inline u32 cache_pos_onmedia_crc(struct pcache_cache_pos_onmedia *pos_om)
+{
+ return pcache_meta_crc(&pos_om->header, sizeof(struct pcache_cache_pos_onmedia));
+}
+
+void cache_pos_encode(struct pcache_cache *cache,
+ struct pcache_cache_pos_onmedia *pos_onmedia,
+ struct pcache_cache_pos *pos, u64 seq, u32 *index);
+int cache_pos_decode(struct pcache_cache *cache,
+ struct pcache_cache_pos_onmedia *pos_onmedia,
+ struct pcache_cache_pos *pos, u64 *seq, u32 *index);
+
+static inline void cache_encode_key_tail(struct pcache_cache *cache)
+{
+ cache_pos_encode(cache, cache->cache_ctrl->key_tail_pos,
+ &cache->key_tail, ++cache->key_tail_seq,
+ &cache->key_tail_index);
+}
+
+static inline int cache_decode_key_tail(struct pcache_cache *cache)
+{
+ return cache_pos_decode(cache, cache->cache_ctrl->key_tail_pos,
+ &cache->key_tail, &cache->key_tail_seq,
+ &cache->key_tail_index);
+}
+
+static inline void cache_encode_dirty_tail(struct pcache_cache *cache)
+{
+ cache_pos_encode(cache, cache->cache_ctrl->dirty_tail_pos,
+ &cache->dirty_tail, ++cache->dirty_tail_seq,
+ &cache->dirty_tail_index);
+}
+
+static inline int cache_decode_dirty_tail(struct pcache_cache *cache)
+{
+ return cache_pos_decode(cache, cache->cache_ctrl->dirty_tail_pos,
+ &cache->dirty_tail, &cache->dirty_tail_seq,
+ &cache->dirty_tail_index);
+}
+#endif /* _PCACHE_CACHE_H */
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH 11/11] dm-pcache: initial dm-pcache target
2025-06-05 14:22 [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices Dongsheng Yang
` (9 preceding siblings ...)
2025-06-05 14:23 ` [RFC PATCH 10/11] dm-pcache: add cache core Dongsheng Yang
@ 2025-06-05 14:23 ` Dongsheng Yang
2025-06-12 16:57 ` [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices Mikulas Patocka
11 siblings, 0 replies; 18+ messages in thread
From: Dongsheng Yang @ 2025-06-05 14:23 UTC (permalink / raw)
To: mpatocka, agk, snitzer, axboe, hch, dan.j.williams,
Jonathan.Cameron
Cc: linux-block, linux-kernel, linux-cxl, nvdimm, dm-devel,
Dongsheng Yang
Add the top-level integration pieces that make the new persistent-memory
cache target usable from device-mapper:
* Documentation
- `Documentation/admin-guide/device-mapper/dm-pcache.rst` explains the
design, table syntax, status fields and runtime messages.
* Core target implementation
- `dm_pcache.c` and `dm_pcache.h` register the `"pcache"` DM target,
parse constructor arguments, create workqueues, and forward BIOS to
the cache core added in earlier patches.
- Supports flush/FUA, status reporting, and a “gc_percent” message.
- Dont support discard currently.
- Dont support table reload for live target currently.
* Device-mapper tables now accept lines like
pcache <pmem_dev> <backing_dev> writeback <true|false>
Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
.../admin-guide/device-mapper/dm-pcache.rst | 200 +++++++++
MAINTAINERS | 9 +
drivers/md/Kconfig | 2 +
drivers/md/Makefile | 1 +
drivers/md/dm-pcache/Kconfig | 17 +
drivers/md/dm-pcache/Makefile | 3 +
drivers/md/dm-pcache/dm_pcache.c | 388 ++++++++++++++++++
drivers/md/dm-pcache/dm_pcache.h | 61 +++
8 files changed, 681 insertions(+)
create mode 100644 Documentation/admin-guide/device-mapper/dm-pcache.rst
create mode 100644 drivers/md/dm-pcache/Kconfig
create mode 100644 drivers/md/dm-pcache/Makefile
create mode 100644 drivers/md/dm-pcache/dm_pcache.c
create mode 100644 drivers/md/dm-pcache/dm_pcache.h
diff --git a/Documentation/admin-guide/device-mapper/dm-pcache.rst b/Documentation/admin-guide/device-mapper/dm-pcache.rst
new file mode 100644
index 000000000000..faf797cf29ca
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-pcache.rst
@@ -0,0 +1,200 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================================
+dm-pcache — Persistent Cache
+=================================
+
+*Author: Dongsheng Yang <dongsheng.yang@linux.dev>*
+
+This document describes *dm-pcache*, a Device-Mapper target that lets a
+byte-addressable *DAX* (persistent-memory, “pmem”) region act as a
+high-performance, crash-persistent cache in front of a slower block
+device. The code lives in `drivers/md/dm-pcache/`.
+
+Quick feature summary
+=====================
+
+* *Write-back* caching (only mode currently supported).
+* *16 MiB segments* allocated on the pmem device.
+* *Data CRC32* verification (optional, per cache).
+* Crash-safe: every metadata structure is duplicated (`PCACHE_META_INDEX_MAX
+ == 2`) and protected with CRC+sequence numbers.
+* *Multi-tree indexing* (one radix tree per CPU backend) for high PMem
+ parallelism
+* Pure *DAX path* I/O – no extra BIO round-trips
+* *Log-structured write-back* that preserves backend crash-consistency
+
+-------------------------------------------------------------------------------
+Constructor
+===========
+
+::
+
+ pcache <cache_dev> <backing_dev> <cache_mode> <data_crc>
+
+========================= ====================================================
+``cache_dev`` Any DAX-capable block device (``/dev/pmem0``…).
+ All metadata *and* cached blocks are stored here.
+
+``backing_dev`` The slow block device to be cached.
+
+``cache_mode`` Only ``writeback`` is accepted at the moment.
+
+``data_crc`` ``true`` – store CRC32 for every cached entry and
+ verify on reads
+ ``false`` – skip CRC (faster)
+========================= ====================================================
+
+Example
+-------
+
+.. code-block:: shell
+
+ dmsetup create pcache_sdb --table \
+ "0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb writeback true"
+
+The first time a pmem device is used, dm-pcache formats it automatically
+(super-block, cache_info, etc.).
+
+-------------------------------------------------------------------------------
+Status line
+===========
+
+``dmsetup status <device>`` (``STATUSTYPE_INFO``) prints:
+
+::
+
+ <sb_flags> <seg_total> <cache_segs> <segs_used> \
+ <gc_percent> <cache_flags> \
+ <key_head_seg>:<key_head_off> \
+ <dirty_tail_seg>:<dirty_tail_off> \
+ <key_tail_seg>:<key_tail_off>
+
+Field meanings
+--------------
+
+=============================== =============================================
+``sb_flags`` Super-block flags (e.g. endian marker).
+
+``seg_total`` Number of physical *pmem* segments.
+
+``cache_segs`` Number of segments used for cache.
+
+``segs_used`` Segments currently allocated (bitmap weight).
+
+``gc_percent`` Current GC high-water mark (0-90).
+
+``cache_flags`` Bit 0 – DATA_CRC enabled
+ Bit 1 – INIT_DONE (cache initialised)
+ Bits 2-5 – cache mode (0 == WB).
+
+``key_head`` Where new key-sets are being written.
+
+``dirty_tail`` First dirty key-set that still needs
+ write-back to the backing device.
+
+``key_tail`` First key-set that may be reclaimed by GC.
+=============================== =============================================
+
+-------------------------------------------------------------------------------
+Messages
+========
+
+*Change GC trigger*
+
+::
+
+ dmsetup message <dev> 0 gc_percent <0-90>
+
+-------------------------------------------------------------------------------
+Theory of operation
+===================
+
+Sub-devices
+-----------
+
+==================== =========================================================
+backing_dev Any block device (SSD/HDD/loop/LVM, etc.).
+cache_dev DAX device; must expose direct-access memory.
+==================== =========================================================
+
+Segments and key-sets
+---------------------
+
+* The pmem space is divided into *16 MiB segments*.
+* Each write allocates space from a per-CPU *data_head* inside a segment.
+* A *cache-key* records a logical range on the origin and where it lives
+ inside pmem (segment + offset + generation).
+* 128 keys form a *key-set* (kset); ksets are written sequentially in pmem
+ and are themselves crash-safe (CRC).
+* The pair *(key_tail, dirty_tail)* delimit clean/dirty and live/dead ksets.
+
+Write-back
+----------
+
+Dirty keys are queued into a tree; a background worker copies data
+back to the backing_dev and advances *dirty_tail*. A FLUSH/FUA bio from the
+upper layers forces an immediate metadata commit.
+
+Garbage collection
+------------------
+
+GC starts when ``segs_used >= seg_total * gc_percent / 100``. It walks
+from *key_tail*, frees segments whose every key has been invalidated, and
+advances *key_tail*.
+
+CRC verification
+----------------
+
+If ``data_crc is enabled`` dm-pcache computes a CRC32 over every cached data
+range when it is inserted and stores it in the on-media key. Reads
+validate the CRC before copying to the caller.
+
+-------------------------------------------------------------------------------
+Failure handling
+================
+
+* *pmem media errors* – all metadata copies are read with
+ ``copy_mc_to_kernel``; an uncorrectable error logs and aborts initialisation.
+* *Cache full* – if no free segment can be found, writes return ``-EBUSY``;
+ dm-pcache retries internally (request deferral).
+* *System crash* – on attach, the driver replays ksets from *key_tail* to
+ rebuild the in-core trees; every segment’s generation guards against
+ use-after-free keys.
+
+-------------------------------------------------------------------------------
+Limitations & TODO
+==================
+
+* Only *write-back* mode; other modes planned.
+* Only FIFO cache invalidate; other (LRU, ARC...) planned.
+* Table reload is not supported currently.
+* Discard planned.
+
+-------------------------------------------------------------------------------
+Example workflow
+================
+
+.. code-block:: shell
+
+ # 1. Create devices
+ dmsetup create pcache_sdb --table \
+ "0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb writeback true"
+
+ # 2. Put a filesystem on top
+ mkfs.ext4 /dev/mapper/pcache_sdb
+ mount /dev/mapper/pcache_sdb /mnt
+
+ # 3. Tune GC threshold to 80 %
+ dmsetup message pcache_sdb 0 gc_percent 80
+
+ # 4. Observe status
+ watch -n1 'dmsetup status pcache_sdb'
+
+ # 5. Shutdown
+ umount /mnt
+ dmsetup remove pcache_sdb
+
+-------------------------------------------------------------------------------
+``dm-pcache`` is under active development; feedback, bug reports and patches
+are very welcome!
diff --git a/MAINTAINERS b/MAINTAINERS
index dd844ac8d910..aba4e84316b6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6838,6 +6838,15 @@ S: Maintained
F: Documentation/admin-guide/device-mapper/vdo*.rst
F: drivers/md/dm-vdo/
+DEVICE-MAPPER PCACHE TARGET
+M: Dongsheng Yang <dongsheng.yang@linux.dev>
+M: Zheng Gu <cengku@gmail.com>
+R: Linggang Zeng <linggang.linux@gmail.com>
+L: dm-devel@lists.linux.dev
+S: Maintained
+F: Documentation/admin-guide/device-mapper/dm-pcache.rst
+F: drivers/md/dm-pcache/
+
DEVLINK
M: Jiri Pirko <jiri@resnulli.us>
L: netdev@vger.kernel.org
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index ddb37f6670de..cd4d8d1705bb 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -659,4 +659,6 @@ config DM_AUDIT
source "drivers/md/dm-vdo/Kconfig"
+source "drivers/md/dm-pcache/Kconfig"
+
endif # MD
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 87bdfc9fe14c..f91a3133677f 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -71,6 +71,7 @@ obj-$(CONFIG_DM_RAID) += dm-raid.o
obj-$(CONFIG_DM_THIN_PROVISIONING) += dm-thin-pool.o
obj-$(CONFIG_DM_VERITY) += dm-verity.o
obj-$(CONFIG_DM_VDO) += dm-vdo/
+obj-$(CONFIG_DM_PCACHE) += dm-pcache/
obj-$(CONFIG_DM_CACHE) += dm-cache.o
obj-$(CONFIG_DM_CACHE_SMQ) += dm-cache-smq.o
obj-$(CONFIG_DM_EBS) += dm-ebs.o
diff --git a/drivers/md/dm-pcache/Kconfig b/drivers/md/dm-pcache/Kconfig
new file mode 100644
index 000000000000..0e251eca892e
--- /dev/null
+++ b/drivers/md/dm-pcache/Kconfig
@@ -0,0 +1,17 @@
+config DM_PCACHE
+ tristate "Persistent cache for Block Device (Experimental)"
+ depends on BLK_DEV_DM
+ depends on DEV_DAX
+ help
+ PCACHE provides a mechanism to use persistent memory (e.g., CXL persistent memory,
+ DAX-enabled devices) as a high-performance cache layer in front of
+ traditional block devices such as SSDs or HDDs.
+
+ PCACHE is implemented as a kernel module that integrates with the block
+ layer and supports direct access (DAX) to persistent memory for low-latency,
+ byte-addressable caching.
+
+ Note: This feature is experimental and should be tested thoroughly
+ before use in production environments.
+
+ If unsure, say 'N'.
diff --git a/drivers/md/dm-pcache/Makefile b/drivers/md/dm-pcache/Makefile
new file mode 100644
index 000000000000..86776e4acad2
--- /dev/null
+++ b/drivers/md/dm-pcache/Makefile
@@ -0,0 +1,3 @@
+dm-pcache-y := dm_pcache.o cache_dev.o segment.o backing_dev.o cache.o cache_gc.o cache_writeback.o cache_segment.o cache_key.o cache_req.o
+
+obj-m += dm-pcache.o
diff --git a/drivers/md/dm-pcache/dm_pcache.c b/drivers/md/dm-pcache/dm_pcache.c
new file mode 100644
index 000000000000..c4ac51a9cef0
--- /dev/null
+++ b/drivers/md/dm-pcache/dm_pcache.c
@@ -0,0 +1,388 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include <linux/module.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+
+#include "../dm-core.h"
+#include "cache_dev.h"
+#include "backing_dev.h"
+#include "cache.h"
+#include "dm_pcache.h"
+
+static void defer_req(struct pcache_request *pcache_req)
+{
+ struct dm_pcache *pcache = pcache_req->pcache;
+
+ BUG_ON(!list_empty(&pcache_req->list_node));
+
+ spin_lock(&pcache->defered_req_list_lock);
+ list_add(&pcache_req->list_node, &pcache->defered_req_list);
+ spin_unlock(&pcache->defered_req_list_lock);
+
+ queue_delayed_work(pcache->task_wq, &pcache->defered_req_work, msecs_to_jiffies(100));
+}
+
+static void defered_req_fn(struct work_struct *work)
+{
+ struct dm_pcache *pcache = container_of(work, struct dm_pcache, defered_req_work.work);
+ struct pcache_request *pcache_req;
+ LIST_HEAD(tmp_list);
+ int ret;
+
+ if (pcache_is_stopping(pcache))
+ return;
+
+ spin_lock(&pcache->defered_req_list_lock);
+ list_splice_init(&pcache->defered_req_list, &tmp_list);
+ spin_unlock(&pcache->defered_req_list_lock);
+
+ while (!list_empty(&tmp_list)) {
+ pcache_req = list_first_entry(&tmp_list,
+ struct pcache_request, list_node);
+ list_del_init(&pcache_req->list_node);
+ pcache_req->ret = 0;
+ ret = pcache_cache_handle_req(&pcache->cache, pcache_req);
+ if (ret == -ENOMEM || ret == -EBUSY) {
+ defer_req(pcache_req);
+ ret = 0;
+ } else {
+ pcache_req_put(pcache_req, ret);
+ }
+ }
+}
+
+void pcache_req_get(struct pcache_request *pcache_req)
+{
+ kref_get(&pcache_req->ref);
+}
+
+static void end_req(struct kref *ref)
+{
+ struct pcache_request *pcache_req = container_of(ref, struct pcache_request, ref);
+ struct bio *bio = pcache_req->bio;
+ int ret = pcache_req->ret;
+
+ if (ret == -ENOMEM || ret == -EBUSY) {
+ pcache_req_get(pcache_req);
+ defer_req(pcache_req);
+ } else {
+ bio->bi_status = errno_to_blk_status(ret);
+ bio_endio(bio);
+ }
+}
+
+void pcache_req_put(struct pcache_request *pcache_req, int ret)
+{
+ /* Set the return status if it is not already set */
+ if (ret && !pcache_req->ret)
+ pcache_req->ret = ret;
+
+ kref_put(&pcache_req->ref, end_req);
+}
+
+static int parse_cache_dev(struct dm_pcache *pcache, struct dm_arg_set *as,
+ char **error)
+{
+ const char *cache_dev_path;
+ int ret;
+
+ if (!as->argc) {
+ *error = "Cache_dev_path required";
+ return -EINVAL;
+ }
+
+ cache_dev_path = dm_shift_arg(as);
+ ret = cache_dev_start(pcache, cache_dev_path);
+ if (ret) {
+ *error = "Failed to start cache dev";
+ return ret;
+ }
+
+ return 0;
+}
+
+static int parse_backing_dev(struct dm_pcache *pcache, struct dm_arg_set *as,
+ char **error)
+{
+ const char *backing_dev_path;
+ int ret;
+
+ if (!as->argc) {
+ *error = "Backing_dev_path required";
+ return -EINVAL;
+ }
+
+ backing_dev_path = dm_shift_arg(as);
+ ret = backing_dev_start(pcache, backing_dev_path);
+ if (ret) {
+ *error = "Failed to start backing dev";
+ return ret;
+ }
+
+ return 0;
+}
+
+static int parse_cache_opts(struct dm_pcache *pcache, struct dm_arg_set *as,
+ char **error)
+{
+ struct pcache_cache_options opts = { 0 };
+ const char *cache_mode;
+ const char *data_crc_str;
+
+ if (as->argc != 2) {
+ *error = "Bad cache options, need '<cache_mode> <data_crc(true|false)>'";
+ return -EINVAL;
+ }
+
+ cache_mode = dm_shift_arg(as);
+ if (!strcmp(cache_mode, "writeback")) {
+ opts.cache_mode = PCACHE_CACHE_MODE_WRITEBACK;
+ } else {
+ *error = "only writeback cache mode supported currently";
+ return -EINVAL;
+ }
+
+ data_crc_str = dm_shift_arg(as);
+ if (!strcmp(data_crc_str, "true")) {
+ opts.data_crc = true;
+ } else if (!strcmp(data_crc_str, "false")) {
+ opts.data_crc = false;
+ } else {
+ *error = "Invalid option for data_crc";
+ return -EINVAL;
+ }
+
+ return pcache_cache_start(pcache, &opts);
+}
+
+static int parse_pcache_args(struct dm_pcache *pcache, unsigned int argc, char **argv,
+ char **error)
+{
+ struct dm_arg_set as;
+ int ret;
+
+ if (argc != 4) {
+ ret = -EINVAL;
+ goto err;
+ }
+
+ as.argc = argc;
+ as.argv = argv;
+
+ ret = parse_cache_dev(pcache, &as, error);
+ if (ret)
+ goto err;
+
+ ret = parse_backing_dev(pcache, &as, error);
+ if (ret)
+ goto stop_cache_dev;
+
+ ret = parse_cache_opts(pcache, &as, error);
+ if (ret)
+ goto stop_backing_dev;
+
+ return 0;
+
+stop_backing_dev:
+ backing_dev_stop(pcache);
+stop_cache_dev:
+ cache_dev_stop(pcache);
+err:
+ return ret;
+}
+
+static int dm_pcache_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+ struct mapped_device *md = ti->table->md;
+ struct dm_pcache *pcache;
+ int ret;
+
+ if (md->map) {
+ ti->error = "Don't support table loading for live md";
+ return -EOPNOTSUPP;
+ }
+
+ /* Allocate memory for the cache structure */
+ pcache = kzalloc(sizeof(struct dm_pcache), GFP_KERNEL);
+ if (!pcache)
+ return -ENOMEM;
+
+ pcache->task_wq = alloc_workqueue("pcache-%s-wq", WQ_UNBOUND | WQ_MEM_RECLAIM,
+ 0, md->name);
+ if (!pcache->task_wq) {
+ ret = -ENOMEM;
+ goto free_pcache;
+ }
+
+ spin_lock_init(&pcache->defered_req_list_lock);
+ INIT_LIST_HEAD(&pcache->defered_req_list);
+ INIT_DELAYED_WORK(&pcache->defered_req_work, defered_req_fn);
+ pcache->ti = ti;
+
+ ret = parse_pcache_args(pcache, argc, argv, &ti->error);
+ if (ret)
+ goto destroy_wq;
+
+ ti->num_flush_bios = 1;
+ ti->flush_supported = true;
+ ti->per_io_data_size = sizeof(struct pcache_request);
+ ti->private = pcache;
+ atomic_set(&pcache->state, PCACHE_STATE_RUNNING);
+
+ return 0;
+
+destroy_wq:
+ destroy_workqueue(pcache->task_wq);
+free_pcache:
+ kfree(pcache);
+
+ return ret;
+}
+
+static void defer_req_stop(struct dm_pcache *pcache)
+{
+ struct pcache_request *pcache_req;
+ LIST_HEAD(tmp_list);
+
+ cancel_delayed_work_sync(&pcache->defered_req_work);
+
+ spin_lock(&pcache->defered_req_list_lock);
+ list_splice_init(&pcache->defered_req_list, &tmp_list);
+ spin_unlock(&pcache->defered_req_list_lock);
+
+ while (!list_empty(&tmp_list)) {
+ pcache_req = list_first_entry(&tmp_list,
+ struct pcache_request, list_node);
+ list_del_init(&pcache_req->list_node);
+ pcache_req_put(pcache_req, -EIO);
+ }
+}
+
+static void dm_pcache_dtr(struct dm_target *ti)
+{
+ struct dm_pcache *pcache;
+
+ pcache = ti->private;
+
+ atomic_set(&pcache->state, PCACHE_STATE_STOPPING);
+
+ defer_req_stop(pcache);
+ pcache_cache_stop(pcache);
+ backing_dev_stop(pcache);
+ cache_dev_stop(pcache);
+
+ drain_workqueue(pcache->task_wq);
+ destroy_workqueue(pcache->task_wq);
+
+ kfree(pcache);
+}
+
+static int dm_pcache_map_bio(struct dm_target *ti, struct bio *bio)
+{
+ struct pcache_request *pcache_req = dm_per_bio_data(bio, sizeof(struct pcache_request));
+ struct dm_pcache *pcache = ti->private;
+ int ret;
+
+ pcache_req->pcache = pcache;
+ kref_init(&pcache_req->ref);
+ pcache_req->ret = 0;
+ pcache_req->bio = bio;
+ pcache_req->off = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
+ pcache_req->data_len = bio->bi_iter.bi_size;
+ INIT_LIST_HEAD(&pcache_req->list_node);
+ bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);
+
+ ret = pcache_cache_handle_req(&pcache->cache, pcache_req);
+ if (ret == -ENOMEM || ret == -EBUSY) {
+ defer_req(pcache_req);
+ } else {
+ pcache_req_put(pcache_req, ret);
+ }
+
+ return DM_MAPIO_SUBMITTED;
+}
+
+static void dm_pcache_status(struct dm_target *ti, status_type_t type,
+ unsigned int status_flags, char *result,
+ unsigned int maxlen)
+{
+ struct dm_pcache *pcache = ti->private;
+ struct pcache_cache_dev *cache_dev = &pcache->cache_dev;
+ struct pcache_backing_dev *backing_dev = &pcache->backing_dev;
+ struct pcache_cache *cache = &pcache->cache;
+ unsigned int sz = 0;
+
+ switch (type) {
+ case STATUSTYPE_INFO:
+ DMEMIT("%x %u %u %u %u %x %u:%u %u:%u %u:%u",
+ cache_dev->sb_flags,
+ cache_dev->seg_num,
+ cache->n_segs,
+ bitmap_weight(cache->seg_map, cache->n_segs),
+ pcache_cache_get_gc_percent(cache),
+ cache->cache_info.flags,
+ cache->key_head.cache_seg->cache_seg_id,
+ cache->key_head.seg_off,
+ cache->dirty_tail.cache_seg->cache_seg_id,
+ cache->dirty_tail.seg_off,
+ cache->key_tail.cache_seg->cache_seg_id,
+ cache->key_tail.seg_off);
+ break;
+ case STATUSTYPE_TABLE:
+ DMEMIT("%s %s writeback %s",
+ cache_dev->dm_dev->name,
+ backing_dev->dm_dev->name,
+ cache_data_crc_on(cache) ? "true" : "false");
+ break;
+ case STATUSTYPE_IMA:
+ *result = '\0';
+ break;
+ }
+}
+
+static int dm_pcache_message(struct dm_target *ti, unsigned int argc,
+ char **argv, char *result, unsigned int maxlen)
+{
+ struct dm_pcache *pcache = ti->private;
+ unsigned long val;
+
+ if (argc != 2)
+ goto err;
+
+ if (!strcasecmp(argv[0], "gc_percent")) {
+ if (kstrtoul(argv[1], 10, &val))
+ goto err;
+
+ return pcache_cache_set_gc_percent(&pcache->cache, val);
+ }
+err:
+ return -EINVAL;
+}
+
+static struct target_type dm_pcache_target = {
+ .name = "pcache",
+ .version = {0, 1, 0},
+ .module = THIS_MODULE,
+ .features = DM_TARGET_SINGLETON,
+ .ctr = dm_pcache_ctr,
+ .dtr = dm_pcache_dtr,
+ .map = dm_pcache_map_bio,
+ .status = dm_pcache_status,
+ .message = dm_pcache_message,
+};
+
+static int __init dm_pcache_init(void)
+{
+ return dm_register_target(&dm_pcache_target);
+}
+module_init(dm_pcache_init);
+
+static void __exit dm_pcache_exit(void)
+{
+ dm_unregister_target(&dm_pcache_target);
+}
+module_exit(dm_pcache_exit);
+
+MODULE_DESCRIPTION("dm-pcache Persistent Cache for block device");
+MODULE_AUTHOR("Dongsheng Yang <dongsheng.yang@linux.dev>");
+MODULE_LICENSE("GPL");
diff --git a/drivers/md/dm-pcache/dm_pcache.h b/drivers/md/dm-pcache/dm_pcache.h
new file mode 100644
index 000000000000..07be2e2a2f5b
--- /dev/null
+++ b/drivers/md/dm-pcache/dm_pcache.h
@@ -0,0 +1,61 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _DM_PCACHE_H
+#define _DM_PCACHE_H
+#include <linux/device-mapper.h>
+
+#include "../dm-core.h"
+
+#define CACHE_DEV_TO_PCACHE(cache_dev) (container_of(cache_dev, struct dm_pcache, cache_dev))
+#define BACKING_DEV_TO_PCACHE(backing_dev) (container_of(backing_dev, struct dm_pcache, backing_dev))
+#define CACHE_TO_PCACHE(cache) (container_of(cache, struct dm_pcache, cache))
+
+#define PCACHE_STATE_RUNNING 1
+#define PCACHE_STATE_STOPPING 2
+
+struct pcache_cache_dev;
+struct pcache_backing_dev;
+struct pcache_cache;
+struct dm_pcache {
+ struct dm_target *ti;
+ struct pcache_cache_dev cache_dev;
+ struct pcache_backing_dev backing_dev;
+ struct pcache_cache cache;
+
+ spinlock_t defered_req_list_lock;
+ struct list_head defered_req_list;
+ struct workqueue_struct *task_wq;
+
+ struct delayed_work defered_req_work;
+
+ atomic_t state;
+};
+
+static inline bool pcache_is_stopping(struct dm_pcache *pcache)
+{
+ return (atomic_read(&pcache->state) == PCACHE_STATE_STOPPING);
+}
+
+#define pcache_dev_err(pcache, fmt, ...) \
+ pcache_err("%s " fmt, pcache->ti->table->md->name, ##__VA_ARGS__)
+#define pcache_dev_info(pcache, fmt, ...) \
+ pcache_info("%s " fmt, pcache->ti->table->md->name, ##__VA_ARGS__)
+#define pcache_dev_debug(pcache, fmt, ...) \
+ pcache_debug("%s " fmt, pcache->ti->table->md->name, ##__VA_ARGS__)
+
+struct pcache_request {
+ struct dm_pcache *pcache;
+ struct bio *bio;
+
+ u64 off;
+ u32 data_len;
+
+ struct kref ref;
+ int ret;
+
+ struct list_head list_node;
+};
+
+void pcache_req_get(struct pcache_request *pcache_req);
+void pcache_req_put(struct pcache_request *pcache_req, int ret);
+
+#endif /* _DM_PCACHE_H */
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices
2025-06-05 14:22 [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices Dongsheng Yang
` (10 preceding siblings ...)
2025-06-05 14:23 ` [RFC PATCH 11/11] dm-pcache: initial dm-pcache target Dongsheng Yang
@ 2025-06-12 16:57 ` Mikulas Patocka
2025-06-13 3:39 ` Dongsheng Yang
2025-06-23 3:13 ` Dongsheng Yang
11 siblings, 2 replies; 18+ messages in thread
From: Mikulas Patocka @ 2025-06-12 16:57 UTC (permalink / raw)
To: Dongsheng Yang
Cc: agk, snitzer, axboe, hch, dan.j.williams, Jonathan.Cameron,
linux-block, linux-kernel, linux-cxl, nvdimm, dm-devel
Hi
On Thu, 5 Jun 2025, Dongsheng Yang wrote:
> Hi Mikulas and all,
>
> This is *RFC v2* of the *pcache* series, a persistent-memory backed cache.
> Compared with *RFC v1*
> <https://lore.kernel.org/lkml/20250414014505.20477-1-dongsheng.yang@linux.dev/>
> the most important change is that the whole cache has been *ported to
> the Device-Mapper framework* and is now exposed as a regular DM target.
>
> Code:
> https://github.com/DataTravelGuide/linux/tree/dm-pcache
>
> Full RFC v2 test results:
> https://datatravelguide.github.io/dtg-blog/pcache/pcache_rfc_v2_result/results.html
>
> All 962 xfstests cases passed successfully under four different
> pcache configurations.
>
> One of the detailed xfstests run:
> https://datatravelguide.github.io/dtg-blog/pcache/pcache_rfc_v2_result/test-results/02-._pcache.py_PcacheTest.test_run-crc-enable-gc-gc0-test_script-xfstests-a515/debug.log
>
> Below is a quick tour through the three layers of the implementation,
> followed by an example invocation.
>
> ----------------------------------------------------------------------
> 1. pmem access layer
> ----------------------------------------------------------------------
>
> * All reads use *copy_mc_to_kernel()* so that uncorrectable media
> errors are detected and reported.
> * All writes go through *memcpy_flushcache()* to guarantee durability
> on real persistent memory.
You could also try to use normal write and clflushopt for big writes - I
found out that for larger regions it is better - see the function
memcpy_flushcache_optimized in dm-writecache. Test, which way is better.
> ----------------------------------------------------------------------
> 2. cache-logic layer (segments / keys / workers)
> ----------------------------------------------------------------------
>
> Main features
> - 16 MiB pmem segments, log-structured allocation.
> - Multi-subtree RB-tree index for high parallelism.
> - Optional per-entry *CRC32* on cached data.
Would it be better to use crc32c because it has hardware support in the
SSE4.2 instruction set?
> - Background *write-back* worker and watermark-driven *GC*.
> - Crash-safe replay: key-sets are scanned from *key_tail* on start-up.
>
> Current limitations
> - Only *write-back* mode implemented.
> - Only FIFO cache invalidate; other (LRU, ARC...) planned.
>
> ----------------------------------------------------------------------
> 3. dm-pcache target integration
> ----------------------------------------------------------------------
>
> * Table line
> `pcache <pmem_dev> <origin_dev> writeback <true|false>`
> * Features advertised to DM:
> - `ti->flush_supported = true`, so *PREFLUSH* and *FUA* are honoured
> (they force all open key-sets to close and data to be durable).
> * Not yet supported:
> - Discard / TRIM.
> - dynamic `dmsetup reload`.
If you don't support it, you should at least try to detect that the user
did reload and return error - so that there won't be data corruption in
this case.
But it would be better to support table reload. You can support it by a
similar mechanism to "__handover_exceptions" in the dm-snap.c driver.
> Runtime controls
> - `dmsetup message <dev> 0 gc_percent <0-90>` adjusts the GC trigger.
>
> Status line reports super-block flags, segment counts, GC threshold and
> the three tail/head pointers (see the RST document for details).
Perhaps these are not real bugs (I didn't analyze it thoroughly), but
there are some GFP_NOWAIT and GFP_KERNEL allocations.
GFP_NOWAIT can fail anytime (for example, if the machine receives too many
network packets), so you must handle the error gracefully.
GFP_KERNEL allocation may recurse back into the I/O path through swapping
or file writeback, thus they may cause deadlocks. You should use
GFP_KERNEL in the target constructor or destructor because there is no I/O
to be processed in this time, but they shouldn't be used in the I/O
processing path.
I see that when you get ENOMEM, you retry the request in 100ms - putting
arbitrary waits in the code is generally bad practice - this won't work if
the user is swapping to the dm-pcache device. It may be possible that
there is no memory free, thus retrying won't help and it will deadlock.
I suggest to use mempools to guarantee forward progress in out-of-memory
situation. A mempool_alloc(GFP_IO) will never return NULL, it will just
wait until some other process frees some entry into the mempool.
Generally, a convention among device mapper targets is that the have a few
fixed parameters first, then there is a number of optional parameters and
then there are optional parameters (either in "parameter:123" or
"parameter 123" format). You should follow this convention, so that it can
be easily extended with new parameters later.
The __packed attribute causes performance degradation on risc machines
without hardware support for unaligned accesses - the compiled will
generate byte-by-byte accesses - I suggest to not use it and instead make
sure that the members in the structures are naturally aligned (and
inserting explicit padding if needed).
The function "memcpy_flushcache" in arch/x86/include/asm/string_64.h is
optimized for 4, 8 and 16-byte accesess (because that's what dm-writecache
uses) - I suggest to add more optimizations to it for constant sizes that
fit the usage pattern of dm-pcache,
I see that you are using "queue_delayed_work(cache_get_wq(cache),
&cache->writeback_work, 0);" and "queue_delayed_work(cache_get_wq(cache),
&cache->writeback_work, delay);" - the problem here is that if the entry
is already queued with a delay and you attempt to queue it again with zero
again, this new queue attempt will be ignored - I'm not sure if this is
intended behavior or not.
req_complete_fn: this will never run with interrupts disabled, so you can
replace spin_lock_irqsave/spin_unlock_irqrestore with
spin_lock_irq/spin_unlock_irq.
backing_dev_bio_end: there's a bug in this function - it may be called
both with interrupts disabled and interrupts enabled, so you should not
use spin_lock/spin_unlock. You should be called
spin_lock_irqsave/spin_unlock_irqrestore.
queue_work(BACKING_DEV_TO_PCACHE - i would move it inside the spinlock -
see the commit 829451beaed6165eb11d7a9fb4e28eb17f489980 for a similar
problem.
bio_map - bio vectors can hold arbitrarily long entries - if the "base"
variable is not from vmalloc, you can just add it one bvec entry.
"backing_req->kmem.bvecs = kcalloc" - you can use kmalloc_array instead of
kcalloc, there's no need to zero the value.
> + if (++wait_count >= PCACHE_WAIT_NEW_CACHE_COUNT)
> + return NULL;
> +
> + udelay(PCACHE_WAIT_NEW_CACHE_INTERVAL);
> + goto again;
This is not good practice to insert arbitrary waits (here, the wait is
burning CPU power, which makes it even worse). You should add the process
to a wait queue and wake up the queue.
See the functions writecache_wait_on_freelist and writecache_free_entry
for an example, how to wait correctly.
> +static int dm_pcache_map_bio(struct dm_target *ti, struct bio *bio)
> +{
> + struct pcache_request *pcache_req = dm_per_bio_data(bio, sizeof(struct pcache_request));
> + struct dm_pcache *pcache = ti->private;
> + int ret;
> +
> + pcache_req->pcache = pcache;
> + kref_init(&pcache_req->ref);
> + pcache_req->ret = 0;
> + pcache_req->bio = bio;
> + pcache_req->off = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
> + pcache_req->data_len = bio->bi_iter.bi_size;
> + INIT_LIST_HEAD(&pcache_req->list_node);
> + bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);
This looks suspicious because you store the original bi_sector to
pcache_req->off and then subtract the target offset from it. Shouldn't
"bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);"
be before "pcache_req->off = (u64)bio->bi_iter.bi_sector <<
SECTOR_SHIFT;"?
Generally, the code doesn't seem bad. After reworking the out-of-memory
handling and replacing arbitrary waits with wait queues, I can merge it.
Mikulas
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices
2025-06-12 16:57 ` [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices Mikulas Patocka
@ 2025-06-13 3:39 ` Dongsheng Yang
2025-06-23 3:13 ` Dongsheng Yang
1 sibling, 0 replies; 18+ messages in thread
From: Dongsheng Yang @ 2025-06-13 3:39 UTC (permalink / raw)
To: Mikulas Patocka
Cc: agk, snitzer, axboe, hch, dan.j.williams, Jonathan.Cameron,
linux-block, linux-kernel, linux-cxl, nvdimm, dm-devel
On 2025/6/13 0:57, Mikulas Patocka wrote:
> Hi
>
>
> On Thu, 5 Jun 2025, Dongsheng Yang wrote:
>
>> Hi Mikulas and all,
...
>
> Generally, the code doesn't seem bad. After reworking the out-of-memory
> handling and replacing arbitrary waits with wait queues, I can merge it.
Hi Mikulas,
Thanks for your review. I will go through and respond to your
review comments one by one over the next few days, before the next version.
Thanx
Dongsheng
>
> Mikulas
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices
2025-06-12 16:57 ` [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices Mikulas Patocka
2025-06-13 3:39 ` Dongsheng Yang
@ 2025-06-23 3:13 ` Dongsheng Yang
2025-06-23 4:18 ` Dongsheng Yang
2025-06-30 15:57 ` Mikulas Patocka
1 sibling, 2 replies; 18+ messages in thread
From: Dongsheng Yang @ 2025-06-23 3:13 UTC (permalink / raw)
To: Mikulas Patocka
Cc: agk, snitzer, axboe, hch, dan.j.williams, Jonathan.Cameron,
linux-block, linux-kernel, linux-cxl, nvdimm, dm-devel
[-- Attachment #1.1: Type: text/plain, Size: 13898 bytes --]
Hi Mikulas:
I will send dm-pcache V1 soon, below is my response to your comments.
在 6/13/2025 12:57 AM, Mikulas Patocka 写道:
> Hi
>
>
> On Thu, 5 Jun 2025, Dongsheng Yang wrote:
>
>> Hi Mikulas and all,
>>
>> This is *RFC v2* of the *pcache* series, a persistent-memory backed cache.
>> Compared with *RFC v1*
>> <https://lore.kernel.org/lkml/20250414014505.20477-1-dongsheng.yang@linux.dev/>
>> the most important change is that the whole cache has been *ported to
>> the Device-Mapper framework* and is now exposed as a regular DM target.
>>
>> Code:
>> https://github.com/DataTravelGuide/linux/tree/dm-pcache
>>
>> Full RFC v2 test results:
>> https://datatravelguide.github.io/dtg-blog/pcache/pcache_rfc_v2_result/results.html
>>
>> All 962 xfstests cases passed successfully under four different
>> pcache configurations.
>>
>> One of the detailed xfstests run:
>> https://datatravelguide.github.io/dtg-blog/pcache/pcache_rfc_v2_result/test-results/02-._pcache.py_PcacheTest.test_run-crc-enable-gc-gc0-test_script-xfstests-a515/debug.log
>>
>> Below is a quick tour through the three layers of the implementation,
>> followed by an example invocation.
>>
>> ----------------------------------------------------------------------
>> 1. pmem access layer
>> ----------------------------------------------------------------------
>>
>> * All reads use *copy_mc_to_kernel()* so that uncorrectable media
>> errors are detected and reported.
>> * All writes go through *memcpy_flushcache()* to guarantee durability
>> on real persistent memory.
> You could also try to use normal write and clflushopt for big writes - I
> found out that for larger regions it is better - see the function
> memcpy_flushcache_optimized in dm-writecache. Test, which way is better.
I did a test with fio on /dev/pmem0, with an attached patch on nd_pmem.ko:
when I use memmap pmem device, I got a similar result with the comment
in memcpy_flushcache_optimized():
Test (memmap pmem) clflushopt flushcache
------------------------------------------------- test_randwrite_512 200
MiB/s 228 MiB/s test_randwrite_1024 378 MiB/s 431 MiB/s
test_randwrite_2K 773 MiB/s 769 MiB/s test_randwrite_4K 1364 MiB/s 1272
MiB/s test_randwrite_8K 2078 MiB/s 1817 MiB/s test_randwrite_16K 2745
MiB/s 2098 MiB/s test_randwrite_32K 3232 MiB/s 2231 MiB/s
test_randwrite_64K 3660 MiB/s 2411 MiB/s test_randwrite_128K 3922 MiB/s
2513 MiB/s test_randwrite_1M 3824 MiB/s 2537 MiB/s test_write_512 228
MiB/s 228 MiB/s test_write_1024 439 MiB/s 423 MiB/s test_write_2K 841
MiB/s 800 MiB/s test_write_4K 1364 MiB/s 1308 MiB/s test_write_8K 2107
MiB/s 1838 MiB/s test_write_16K 2752 MiB/s 2166 MiB/s test_write_32K
3213 MiB/s 2247 MiB/s test_write_64K 3661 MiB/s 2415 MiB/s
test_write_128K 3902 MiB/s 2514 MiB/s test_write_1M 3808 MiB/s 2529 MiB/s
But I got a different result when I use Optane pmem100:
Test (Optane pmem100) clflushopt flushcache
------------------------------------------------- test_randwrite_512 167
MiB/s 226 MiB/s test_randwrite_1024 301 MiB/s 420 MiB/s
test_randwrite_2K 615 MiB/s 639 MiB/s test_randwrite_4K 967 MiB/s 1024
MiB/s test_randwrite_8K 1047 MiB/s 1314 MiB/s test_randwrite_16K 1096
MiB/s 1377 MiB/s test_randwrite_32K 1155 MiB/s 1382 MiB/s
test_randwrite_64K 1184 MiB/s 1452 MiB/s test_randwrite_128K 1199 MiB/s
1488 MiB/s test_randwrite_1M 1178 MiB/s 1499 MiB/s test_write_512 233
MiB/s 233 MiB/s test_write_1024 424 MiB/s 391 MiB/s test_write_2K 706
MiB/s 760 MiB/s test_write_4K 978 MiB/s 1076 MiB/s test_write_8K 1059
MiB/s 1296 MiB/s test_write_16K 1119 MiB/s 1380 MiB/s test_write_32K
1158 MiB/s 1387 MiB/s test_write_64K 1184 MiB/s 1448 MiB/s
test_write_128K 1198 MiB/s 1481 MiB/s test_write_1M 1178 MiB/s 1486 MiB/s
So for now I’d rather keep using flushcache in pcache. In future, once
we’ve come up with a general-purpose optimization, we can switch to that.
>
>> ----------------------------------------------------------------------
>> 2. cache-logic layer (segments / keys / workers)
>> ----------------------------------------------------------------------
>>
>> Main features
>> - 16 MiB pmem segments, log-structured allocation.
>> - Multi-subtree RB-tree index for high parallelism.
>> - Optional per-entry *CRC32* on cached data.
> Would it be better to use crc32c because it has hardware support in the
> SSE4.2 instruction set?
Good suggestion, thanx.
>
>> - Background *write-back* worker and watermark-driven *GC*.
>> - Crash-safe replay: key-sets are scanned from *key_tail* on start-up.
>>
>> Current limitations
>> - Only *write-back* mode implemented.
>> - Only FIFO cache invalidate; other (LRU, ARC...) planned.
>>
>> ----------------------------------------------------------------------
>> 3. dm-pcache target integration
>> ----------------------------------------------------------------------
>>
>> * Table line
>> `pcache <pmem_dev> <origin_dev> writeback <true|false>`
>> * Features advertised to DM:
>> - `ti->flush_supported = true`, so *PREFLUSH* and *FUA* are honoured
>> (they force all open key-sets to close and data to be durable).
>> * Not yet supported:
>> - Discard / TRIM.
>> - dynamic `dmsetup reload`.
> If you don't support it, you should at least try to detect that the user
> did reload and return error - so that there won't be data corruption in
> this case.
Yes, actually It did return -EOPNOTSUPP。
static int dm_pcache_ctr(struct dm_target *ti, unsigned int argc, char
**argv)
{
struct mapped_device *md = ti->table->md;
struct dm_pcache *pcache;
int ret;
if (md->map) {
ti->error = "Don't support table loading for live md";
return -EOPNOTSUPP;
}
>
> But it would be better to support table reload. You can support it by a
> similar mechanism to "__handover_exceptions" in the dm-snap.c driver.
Of course, that's planned but we think that's not necessary in initial
version of it.
>
>> Runtime controls
>> - `dmsetup message <dev> 0 gc_percent <0-90>` adjusts the GC trigger.
>>
>> Status line reports super-block flags, segment counts, GC threshold and
>> the three tail/head pointers (see the RST document for details).
> Perhaps these are not real bugs (I didn't analyze it thoroughly), but
> there are some GFP_NOWAIT and GFP_KERNEL allocations.
>
> GFP_NOWAIT can fail anytime (for example, if the machine receives too many
> network packets), so you must handle the error gracefully.
>
> GFP_KERNEL allocation may recurse back into the I/O path through swapping
> or file writeback, thus they may cause deadlocks. You should use
> GFP_KERNEL in the target constructor or destructor because there is no I/O
> to be processed in this time, but they shouldn't be used in the I/O
> processing path.
**Yes**, I'm using the approach I described above.
- **GFP_KERNEL** is only used on the constructor path.
- **GFP_NOWAIT** is used in two cases: one for allocating `cache_key`,
and another for allocating `backing_dev_req` on a cache miss.
Both of these NOWAIT allocations happen while traversing the
`cache_subtree` under the `subtree_lock` spinlock—i.e., in contexts
where scheduling isn't allowed. That’s why we use `GFP_NOWAIT`.
>
> I see that when you get ENOMEM, you retry the request in 100ms - putting
> arbitrary waits in the code is generally bad practice - this won't work if
> the user is swapping to the dm-pcache device. It may be possible that
> there is no memory free, thus retrying won't help and it will deadlock.
In **V1**, I removed the retry-on-ENOMEM mechanism to make pcache more
coherent. Now it only retries when the cache is actually full.
- When the cache is full, the `pcache_req` is added to a `defer_list`.
- When cache invalidation occurs, we kick off all deferred requests to
retry.
This eliminates arbitrary waits.
>
> I suggest to use mempools to guarantee forward progress in out-of-memory
> situation. A mempool_alloc(GFP_IO) will never return NULL, it will just
> wait until some other process frees some entry into the mempool.
Both `cache_key` and `backing_dev_req` now use **mempools**, but—as
explained—they may still be allocated under a spinlock, so they use
**GFP_NOWAIT**.
>
> Generally, a convention among device mapper targets is that the have a few
> fixed parameters first, then there is a number of optional parameters and
> then there are optional parameters (either in "parameter:123" or
> "parameter 123" format). You should follow this convention, so that it can
> be easily extended with new parameters later.
Thanks, that's a good suggestion.
In **V1**, we changed the parameter format to:
pcache <cache_dev> <backing_dev> [<number_of_optional_arguments>
<cache_mode writeback> <data_crc true|false>]
>
> The __packed attribute causes performance degradation on risc machines
> without hardware support for unaligned accesses - the compiled will
> generate byte-by-byte accesses - I suggest to not use it and instead make
> sure that the members in the structures are naturally aligned (and
> inserting explicit padding if needed).
Thanks — we removed `__packed` in V1.
>
> The function "memcpy_flushcache" in arch/x86/include/asm/string_64.h is
> optimized for 4, 8 and 16-byte accesess (because that's what dm-writecache
> uses) - I suggest to add more optimizations to it for constant sizes that
> fit the usage pattern of dm-pcache,
Thanks again — as mentioned above, we plan to optimize later, possibly
with a more generic solution.
>
> I see that you are using "queue_delayed_work(cache_get_wq(cache),
> &cache->writeback_work, 0);" and "queue_delayed_work(cache_get_wq(cache),
> &cache->writeback_work, delay);" - the problem here is that if the entry
> is already queued with a delay and you attempt to queue it again with zero
> again, this new queue attempt will be ignored - I'm not sure if this is
> intended behavior or not.
Yes, in simple terms:
1. If after writeback completes the cache is clean, we should delay.
2. If it's not clean, set delay = 0 so the next writeback cycle can
begin immediately.
In theory these are two separate branches, even if a zero delay might be
ignored. Since writeback work isn't highly time-sensitive, I believe
this is harmless.
>
> req_complete_fn: this will never run with interrupts disabled, so you can
> replace spin_lock_irqsave/spin_unlock_irqrestore with
> spin_lock_irq/spin_unlock_irq.
thanx Fixed in V1
>
> backing_dev_bio_end: there's a bug in this function - it may be called
> both with interrupts disabled and interrupts enabled, so you should not
> use spin_lock/spin_unlock. You should be called
> spin_lock_irqsave/spin_unlock_irqrestore.
Good catch,it's fixed in V1
>
> queue_work(BACKING_DEV_TO_PCACHE - i would move it inside the spinlock -
> see the commit 829451beaed6165eb11d7a9fb4e28eb17f489980 for a similar
> problem.
Thanx, Fixed in V1
>
> bio_map - bio vectors can hold arbitrarily long entries - if the "base"
> variable is not from vmalloc, you can just add it one bvec entry.
Good idea — in V1 we introduced `inline_bvecs`, and when `base` isn’t
vmalloc, we use a single bvec entry.
> "backing_req->kmem.bvecs = kcalloc" - you can use kmalloc_array instead of
> kcalloc, there's no need to zero the value.
Fixed in V1
>
>> + if (++wait_count >= PCACHE_WAIT_NEW_CACHE_COUNT)
>> + return NULL;
>> +
>> + udelay(PCACHE_WAIT_NEW_CACHE_INTERVAL);
>> + goto again;
> This is not good practice to insert arbitrary waits (here, the wait is
> burning CPU power, which makes it even worse). You should add the process
> to a wait queue and wake up the queue.
>
> See the functions writecache_wait_on_freelist and writecache_free_entry
> for an example, how to wait correctly.
`get_cache_segment()` may be called under a spinlock, so I originally
designed a “short-busy-wait” to wait for a free segment,
If wait_count >= PCACHE_WAIT_NEW_CACHE_COUNT, then return -EBUSY to let
dm-pcache defer_list to retry.
However, I later discovered that when the cache is full, there's rarely
enough time during that brief short-busy-wait to free a segment.
Therefore in V1 I removed it and instead return `-EBUSY`, allowing
dm-pcache’s retry mechanism to wait for a cache invalidation before
retrying.
>
>> +static int dm_pcache_map_bio(struct dm_target *ti, struct bio *bio)
>> +{
>> + struct pcache_request *pcache_req = dm_per_bio_data(bio, sizeof(struct pcache_request));
>> + struct dm_pcache *pcache = ti->private;
>> + int ret;
>> +
>> + pcache_req->pcache = pcache;
>> + kref_init(&pcache_req->ref);
>> + pcache_req->ret = 0;
>> + pcache_req->bio = bio;
>> + pcache_req->off = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
>> + pcache_req->data_len = bio->bi_iter.bi_size;
>> + INIT_LIST_HEAD(&pcache_req->list_node);
>> + bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);
> This looks suspicious because you store the original bi_sector to
> pcache_req->off and then subtract the target offset from it. Shouldn't
> "bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);"
> be before "pcache_req->off = (u64)bio->bi_iter.bi_sector <<
> SECTOR_SHIFT;"?
Yes, that logic is indeed questionable, but it works in testing.
Since we define dm-pcache as a **singleton**, both behaviors should
effectively be equivalent, IIUC. Also, in V1 I moved the call to
`dm_target_offset()` so it runs before setting up `pcache_req->off`,
making the code logic correct.
Thanx
Dongsheng
>
> Generally, the code doesn't seem bad. After reworking the out-of-memory
> handling and replacing arbitrary waits with wait queues, I can merge it.
>
> Mikulas
>
[-- Attachment #1.2: Type: text/html, Size: 20639 bytes --]
[-- Attachment #2: 0001-memcpy_flushcache_optimized-on-nd_pmem.ko.patch --]
[-- Type: text/plain, Size: 2455 bytes --]
From 88476f95801f7177c3a8c30c663cdd788f73a3b5 Mon Sep 17 00:00:00 2001
From: Dongsheng Yang <dongsheng.yang@linux.dev>
Date: Mon, 23 Jun 2025 10:10:37 +0800
Subject: [PATCH] memcpy_flushcache_optimized on nd_pmem.ko
Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
drivers/nvdimm/pmem.c | 39 ++++++++++++++++++++++++++++++++++++++-
1 file changed, 38 insertions(+), 1 deletion(-)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index aa50006b7616..78c2e8481630 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -122,6 +122,42 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
return BLK_STS_OK;
}
+static void memcpy_flushcache_optimized(void *dest, void *source, size_t size)
+{
+ /*
+ * clflushopt performs better with block size 1024, 2048, 4096
+ * non-temporal stores perform better with block size 512
+ *
+ * block size 512 1024 2048 4096
+ * movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
+ * clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s
+ *
+ * We see that movnti performs better for 512-byte blocks, and
+ * clflushopt performs better for 1024-byte and larger blocks. So, we
+ * prefer clflushopt for sizes >= 768.
+ *
+ * NOTE: this happens to be the case now (with dm-writecache's single
+ * threaded model) but re-evaluate this once memcpy_flushcache() is
+ * enabled to use movdir64b which might invalidate this performance
+ * advantage seen with cache-allocating-writes plus flushing.
+ */
+#ifdef CONFIG_X86
+ if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) &&
+ likely(boot_cpu_data.x86_clflush_size == 64) &&
+ likely(size >= 0)) {
+ do {
+ memcpy((void *)dest, (void *)source, 64);
+ clflushopt((void *)dest);
+ dest += 64;
+ source += 64;
+ size -= 64;
+ } while (size >= 64);
+ return;
+ }
+#endif
+ memcpy_flushcache(dest, source, size);
+}
+
static void write_pmem(void *pmem_addr, struct page *page,
unsigned int off, unsigned int len)
{
@@ -131,7 +167,8 @@ static void write_pmem(void *pmem_addr, struct page *page,
while (len) {
mem = kmap_atomic(page);
chunk = min_t(unsigned int, len, PAGE_SIZE - off);
- memcpy_flushcache(pmem_addr, mem + off, chunk);
+ //memcpy_flushcache(pmem_addr, mem + off, chunk);
+ memcpy_flushcache_optimized(pmem_addr, mem + off, chunk);
kunmap_atomic(mem);
len -= chunk;
off = 0;
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices
2025-06-23 3:13 ` Dongsheng Yang
@ 2025-06-23 4:18 ` Dongsheng Yang
2025-06-30 15:57 ` Mikulas Patocka
1 sibling, 0 replies; 18+ messages in thread
From: Dongsheng Yang @ 2025-06-23 4:18 UTC (permalink / raw)
To: Mikulas Patocka
Cc: agk, snitzer, axboe, hch, dan.j.williams, Jonathan.Cameron,
linux-block, linux-kernel, linux-cxl, nvdimm, dm-devel
[-- Attachment #1.1: Type: text/plain, Size: 3310 bytes --]
在 6/23/2025 11:13 AM, Dongsheng Yang 写道:
>
> Hi Mikulas:
>
> I will send dm-pcache V1 soon, below is my response to your comments.
>
> 在 6/13/2025 12:57 AM, Mikulas Patocka 写道:
>> Hi
>>
>>
>> On Thu, 5 Jun 2025, Dongsheng Yang wrote:
>>
>>> Hi Mikulas and all,
>>>
>>> This is *RFC v2* of the *pcache* series, a persistent-memory backed cache.
>>>
>>>
>>> ----------------------------------------------------------------------
>>> 1. pmem access layer
>>> ----------------------------------------------------------------------
>>>
>>> * All reads use *copy_mc_to_kernel()* so that uncorrectable media
>>> errors are detected and reported.
>>> * All writes go through *memcpy_flushcache()* to guarantee durability
>>> on real persistent memory.
>> You could also try to use normal write and clflushopt for big writes - I
>> found out that for larger regions it is better - see the function
>> memcpy_flushcache_optimized in dm-writecache. Test, which way is better.
>
> I did a test with fio on /dev/pmem0, with an attached patch on nd_pmem.ko:
>
> when I use memmap pmem device, I got a similar result with the comment
> in memcpy_flushcache_optimized():
>
> Test (memmap pmem) clflushopt flushcache
> ------------------------------------------------- test_randwrite_512
> 200 MiB/s 228 MiB/s test_randwrite_1024 378 MiB/s 431 MiB/s
> test_randwrite_2K 773 MiB/s 769 MiB/s test_randwrite_4K 1364 MiB/s
> 1272 MiB/s test_randwrite_8K 2078 MiB/s 1817 MiB/s test_randwrite_16K
> 2745 MiB/s 2098 MiB/s test_randwrite_32K 3232 MiB/s 2231 MiB/s
> test_randwrite_64K 3660 MiB/s 2411 MiB/s test_randwrite_128K 3922
> MiB/s 2513 MiB/s test_randwrite_1M 3824 MiB/s 2537 MiB/s
> test_write_512 228 MiB/s 228 MiB/s test_write_1024 439 MiB/s 423 MiB/s
> test_write_2K 841 MiB/s 800 MiB/s test_write_4K 1364 MiB/s 1308 MiB/s
> test_write_8K 2107 MiB/s 1838 MiB/s test_write_16K 2752 MiB/s 2166
> MiB/s test_write_32K 3213 MiB/s 2247 MiB/s test_write_64K 3661 MiB/s
> 2415 MiB/s test_write_128K 3902 MiB/s 2514 MiB/s test_write_1M 3808
> MiB/s 2529 MiB/s
>
> But I got a different result when I use Optane pmem100:
>
> Test (Optane pmem100) clflushopt flushcache
> ------------------------------------------------- test_randwrite_512
> 167 MiB/s 226 MiB/s test_randwrite_1024 301 MiB/s 420 MiB/s
> test_randwrite_2K 615 MiB/s 639 MiB/s test_randwrite_4K 967 MiB/s 1024
> MiB/s test_randwrite_8K 1047 MiB/s 1314 MiB/s test_randwrite_16K 1096
> MiB/s 1377 MiB/s test_randwrite_32K 1155 MiB/s 1382 MiB/s
> test_randwrite_64K 1184 MiB/s 1452 MiB/s test_randwrite_128K 1199
> MiB/s 1488 MiB/s test_randwrite_1M 1178 MiB/s 1499 MiB/s
> test_write_512 233 MiB/s 233 MiB/s test_write_1024 424 MiB/s 391 MiB/s
> test_write_2K 706 MiB/s 760 MiB/s test_write_4K 978 MiB/s 1076 MiB/s
> test_write_8K 1059 MiB/s 1296 MiB/s test_write_16K 1119 MiB/s 1380
> MiB/s test_write_32K 1158 MiB/s 1387 MiB/s test_write_64K 1184 MiB/s
> 1448 MiB/s test_write_128K 1198 MiB/s 1481 MiB/s test_write_1M 1178
> MiB/s 1486 MiB/s
>
>
> So for now I’d rather keep using flushcache in pcache. In future, once
> we’ve come up with a general-purpose optimization, we can switch to that.
>
Sorry for the formatting issue—the table can be checked in attachment
<pmem_test_result>
Thanx
Dongsheng
[-- Attachment #1.2: Type: text/html, Size: 4922 bytes --]
[-- Attachment #2: pmem_test_result.txt --]
[-- Type: text/plain, Size: 2516 bytes --]
Test (memmap pmem) clflushopt flushcache
-------------------------------------------------
test_randwrite_512 200 MiB/s 228 MiB/s
test_randwrite_1024 378 MiB/s 431 MiB/s
test_randwrite_2K 773 MiB/s 769 MiB/s
test_randwrite_4K 1364 MiB/s 1272 MiB/s
test_randwrite_8K 2078 MiB/s 1817 MiB/s
test_randwrite_16K 2745 MiB/s 2098 MiB/s
test_randwrite_32K 3232 MiB/s 2231 MiB/s
test_randwrite_64K 3660 MiB/s 2411 MiB/s
test_randwrite_128K 3922 MiB/s 2513 MiB/s
test_randwrite_1M 3824 MiB/s 2537 MiB/s
test_write_512 228 MiB/s 228 MiB/s
test_write_1024 439 MiB/s 423 MiB/s
test_write_2K 841 MiB/s 800 MiB/s
test_write_4K 1364 MiB/s 1308 MiB/s
test_write_8K 2107 MiB/s 1838 MiB/s
test_write_16K 2752 MiB/s 2166 MiB/s
test_write_32K 3213 MiB/s 2247 MiB/s
test_write_64K 3661 MiB/s 2415 MiB/s
test_write_128K 3902 MiB/s 2514 MiB/s
test_write_1M 3808 MiB/s 2529 MiB/s
Test (Optane pmem100) clflushopt flushcache
-------------------------------------------------
test_randwrite_512 167 MiB/s 226 MiB/s
test_randwrite_1024 301 MiB/s 420 MiB/s
test_randwrite_2K 615 MiB/s 639 MiB/s
test_randwrite_4K 967 MiB/s 1024 MiB/s
test_randwrite_8K 1047 MiB/s 1314 MiB/s
test_randwrite_16K 1096 MiB/s 1377 MiB/s
test_randwrite_32K 1155 MiB/s 1382 MiB/s
test_randwrite_64K 1184 MiB/s 1452 MiB/s
test_randwrite_128K 1199 MiB/s 1488 MiB/s
test_randwrite_1M 1178 MiB/s 1499 MiB/s
test_write_512 233 MiB/s 233 MiB/s
test_write_1024 424 MiB/s 391 MiB/s
test_write_2K 706 MiB/s 760 MiB/s
test_write_4K 978 MiB/s 1076 MiB/s
test_write_8K 1059 MiB/s 1296 MiB/s
test_write_16K 1119 MiB/s 1380 MiB/s
test_write_32K 1158 MiB/s 1387 MiB/s
test_write_64K 1184 MiB/s 1448 MiB/s
test_write_128K 1198 MiB/s 1481 MiB/s
test_write_1M 1178 MiB/s 1486 MiB/s
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices
2025-06-23 3:13 ` Dongsheng Yang
2025-06-23 4:18 ` Dongsheng Yang
@ 2025-06-30 15:57 ` Mikulas Patocka
2025-06-30 16:28 ` Dongsheng Yang
1 sibling, 1 reply; 18+ messages in thread
From: Mikulas Patocka @ 2025-06-30 15:57 UTC (permalink / raw)
To: Dongsheng Yang
Cc: agk, snitzer, axboe, hch, dan.j.williams, Jonathan.Cameron,
linux-block, linux-kernel, linux-cxl, nvdimm, dm-devel
On Mon, 23 Jun 2025, Dongsheng Yang wrote:
> +static int dm_pcache_map_bio(struct dm_target *ti, struct bio *bio)
> +{
> + struct pcache_request *pcache_req = dm_per_bio_data(bio, sizeof(struct pcache_request));
> + struct dm_pcache *pcache = ti->private;
> + int ret;
> +
> + pcache_req->pcache = pcache;
> + kref_init(&pcache_req->ref);
> + pcache_req->ret = 0;
> + pcache_req->bio = bio;
> + pcache_req->off = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
> + pcache_req->data_len = bio->bi_iter.bi_size;
> + INIT_LIST_HEAD(&pcache_req->list_node);
> + bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);
>
> This looks suspicious because you store the original bi_sector to
> pcache_req->off and then subtract the target offset from it. Shouldn't
> "bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);"
> be before "pcache_req->off = (u64)bio->bi_iter.bi_sector <<
> SECTOR_SHIFT;"?
>
>
> Yes, that logic is indeed questionable, but it works in testing.
>
> Since we define dm-pcache as a **singleton**, both behaviors should
> effectively be equivalent, IIUC. Also, in V1 I moved the call to
> `dm_target_offset()` so it runs before setting up `pcache_req->off`,
> making the code logic correct.
If this target is singleton, you can delete the call to dm_target_offset
at all.
That call is harmless, but it looks confusing when reviewing the code,
because pcache_req->off is set to the absolute bio sector (from the start
of the table) and bio->bi_iter.bi_sector is set to the relative bio sector
(from the start of the target). If the target always starts at offset 0,
dm_target_offset just returns bi_sector.
Mikulas
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices
2025-06-30 15:57 ` Mikulas Patocka
@ 2025-06-30 16:28 ` Dongsheng Yang
0 siblings, 0 replies; 18+ messages in thread
From: Dongsheng Yang @ 2025-06-30 16:28 UTC (permalink / raw)
To: Mikulas Patocka
Cc: agk, snitzer, axboe, hch, dan.j.williams, Jonathan.Cameron,
linux-block, linux-kernel, linux-cxl, nvdimm, dm-devel
在 6/30/2025 11:57 PM, Mikulas Patocka 写道:
>
> On Mon, 23 Jun 2025, Dongsheng Yang wrote:
>
>> +static int dm_pcache_map_bio(struct dm_target *ti, struct bio *bio)
>> +{
>> + struct pcache_request *pcache_req = dm_per_bio_data(bio, sizeof(struct pcache_request));
>> + struct dm_pcache *pcache = ti->private;
>> + int ret;
>> +
>> + pcache_req->pcache = pcache;
>> + kref_init(&pcache_req->ref);
>> + pcache_req->ret = 0;
>> + pcache_req->bio = bio;
>> + pcache_req->off = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
>> + pcache_req->data_len = bio->bi_iter.bi_size;
>> + INIT_LIST_HEAD(&pcache_req->list_node);
>> + bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);
>>
>> This looks suspicious because you store the original bi_sector to
>> pcache_req->off and then subtract the target offset from it. Shouldn't
>> "bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);"
>> be before "pcache_req->off = (u64)bio->bi_iter.bi_sector <<
>> SECTOR_SHIFT;"?
>>
>>
>> Yes, that logic is indeed questionable, but it works in testing.
>>
>> Since we define dm-pcache as a **singleton**, both behaviors should
>> effectively be equivalent, IIUC. Also, in V1 I moved the call to
>> `dm_target_offset()` so it runs before setting up `pcache_req->off`,
>> making the code logic correct.
> If this target is singleton, you can delete the call to dm_target_offset
> at all.
>
> That call is harmless, but it looks confusing when reviewing the code,
> because pcache_req->off is set to the absolute bio sector (from the start
> of the table) and bio->bi_iter.bi_sector is set to the relative bio sector
> (from the start of the target). If the target always starts at offset 0,
> dm_target_offset just returns bi_sector.
That makes sense
Thanx
>
> Mikulas
>
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-06-30 16:28 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-05 14:22 [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices Dongsheng Yang
2025-06-05 14:22 ` [RFC PATCH 01/11] dm-pcache: add pcache_internal.h Dongsheng Yang
2025-06-05 14:22 ` [RFC PATCH 02/11] dm-pcache: add backing device management Dongsheng Yang
2025-06-05 14:22 ` [RFC PATCH 03/11] dm-pcache: add cache device Dongsheng Yang
2025-06-05 14:22 ` [RFC PATCH 04/11] dm-pcache: add segment layer Dongsheng Yang
2025-06-05 14:23 ` [RFC PATCH 05/11] dm-pcache: add cache_segment Dongsheng Yang
2025-06-05 14:23 ` [RFC PATCH 06/11] dm-pcache: add cache_writeback Dongsheng Yang
2025-06-05 14:23 ` [RFC PATCH 07/11] dm-pcache: add cache_gc Dongsheng Yang
2025-06-05 14:23 ` [RFC PATCH 08/11] dm-pcache: add cache_key Dongsheng Yang
2025-06-05 14:23 ` [RFC PATCH 09/11] dm-pcache: add cache_req Dongsheng Yang
2025-06-05 14:23 ` [RFC PATCH 10/11] dm-pcache: add cache core Dongsheng Yang
2025-06-05 14:23 ` [RFC PATCH 11/11] dm-pcache: initial dm-pcache target Dongsheng Yang
2025-06-12 16:57 ` [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices Mikulas Patocka
2025-06-13 3:39 ` Dongsheng Yang
2025-06-23 3:13 ` Dongsheng Yang
2025-06-23 4:18 ` Dongsheng Yang
2025-06-30 15:57 ` Mikulas Patocka
2025-06-30 16:28 ` Dongsheng Yang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).