All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dongsheng Yang <dongsheng.yang@linux.dev>
To: Mikulas Patocka <mpatocka@redhat.com>
Cc: agk@redhat.com, snitzer@kernel.org, axboe@kernel.dk, hch@lst.de,
	dan.j.williams@intel.com, Jonathan.Cameron@Huawei.com,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-cxl@vger.kernel.org, nvdimm@lists.linux.dev,
	dm-devel@lists.linux.dev
Subject: Re: [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices
Date: Mon, 23 Jun 2025 11:13:04 +0800	[thread overview]
Message-ID: <3c9f304a-b830-4242-8e01-04efab4e0eaa@linux.dev> (raw)
In-Reply-To: <dc019764-5128-526e-d8ea-effa78e37b39@redhat.com>


[-- Attachment #1.1: Type: text/plain, Size: 13898 bytes --]

Hi Mikulas:

      I will send dm-pcache V1 soon, below is my response to your comments.

在 6/13/2025 12:57 AM, Mikulas Patocka 写道:
> Hi
>
>
> On Thu, 5 Jun 2025, Dongsheng Yang wrote:
>
>> Hi Mikulas and all,
>>
>> This is *RFC v2* of the *pcache* series, a persistent-memory backed cache.
>> Compared with *RFC v1*
>> <https://lore.kernel.org/lkml/20250414014505.20477-1-dongsheng.yang@linux.dev/>  
>> the most important change is that the whole cache has been *ported to
>> the Device-Mapper framework* and is now exposed as a regular DM target.
>>
>> Code:
>>      https://github.com/DataTravelGuide/linux/tree/dm-pcache
>>
>> Full RFC v2 test results:
>>      https://datatravelguide.github.io/dtg-blog/pcache/pcache_rfc_v2_result/results.html
>>
>>      All 962 xfstests cases passed successfully under four different
>> pcache configurations.
>>
>>      One of the detailed xfstests run:
>>          https://datatravelguide.github.io/dtg-blog/pcache/pcache_rfc_v2_result/test-results/02-._pcache.py_PcacheTest.test_run-crc-enable-gc-gc0-test_script-xfstests-a515/debug.log
>>
>> Below is a quick tour through the three layers of the implementation,
>> followed by an example invocation.
>>
>> ----------------------------------------------------------------------
>> 1. pmem access layer
>> ----------------------------------------------------------------------
>>
>> * All reads use *copy_mc_to_kernel()* so that uncorrectable media
>>    errors are detected and reported.
>> * All writes go through *memcpy_flushcache()* to guarantee durability
>>    on real persistent memory.
> You could also try to use normal write and clflushopt for big writes - I
> found out that for larger regions it is better - see the function
> memcpy_flushcache_optimized in dm-writecache. Test, which way is better.

I did a test with fio on /dev/pmem0, with an attached patch on nd_pmem.ko:

when I use memmap pmem device, I got a similar result with the comment 
in memcpy_flushcache_optimized():

Test (memmap pmem) clflushopt flushcache 
------------------------------------------------- test_randwrite_512 200 
MiB/s 228 MiB/s test_randwrite_1024 378 MiB/s 431 MiB/s 
test_randwrite_2K 773 MiB/s 769 MiB/s test_randwrite_4K 1364 MiB/s 1272 
MiB/s test_randwrite_8K 2078 MiB/s 1817 MiB/s test_randwrite_16K 2745 
MiB/s 2098 MiB/s test_randwrite_32K 3232 MiB/s 2231 MiB/s 
test_randwrite_64K 3660 MiB/s 2411 MiB/s test_randwrite_128K 3922 MiB/s 
2513 MiB/s test_randwrite_1M 3824 MiB/s 2537 MiB/s test_write_512 228 
MiB/s 228 MiB/s test_write_1024 439 MiB/s 423 MiB/s test_write_2K 841 
MiB/s 800 MiB/s test_write_4K 1364 MiB/s 1308 MiB/s test_write_8K 2107 
MiB/s 1838 MiB/s test_write_16K 2752 MiB/s 2166 MiB/s test_write_32K 
3213 MiB/s 2247 MiB/s test_write_64K 3661 MiB/s 2415 MiB/s 
test_write_128K 3902 MiB/s 2514 MiB/s test_write_1M 3808 MiB/s 2529 MiB/s

But I got a different result when I use Optane pmem100:

Test (Optane pmem100) clflushopt flushcache 
------------------------------------------------- test_randwrite_512 167 
MiB/s 226 MiB/s test_randwrite_1024 301 MiB/s 420 MiB/s 
test_randwrite_2K 615 MiB/s 639 MiB/s test_randwrite_4K 967 MiB/s 1024 
MiB/s test_randwrite_8K 1047 MiB/s 1314 MiB/s test_randwrite_16K 1096 
MiB/s 1377 MiB/s test_randwrite_32K 1155 MiB/s 1382 MiB/s 
test_randwrite_64K 1184 MiB/s 1452 MiB/s test_randwrite_128K 1199 MiB/s 
1488 MiB/s test_randwrite_1M 1178 MiB/s 1499 MiB/s test_write_512 233 
MiB/s 233 MiB/s test_write_1024 424 MiB/s 391 MiB/s test_write_2K 706 
MiB/s 760 MiB/s test_write_4K 978 MiB/s 1076 MiB/s test_write_8K 1059 
MiB/s 1296 MiB/s test_write_16K 1119 MiB/s 1380 MiB/s test_write_32K 
1158 MiB/s 1387 MiB/s test_write_64K 1184 MiB/s 1448 MiB/s 
test_write_128K 1198 MiB/s 1481 MiB/s test_write_1M 1178 MiB/s 1486 MiB/s


So for now I’d rather keep using flushcache in pcache. In future, once 
we’ve come up with a general-purpose optimization, we can switch to that.

>
>> ----------------------------------------------------------------------
>> 2. cache-logic layer (segments / keys / workers)
>> ----------------------------------------------------------------------
>>
>> Main features
>>    - 16 MiB pmem segments, log-structured allocation.
>>    - Multi-subtree RB-tree index for high parallelism.
>>    - Optional per-entry *CRC32* on cached data.
> Would it be better to use crc32c because it has hardware support in the
> SSE4.2 instruction set?

Good suggestion, thanx.

>
>>    - Background *write-back* worker and watermark-driven *GC*.
>>    - Crash-safe replay: key-sets are scanned from *key_tail* on start-up.
>>
>> Current limitations
>>    - Only *write-back* mode implemented.
>>    - Only FIFO cache invalidate; other (LRU, ARC...) planned.
>>
>> ----------------------------------------------------------------------
>> 3. dm-pcache target integration
>> ----------------------------------------------------------------------
>>
>> * Table line
>>      `pcache <pmem_dev> <origin_dev> writeback <true|false>`
>> * Features advertised to DM:
>>    - `ti->flush_supported = true`, so *PREFLUSH* and *FUA* are honoured
>>      (they force all open key-sets to close and data to be durable).
>> * Not yet supported:
>>    - Discard / TRIM.
>>    - dynamic `dmsetup reload`.
> If you don't support it, you should at least try to detect that the user
> did reload and return error - so that there won't be data corruption in
> this case.

Yes, actually It did return -EOPNOTSUPP。

static int dm_pcache_ctr(struct dm_target *ti, unsigned int argc, char 
**argv)
{
         struct mapped_device *md = ti->table->md;
         struct dm_pcache *pcache;
         int ret;

         if (md->map) {
                 ti->error = "Don't support table loading for live md";
                 return -EOPNOTSUPP;
         }

>
> But it would be better to support table reload. You can support it by a
> similar mechanism to "__handover_exceptions" in the dm-snap.c driver.


Of course, that's planned but we think that's not necessary in initial 
version of it.

>
>> Runtime controls
>>    - `dmsetup message <dev> 0 gc_percent <0-90>` adjusts the GC trigger.
>>
>> Status line reports super-block flags, segment counts, GC threshold and
>> the three tail/head pointers (see the RST document for details).
> Perhaps these are not real bugs (I didn't analyze it thoroughly), but
> there are some GFP_NOWAIT and GFP_KERNEL allocations.
>
> GFP_NOWAIT can fail anytime (for example, if the machine receives too many
> network packets), so you must handle the error gracefully.
>
> GFP_KERNEL allocation may recurse back into the I/O path through swapping
> or file writeback, thus they may cause deadlocks. You should use
> GFP_KERNEL in the target constructor or destructor because there is no I/O
> to be processed in this time, but they shouldn't be used in the I/O
> processing path.


**Yes**, I'm using the approach I described above.

- **GFP_KERNEL** is only used on the constructor path.
- **GFP_NOWAIT** is used in two cases: one for allocating `cache_key`, 
and another for allocating `backing_dev_req` on a cache miss.

Both of these NOWAIT allocations happen while traversing the 
`cache_subtree` under the `subtree_lock` spinlock—i.e., in contexts 
where scheduling isn't allowed. That’s why we use `GFP_NOWAIT`.


>
> I see that when you get ENOMEM, you retry the request in 100ms - putting
> arbitrary waits in the code is generally bad practice - this won't work if
> the user is swapping to the dm-pcache device. It may be possible that
> there is no memory free, thus retrying won't help and it will deadlock.


In **V1**, I removed the retry-on-ENOMEM mechanism to make pcache more 
coherent. Now it only retries when the cache is actually full.

- When the cache is full, the `pcache_req` is added to a `defer_list`.
- When cache invalidation occurs, we kick off all deferred requests to 
retry.

This eliminates arbitrary waits.

>
> I suggest to use mempools to guarantee forward progress in out-of-memory
> situation. A mempool_alloc(GFP_IO) will never return NULL, it will just
> wait until some other process frees some entry into the mempool.

Both `cache_key` and `backing_dev_req` now use **mempools**, but—as 
explained—they may still be allocated under a spinlock, so they use 
**GFP_NOWAIT**.

>
> Generally, a convention among device mapper targets is that the have a few
> fixed parameters first, then there is a number of optional parameters and
> then there are optional parameters (either in "parameter:123" or
> "parameter 123" format). You should follow this convention, so that it can
> be easily extended with new parameters later.
Thanks, that's a good suggestion.

In **V1**, we changed the parameter format to:

  pcache <cache_dev> <backing_dev> [<number_of_optional_arguments> 
<cache_mode writeback> <data_crc true|false>]

>
> The __packed attribute causes performance degradation on risc machines
> without hardware support for unaligned accesses - the compiled will
> generate byte-by-byte accesses - I suggest to not use it and instead make
> sure that the members in the structures are naturally aligned (and
> inserting explicit padding if needed).


Thanks — we removed `__packed` in V1.

>
> The function "memcpy_flushcache" in arch/x86/include/asm/string_64.h is
> optimized for 4, 8 and 16-byte accesess (because that's what dm-writecache
> uses) - I suggest to add more optimizations to it for constant sizes that
> fit the usage pattern of dm-pcache,

Thanks again — as mentioned above, we plan to optimize later, possibly 
with a more generic solution.

>
> I see that you are using "queue_delayed_work(cache_get_wq(cache),
> &cache->writeback_work, 0);" and "queue_delayed_work(cache_get_wq(cache),
> &cache->writeback_work, delay);" - the problem here is that if the entry
> is already queued with a delay and you attempt to queue it again with zero
> again, this new queue attempt will be ignored - I'm not sure if this is
> intended behavior or not.

Yes, in simple terms:
1. If after writeback completes the cache is clean, we should delay.
2. If it's not clean, set delay = 0 so the next writeback cycle can 
begin immediately.

In theory these are two separate branches, even if a zero delay might be 
ignored. Since writeback work isn't highly time-sensitive, I believe 
this is harmless.

>
> req_complete_fn: this will never run with interrupts disabled, so you can
> replace spin_lock_irqsave/spin_unlock_irqrestore with
> spin_lock_irq/spin_unlock_irq.

thanx  Fixed in V1

>
> backing_dev_bio_end: there's a bug in this function - it may be called
> both with interrupts disabled and interrupts enabled, so you should not
> use spin_lock/spin_unlock. You should be called
> spin_lock_irqsave/spin_unlock_irqrestore.

Good catch,it's fixed in V1

>
> queue_work(BACKING_DEV_TO_PCACHE - i would move it inside the spinlock -
> see the commit 829451beaed6165eb11d7a9fb4e28eb17f489980 for a similar
> problem.

Thanx, Fixed in V1

>
> bio_map - bio vectors can hold arbitrarily long entries - if the "base"
> variable is not from vmalloc, you can just add it one bvec entry.


Good idea — in V1 we introduced `inline_bvecs`, and when `base` isn’t 
vmalloc, we use a single bvec entry.

> "backing_req->kmem.bvecs = kcalloc" - you can use kmalloc_array instead of
> kcalloc, there's no need to zero the value.
Fixed in V1
>
>> +                if (++wait_count >= PCACHE_WAIT_NEW_CACHE_COUNT)
>> +                        return NULL;
>> +
>> +                udelay(PCACHE_WAIT_NEW_CACHE_INTERVAL);
>> +                goto again;
> This is not good practice to insert arbitrary waits (here, the wait is
> burning CPU power, which makes it even worse). You should add the process
> to a wait queue and wake up the queue.
>
> See the functions writecache_wait_on_freelist and writecache_free_entry
> for an example, how to wait correctly.


`get_cache_segment()` may be called under a spinlock, so I originally 
designed a “short-busy-wait” to wait for a free segment,

If wait_count >= PCACHE_WAIT_NEW_CACHE_COUNT, then return -EBUSY to let 
dm-pcache defer_list to retry.

However, I later discovered that when the cache is full, there's rarely 
enough time during that brief short-busy-wait to free a segment.

Therefore in V1 I removed it and instead return `-EBUSY`, allowing 
dm-pcache’s retry mechanism to wait for a cache invalidation before 
retrying.

>
>> +static int dm_pcache_map_bio(struct dm_target *ti, struct bio *bio)
>> +{
>> +     struct pcache_request *pcache_req = dm_per_bio_data(bio, sizeof(struct pcache_request));
>> +     struct dm_pcache *pcache = ti->private;
>> +     int ret;
>> +
>> +     pcache_req->pcache = pcache;
>> +     kref_init(&pcache_req->ref);
>> +     pcache_req->ret = 0;
>> +     pcache_req->bio = bio;
>> +     pcache_req->off = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
>> +     pcache_req->data_len = bio->bi_iter.bi_size;
>> +     INIT_LIST_HEAD(&pcache_req->list_node);
>> +     bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);
> This looks suspicious because you store the original bi_sector to
> pcache_req->off and then subtract the target offset from it. Shouldn't
> "bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);"
> be before "pcache_req->off = (u64)bio->bi_iter.bi_sector <<
> SECTOR_SHIFT;"?


Yes, that logic is indeed questionable, but it works in testing.

Since we define dm-pcache as a **singleton**, both behaviors should 
effectively be equivalent, IIUC. Also, in V1 I moved the call to 
`dm_target_offset()` so it runs before setting up `pcache_req->off`, 
making the code logic correct.

Thanx

Dongsheng

>
> Generally, the code doesn't seem bad. After reworking the out-of-memory
> handling and replacing arbitrary waits with wait queues, I can merge it.
>
> Mikulas
>

[-- Attachment #1.2: Type: text/html, Size: 20639 bytes --]

[-- Attachment #2: 0001-memcpy_flushcache_optimized-on-nd_pmem.ko.patch --]
[-- Type: text/plain, Size: 2455 bytes --]

From 88476f95801f7177c3a8c30c663cdd788f73a3b5 Mon Sep 17 00:00:00 2001
From: Dongsheng Yang <dongsheng.yang@linux.dev>
Date: Mon, 23 Jun 2025 10:10:37 +0800
Subject: [PATCH] memcpy_flushcache_optimized on nd_pmem.ko

Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
 drivers/nvdimm/pmem.c | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index aa50006b7616..78c2e8481630 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -122,6 +122,42 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
 	return BLK_STS_OK;
 }
 
+static void memcpy_flushcache_optimized(void *dest, void *source, size_t size)
+{
+	/*
+	 * clflushopt performs better with block size 1024, 2048, 4096
+	 * non-temporal stores perform better with block size 512
+	 *
+	 * block size   512             1024            2048            4096
+	 * movnti       496 MB/s        642 MB/s        725 MB/s        744 MB/s
+	 * clflushopt   373 MB/s        688 MB/s        1.1 GB/s        1.2 GB/s
+	 *
+	 * We see that movnti performs better for 512-byte blocks, and
+	 * clflushopt performs better for 1024-byte and larger blocks. So, we
+	 * prefer clflushopt for sizes >= 768.
+	 *
+	 * NOTE: this happens to be the case now (with dm-writecache's single
+	 * threaded model) but re-evaluate this once memcpy_flushcache() is
+	 * enabled to use movdir64b which might invalidate this performance
+	 * advantage seen with cache-allocating-writes plus flushing.
+	 */
+#ifdef CONFIG_X86
+	if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) &&
+	    likely(boot_cpu_data.x86_clflush_size == 64) &&
+	    likely(size >= 0)) {
+		do {
+			memcpy((void *)dest, (void *)source, 64);
+			clflushopt((void *)dest);
+			dest += 64;
+			source += 64;
+			size -= 64;
+		} while (size >= 64);
+		return;
+	}
+#endif
+	memcpy_flushcache(dest, source, size);
+}
+
 static void write_pmem(void *pmem_addr, struct page *page,
 		unsigned int off, unsigned int len)
 {
@@ -131,7 +167,8 @@ static void write_pmem(void *pmem_addr, struct page *page,
 	while (len) {
 		mem = kmap_atomic(page);
 		chunk = min_t(unsigned int, len, PAGE_SIZE - off);
-		memcpy_flushcache(pmem_addr, mem + off, chunk);
+		//memcpy_flushcache(pmem_addr, mem + off, chunk);
+		memcpy_flushcache_optimized(pmem_addr, mem + off, chunk);
 		kunmap_atomic(mem);
 		len -= chunk;
 		off = 0;
-- 
2.43.0


  parent reply	other threads:[~2025-06-23  3:13 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-05 14:22 [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices Dongsheng Yang
2025-06-05 14:22 ` [RFC PATCH 01/11] dm-pcache: add pcache_internal.h Dongsheng Yang
2025-06-05 14:22 ` [RFC PATCH 02/11] dm-pcache: add backing device management Dongsheng Yang
2025-06-05 14:22 ` [RFC PATCH 03/11] dm-pcache: add cache device Dongsheng Yang
2025-06-05 14:22 ` [RFC PATCH 04/11] dm-pcache: add segment layer Dongsheng Yang
2025-06-05 14:23 ` [RFC PATCH 05/11] dm-pcache: add cache_segment Dongsheng Yang
2025-06-05 14:23 ` [RFC PATCH 06/11] dm-pcache: add cache_writeback Dongsheng Yang
2025-06-05 14:23 ` [RFC PATCH 07/11] dm-pcache: add cache_gc Dongsheng Yang
2025-06-05 14:23 ` [RFC PATCH 08/11] dm-pcache: add cache_key Dongsheng Yang
2025-06-05 14:23 ` [RFC PATCH 09/11] dm-pcache: add cache_req Dongsheng Yang
2025-06-05 14:23 ` [RFC PATCH 10/11] dm-pcache: add cache core Dongsheng Yang
2025-06-05 14:23 ` [RFC PATCH 11/11] dm-pcache: initial dm-pcache target Dongsheng Yang
2025-06-12 16:57 ` [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices Mikulas Patocka
2025-06-13  3:39   ` Dongsheng Yang
2025-06-23  3:13   ` Dongsheng Yang [this message]
2025-06-23  4:18     ` Dongsheng Yang
2025-06-30 15:57     ` Mikulas Patocka
2025-06-30 16:28       ` Dongsheng Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3c9f304a-b830-4242-8e01-04efab4e0eaa@linux.dev \
    --to=dongsheng.yang@linux.dev \
    --cc=Jonathan.Cameron@Huawei.com \
    --cc=agk@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=dan.j.williams@intel.com \
    --cc=dm-devel@lists.linux.dev \
    --cc=hch@lst.de \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mpatocka@redhat.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=snitzer@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.