Re: [dm-devel] [vdo-devel] [PATCH v2 00/39] Add the dm-vdo deduplication and compression device mapper target.

From: Ken Raeburn <raeburn@redhat.com>
To: Mike Snitzer <snitzer@kernel.org>
Cc: linux-block@vger.kernel.org, vdo-devel@redhat.com,
	dm-devel@redhat.com, ebiggers@kernel.org, tj@kernel.org
Subject: Re: [dm-devel] [vdo-devel] [PATCH v2 00/39] Add the dm-vdo deduplication and compression device mapper target.
Date: Mon, 24 Jul 2023 14:03:45 -0400	[thread overview]
Message-ID: <87mszl9ofy.fsf@redhat.com> (raw)
In-Reply-To: <ZLa086NuWiMkJKJE@redhat.com> (Mike Snitzer's message of "Tue, 18 Jul 2023 11:51:15 -0400")

(Apologies for the re-send ... I neglected to turn of HTML and so
linux-block bounced the email as spam.)

On Tue, Jul 18, 2023 at 11:51 AM Mike Snitzer <snitzer@kernel.org> wrote:

 But the long-standing dependency on VDO's work-queue data
 struct is still lingering (drivers/md/dm-vdo/work-queue.c). At a
 minimum we need to work toward pinning down _exactly_ why that is, and
 I think the best way to answer that is by simply converting the VDO
 code over to using Linux's workqueues.  If doing so causes serious
 inherent performance (or functionality) loss then we need to
 understand why -- and fix Linux's workqueue code accordingly. (I've
 cc'd Tejun so he is aware).

We tried this experiment and did indeed see some significant
performance differences. Nearly a 7x slowdown in some cases.

VDO can be pretty CPU-intensive. In addition to hashing and
compression, it scans some big in-memory data structures as part of
the deduplication process. Some data structures are split across one
or more "zones" to enable concurrency (usually split based on bits of
an address or something like that), but some are not, and a couple of
those threads can sometimes exceed 50% CPU utilization, even 90%
depending on the system and test data configuration. (Usually this is
while pushing over 1GB/s through the deduplication and compression
processing on a system with fast storage. On a slow VM with spinning
storage, the CPU load is much smaller.)

We use a sort of message-passing arrangement where a worker thread is
responsible for updating certain data structures as needed for the
I/Os in progress, rather than having the processing of each I/O
contend for locks on the data structures. It gives us some good
throughput under load but it does mean upwards of a dozen handoffs per
4kB write, depending on compressibility, whether the block is a
duplicate, and various other factors. So processing 1 GB/s means
handling over 3M messages per second, though each step of processing
is generally lightweight. For our dedicated worker threads, it's not
unusual for a thread to wake up and process a few tens or even
hundreds of updates to its data structures (likely benefiting from CPU
caching of the data structures) before running out of available work
and going back to sleep.

The experiment I ran was to create an ordered workqueue instead of
each dedicated thread where we need serialization, and unordered
workqueues when concurrency is allowed. On our slower test systems (>
10y old Supermicro Xeon E5-1650 v2, RAID-0 storage using SSDs or
HDDs), the slowdown was less significant (under 2x), but on our faster
system (4-5? year old Supermicro 1029P-WTR, 2x Xeon Gold 6128 = 12
cores, NVMe storage) we got nearly a 7x slowdown overall. I haven't
yet dug deeply into _why_ the kernel work queues are slower in this
sort of setup. I did run "perf top" briefly during one test with
kernel work queues, and the largest single use of CPU cycles was in
spin lock acquisition, but I didn't get call graphs.

(This was with Fedora 37 6.2.12-200 and 6.2.15-200 kernels, without
the latest submissions from Tejun, which look interesting. Though I
suspect we care more about cache locality for some of our
thread-specific data structures than for accessing the I/O
structures.)

Ken

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel