Linux Device Mapper development
 help / color / mirror / Atom feed
From: Ken Raeburn <raeburn@redhat.com>
To: Mike Snitzer <snitzer@kernel.org>
Cc: linux-block@vger.kernel.org, vdo-devel@redhat.com,
	dm-devel@redhat.com, ebiggers@kernel.org, tj@kernel.org
Subject: Re: [dm-devel] [vdo-devel] [PATCH v2 00/39] Add the dm-vdo deduplication and compression device mapper target.
Date: Wed, 26 Jul 2023 19:32:56 -0400	[thread overview]
Message-ID: <87cz0e9rkn.fsf@redhat.com> (raw)
In-Reply-To: <CAK1Ur396ThV5AAZx2336uAW3FqSY+bHiiwEPofHB_Kwwr4ag5A@mail.gmail.com> (Kenneth Raeburn's message of "Fri, 21 Jul 2023 21:59:05 -0400")


An offline discussion suggested maybe I should've gone into a little
more detail about how VDO uses its work queues.

VDO is sufficiently work-intensive that we found long ago that doing all
the work in one thread wouldn't keep up.

Our multithreaded design started many years ago and grew out of our
existing design for UDS (VDO's central deduplication index), which,
somewhat akin to partitioning and sharding in databases, does scanning
of the in-memory part of the "database" of values in some number (fixed
at startup) of threads, with the data and work divided up based on
certain bits of the hash value being looked up, and performs its I/O and
callbacks from certain other threads. We aren't splitting work to
multiple machines as database systems sometimes do, but to multiple
threads and potentially multiple NUMA nodes.

We try to optimize for keeping the busy case fast, even if it means
light usage loads don't perform quite as well as they could be made to.
We try to reduce instances of contention between threads by avoiding
locks when we can, preferring a fast queueing mechanism or loose
synchronization between threads. (We haven't kept to it strictly, but
we've mostly tried to.)

In VDO, at the first level, the work is split according to the
collection of data structures to be updated (e.g., recovery journal vs
disk block allocation vs block address mapping management).

For some data structures, we split the structures further based on
values of relevant bit-strings for the data structure in question (block
addresses, hash values). Currently we can split the work N ways for many
small values of N but it's hard to change N without restarting. The
processing of a read or write operation generally doesn't need to touch
more than one "zone" in any of these sets (or two, in a certain write
case).

Giving one thread exclusive access to the data structures means we can
do away with the locking. Of course, with so many different threads
owning data structures, we get a lot of queueing in exchange, but we
depend on a fast, nearly-lock-free MPSC queueing mechanism to keep that
reasonably efficient.

There's a little more to it in places where we need to preserve the
order of processing of multiple VIOs in a couple different sections of
the write path. So we do make some higher-level use of the fact that
we're adding work to queues with certain behavior, and not just turning
loose a bunch of threads to contend for a just-released mutex.

Some other bits of work like computing the hash value don't update any
other data structures, and not only would be amenable to kernel
workqueue conversion with concurrency greater than 1, but such a
conversion might open up some interesting options, like hashing on the
CPU or NUMA node where the data block is likely to reside in cache. But
for now, using one work management mechanism has been easier than two.

The experiment I referred to in my earlier email with using kernel
workqueues in VDO kept the same model of protecting data structures by
making them exclusive to specific threads (or in this case,
concurrency-1 workqueues) to serialize all access and using message
passing; it didn't change everything over to using mutexes instead.

I hope some of this helps. I'm happy to answer further questions.

Ken

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


  parent reply	other threads:[~2023-07-26 23:46 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-23 21:45 [dm-devel] [PATCH v2 00/39] Add the dm-vdo deduplication and compression device mapper target J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 01/39] Add documentation for dm-vdo J. corwin Coburn
2023-05-24 22:36   ` kernel test robot
2023-05-23 21:45 ` [dm-devel] [PATCH v2 02/39] Add the MurmurHash3 fast hashing algorithm J. corwin Coburn
2023-05-23 22:06   ` Eric Biggers
2023-05-23 22:13     ` corwin
2023-05-23 22:25       ` Eric Biggers
2023-05-23 23:06         ` Eric Biggers
2023-05-24  4:15           ` corwin
2023-05-23 21:45 ` [dm-devel] [PATCH v2 03/39] Add memory allocation utilities J. corwin Coburn
2023-05-23 22:14   ` Eric Biggers
2023-05-23 21:45 ` [dm-devel] [PATCH v2 04/39] Add basic logging and support utilities J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 05/39] Add vdo type declarations, constants, and simple data structures J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 06/39] Add thread and synchronization utilities J. corwin Coburn
2023-05-24  5:15   ` kernel test robot
2023-05-23 21:45 ` [dm-devel] [PATCH v2 07/39] Add specialized request queueing functionality J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 08/39] Add basic data structures J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 09/39] Add deduplication configuration structures J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 10/39] Add deduplication index storage interface J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 11/39] Implement the delta index J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 12/39] Implement the volume index J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 13/39] Implement the open chapter and chapter indexes J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 14/39] Implement the chapter volume store J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 15/39] Implement top-level deduplication index J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 16/39] Implement external deduplication index interface J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 17/39] Add administrative state and scheduling for vdo J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 18/39] Add vio, the request object for vdo metadata J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 19/39] Add data_vio, the request object which services incoming bios J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 20/39] Add flush support to vdo J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 21/39] Add the vdo io_submitter J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 22/39] Add hash locks and hash zones J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 23/39] Add use of the deduplication index in " J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 24/39] Add the compressed block bin packer J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 25/39] Add vdo_slab J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 26/39] Add the slab summary J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 27/39] Add the block allocators and physical zones J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 28/39] Add the slab depot itself J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 29/39] Add the vdo block map J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 30/39] Implement the vdo block map page cache J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 31/39] Add the vdo recovery journal J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 32/39] Add repair (crash recovery and read-only rebuild) of damaged vdos J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 33/39] Add the vdo structure itself J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 34/39] Add the on-disk formats and marshalling of vdo structures J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 35/39] Add statistics tracking J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 36/39] Add sysfs support for setting vdo parameters and fetching statistics J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 37/39] Add vdo debugging support J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 38/39] Add dm-vdo-target.c J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 39/39] Enable configuration and building of dm-vdo J. corwin Coburn
2023-05-23 22:40 ` [dm-devel] [PATCH v2 00/39] Add the dm-vdo deduplication and compression device mapper target Eric Biggers
2023-05-30 23:03   ` [dm-devel] [vdo-devel] " Matthew Sakai
2023-07-18 15:51 ` [dm-devel] " Mike Snitzer
2023-07-22  1:59   ` [dm-devel] [vdo-devel] " Kenneth Raeburn
2023-07-23  6:24     ` Sweet Tea Dorminy
2023-07-26 23:33       ` Ken Raeburn
2023-07-27 15:29         ` Sweet Tea Dorminy
2023-07-26 23:32     ` Ken Raeburn [this message]
2023-07-27 14:57       ` [dm-devel] " Mike Snitzer
2023-07-28  8:28         ` Ken Raeburn
2023-07-28 14:49           ` Mike Snitzer
2023-07-24 18:03   ` [dm-devel] [vdo-devel] " Ken Raeburn
2023-08-09 23:40 ` Matthew Sakai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87cz0e9rkn.fsf@redhat.com \
    --to=raeburn@redhat.com \
    --cc=dm-devel@redhat.com \
    --cc=ebiggers@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=snitzer@kernel.org \
    --cc=tj@kernel.org \
    --cc=vdo-devel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox