stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Nikos Tsironis <ntsironis@arrikto.com>,
	Mike Snitzer <snitzer@redhat.com>
Subject: [PATCH 5.4 55/80] dm clone: Flush destination device before committing metadata
Date: Thu, 19 Dec 2019 19:34:47 +0100	[thread overview]
Message-ID: <20191219183129.010437125@linuxfoundation.org> (raw)
In-Reply-To: <20191219183031.278083125@linuxfoundation.org>

From: Nikos Tsironis <ntsironis@arrikto.com>

commit 8b3fd1f53af3591d5624ab9df718369b14d09ed1 upstream.

dm-clone maintains an on-disk bitmap which records which regions are
valid in the destination device, i.e., which regions have already been
hydrated, or have been written to directly, via user I/O.

Setting a bit in the on-disk bitmap meas the corresponding region is
valid in the destination device and we redirect all I/O regarding it to
the destination device.

Suppose the destination device has a volatile write-back cache and the
following sequence of events occur:

1. A region gets hydrated, either through the background hydration or
   because it was written to directly, via user I/O.

2. The commit timeout expires and we commit the metadata, marking that
   region as valid in the destination device.

3. The system crashes and the destination device's cache has not been
   flushed, meaning the region's data are lost.

The next time we read that region we read it from the destination
device, since the metadata have been successfully committed, but the
data are lost due to the crash, so we read garbage instead of the old
data.

This has several implications:

1. In case of background hydration or of writes with size smaller than
   the region size (which means we first copy the whole region and then
   issue the smaller write), we corrupt data that the user never
   touched.

2. In case of writes with size equal to the device's logical block size,
   we fail to provide atomic sector writes. When the system recovers the
   user will read garbage from the sector instead of the old data or the
   new data.

3. In case of writes without the FUA flag set, after the system
   recovers, the written sectors will contain garbage instead of a
   random mix of sectors containing either old data or new data, thus we
   fail again to provide atomic sector writes.

4. Even when the user flushes the dm-clone device, because we first
   commit the metadata and then pass down the flush, the same risk for
   corruption exists (if the system crashes after the metadata have been
   committed but before the flush is passed down).

The only case which is unaffected is that of writes with size equal to
the region size and with the FUA flag set. But, because FUA writes
trigger metadata commits, this case can trigger the corruption
indirectly.

To solve this and avoid the potential data corruption we flush the
destination device **before** committing the metadata.

This ensures that any freshly hydrated regions, for which we commit the
metadata, are properly written to non-volatile storage and won't be lost
in case of a crash.

Fixes: 7431b7835f55 ("dm: add clone target")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 drivers/md/dm-clone-target.c |   46 +++++++++++++++++++++++++++++++++++++------
 1 file changed, 40 insertions(+), 6 deletions(-)

--- a/drivers/md/dm-clone-target.c
+++ b/drivers/md/dm-clone-target.c
@@ -86,6 +86,12 @@ struct clone {
 
 	struct dm_clone_metadata *cmd;
 
+	/*
+	 * bio used to flush the destination device, before committing the
+	 * metadata.
+	 */
+	struct bio flush_bio;
+
 	/* Region hydration hash table */
 	struct hash_table_bucket *ht;
 
@@ -1106,10 +1112,13 @@ static bool need_commit_due_to_time(stru
 /*
  * A non-zero return indicates read-only or fail mode.
  */
-static int commit_metadata(struct clone *clone)
+static int commit_metadata(struct clone *clone, bool *dest_dev_flushed)
 {
 	int r = 0;
 
+	if (dest_dev_flushed)
+		*dest_dev_flushed = false;
+
 	mutex_lock(&clone->commit_lock);
 
 	if (!dm_clone_changed_this_transaction(clone->cmd))
@@ -1126,6 +1135,19 @@ static int commit_metadata(struct clone
 		goto out;
 	}
 
+	bio_reset(&clone->flush_bio);
+	bio_set_dev(&clone->flush_bio, clone->dest_dev->bdev);
+	clone->flush_bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
+
+	r = submit_bio_wait(&clone->flush_bio);
+	if (unlikely(r)) {
+		__metadata_operation_failed(clone, "flush destination device", r);
+		goto out;
+	}
+
+	if (dest_dev_flushed)
+		*dest_dev_flushed = true;
+
 	r = dm_clone_metadata_commit(clone->cmd);
 	if (unlikely(r)) {
 		__metadata_operation_failed(clone, "dm_clone_metadata_commit", r);
@@ -1199,6 +1221,7 @@ static void process_deferred_flush_bios(
 {
 	struct bio *bio;
 	unsigned long flags;
+	bool dest_dev_flushed;
 	struct bio_list bios = BIO_EMPTY_LIST;
 	struct bio_list bio_completions = BIO_EMPTY_LIST;
 
@@ -1218,7 +1241,7 @@ static void process_deferred_flush_bios(
 	    !(dm_clone_changed_this_transaction(clone->cmd) && need_commit_due_to_time(clone)))
 		return;
 
-	if (commit_metadata(clone)) {
+	if (commit_metadata(clone, &dest_dev_flushed)) {
 		bio_list_merge(&bios, &bio_completions);
 
 		while ((bio = bio_list_pop(&bios)))
@@ -1232,8 +1255,17 @@ static void process_deferred_flush_bios(
 	while ((bio = bio_list_pop(&bio_completions)))
 		bio_endio(bio);
 
-	while ((bio = bio_list_pop(&bios)))
-		generic_make_request(bio);
+	while ((bio = bio_list_pop(&bios))) {
+		if ((bio->bi_opf & REQ_PREFLUSH) && dest_dev_flushed) {
+			/* We just flushed the destination device as part of
+			 * the metadata commit, so there is no reason to send
+			 * another flush.
+			 */
+			bio_endio(bio);
+		} else {
+			generic_make_request(bio);
+		}
+	}
 }
 
 static void do_worker(struct work_struct *work)
@@ -1405,7 +1437,7 @@ static void clone_status(struct dm_targe
 
 		/* Commit to ensure statistics aren't out-of-date */
 		if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti))
-			(void) commit_metadata(clone);
+			(void) commit_metadata(clone, NULL);
 
 		r = dm_clone_get_free_metadata_block_count(clone->cmd, &nr_free_metadata_blocks);
 
@@ -1839,6 +1871,7 @@ static int clone_ctr(struct dm_target *t
 	bio_list_init(&clone->deferred_flush_completions);
 	clone->hydration_offset = 0;
 	atomic_set(&clone->hydrations_in_flight, 0);
+	bio_init(&clone->flush_bio, NULL, 0);
 
 	clone->wq = alloc_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM, 0);
 	if (!clone->wq) {
@@ -1912,6 +1945,7 @@ static void clone_dtr(struct dm_target *
 	struct clone *clone = ti->private;
 
 	mutex_destroy(&clone->commit_lock);
+	bio_uninit(&clone->flush_bio);
 
 	for (i = 0; i < clone->nr_ctr_args; i++)
 		kfree(clone->ctr_args[i]);
@@ -1966,7 +2000,7 @@ static void clone_postsuspend(struct dm_
 	wait_event(clone->hydration_stopped, !atomic_read(&clone->hydrations_in_flight));
 	flush_workqueue(clone->wq);
 
-	(void) commit_metadata(clone);
+	(void) commit_metadata(clone, NULL);
 }
 
 static void clone_resume(struct dm_target *ti)



  parent reply	other threads:[~2019-12-19 18:58 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-19 18:33 [PATCH 5.4 00/80] 5.4.6-stable review Greg Kroah-Hartman
2019-12-19 18:33 ` [PATCH 5.4 01/80] USB: Fix incorrect DMA allocations for local memory pool drivers Greg Kroah-Hartman
2019-12-19 18:33 ` [PATCH 5.4 02/80] mmc: block: Make card_busy_detect() a bit more generic Greg Kroah-Hartman
2019-12-19 18:33 ` [PATCH 5.4 03/80] mmc: block: Add CMD13 polling for MMC IOCTLS with R1B response Greg Kroah-Hartman
2019-12-19 18:33 ` [PATCH 5.4 04/80] mmc: core: Drop check for mmc_card_is_removable() in mmc_rescan() Greg Kroah-Hartman
2019-12-19 18:33 ` [PATCH 5.4 05/80] mmc: core: Re-work HW reset for SDIO cards Greg Kroah-Hartman
2019-12-19 18:33 ` [PATCH 5.4 06/80] PCI/switchtec: Read all 64 bits of part_event_bitmap Greg Kroah-Hartman
2019-12-19 18:33 ` [PATCH 5.4 07/80] PCI/PM: Always return devices to D0 when thawing Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 08/80] PCI: pciehp: Avoid returning prematurely from sysfs requests Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 09/80] PCI: Fix Intel ACS quirk UPDCR register address Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 10/80] PCI/MSI: Fix incorrect MSI-X masking on resume Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 11/80] PCI: Do not use bus number zero from EA capability Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 12/80] PCI: rcar: Fix missing MACCTLR register setting in initialization sequence Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 13/80] PCI: Apply Cavium ACS quirk to ThunderX2 and ThunderX3 Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 14/80] PM / QoS: Redefine FREQ_QOS_MAX_DEFAULT_VALUE to S32_MAX Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 15/80] block: fix "check bi_size overflow before merge" Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 16/80] xtensa: use MEMBLOCK_ALLOC_ANYWHERE for KASAN shadow map Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 17/80] gfs2: Multi-block allocations in gfs2_page_mkwrite Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 18/80] gfs2: fix glock reference problem in gfs2_trans_remove_revoke Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 19/80] xtensa: fix TLB sanity checker Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 20/80] xtensa: fix syscall_set_return_value Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 21/80] rpmsg: glink: Set tail pointer to 0 at end of FIFO Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 22/80] rpmsg: glink: Fix reuse intents memory leak issue Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 23/80] rpmsg: glink: Fix use after free in open_ack TIMEOUT case Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 24/80] rpmsg: glink: Put an extra reference during cleanup Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 25/80] rpmsg: glink: Fix rpmsg_register_device err handling Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 26/80] rpmsg: glink: Dont send pending rx_done during remove Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 27/80] rpmsg: glink: Free pending deferred work on remove Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 28/80] cifs: smbd: Return -EAGAIN when transport is reconnecting Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 29/80] cifs: smbd: Only queue work for error recovery on memory registration Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 30/80] cifs: smbd: Add messages on RDMA session destroy and reconnection Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 31/80] cifs: smbd: Return -EINVAL when the number of iovs exceeds SMBDIRECT_MAX_SGE Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 32/80] cifs: smbd: Return -ECONNABORTED when trasnport is not in connected state Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 33/80] cifs: Dont display RDMA transport on reconnect Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 34/80] CIFS: Respect O_SYNC and O_DIRECT flags during reconnect Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 35/80] CIFS: Close open handle after interrupted close Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 36/80] CIFS: Do not miss cancelled OPEN responses Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 37/80] CIFS: Fix NULL pointer dereference in mid callback Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 38/80] cifs: Fix retrieval of DFS referrals in cifs_mount() Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 39/80] ARM: dts: s3c64xx: Fix init order of clock providers Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 40/80] ARM: tegra: Fix FLOW_CTLR_HALT register clobbering by tegra_resume() Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 41/80] vfio/pci: call irq_bypass_unregister_producer() before freeing irq Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 42/80] dma-buf: Fix memory leak in sync_file_merge() Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 43/80] drm/panfrost: Fix a race in panfrost_ioctl_madvise() Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 44/80] drm/panfrost: Fix a BO leak in panfrost_ioctl_mmap_bo() Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 45/80] drm/panfrost: Fix a race in panfrost_gem_free_object() Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 46/80] drm/mgag200: Extract device type from flags Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 47/80] drm/mgag200: Store flags from PCI driver data in device structure Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 48/80] drm/mgag200: Add workaround for HW that does not support startadd Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 49/80] drm/mgag200: Flag all G200 SE A machines as broken wrt <startadd> Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 50/80] drm: meson: venc: cvbs: fix CVBS mode matching Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 51/80] dm mpath: remove harmful bio-based optimization Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 52/80] dm btree: increase rebalance threshold in __rebalance2() Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 53/80] dm clone metadata: Track exact changes per transaction Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 54/80] dm clone metadata: Use a two phase commit Greg Kroah-Hartman
2019-12-19 18:34 ` Greg Kroah-Hartman [this message]
2019-12-19 18:34 ` [PATCH 5.4 56/80] dm thin metadata: Add support for a pre-commit callback Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 57/80] dm thin: Flush data device before committing metadata Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 58/80] scsi: ufs: Disable autohibern8 feature in Cadence UFS Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 59/80] scsi: iscsi: Fix a potential deadlock in the timeout handler Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 60/80] scsi: qla2xxx: Ignore NULL pointer in tcm_qla2xxx_free_mcmd Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 61/80] scsi: qla2xxx: Initialize free_work before flushing it Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 62/80] scsi: qla2xxx: Added support for MPI and PEP regions for ISP28XX Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 63/80] scsi: qla2xxx: Change discovery state before PLOGI Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 64/80] scsi: qla2xxx: Correctly retrieve and interpret active flash region Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 65/80] scsi: qla2xxx: Fix incorrect SFUB length used for Secure Flash Update MB Cmd Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 66/80] drm/nouveau/kms/nv50-: Call outp_atomic_check_view() before handling PBN Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 67/80] drm/nouveau/kms/nv50-: Store the bpc were using in nv50_head_atom Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 68/80] drm/nouveau/kms/nv50-: Limit MST BPC to 8 Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 69/80] drm/i915/fbc: Disable fbc by default on all glk+ Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 70/80] drm/radeon: fix r1xx/r2xx register checker for POT textures Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 71/80] drm/dp_mst: Correct the bug in drm_dp_update_payload_part1() Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 72/80] drm/amd/display: re-enable wait in pipelock, but add timeout Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 73/80] drm/amd/display: add default clocks if not able to fetch them Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 74/80] drm/amdgpu: initialize vm_inv_eng0_sem for gfxhub and mmhub Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 75/80] drm/amdgpu: invalidate mmhub semaphore workaround in gmc9/gmc10 Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 76/80] drm/amdgpu/gfx10: explicitly wait for cp idle after halt/unhalt Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 77/80] drm/amdgpu/gfx10: re-init clear state buffer after gpu reset Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 78/80] drm/i915/gvt: Fix cmd length check for MI_ATOMIC Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 79/80] drm/amdgpu: avoid using invalidate semaphore for picasso Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 80/80] drm/amdgpu: add invalidate semaphore limit for SRIOV and picasso in gmc9 Greg Kroah-Hartman
2019-12-20  4:30 ` [PATCH 5.4 00/80] 5.4.6-stable review shuah
2019-12-20  6:29   ` Greg Kroah-Hartman
2019-12-20  7:22 ` Naresh Kamboju
2019-12-20  8:20   ` Greg Kroah-Hartman
2019-12-20 10:30 ` Jon Hunter
2019-12-20 13:50   ` Greg Kroah-Hartman
2019-12-20 18:50 ` Guenter Roeck
2019-12-20 21:21   ` Greg Kroah-Hartman
2019-12-21  9:36 ` Jeffrin Jose

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191219183129.010437125@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ntsironis@arrikto.com \
    --cc=snitzer@redhat.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).