stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Nikos Tsironis <ntsironis@arrikto.com>,
	Joe Thornber <ejt@redhat.com>, Mike Snitzer <snitzer@redhat.com>
Subject: [PATCH 5.4 57/80] dm thin: Flush data device before committing metadata
Date: Thu, 19 Dec 2019 19:34:49 +0100	[thread overview]
Message-ID: <20191219183130.904727172@linuxfoundation.org> (raw)
In-Reply-To: <20191219183031.278083125@linuxfoundation.org>

From: Nikos Tsironis <ntsironis@arrikto.com>

commit 694cfe7f31db36912725e63a38a5179c8628a496 upstream.

The thin provisioning target maintains per thin device mappings that map
virtual blocks to data blocks in the data device.

When we write to a shared block, in case of internal snapshots, or
provision a new block, in case of external snapshots, we copy the shared
block to a new data block (COW), update the mapping for the relevant
virtual block and then issue the write to the new data block.

Suppose the data device has a volatile write-back cache and the
following sequence of events occur:

1. We write to a shared block
2. A new data block is allocated
3. We copy the shared block to the new data block using kcopyd (COW)
4. We insert the new mapping for the virtual block in the btree for that
   thin device.
5. The commit timeout expires and we commit the metadata, that now
   includes the new mapping from step (4).
6. The system crashes and the data device's cache has not been flushed,
   meaning that the COWed data are lost.

The next time we read that virtual block of the thin device we read it
from the data block allocated in step (2), since the metadata have been
successfully committed. The data are lost due to the crash, so we read
garbage instead of the old, shared data.

This has the following implications:

1. In case of writes to shared blocks, with size smaller than the pool's
   block size (which means we first copy the whole block and then issue
   the smaller write), we corrupt data that the user never touched.

2. In case of writes to shared blocks, with size equal to the device's
   logical block size, we fail to provide atomic sector writes. When the
   system recovers the user will read garbage from that sector instead
   of the old data or the new data.

3. Even for writes to shared blocks, with size equal to the pool's block
   size (overwrites), after the system recovers, the written sectors
   will contain garbage instead of a random mix of sectors containing
   either old data or new data, thus we fail again to provide atomic
   sectors writes.

4. Even when the user flushes the thin device, because we first commit
   the metadata and then pass down the flush, the same risk for
   corruption exists (if the system crashes after the metadata have been
   committed but before the flush is passed down to the data device.)

The only case which is unaffected is that of writes with size equal to
the pool's block size and with the FUA flag set. But, because FUA writes
trigger metadata commits, this case can trigger the corruption
indirectly.

Moreover, apart from internal and external snapshots, the same issue
exists for newly provisioned blocks, when block zeroing is enabled.
After the system recovers the provisioned blocks might contain garbage
instead of zeroes.

To solve this and avoid the potential data corruption we flush the
pool's data device **before** committing its metadata.

This ensures that the data blocks of any newly inserted mappings are
properly written to non-volatile storage and won't be lost in case of a
crash.

Cc: stable@vger.kernel.org
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 drivers/md/dm-thin.c |   42 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 40 insertions(+), 2 deletions(-)

--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -328,6 +328,7 @@ struct pool_c {
 	dm_block_t low_water_blocks;
 	struct pool_features requested_pf; /* Features requested during table load */
 	struct pool_features adjusted_pf;  /* Features used after adjusting for constituent devices */
+	struct bio flush_bio;
 };
 
 /*
@@ -2392,8 +2393,16 @@ static void process_deferred_bios(struct
 	while ((bio = bio_list_pop(&bio_completions)))
 		bio_endio(bio);
 
-	while ((bio = bio_list_pop(&bios)))
-		generic_make_request(bio);
+	while ((bio = bio_list_pop(&bios))) {
+		/*
+		 * The data device was flushed as part of metadata commit,
+		 * so complete redundant flushes immediately.
+		 */
+		if (bio->bi_opf & REQ_PREFLUSH)
+			bio_endio(bio);
+		else
+			generic_make_request(bio);
+	}
 }
 
 static void do_worker(struct work_struct *ws)
@@ -3127,6 +3136,7 @@ static void pool_dtr(struct dm_target *t
 	__pool_dec(pt->pool);
 	dm_put_device(ti, pt->metadata_dev);
 	dm_put_device(ti, pt->data_dev);
+	bio_uninit(&pt->flush_bio);
 	kfree(pt);
 
 	mutex_unlock(&dm_thin_pool_table.mutex);
@@ -3192,6 +3202,29 @@ static void metadata_low_callback(void *
 	dm_table_event(pool->ti->table);
 }
 
+/*
+ * We need to flush the data device **before** committing the metadata.
+ *
+ * This ensures that the data blocks of any newly inserted mappings are
+ * properly written to non-volatile storage and won't be lost in case of a
+ * crash.
+ *
+ * Failure to do so can result in data corruption in the case of internal or
+ * external snapshots and in the case of newly provisioned blocks, when block
+ * zeroing is enabled.
+ */
+static int metadata_pre_commit_callback(void *context)
+{
+	struct pool_c *pt = context;
+	struct bio *flush_bio = &pt->flush_bio;
+
+	bio_reset(flush_bio);
+	bio_set_dev(flush_bio, pt->data_dev->bdev);
+	flush_bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
+
+	return submit_bio_wait(flush_bio);
+}
+
 static sector_t get_dev_size(struct block_device *bdev)
 {
 	return i_size_read(bdev->bd_inode) >> SECTOR_SHIFT;
@@ -3360,6 +3393,7 @@ static int pool_ctr(struct dm_target *ti
 	pt->data_dev = data_dev;
 	pt->low_water_blocks = low_water_blocks;
 	pt->adjusted_pf = pt->requested_pf = pf;
+	bio_init(&pt->flush_bio, NULL, 0);
 	ti->num_flush_bios = 1;
 
 	/*
@@ -3386,6 +3420,10 @@ static int pool_ctr(struct dm_target *ti
 	if (r)
 		goto out_flags_changed;
 
+	dm_pool_register_pre_commit_callback(pt->pool->pmd,
+					     metadata_pre_commit_callback,
+					     pt);
+
 	pt->callbacks.congested_fn = pool_is_congested;
 	dm_table_add_target_callbacks(ti->table, &pt->callbacks);
 



  parent reply	other threads:[~2019-12-19 18:55 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-19 18:33 [PATCH 5.4 00/80] 5.4.6-stable review Greg Kroah-Hartman
2019-12-19 18:33 ` [PATCH 5.4 01/80] USB: Fix incorrect DMA allocations for local memory pool drivers Greg Kroah-Hartman
2019-12-19 18:33 ` [PATCH 5.4 02/80] mmc: block: Make card_busy_detect() a bit more generic Greg Kroah-Hartman
2019-12-19 18:33 ` [PATCH 5.4 03/80] mmc: block: Add CMD13 polling for MMC IOCTLS with R1B response Greg Kroah-Hartman
2019-12-19 18:33 ` [PATCH 5.4 04/80] mmc: core: Drop check for mmc_card_is_removable() in mmc_rescan() Greg Kroah-Hartman
2019-12-19 18:33 ` [PATCH 5.4 05/80] mmc: core: Re-work HW reset for SDIO cards Greg Kroah-Hartman
2019-12-19 18:33 ` [PATCH 5.4 06/80] PCI/switchtec: Read all 64 bits of part_event_bitmap Greg Kroah-Hartman
2019-12-19 18:33 ` [PATCH 5.4 07/80] PCI/PM: Always return devices to D0 when thawing Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 08/80] PCI: pciehp: Avoid returning prematurely from sysfs requests Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 09/80] PCI: Fix Intel ACS quirk UPDCR register address Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 10/80] PCI/MSI: Fix incorrect MSI-X masking on resume Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 11/80] PCI: Do not use bus number zero from EA capability Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 12/80] PCI: rcar: Fix missing MACCTLR register setting in initialization sequence Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 13/80] PCI: Apply Cavium ACS quirk to ThunderX2 and ThunderX3 Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 14/80] PM / QoS: Redefine FREQ_QOS_MAX_DEFAULT_VALUE to S32_MAX Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 15/80] block: fix "check bi_size overflow before merge" Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 16/80] xtensa: use MEMBLOCK_ALLOC_ANYWHERE for KASAN shadow map Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 17/80] gfs2: Multi-block allocations in gfs2_page_mkwrite Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 18/80] gfs2: fix glock reference problem in gfs2_trans_remove_revoke Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 19/80] xtensa: fix TLB sanity checker Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 20/80] xtensa: fix syscall_set_return_value Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 21/80] rpmsg: glink: Set tail pointer to 0 at end of FIFO Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 22/80] rpmsg: glink: Fix reuse intents memory leak issue Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 23/80] rpmsg: glink: Fix use after free in open_ack TIMEOUT case Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 24/80] rpmsg: glink: Put an extra reference during cleanup Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 25/80] rpmsg: glink: Fix rpmsg_register_device err handling Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 26/80] rpmsg: glink: Dont send pending rx_done during remove Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 27/80] rpmsg: glink: Free pending deferred work on remove Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 28/80] cifs: smbd: Return -EAGAIN when transport is reconnecting Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 29/80] cifs: smbd: Only queue work for error recovery on memory registration Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 30/80] cifs: smbd: Add messages on RDMA session destroy and reconnection Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 31/80] cifs: smbd: Return -EINVAL when the number of iovs exceeds SMBDIRECT_MAX_SGE Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 32/80] cifs: smbd: Return -ECONNABORTED when trasnport is not in connected state Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 33/80] cifs: Dont display RDMA transport on reconnect Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 34/80] CIFS: Respect O_SYNC and O_DIRECT flags during reconnect Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 35/80] CIFS: Close open handle after interrupted close Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 36/80] CIFS: Do not miss cancelled OPEN responses Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 37/80] CIFS: Fix NULL pointer dereference in mid callback Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 38/80] cifs: Fix retrieval of DFS referrals in cifs_mount() Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 39/80] ARM: dts: s3c64xx: Fix init order of clock providers Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 40/80] ARM: tegra: Fix FLOW_CTLR_HALT register clobbering by tegra_resume() Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 41/80] vfio/pci: call irq_bypass_unregister_producer() before freeing irq Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 42/80] dma-buf: Fix memory leak in sync_file_merge() Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 43/80] drm/panfrost: Fix a race in panfrost_ioctl_madvise() Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 44/80] drm/panfrost: Fix a BO leak in panfrost_ioctl_mmap_bo() Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 45/80] drm/panfrost: Fix a race in panfrost_gem_free_object() Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 46/80] drm/mgag200: Extract device type from flags Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 47/80] drm/mgag200: Store flags from PCI driver data in device structure Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 48/80] drm/mgag200: Add workaround for HW that does not support startadd Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 49/80] drm/mgag200: Flag all G200 SE A machines as broken wrt <startadd> Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 50/80] drm: meson: venc: cvbs: fix CVBS mode matching Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 51/80] dm mpath: remove harmful bio-based optimization Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 52/80] dm btree: increase rebalance threshold in __rebalance2() Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 53/80] dm clone metadata: Track exact changes per transaction Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 54/80] dm clone metadata: Use a two phase commit Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 55/80] dm clone: Flush destination device before committing metadata Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 56/80] dm thin metadata: Add support for a pre-commit callback Greg Kroah-Hartman
2019-12-19 18:34 ` Greg Kroah-Hartman [this message]
2019-12-19 18:34 ` [PATCH 5.4 58/80] scsi: ufs: Disable autohibern8 feature in Cadence UFS Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 59/80] scsi: iscsi: Fix a potential deadlock in the timeout handler Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 60/80] scsi: qla2xxx: Ignore NULL pointer in tcm_qla2xxx_free_mcmd Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 61/80] scsi: qla2xxx: Initialize free_work before flushing it Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 62/80] scsi: qla2xxx: Added support for MPI and PEP regions for ISP28XX Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 63/80] scsi: qla2xxx: Change discovery state before PLOGI Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 64/80] scsi: qla2xxx: Correctly retrieve and interpret active flash region Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 65/80] scsi: qla2xxx: Fix incorrect SFUB length used for Secure Flash Update MB Cmd Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 66/80] drm/nouveau/kms/nv50-: Call outp_atomic_check_view() before handling PBN Greg Kroah-Hartman
2019-12-19 18:34 ` [PATCH 5.4 67/80] drm/nouveau/kms/nv50-: Store the bpc were using in nv50_head_atom Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 68/80] drm/nouveau/kms/nv50-: Limit MST BPC to 8 Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 69/80] drm/i915/fbc: Disable fbc by default on all glk+ Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 70/80] drm/radeon: fix r1xx/r2xx register checker for POT textures Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 71/80] drm/dp_mst: Correct the bug in drm_dp_update_payload_part1() Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 72/80] drm/amd/display: re-enable wait in pipelock, but add timeout Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 73/80] drm/amd/display: add default clocks if not able to fetch them Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 74/80] drm/amdgpu: initialize vm_inv_eng0_sem for gfxhub and mmhub Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 75/80] drm/amdgpu: invalidate mmhub semaphore workaround in gmc9/gmc10 Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 76/80] drm/amdgpu/gfx10: explicitly wait for cp idle after halt/unhalt Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 77/80] drm/amdgpu/gfx10: re-init clear state buffer after gpu reset Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 78/80] drm/i915/gvt: Fix cmd length check for MI_ATOMIC Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 79/80] drm/amdgpu: avoid using invalidate semaphore for picasso Greg Kroah-Hartman
2019-12-19 18:35 ` [PATCH 5.4 80/80] drm/amdgpu: add invalidate semaphore limit for SRIOV and picasso in gmc9 Greg Kroah-Hartman
2019-12-20  4:30 ` [PATCH 5.4 00/80] 5.4.6-stable review shuah
2019-12-20  6:29   ` Greg Kroah-Hartman
2019-12-20  7:22 ` Naresh Kamboju
2019-12-20  8:20   ` Greg Kroah-Hartman
2019-12-20 10:30 ` Jon Hunter
2019-12-20 13:50   ` Greg Kroah-Hartman
2019-12-20 18:50 ` Guenter Roeck
2019-12-20 21:21   ` Greg Kroah-Hartman
2019-12-21  9:36 ` Jeffrin Jose

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191219183130.904727172@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=ejt@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ntsironis@arrikto.com \
    --cc=snitzer@redhat.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).