From: Sasha Levin <Alexander.Levin@microsoft.com>
To: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"stable@vger.kernel.org" <stable@vger.kernel.org>
Cc: NeilBrown <neilb@suse.com>, Mike Snitzer <snitzer@redhat.com>,
Sasha Levin <Alexander.Levin@microsoft.com>
Subject: [PATCH AUTOSEL for 4.15 10/78] dm: ensure bio submission follows a depth-first tree walk
Date: Thu, 8 Mar 2018 04:56:05 +0000 [thread overview]
Message-ID: <20180308045525.7662-10-alexander.levin@microsoft.com> (raw)
In-Reply-To: <20180308045525.7662-1-alexander.levin@microsoft.com>
From: NeilBrown <neilb@suse.com>
[ Upstream commit 18a25da84354c6bb655320de6072c00eda6eb602 ]
A dm device can, in general, represent a tree of targets, each of which
handles a sub-range of the range of blocks handled by the parent.
The bio sequencing managed by generic_make_request() requires that bios
are generated and handled in a depth-first manner. Each call to a
make_request_fn() may submit bios to a single member device, and may
submit bios for a reduced region of the same device as the
make_request_fn.
In particular, any bios submitted to member devices must be expected to
be processed in order, so a later one must never wait for an earlier
one.
This ordering is usually achieved by using bio_split() to reduce a bio
to a size that can be completely handled by one target, and resubmitting
the remainder to the originating device. bio_queue_split() shows the
canonical approach.
dm doesn't follow this approach, largely because it has needed to split
bios since long before bio_split() was available. It currently can
submit bios to separate targets within the one dm_make_request() call.
Dependencies between these targets, as can happen with dm-snap, can
cause deadlocks if either bios gets stuck behind the other in the queues
managed by generic_make_request(). This requires the 'rescue'
functionality provided by dm_offload_{start,end}.
Some of this requirement can be removed by changing the order of bio
submission to follow the canonical approach. That is, if dm finds that
it needs to split a bio, the remainder should be sent to
generic_make_request() rather than being handled immediately. This
delays the handling until the first part is completely processed, so the
deadlock problems do not occur.
__split_and_process_bio() can be called both from dm_make_request() and
from dm_wq_work(). When called from dm_wq_work() the current approach
is perfectly satisfactory as each bio will be processed immediately.
When called from dm_make_request(), current->bio_list will be non-NULL,
and in this case it is best to create a separate "clone" bio for the
remainder.
When we use bio_clone_bioset() to split off the front part of a bio
and chain the two together and submit the remainder to
generic_make_request(), it is important that the newly allocated
bio is used as the head to be processed immediately, and the original
bio gets "bio_advance()"d and sent to generic_make_request() as the
remainder. Otherwise, if the newly allocated bio is used as the
remainder, and if it then needs to be split again, then the next
bio_clone_bioset() call will be made while holding a reference a bio
(result of the first clone) from the same bioset. This can potentially
exhaust the bioset mempool and result in a memory allocation deadlock.
Note that there is no race caused by reassigning cio.io->bio after already
calling __map_bio(). This bio will only be dereferenced again after
dec_pending() has found io->io_count to be zero, and this cannot happen
before the dec_pending() call at the end of __split_and_process_bio().
To provide the clone bio when splitting, we use q->bio_split. This
was previously being freed by bio-based dm to avoid having excess
rescuer threads. As bio_split bio sets no longer create rescuer
threads, there is little cost and much gain from restoring the
q->bio_split bio set.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
---
drivers/md/dm.c | 33 ++++++++++++++++++++++++---------
1 file changed, 24 insertions(+), 9 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 1c42b00d3be2..04402d2ccb20 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1499,8 +1499,29 @@ static void __split_and_process_bio(struct mapped_device *md,
} else {
ci.bio = bio;
ci.sector_count = bio_sectors(bio);
- while (ci.sector_count && !error)
+ while (ci.sector_count && !error) {
error = __split_and_process_non_flush(&ci);
+ if (current->bio_list && ci.sector_count && !error) {
+ /*
+ * Remainder must be passed to generic_make_request()
+ * so that it gets handled *after* bios already submitted
+ * have been completely processed.
+ * We take a clone of the original to store in
+ * ci.io->bio to be used by end_io_acct() and
+ * for dec_pending to use for completion handling.
+ * As this path is not used for REQ_OP_ZONE_REPORT,
+ * the usage of io->bio in dm_remap_zone_report()
+ * won't be affected by this reassignment.
+ */
+ struct bio *b = bio_clone_bioset(bio, GFP_NOIO,
+ md->queue->bio_split);
+ ci.io->bio = b;
+ bio_advance(bio, (bio_sectors(bio) - ci.sector_count) << 9);
+ bio_chain(b, bio);
+ generic_make_request(bio);
+ break;
+ }
+ }
}
/* drop the extra reference count */
@@ -1511,8 +1532,8 @@ static void __split_and_process_bio(struct mapped_device *md,
*---------------------------------------------------------------*/
/*
- * The request function that just remaps the bio built up by
- * dm_merge_bvec.
+ * The request function that remaps the bio to one target and
+ * splits off any remainder.
*/
static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio)
{
@@ -2035,12 +2056,6 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t)
case DM_TYPE_DAX_BIO_BASED:
dm_init_normal_md_queue(md);
blk_queue_make_request(md->queue, dm_make_request);
- /*
- * DM handles splitting bios as needed. Free the bio_split bioset
- * since it won't be used (saves 1 process per bio-based DM device).
- */
- bioset_free(md->queue->bio_split);
- md->queue->bio_split = NULL;
if (type == DM_TYPE_DAX_BIO_BASED)
queue_flag_set_unlocked(QUEUE_FLAG_DAX, md->queue);
--
2.14.1
next prev parent reply other threads:[~2018-03-08 4:56 UTC|newest]
Thread overview: 81+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-03-08 4:56 [PATCH AUTOSEL for 4.15 01/78] ipmi_si: Fix error handling of platform device Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 03/78] Bluetooth: hci_qca: Avoid setup failure on missing rampatch Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 02/78] drm/amdgpu: use polling mem to set SDMA3 wptr for VF Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 05/78] cpufreq: longhaul: Revert transition_delay_us to 200 ms Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 04/78] Bluetooth: btqcomsmd: Fix skb double free corruption Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 06/78] dt-bindings: net: add TI CC2560 Bluetooth chip Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 07/78] media: c8sectpfe: fix potential NULL pointer dereference in c8sectpfe_timer_interrupt Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 08/78] drm/msm: fix leak in failed get_pages Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 09/78] net: fec: add phy_reset_after_clk_enable() support Sasha Levin
2018-03-08 4:56 ` Sasha Levin [this message]
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 11/78] IB/ipoib: Warn when one port fails to initialize Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 13/78] hv_netvsc: Fix the receive buffer size limit Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 12/78] RDMA/iwpm: Fix uninitialized error code in iwpm_send_mapinfo() Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 15/78] tcp: allow TLP in ECN CWR Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 14/78] hv_netvsc: Fix the TX/RX buffer default sizes Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 16/78] KVM: x86: add support for emulating UMIP Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 17/78] spi: sh-msiof: Avoid writing to registers from spi_master.setup() Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 18/78] libbpf: prefer global symbols as bpf program name source Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 19/78] rtlwifi: rtl_pci: Fix the bug when inactiveps is enabled Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 20/78] rtlwifi: always initialize variables given to RT_TRACE() Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 22/78] ath10k: handling qos at STA side based on AP WMM enable/disable Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 21/78] media: bt8xx: Fix err 'bt878_probe()' Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 23/78] media: [RESEND] media: dvb-frontends: Add delay to Si2168 restart Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 24/78] qmi_wwan: set FLAG_SEND_ZLP to avoid network initiated disconnect Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 25/78] tty: goldfish: Enable 'earlycon' only if built-in Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 26/78] serial: 8250_dw: Disable clock on error Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 27/78] cros_ec: fix nul-termination for firmware build info Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 28/78] watchdog: Fix potential kref imbalance when opening watchdog Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 30/78] platform/chrome: Use proper protocol transfer function Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 29/78] watchdog: Fix kref imbalance seen if handle_boot_enabled=0 Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 31/78] dmaengine: zynqmp_dma: Fix race condition in the probe Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 32/78] drm/tilcdc: ensure nonatomic iowrite64 is not used Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 33/78] mmc: avoid removing non-removable hosts during suspend Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 34/78] mmc: block: fix logical error to avoid memory leak Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 35/78] /dev/mem: Add bounce buffer for copy-out Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 36/78] net: phy: meson-gxl: check phy_write return value Sasha Levin
2018-03-08 10:18 ` Jerome Brunet
2018-03-08 12:34 ` Greg KH
2018-03-19 15:28 ` Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 37/78] sfp: fix EEPROM reading in the case of non-SFF8472 SFPs Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 39/78] media: s5p-mfc: Fix lock contention - request_firmware() once Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 38/78] sfp: fix non-detection of PHY Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 41/78] IB/ipoib: Avoid memory leak if the SA returns a different DGID Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 42/78] RDMA/cma: Use correct size when writing netlink stats Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 40/78] rtc: ac100: Fix multiple race conditions Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 43/78] IB/umem: Fix use of npages/nmap fields Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 44/78] iser-target: avoid reinitializing rdma contexts for isert commands Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 46/78] PCI/ASPM: Calculate LTR_L1.2_THRESHOLD from device characteristics Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 45/78] bpf/cgroup: fix a verification error for a CGROUP_DEVICE type prog Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 47/78] vgacon: Set VGA struct resource types Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 48/78] omapdrm: panel: fix compatible vendor string for td028ttec1 Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 50/78] drm/omap: DMM: Check for DMM readiness after successful transaction commit Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 49/78] mmc: sdhci-xenon: wait 5ms after set 1.8V signal enable Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 51/78] pty: cancel pty slave port buf's work in tty_release Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 53/78] PCI: designware-ep: Fix ->get_msi() to check MSI_EN bit Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 52/78] coresight: Fix disabling of CoreSight TPIU Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 56/78] media: davinci: fix a debug printk Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 55/78] PCI: rcar: Handle rcar_pcie_parse_request_of_pci_ranges() failures Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 54/78] PCI: endpoint: Fix find_first_zero_bit() usage Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 58/78] dt-bindings: display: panel: Fix compatible string for Toshiba LT089AC29000 Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 57/78] clk: check ops pointer on clock register Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 60/78] pinctrl: Really force states during suspend/resume Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 59/78] clk: use round rate to bail out early in set_rate Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 62/78] iommu/vt-d: clean up pr_irq if request_threaded_irq fails Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 61/78] pinctrl: rockchip: enable clock when reading pin direction register Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 63/78] ip6_vti: adjust vti mtu according to mtu of lower device Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 64/78] ip_gre: fix error path when erspan_rcv failed Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 65/78] ip_gre: fix potential memory leak in erspan_rcv Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 66/78] soc: qcom: smsm: fix child-node lookup Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 68/78] scsi: lpfc: Fix issues connecting with nvme initiator Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 67/78] scsi: lpfc: Fix SCSI LUN discovery when SCSI and NVME enabled Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 69/78] RDMA/ocrdma: Fix permissions for OCRDMA_RESET_STATS Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 71/78] nfsd4: permit layoutget of executable-only files Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 70/78] ARM: dts: aspeed-evb: Add unit name to memory node Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 73/78] clk: Don't touch hardware when reparenting during registration Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 72/78] clk: at91: pmc: Wait for clocks when resuming Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 75/78] clk: si5351: Rename internal plls to avoid name collisions Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 74/78] clk: axi-clkgen: Correctly handle nocount bit in recalc_rate() Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 76/78] crypto: artpec6 - set correct iv size for gcm(aes) Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 77/78] hwrng: core - Clean up RNG list when last hwrng is unregistered Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 78/78] dmaengine: ti-dma-crossbar: Fix event mapping for TPCC_EVT_MUX_60_63 Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180308045525.7662-10-alexander.levin@microsoft.com \
--to=alexander.levin@microsoft.com \
--cc=linux-kernel@vger.kernel.org \
--cc=neilb@suse.com \
--cc=snitzer@redhat.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox