From: Sasha Levin <Alexander.Levin@microsoft.com>
To: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"stable@vger.kernel.org" <stable@vger.kernel.org>
Cc: NeilBrown <neilb@suse.com>, Mike Snitzer <snitzer@redhat.com>,
Sasha Levin <Alexander.Levin@microsoft.com>
Subject: [PATCH AUTOSEL for 4.15 10/78] dm: ensure bio submission follows a depth-first tree walk
Date: Thu, 8 Mar 2018 04:56:05 +0000 [thread overview]
Message-ID: <20180308045525.7662-10-alexander.levin@microsoft.com> (raw)
In-Reply-To: <20180308045525.7662-1-alexander.levin@microsoft.com>
From: NeilBrown <neilb@suse.com>
[ Upstream commit 18a25da84354c6bb655320de6072c00eda6eb602 ]
A dm device can, in general, represent a tree of targets, each of which
handles a sub-range of the range of blocks handled by the parent.
The bio sequencing managed by generic_make_request() requires that bios
are generated and handled in a depth-first manner. Each call to a
make_request_fn() may submit bios to a single member device, and may
submit bios for a reduced region of the same device as the
make_request_fn.
In particular, any bios submitted to member devices must be expected to
be processed in order, so a later one must never wait for an earlier
one.
This ordering is usually achieved by using bio_split() to reduce a bio
to a size that can be completely handled by one target, and resubmitting
the remainder to the originating device. bio_queue_split() shows the
canonical approach.
dm doesn't follow this approach, largely because it has needed to split
bios since long before bio_split() was available. It currently can
submit bios to separate targets within the one dm_make_request() call.
Dependencies between these targets, as can happen with dm-snap, can
cause deadlocks if either bios gets stuck behind the other in the queues
managed by generic_make_request(). This requires the 'rescue'
functionality provided by dm_offload_{start,end}.
Some of this requirement can be removed by changing the order of bio
submission to follow the canonical approach. That is, if dm finds that
it needs to split a bio, the remainder should be sent to
generic_make_request() rather than being handled immediately. This
delays the handling until the first part is completely processed, so the
deadlock problems do not occur.
__split_and_process_bio() can be called both from dm_make_request() and
from dm_wq_work(). When called from dm_wq_work() the current approach
is perfectly satisfactory as each bio will be processed immediately.
When called from dm_make_request(), current->bio_list will be non-NULL,
and in this case it is best to create a separate "clone" bio for the
remainder.
When we use bio_clone_bioset() to split off the front part of a bio
and chain the two together and submit the remainder to
generic_make_request(), it is important that the newly allocated
bio is used as the head to be processed immediately, and the original
bio gets "bio_advance()"d and sent to generic_make_request() as the
remainder. Otherwise, if the newly allocated bio is used as the
remainder, and if it then needs to be split again, then the next
bio_clone_bioset() call will be made while holding a reference a bio
(result of the first clone) from the same bioset. This can potentially
exhaust the bioset mempool and result in a memory allocation deadlock.
Note that there is no race caused by reassigning cio.io->bio after already
calling __map_bio(). This bio will only be dereferenced again after
dec_pending() has found io->io_count to be zero, and this cannot happen
before the dec_pending() call at the end of __split_and_process_bio().
To provide the clone bio when splitting, we use q->bio_split. This
was previously being freed by bio-based dm to avoid having excess
rescuer threads. As bio_split bio sets no longer create rescuer
threads, there is little cost and much gain from restoring the
q->bio_split bio set.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
---
drivers/md/dm.c | 33 ++++++++++++++++++++++++---------
1 file changed, 24 insertions(+), 9 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 1c42b00d3be2..04402d2ccb20 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1499,8 +1499,29 @@ static void __split_and_process_bio(struct mapped_device *md,
} else {
ci.bio = bio;
ci.sector_count = bio_sectors(bio);
- while (ci.sector_count && !error)
+ while (ci.sector_count && !error) {
error = __split_and_process_non_flush(&ci);
+ if (current->bio_list && ci.sector_count && !error) {
+ /*
+ * Remainder must be passed to generic_make_request()
+ * so that it gets handled *after* bios already submitted
+ * have been completely processed.
+ * We take a clone of the original to store in
+ * ci.io->bio to be used by end_io_acct() and
+ * for dec_pending to use for completion handling.
+ * As this path is not used for REQ_OP_ZONE_REPORT,
+ * the usage of io->bio in dm_remap_zone_report()
+ * won't be affected by this reassignment.
+ */
+ struct bio *b = bio_clone_bioset(bio, GFP_NOIO,
+ md->queue->bio_split);
+ ci.io->bio = b;
+ bio_advance(bio, (bio_sectors(bio) - ci.sector_count) << 9);
+ bio_chain(b, bio);
+ generic_make_request(bio);
+ break;
+ }
+ }
}
/* drop the extra reference count */
@@ -1511,8 +1532,8 @@ static void __split_and_process_bio(struct mapped_device *md,
*---------------------------------------------------------------*/
/*
- * The request function that just remaps the bio built up by
- * dm_merge_bvec.
+ * The request function that remaps the bio to one target and
+ * splits off any remainder.
*/
static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio)
{
@@ -2035,12 +2056,6 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t)
case DM_TYPE_DAX_BIO_BASED:
dm_init_normal_md_queue(md);
blk_queue_make_request(md->queue, dm_make_request);
- /*
- * DM handles splitting bios as needed. Free the bio_split bioset
- * since it won't be used (saves 1 process per bio-based DM device).
- */
- bioset_free(md->queue->bio_split);
- md->queue->bio_split = NULL;
if (type == DM_TYPE_DAX_BIO_BASED)
queue_flag_set_unlocked(QUEUE_FLAG_DAX, md->queue);
--
2.14.1
next prev parent reply other threads:[~2018-03-08 7:11 UTC|newest]
Thread overview: 81+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-03-08 4:56 [PATCH AUTOSEL for 4.15 01/78] ipmi_si: Fix error handling of platform device Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 02/78] drm/amdgpu: use polling mem to set SDMA3 wptr for VF Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 03/78] Bluetooth: hci_qca: Avoid setup failure on missing rampatch Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 04/78] Bluetooth: btqcomsmd: Fix skb double free corruption Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 05/78] cpufreq: longhaul: Revert transition_delay_us to 200 ms Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 06/78] dt-bindings: net: add TI CC2560 Bluetooth chip Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 07/78] media: c8sectpfe: fix potential NULL pointer dereference in c8sectpfe_timer_interrupt Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 08/78] drm/msm: fix leak in failed get_pages Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 09/78] net: fec: add phy_reset_after_clk_enable() support Sasha Levin
2018-03-08 4:56 ` Sasha Levin [this message]
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 11/78] IB/ipoib: Warn when one port fails to initialize Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 13/78] hv_netvsc: Fix the receive buffer size limit Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 12/78] RDMA/iwpm: Fix uninitialized error code in iwpm_send_mapinfo() Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 15/78] tcp: allow TLP in ECN CWR Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 14/78] hv_netvsc: Fix the TX/RX buffer default sizes Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 17/78] spi: sh-msiof: Avoid writing to registers from spi_master.setup() Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 18/78] libbpf: prefer global symbols as bpf program name source Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 16/78] KVM: x86: add support for emulating UMIP Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 20/78] rtlwifi: always initialize variables given to RT_TRACE() Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 19/78] rtlwifi: rtl_pci: Fix the bug when inactiveps is enabled Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 21/78] media: bt8xx: Fix err 'bt878_probe()' Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 22/78] ath10k: handling qos at STA side based on AP WMM enable/disable Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 23/78] media: [RESEND] media: dvb-frontends: Add delay to Si2168 restart Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 24/78] qmi_wwan: set FLAG_SEND_ZLP to avoid network initiated disconnect Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 25/78] tty: goldfish: Enable 'earlycon' only if built-in Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 26/78] serial: 8250_dw: Disable clock on error Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 27/78] cros_ec: fix nul-termination for firmware build info Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 28/78] watchdog: Fix potential kref imbalance when opening watchdog Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 30/78] platform/chrome: Use proper protocol transfer function Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 29/78] watchdog: Fix kref imbalance seen if handle_boot_enabled=0 Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 31/78] dmaengine: zynqmp_dma: Fix race condition in the probe Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 32/78] drm/tilcdc: ensure nonatomic iowrite64 is not used Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 33/78] mmc: avoid removing non-removable hosts during suspend Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 35/78] /dev/mem: Add bounce buffer for copy-out Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 34/78] mmc: block: fix logical error to avoid memory leak Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 37/78] sfp: fix EEPROM reading in the case of non-SFF8472 SFPs Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 36/78] net: phy: meson-gxl: check phy_write return value Sasha Levin
2018-03-08 10:18 ` Jerome Brunet
2018-03-08 12:34 ` Greg KH
2018-03-19 15:28 ` Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 39/78] media: s5p-mfc: Fix lock contention - request_firmware() once Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 38/78] sfp: fix non-detection of PHY Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 42/78] RDMA/cma: Use correct size when writing netlink stats Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 40/78] rtc: ac100: Fix multiple race conditions Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 41/78] IB/ipoib: Avoid memory leak if the SA returns a different DGID Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 43/78] IB/umem: Fix use of npages/nmap fields Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 44/78] iser-target: avoid reinitializing rdma contexts for isert commands Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 46/78] PCI/ASPM: Calculate LTR_L1.2_THRESHOLD from device characteristics Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 45/78] bpf/cgroup: fix a verification error for a CGROUP_DEVICE type prog Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 48/78] omapdrm: panel: fix compatible vendor string for td028ttec1 Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 47/78] vgacon: Set VGA struct resource types Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 49/78] mmc: sdhci-xenon: wait 5ms after set 1.8V signal enable Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 50/78] drm/omap: DMM: Check for DMM readiness after successful transaction commit Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 51/78] pty: cancel pty slave port buf's work in tty_release Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 52/78] coresight: Fix disabling of CoreSight TPIU Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 53/78] PCI: designware-ep: Fix ->get_msi() to check MSI_EN bit Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 54/78] PCI: endpoint: Fix find_first_zero_bit() usage Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 56/78] media: davinci: fix a debug printk Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 55/78] PCI: rcar: Handle rcar_pcie_parse_request_of_pci_ranges() failures Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 58/78] dt-bindings: display: panel: Fix compatible string for Toshiba LT089AC29000 Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 57/78] clk: check ops pointer on clock register Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 60/78] pinctrl: Really force states during suspend/resume Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 59/78] clk: use round rate to bail out early in set_rate Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 62/78] iommu/vt-d: clean up pr_irq if request_threaded_irq fails Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 61/78] pinctrl: rockchip: enable clock when reading pin direction register Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 63/78] ip6_vti: adjust vti mtu according to mtu of lower device Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 64/78] ip_gre: fix error path when erspan_rcv failed Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 65/78] ip_gre: fix potential memory leak in erspan_rcv Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 66/78] soc: qcom: smsm: fix child-node lookup Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 68/78] scsi: lpfc: Fix issues connecting with nvme initiator Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 67/78] scsi: lpfc: Fix SCSI LUN discovery when SCSI and NVME enabled Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 70/78] ARM: dts: aspeed-evb: Add unit name to memory node Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 71/78] nfsd4: permit layoutget of executable-only files Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 69/78] RDMA/ocrdma: Fix permissions for OCRDMA_RESET_STATS Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 72/78] clk: at91: pmc: Wait for clocks when resuming Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 73/78] clk: Don't touch hardware when reparenting during registration Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 75/78] clk: si5351: Rename internal plls to avoid name collisions Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 74/78] clk: axi-clkgen: Correctly handle nocount bit in recalc_rate() Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 77/78] hwrng: core - Clean up RNG list when last hwrng is unregistered Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 76/78] crypto: artpec6 - set correct iv size for gcm(aes) Sasha Levin
2018-03-08 4:56 ` [PATCH AUTOSEL for 4.15 78/78] dmaengine: ti-dma-crossbar: Fix event mapping for TPCC_EVT_MUX_60_63 Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180308045525.7662-10-alexander.levin@microsoft.com \
--to=alexander.levin@microsoft.com \
--cc=linux-kernel@vger.kernel.org \
--cc=neilb@suse.com \
--cc=snitzer@redhat.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.