From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
stable@vger.kernel.org, Sage Weil <sage@inktank.com>,
Josh Durgin <josh.durgin@inktank.com>
Subject: [PATCH 3.13 14/46] libceph: block I/O when PAUSE or FULL osd map flags are set
Date: Fri, 28 Mar 2014 10:31:58 -0700 [thread overview]
Message-ID: <20140328173136.576332715@linuxfoundation.org> (raw)
In-Reply-To: <20140328173134.630198216@linuxfoundation.org>
3.13-stable review patch. If anyone has any objections, please let me know.
------------------
From: Josh Durgin <josh.durgin@inktank.com>
commit d29adb34a94715174c88ca93e8aba955850c9bde upstream.
The PAUSEWR and PAUSERD flags are meant to stop the cluster from
processing writes and reads, respectively. The FULL flag is set when
the cluster determines that it is out of space, and will no longer
process writes. PAUSEWR and PAUSERD are purely client-side settings
already implemented in userspace clients. The osd does nothing special
with these flags.
When the FULL flag is set, however, the osd responds to all writes
with -ENOSPC. For cephfs, this makes sense, but for rbd the block
layer translates this into EIO. If a cluster goes from full to
non-full quickly, a filesystem on top of rbd will not behave well,
since some writes succeed while others get EIO.
Fix this by blocking any writes when the FULL flag is set in the osd
client. This is the same strategy used by userspace, so apply it by
default. A follow-on patch makes this configurable.
__map_request() is called to re-target osd requests in case the
available osds changed. Add a paused field to a ceph_osd_request, and
set it whenever an appropriate osd map flag is set. Avoid queueing
paused requests in __map_request(), but force them to be resent if
they become unpaused.
Also subscribe to the next osd map from the monitor if any of these
flags are set, so paused requests can be unblocked as soon as
possible.
Fixes: http://tracker.ceph.com/issues/6079
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
include/linux/ceph/osd_client.h | 1 +
net/ceph/osd_client.c | 29 +++++++++++++++++++++++++++--
2 files changed, 28 insertions(+), 2 deletions(-)
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -138,6 +138,7 @@ struct ceph_osd_request {
__le64 *r_request_pool;
void *r_request_pgid;
__le32 *r_request_attempts;
+ bool r_paused;
struct ceph_eversion *r_request_reassert_version;
int r_result;
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1232,6 +1232,22 @@ void ceph_osdc_set_request_linger(struct
EXPORT_SYMBOL(ceph_osdc_set_request_linger);
/*
+ * Returns whether a request should be blocked from being sent
+ * based on the current osdmap and osd_client settings.
+ *
+ * Caller should hold map_sem for read.
+ */
+static bool __req_should_be_paused(struct ceph_osd_client *osdc,
+ struct ceph_osd_request *req)
+{
+ bool pauserd = ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSERD);
+ bool pausewr = ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSEWR) ||
+ ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL);
+ return (req->r_flags & CEPH_OSD_FLAG_READ && pauserd) ||
+ (req->r_flags & CEPH_OSD_FLAG_WRITE && pausewr);
+}
+
+/*
* Pick an osd (the first 'up' osd in the pg), allocate the osd struct
* (as needed), and set the request r_osd appropriately. If there is
* no up osd, set r_osd to NULL. Move the request to the appropriate list
@@ -1248,6 +1264,7 @@ static int __map_request(struct ceph_osd
int acting[CEPH_PG_MAX_SIZE];
int o = -1, num = 0;
int err;
+ bool was_paused;
dout("map_request %p tid %lld\n", req, req->r_tid);
err = ceph_calc_ceph_pg(&pgid, req->r_oid, osdc->osdmap,
@@ -1264,12 +1281,18 @@ static int __map_request(struct ceph_osd
num = err;
}
+ was_paused = req->r_paused;
+ req->r_paused = __req_should_be_paused(osdc, req);
+ if (was_paused && !req->r_paused)
+ force_resend = 1;
+
if ((!force_resend &&
req->r_osd && req->r_osd->o_osd == o &&
req->r_sent >= req->r_osd->o_incarnation &&
req->r_num_pg_osds == num &&
memcmp(req->r_pg_osds, acting, sizeof(acting[0])*num) == 0) ||
- (req->r_osd == NULL && o == -1))
+ (req->r_osd == NULL && o == -1) ||
+ req->r_paused)
return 0; /* no change */
dout("map_request tid %llu pgid %lld.%x osd%d (was osd%d)\n",
@@ -1804,7 +1827,9 @@ done:
* we find out when we are no longer full and stop returning
* ENOSPC.
*/
- if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL))
+ if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL) ||
+ ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSERD) ||
+ ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSEWR))
ceph_monc_request_next_osdmap(&osdc->client->monc);
mutex_lock(&osdc->request_mutex);
next prev parent reply other threads:[~2014-03-28 17:31 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-03-28 17:31 [PATCH 3.13 00/46] 3.13.8-stable review Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 01/46] HID: hidraw: fix warning destroying hidraw device files after parent Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 02/46] ALSA: compress: Pass through return value of open ops callback Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 03/46] clocksource: vf_pit_timer: use complement for sched_clock reading Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 04/46] drm/i915: Fix PSR programming Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 05/46] drm/i915: Dont enable display error interrupts from the start Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 06/46] drm/i915: Disable stolen memory when DMAR is active Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 07/46] tracing: Fix array size mismatch in format string Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 08/46] partly revert commit 8a10bc9: parisc/sti_console: prefer Linux fonts over built-in ROM fonts Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 09/46] net: davinci_emac: Replace devm_request_irq with request_irq Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 10/46] NFSv4: Use the correct net namespace in nfs4_update_server Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 11/46] media: cxusb: unlock on error in cxusb_i2c_xfer() Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 12/46] media: dw2102: some missing unlocks on error Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 13/46] media: cx18: check for allocation failure in cx18_read_eeprom() Greg Kroah-Hartman
2014-03-28 17:31 ` Greg Kroah-Hartman [this message]
2014-03-28 17:31 ` [PATCH 3.13 15/46] libceph: resend all writes after the osdmap loses the full flag Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 16/46] ASoC: max98090: make REVISION_ID readable Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 17/46] stop_machine: Fix^2 race between stop_two_cpus() and stop_cpus() Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 18/46] sfc: Use the correct maximum TX DMA ring size for SFC9100 Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 19/46] ARM: 7941/2: Fix incorrect FDT initrd parameter override Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 20/46] SUNRPC: Fix a pipe_version reference leak Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 21/46] x86: bpf_jit: support negative offsets Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 22/46] printk: fix syslog() overflowing user buffer Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 23/46] Fix uses of dma_max_pfn() when converting to a limiting address Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 24/46] perf tools: Fix AAAAARGH64 memory barriers Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 25/46] deb-pkg: Fix building for MIPS big-endian or ARM OABI Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 26/46] deb-pkg: Fix cross-building linux-headers package Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 27/46] MIPS: Fix build error seen in some configurations Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 28/46] p54: clamp properly instead of just truncating Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 29/46] regulator: core: Replace direct ops->disable usage Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 30/46] powerpc/powernv: Move PHB-diag dump functions around Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 31/46] powerpc/eeh: Handle multiple EEH errors Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 32/46] powerpc/powernv: Dump PHB diag-data immediately Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 33/46] powerpc/powernv: Refactor PHB diag-data dump Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 34/46] fs/proc/proc_devtree.c: remove empty /proc/device-tree when no openfirmware exists Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 35/46] Input: elantech - improve clickpad detection Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 36/46] KVM: MMU: handle invalid root_hpa at __direct_map Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 37/46] KVM: x86: handle invalid root_hpa everywhere Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 38/46] KVM: VMX: fix use after free of vmx->loaded_vmcs Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 39/46] Input: wacom - make sure touch_max is set for touch devices Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 40/46] Input: wacom - add support for three new Intuos devices Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 41/46] Input: wacom - add reporting of SW_MUTE_DEVICE events Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 42/46] xhci: Fix resume issues on Renesas chips in Samsung laptops Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 43/46] e100: Fix "disabling already-disabled device" warning Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 44/46] libceph: rename ceph_msg::front_max to front_alloc_len Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 45/46] libceph: rename front to front_len in get_reply() Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 46/46] libceph: fix preallocation check " Greg Kroah-Hartman
2014-03-29 1:12 ` [PATCH 3.13 00/46] 3.13.8-stable review Guenter Roeck
2014-03-29 1:28 ` Greg Kroah-Hartman
2014-03-29 12:19 ` Satoru Takeuchi
2014-03-29 17:01 ` Greg Kroah-Hartman
2014-03-30 1:25 ` Shuah Khan
2014-03-30 2:49 ` Greg Kroah-Hartman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140328173136.576332715@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=josh.durgin@inktank.com \
--cc=linux-kernel@vger.kernel.org \
--cc=sage@inktank.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).