From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
stable@vger.kernel.org, Sage Weil <sage@inktank.com>,
Josh Durgin <josh.durgin@inktank.com>
Subject: [PATCH 3.13 15/46] libceph: resend all writes after the osdmap loses the full flag
Date: Fri, 28 Mar 2014 10:31:59 -0700 [thread overview]
Message-ID: <20140328173136.713645652@linuxfoundation.org> (raw)
In-Reply-To: <20140328173134.630198216@linuxfoundation.org>
3.13-stable review patch. If anyone has any objections, please let me know.
------------------
From: Josh Durgin <josh.durgin@inktank.com>
commit 9a1ea2dbff11547a8e664f143c1ffefc586a577a upstream.
With the current full handling, there is a race between osds and
clients getting the first map marked full. If the osd wins, it will
return -ENOSPC to any writes, but the client may already have writes
in flight. This results in the client getting the error and
propagating it up the stack. For rbd, the block layer turns this into
EIO, which can cause corruption in filesystems above it.
To avoid this race, osds are being changed to drop writes that came
from clients with an osdmap older than the last osdmap marked full.
In order for this to work, clients must resend all writes after they
encounter a full -> not full transition in the osdmap. osds will wait
for an updated map instead of processing a request from a client with
a newer map, so resent writes will not be dropped by the osd unless
there is another not full -> full transition.
This approach requires both osds and clients to be fixed to avoid the
race. Old clients talking to osds with this fix may hang instead of
returning EIO and potentially corrupting an fs. New clients talking to
old osds have the same behavior as before if they encounter this race.
Fixes: http://tracker.ceph.com/issues/6938
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
net/ceph/osd_client.c | 28 ++++++++++++++++++++++------
1 file changed, 22 insertions(+), 6 deletions(-)
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1636,14 +1636,17 @@ static void reset_changed_osds(struct ce
*
* Caller should hold map_sem for read.
*/
-static void kick_requests(struct ceph_osd_client *osdc, int force_resend)
+static void kick_requests(struct ceph_osd_client *osdc, bool force_resend,
+ bool force_resend_writes)
{
struct ceph_osd_request *req, *nreq;
struct rb_node *p;
int needmap = 0;
int err;
+ bool force_resend_req;
- dout("kick_requests %s\n", force_resend ? " (force resend)" : "");
+ dout("kick_requests %s %s\n", force_resend ? " (force resend)" : "",
+ force_resend_writes ? " (force resend writes)" : "");
mutex_lock(&osdc->request_mutex);
for (p = rb_first(&osdc->requests); p; ) {
req = rb_entry(p, struct ceph_osd_request, r_node);
@@ -1668,7 +1671,10 @@ static void kick_requests(struct ceph_os
continue;
}
- err = __map_request(osdc, req, force_resend);
+ force_resend_req = force_resend ||
+ (force_resend_writes &&
+ req->r_flags & CEPH_OSD_FLAG_WRITE);
+ err = __map_request(osdc, req, force_resend_req);
if (err < 0)
continue; /* error */
if (req->r_osd == NULL) {
@@ -1688,7 +1694,8 @@ static void kick_requests(struct ceph_os
r_linger_item) {
dout("linger req=%p req->r_osd=%p\n", req, req->r_osd);
- err = __map_request(osdc, req, force_resend);
+ err = __map_request(osdc, req,
+ force_resend || force_resend_writes);
dout("__map_request returned %d\n", err);
if (err == 0)
continue; /* no change and no osd was specified */
@@ -1730,6 +1737,7 @@ void ceph_osdc_handle_map(struct ceph_os
struct ceph_osdmap *newmap = NULL, *oldmap;
int err;
struct ceph_fsid fsid;
+ bool was_full;
dout("handle_map have %u\n", osdc->osdmap ? osdc->osdmap->epoch : 0);
p = msg->front.iov_base;
@@ -1743,6 +1751,8 @@ void ceph_osdc_handle_map(struct ceph_os
down_write(&osdc->map_sem);
+ was_full = ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL);
+
/* incremental maps */
ceph_decode_32_safe(&p, end, nr_maps, bad);
dout(" %d inc maps\n", nr_maps);
@@ -1767,7 +1777,10 @@ void ceph_osdc_handle_map(struct ceph_os
ceph_osdmap_destroy(osdc->osdmap);
osdc->osdmap = newmap;
}
- kick_requests(osdc, 0);
+ was_full = was_full ||
+ ceph_osdmap_flag(osdc->osdmap,
+ CEPH_OSDMAP_FULL);
+ kick_requests(osdc, 0, was_full);
} else {
dout("ignoring incremental map %u len %d\n",
epoch, maplen);
@@ -1810,7 +1823,10 @@ void ceph_osdc_handle_map(struct ceph_os
skipped_map = 1;
ceph_osdmap_destroy(oldmap);
}
- kick_requests(osdc, skipped_map);
+ was_full = was_full ||
+ ceph_osdmap_flag(osdc->osdmap,
+ CEPH_OSDMAP_FULL);
+ kick_requests(osdc, skipped_map, was_full);
}
p += maplen;
nr_maps--;
next prev parent reply other threads:[~2014-03-28 17:31 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-03-28 17:31 [PATCH 3.13 00/46] 3.13.8-stable review Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 01/46] HID: hidraw: fix warning destroying hidraw device files after parent Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 02/46] ALSA: compress: Pass through return value of open ops callback Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 03/46] clocksource: vf_pit_timer: use complement for sched_clock reading Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 04/46] drm/i915: Fix PSR programming Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 05/46] drm/i915: Dont enable display error interrupts from the start Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 06/46] drm/i915: Disable stolen memory when DMAR is active Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 07/46] tracing: Fix array size mismatch in format string Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 08/46] partly revert commit 8a10bc9: parisc/sti_console: prefer Linux fonts over built-in ROM fonts Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 09/46] net: davinci_emac: Replace devm_request_irq with request_irq Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 10/46] NFSv4: Use the correct net namespace in nfs4_update_server Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 11/46] media: cxusb: unlock on error in cxusb_i2c_xfer() Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 12/46] media: dw2102: some missing unlocks on error Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 13/46] media: cx18: check for allocation failure in cx18_read_eeprom() Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 14/46] libceph: block I/O when PAUSE or FULL osd map flags are set Greg Kroah-Hartman
2014-03-28 17:31 ` Greg Kroah-Hartman [this message]
2014-03-28 17:32 ` [PATCH 3.13 16/46] ASoC: max98090: make REVISION_ID readable Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 17/46] stop_machine: Fix^2 race between stop_two_cpus() and stop_cpus() Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 18/46] sfc: Use the correct maximum TX DMA ring size for SFC9100 Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 19/46] ARM: 7941/2: Fix incorrect FDT initrd parameter override Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 20/46] SUNRPC: Fix a pipe_version reference leak Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 21/46] x86: bpf_jit: support negative offsets Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 22/46] printk: fix syslog() overflowing user buffer Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 23/46] Fix uses of dma_max_pfn() when converting to a limiting address Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 24/46] perf tools: Fix AAAAARGH64 memory barriers Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 25/46] deb-pkg: Fix building for MIPS big-endian or ARM OABI Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 26/46] deb-pkg: Fix cross-building linux-headers package Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 27/46] MIPS: Fix build error seen in some configurations Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 28/46] p54: clamp properly instead of just truncating Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 29/46] regulator: core: Replace direct ops->disable usage Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 30/46] powerpc/powernv: Move PHB-diag dump functions around Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 31/46] powerpc/eeh: Handle multiple EEH errors Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 32/46] powerpc/powernv: Dump PHB diag-data immediately Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 33/46] powerpc/powernv: Refactor PHB diag-data dump Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 34/46] fs/proc/proc_devtree.c: remove empty /proc/device-tree when no openfirmware exists Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 35/46] Input: elantech - improve clickpad detection Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 36/46] KVM: MMU: handle invalid root_hpa at __direct_map Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 37/46] KVM: x86: handle invalid root_hpa everywhere Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 38/46] KVM: VMX: fix use after free of vmx->loaded_vmcs Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 39/46] Input: wacom - make sure touch_max is set for touch devices Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 40/46] Input: wacom - add support for three new Intuos devices Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 41/46] Input: wacom - add reporting of SW_MUTE_DEVICE events Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 42/46] xhci: Fix resume issues on Renesas chips in Samsung laptops Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 43/46] e100: Fix "disabling already-disabled device" warning Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 44/46] libceph: rename ceph_msg::front_max to front_alloc_len Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 45/46] libceph: rename front to front_len in get_reply() Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 46/46] libceph: fix preallocation check " Greg Kroah-Hartman
2014-03-29 1:12 ` [PATCH 3.13 00/46] 3.13.8-stable review Guenter Roeck
2014-03-29 1:28 ` Greg Kroah-Hartman
2014-03-29 12:19 ` Satoru Takeuchi
2014-03-29 17:01 ` Greg Kroah-Hartman
2014-03-30 1:25 ` Shuah Khan
2014-03-30 2:49 ` Greg Kroah-Hartman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140328173136.713645652@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=josh.durgin@inktank.com \
--cc=linux-kernel@vger.kernel.org \
--cc=sage@inktank.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).