[PATCH 0/3] block I/O when cluster is full

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] block I/O when cluster is full
@ 2013-12-03 23:12 Josh Durgin
  2013-12-03 23:12 ` [PATCH 1/3] libceph: block I/O when PAUSE or FULL osd map flags are set Josh Durgin
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Josh Durgin @ 2013-12-03 23:12 UTC (permalink / raw)
  To: ceph-devel

These patches allow rbd to block writes instead of returning errors
when OSDs are full enough that the FULL flag is set in the osd map.
This avoids filesystems on top of rbd getting confused by transient
EIOs if the cluster oscillates between full and non-full.

These are also available in the wip-full branch of ceph-client.git.

Josh Durgin (3):
  libceph: block I/O when PAUSE or FULL osd map flags are set
  libceph: add an option to configure client behavior when osds are
    full
  rbd: document rbd-specific options

 Documentation/ABI/testing/sysfs-bus-rbd |   19 ++++++++++++++++++
 include/linux/ceph/libceph.h            |    7 +++++++
 include/linux/ceph/osd_client.h         |    1 +
 net/ceph/ceph_common.c                  |   13 +++++++++++++
 net/ceph/osd_client.c                   |   32 +++++++++++++++++++++++++++++--
 5 files changed, 70 insertions(+), 2 deletions(-)

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/3] libceph: block I/O when PAUSE or FULL osd map flags are set
  2013-12-03 23:12 [PATCH 0/3] block I/O when cluster is full Josh Durgin
@ 2013-12-03 23:12 ` Josh Durgin
  2013-12-07  3:02   ` Li Wang
  2013-12-03 23:12 ` [PATCH 2/3] libceph: add an option to configure client behavior when osds are full Josh Durgin
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Josh Durgin @ 2013-12-03 23:12 UTC (permalink / raw)
  To: ceph-devel

The PAUSEWR and PAUSERD flags are meant to stop the cluster from
processing writes and reads, respectively. The FULL flag is set when
the cluster determines that it is out of space, and will no longer
process writes.  PAUSEWR and PAUSERD are purely client-side settings
already implemented in userspace clients. The osd does nothing special
with these flags.

When the FULL flag is set, however, the osd responds to all writes
with -ENOSPC. For cephfs, this makes sense, but for rbd the block
layer translates this into EIO.  If a cluster goes from full to
non-full quickly, a filesystem on top of rbd will not behave well,
since some writes succeed while others get EIO.

Fix this by blocking any writes when the FULL flag is set in the osd
client. This is the same strategy used by userspace, so apply it by
default.  A follow-on patch makes this configurable.

__map_request() is called to re-target osd requests in case the
available osds changed.  Add a paused field to a ceph_osd_request, and
set it whenever an appropriate osd map flag is set.  Avoid queueing
paused requests in __map_request(), but force them to be resent if
they become unpaused.

Also subscribe to the next osd map from the monitor if any of these
flags are set, so paused requests can be unblocked as soon as
possible.

Fixes: http://tracker.ceph.com/issues/6079

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
---
 include/linux/ceph/osd_client.h |    1 +
 net/ceph/osd_client.c           |   29 +++++++++++++++++++++++++++--
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 8f47625..4fb6a89 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -138,6 +138,7 @@ struct ceph_osd_request {
 	__le64           *r_request_pool;
 	void             *r_request_pgid;
 	__le32           *r_request_attempts;
+	bool              r_paused;
 	struct ceph_eversion *r_request_reassert_version;
 
 	int               r_result;
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 2b4b32a..21476be 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1232,6 +1232,22 @@ void ceph_osdc_set_request_linger(struct ceph_osd_client *osdc,
 EXPORT_SYMBOL(ceph_osdc_set_request_linger);
 
 /*
+ * Returns whether a request should be blocked from being sent
+ * based on the current osdmap and osd_client settings.
+ *
+ * Caller should hold map_sem for read.
+ */
+static bool __req_should_be_paused(struct ceph_osd_client *osdc,
+				   struct ceph_osd_request *req)
+{
+	bool pauserd = ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSERD);
+	bool pausewr = ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSEWR) ||
+		ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL);
+	return (req->r_flags & CEPH_OSD_FLAG_READ && pauserd) ||
+		(req->r_flags & CEPH_OSD_FLAG_WRITE && pausewr);
+}
+
+/*
  * Pick an osd (the first 'up' osd in the pg), allocate the osd struct
  * (as needed), and set the request r_osd appropriately.  If there is
  * no up osd, set r_osd to NULL.  Move the request to the appropriate list
@@ -1248,6 +1264,7 @@ static int __map_request(struct ceph_osd_client *osdc,
 	int acting[CEPH_PG_MAX_SIZE];
 	int o = -1, num = 0;
 	int err;
+	bool was_paused;
 
 	dout("map_request %p tid %lld\n", req, req->r_tid);
 	err = ceph_calc_ceph_pg(&pgid, req->r_oid, osdc->osdmap,
@@ -1264,12 +1281,18 @@ static int __map_request(struct ceph_osd_client *osdc,
 		num = err;
 	}
 
+	was_paused = req->r_paused;
+	req->r_paused = __req_should_be_paused(osdc, req);
+	if (was_paused && !req->r_paused)
+		force_resend = 1;
+
 	if ((!force_resend &&
 	     req->r_osd && req->r_osd->o_osd == o &&
 	     req->r_sent >= req->r_osd->o_incarnation &&
 	     req->r_num_pg_osds == num &&
 	     memcmp(req->r_pg_osds, acting, sizeof(acting[0])*num) == 0) ||
-	    (req->r_osd == NULL && o == -1))
+	    (req->r_osd == NULL && o == -1) ||
+	    req->r_paused)
 		return 0;  /* no change */
 
 	dout("map_request tid %llu pgid %lld.%x osd%d (was osd%d)\n",
@@ -1804,7 +1827,9 @@ done:
 	 * we find out when we are no longer full and stop returning
 	 * ENOSPC.
 	 */
-	if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL))
+	if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL) ||
+		ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSERD) ||
+		ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSEWR))
 		ceph_monc_request_next_osdmap(&osdc->client->monc);
 
 	mutex_lock(&osdc->request_mutex);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/3] libceph: add an option to configure client behavior when osds are full
  2013-12-03 23:12 [PATCH 0/3] block I/O when cluster is full Josh Durgin
  2013-12-03 23:12 ` [PATCH 1/3] libceph: block I/O when PAUSE or FULL osd map flags are set Josh Durgin
@ 2013-12-03 23:12 ` Josh Durgin
  2013-12-03 23:12 ` [PATCH 3/3] rbd: document rbd-specific options Josh Durgin
  2013-12-06  1:47 ` [PATCH 0/3] block I/O when cluster is full Josh Durgin
  3 siblings, 0 replies; 13+ messages in thread
From: Josh Durgin @ 2013-12-03 23:12 UTC (permalink / raw)
  To: ceph-devel

Default to blocking requests to be consistent with userspace. Some
applications may prefer the previous behavior of returning an error
instead, so make that an option. CephFS implements returning -ENOSPC
at a higher level, so only rbd is really affected by this.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
---
 include/linux/ceph/libceph.h |    7 +++++++
 net/ceph/ceph_common.c       |   13 +++++++++++++
 net/ceph/osd_client.c        |    5 ++++-
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h
index 2e30248..77b28ac 100644
--- a/include/linux/ceph/libceph.h
+++ b/include/linux/ceph/libceph.h
@@ -32,6 +32,12 @@
 
 #define CEPH_OPT_DEFAULT   (0)
 
+/* osd full behavior */
+enum {
+	CEPH_OSD_FULL_ERROR,
+	CEPH_OSD_FULL_BLOCK,
+};
+
 #define ceph_set_opt(client, opt) \
 	(client)->options->flags |= CEPH_OPT_##opt;
 #define ceph_test_opt(client, opt) \
@@ -44,6 +50,7 @@ struct ceph_options {
 	int mount_timeout;
 	int osd_idle_ttl;
 	int osd_keepalive_timeout;
+	int osd_full_behavior;
 
 	/*
 	 * any type that can't be simply compared or doesn't need need
diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c
index 34b11ee..d029fc5 100644
--- a/net/ceph/ceph_common.c
+++ b/net/ceph/ceph_common.c
@@ -217,6 +217,7 @@ enum {
 	Opt_secret,
 	Opt_key,
 	Opt_ip,
+	Opt_osd_full_behavior,
 	Opt_last_string,
 	/* string args above */
 	Opt_share,
@@ -236,6 +237,7 @@ static match_table_t opt_tokens = {
 	{Opt_secret, "secret=%s"},
 	{Opt_key, "key=%s"},
 	{Opt_ip, "ip=%s"},
+	{Opt_osd_full_behavior, "osd_full_behavior=%s"},
 	/* string args above */
 	{Opt_share, "share"},
 	{Opt_noshare, "noshare"},
@@ -329,6 +331,7 @@ ceph_parse_options(char *options, const char *dev_name,
 	opt->osd_keepalive_timeout = CEPH_OSD_KEEPALIVE_DEFAULT;
 	opt->mount_timeout = CEPH_MOUNT_TIMEOUT_DEFAULT; /* seconds */
 	opt->osd_idle_ttl = CEPH_OSD_IDLE_TTL_DEFAULT;   /* seconds */
+	opt->osd_full_behavior = CEPH_OSD_FULL_BLOCK;
 
 	/* get mon ip(s) */
 	/* ip1[:port1][,ip2[:port2]...] */
@@ -408,6 +411,16 @@ ceph_parse_options(char *options, const char *dev_name,
 			if (err < 0)
 				goto out;
 			break;
+		case Opt_osd_full_behavior:
+			if (!strcmp(argstr[0].from, "error")) {
+				opt->osd_full_behavior = CEPH_OSD_FULL_ERROR;
+			} else if (!strcmp(argstr[0].from, "block")) {
+				opt->osd_full_behavior = CEPH_OSD_FULL_BLOCK;
+			} else {
+				err = -EINVAL;
+				goto out;
+			}
+			break;
 
 			/* misc */
 		case Opt_osdtimeout:
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 21476be..664432e 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1240,9 +1240,12 @@ EXPORT_SYMBOL(ceph_osdc_set_request_linger);
 static bool __req_should_be_paused(struct ceph_osd_client *osdc,
 				   struct ceph_osd_request *req)
 {
+	bool block_on_full =
+		osdc->client->options->osd_full_behavior & CEPH_OSD_FULL_BLOCK;
 	bool pauserd = ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSERD);
 	bool pausewr = ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSEWR) ||
-		ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL);
+		(ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL) &&
+			block_on_full);
 	return (req->r_flags & CEPH_OSD_FLAG_READ && pauserd) ||
 		(req->r_flags & CEPH_OSD_FLAG_WRITE && pausewr);
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/3] rbd: document rbd-specific options
  2013-12-03 23:12 [PATCH 0/3] block I/O when cluster is full Josh Durgin
  2013-12-03 23:12 ` [PATCH 1/3] libceph: block I/O when PAUSE or FULL osd map flags are set Josh Durgin
  2013-12-03 23:12 ` [PATCH 2/3] libceph: add an option to configure client behavior when osds are full Josh Durgin
@ 2013-12-03 23:12 ` Josh Durgin
  2013-12-06  1:47 ` [PATCH 0/3] block I/O when cluster is full Josh Durgin
  3 siblings, 0 replies; 13+ messages in thread
From: Josh Durgin @ 2013-12-03 23:12 UTC (permalink / raw)
  To: ceph-devel

osd_full_behavior only affects rbd, so document it along with
read-only and read-write.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
---
 Documentation/ABI/testing/sysfs-bus-rbd |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-rbd b/Documentation/ABI/testing/sysfs-bus-rbd
index 0a30647..15f3ba6 100644
--- a/Documentation/ABI/testing/sysfs-bus-rbd
+++ b/Documentation/ABI/testing/sysfs-bus-rbd
@@ -18,6 +18,25 @@ Removal of a device:
 
   $ echo <dev-id> > /sys/bus/rbd/remove
 
+Options
+-------
+
+read_only/ro
+
+	The mapped device will only handle reads. This is the default for
+	snapshots.
+
+read_write/rw
+
+	The mapped device will handle reads and writes. This is invalid
+	for snapshots.
+
+osd_full_behavior
+
+	Choose how to handle writes to a full ceph cluster. Options are
+	"block" to pause I/O until there is space (the default), or
+	"error", to return an I/O error.
+
 Entries under /sys/bus/rbd/devices/<dev-id>/
 --------------------------------------------
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] block I/O when cluster is full
  2013-12-03 23:12 [PATCH 0/3] block I/O when cluster is full Josh Durgin
                   ` (2 preceding siblings ...)
  2013-12-03 23:12 ` [PATCH 3/3] rbd: document rbd-specific options Josh Durgin
@ 2013-12-06  1:47 ` Josh Durgin
  2013-12-06  4:58   ` Gregory Farnum
  3 siblings, 1 reply; 13+ messages in thread
From: Josh Durgin @ 2013-12-06  1:47 UTC (permalink / raw)
  To: ceph-devel

On 12/03/2013 03:12 PM, Josh Durgin wrote:
> These patches allow rbd to block writes instead of returning errors
> when OSDs are full enough that the FULL flag is set in the osd map.
> This avoids filesystems on top of rbd getting confused by transient
> EIOs if the cluster oscillates between full and non-full.
>
> These are also available in the wip-full branch of ceph-client.git.
>
> Josh Durgin (3):
>    libceph: block I/O when PAUSE or FULL osd map flags are set
>    libceph: add an option to configure client behavior when osds are
>      full
>    rbd: document rbd-specific options

Due to a race condition between clients and osds in handling maps
marked FULL, it's not feasible to offer the 'error' option, so patches
2 and 3 can be ignored.

http://tracker.ceph.com/issues/6938


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] block I/O when cluster is full
  2013-12-06  1:47 ` [PATCH 0/3] block I/O when cluster is full Josh Durgin
@ 2013-12-06  4:58   ` Gregory Farnum
  2013-12-07  2:16     ` Josh Durgin
  0 siblings, 1 reply; 13+ messages in thread
From: Gregory Farnum @ 2013-12-06  4:58 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel@vger.kernel.org

On Thu, Dec 5, 2013 at 5:47 PM, Josh Durgin <josh.durgin@inktank.com> wrote:
> On 12/03/2013 03:12 PM, Josh Durgin wrote:
>>
>> These patches allow rbd to block writes instead of returning errors
>> when OSDs are full enough that the FULL flag is set in the osd map.
>> This avoids filesystems on top of rbd getting confused by transient
>> EIOs if the cluster oscillates between full and non-full.
>>
>> These are also available in the wip-full branch of ceph-client.git.
>>
>> Josh Durgin (3):
>>    libceph: block I/O when PAUSE or FULL osd map flags are set
>>    libceph: add an option to configure client behavior when osds are
>>      full
>>    rbd: document rbd-specific options
>
>
> Due to a race condition between clients and osds in handling maps
> marked FULL, it's not feasible to offer the 'error' option, so patches
> 2 and 3 can be ignored.
>
> http://tracker.ceph.com/issues/6938

It's not clear to me — are you going to assume all ENOSPC means the
map is marked as full and intercept it, or that you can't reliably
block IO so don't bother trying?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] block I/O when cluster is full
  2013-12-06  4:58   ` Gregory Farnum
@ 2013-12-07  2:16     ` Josh Durgin
  2013-12-07  2:24       ` Gregory Farnum
  0 siblings, 1 reply; 13+ messages in thread
From: Josh Durgin @ 2013-12-07  2:16 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel@vger.kernel.org

On 12/05/2013 08:58 PM, Gregory Farnum wrote:
> On Thu, Dec 5, 2013 at 5:47 PM, Josh Durgin <josh.durgin@inktank.com> wrote:
>> On 12/03/2013 03:12 PM, Josh Durgin wrote:
>>>
>>> These patches allow rbd to block writes instead of returning errors
>>> when OSDs are full enough that the FULL flag is set in the osd map.
>>> This avoids filesystems on top of rbd getting confused by transient
>>> EIOs if the cluster oscillates between full and non-full.
>>>
>>> These are also available in the wip-full branch of ceph-client.git.
>>>
>>> Josh Durgin (3):
>>>     libceph: block I/O when PAUSE or FULL osd map flags are set
>>>     libceph: add an option to configure client behavior when osds are
>>>       full
>>>     rbd: document rbd-specific options
>>
>>
>> Due to a race condition between clients and osds in handling maps
>> marked FULL, it's not feasible to offer the 'error' option, so patches
>> 2 and 3 can be ignored.
>>
>> http://tracker.ceph.com/issues/6938
>
> It's not clear to me — are you going to assume all ENOSPC means the
> map is marked as full and intercept it, or that you can't reliably
> block IO so don't bother trying?

Don't bother trying to stop ENOSPC on the client side, since it'd need 
some restructuring in the kernel side and would be prone to screwing up
write ordering.

Instead drop writes on the osd side when they have a map marked full,
and have clients resend all writes when a map goes transitions from
full -> nonfull. The userspace side is https://github.com/ceph/ceph/pull/914

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] block I/O when cluster is full
  2013-12-07  2:16     ` Josh Durgin
@ 2013-12-07  2:24       ` Gregory Farnum
  2013-12-10  0:11         ` Josh Durgin
  0 siblings, 1 reply; 13+ messages in thread
From: Gregory Farnum @ 2013-12-07  2:24 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel@vger.kernel.org

On Fri, Dec 6, 2013 at 6:16 PM, Josh Durgin <josh.durgin@inktank.com> wrote:
> On 12/05/2013 08:58 PM, Gregory Farnum wrote:
>>
>> On Thu, Dec 5, 2013 at 5:47 PM, Josh Durgin <josh.durgin@inktank.com>
>> wrote:
>>>
>>> On 12/03/2013 03:12 PM, Josh Durgin wrote:
>>>>
>>>>
>>>> These patches allow rbd to block writes instead of returning errors
>>>> when OSDs are full enough that the FULL flag is set in the osd map.
>>>> This avoids filesystems on top of rbd getting confused by transient
>>>> EIOs if the cluster oscillates between full and non-full.
>>>>
>>>> These are also available in the wip-full branch of ceph-client.git.
>>>>
>>>> Josh Durgin (3):
>>>>     libceph: block I/O when PAUSE or FULL osd map flags are set
>>>>     libceph: add an option to configure client behavior when osds are
>>>>       full
>>>>     rbd: document rbd-specific options
>>>
>>>
>>>
>>> Due to a race condition between clients and osds in handling maps
>>> marked FULL, it's not feasible to offer the 'error' option, so patches
>>> 2 and 3 can be ignored.
>>>
>>> http://tracker.ceph.com/issues/6938
>>
>>
>> It's not clear to me — are you going to assume all ENOSPC means the
>> map is marked as full and intercept it, or that you can't reliably
>> block IO so don't bother trying?
>
>
> Don't bother trying to stop ENOSPC on the client side, since it'd need some
> restructuring in the kernel side and would be prone to screwing up
> write ordering.
>
> Instead drop writes on the osd side when they have a map marked full,
> and have clients resend all writes when a map goes transitions from
> full -> nonfull. The userspace side is https://github.com/ceph/ceph/pull/914

Do previous client implementations already satisfy that requirement?
We can't drop requests if older clients expect a response...
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] libceph: block I/O when PAUSE or FULL osd map flags are set
  2013-12-03 23:12 ` [PATCH 1/3] libceph: block I/O when PAUSE or FULL osd map flags are set Josh Durgin
@ 2013-12-07  3:02   ` Li Wang
  2013-12-09 23:52     ` Josh Durgin
  0 siblings, 1 reply; 13+ messages in thread
From: Li Wang @ 2013-12-07  3:02 UTC (permalink / raw)
  To: Josh Durgin, ceph-devel

I just had a quick look, did not think it thoroughly.
(1) If possible, there is a race condition, that a former write get 
blocked by FULL, a latter write is lucky to be sent to osd after FULL -> 
NOFULL,
then the former write is resent, to cause the old data overwrite the new 
data.
(2) If it keeps FULL, how long gotta the write request waiting? The 
upper fs writepage/writepages() kernel thread or sync process gotta hang 
there.


On 2013/12/4 7:12, Josh Durgin wrote:
> The PAUSEWR and PAUSERD flags are meant to stop the cluster from
> processing writes and reads, respectively. The FULL flag is set when
> the cluster determines that it is out of space, and will no longer
> process writes.  PAUSEWR and PAUSERD are purely client-side settings
> already implemented in userspace clients. The osd does nothing special
> with these flags.
>
> When the FULL flag is set, however, the osd responds to all writes
> with -ENOSPC. For cephfs, this makes sense, but for rbd the block
> layer translates this into EIO.  If a cluster goes from full to
> non-full quickly, a filesystem on top of rbd will not behave well,
> since some writes succeed while others get EIO.
>
> Fix this by blocking any writes when the FULL flag is set in the osd
> client. This is the same strategy used by userspace, so apply it by
> default.  A follow-on patch makes this configurable.
>
> __map_request() is called to re-target osd requests in case the
> available osds changed.  Add a paused field to a ceph_osd_request, and
> set it whenever an appropriate osd map flag is set.  Avoid queueing
> paused requests in __map_request(), but force them to be resent if
> they become unpaused.
>
> Also subscribe to the next osd map from the monitor if any of these
> flags are set, so paused requests can be unblocked as soon as
> possible.
>
> Fixes: http://tracker.ceph.com/issues/6079
>
> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
> ---
>   include/linux/ceph/osd_client.h |    1 +
>   net/ceph/osd_client.c           |   29 +++++++++++++++++++++++++++--
>   2 files changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index 8f47625..4fb6a89 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -138,6 +138,7 @@ struct ceph_osd_request {
>   	__le64           *r_request_pool;
>   	void             *r_request_pgid;
>   	__le32           *r_request_attempts;
> +	bool              r_paused;
>   	struct ceph_eversion *r_request_reassert_version;
>
>   	int               r_result;
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 2b4b32a..21476be 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -1232,6 +1232,22 @@ void ceph_osdc_set_request_linger(struct ceph_osd_client *osdc,
>   EXPORT_SYMBOL(ceph_osdc_set_request_linger);
>
>   /*
> + * Returns whether a request should be blocked from being sent
> + * based on the current osdmap and osd_client settings.
> + *
> + * Caller should hold map_sem for read.
> + */
> +static bool __req_should_be_paused(struct ceph_osd_client *osdc,
> +				   struct ceph_osd_request *req)
> +{
> +	bool pauserd = ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSERD);
> +	bool pausewr = ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSEWR) ||
> +		ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL);
> +	return (req->r_flags & CEPH_OSD_FLAG_READ && pauserd) ||
> +		(req->r_flags & CEPH_OSD_FLAG_WRITE && pausewr);
> +}
> +
> +/*
>    * Pick an osd (the first 'up' osd in the pg), allocate the osd struct
>    * (as needed), and set the request r_osd appropriately.  If there is
>    * no up osd, set r_osd to NULL.  Move the request to the appropriate list
> @@ -1248,6 +1264,7 @@ static int __map_request(struct ceph_osd_client *osdc,
>   	int acting[CEPH_PG_MAX_SIZE];
>   	int o = -1, num = 0;
>   	int err;
> +	bool was_paused;
>
>   	dout("map_request %p tid %lld\n", req, req->r_tid);
>   	err = ceph_calc_ceph_pg(&pgid, req->r_oid, osdc->osdmap,
> @@ -1264,12 +1281,18 @@ static int __map_request(struct ceph_osd_client *osdc,
>   		num = err;
>   	}
>
> +	was_paused = req->r_paused;
> +	req->r_paused = __req_should_be_paused(osdc, req);
> +	if (was_paused && !req->r_paused)
> +		force_resend = 1;
> +
>   	if ((!force_resend &&
>   	     req->r_osd && req->r_osd->o_osd == o &&
>   	     req->r_sent >= req->r_osd->o_incarnation &&
>   	     req->r_num_pg_osds == num &&
>   	     memcmp(req->r_pg_osds, acting, sizeof(acting[0])*num) == 0) ||
> -	    (req->r_osd == NULL && o == -1))
> +	    (req->r_osd == NULL && o == -1) ||
> +	    req->r_paused)
>   		return 0;  /* no change */
>
>   	dout("map_request tid %llu pgid %lld.%x osd%d (was osd%d)\n",
> @@ -1804,7 +1827,9 @@ done:
>   	 * we find out when we are no longer full and stop returning
>   	 * ENOSPC.
>   	 */
> -	if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL))
> +	if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL) ||
> +		ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSERD) ||
> +		ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSEWR))
>   		ceph_monc_request_next_osdmap(&osdc->client->monc);
>
>   	mutex_lock(&osdc->request_mutex);
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] libceph: block I/O when PAUSE or FULL osd map flags are set
  2013-12-07  3:02   ` Li Wang
@ 2013-12-09 23:52     ` Josh Durgin
  0 siblings, 0 replies; 13+ messages in thread
From: Josh Durgin @ 2013-12-09 23:52 UTC (permalink / raw)
  To: Li Wang, ceph-devel

On 12/06/2013 07:02 PM, Li Wang wrote:
> I just had a quick look, did not think it thoroughly.
> (1) If possible, there is a race condition, that a former write get
> blocked by FULL, a latter write is lucky to be sent to osd after FULL ->
> NOFULL,
> then the former write is resent, to cause the old data overwrite the new
> data.

The first osdmap that contains the FULL -> NOFULL transition will queue
all paused writes to be resent in their original order (by osd_client
tid). This happens before any further writes are possible since
osdc->map_sem is locked exclusively, preventing the race.

> (2) If it keeps FULL, how long gotta the write request waiting? The
> upper fs writepage/writepages() kernel thread or sync process gotta hang
> there.

There is no upper bound on the wait. This is a conservative approach
to prevent any EIOs causing fs corruption for fses on top of rbd.
Thin-provisioned lvm devices behave this way as well, blocking when
they run out of space until more space is available. Do you have an
idea for avoiding this?

> On 2013/12/4 7:12, Josh Durgin wrote:
>> The PAUSEWR and PAUSERD flags are meant to stop the cluster from
>> processing writes and reads, respectively. The FULL flag is set when
>> the cluster determines that it is out of space, and will no longer
>> process writes.  PAUSEWR and PAUSERD are purely client-side settings
>> already implemented in userspace clients. The osd does nothing special
>> with these flags.
>>
>> When the FULL flag is set, however, the osd responds to all writes
>> with -ENOSPC. For cephfs, this makes sense, but for rbd the block
>> layer translates this into EIO.  If a cluster goes from full to
>> non-full quickly, a filesystem on top of rbd will not behave well,
>> since some writes succeed while others get EIO.
>>
>> Fix this by blocking any writes when the FULL flag is set in the osd
>> client. This is the same strategy used by userspace, so apply it by
>> default.  A follow-on patch makes this configurable.
>>
>> __map_request() is called to re-target osd requests in case the
>> available osds changed.  Add a paused field to a ceph_osd_request, and
>> set it whenever an appropriate osd map flag is set.  Avoid queueing
>> paused requests in __map_request(), but force them to be resent if
>> they become unpaused.
>>
>> Also subscribe to the next osd map from the monitor if any of these
>> flags are set, so paused requests can be unblocked as soon as
>> possible.
>>
>> Fixes: http://tracker.ceph.com/issues/6079
>>
>> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
>> ---
>>   include/linux/ceph/osd_client.h |    1 +
>>   net/ceph/osd_client.c           |   29 +++++++++++++++++++++++++++--
>>   2 files changed, 28 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/ceph/osd_client.h
>> b/include/linux/ceph/osd_client.h
>> index 8f47625..4fb6a89 100644
>> --- a/include/linux/ceph/osd_client.h
>> +++ b/include/linux/ceph/osd_client.h
>> @@ -138,6 +138,7 @@ struct ceph_osd_request {
>>       __le64           *r_request_pool;
>>       void             *r_request_pgid;
>>       __le32           *r_request_attempts;
>> +    bool              r_paused;
>>       struct ceph_eversion *r_request_reassert_version;
>>
>>       int               r_result;
>> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
>> index 2b4b32a..21476be 100644
>> --- a/net/ceph/osd_client.c
>> +++ b/net/ceph/osd_client.c
>> @@ -1232,6 +1232,22 @@ void ceph_osdc_set_request_linger(struct
>> ceph_osd_client *osdc,
>>   EXPORT_SYMBOL(ceph_osdc_set_request_linger);
>>
>>   /*
>> + * Returns whether a request should be blocked from being sent
>> + * based on the current osdmap and osd_client settings.
>> + *
>> + * Caller should hold map_sem for read.
>> + */
>> +static bool __req_should_be_paused(struct ceph_osd_client *osdc,
>> +                   struct ceph_osd_request *req)
>> +{
>> +    bool pauserd = ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSERD);
>> +    bool pausewr = ceph_osdmap_flag(osdc->osdmap,
>> CEPH_OSDMAP_PAUSEWR) ||
>> +        ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL);
>> +    return (req->r_flags & CEPH_OSD_FLAG_READ && pauserd) ||
>> +        (req->r_flags & CEPH_OSD_FLAG_WRITE && pausewr);
>> +}
>> +
>> +/*
>>    * Pick an osd (the first 'up' osd in the pg), allocate the osd struct
>>    * (as needed), and set the request r_osd appropriately.  If there is
>>    * no up osd, set r_osd to NULL.  Move the request to the
>> appropriate list
>> @@ -1248,6 +1264,7 @@ static int __map_request(struct ceph_osd_client
>> *osdc,
>>       int acting[CEPH_PG_MAX_SIZE];
>>       int o = -1, num = 0;
>>       int err;
>> +    bool was_paused;
>>
>>       dout("map_request %p tid %lld\n", req, req->r_tid);
>>       err = ceph_calc_ceph_pg(&pgid, req->r_oid, osdc->osdmap,
>> @@ -1264,12 +1281,18 @@ static int __map_request(struct
>> ceph_osd_client *osdc,
>>           num = err;
>>       }
>>
>> +    was_paused = req->r_paused;
>> +    req->r_paused = __req_should_be_paused(osdc, req);
>> +    if (was_paused && !req->r_paused)
>> +        force_resend = 1;
>> +
>>       if ((!force_resend &&
>>            req->r_osd && req->r_osd->o_osd == o &&
>>            req->r_sent >= req->r_osd->o_incarnation &&
>>            req->r_num_pg_osds == num &&
>>            memcmp(req->r_pg_osds, acting, sizeof(acting[0])*num) == 0) ||
>> -        (req->r_osd == NULL && o == -1))
>> +        (req->r_osd == NULL && o == -1) ||
>> +        req->r_paused)
>>           return 0;  /* no change */
>>
>>       dout("map_request tid %llu pgid %lld.%x osd%d (was osd%d)\n",
>> @@ -1804,7 +1827,9 @@ done:
>>        * we find out when we are no longer full and stop returning
>>        * ENOSPC.
>>        */
>> -    if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL))
>> +    if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL) ||
>> +        ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSERD) ||
>> +        ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSEWR))
>>           ceph_monc_request_next_osdmap(&osdc->client->monc);
>>
>>       mutex_lock(&osdc->request_mutex);
>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] block I/O when cluster is full
  2013-12-07  2:24       ` Gregory Farnum
@ 2013-12-10  0:11         ` Josh Durgin
  2013-12-10  0:19           ` Gregory Farnum
  0 siblings, 1 reply; 13+ messages in thread
From: Josh Durgin @ 2013-12-10  0:11 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel@vger.kernel.org

On 12/06/2013 06:24 PM, Gregory Farnum wrote:
> On Fri, Dec 6, 2013 at 6:16 PM, Josh Durgin <josh.durgin@inktank.com> wrote:
>> On 12/05/2013 08:58 PM, Gregory Farnum wrote:
>>>
>>> On Thu, Dec 5, 2013 at 5:47 PM, Josh Durgin <josh.durgin@inktank.com>
>>> wrote:
>>>>
>>>> On 12/03/2013 03:12 PM, Josh Durgin wrote:
>>>>>
>>>>>
>>>>> These patches allow rbd to block writes instead of returning errors
>>>>> when OSDs are full enough that the FULL flag is set in the osd map.
>>>>> This avoids filesystems on top of rbd getting confused by transient
>>>>> EIOs if the cluster oscillates between full and non-full.
>>>>>
>>>>> These are also available in the wip-full branch of ceph-client.git.
>>>>>
>>>>> Josh Durgin (3):
>>>>>      libceph: block I/O when PAUSE or FULL osd map flags are set
>>>>>      libceph: add an option to configure client behavior when osds are
>>>>>        full
>>>>>      rbd: document rbd-specific options
>>>>
>>>>
>>>>
>>>> Due to a race condition between clients and osds in handling maps
>>>> marked FULL, it's not feasible to offer the 'error' option, so patches
>>>> 2 and 3 can be ignored.
>>>>
>>>> http://tracker.ceph.com/issues/6938
>>>
>>>
>>> It's not clear to me — are you going to assume all ENOSPC means the
>>> map is marked as full and intercept it, or that you can't reliably
>>> block IO so don't bother trying?
>>
>>
>> Don't bother trying to stop ENOSPC on the client side, since it'd need some
>> restructuring in the kernel side and would be prone to screwing up
>> write ordering.
>>
>> Instead drop writes on the osd side when they have a map marked full,
>> and have clients resend all writes when a map goes transitions from
>> full -> nonfull. The userspace side is https://github.com/ceph/ceph/pull/914
>
> Do previous client implementations already satisfy that requirement?
> We can't drop requests if older clients expect a response...

No, previous clients do not do this. For old rbd clients, this turns a
potential corruption into a hang, which is a good trade-off imo.

For userspace clients, this only happens when the osd gets the FULL map
first, and rejects a write in flight before the client got a FULL map.

The kernel client already rejects writes at the fs layer when the FULL
flag is set, so kcephfs will only be affected when it hits this race as
well.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] block I/O when cluster is full
  2013-12-10  0:11         ` Josh Durgin
@ 2013-12-10  0:19           ` Gregory Farnum
  2013-12-10  0:45             ` Josh Durgin
  0 siblings, 1 reply; 13+ messages in thread
From: Gregory Farnum @ 2013-12-10  0:19 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel@vger.kernel.org

On Mon, Dec 9, 2013 at 4:11 PM, Josh Durgin <josh.durgin@inktank.com> wrote:
> On 12/06/2013 06:24 PM, Gregory Farnum wrote:
>>
>> On Fri, Dec 6, 2013 at 6:16 PM, Josh Durgin <josh.durgin@inktank.com>
>> wrote:
>>> Don't bother trying to stop ENOSPC on the client side, since it'd need
>>> some
>>> restructuring in the kernel side and would be prone to screwing up
>>> write ordering.
>>>
>>> Instead drop writes on the osd side when they have a map marked full,
>>> and have clients resend all writes when a map goes transitions from
>>> full -> nonfull. The userspace side is
>>> https://github.com/ceph/ceph/pull/914
>>
>>
>> Do previous client implementations already satisfy that requirement?
>> We can't drop requests if older clients expect a response...
>
>
> No, previous clients do not do this. For old rbd clients, this turns a
> potential corruption into a hang, which is a good trade-off imo.
>
> For userspace clients, this only happens when the osd gets the FULL map
> first, and rejects a write in flight before the client got a FULL map.
>
> The kernel client already rejects writes at the fs layer when the FULL
> flag is set, so kcephfs will only be affected when it hits this race as
> well.

Hrm, do we have mechanisms in the kernel for re-sending ops that are
waiting? Hanging instead of corrupting doesn't help us much if we have
no way to get the proper state ondisk.
-Greg

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] block I/O when cluster is full
  2013-12-10  0:19           ` Gregory Farnum
@ 2013-12-10  0:45             ` Josh Durgin
  0 siblings, 0 replies; 13+ messages in thread
From: Josh Durgin @ 2013-12-10  0:45 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel@vger.kernel.org

On 12/09/2013 04:19 PM, Gregory Farnum wrote:
> On Mon, Dec 9, 2013 at 4:11 PM, Josh Durgin <josh.durgin@inktank.com> wrote:
>> On 12/06/2013 06:24 PM, Gregory Farnum wrote:
>>>
>>> On Fri, Dec 6, 2013 at 6:16 PM, Josh Durgin <josh.durgin@inktank.com>
>>> wrote:
>>>> Don't bother trying to stop ENOSPC on the client side, since it'd need
>>>> some
>>>> restructuring in the kernel side and would be prone to screwing up
>>>> write ordering.
>>>>
>>>> Instead drop writes on the osd side when they have a map marked full,
>>>> and have clients resend all writes when a map goes transitions from
>>>> full -> nonfull. The userspace side is
>>>> https://github.com/ceph/ceph/pull/914
>>>
>>>
>>> Do previous client implementations already satisfy that requirement?
>>> We can't drop requests if older clients expect a response...
>>
>>
>> No, previous clients do not do this. For old rbd clients, this turns a
>> potential corruption into a hang, which is a good trade-off imo.
>>
>> For userspace clients, this only happens when the osd gets the FULL map
>> first, and rejects a write in flight before the client got a FULL map.
>>
>> The kernel client already rejects writes at the fs layer when the FULL
>> flag is set, so kcephfs will only be affected when it hits this race as
>> well.
>
> Hrm, do we have mechanisms in the kernel for re-sending ops that are
> waiting? Hanging instead of corrupting doesn't help us much if we have
> no way to get the proper state ondisk.

Yes, that's what 1/3 uses.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-12-10  0:45 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-03 23:12 [PATCH 0/3] block I/O when cluster is full Josh Durgin
2013-12-03 23:12 ` [PATCH 1/3] libceph: block I/O when PAUSE or FULL osd map flags are set Josh Durgin
2013-12-07  3:02   ` Li Wang
2013-12-09 23:52     ` Josh Durgin
2013-12-03 23:12 ` [PATCH 2/3] libceph: add an option to configure client behavior when osds are full Josh Durgin
2013-12-03 23:12 ` [PATCH 3/3] rbd: document rbd-specific options Josh Durgin
2013-12-06  1:47 ` [PATCH 0/3] block I/O when cluster is full Josh Durgin
2013-12-06  4:58   ` Gregory Farnum
2013-12-07  2:16     ` Josh Durgin
2013-12-07  2:24       ` Gregory Farnum
2013-12-10  0:11         ` Josh Durgin
2013-12-10  0:19           ` Gregory Farnum
2013-12-10  0:45             ` Josh Durgin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.