All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] RFC: Enable clients to distinguish busy and unreachable OSDs
@ 2011-06-22 18:26 Jim Schutt
  2011-06-22 18:26 ` [PATCH] ceph: distinguish between unreachable and busy osds when resetting a connection Jim Schutt
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Jim Schutt @ 2011-06-22 18:26 UTC (permalink / raw)
  To: ceph-devel; +Cc: Jim Schutt

Previously, when clients' sustained offered write load exceeded the
sustained throughput of the OSDs, normal operation was that client
messages timed out while waiting to be processed by the OSDs.  The
client response to this was to reset the connection to the OSD
handling a timed-out message.

That has at least two types of impact:
    
- the reset frequently happens while data is being sent, so data
  that was successfully received must be discarded and resent.

- after several such connection resets, many sockets can remain open,
  waiting for readers to be granted space by the policy throttler,
  so that they can notice that the pipe has been shut down, and the
  socket can be closed.

This patchset causes Ceph OSDs to send keepalives when waiting for 
sufficient buffer space to receive a message from a client. There is
also a companion kernel client patch that causes clients to notice
the keepalives, and not reset a connection serving a timed-out
message if anything, particularly a keepalive, has been received
recently.

This patchset also has the operational impact of eliminating client log
messages about resetting OSDs under normal operation with a heavy write
load, which makes it easier to notice other issues in the client logs.


Jim Schutt (3):
  common/Throttle: Remove unused return type on Throttle::get()
  common/Throttle: Add timed_wait().
  msgr: Send keepalive periodically when waiting in policy throttler

 src/common/Throttle.h      |   45 +++++++++++++++++++++++++++++++++++++++++--
 src/common/config.cc       |    1 +
 src/common/config.h        |    1 +
 src/msg/SimpleMessenger.cc |    6 ++++-
 4 files changed, 49 insertions(+), 4 deletions(-)



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] ceph: distinguish between unreachable and busy osds when resetting a connection
  2011-06-22 18:26 [PATCH 0/3] RFC: Enable clients to distinguish busy and unreachable OSDs Jim Schutt
@ 2011-06-22 18:26 ` Jim Schutt
  2011-06-22 18:26 ` [PATCH 1/3] common/Throttle: Remove unused return type on Throttle::get() Jim Schutt
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Jim Schutt @ 2011-06-22 18:26 UTC (permalink / raw)
  To: ceph-devel; +Cc: Jim Schutt

Previously, when clients' sustained offered write load exceeded the
sustained throughput of the OSDs, normal operation was that client
messages timed out while waiting to be processed by the OSDs.  The
client response to this was to reset the connection to the OSD
handling a timed-out message.

Ceph OSDs can now send keepalives when waiting for sufficient buffer
space to receive a message from a client.  This patch causes clients
to notice the keepalives, and not reset a connection serving a
timed-out message if anything, particularly a keepalive, has been
received recently.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
---
 include/linux/ceph/messenger.h |    1 +
 net/ceph/messenger.c           |    9 +++++++++
 net/ceph/osd_client.c          |    9 +++++++++
 3 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 31d91a6..0b12f5e 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -141,6 +141,7 @@ struct ceph_connection {
 	struct ceph_messenger *msgr;
 	struct socket *sock;
 	unsigned long state;	/* connection state (see flags above) */
+	unsigned long last_rcv;
 	const char *error_msg;  /* error message, if any */
 
 	struct ceph_entity_addr peer_addr; /* peer address */
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 78b55f4..9eea67e 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -416,6 +416,7 @@ void ceph_con_init(struct ceph_messenger *msgr, struct ceph_connection *con)
 	memset(con, 0, sizeof(*con));
 	atomic_set(&con->nref, 1);
 	con->msgr = msgr;
+	con->last_rcv = jiffies;
 	mutex_init(&con->mutex);
 	INIT_LIST_HEAD(&con->out_queue);
 	INIT_LIST_HEAD(&con->out_sent);
@@ -1855,6 +1856,7 @@ more:
 		ret = process_connect(con);
 		if (ret < 0)
 			goto out;
+		con->last_rcv = jiffies;
 		goto more;
 	}
 
@@ -1870,6 +1872,7 @@ more:
 		ret = ceph_tcp_recvmsg(con->sock, buf, skip);
 		if (ret <= 0)
 			goto out;
+		con->last_rcv = jiffies;
 		con->in_base_pos += ret;
 		if (con->in_base_pos)
 			goto more;
@@ -1881,6 +1884,7 @@ more:
 		ret = ceph_tcp_recvmsg(con->sock, &con->in_tag, 1);
 		if (ret <= 0)
 			goto out;
+		con->last_rcv = jiffies;
 		dout("try_read got tag %d\n", (int)con->in_tag);
 		switch (con->in_tag) {
 		case CEPH_MSGR_TAG_MSG:
@@ -1889,6 +1893,9 @@ more:
 		case CEPH_MSGR_TAG_ACK:
 			prepare_read_ack(con);
 			break;
+		case CEPH_MSGR_TAG_KEEPALIVE:
+			prepare_read_tag(con);
+			goto out;
 		case CEPH_MSGR_TAG_CLOSE:
 			set_bit(CLOSED, &con->state);   /* fixme */
 			goto out;
@@ -1910,6 +1917,7 @@ more:
 			}
 			goto out;
 		}
+		con->last_rcv = jiffies;
 		if (con->in_tag == CEPH_MSGR_TAG_READY)
 			goto more;
 		process_message(con);
@@ -1919,6 +1927,7 @@ more:
 		ret = read_partial_ack(con);
 		if (ret <= 0)
 			goto out;
+		con->last_rcv = jiffies;
 		process_ack(con);
 		goto more;
 	}
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 7330c27..30fa648 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1094,6 +1094,15 @@ static void handle_timeout(struct work_struct *work)
 
 		osd = req->r_osd;
 		BUG_ON(!osd);
+
+		/*
+		 * Only reset osd if we haven't recently received something
+		 * from it - if we have, it's just busy, and hasn't gotten
+		 * to this request yet.
+		 */
+		if (time_before(jiffies, osd->o_con.last_rcv + timeout))
+			break;
+
 		pr_warning(" tid %llu timed out on osd%d, will reset osd\n",
 			   req->r_tid, osd->o_osd);
 		__kick_osd_requests(osdc, osd);
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 1/3] common/Throttle: Remove unused return type on Throttle::get()
  2011-06-22 18:26 [PATCH 0/3] RFC: Enable clients to distinguish busy and unreachable OSDs Jim Schutt
  2011-06-22 18:26 ` [PATCH] ceph: distinguish between unreachable and busy osds when resetting a connection Jim Schutt
@ 2011-06-22 18:26 ` Jim Schutt
  2011-06-22 18:26 ` [PATCH 2/3] common/Throttle: Add timed_wait() Jim Schutt
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Jim Schutt @ 2011-06-22 18:26 UTC (permalink / raw)
  To: ceph-devel; +Cc: Jim Schutt


Signed-off-by: Jim Schutt <jaschut@sandia.gov>
---
 src/common/Throttle.h |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/src/common/Throttle.h b/src/common/Throttle.h
index cd772dd..8fd2625 100644
--- a/src/common/Throttle.h
+++ b/src/common/Throttle.h
@@ -68,16 +68,15 @@ public:
     return count;
   }
 
-  bool get(int64_t c = 1, int64_t m = 0) {
+  void get(int64_t c = 1, int64_t m = 0) {
     assert(c >= 0);
     Mutex::Locker l(lock);
     if (m) {
       assert(m > 0);
       _reset_max(m);
     }
-    bool waited = _wait(c);
+    _wait(c);
     count += c;
-    return waited;
   }
 
   /* Returns true if it successfully got the requested amount,
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/3] common/Throttle: Add timed_wait().
  2011-06-22 18:26 [PATCH 0/3] RFC: Enable clients to distinguish busy and unreachable OSDs Jim Schutt
  2011-06-22 18:26 ` [PATCH] ceph: distinguish between unreachable and busy osds when resetting a connection Jim Schutt
  2011-06-22 18:26 ` [PATCH 1/3] common/Throttle: Remove unused return type on Throttle::get() Jim Schutt
@ 2011-06-22 18:26 ` Jim Schutt
  2011-06-23 11:02   ` Colin McCabe
  2011-06-22 18:26 ` [PATCH 3/3] msgr: Send keepalive periodically when waiting in policy throttler Jim Schutt
  2011-06-22 20:31 ` [PATCH 0/3] RFC: Enable clients to distinguish busy and unreachable OSDs Sage Weil
  4 siblings, 1 reply; 12+ messages in thread
From: Jim Schutt @ 2011-06-22 18:26 UTC (permalink / raw)
  To: ceph-devel; +Cc: Jim Schutt

This will enable SimpleMessenger to send keepalives to clients while
blocked in the policy throttler.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
---
 src/common/Throttle.h |   40 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/src/common/Throttle.h b/src/common/Throttle.h
index 8fd2625..5a028cc 100644
--- a/src/common/Throttle.h
+++ b/src/common/Throttle.h
@@ -27,6 +27,9 @@ private:
       ((c < max && count + c > max) ||   // normally stay under max
        (c >= max && count > max));       // except for large c
   }
+
+  /* Returns true if it had to block/wait, false otherwise.
+   */
   bool _wait(int64_t c) {
     bool waited = false;
     if (_should_wait(c)) {
@@ -44,6 +47,28 @@ private:
     return waited;
   }
 
+  /* Returns true if it timed out while blocked/waiting,
+   * false otherwise.
+   */
+  bool _timed_wait(utime_t interval, int64_t c) {
+    if (_should_wait(c)) {
+      utime_t timeout_at = g_clock.now() + interval;
+      waiting += c;
+      do {
+	if (cond.WaitUntil(lock, timeout_at)) {
+	  waiting -= c;
+	  return true;
+	}
+      } while (_should_wait(c));
+      waiting -= c;
+
+      // wake up the next guy
+      if (waiting)
+	cond.SignalOne();
+    }
+    return false;
+  }
+
 public:
   int64_t get_current() {
     Mutex::Locker l(lock);
@@ -79,6 +104,21 @@ public:
     count += c;
   }
 
+  /* Returns true if it got the requested amount within
+   * the specified interval, false otherwise.
+   */
+  bool timed_get(utime_t interval, int64_t c = 1, int64_t m = 0) {
+    assert(c >= 0);
+    Mutex::Locker l(lock);
+    if (m) {
+      assert(m > 0);
+      _reset_max(m);
+    }
+    if (_timed_wait(interval, c)) return false;
+    count += c;
+    return true;
+  }
+
   /* Returns true if it successfully got the requested amount,
    * or false if it would block.
    */
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/3] msgr: Send keepalive periodically when waiting in policy throttler
  2011-06-22 18:26 [PATCH 0/3] RFC: Enable clients to distinguish busy and unreachable OSDs Jim Schutt
                   ` (2 preceding siblings ...)
  2011-06-22 18:26 ` [PATCH 2/3] common/Throttle: Add timed_wait() Jim Schutt
@ 2011-06-22 18:26 ` Jim Schutt
  2011-06-22 20:31 ` [PATCH 0/3] RFC: Enable clients to distinguish busy and unreachable OSDs Sage Weil
  4 siblings, 0 replies; 12+ messages in thread
From: Jim Schutt @ 2011-06-22 18:26 UTC (permalink / raw)
  To: ceph-devel; +Cc: Jim Schutt

Cause read_message() to periodically send a keepalive while waiting
in the policy throttler.  Clients can then notice that a connection
is still active, and avoid resetting it when a message times out.

Without this patch, when clients are offering a sustained write
load that is higher than the sustained bandwidth available from the
OSDs, messages time out continuously due to the OSDs being busy.

That has at least two types of impact:

- the reset frequently happens while data is being sent, so data
  that was successfully received must be discarded and resent.

- after several such connection resets, many sockets can remain open,
  waiting for readers to be granted space by the policy throttler,
  so that they can notice that the pipe has been shut down, and the
  socket can be closed.

This patch, combined with the companion kernel client patch, also has
the operational impact of eliminating client log messages about
resetting OSDs under normal operation with a heavy write load, which
makes it easier to notice other issues in the client logs.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
---
 src/common/config.cc       |    1 +
 src/common/config.h        |    1 +
 src/msg/SimpleMessenger.cc |    6 +++++-
 3 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/src/common/config.cc b/src/common/config.cc
index ca359a3..86b1917 100644
--- a/src/common/config.cc
+++ b/src/common/config.cc
@@ -183,6 +183,7 @@ struct config_option config_optionsp[] = {
   OPTION(ms_tcp_nodelay, OPT_BOOL, true),
   OPTION(ms_initial_backoff, OPT_DOUBLE, .2),
   OPTION(ms_max_backoff, OPT_DOUBLE, 15.0),
+  OPTION(ms_reader_keepalive, OPT_INT, 30),  // seconds, readers blocked in throttler send keepalive this often
   OPTION(ms_nocrc, OPT_BOOL, false),
   OPTION(ms_die_on_bad_msg, OPT_BOOL, false),
   OPTION(ms_dispatch_throttle_bytes, OPT_U64, 100 << 20),
diff --git a/src/common/config.h b/src/common/config.h
index 1a389d9..5ea0f69 100644
--- a/src/common/config.h
+++ b/src/common/config.h
@@ -211,6 +211,7 @@ public:
   bool ms_tcp_nodelay;
   double ms_initial_backoff;
   double ms_max_backoff;
+  int ms_reader_keepalive;
   bool ms_nocrc;
   bool ms_die_on_bad_msg;
   uint64_t ms_dispatch_throttle_bytes;
diff --git a/src/msg/SimpleMessenger.cc b/src/msg/SimpleMessenger.cc
index 321c55a..2a533be 100644
--- a/src/msg/SimpleMessenger.cc
+++ b/src/msg/SimpleMessenger.cc
@@ -1876,10 +1876,14 @@ int SimpleMessenger::Pipe::read_message(Message **pm)
   uint64_t message_size = header.front_len + header.middle_len + header.data_len;
   if (message_size) {
     if (policy.throttler) {
+      utime_t kato = utime_t(g_conf.ms_reader_keepalive, 0);
       dout(10) << "reader wants " << message_size << " from policy throttler "
 	       << policy.throttler->get_current() << "/"
 	       << policy.throttler->get_max() << dendl;
-      policy.throttler->get(message_size);
+      while (!policy.throttler->timed_get(kato, message_size)) {
+	dout(20) << "sending keepalive while waiting for policy throttler" << dendl;
+	send_keepalive();
+      }
     }
 
     // throttle total bytes waiting for dispatch.  do this _after_ the
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/3] RFC: Enable clients to distinguish busy and unreachable OSDs
  2011-06-22 18:26 [PATCH 0/3] RFC: Enable clients to distinguish busy and unreachable OSDs Jim Schutt
                   ` (3 preceding siblings ...)
  2011-06-22 18:26 ` [PATCH 3/3] msgr: Send keepalive periodically when waiting in policy throttler Jim Schutt
@ 2011-06-22 20:31 ` Sage Weil
  2011-06-22 21:22   ` Sage Weil
  4 siblings, 1 reply; 12+ messages in thread
From: Sage Weil @ 2011-06-22 20:31 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

Hi Jim,

On Wed, 22 Jun 2011, Jim Schutt wrote:
> Previously, when clients' sustained offered write load exceeded the
> sustained throughput of the OSDs, normal operation was that client
> messages timed out while waiting to be processed by the OSDs.  The
> client response to this was to reset the connection to the OSD
> handling a timed-out message.
> 
> That has at least two types of impact:
>     
> - the reset frequently happens while data is being sent, so data
>   that was successfully received must be discarded and resent.
> 
> - after several such connection resets, many sockets can remain open,
>   waiting for readers to be granted space by the policy throttler,
>   so that they can notice that the pipe has been shut down, and the
>   socket can be closed.
> 
> This patchset causes Ceph OSDs to send keepalives when waiting for 
> sufficient buffer space to receive a message from a client. There is
> also a companion kernel client patch that causes clients to notice
> the keepalives, and not reset a connection serving a timed-out
> message if anything, particularly a keepalive, has been received
> recently.
> 
> This patchset also has the operational impact of eliminating client log
> messages about resetting OSDs under normal operation with a heavy write
> load, which makes it easier to notice other issues in the client logs.

This looks pretty good, with one exception: the keepalive needs to be 
somehow be specific to the message getting throttled in order for it to 
make sense.  Otherwise, we might

 - client sends request A, B
 - osd msgr receives A, starts processing, it hits a bug and hangs
 - osd msgr throttles on B, sends keepalives
 - client never times out A

Basically, the whole purpose of the timeout on the client side is to make 
noise and retry if the OSD is buggy or broken.  If we have a coarse 
never-timeout-anything-on-this-connection flag we may as well just turn 
the timeouts off (mount -o osdtimeout=0 I think).

The key for this to be useful is to only make the requests being throttled 
(and those that follow) avoid timing out.  I think we can accomplish that 
by looking at which messages were ACKed (as well as the new last_rcv)... 
going to start a branch and take a closer look!

sage


> 
> 
> Jim Schutt (3):
>   common/Throttle: Remove unused return type on Throttle::get()
>   common/Throttle: Add timed_wait().
>   msgr: Send keepalive periodically when waiting in policy throttler
> 
>  src/common/Throttle.h      |   45 +++++++++++++++++++++++++++++++++++++++++--
>  src/common/config.cc       |    1 +
>  src/common/config.h        |    1 +
>  src/msg/SimpleMessenger.cc |    6 ++++-
>  4 files changed, 49 insertions(+), 4 deletions(-)
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/3] RFC: Enable clients to distinguish busy and unreachable OSDs
  2011-06-22 20:31 ` [PATCH 0/3] RFC: Enable clients to distinguish busy and unreachable OSDs Sage Weil
@ 2011-06-22 21:22   ` Sage Weil
  2011-06-23 14:22     ` Jim Schutt
  0 siblings, 1 reply; 12+ messages in thread
From: Sage Weil @ 2011-06-22 21:22 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

Hey Jim-

I wonder if the below is sufficient, actually.  This avoids any change on 
the server side, and just changes the client to start the per-message 
timeout "clock" when the message is actually received by the server... 

This way we still time out if the request gets stuck in cosd's request 
queues somewhere, or if the disk blocks up, or something.  Any requests 
that didn't get received don't time out, though.

What do you think?

sage



From e129e4f3f500f4e77cd1a7c64ff64edc54a9a9ea Mon Sep 17 00:00:00 2001
From: Sage Weil <sage@newdream.net>
Date: Wed, 22 Jun 2011 13:43:06 -0700
Subject: [PATCH] libceph: don't time out osd requests that haven't been received

Keep track of when an outgoing message is ACKed (i.e., the server fully
received it and, presumably, queued it for processing).  Time out OSD
requests only if it's been too long since they've been received.

This prevents timeouts and connection thrashing when the OSDs are simply
busy and are throttling the requests they read off the network.

Signed-off-by: Sage Weil <sage@newdream.net>
---
 include/linux/ceph/messenger.h |    1 +
 net/ceph/messenger.c           |   12 +++++-------
 net/ceph/osd_client.c          |    6 ++++++
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 31d91a6..d7adf15 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -94,6 +94,7 @@ struct ceph_msg {
 	bool more_to_follow;
 	bool needs_out_seq;
 	int front_max;
+	unsigned long ack_stamp;        /* tx: when we were acked */
 
 	struct ceph_msgpool *pool;
 };
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 78b55f4..c340e2e 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -486,13 +486,10 @@ static void prepare_write_message(struct ceph_connection *con)
 	m = list_first_entry(&con->out_queue,
 		       struct ceph_msg, list_head);
 	con->out_msg = m;
-	if (test_bit(LOSSYTX, &con->state)) {
-		list_del_init(&m->list_head);
-	} else {
-		/* put message on sent list */
-		ceph_msg_get(m);
-		list_move_tail(&m->list_head, &con->out_sent);
-	}
+
+	/* put message on sent list */
+	ceph_msg_get(m);
+	list_move_tail(&m->list_head, &con->out_sent);
 
 	/*
 	 * only assign outgoing seq # if we haven't sent this message
@@ -1399,6 +1396,7 @@ static void process_ack(struct ceph_connection *con)
 			break;
 		dout("got ack for seq %llu type %d at %p\n", seq,
 		     le16_to_cpu(m->hdr.type), m);
+		m->ack_stamp = jiffies;
 		ceph_msg_remove(m);
 	}
 	prepare_read_tag(con);
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 7330c27..ce310ee 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1085,9 +1085,15 @@ static void handle_timeout(struct work_struct *work)
 		req = list_entry(osdc->req_lru.next, struct ceph_osd_request,
 				 r_req_lru_item);
 
+		/* hasn't been long enough since we sent it? */
 		if (time_before(jiffies, req->r_stamp + timeout))
 			break;
 
+		/* hasn't been long enough since it was acked? */
+		if (req->r_request->ack_stamp == 0 ||
+		    time_before(jiffies, req->r_request->ack_stamp + timeout))
+			break;
+
 		BUG_ON(req == last_req && req->r_stamp == last_stamp);
 		last_req = req;
 		last_stamp = req->r_stamp;
-- 
1.7.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] common/Throttle: Add timed_wait().
  2011-06-22 18:26 ` [PATCH 2/3] common/Throttle: Add timed_wait() Jim Schutt
@ 2011-06-23 11:02   ` Colin McCabe
  2011-06-23 15:53     ` Jim Schutt
  0 siblings, 1 reply; 12+ messages in thread
From: Colin McCabe @ 2011-06-23 11:02 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

On Wed, Jun 22, 2011 at 11:26 AM, Jim Schutt <jaschut@sandia.gov> wrote:
> This will enable SimpleMessenger to send keepalives to clients while
> blocked in the policy throttler.
>
> Signed-off-by: Jim Schutt <jaschut@sandia.gov>
> ---
>  src/common/Throttle.h |   40 ++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 40 insertions(+), 0 deletions(-)

This looks good. Only a small nitpick-- it's better not to pass
utime_t by value, since it's a struct. There's a bunch of other code
doing it, but it's better to pass a const reference. Also, I guess we
need to rebase this because g_clock has been replaced by
ceph_clock_now() in master. Should be easy though.

cheers,
Colin


>
> diff --git a/src/common/Throttle.h b/src/common/Throttle.h
> index 8fd2625..5a028cc 100644
> --- a/src/common/Throttle.h
> +++ b/src/common/Throttle.h
> @@ -27,6 +27,9 @@ private:
>       ((c < max && count + c > max) ||   // normally stay under max
>        (c >= max && count > max));       // except for large c
>   }
> +
> +  /* Returns true if it had to block/wait, false otherwise.
> +   */
>   bool _wait(int64_t c) {
>     bool waited = false;
>     if (_should_wait(c)) {
> @@ -44,6 +47,28 @@ private:
>     return waited;
>   }
>
> +  /* Returns true if it timed out while blocked/waiting,
> +   * false otherwise.
> +   */
> +  bool _timed_wait(utime_t interval, int64_t c) {
> +    if (_should_wait(c)) {
> +      utime_t timeout_at = g_clock.now() + interval;
> +      waiting += c;
> +      do {
> +       if (cond.WaitUntil(lock, timeout_at)) {
> +         waiting -= c;
> +         return true;
> +       }
> +      } while (_should_wait(c));
> +      waiting -= c;
> +
> +      // wake up the next guy
> +      if (waiting)
> +       cond.SignalOne();
> +    }
> +    return false;
> +  }
> +
>  public:
>   int64_t get_current() {
>     Mutex::Locker l(lock);
> @@ -79,6 +104,21 @@ public:
>     count += c;
>   }
>
> +  /* Returns true if it got the requested amount within
> +   * the specified interval, false otherwise.
> +   */
> +  bool timed_get(utime_t interval, int64_t c = 1, int64_t m = 0) {
> +    assert(c >= 0);
> +    Mutex::Locker l(lock);
> +    if (m) {
> +      assert(m > 0);
> +      _reset_max(m);
> +    }
> +    if (_timed_wait(interval, c)) return false;
> +    count += c;
> +    return true;
> +  }
> +
>   /* Returns true if it successfully got the requested amount,
>    * or false if it would block.
>    */
> --
> 1.7.1
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/3] RFC: Enable clients to distinguish busy and unreachable OSDs
  2011-06-22 21:22   ` Sage Weil
@ 2011-06-23 14:22     ` Jim Schutt
  2011-06-23 19:05       ` Jim Schutt
  0 siblings, 1 reply; 12+ messages in thread
From: Jim Schutt @ 2011-06-23 14:22 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Hi Sage,

Sage Weil wrote:
> Hey Jim-
> 
> I wonder if the below is sufficient, actually.  This avoids any change on 
> the server side, and just changes the client to start the per-message 
> timeout "clock" when the message is actually received by the server... 
> 
> This way we still time out if the request gets stuck in cosd's request 
> queues somewhere, or if the disk blocks up, or something.  Any requests 
> that didn't get received don't time out, though.
> 
> What do you think?

I like it.  I'm pretty sure you've addressed the
case I'm after.  Let me give it a try.

Thanks!

-- Jim


> 
> sage
> 
> 
> 
>>From e129e4f3f500f4e77cd1a7c64ff64edc54a9a9ea Mon Sep 17 00:00:00 2001
> From: Sage Weil <sage@newdream.net>
> Date: Wed, 22 Jun 2011 13:43:06 -0700
> Subject: [PATCH] libceph: don't time out osd requests that haven't been received
> 
> Keep track of when an outgoing message is ACKed (i.e., the server fully
> received it and, presumably, queued it for processing).  Time out OSD
> requests only if it's been too long since they've been received.
> 
> This prevents timeouts and connection thrashing when the OSDs are simply
> busy and are throttling the requests they read off the network.
> 
> Signed-off-by: Sage Weil <sage@newdream.net>
> ---
>  include/linux/ceph/messenger.h |    1 +
>  net/ceph/messenger.c           |   12 +++++-------
>  net/ceph/osd_client.c          |    6 ++++++
>  3 files changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
> index 31d91a6..d7adf15 100644
> --- a/include/linux/ceph/messenger.h
> +++ b/include/linux/ceph/messenger.h
> @@ -94,6 +94,7 @@ struct ceph_msg {
>  	bool more_to_follow;
>  	bool needs_out_seq;
>  	int front_max;
> +	unsigned long ack_stamp;        /* tx: when we were acked */
>  
>  	struct ceph_msgpool *pool;
>  };
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index 78b55f4..c340e2e 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -486,13 +486,10 @@ static void prepare_write_message(struct ceph_connection *con)
>  	m = list_first_entry(&con->out_queue,
>  		       struct ceph_msg, list_head);
>  	con->out_msg = m;
> -	if (test_bit(LOSSYTX, &con->state)) {
> -		list_del_init(&m->list_head);
> -	} else {
> -		/* put message on sent list */
> -		ceph_msg_get(m);
> -		list_move_tail(&m->list_head, &con->out_sent);
> -	}
> +
> +	/* put message on sent list */
> +	ceph_msg_get(m);
> +	list_move_tail(&m->list_head, &con->out_sent);
>  
>  	/*
>  	 * only assign outgoing seq # if we haven't sent this message
> @@ -1399,6 +1396,7 @@ static void process_ack(struct ceph_connection *con)
>  			break;
>  		dout("got ack for seq %llu type %d at %p\n", seq,
>  		     le16_to_cpu(m->hdr.type), m);
> +		m->ack_stamp = jiffies;
>  		ceph_msg_remove(m);
>  	}
>  	prepare_read_tag(con);
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 7330c27..ce310ee 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -1085,9 +1085,15 @@ static void handle_timeout(struct work_struct *work)
>  		req = list_entry(osdc->req_lru.next, struct ceph_osd_request,
>  				 r_req_lru_item);
>  
> +		/* hasn't been long enough since we sent it? */
>  		if (time_before(jiffies, req->r_stamp + timeout))
>  			break;
>  
> +		/* hasn't been long enough since it was acked? */
> +		if (req->r_request->ack_stamp == 0 ||
> +		    time_before(jiffies, req->r_request->ack_stamp + timeout))
> +			break;
> +
>  		BUG_ON(req == last_req && req->r_stamp == last_stamp);
>  		last_req = req;
>  		last_stamp = req->r_stamp;



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] common/Throttle: Add timed_wait().
  2011-06-23 11:02   ` Colin McCabe
@ 2011-06-23 15:53     ` Jim Schutt
  2011-06-23 17:14       ` Colin Patrick McCabe
  0 siblings, 1 reply; 12+ messages in thread
From: Jim Schutt @ 2011-06-23 15:53 UTC (permalink / raw)
  To: Colin McCabe; +Cc: ceph-devel

Hi Colin,

Colin McCabe wrote:
> On Wed, Jun 22, 2011 at 11:26 AM, Jim Schutt <jaschut@sandia.gov> wrote:
>> This will enable SimpleMessenger to send keepalives to clients while
>> blocked in the policy throttler.
>>
>> Signed-off-by: Jim Schutt <jaschut@sandia.gov>
>> ---
>>  src/common/Throttle.h |   40 ++++++++++++++++++++++++++++++++++++++++
>>  1 files changed, 40 insertions(+), 0 deletions(-)
> 
> This looks good. Only a small nitpick-- it's better not to pass
> utime_t by value, since it's a struct. There's a bunch of other code
> doing it, but it's better to pass a const reference. Also, I guess we
> need to rebase this because g_clock has been replaced by
> ceph_clock_now() in master. Should be easy though.

Given Sage's much cleaner approach to fixing the osd reset
issue, is this change useful on its own?  If so, I'm happy
to fix it up and resubmit.

Let me know.

-- Jim

> 
> cheers,
> Colin
> 
> 
>> diff --git a/src/common/Throttle.h b/src/common/Throttle.h
>> index 8fd2625..5a028cc 100644
>> --- a/src/common/Throttle.h
>> +++ b/src/common/Throttle.h
>> @@ -27,6 +27,9 @@ private:
>>       ((c < max && count + c > max) ||   // normally stay under max
>>        (c >= max && count > max));       // except for large c
>>   }
>> +
>> +  /* Returns true if it had to block/wait, false otherwise.
>> +   */
>>   bool _wait(int64_t c) {
>>     bool waited = false;
>>     if (_should_wait(c)) {
>> @@ -44,6 +47,28 @@ private:
>>     return waited;
>>   }
>>
>> +  /* Returns true if it timed out while blocked/waiting,
>> +   * false otherwise.
>> +   */
>> +  bool _timed_wait(utime_t interval, int64_t c) {
>> +    if (_should_wait(c)) {
>> +      utime_t timeout_at = g_clock.now() + interval;
>> +      waiting += c;
>> +      do {
>> +       if (cond.WaitUntil(lock, timeout_at)) {
>> +         waiting -= c;
>> +         return true;
>> +       }
>> +      } while (_should_wait(c));
>> +      waiting -= c;
>> +
>> +      // wake up the next guy
>> +      if (waiting)
>> +       cond.SignalOne();
>> +    }
>> +    return false;
>> +  }
>> +
>>  public:
>>   int64_t get_current() {
>>     Mutex::Locker l(lock);
>> @@ -79,6 +104,21 @@ public:
>>     count += c;
>>   }
>>
>> +  /* Returns true if it got the requested amount within
>> +   * the specified interval, false otherwise.
>> +   */
>> +  bool timed_get(utime_t interval, int64_t c = 1, int64_t m = 0) {
>> +    assert(c >= 0);
>> +    Mutex::Locker l(lock);
>> +    if (m) {
>> +      assert(m > 0);
>> +      _reset_max(m);
>> +    }
>> +    if (_timed_wait(interval, c)) return false;
>> +    count += c;
>> +    return true;
>> +  }
>> +
>>   /* Returns true if it successfully got the requested amount,
>>    * or false if it would block.
>>    */
>> --
>> 1.7.1
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] common/Throttle: Add timed_wait().
  2011-06-23 15:53     ` Jim Schutt
@ 2011-06-23 17:14       ` Colin Patrick McCabe
  0 siblings, 0 replies; 12+ messages in thread
From: Colin Patrick McCabe @ 2011-06-23 17:14 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

On Thu, Jun 23, 2011 at 8:53 AM, Jim Schutt <jaschut@sandia.gov> wrote:
> Hi Colin,
>
> Colin McCabe wrote:
>>
>> On Wed, Jun 22, 2011 at 11:26 AM, Jim Schutt <jaschut@sandia.gov> wrote:
>>>
>>> This will enable SimpleMessenger to send keepalives to clients while
>>> blocked in the policy throttler.
>>>
>>> Signed-off-by: Jim Schutt <jaschut@sandia.gov>
>>> ---
>>>  src/common/Throttle.h |   40 ++++++++++++++++++++++++++++++++++++++++
>>>  1 files changed, 40 insertions(+), 0 deletions(-)
>>
>> This looks good. Only a small nitpick-- it's better not to pass
>> utime_t by value, since it's a struct. There's a bunch of other code
>> doing it, but it's better to pass a const reference. Also, I guess we
>> need to rebase this because g_clock has been replaced by
>> ceph_clock_now() in master. Should be easy though.
>
> Given Sage's much cleaner approach to fixing the osd reset
> issue, is this change useful on its own?  If so, I'm happy
> to fix it up and resubmit.

I have a feeling that adding timeouts to the Throttler would be useful
in general. I guess we'll probably wait until we have a user for the
function first, though.
We should definitely pull the first (cleanup) patch for Throttle.h, though.

sincerely,
Colin


>
> Let me know.
>
> -- Jim
>
>>
>> cheers,
>> Colin
>>
>>
>>> diff --git a/src/common/Throttle.h b/src/common/Throttle.h
>>> index 8fd2625..5a028cc 100644
>>> --- a/src/common/Throttle.h
>>> +++ b/src/common/Throttle.h
>>> @@ -27,6 +27,9 @@ private:
>>>      ((c < max && count + c > max) ||   // normally stay under max
>>>       (c >= max && count > max));       // except for large c
>>>  }
>>> +
>>> +  /* Returns true if it had to block/wait, false otherwise.
>>> +   */
>>>  bool _wait(int64_t c) {
>>>    bool waited = false;
>>>    if (_should_wait(c)) {
>>> @@ -44,6 +47,28 @@ private:
>>>    return waited;
>>>  }
>>>
>>> +  /* Returns true if it timed out while blocked/waiting,
>>> +   * false otherwise.
>>> +   */
>>> +  bool _timed_wait(utime_t interval, int64_t c) {
>>> +    if (_should_wait(c)) {
>>> +      utime_t timeout_at = g_clock.now() + interval;
>>> +      waiting += c;
>>> +      do {
>>> +       if (cond.WaitUntil(lock, timeout_at)) {
>>> +         waiting -= c;
>>> +         return true;
>>> +       }
>>> +      } while (_should_wait(c));
>>> +      waiting -= c;
>>> +
>>> +      // wake up the next guy
>>> +      if (waiting)
>>> +       cond.SignalOne();
>>> +    }
>>> +    return false;
>>> +  }
>>> +
>>>  public:
>>>  int64_t get_current() {
>>>    Mutex::Locker l(lock);
>>> @@ -79,6 +104,21 @@ public:
>>>    count += c;
>>>  }
>>>
>>> +  /* Returns true if it got the requested amount within
>>> +   * the specified interval, false otherwise.
>>> +   */
>>> +  bool timed_get(utime_t interval, int64_t c = 1, int64_t m = 0) {
>>> +    assert(c >= 0);
>>> +    Mutex::Locker l(lock);
>>> +    if (m) {
>>> +      assert(m > 0);
>>> +      _reset_max(m);
>>> +    }
>>> +    if (_timed_wait(interval, c)) return false;
>>> +    count += c;
>>> +    return true;
>>> +  }
>>> +
>>>  /* Returns true if it successfully got the requested amount,
>>>   * or false if it would block.
>>>   */
>>> --
>>> 1.7.1
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/3] RFC: Enable clients to distinguish busy and unreachable OSDs
  2011-06-23 14:22     ` Jim Schutt
@ 2011-06-23 19:05       ` Jim Schutt
  0 siblings, 0 replies; 12+ messages in thread
From: Jim Schutt @ 2011-06-23 19:05 UTC (permalink / raw)
  To: Jim Schutt; +Cc: Sage Weil, ceph-devel

Jim Schutt wrote:
> Hi Sage,
> 
> Sage Weil wrote:
>> Hey Jim-
>>
>> I wonder if the below is sufficient, actually.  This avoids any change 
>> on the server side, and just changes the client to start the 
>> per-message timeout "clock" when the message is actually received by 
>> the server...
>> This way we still time out if the request gets stuck in cosd's request 
>> queues somewhere, or if the disk blocks up, or something.  Any 
>> requests that didn't get received don't time out, though.
>>
>> What do you think?
> 
> I like it.  I'm pretty sure you've addressed the
> case I'm after.  Let me give it a try.

Please consider queuing this patch up.

It has greatly increased the S/N ratio in those resets -
I'm still getting a few, and the ones I've looked at
so far don't make sense to me yet.  Probably as I learn
more I'll understand why they should be happening, but
maybe they're the result of bugs; in any event I would
never have noticed them without this patch.

Thanks -- Jim

> 
> Thanks!
> 
> -- Jim
> 
> 
>>
>> sage
>>
>>
>>
>>> From e129e4f3f500f4e77cd1a7c64ff64edc54a9a9ea Mon Sep 17 00:00:00 2001
>> From: Sage Weil <sage@newdream.net>
>> Date: Wed, 22 Jun 2011 13:43:06 -0700
>> Subject: [PATCH] libceph: don't time out osd requests that haven't 
>> been received
>>
>> Keep track of when an outgoing message is ACKed (i.e., the server fully
>> received it and, presumably, queued it for processing).  Time out OSD
>> requests only if it's been too long since they've been received.
>>
>> This prevents timeouts and connection thrashing when the OSDs are simply
>> busy and are throttling the requests they read off the network.
>>
>> Signed-off-by: Sage Weil <sage@newdream.net>
>> ---
>>  include/linux/ceph/messenger.h |    1 +
>>  net/ceph/messenger.c           |   12 +++++-------
>>  net/ceph/osd_client.c          |    6 ++++++
>>  3 files changed, 12 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/ceph/messenger.h 
>> b/include/linux/ceph/messenger.h
>> index 31d91a6..d7adf15 100644
>> --- a/include/linux/ceph/messenger.h
>> +++ b/include/linux/ceph/messenger.h
>> @@ -94,6 +94,7 @@ struct ceph_msg {
>>      bool more_to_follow;
>>      bool needs_out_seq;
>>      int front_max;
>> +    unsigned long ack_stamp;        /* tx: when we were acked */
>>  
>>      struct ceph_msgpool *pool;
>>  };
>> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
>> index 78b55f4..c340e2e 100644
>> --- a/net/ceph/messenger.c
>> +++ b/net/ceph/messenger.c
>> @@ -486,13 +486,10 @@ static void prepare_write_message(struct 
>> ceph_connection *con)
>>      m = list_first_entry(&con->out_queue,
>>                 struct ceph_msg, list_head);
>>      con->out_msg = m;
>> -    if (test_bit(LOSSYTX, &con->state)) {
>> -        list_del_init(&m->list_head);
>> -    } else {
>> -        /* put message on sent list */
>> -        ceph_msg_get(m);
>> -        list_move_tail(&m->list_head, &con->out_sent);
>> -    }
>> +
>> +    /* put message on sent list */
>> +    ceph_msg_get(m);
>> +    list_move_tail(&m->list_head, &con->out_sent);
>>  
>>      /*
>>       * only assign outgoing seq # if we haven't sent this message
>> @@ -1399,6 +1396,7 @@ static void process_ack(struct ceph_connection 
>> *con)
>>              break;
>>          dout("got ack for seq %llu type %d at %p\n", seq,
>>               le16_to_cpu(m->hdr.type), m);
>> +        m->ack_stamp = jiffies;
>>          ceph_msg_remove(m);
>>      }
>>      prepare_read_tag(con);
>> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
>> index 7330c27..ce310ee 100644
>> --- a/net/ceph/osd_client.c
>> +++ b/net/ceph/osd_client.c
>> @@ -1085,9 +1085,15 @@ static void handle_timeout(struct work_struct 
>> *work)
>>          req = list_entry(osdc->req_lru.next, struct ceph_osd_request,
>>                   r_req_lru_item);
>>  
>> +        /* hasn't been long enough since we sent it? */
>>          if (time_before(jiffies, req->r_stamp + timeout))
>>              break;
>>  
>> +        /* hasn't been long enough since it was acked? */
>> +        if (req->r_request->ack_stamp == 0 ||
>> +            time_before(jiffies, req->r_request->ack_stamp + timeout))
>> +            break;
>> +
>>          BUG_ON(req == last_req && req->r_stamp == last_stamp);
>>          last_req = req;
>>          last_stamp = req->r_stamp;
> 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-06-23 19:05 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-22 18:26 [PATCH 0/3] RFC: Enable clients to distinguish busy and unreachable OSDs Jim Schutt
2011-06-22 18:26 ` [PATCH] ceph: distinguish between unreachable and busy osds when resetting a connection Jim Schutt
2011-06-22 18:26 ` [PATCH 1/3] common/Throttle: Remove unused return type on Throttle::get() Jim Schutt
2011-06-22 18:26 ` [PATCH 2/3] common/Throttle: Add timed_wait() Jim Schutt
2011-06-23 11:02   ` Colin McCabe
2011-06-23 15:53     ` Jim Schutt
2011-06-23 17:14       ` Colin Patrick McCabe
2011-06-22 18:26 ` [PATCH 3/3] msgr: Send keepalive periodically when waiting in policy throttler Jim Schutt
2011-06-22 20:31 ` [PATCH 0/3] RFC: Enable clients to distinguish busy and unreachable OSDs Sage Weil
2011-06-22 21:22   ` Sage Weil
2011-06-23 14:22     ` Jim Schutt
2011-06-23 19:05       ` Jim Schutt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.