[Drbd-dev] Huge latency issue with 8.2.6

Distributed Replicated Block Device (DRBD) development
 help / color / mirror / Atom feed

* [Drbd-dev] Huge latency issue with 8.2.6
@ 2008-08-12 16:31 Graham, Simon
  2008-08-16 15:22 ` Lars Ellenberg
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Graham, Simon @ 2008-08-12 16:31 UTC (permalink / raw)
  To: drbd-dev

[-- Attachment #1: Type: text/plain, Size: 3368 bytes --]

We've been benchmarking DRBD 8.2.6 and have found that some specific benchmarks (SQL) have absolutely terrible performance (a factor of 100 worse than the non DRBD case - 30 transactions per second versus 3000). These issues go away when we power off the secondary system, so it seems likely that it's somehow related to the network component. After some analysis of network traces, we found the following:

1. When we are doing 30TPS, we're also doing about 30 1K writes/s - the conclusion here is
   that one transaction needs 1 1K (2 block) write. This means we are seeing a write-to-write
   time of around 33ms. To hit the 3000TPS mark, we'd need to be handling 3000 1K writes/s
   which means a total write-to-write time of 333us

2. When we do a tcpdump on the node running the benchmark, we see the following DRBD protocol 
   consistently:
   . Node issues barrier + 1K write + unplug remote in a single packet
   . Receives barrier ack on meta-data connection 30-130us later
   . Receives Data ack on meta-data connection ~250us later (after original rq issued)
   . Receives TCP level ack on data connection 35-40ms later
   . The next write is not sent on the wire for 35-40ms

3. tcpdump on the other node shows the time between sending the barrierack and sending the
   data ack is around 120us -- this is basically the disk write time.

Conclusion 1 -- network latency has nothing to do with the horrendous perf we are seeing. What's more, we are adding (250 - write_time)us to the overall time to write the block - it seems that the disk write time is of the order of 120us, so we are adding around 130us to the total write time -- this should lead us to a max possible TPS value around 4000...

Conclusion 2 -- the problem here has to do with the time is takes the secondary to send the TCP ACK.

I should also note that we are running this system with GSO disabled -- this in turn means that the zero-copy writes done by DRBD actually turn into non-zc writes inside TCP (if you look at tcp_sendpage in tcp.c, you will see that zero-copy is disabled unless scatter gather is enabled and scatter gather is disabled if you don't have checksum offload enabled).

So, my theory is that the first write completes very quickly (around 300us), then the app issues the second write which is passed to TCP BUT TCP does not send it because it is waiting for buffers to be free which doesn't happen until the TCP level ACK is received on the data connection. In addition, the data connection is basically unidirectional -- all the other DRBD protocol messages flow on the meta-data connection.

So, what we are running into here is the TCP ACK-Delay timer on the secondary node - since we never send any data from the secondary to the primary on this connection, we have to wait for the timer to expire before acknowledging the data.

I have prototyped a change that uses the TCP_QUICKACK socket option to force acks to be sent when the UnplugRemote message is received and this has made a huge difference - in our original tests (which, in the interests of full disclosure, were run against DRBD8.0) we saw the performance go from 30TPS to 1600+TPS. 

Forcing the Ack to be sent when the UnplugRemote is received seems the right fix to me, but please comment... candidate git patch against the HEAD of 8.2 is attached.

Simon

[-- Attachment #2: 0001-Ensure-data-is-ACK-d-at-TCP-level-in-a-timely-fashio.patch --]
[-- Type: application/octet-stream, Size: 1413 bytes --]

From ff676b469fba17227df7cb6a88d3652f6c4fc061 Mon Sep 17 00:00:00 2001
From: Simon P. Graham <Simon.Graham@stratus.com>
Date: Mon, 11 Aug 2008 10:27:08 -0400
Subject: [PATCH] Ensure data is ACK'd at TCP level in a timely fashion.

---
 drbd/drbd_int.h      |    6 ++++++
 drbd/drbd_receiver.c |    3 +++
 2 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/drbd/drbd_int.h b/drbd/drbd_int.h
index 576c72c..68c5427 100644
--- a/drbd/drbd_int.h
+++ b/drbd/drbd_int.h
@@ -1522,6 +1522,12 @@ static inline void drbd_tcp_flush(struct socket *sock)
 	tcp_setsockopt(sock->sk, SOL_TCP, TCP_CORK, (char __user *)&val, sizeof(val) );
 }

+static inline void drbd_tcp_quickack(struct socket *sock)
+{
+	int val = 1;
+	tcp_setsockopt(sock->sk, SOL_TCP, TCP_QUICKACK, (char*)&val, sizeof(val) );
+}
+
 /* drbd_proc.c */
 extern struct proc_dir_entry *drbd_proc;
 extern struct file_operations drbd_proc_fops;
diff --git a/drbd/drbd_receiver.c b/drbd/drbd_receiver.c
index 71e7601..c619c7a 100644
--- a/drbd/drbd_receiver.c
+++ b/drbd/drbd_receiver.c
@@ -2738,6 +2738,9 @@ STATIC int receive_UnplugRemote(struct drbd_conf *mdev, struct Drbd_Header *h)
 {
 	if (mdev->state.disk >= Inconsistent)
 		drbd_kick_lo(mdev);
+	/* Make sure we've acked all the TCP data associated
+	   with the data requests being unplugged */
+	drbd_tcp_quickack(mdev->data.socket);
 	return TRUE; /* cannot fail. */
 }

-- 
1.5.4.rc1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [Drbd-dev] Huge latency issue with 8.2.6
  2008-08-12 16:31 [Drbd-dev] Huge latency issue with 8.2.6 Graham, Simon
@ 2008-08-16 15:22 ` Lars Ellenberg
  2008-08-16 16:44 ` Graham, Simon
       [not found] ` <DA0E7D869C862D4095C265233CD1D41EEE9B7D@EXNA.corp.strat us.com>
  2 siblings, 0 replies; 5+ messages in thread
From: Lars Ellenberg @ 2008-08-16 15:22 UTC (permalink / raw)
  To: drbd-dev

On Tue, Aug 12, 2008 at 12:31:42PM -0400, Graham, Simon wrote:
> We've been benchmarking DRBD 8.2.6 and have found that some specific benchmarks (SQL) have absolutely terrible performance (a factor of 100 worse than the non DRBD case - 30 transactions per second versus 3000). These issues go away when we power off the secondary system, so it seems likely that it's somehow related to the network component. After some analysis of network traces, we found the following:
> 
> 1. When we are doing 30TPS, we're also doing about 30 1K writes/s - the conclusion here is
>    that one transaction needs 1 1K (2 block) write. This means we are seeing a write-to-write
>    time of around 33ms. To hit the 3000TPS mark, we'd need to be handling 3000 1K writes/s
>    which means a total write-to-write time of 333us
> 
> 2. When we do a tcpdump on the node running the benchmark, we see the following DRBD protocol 
>    consistently:
>    . Node issues barrier + 1K write + unplug remote in a single packet
>    . Receives barrier ack on meta-data connection 30-130us later
>    . Receives Data ack on meta-data connection ~250us later (after original rq issued)
>    . Receives TCP level ack on data connection 35-40ms later
>    . The next write is not sent on the wire for 35-40ms
> 
> 3. tcpdump on the other node shows the time between sending the barrierack and sending the
>    data ack is around 120us -- this is basically the disk write time.
> 
> Conclusion 1 -- network latency has nothing to do with the horrendous
> perf we are seeing. What's more, we are adding (250 - write_time)us to
> the overall time to write the block - it seems that the disk write
> time is of the order of 120us, so we are adding around 130us to the
> total write time -- this should lead us to a max possible TPS value
> around 4000...
> 
> Conclusion 2 -- the problem here has to do with the time is takes the secondary to send the TCP ACK.

in git on the way to 8.2.7, we added the TCP_NODELAY socket option,

we also added the possibility to set "sndbuf-size" to 0,
to leverage tcp stack autotuning of tcp-buffer size.

both have been released with 8.0.13, and will be released with 8.2.7.

it should help here as well.

-- 
: Lars Ellenberg                
: LINBIT HA-Solutions GmbH
: DRBD®/HA support and consulting    http://www.linbit.com

DRBD® and LINBIT® are registered trademarks
of LINBIT Information Technologies GmbH
__
please don't Cc me, but send to list   --   I'm subscribed

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [Drbd-dev] Huge latency issue with 8.2.6
  2008-08-12 16:31 [Drbd-dev] Huge latency issue with 8.2.6 Graham, Simon
  2008-08-16 15:22 ` Lars Ellenberg
@ 2008-08-16 16:44 ` Graham, Simon
  2008-08-16 16:55   ` Lars Ellenberg
       [not found] ` <DA0E7D869C862D4095C265233CD1D41EEE9B7D@EXNA.corp.strat us.com>
  2 siblings, 1 reply; 5+ messages in thread
From: Graham, Simon @ 2008-08-16 16:44 UTC (permalink / raw)
  To: drbd-dev

> > Conclusion 2 -- the problem here has to do with the time is takes
the
> secondary to send the TCP ACK.
> 
> in git on the way to 8.2.7, we added the TCP_NODELAY socket option,
> 

I saw that and it's goodness BUT this controls the Nagle algorithm and
NOT the Ack Delay timer - you need (I believe) TCP_QUICKACK as well (or
a global change on the connection to lower the ack delay but I think
that would be bad - the nice thing about TCP_QUICKACK is that it's a one
time setting that causes pending acks to be sent right now but does not
change the normal ack delay).

Simon

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Drbd-dev] Huge latency issue with 8.2.6
  2008-08-16 16:44 ` Graham, Simon
@ 2008-08-16 16:55   ` Lars Ellenberg
  0 siblings, 0 replies; 5+ messages in thread
From: Lars Ellenberg @ 2008-08-16 16:55 UTC (permalink / raw)
  To: drbd-dev

On Sat, Aug 16, 2008 at 12:44:33PM -0400, Graham, Simon wrote:
> > > Conclusion 2 -- the problem here has to do with the time is takes
> > > the secondary to send the TCP ACK.
> > 
> > in git on the way to 8.2.7, we added the TCP_NODELAY socket option,
> > 
> 
> I saw that and it's goodness BUT this controls the Nagle algorithm and
> NOT the Ack Delay timer - you need (I believe) TCP_QUICKACK as well (or
> a global change on the connection to lower the ack delay but I think
> that would be bad - the nice thing about TCP_QUICKACK is that it's a one
> time setting that causes pending acks to be sent right now but does not
> change the normal ack delay).

I had the impression that the NODELAY would imply QUICKACK behaviour,
and together witht the CORK and UNCORK should do the trick.
but I'm happy to be enlightened.

will commit a slightly modified verion of your quickack soonish.

thanks,


-- 
: Lars Ellenberg                
: LINBIT HA-Solutions GmbH
: DRBD®/HA support and consulting    http://www.linbit.com

DRBD® and LINBIT® are registered trademarks
of LINBIT Information Technologies GmbH
__
please don't Cc me, but send to list   --   I'm subscribed

^ permalink raw reply	[flat|nested] 5+ messages in thread

[parent not found: <DA0E7D869C862D4095C265233CD1D41EEE9B7D@EXNA.corp.strat us.com>]

* RE: [Drbd-dev] Huge latency issue with 8.2.6
       [not found] ` <DA0E7D869C862D4095C265233CD1D41EEE9B7D@EXNA.corp.strat us.com>
@ 2008-08-16 19:35   ` Graham, Simon
  0 siblings, 0 replies; 5+ messages in thread
From: Graham, Simon @ 2008-08-16 19:35 UTC (permalink / raw)
  To: drbd-dev

> I had the impression that the NODELAY would imply QUICKACK behaviour,
> and together witht the CORK and UNCORK should do the trick.
> but I'm happy to be enlightened.
> 

I had the same thought but I'm reasonably sure NODELAY only affects
sending data. I think the underlying cause of this is that the data
connection is basically unidirectional -- all replies to data received
go on the meta data connection. If, for example, the data-ack were sent
on the data connection, then the TCP ACK would be piggybacked and
NODELAY would do everything you need...

> will commit a slightly modified verion of your quickack soonish.

Thanks
Simon

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-08-16 19:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-12 16:31 [Drbd-dev] Huge latency issue with 8.2.6 Graham, Simon
2008-08-16 15:22 ` Lars Ellenberg
2008-08-16 16:44 ` Graham, Simon
2008-08-16 16:55   ` Lars Ellenberg
     [not found] ` <DA0E7D869C862D4095C265233CD1D41EEE9B7D@EXNA.corp.strat us.com>
2008-08-16 19:35   ` Graham, Simon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox