[Drbd-dev] DRBD small synchronous writes performance improvements

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Drbd-dev] DRBD small synchronous writes performance improvements
@ 2010-04-29 20:00 Guzovsky, Eduard
  2010-04-29 21:26 ` Lars Ellenberg
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Guzovsky, Eduard @ 2010-04-29 20:00 UTC (permalink / raw)
  To: philipp.reisner, lars.ellenberg; +Cc: drbd-dev

Hi guys,

We analyzed DRBD performance of small synchronous write operations on
systems with RAID controllers.  This i/o pattern happens frequently in
data base transaction processing workloads. Large RAID caches ensure
that disk i/o overhead is small - about 100us per 32KB block - and
network overhead turns into a dominant factor.  In addition to network
stack processing time, network overhead has two large components

1. "Wire time" - the actual time it takes to transmit data. On 1Gig
network it takes about 270us to transmit 32KB block, on 10Gig - about
27us.

2. "NIC latency" - the time it takes NIC to start transmitting a packet
on the sending node plus the time on the receiving node between the
packet reception by the NIC and delivering the packet to the driver. NIC
latency is chip specific and depends on "packet/interrupt coalescing"
setting configurable via "ethtool -C ". The default setting results in
about a 100us to 125us "short packet round trip" latency.  We reduced it
to about 50us by effectively disabling packet coalescing.  Have you or
your customers experimented with this? We are obviously concerned with
adverse side effects of disabling packet coalescing on whole system
performance during high network loads.

We found in our testing that certain DRBD changes - see the patch bellow
- improve performance of small synchronous writes.  For the reference,
the testing was done in Xen environment on Dell T610 system with Xeon
E5520 2.27 GHz CPU, and Dell PERC 6/I RAID controller.  DRBD code was
running in the 2 way SMP Dom0.
The patch is made against DRBD version 8.2.7, but it is equally relevant
to 8.3.7 as corresponding parts of the code did not change
significantly. This patch is a "rough draft request for comment". It
contains several changes.

1. TCP_QUICKACK option is set incorrectly. The goal was force TCP to
send and ACK as a  "one time" event.  Instead the code permanently sets
connection in the QUICKACK mode.

2. Socket sndbufsize/rcvbufsize setting is done incorrectly. The code
sets socket buffer sizes _after_ connection is established.  In order
for these settings to take effect they should be set _before_ connection
is established.  We made a quick and dirty change that makes identical
setting for both meta and data connections.  It would require a bigger
change to have separate settings because in the current code it is not
known in advance which socket will be used for which connection.

3. We noticed that on the primary node it takes about 20us to schedule
DRBD worker thread that packages and sends write request to the
secondary node. We think it would be better to send request to the
secondary ASAP and only then continue with primary node processing. So I
added "schedule()"  hack to drbd_make_request_common() and raised the
priority of the worker thread. That reduced worker thread scheduling
delay to about 7us. I am not 100% if this hack is safe - would be very
interested in your opinion on it.

4. We disabled TCP_CORK through drbdsetup utility and modified the code
to do implicit corking using MSG_MORE flag. TCP code tries to postpone
sending partial message until the whole message is assembled. So we try
to send drbd request header first to let the secondary node start
preparations to receive the data part while the primary node is still
transmitting the data. May be this behavior  should be a configurable
variant of tcp corking, because it might not be advantageous for every
NIC/link speed configuration.

We were also considering implementation of "zero copy" receive that
should improve performance for 10Gig links - that is not part of the
patch. The basic idea is to intercept incoming data before they get
queued to the socket in the same way NFS and i-scsi  code do it through
tcp_read_sock() api. Then drbd could convert sk_buff chain to bio, if
alignment is right, and avoid expensive data copy. Do you plan to add
something like this into you future DRBD release so we do not have to do
it ourselves? ;-)

We would appreciate your comments on the patch.

Thanks,

-Ed

-----------------------------

diff -aur src.orig/drbd/drbd_int.h src/drbd/drbd_int.h
--- src.orig/drbd/drbd_int.h	2010-04-01 15:47:54.000000000 -0400
+++ src/drbd/drbd_int.h	2010-04-26 18:09:14.000000000 -0400
@@ -1124,7 +1124,7 @@
 extern int drbd_send_ack_ex(struct drbd_conf *mdev, enum
Drbd_Packet_Cmd cmd,
 			    sector_t sector, int blksize, u64 block_id);
 extern int _drbd_send_page(struct drbd_conf *mdev, struct page *page,
-			int offset, size_t size);
+			int offset, size_t size, int flags);
 extern int drbd_send_block(struct drbd_conf *mdev, enum Drbd_Packet_Cmd
cmd,
 			   struct Tl_epoch_entry *e);
 extern int drbd_send_dblock(struct drbd_conf *mdev, struct drbd_request
*req);
@@ -1596,7 +1596,7 @@

 static inline void drbd_tcp_quickack(struct socket *sock)
 {
-	int __user val = 1;
+	int __user val = 2;
 	(void) drbd_setsockopt(sock, SOL_TCP, TCP_QUICKACK,
 			(char __user *)&val, sizeof(val));
 }
diff -aur src.orig/drbd/drbd_main.c src/drbd/drbd_main.c
--- src.orig/drbd/drbd_main.c	2010-04-01 15:47:54.000000000 -0400
+++ src/drbd/drbd_main.c	2010-04-28 15:25:48.000000000 -0400
@@ -2084,7 +2084,7 @@
 }

 int _drbd_send_page(struct drbd_conf *mdev, struct page *page,
-		    int offset, size_t size)
+		    int offset, size_t size, int flags)
 {
 	mm_segment_t oldfs = get_fs();
 	int sent, ok;
@@ -2130,7 +2130,7 @@
 	do {
 		sent =
mdev->data.socket->ops->sendpage(mdev->data.socket, page,
 							offset, len,
-							MSG_NOSIGNAL);
+							flags |
MSG_NOSIGNAL);
 		if (sent == -EAGAIN) {
 			if (we_should_drop_the_connection(mdev,

mdev->data.socket))
@@ -2168,13 +2168,20 @@
 	return 1;
 }

-static inline int _drbd_send_zc_bio(struct drbd_conf *mdev, struct bio
*bio)
+static inline int _drbd_send_zc_bio(struct drbd_conf *mdev, struct bio
*bio,
+                                    int flags)
 {
 	struct bio_vec *bvec;
 	int i;
+        unsigned int len = 0;
+
 	__bio_for_each_segment(bvec, bio, i, 0) {
+
+                if ((len += bvec->bv_len) == bio->bi_size)
+                        flags = 0;
+
 		if (!_drbd_send_page(mdev, bvec->bv_page,
-				     bvec->bv_offset, bvec->bv_len))
+				     bvec->bv_offset, bvec->bv_len,
flags))
 			return 0;
 	}

@@ -2224,7 +2231,7 @@
 	dump_packet(mdev, mdev->data.socket, 0, (void *)&p, __FILE__,
__LINE__);
 	blk_add_trace_bio(mdev->rq_queue, req->master_bio,
BLK_TA_GETRQ);
 	ok = (sizeof(p) ==
-		drbd_send(mdev, mdev->data.socket, &p, sizeof(p),
MSG_MORE));
+                drbd_send(mdev, mdev->data.socket, &p, sizeof(p), 0));
 	if (ok && dgs) {
 		dgb = mdev->int_dig_out;
 		drbd_csum(mdev, mdev->integrity_w_tfm, req->master_bio,
dgb);
@@ -2234,7 +2241,7 @@
 		if (mdev->net_conf->wire_protocol == DRBD_PROT_A)
 			ok = _drbd_send_bio(mdev, req->master_bio);
 		else
-			ok = _drbd_send_zc_bio(mdev, req->master_bio);
+			ok = _drbd_send_zc_bio(mdev, req->master_bio,
MSG_MORE);
 	}

 	drbd_put_data_sock(mdev);
@@ -2281,7 +2288,7 @@
 		ok = drbd_send(mdev, mdev->data.socket, dgb, dgs,
MSG_MORE);
 	}
 	if (ok)
-		ok = _drbd_send_zc_bio(mdev, e->private_bio);
+		ok = _drbd_send_zc_bio(mdev, e->private_bio, 0);

 	drbd_put_data_sock(mdev);
 	return ok;
Only in src/drbd: drbd_main.c~
diff -aur src.orig/drbd/drbd_receiver.c src/drbd/drbd_receiver.c
--- src.orig/drbd/drbd_receiver.c	2010-04-01 15:47:54.000000000
-0400
+++ src/drbd/drbd_receiver.c	2010-04-28 15:27:17.000000000 -0400
@@ -620,6 +620,15 @@
 	sock->sk->sk_rcvtimeo =
 	sock->sk->sk_sndtimeo =  mdev->net_conf->try_connect_int*HZ;

+	if (mdev->net_conf->sndbuf_size) {
+		/* FIXME fold to limits. should be done during
configuration */
+		/* this is setsockopt SO_SNDBUFFORCE and SO_RCVBUFFORCE,
+		 * done directly. */
+		sock->sk->sk_sndbuf = mdev->net_conf->sndbuf_size;
+		sock->sk->sk_rcvbuf = mdev->net_conf->sndbuf_size;
+		sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK |
SOCK_RCVBUF_LOCK;
+	}
+
        /* explicitly bind to the configured IP as source IP
 	*  for the outgoing connections.
 	*  This is needed for multihomed hosts and to be
@@ -699,6 +708,16 @@
 	s_listen->sk->sk_rcvtimeo =
 	s_listen->sk->sk_sndtimeo =  mdev->net_conf->try_connect_int*HZ;

+	if (mdev->net_conf->sndbuf_size) {
+		/* FIXME fold to limits. should be done during
configuration */
+		/* this is setsockopt SO_SNDBUFFORCE and SO_RCVBUFFORCE,
+		 * done directly. */
+		s_listen->sk->sk_sndbuf = mdev->net_conf->sndbuf_size;
+		s_listen->sk->sk_rcvbuf = mdev->net_conf->sndbuf_size;
+		s_listen->sk->sk_userlocks |= 
+                    SOCK_SNDBUF_LOCK | SOCK_RCVBUF_LOCK;
+	}
+
 	what = "bind before listen";
 	err = s_listen->ops->bind(s_listen,
 			      (struct sockaddr *)
mdev->net_conf->my_addr,
@@ -885,6 +904,7 @@
 	sock->sk->sk_priority = TC_PRIO_INTERACTIVE_BULK;
 	msock->sk->sk_priority = TC_PRIO_INTERACTIVE;

+#if 0 
 	if (mdev->net_conf->sndbuf_size) {
 		/* FIXME fold to limits. should be done during
configuration */
 		/* this is setsockopt SO_SNDBUFFORCE and SO_RCVBUFFORCE,
@@ -893,6 +913,7 @@
 		sock->sk->sk_rcvbuf = mdev->net_conf->sndbuf_size;
 		sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK |
SOCK_RCVBUF_LOCK;
 	}
+#endif 

 #if 0 /* don't pin the msock bufsize, autotuning should work better */
 	msock->sk->sk_sndbuf = 2*32767;
diff -aur src.orig/drbd/drbd_req.c src/drbd/drbd_req.c
--- src.orig/drbd/drbd_req.c	2010-04-01 15:47:54.000000000 -0400
+++ src/drbd/drbd_req.c	2010-04-28 15:28:27.000000000 -0400
@@ -1110,6 +1110,13 @@
 			_req_mod(req, queue_for_net_read, 0);
 	}
 	spin_unlock_irq(&mdev->req_lock);
+
+     /* give a worker thread a chance to pick up the request */
+	if (remote) {
+            if (!in_atomic())
+                    schedule();
+     }
+
 	kfree(b); /* if someone else has beaten us to it... */

 	if (local) {
diff -aur src.orig/drbd/drbd_worker.c src/drbd/drbd_worker.c
--- src.orig/drbd/drbd_worker.c	2010-04-01 15:47:54.000000000 -0400
+++ src/drbd/drbd_worker.c	2010-04-26 18:25:17.000000000 -0400
@@ -1237,6 +1237,9 @@

 	sprintf(current->comm, "drbd%d_worker", mdev_to_minor(mdev));

+	current->policy = SCHED_RR;  /* Make this a realtime task! */
+	current->rt_priority = 2;    /* more important than all other
tasks */
+
 	while (get_t_state(thi) == Running) {
 		drbd_thread_current_set_cpu(mdev);

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Drbd-dev] DRBD small synchronous writes performance improvements
  2010-04-29 20:00 [Drbd-dev] DRBD small synchronous writes performance improvements Guzovsky, Eduard
@ 2010-04-29 21:26 ` Lars Ellenberg
  2010-04-30 22:11 ` [Drbd-dev] DRBD small synchronous writes performanceimprovements Guzovsky, Eduard
       [not found] ` <BC2F8964429F14468EA0E6D6CF00E8C99483A1@EXHQ.corp.strat us.com>
  2 siblings, 0 replies; 6+ messages in thread
From: Lars Ellenberg @ 2010-04-29 21:26 UTC (permalink / raw)
  To: drbd-dev

On Thu, Apr 29, 2010 at 04:00:50PM -0400, Guzovsky, Eduard wrote:
> Hi guys,
> 
> We analyzed DRBD performance of small synchronous write operations on
> systems with RAID controllers.  This i/o pattern happens frequently in
> data base transaction processing workloads. Large RAID caches ensure
> that disk i/o overhead is small - about 100us per 32KB block - and
> network overhead turns into a dominant factor.  In addition to network
> stack processing time, network overhead has two large components
> 
> 1. "Wire time" - the actual time it takes to transmit data. On 1Gig
> network it takes about 270us to transmit 32KB block, on 10Gig - about
> 27us.
> 
> 2. "NIC latency" - the time it takes NIC to start transmitting a packet
> on the sending node plus the time on the receiving node between the
> packet reception by the NIC and delivering the packet to the driver. NIC
> latency is chip specific and depends on "packet/interrupt coalescing"
> setting configurable via "ethtool -C ". The default setting results in
> about a 100us to 125us "short packet round trip" latency.  We reduced it
> to about 50us by effectively disabling packet coalescing.  Have you or
> your customers experimented with this? We are obviously concerned with
> adverse side effects of disabling packet coalescing on whole system
> performance during high network loads.
> 
> We found in our testing that certain DRBD changes - see the patch bellow
> - improve performance of small synchronous writes.  For the reference,
> the testing was done in Xen environment on Dell T610 system with Xeon
> E5520 2.27 GHz CPU, and Dell PERC 6/I RAID controller.  DRBD code was
> running in the 2 way SMP Dom0.
> The patch is made against DRBD version 8.2.7, but it is equally relevant
> to 8.3.7 as corresponding parts of the code did not change
> significantly. This patch is a "rough draft request for comment". It
> contains several changes.
> 
> 1. TCP_QUICKACK option is set incorrectly. The goal was force TCP to
> send and ACK as a  "one time" event.  Instead the code permanently sets
> connection in the QUICKACK mode.

Oh, it is not permanent, tcp will re-enable "pingpong" mode
when it "feels like it".  But you are right, using val = 2 will
re-enter pingpong mode immediately if there actually
have been pending ACKs forced out.

> 2. Socket sndbufsize/rcvbufsize setting is done incorrectly. The code
> sets socket buffer sizes _after_ connection is established.  In order
> for these settings to take effect they should be set _before_ connection
> is established.  We made a quick and dirty change that makes identical
> setting for both meta and data connections.  It would require a bigger
> change to have separate settings because in the current code it is not
> known in advance which socket will be used for which connection.

Apparently I need to re-read some kernel code on this.
If you want to point me to a specific area of code?

> 3. We noticed that on the primary node it takes about 20us to schedule
> DRBD worker thread that packages and sends write request to the
> secondary node. We think it would be better to send request to the
> secondary ASAP and only then continue with primary node processing. So I
> added "schedule()"  hack to drbd_make_request_common() and raised the
> priority of the worker thread. That reduced worker thread scheduling
> delay to about 7us. I am not 100% if this hack is safe - would be very
> interested in your opinion on it.

That's an interessting hack ;-)
What priority do you chose?
What is your "cpu-mask" for the drbd threads?

> 4. We disabled TCP_CORK through drbdsetup utility and modified the code
> to do implicit corking using MSG_MORE flag. TCP code tries to postpone
> sending partial message until the whole message is assembled. So we try
> to send drbd request header first to let the secondary node start
> preparations to receive the data part while the primary node is still
> transmitting the data. May be this behavior  should be a configurable
> variant of tcp corking, because it might not be advantageous for every
> NIC/link speed configuration.

Ok.
We'll see what this does to our test hardware.  Anyways, if it seems to
be beneficial for you, we can certainly add some config option for it.

> We were also considering implementation of "zero copy" receive that
> should improve performance for 10Gig links - that is not part of the
> patch. The basic idea is to intercept incoming data before they get
> queued to the socket in the same way NFS and i-scsi  code do it through
> tcp_read_sock() api.

Yes. I recently suggested that as an enhancement for IET on the
iscsit-target mailing list myself, though to get rid of an additional
memcopy they do for their "fileio" mode.

I don't think it is that easy to adapt for DRBD (or their "block io"
mode), because:

> Then drbd could convert sk_buff chain to bio, if alignment is right,

that is a big iff.
I'm not at all sure how you want to achieve this.
Usually the alignment will be just wrong.

> and avoid expensive data copy. Do you plan to add something like this
> into you future DRBD release so we do not have to do it ourselves? ;-)

Something like this should be very beneficial, but I don't see how we
can achieve the proper alignment of the data pages in the sk_buff.

"native RDMA mode" for DRBD would be a nice thing to have, and possibly
solve this as well.  Maybe we find a feature sponsor for that ;-)

> We would appreciate your comments on the patch.

Will have to do actual review within the next few days.

Thanks,

>  	ok = (sizeof(p) ==
> -		drbd_send(mdev, mdev->data.socket, &p, sizeof(p), MSG_MORE));
> +		drbd_send(mdev, mdev->data.socket, &p, sizeof(p), 0));

> @@ -2234,7 +2241,7 @@
> -			ok = _drbd_send_zc_bio(mdev, req->master_bio);
> +			ok = _drbd_send_zc_bio(mdev, req->master_bio, MSG_MORE);

Ok, I see where you are going.
Maybe rather not have the flags in _drbd_send_zc_bio, but have the
_drbd_send_zc_bio itself add MSG_MORE to all but the last sendpage?

> @@ -2281,7 +2288,7 @@
> -		ok = _drbd_send_zc_bio(mdev, e->private_bio);
> +		ok = _drbd_send_zc_bio(mdev, e->private_bio, 0);

no MSG_MORE here?


> +
> +     /* give a worker thread a chance to pick up the request */
> +	if (remote) {
> +            if (!in_atomic())
> +                    schedule();

You may well drop the if (!in_atomic()),
it cannot possibly be in atomic context there.
Also, the immediately preceding spin_unlock_irq() is a pre-emption
point.  So actually this should not even be necessary.

> +     }
> +
>  	kfree(b); /* if someone else has beaten us to it... */
>  
>  	if (local) {
> diff -aur src.orig/drbd/drbd_worker.c src/drbd/drbd_worker.c
> --- src.orig/drbd/drbd_worker.c	2010-04-01 15:47:54.000000000 -0400
> +++ src/drbd/drbd_worker.c	2010-04-26 18:25:17.000000000 -0400
> @@ -1237,6 +1237,9 @@
>  
>  	sprintf(current->comm, "drbd%d_worker", mdev_to_minor(mdev));
>  
> +	current->policy = SCHED_RR;  /* Make this a realtime task! */
> +	current->rt_priority = 2;    /* more important than all other
> tasks */
> +

Not sure about this.
I don't really want to do crypto operations
from a real time kernel thread.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Drbd-dev] DRBD small synchronous writes performanceimprovements
  2010-04-29 20:00 [Drbd-dev] DRBD small synchronous writes performance improvements Guzovsky, Eduard
  2010-04-29 21:26 ` Lars Ellenberg
@ 2010-04-30 22:11 ` Guzovsky, Eduard
  2010-05-03  7:08   ` Lars Ellenberg
       [not found] ` <BC2F8964429F14468EA0E6D6CF00E8C99483A1@EXHQ.corp.strat us.com>
  2 siblings, 1 reply; 6+ messages in thread
From: Guzovsky, Eduard @ 2010-04-30 22:11 UTC (permalink / raw)
  To: Lars Ellenberg, drbd-dev

Hi Lars, thank you for your quick response. My answers/comments are
inline.

> 
> > 2. Socket sndbufsize/rcvbufsize setting is done incorrectly. The
code
> > sets socket buffer sizes _after_ connection is established.  In
order
> > for these settings to take effect they should be set _before_
connection
> > is established.  We made a quick and dirty change that makes
identical
> > setting for both meta and data connections.  It would require a
bigger
> > change to have separate settings because in the current code it is
not
> > known in advance which socket will be used for which connection.
> 
> Apparently I need to re-read some kernel code on this.
> If you want to point me to a specific area of code?
> 

I never had a chance to track this behavior to a specific area of the
linux tcp code, but I was burned by this problem before. 

Here is a quote from tcp(7) man page.

 "On individual  connections,  the  socket buffer size must be
 set prior to the listen() or connect() calls in order to 
 have  it  take  effect"

I verified via a simple experiment that this problem exists in DRBD and
that the suggested patch fixes it.

Just configure large DRBD socket buffer size, let say 4MB. Initiate
large disk writes. You can see in the tcpdump capture that the secondary
node never advertizes receive window above 128KB. With the suggested
patch receive window will go up to 2MB. 

Goes without saying that the patch is kind of crude - it sets the socket
buffer size on both data and meta connections. It would require more
significant code changes to only fix data connection because it is not
known a priory which of the two sockets will be used for the data
connection.

Also, auto-tuning alleviates the whole problem. 


> > 3. We noticed that on the primary node it takes about 20us to
schedule
> > DRBD worker thread that packages and sends write request to the
> > secondary node. We think it would be better to send request to the
> > secondary ASAP and only then continue with primary node processing.
So I
> > added "schedule()"  hack to drbd_make_request_common() and raised
the
> > priority of the worker thread. That reduced worker thread scheduling
> > delay to about 7us. I am not 100% if this hack is safe - would be
very
> > interested in your opinion on it.
> 
> That's an interessting hack ;-)
> What priority do you chose?

Real time, RR, 2 - the same as asender thread.

> What is your "cpu-mask" for the drbd threads?

We do not specify affinity - any cpu is up for grabs.

> 
> > 4. We disabled TCP_CORK through drbdsetup utility and modified the
code
> > to do implicit corking using MSG_MORE flag. TCP code tries to
postpone
> > sending partial message until the whole message is assembled. So we
try
> > to send drbd request header first to let the secondary node start
> > preparations to receive the data part while the primary node is
still
> > transmitting the data. May be this behavior  should be a
configurable
> > variant of tcp corking, because it might not be advantageous for
every
> > NIC/link speed configuration.
> 
> Ok.
> We'll see what this does to our test hardware.  Anyways, if it seems
to
> be beneficial for you, we can certainly add some config option for it.

DRBD config option would be great because this method might not be
advantageous in all configurations.

> 
> > We were also considering implementation of "zero copy" receive that
> > should improve performance for 10Gig links - that is not part of the
> > patch. The basic idea is to intercept incoming data before they get
> > queued to the socket in the same way NFS and i-scsi  code do it
through
> > tcp_read_sock() api.
> 
> Yes. I recently suggested that as an enhancement for IET on the
> iscsit-target mailing list myself, though to get rid of an additional
> memcopy they do for their "fileio" mode.
> 
> I don't think it is that easy to adapt for DRBD (or their "block io"
> mode), because:
> 
> > Then drbd could convert sk_buff chain to bio, if alignment is right,
> 
> that is a big iff.
> I'm not at all sure how you want to achieve this.
> Usually the alignment will be just wrong.
> 
> > and avoid expensive data copy. Do you plan to add something like
this
> > into you future DRBD release so we do not have to do it ourselves?
;-)
> 
> Something like this should be very beneficial, but I don't see how we
> can achieve the proper alignment of the data pages in the sk_buff.
> 
> "native RDMA mode" for DRBD would be a nice thing to have, and
possibly
> solve this as well.  Maybe we find a feature sponsor for that ;-)
> 

Here is a plan for getting alignment right. I will assume usage of the
Intel 82599 10Gig chip and the corresponding ixgbe driver. 

The nice thing about this chip and the driver is that by default they
supports packet splitting. That means that Ethernet/TCP/IP header of the
incoming packet is received in one memory buffer, while the data portion
is received into another memory buffer. This second buffer is half-page
(2KB) aligned. I guess they did not make it the whole page aligned to
reduce memory waste. Still, AFAIK that should more than satisfy bio
alignment requirements. Is it 512 bytes?

We set interface mtu to 9000. Let's say DRBD does a 32KB write. DRBD can
control (or at least give a hint to TCP) how the whole request should be
"packetized" using MSG_MORE flag. DRBD Request Header is sent as one
packet (no MSG_MORE) flag. Then each even (counting from 0) data page is
sent with MSG_MORE flag, each odd data page is sent without MSG_MORE
flag. This should result in two data pages per packet transmits.

I instrumented DRBD to do just that. I also instrumented ixgbe driver to
dump skb with the received data on the secondary node. Here is what I
got for a 32KB write.

 skb 0xebc04c80 len 84 data_len 32 frags 1     <-- 52 bytes TCP/IP
header
  frag_page 0xc146ba40 offset 0 size 32        <-- 32 bytes
Drbd_Data_Packet

 skb 0xe362b0c0 len 8244 data_len 8192 frags 4 <-- 52 bytes TCP/IP
header
  frag_page 0xc146ba60 offset 2048 size 2048   <-- 8KB of data
  frag_page 0xc146baa0 offset 2048 size 2048
  frag_page 0xc146bac0 offset 0 size 2048
  frag_page 0xc146bae0 offset 2048 size 2048

 skb 0xe35a9440 len 8244 data_len 8192 frags 4 <-- 52 bytes TCP/IP
header
  frag_page 0xc146bb00 offset 2048 size 2048   <-- 8KB of data
  frag_page 0xc146bb40 offset 2048 size 2048
  frag_page 0xc146bb60 offset 0 size 2048
  frag_page 0xc146bb80 offset 2048 size 2048

 skb 0xe99ada80 len 8244 data_len 8192 frags 4 <-- 52 bytes TCP/IP
header
  frag_page 0xc146bbc0 offset 0 size 2048      <-- 8KB of data
  frag_page 0xc146bc00 offset 2048 size 2048
  frag_page 0xc146bc20 offset 0 size 2048
  frag_page 0xc146bc40 offset 2048 size 2048

 skb 0xebc4c300 len 8244 data_len 8192 frags 4 <-- 52 bytes TCP/IP
header
  frag_page 0xc146bc60 offset 0 size 2048      <-- 8KB of data
  frag_page 0xc146bca0 offset 0 size 2048
  frag_page 0xc146bcc0 offset 0 size 2048
  frag_page 0xc146bce0 offset 2048 size 2048

As you can see the data is 2KB aligned.


> 
> >  	ok = (sizeof(p) ==
> > -		drbd_send(mdev, mdev->data.socket, &p, sizeof(p),
MSG_MORE));
> > +		drbd_send(mdev, mdev->data.socket, &p, sizeof(p), 0));
> 
> > @@ -2234,7 +2241,7 @@
> > -			ok = _drbd_send_zc_bio(mdev, req->master_bio);
> > +			ok = _drbd_send_zc_bio(mdev, req->master_bio,
MSG_MORE);
> 
> Ok, I see where you are going.
> Maybe rather not have the flags in _drbd_send_zc_bio, but have the
> _drbd_send_zc_bio itself add MSG_MORE to all but the last sendpage?

Sure, that should work too.

> 
> > @@ -2281,7 +2288,7 @@
> > -		ok = _drbd_send_zc_bio(mdev, e->private_bio);
> > +		ok = _drbd_send_zc_bio(mdev, e->private_bio, 0);
> 
> no MSG_MORE here?
> 

May be. I have not played with "remote reads" performance.

> 
> > +
> > +     /* give a worker thread a chance to pick up the request */
> > +	if (remote) {
> > +            if (!in_atomic())
> > +                    schedule();
> 
> You may well drop the if (!in_atomic()),
> it cannot possibly be in atomic context there.

if (!in_atomic()) is paranoia ;-)


> Also, the immediately preceding spin_unlock_irq() is a pre-emption
> point.  So actually this should not even be necessary.

It is necessary in our case - our kernel is compiled without
CONFIG_PREEMPT so threads are not preemptable in the kernel. So may be
another drbd configuration option would be useful here.

> > diff -aur src.orig/drbd/drbd_worker.c src/drbd/drbd_worker.c
> > --- src.orig/drbd/drbd_worker.c	2010-04-01 15:47:54.000000000
-0400
> > +++ src/drbd/drbd_worker.c	2010-04-26 18:25:17.000000000 -0400
> > @@ -1237,6 +1237,9 @@
> >
> >  	sprintf(current->comm, "drbd%d_worker", mdev_to_minor(mdev));
> >
> > +	current->policy = SCHED_RR;  /* Make this a realtime task! */
> > +	current->rt_priority = 2;    /* more important than all other
> > tasks */
> > +
> 
> Not sure about this.
> I don't really want to do crypto operations
> from a real time kernel thread.

Sure, I agree. Though in our case we do not use crypto stuff. So how
about one more drbd configuration option? ;-)

Thanks again,

-Ed




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Drbd-dev] DRBD small synchronous writes performanceimprovements
  2010-04-30 22:11 ` [Drbd-dev] DRBD small synchronous writes performanceimprovements Guzovsky, Eduard
@ 2010-05-03  7:08   ` Lars Ellenberg
  0 siblings, 0 replies; 6+ messages in thread
From: Lars Ellenberg @ 2010-05-03  7:08 UTC (permalink / raw)
  To: drbd-dev

On Fri, Apr 30, 2010 at 06:11:13PM -0400, Guzovsky, Eduard wrote:
> > > 3. We noticed that on the primary node it takes about 20us to
> > > schedule DRBD worker thread that packages and sends write request
> > > to the secondary node. We think it would be better to send request
> > > to the secondary ASAP and only then continue with primary node
> > > processing.  So I added "schedule()"  hack to
> > > drbd_make_request_common() and raised the priority of the worker
> > > thread. That reduced worker thread scheduling delay to about 7us.
> > > I am not 100% if this hack is safe - would be very interested in
> > > your opinion on it.

> > What is your "cpu-mask" for the drbd threads?
> 
> We do not specify affinity - any cpu is up for grabs.

If you do not set cpu-mask with drbdsetup,
DRBD kernel threads of one specific minor
will pin themselves on the same single cpu.

So maybe try: drbdsetup 0 syncer --cpu-mask ff

> > > We were also considering implementation of "zero copy" receive
> > > that should improve performance for 10Gig links - that is not part
> > > of the patch. The basic idea is to intercept incoming data before
> > > they get queued to the socket in the same way NFS and i-scsi  code
> > > do it through tcp_read_sock() api.
> > 
> > Yes. I recently suggested that as an enhancement for IET on the
> > iscsit-target mailing list myself, though to get rid of an additional
> > memcopy they do for their "fileio" mode.
> > 
> > I don't think it is that easy to adapt for DRBD (or their "block io"
> > mode), because:
> > 
> > > Then drbd could convert sk_buff chain to bio, if alignment is right,
> > 
> > that is a big iff.
> > I'm not at all sure how you want to achieve this.
> > Usually the alignment will be just wrong.
> > 
> > > and avoid expensive data copy. Do you plan to add something like
> > > this into you future DRBD release so we do not have to do it
> > > ourselves? ;-)
> > 
> > Something like this should be very beneficial, but I don't see how we
> > can achieve the proper alignment of the data pages in the sk_buff.
> > 
> > "native RDMA mode" for DRBD would be a nice thing to have, and possibly
> > solve this as well.  Maybe we find a feature sponsor for that ;-)
> > 
> 
> Here is a plan for getting alignment right. I will assume usage of the
> Intel 82599 10Gig chip and the corresponding ixgbe driver. 
> 
> The nice thing about this chip and the driver is that by default they
> supports packet splitting. That means that Ethernet/TCP/IP header of
> the incoming packet is received in one memory buffer, while the data
> portion is received into another memory buffer. This second buffer is
> half-page (2KB) aligned. I guess they did not make it the whole page
> aligned to reduce memory waste. Still, AFAIK that should more than
> satisfy bio alignment requirements. Is it 512 bytes?
> 
> We set interface mtu to 9000. Let's say DRBD does a 32KB write. DRBD
> can control (or at least give a hint to TCP) how the whole request
> should be "packetized" using MSG_MORE flag. DRBD Request Header is
> sent as one packet (no MSG_MORE) flag. Then each even (counting from
> 0) data page is sent with MSG_MORE flag, each odd data page is sent
> without MSG_MORE flag. This should result in two data pages per packet
> transmits.
> 
> I instrumented DRBD to do just that. I also instrumented ixgbe driver
> to dump skb with the received data on the secondary node. Here is what
> I got for a 32KB write.
> 
>  skb 0xebc04c80 len 84 data_len 32 frags 1     <-- 52 bytes TCP/IP header
>   frag_page 0xc146ba40 offset 0 size 32        <-- 32 bytes Drbd_Data_Packet
> 
>  skb 0xe362b0c0 len 8244 data_len 8192 frags 4 <-- 52 bytes TCP/IP header
>   frag_page 0xc146ba60 offset 2048 size 2048   <-- 8KB of data
>   frag_page 0xc146baa0 offset 2048 size 2048
>   frag_page 0xc146bac0 offset 0 size 2048
>   frag_page 0xc146bae0 offset 2048 size 2048
> 
>  skb 0xe35a9440 len 8244 data_len 8192 frags 4 <-- 52 bytes TCP/IP header
>   frag_page 0xc146bb00 offset 2048 size 2048   <-- 8KB of data
>   frag_page 0xc146bb40 offset 2048 size 2048
>   frag_page 0xc146bb60 offset 0 size 2048
>   frag_page 0xc146bb80 offset 2048 size 2048
> 
>  skb 0xe99ada80 len 8244 data_len 8192 frags 4 <-- 52 bytes TCP/IP header
>   frag_page 0xc146bbc0 offset 0 size 2048      <-- 8KB of data
>   frag_page 0xc146bc00 offset 2048 size 2048
>   frag_page 0xc146bc20 offset 0 size 2048
>   frag_page 0xc146bc40 offset 2048 size 2048
> 
>  skb 0xebc4c300 len 8244 data_len 8192 frags 4 <-- 52 bytes TCP/IP header
>   frag_page 0xc146bc60 offset 0 size 2048      <-- 8KB of data
>   frag_page 0xc146bca0 offset 0 size 2048
>   frag_page 0xc146bcc0 offset 0 size 2048
>   frag_page 0xc146bce0 offset 2048 size 2048
> 
> As you can see the data is 2KB aligned.

So you suggest we could "sometimes" (maybe even "most of the time")
get_page, assign to bvec, submit, and on completion adjust skb for the
"recvmsg" that never happens.
We'd still need the "slowpath" memcpy code for those fragments that
happen to be not aligned.
And we'd need ot convert DRBDs currently blocking network IO
into something that uses the sk_*_callbacks directly.

But yes, this seems to be possible.

> > > +     /* give a worker thread a chance to pick up the request */
> > > +	if (remote) {
> > > +            if (!in_atomic())
> > > +                    schedule();
> > 
> > You may well drop the if (!in_atomic()),
> > it cannot possibly be in atomic context there.
> 
> if (!in_atomic()) is paranoia ;-)

It there are several potentially sleeping functions called before from
this context, so it probably had BUG()ed before if it was atomic.

> > Also, the immediately preceding spin_unlock_irq() is a pre-emption
> > point.  So actually this should not even be necessary.
> 
> It is necessary in our case - our kernel is compiled without
> CONFIG_PREEMPT so threads are not preemptable in the kernel. So may be
> another drbd configuration option would be useful here.
> 
> > > diff -aur src.orig/drbd/drbd_worker.c src/drbd/drbd_worker.c
> > > --- src.orig/drbd/drbd_worker.c	2010-04-01 15:47:54.000000000
> -0400
> > > +++ src/drbd/drbd_worker.c	2010-04-26 18:25:17.000000000 -0400
> > > @@ -1237,6 +1237,9 @@
> > >
> > >  	sprintf(current->comm, "drbd%d_worker", mdev_to_minor(mdev));
> > >
> > > +	current->policy = SCHED_RR;  /* Make this a realtime task! */
> > > +	current->rt_priority = 2;    /* more important than all other
> > > tasks */
> > > +
> > 
> > Not sure about this.
> > I don't really want to do crypto operations
> > from a real time kernel thread.
> 
> Sure, I agree. Though in our case we do not use crypto stuff. So how
> about one more drbd configuration option? ;-)

We'll talk this through.  But please try the mentioned drbdsetup
cpu-mask stuff, that should make the drbd worker thread send the request
from an other cpu even while this context is still submitting it locally.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Drbd-dev] DRBD small synchronous writes performanceimprovements
       [not found] ` <BC2F8964429F14468EA0E6D6CF00E8C99483A1@EXHQ.corp.strat us.com>
@ 2010-05-12  3:50   ` Guzovsky, Eduard
  2010-05-12  8:16     ` Lars Ellenberg
  0 siblings, 1 reply; 6+ messages in thread
From: Guzovsky, Eduard @ 2010-05-12  3:50 UTC (permalink / raw)
  To: Lars Ellenberg, drbd-dev

Hi Lars,

My comments and a zero copy receive patch are inline.

Thanks,

-Ed

> -----Original Message-----
> From: drbd-dev-bounces@lists.linbit.com [mailto:drbd-dev-
> bounces@lists.linbit.com] On Behalf Of Lars Ellenberg
> Sent: Monday, May 03, 2010 3:09 AM
> To: drbd-dev@lists.linbit.com
> Subject: Re: [Drbd-dev] DRBD small synchronous writes
> performanceimprovements
> 
> On Fri, Apr 30, 2010 at 06:11:13PM -0400, Guzovsky, Eduard wrote:
> > > > 3. We noticed that on the primary node it takes about 20us to
> > > > schedule DRBD worker thread that packages and sends write
request
> > > > to the secondary node. We think it would be better to send
request
> > > > to the secondary ASAP and only then continue with primary node
> > > > processing.  So I added "schedule()"  hack to
> > > > drbd_make_request_common() and raised the priority of the worker
> > > > thread. That reduced worker thread scheduling delay to about
7us.
> > > > I am not 100% if this hack is safe - would be very interested in
> > > > your opinion on it.
> 
> > > What is your "cpu-mask" for the drbd threads?
> >
> > We do not specify affinity - any cpu is up for grabs.
> 
> If you do not set cpu-mask with drbdsetup,
> DRBD kernel threads of one specific minor
> will pin themselves on the same single cpu.
> 
> So maybe try: drbdsetup 0 syncer --cpu-mask ff

Ok, I did not realize that setting cpu affinity is the default behavior.


> 
> > > > We were also considering implementation of "zero copy" receive
> > > > that should improve performance for 10Gig links - that is not
part
> > > > of the patch. The basic idea is to intercept incoming data
before
> > > > they get queued to the socket in the same way NFS and i-scsi
code
> > > > do it through tcp_read_sock() api.
> > >
> > > Yes. I recently suggested that as an enhancement for IET on the
> > > iscsit-target mailing list myself, though to get rid of an
additional
> > > memcopy they do for their "fileio" mode.
> > >
> > > I don't think it is that easy to adapt for DRBD (or their "block
io"
> > > mode), because:
> > >
> > > > Then drbd could convert sk_buff chain to bio, if alignment is
right,
> > >
> > > that is a big iff.
> > > I'm not at all sure how you want to achieve this.
> > > Usually the alignment will be just wrong.
> > >
> > > > and avoid expensive data copy. Do you plan to add something like
> > > > this into you future DRBD release so we do not have to do it
> > > > ourselves? ;-)
> > >
> > > Something like this should be very beneficial, but I don't see how
we
> > > can achieve the proper alignment of the data pages in the sk_buff.
> > >
> > > "native RDMA mode" for DRBD would be a nice thing to have, and
possibly
> > > solve this as well.  Maybe we find a feature sponsor for that ;-)
> > >
> >
> > Here is a plan for getting alignment right. I will assume usage of
the
> > Intel 82599 10Gig chip and the corresponding ixgbe driver.
> >
> > The nice thing about this chip and the driver is that by default
they
> > supports packet splitting. That means that Ethernet/TCP/IP header of
> > the incoming packet is received in one memory buffer, while the data
> > portion is received into another memory buffer. This second buffer
is
> > half-page (2KB) aligned. I guess they did not make it the whole page
> > aligned to reduce memory waste. Still, AFAIK that should more than
> > satisfy bio alignment requirements. Is it 512 bytes?
> >
> > We set interface mtu to 9000. Let's say DRBD does a 32KB write. DRBD
> > can control (or at least give a hint to TCP) how the whole request
> > should be "packetized" using MSG_MORE flag. DRBD Request Header is
> > sent as one packet (no MSG_MORE) flag. Then each even (counting from
> > 0) data page is sent with MSG_MORE flag, each odd data page is sent
> > without MSG_MORE flag. This should result in two data pages per
packet
> > transmits.
> >
> > I instrumented DRBD to do just that. I also instrumented ixgbe
driver
> > to dump skb with the received data on the secondary node. Here is
what
> > I got for a 32KB write.
> >
> >  skb 0xebc04c80 len 84 data_len 32 frags 1     <-- 52 bytes TCP/IP
header
> >   frag_page 0xc146ba40 offset 0 size 32        <-- 32 bytes
> Drbd_Data_Packet
> >
> >  skb 0xe362b0c0 len 8244 data_len 8192 frags 4 <-- 52 bytes TCP/IP
header
> >   frag_page 0xc146ba60 offset 2048 size 2048   <-- 8KB of data
> >   frag_page 0xc146baa0 offset 2048 size 2048
> >   frag_page 0xc146bac0 offset 0 size 2048
> >   frag_page 0xc146bae0 offset 2048 size 2048
> >
> >  skb 0xe35a9440 len 8244 data_len 8192 frags 4 <-- 52 bytes TCP/IP
header
> >   frag_page 0xc146bb00 offset 2048 size 2048   <-- 8KB of data
> >   frag_page 0xc146bb40 offset 2048 size 2048
> >   frag_page 0xc146bb60 offset 0 size 2048
> >   frag_page 0xc146bb80 offset 2048 size 2048
> >
> >  skb 0xe99ada80 len 8244 data_len 8192 frags 4 <-- 52 bytes TCP/IP
header
> >   frag_page 0xc146bbc0 offset 0 size 2048      <-- 8KB of data
> >   frag_page 0xc146bc00 offset 2048 size 2048
> >   frag_page 0xc146bc20 offset 0 size 2048
> >   frag_page 0xc146bc40 offset 2048 size 2048
> >
> >  skb 0xebc4c300 len 8244 data_len 8192 frags 4 <-- 52 bytes TCP/IP
header
> >   frag_page 0xc146bc60 offset 0 size 2048      <-- 8KB of data
> >   frag_page 0xc146bca0 offset 0 size 2048
> >   frag_page 0xc146bcc0 offset 0 size 2048
> >   frag_page 0xc146bce0 offset 2048 size 2048
> >
> > As you can see the data is 2KB aligned.
> 
> So you suggest we could "sometimes" (maybe even "most of the time")
> get_page, assign to bvec, submit, and on completion adjust skb for the
> "recvmsg" that never happens.
> We'd still need the "slowpath" memcpy code for those fragments that
> happen to be not aligned.
> And we'd need ot convert DRBDs currently blocking network IO
> into something that uses the sk_*_callbacks directly.
> 
> But yes, this seems to be possible.

After reading the code more closely, it looks to me that it is not
necessary to use the callbacks - tcp_read_sock() can be used
"synchronously" and can be mixed with regular sock_recvmsg(). I put
together a prototype implementation that integrates fastpath/slowpath
(see the patch below). It worked fine in my limited testing and pretty
much completely eliminated receive data coping for /dev/drbd0. This
prototype is just a proof of concept and not a "commercial strength"
code by any stretch (at the very least it needs to include signal
handling and use drbd memory pools for page allocation). But hopefully
it would be useful if you decide to include zero copy receive feature in
DRBD.

I also modified ixgbe 10Gig driver to do use 4KB aligned buffer instead
of 2KB. This further reduced the overhead, but it is not necessary - the
patch works with the original alignment. 

> 
> > > > +     /* give a worker thread a chance to pick up the request */
> > > > +	if (remote) {
> > > > +            if (!in_atomic())
> > > > +                    schedule();
> > >
> > > You may well drop the if (!in_atomic()),
> > > it cannot possibly be in atomic context there.
> >
> > if (!in_atomic()) is paranoia ;-)
> 
> It there are several potentially sleeping functions called before from
> this context, so it probably had BUG()ed before if it was atomic.

Agree.

> 
> > > Also, the immediately preceding spin_unlock_irq() is a pre-emption
> > > point.  So actually this should not even be necessary.
> >
> > It is necessary in our case - our kernel is compiled without
> > CONFIG_PREEMPT so threads are not preemptable in the kernel. So may
be
> > another drbd configuration option would be useful here.
> >
> > > > diff -aur src.orig/drbd/drbd_worker.c src/drbd/drbd_worker.c
> > > > --- src.orig/drbd/drbd_worker.c	2010-04-01 15:47:54.000000000
> > -0400
> > > > +++ src/drbd/drbd_worker.c	2010-04-26 18:25:17.000000000
-0400
> > > > @@ -1237,6 +1237,9 @@
> > > >
> > > >  	sprintf(current->comm, "drbd%d_worker",
mdev_to_minor(mdev));
> > > >
> > > > +	current->policy = SCHED_RR;  /* Make this a realtime
task! */
> > > > +	current->rt_priority = 2;    /* more important than all
other
> > > > tasks */
> > > > +
> > >
> > > Not sure about this.
> > > I don't really want to do crypto operations
> > > from a real time kernel thread.
> >
> > Sure, I agree. Though in our case we do not use crypto stuff. So how
> > about one more drbd configuration option? ;-)
> 
> We'll talk this through.  But please try the mentioned drbdsetup
> cpu-mask stuff, that should make the drbd worker thread send the
request
> from an other cpu even while this context is still submitting it
locally.
> 

Thanks, Lars. I tried your suggestion. It did not help on its own. But
when I combined it with making the worker thread run at real time
priority 2, scheduling delay was almost as small as with the schedule()
hack - one or two usecs worse. This hack might still be useful on a
single CPU systems.

---------------- zero copy receive patch

diff -awur src-a/drbd/drbd_int.h src-b/drbd/drbd_int.h
--- src-a/drbd/drbd_int.h	2010-05-05 13:59:18.000000000 -0400
+++ src-b/drbd/drbd_int.h	2010-05-05 14:13:04.000000000 -0400
@@ -752,11 +752,13 @@
 	__EE_CONFLICT_PENDING,
 	__EE_MAY_SET_IN_SYNC,
 	__EE_IS_BARRIER,
+        __EE_NON_DRBD_PAGE,
 };
 #define EE_CALL_AL_COMPLETE_IO (1<<__EE_CALL_AL_COMPLETE_IO)
 #define EE_CONFLICT_PENDING    (1<<__EE_CONFLICT_PENDING)
 #define EE_MAY_SET_IN_SYNC     (1<<__EE_MAY_SET_IN_SYNC)
 #define EE_IS_BARRIER          (1<<__EE_IS_BARRIER)
+#define EE_NON_DRBD_PAGE       (1<<__EE_NON_DRBD_PAGE)
 
 /* global flag bits */
 enum {
diff -awur src-a/drbd/drbd_main.c src-b/drbd/drbd_main.c
--- src-a/drbd/drbd_main.c	2010-05-11 21:58:36.000000000 -0400
+++ src-b/drbd/drbd_main.c	2010-05-11 13:25:49.000000000 -0400
@@ -2174,7 +2174,8 @@
 {
 	struct bio_vec *bvec;
 	int i;
-        unsigned int len;
+        unsigned int togle = (flags & MSG_MORE) ? (flags & MSG_MORE) :
0; 
+        unsigned int len = 0;
 
 	__bio_for_each_segment(bvec, bio, i, 0) {
 
@@ -2186,6 +2187,8 @@
 			return 0;
 
 		blk_add_trace_generic(mdev->rq_queue, NULL, 0,
BLK_TA_GETRQ);
+                if (togle)
+                   flags ^= togle;
 	}
 
 	return 1;
diff -awur src-a/drbd/drbd_receiver.c src-b/drbd/drbd_receiver.c
--- src-a/drbd/drbd_receiver.c	2010-05-05 13:59:18.000000000 -0400
+++ src-b/drbd/drbd_receiver.c	2010-05-11 21:54:01.000000000 -0400
@@ -314,6 +314,53 @@
 	return NULL;
 }
 
+struct Tl_epoch_entry *drbd_alloc_zc_ee(struct drbd_conf *mdev,
+                                        u64 id,
+                                        sector_t sector,
+                                        unsigned int data_size,
+                                        gfp_t gfp_mask)
__must_hold(local)
+{
+	struct Tl_epoch_entry *e;
+	struct bio *bio;
+
+	e = mempool_alloc(drbd_ee_mempool, gfp_mask);
+	if (!e) {
+		ERR("alloc_zc_ee: Allocation of an EE failed\n");
+		return NULL;
+	}
+
+	bio = bio_alloc(GFP_KERNEL, div_ceil(data_size, 512));
+	if (!bio) {
+		ERR("alloc_zc_ee: Allocation of a bio failed\n");
+		goto fail1;
+	}
+
+	bio->bi_bdev = mdev->bc->backing_bdev;
+	bio->bi_sector = sector;
+	bio->bi_private = e;
+	e->mdev = mdev;
+	e->sector = sector;
+	/* e->size = bio->bi_size; */
+
+	e->private_bio = bio;
+	e->block_id = id;
+	INIT_HLIST_NODE(&e->colision);
+	e->epoch = NULL;
+	e->flags = EE_NON_DRBD_PAGE;
+
+	MTRACE(TraceTypeEE, TraceLvlAll,
+	       INFO("allocated EE sec=%llus size=%u ee=%p\n",
+		    (unsigned long long)sector, data_size, e);
+	       );
+
+	return e;
+
+ fail1:
+	mempool_free(e, drbd_ee_mempool);
+
+	return NULL;
+}
+
 void drbd_free_ee(struct drbd_conf *mdev, struct Tl_epoch_entry *e)
 {
 	struct bio *bio = e->private_bio;
@@ -326,6 +373,9 @@
 	       );
 
 	__bio_for_each_segment(bvec, bio, i, 0) {
+                if (e->flags & EE_NON_DRBD_PAGE) 
+                    __free_page(bvec->bv_page);
+                else 
 		drbd_pp_free(mdev, bvec->bv_page);
 	}
 
@@ -598,6 +648,184 @@
 	return rv;
 }
 
+typedef struct _drbd_read_descriptor {
+        read_descriptor_t rd_desc;
+        int               rd_vcnt;
+        struct bio_vec    *rd_vec;
+} drbd_read_descriptor_t; 
+
+#define DRBD_SKB_DATA_ALLIGNMENT (512)
+#define DRBD_DATA_ALIGNED(_x) (((_x) & (DRBD_SKB_DATA_ALLIGNMENT-1)) ==
0)
+
+static int drbd_tcp_data_recv(read_descriptor_t *rd_desc, struct
sk_buff *skb, 
+                              unsigned int offset, size_t skb_len)
+{
+        drbd_read_descriptor_t *drbd_rd_desc = 
+           (drbd_read_descriptor_t *)rd_desc;
+
+        struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[0];
+        struct bio_vec  *bvec =
&drbd_rd_desc->rd_vec[drbd_rd_desc->rd_vcnt];
+        size_t ds = rd_desc->count - rd_desc->written;
+        size_t len =  min_t(size_t, ds, skb_len);
+        int    i, consumed = 0;
+
+        BUG_ON(offset >= skb->len);
+        BUG_ON(rd_desc->count <= rd_desc->written);
+
+        if (skb->h.th->fin || skb->h.th->rst || offset <
skb_headlen(skb))
+            goto done;
+
+        offset -= skb_headlen(skb);
+
+        if (!DRBD_DATA_ALIGNED(offset))
+            goto done;
+
+        /* find the fragment */
+        for (i = 0; i < skb_shinfo(skb)->nr_frags; i++, frag++) {
+
+            if (offset < frag->size) 
+                break;
+            offset -= frag->size;
+        }
+
+        /* we'll optimize only "whole fragment consumption" case */
+        if (offset != 0)
+            goto done;
+
+        for (;i < skb_shinfo(skb)->nr_frags && len > 0; i++, frag++) {
+
+            /* make sure the fragment is alligned */
+            if (!DRBD_DATA_ALIGNED(frag->size) || 
+                !DRBD_DATA_ALIGNED(frag->page_offset))
+                goto done;
+
+            ds = min_t(size_t, len, frag->size);
+
+            BUG_ON((consumed + ds) > skb_len);
+
+            /* check if the fragment is mergeable with the previous one
*/
+            if (drbd_rd_desc->rd_vcnt > 0) {
+
+                struct bio_vec *bvec_prev = bvec - 1;
+
+                if (bvec_prev->bv_page == frag->page &&
+                    (bvec_prev->bv_offset + bvec_prev->bv_len) == 
+                    frag->page_offset) {
+
+                    bvec_prev->bv_len += ds;
+                    consumed += ds;
+
+                    continue;
+                }
+            }
+
+            /* consume the fragment */
+            get_page(frag->page);
+            bvec->bv_page = frag->page;
+            bvec->bv_len = ds;
+            bvec->bv_offset = frag->page_offset;
+
+            bvec++;
+            drbd_rd_desc->rd_vcnt++;
+            consumed += ds;
+            len -= ds;
+        }
+        
+ done:
+        drbd_rd_desc->rd_desc.written += consumed;
+
+        /* mark misalignment case */
+        if (consumed < len || rd_desc->count == rd_desc->written)
+            drbd_rd_desc->rd_desc.count = 0;
+
+        return consumed;
+}
+
+STATIC int drbd_zc_recv(struct drbd_conf *mdev, 
+                        drbd_read_descriptor_t *drbd_rd_desc, size_t
size)
+{
+	   int i, rc = 0, rr, ds = (int)size, rv = (int) size;
+
+        struct sock *sk = mdev->data.socket->sk;
+        struct page *page;
+
+	   drbd_rd_desc->rd_desc.count = size;
+	   drbd_rd_desc->rd_desc.arg.data = mdev;
+
+        /* try to get the data through tcp_read_sock first */
+
+        lock_sock(sk);
+
+        while (ds > 0) {
+
+            long timeo = sock_rcvtimeo(sk, 0);
+
+            if ((rc = sk_wait_data(sk, &timeo)) == 0)
+                break;
+
+            read_lock(&sk->sk_callback_lock);
+            rc = tcp_read_sock(sk, &drbd_rd_desc->rd_desc,
drbd_tcp_data_recv);
+            read_unlock(&sk->sk_callback_lock);
+                
+            if (rc < 0) {
+                release_sock(sk);
+                drbd_force_state(mdev, NS(conn, BrokenPipe));
+                rv = -EIO;
+                goto done;
+            }
+
+            D_ASSERT(drbd_rd_desc->rd_desc.written <= size);
+
+            ds = size - drbd_rd_desc->rd_desc.written;
+            
+            if (ds > 0 && drbd_rd_desc->rd_desc.count == 0) 
+                break;
+        }
+
+        release_sock(sk);
+
+        /* get the rest of the data, if any, through regular
sock_recvmsg */
+        while(ds > 0) {
+
+            struct bio_vec *bvec = 
+                &drbd_rd_desc->rd_vec[drbd_rd_desc->rd_vcnt];
+
+            if ((page = alloc_page(GFP_KERNEL)) == NULL) {
+                rv = -ENOMEM;
+                goto done;
+            }
+
+            bvec->bv_page = page;
+            bvec->bv_len = min_t(int, ds, PAGE_SIZE);
+            bvec->bv_offset = 0;
+            drbd_rd_desc->rd_vcnt++;
+
+            rr = drbd_recv(mdev, kmap(page), bvec->bv_len);
+            kunmap(page);
+
+            if (rr != min_t(int, ds, PAGE_SIZE)) {
+                    
+                drbd_WARN("short read receiving data: read %d expected
%d\n",
+                          rr, min_t(int, ds, PAGE_SIZE));
+                rv = -EIO;
+                goto done;
+            }
+
+            ds -= rr;
+        }
+
+ done:
+        if (rv < 0) {
+
+            /* free the pages pages */
+            for (i = 0; i < drbd_rd_desc->rd_vcnt; i++) 
+                __free_page(drbd_rd_desc->rd_vec[i].bv_page); 
+        }
+
+	return rv;
+}
+
+
 STATIC struct socket *drbd_try_connect(struct drbd_conf *mdev)
 {
 	const char *what;
@@ -1352,6 +1580,7 @@
 	int dgs, ds, i, rr;
 	void *dig_in = mdev->int_dig_in;
 	void *dig_vv = mdev->int_dig_vv;
+        int use_zc = (mdev->minor == 0) ? 1 : 0;
 
 	dgs = (mdev->agreed_pro_version >= 87 && mdev->integrity_r_tfm)
?
 		crypto_hash_digestsize(mdev->integrity_r_tfm) : 0;
@@ -1370,11 +1599,62 @@
 	ERR_IF(data_size &  0x1ff) return NULL;
 	ERR_IF(data_size >  DRBD_MAX_SEGMENT_SIZE) return NULL;
 
+        if (use_zc) 
+            e = drbd_alloc_zc_ee(mdev, id, sector, data_size,
GFP_KERNEL);
+        else
 	e = drbd_alloc_ee(mdev, id, sector, data_size, GFP_KERNEL);
 	if (!e)
 		return NULL;
 	bio = e->private_bio;
 	ds = data_size;
+
+        if (use_zc) {
+            
+            drbd_read_descriptor_t drbd_rd_desc;
+            int err = 0;
+
+            memset(&drbd_rd_desc, 0, sizeof(drbd_rd_desc));
+            drbd_rd_desc.rd_vec = bio->bi_io_vec;
+
+            if ((rr = drbd_zc_recv(mdev, &drbd_rd_desc, ds)) != ds) {
+                drbd_free_ee(mdev, e);
+                drbd_WARN("short read receiving data: read %d expected
%d\n",
+                          rr, ds);
+                return NULL;
+            }
+
+            for (i = 0; i < drbd_rd_desc.rd_vcnt; i++) {
+                
+                struct bio_vec *bvec = &bio->bi_io_vec[i];
+                page = bvec->bv_page;
+
+                if (err) {
+                    __free_page(page);
+                }
+                else if (!bio_add_page(bio, page, bvec->bv_len, 
+                                       bvec->bv_offset)) {
+
+                    ERR("alloc_ee: bio_add_page(s=%llu,"
+                        "data_size=%u,ds=%u) failed\n",
+                        (unsigned long long)sector, data_size, ds);
+                    __free_page(page);
+                    err = -1;
+                }
+                else {
+                    ds -= bvec->bv_len;
+                }
+            }
+
+            if (err) {
+                drbd_free_ee(mdev, e);
+                return NULL;
+            }
+
+            D_ASSERT(data_size == bio->bi_size);
+
+            e->size = bio->bi_size;
+        }
+        else {
 	bio_for_each_segment(bvec, bio, i) {
 		page = bvec->bv_page;
 		rr = drbd_recv(mdev, kmap(page), min_t(int, ds,
PAGE_SIZE));
@@ -1387,6 +1667,7 @@
 		}
 		ds -= rr;
 	}
+        }
 
 	if (dgs) {
 		drbd_csum(mdev, mdev->integrity_r_tfm, bio, dig_vv);

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Drbd-dev] DRBD small synchronous writes performanceimprovements
  2010-05-12  3:50   ` Guzovsky, Eduard
@ 2010-05-12  8:16     ` Lars Ellenberg
  0 siblings, 0 replies; 6+ messages in thread
From: Lars Ellenberg @ 2010-05-12  8:16 UTC (permalink / raw)
  To: drbd-dev

On Tue, May 11, 2010 at 11:50:54PM -0400, Guzovsky, Eduard wrote:
> Hi Lars,
> 
> My comments and a zero copy receive patch are inline.

Good stuff.
Review will take some time, I want to experiment with this a bit,
but am currently deep into something else.

Thanks, 

	Lars

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-05-12  8:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-29 20:00 [Drbd-dev] DRBD small synchronous writes performance improvements Guzovsky, Eduard
2010-04-29 21:26 ` Lars Ellenberg
2010-04-30 22:11 ` [Drbd-dev] DRBD small synchronous writes performanceimprovements Guzovsky, Eduard
2010-05-03  7:08   ` Lars Ellenberg
     [not found] ` <BC2F8964429F14468EA0E6D6CF00E8C99483A1@EXHQ.corp.strat us.com>
2010-05-12  3:50   ` Guzovsky, Eduard
2010-05-12  8:16     ` Lars Ellenberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.