[Changelog] - Potential performance bottleneck for Linxu TCP

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [Changelog] - Potential performance bottleneck for Linxu TCP
       [not found] <HNEBLGGMEGLPMPPDOPMGKEAJCGAA.wenji@fnal.gov>
@ 2006-11-29 23:27 ` Wenji Wu
  2006-11-29 23:28   ` [patch 1/4] " Wenji Wu
  2006-11-29 23:36   ` [Changelog] " Martin Bligh
  2006-11-29 23:42 ` Bug 7596 " Andrew Morton
  2006-11-30  1:01 ` David Miller
  2 siblings, 2 replies; 18+ messages in thread
From: Wenji Wu @ 2006-11-29 23:27 UTC (permalink / raw)
  To: netdev, davem, akpm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 846 bytes --]

From: Wenji Wu <wenji@fnal.gov>

Greetings,

For Linux TCP, when the network applcaiton make system call to move data
from
socket's receive buffer to user space by calling tcp_recvmsg(). The socket
will
be locked. During the period, all the incoming packet for the TCP socket
will go
to the backlog queue without being TCP processed. Since Linux 2.6 can be
inerrupted mid-task, if the network application expires, and moved to the
expired array with the socket locked, all the packets within the backlog
queue
will not be TCP processed till the network applicaton resume its execution.
If
the system is heavily loaded, TCP can easily RTO in the Sender Side.

Attached is the Changelog for the patch

best regards,

wenji

Wenji Wu
Network Researcher
Fermilab, MS-368
P.O. Box 500
Batavia, IL, 60510
(Email): wenji@fnal.gov
(O): 001-630-840-4541

[-- Attachment #2: Changelog.txt --]
[-- Type: text/plain, Size: 2988 bytes --]

From: Wenji Wu <wenji@fnal.gov>

- Subject

Potential performance bottleneck for Linux TCP (2.6 Desktop, Low-latency Desktop)

- Why the kernel needed patching

For Linux TCP, when the network applcaiton make system call to move data from
socket's receive buffer to user space by calling tcp_recvmsg(). The socket will
be locked. During the period, all the incoming packet for the TCP socket will go
to the backlog queue without being TCP processed. Since Linux 2.6 can be
inerrupted mid-task, if the network application expires, and moved to the
expired array with the socket locked, all the packets within the backlog queue
will not be TCP processed till the network applicaton resume its execution. If
the system is heavily loaded, TCP can easily RTO in the Sender Side.

- The overall design apparoch in the patch

the underlying idea here is that when there are packets waiting on the prequeue 
or backlog queue, do not allow the data receiving process to release the CPU for long. 

- Implementation details

We have modified the Linux process scheduling policy and tcp_recvmsg().

To summarize, the solution works as follows: 

an expired data receiving process with packets waiting on backlog queue or 
prequeue is moved to the active array, instead of expired array as usual. 
More often than not, the expired data receiving process will continue to run. 
Even it doesn’t, the wait time before it resumes its execution will be greatly reduced. 
However, this gives the process extra runs compared to other processes in the runqueue. 

For the sake of fairness, the process would be labeled with the extra_run_flag. 

Also considering the facts that: 

(1) the resumed process will continue its execution within tcp_recvmsg(); 
(2) tcp_recvmsg() does not return to user space until the prequeue and backlog queue are drained. 

For the sake of fairness, we modified tcp_recvmsg() as such: after prequeue and backlog 
queue are drained and before tcp_recvmsg() returns to user space, any process labeled with 
the extra_run_flag will call yield() to explicitly yield the CPU to other proc-esses in the runqueue. 
yield() works by removing the process from the active array (where it current is, because it is running), 
and inserting it into the expired array. Also, to prevent processes in the expired array from starving, 

A special rule has been provided for Linux process scheduling (the same rule used for interactive processes): 
an expired process is moved to the expired array without respect to its status if processes in the expired array are starved.

Changed files:

/kernel/sched.c
/kernel/fork.c
/include/linux/sched.h
/net/ipv4/tcp.c

- Testing results

The proposed solution tradeoffs a small amount of fairness performance to resolve the TCP performance bottleneck. 
The proposed solution won’t cause serious fairness issue.

The patch is for Linux kernel 2.6.14 Deskop and Low-latency Desktop

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-29 23:27 ` [Changelog] - Potential performance bottleneck for Linxu TCP Wenji Wu
@ 2006-11-29 23:28   ` Wenji Wu
  2006-11-29 23:29     ` [patch 2/4] " Wenji Wu
  2006-11-30  0:53     ` [patch 1/4] " David Miller
  2006-11-29 23:36   ` [Changelog] " Martin Bligh
  1 sibling, 2 replies; 18+ messages in thread
From: Wenji Wu @ 2006-11-29 23:28 UTC (permalink / raw)
  To: wenji, netdev, davem, akpm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 832 bytes --]


From: Wenji Wu <wenji@fnal.gov>

Greetings,

For Linux TCP, when the network applcaiton make system call to move data
from
socket's receive buffer to user space by calling tcp_recvmsg(). The socket
will
be locked. During the period, all the incoming packet for the TCP socket
will go
to the backlog queue without being TCP processed. Since Linux 2.6 can be
inerrupted mid-task, if the network application expires, and moved to the
expired array with the socket locked, all the packets within the backlog
queue
will not be TCP processed till the network applicaton resume its execution.
If
the system is heavily loaded, TCP can easily RTO in the Sender Side.

Attached is the patch 1/4

best regards,

wenji

Wenji Wu
Network Researcher
Fermilab, MS-368
P.O. Box 500
Batavia, IL, 60510
(Email): wenji@fnal.gov
(O): 001-630-840-4541

[-- Attachment #2: tcp.c.patch --]
[-- Type: application/octet-stream, Size: 553 bytes --]

--- linux-2.6.14-old/net/ipv4/tcp.c	2006-11-29 16:24:56.000000000 -0600
+++ linux-2.6.14/net/ipv4/tcp.c	2006-11-29 11:25:57.000000000 -0600
@@ -1109,6 +1109,8 @@
 	int target;		/* Read at least this many bytes */
 	long timeo;
 	struct task_struct *user_recv = NULL;
+	
+	current->backlog_flag = 1;
 
 	lock_sock(sk);
 
@@ -1394,6 +1396,13 @@
 
 	TCP_CHECK_TIMER(sk);
 	release_sock(sk);
+
+	current->backlog_flag = 0;
+	if(current->extrarun_flag == 1){
+		current->extrarun_flag = 0;
+		yield();
+	}
+
 	return copied;
 
 out:

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch 2/4] - Potential performance bottleneck for Linxu TCP
  2006-11-29 23:28   ` [patch 1/4] " Wenji Wu
@ 2006-11-29 23:29     ` Wenji Wu
  2006-11-29 23:30       ` [patch 3/4] " Wenji Wu
  2006-11-30  0:53     ` [patch 1/4] " David Miller
  1 sibling, 1 reply; 18+ messages in thread
From: Wenji Wu @ 2006-11-29 23:29 UTC (permalink / raw)
  To: netdev, davem, akpm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 832 bytes --]

From: Wenji Wu <wenji@fnal.gov>

Greetings,

For Linux TCP, when the network applcaiton make system call to move data
from
socket's receive buffer to user space by calling tcp_recvmsg(). The socket
will
be locked. During the period, all the incoming packet for the TCP socket
will go
to the backlog queue without being TCP processed. Since Linux 2.6 can be
inerrupted mid-task, if the network application expires, and moved to the
expired array with the socket locked, all the packets within the backlog
queue
will not be TCP processed till the network applicaton resume its execution.
If
the system is heavily loaded, TCP can easily RTO in the Sender Side.

Attached is the patch 2/4

best regards,

wenji

Wenji Wu
Network Researcher
Fermilab, MS-368
P.O. Box 500
Batavia, IL, 60510
(Email): wenji@fnal.gov
(O): 001-630-840-4541

[-- Attachment #2: sched.h.patch --]
[-- Type: application/octet-stream, Size: 477 bytes --]

--- linux-2.6.14-old/include/linux/sched.h	2006-11-29 16:25:42.000000000 -0600
+++ linux-2.6.14/include/linux/sched.h	2006-11-29 10:32:55.000000000 -0600
@@ -813,6 +813,9 @@
 	int cpuset_mems_generation;
 #endif
 	atomic_t fs_excl;	/* holding fs exclusive resources */
+	int backlog_flag; 	/* packets wait in tcp backlog queue flag */
+	int extrarun_flag;	/* extra run flag for TCP performance */
+
 };

 static inline pid_t process_group(struct task_struct *tsk)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch 3/4] - Potential performance bottleneck for Linxu TCP
  2006-11-29 23:29     ` [patch 2/4] " Wenji Wu
@ 2006-11-29 23:30       ` Wenji Wu
  2006-11-29 23:31         ` [patch 4/4] " Wenji Wu
  0 siblings, 1 reply; 18+ messages in thread
From: Wenji Wu @ 2006-11-29 23:30 UTC (permalink / raw)
  To: netdev, davem, akpm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 832 bytes --]


From: Wenji Wu <wenji@fnal.gov>

Greetings,

For Linux TCP, when the network applcaiton make system call to move data
from
socket's receive buffer to user space by calling tcp_recvmsg(). The socket
will
be locked. During the period, all the incoming packet for the TCP socket
will go
to the backlog queue without being TCP processed. Since Linux 2.6 can be
inerrupted mid-task, if the network application expires, and moved to the
expired array with the socket locked, all the packets within the backlog
queue
will not be TCP processed till the network applicaton resume its execution.
If
the system is heavily loaded, TCP can easily RTO in the Sender Side.

Attached is the patch 3/4

best regards,

wenji

Wenji Wu
Network Researcher
Fermilab, MS-368
P.O. Box 500
Batavia, IL, 60510
(Email): wenji@fnal.gov
(O): 001-630-840-4541

[-- Attachment #2: sched.c.patch --]
[-- Type: application/octet-stream, Size: 1075 bytes --]

--- linux-2.6.14-old/kernel/sched.c	2006-11-29 16:22:22.000000000 -0600
+++ linux-2.6.14/kernel/sched.c	2006-11-29 11:29:34.000000000 -0600
@@ -2598,12 +2598,24 @@
 
 		if (!rq->expired_timestamp)
 			rq->expired_timestamp = jiffies;
-		if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
-			enqueue_task(p, rq->expired);
-			if (p->static_prio < rq->best_expired_prio)
+		if(p->backlog_flag == 0){
+			if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
+				enqueue_task(p, rq->expired);
+				if (p->static_prio < rq->best_expired_prio)
+					rq->best_expired_prio = p->static_prio;
+			} else
+				enqueue_task(p, rq->active);
+		} else {
+			if(EXPIRED_STARVING(rq)) {
+				enqueue_task(p,rq->expired);
+				if (p->static_prio < rq->best_expired_prio)
 				rq->best_expired_prio = p->static_prio;
-		} else
-			enqueue_task(p, rq->active);
+			} else {
+				if(!TASK_INTERACTIVE(p))
+					p->extrarun_flag = 1;
+				enqueue_task(p,rq->active);
+			}	
+		}
 	} else {
 		/*
 		 * Prevent a too long timeslice allowing a task to monopolize

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch 4/4] - Potential performance bottleneck for Linxu TCP
  2006-11-29 23:30       ` [patch 3/4] " Wenji Wu
@ 2006-11-29 23:31         ` Wenji Wu
  0 siblings, 0 replies; 18+ messages in thread
From: Wenji Wu @ 2006-11-29 23:31 UTC (permalink / raw)
  To: netdev, davem, akpm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 831 bytes --]

From: Wenji Wu <wenji@fnal.gov>

Greetings,

For Linux TCP, when the network applcaiton make system call to move data
from
socket's receive buffer to user space by calling tcp_recvmsg(). The socket
will
be locked. During the period, all the incoming packet for the TCP socket
will go
to the backlog queue without being TCP processed. Since Linux 2.6 can be
inerrupted mid-task, if the network application expires, and moved to the
expired array with the socket locked, all the packets within the backlog
queue
will not be TCP processed till the network applicaton resume its execution.
If
the system is heavily loaded, TCP can easily RTO in the Sender Side.

Attached is the patch 3/4

best regards,

wenji

Wenji Wu
Network Researcher
Fermilab, MS-368
P.O. Box 500
Batavia, IL, 60510
(Email): wenji@fnal.gov
(O): 001-630-840-4541

[-- Attachment #2: fork.c.patch --]
[-- Type: application/octet-stream, Size: 682 bytes --]

--- linux-2.6.14-old/kernel/fork.c	2006-11-29 16:22:25.000000000 -0600
+++ linux-2.6.14/kernel/fork.c	2006-11-29 11:23:20.000000000 -0600
@@ -868,7 +868,7 @@
  *
  * It copies the registers, and all the appropriate
  * parts of the process environment (as per the clone
- * flags). The actual kick-off is left to the caller.
+ * flags). The actual kick-off is left to the caller.copy_process
  */
 static task_t *copy_process(unsigned long clone_flags,
 				 unsigned long stack_start,
@@ -1154,6 +1154,9 @@
 	write_unlock_irq(&tasklist_lock);
 	retval = 0;

+	p->backlog_flag = 0;
+	p->extrarun_flag = 0;
+
 fork_out:
 	if (retval)
 		return ERR_PTR(retval);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Changelog] - Potential performance bottleneck for Linxu TCP
  2006-11-29 23:27 ` [Changelog] - Potential performance bottleneck for Linxu TCP Wenji Wu
  2006-11-29 23:28   ` [patch 1/4] " Wenji Wu
@ 2006-11-29 23:36   ` Martin Bligh
  1 sibling, 0 replies; 18+ messages in thread
From: Martin Bligh @ 2006-11-29 23:36 UTC (permalink / raw)
  To: wenji; +Cc: netdev, davem, akpm, linux-kernel

Wenji Wu wrote:
> From: Wenji Wu <wenji@fnal.gov>
> 
> Greetings,
> 
> For Linux TCP, when the network applcaiton make system call to move data
> from
> socket's receive buffer to user space by calling tcp_recvmsg(). The socket
> will
> be locked. During the period, all the incoming packet for the TCP socket
> will go
> to the backlog queue without being TCP processed. Since Linux 2.6 can be
> inerrupted mid-task, if the network application expires, and moved to the
> expired array with the socket locked, all the packets within the backlog
> queue
> will not be TCP processed till the network applicaton resume its execution.
> If
> the system is heavily loaded, TCP can easily RTO in the Sender Side.


So how much difference did this patch actually make, and to what
benchmark?

> The patch is for Linux kernel 2.6.14 Deskop and Low-latency Desktop

The patch oesn't seem to be attached? Also, would be better to make
it for the latest kernel version (2.6.19) ... 2.6.14 is rather old ;-)

M

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Bug 7596 - Potential performance bottleneck for Linxu TCP
       [not found] <HNEBLGGMEGLPMPPDOPMGKEAJCGAA.wenji@fnal.gov>
  2006-11-29 23:27 ` [Changelog] - Potential performance bottleneck for Linxu TCP Wenji Wu
@ 2006-11-29 23:42 ` Andrew Morton
  2006-11-30  6:32   ` Ingo Molnar
  2006-11-30  1:01 ` David Miller
  2 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2006-11-29 23:42 UTC (permalink / raw)
  To: wenji; +Cc: netdev, davem, linux-kernel

On Wed, 29 Nov 2006 17:22:10 -0600
Wenji Wu <wenji@fnal.gov> wrote:

> From: Wenji Wu <wenji@fnal.gov>
> 
> Greetings,
> 
> For Linux TCP, when the network applcaiton make system call to move data
> from
> socket's receive buffer to user space by calling tcp_recvmsg(). The socket
> will
> be locked. During the period, all the incoming packet for the TCP socket
> will go
> to the backlog queue without being TCP processed. Since Linux 2.6 can be
> inerrupted mid-task, if the network application expires, and moved to the
> expired array with the socket locked, all the packets within the backlog
> queue
> will not be TCP processed till the network applicaton resume its execution.
> If
> the system is heavily loaded, TCP can easily RTO in the Sender Side.
> 
> Attached is the detailed description of the problem and one possible
> solution.

Thanks.  The attachment will be too large for the mailing-list servers so I
uploaded a copy to
http://userweb.kernel.org/~akpm/Linux-TCP-Bottleneck-Analysis-Report.pdf

>From a quick peek it appears that you're getting around 10% improvement in
TCP throughput, best case.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-29 23:28   ` [patch 1/4] " Wenji Wu
  2006-11-29 23:29     ` [patch 2/4] " Wenji Wu
@ 2006-11-30  0:53     ` David Miller
  2006-11-30  1:08       ` Andrew Morton
  1 sibling, 1 reply; 18+ messages in thread
From: David Miller @ 2006-11-30  0:53 UTC (permalink / raw)
  To: wenji; +Cc: netdev, akpm, linux-kernel

Please, it is very difficult to review your work the way you have
submitted this patch as a set of 4 patches.  These patches have not
been split up "logically", but rather they have been split up "per
file" with the same exact changelog message in each patch posting.
This is very clumsy, and impossible to review, and wastes a lot of
mailing list bandwith.

We have an excellent file, called Documentation/SubmittingPatches, in
the kernel source tree, which explains exactly how to do this
correctly.

By splitting your patch into 4 patches, one for each file touched,
it is impossible to review your patch as a logical whole.

Please also provide your patch inline so people can just hit reply
in their mail reader client to quote your patch and comment on it.
This is impossible with the attachments you've used.

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Bug 7596 - Potential performance bottleneck for Linxu TCP
       [not found] <HNEBLGGMEGLPMPPDOPMGKEAJCGAA.wenji@fnal.gov>
  2006-11-29 23:27 ` [Changelog] - Potential performance bottleneck for Linxu TCP Wenji Wu
  2006-11-29 23:42 ` Bug 7596 " Andrew Morton
@ 2006-11-30  1:01 ` David Miller
  2 siblings, 0 replies; 18+ messages in thread
From: David Miller @ 2006-11-30  1:01 UTC (permalink / raw)
  To: wenji; +Cc: netdev, akpm, linux-kernel

The delays dealt with in your paper might actually help a highly
loaded server with lots of sockets and threads trying to communicate.

The packet processing delays caused by the scheduling delay paces the
TCP sender by controlling the rate at which ACKs go back to that
sender.  Those ACKs will go out paced to the rate at which the
sleeping TCP receiver gets back onto the cpu, and this will cause the
TCP sender to naturally adjust to the overall processing rate of the
receiver system, on a per-connection basis.

Perhaps try a system with hundreds of processes and potentially
hundreds of thousands of TCP sockets, with thousands of unique sender
sites, and see what happens.

This is a similar topic like TSO, where we are trying to balance the
gains from batching work from the losses of gaps in the communication
stream.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30  0:53     ` [patch 1/4] " David Miller
@ 2006-11-30  1:08       ` Andrew Morton
  2006-11-30  1:13         ` David Miller
  2006-11-30  6:04         ` Mike Galbraith
  0 siblings, 2 replies; 18+ messages in thread
From: Andrew Morton @ 2006-11-30  1:08 UTC (permalink / raw)
  To: David Miller; +Cc: wenji, netdev, linux-kernel

On Wed, 29 Nov 2006 16:53:11 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> 
> Please, it is very difficult to review your work the way you have
> submitted this patch as a set of 4 patches.  These patches have not
> been split up "logically", but rather they have been split up "per
> file" with the same exact changelog message in each patch posting.
> This is very clumsy, and impossible to review, and wastes a lot of
> mailing list bandwith.
> 
> We have an excellent file, called Documentation/SubmittingPatches, in
> the kernel source tree, which explains exactly how to do this
> correctly.
> 
> By splitting your patch into 4 patches, one for each file touched,
> it is impossible to review your patch as a logical whole.
> 
> Please also provide your patch inline so people can just hit reply
> in their mail reader client to quote your patch and comment on it.
> This is impossible with the attachments you've used.
> 

Here you go - joined up, cleaned up, ported to mainline and test-compiled.

That yield() will need to be removed - yield()'s behaviour is truly awful
if the system is otherwise busy.  What is it there for?



From: Wenji Wu <wenji@fnal.gov>

For Linux TCP, when the network applcaiton make system call to move data from
socket's receive buffer to user space by calling tcp_recvmsg().  The socket
will be locked.  During this period, all the incoming packet for the TCP
socket will go to the backlog queue without being TCP processed

Since Linux 2.6 can be inerrupted mid-task, if the network application
expires, and moved to the expired array with the socket locked, all the
packets within the backlog queue will not be TCP processed till the network
applicaton resume its execution.  If the system is heavily loaded, TCP can
easily RTO in the Sender Side.



 include/linux/sched.h |    2 ++
 kernel/fork.c         |    3 +++
 kernel/sched.c        |   24 ++++++++++++++++++------
 net/ipv4/tcp.c        |    9 +++++++++
 4 files changed, 32 insertions(+), 6 deletions(-)

diff -puN net/ipv4/tcp.c~tcp-speedup net/ipv4/tcp.c
--- a/net/ipv4/tcp.c~tcp-speedup
+++ a/net/ipv4/tcp.c
@@ -1109,6 +1109,8 @@ int tcp_recvmsg(struct kiocb *iocb, stru
 	struct task_struct *user_recv = NULL;
 	int copied_early = 0;
 
+	current->backlog_flag = 1;
+
 	lock_sock(sk);
 
 	TCP_CHECK_TIMER(sk);
@@ -1468,6 +1470,13 @@ skip_copy:
 
 	TCP_CHECK_TIMER(sk);
 	release_sock(sk);
+
+	current->backlog_flag = 0;
+	if (current->extrarun_flag == 1){
+		current->extrarun_flag = 0;
+		yield();
+	}
+
 	return copied;
 
 out:
diff -puN include/linux/sched.h~tcp-speedup include/linux/sched.h
--- a/include/linux/sched.h~tcp-speedup
+++ a/include/linux/sched.h
@@ -1023,6 +1023,8 @@ struct task_struct {
 #ifdef	CONFIG_TASK_DELAY_ACCT
 	struct task_delay_info *delays;
 #endif
+	int backlog_flag; 	/* packets wait in tcp backlog queue flag */
+	int extrarun_flag;	/* extra run flag for TCP performance */
 };
 
 static inline pid_t process_group(struct task_struct *tsk)
diff -puN kernel/sched.c~tcp-speedup kernel/sched.c
--- a/kernel/sched.c~tcp-speedup
+++ a/kernel/sched.c
@@ -3099,12 +3099,24 @@ void scheduler_tick(void)
 
 		if (!rq->expired_timestamp)
 			rq->expired_timestamp = jiffies;
-		if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
-			enqueue_task(p, rq->expired);
-			if (p->static_prio < rq->best_expired_prio)
-				rq->best_expired_prio = p->static_prio;
-		} else
-			enqueue_task(p, rq->active);
+		if (p->backlog_flag == 0) {
+			if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
+				enqueue_task(p, rq->expired);
+				if (p->static_prio < rq->best_expired_prio)
+					rq->best_expired_prio = p->static_prio;
+			} else
+				enqueue_task(p, rq->active);
+		} else {
+			if (expired_starving(rq)) {
+				enqueue_task(p,rq->expired);
+				if (p->static_prio < rq->best_expired_prio)
+					rq->best_expired_prio = p->static_prio;
+			} else {
+				if (!TASK_INTERACTIVE(p))
+					p->extrarun_flag = 1;
+				enqueue_task(p,rq->active);
+			}
+		}
 	} else {
 		/*
 		 * Prevent a too long timeslice allowing a task to monopolize
diff -puN kernel/fork.c~tcp-speedup kernel/fork.c
--- a/kernel/fork.c~tcp-speedup
+++ a/kernel/fork.c
@@ -1032,6 +1032,9 @@ static struct task_struct *copy_process(
 	clear_tsk_thread_flag(p, TIF_SIGPENDING);
 	init_sigpending(&p->pending);
 
+	p->backlog_flag = 0;
+	p->extrarun_flag = 0;
+
 	p->utime = cputime_zero;
 	p->stime = cputime_zero;
  	p->sched_time = 0;
_


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30  1:08       ` Andrew Morton
@ 2006-11-30  1:13         ` David Miller
  2006-11-30  6:04         ` Mike Galbraith
  1 sibling, 0 replies; 18+ messages in thread
From: David Miller @ 2006-11-30  1:13 UTC (permalink / raw)
  To: akpm; +Cc: wenji, netdev, linux-kernel

From: Andrew Morton <akpm@osdl.org>
Date: Wed, 29 Nov 2006 17:08:35 -0800

> On Wed, 29 Nov 2006 16:53:11 -0800 (PST)
> David Miller <davem@davemloft.net> wrote:
> 
> > 
> > Please, it is very difficult to review your work the way you have
> > submitted this patch as a set of 4 patches.  These patches have not
> > been split up "logically", but rather they have been split up "per
> > file" with the same exact changelog message in each patch posting.
> > This is very clumsy, and impossible to review, and wastes a lot of
> > mailing list bandwith.
> > 
> > We have an excellent file, called Documentation/SubmittingPatches, in
> > the kernel source tree, which explains exactly how to do this
> > correctly.
> > 
> > By splitting your patch into 4 patches, one for each file touched,
> > it is impossible to review your patch as a logical whole.
> > 
> > Please also provide your patch inline so people can just hit reply
> > in their mail reader client to quote your patch and comment on it.
> > This is impossible with the attachments you've used.
> > 
> 
> Here you go - joined up, cleaned up, ported to mainline and test-compiled.
> 
> That yield() will need to be removed - yield()'s behaviour is truly awful
> if the system is otherwise busy.  What is it there for?

What about simply turning off CONFIG_PREEMPT to fix this "problem"?

We always properly run the backlog (by doing a release_sock()) before
going to sleep otherwise except for the specific case of taking a page
fault during the copy to userspace.  It is only CONFIG_PREEMPT that
can cause this situation to occur in other circumstances as far as I
can see.

We could also pepper tcp_recvmsg() with some very carefully placed
preemption disable/enable calls to deal with this even with
CONFIG_PREEMPT enabled.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30  1:08       ` Andrew Morton
  2006-11-30  1:13         ` David Miller
@ 2006-11-30  6:04         ` Mike Galbraith
  1 sibling, 0 replies; 18+ messages in thread
From: Mike Galbraith @ 2006-11-30  6:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Miller, wenji, netdev, linux-kernel

On Wed, 2006-11-29 at 17:08 -0800, Andrew Morton wrote:
> +		if (p->backlog_flag == 0) {
> +			if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
> +				enqueue_task(p, rq->expired);
> +				if (p->static_prio < rq->best_expired_prio)
> +					rq->best_expired_prio = p->static_prio;
> +			} else
> +				enqueue_task(p, rq->active);
> +		} else {
> +			if (expired_starving(rq)) {
> +				enqueue_task(p,rq->expired);
> +				if (p->static_prio < rq->best_expired_prio)
> +					rq->best_expired_prio = p->static_prio;
> +			} else {
> +				if (!TASK_INTERACTIVE(p))
> +					p->extrarun_flag = 1;
> +				enqueue_task(p,rq->active);
> +			}
> +		}

(oh my, doing that to the scheduler upsets my tummy, but that aside...)

I don't see how that can really solve anything.  "Interactive" tasks
starting to use cpu heftily can still preempt and keep the special cased
cpu hog off the cpu for ages.  It also only takes one task in the
expired array to trigger the forced array switch with a fully loaded
cpu, and once any task hits the expired array, a stream of wakeups can
prevent the switch from completing for as long as you can keep wakeups
happening.

	-Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Bug 7596 - Potential performance bottleneck for Linxu TCP
  2006-11-29 23:42 ` Bug 7596 " Andrew Morton
@ 2006-11-30  6:32   ` Ingo Molnar
  2006-12-19 18:37     ` Stephen Hemminger
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2006-11-30  6:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: wenji, netdev, davem, linux-kernel

* Andrew Morton <akpm@osdl.org> wrote:

> > Attached is the detailed description of the problem and one possible 
> > solution.
> 
> Thanks.  The attachment will be too large for the mailing-list servers 
> so I uploaded a copy to 
> http://userweb.kernel.org/~akpm/Linux-TCP-Bottleneck-Analysis-Report.pdf
> 
> From a quick peek it appears that you're getting around 10% 
> improvement in TCP throughput, best case.

Wenji, have you tried to renice the receiving task (to say nice -20) and 
see how much TCP throughput you get in "background load of 10.0". 
(similarly, you could also renice the background load tasks to nice +19 
and/or set their scheduling policy to SCHED_BATCH)

as far as i can see, the numbers in the paper and the patch prove the 
following two points:

 - a task doing TCP receive with 10 other tasks running on the CPU will
   see lower TCP throughput than if it had the CPU for itself alone.

 - a patch that tweaks the scheduler to give the receiving task more
   timeslices (i.e. raises its nice level in essence) results in ...
   more timeslices, which results in higher receive numbers ...

so the most important thing to check would be, before any scheduler and 
TCP code change is considered: if you give the task higher priority 
/explicitly/, via nice -20, do the numbers improve? Similarly, if all 
the other "background load" tasks are reniced to nice +19 (or their 
policy is set to SCHED_BATCH), do you get a similar improvement?

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Bug 7596 - Potential performance bottleneck for Linxu TCP
  2006-11-30  6:32   ` Ingo Molnar
@ 2006-12-19 18:37     ` Stephen Hemminger
  2006-12-19 23:52       ` Herbert Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Stephen Hemminger @ 2006-12-19 18:37 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, wenji, netdev, davem, linux-kernel

I noticed this bit of discussion in tcp_recvmsg. It implies that a better
queuing policy would be good. But it is confusing English (Alexey?) so
not sure where to start.


> 		if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) {
> 			/* Install new reader */
> 			if (!user_recv && !(flags & (MSG_TRUNC | MSG_PEEK))) {
> 				user_recv = current;
> 				tp->ucopy.task = user_recv;
> 				tp->ucopy.iov = msg->msg_iov;
> 			}
> 
> 			tp->ucopy.len = len;
> 
> 			BUG_TRAP(tp->copied_seq == tp->rcv_nxt ||
> 				 (flags & (MSG_PEEK | MSG_TRUNC)));
> 
> 			/* Ugly... If prequeue is not empty, we have to
> 			 * process it before releasing socket, otherwise
> 			 * order will be broken at second iteration.
> 			 * More elegant solution is required!!!
> 			 *
> 			 * Look: we have the following (pseudo)queues:
> 			 *
> 			 * 1. packets in flight
> 			 * 2. backlog
> 			 * 3. prequeue
> 			 * 4. receive_queue
> 			 *
> 			 * Each queue can be processed only if the next ones
> 			 * are empty. At this point we have empty receive_queue.
> 			 * But prequeue _can_ be not empty after 2nd iteration,
> 			 * when we jumped to start of loop because backlog
> 			 * processing added something to receive_queue.
> 			 * We cannot release_sock(), because backlog contains
> 			 * packets arrived _after_ prequeued ones.
> 			 *
> 			 * Shortly, algorithm is clear --- to process all
> 			 * the queues in order. We could make it more directly,
> 			 * requeueing packets from backlog to prequeue, if
> 			 * is not empty. It is more elegant, but eats cycles,
> 			 * unfortunately.
> 			 */
> 			if (!skb_queue_empty(&tp->ucopy.prequeue))
> 				goto do_prequeue;
> 
> 			/* __ Set realtime policy in scheduler __ */
> 		}
> 
> 		if (copied >= target) {
> 			/* Do not sleep, just process backlog. */
> 			release_sock(sk);
> 			lock_sock(sk);
> 		} else
> 		

-- 
Stephen Hemminger <shemminger@osdl.org>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Bug 7596 - Potential performance bottleneck for Linxu TCP
  2006-12-19 18:37     ` Stephen Hemminger
@ 2006-12-19 23:52       ` Herbert Xu
  2006-12-20  2:55         ` David Miller
  0 siblings, 1 reply; 18+ messages in thread
From: Herbert Xu @ 2006-12-19 23:52 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: mingo, akpm, wenji, netdev, davem, linux-kernel

Stephen Hemminger <shemminger@osdl.org> wrote:
> I noticed this bit of discussion in tcp_recvmsg. It implies that a better
> queuing policy would be good. But it is confusing English (Alexey?) so
> not sure where to start.

Actually I think the comment says that the current code isn't the
most elegant but is more efficient.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Bug 7596 - Potential performance bottleneck for Linxu TCP
  2006-12-19 23:52       ` Herbert Xu
@ 2006-12-20  2:55         ` David Miller
  2006-12-20  5:11           ` Stephen Hemminger
  0 siblings, 1 reply; 18+ messages in thread
From: David Miller @ 2006-12-20  2:55 UTC (permalink / raw)
  To: herbert; +Cc: shemminger, mingo, akpm, wenji, netdev, linux-kernel

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 20 Dec 2006 10:52:19 +1100

> Stephen Hemminger <shemminger@osdl.org> wrote:
> > I noticed this bit of discussion in tcp_recvmsg. It implies that a better
> > queuing policy would be good. But it is confusing English (Alexey?) so
> > not sure where to start.
> 
> Actually I think the comment says that the current code isn't the
> most elegant but is more efficient.

It's just explaining the hierarchy of queues that need to
be purged, and in what order, for correctness.

Alexey added that code when I mentioned to him, right after
we added the prequeue, that it was possible process the
normal backlog before the prequeue, which is illegal.
In fixing that bug, he added the comment we are discussing.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Bug 7596 - Potential performance bottleneck for Linxu TCP
  2006-12-20  2:55         ` David Miller
@ 2006-12-20  5:11           ` Stephen Hemminger
  2006-12-20  5:15             ` David Miller
  0 siblings, 1 reply; 18+ messages in thread
From: Stephen Hemminger @ 2006-12-20  5:11 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, mingo, akpm, wenji, netdev, linux-kernel

On Tue, 19 Dec 2006 18:55:25 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Wed, 20 Dec 2006 10:52:19 +1100
> 
> > Stephen Hemminger <shemminger@osdl.org> wrote:
> > > I noticed this bit of discussion in tcp_recvmsg. It implies that a better
> > > queuing policy would be good. But it is confusing English (Alexey?) so
> > > not sure where to start.
> > 
> > Actually I think the comment says that the current code isn't the
> > most elegant but is more efficient.
> 
> It's just explaining the hierarchy of queues that need to
> be purged, and in what order, for correctness.
> 
> Alexey added that code when I mentioned to him, right after
> we added the prequeue, that it was possible process the
> normal backlog before the prequeue, which is illegal.
> In fixing that bug, he added the comment we are discussing.

It was the realtime/normal comments that piqued my interest.
Perhaps we should either tweak process priority or remove
the comments.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Bug 7596 - Potential performance bottleneck for Linxu TCP
  2006-12-20  5:11           ` Stephen Hemminger
@ 2006-12-20  5:15             ` David Miller
  0 siblings, 0 replies; 18+ messages in thread
From: David Miller @ 2006-12-20  5:15 UTC (permalink / raw)
  To: shemminger; +Cc: herbert, mingo, akpm, wenji, netdev, linux-kernel

From: Stephen Hemminger <shemminger@osdl.org>
Date: Tue, 19 Dec 2006 21:11:24 -0800

> It was the realtime/normal comments that piqued my interest.
> Perhaps we should either tweak process priority or remove
> the comments.

I mentioned that to Linus once and he said the entire
idea was bogus.

With the recent tcp_recvmsg() preemption issue thread,
I agree with his sentiments even more than I did previously.

What needs to happen is to liberate the locking so that
input packet processing can occur in parallel with
tcp_recvmsg(), instead of doing this bogus backlog thing
which can wedge TCP ACK processing for an entire quantum
if we take a kernel preemption while the process has the
socket lock held.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2006-12-20  5:16 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <HNEBLGGMEGLPMPPDOPMGKEAJCGAA.wenji@fnal.gov>
2006-11-29 23:27 ` [Changelog] - Potential performance bottleneck for Linxu TCP Wenji Wu
2006-11-29 23:28   ` [patch 1/4] " Wenji Wu
2006-11-29 23:29     ` [patch 2/4] " Wenji Wu
2006-11-29 23:30       ` [patch 3/4] " Wenji Wu
2006-11-29 23:31         ` [patch 4/4] " Wenji Wu
2006-11-30  0:53     ` [patch 1/4] " David Miller
2006-11-30  1:08       ` Andrew Morton
2006-11-30  1:13         ` David Miller
2006-11-30  6:04         ` Mike Galbraith
2006-11-29 23:36   ` [Changelog] " Martin Bligh
2006-11-29 23:42 ` Bug 7596 " Andrew Morton
2006-11-30  6:32   ` Ingo Molnar
2006-12-19 18:37     ` Stephen Hemminger
2006-12-19 23:52       ` Herbert Xu
2006-12-20  2:55         ` David Miller
2006-12-20  5:11           ` Stephen Hemminger
2006-12-20  5:15             ` David Miller
2006-11-30  1:01 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).