netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
@ 2006-11-30  1:56 Wenji Wu
  2006-11-30  2:19 ` David Miller
  2006-11-30  9:33 ` Christoph Hellwig
  0 siblings, 2 replies; 36+ messages in thread
From: Wenji Wu @ 2006-11-30  1:56 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, netdev, linux-kernel

Yes, when CONFIG_PREEMPT is disabled, the "problem" won't happen. That is why I put "for 2.6 desktop, low-latency desktop" in the uploaded paper. This "problem" happens in the 2.6 Desktop and Low-latency Desktop.

>We could also pepper tcp_recvmsg() with some very carefully placed preemption disable/enable calls to deal with this even with CONFIG_PREEMPT enabled.

I also think about this approach. But since the "problem" happens in the 2.6 Desktop and Low-latency Desktop (not server), system responsiveness is a key feature, simply placing preemption disabled/enable call might not work.  If you want to place preemption disable/enable calls within tcp_recvmsg, you have to put them in the very beginning and end of the call. Disabling preemption would degrade system responsiveness.

wenji



----- Original Message -----
From: David Miller <davem@davemloft.net>
Date: Wednesday, November 29, 2006 7:13 pm
Subject: Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP

> From: Andrew Morton <akpm@osdl.org>
> Date: Wed, 29 Nov 2006 17:08:35 -0800
> 
> > On Wed, 29 Nov 2006 16:53:11 -0800 (PST)
> > David Miller <davem@davemloft.net> wrote:
> > 
> > > 
> > > Please, it is very difficult to review your work the way you have
> > > submitted this patch as a set of 4 patches.  These patches have 
> not> > been split up "logically", but rather they have been split 
> up "per
> > > file" with the same exact changelog message in each patch posting.
> > > This is very clumsy, and impossible to review, and wastes a lot of
> > > mailing list bandwith.
> > > 
> > > We have an excellent file, called 
> Documentation/SubmittingPatches, in
> > > the kernel source tree, which explains exactly how to do this
> > > correctly.
> > > 
> > > By splitting your patch into 4 patches, one for each file touched,
> > > it is impossible to review your patch as a logical whole.
> > > 
> > > Please also provide your patch inline so people can just hit reply
> > > in their mail reader client to quote your patch and comment on it.
> > > This is impossible with the attachments you've used.
> > > 
> > 
> > Here you go - joined up, cleaned up, ported to mainline and test-
> compiled.> 
> > That yield() will need to be removed - yield()'s behaviour is 
> truly awful
> > if the system is otherwise busy.  What is it there for?
> 
> What about simply turning off CONFIG_PREEMPT to fix this "problem"?
> 
> We always properly run the backlog (by doing a release_sock()) before
> going to sleep otherwise except for the specific case of taking a page
> fault during the copy to userspace.  It is only CONFIG_PREEMPT that
> can cause this situation to occur in other circumstances as far as I
> can see.
> 
> We could also pepper tcp_recvmsg() with some very carefully placed
> preemption disable/enable calls to deal with this even with
> CONFIG_PREEMPT enabled.
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
@ 2006-11-30  2:02 Wenji Wu
  2006-11-30  6:19 ` Ingo Molnar
  0 siblings, 1 reply; 36+ messages in thread
From: Wenji Wu @ 2006-11-30  2:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Miller, netdev, linux-kernel

> That yield() will need to be removed - yield()'s behaviour is truly 
> awfulif the system is otherwise busy.  What is it there for?

Please read the uploaded paper, which has detailed description.

thanks,

wenji

----- Original Message -----
From: Andrew Morton <akpm@osdl.org>
Date: Wednesday, November 29, 2006 7:08 pm
Subject: Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP

> On Wed, 29 Nov 2006 16:53:11 -0800 (PST)
> David Miller <davem@davemloft.net> wrote:
> 
> > 
> > Please, it is very difficult to review your work the way you have
> > submitted this patch as a set of 4 patches.  These patches have not
> > been split up "logically", but rather they have been split up "per
> > file" with the same exact changelog message in each patch posting.
> > This is very clumsy, and impossible to review, and wastes a lot of
> > mailing list bandwith.
> > 
> > We have an excellent file, called 
> Documentation/SubmittingPatches, in
> > the kernel source tree, which explains exactly how to do this
> > correctly.
> > 
> > By splitting your patch into 4 patches, one for each file touched,
> > it is impossible to review your patch as a logical whole.
> > 
> > Please also provide your patch inline so people can just hit reply
> > in their mail reader client to quote your patch and comment on it.
> > This is impossible with the attachments you've used.
> > 
> 
> Here you go - joined up, cleaned up, ported to mainline and test-
> compiled.
> That yield() will need to be removed - yield()'s behaviour is truly 
> awfulif the system is otherwise busy.  What is it there for?
> 
> 
> 
> From: Wenji Wu <wenji@fnal.gov>
> 
> For Linux TCP, when the network applcaiton make system call to move 
> data from
> socket's receive buffer to user space by calling tcp_recvmsg().  
> The socket
> will be locked.  During this period, all the incoming packet for 
> the TCP
> socket will go to the backlog queue without being TCP processed
> 
> Since Linux 2.6 can be inerrupted mid-task, if the network application
> expires, and moved to the expired array with the socket locked, all 
> thepackets within the backlog queue will not be TCP processed till 
> the network
> applicaton resume its execution.  If the system is heavily loaded, 
> TCP can
> easily RTO in the Sender Side.
> 
> 
> 
> include/linux/sched.h |    2 ++
> kernel/fork.c         |    3 +++
> kernel/sched.c        |   24 ++++++++++++++++++------
> net/ipv4/tcp.c        |    9 +++++++++
> 4 files changed, 32 insertions(+), 6 deletions(-)
> 
> diff -puN net/ipv4/tcp.c~tcp-speedup net/ipv4/tcp.c
> --- a/net/ipv4/tcp.c~tcp-speedup
> +++ a/net/ipv4/tcp.c
> @@ -1109,6 +1109,8 @@ int tcp_recvmsg(struct kiocb *iocb, stru
> 	struct task_struct *user_recv = NULL;
> 	int copied_early = 0;
> 
> +	current->backlog_flag = 1;
> +
> 	lock_sock(sk);
> 
> 	TCP_CHECK_TIMER(sk);
> @@ -1468,6 +1470,13 @@ skip_copy:
> 
> 	TCP_CHECK_TIMER(sk);
> 	release_sock(sk);
> +
> +	current->backlog_flag = 0;
> +	if (current->extrarun_flag == 1){
> +        	current->extrarun_flag = 0;
> +        	yield();
> +	}
> +
> 	return copied;
> 
> out:
> diff -puN include/linux/sched.h~tcp-speedup include/linux/sched.h
> --- a/include/linux/sched.h~tcp-speedup
> +++ a/include/linux/sched.h
> @@ -1023,6 +1023,8 @@ struct task_struct {
> #ifdef	CONFIG_TASK_DELAY_ACCT
> 	struct task_delay_info *delays;
> #endif
> +	int backlog_flag; 	/* packets wait in tcp backlog queue flag */
> +	int extrarun_flag;	/* extra run flag for TCP performance */
> };
> 
> static inline pid_t process_group(struct task_struct *tsk)
> diff -puN kernel/sched.c~tcp-speedup kernel/sched.c
> --- a/kernel/sched.c~tcp-speedup
> +++ a/kernel/sched.c
> @@ -3099,12 +3099,24 @@ void scheduler_tick(void)
> 
>         	if (!rq->expired_timestamp)
>                 	rq->expired_timestamp = jiffies;
> -        	if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
> -                	enqueue_task(p, rq->expired);
> -                	if (p->static_prio < rq->best_expired_prio)
> -                        	rq->best_expired_prio = p->static_prio;
> -        	} else
> -                	enqueue_task(p, rq->active);
> +        	if (p->backlog_flag == 0) {
> +                	if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
> +                        	enqueue_task(p, rq->expired);
> +                        	if (p->static_prio < rq->best_expired_prio)
> +                                	rq->best_expired_prio = p-
> >static_prio;+                	} else
> +                        	enqueue_task(p, rq->active);
> +        	} else {
> +                	if (expired_starving(rq)) {
> +                        	enqueue_task(p,rq->expired);
> +                        	if (p->static_prio < rq->best_expired_prio)
> +                                	rq->best_expired_prio = p-
> >static_prio;+                	} else {
> +                        	if (!TASK_INTERACTIVE(p))
> +                                	p->extrarun_flag = 1;
> +                        	enqueue_task(p,rq->active);
> +                	}
> +        	}
> 	} else {
>         	/*
>                  * Prevent a too long timeslice allowing a task to 
> monopolizediff -puN kernel/fork.c~tcp-speedup kernel/fork.c
> --- a/kernel/fork.c~tcp-speedup
> +++ a/kernel/fork.c
> @@ -1032,6 +1032,9 @@ static struct task_struct *copy_process(
> 	clear_tsk_thread_flag(p, TIF_SIGPENDING);
> 	init_sigpending(&p->pending);
> 
> +	p->backlog_flag = 0;
> +	p->extrarun_flag = 0;
> +
> 	p->utime = cputime_zero;
> 	p->stime = cputime_zero;
>  	p->sched_time = 0;
> _
> 
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread
* [Changelog] - Potential performance bottleneck for Linxu TCP
@ 2006-11-29 23:27 Wenji Wu
  2006-11-29 23:28 ` [patch 1/4] " Wenji Wu
  0 siblings, 1 reply; 36+ messages in thread
From: Wenji Wu @ 2006-11-29 23:27 UTC (permalink / raw)
  To: netdev, davem, akpm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 846 bytes --]


From: Wenji Wu <wenji@fnal.gov>

Greetings,

For Linux TCP, when the network applcaiton make system call to move data
from
socket's receive buffer to user space by calling tcp_recvmsg(). The socket
will
be locked. During the period, all the incoming packet for the TCP socket
will go
to the backlog queue without being TCP processed. Since Linux 2.6 can be
inerrupted mid-task, if the network application expires, and moved to the
expired array with the socket locked, all the packets within the backlog
queue
will not be TCP processed till the network applicaton resume its execution.
If
the system is heavily loaded, TCP can easily RTO in the Sender Side.

Attached is the Changelog for the patch

best regards,

wenji

Wenji Wu
Network Researcher
Fermilab, MS-368
P.O. Box 500
Batavia, IL, 60510
(Email): wenji@fnal.gov
(O): 001-630-840-4541

[-- Attachment #2: Changelog.txt --]
[-- Type: text/plain, Size: 2988 bytes --]


From: Wenji Wu <wenji@fnal.gov>

- Subject

Potential performance bottleneck for Linux TCP (2.6 Desktop, Low-latency Desktop)


- Why the kernel needed patching

For Linux TCP, when the network applcaiton make system call to move data from
socket's receive buffer to user space by calling tcp_recvmsg(). The socket will
be locked. During the period, all the incoming packet for the TCP socket will go
to the backlog queue without being TCP processed. Since Linux 2.6 can be
inerrupted mid-task, if the network application expires, and moved to the
expired array with the socket locked, all the packets within the backlog queue
will not be TCP processed till the network applicaton resume its execution. If
the system is heavily loaded, TCP can easily RTO in the Sender Side.

- The overall design apparoch in the patch

the underlying idea here is that when there are packets waiting on the prequeue 
or backlog queue, do not allow the data receiving process to release the CPU for long. 

- Implementation details

We have modified the Linux process scheduling policy and tcp_recvmsg().

To summarize, the solution works as follows: 

an expired data receiving process with packets waiting on backlog queue or 
prequeue is moved to the active array, instead of expired array as usual. 
More often than not, the expired data receiving process will continue to run. 
Even it doesn’t, the wait time before it resumes its execution will be greatly reduced. 
However, this gives the process extra runs compared to other processes in the runqueue. 

For the sake of fairness, the process would be labeled with the extra_run_flag. 

Also considering the facts that: 

(1) the resumed process will continue its execution within tcp_recvmsg(); 
(2) tcp_recvmsg() does not return to user space until the prequeue and backlog queue are drained. 

For the sake of fairness, we modified tcp_recvmsg() as such: after prequeue and backlog 
queue are drained and before tcp_recvmsg() returns to user space, any process labeled with 
the extra_run_flag will call yield() to explicitly yield the CPU to other proc-esses in the runqueue. 
yield() works by removing the process from the active array (where it current is, because it is running), 
and inserting it into the expired array. Also, to prevent processes in the expired array from starving, 

A special rule has been provided for Linux process scheduling (the same rule used for interactive processes): 
an expired process is moved to the expired array without respect to its status if processes in the expired array are starved.

Changed files:

/kernel/sched.c
/kernel/fork.c
/include/linux/sched.h
/net/ipv4/tcp.c

- Testing results

The proposed solution tradeoffs a small amount of fairness performance to resolve the TCP performance bottleneck. 
The proposed solution won’t cause serious fairness issue.

The patch is for Linux kernel 2.6.14 Deskop and Low-latency Desktop


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2006-12-01 23:18 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-30  1:56 [patch 1/4] - Potential performance bottleneck for Linxu TCP Wenji Wu
2006-11-30  2:19 ` David Miller
2006-11-30  6:17   ` Ingo Molnar
2006-11-30  6:30     ` David Miller
2006-11-30  6:47       ` Ingo Molnar
2006-11-30  7:12         ` David Miller
2006-11-30  7:35           ` Ingo Molnar
2006-11-30  9:52             ` Evgeniy Polyakov
2006-11-30 10:07               ` Nick Piggin
2006-11-30 10:22                 ` Evgeniy Polyakov
2006-11-30 10:32                   ` Ingo Molnar
2006-11-30 17:04                     ` Wenji Wu
2006-11-30 20:20                       ` Ingo Molnar
2006-11-30 20:58                         ` Wenji Wu
2006-11-30 20:22                     ` David Miller
2006-11-30 20:30                       ` Ingo Molnar
2006-11-30 20:38                         ` David Miller
2006-11-30 20:49                           ` Ingo Molnar
2006-11-30 20:54                             ` Ingo Molnar
2006-11-30 20:55                             ` David Miller
2006-11-30 20:14                   ` David Miller
2006-11-30 20:42                     ` Wenji Wu
2006-12-01  9:53                     ` Evgeniy Polyakov
2006-12-01 23:18                       ` David Miller
2006-11-30  6:56       ` Ingo Molnar
2006-11-30 16:08   ` Wenji Wu
2006-11-30 20:06     ` David Miller
2006-11-30  9:33 ` Christoph Hellwig
2006-11-30 16:51   ` Lee Revell
  -- strict thread matches above, loose matches on Subject: below --
2006-11-30  2:02 Wenji Wu
2006-11-30  6:19 ` Ingo Molnar
2006-11-29 23:27 [Changelog] " Wenji Wu
2006-11-29 23:28 ` [patch 1/4] " Wenji Wu
2006-11-30  0:53   ` David Miller
2006-11-30  1:08     ` Andrew Morton
2006-11-30  1:13       ` David Miller
2006-11-30  6:04       ` Mike Galbraith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).