* [Changelog] - Potential performance bottleneck for Linxu TCP [not found] <HNEBLGGMEGLPMPPDOPMGKEAJCGAA.wenji@fnal.gov> @ 2006-11-29 23:27 ` Wenji Wu 2006-11-29 23:28 ` [patch 1/4] " Wenji Wu 2006-11-29 23:36 ` [Changelog] " Martin Bligh 2006-11-29 23:42 ` Bug 7596 " Andrew Morton 2006-11-30 1:01 ` David Miller 2 siblings, 2 replies; 18+ messages in thread From: Wenji Wu @ 2006-11-29 23:27 UTC (permalink / raw) To: netdev, davem, akpm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 846 bytes --] From: Wenji Wu <wenji@fnal.gov> Greetings, For Linux TCP, when the network applcaiton make system call to move data from socket's receive buffer to user space by calling tcp_recvmsg(). The socket will be locked. During the period, all the incoming packet for the TCP socket will go to the backlog queue without being TCP processed. Since Linux 2.6 can be inerrupted mid-task, if the network application expires, and moved to the expired array with the socket locked, all the packets within the backlog queue will not be TCP processed till the network applicaton resume its execution. If the system is heavily loaded, TCP can easily RTO in the Sender Side. Attached is the Changelog for the patch best regards, wenji Wenji Wu Network Researcher Fermilab, MS-368 P.O. Box 500 Batavia, IL, 60510 (Email): wenji@fnal.gov (O): 001-630-840-4541 [-- Attachment #2: Changelog.txt --] [-- Type: text/plain, Size: 2988 bytes --] From: Wenji Wu <wenji@fnal.gov> - Subject Potential performance bottleneck for Linux TCP (2.6 Desktop, Low-latency Desktop) - Why the kernel needed patching For Linux TCP, when the network applcaiton make system call to move data from socket's receive buffer to user space by calling tcp_recvmsg(). The socket will be locked. During the period, all the incoming packet for the TCP socket will go to the backlog queue without being TCP processed. Since Linux 2.6 can be inerrupted mid-task, if the network application expires, and moved to the expired array with the socket locked, all the packets within the backlog queue will not be TCP processed till the network applicaton resume its execution. If the system is heavily loaded, TCP can easily RTO in the Sender Side. - The overall design apparoch in the patch the underlying idea here is that when there are packets waiting on the prequeue or backlog queue, do not allow the data receiving process to release the CPU for long. - Implementation details We have modified the Linux process scheduling policy and tcp_recvmsg(). To summarize, the solution works as follows: an expired data receiving process with packets waiting on backlog queue or prequeue is moved to the active array, instead of expired array as usual. More often than not, the expired data receiving process will continue to run. Even it doesnt, the wait time before it resumes its execution will be greatly reduced. However, this gives the process extra runs compared to other processes in the runqueue. For the sake of fairness, the process would be labeled with the extra_run_flag. Also considering the facts that: (1) the resumed process will continue its execution within tcp_recvmsg(); (2) tcp_recvmsg() does not return to user space until the prequeue and backlog queue are drained. For the sake of fairness, we modified tcp_recvmsg() as such: after prequeue and backlog queue are drained and before tcp_recvmsg() returns to user space, any process labeled with the extra_run_flag will call yield() to explicitly yield the CPU to other proc-esses in the runqueue. yield() works by removing the process from the active array (where it current is, because it is running), and inserting it into the expired array. Also, to prevent processes in the expired array from starving, A special rule has been provided for Linux process scheduling (the same rule used for interactive processes): an expired process is moved to the expired array without respect to its status if processes in the expired array are starved. Changed files: /kernel/sched.c /kernel/fork.c /include/linux/sched.h /net/ipv4/tcp.c - Testing results The proposed solution tradeoffs a small amount of fairness performance to resolve the TCP performance bottleneck. The proposed solution wont cause serious fairness issue. The patch is for Linux kernel 2.6.14 Deskop and Low-latency Desktop ^ permalink raw reply [flat|nested] 18+ messages in thread
* [patch 1/4] - Potential performance bottleneck for Linxu TCP 2006-11-29 23:27 ` [Changelog] - Potential performance bottleneck for Linxu TCP Wenji Wu @ 2006-11-29 23:28 ` Wenji Wu 2006-11-29 23:29 ` [patch 2/4] " Wenji Wu 2006-11-30 0:53 ` [patch 1/4] " David Miller 2006-11-29 23:36 ` [Changelog] " Martin Bligh 1 sibling, 2 replies; 18+ messages in thread From: Wenji Wu @ 2006-11-29 23:28 UTC (permalink / raw) To: wenji, netdev, davem, akpm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 832 bytes --] From: Wenji Wu <wenji@fnal.gov> Greetings, For Linux TCP, when the network applcaiton make system call to move data from socket's receive buffer to user space by calling tcp_recvmsg(). The socket will be locked. During the period, all the incoming packet for the TCP socket will go to the backlog queue without being TCP processed. Since Linux 2.6 can be inerrupted mid-task, if the network application expires, and moved to the expired array with the socket locked, all the packets within the backlog queue will not be TCP processed till the network applicaton resume its execution. If the system is heavily loaded, TCP can easily RTO in the Sender Side. Attached is the patch 1/4 best regards, wenji Wenji Wu Network Researcher Fermilab, MS-368 P.O. Box 500 Batavia, IL, 60510 (Email): wenji@fnal.gov (O): 001-630-840-4541 [-- Attachment #2: tcp.c.patch --] [-- Type: application/octet-stream, Size: 553 bytes --] --- linux-2.6.14-old/net/ipv4/tcp.c 2006-11-29 16:24:56.000000000 -0600 +++ linux-2.6.14/net/ipv4/tcp.c 2006-11-29 11:25:57.000000000 -0600 @@ -1109,6 +1109,8 @@ int target; /* Read at least this many bytes */ long timeo; struct task_struct *user_recv = NULL; + + current->backlog_flag = 1; lock_sock(sk); @@ -1394,6 +1396,13 @@ TCP_CHECK_TIMER(sk); release_sock(sk); + + current->backlog_flag = 0; + if(current->extrarun_flag == 1){ + current->extrarun_flag = 0; + yield(); + } + return copied; out: ^ permalink raw reply [flat|nested] 18+ messages in thread
* [patch 2/4] - Potential performance bottleneck for Linxu TCP 2006-11-29 23:28 ` [patch 1/4] " Wenji Wu @ 2006-11-29 23:29 ` Wenji Wu 2006-11-29 23:30 ` [patch 3/4] " Wenji Wu 2006-11-30 0:53 ` [patch 1/4] " David Miller 1 sibling, 1 reply; 18+ messages in thread From: Wenji Wu @ 2006-11-29 23:29 UTC (permalink / raw) To: netdev, davem, akpm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 832 bytes --] From: Wenji Wu <wenji@fnal.gov> Greetings, For Linux TCP, when the network applcaiton make system call to move data from socket's receive buffer to user space by calling tcp_recvmsg(). The socket will be locked. During the period, all the incoming packet for the TCP socket will go to the backlog queue without being TCP processed. Since Linux 2.6 can be inerrupted mid-task, if the network application expires, and moved to the expired array with the socket locked, all the packets within the backlog queue will not be TCP processed till the network applicaton resume its execution. If the system is heavily loaded, TCP can easily RTO in the Sender Side. Attached is the patch 2/4 best regards, wenji Wenji Wu Network Researcher Fermilab, MS-368 P.O. Box 500 Batavia, IL, 60510 (Email): wenji@fnal.gov (O): 001-630-840-4541 [-- Attachment #2: sched.h.patch --] [-- Type: application/octet-stream, Size: 477 bytes --] --- linux-2.6.14-old/include/linux/sched.h 2006-11-29 16:25:42.000000000 -0600 +++ linux-2.6.14/include/linux/sched.h 2006-11-29 10:32:55.000000000 -0600 @@ -813,6 +813,9 @@ int cpuset_mems_generation; #endif atomic_t fs_excl; /* holding fs exclusive resources */ + int backlog_flag; /* packets wait in tcp backlog queue flag */ + int extrarun_flag; /* extra run flag for TCP performance */ + }; static inline pid_t process_group(struct task_struct *tsk) ^ permalink raw reply [flat|nested] 18+ messages in thread
* [patch 3/4] - Potential performance bottleneck for Linxu TCP 2006-11-29 23:29 ` [patch 2/4] " Wenji Wu @ 2006-11-29 23:30 ` Wenji Wu 2006-11-29 23:31 ` [patch 4/4] " Wenji Wu 0 siblings, 1 reply; 18+ messages in thread From: Wenji Wu @ 2006-11-29 23:30 UTC (permalink / raw) To: netdev, davem, akpm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 832 bytes --] From: Wenji Wu <wenji@fnal.gov> Greetings, For Linux TCP, when the network applcaiton make system call to move data from socket's receive buffer to user space by calling tcp_recvmsg(). The socket will be locked. During the period, all the incoming packet for the TCP socket will go to the backlog queue without being TCP processed. Since Linux 2.6 can be inerrupted mid-task, if the network application expires, and moved to the expired array with the socket locked, all the packets within the backlog queue will not be TCP processed till the network applicaton resume its execution. If the system is heavily loaded, TCP can easily RTO in the Sender Side. Attached is the patch 3/4 best regards, wenji Wenji Wu Network Researcher Fermilab, MS-368 P.O. Box 500 Batavia, IL, 60510 (Email): wenji@fnal.gov (O): 001-630-840-4541 [-- Attachment #2: sched.c.patch --] [-- Type: application/octet-stream, Size: 1075 bytes --] --- linux-2.6.14-old/kernel/sched.c 2006-11-29 16:22:22.000000000 -0600 +++ linux-2.6.14/kernel/sched.c 2006-11-29 11:29:34.000000000 -0600 @@ -2598,12 +2598,24 @@ if (!rq->expired_timestamp) rq->expired_timestamp = jiffies; - if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) { - enqueue_task(p, rq->expired); - if (p->static_prio < rq->best_expired_prio) + if(p->backlog_flag == 0){ + if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) { + enqueue_task(p, rq->expired); + if (p->static_prio < rq->best_expired_prio) + rq->best_expired_prio = p->static_prio; + } else + enqueue_task(p, rq->active); + } else { + if(EXPIRED_STARVING(rq)) { + enqueue_task(p,rq->expired); + if (p->static_prio < rq->best_expired_prio) rq->best_expired_prio = p->static_prio; - } else - enqueue_task(p, rq->active); + } else { + if(!TASK_INTERACTIVE(p)) + p->extrarun_flag = 1; + enqueue_task(p,rq->active); + } + } } else { /* * Prevent a too long timeslice allowing a task to monopolize ^ permalink raw reply [flat|nested] 18+ messages in thread
* [patch 4/4] - Potential performance bottleneck for Linxu TCP 2006-11-29 23:30 ` [patch 3/4] " Wenji Wu @ 2006-11-29 23:31 ` Wenji Wu 0 siblings, 0 replies; 18+ messages in thread From: Wenji Wu @ 2006-11-29 23:31 UTC (permalink / raw) To: netdev, davem, akpm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 831 bytes --] From: Wenji Wu <wenji@fnal.gov> Greetings, For Linux TCP, when the network applcaiton make system call to move data from socket's receive buffer to user space by calling tcp_recvmsg(). The socket will be locked. During the period, all the incoming packet for the TCP socket will go to the backlog queue without being TCP processed. Since Linux 2.6 can be inerrupted mid-task, if the network application expires, and moved to the expired array with the socket locked, all the packets within the backlog queue will not be TCP processed till the network applicaton resume its execution. If the system is heavily loaded, TCP can easily RTO in the Sender Side. Attached is the patch 3/4 best regards, wenji Wenji Wu Network Researcher Fermilab, MS-368 P.O. Box 500 Batavia, IL, 60510 (Email): wenji@fnal.gov (O): 001-630-840-4541 [-- Attachment #2: fork.c.patch --] [-- Type: application/octet-stream, Size: 682 bytes --] --- linux-2.6.14-old/kernel/fork.c 2006-11-29 16:22:25.000000000 -0600 +++ linux-2.6.14/kernel/fork.c 2006-11-29 11:23:20.000000000 -0600 @@ -868,7 +868,7 @@ * * It copies the registers, and all the appropriate * parts of the process environment (as per the clone - * flags). The actual kick-off is left to the caller. + * flags). The actual kick-off is left to the caller.copy_process */ static task_t *copy_process(unsigned long clone_flags, unsigned long stack_start, @@ -1154,6 +1154,9 @@ write_unlock_irq(&tasklist_lock); retval = 0; + p->backlog_flag = 0; + p->extrarun_flag = 0; + fork_out: if (retval) return ERR_PTR(retval); ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP 2006-11-29 23:28 ` [patch 1/4] " Wenji Wu 2006-11-29 23:29 ` [patch 2/4] " Wenji Wu @ 2006-11-30 0:53 ` David Miller 2006-11-30 1:08 ` Andrew Morton 1 sibling, 1 reply; 18+ messages in thread From: David Miller @ 2006-11-30 0:53 UTC (permalink / raw) To: wenji; +Cc: netdev, akpm, linux-kernel Please, it is very difficult to review your work the way you have submitted this patch as a set of 4 patches. These patches have not been split up "logically", but rather they have been split up "per file" with the same exact changelog message in each patch posting. This is very clumsy, and impossible to review, and wastes a lot of mailing list bandwith. We have an excellent file, called Documentation/SubmittingPatches, in the kernel source tree, which explains exactly how to do this correctly. By splitting your patch into 4 patches, one for each file touched, it is impossible to review your patch as a logical whole. Please also provide your patch inline so people can just hit reply in their mail reader client to quote your patch and comment on it. This is impossible with the attachments you've used. Thanks. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP 2006-11-30 0:53 ` [patch 1/4] " David Miller @ 2006-11-30 1:08 ` Andrew Morton 2006-11-30 1:13 ` David Miller 2006-11-30 6:04 ` Mike Galbraith 0 siblings, 2 replies; 18+ messages in thread From: Andrew Morton @ 2006-11-30 1:08 UTC (permalink / raw) To: David Miller; +Cc: wenji, netdev, linux-kernel On Wed, 29 Nov 2006 16:53:11 -0800 (PST) David Miller <davem@davemloft.net> wrote: > > Please, it is very difficult to review your work the way you have > submitted this patch as a set of 4 patches. These patches have not > been split up "logically", but rather they have been split up "per > file" with the same exact changelog message in each patch posting. > This is very clumsy, and impossible to review, and wastes a lot of > mailing list bandwith. > > We have an excellent file, called Documentation/SubmittingPatches, in > the kernel source tree, which explains exactly how to do this > correctly. > > By splitting your patch into 4 patches, one for each file touched, > it is impossible to review your patch as a logical whole. > > Please also provide your patch inline so people can just hit reply > in their mail reader client to quote your patch and comment on it. > This is impossible with the attachments you've used. > Here you go - joined up, cleaned up, ported to mainline and test-compiled. That yield() will need to be removed - yield()'s behaviour is truly awful if the system is otherwise busy. What is it there for? From: Wenji Wu <wenji@fnal.gov> For Linux TCP, when the network applcaiton make system call to move data from socket's receive buffer to user space by calling tcp_recvmsg(). The socket will be locked. During this period, all the incoming packet for the TCP socket will go to the backlog queue without being TCP processed Since Linux 2.6 can be inerrupted mid-task, if the network application expires, and moved to the expired array with the socket locked, all the packets within the backlog queue will not be TCP processed till the network applicaton resume its execution. If the system is heavily loaded, TCP can easily RTO in the Sender Side. include/linux/sched.h | 2 ++ kernel/fork.c | 3 +++ kernel/sched.c | 24 ++++++++++++++++++------ net/ipv4/tcp.c | 9 +++++++++ 4 files changed, 32 insertions(+), 6 deletions(-) diff -puN net/ipv4/tcp.c~tcp-speedup net/ipv4/tcp.c --- a/net/ipv4/tcp.c~tcp-speedup +++ a/net/ipv4/tcp.c @@ -1109,6 +1109,8 @@ int tcp_recvmsg(struct kiocb *iocb, stru struct task_struct *user_recv = NULL; int copied_early = 0; + current->backlog_flag = 1; + lock_sock(sk); TCP_CHECK_TIMER(sk); @@ -1468,6 +1470,13 @@ skip_copy: TCP_CHECK_TIMER(sk); release_sock(sk); + + current->backlog_flag = 0; + if (current->extrarun_flag == 1){ + current->extrarun_flag = 0; + yield(); + } + return copied; out: diff -puN include/linux/sched.h~tcp-speedup include/linux/sched.h --- a/include/linux/sched.h~tcp-speedup +++ a/include/linux/sched.h @@ -1023,6 +1023,8 @@ struct task_struct { #ifdef CONFIG_TASK_DELAY_ACCT struct task_delay_info *delays; #endif + int backlog_flag; /* packets wait in tcp backlog queue flag */ + int extrarun_flag; /* extra run flag for TCP performance */ }; static inline pid_t process_group(struct task_struct *tsk) diff -puN kernel/sched.c~tcp-speedup kernel/sched.c --- a/kernel/sched.c~tcp-speedup +++ a/kernel/sched.c @@ -3099,12 +3099,24 @@ void scheduler_tick(void) if (!rq->expired_timestamp) rq->expired_timestamp = jiffies; - if (!TASK_INTERACTIVE(p) || expired_starving(rq)) { - enqueue_task(p, rq->expired); - if (p->static_prio < rq->best_expired_prio) - rq->best_expired_prio = p->static_prio; - } else - enqueue_task(p, rq->active); + if (p->backlog_flag == 0) { + if (!TASK_INTERACTIVE(p) || expired_starving(rq)) { + enqueue_task(p, rq->expired); + if (p->static_prio < rq->best_expired_prio) + rq->best_expired_prio = p->static_prio; + } else + enqueue_task(p, rq->active); + } else { + if (expired_starving(rq)) { + enqueue_task(p,rq->expired); + if (p->static_prio < rq->best_expired_prio) + rq->best_expired_prio = p->static_prio; + } else { + if (!TASK_INTERACTIVE(p)) + p->extrarun_flag = 1; + enqueue_task(p,rq->active); + } + } } else { /* * Prevent a too long timeslice allowing a task to monopolize diff -puN kernel/fork.c~tcp-speedup kernel/fork.c --- a/kernel/fork.c~tcp-speedup +++ a/kernel/fork.c @@ -1032,6 +1032,9 @@ static struct task_struct *copy_process( clear_tsk_thread_flag(p, TIF_SIGPENDING); init_sigpending(&p->pending); + p->backlog_flag = 0; + p->extrarun_flag = 0; + p->utime = cputime_zero; p->stime = cputime_zero; p->sched_time = 0; _ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP 2006-11-30 1:08 ` Andrew Morton @ 2006-11-30 1:13 ` David Miller 2006-11-30 6:04 ` Mike Galbraith 1 sibling, 0 replies; 18+ messages in thread From: David Miller @ 2006-11-30 1:13 UTC (permalink / raw) To: akpm; +Cc: wenji, netdev, linux-kernel From: Andrew Morton <akpm@osdl.org> Date: Wed, 29 Nov 2006 17:08:35 -0800 > On Wed, 29 Nov 2006 16:53:11 -0800 (PST) > David Miller <davem@davemloft.net> wrote: > > > > > Please, it is very difficult to review your work the way you have > > submitted this patch as a set of 4 patches. These patches have not > > been split up "logically", but rather they have been split up "per > > file" with the same exact changelog message in each patch posting. > > This is very clumsy, and impossible to review, and wastes a lot of > > mailing list bandwith. > > > > We have an excellent file, called Documentation/SubmittingPatches, in > > the kernel source tree, which explains exactly how to do this > > correctly. > > > > By splitting your patch into 4 patches, one for each file touched, > > it is impossible to review your patch as a logical whole. > > > > Please also provide your patch inline so people can just hit reply > > in their mail reader client to quote your patch and comment on it. > > This is impossible with the attachments you've used. > > > > Here you go - joined up, cleaned up, ported to mainline and test-compiled. > > That yield() will need to be removed - yield()'s behaviour is truly awful > if the system is otherwise busy. What is it there for? What about simply turning off CONFIG_PREEMPT to fix this "problem"? We always properly run the backlog (by doing a release_sock()) before going to sleep otherwise except for the specific case of taking a page fault during the copy to userspace. It is only CONFIG_PREEMPT that can cause this situation to occur in other circumstances as far as I can see. We could also pepper tcp_recvmsg() with some very carefully placed preemption disable/enable calls to deal with this even with CONFIG_PREEMPT enabled. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP 2006-11-30 1:08 ` Andrew Morton 2006-11-30 1:13 ` David Miller @ 2006-11-30 6:04 ` Mike Galbraith 1 sibling, 0 replies; 18+ messages in thread From: Mike Galbraith @ 2006-11-30 6:04 UTC (permalink / raw) To: Andrew Morton; +Cc: David Miller, wenji, netdev, linux-kernel On Wed, 2006-11-29 at 17:08 -0800, Andrew Morton wrote: > + if (p->backlog_flag == 0) { > + if (!TASK_INTERACTIVE(p) || expired_starving(rq)) { > + enqueue_task(p, rq->expired); > + if (p->static_prio < rq->best_expired_prio) > + rq->best_expired_prio = p->static_prio; > + } else > + enqueue_task(p, rq->active); > + } else { > + if (expired_starving(rq)) { > + enqueue_task(p,rq->expired); > + if (p->static_prio < rq->best_expired_prio) > + rq->best_expired_prio = p->static_prio; > + } else { > + if (!TASK_INTERACTIVE(p)) > + p->extrarun_flag = 1; > + enqueue_task(p,rq->active); > + } > + } (oh my, doing that to the scheduler upsets my tummy, but that aside...) I don't see how that can really solve anything. "Interactive" tasks starting to use cpu heftily can still preempt and keep the special cased cpu hog off the cpu for ages. It also only takes one task in the expired array to trigger the forced array switch with a fully loaded cpu, and once any task hits the expired array, a stream of wakeups can prevent the switch from completing for as long as you can keep wakeups happening. -Mike ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Changelog] - Potential performance bottleneck for Linxu TCP 2006-11-29 23:27 ` [Changelog] - Potential performance bottleneck for Linxu TCP Wenji Wu 2006-11-29 23:28 ` [patch 1/4] " Wenji Wu @ 2006-11-29 23:36 ` Martin Bligh 1 sibling, 0 replies; 18+ messages in thread From: Martin Bligh @ 2006-11-29 23:36 UTC (permalink / raw) To: wenji; +Cc: netdev, davem, akpm, linux-kernel Wenji Wu wrote: > From: Wenji Wu <wenji@fnal.gov> > > Greetings, > > For Linux TCP, when the network applcaiton make system call to move data > from > socket's receive buffer to user space by calling tcp_recvmsg(). The socket > will > be locked. During the period, all the incoming packet for the TCP socket > will go > to the backlog queue without being TCP processed. Since Linux 2.6 can be > inerrupted mid-task, if the network application expires, and moved to the > expired array with the socket locked, all the packets within the backlog > queue > will not be TCP processed till the network applicaton resume its execution. > If > the system is heavily loaded, TCP can easily RTO in the Sender Side. So how much difference did this patch actually make, and to what benchmark? > The patch is for Linux kernel 2.6.14 Deskop and Low-latency Desktop The patch oesn't seem to be attached? Also, would be better to make it for the latest kernel version (2.6.19) ... 2.6.14 is rather old ;-) M ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Bug 7596 - Potential performance bottleneck for Linxu TCP [not found] <HNEBLGGMEGLPMPPDOPMGKEAJCGAA.wenji@fnal.gov> 2006-11-29 23:27 ` [Changelog] - Potential performance bottleneck for Linxu TCP Wenji Wu @ 2006-11-29 23:42 ` Andrew Morton 2006-11-30 6:32 ` Ingo Molnar 2006-11-30 1:01 ` David Miller 2 siblings, 1 reply; 18+ messages in thread From: Andrew Morton @ 2006-11-29 23:42 UTC (permalink / raw) To: wenji; +Cc: netdev, davem, linux-kernel On Wed, 29 Nov 2006 17:22:10 -0600 Wenji Wu <wenji@fnal.gov> wrote: > From: Wenji Wu <wenji@fnal.gov> > > Greetings, > > For Linux TCP, when the network applcaiton make system call to move data > from > socket's receive buffer to user space by calling tcp_recvmsg(). The socket > will > be locked. During the period, all the incoming packet for the TCP socket > will go > to the backlog queue without being TCP processed. Since Linux 2.6 can be > inerrupted mid-task, if the network application expires, and moved to the > expired array with the socket locked, all the packets within the backlog > queue > will not be TCP processed till the network applicaton resume its execution. > If > the system is heavily loaded, TCP can easily RTO in the Sender Side. > > Attached is the detailed description of the problem and one possible > solution. Thanks. The attachment will be too large for the mailing-list servers so I uploaded a copy to http://userweb.kernel.org/~akpm/Linux-TCP-Bottleneck-Analysis-Report.pdf >From a quick peek it appears that you're getting around 10% improvement in TCP throughput, best case. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Bug 7596 - Potential performance bottleneck for Linxu TCP 2006-11-29 23:42 ` Bug 7596 " Andrew Morton @ 2006-11-30 6:32 ` Ingo Molnar 2006-12-19 18:37 ` Stephen Hemminger 0 siblings, 1 reply; 18+ messages in thread From: Ingo Molnar @ 2006-11-30 6:32 UTC (permalink / raw) To: Andrew Morton; +Cc: wenji, netdev, davem, linux-kernel * Andrew Morton <akpm@osdl.org> wrote: > > Attached is the detailed description of the problem and one possible > > solution. > > Thanks. The attachment will be too large for the mailing-list servers > so I uploaded a copy to > http://userweb.kernel.org/~akpm/Linux-TCP-Bottleneck-Analysis-Report.pdf > > From a quick peek it appears that you're getting around 10% > improvement in TCP throughput, best case. Wenji, have you tried to renice the receiving task (to say nice -20) and see how much TCP throughput you get in "background load of 10.0". (similarly, you could also renice the background load tasks to nice +19 and/or set their scheduling policy to SCHED_BATCH) as far as i can see, the numbers in the paper and the patch prove the following two points: - a task doing TCP receive with 10 other tasks running on the CPU will see lower TCP throughput than if it had the CPU for itself alone. - a patch that tweaks the scheduler to give the receiving task more timeslices (i.e. raises its nice level in essence) results in ... more timeslices, which results in higher receive numbers ... so the most important thing to check would be, before any scheduler and TCP code change is considered: if you give the task higher priority /explicitly/, via nice -20, do the numbers improve? Similarly, if all the other "background load" tasks are reniced to nice +19 (or their policy is set to SCHED_BATCH), do you get a similar improvement? Ingo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Bug 7596 - Potential performance bottleneck for Linxu TCP 2006-11-30 6:32 ` Ingo Molnar @ 2006-12-19 18:37 ` Stephen Hemminger 2006-12-19 23:52 ` Herbert Xu 0 siblings, 1 reply; 18+ messages in thread From: Stephen Hemminger @ 2006-12-19 18:37 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, wenji, netdev, davem, linux-kernel I noticed this bit of discussion in tcp_recvmsg. It implies that a better queuing policy would be good. But it is confusing English (Alexey?) so not sure where to start. > if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) { > /* Install new reader */ > if (!user_recv && !(flags & (MSG_TRUNC | MSG_PEEK))) { > user_recv = current; > tp->ucopy.task = user_recv; > tp->ucopy.iov = msg->msg_iov; > } > > tp->ucopy.len = len; > > BUG_TRAP(tp->copied_seq == tp->rcv_nxt || > (flags & (MSG_PEEK | MSG_TRUNC))); > > /* Ugly... If prequeue is not empty, we have to > * process it before releasing socket, otherwise > * order will be broken at second iteration. > * More elegant solution is required!!! > * > * Look: we have the following (pseudo)queues: > * > * 1. packets in flight > * 2. backlog > * 3. prequeue > * 4. receive_queue > * > * Each queue can be processed only if the next ones > * are empty. At this point we have empty receive_queue. > * But prequeue _can_ be not empty after 2nd iteration, > * when we jumped to start of loop because backlog > * processing added something to receive_queue. > * We cannot release_sock(), because backlog contains > * packets arrived _after_ prequeued ones. > * > * Shortly, algorithm is clear --- to process all > * the queues in order. We could make it more directly, > * requeueing packets from backlog to prequeue, if > * is not empty. It is more elegant, but eats cycles, > * unfortunately. > */ > if (!skb_queue_empty(&tp->ucopy.prequeue)) > goto do_prequeue; > > /* __ Set realtime policy in scheduler __ */ > } > > if (copied >= target) { > /* Do not sleep, just process backlog. */ > release_sock(sk); > lock_sock(sk); > } else > -- Stephen Hemminger <shemminger@osdl.org> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Bug 7596 - Potential performance bottleneck for Linxu TCP 2006-12-19 18:37 ` Stephen Hemminger @ 2006-12-19 23:52 ` Herbert Xu 2006-12-20 2:55 ` David Miller 0 siblings, 1 reply; 18+ messages in thread From: Herbert Xu @ 2006-12-19 23:52 UTC (permalink / raw) To: Stephen Hemminger; +Cc: mingo, akpm, wenji, netdev, davem, linux-kernel Stephen Hemminger <shemminger@osdl.org> wrote: > I noticed this bit of discussion in tcp_recvmsg. It implies that a better > queuing policy would be good. But it is confusing English (Alexey?) so > not sure where to start. Actually I think the comment says that the current code isn't the most elegant but is more efficient. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Bug 7596 - Potential performance bottleneck for Linxu TCP 2006-12-19 23:52 ` Herbert Xu @ 2006-12-20 2:55 ` David Miller 2006-12-20 5:11 ` Stephen Hemminger 0 siblings, 1 reply; 18+ messages in thread From: David Miller @ 2006-12-20 2:55 UTC (permalink / raw) To: herbert; +Cc: shemminger, mingo, akpm, wenji, netdev, linux-kernel From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 20 Dec 2006 10:52:19 +1100 > Stephen Hemminger <shemminger@osdl.org> wrote: > > I noticed this bit of discussion in tcp_recvmsg. It implies that a better > > queuing policy would be good. But it is confusing English (Alexey?) so > > not sure where to start. > > Actually I think the comment says that the current code isn't the > most elegant but is more efficient. It's just explaining the hierarchy of queues that need to be purged, and in what order, for correctness. Alexey added that code when I mentioned to him, right after we added the prequeue, that it was possible process the normal backlog before the prequeue, which is illegal. In fixing that bug, he added the comment we are discussing. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Bug 7596 - Potential performance bottleneck for Linxu TCP 2006-12-20 2:55 ` David Miller @ 2006-12-20 5:11 ` Stephen Hemminger 2006-12-20 5:15 ` David Miller 0 siblings, 1 reply; 18+ messages in thread From: Stephen Hemminger @ 2006-12-20 5:11 UTC (permalink / raw) To: David Miller; +Cc: herbert, mingo, akpm, wenji, netdev, linux-kernel On Tue, 19 Dec 2006 18:55:25 -0800 (PST) David Miller <davem@davemloft.net> wrote: > From: Herbert Xu <herbert@gondor.apana.org.au> > Date: Wed, 20 Dec 2006 10:52:19 +1100 > > > Stephen Hemminger <shemminger@osdl.org> wrote: > > > I noticed this bit of discussion in tcp_recvmsg. It implies that a better > > > queuing policy would be good. But it is confusing English (Alexey?) so > > > not sure where to start. > > > > Actually I think the comment says that the current code isn't the > > most elegant but is more efficient. > > It's just explaining the hierarchy of queues that need to > be purged, and in what order, for correctness. > > Alexey added that code when I mentioned to him, right after > we added the prequeue, that it was possible process the > normal backlog before the prequeue, which is illegal. > In fixing that bug, he added the comment we are discussing. It was the realtime/normal comments that piqued my interest. Perhaps we should either tweak process priority or remove the comments. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Bug 7596 - Potential performance bottleneck for Linxu TCP 2006-12-20 5:11 ` Stephen Hemminger @ 2006-12-20 5:15 ` David Miller 0 siblings, 0 replies; 18+ messages in thread From: David Miller @ 2006-12-20 5:15 UTC (permalink / raw) To: shemminger; +Cc: herbert, mingo, akpm, wenji, netdev, linux-kernel From: Stephen Hemminger <shemminger@osdl.org> Date: Tue, 19 Dec 2006 21:11:24 -0800 > It was the realtime/normal comments that piqued my interest. > Perhaps we should either tweak process priority or remove > the comments. I mentioned that to Linus once and he said the entire idea was bogus. With the recent tcp_recvmsg() preemption issue thread, I agree with his sentiments even more than I did previously. What needs to happen is to liberate the locking so that input packet processing can occur in parallel with tcp_recvmsg(), instead of doing this bogus backlog thing which can wedge TCP ACK processing for an entire quantum if we take a kernel preemption while the process has the socket lock held. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Bug 7596 - Potential performance bottleneck for Linxu TCP [not found] <HNEBLGGMEGLPMPPDOPMGKEAJCGAA.wenji@fnal.gov> 2006-11-29 23:27 ` [Changelog] - Potential performance bottleneck for Linxu TCP Wenji Wu 2006-11-29 23:42 ` Bug 7596 " Andrew Morton @ 2006-11-30 1:01 ` David Miller 2 siblings, 0 replies; 18+ messages in thread From: David Miller @ 2006-11-30 1:01 UTC (permalink / raw) To: wenji; +Cc: netdev, akpm, linux-kernel The delays dealt with in your paper might actually help a highly loaded server with lots of sockets and threads trying to communicate. The packet processing delays caused by the scheduling delay paces the TCP sender by controlling the rate at which ACKs go back to that sender. Those ACKs will go out paced to the rate at which the sleeping TCP receiver gets back onto the cpu, and this will cause the TCP sender to naturally adjust to the overall processing rate of the receiver system, on a per-connection basis. Perhaps try a system with hundreds of processes and potentially hundreds of thousands of TCP sockets, with thousands of unique sender sites, and see what happens. This is a similar topic like TSO, where we are trying to balance the gains from batching work from the losses of gaps in the communication stream. ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2006-12-20 5:16 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <HNEBLGGMEGLPMPPDOPMGKEAJCGAA.wenji@fnal.gov>
2006-11-29 23:27 ` [Changelog] - Potential performance bottleneck for Linxu TCP Wenji Wu
2006-11-29 23:28 ` [patch 1/4] " Wenji Wu
2006-11-29 23:29 ` [patch 2/4] " Wenji Wu
2006-11-29 23:30 ` [patch 3/4] " Wenji Wu
2006-11-29 23:31 ` [patch 4/4] " Wenji Wu
2006-11-30 0:53 ` [patch 1/4] " David Miller
2006-11-30 1:08 ` Andrew Morton
2006-11-30 1:13 ` David Miller
2006-11-30 6:04 ` Mike Galbraith
2006-11-29 23:36 ` [Changelog] " Martin Bligh
2006-11-29 23:42 ` Bug 7596 " Andrew Morton
2006-11-30 6:32 ` Ingo Molnar
2006-12-19 18:37 ` Stephen Hemminger
2006-12-19 23:52 ` Herbert Xu
2006-12-20 2:55 ` David Miller
2006-12-20 5:11 ` Stephen Hemminger
2006-12-20 5:15 ` David Miller
2006-11-30 1:01 ` David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).