Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
@ 2006-11-30  1:56 Wenji Wu
  2006-11-30  2:19 ` David Miller
  2006-11-30  9:33 ` Christoph Hellwig
  0 siblings, 2 replies; 36+ messages in thread
From: Wenji Wu @ 2006-11-30  1:56 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, netdev, linux-kernel

Yes, when CONFIG_PREEMPT is disabled, the "problem" won't happen. That is why I put "for 2.6 desktop, low-latency desktop" in the uploaded paper. This "problem" happens in the 2.6 Desktop and Low-latency Desktop.

>We could also pepper tcp_recvmsg() with some very carefully placed preemption disable/enable calls to deal with this even with CONFIG_PREEMPT enabled.

I also think about this approach. But since the "problem" happens in the 2.6 Desktop and Low-latency Desktop (not server), system responsiveness is a key feature, simply placing preemption disabled/enable call might not work.  If you want to place preemption disable/enable calls within tcp_recvmsg, you have to put them in the very beginning and end of the call. Disabling preemption would degrade system responsiveness.

wenji



----- Original Message -----
From: David Miller <davem@davemloft.net>
Date: Wednesday, November 29, 2006 7:13 pm
Subject: Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP

> From: Andrew Morton <akpm@osdl.org>
> Date: Wed, 29 Nov 2006 17:08:35 -0800
> 
> > On Wed, 29 Nov 2006 16:53:11 -0800 (PST)
> > David Miller <davem@davemloft.net> wrote:
> > 
> > > 
> > > Please, it is very difficult to review your work the way you have
> > > submitted this patch as a set of 4 patches.  These patches have 
> not> > been split up "logically", but rather they have been split 
> up "per
> > > file" with the same exact changelog message in each patch posting.
> > > This is very clumsy, and impossible to review, and wastes a lot of
> > > mailing list bandwith.
> > > 
> > > We have an excellent file, called 
> Documentation/SubmittingPatches, in
> > > the kernel source tree, which explains exactly how to do this
> > > correctly.
> > > 
> > > By splitting your patch into 4 patches, one for each file touched,
> > > it is impossible to review your patch as a logical whole.
> > > 
> > > Please also provide your patch inline so people can just hit reply
> > > in their mail reader client to quote your patch and comment on it.
> > > This is impossible with the attachments you've used.
> > > 
> > 
> > Here you go - joined up, cleaned up, ported to mainline and test-
> compiled.> 
> > That yield() will need to be removed - yield()'s behaviour is 
> truly awful
> > if the system is otherwise busy.  What is it there for?
> 
> What about simply turning off CONFIG_PREEMPT to fix this "problem"?
> 
> We always properly run the backlog (by doing a release_sock()) before
> going to sleep otherwise except for the specific case of taking a page
> fault during the copy to userspace.  It is only CONFIG_PREEMPT that
> can cause this situation to occur in other circumstances as far as I
> can see.
> 
> We could also pepper tcp_recvmsg() with some very carefully placed
> preemption disable/enable calls to deal with this even with
> CONFIG_PREEMPT enabled.
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30  1:56 [patch 1/4] - Potential performance bottleneck for Linxu TCP Wenji Wu
@ 2006-11-30  2:19 ` David Miller
  2006-11-30  6:17   ` Ingo Molnar
  2006-11-30 16:08   ` Wenji Wu
  2006-11-30  9:33 ` Christoph Hellwig
  1 sibling, 2 replies; 36+ messages in thread
From: David Miller @ 2006-11-30  2:19 UTC (permalink / raw)
  To: wenji; +Cc: akpm, netdev, linux-kernel

From: Wenji Wu <wenji@fnal.gov>
Date: Wed, 29 Nov 2006 19:56:58 -0600

> >We could also pepper tcp_recvmsg() with some very carefully placed
> >preemption disable/enable calls to deal with this even with
> >CONFIG_PREEMPT enabled.
>
> I also think about this approach. But since the "problem" happens in
> the 2.6 Desktop and Low-latency Desktop (not server), system
> responsiveness is a key feature, simply placing preemption
> disabled/enable call might not work.  If you want to place
> preemption disable/enable calls within tcp_recvmsg, you have to put
> them in the very beginning and end of the call. Disabling preemption
> would degrade system responsiveness.

We can make explicitl preemption checks in the main loop of
tcp_recvmsg(), and release the socket and run the backlog if
need_resched() is TRUE.

This is the simplest and most elegant solution to this problem.

The one suggested in your patch and paper are way overkill, there is
no reason to solve a TCP specific problem inside of the generic
scheduler.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30  2:19 ` David Miller
@ 2006-11-30  6:17   ` Ingo Molnar
  2006-11-30  6:30     ` David Miller
  2006-11-30 16:08   ` Wenji Wu
  1 sibling, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2006-11-30  6:17 UTC (permalink / raw)
  To: David Miller; +Cc: wenji, akpm, netdev, linux-kernel

* David Miller <davem@davemloft.net> wrote:

> We can make explicitl preemption checks in the main loop of 
> tcp_recvmsg(), and release the socket and run the backlog if 
> need_resched() is TRUE.
> 
> This is the simplest and most elegant solution to this problem.

yeah, i like this one. If the problem is "too long locked section", then
the most natural solution is to "break up the lock", not to "boost the 
priority of the lock-holding task" (which is what the proposed patch 
does).

[ Also note that "sprinkle the code with preempt_disable()" kind of
  solutions, besides hurting interactivity, are also a pain to resolve 
  in something like PREEMPT_RT. (unlike say a spinlock, 
  preempt_disable() is quite opaque in what data structure it protects, 
  etc., making it hard to convert it to a preemptible primitive) ]

> The one suggested in your patch and paper are way overkill, there is 
> no reason to solve a TCP specific problem inside of the generic 
> scheduler.

agreed.

What we could also add is a /reverse/ mechanism to the scheduler: a task 
could query whether it has just a small amount of time left in its 
timeslice, and could in that case voluntarily drop its current lock and 
yield, and thus give up its current timeslice and wait for a new, full 
timeslice, instead of being forcibly preempted due to lack of timeslices 
with a possibly critical lock still held.

But the suggested solution here, to "prolong the running of this task 
just a little bit longer" only starts a perpetual arms race between 
users of such a facility and other kernel subsystems. (besides not being 
adequate anyway, there can always be /so/ long lock-hold times that the 
scheduler would have no other option but to preempt the task)

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30  6:17   ` Ingo Molnar
@ 2006-11-30  6:30     ` David Miller
  2006-11-30  6:47       ` Ingo Molnar
  2006-11-30  6:56       ` Ingo Molnar
  0 siblings, 2 replies; 36+ messages in thread
From: David Miller @ 2006-11-30  6:30 UTC (permalink / raw)
  To: mingo; +Cc: wenji, akpm, netdev, linux-kernel

From: Ingo Molnar <mingo@elte.hu>
Date: Thu, 30 Nov 2006 07:17:58 +0100

> 
> * David Miller <davem@davemloft.net> wrote:
> 
> > We can make explicitl preemption checks in the main loop of 
> > tcp_recvmsg(), and release the socket and run the backlog if 
> > need_resched() is TRUE.
> > 
> > This is the simplest and most elegant solution to this problem.
> 
> yeah, i like this one. If the problem is "too long locked section", then
> the most natural solution is to "break up the lock", not to "boost the 
> priority of the lock-holding task" (which is what the proposed patch 
> does).

Ingo you're mis-read the problem :-)

The issue is that we actually don't hold any locks that prevent
preemption, so we can take preemption points which the TCP code
wasn't designed with in-mind.

Normally, we control the sleep point very carefully in the TCP
sendmsg/recvmsg code, such that when we sleep we drop the socket
lock and process the backlog packets that accumulated while the
socket was locked.

With pre-emption we can't control that properly.

The problem is that we really do need to run the backlog any time
we give up the cpu in the sendmsg/recvmsg path, or things get real
erratic.  ACKs don't go out as early as we'd like them to, etc.

It isn't easy to do generically, perhaps, because we can only
drop the socket lock at certain points and we need to do that to
run the backlog.

This is why my suggestion is to preempt_disable() as soon as we
grab the socket lock, and explicitly test need_resched() at places
where it is absolutely safe, like this:

	if (need_resched()) {
		/* Run packet backlog... */
		release_sock(sk);
		schedule();
		lock_sock(sk);
	}

The socket lock is just a by-hand binary semaphore, so it doesn't
block pre-emption.  We have to be able to sleep while holding it.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30  6:30     ` David Miller
@ 2006-11-30  6:47       ` Ingo Molnar
  2006-11-30  7:12         ` David Miller
  2006-11-30  6:56       ` Ingo Molnar
  1 sibling, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2006-11-30  6:47 UTC (permalink / raw)
  To: David Miller; +Cc: wenji, akpm, netdev, linux-kernel

* David Miller <davem@davemloft.net> wrote:

> > yeah, i like this one. If the problem is "too long locked section", 
> > then the most natural solution is to "break up the lock", not to 
> > "boost the priority of the lock-holding task" (which is what the 
> > proposed patch does).
> 
> Ingo you're mis-read the problem :-)

yeah, the problem isnt too long locked section but "too much time spent 
holding a lock" and hence opening up ourselves to possible negative 
side-effects of the scheduler's fairness algorithm when it forces a 
preemption of that process context with that lock held (and forcing all 
subsequent packets to be backlogged).

but please read my last mail - i think i'm slowly starting to wake up 
;-) I dont think there is any real problem: a tweak to the scheduler 
that in essence gives TCP-using tasks a preference changes the balance 
of workloads. Such an explicit tweak is possible already.

furthermore, the tweak allows the shifting of processing from a 
prioritized process context into a highest-priority softirq context. 
(it's not proven that there is any significant /net win/ of performance: 
all that was proven is that if we shift TCP processing from process 
context into softirq context then TCP throughput of that otherwise 
penalized process context increases.)

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30  6:47       ` Ingo Molnar
@ 2006-11-30  7:12         ` David Miller
  2006-11-30  7:35           ` Ingo Molnar
  0 siblings, 1 reply; 36+ messages in thread
From: David Miller @ 2006-11-30  7:12 UTC (permalink / raw)
  To: mingo; +Cc: wenji, akpm, netdev, linux-kernel

From: Ingo Molnar <mingo@elte.hu>
Date: Thu, 30 Nov 2006 07:47:58 +0100

> furthermore, the tweak allows the shifting of processing from a 
> prioritized process context into a highest-priority softirq context. 
> (it's not proven that there is any significant /net win/ of performance: 
> all that was proven is that if we shift TCP processing from process 
> context into softirq context then TCP throughput of that otherwise 
> penalized process context increases.)

If we preempt with any packets in the backlog, we send no ACKs and the
sender cannot send thus the pipe empties.  That's the problem, this
has nothing to do with scheduler priorities or stuff like that IMHO.
The argument goes that if the reschedule is delayed long enough, the
ACKs will exceed the round trip time and trigger retransmits which
will absolutely kill performance.

The only reason we block input packet processing while we hold this
lock is because we don't want the receive queue changing from
underneath us while we're copying data to userspace.

Furthermore once you preempt in this particular way, no input
packet processing occurs in that socket still, exacerbating the
situation.

Anyways, even if we somehow unlocked the socket and ran the backlog at
preemption points, by hand, since we've thus deferred the whole work
of processing whatever is in the backlog until the preemption point,
we've lost our quantum already, so it's perhaps not legal to do the
deferred processing as the preemption signalling point from a fairness
perspective.

It would be different if we really did the packet processing at the
original moment (where we had to queue to the socket backlog because
it was locked, in softirq) because then we'd return from the softirq
and hit the preemption point earlier or whatever.

Therefore, perhaps the best would be to see if there is a way we can
still allow input packet processing even while running the majority of
TCP's recvmsg().  It won't be easy :)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30  7:12         ` David Miller
@ 2006-11-30  7:35           ` Ingo Molnar
  2006-11-30  9:52             ` Evgeniy Polyakov
  0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2006-11-30  7:35 UTC (permalink / raw)
  To: David Miller; +Cc: wenji, akpm, netdev, linux-kernel

* David Miller <davem@davemloft.net> wrote:

> > furthermore, the tweak allows the shifting of processing from a 
> > prioritized process context into a highest-priority softirq context. 
> > (it's not proven that there is any significant /net win/ of 
> > performance: all that was proven is that if we shift TCP processing 
> > from process context into softirq context then TCP throughput of 
> > that otherwise penalized process context increases.)
> 
> If we preempt with any packets in the backlog, we send no ACKs and the 
> sender cannot send thus the pipe empties.  That's the problem, this 
> has nothing to do with scheduler priorities or stuff like that IMHO. 
> The argument goes that if the reschedule is delayed long enough, the 
> ACKs will exceed the round trip time and trigger retransmits which 
> will absolutely kill performance.

yes, but i disagree a bit about the characterisation of the problem. The 
question in my opinion is: how is TCP processing prioritized for this 
particular socket, which is attached to the process context which was 
preempted.

normally, normally quite a bit of TCP processing happens in a softirq 
context (in fact most of it happens there), and softirq contexts have no 
fairness whatsoever - they preempt whatever processing is going on, 
regardless of any priority preferences of the user!

what was observed here were the effects of completely throttling TCP 
processing for a given socket. I think such throttling can in fact be 
desirable: there is a /reason/ why the process context was preempted: in 
that load scenario there was 10 times more processing requested from the 
CPU than it can possibly service. It's a serious overload situation and 
it's the scheduler's task to prioritize between workloads!

normally such kind of "throttling" of the TCP stack for this particular 
socket does not happen. Note that there's no performance lost: we dont 
do TCP processing because there are /9 other tasks for this CPU to run/, 
and the scheduler has a tough choice.

Now i agree that there are more intelligent ways to throttle and less 
intelligent ways to throttle, but the notion to allow a given workload 
'steal' CPU time from other workloads by allowing it to push its 
processing into a softirq is i think unfair. (and this issue is 
partially addressed by my softirq threading patches in -rt :-)

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30  7:35           ` Ingo Molnar
@ 2006-11-30  9:52             ` Evgeniy Polyakov
  2006-11-30 10:07               ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Evgeniy Polyakov @ 2006-11-30  9:52 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: David Miller, wenji, akpm, netdev, linux-kernel

On Thu, Nov 30, 2006 at 08:35:04AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> what was observed here were the effects of completely throttling TCP 
> processing for a given socket. I think such throttling can in fact be 
> desirable: there is a /reason/ why the process context was preempted: in 
> that load scenario there was 10 times more processing requested from the 
> CPU than it can possibly service. It's a serious overload situation and 
> it's the scheduler's task to prioritize between workloads!
> 
> normally such kind of "throttling" of the TCP stack for this particular 
> socket does not happen. Note that there's no performance lost: we dont 
> do TCP processing because there are /9 other tasks for this CPU to run/, 
> and the scheduler has a tough choice.
> 
> Now i agree that there are more intelligent ways to throttle and less 
> intelligent ways to throttle, but the notion to allow a given workload 
> 'steal' CPU time from other workloads by allowing it to push its 
> processing into a softirq is i think unfair. (and this issue is 
> partially addressed by my softirq threading patches in -rt :-)

Doesn't the provided solution is just a in-kernel variant of the
SCHED_FIFO set from userspace? Why kernel should be able to mark some
users as having higher priority?
What if workload of the system is targeted to not the maximum TCP
performance, but maximum other-task performance, which will be broken
with provided patch.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30  9:52             ` Evgeniy Polyakov
@ 2006-11-30 10:07               ` Nick Piggin
  2006-11-30 10:22                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2006-11-30 10:07 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, David Miller, wenji, akpm, netdev, linux-kernel

Evgeniy Polyakov wrote:
> On Thu, Nov 30, 2006 at 08:35:04AM +0100, Ingo Molnar (mingo@elte.hu) wrote:

> Doesn't the provided solution is just a in-kernel variant of the
> SCHED_FIFO set from userspace? Why kernel should be able to mark some
> users as having higher priority?
> What if workload of the system is targeted to not the maximum TCP
> performance, but maximum other-task performance, which will be broken
> with provided patch.

David's line of thinking for a solution sounds better to me. This patch
does not prevent the process from being preempted (for potentially a long
time), by any means.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30 10:07               ` Nick Piggin
@ 2006-11-30 10:22                 ` Evgeniy Polyakov
  2006-11-30 10:32                   ` Ingo Molnar
  2006-11-30 20:14                   ` David Miller
  0 siblings, 2 replies; 36+ messages in thread
From: Evgeniy Polyakov @ 2006-11-30 10:22 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Ingo Molnar, David Miller, wenji, akpm, netdev, linux-kernel

On Thu, Nov 30, 2006 at 09:07:42PM +1100, Nick Piggin (nickpiggin@yahoo.com.au) wrote:
> >Doesn't the provided solution is just a in-kernel variant of the
> >SCHED_FIFO set from userspace? Why kernel should be able to mark some
> >users as having higher priority?
> >What if workload of the system is targeted to not the maximum TCP
> >performance, but maximum other-task performance, which will be broken
> >with provided patch.
> 
> David's line of thinking for a solution sounds better to me. This patch
> does not prevent the process from being preempted (for potentially a long
> time), by any means.

It steals timeslices from other processes to complete tcp_recvmsg()
task, and only when it does it for too long, it will be preempted.
Processing backlog queue on behalf of need_resched() will break fairness
too - processing itself can take a lot of time, so process can be
scheduled away in that part too.

> -- 
> SUSE Labs, Novell Inc.
> Send instant messages to your online friends http://au.messenger.yahoo.com 

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30 10:22                 ` Evgeniy Polyakov
@ 2006-11-30 10:32                   ` Ingo Molnar
  2006-11-30 17:04                     ` Wenji Wu
  2006-11-30 20:22                     ` David Miller
  2006-11-30 20:14                   ` David Miller
  1 sibling, 2 replies; 36+ messages in thread
From: Ingo Molnar @ 2006-11-30 10:32 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Nick Piggin, David Miller, wenji, akpm, netdev, linux-kernel

* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> > David's line of thinking for a solution sounds better to me. This 
> > patch does not prevent the process from being preempted (for 
> > potentially a long time), by any means.
> 
> It steals timeslices from other processes to complete tcp_recvmsg() 
> task, and only when it does it for too long, it will be preempted. 
> Processing backlog queue on behalf of need_resched() will break 
> fairness too - processing itself can take a lot of time, so process 
> can be scheduled away in that part too.

correct - it's just the wrong thing to do. The '10% performance win' 
that was measured was against _9 other tasks who contended for the same 
CPU resource_. I.e. it's /not/ an absolute 'performance win' AFAICS, 
it's a simple shift in CPU cycles away from the other 9 tasks and 
towards the task that does TCP receive.

Note that even without the change the TCP receiving task is already 
getting a disproportionate share of cycles due to softirq processing! 
Under a load of 10.0 it went from 500 mbits to 74 mbits, while the 
'fair' share would be 50 mbits. So the TCP receiver /already/ has an 
unfair advantage. The patch only deepends that unfairness.

The solution is really simple and needs no kernel change at all: if you 
want the TCP receiver to get a larger share of timeslices then either 
renice it to -20 or renice the other tasks to +19.

The other disadvantage, even ignoring that it's the wrong thing to do, 
is the crudeness of preempt_disable() that i mentioned in the other 
post:

---------->

independently of the issue at hand, in general the explicit use of 
preempt_disable() in non-infrastructure code is quite a heavy tool. Its 
effects are heavy and global: it disables /all/ preemption (even on 
PREEMPT_RT). Furthermore, when preempt_disable() is used for per-CPU 
data structures then [unlike for example to a spin-lock] the connection 
between the 'data' and the 'lock' is not explicit - causing all kinds of 
grief when trying to convert such code to a different preemption model. 
(such as PREEMPT_RT :-)

So my plan is to remove all "open-coded" use of preempt_disable() [and 
raw use of local_irq_save/restore] from the kernel and replace it with 
some facility that connects data and lock. (Note that this will not 
result in any actual changes on the instruction level because internally 
every such facility still maps to preempt_disable() on non-PREEMPT_RT 
kernels, so on non-PREEMPT_RT kernels such code will still be the same 
as before.)

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30 10:32                   ` Ingo Molnar
@ 2006-11-30 17:04                     ` Wenji Wu
  2006-11-30 20:20                       ` Ingo Molnar
  2006-11-30 20:22                     ` David Miller
  1 sibling, 1 reply; 36+ messages in thread
From: Wenji Wu @ 2006-11-30 17:04 UTC (permalink / raw)
  To: Ingo Molnar, Evgeniy Polyakov
  Cc: Nick Piggin, David Miller, akpm, netdev, linux-kernel


>The solution is really simple and needs no kernel change at all: if you
>want the TCP receiver to get a larger share of timeslices then either
>renice it to -20 or renice the other tasks to +19.

Simply give a larger share of timeslices to the TCP receiver won't solve the
problem.  No matter what the timeslice is, if the TCP receiving process has
packets within backlog, and the process is expired and moved to the expired
array, RTO might happen in the TCP sender.

The solution does not look like that simple.

wenji





^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30 17:04                     ` Wenji Wu
@ 2006-11-30 20:20                       ` Ingo Molnar
  2006-11-30 20:58                         ` Wenji Wu
  0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2006-11-30 20:20 UTC (permalink / raw)
  To: Wenji Wu
  Cc: Evgeniy Polyakov, Nick Piggin, David Miller, akpm, netdev,
	linux-kernel


* Wenji Wu <wenji@fnal.gov> wrote:

> >The solution is really simple and needs no kernel change at all: if 
> >you want the TCP receiver to get a larger share of timeslices then 
> >either renice it to -20 or renice the other tasks to +19.
> 
> Simply give a larger share of timeslices to the TCP receiver won't 
> solve the problem.  No matter what the timeslice is, if the TCP 
> receiving process has packets within backlog, and the process is 
> expired and moved to the expired array, RTO might happen in the TCP 
> sender.

if you still have the test-setup, could you nevertheless try setting the 
priority of the receiving TCP task to nice -20 and see what kind of 
performance you get?

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30 20:20                       ` Ingo Molnar
@ 2006-11-30 20:58                         ` Wenji Wu
  0 siblings, 0 replies; 36+ messages in thread
From: Wenji Wu @ 2006-11-30 20:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Evgeniy Polyakov, Nick Piggin, David Miller, akpm, netdev,
	linux-kernel

>if you still have the test-setup, could you nevertheless try setting the
>priority of the receiving TCP task to nice -20 and see what kind of
>performance you get?

A process with nice of -20 can easily get the interactivity status. When it
expires, it still go back to the active array. It just hide the TCP problem,
instead of solving it.

For a process with nice value of -20, it will have the following advantages
over other processes:
(1) its timeslice is 800ms, the timeslice of a process with a nice value of
0 is 100ms
(2) it has higher priority than other processes
(3) it is easier to gain the interactivity status.

The chances that the process expires and moves to the expired array with
packets within backlog is much reduces, but still has the chance.

wenji

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30 10:32                   ` Ingo Molnar
  2006-11-30 17:04                     ` Wenji Wu
@ 2006-11-30 20:22                     ` David Miller
  2006-11-30 20:30                       ` Ingo Molnar
  1 sibling, 1 reply; 36+ messages in thread
From: David Miller @ 2006-11-30 20:22 UTC (permalink / raw)
  To: mingo; +Cc: johnpol, nickpiggin, wenji, akpm, netdev, linux-kernel

From: Ingo Molnar <mingo@elte.hu>
Date: Thu, 30 Nov 2006 11:32:40 +0100

> Note that even without the change the TCP receiving task is already 
> getting a disproportionate share of cycles due to softirq processing! 
> Under a load of 10.0 it went from 500 mbits to 74 mbits, while the 
> 'fair' share would be 50 mbits. So the TCP receiver /already/ has an 
> unfair advantage. The patch only deepends that unfairness.

I want to point out something which is slightly misleading about this
kind of analysis.

Your disk I/O speed doesn't go down by a factor of 10 just because 9
other non disk I/O tasks are running, yet for TCP that's seemingly OK
:-)

Not looking at input TCP packets enough to send out the ACKs is the
same as "forgetting" to queue some I/O requests that can go to the
controller right now.

That's the problem, TCP performance is intimately tied to ACK
feedback.  So we should find a way to make sure ACK feedback goes
out, in preference to other tcp_recvmsg() processing.

What really should pace the TCP sender in this kind of situation is
the advertised window, not the lack of ACKs.  Lack of an ACK mean the
packet didn't get there, which is the wrong signal in this kind of
situation, whereas a closing window means "application can't keep
up with the data rate, hold on..." and is the proper flow control
signal in this high load scenerio.

If you don't send ACKs, packets are retransmitted when there is no
reason for it, and that borders on illegal. :-)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30 20:22                     ` David Miller
@ 2006-11-30 20:30                       ` Ingo Molnar
  2006-11-30 20:38                         ` David Miller
  0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2006-11-30 20:30 UTC (permalink / raw)
  To: David Miller; +Cc: johnpol, nickpiggin, wenji, akpm, netdev, linux-kernel


* David Miller <davem@davemloft.net> wrote:

> I want to point out something which is slightly misleading about this 
> kind of analysis.
> 
> Your disk I/O speed doesn't go down by a factor of 10 just because 9 
> other non disk I/O tasks are running, yet for TCP that's seemingly OK
> :-)

disk I/O is typically not CPU bound, and i believe these TCP tests /are/ 
CPU-bound. Otherwise there would be no expiry of the timeslice to begin 
with and the TCP receiver task would always be boosted to 'interactive' 
status by the scheduler and would happily chug along at 500 mbits ...

(and i grant you, if a disk IO test is 20% CPU bound in process context 
and system load is 10, then the scheduler will throttle that task quite 
effectively.)

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30 20:30                       ` Ingo Molnar
@ 2006-11-30 20:38                         ` David Miller
  2006-11-30 20:49                           ` Ingo Molnar
  0 siblings, 1 reply; 36+ messages in thread
From: David Miller @ 2006-11-30 20:38 UTC (permalink / raw)
  To: mingo; +Cc: johnpol, nickpiggin, wenji, akpm, netdev, linux-kernel

From: Ingo Molnar <mingo@elte.hu>
Date: Thu, 30 Nov 2006 21:30:26 +0100

> disk I/O is typically not CPU bound, and i believe these TCP tests /are/ 
> CPU-bound. Otherwise there would be no expiry of the timeslice to begin 
> with and the TCP receiver task would always be boosted to 'interactive' 
> status by the scheduler and would happily chug along at 500 mbits ...

It's about the prioritization of the work.

If all disk I/O were shut off and frozen while we copy file
data into userspace, you'd see the same problem for disk I/O.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30 20:38                         ` David Miller
@ 2006-11-30 20:49                           ` Ingo Molnar
  2006-11-30 20:54                             ` Ingo Molnar
  2006-11-30 20:55                             ` David Miller
  0 siblings, 2 replies; 36+ messages in thread
From: Ingo Molnar @ 2006-11-30 20:49 UTC (permalink / raw)
  To: David Miller; +Cc: johnpol, nickpiggin, wenji, akpm, netdev, linux-kernel

* David Miller <davem@davemloft.net> wrote:

> > disk I/O is typically not CPU bound, and i believe these TCP tests 
> > /are/ CPU-bound. Otherwise there would be no expiry of the timeslice 
> > to begin with and the TCP receiver task would always be boosted to 
> > 'interactive' status by the scheduler and would happily chug along 
> > at 500 mbits ...
> 
> It's about the prioritization of the work.
> 
> If all disk I/O were shut off and frozen while we copy file data into 
> userspace, you'd see the same problem for disk I/O.

well, it's an issue of how much processing is done in non-prioritized 
contexts. TCP is a bit more sensitive to process context being throttled 
- but disk I/O is not immune either: if nothing submits new IO, or if 
the task does shorts reads+writes then any process level throttling 
immediately shows up in IO throughput.

but in the general sense it is /unfair/ that certain processing such as 
disk and network IO can get a disproportionate amount of CPU time from 
the system - just because they happen to have some of their processing 
in IRQ and softirq context (which is essentially prioritized to 
SCHED_FIFO 100). A system can easily spend 80% CPU time in softirq 
context. (and that is easily visible in something like an -rt kernel 
where various softirq contexts are separate threads and you can see 30% 
net-rx and 20% net-tx CPU utilization in 'top'). How is this kind of 
processing different from purely process-context based subsystems?

so i agree with you that by tweaking the TCP stack to be less sensitive 
to process throttling you /will/ improve the relative performance of the 
TCP receiver task - but in general system design and scheduler design 
terms it's not a win.

i'd also agree with the notion that the current 'throttling' of process 
contexts can be abrupt and uncooperative, and hence the TCP stack could 
get more out of the same amount of CPU time if it used it in a smarter 
way. As i pointed it out in the first mail i'd support the TCP stack 
getting the ability to query how much timeslices it has - or even the 
scheduler notifying the TCP stack via some downcall if 
current->timeslice reaches 1 (or something like that).

So i dont support the scheme proposed here, the blatant bending of the 
priority scale towards the TCP workload. Instead what i'd like to see is 
more TCP performance (and a nicer over-the-wire behavior - no 
retransmits for example) /with the same 10% CPU time used/. Are we in
rough agreement?

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30 20:49                           ` Ingo Molnar
@ 2006-11-30 20:54                             ` Ingo Molnar
  2006-11-30 20:55                             ` David Miller
  1 sibling, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2006-11-30 20:54 UTC (permalink / raw)
  To: David Miller; +Cc: johnpol, nickpiggin, wenji, akpm, netdev, linux-kernel

* Ingo Molnar <mingo@elte.hu> wrote:

> [...] Instead what i'd like to see is more TCP performance (and a 
> nicer over-the-wire behavior - no retransmits for example) /with the 
> same 10% CPU time used/. Are we in rough agreement?

put in another way: i'd like to see the "TCP bytes transferred per CPU 
time spent by the TCP stack" ratio to be maximized in a load-independent 
way (part of which is the sender host too: to not cause unnecessary 
retransmits is important as well). In a high-load scenario this means 
that any measure that purely improves TCP throughput by giving it more 
cycles is not a real improvement. So the focus should be on throttling 
intelligently and without causing extra work on the sender side either - 
not on trying to circumvent throttling measures.

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30 20:49                           ` Ingo Molnar
  2006-11-30 20:54                             ` Ingo Molnar
@ 2006-11-30 20:55                             ` David Miller
  1 sibling, 0 replies; 36+ messages in thread
From: David Miller @ 2006-11-30 20:55 UTC (permalink / raw)
  To: mingo; +Cc: johnpol, nickpiggin, wenji, akpm, netdev, linux-kernel

From: Ingo Molnar <mingo@elte.hu>
Date: Thu, 30 Nov 2006 21:49:08 +0100

> So i dont support the scheme proposed here, the blatant bending of the 
> priority scale towards the TCP workload.

I don't support this scheme either ;-)

That's why my proposal is to find a way to allow input packet
processing even during tcp_recvmsg() work.  It is a solution that
would give the TCP task exactly it's time slice, no more, no less,
without the erroneous behavior of sleeping with packets held in the
socket backlog.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30 10:22                 ` Evgeniy Polyakov
  2006-11-30 10:32                   ` Ingo Molnar
@ 2006-11-30 20:14                   ` David Miller
  2006-11-30 20:42                     ` Wenji Wu
  2006-12-01  9:53                     ` Evgeniy Polyakov
  1 sibling, 2 replies; 36+ messages in thread
From: David Miller @ 2006-11-30 20:14 UTC (permalink / raw)
  To: johnpol; +Cc: nickpiggin, mingo, wenji, akpm, netdev, linux-kernel

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Thu, 30 Nov 2006 13:22:06 +0300

> It steals timeslices from other processes to complete tcp_recvmsg()
> task, and only when it does it for too long, it will be preempted.
> Processing backlog queue on behalf of need_resched() will break
> fairness too - processing itself can take a lot of time, so process
> can be scheduled away in that part too.

Yes, at this point I agree with this analysis.

Currently I am therefore advocating some way to allow
full input packet handling even amidst tcp_recvmsg()
processing.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30 20:14                   ` David Miller
@ 2006-11-30 20:42                     ` Wenji Wu
  2006-12-01  9:53                     ` Evgeniy Polyakov
  1 sibling, 0 replies; 36+ messages in thread
From: Wenji Wu @ 2006-11-30 20:42 UTC (permalink / raw)
  To: David Miller, johnpol; +Cc: nickpiggin, mingo, akpm, netdev, linux-kernel

> It steals timeslices from other processes to complete tcp_recvmsg()
> task, and only when it does it for too long, it will be preempted.
> Processing backlog queue on behalf of need_resched() will break
> fairness too - processing itself can take a lot of time, so process
> can be scheduled away in that part too.

It does steal timeslices from other processes to complete tcp_recvmsg()
task. But I do not think it will  take long. When processing backlog, the
processed packets will go to the receive buffer, the TCP flow control will
take effect to slow down the sender.

The data receiving process might be preempted by higher priority processes.
Only the data recieving process stays in the active array, the problem is
not that bad because the process might resume its execution soon. The worst
case is that it expires and is moved to the active array with packets within
the backlog queue.

wenji

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30 20:14                   ` David Miller
  2006-11-30 20:42                     ` Wenji Wu
@ 2006-12-01  9:53                     ` Evgeniy Polyakov
  2006-12-01 23:18                       ` David Miller
  1 sibling, 1 reply; 36+ messages in thread
From: Evgeniy Polyakov @ 2006-12-01  9:53 UTC (permalink / raw)
  To: David Miller; +Cc: nickpiggin, mingo, wenji, akpm, netdev, linux-kernel

On Thu, Nov 30, 2006 at 12:14:43PM -0800, David Miller (davem@davemloft.net) wrote:
> > It steals timeslices from other processes to complete tcp_recvmsg()
> > task, and only when it does it for too long, it will be preempted.
> > Processing backlog queue on behalf of need_resched() will break
> > fairness too - processing itself can take a lot of time, so process
> > can be scheduled away in that part too.
> 
> Yes, at this point I agree with this analysis.
> 
> Currently I am therefore advocating some way to allow
> full input packet handling even amidst tcp_recvmsg()
> processing.

Isn't it a step in direction of full tcp processing bound to process
context? :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-12-01  9:53                     ` Evgeniy Polyakov
@ 2006-12-01 23:18                       ` David Miller
  0 siblings, 0 replies; 36+ messages in thread
From: David Miller @ 2006-12-01 23:18 UTC (permalink / raw)
  To: johnpol; +Cc: nickpiggin, mingo, wenji, akpm, netdev, linux-kernel

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Fri, 1 Dec 2006 12:53:07 +0300

> Isn't it a step in direction of full tcp processing bound to process
> context? :)

:-)

Rather, it is just finer grained locking.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30  6:30     ` David Miller
  2006-11-30  6:47       ` Ingo Molnar
@ 2006-11-30  6:56       ` Ingo Molnar
  1 sibling, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2006-11-30  6:56 UTC (permalink / raw)
  To: David Miller; +Cc: wenji, akpm, netdev, linux-kernel

* David Miller <davem@davemloft.net> wrote:

> This is why my suggestion is to preempt_disable() as soon as we grab 
> the socket lock, [...]

independently of the issue at hand, in general the explicit use of 
preempt_disable() in non-infrastructure code is quite a heavy tool. Its 
effects are heavy and global: it disables /all/ preemption (even on 
PREEMPT_RT). Furthermore, when preempt_disable() is used for per-CPU 
data structures then [unlike for example to a spin-lock] the connection 
between the 'data' and the 'lock' is not explicit - causing all kinds of 
grief when trying to convert such code to a different preemption model. 
(such as PREEMPT_RT :-)

So my plan is to remove all "open-coded" use of preempt_disable() [and 
raw use of local_irq_save/restore] from the kernel and replace it with 
some facility that connects data and lock. (Note that this will not 
result in any actual changes on the instruction level because internally 
every such facility still maps to preempt_disable() on non-PREEMPT_RT 
kernels, so on non-PREEMPT_RT kernels such code will still be the same 
as before.)

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30  2:19 ` David Miller
  2006-11-30  6:17   ` Ingo Molnar
@ 2006-11-30 16:08   ` Wenji Wu
  2006-11-30 20:06     ` David Miller
  1 sibling, 1 reply; 36+ messages in thread
From: Wenji Wu @ 2006-11-30 16:08 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, netdev, linux-kernel

>We can make explicitl preemption checks in the main loop of
>tcp_recvmsg(), and release the socket and run the backlog if
>need_resched() is TRUE.

>This is the simplest and most elegant solution to this problem.

I am not sure whether this approach will work. How can you make the explicit
preemption checks?

For Desktop case, yes, you can make the explicit preemption checks at some
points whether need_resched() is true. But when need_resched() is true, you
can not decide whether it is triggered by higher priority processes becoming
runnable, or the process within tcp_recvmsg being expiring.

If the higher prioirty processes become runnable (e.g., interactive
process), you better yield the CPU, instead of continuing this process. If
it is the case that the process within tcp_recvmsg() is expriring, then, you
can continue the process to go ahead to process backlog.

For Low-latency Desktop case, I believe it is very hard to make the checks.
We do not know when the process is going to expire, or when higher priority
process will become runnable. The process could expire at any moment, or
higher priority process could become runnnable at any moment. If we do not
want to tradeoff system responsiveness, where do you want to make the check?
If you just make the chekc, then need_resched() become TRUE, what are you
going to do in this case?

wenji

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30 16:08   ` Wenji Wu
@ 2006-11-30 20:06     ` David Miller
  0 siblings, 0 replies; 36+ messages in thread
From: David Miller @ 2006-11-30 20:06 UTC (permalink / raw)
  To: wenji; +Cc: akpm, netdev, linux-kernel

From: Wenji Wu <wenji@fnal.gov>
Date: Thu, 30 Nov 2006 10:08:22 -0600

> If the higher prioirty processes become runnable (e.g., interactive
> process), you better yield the CPU, instead of continuing this process. If
> it is the case that the process within tcp_recvmsg() is expriring, then, you
> can continue the process to go ahead to process backlog.

Yes, I understand this, and I made that point in one of my
replies to Ingo Molnar last night.

The only seemingly remaining possibility is to find a way to allow
input packet processing, at least enough to emit ACKs, during
tcp_recvmsg() processing.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30  1:56 [patch 1/4] - Potential performance bottleneck for Linxu TCP Wenji Wu
  2006-11-30  2:19 ` David Miller
@ 2006-11-30  9:33 ` Christoph Hellwig
  2006-11-30 16:51   ` Lee Revell
  1 sibling, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2006-11-30  9:33 UTC (permalink / raw)
  To: Wenji Wu; +Cc: David Miller, akpm, netdev, linux-kernel

On Wed, Nov 29, 2006 at 07:56:58PM -0600, Wenji Wu wrote:
> Yes, when CONFIG_PREEMPT is disabled, the "problem" won't happen. That is why I put "for 2.6 desktop, low-latency desktop" in the uploaded paper. This "problem" happens in the 2.6 Desktop and Low-latency Desktop.

CONFIG_PREEMPT is only for people that are in for the feeling.  There is no
real world advtantage to it and we should probably remove it again.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30  9:33 ` Christoph Hellwig
@ 2006-11-30 16:51   ` Lee Revell
  0 siblings, 0 replies; 36+ messages in thread
From: Lee Revell @ 2006-11-30 16:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Wenji Wu, David Miller, akpm, netdev, linux-kernel

On Thu, 2006-11-30 at 09:33 +0000, Christoph Hellwig wrote:
> On Wed, Nov 29, 2006 at 07:56:58PM -0600, Wenji Wu wrote:
> > Yes, when CONFIG_PREEMPT is disabled, the "problem" won't happen. That is why I put "for 2.6 desktop, low-latency desktop" in the uploaded paper. This "problem" happens in the 2.6 Desktop and Low-latency Desktop.
> 
> CONFIG_PREEMPT is only for people that are in for the feeling.  There is no
> real world advtantage to it and we should probably remove it again.

There certainly is a real world advantage for many applications.  Of
course it would be better if the latency requirements could be met
without kernel preemption but that's not the case now.

Lee


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
@ 2006-11-30  2:02 Wenji Wu
  2006-11-30  6:19 ` Ingo Molnar
  0 siblings, 1 reply; 36+ messages in thread
From: Wenji Wu @ 2006-11-30  2:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Miller, netdev, linux-kernel

> That yield() will need to be removed - yield()'s behaviour is truly 
> awfulif the system is otherwise busy.  What is it there for?

Please read the uploaded paper, which has detailed description.

thanks,

wenji

----- Original Message -----
From: Andrew Morton <akpm@osdl.org>
Date: Wednesday, November 29, 2006 7:08 pm
Subject: Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP

> On Wed, 29 Nov 2006 16:53:11 -0800 (PST)
> David Miller <davem@davemloft.net> wrote:
> 
> > 
> > Please, it is very difficult to review your work the way you have
> > submitted this patch as a set of 4 patches.  These patches have not
> > been split up "logically", but rather they have been split up "per
> > file" with the same exact changelog message in each patch posting.
> > This is very clumsy, and impossible to review, and wastes a lot of
> > mailing list bandwith.
> > 
> > We have an excellent file, called 
> Documentation/SubmittingPatches, in
> > the kernel source tree, which explains exactly how to do this
> > correctly.
> > 
> > By splitting your patch into 4 patches, one for each file touched,
> > it is impossible to review your patch as a logical whole.
> > 
> > Please also provide your patch inline so people can just hit reply
> > in their mail reader client to quote your patch and comment on it.
> > This is impossible with the attachments you've used.
> > 
> 
> Here you go - joined up, cleaned up, ported to mainline and test-
> compiled.
> That yield() will need to be removed - yield()'s behaviour is truly 
> awfulif the system is otherwise busy.  What is it there for?
> 
> 
> 
> From: Wenji Wu <wenji@fnal.gov>
> 
> For Linux TCP, when the network applcaiton make system call to move 
> data from
> socket's receive buffer to user space by calling tcp_recvmsg().  
> The socket
> will be locked.  During this period, all the incoming packet for 
> the TCP
> socket will go to the backlog queue without being TCP processed
> 
> Since Linux 2.6 can be inerrupted mid-task, if the network application
> expires, and moved to the expired array with the socket locked, all 
> thepackets within the backlog queue will not be TCP processed till 
> the network
> applicaton resume its execution.  If the system is heavily loaded, 
> TCP can
> easily RTO in the Sender Side.
> 
> 
> 
> include/linux/sched.h |    2 ++
> kernel/fork.c         |    3 +++
> kernel/sched.c        |   24 ++++++++++++++++++------
> net/ipv4/tcp.c        |    9 +++++++++
> 4 files changed, 32 insertions(+), 6 deletions(-)
> 
> diff -puN net/ipv4/tcp.c~tcp-speedup net/ipv4/tcp.c
> --- a/net/ipv4/tcp.c~tcp-speedup
> +++ a/net/ipv4/tcp.c
> @@ -1109,6 +1109,8 @@ int tcp_recvmsg(struct kiocb *iocb, stru
> 	struct task_struct *user_recv = NULL;
> 	int copied_early = 0;
> 
> +	current->backlog_flag = 1;
> +
> 	lock_sock(sk);
> 
> 	TCP_CHECK_TIMER(sk);
> @@ -1468,6 +1470,13 @@ skip_copy:
> 
> 	TCP_CHECK_TIMER(sk);
> 	release_sock(sk);
> +
> +	current->backlog_flag = 0;
> +	if (current->extrarun_flag == 1){
> +        	current->extrarun_flag = 0;
> +        	yield();
> +	}
> +
> 	return copied;
> 
> out:
> diff -puN include/linux/sched.h~tcp-speedup include/linux/sched.h
> --- a/include/linux/sched.h~tcp-speedup
> +++ a/include/linux/sched.h
> @@ -1023,6 +1023,8 @@ struct task_struct {
> #ifdef	CONFIG_TASK_DELAY_ACCT
> 	struct task_delay_info *delays;
> #endif
> +	int backlog_flag; 	/* packets wait in tcp backlog queue flag */
> +	int extrarun_flag;	/* extra run flag for TCP performance */
> };
> 
> static inline pid_t process_group(struct task_struct *tsk)
> diff -puN kernel/sched.c~tcp-speedup kernel/sched.c
> --- a/kernel/sched.c~tcp-speedup
> +++ a/kernel/sched.c
> @@ -3099,12 +3099,24 @@ void scheduler_tick(void)
> 
>         	if (!rq->expired_timestamp)
>                 	rq->expired_timestamp = jiffies;
> -        	if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
> -                	enqueue_task(p, rq->expired);
> -                	if (p->static_prio < rq->best_expired_prio)
> -                        	rq->best_expired_prio = p->static_prio;
> -        	} else
> -                	enqueue_task(p, rq->active);
> +        	if (p->backlog_flag == 0) {
> +                	if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
> +                        	enqueue_task(p, rq->expired);
> +                        	if (p->static_prio < rq->best_expired_prio)
> +                                	rq->best_expired_prio = p-
> >static_prio;+                	} else
> +                        	enqueue_task(p, rq->active);
> +        	} else {
> +                	if (expired_starving(rq)) {
> +                        	enqueue_task(p,rq->expired);
> +                        	if (p->static_prio < rq->best_expired_prio)
> +                                	rq->best_expired_prio = p-
> >static_prio;+                	} else {
> +                        	if (!TASK_INTERACTIVE(p))
> +                                	p->extrarun_flag = 1;
> +                        	enqueue_task(p,rq->active);
> +                	}
> +        	}
> 	} else {
>         	/*
>                  * Prevent a too long timeslice allowing a task to 
> monopolizediff -puN kernel/fork.c~tcp-speedup kernel/fork.c
> --- a/kernel/fork.c~tcp-speedup
> +++ a/kernel/fork.c
> @@ -1032,6 +1032,9 @@ static struct task_struct *copy_process(
> 	clear_tsk_thread_flag(p, TIF_SIGPENDING);
> 	init_sigpending(&p->pending);
> 
> +	p->backlog_flag = 0;
> +	p->extrarun_flag = 0;
> +
> 	p->utime = cputime_zero;
> 	p->stime = cputime_zero;
>  	p->sched_time = 0;
> _
> 
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30  2:02 Wenji Wu
@ 2006-11-30  6:19 ` Ingo Molnar
  0 siblings, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2006-11-30  6:19 UTC (permalink / raw)
  To: Wenji Wu; +Cc: Andrew Morton, David Miller, netdev, linux-kernel


* Wenji Wu <wenji@fnal.gov> wrote:

> > That yield() will need to be removed - yield()'s behaviour is truly 
> > awfulif the system is otherwise busy.  What is it there for?
> 
> Please read the uploaded paper, which has detailed description.

do you have any URL for that?

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Changelog] - Potential performance bottleneck for Linxu TCP
@ 2006-11-29 23:27 Wenji Wu
  2006-11-29 23:28 ` [patch 1/4] " Wenji Wu
  0 siblings, 1 reply; 36+ messages in thread
From: Wenji Wu @ 2006-11-29 23:27 UTC (permalink / raw)
  To: netdev, davem, akpm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 846 bytes --]

From: Wenji Wu <wenji@fnal.gov>

Greetings,

For Linux TCP, when the network applcaiton make system call to move data
from
socket's receive buffer to user space by calling tcp_recvmsg(). The socket
will
be locked. During the period, all the incoming packet for the TCP socket
will go
to the backlog queue without being TCP processed. Since Linux 2.6 can be
inerrupted mid-task, if the network application expires, and moved to the
expired array with the socket locked, all the packets within the backlog
queue
will not be TCP processed till the network applicaton resume its execution.
If
the system is heavily loaded, TCP can easily RTO in the Sender Side.

Attached is the Changelog for the patch

best regards,

wenji

Wenji Wu
Network Researcher
Fermilab, MS-368
P.O. Box 500
Batavia, IL, 60510
(Email): wenji@fnal.gov
(O): 001-630-840-4541

[-- Attachment #2: Changelog.txt --]
[-- Type: text/plain, Size: 2988 bytes --]

From: Wenji Wu <wenji@fnal.gov>

- Subject

Potential performance bottleneck for Linux TCP (2.6 Desktop, Low-latency Desktop)

- Why the kernel needed patching

For Linux TCP, when the network applcaiton make system call to move data from
socket's receive buffer to user space by calling tcp_recvmsg(). The socket will
be locked. During the period, all the incoming packet for the TCP socket will go
to the backlog queue without being TCP processed. Since Linux 2.6 can be
inerrupted mid-task, if the network application expires, and moved to the
expired array with the socket locked, all the packets within the backlog queue
will not be TCP processed till the network applicaton resume its execution. If
the system is heavily loaded, TCP can easily RTO in the Sender Side.

- The overall design apparoch in the patch

the underlying idea here is that when there are packets waiting on the prequeue 
or backlog queue, do not allow the data receiving process to release the CPU for long. 

- Implementation details

We have modified the Linux process scheduling policy and tcp_recvmsg().

To summarize, the solution works as follows: 

an expired data receiving process with packets waiting on backlog queue or 
prequeue is moved to the active array, instead of expired array as usual. 
More often than not, the expired data receiving process will continue to run. 
Even it doesn’t, the wait time before it resumes its execution will be greatly reduced. 
However, this gives the process extra runs compared to other processes in the runqueue. 

For the sake of fairness, the process would be labeled with the extra_run_flag. 

Also considering the facts that: 

(1) the resumed process will continue its execution within tcp_recvmsg(); 
(2) tcp_recvmsg() does not return to user space until the prequeue and backlog queue are drained. 

For the sake of fairness, we modified tcp_recvmsg() as such: after prequeue and backlog 
queue are drained and before tcp_recvmsg() returns to user space, any process labeled with 
the extra_run_flag will call yield() to explicitly yield the CPU to other proc-esses in the runqueue. 
yield() works by removing the process from the active array (where it current is, because it is running), 
and inserting it into the expired array. Also, to prevent processes in the expired array from starving, 

A special rule has been provided for Linux process scheduling (the same rule used for interactive processes): 
an expired process is moved to the expired array without respect to its status if processes in the expired array are starved.

Changed files:

/kernel/sched.c
/kernel/fork.c
/include/linux/sched.h
/net/ipv4/tcp.c

- Testing results

The proposed solution tradeoffs a small amount of fairness performance to resolve the TCP performance bottleneck. 
The proposed solution won’t cause serious fairness issue.

The patch is for Linux kernel 2.6.14 Deskop and Low-latency Desktop

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-29 23:27 [Changelog] " Wenji Wu
@ 2006-11-29 23:28 ` Wenji Wu
  2006-11-30  0:53   ` David Miller
  0 siblings, 1 reply; 36+ messages in thread
From: Wenji Wu @ 2006-11-29 23:28 UTC (permalink / raw)
  To: wenji, netdev, davem, akpm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 832 bytes --]

From: Wenji Wu <wenji@fnal.gov>

Greetings,

For Linux TCP, when the network applcaiton make system call to move data
from
socket's receive buffer to user space by calling tcp_recvmsg(). The socket
will
be locked. During the period, all the incoming packet for the TCP socket
will go
to the backlog queue without being TCP processed. Since Linux 2.6 can be
inerrupted mid-task, if the network application expires, and moved to the
expired array with the socket locked, all the packets within the backlog
queue
will not be TCP processed till the network applicaton resume its execution.
If
the system is heavily loaded, TCP can easily RTO in the Sender Side.

Attached is the patch 1/4

best regards,

wenji

Wenji Wu
Network Researcher
Fermilab, MS-368
P.O. Box 500
Batavia, IL, 60510
(Email): wenji@fnal.gov
(O): 001-630-840-4541

[-- Attachment #2: tcp.c.patch --]
[-- Type: application/octet-stream, Size: 553 bytes --]

--- linux-2.6.14-old/net/ipv4/tcp.c	2006-11-29 16:24:56.000000000 -0600
+++ linux-2.6.14/net/ipv4/tcp.c	2006-11-29 11:25:57.000000000 -0600
@@ -1109,6 +1109,8 @@
 	int target;		/* Read at least this many bytes */
 	long timeo;
 	struct task_struct *user_recv = NULL;
+	
+	current->backlog_flag = 1;

 	lock_sock(sk);

@@ -1394,6 +1396,13 @@

 	TCP_CHECK_TIMER(sk);
 	release_sock(sk);
+
+	current->backlog_flag = 0;
+	if(current->extrarun_flag == 1){
+		current->extrarun_flag = 0;
+		yield();
+	}
+
 	return copied;

 out:

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-29 23:28 ` [patch 1/4] " Wenji Wu
@ 2006-11-30  0:53   ` David Miller
  2006-11-30  1:08     ` Andrew Morton
  0 siblings, 1 reply; 36+ messages in thread
From: David Miller @ 2006-11-30  0:53 UTC (permalink / raw)
  To: wenji; +Cc: netdev, akpm, linux-kernel

Please, it is very difficult to review your work the way you have
submitted this patch as a set of 4 patches.  These patches have not
been split up "logically", but rather they have been split up "per
file" with the same exact changelog message in each patch posting.
This is very clumsy, and impossible to review, and wastes a lot of
mailing list bandwith.

We have an excellent file, called Documentation/SubmittingPatches, in
the kernel source tree, which explains exactly how to do this
correctly.

By splitting your patch into 4 patches, one for each file touched,
it is impossible to review your patch as a logical whole.

Please also provide your patch inline so people can just hit reply
in their mail reader client to quote your patch and comment on it.
This is impossible with the attachments you've used.

Thanks.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30  0:53   ` David Miller
@ 2006-11-30  1:08     ` Andrew Morton
  2006-11-30  1:13       ` David Miller
  2006-11-30  6:04       ` Mike Galbraith
  0 siblings, 2 replies; 36+ messages in thread
From: Andrew Morton @ 2006-11-30  1:08 UTC (permalink / raw)
  To: David Miller; +Cc: wenji, netdev, linux-kernel

On Wed, 29 Nov 2006 16:53:11 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> 
> Please, it is very difficult to review your work the way you have
> submitted this patch as a set of 4 patches.  These patches have not
> been split up "logically", but rather they have been split up "per
> file" with the same exact changelog message in each patch posting.
> This is very clumsy, and impossible to review, and wastes a lot of
> mailing list bandwith.
> 
> We have an excellent file, called Documentation/SubmittingPatches, in
> the kernel source tree, which explains exactly how to do this
> correctly.
> 
> By splitting your patch into 4 patches, one for each file touched,
> it is impossible to review your patch as a logical whole.
> 
> Please also provide your patch inline so people can just hit reply
> in their mail reader client to quote your patch and comment on it.
> This is impossible with the attachments you've used.
> 

Here you go - joined up, cleaned up, ported to mainline and test-compiled.

That yield() will need to be removed - yield()'s behaviour is truly awful
if the system is otherwise busy.  What is it there for?



From: Wenji Wu <wenji@fnal.gov>

For Linux TCP, when the network applcaiton make system call to move data from
socket's receive buffer to user space by calling tcp_recvmsg().  The socket
will be locked.  During this period, all the incoming packet for the TCP
socket will go to the backlog queue without being TCP processed

Since Linux 2.6 can be inerrupted mid-task, if the network application
expires, and moved to the expired array with the socket locked, all the
packets within the backlog queue will not be TCP processed till the network
applicaton resume its execution.  If the system is heavily loaded, TCP can
easily RTO in the Sender Side.



 include/linux/sched.h |    2 ++
 kernel/fork.c         |    3 +++
 kernel/sched.c        |   24 ++++++++++++++++++------
 net/ipv4/tcp.c        |    9 +++++++++
 4 files changed, 32 insertions(+), 6 deletions(-)

diff -puN net/ipv4/tcp.c~tcp-speedup net/ipv4/tcp.c
--- a/net/ipv4/tcp.c~tcp-speedup
+++ a/net/ipv4/tcp.c
@@ -1109,6 +1109,8 @@ int tcp_recvmsg(struct kiocb *iocb, stru
 	struct task_struct *user_recv = NULL;
 	int copied_early = 0;
 
+	current->backlog_flag = 1;
+
 	lock_sock(sk);
 
 	TCP_CHECK_TIMER(sk);
@@ -1468,6 +1470,13 @@ skip_copy:
 
 	TCP_CHECK_TIMER(sk);
 	release_sock(sk);
+
+	current->backlog_flag = 0;
+	if (current->extrarun_flag == 1){
+		current->extrarun_flag = 0;
+		yield();
+	}
+
 	return copied;
 
 out:
diff -puN include/linux/sched.h~tcp-speedup include/linux/sched.h
--- a/include/linux/sched.h~tcp-speedup
+++ a/include/linux/sched.h
@@ -1023,6 +1023,8 @@ struct task_struct {
 #ifdef	CONFIG_TASK_DELAY_ACCT
 	struct task_delay_info *delays;
 #endif
+	int backlog_flag; 	/* packets wait in tcp backlog queue flag */
+	int extrarun_flag;	/* extra run flag for TCP performance */
 };
 
 static inline pid_t process_group(struct task_struct *tsk)
diff -puN kernel/sched.c~tcp-speedup kernel/sched.c
--- a/kernel/sched.c~tcp-speedup
+++ a/kernel/sched.c
@@ -3099,12 +3099,24 @@ void scheduler_tick(void)
 
 		if (!rq->expired_timestamp)
 			rq->expired_timestamp = jiffies;
-		if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
-			enqueue_task(p, rq->expired);
-			if (p->static_prio < rq->best_expired_prio)
-				rq->best_expired_prio = p->static_prio;
-		} else
-			enqueue_task(p, rq->active);
+		if (p->backlog_flag == 0) {
+			if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
+				enqueue_task(p, rq->expired);
+				if (p->static_prio < rq->best_expired_prio)
+					rq->best_expired_prio = p->static_prio;
+			} else
+				enqueue_task(p, rq->active);
+		} else {
+			if (expired_starving(rq)) {
+				enqueue_task(p,rq->expired);
+				if (p->static_prio < rq->best_expired_prio)
+					rq->best_expired_prio = p->static_prio;
+			} else {
+				if (!TASK_INTERACTIVE(p))
+					p->extrarun_flag = 1;
+				enqueue_task(p,rq->active);
+			}
+		}
 	} else {
 		/*
 		 * Prevent a too long timeslice allowing a task to monopolize
diff -puN kernel/fork.c~tcp-speedup kernel/fork.c
--- a/kernel/fork.c~tcp-speedup
+++ a/kernel/fork.c
@@ -1032,6 +1032,9 @@ static struct task_struct *copy_process(
 	clear_tsk_thread_flag(p, TIF_SIGPENDING);
 	init_sigpending(&p->pending);
 
+	p->backlog_flag = 0;
+	p->extrarun_flag = 0;
+
 	p->utime = cputime_zero;
 	p->stime = cputime_zero;
  	p->sched_time = 0;
_


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30  1:08     ` Andrew Morton
@ 2006-11-30  1:13       ` David Miller
  2006-11-30  6:04       ` Mike Galbraith
  1 sibling, 0 replies; 36+ messages in thread
From: David Miller @ 2006-11-30  1:13 UTC (permalink / raw)
  To: akpm; +Cc: wenji, netdev, linux-kernel

From: Andrew Morton <akpm@osdl.org>
Date: Wed, 29 Nov 2006 17:08:35 -0800

> On Wed, 29 Nov 2006 16:53:11 -0800 (PST)
> David Miller <davem@davemloft.net> wrote:
> 
> > 
> > Please, it is very difficult to review your work the way you have
> > submitted this patch as a set of 4 patches.  These patches have not
> > been split up "logically", but rather they have been split up "per
> > file" with the same exact changelog message in each patch posting.
> > This is very clumsy, and impossible to review, and wastes a lot of
> > mailing list bandwith.
> > 
> > We have an excellent file, called Documentation/SubmittingPatches, in
> > the kernel source tree, which explains exactly how to do this
> > correctly.
> > 
> > By splitting your patch into 4 patches, one for each file touched,
> > it is impossible to review your patch as a logical whole.
> > 
> > Please also provide your patch inline so people can just hit reply
> > in their mail reader client to quote your patch and comment on it.
> > This is impossible with the attachments you've used.
> > 
> 
> Here you go - joined up, cleaned up, ported to mainline and test-compiled.
> 
> That yield() will need to be removed - yield()'s behaviour is truly awful
> if the system is otherwise busy.  What is it there for?

What about simply turning off CONFIG_PREEMPT to fix this "problem"?

We always properly run the backlog (by doing a release_sock()) before
going to sleep otherwise except for the specific case of taking a page
fault during the copy to userspace.  It is only CONFIG_PREEMPT that
can cause this situation to occur in other circumstances as far as I
can see.

We could also pepper tcp_recvmsg() with some very carefully placed
preemption disable/enable calls to deal with this even with
CONFIG_PREEMPT enabled.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
  2006-11-30  1:08     ` Andrew Morton
  2006-11-30  1:13       ` David Miller
@ 2006-11-30  6:04       ` Mike Galbraith
  1 sibling, 0 replies; 36+ messages in thread
From: Mike Galbraith @ 2006-11-30  6:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Miller, wenji, netdev, linux-kernel

On Wed, 2006-11-29 at 17:08 -0800, Andrew Morton wrote:
> +		if (p->backlog_flag == 0) {
> +			if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
> +				enqueue_task(p, rq->expired);
> +				if (p->static_prio < rq->best_expired_prio)
> +					rq->best_expired_prio = p->static_prio;
> +			} else
> +				enqueue_task(p, rq->active);
> +		} else {
> +			if (expired_starving(rq)) {
> +				enqueue_task(p,rq->expired);
> +				if (p->static_prio < rq->best_expired_prio)
> +					rq->best_expired_prio = p->static_prio;
> +			} else {
> +				if (!TASK_INTERACTIVE(p))
> +					p->extrarun_flag = 1;
> +				enqueue_task(p,rq->active);
> +			}
> +		}

(oh my, doing that to the scheduler upsets my tummy, but that aside...)

I don't see how that can really solve anything.  "Interactive" tasks
starting to use cpu heftily can still preempt and keep the special cased
cpu hog off the cpu for ages.  It also only takes one task in the
expired array to trigger the forced array switch with a fully loaded
cpu, and once any task hits the expired array, a stream of wakeups can
prevent the switch from completing for as long as you can keep wakeups
happening.

	-Mike

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2006-12-01 23:18 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-30  1:56 [patch 1/4] - Potential performance bottleneck for Linxu TCP Wenji Wu
2006-11-30  2:19 ` David Miller
2006-11-30  6:17   ` Ingo Molnar
2006-11-30  6:30     ` David Miller
2006-11-30  6:47       ` Ingo Molnar
2006-11-30  7:12         ` David Miller
2006-11-30  7:35           ` Ingo Molnar
2006-11-30  9:52             ` Evgeniy Polyakov
2006-11-30 10:07               ` Nick Piggin
2006-11-30 10:22                 ` Evgeniy Polyakov
2006-11-30 10:32                   ` Ingo Molnar
2006-11-30 17:04                     ` Wenji Wu
2006-11-30 20:20                       ` Ingo Molnar
2006-11-30 20:58                         ` Wenji Wu
2006-11-30 20:22                     ` David Miller
2006-11-30 20:30                       ` Ingo Molnar
2006-11-30 20:38                         ` David Miller
2006-11-30 20:49                           ` Ingo Molnar
2006-11-30 20:54                             ` Ingo Molnar
2006-11-30 20:55                             ` David Miller
2006-11-30 20:14                   ` David Miller
2006-11-30 20:42                     ` Wenji Wu
2006-12-01  9:53                     ` Evgeniy Polyakov
2006-12-01 23:18                       ` David Miller
2006-11-30  6:56       ` Ingo Molnar
2006-11-30 16:08   ` Wenji Wu
2006-11-30 20:06     ` David Miller
2006-11-30  9:33 ` Christoph Hellwig
2006-11-30 16:51   ` Lee Revell
  -- strict thread matches above, loose matches on Subject: below --
2006-11-30  2:02 Wenji Wu
2006-11-30  6:19 ` Ingo Molnar
2006-11-29 23:27 [Changelog] " Wenji Wu
2006-11-29 23:28 ` [patch 1/4] " Wenji Wu
2006-11-30  0:53   ` David Miller
2006-11-30  1:08     ` Andrew Morton
2006-11-30  1:13       ` David Miller
2006-11-30  6:04       ` Mike Galbraith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).