netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Introduce to batch variants of accept() and epoll_ctl() syscall
@ 2012-06-15  4:13 Li Yu
  2012-06-15  4:29 ` Changli Gao
  2012-06-15  8:35 ` David Laight
  0 siblings, 2 replies; 8+ messages in thread
From: Li Yu @ 2012-06-15  4:13 UTC (permalink / raw)
  To: Linux Netdev List; +Cc: Linux Kernel Mailing List, davidel

Hi,

  We encounter a performance problem in a large scale computer
cluster, which needs to handle a lot of incoming concurrent TCP
connection requests.

  The top shows the kernel is most cpu hog, the testing is simple,
just a accept() -> epoll_ctl(ADD) loop, the ratio of cpu util sys% to
si% is about 2:5.

  I also asked some experienced webserver/proxy developers in my team
for suggestions, it seem that behavior of many userland programs already
called accept() multiple times after it is waked up by
epoll_wait(). And the common action is adding the fd that accept()
return into epoll interface by epoll_ctl() syscall then.

  Therefore, I think that we'd better to introduce to batch variants of
accept() and epoll_ctl() syscall, just like sendmmsg() or recvmmsg().

  For accept(), we may need a new syscall, it may like this,

  struct accept_result {
      int fd;
      struct sockaddr addr;
      socklen_t addr_len;
  };

  int maccept4(int fd, int flags, int nr_accept_result, struct
accept_result *results);

  For epoll_ctl(), there are two means to extend it, I prefer to extend
current interface instead of introduce to new syscall. We may introduce
to a new flag EPOLL_CTL_BATCH. If userland call epoll_ctl() with this
flag set, the meaning of last two arguments of epoll_ctl() change, .e.g:

  struct batch_epoll_event batch_event[] = {
         {
              .fd = a_newsock_fd;
              .epoll_event = { ... };
         },
         ...
  };

  ret = epoll_ctl(fd, EPOLL_CTL_ADD|EPOLL_CTL_BATCH, nr_batch_events,
batch_events);

  Thanks.

Yu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Introduce to batch variants of accept() and epoll_ctl() syscall
  2012-06-15  4:13 [RFC] Introduce to batch variants of accept() and epoll_ctl() syscall Li Yu
@ 2012-06-15  4:29 ` Changli Gao
  2012-06-15  5:37   ` Li Yu
  2012-06-15  8:35 ` David Laight
  1 sibling, 1 reply; 8+ messages in thread
From: Changli Gao @ 2012-06-15  4:29 UTC (permalink / raw)
  To: Li Yu; +Cc: Linux Netdev List, Linux Kernel Mailing List, davidel

On Fri, Jun 15, 2012 at 12:13 PM, Li Yu <raise.sail@gmail.com> wrote:
> Hi,
>
>  We encounter a performance problem in a large scale computer
> cluster, which needs to handle a lot of incoming concurrent TCP
> connection requests.
>
>  The top shows the kernel is most cpu hog, the testing is simple,
> just a accept() -> epoll_ctl(ADD) loop, the ratio of cpu util sys% to
> si% is about 2:5.
>
>  I also asked some experienced webserver/proxy developers in my team
> for suggestions, it seem that behavior of many userland programs already
> called accept() multiple times after it is waked up by
> epoll_wait(). And the common action is adding the fd that accept()
> return into epoll interface by epoll_ctl() syscall then.
>
>  Therefore, I think that we'd better to introduce to batch variants of
> accept() and epoll_ctl() syscall, just like sendmmsg() or recvmmsg().
>
>  For accept(), we may need a new syscall, it may like this,
>
>  struct accept_result {
>      int fd;
>      struct sockaddr addr;
>      socklen_t addr_len;
>  };
>
>  int maccept4(int fd, int flags, int nr_accept_result, struct
> accept_result *results);
>
>  For epoll_ctl(), there are two means to extend it, I prefer to extend
> current interface instead of introduce to new syscall. We may introduce
> to a new flag EPOLL_CTL_BATCH. If userland call epoll_ctl() with this
> flag set, the meaning of last two arguments of epoll_ctl() change, .e.g:
>
>  struct batch_epoll_event batch_event[] = {
>         {
>              .fd = a_newsock_fd;
>              .epoll_event = { ... };
>         },
>         ...
>  };
>
>  ret = epoll_ctl(fd, EPOLL_CTL_ADD|EPOLL_CTL_BATCH, nr_batch_events,
> batch_events);
>

I think it is good idea. Would you please implement a prototype and
give some numbers? This kind of data may help selling this idea.
Thanks.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Introduce to batch variants of accept() and epoll_ctl() syscall
  2012-06-15  4:29 ` Changli Gao
@ 2012-06-15  5:37   ` Li Yu
  2012-06-15  8:51     ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Li Yu @ 2012-06-15  5:37 UTC (permalink / raw)
  To: Changli Gao; +Cc: Linux Netdev List, Linux Kernel Mailing List, davidel

于 2012年06月15日 12:29, Changli Gao 写道:
> On Fri, Jun 15, 2012 at 12:13 PM, Li Yu<raise.sail@gmail.com>  wrote:
>> Hi,
>>
>>   We encounter a performance problem in a large scale computer
>> cluster, which needs to handle a lot of incoming concurrent TCP
>> connection requests.
>>
>>   The top shows the kernel is most cpu hog, the testing is simple,
>> just a accept() ->  epoll_ctl(ADD) loop, the ratio of cpu util sys% to
>> si% is about 2:5.
>>
>>   I also asked some experienced webserver/proxy developers in my team
>> for suggestions, it seem that behavior of many userland programs already
>> called accept() multiple times after it is waked up by
>> epoll_wait(). And the common action is adding the fd that accept()
>> return into epoll interface by epoll_ctl() syscall then.
>>
>>   Therefore, I think that we'd better to introduce to batch variants of
>> accept() and epoll_ctl() syscall, just like sendmmsg() or recvmmsg().
>>
>>   For accept(), we may need a new syscall, it may like this,
>>
>>   struct accept_result {
>>       int fd;
>>       struct sockaddr addr;
>>       socklen_t addr_len;
>>   };
>>
>>   int maccept4(int fd, int flags, int nr_accept_result, struct
>> accept_result *results);
>>
>>   For epoll_ctl(), there are two means to extend it, I prefer to extend
>> current interface instead of introduce to new syscall. We may introduce
>> to a new flag EPOLL_CTL_BATCH. If userland call epoll_ctl() with this
>> flag set, the meaning of last two arguments of epoll_ctl() change, .e.g:
>>
>>   struct batch_epoll_event batch_event[] = {
>>          {
>>               .fd = a_newsock_fd;
>>               .epoll_event = { ... };
>>          },
>>          ...
>>   };
>>
>>   ret = epoll_ctl(fd, EPOLL_CTL_ADD|EPOLL_CTL_BATCH, nr_batch_events,
>> batch_events);
>>
>
> I think it is good idea. Would you please implement a prototype and
> give some numbers? This kind of data may help selling this idea.
> Thanks.
>

Of course, I think that implementing them should not be a hard work :)

Em. I really do not know whether it is necessary to introduce to a new 
syscall here. An alternative solution to add new socket option to handle 
such batch requirement, so applications also can detect if kernel has 
this extended ability with a easy getsockopt() call.

Any way, I am going to try to write a prototype first.

Thanks

Yu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [RFC] Introduce to batch variants of accept() and epoll_ctl() syscall
  2012-06-15  4:13 [RFC] Introduce to batch variants of accept() and epoll_ctl() syscall Li Yu
  2012-06-15  4:29 ` Changli Gao
@ 2012-06-15  8:35 ` David Laight
  1 sibling, 0 replies; 8+ messages in thread
From: David Laight @ 2012-06-15  8:35 UTC (permalink / raw)
  To: Li Yu, Linux Netdev List; +Cc: Linux Kernel Mailing List, davidel

 
>   We encounter a performance problem in a large scale computer
> cluster, which needs to handle a lot of incoming concurrent TCP
> connection requests.
> 
>   The top shows the kernel is most cpu hog, the testing is simple,
> just a accept() -> epoll_ctl(ADD) loop, the ratio of cpu util sys% to
> si% is about 2:5.
> 
>   I also asked some experienced webserver/proxy developers in my team
> for suggestions, it seem that behavior of many userland 
> programs already
> called accept() multiple times after it is waked up by
> epoll_wait(). And the common action is adding the fd that accept()
> return into epoll interface by epoll_ctl() syscall then.
> 
>   Therefore, I think that we'd better to introduce to batch 
> variants of
> accept() and epoll_ctl() syscall, just like sendmmsg() or recvmmsg().
...

Having seen the support added to NetBSD for sendmmsg() and
recvmmsg() (and I'm told the linux code is much the same),
I'm surprised that just cutting out a system call entry/exit
and fd lookup is significant above the rest of the costs
involved in sending a message (which I presume is UDP here).
I'd be even more surprised if it is significant for an
incoming connection.

	David

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Introduce to batch variants of accept() and epoll_ctl() syscall
  2012-06-15  5:37   ` Li Yu
@ 2012-06-15  8:51     ` Eric Dumazet
  2012-06-18 23:27       ` Andi Kleen
  2012-07-06  9:38       ` Li Yu
  0 siblings, 2 replies; 8+ messages in thread
From: Eric Dumazet @ 2012-06-15  8:51 UTC (permalink / raw)
  To: Li Yu; +Cc: Changli Gao, Linux Netdev List, Linux Kernel Mailing List,
	davidel

On Fri, 2012-06-15 at 13:37 +0800, Li Yu wrote:

> Of course, I think that implementing them should not be a hard work :)
> 
> Em. I really do not know whether it is necessary to introduce to a new 
> syscall here. An alternative solution to add new socket option to handle 
> such batch requirement, so applications also can detect if kernel has 
> this extended ability with a easy getsockopt() call.
> 
> Any way, I am going to try to write a prototype first.

Before that, could you post the result of "perf top", or "perf
record ...;perf report"

>  The top shows the kernel is most cpu hog, the testing is simple,
> just a accept() -> epoll_ctl(ADD) loop, the ratio of cpu util sys% to
> si% is about 2:5.

This ratio is not meaningful, if we dont know where time is spent.


I doubt epoll_ctl(ADD) is a problem here...

If it is, batching the fds wont speed the thing anyway...

I believe accept() is the problem here, because it contends with the
softirq processing the tcp session handshake.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Introduce to batch variants of accept() and epoll_ctl() syscall
  2012-06-15  8:51     ` Eric Dumazet
@ 2012-06-18 23:27       ` Andi Kleen
  2012-07-06  9:38       ` Li Yu
  1 sibling, 0 replies; 8+ messages in thread
From: Andi Kleen @ 2012-06-18 23:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Li Yu, Changli Gao, Linux Netdev List, Linux Kernel Mailing List,
	davidel

Eric Dumazet <eric.dumazet@gmail.com> writes:
>
> I believe accept() is the problem here, because it contends with the
> softirq processing the tcp session handshake.

The MOSBENCH people some time ago did a per CPU accept queue. This is
probably overkill, but there are clearly some scaling problems here
with enough cores.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Introduce to batch variants of accept() and epoll_ctl() syscall
  2012-06-15  8:51     ` Eric Dumazet
  2012-06-18 23:27       ` Andi Kleen
@ 2012-07-06  9:38       ` Li Yu
  2012-07-09  3:36         ` Li Yu
  1 sibling, 1 reply; 8+ messages in thread
From: Li Yu @ 2012-07-06  9:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, Linux Netdev List, Linux Kernel Mailing List,
	davidel

于 2012年06月15日 16:51, Eric Dumazet 写道:
> On Fri, 2012-06-15 at 13:37 +0800, Li Yu wrote:
>
>> Of course, I think that implementing them should not be a hard work :)
>>
>> Em. I really do not know whether it is necessary to introduce to a new
>> syscall here. An alternative solution to add new socket option to handle
>> such batch requirement, so applications also can detect if kernel has
>> this extended ability with a easy getsockopt() call.
>>
>> Any way, I am going to try to write a prototype first.
>
> Before that, could you post the result of "perf top", or "perf
> record ...;perf report"
>

Sorry for I just have time to write a benchmark to reproduce this
problem on my test bed, below are results of "perf record -g -C 0".
kernel is 3.4.0:

Events: 7K cycles
+  54.87%  swapper  [kernel.kallsyms]  [k] poll_idle
-   3.10%   :22984  [kernel.kallsyms]  [k] _raw_spin_lock
    - _raw_spin_lock
       - 64.62% sch_direct_xmit
            dev_queue_xmit
            ip_finish_output
            ip_output
          - ip_local_out
             + 49.48% ip_queue_xmit
             + 37.48% ip_build_and_send_pkt
             + 13.04% ip_send_skb

I can not reproduce complete same high CPU usage on my testing 
environment, but top show that it has similar ratio of sys% and
si% on one CPU:

Tasks: 125 total,   2 running, 123 sleeping,   0 stopped,   0 zombie
Cpu0  :  1.0%us, 30.7%sy,  0.0%ni, 18.8%id,  0.0%wa,  0.0%hi, 49.5%si, 
0.0%st

Well, it seem that I must acknowledge I was wrong here. however,
I recall that I indeed ever encountered this in another benchmarking a
small packets performance.

I guess, this is since TX softirq and syscall context contend same lock
in sch_direct_xmit(), is this right?

thanks

Yu

>>   The top shows the kernel is most cpu hog, the testing is simple,
>> just a accept() -> epoll_ctl(ADD) loop, the ratio of cpu util sys% to
>> si% is about 2:5.
>
> This ratio is not meaningful, if we dont know where time is spent.
>
>
> I doubt epoll_ctl(ADD) is a problem here...
>
> If it is, batching the fds wont speed the thing anyway...
>
> I believe accept() is the problem here, because it contends with the
> softirq processing the tcp session handshake.
>
>
>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Introduce to batch variants of accept() and epoll_ctl() syscall
  2012-07-06  9:38       ` Li Yu
@ 2012-07-09  3:36         ` Li Yu
  0 siblings, 0 replies; 8+ messages in thread
From: Li Yu @ 2012-07-09  3:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, Linux Netdev List, Linux Kernel Mailing List,
	davidel

于 2012年07月06日 17:38, Li Yu 写道:
> 于 2012年06月15日 16:51, Eric Dumazet 写道:
>> On Fri, 2012-06-15 at 13:37 +0800, Li Yu wrote:
>>
>>> Of course, I think that implementing them should not be a hard work :)
>>>
>>> Em. I really do not know whether it is necessary to introduce to a new
>>> syscall here. An alternative solution to add new socket option to handle
>>> such batch requirement, so applications also can detect if kernel has
>>> this extended ability with a easy getsockopt() call.
>>>
>>> Any way, I am going to try to write a prototype first.
>>
>> Before that, could you post the result of "perf top", or "perf
>> record ...;perf report"
>>
>
> Sorry for I just have time to write a benchmark to reproduce this
> problem on my test bed, below are results of "perf record -g -C 0".
> kernel is 3.4.0:
>
> Events: 7K cycles
> +  54.87%  swapper  [kernel.kallsyms]  [k] poll_idle
> -   3.10%   :22984  [kernel.kallsyms]  [k] _raw_spin_lock
>     - _raw_spin_lock
>        - 64.62% sch_direct_xmit
>             dev_queue_xmit
>             ip_finish_output
>             ip_output
>           - ip_local_out
>              + 49.48% ip_queue_xmit
>              + 37.48% ip_build_and_send_pkt
>              + 13.04% ip_send_skb
>
> I can not reproduce complete same high CPU usage on my testing
> environment, but top show that it has similar ratio of sys% and
> si% on one CPU:
>
> Tasks: 125 total,   2 running, 123 sleeping,   0 stopped,   0 zombie
> Cpu0  :  1.0%us, 30.7%sy,  0.0%ni, 18.8%id,  0.0%wa,  0.0%hi, 49.5%si,
> 0.0%st
>
> Well, it seem that I must acknowledge I was wrong here. however,
> I recall that I indeed ever encountered this in another benchmarking a
> small packets performance.
>
> I guess, this is since TX softirq and syscall context contend same lock
> in sch_direct_xmit(), is this right?
>

Em, do we have some means to decrease the lock contention here?

> thanks
>
> Yu
>
>>>   The top shows the kernel is most cpu hog, the testing is simple,
>>> just a accept() -> epoll_ctl(ADD) loop, the ratio of cpu util sys% to
>>> si% is about 2:5.
>>
>> This ratio is not meaningful, if we dont know where time is spent.
>>
>>
>> I doubt epoll_ctl(ADD) is a problem here...
>>
>> If it is, batching the fds wont speed the thing anyway...
>>
>> I believe accept() is the problem here, because it contends with the
>> softirq processing the tcp session handshake.
>>
>>
>>
>>
>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-07-09  3:36 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-15  4:13 [RFC] Introduce to batch variants of accept() and epoll_ctl() syscall Li Yu
2012-06-15  4:29 ` Changli Gao
2012-06-15  5:37   ` Li Yu
2012-06-15  8:51     ` Eric Dumazet
2012-06-18 23:27       ` Andi Kleen
2012-07-06  9:38       ` Li Yu
2012-07-09  3:36         ` Li Yu
2012-06-15  8:35 ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).