* Re: surprising memory request
2013-01-18 17:46 ` Eric Dumazet
@ 2013-01-18 17:52 ` Stephen Hemminger
2013-01-21 5:44 ` Jason Wang
2013-01-18 17:54 ` Eric Dumazet
` (2 subsequent siblings)
3 siblings, 1 reply; 12+ messages in thread
From: Stephen Hemminger @ 2013-01-18 17:52 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Dirk Hohndel, Jason Wang, netdev, David Woodhouse
On Fri, 18 Jan 2013 09:46:30 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2013-01-18 at 08:58 -0800, Dirk Hohndel wrote:
> > Running openconnect on a very recent 3.8 (a few commits before Linus cut
> > RC4) I get this allocation failure. I'm unclear why we would need 128
> > contiguous pages here...
> >
> > /D
> >
> > [66015.673818] openconnect: page allocation failure: order:7, mode:0x10c0d0
> > [66015.673827] Pid: 3292, comm: openconnect Tainted: G W 3.8.0-rc3-00352-gdfdebc2 #94
> > [66015.673830] Call Trace:
> > [66015.673841] [<ffffffff810e9c29>] warn_alloc_failed+0xe9/0x140
> > [66015.673849] [<ffffffff81093967>] ? on_each_cpu_mask+0x87/0xa0
> > [66015.673854] [<ffffffff810ec349>] __alloc_pages_nodemask+0x579/0x720
> > [66015.673859] [<ffffffff810ec507>] __get_free_pages+0x17/0x50
> > [66015.673866] [<ffffffff81123979>] kmalloc_order_trace+0x39/0xf0
> > [66015.673874] [<ffffffff81666178>] ? __hw_addr_add_ex+0x78/0xc0
> > [66015.673879] [<ffffffff811260d8>] __kmalloc+0xc8/0x180
> > [66015.673883] [<ffffffff81666616>] ? dev_addr_init+0x66/0x90
> > [66015.673889] [<ffffffff81660985>] alloc_netdev_mqs+0x145/0x300
> > [66015.673896] [<ffffffff81513830>] ? tun_net_fix_features+0x20/0x20
> > [66015.673902] [<ffffffff815168aa>] __tun_chr_ioctl+0xd0a/0xec0
> > [66015.673908] [<ffffffff81516a93>] tun_chr_ioctl+0x13/0x20
> > [66015.673913] [<ffffffff8113b197>] do_vfs_ioctl+0x97/0x530
> > [66015.673917] [<ffffffff811256f3>] ? kmem_cache_free+0x33/0x170
> > [66015.673923] [<ffffffff81134896>] ? final_putname+0x26/0x50
> > [66015.673927] [<ffffffff8113b6c1>] sys_ioctl+0x91/0xb0
> > [66015.673935] [<ffffffff8180e3d2>] system_call_fastpath+0x16/0x1b
> > [66015.673938] Mem-Info:
>
> Thats because Jason thought that tun device had to have an insane number
> of queues to get good performance.
>
> #define MAX_TAP_QUEUES 1024
>
> Thats crazy if your machine has say 8 cpus.
>
> And Jason didnt care to adapt the memory allocations done in
> alloc_netdev_mqs(), in order to switch to vmalloc() when kmalloc()
> fails.
>
> commit c8d68e6be1c3b242f1c598595830890b65cea64a
> Author: Jason Wang <jasowang@redhat.com>
> Date: Wed Oct 31 19:46:00 2012 +0000
>
> tuntap: multiqueue support
>
> This patch converts tun/tap to a multiqueue devices and expose the multiqueue
> queues as multiple file descriptors to userspace. Internally, each tun_file were
> abstracted as a queue, and an array of pointers to tun_file structurs were
> stored in tun_structure device, so multiple tun_files were allowed to be
> attached to the device as multiple queues.
>
> When choosing txq, we first try to identify a flow through its rxhash, if it
> does not have such one, we could try recorded rxq and then use them to choose
> the transmit queue. This policy may be changed in the future.
>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
Also the tuntap device now has it's own flow cache which is also a bad idea.
Why not just 128 queues and a hash like SFQ?
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: surprising memory request
2013-01-18 17:52 ` Stephen Hemminger
@ 2013-01-21 5:44 ` Jason Wang
0 siblings, 0 replies; 12+ messages in thread
From: Jason Wang @ 2013-01-21 5:44 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Eric Dumazet, Dirk Hohndel, netdev, David Woodhouse
On 01/19/2013 01:52 AM, Stephen Hemminger wrote:
> On Fri, 18 Jan 2013 09:46:30 -0800
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> On Fri, 2013-01-18 at 08:58 -0800, Dirk Hohndel wrote:
>>> Running openconnect on a very recent 3.8 (a few commits before Linus cut
>>> RC4) I get this allocation failure. I'm unclear why we would need 128
>>> contiguous pages here...
>>>
>>> /D
>>>
>>> [66015.673818] openconnect: page allocation failure: order:7, mode:0x10c0d0
>>> [66015.673827] Pid: 3292, comm: openconnect Tainted: G W 3.8.0-rc3-00352-gdfdebc2 #94
>>> [66015.673830] Call Trace:
>>> [66015.673841] [<ffffffff810e9c29>] warn_alloc_failed+0xe9/0x140
>>> [66015.673849] [<ffffffff81093967>] ? on_each_cpu_mask+0x87/0xa0
>>> [66015.673854] [<ffffffff810ec349>] __alloc_pages_nodemask+0x579/0x720
>>> [66015.673859] [<ffffffff810ec507>] __get_free_pages+0x17/0x50
>>> [66015.673866] [<ffffffff81123979>] kmalloc_order_trace+0x39/0xf0
>>> [66015.673874] [<ffffffff81666178>] ? __hw_addr_add_ex+0x78/0xc0
>>> [66015.673879] [<ffffffff811260d8>] __kmalloc+0xc8/0x180
>>> [66015.673883] [<ffffffff81666616>] ? dev_addr_init+0x66/0x90
>>> [66015.673889] [<ffffffff81660985>] alloc_netdev_mqs+0x145/0x300
>>> [66015.673896] [<ffffffff81513830>] ? tun_net_fix_features+0x20/0x20
>>> [66015.673902] [<ffffffff815168aa>] __tun_chr_ioctl+0xd0a/0xec0
>>> [66015.673908] [<ffffffff81516a93>] tun_chr_ioctl+0x13/0x20
>>> [66015.673913] [<ffffffff8113b197>] do_vfs_ioctl+0x97/0x530
>>> [66015.673917] [<ffffffff811256f3>] ? kmem_cache_free+0x33/0x170
>>> [66015.673923] [<ffffffff81134896>] ? final_putname+0x26/0x50
>>> [66015.673927] [<ffffffff8113b6c1>] sys_ioctl+0x91/0xb0
>>> [66015.673935] [<ffffffff8180e3d2>] system_call_fastpath+0x16/0x1b
>>> [66015.673938] Mem-Info:
>> Thats because Jason thought that tun device had to have an insane number
>> of queues to get good performance.
>>
>> #define MAX_TAP_QUEUES 1024
>>
>> Thats crazy if your machine has say 8 cpus.
>>
>> And Jason didnt care to adapt the memory allocations done in
>> alloc_netdev_mqs(), in order to switch to vmalloc() when kmalloc()
>> fails.
>>
>> commit c8d68e6be1c3b242f1c598595830890b65cea64a
>> Author: Jason Wang <jasowang@redhat.com>
>> Date: Wed Oct 31 19:46:00 2012 +0000
>>
>> tuntap: multiqueue support
>>
>> This patch converts tun/tap to a multiqueue devices and expose the multiqueue
>> queues as multiple file descriptors to userspace. Internally, each tun_file were
>> abstracted as a queue, and an array of pointers to tun_file structurs were
>> stored in tun_structure device, so multiple tun_files were allowed to be
>> attached to the device as multiple queues.
>>
>> When choosing txq, we first try to identify a flow through its rxhash, if it
>> does not have such one, we could try recorded rxq and then use them to choose
>> the transmit queue. This policy may be changed in the future.
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> Signed-off-by: David S. Miller <davem@davemloft.net>
> Also the tuntap device now has it's own flow cache which is also a bad idea.
> Why not just 128 queues and a hash like SFQ?
Hi Stephen:
I know your concerns, I think we can solve it by limiting the number of
flow caches to a value (say 4096). With this, the average worst
searching depth is 4 which solves the issue when there's lots of
short-live connections.
The issue of just an array of 128 entries is that the matching is not
accurate. With an array of limited entries, we can easily get the index
collision with two different flows, which may result the packets of a
flow move back of forth between queues. Ideally we may need a perfect
filter and doing comparison on n-tuple which may be very expensive for
software device such as tun, so I choose to store rxhash in the flow
caches and using a hash list to do the match.
Thanks
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: surprising memory request
2013-01-18 17:46 ` Eric Dumazet
2013-01-18 17:52 ` Stephen Hemminger
@ 2013-01-18 17:54 ` Eric Dumazet
2013-01-21 5:21 ` Jason Wang
2013-01-18 17:59 ` David Woodhouse
2013-01-21 5:13 ` Jason Wang
3 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2013-01-18 17:54 UTC (permalink / raw)
To: Dirk Hohndel; +Cc: Jason Wang, netdev, David Woodhouse
On Fri, 2013-01-18 at 09:46 -0800, Eric Dumazet wrote:
> Thats because Jason thought that tun device had to have an insane number
> of queues to get good performance.
>
> #define MAX_TAP_QUEUES 1024
>
> Thats crazy if your machine has say 8 cpus.
>
> And Jason didnt care to adapt the memory allocations done in
> alloc_netdev_mqs(), in order to switch to vmalloc() when kmalloc()
> fails.
I suggest using the more reasonable :
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index c81680d..ec18fbf 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -113,7 +113,7 @@ struct tap_filter {
* the order of 100-200 CPUs so this leaves us some breathing space if we want
* to match a queue per guest CPU.
*/
-#define MAX_TAP_QUEUES 1024
+#define MAX_TAP_QUEUES DEFAULT_MAX_NUM_RSS_QUEUES
#define TUN_FLOW_EXPIRE (3 * HZ)
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: surprising memory request
2013-01-18 17:54 ` Eric Dumazet
@ 2013-01-21 5:21 ` Jason Wang
0 siblings, 0 replies; 12+ messages in thread
From: Jason Wang @ 2013-01-21 5:21 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Dirk Hohndel, netdev, David Woodhouse
On 01/19/2013 01:54 AM, Eric Dumazet wrote:
> On Fri, 2013-01-18 at 09:46 -0800, Eric Dumazet wrote:
>
>> Thats because Jason thought that tun device had to have an insane number
>> of queues to get good performance.
>>
>> #define MAX_TAP_QUEUES 1024
>>
>> Thats crazy if your machine has say 8 cpus.
>>
>> And Jason didnt care to adapt the memory allocations done in
>> alloc_netdev_mqs(), in order to switch to vmalloc() when kmalloc()
>> fails.
> I suggest using the more reasonable :
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index c81680d..ec18fbf 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -113,7 +113,7 @@ struct tap_filter {
> * the order of 100-200 CPUs so this leaves us some breathing space if we want
> * to match a queue per guest CPU.
> */
> -#define MAX_TAP_QUEUES 1024
> +#define MAX_TAP_QUEUES DEFAULT_MAX_NUM_RSS_QUEUES
>
> #define TUN_FLOW_EXPIRE (3 * HZ)
>
But it's default value 8 is a little too small, we can easily have a kvm
guest with more than 8 vcpus and a host multiqueue card with more than 8
queues. Maybe we can use num_possible_cpus() or just an arbitrary number
such as 256 which seems large enough.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: surprising memory request
2013-01-18 17:46 ` Eric Dumazet
2013-01-18 17:52 ` Stephen Hemminger
2013-01-18 17:54 ` Eric Dumazet
@ 2013-01-18 17:59 ` David Woodhouse
2013-01-20 13:06 ` Ben Hutchings
2013-01-21 5:23 ` Jason Wang
2013-01-21 5:13 ` Jason Wang
3 siblings, 2 replies; 12+ messages in thread
From: David Woodhouse @ 2013-01-18 17:59 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Dirk Hohndel, Jason Wang, netdev
[-- Attachment #1: Type: text/plain, Size: 294 bytes --]
On Fri, 2013-01-18 at 09:46 -0800, Eric Dumazet wrote:
>
> #define MAX_TAP_QUEUES 1024
>
> Thats crazy if your machine has say 8 cpus.
Even crazier if your userspace is never going to *use* MQ. Can't we
default to one queue unless userspace explicitly requests more?
--
dwmw2
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: surprising memory request
2013-01-18 17:59 ` David Woodhouse
@ 2013-01-20 13:06 ` Ben Hutchings
2013-01-21 5:23 ` Jason Wang
1 sibling, 0 replies; 12+ messages in thread
From: Ben Hutchings @ 2013-01-20 13:06 UTC (permalink / raw)
To: David Woodhouse; +Cc: Eric Dumazet, Dirk Hohndel, Jason Wang, netdev
On Fri, 2013-01-18 at 17:59 +0000, David Woodhouse wrote:
> On Fri, 2013-01-18 at 09:46 -0800, Eric Dumazet wrote:
> >
> > #define MAX_TAP_QUEUES 1024
> >
> > Thats crazy if your machine has say 8 cpus.
>
> Even crazier if your userspace is never going to *use* MQ. Can't we
> default to one queue unless userspace explicitly requests more?
I don't know about tun's internal TX queue structures but the core TX
queue structures immediately follow struct net_device and you have to
set a maximum when allocating it.
Ben.
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: surprising memory request
2013-01-18 17:59 ` David Woodhouse
2013-01-20 13:06 ` Ben Hutchings
@ 2013-01-21 5:23 ` Jason Wang
1 sibling, 0 replies; 12+ messages in thread
From: Jason Wang @ 2013-01-21 5:23 UTC (permalink / raw)
To: David Woodhouse; +Cc: Eric Dumazet, Dirk Hohndel, netdev
On 01/19/2013 01:59 AM, David Woodhouse wrote:
> On Fri, 2013-01-18 at 09:46 -0800, Eric Dumazet wrote:
>> #define MAX_TAP_QUEUES 1024
>>
>> Thats crazy if your machine has say 8 cpus.
> Even crazier if your userspace is never going to *use* MQ. Can't we
> default to one queue unless userspace explicitly requests more?
>
Yes, if userspace never use MQ, we just use a one queue netdevice. Just
draft a patch fro Dirk to test.
Thanks
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: surprising memory request
2013-01-18 17:46 ` Eric Dumazet
` (2 preceding siblings ...)
2013-01-18 17:59 ` David Woodhouse
@ 2013-01-21 5:13 ` Jason Wang
3 siblings, 0 replies; 12+ messages in thread
From: Jason Wang @ 2013-01-21 5:13 UTC (permalink / raw)
To: Eric Dumazet, Dirk Hohndel; +Cc: netdev, David Woodhouse
On Friday, January 18, 2013 09:46:30 AM Eric Dumazet wrote:
> On Fri, 2013-01-18 at 08:58 -0800, Dirk Hohndel wrote:
> > Running openconnect on a very recent 3.8 (a few commits before Linus cut
> > RC4) I get this allocation failure. I'm unclear why we would need 128
> > contiguous pages here...
> >
> > /D
> >
> > [66015.673818] openconnect: page allocation failure: order:7,
> > mode:0x10c0d0
> > [66015.673827] Pid: 3292, comm: openconnect Tainted: G W
> > 3.8.0-rc3-00352-gdfdebc2 #94 [66015.673830] Call Trace:
> > [66015.673841] [<ffffffff810e9c29>] warn_alloc_failed+0xe9/0x140
> > [66015.673849] [<ffffffff81093967>] ? on_each_cpu_mask+0x87/0xa0
> > [66015.673854] [<ffffffff810ec349>] __alloc_pages_nodemask+0x579/0x720
> > [66015.673859] [<ffffffff810ec507>] __get_free_pages+0x17/0x50
> > [66015.673866] [<ffffffff81123979>] kmalloc_order_trace+0x39/0xf0
> > [66015.673874] [<ffffffff81666178>] ? __hw_addr_add_ex+0x78/0xc0
> > [66015.673879] [<ffffffff811260d8>] __kmalloc+0xc8/0x180
> > [66015.673883] [<ffffffff81666616>] ? dev_addr_init+0x66/0x90
> > [66015.673889] [<ffffffff81660985>] alloc_netdev_mqs+0x145/0x300
> > [66015.673896] [<ffffffff81513830>] ? tun_net_fix_features+0x20/0x20
> > [66015.673902] [<ffffffff815168aa>] __tun_chr_ioctl+0xd0a/0xec0
> > [66015.673908] [<ffffffff81516a93>] tun_chr_ioctl+0x13/0x20
> > [66015.673913] [<ffffffff8113b197>] do_vfs_ioctl+0x97/0x530
> > [66015.673917] [<ffffffff811256f3>] ? kmem_cache_free+0x33/0x170
> > [66015.673923] [<ffffffff81134896>] ? final_putname+0x26/0x50
> > [66015.673927] [<ffffffff8113b6c1>] sys_ioctl+0x91/0xb0
> > [66015.673935] [<ffffffff8180e3d2>] system_call_fastpath+0x16/0x1b
>
> > [66015.673938] Mem-Info:
> Thats because Jason thought that tun device had to have an insane number
> of queues to get good performance.
>
> #define MAX_TAP_QUEUES 1024
>
> Thats crazy if your machine has say 8 cpus.
>
> And Jason didnt care to adapt the memory allocations done in
> alloc_netdev_mqs(), in order to switch to vmalloc() when kmalloc()
> fails.
>
Right, looks like the alloc_netdev_mqs uses kmalloc to require a huge number of contigious pages used by netdev_queue and netdev_rx_queue. Most of them are not needed when MQ is not enabled.I wonder whether we can solve the contigious page allocation in alloc_netdev_mqs() by using flex array instead of kmalloc() to allocate rx/tx queues.
Dirk, could you pls try the following patch that only allocate one queue when IFF_MULTI_QUEUE is not specified.
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 3f011e0..734085e 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1589,6 +1589,8 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
else {
char *name;
unsigned long flags = 0;
+ int queues = ifr->ifr_flags & IFF_MULTI_QUEUE ?
+ MAX_TAP_QUEUES : 1;
if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
return -EPERM;
@@ -1612,8 +1614,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
name = ifr->ifr_name;
dev = alloc_netdev_mqs(sizeof(struct tun_struct), name,
- tun_setup,
- MAX_TAP_QUEUES, MAX_TAP_QUEUES);
+ tun_setup, queues, queues);
if (!dev)
return -ENOMEM;
> commit c8d68e6be1c3b242f1c598595830890b65cea64a
> Author: Jason Wang <jasowang@redhat.com>
> Date: Wed Oct 31 19:46:00 2012 +0000
>
> tuntap: multiqueue support
>
> This patch converts tun/tap to a multiqueue devices and expose the
> multiqueue queues as multiple file descriptors to userspace. Internally,
> each tun_file were abstracted as a queue, and an array of pointers to
> tun_file structurs were stored in tun_structure device, so multiple
> tun_files were allowed to be attached to the device as multiple queues.
>
> When choosing txq, we first try to identify a flow through its rxhash,
> if it does not have such one, we could try recorded rxq and then use them
> to choose the transmit queue. This policy may be changed in the future.
>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
^ permalink raw reply related [flat|nested] 12+ messages in thread