Hi all Part of the performance problem we have with netfilter is memory allocation is not NUMA aware, but 'only' SMP aware (ie each CPU normally touch separate cache lines) Even with small iptables rules, the cost of this misplacement can be high on common workloads. Instead of using one vmalloc() area (located in the node of the iptables process), we now vmalloc() an area for each possible CPU, using NUMA policy (MPOL_PREFERRED) so that memory should be allocated in the CPU's node if possible. If the size of ipt_table is small enough (less than one page), we use kmalloc_node() instead of vmalloc(), to use less memory (and less TLB entries) in small setups. This patch try to use local node memory in expensive translate_table() function (and others), but doesnt bother to bind the task to the current CPU. Note : I also optimize get_counters(), using a SET_COUNTER() for the first cpu, avoiding a memset() and ADD_COUNTER() if SMP on other cpus. Thank you Signed-off-by: Eric Dumazet