From mboxrd@z Thu Jan 1 00:00:00 1970 From: Neil Horman Subject: Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter Date: Thu, 27 Oct 2011 15:53:36 -0400 Message-ID: <1319745221-30880-1-git-send-email-nhorman@tuxdriver.com> Cc: Neil Horman , "David S. Miller" To: netdev@vger.kernel.org Return-path: Received: from charlotte.tuxdriver.com ([70.61.120.58]:38377 "EHLO smtp.tuxdriver.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751567Ab1J0TyO (ORCPT ); Thu, 27 Oct 2011 15:54:14 -0400 Sender: netdev-owner@vger.kernel.org List-ID: I had this idea awhile ago while I was looking at the receive path for multicast frames. The top of the mcast recieve path (in __udp4_lib_mcast_deliver, has a loop in which we traverse a hash list linearly, looking for sockets that are listening to a given multicast group. For each matching socket we clone the skb to enqueue it to the corresponding socket. This creates two problems: 1) Application driven jitter in the receive path As you add processes that listen to the same multcast group, you increase the number of iterations you have to preform in this loop, which can lead to increases in the amount of time you spend processing each frame in softirq context, expecially if you are memory constrained, and the skb_clone operation has to call all the way back into the buddy allocator for more ram. This can lead to needlessly dropped frames as rx latency increases in the stack. 2) Increased memory usage As you increase the number of listeners to a multicast group, you directly increase the number of times you clone and skb, putting increased memory pressure on the system. while neither of these problems is a huge concern, I thought it would be nice if we could mitigate the effects of increased application instances on performance in this area. As such I came up with this patch set. I created a new skb fclone type called FCLONE_SCRATCH. When available, it commandeers the internally fragmented space of an skb data buffer and uses that to allocate additional skbs during the clone operation. Since the skb->data area is allocated with a kmalloc operation (and is therefore nominally a power of 2 in size), and nominally network interfaces tend to have an mtu of around 1500 bytes, we typically can reclaim several hundred bytes of space at the end of an skb (more if the incomming packet is not a full MTU in size). This space, being exclusively accessible to the softirq doing the reclaim, can be quickly accesed without the need for additional locking, potntially providing lower jitter in napi context per frame during a receive operation, as well as some memory savings. I'm still collecting stats on its performance, but I thought I would post now to get some early reviews and feedback on it. Thanks & Regards Neil Signed-off-by: Neil Horman CC: "David S. Miller"