Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 1/3] net: stmmac: change GMAC control register for SGMII
From: Giuseppe CAVALLARO @ 2012-11-23  9:31 UTC (permalink / raw)
  To: Byungho An; +Cc: davem, jeffrey.t.kirsher, netdev, kgene.kim, linux-kernel
In-Reply-To: <004b01cdc959$80af8030$820e8090$%an@samsung.com>

Hello An

On 11/23/2012 10:04 AM, Byungho An wrote:
>
> This patch changes GMAC control register (TC(Transmit
> Configuration) and PS(Port Selection) bit for SGMII.
> In case of SGMII, TC bit is '1' and PS bit is 0.

I was looking at this too. In particular, I was working on the rgmii 
interrupt so I guess we could improve this part together.

First my note is that I would like to have this kind of code never 
placed in the stmmac_main. It should stay in the core part.
Also I 'd like to avoid the Kconfig option where possible.

At any rate, I'll come back with further details soon.

BR,
Peppe

>
> Signed-off-by: Byungho An <bh74.an@samsung.com>
> ---
>   drivers/net/ethernet/stmicro/stmmac/stmmac_main.c |   10 ++++++++++
>   1 file changed, 10 insertions(+)
>
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index c6cdbc4..a719c87 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -1037,6 +1037,7 @@ static int stmmac_open(struct net_device *dev)
>   {
>   	struct stmmac_priv *priv = netdev_priv(dev);
>   	int ret;
> +	u32 value;
>
>   #ifdef CONFIG_STMMAC_TIMER
>   	priv->tm = kzalloc(sizeof(struct stmmac_timer *), GFP_KERNEL);
> @@ -1088,6 +1089,15 @@ static int stmmac_open(struct net_device *dev)
>   	/* Initialize the MAC Core */
>   	priv->hw->mac->core_init(priv->ioaddr);
>
> +	if (priv->phydev->interface == PHY_INTERFACE_MODE_SGMII) {
> +		value = readl(priv->ioaddr);
> +		/* GMAC_CONTROL_TC : transmit config in RGMII/SGMII */
> +		value |= 0x1000000;
> +		/* GMAC_CONTROL_PS : Port Selection for GMII */
> +		value &= ~(0x8000);
> +		writel(value, priv->ioaddr);
> +	}
> +
>   	/* Request the IRQ lines */
>   	ret = request_irq(dev->irq, stmmac_interrupt,
>   			 IRQF_SHARED, dev->name, dev);
>

^ permalink raw reply

* Re: [Qemu-devel] tap devices not receiving packets from a bridge
From: Peter Lieven @ 2012-11-23  9:41 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, netdev, mst
In-Reply-To: <20121123070211.GC22787@stefanha-thinkpad.hitronhub.home>


Am 23.11.2012 um 08:02 schrieb Stefan Hajnoczi:

> On Thu, Nov 22, 2012 at 03:29:52PM +0100, Peter Lieven wrote:
>> is anyone aware of a problem with the linux network bridge that in very rare circumstances stops
>> a bridge from sending pakets to a tap device?
>> 
>> My problem occurs in conjunction with vanilla qemu-kvm-1.2.0 and Ubuntu Kernel 3.2.0-34.53
>> which is based on Linux 3.2.33.
>> 
>> I was not yet able to reproduce the issue, it happens in really rare cases. The symptom is that
>> the tap does not have any TX packets. RX is working fine. I see the packets coming in at
>> the physical interface on the host, but they are not forwarded to the tap interface.
>> The bridge itself has learnt the mac address of the vServer that is connected to the tap interface.
>> It does not help to toggle the bridge link status,  the tap interface status or the interface in the vServer.
>> It seems that problem occurs if a tap interface that has previously been used, but set to nonpersistent
>> is set persistent again and then is by chance assigned to the same vServer (=same mac address on same
>> bridge) again. Unfortunately it seems not to be reproducible.
> 
> Not sure but this patch from Michael Tsirkin may help - it solves an
> issue with persistent tap devices:
> 
> http://patchwork.ozlabs.org/patch/198598/

Hi Stefan,

thanks for the pointer. I have seen this patch, but I have neglected it because it was dealing
with persistent taps. But maybe the taps in the kernel are not deleted directly. 
Can you remember what the syptomps of the above issue have been? Sorry for
being vague, but I currently have no clue whats going on.

Can someone who has more internal knowledge of the bridging/tap code say if qemu can
be responsible at all if the tap device is not receiving packets from the bridge.

If I have the following config. Lets say packets coming in via physical interface eth1.123,
and a bridge called br123.I further have a virtual machine with tap0. Both eth1.123
and tap0 are member of br123. 

If the issue occurs the vServer has no network connectivity inbound. If I sent a ping
from the vServer I see it on tap0 and leaving on eth1.123. I see further the arp reply coming
in via eth1.123, but the reply can't be seen on tap0.

Peter

> 
> Stefan

^ permalink raw reply

* [TCP] Flags returned by poll on connection request to closed peer?
From: Yi Li @ 2012-11-23 10:07 UTC (permalink / raw)
  To: netdev

Hi List,
When I issues a non-blocking connection request to a closed peer, and 
call select() to get the status
of the socket. But When I issues many threads, and I got the statistic 
as follow:

  POLLIN_SET POLLOUT_SET	9980000
  !POLLIN_SET !POLLOUT_SET	0
  POLLIN_SET !POLLOUT_SET0
  !POLLIN_SET POLLOUT_SET20000

as POLLIN_SET&& POLLOUT_SET means connection error.(of course, we are attempting to connect
to a closed peer). But what the meaning of !POLLIN_SET POLLOUT_SET ?

Here is my test program, and my test command is :
./client -d $SERVERS -s $max_range_start -e $max_range_end -t 20000

#include <sys/types.h>
#include <sys/socket.h>
#include <sys/select.h>
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <pthread.h>
#include <stdlib.h>
#include <arpa/inet.h>
#include <netdb.h>

#define BUFSIZE 255
#define POOL_SIZE 1000

unsigned short port_start, port_end;
uint32_t server_ip;
int total_count;
unsigned short port_pool[POOL_SIZE];

void usage(const char *name){
     printf("%s: -d SERVER_IP -s SERVER_PORT_S -e SERVER_PORT_E  -t TOTAL_COUNT -h\n", name);
     printf("-d SERVER_IP: server ip is SERVER_IP\n");
     printf("-s SERVER_PORT_S: closed server port range start, one thread per port\n");
     printf("-s SERVER_PORT_E: closed server port range end, one thread per port\n");
     printf("-t TOTAL_COUNT: per thread try TOTAL_COUNT tmies connection requests\n");
     printf("-h: print this help message\n");
     return;
}

void* talk_to_server(void* arg){
     int sockfd, flags, ret, i = 0;
     struct timeval timeout;
     fd_set rset, wset;
     struct sockaddr_in server_addr;
     int index = *((int *)&arg);
	
     server_addr.sin_family = AF_INET;
     server_addr.sin_addr.s_addr = server_ip;
     server_addr.sin_port = htons(port_pool[index]);

     while(i++ < total_count){
	if ((sockfd = socket(AF_INET, SOCK_STREAM, 0)) < 0){
	    perror("client: socket create error");
	    goto exit;
	}

	FD_ZERO(&rset);
	FD_SET(sockfd, &rset);
	wset = rset;
	timeout.tv_sec = 10;
	timeout.tv_usec = 0;
     
	flags = fcntl(sockfd, F_GETFL, 0);
	fcntl(sockfd, F_SETFL, flags | O_NONBLOCK);
     
	if ((ret = connect(sockfd, (struct sockaddr *)&server_addr, sizeof(server_addr))) <= 0){
	    if(ret == 0){
		fprintf(stderr, "client: connect established.\n");
		goto sockfd_exit;
	    }
	    if(errno != EINPROGRESS){
		perror("client: connect failed.");
		goto sockfd_exit;
	    }
	}
     
	if( (ret = select(sockfd+1, &rset, &wset, NULL, &timeout)) < 0){
	    perror("client: select failed.");
	    goto sockfd_exit;
	}

	if(FD_ISSET(sockfd, &rset) && FD_ISSET(sockfd, &wset)){
	    fprintf(stdout, "client: sockfd=%d, with POLLIN_SET POLLOUT_SET\n", sockfd);
	}else if(FD_ISSET(sockfd, &rset) && !FD_ISSET(sockfd, &wset)){
	    fprintf(stdout, "client: sockfd=%d, with POLLIN_SET !POLLOUT_SET\n", sockfd);
	}else if(!FD_ISSET(sockfd, &rset) && FD_ISSET(sockfd, &wset)){
	    fprintf(stdout, "client: sockfd=%d, with !POLLIN_SET POLLOUT_SET\n", sockfd);
	}else{
	    fprintf(stdout, "client: sockfd=%d, with !POLLIN_SET !POLLOUT_SET\n", sockfd);
	}
	
     sockfd_exit:
	close(sockfd);
     }
  exit:
     pthread_exit(NULL);
}

int parse_options(int argc, char *argv[]){
     int ret;
     struct hostent *hptr;
     char buf[BUFSIZE];

     if(argc < 6){
	usage(argv[0]);
	return -1;
     }
     
     while((ret = getopt(argc, argv, "d:s:e:t:h")) != -1){
	switch(ret){
	case 'd':
	    if( (hptr = gethostbyname(optarg)) == NULL){
		fprintf(stderr, "client: gethostbyname error: %s\n", hstrerror(h_errno));
		return -1;
	    }
	    switch(hptr->h_addrtype){
	    case AF_INET:
		server_ip =((struct in_addr*)hptr->h_addr)->s_addr;
		break;
	    default:
		fprintf(stderr, "client: unknow address type\n");
		return -1;
	    }
	    break;
	case 's':
	    port_start = atoi(optarg);
	    break;
	case 'e':
	    port_end = atoi(optarg);
	    break;
	case 't':
	    total_count = atoi(optarg);
	    break;
	case 'h':
	    usage(argv[0]);
	    return -1;
	case '?':
	default:
	    fprintf(stderr, "unknow option %c\n", optopt);
	    return -1;
	}
     }
     return 0;
}

int main(int argc, char *argv[]){
     int i;
     pthread_t tid;
     pthread_attr_t child_thread_attr;
     
     if(parse_options(argc, argv) < 0)
	return 0;

     if( port_end - port_start+1 > POOL_SIZE)
	port_end = port_start + POOL_SIZE -1;
     
     /*initialize port pool*/
     for(i = 0; port_start + i <= port_end; i++){
	port_pool[i] = port_start + i;
     }
     
     /*create threads, one thread per server port*/
     pthread_attr_init(&child_thread_attr);
     pthread_attr_setdetachstate(&child_thread_attr, PTHREAD_CREATE_DETACHED);
     for( i = 0; port_start + i <= port_end ; i++){
	    if( pthread_create(&tid, &child_thread_attr, talk_to_server, (void *)i) != 0 )
		fprintf(stderr, "client: pthread create failed thread %d port %d\n",
			i, port_start+i);
     }
     pthread_exit(NULL);
}

^ permalink raw reply

* Re: [PATCHv4] virtio-spec: virtio network device RFS support
From: Michael S. Tsirkin @ 2012-11-23 10:10 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, kvm, virtualization
In-Reply-To: <50AF06F0.1030604@redhat.com>

On Fri, Nov 23, 2012 at 01:17:36PM +0800, Jason Wang wrote:
...

> "specifying the number of the last transmit and receive queue that
> is going to be used; thus out of transmitq0..transmitqn and
> receiveq0..receiveqn where n=virtqueue_pairs will be used."
> 
> In this description, looks like n+1 virtqueue pairs (include
> receiveq0 and transmitq0) could be used in RFS mode.

The intent was not to reserve any virt queue pairs.
I hope I clarified this below.



Thanks for the comments. Here's an incremental patch to address
them.


diff --git a/virtio-spec.lyx b/virtio-spec.lyx
index e562335..53ddeec 100644
--- a/virtio-spec.lyx
+++ b/virtio-spec.lyx
@@ -4384,7 +4384,7 @@ VIRTIO_NET_F_GUEST_ANNOUNCE(21) Guest can send gratuitous packets.
 \begin_layout Description
 
 \change_inserted 1986246365 1352742808
-VIRTIO_NET_F_RFS(2) Device supports Receive Flow Steering.
+VIRTIO_NET_F_RFS(22) Device supports Receive Flow Steering.
 \change_unchanged
 
 \end_layout
@@ -4432,9 +4432,10 @@ N
 \emph default
 =
 \emph on
-max_virtqueue_pairs
+max_virtqueue_pairs - 1
 \emph default
 ) that can be configured once VIRTIO_NET_F_RFS is negotiated.
+Legal values for this field are 1 to 8000h.
 
 \change_unchanged
  
@@ -4496,7 +4497,7 @@ The initialization routine should identify the receive and transmission
 \change_inserted 1986246365 1352743942
  If VIRTIO_NET_F_RFS feature bit is negotiated, 
 \emph on
-N=max_virtqueue_pairs
+N=max_virtqueue_pairs-1
 \emph default
 , otherwise identify 
 \emph on
@@ -5464,7 +5465,7 @@ struct virtio_net_ctrl_rfs {
 
 \change_inserted 1986246365 1353594263
 
-#define VIRTIO_NET_CTRL_RFC    1
+#define VIRTIO_NET_CTRL_RFS    1
 \end_layout
 
 \begin_layout Plain Layout
@@ -5474,6 +5475,19 @@ struct virtio_net_ctrl_rfs {
  #define VIRTIO_NET_CTRL_RFS_VQ_PAIRS_SET        0 
 \end_layout
 
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594273
+
+ #define VIRTIO_NET_CTRL_RFS_VQ_PAIRS_MIN        1 
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594273
+
+ #define VIRTIO_NET_CTRL_RFS_VQ_PAIRS_MAX        0x8000
+\end_layout
 \end_inset
 
 
@@ -5484,14 +5498,19 @@ struct virtio_net_ctrl_rfs {
 \change_inserted 1986246365 1353594884
 RFS acceleration is disabled by default.
  Driver enables RFS by executing the VIRTIO_NET_CTRL_RFS_VQ_PAIRS_SET command,
- specifying the number of the last transmit and receive queue that is going
+ specifying the number of the transmit and receive queues that is going
  to be used; thus out of transmitq0..transmitqn and receiveq0..receiveqn where
  
 \emph on
-n=virtqueue
+n=virtqueue_pairs-1
 \emph default
-_pairs will be used.
+ will be used.
  All these virtqueues must have been pre-configured in advance.
+ The range of legal values for the
+\emph on
+ virtqueue_pairs
+\emph off
+ field is between 1 and 8000h.
 \end_layout
 
 \begin_layout Standard
@@ -5512,7 +5531,7 @@ Programming of the receive flow classificator is implicit.
 \change_inserted 1986246365 1353595040
 RFS acceleration is disabled by setting 
 \emph on
-virtqueue_pairs = 0
+virtqueue_pairs = 1
 \emph default
  (this is the default).
  Following this, driver should not transmit new packets on virtqueues other

^ permalink raw reply related

* [PATCH net-next] be2net: fix a possible events_get() race on BE2
From: Sathya Perla @ 2012-11-23 10:27 UTC (permalink / raw)
  To: netdev; +Cc: Sathya Perla

On BE2 chip, an interrupt being raised even when EQ is in un-armed state has
been observed a few times.  This is not expected and has never been
observed on BE3/Lancer chips.

As a consequence, be_msix()::events_get() and be_poll()::events_get()
can race and notify an EQ wrongly causing a CEV UE. The other possible
side-effect would be traffic stalling because after notifying EQ,
napi_schedule() is ignored as NAPI is already running.

This patch fixes this issue by counting events only in be_poll().

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
---
 drivers/net/ethernet/emulex/benet/be_main.c |   11 +++++++----
 1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index c365722..adef536 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -2029,7 +2029,8 @@ static irqreturn_t be_msix(int irq, void *dev)
 {
 	struct be_eq_obj *eqo = dev;
 
-	event_handle(eqo);
+	be_eq_notify(eqo->adapter, eqo->q.id, false, true, 0);
+	napi_schedule(&eqo->napi);
 	return IRQ_HANDLED;
 }
 
@@ -2125,9 +2126,11 @@ int be_poll(struct napi_struct *napi, int budget)
 {
 	struct be_eq_obj *eqo = container_of(napi, struct be_eq_obj, napi);
 	struct be_adapter *adapter = eqo->adapter;
-	int max_work = 0, work, i;
+	int max_work = 0, work, i, num_evts;
 	bool tx_done;
 
+	num_evts = events_get(eqo);
+
 	/* Process all TXQs serviced by this EQ */
 	for (i = eqo->idx; i < adapter->num_tx_qs; i += adapter->num_evt_qs) {
 		tx_done = be_process_tx(adapter, &adapter->tx_obj[i],
@@ -2150,10 +2153,10 @@ int be_poll(struct napi_struct *napi, int budget)
 
 	if (max_work < budget) {
 		napi_complete(napi);
-		be_eq_notify(adapter, eqo->q.id, true, false, 0);
+		be_eq_notify(adapter, eqo->q.id, true, false, num_evts);
 	} else {
 		/* As we'll continue in polling mode, count and clear events */
-		be_eq_notify(adapter, eqo->q.id, false, false, events_get(eqo));
+		be_eq_notify(adapter, eqo->q.id, false, false, num_evts);
 	}
 	return max_work;
 }
-- 
1.7.1

^ permalink raw reply related

* [PATCH] net: sched: enable CAN Identifier to be build into kernel
From: Marc Kleine-Budde @ 2012-11-23 10:44 UTC (permalink / raw)
  To: netdev; +Cc: linux-can, Marc Kleine-Budde

This patch makes it possible to build the CAN Identifier into the kernel, even
if the CAN support is build as a module.

Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
---
Hello,

is there a nicer solution to this problem? Or remove the "&& CAN" at all?

Marc

 net/sched/Kconfig |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 62fb51f..235e01a 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -509,7 +509,7 @@ config NET_EMATCH_TEXT
 
 config NET_EMATCH_CANID
 	tristate "CAN Identifier"
-	depends on NET_EMATCH && CAN
+	depends on NET_EMATCH && (CAN=y || CAN=m)
 	---help---
 	  Say Y here if you want to be able to classify CAN frames based
 	  on CAN Identifier.
-- 
1.7.10.4


^ permalink raw reply related

* Re: [Qemu-devel] tap devices not receiving packets from a bridge
From: Michael S. Tsirkin @ 2012-11-23 11:01 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Stefan Hajnoczi, qemu-devel, netdev
In-Reply-To: <E85C6011-548D-4507-A776-1028DD3E3515@dlhnet.de>

On Fri, Nov 23, 2012 at 10:41:21AM +0100, Peter Lieven wrote:
> 
> Am 23.11.2012 um 08:02 schrieb Stefan Hajnoczi:
> 
> > On Thu, Nov 22, 2012 at 03:29:52PM +0100, Peter Lieven wrote:
> >> is anyone aware of a problem with the linux network bridge that in very rare circumstances stops
> >> a bridge from sending pakets to a tap device?
> >> 
> >> My problem occurs in conjunction with vanilla qemu-kvm-1.2.0 and Ubuntu Kernel 3.2.0-34.53
> >> which is based on Linux 3.2.33.
> >> 
> >> I was not yet able to reproduce the issue, it happens in really rare cases. The symptom is that
> >> the tap does not have any TX packets. RX is working fine. I see the packets coming in at
> >> the physical interface on the host, but they are not forwarded to the tap interface.
> >> The bridge itself has learnt the mac address of the vServer that is connected to the tap interface.
> >> It does not help to toggle the bridge link status,  the tap interface status or the interface in the vServer.
> >> It seems that problem occurs if a tap interface that has previously been used, but set to nonpersistent
> >> is set persistent again and then is by chance assigned to the same vServer (=same mac address on same
> >> bridge) again. Unfortunately it seems not to be reproducible.
> > 
> > Not sure but this patch from Michael Tsirkin may help - it solves an
> > issue with persistent tap devices:
> > 
> > http://patchwork.ozlabs.org/patch/198598/
> 
> Hi Stefan,
> 
> thanks for the pointer. I have seen this patch, but I have neglected it because it was dealing
> with persistent taps. But maybe the taps in the kernel are not deleted directly. 
> Can you remember what the syptomps of the above issue have been? Sorry for
> being vague, but I currently have no clue whats going on.
> 
> Can someone who has more internal knowledge of the bridging/tap code say if qemu can
> be responsible at all if the tap device is not receiving packets from the bridge.
> 
> If I have the following config. Lets say packets coming in via physical interface eth1.123,
> and a bridge called br123.I further have a virtual machine with tap0. Both eth1.123
> and tap0 are member of br123. 
> 
> If the issue occurs the vServer has no network connectivity inbound. If I sent a ping
> from the vServer I see it on tap0 and leaving on eth1.123. I see further the arp reply coming
> in via eth1.123, but the reply can't be seen on tap0.
> 
> Peter

If guest is not consuming packets, a TX queue in tap device
will with time overrun (there's space for 1000 packets there).
This is code from tun:

        if (skb_queue_len(&tfile->socket.sk->sk_receive_queue)
                          >= dev->tx_queue_len / tun->numqueues){
                if (!(tun->flags & TUN_ONE_QUEUE)) {
                        /* Normal queueing mode. */
                        /* Packet scheduler handles dropping of further
 * packets. */
                        netif_stop_subqueue(dev, txq);

                        /* We won't see all dropped packets
 * individually, so overrun
                         * error is more appropriate. */
                        dev->stats.tx_fifo_errors++;


So you can detect that this triggered by looking at fifo errors counter in device.

Once this happens TX queue is stopped, then you hit this path:

                        if (!netif_xmit_stopped(txq)) {
                                __this_cpu_inc(xmit_recursion);
                                rc = dev_hard_start_xmit(skb, dev, txq);
                                __this_cpu_dec(xmit_recursion);
                                if (dev_xmit_complete(rc)) {
                                        HARD_TX_UNLOCK(dev, txq);
                                        goto out;
                                }
                        }

so packets are not passed to device anymore.
It will stay this way until guest consumes some packets and
queue is restarted.

> > 
> > Stefan

^ permalink raw reply

* Re: [Qemu-devel] tap devices not receiving packets from a bridge
From: Peter Lieven @ 2012-11-23 11:02 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Stefan Hajnoczi, qemu-devel, netdev
In-Reply-To: <20121123110146.GC7051@redhat.com>


Am 23.11.2012 um 12:01 schrieb Michael S. Tsirkin:

> On Fri, Nov 23, 2012 at 10:41:21AM +0100, Peter Lieven wrote:
>> 
>> Am 23.11.2012 um 08:02 schrieb Stefan Hajnoczi:
>> 
>>> On Thu, Nov 22, 2012 at 03:29:52PM +0100, Peter Lieven wrote:
>>>> is anyone aware of a problem with the linux network bridge that in very rare circumstances stops
>>>> a bridge from sending pakets to a tap device?
>>>> 
>>>> My problem occurs in conjunction with vanilla qemu-kvm-1.2.0 and Ubuntu Kernel 3.2.0-34.53
>>>> which is based on Linux 3.2.33.
>>>> 
>>>> I was not yet able to reproduce the issue, it happens in really rare cases. The symptom is that
>>>> the tap does not have any TX packets. RX is working fine. I see the packets coming in at
>>>> the physical interface on the host, but they are not forwarded to the tap interface.
>>>> The bridge itself has learnt the mac address of the vServer that is connected to the tap interface.
>>>> It does not help to toggle the bridge link status,  the tap interface status or the interface in the vServer.
>>>> It seems that problem occurs if a tap interface that has previously been used, but set to nonpersistent
>>>> is set persistent again and then is by chance assigned to the same vServer (=same mac address on same
>>>> bridge) again. Unfortunately it seems not to be reproducible.
>>> 
>>> Not sure but this patch from Michael Tsirkin may help - it solves an
>>> issue with persistent tap devices:
>>> 
>>> http://patchwork.ozlabs.org/patch/198598/
>> 
>> Hi Stefan,
>> 
>> thanks for the pointer. I have seen this patch, but I have neglected it because it was dealing
>> with persistent taps. But maybe the taps in the kernel are not deleted directly. 
>> Can you remember what the syptomps of the above issue have been? Sorry for
>> being vague, but I currently have no clue whats going on.
>> 
>> Can someone who has more internal knowledge of the bridging/tap code say if qemu can
>> be responsible at all if the tap device is not receiving packets from the bridge.
>> 
>> If I have the following config. Lets say packets coming in via physical interface eth1.123,
>> and a bridge called br123.I further have a virtual machine with tap0. Both eth1.123
>> and tap0 are member of br123. 
>> 
>> If the issue occurs the vServer has no network connectivity inbound. If I sent a ping
>> from the vServer I see it on tap0 and leaving on eth1.123. I see further the arp reply coming
>> in via eth1.123, but the reply can't be seen on tap0.
>> 
>> Peter
> 
> If guest is not consuming packets, a TX queue in tap device
> will with time overrun (there's space for 1000 packets there).
> This is code from tun:

>From what I remember there where zero TX packets and no TX errors
on the device.

Might it be that this queue is somehow not cleared correctly when
the device is reassigned (although it was nonpersistant in between).

Thank you,
Peter

> 
>        if (skb_queue_len(&tfile->socket.sk->sk_receive_queue)
>> = dev->tx_queue_len / tun->numqueues){
>                if (!(tun->flags & TUN_ONE_QUEUE)) {
>                        /* Normal queueing mode. */
>                        /* Packet scheduler handles dropping of further
> * packets. */
>                        netif_stop_subqueue(dev, txq);
> 
>                        /* We won't see all dropped packets
> * individually, so overrun
>                         * error is more appropriate. */
>                        dev->stats.tx_fifo_errors++;
> 
> 
> So you can detect that this triggered by looking at fifo errors counter in device.
> 
> Once this happens TX queue is stopped, then you hit this path:
> 
>                        if (!netif_xmit_stopped(txq)) {
>                                __this_cpu_inc(xmit_recursion);
>                                rc = dev_hard_start_xmit(skb, dev, txq);
>                                __this_cpu_dec(xmit_recursion);
>                                if (dev_xmit_complete(rc)) {
>                                        HARD_TX_UNLOCK(dev, txq);
>                                        goto out;
>                                }
>                        }
> 
> so packets are not passed to device anymore.
> It will stay this way until guest consumes some packets and
> queue is restarted.
> 
>>> 
>>> Stefan

^ permalink raw reply

* [PATCHv5] virtio-spec: virtio network device RFS support
From: Michael S. Tsirkin @ 2012-11-23 12:05 UTC (permalink / raw)
  To: Jason Wang; +Cc: rusty, virtualization, netdev, kvm

Add RFS support to virtio network device.
Add a new feature flag VIRTIO_NET_F_RFS for this feature, a new
configuration field max_virtqueue_pairs to detect supported number of
virtqueues as well as a new command VIRTIO_NET_CTRL_RFS to program
packet steering for unidirectional protocols.

---

Changes from v4:
- address Jason's comments
- have configuration specify the number of VQ pairs and not pairs - 1

Changes from v3:
- rename multiqueue -> rfs this is what we support
- Be more explicit about what driver should do.
- Simplify layout making VQs functionality depend on feature.
- Remove unused commands, only leave in programming # of queues

Changes from v2:
Address Jason's comments on v2:
- Changed STEERING_HOST to STEERING_RX_FOLLOWS_TX:
   this is both clearer and easier to support.
   It does not look like we need a separate steering command
   since host can just watch tx packets as they go.
- Moved RX and TX steering sections near each other.
- Add motivation for other changes in v2

Changes from Jason's rfc:
- reserved vq 3: this makes all rx vqs even and tx vqs odd, which
   looks nicer to me.
- documented packet steering, added a generalized steering programming
   command. Current modes are single queue and host driven multiqueue,
   but I envision support for guest driven multiqueue in the future.
- make default vqs unused when in mq mode - this wastes some memory
   but makes it more efficient to switch between modes as
   we can avoid this causing packet reordering.

Rusty, could you please take a look and comment soon?
If this looks OK to everyone, we can proceed with finalizing the
implementation. Would be nice to try and put it in 3.8.

diff --git a/virtio-spec.lyx b/virtio-spec.lyx
index d2f0da9..53ddeec 100644
--- a/virtio-spec.lyx
+++ b/virtio-spec.lyx
@@ -59,6 +59,7 @@
 \author -608949062 "Rusty Russell,,," 
 \author -385801441 "Cornelia Huck" cornelia.huck@de.ibm.com
 \author 1531152142 "Paolo Bonzini,,," 
+\author 1986246365 "Michael S. Tsirkin"
 \end_header
 
 \begin_body
@@ -4170,9 +4171,42 @@ ID 1
 \end_layout
 
 \begin_layout Description
-Virtqueues 0:receiveq.
- 1:transmitq.
- 2:controlq
+Virtqueues 0:receiveq
+\change_inserted 1986246365 1352742829
+0
+\change_unchanged
+.
+ 1:transmitq
+\change_inserted 1986246365 1352742832
+0
+\change_deleted 1986246365 1352742947
+.
+
+\change_inserted 1986246365 1352742952
+.
+ ....
+ 2N
+\begin_inset Foot
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1352743187
+N=0 if VIRTIO_NET_F_RFS is not negotiated, otherwise N is indicated by max_
+\emph on
+virtqueue_pairs control
+\emph default
+ field.
+
+\end_layout
+
+\end_inset
+
+: receivqN.
+ 2N+1: transmitqN.
+ 2N+
+\change_unchanged
+2:controlq
 \begin_inset Foot
 status open
 
@@ -4343,6 +4377,16 @@ VIRTIO_NET_F_CTRL_VLAN
 
 \begin_layout Description
 VIRTIO_NET_F_GUEST_ANNOUNCE(21) Guest can send gratuitous packets.
+\change_inserted 1986246365 1352742767
+
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1986246365 1352742808
+VIRTIO_NET_F_RFS(22) Device supports Receive Flow Steering.
+\change_unchanged
+
 \end_layout
 
 \end_deeper
@@ -4355,11 +4399,45 @@ configuration
 \begin_inset space ~
 \end_inset
 
-layout Two configuration fields are currently defined.
+layout
+\change_deleted 1986246365 1352743300
+Two
+\change_inserted 1986246365 1352743301
+Four
+\change_unchanged
+ configuration fields are currently defined.
  The mac address field always exists (though is only valid if VIRTIO_NET_F_MAC
  is set), and the status field only exists if VIRTIO_NET_F_STATUS is set.
  Two read-only bits are currently defined for the status field: VIRTIO_NET_S_LIN
 K_UP and VIRTIO_NET_S_ANNOUNCE.
+
+\change_inserted 1986246365 1353595219
+ The following read-only field,
+\emph on
+max_virtqueue_pairs
+\emph default
+ only exists if VIRTIO_NET_F_RFS is set.
+ This field specifies the maximum number of each of transmit and receive
+ virtqueues (receiveq0..receiveq
+\emph on
+N
+\emph default
+ and transmitq0..transmitq
+\emph on
+N
+\emph default
+ respectively;
+\emph on
+N
+\emph default
+=
+\emph on
+max_virtqueue_pairs - 1
+\emph default
+) that can be configured once VIRTIO_NET_F_RFS is negotiated.
+Legal values for this field are 1 to 8000h.
+
+\change_unchanged
  
 \begin_inset listings
 inline false
@@ -4410,7 +4488,24 @@ Device Initialization
 
 \begin_layout Enumerate
 The initialization routine should identify the receive and transmission
- virtqueues.
+ virtqueues
+\change_inserted 1986246365 1352744077
+, up to N+1 of each kind
+\change_unchanged
+.
+
+\change_inserted 1986246365 1352743942
+ If VIRTIO_NET_F_RFS feature bit is negotiated, 
+\emph on
+N=max_virtqueue_pairs-1
+\emph default
+, otherwise identify 
+\emph on
+N=0
+\emph default
+.
+\change_unchanged
+
 \end_layout
 
 \begin_layout Enumerate
@@ -4455,7 +4550,11 @@ status
 \end_layout
 
 \begin_layout Enumerate
-The receive virtqueue should be filled with receive buffers.
+The receive virtqueue
+\change_inserted 1986246365 1352743953
+s
+\change_unchanged
+ should be filled with receive buffers.
  This is described in detail below in 
 \begin_inset Quotes eld
 \end_inset
@@ -4550,8 +4649,15 @@ Device Operation
 \end_layout
 
 \begin_layout Standard
-Packets are transmitted by placing them in the transmitq, and buffers for
- incoming packets are placed in the receiveq.
+Packets are transmitted by placing them in the transmitq
+\change_inserted 1986246365 1353593685
+0..transmitqN
+\change_unchanged
+, and buffers for incoming packets are placed in the receiveq
+\change_inserted 1986246365 1353593692
+0..receiveqN
+\change_unchanged
+.
  In each case, the packet itself is preceeded by a header:
 \end_layout
 
@@ -4861,6 +4967,17 @@ If VIRTIO_NET_F_MRG_RXBUF is negotiated, each buffer must be at least the
 struct virtio_net_hdr
 \family default
 .
+\change_inserted 1986246365 1353594518
+
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1986246365 1353594638
+If VIRTIO_NET_F_RFS is negotiated, each of the receiveq0...receiveqN that will
+ be used should be populated with receive buffers.
+\change_unchanged
+
 \end_layout
 
 \begin_layout Subsection*
@@ -5293,8 +5410,143 @@ Sending VIRTIO_NET_CTRL_ANNOUNCE_ACK command through control vq.
  
 \end_layout
 
-\begin_layout Enumerate
+\begin_layout Subsection*
+
+\change_inserted 1986246365 1353593879
+Packet Receive Flow Steering
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1986246365 1353594403
+If the driver negotiates the VIRTIO_NET_F_RFS (depends on VIRTIO_NET_F_CTRL_VQ),
+ it can transmit outgoing packets on one of the multiple transmitq0..transmitqN
+ and ask the device to queue incoming packets into one the multiple receiveq0..rec
+eiveqN depending on the packet flow.
+\change_unchanged
+
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1986246365 1353594292
+\begin_inset listings
+inline false
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594178
+
+struct virtio_net_ctrl_rfs {
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594212
+
+	u16 virtqueue_pairs;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594172
+
+};
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594172
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594263
+
+#define VIRTIO_NET_CTRL_RFS    1
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594273
+
+ #define VIRTIO_NET_CTRL_RFS_VQ_PAIRS_SET        0 
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594273
+
+ #define VIRTIO_NET_CTRL_RFS_VQ_PAIRS_MIN        1 
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594273
+
+ #define VIRTIO_NET_CTRL_RFS_VQ_PAIRS_MAX        0x8000
+\end_layout
+\end_inset
+
+
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1986246365 1353594884
+RFS acceleration is disabled by default.
+ Driver enables RFS by executing the VIRTIO_NET_CTRL_RFS_VQ_PAIRS_SET command,
+ specifying the number of the transmit and receive queues that is going
+ to be used; thus out of transmitq0..transmitqn and receiveq0..receiveqn where
+ 
+\emph on
+n=virtqueue_pairs-1
+\emph default
+ will be used.
+ All these virtqueues must have been pre-configured in advance.
+ The range of legal values for the
+\emph on
+ virtqueue_pairs
+\emph off
+ field is between 1 and 8000h.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1986246365 1353595328
+Programming of the receive flow classificator is implicit.
+ Transmitting a packet of a specific flow on transmitqX will cause incoming
+ packets for this flow to be steered to receiveqX.
+ For uni-directional protocols, or where no packets have been transmitted
+ yet, device will steer a packet to a random queue out of the specified
+ receiveq0..receiveqn.
+\change_unchanged
+
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1986246365 1353595040
+RFS acceleration is disabled by setting 
+\emph on
+virtqueue_pairs = 1
+\emph default
+ (this is the default).
+ Following this, driver should not transmit new packets on virtqueues other
+ than transmitq0 and device will not steer new packets on virtqueues other
+ than receiveq0.
+\change_unchanged
+
+\end_layout
+
+\begin_layout Standard
+
+\change_deleted 1986246365 1353593873
 .
+
+\change_unchanged
  
 \end_layout
 

^ permalink raw reply related

* Re: [PATCH] 8139cp: set ring address after enabling C+ mode
From: Gilboa Davara @ 2012-11-23 12:37 UTC (permalink / raw)
  To: Francois Romieu
  Cc: Jeff Garzik, David Miller, dwmw2, jasowang, netdev, slacky,
	rggjan, Hayes Wang
In-Reply-To: <20121122213950.GA8873@electric-eye.fr.zoreil.com>

On Thu, Nov 22, 2012 at 11:39 PM, Francois Romieu <romieu@fr.zoreil.com> wrote:
> It would be nice if gilboad could give it a try (users Cced).
>

Applied it against 3.6.6.
Seems to be working just fine.

> --
> Ueimor

- Gilboa

^ permalink raw reply

* Re: [PATCH][RESEND] bonding: delete migrated IP addresses from the rlb hash table
From: Jiri Bohac @ 2012-11-23 12:44 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Jiri Bohac, Andy Gospodarek, netdev
In-Reply-To: <2699.1340319919@death.nxdomain>

Hi, 

This is another resend of the patch discussed in June. The only
changes over the previous version are improved comments.

Bonding with balance_rlb keeps poisoning other machines' ARP
caches and I whink we need to fix this.

On Thu, Jun 21, 2012 at 04:05:19PM -0700, Jay Vosburgh wrote:
> Jiri Bohac <jbohac@suse.cz> wrote:
> 
> >Hi, this is a resend of the patch discussed here:
> >	http://thread.gmane.org/gmane.linux.network/228076
> >It has been updated to apply to the lastest net-next.
> [...]
> >The hash table is hashed by ip_dst. To be able to do the above
> >check efficiently (not walking the whole hash table), we need a
> >reverse mapping (by ip_src).
> 
> 	Just a note that I'm doing some testing with this patch.  Seems
> to be ok for the "direct" case (wherein the IP in question is assigned
> to the local system); I haven't tried the "bridge" case yet.  I've
> extended some of the debugfs stuff to dump the new information, and I'm
> trying some of the corner cases (e.g., breaking the linkages in the
> middle) to see if it all hangs together.

Were there any results of your testing?  Good or bad?

> 	I am thinking that the layout of the "hash"-ish table is now
> sufficiently complicated that there should be a comment block somewhere
> describing what's going on (because I didn't really quite get it until I
> dumped the whole thing and looked at it).  With this patch, there is one
> "used" linkage for all of the elements in use, plus some number of "src"
> linkages, one for each active source hash.  The "src" linkages are also
> notable in that they are separate from the "assigned" state.

I updated the comments in drivers/net/bonding/bond_alb.h to
describe the structure.

> >+	 * have a dirrerent mac_src.
> 
> 	Typo here; should be "different."

Fixed. 
Any chance we could finally get this merged?:






Bonding in balance-alb mode records information from ARP packets
passing through the bond in a hash table (rx_hashtbl).

At certain situations (e.g. link change of a slave),
rlb_update_rx_clients() will send out ARP packets to update ARP
caches of other hosts on the network to achieve RX load
balancing.

The problem is that once an IP address is recorded in the hash
table, it stays there indefinitely. If this IP address is
migrated to a different host in the network, bonding still sends
out ARP packets that poison other systems' ARP caches with
invalid information.

This patch solves this by looking at all incoming ARP packets,
and checking if the source IP address is one of the source
addresses stored in the rx_hashtbl. If it is, but the MAC
addresses differ, the corresponding hash table entries are
removed. Thus, when an IP address is migrated, the first ARP
broadcast by its new owner will purge the offending entries of
rx_hashtbl.

The hash table is hashed by ip_dst. To be able to do the above
check efficiently (not walking the whole hash table), we need a
reverse mapping (by ip_src).

I added three new members in struct rlb_client_info:
   rx_hashtbl[x].src_first will point to the start of a list of
      entries for which hash(ip_src) == x.
   The list is linked with src_next and src_prev.

When an incoming ARP packet arrives at rlb_arp_recv()
rlb_purge_src_ip() can quickly walk only the entries on the
corresponding lists, i.e. the entries that are likely to contain
the offending IP address.

To avoid confusion, I renamed these existing fields of struct 
rlb_client_info:
	next -> used_next
	prev -> used_prev
	rx_hashtbl_head -> rx_hashtbl_used_head

(The current linked list is _not_ a list of hash table
entries with colliding ip_dst. It's a list of entries that are
being used; its purpose is to avoid walking the whole hash table
when looking for used entries.)

Signed-off-by: Jiri Bohac <jbohac@suse.cz>

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index e15cc11..8505a24 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -84,6 +84,9 @@ static inline struct arp_pkt *arp_pkt(const struct sk_buff *skb)
 
 /* Forward declaration */
 static void alb_send_learning_packets(struct slave *slave, u8 mac_addr[]);
+static void rlb_purge_src_ip(struct bonding *bond, struct arp_pkt *arp);
+static void rlb_src_unlink(struct bonding *bond, u32 index);
+static void rlb_src_link(struct bonding *bond, u32 ip_src_hash, u32 ip_dst_hash);
 
 static inline u8 _simple_hash(const u8 *hash_start, int hash_size)
 {
@@ -354,6 +357,17 @@ static int rlb_arp_recv(const struct sk_buff *skb, struct bonding *bond,
 	if (!arp)
 		goto out;
 
+	/* We received an ARP from arp->ip_src.
+	 * We might have used this IP address previously (on the bonding host
+	 * itself or on a system that is bridged together with the bond).
+	 * However, if arp->mac_src is different than what is stored in
+	 * rx_hashtbl, some other host is now using the IP and we must prevent
+	 * sending out client updates with this IP address and the old MAC address.
+	 * Clean up all hash table entries that have this address as ip_src but
+	 * have a different mac_src.
+	 */
+	rlb_purge_src_ip(bond, arp);
+
 	if (arp->op_code == htons(ARPOP_REPLY)) {
 		/* update rx hash table for this ARP */
 		rlb_update_entry_from_arp(bond, arp);
@@ -432,9 +446,9 @@ static void rlb_clear_slave(struct bonding *bond, struct slave *slave)
 	_lock_rx_hashtbl_bh(bond);
 
 	rx_hash_table = bond_info->rx_hashtbl;
-	index = bond_info->rx_hashtbl_head;
+	index = bond_info->rx_hashtbl_used_head;
 	for (; index != RLB_NULL_INDEX; index = next_index) {
-		next_index = rx_hash_table[index].next;
+		next_index = rx_hash_table[index].used_next;
 		if (rx_hash_table[index].slave == slave) {
 			struct slave *assigned_slave = rlb_next_rx_slave(bond);
 
@@ -519,8 +533,8 @@ static void rlb_update_rx_clients(struct bonding *bond)
 
 	_lock_rx_hashtbl_bh(bond);
 
-	hash_index = bond_info->rx_hashtbl_head;
-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
+	hash_index = bond_info->rx_hashtbl_used_head;
+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
 		client_info = &(bond_info->rx_hashtbl[hash_index]);
 		if (client_info->ntt) {
 			rlb_update_client(client_info);
@@ -548,8 +562,8 @@ static void rlb_req_update_slave_clients(struct bonding *bond, struct slave *sla
 
 	_lock_rx_hashtbl_bh(bond);
 
-	hash_index = bond_info->rx_hashtbl_head;
-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
+	hash_index = bond_info->rx_hashtbl_used_head;
+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
 		client_info = &(bond_info->rx_hashtbl[hash_index]);
 
 		if ((client_info->slave == slave) &&
@@ -578,8 +592,8 @@ static void rlb_req_update_subnet_clients(struct bonding *bond, __be32 src_ip)
 
 	_lock_rx_hashtbl(bond);
 
-	hash_index = bond_info->rx_hashtbl_head;
-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
+	hash_index = bond_info->rx_hashtbl_used_head;
+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
 		client_info = &(bond_info->rx_hashtbl[hash_index]);
 
 		if (!client_info->slave) {
@@ -625,6 +639,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
 				/* update mac address from arp */
 				memcpy(client_info->mac_dst, arp->mac_dst, ETH_ALEN);
 			}
+			memcpy(client_info->mac_src, arp->mac_src, ETH_ALEN);
 
 			assigned_slave = client_info->slave;
 			if (assigned_slave) {
@@ -647,6 +662,13 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
 	assigned_slave = rlb_next_rx_slave(bond);
 
 	if (assigned_slave) {
+		if (!(client_info->assigned && client_info->ip_src == arp->ip_src)) {
+			/* ip_src is going to be updated, fix the src hash list */
+			u32 hash_src = _simple_hash((u8 *)&arp->ip_src, sizeof(arp->ip_src));
+			rlb_src_unlink(bond, hash_index);
+			rlb_src_link(bond, hash_src, hash_index);
+		}
+
 		client_info->ip_src = arp->ip_src;
 		client_info->ip_dst = arp->ip_dst;
 		/* arp->mac_dst is broadcast for arp reqeusts.
@@ -654,6 +676,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
 		 * upon receiving an arp reply.
 		 */
 		memcpy(client_info->mac_dst, arp->mac_dst, ETH_ALEN);
+		memcpy(client_info->mac_src, arp->mac_src, ETH_ALEN);
 		client_info->slave = assigned_slave;
 
 		if (!ether_addr_equal_64bits(client_info->mac_dst, mac_bcast)) {
@@ -669,11 +692,11 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
 		}
 
 		if (!client_info->assigned) {
-			u32 prev_tbl_head = bond_info->rx_hashtbl_head;
-			bond_info->rx_hashtbl_head = hash_index;
-			client_info->next = prev_tbl_head;
+			u32 prev_tbl_head = bond_info->rx_hashtbl_used_head;
+			bond_info->rx_hashtbl_used_head = hash_index;
+			client_info->used_next = prev_tbl_head;
 			if (prev_tbl_head != RLB_NULL_INDEX) {
-				bond_info->rx_hashtbl[prev_tbl_head].prev =
+				bond_info->rx_hashtbl[prev_tbl_head].used_prev =
 					hash_index;
 			}
 			client_info->assigned = 1;
@@ -740,8 +763,8 @@ static void rlb_rebalance(struct bonding *bond)
 	_lock_rx_hashtbl_bh(bond);
 
 	ntt = 0;
-	hash_index = bond_info->rx_hashtbl_head;
-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
+	hash_index = bond_info->rx_hashtbl_used_head;
+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
 		client_info = &(bond_info->rx_hashtbl[hash_index]);
 		assigned_slave = rlb_next_rx_slave(bond);
 		if (assigned_slave && (client_info->slave != assigned_slave)) {
@@ -759,11 +782,113 @@ static void rlb_rebalance(struct bonding *bond)
 }
 
 /* Caller must hold rx_hashtbl lock */
+static void rlb_init_table_entry_dst(struct rlb_client_info *entry)
+{
+	entry->used_next = RLB_NULL_INDEX;
+	entry->used_prev = RLB_NULL_INDEX;
+	entry->assigned = 0;
+	entry->slave = NULL;
+	entry->tag = 0;
+}
+static void rlb_init_table_entry_src(struct rlb_client_info *entry)
+{
+	entry->src_first = RLB_NULL_INDEX;
+	entry->src_prev = RLB_NULL_INDEX;
+	entry->src_next = RLB_NULL_INDEX;
+}
+
 static void rlb_init_table_entry(struct rlb_client_info *entry)
 {
 	memset(entry, 0, sizeof(struct rlb_client_info));
-	entry->next = RLB_NULL_INDEX;
-	entry->prev = RLB_NULL_INDEX;
+	rlb_init_table_entry_dst(entry);
+	rlb_init_table_entry_src(entry);
+}
+
+static void rlb_delete_table_entry_dst(struct bonding *bond, u32 index)
+{
+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
+	u32 next_index = bond_info->rx_hashtbl[index].used_next;
+	u32 prev_index = bond_info->rx_hashtbl[index].used_prev;
+
+	if (index == bond_info->rx_hashtbl_used_head)
+		bond_info->rx_hashtbl_used_head = next_index;
+	if (prev_index != RLB_NULL_INDEX)
+		bond_info->rx_hashtbl[prev_index].used_next = next_index;
+	if (next_index != RLB_NULL_INDEX)
+		bond_info->rx_hashtbl[next_index].used_prev = prev_index;
+}
+
+/* unlink a rlb hash table entry from the src list */
+static void rlb_src_unlink(struct bonding *bond, u32 index)
+{
+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
+	u32 next_index = bond_info->rx_hashtbl[index].src_next;
+	u32 prev_index = bond_info->rx_hashtbl[index].src_prev;
+
+	bond_info->rx_hashtbl[index].src_next = RLB_NULL_INDEX;
+	bond_info->rx_hashtbl[index].src_prev = RLB_NULL_INDEX;
+
+	if (next_index != RLB_NULL_INDEX)
+		bond_info->rx_hashtbl[next_index].src_prev = prev_index;
+
+	if (prev_index == RLB_NULL_INDEX)
+		return;
+
+	/* is prev_index pointing to the head of this list? */
+	if (bond_info->rx_hashtbl[prev_index].src_first == index)
+		bond_info->rx_hashtbl[prev_index].src_first = next_index;
+	else
+		bond_info->rx_hashtbl[prev_index].src_next = next_index;
+
+}
+
+static void rlb_delete_table_entry(struct bonding *bond, u32 index)
+{
+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
+	struct rlb_client_info *entry = &(bond_info->rx_hashtbl[index]);
+
+	rlb_delete_table_entry_dst(bond, index);
+	rlb_init_table_entry_dst(entry);
+
+	rlb_src_unlink(bond, index);
+}
+
+/* add the rx_hashtbl[ip_dst_hash] entry to the list
+ * of entries with identical ip_src_hash
+ */
+static void rlb_src_link(struct bonding *bond, u32 ip_src_hash, u32 ip_dst_hash)
+{
+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
+	u32 next;
+
+	bond_info->rx_hashtbl[ip_dst_hash].src_prev = ip_src_hash;
+	next = bond_info->rx_hashtbl[ip_src_hash].src_first;
+	bond_info->rx_hashtbl[ip_dst_hash].src_next = next;
+	if (next != RLB_NULL_INDEX)
+		bond_info->rx_hashtbl[next].src_prev = ip_dst_hash;
+	bond_info->rx_hashtbl[ip_src_hash].src_first = ip_dst_hash;
+}
+
+/* deletes all rx_hashtbl entries with  arp->ip_src if their mac_src does
+ * not match arp->mac_src */
+static void rlb_purge_src_ip(struct bonding *bond, struct arp_pkt *arp)
+{
+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
+	u32 ip_src_hash = _simple_hash((u8*)&(arp->ip_src), sizeof(arp->ip_src));
+	u32 index;
+
+	_lock_rx_hashtbl_bh(bond);
+
+	index = bond_info->rx_hashtbl[ip_src_hash].src_first;
+	while (index != RLB_NULL_INDEX) {
+		struct rlb_client_info *entry = &(bond_info->rx_hashtbl[index]);
+		u32 next_index = entry->src_next;
+		if (entry->ip_src == arp->ip_src &&
+		    !ether_addr_equal_64bits(arp->mac_src, entry->mac_src))
+				rlb_delete_table_entry(bond, index);
+		index = next_index;
+	}
+	_unlock_rx_hashtbl_bh(bond);
 }
 
 static int rlb_initialize(struct bonding *bond)
@@ -781,7 +906,7 @@ static int rlb_initialize(struct bonding *bond)
 
 	bond_info->rx_hashtbl = new_hashtbl;
 
-	bond_info->rx_hashtbl_head = RLB_NULL_INDEX;
+	bond_info->rx_hashtbl_used_head = RLB_NULL_INDEX;
 
 	for (i = 0; i < RLB_HASH_TABLE_SIZE; i++) {
 		rlb_init_table_entry(bond_info->rx_hashtbl + i);
@@ -803,7 +928,7 @@ static void rlb_deinitialize(struct bonding *bond)
 
 	kfree(bond_info->rx_hashtbl);
 	bond_info->rx_hashtbl = NULL;
-	bond_info->rx_hashtbl_head = RLB_NULL_INDEX;
+	bond_info->rx_hashtbl_used_head = RLB_NULL_INDEX;
 
 	_unlock_rx_hashtbl_bh(bond);
 }
@@ -815,25 +940,13 @@ static void rlb_clear_vlan(struct bonding *bond, unsigned short vlan_id)
 
 	_lock_rx_hashtbl_bh(bond);
 
-	curr_index = bond_info->rx_hashtbl_head;
+	curr_index = bond_info->rx_hashtbl_used_head;
 	while (curr_index != RLB_NULL_INDEX) {
 		struct rlb_client_info *curr = &(bond_info->rx_hashtbl[curr_index]);
-		u32 next_index = bond_info->rx_hashtbl[curr_index].next;
-		u32 prev_index = bond_info->rx_hashtbl[curr_index].prev;
-
-		if (curr->tag && (curr->vlan_id == vlan_id)) {
-			if (curr_index == bond_info->rx_hashtbl_head) {
-				bond_info->rx_hashtbl_head = next_index;
-			}
-			if (prev_index != RLB_NULL_INDEX) {
-				bond_info->rx_hashtbl[prev_index].next = next_index;
-			}
-			if (next_index != RLB_NULL_INDEX) {
-				bond_info->rx_hashtbl[next_index].prev = prev_index;
-			}
+		u32 next_index = bond_info->rx_hashtbl[curr_index].used_next;
 
-			rlb_init_table_entry(curr);
-		}
+		if (curr->tag && (curr->vlan_id == vlan_id))
+			rlb_delete_table_entry(bond, curr_index);
 
 		curr_index = next_index;
 	}
diff --git a/drivers/net/bonding/bond_alb.h b/drivers/net/bonding/bond_alb.h
index 90f140a..de831ba 100644
--- a/drivers/net/bonding/bond_alb.h
+++ b/drivers/net/bonding/bond_alb.h
@@ -94,15 +94,35 @@ struct tlb_client_info {
 
 /* -------------------------------------------------------------------------
  * struct rlb_client_info contains all info related to a specific rx client
- * connection. This is the Clients Hash Table entry struct
+ * connection. This is the Clients Hash Table entry struct.
+ * Note that this is not a proper hash table; if a new client's IP address
+ * hash collides with an existing client entry, the old entry is replaced.
+ *
+ * There is a linked list (linked by the used_next and used_prev members)
+ * linking all the used entries of the hash table. This allows updating
+ * all the clients without walking over all the unused elements of the table.
+ *
+ * There are also linked lists of entries with identical hash(ip_src). These
+ * allow cleaning up the table from ip_src<->mac_src associatins that have
+ * become outdated and would cause sending out invalid ARP updates to the
+ * network. These are linked by the (src_next and src_prev members).
  * -------------------------------------------------------------------------
  */
 struct rlb_client_info {
 	__be32 ip_src;		/* the server IP address */
 	__be32 ip_dst;		/* the client IP address */
+	u8  mac_src[ETH_ALEN];	/* the server MAC address */
 	u8  mac_dst[ETH_ALEN];	/* the client MAC address */
-	u32 next;		/* The next Hash table entry index */
-	u32 prev;		/* The previous Hash table entry index */
+
+	/* list of used hash table entries, starting at rx_hashtbl_used_head */
+	u32 used_next;
+	u32 used_prev;
+
+	/* ip_src based hashing */
+	u32 src_next;	/* next entry with same hash(ip_src) */
+	u32 src_prev;	/* prev entry with same hash(ip_src) */
+	u32 src_first;	/* first entry with hash(ip_src) == this entry's index */
+
 	u8  assigned;		/* checking whether this entry is assigned */
 	u8  ntt;		/* flag - need to transmit client info */
 	struct slave *slave;	/* the slave assigned to this client */
@@ -131,7 +151,7 @@ struct alb_bond_info {
 	int rlb_enabled;
 	struct rlb_client_info	*rx_hashtbl;	/* Receive hash table */
 	spinlock_t		rx_hashtbl_lock;
-	u32			rx_hashtbl_head;
+	u32			rx_hashtbl_used_head;
 	u8			rx_ntt;	/* flag - need to transmit
 					 * to all rx clients
 					 */
diff --git a/drivers/net/bonding/bond_debugfs.c b/drivers/net/bonding/bond_debugfs.c
index 2cf084e..6ac855f 100644
--- a/drivers/net/bonding/bond_debugfs.c
+++ b/drivers/net/bonding/bond_debugfs.c
@@ -31,8 +31,8 @@ static int bond_debug_rlb_hash_show(struct seq_file *m, void *v)
 
 	spin_lock_bh(&(BOND_ALB_INFO(bond).rx_hashtbl_lock));
 
-	hash_index = bond_info->rx_hashtbl_head;
-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
+	hash_index = bond_info->rx_hashtbl_used_head;
+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
 		client_info = &(bond_info->rx_hashtbl[hash_index]);
 		seq_printf(m, "%-15pI4 %-15pI4 %-17pM %s\n",
 			&client_info->ip_src,

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ

^ permalink raw reply related

* Re: [PATCH 1/5] smsc95xx: fix error checking of usbnet_resume
From: Steve Glendinning @ 2012-11-23 12:55 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20121122.153203.449492215100737957.davem@davemloft.net>

On 22 November 2012 20:32, David Miller <davem@davemloft.net> wrote:
>
> I only see patches #1 and #2, where are the other 3?

That's odd, they seem to have all made it to patchwork so they must have sent:

http://patchwork.ozlabs.org/project/netdev/list/

Spam filtering problem maybe?  Let me know if you need me to re-send.

^ permalink raw reply

* [RFC net-next PATCH V1 7/9] net: frag queue locking per hash bucket
From: Jesper Dangaard Brouer @ 2012-11-23 13:08 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
	Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu
In-Reply-To: <20121123130749.18764.25962.stgit@dragon>

DO NOT apply - patch not finished, can cause on OOPS/PANIC during hash rebuild

This patch implements per hash bucket locking for the frag queue
hash.  This removes two write locks, and the only remaining write
lock is for protecting hash rebuild.  This essentially reduce the
readers-writer lock to a rebuild lock.

NOT-Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 include/net/inet_frag.h  |   10 +++++++-
 net/ipv4/inet_fragment.c |   56 +++++++++++++++++++++++++++++++++++-----------
 2 files changed, 51 insertions(+), 15 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 9938ea4..1efec6b 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -51,9 +51,15 @@ struct inet_frag_queue {
 
 #define INETFRAGS_HASHSZ		64
 
+
+struct inet_frag_bucket {
+	struct hlist_head	chain;
+	spinlock_t		chain_lock;
+};
+
 struct inet_frags {
-	struct hlist_head	hash[INETFRAGS_HASHSZ];
-	rwlock_t		lock;
+	struct inet_frag_bucket	hash[INETFRAGS_HASHSZ];
+	rwlock_t		lock; /* Rebuild lock */
 	u32			rnd;
 	int			qsize;
 	int			secret_interval;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 1620a21..447423f 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -35,20 +35,27 @@ static void inet_frag_secret_rebuild(unsigned long dummy)
 	unsigned long now = jiffies;
 	int i;
 
+	/* Per bucket lock NOT needed here, due to write lock protection */
 	write_lock(&f->lock);
+
 	get_random_bytes(&f->rnd, sizeof(u32));
 	for (i = 0; i < INETFRAGS_HASHSZ; i++) {
+		struct inet_frag_bucket *hb;
 		struct inet_frag_queue *q;
 		struct hlist_node *p, *n;
 
-		hlist_for_each_entry_safe(q, p, n, &f->hash[i], list) {
+		hb = &f->hash[i];
+		hlist_for_each_entry_safe(q, p, n, &hb->chain, list) {
 			unsigned int hval = f->hashfn(q);
 
 			if (hval != i) {
+				struct inet_frag_bucket *hb_dest;
+
 				hlist_del(&q->list);
 
 				/* Relink to new hash chain. */
-				hlist_add_head(&q->list, &f->hash[hval]);
+				hb_dest = &f->hash[hval];
+				hlist_add_head(&q->list, &hb->chain);
 			}
 		}
 	}
@@ -61,9 +68,12 @@ void inet_frags_init(struct inet_frags *f)
 {
 	int i;
 
-	for (i = 0; i < INETFRAGS_HASHSZ; i++)
-		INIT_HLIST_HEAD(&f->hash[i]);
+	for (i = 0; i < INETFRAGS_HASHSZ; i++) {
+		struct inet_frag_bucket *hb = &f->hash[i];
 
+		spin_lock_init(&hb->chain_lock);
+		INIT_HLIST_HEAD(&hb->chain);
+	}
 	rwlock_init(&f->lock);
 
 	f->rnd = (u32) ((num_physpages ^ (num_physpages>>7)) ^
@@ -102,9 +112,17 @@ EXPORT_SYMBOL(inet_frags_exit_net);
 
 static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f)
 {
-	write_lock(&f->lock);
+	struct inet_frag_bucket *hb;
+	unsigned int hash;
+
+	read_lock(&f->lock);
+	hash = f->hashfn(fq);
+	hb = &f->hash[hash];
+
+	spin_lock_bh(&hb->chain_lock);
 	hlist_del(&fq->list);
-	write_unlock(&f->lock);
+	spin_unlock_bh(&hb->chain_lock);
+	read_unlock(&f->lock);
 	inet_frag_lru_del(fq);
 }
 
@@ -224,28 +242,33 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 		struct inet_frag_queue *qp_in, struct inet_frags *f,
 		void *arg)
 {
+	struct inet_frag_bucket *hb;
 	struct inet_frag_queue *qp;
 #ifdef CONFIG_SMP
 	struct hlist_node *n;
 #endif
 	unsigned int hash;
 
-	write_lock(&f->lock);
+	read_lock(&f->lock); /* Protects against hash rebuild */
 	/*
 	 * While we stayed w/o the lock other CPU could update
 	 * the rnd seed, so we need to re-calculate the hash
 	 * chain. Fortunatelly the qp_in can be used to get one.
 	 */
 	hash = f->hashfn(qp_in);
+	hb = &f->hash[hash];
+	spin_lock_bh(&hb->chain_lock);
+
 #ifdef CONFIG_SMP
 	/* With SMP race we have to recheck hash table, because
 	 * such entry could be created on other cpu, while we
-	 * promoted read lock to write lock.
+	 * promoted read lock to write lock. ***Comment FIXME***
 	 */
-	hlist_for_each_entry(qp, n, &f->hash[hash], list) {
+	hlist_for_each_entry(qp, n, &hb->chain, list) {
 		if (qp->net == nf && f->match(qp, arg)) {
 			atomic_inc(&qp->refcnt);
-			write_unlock(&f->lock);
+			spin_unlock_bh(&hb->chain_lock);
+			read_unlock(&f->lock);
 			qp_in->last_in |= INET_FRAG_COMPLETE;
 			inet_frag_put(qp_in, f);
 			return qp;
@@ -257,8 +280,9 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 		atomic_inc(&qp->refcnt);
 
 	atomic_inc(&qp->refcnt);
-	hlist_add_head(&qp->list, &f->hash[hash]);
-	write_unlock(&f->lock);
+	hlist_add_head(&qp->list, &hb->chain);
+	spin_unlock_bh(&hb->chain_lock);
+	read_unlock(&f->lock);
 	inet_frag_lru_add(nf, qp);
 	return qp;
 }
@@ -307,16 +331,22 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 		struct inet_frags *f, void *key, unsigned int hash)
 	__releases(&f->lock)
 {
+	struct inet_frag_bucket *hb;
 	struct inet_frag_queue *q;
 	struct hlist_node *n;
 
-	hlist_for_each_entry(q, n, &f->hash[hash], list) {
+	hb = &f->hash[hash];
+
+	spin_lock_bh(&hb->chain_lock);
+	hlist_for_each_entry(q, n, &hb->chain, list) {
 		if (q->net == nf && f->match(q, key)) {
 			atomic_inc(&q->refcnt);
+			spin_unlock_bh(&hb->chain_lock);
 			read_unlock(&f->lock);
 			return q;
 		}
 	}
+	spin_unlock_bh(&hb->chain_lock);
 	read_unlock(&f->lock);
 
 	return inet_frag_create(nf, f, key);

^ permalink raw reply related

* [RFC net-next PATCH V1 8/9] net: increase frag queue hash size and cache-line
From: Jesper Dangaard Brouer @ 2012-11-23 13:08 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
	Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu
In-Reply-To: <20121123130749.18764.25962.stgit@dragon>

Increase frag queue hash size and assure cache-line alignment to
avoid false sharing.  Hash size is set to 256, because I have
observed 206 frag queues in use at 4x10G with packet size 4416 bytes.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 include/net/inet_frag.h |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 1efec6b..5054228 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -49,13 +49,12 @@ struct inet_frag_queue {
 	u16			max_size;
 };
 
-#define INETFRAGS_HASHSZ		64
-
+#define INETFRAGS_HASHSZ	256
 
 struct inet_frag_bucket {
 	struct hlist_head	chain;
 	spinlock_t		chain_lock;
-};
+} ____cacheline_aligned_in_smp;
 
 struct inet_frags {
 	struct inet_frag_bucket	hash[INETFRAGS_HASHSZ];

^ permalink raw reply related

* [RFC net-next PATCH V1 9/9] net: frag remove readers-writer lock (hack)
From: Jesper Dangaard Brouer @ 2012-11-23 13:08 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
	Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu
In-Reply-To: <20121123130749.18764.25962.stgit@dragon>

Do NOT APPLY this patch.

After all the other patches, the rw_lock is now the contention point.

This is a quick hack, that remove the readers-writer lock, by
disabling/breaking hash rebuilding.  Just to see how big the
performance gain would be.

  2x10G size(4416) result: 6481+6764 = 13245 Mbit/s (gen: 7652+8077 Mbit/s)

  4x10G size(4416) result:(5610+6283+5735+5238)=22866 Mbit/s
                     (gen: 6530+7860+5967+5238 =25595 Mbit/s)

And the results show, that its a big win. With 4x10G size(4416)
before: 17923 Mbit/s -> now: 22866 Mbit/s increase 4943 Mbit/s.
With 2x10G size(4416) before 10689 Mbit/s -> 13245 Mbit/s
increase 2556 Mbit/s.

I'll work on a real solution for removing the rw_lock while still
supporting hash rebuilding.  Suggestions and ideas are welcome.

NOT-signed-off
---

 include/net/inet_frag.h                 |    2 +-
 net/ipv4/inet_fragment.c                |   23 +++++++++++++----------
 net/ipv4/ip_fragment.c                  |    2 +-
 net/ipv6/netfilter/nf_conntrack_reasm.c |    2 +-
 net/ipv6/reassembly.c                   |    2 +-
 5 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 5054228..2fb8578 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -58,7 +58,7 @@ struct inet_frag_bucket {
 
 struct inet_frags {
 	struct inet_frag_bucket	hash[INETFRAGS_HASHSZ];
-	rwlock_t		lock; /* Rebuild lock */
+//	rwlock_t		lock; /* Rebuild lock */
 	u32			rnd;
 	int			qsize;
 	int			secret_interval;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 447423f..63227d6 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -35,8 +35,11 @@ static void inet_frag_secret_rebuild(unsigned long dummy)
 	unsigned long now = jiffies;
 	int i;
 
+	//HACK don't rebuild
+	return;
+
 	/* Per bucket lock NOT needed here, due to write lock protection */
-	write_lock(&f->lock);
+//	write_lock(&f->lock);
 
 	get_random_bytes(&f->rnd, sizeof(u32));
 	for (i = 0; i < INETFRAGS_HASHSZ; i++) {
@@ -59,7 +62,7 @@ static void inet_frag_secret_rebuild(unsigned long dummy)
 			}
 		}
 	}
-	write_unlock(&f->lock);
+//	write_unlock(&f->lock);
 
 	mod_timer(&f->secret_timer, now + f->secret_interval);
 }
@@ -74,7 +77,7 @@ void inet_frags_init(struct inet_frags *f)
 		spin_lock_init(&hb->chain_lock);
 		INIT_HLIST_HEAD(&hb->chain);
 	}
-	rwlock_init(&f->lock);
+//	rwlock_init(&f->lock);
 
 	f->rnd = (u32) ((num_physpages ^ (num_physpages>>7)) ^
 				   (jiffies ^ (jiffies >> 6)));
@@ -115,14 +118,14 @@ static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f)
 	struct inet_frag_bucket *hb;
 	unsigned int hash;
 
-	read_lock(&f->lock);
+	//read_lock(&f->lock);
 	hash = f->hashfn(fq);
 	hb = &f->hash[hash];
 
 	spin_lock_bh(&hb->chain_lock);
 	hlist_del(&fq->list);
 	spin_unlock_bh(&hb->chain_lock);
-	read_unlock(&f->lock);
+	//read_unlock(&f->lock);
 	inet_frag_lru_del(fq);
 }
 
@@ -249,7 +252,7 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 #endif
 	unsigned int hash;
 
-	read_lock(&f->lock); /* Protects against hash rebuild */
+	//read_lock(&f->lock); /* Protects against hash rebuild */
 	/*
 	 * While we stayed w/o the lock other CPU could update
 	 * the rnd seed, so we need to re-calculate the hash
@@ -268,7 +271,7 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 		if (qp->net == nf && f->match(qp, arg)) {
 			atomic_inc(&qp->refcnt);
 			spin_unlock_bh(&hb->chain_lock);
-			read_unlock(&f->lock);
+			//read_unlock(&f->lock);
 			qp_in->last_in |= INET_FRAG_COMPLETE;
 			inet_frag_put(qp_in, f);
 			return qp;
@@ -282,7 +285,7 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 	atomic_inc(&qp->refcnt);
 	hlist_add_head(&qp->list, &hb->chain);
 	spin_unlock_bh(&hb->chain_lock);
-	read_unlock(&f->lock);
+	//read_unlock(&f->lock);
 	inet_frag_lru_add(nf, qp);
 	return qp;
 }
@@ -342,12 +345,12 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 		if (q->net == nf && f->match(q, key)) {
 			atomic_inc(&q->refcnt);
 			spin_unlock_bh(&hb->chain_lock);
-			read_unlock(&f->lock);
+			//read_unlock(&f->lock);
 			return q;
 		}
 	}
 	spin_unlock_bh(&hb->chain_lock);
-	read_unlock(&f->lock);
+	//read_unlock(&f->lock);
 
 	return inet_frag_create(nf, f, key);
 }
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 7b1cf51..b2cb05f 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -289,7 +289,7 @@ static inline struct ipq *ip_find(struct net *net, struct iphdr *iph, u32 user)
 	arg.iph = iph;
 	arg.user = user;
 
-	read_lock(&ip4_frags.lock);
+	//read_lock(&ip4_frags.lock);
 	hash = ipqhashfn(iph->id, iph->saddr, iph->daddr, iph->protocol);
 
 	q = inet_frag_find(&net->ipv4.frags, &ip4_frags, &arg, hash);
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index c088831..5b57e03 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -175,7 +175,7 @@ static inline struct frag_queue *fq_find(struct net *net, __be32 id,
 	arg.src = src;
 	arg.dst = dst;
 
-	read_lock_bh(&nf_frags.lock);
+	//read_lock_bh(&nf_frags.lock);
 	hash = inet6_hash_frag(id, src, dst, nf_frags.rnd);
 
 	q = inet_frag_find(&net->nf_frag.frags, &nf_frags, &arg, hash);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 9cfe047..2c74394 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -193,7 +193,7 @@ fq_find(struct net *net, __be32 id, const struct in6_addr *src, const struct in6
 	arg.src = src;
 	arg.dst = dst;
 
-	read_lock(&ip6_frags.lock);
+	//read_lock(&ip6_frags.lock);
 	hash = inet6_hash_frag(id, src, dst, ip6_frags.rnd);
 
 	q = inet_frag_find(&net->ipv6.frags, &ip6_frags, &arg, hash);

^ permalink raw reply related

* [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
From: Jesper Dangaard Brouer @ 2012-11-23 13:08 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
	Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu

This patchset implements significant performance improvements for
fragmentation handling in the kernel, with a focus on NUMA and SMP
based systems.

Review:

 Please review these patches.  I have on purpose added comments in the
 code with the "//" comments style.  These comments are to be removed
 before applying.  They serve as a questions to, you, the reviewer.

The fragmentation code today:

 The fragmentation code "protects" kernel resources, by implementing
 some memory resource limitation code.  This is centered around a
 global readers-writer lock, and (per network namespace) an atomic mem
 counter and a LRU (Least-Recently-Used) list.  (Although separate
 global variables and namespace resources, are kept for IPv4, IPv6
 and Netfilter reassembly.)

 The code tries to keep the memory usage between a high and low
 threshold (see: /proc/sys/net/ipv4/ipfrag_{high,low}_thresh).  The
 "evictor" code cleans up fragments, when the high threshold is
 exceeded, and stops only, when the low threshold is reached.

The scalability problem:

 Having a global/central variable for a resource limit is obviously a
 scalability issue on SMP systems, and even amplified on a NUMA based
 system.

 When profiling the code, the scalability problems appeared to be the
 readers-writer lock.  But, surprise, the primary scalability issue
 was caused by the global atomic mem limit counter, which, especially
 on NUMA systems, would prolong the time spend inside the
 readers-writer lock sections.  It is not trivial to remove the
 readers-writer lock, but it is possible to reduce the number of
 writer lock sections.

Testlab:

 My original big-testlab were based on four Intel based 10Gbit/s NICs
 on two identical Sandy-Bridge-E NUMA system.  The testlab
 used/available, while rebasing to net-next, were not as powerful.
 Its based on a single Sandy-Bridge-E NUMA system with the same Intel
 10G NICs, but the generator machine was an old Core-i7 920 with some
 older NICs. This means that I have not been able to generate full 4x
 10G wirespeed.  I have chosen (mostly) to include 2x 10G test results
 due to the generator machine (although the 4x 10G results from the
 big system looks more impressive).

 The tests are performed with netperf -t UDP_STREAM (which default
 send UDP packets with size 65507 bytes, which gets fragmented).  The
 netserver's get numactl pinned and the CPU sockets get smp_affinity
 aligned to the physical NIC connected to its own NUMA node.

Performance results:

 For the impressive 4x 10Gbit/s big-testlab results, performance goes
  from (a collective) 496 Mbit/s to 38463 Mbit/s (per stream 9615 Mbit/s)
  (at packet size 65507 bytes)

 For the results to be fair/meaningful, I'll report the used packet
 size, as (after the fixes) bigger UDP packets scale better, because
 smaller packets will require/create more frag queues to handle.

 I'll report packet size 65507 and three fragments 1472*3=4416 bytes.

 Disabled Ethernet Flow Control (via ethtool -A).  To show the real
 effect of the patches, the system needs to be in an "overload"
 situation.  When Ethernet Flow Control is enabled, the system will
 make the generator back-off, and the code path will be less stressed.
 Thus, I have disabled Ethernet Flow Control.

No patches:
 -------
 Results without any patches, and no flow control:

  2x10G size(65507) result:(7+50)     =57   Mbit/s (gen:9613+9473 Mbit/s)
  2x10G size(4416)  result:(3619+3772)=7391 Mbit/s (gen:8339+9105 Mbit/s)

 The very pure result with large frames is a result of the "evictor"
 code, which gets fixed in patch-01.

Patch-01: net: frag evictor, avoid killing warm frag queues
 -------
 The fragmentation evictor system have a very unfortunate eviction
 system for killing fragment, when the system is put under pressure.
 The evictor code basically kills "warm" fragments too quickly.
 Resulting in a massive, DoS like, performance drop, as seen above
 (no-patch) results with large packets.

 The solution is to avoid killing "warm" fragments, and rather block
 new incoming in case mem limit is exceeded. This is solved by
 introducing a creation time-stamp, which set to "jiffies" in
 inet_frag_alloc().

  2x10G size(65507) result:(3011+2568)=5579 Mbit/s (gen:9613+9553 Mbit/s)
  2x10G size(4416)  result:(3716+3518)=7234 Mbit/s (gen:9037+8614 Mbit/s)

Patch-02: cache line adjust inet_frag_queue.net (netns)
 -------
 Avoid possible cache-line bounces in struct inet_frag_queue.  By
 moving the net pointer (struct netns_frags) because its placed on the
 same write-often cache-line as e.g. refcnt and lock.

  2x10G size(65507) result:(2960+2613)=5573 Mbit/s (gen:9614+9465 Mbit/s)
  2x10G size(4416)  result:(3858+3650)=7508 Mbit/s (gen:8076+7633 Mbit/s)

 The performance benefit looks small. We can discuss if this patch is
 needed or not.

Patch-03: move LRU list maintenance outside of rwlock
 -------
 Updating the fragmentation queues LRU (Least-Recently-Used) list,
 required taking the hash writer lock.  However, the LRU list isn't
 tied to the hash at all, so we can use a separate lock for it.

 This patch looks like a performance loss for big packets, but the LRU
 locking changes are needed, by later patches.

  2x10G size(65507) result:(2533+2138)=4671 Mbit/s (gen:9612+9461 Mbit/s)
  2x10G size(4416)  result:(3952+3713)=7665 Mbit/s (gen:9168+8415 Mbit/s)

Patch-04: frag helper functions for mem limit tracking
 -------
 This patch is only meant as a preparation patch, towards the next
 patch.  The performance improvement comes from reduce the number
 atomic operation, during freeing of a frag queue, by summing the mem
 accounting before and doing a single atomic dec.

  2x10G size(65507) result:(2475+3101)=5576 Mbit/s (gen:9614+9439 Mbit/s)
  2x10G size(4416)  result:(3928+4129)=8057 Mbit/s (gen:7259+8131 Mbit/s)

Patch-05: per CPU mem limit and LRU list accounting
 -------
 The major performance bottleneck on NUMA systems, is the mem limit
 counter, which is based on an atomic counter.  This patch removes the
 cache-bouncing of the atomic counter, by moving this accounting to be
 bound to each CPU.  The LRU list also need to be done per CPU,
 in-order to keep the accounting straight.

  2x10G size(65507) result:(9603+9458)=19061 Mbit/s (gen:9614+9458 Mbit/s)
  2x10G size(4416)  result:(4871+4848)=9719 Mbit/s (gen:9107+8378 Mbit/s)

 To compare the benefit of the next patches, its necessary to increase
 the stress on the code, but doing 4x 10Gbit/s tests.

  4x10G size(65507) result:(8631+9337+7534+6928)=32430 Mbit/s
                       (gen:8646+9613+7547+6937 =32743 Mbit/s)
  4x10G size(4416)  result:(2870+2990+2993+3016)=11869 Mbit/s
                       (gen:4819+7767+6893+5043 =24522 Mbit/s)

Patch-06: nqueues_under_LRU_lock
 -------
 This patch just moves the nqueues counter under the LRU lock (and
 per CPU), instead of the write lock, to prepare for next patch.  No
 need for performance testing this part.

Patch-07: hash_bucket_locking
 -------
 This patch implements per hash bucket locking for the frag queue
 hash.  This removes two write locks, and the only remaining write
 lock is for protecting hash rebuild.  This essentially reduces the
 readers-writer lock to a rebuild lock.

 UPDATE: This patch can result in a OOPS during hash rebuilding.
 Needs more work before its safe to apply.

  2x10G size(65507) result:(9602+9466)=19068 Mbit/s (gen:9613+9472 Mbit/s)
  2x10G size(4416)  result:(5024+4925)= 9949 Mbit/s (gen:8581+8957 Mbit/s)

 To see the real benefit of this patch, we need to crank up the load
 and stress on the code, with 4x 10Gbit/s at small packets,
 improvement at size(4416): before 11869 Mbit/s now 17155 Mbit/s. Also
 note the regression at size(65507) 32430 -> 31021.

  4x10G size(65507) result:(7618+8708+7381+7314)=31021 Mbit/s
                       (gen:7628+9501+8728+7321 =33178 Mbit/s)
  4x10G size(4416)  result:(4156+4714+4300+3985)=17155 Mbit/s
                       (gen:6614+5330+7745+5366 =25055 Mbit/s)

 At 4x10G size(4416) I have seen 206 frag queues in use, and hash size is 64.

Patch-08: cache_align_hash_bucket
 -------
 Increase frag queue hash size and assure cache-line alignment to
 avoid false sharing.  Hash size is set to 256, because I have
 observed 206 frag queues in use at 4x10G with packet size 4416 bytes.

  2x10G size(65507) result:(9601+9414)=19015 Mbit/s (gen:9614+9434 Mbit/s)
  2x10G size(4416)  result:(5421+5268)=10689 Mbit/s (gen:8028+7457 Mbit/s)

 This does introduce an improvement (although not as big as I
 expected), but most importantly the regression seen in patch-07 4x10G
 at size(65507) is gone (patch-05:32430 Mbits/s -> 32676 Mbit).

  4x10G size(65507) result:(7604+8307+9593+7172)=32676 Mbit/s
                       (gen:7615+8713+9606+7184 =33118 Mbit/s)
  4x10G size(4416)  result:(4890+4364+4139+4530)=17923 Mbit/s
                       (gen:5170+6873+5215+7632 =24890 Mbit/s)

 After this patch it looks like the read lock is now the new
 contention point.

Patch-09: Hack disable rebuild and remove rw_lock
 -------
 I've done a quick hack patch, that remove the readers-writer lock, by
 disabling/breaking hash rebuilding.  Just to see how big the
 performance gain would be.

  2x10G size(4416) result: 6481+6764 = 13245 Mbit/s (gen: 7652+8077 Mbit/s)

  4x10G size(4416) result:(5610+6283+5735+5238)=22866 Mbit/s
                     (gen: 6530+7860+5967+5238 =25595 Mbit/s)

 And the results show, that its a big win. With 4x10G size(4416)
 before: 17923 Mbit/s -> now: 22866 Mbit/s increase 4943 Mbit/s.
 With 2x10G size(4416) before 10689 Mbit/s -> 13245 Mbit/s
 increase 2556 Mbit/s.

 I'll work on a real solution for removing the rw_lock while still
 supporting hash rebuilding.  Suggestions and ideas are welcome.


This patchset is based upon:
  Davem's net-next tree:
    git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git
  On top of:
    commit ff33c0e1885cda44dd14c79f70df4706f83582a0
    (net: Remove bogus dependencies on INET)

---

Jesper Dangaard Brouer (9):
      net: frag remove readers-writer lock (hack)
      net: increase frag queue hash size and cache-line
      net: frag queue locking per hash bucket
      net: frag, move nqueues counter under LRU lock protection
      net: frag per CPU mem limit and LRU list accounting
      net: frag helper functions for mem limit tracking
      net: frag, move LRU list maintenance outside of rwlock
      net: frag cache line adjust inet_frag_queue.net
      net: frag evictor, avoid killing warm frag queues


 include/net/inet_frag.h                 |  120 +++++++++++++++++++++++--
 include/net/ipv6.h                      |    4 -
 net/ipv4/inet_fragment.c                |  150 ++++++++++++++++++++++---------
 net/ipv4/ip_fragment.c                  |   43 +++++----
 net/ipv6/netfilter/nf_conntrack_reasm.c |   13 +--
 net/ipv6/reassembly.c                   |   16 ++-
 6 files changed, 259 insertions(+), 87 deletions(-)


--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [RFC net-next PATCH V1 2/9] net: frag cache line adjust inet_frag_queue.net
From: Jesper Dangaard Brouer @ 2012-11-23 13:08 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
	Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu
In-Reply-To: <20121123130749.18764.25962.stgit@dragon>

In inet_frag_find() unfortunate cache-line bounces can occur with
struct inet_frag_queue.  As the net pointer (struct netns_frags)
is placed on the same write-often cache-line as e.g. refcnt and
lock. As the hash match check always check (q->net == nf).

This (of-cause) only happens on hash bucket collisions, but as current
hash size is only 64 this makes collisions more likely.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 include/net/inet_frag.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 7b897b2..1f75316 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -14,7 +14,6 @@ struct netns_frags {
 
 struct inet_frag_queue {
 	struct hlist_node	list;
-	struct netns_frags	*net;
 	struct list_head	lru_list;   /* lru list member */
 	spinlock_t		lock;
 	atomic_t		refcnt;
@@ -24,6 +23,7 @@ struct inet_frag_queue {
 	ktime_t			stamp;
 	int			len;        /* total length of orig datagram */
 	int			meat;
+	struct netns_frags	*net;
 	u32			creation_ts;/* jiffies when queue was created*/
 	__u8			last_in;    /* first/last segment arrived? */
 

^ permalink raw reply related

* [RFC net-next PATCH V1 4/9] net: frag helper functions for mem limit tracking
From: Jesper Dangaard Brouer @ 2012-11-23 13:08 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
	Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu
In-Reply-To: <20121123130749.18764.25962.stgit@dragon>

This change is primarily a preparation to ease the extension of memory
limit tracking.

The change does reduce the number atomic operation, during freeing of
a frag queue.  This does introduce a fairly good performance improvement, as
these atomic operations are at the core of the performance problems
seen on NUMA systems.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 include/net/inet_frag.h                 |   28 ++++++++++++++++++++++++++++
 include/net/ipv6.h                      |    2 +-
 net/ipv4/inet_fragment.c                |   27 +++++++++++++--------------
 net/ipv4/ip_fragment.c                  |   24 +++++++++++-------------
 net/ipv6/netfilter/nf_conntrack_reasm.c |    6 +++---
 net/ipv6/reassembly.c                   |    6 +++---
 6 files changed, 59 insertions(+), 34 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 312a3fa..9bbef17 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -95,4 +95,32 @@ static inline void inet_frag_lru_add(struct netns_frags *nf,
 	list_add_tail(&q->lru_list, &nf->lru_list);
 	spin_unlock(&nf->lru_lock);
 }
+
+/* Memory Tracking Functions. */
+
+static inline int frag_mem_limit(struct netns_frags *nf)
+{
+	return atomic_read(&nf->mem);
+}
+
+static inline void sub_frag_mem_limit(struct inet_frag_queue *q, int i)
+{
+	atomic_sub(i, &q->net->mem);
+}
+
+static inline void add_frag_mem_limit(struct inet_frag_queue *q, int i)
+{
+	atomic_add(i, &q->net->mem);
+}
+
+static inline void init_frag_mem_limit(struct netns_frags *nf)
+{
+	atomic_set(&nf->mem, 0);
+}
+
+static inline int sum_frag_mem_limit(struct netns_frags *nf)
+{
+	return atomic_read(&nf->mem);
+}
+
 #endif
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 979bf6c..a5c1cf1 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -279,7 +279,7 @@ static inline int ip6_frag_nqueues(struct net *net)
 
 static inline int ip6_frag_mem(struct net *net)
 {
-	return atomic_read(&net->ipv6.frags.mem);
+	return sum_frag_mem_limit(&net->ipv6.frags);
 }
 #endif
 
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 47141ab..4f1ab8a 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -73,7 +73,7 @@ EXPORT_SYMBOL(inet_frags_init);
 void inet_frags_init_net(struct netns_frags *nf)
 {
 	nf->nqueues = 0;
-	atomic_set(&nf->mem, 0);
+	init_frag_mem_limit(nf);
 	INIT_LIST_HEAD(&nf->lru_list);
 	spin_lock_init(&nf->lru_lock);
 }
@@ -118,12 +118,8 @@ void inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f)
 EXPORT_SYMBOL(inet_frag_kill);
 
 static inline void frag_kfree_skb(struct netns_frags *nf, struct inet_frags *f,
-		struct sk_buff *skb, int *work)
+		struct sk_buff *skb)
 {
-	if (work)
-		*work -= skb->truesize;
-
-	atomic_sub(skb->truesize, &nf->mem);
 	if (f->skb_free)
 		f->skb_free(skb);
 	kfree_skb(skb);
@@ -134,6 +130,7 @@ void inet_frag_destroy(struct inet_frag_queue *q, struct inet_frags *f,
 {
 	struct sk_buff *fp;
 	struct netns_frags *nf;
+	unsigned int sum, sum_truesize = 0;
 
 	WARN_ON(!(q->last_in & INET_FRAG_COMPLETE));
 	WARN_ON(del_timer(&q->timer) != 0);
@@ -144,13 +141,14 @@ void inet_frag_destroy(struct inet_frag_queue *q, struct inet_frags *f,
 	while (fp) {
 		struct sk_buff *xp = fp->next;
 
-		frag_kfree_skb(nf, f, fp, work);
+		sum_truesize += fp->truesize;
+		frag_kfree_skb(nf, f, fp);
 		fp = xp;
 	}
-
+	sum = sum_truesize + f->qsize;
 	if (work)
-		*work -= f->qsize;
-	atomic_sub(f->qsize, &nf->mem);
+		*work -= sum;
+	sub_frag_mem_limit(q, sum);
 
 	if (f->destructor)
 		f->destructor(q);
@@ -165,11 +163,11 @@ int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f, bool force)
 	int work, evicted = 0;
 
 	if (!force) {
-		if (atomic_read(&nf->mem) <= nf->high_thresh)
+		if (frag_mem_limit(nf) <= nf->high_thresh)
 			return 0;
 	}
 
-	work = atomic_read(&nf->mem) - nf->low_thresh;
+	work = frag_mem_limit(nf) - nf->low_thresh;
 	while (work > 0) {
 		spin_lock(&nf->lru_lock);
 
@@ -269,7 +267,7 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
 	/* Guard creations of new frag queues if mem limit reached, as
 	 * we allow warm/recent elements to survive in inet_frag_evictor()
 	 */
-	if (atomic_read(&nf->mem) > nf->high_thresh)
+	if (frag_mem_limit(nf) > nf->high_thresh)
 		return NULL;
 
 	q = kzalloc(f->qsize, GFP_ATOMIC);
@@ -279,7 +277,8 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
 	q->creation_ts = (u32) jiffies;
 	q->net = nf;
 	f->constructor(q, arg);
-	atomic_add(f->qsize, &nf->mem);
+	add_frag_mem_limit(q, f->qsize);
+
 	setup_timer(&q->timer, f->frag_expire, (unsigned long)q);
 	spin_lock_init(&q->lock);
 	atomic_set(&q->refcnt, 1);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 04c9e53..cc36a0b 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -122,7 +122,7 @@ int ip_frag_nqueues(struct net *net)
 
 int ip_frag_mem(struct net *net)
 {
-	return atomic_read(&net->ipv4.frags.mem);
+	return sum_frag_mem_limit(&net->ipv4.frags);
 }
 
 static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
@@ -161,13 +161,6 @@ static bool ip4_frag_match(struct inet_frag_queue *q, void *a)
 		qp->user == arg->user;
 }
 
-/* Memory Tracking Functions. */
-static void frag_kfree_skb(struct netns_frags *nf, struct sk_buff *skb)
-{
-	atomic_sub(skb->truesize, &nf->mem);
-	kfree_skb(skb);
-}
-
 static void ip4_frag_init(struct inet_frag_queue *q, void *a)
 {
 	struct ipq *qp = container_of(q, struct ipq, q);
@@ -340,6 +333,7 @@ static inline int ip_frag_too_far(struct ipq *qp)
 static int ip_frag_reinit(struct ipq *qp)
 {
 	struct sk_buff *fp;
+	unsigned int sum_truesize = 0;
 
 	if (!mod_timer(&qp->q.timer, jiffies + qp->q.net->timeout)) {
 		atomic_inc(&qp->q.refcnt);
@@ -349,9 +343,12 @@ static int ip_frag_reinit(struct ipq *qp)
 	fp = qp->q.fragments;
 	do {
 		struct sk_buff *xp = fp->next;
-		frag_kfree_skb(qp->q.net, fp);
+
+		sum_truesize += fp->truesize;
+		kfree_skb(fp);
 		fp = xp;
 	} while (fp);
+	sub_frag_mem_limit(&qp->q, sum_truesize);
 
 	qp->q.last_in = 0;
 	qp->q.len = 0;
@@ -496,7 +493,8 @@ found:
 				qp->q.fragments = next;
 
 			qp->q.meat -= free_it->len;
-			frag_kfree_skb(qp->q.net, free_it);
+			sub_frag_mem_limit(&qp->q, free_it->truesize);
+			kfree_skb(free_it);
 		}
 	}
 
@@ -519,7 +517,7 @@ found:
 	qp->q.stamp = skb->tstamp;
 	qp->q.meat += skb->len;
 	qp->ecn |= ecn;
-	atomic_add(skb->truesize, &qp->q.net->mem);
+	add_frag_mem_limit(&qp->q, skb->truesize);
 	if (offset == 0)
 		qp->q.last_in |= INET_FRAG_FIRST_IN;
 
@@ -614,7 +612,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 		head->len -= clone->len;
 		clone->csum = 0;
 		clone->ip_summed = head->ip_summed;
-		atomic_add(clone->truesize, &qp->q.net->mem);
+		add_frag_mem_limit(&qp->q, clone->truesize);
 	}
 
 	skb_push(head, head->data - skb_network_header(head));
@@ -642,7 +640,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 		}
 		fp = next;
 	}
-	atomic_sub(sum_truesize, &qp->q.net->mem);
+	sub_frag_mem_limit(&qp->q, sum_truesize);
 
 	head->next = NULL;
 	head->dev = dev;
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index b0a1c96..c088831 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -316,7 +316,7 @@ found:
 	fq->q.meat += skb->len;
 	if (payload_len > fq->q.max_size)
 		fq->q.max_size = payload_len;
-	atomic_add(skb->truesize, &fq->q.net->mem);
+	add_frag_mem_limit(&fq->q, skb->truesize);
 
 	/* The first fragment.
 	 * nhoffset is obtained from the first fragment, of course.
@@ -394,7 +394,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct net_device *dev)
 		clone->ip_summed = head->ip_summed;
 
 		NFCT_FRAG6_CB(clone)->orig = NULL;
-		atomic_add(clone->truesize, &fq->q.net->mem);
+		add_frag_mem_limit(&fq->q, clone->truesize);
 	}
 
 	/* We have to remove fragment header from datagram and to relocate
@@ -418,7 +418,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct net_device *dev)
 			head->csum = csum_add(head->csum, fp->csum);
 		head->truesize += fp->truesize;
 	}
-	atomic_sub(head->truesize, &fq->q.net->mem);
+	sub_frag_mem_limit(&fq->q, head->truesize);
 
 	head->local_df = 1;
 	head->next = NULL;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index b38f290..9cfe047 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -327,7 +327,7 @@ found:
 	}
 	fq->q.stamp = skb->tstamp;
 	fq->q.meat += skb->len;
-	atomic_add(skb->truesize, &fq->q.net->mem);
+	add_frag_mem_limit(&fq->q, skb->truesize);
 
 	/* The first fragment.
 	 * nhoffset is obtained from the first fragment, of course.
@@ -426,7 +426,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
 		head->len -= clone->len;
 		clone->csum = 0;
 		clone->ip_summed = head->ip_summed;
-		atomic_add(clone->truesize, &fq->q.net->mem);
+		add_frag_mem_limit(&fq->q, clone->truesize);
 	}
 
 	/* We have to remove fragment header from datagram and to relocate
@@ -464,7 +464,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
 		}
 		fp = next;
 	}
-	atomic_sub(sum_truesize, &fq->q.net->mem);
+	sub_frag_mem_limit(&fq->q, sum_truesize);
 
 	head->next = NULL;
 	head->dev = dev;

^ permalink raw reply related

* [PATCH] kvaser_usb: fix "dma on the stack" errors
From: Olivier Sobrie @ 2012-11-23 13:30 UTC (permalink / raw)
  To: Marc Kleine-Budde, linux-can
  Cc: Greg KH, Wolfgang Grandegge, netdev, linux-usb, Daniel Berglund,
	Olivier Sobrie
In-Reply-To: <50AF3856.8060808@pengutronix.de>

The dma buffer given to usb_bulk_msg() must be allocated and not on
the stack.
See Documentation/DMA-API-HOWTO.txt section "What memory is DMA'able?"

Signed-off-by: Olivier Sobrie <olivier@sobrie.be>
---
Here is the incremental patch.
Thank you Greg !

Olivier

 drivers/net/can/usb/kvaser_usb.c |  110 ++++++++++++++++++++++++--------------
 1 file changed, 69 insertions(+), 41 deletions(-)

diff --git a/drivers/net/can/usb/kvaser_usb.c b/drivers/net/can/usb/kvaser_usb.c
index 8807bf8..7ac6e82 100644
--- a/drivers/net/can/usb/kvaser_usb.c
+++ b/drivers/net/can/usb/kvaser_usb.c
@@ -421,14 +421,21 @@ end:
 static int kvaser_usb_send_simple_msg(const struct kvaser_usb *dev,
 				      u8 msg_id, int channel)
 {
-	struct kvaser_msg msg = {
-		.len = MSG_HEADER_LEN + sizeof(struct kvaser_msg_simple),
-		.id = msg_id,
-		.u.simple.channel = channel,
-		.u.simple.tid = 0xff,
-	};
-
-	return kvaser_usb_send_msg(dev, &msg);
+	struct kvaser_msg *msg;
+	int rc;
+
+	msg = kmalloc(sizeof(*msg), GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	msg->len = MSG_HEADER_LEN + sizeof(struct kvaser_msg_simple);
+	msg->u.simple.channel = channel;
+	msg->u.simple.tid = 0xff;
+
+	rc = kvaser_usb_send_msg(dev, msg);
+
+	kfree(msg);
+	return rc;
 }
 
 static int kvaser_usb_get_software_info(struct kvaser_usb *dev)
@@ -1057,20 +1064,27 @@ static int kvaser_usb_setup_rx_urbs(struct kvaser_usb *dev)
 
 static int kvaser_usb_set_opt_mode(const struct kvaser_usb_net_priv *priv)
 {
-	struct kvaser_msg msg = {
-		.id = CMD_SET_CTRL_MODE,
-		.len = MSG_HEADER_LEN +
-		       sizeof(struct kvaser_msg_ctrl_mode),
-		.u.ctrl_mode.tid = 0xff,
-		.u.ctrl_mode.channel = priv->channel,
-	};
+	struct kvaser_msg *msg;
+	int rc;
+
+	msg = kmalloc(sizeof(*msg), GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	msg->id = CMD_SET_CTRL_MODE;
+	msg->len = MSG_HEADER_LEN + sizeof(struct kvaser_msg_ctrl_mode);
+	msg->u.ctrl_mode.tid = 0xff;
+	msg->u.ctrl_mode.channel = priv->channel;
 
 	if (priv->can.ctrlmode & CAN_CTRLMODE_LISTENONLY)
-		msg.u.ctrl_mode.ctrl_mode = KVASER_CTRL_MODE_SILENT;
+		msg->u.ctrl_mode.ctrl_mode = KVASER_CTRL_MODE_SILENT;
 	else
-		msg.u.ctrl_mode.ctrl_mode = KVASER_CTRL_MODE_NORMAL;
+		msg->u.ctrl_mode.ctrl_mode = KVASER_CTRL_MODE_NORMAL;
+
+	rc = kvaser_usb_send_msg(priv->dev, msg);
 
-	return kvaser_usb_send_msg(priv->dev, &msg);
+	kfree(msg);
+	return rc;
 }
 
 static int kvaser_usb_start_chip(struct kvaser_usb_net_priv *priv)
@@ -1163,15 +1177,22 @@ static int kvaser_usb_stop_chip(struct kvaser_usb_net_priv *priv)
 
 static int kvaser_usb_flush_queue(struct kvaser_usb_net_priv *priv)
 {
-	struct kvaser_msg msg = {
-		.id = CMD_FLUSH_QUEUE,
-		.len = MSG_HEADER_LEN +
-		       sizeof(struct kvaser_msg_flush_queue),
-		.u.flush_queue.channel = priv->channel,
-		.u.flush_queue.flags = 0x00,
-	};
-
-	return kvaser_usb_send_msg(priv->dev, &msg);
+	struct kvaser_msg *msg;
+	int rc;
+
+	msg = kmalloc(sizeof(*msg), GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	msg->id = CMD_FLUSH_QUEUE;
+	msg->len = MSG_HEADER_LEN + sizeof(struct kvaser_msg_flush_queue);
+	msg->u.flush_queue.channel = priv->channel;
+	msg->u.flush_queue.flags = 0x00;
+
+	rc = kvaser_usb_send_msg(priv->dev, msg);
+
+	kfree(msg);
+	return rc;
 }
 
 static int kvaser_usb_close(struct net_device *netdev)
@@ -1364,24 +1385,31 @@ static int kvaser_usb_set_bittiming(struct net_device *netdev)
 	struct kvaser_usb_net_priv *priv = netdev_priv(netdev);
 	struct can_bittiming *bt = &priv->can.bittiming;
 	struct kvaser_usb *dev = priv->dev;
-	struct kvaser_msg msg = {
-		.id = CMD_SET_BUS_PARAMS,
-		.len = MSG_HEADER_LEN +
-		       sizeof(struct kvaser_msg_busparams),
-		.u.busparams.channel = priv->channel,
-		.u.busparams.tid = 0xff,
-		.u.busparams.bitrate = cpu_to_le32(bt->bitrate),
-		.u.busparams.sjw = bt->sjw,
-		.u.busparams.tseg1 = bt->prop_seg + bt->phase_seg1,
-		.u.busparams.tseg2 = bt->phase_seg2,
-	};
+	struct kvaser_msg *msg;
+	int rc;
+
+	msg = kmalloc(sizeof(*msg), GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	msg->id = CMD_SET_BUS_PARAMS;
+	msg->len = MSG_HEADER_LEN + sizeof(struct kvaser_msg_busparams);
+	msg->u.busparams.channel = priv->channel;
+	msg->u.busparams.tid = 0xff;
+	msg->u.busparams.bitrate = cpu_to_le32(bt->bitrate);
+	msg->u.busparams.sjw = bt->sjw;
+	msg->u.busparams.tseg1 = bt->prop_seg + bt->phase_seg1;
+	msg->u.busparams.tseg2 = bt->phase_seg2;
 
 	if (priv->can.ctrlmode & CAN_CTRLMODE_3_SAMPLES)
-		msg.u.busparams.no_samp = 3;
+		msg->u.busparams.no_samp = 3;
 	else
-		msg.u.busparams.no_samp = 1;
+		msg->u.busparams.no_samp = 1;
+
+	rc = kvaser_usb_send_msg(dev, msg);
 
-	return kvaser_usb_send_msg(dev, &msg);
+	kfree(msg);
+	return rc;
 }
 
 static int kvaser_usb_set_mode(struct net_device *netdev,
-- 
1.7.9.5

^ permalink raw reply related

* Re: [PATCH] kvaser_usb: fix "dma on the stack" errors
From: Olivier Sobrie @ 2012-11-23 13:40 UTC (permalink / raw)
  To: Marc Kleine-Budde, linux-can
  Cc: Greg KH, Wolfgang Grandegge, netdev, linux-usb, Daniel Berglund
In-Reply-To: <1353677428-15805-1-git-send-email-olivier@sobrie.be>

On Fri, Nov 23, 2012 at 02:30:28PM +0100, Olivier Sobrie wrote:
> The dma buffer given to usb_bulk_msg() must be allocated and not on
> the stack.
> See Documentation/DMA-API-HOWTO.txt section "What memory is DMA'able?"
> 
> Signed-off-by: Olivier Sobrie <olivier@sobrie.be>
> ---
> Here is the incremental patch.
> Thank you Greg !
> 
> Olivier
> 
>  drivers/net/can/usb/kvaser_usb.c |  110 ++++++++++++++++++++++++--------------
>  1 file changed, 69 insertions(+), 41 deletions(-)
> 
> diff --git a/drivers/net/can/usb/kvaser_usb.c b/drivers/net/can/usb/kvaser_usb.c
> index 8807bf8..7ac6e82 100644
> --- a/drivers/net/can/usb/kvaser_usb.c
> +++ b/drivers/net/can/usb/kvaser_usb.c
> @@ -421,14 +421,21 @@ end:
>  static int kvaser_usb_send_simple_msg(const struct kvaser_usb *dev,
>  				      u8 msg_id, int channel)
>  {
> -	struct kvaser_msg msg = {
> -		.len = MSG_HEADER_LEN + sizeof(struct kvaser_msg_simple),
> -		.id = msg_id,
> -		.u.simple.channel = channel,
> -		.u.simple.tid = 0xff,
> -	};
> -
> -	return kvaser_usb_send_msg(dev, &msg);
> +	struct kvaser_msg *msg;
> +	int rc;
> +
> +	msg = kmalloc(sizeof(*msg), GFP_KERNEL);
> +	if (!msg)
> +		return -ENOMEM;
> +

Doh! I removed by mistake the line "msg->id = msg_id"... grr

> +	msg->len = MSG_HEADER_LEN + sizeof(struct kvaser_msg_simple);
> +	msg->u.simple.channel = channel;
> +	msg->u.simple.tid = 0xff;
> +
> +	rc = kvaser_usb_send_msg(dev, msg);
> +
> +	kfree(msg);
> +	return rc;
>  }
>  
>  static int kvaser_usb_get_software_info(struct kvaser_usb *dev)
> @@ -1057,20 +1064,27 @@ static int kvaser_usb_setup_rx_urbs(struct kvaser_usb *dev)
>  
>  static int kvaser_usb_set_opt_mode(const struct kvaser_usb_net_priv *priv)
>  {
> -	struct kvaser_msg msg = {
> -		.id = CMD_SET_CTRL_MODE,
> -		.len = MSG_HEADER_LEN +
> -		       sizeof(struct kvaser_msg_ctrl_mode),
> -		.u.ctrl_mode.tid = 0xff,
> -		.u.ctrl_mode.channel = priv->channel,
> -	};
> +	struct kvaser_msg *msg;
> +	int rc;
> +
> +	msg = kmalloc(sizeof(*msg), GFP_KERNEL);
> +	if (!msg)
> +		return -ENOMEM;
> +
> +	msg->id = CMD_SET_CTRL_MODE;
> +	msg->len = MSG_HEADER_LEN + sizeof(struct kvaser_msg_ctrl_mode);
> +	msg->u.ctrl_mode.tid = 0xff;
> +	msg->u.ctrl_mode.channel = priv->channel;
>  
>  	if (priv->can.ctrlmode & CAN_CTRLMODE_LISTENONLY)
> -		msg.u.ctrl_mode.ctrl_mode = KVASER_CTRL_MODE_SILENT;
> +		msg->u.ctrl_mode.ctrl_mode = KVASER_CTRL_MODE_SILENT;
>  	else
> -		msg.u.ctrl_mode.ctrl_mode = KVASER_CTRL_MODE_NORMAL;
> +		msg->u.ctrl_mode.ctrl_mode = KVASER_CTRL_MODE_NORMAL;
> +
> +	rc = kvaser_usb_send_msg(priv->dev, msg);
>  
> -	return kvaser_usb_send_msg(priv->dev, &msg);
> +	kfree(msg);
> +	return rc;
>  }
>  
>  static int kvaser_usb_start_chip(struct kvaser_usb_net_priv *priv)
> @@ -1163,15 +1177,22 @@ static int kvaser_usb_stop_chip(struct kvaser_usb_net_priv *priv)
>  
>  static int kvaser_usb_flush_queue(struct kvaser_usb_net_priv *priv)
>  {
> -	struct kvaser_msg msg = {
> -		.id = CMD_FLUSH_QUEUE,
> -		.len = MSG_HEADER_LEN +
> -		       sizeof(struct kvaser_msg_flush_queue),
> -		.u.flush_queue.channel = priv->channel,
> -		.u.flush_queue.flags = 0x00,
> -	};
> -
> -	return kvaser_usb_send_msg(priv->dev, &msg);
> +	struct kvaser_msg *msg;
> +	int rc;
> +
> +	msg = kmalloc(sizeof(*msg), GFP_KERNEL);
> +	if (!msg)
> +		return -ENOMEM;
> +
> +	msg->id = CMD_FLUSH_QUEUE;
> +	msg->len = MSG_HEADER_LEN + sizeof(struct kvaser_msg_flush_queue);
> +	msg->u.flush_queue.channel = priv->channel;
> +	msg->u.flush_queue.flags = 0x00;
> +
> +	rc = kvaser_usb_send_msg(priv->dev, msg);
> +
> +	kfree(msg);
> +	return rc;
>  }
>  
>  static int kvaser_usb_close(struct net_device *netdev)
> @@ -1364,24 +1385,31 @@ static int kvaser_usb_set_bittiming(struct net_device *netdev)
>  	struct kvaser_usb_net_priv *priv = netdev_priv(netdev);
>  	struct can_bittiming *bt = &priv->can.bittiming;
>  	struct kvaser_usb *dev = priv->dev;
> -	struct kvaser_msg msg = {
> -		.id = CMD_SET_BUS_PARAMS,
> -		.len = MSG_HEADER_LEN +
> -		       sizeof(struct kvaser_msg_busparams),
> -		.u.busparams.channel = priv->channel,
> -		.u.busparams.tid = 0xff,
> -		.u.busparams.bitrate = cpu_to_le32(bt->bitrate),
> -		.u.busparams.sjw = bt->sjw,
> -		.u.busparams.tseg1 = bt->prop_seg + bt->phase_seg1,
> -		.u.busparams.tseg2 = bt->phase_seg2,
> -	};
> +	struct kvaser_msg *msg;
> +	int rc;
> +
> +	msg = kmalloc(sizeof(*msg), GFP_KERNEL);
> +	if (!msg)
> +		return -ENOMEM;
> +
> +	msg->id = CMD_SET_BUS_PARAMS;
> +	msg->len = MSG_HEADER_LEN + sizeof(struct kvaser_msg_busparams);
> +	msg->u.busparams.channel = priv->channel;
> +	msg->u.busparams.tid = 0xff;
> +	msg->u.busparams.bitrate = cpu_to_le32(bt->bitrate);
> +	msg->u.busparams.sjw = bt->sjw;
> +	msg->u.busparams.tseg1 = bt->prop_seg + bt->phase_seg1;
> +	msg->u.busparams.tseg2 = bt->phase_seg2;
>  
>  	if (priv->can.ctrlmode & CAN_CTRLMODE_3_SAMPLES)
> -		msg.u.busparams.no_samp = 3;
> +		msg->u.busparams.no_samp = 3;
>  	else
> -		msg.u.busparams.no_samp = 1;
> +		msg->u.busparams.no_samp = 1;
> +
> +	rc = kvaser_usb_send_msg(dev, msg);
>  
> -	return kvaser_usb_send_msg(dev, &msg);
> +	kfree(msg);
> +	return rc;
>  }
>  
>  static int kvaser_usb_set_mode(struct net_device *netdev,
> -- 
> 1.7.9.5
> 

-- 
Olivier

^ permalink raw reply

* Re: [PATCH] kvaser_usb: fix "dma on the stack" errors
From: Marc Kleine-Budde @ 2012-11-23 13:48 UTC (permalink / raw)
  To: Olivier Sobrie; +Cc: linux-can, Wolfgang Grandegge, netdev, linux-usb
In-Reply-To: <20121123134027.GA32098@hposo>

[-- Attachment #1: Type: text/plain, Size: 1725 bytes --]

On 11/23/2012 02:40 PM, Olivier Sobrie wrote:
> On Fri, Nov 23, 2012 at 02:30:28PM +0100, Olivier Sobrie wrote:
>> The dma buffer given to usb_bulk_msg() must be allocated and not on
>> the stack.
>> See Documentation/DMA-API-HOWTO.txt section "What memory is DMA'able?"
>>
>> Signed-off-by: Olivier Sobrie <olivier@sobrie.be>
>> ---
>> Here is the incremental patch.
>> Thank you Greg !
>>
>> Olivier
>>
>>  drivers/net/can/usb/kvaser_usb.c |  110 ++++++++++++++++++++++++--------------
>>  1 file changed, 69 insertions(+), 41 deletions(-)
>>
>> diff --git a/drivers/net/can/usb/kvaser_usb.c b/drivers/net/can/usb/kvaser_usb.c
>> index 8807bf8..7ac6e82 100644
>> --- a/drivers/net/can/usb/kvaser_usb.c
>> +++ b/drivers/net/can/usb/kvaser_usb.c
>> @@ -421,14 +421,21 @@ end:
>>  static int kvaser_usb_send_simple_msg(const struct kvaser_usb *dev,
>>  				      u8 msg_id, int channel)
>>  {
>> -	struct kvaser_msg msg = {
>> -		.len = MSG_HEADER_LEN + sizeof(struct kvaser_msg_simple),
>> -		.id = msg_id,
>> -		.u.simple.channel = channel,
>> -		.u.simple.tid = 0xff,
>> -	};
>> -
>> -	return kvaser_usb_send_msg(dev, &msg);
>> +	struct kvaser_msg *msg;
>> +	int rc;
>> +
>> +	msg = kmalloc(sizeof(*msg), GFP_KERNEL);
>> +	if (!msg)
>> +		return -ENOMEM;
>> +
> 
> Doh! I removed by mistake the line "msg->id = msg_id"... grr

Please send a v2 version of this patch with this problem fixed.

MarcMarc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply

* [PATCH] net/macb: GEM DMA configuration register update
From: Nicolas Ferre @ 2012-11-23 13:49 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: linux-arm-kernel, linux-kernel, Joachim Eastwood,
	Jean-Christophe PLAGNIOL-VILLARD, Nicolas Ferre

Add information to the DMA Configuration Register to
maximize system performance:
- rx/tx packet buffer full memory size
- allow possibility to use INCR16 if supported

Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
---
 drivers/net/ethernet/cadence/macb.c | 10 ++++++++--
 drivers/net/ethernet/cadence/macb.h | 11 +++++++++++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb.c b/drivers/net/ethernet/cadence/macb.c
index cc6e593..6a59bce 100644
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@ -1033,8 +1033,12 @@ static u32 macb_dbw(struct macb *bp)
 }
 
 /*
- * Configure the receive DMA engine to use the correct receive buffer size.
- * This is a configurable parameter for GEM.
+ * Configure the receive DMA engine
+ * - use the correct receive buffer size
+ * - set the possibility to use INCR16 bursts
+ *   (if not supported by FIFO, it will fallback to default)
+ * - set both rx/tx packet buffers to full memory size
+ * These are configurable parameters for GEM.
  */
 static void macb_configure_dma(struct macb *bp)
 {
@@ -1043,6 +1047,8 @@ static void macb_configure_dma(struct macb *bp)
 	if (macb_is_gem(bp)) {
 		dmacfg = gem_readl(bp, DMACFG) & ~GEM_BF(RXBS, -1L);
 		dmacfg |= GEM_BF(RXBS, RX_BUFFER_SIZE / 64);
+		dmacfg |= GEM_BF(FBLDO, 16);
+		dmacfg |= GEM_BIT(TXPBMS) | GEM_BF(RXBMS, -1L);
 		gem_writel(bp, DMACFG, dmacfg);
 	}
 }
diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
index 4414421..570908b 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -171,8 +171,19 @@
 #define GEM_DBW128				2
 
 /* Bitfields in DMACFG. */
+#define GEM_FBLDO_OFFSET			0
+#define GEM_FBLDO_SIZE				5
+#define GEM_RXBMS_OFFSET			8
+#define GEM_RXBMS_SIZE				2
+#define GEM_TXPBMS_OFFSET			10
+#define GEM_TXPBMS_SIZE				1
+#define GEM_TXCOEN_OFFSET			11
+#define GEM_TXCOEN_SIZE				1
 #define GEM_RXBS_OFFSET				16
 #define GEM_RXBS_SIZE				8
+#define GEM_DDRP_OFFSET				24
+#define GEM_DDRP_SIZE				1
+
 
 /* Bitfields in NSR */
 #define MACB_NSR_LINK_OFFSET			0
-- 
1.8.0

^ permalink raw reply related

* [PATCH] net/macb: Use non-coherent memory for rx buffers
From: Nicolas Ferre @ 2012-11-23 13:50 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: linux-arm-kernel, linux-kernel, Joachim Eastwood,
	Jean-Christophe PLAGNIOL-VILLARD, Havard Skinnemoen,
	Nicolas Ferre

From: Havard Skinnemoen <havard@skinnemoen.net>

Allocate regular pages to use as backing for the RX ring and use the
DMA API to sync the caches. This should give a bit better performance
since it allows the CPU to do burst transfers from memory. It is also
a necessary step on the way to reduce the amount of copying done by
the driver.

Signed-off-by: Havard Skinnemoen <havard@skinnemoen.net>
[nicolas.ferre@atmel.com: adapt to newer kernel]
Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
---
 drivers/net/ethernet/cadence/macb.c | 206 +++++++++++++++++++++++-------------
 drivers/net/ethernet/cadence/macb.h |  20 +++-
 2 files changed, 148 insertions(+), 78 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb.c b/drivers/net/ethernet/cadence/macb.c
index 6a59bce..c2955da 100644
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@ -10,6 +10,7 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include <linux/clk.h>
+#include <linux/highmem.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/kernel.h>
@@ -35,6 +36,8 @@
 #define RX_BUFFER_SIZE		128
 #define RX_RING_SIZE		512 /* must be power of 2 */
 #define RX_RING_BYTES		(sizeof(struct macb_dma_desc) * RX_RING_SIZE)
+#define RX_BUFFERS_PER_PAGE	(PAGE_SIZE / RX_BUFFER_SIZE)
+#define RX_RING_PAGES		(RX_RING_SIZE / RX_BUFFERS_PER_PAGE)
 
 #define TX_RING_SIZE		128 /* must be power of 2 */
 #define TX_RING_BYTES		(sizeof(struct macb_dma_desc) * TX_RING_SIZE)
@@ -90,9 +93,16 @@ static struct macb_dma_desc *macb_rx_desc(struct macb *bp, unsigned int index)
 	return &bp->rx_ring[macb_rx_ring_wrap(index)];
 }
 
-static void *macb_rx_buffer(struct macb *bp, unsigned int index)
+static struct macb_rx_page *macb_rx_page(struct macb *bp, unsigned int index)
 {
-	return bp->rx_buffers + RX_BUFFER_SIZE * macb_rx_ring_wrap(index);
+	unsigned int entry = macb_rx_ring_wrap(index);
+
+	return &bp->rx_page[entry / RX_BUFFERS_PER_PAGE];
+}
+
+static unsigned int macb_rx_page_offset(struct macb *bp, unsigned int index)
+{
+	return (index % RX_BUFFERS_PER_PAGE) * RX_BUFFER_SIZE;
 }
 
 void macb_set_hwaddr(struct macb *bp)
@@ -528,11 +538,15 @@ static void macb_tx_interrupt(struct macb *bp)
 static int macb_rx_frame(struct macb *bp, unsigned int first_frag,
 			 unsigned int last_frag)
 {
-	unsigned int len;
-	unsigned int frag;
-	unsigned int offset;
-	struct sk_buff *skb;
-	struct macb_dma_desc *desc;
+	unsigned int		len;
+	unsigned int		frag;
+	unsigned int		skb_offset;
+	unsigned int		pg_offset;
+	struct macb_rx_page	*rx_page;
+	dma_addr_t		phys;
+	void			*buf;
+	struct sk_buff		*skb;
+	struct macb_dma_desc	*desc;
 
 	desc = macb_rx_desc(bp, last_frag);
 	len = MACB_BFEXT(RX_FRMLEN, desc->ctrl);
@@ -566,7 +580,7 @@ static int macb_rx_frame(struct macb *bp, unsigned int first_frag,
 		return 1;
 	}
 
-	offset = 0;
+	skb_offset = 0;
 	len += NET_IP_ALIGN;
 	skb_checksum_none_assert(skb);
 	skb_put(skb, len);
@@ -574,13 +588,28 @@ static int macb_rx_frame(struct macb *bp, unsigned int first_frag,
 	for (frag = first_frag; ; frag++) {
 		unsigned int frag_len = RX_BUFFER_SIZE;
 
-		if (offset + frag_len > len) {
+		if (skb_offset + frag_len > len) {
 			BUG_ON(frag != last_frag);
-			frag_len = len - offset;
+			frag_len = len - skb_offset;
 		}
-		skb_copy_to_linear_data_offset(skb, offset,
-				macb_rx_buffer(bp, frag), frag_len);
-		offset += RX_BUFFER_SIZE;
+
+		rx_page = macb_rx_page(bp, frag);
+		pg_offset = macb_rx_page_offset(bp, frag);
+		phys = rx_page->phys;
+
+		dma_sync_single_range_for_cpu(&bp->pdev->dev, phys,
+				pg_offset, frag_len, DMA_FROM_DEVICE);
+
+		buf = kmap_atomic(rx_page->page);
+		skb_copy_to_linear_data_offset(skb, skb_offset,
+				buf + pg_offset, frag_len);
+		kunmap_atomic(buf);
+
+		skb_offset += frag_len;
+
+		dma_sync_single_range_for_device(&bp->pdev->dev, phys,
+				pg_offset, frag_len, DMA_FROM_DEVICE);
+
 		desc = macb_rx_desc(bp, frag);
 		desc->addr &= ~MACB_BIT(RX_USED);
 
@@ -860,86 +889,90 @@ static int macb_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	return NETDEV_TX_OK;
 }
 
-static void macb_free_consistent(struct macb *bp)
+static void macb_free_rings(struct macb *bp)
 {
-	if (bp->tx_skb) {
-		kfree(bp->tx_skb);
-		bp->tx_skb = NULL;
-	}
-	if (bp->rx_ring) {
-		dma_free_coherent(&bp->pdev->dev, RX_RING_BYTES,
-				  bp->rx_ring, bp->rx_ring_dma);
-		bp->rx_ring = NULL;
-	}
-	if (bp->tx_ring) {
-		dma_free_coherent(&bp->pdev->dev, TX_RING_BYTES,
-				  bp->tx_ring, bp->tx_ring_dma);
-		bp->tx_ring = NULL;
-	}
-	if (bp->rx_buffers) {
-		dma_free_coherent(&bp->pdev->dev,
-				  RX_RING_SIZE * RX_BUFFER_SIZE,
-				  bp->rx_buffers, bp->rx_buffers_dma);
-		bp->rx_buffers = NULL;
+	int i;
+
+	for (i = 0; i < RX_RING_PAGES; i++) {
+		struct macb_rx_page *rx_page = &bp->rx_page[i];
+
+		if (!rx_page->page)
+			continue;
+
+		dma_unmap_page(&bp->pdev->dev, rx_page->phys,
+			       PAGE_SIZE, DMA_FROM_DEVICE);
+		put_page(rx_page->page);
+		rx_page->page = NULL;
 	}
+
+	kfree(bp->tx_skb);
+	kfree(bp->rx_page);
+	dma_free_coherent(&bp->pdev->dev, TX_RING_BYTES, bp->tx_ring,
+			  bp->tx_ring_dma);
+	dma_free_coherent(&bp->pdev->dev, RX_RING_BYTES, bp->rx_ring,
+			  bp->rx_ring_dma);
 }
 
-static int macb_alloc_consistent(struct macb *bp)
+static int macb_init_rings(struct macb *bp)
 {
-	int size;
+	struct page	*page;
+	dma_addr_t	phys;
+	unsigned int	page_idx;
+	unsigned int	ring_idx;
+	unsigned int	i;
 
-	size = TX_RING_SIZE * sizeof(struct macb_tx_skb);
-	bp->tx_skb = kmalloc(size, GFP_KERNEL);
-	if (!bp->tx_skb)
-		goto out_err;
-
-	size = RX_RING_BYTES;
-	bp->rx_ring = dma_alloc_coherent(&bp->pdev->dev, size,
+	bp->rx_ring = dma_alloc_coherent(&bp->pdev->dev, RX_RING_BYTES,
 					 &bp->rx_ring_dma, GFP_KERNEL);
 	if (!bp->rx_ring)
-		goto out_err;
+		goto err_alloc_rx_ring;
+
 	netdev_dbg(bp->dev,
 		   "Allocated RX ring of %d bytes at %08lx (mapped %p)\n",
-		   size, (unsigned long)bp->rx_ring_dma, bp->rx_ring);
+		   RX_RING_BYTES, (unsigned long)bp->rx_ring_dma, bp->rx_ring);
 
-	size = TX_RING_BYTES;
-	bp->tx_ring = dma_alloc_coherent(&bp->pdev->dev, size,
+	bp->tx_ring = dma_alloc_coherent(&bp->pdev->dev, TX_RING_BYTES,
 					 &bp->tx_ring_dma, GFP_KERNEL);
 	if (!bp->tx_ring)
-		goto out_err;
-	netdev_dbg(bp->dev,
-		   "Allocated TX ring of %d bytes at %08lx (mapped %p)\n",
-		   size, (unsigned long)bp->tx_ring_dma, bp->tx_ring);
-
-	size = RX_RING_SIZE * RX_BUFFER_SIZE;
-	bp->rx_buffers = dma_alloc_coherent(&bp->pdev->dev, size,
-					    &bp->rx_buffers_dma, GFP_KERNEL);
-	if (!bp->rx_buffers)
-		goto out_err;
+		goto err_alloc_tx_ring;
+
 	netdev_dbg(bp->dev,
-		   "Allocated RX buffers of %d bytes at %08lx (mapped %p)\n",
-		   size, (unsigned long)bp->rx_buffers_dma, bp->rx_buffers);
+		   "Allocated TX ring of %d bytes at 0x%08lx (mapped %p)\n",
+		   TX_RING_BYTES, (unsigned long)bp->tx_ring_dma, bp->tx_ring);
 
-	return 0;
+	bp->rx_page = kcalloc(RX_RING_PAGES, sizeof(struct macb_rx_page),
+			      GFP_KERNEL);
+	if (!bp->rx_page)
+		goto err_alloc_rx_page;
 
-out_err:
-	macb_free_consistent(bp);
-	return -ENOMEM;
-}
+	bp->tx_skb = kcalloc(TX_RING_SIZE, sizeof(struct macb_tx_skb),
+			     GFP_KERNEL);
+	if (!bp->tx_skb)
+		goto err_alloc_tx_skb;
 
-static void macb_init_rings(struct macb *bp)
-{
-	int i;
-	dma_addr_t addr;
+	for (page_idx = 0, ring_idx = 0; page_idx < RX_RING_PAGES; page_idx++) {
+		page = alloc_page(GFP_KERNEL);
+		if (!page)
+			goto err_alloc_page;
+
+		phys = dma_map_page(&bp->pdev->dev, page, 0, PAGE_SIZE,
+				    DMA_FROM_DEVICE);
+		if (dma_mapping_error(&bp->pdev->dev, phys))
+			goto err_map_page;
+
+		bp->rx_page[page_idx].page = page;
+		bp->rx_page[page_idx].phys = phys;
 
-	addr = bp->rx_buffers_dma;
-	for (i = 0; i < RX_RING_SIZE; i++) {
-		bp->rx_ring[i].addr = addr;
-		bp->rx_ring[i].ctrl = 0;
-		addr += RX_BUFFER_SIZE;
+		for (i = 0; i < RX_BUFFERS_PER_PAGE; i++, ring_idx++) {
+			bp->rx_ring[ring_idx].addr = phys;
+			bp->rx_ring[ring_idx].ctrl = 0;
+			phys += RX_BUFFER_SIZE;
+		}
 	}
 	bp->rx_ring[RX_RING_SIZE - 1].addr |= MACB_BIT(RX_WRAP);
 
+	netdev_dbg(bp->dev, "Allocated %u RX buffers (%lu pages)\n",
+		   RX_RING_SIZE, RX_RING_PAGES);
+
 	for (i = 0; i < TX_RING_SIZE; i++) {
 		bp->tx_ring[i].addr = 0;
 		bp->tx_ring[i].ctrl = MACB_BIT(TX_USED);
@@ -947,6 +980,28 @@ static void macb_init_rings(struct macb *bp)
 	bp->tx_ring[TX_RING_SIZE - 1].ctrl |= MACB_BIT(TX_WRAP);
 
 	bp->rx_tail = bp->tx_head = bp->tx_tail = 0;
+
+	return 0;
+
+err_map_page:
+	__free_page(page);
+err_alloc_page:
+	while (page_idx--) {
+		dma_unmap_page(&bp->pdev->dev, bp->rx_page[page_idx].phys,
+			       PAGE_SIZE, DMA_FROM_DEVICE);
+		__free_page(bp->rx_page[page_idx].page);
+	}
+	kfree(bp->tx_skb);
+err_alloc_tx_skb:
+	kfree(bp->rx_page);
+err_alloc_rx_page:
+	dma_free_coherent(&bp->pdev->dev, TX_RING_BYTES, bp->tx_ring,
+			  bp->tx_ring_dma);
+err_alloc_tx_ring:
+	dma_free_coherent(&bp->pdev->dev, RX_RING_BYTES, bp->rx_ring,
+			  bp->rx_ring_dma);
+err_alloc_rx_ring:
+	return -ENOMEM;
 }
 
 static void macb_reset_hw(struct macb *bp)
@@ -1221,16 +1276,15 @@ static int macb_open(struct net_device *dev)
 	if (!bp->phy_dev)
 		return -EAGAIN;
 
-	err = macb_alloc_consistent(bp);
+	err = macb_init_rings(bp);
 	if (err) {
-		netdev_err(dev, "Unable to allocate DMA memory (error %d)\n",
+		netdev_err(dev, "Unable to allocate DMA rings (error %d)\n",
 			   err);
 		return err;
 	}
 
 	napi_enable(&bp->napi);
 
-	macb_init_rings(bp);
 	macb_init_hw(bp);
 
 	/* schedule a link state check */
@@ -1257,7 +1311,7 @@ static int macb_close(struct net_device *dev)
 	netif_carrier_off(dev);
 	spin_unlock_irqrestore(&bp->lock, flags);
 
-	macb_free_consistent(bp);
+	macb_free_rings(bp);
 
 	return 0;
 }
diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
index 570908b..74e68a3 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -453,6 +453,23 @@ struct macb_dma_desc {
 #define MACB_TX_USED_SIZE			1
 
 /**
+ * struct macb_rx_page - data associated with a page used as RX buffers
+ * @page: Physical page used as storage for the buffers
+ * @phys: DMA address of the page
+ *
+ * Each page is used to provide %MACB_RX_BUFFERS_PER_PAGE RX buffers.
+ * The page gets an initial reference when it is inserted into the
+ * ring, and an additional reference each time it is passed up the
+ * stack as a fragment. When all the buffers have been used, we drop
+ * the initial reference and allocate a new page. Any additional
+ * references are dropped when the higher layers free the skb.
+ */
+struct macb_rx_page {
+	struct page		*page;
+	dma_addr_t		phys;
+};
+
+/**
  * struct macb_tx_skb - data about an skb which is being transmitted
  * @skb: skb currently being transmitted
  * @mapping: DMA address of the skb's data buffer
@@ -543,7 +560,7 @@ struct macb {
 
 	unsigned int		rx_tail;
 	struct macb_dma_desc	*rx_ring;
-	void			*rx_buffers;
+	struct macb_rx_page	*rx_page;
 
 	unsigned int		tx_head, tx_tail;
 	struct macb_dma_desc	*tx_ring;
@@ -564,7 +581,6 @@ struct macb {
 
 	dma_addr_t		rx_ring_dma;
 	dma_addr_t		tx_ring_dma;
-	dma_addr_t		rx_buffers_dma;
 
 	struct mii_bus		*mii_bus;
 	struct phy_device	*phy_dev;
-- 
1.8.0

^ permalink raw reply related

* [PATCH v2] kvaser_usb: fix "dma on the stack" errors
From: Olivier Sobrie @ 2012-11-23 13:54 UTC (permalink / raw)
  To: Marc Kleine-Budde, linux-can
  Cc: Greg KH, Wolfgang Grandegge, netdev, linux-usb, Daniel Berglund,
	Olivier Sobrie
In-Reply-To: <20121123134027.GA32098@hposo>

The dma buffer given to usb_bulk_msg() must be allocated and not on
the stack.
See Documentation/DMA-API-HOWTO.txt section "What memory is DMA'able?"

Signed-off-by: Olivier Sobrie <olivier@sobrie.be>
---
 drivers/net/can/usb/kvaser_usb.c |  111 ++++++++++++++++++++++++--------------
 1 file changed, 70 insertions(+), 41 deletions(-)

diff --git a/drivers/net/can/usb/kvaser_usb.c b/drivers/net/can/usb/kvaser_usb.c
index 8807bf8..1b159e7 100644
--- a/drivers/net/can/usb/kvaser_usb.c
+++ b/drivers/net/can/usb/kvaser_usb.c
@@ -421,14 +421,22 @@ end:
 static int kvaser_usb_send_simple_msg(const struct kvaser_usb *dev,
 				      u8 msg_id, int channel)
 {
-	struct kvaser_msg msg = {
-		.len = MSG_HEADER_LEN + sizeof(struct kvaser_msg_simple),
-		.id = msg_id,
-		.u.simple.channel = channel,
-		.u.simple.tid = 0xff,
-	};
-
-	return kvaser_usb_send_msg(dev, &msg);
+	struct kvaser_msg *msg;
+	int rc;
+
+	msg = kmalloc(sizeof(*msg), GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	msg->id = msg_id;
+	msg->len = MSG_HEADER_LEN + sizeof(struct kvaser_msg_simple);
+	msg->u.simple.channel = channel;
+	msg->u.simple.tid = 0xff;
+
+	rc = kvaser_usb_send_msg(dev, msg);
+
+	kfree(msg);
+	return rc;
 }
 
 static int kvaser_usb_get_software_info(struct kvaser_usb *dev)
@@ -1057,20 +1065,27 @@ static int kvaser_usb_setup_rx_urbs(struct kvaser_usb *dev)
 
 static int kvaser_usb_set_opt_mode(const struct kvaser_usb_net_priv *priv)
 {
-	struct kvaser_msg msg = {
-		.id = CMD_SET_CTRL_MODE,
-		.len = MSG_HEADER_LEN +
-		       sizeof(struct kvaser_msg_ctrl_mode),
-		.u.ctrl_mode.tid = 0xff,
-		.u.ctrl_mode.channel = priv->channel,
-	};
+	struct kvaser_msg *msg;
+	int rc;
+
+	msg = kmalloc(sizeof(*msg), GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	msg->id = CMD_SET_CTRL_MODE;
+	msg->len = MSG_HEADER_LEN + sizeof(struct kvaser_msg_ctrl_mode);
+	msg->u.ctrl_mode.tid = 0xff;
+	msg->u.ctrl_mode.channel = priv->channel;
 
 	if (priv->can.ctrlmode & CAN_CTRLMODE_LISTENONLY)
-		msg.u.ctrl_mode.ctrl_mode = KVASER_CTRL_MODE_SILENT;
+		msg->u.ctrl_mode.ctrl_mode = KVASER_CTRL_MODE_SILENT;
 	else
-		msg.u.ctrl_mode.ctrl_mode = KVASER_CTRL_MODE_NORMAL;
+		msg->u.ctrl_mode.ctrl_mode = KVASER_CTRL_MODE_NORMAL;
 
-	return kvaser_usb_send_msg(priv->dev, &msg);
+	rc = kvaser_usb_send_msg(priv->dev, msg);
+
+	kfree(msg);
+	return rc;
 }
 
 static int kvaser_usb_start_chip(struct kvaser_usb_net_priv *priv)
@@ -1163,15 +1178,22 @@ static int kvaser_usb_stop_chip(struct kvaser_usb_net_priv *priv)
 
 static int kvaser_usb_flush_queue(struct kvaser_usb_net_priv *priv)
 {
-	struct kvaser_msg msg = {
-		.id = CMD_FLUSH_QUEUE,
-		.len = MSG_HEADER_LEN +
-		       sizeof(struct kvaser_msg_flush_queue),
-		.u.flush_queue.channel = priv->channel,
-		.u.flush_queue.flags = 0x00,
-	};
-
-	return kvaser_usb_send_msg(priv->dev, &msg);
+	struct kvaser_msg *msg;
+	int rc;
+
+	msg = kmalloc(sizeof(*msg), GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	msg->id = CMD_FLUSH_QUEUE;
+	msg->len = MSG_HEADER_LEN + sizeof(struct kvaser_msg_flush_queue);
+	msg->u.flush_queue.channel = priv->channel;
+	msg->u.flush_queue.flags = 0x00;
+
+	rc = kvaser_usb_send_msg(priv->dev, msg);
+
+	kfree(msg);
+	return rc;
 }
 
 static int kvaser_usb_close(struct net_device *netdev)
@@ -1364,24 +1386,31 @@ static int kvaser_usb_set_bittiming(struct net_device *netdev)
 	struct kvaser_usb_net_priv *priv = netdev_priv(netdev);
 	struct can_bittiming *bt = &priv->can.bittiming;
 	struct kvaser_usb *dev = priv->dev;
-	struct kvaser_msg msg = {
-		.id = CMD_SET_BUS_PARAMS,
-		.len = MSG_HEADER_LEN +
-		       sizeof(struct kvaser_msg_busparams),
-		.u.busparams.channel = priv->channel,
-		.u.busparams.tid = 0xff,
-		.u.busparams.bitrate = cpu_to_le32(bt->bitrate),
-		.u.busparams.sjw = bt->sjw,
-		.u.busparams.tseg1 = bt->prop_seg + bt->phase_seg1,
-		.u.busparams.tseg2 = bt->phase_seg2,
-	};
+	struct kvaser_msg *msg;
+	int rc;
+
+	msg = kmalloc(sizeof(*msg), GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	msg->id = CMD_SET_BUS_PARAMS;
+	msg->len = MSG_HEADER_LEN + sizeof(struct kvaser_msg_busparams);
+	msg->u.busparams.channel = priv->channel;
+	msg->u.busparams.tid = 0xff;
+	msg->u.busparams.bitrate = cpu_to_le32(bt->bitrate);
+	msg->u.busparams.sjw = bt->sjw;
+	msg->u.busparams.tseg1 = bt->prop_seg + bt->phase_seg1;
+	msg->u.busparams.tseg2 = bt->phase_seg2;
 
 	if (priv->can.ctrlmode & CAN_CTRLMODE_3_SAMPLES)
-		msg.u.busparams.no_samp = 3;
+		msg->u.busparams.no_samp = 3;
 	else
-		msg.u.busparams.no_samp = 1;
+		msg->u.busparams.no_samp = 1;
+
+	rc = kvaser_usb_send_msg(dev, msg);
 
-	return kvaser_usb_send_msg(dev, &msg);
+	kfree(msg);
+	return rc;
 }
 
 static int kvaser_usb_set_mode(struct net_device *netdev,
-- 
1.7.9.5

^ permalink raw reply related

* [net-next patch] tun: change tun_get_iff() prototype.
From: Rami Rosen @ 2012-11-23 13:58 UTC (permalink / raw)
  To: davem; +Cc: maxk, netdev, Rami Rosen

This patch changes tun_get_iff() prototype to return void as it never fails.

Signed-off-by: Rami Rosen <ramirose@gmail.com>
---
 drivers/net/tun.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index b01e8c0..3bd9932 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1662,7 +1662,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 	return err;
 }
 
-static int tun_get_iff(struct net *net, struct tun_struct *tun,
+static void tun_get_iff(struct net *net, struct tun_struct *tun,
 		       struct ifreq *ifr)
 {
 	tun_debug(KERN_INFO, tun, "tun_get_iff\n");
@@ -1671,7 +1671,6 @@ static int tun_get_iff(struct net *net, struct tun_struct *tun,
 
 	ifr->ifr_flags = tun_flags(tun);
 
-	return 0;
 }
 
 /* This is like a cut-down ethtool ops, except done via tun fd so no
@@ -1847,9 +1846,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 	ret = 0;
 	switch (cmd) {
 	case TUNGETIFF:
-		ret = tun_get_iff(current->nsproxy->net_ns, tun, &ifr);
-		if (ret)
-			break;
+		tun_get_iff(current->nsproxy->net_ns, tun, &ifr);
 
 		if (copy_to_user(argp, &ifr, ifreq_len))
 			ret = -EFAULT;
-- 
1.7.11.7

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox