Netdev List
 help / color / mirror / Atom feed
* Re: genetlink misinterprets NEW as GET
From: Ben Pfaff @ 2011-01-06 17:23 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Jan Engelhardt, Netfilter Developer Mailing List,
	Linux Networking Developer Mailing List
In-Reply-To: <4D25C82F.4010306@netfilter.org>

Pablo Neira Ayuso <pablo@netfilter.org> writes:

> On 04/01/11 03:14, Jan Engelhardt wrote:
>> 	/* Modifiers to GET request */
>> 	#define NLM_F_ROOT      0x100
>> 	#define NLM_F_MATCH     0x200
>> 	#define NLM_F_ATOMIC    0x400
>> 	#define NLM_F_DUMP      (NLM_F_ROOT|NLM_F_MATCH)
[...]
>> [N.B.: I am also wondering whether
>> 	(nlh->nlmsg_flags & NLM_F_DUMP) == NLM_F_DUMP
>> may have been desired, because NLM_F_DUMP is composed of two bits.]
>
> Someone may include NLM_F_ATOMIC to a dump operation, in that case the
> checking that you propose is not valid.

Are you saying that NLM_F_MATCH and NLM_F_ATOMIC are mutually
exclusive, and that NLM_F_ROOT|NLM_F_ATOMIC would also signal a
dump operation?  Otherwise the test that Jan proposes looks valid
to me.

^ permalink raw reply

* Re: Use ioctl() to list all network interfaces.
From: Alexander Clouter @ 2011-01-06 17:07 UTC (permalink / raw)
  To: netdev
In-Reply-To: <AANLkTi=Xb4TuKcCd-DgN5ZnYzRHV1RjDScyjRkqyb4P+@mail.gmail.com>

Chin Shi Hong <cshong87@gmail.com> wrote:
>
> The following codes are just part of my application's source codes:
> 
> --begin of some codes--
> ifconf ifc;
> ret = ioctl(n, SIOCGIFCONF,&ifc);
> --End of some codes--
> 
> My application's role is to display all network interfaces, including
> the network interface which are down.
> 
> However, with the above codes, my application only display the network
> interface which are already up. The network interface which are down
> will not be displayed.
> 
> So, what can I do to make my application display all network
> interfaces, including the network interface which are down?
>
getifaddrs(), you get the IPv6 ones too this way.

Cheers

-- 
Alexander Clouter
.sigmonster says: I've only got 12 cards.


^ permalink raw reply

* Re: POLLPRI/poll() behavior change since 2.6.31
From: Eric Dumazet @ 2011-01-06 16:55 UTC (permalink / raw)
  To: Leonardo Chiquitto; +Cc: netdev, David S. Miller
In-Reply-To: <20110106155040.GA27769@libre.l.ngdn.org>

Le jeudi 06 janvier 2011 à 13:50 -0200, Leonardo Chiquitto a écrit :
> Hello,
> 
> Since 2.6.31, poll() no longer returns when waiting exclusively for a
> POLLPRI event. If we wait for POLLPRI | POLLIN, though, it will
> correctly return POLLPRI as a received event.
> 
> The reproducer (code below) will print the following when running on
> 2.6.30:
> 
> $ ./pollpri-oob 
> main: starting
> main: setup_pipe ok (sfd[0] = 5, sfd[1] = 4)
> parent: child <pid 3790> started
> child: polling...
> parent: sending the message
> parent: waiting for child to exit
> child: poll(): some data <1,2> has arrived!
> child: done
> parent: done
> 
> ... and will block when running on 2.6.37-rc7:
> 
> $ ./pollpri-oob 
> main: starting
> main: setup_pipe ok (sfd[0] = 5, sfd[1] = 4)
> parent: child <pid 14148> started
> child: polling...
> parent: sending the message
> parent: waiting for child to exit
> [hangs here]
> 
> I've bisected this behavior change to the following commit:
> 
> commit 4938d7e0233a455f04507bac81d0886c71529537
> Author: Eric Dumazet <dada1@cosmosbay.com>
> Date:   Tue Jun 16 15:33:36 2009 -0700
> 
>   poll: avoid extra wakeups in select/poll
> 
>   After introduction of keyed wakeups Davide Libenzi did on epoll, we are
>   able to avoid spurious wakeups in poll()/select() code too.
> 
>   For example, typical use of poll()/select() is to wait for incoming
>   network frames on many sockets.  But TX completion for UDP/TCP frames call
>   sock_wfree() which in turn schedules thread.
> 
>   When scheduled, thread does a full scan of all polled fds and can sleep
>   again, because nothing is really available.  If number of fds is large,
>   this cause significant load.
> 
>   This patch makes select()/poll() aware of keyed wakeups and useless
>   wakeups are avoided.  This reduces number of context switches by about 50%
>   on some setups, and work performed by sofirq handlers.
> 
> 
> I don't know if the new behavior is correct, but we've got one report of
> an application that broke due to the change.

Hi Leonardo

Hmm, this is because 	sock_def_readable() uses :

wake_up_interruptible_sync_poll(&wq->wait, POLLIN | POLLRDNORM |
POLLRDBAND);

So POLLPRI bit is not signaled. 

I would just add POLLPRI flag in sock_def_readable()

(Alternatively, define a tcp_def_readable() function to pass POLLPRI
only if TCP_URG is set, but is it worth the pain for a seldom used
feature ?)

David, do you have an opinion on this ?

Thanks

diff --git a/net/core/sock.c b/net/core/sock.c
index e5af8d5..7fd3541 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1907,7 +1907,7 @@ static void sock_def_readable(struct sock *sk, int len)
 	rcu_read_lock();
 	wq = rcu_dereference(sk->sk_wq);
 	if (wq_has_sleeper(wq))
-		wake_up_interruptible_sync_poll(&wq->wait, POLLIN |
+		wake_up_interruptible_sync_poll(&wq->wait, POLLIN | POLLPRI |
 						POLLRDNORM | POLLRDBAND);
 	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
 	rcu_read_unlock();



^ permalink raw reply related

* Re: Bad TCP timestamps on non-PC platforms
From: David Miller @ 2011-01-06 16:44 UTC (permalink / raw)
  To: oakad; +Cc: netdev
In-Reply-To: <219643.83181.qm@web37604.mail.mud.yahoo.com>

From: Alex Dubov <oakad@yahoo.com>
Date: Wed, 5 Jan 2011 22:55:43 -0800 (PST)

> On embedded platforms, where real time clock is initialized lately or
> absent outright, this causes TSval field of outgoing TCP packets to be
> set to some garbage value, in my case in the vicinity of 0xffffffff.

The TCP timestamp is set from "jiffies", not from the RTC clock.

^ permalink raw reply

* Re: [PATCH net-next-2.6 v2 1/2] can: add driver for Softing card
From: Wolfgang Grandegger @ 2011-01-06 16:27 UTC (permalink / raw)
  To: Kurt Van Dijck
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20110106150525.GB324-MxZ6Iy/zr/UdbCeoMzGj59i2O/JbrIOy@public.gmane.org>

Hi Kurt,

On 01/06/2011 04:05 PM, Kurt Van Dijck wrote:
> Wolfgang,
> 
> On Wed, Jan 05, 2011 at 09:57:16PM +0100, Wolfgang Grandegger wrote:
>>
>> here comes my review... First some general remarks. As Mark already
>> pointed out, there are still some coding style issues:
> Oops, I tried to eliminate those.
>>
>> - Please use the following style for multi line comments:
> shame on me. I should have done them all after Mark pointed me to it.
>>
>> - Please avoid alignment of expressions and structure members.
> I see:
> diff --git a/include/linux/can.h b/include/linux/can.h
> index d183333..6b1e5a6 100644
> --- a/include/linux/can.h
> +++ b/include/linux/can.h
> @@ -56,18 +56,18 @@ typedef __u32 can_err_mask_t;
>   */
>  struct can_frame {
>  	canid_t can_id;  /* 32 bit CAN_ID + EFF/RTR/ERR flags */
> -	__u8    can_dlc; /* data length code: 0 .. 8 */
> -	__u8    data[8] __attribute__((aligned(8)));
> +	__u8 can_dlc; /* data length code: 0 .. 8 */
> +	__u8 data[8] __attribute__((aligned(8)));
>  }  
>  /* particular protocols of the protocol family PF_CAN */
> -#define CAN_RAW		1 /* RAW sockets */
> -#define CAN_BCM		2 /* Broadcast Manager */
> -#define CAN_TP16	3 /* VAG Transport Protocol v1.6 */
> -#define CAN_TP20	4 /* VAG Transport Protocol v2.0 */
> -#define CAN_MCNET	5 /* Bosch MCNet */
> -#define CAN_ISOTP	6 /* ISO 15765-2 Transport Protocol */
> -#define CAN_NPROTO	7
> +#define CAN_RAW 1 /* RAW sockets */
> +#define CAN_BCM 2 /* Broadcast Manager */
> +#define CAN_TP16 3 /* VAG Transport Protocol v1.6 */
> +#define CAN_TP20 4 /* VAG Transport Protocol v2.0 */
> +#define CAN_MCNET 5 /* Bosch MCNet */
> +#define CAN_ISOTP 6 /* ISO 15765-2 Transport Protocol */
> +#define CAN_NPROTO 7
>  
>  #define SOL_CAN_BASE 100
>  
> @@ -79,7 +79,7 @@ struct can_frame {
>   */
>  struct sockaddr_can {
>  	sa_family_t can_family;
> -	int         can_ifindex;
> +	int can_ifindex;
>  	union {
>  		/* transport protocol class address information (e.g. ISOTP) */
>  		struct { canid_t rx_id, tx_id; } tp;

That's not my code ;-)

> I applied a search pattern on this, since I seem incapable of finding
> alignment problems in my own code :-).
> I assume alignment is ok for definitions, but not within functions?

You mean s/functions/structures/. Yes, that's what most people prefer, I
think.

> I consulted the Documentation/Coding-style, but I did not find
> the exact answer.

AFAIK, there is just one coding style rule about alignment:

http://lxr.linux.no/#linux+v2.6.37/Documentation/CodingStyle#L208

So feel free to choose the style you like but use if consequently.
I complain because I'm browsing the code anyway. It's definitely not
something worth to reject the patches.

Wolfgang.

^ permalink raw reply

* Re: genetlink misinterprets NEW as GET
From: Jan Engelhardt @ 2011-01-06 16:12 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Netfilter Developer Mailing List,
	Linux Networking Developer Mailing List
In-Reply-To: <4D25D7CD.6070006@netfilter.org>

On Thursday 2011-01-06 15:55, Pablo Neira Ayuso wrote:

>On 06/01/11 15:25, Jan Engelhardt wrote:
>> On Thursday 2011-01-06 14:48, Pablo Neira Ayuso wrote:
>>>>
>>>> 	/* Modifiers to GET request */
>>>> 	#define NLM_F_ROOT      0x100
>>>> 	#define NLM_F_MATCH     0x200
>>>> 	#define NLM_F_ATOMIC    0x400
>>>> 	#define NLM_F_DUMP      (NLM_F_ROOT|NLM_F_MATCH)
>>>> 	
>>>> 	/* Modifiers to NEW request */
>>>> 	#define NLM_F_REPLACE   0x100
>>>> 	#define NLM_F_EXCL      0x200
>>>> 	#define NLM_F_CREATE    0x400
>>>> 	#define NLM_F_APPEND    0x800
>>>>
>
>i getting confused, so ipset is also setting NLM_F_REPLACE to match the
>NLM_F_DUMP bitmask?

Any userspace program sending a (ge)netlink message with
NLM_F_CREATE|NLM_F_EXCL -- with the intent of creating an entry with
excl semantics --, will be misunderstood by genetlink.c to be a dump
request.

The problem is of general nature and not limited to ipset. I only
noticed it while making the ipset-genl patch, because ipset sends all
IPSET_CMD_CREATE requests with
NLM_F_REQUEST|NLM_F_ACK|NLM_F_CREATE|NLM_F_EXCL (see
ipset/lib/mnl.c).

^ permalink raw reply

* Re: Use ioctl() to list all network interfaces.
From: Rémi Denis-Courmont @ 2011-01-06 16:11 UTC (permalink / raw)
  To: Chin Shi Hong; +Cc: netdev
In-Reply-To: <AANLkTi=Xb4TuKcCd-DgN5ZnYzRHV1RjDScyjRkqyb4P+@mail.gmail.com>

Le jeudi 6 janvier 2011 17:33:39 Chin Shi Hong, vous avez écrit :
> My application's role is to display all network interfaces, including
> the network interface which are down.

You should use if_nameindex() then.

-- 
Rémi Denis-Courmont
http://www.remlab.net/
http://fi.linkedin.com/in/remidenis

^ permalink raw reply

* POLLPRI/poll() behavior change since 2.6.31
From: Leonardo Chiquitto @ 2011-01-06 15:50 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet, David S. Miller

Hello,

Since 2.6.31, poll() no longer returns when waiting exclusively for a
POLLPRI event. If we wait for POLLPRI | POLLIN, though, it will
correctly return POLLPRI as a received event.

The reproducer (code below) will print the following when running on
2.6.30:

$ ./pollpri-oob 
main: starting
main: setup_pipe ok (sfd[0] = 5, sfd[1] = 4)
parent: child <pid 3790> started
child: polling...
parent: sending the message
parent: waiting for child to exit
child: poll(): some data <1,2> has arrived!
child: done
parent: done

... and will block when running on 2.6.37-rc7:

$ ./pollpri-oob 
main: starting
main: setup_pipe ok (sfd[0] = 5, sfd[1] = 4)
parent: child <pid 14148> started
child: polling...
parent: sending the message
parent: waiting for child to exit
[hangs here]

I've bisected this behavior change to the following commit:

commit 4938d7e0233a455f04507bac81d0886c71529537
Author: Eric Dumazet <dada1@cosmosbay.com>
Date:   Tue Jun 16 15:33:36 2009 -0700

  poll: avoid extra wakeups in select/poll

  After introduction of keyed wakeups Davide Libenzi did on epoll, we are
  able to avoid spurious wakeups in poll()/select() code too.

  For example, typical use of poll()/select() is to wait for incoming
  network frames on many sockets.  But TX completion for UDP/TCP frames call
  sock_wfree() which in turn schedules thread.

  When scheduled, thread does a full scan of all polled fds and can sleep
  again, because nothing is really available.  If number of fds is large,
  this cause significant load.

  This patch makes select()/poll() aware of keyed wakeups and useless
  wakeups are avoided.  This reduces number of context switches by about 50%
  on some setups, and work performed by sofirq handlers.


I don't know if the new behavior is correct, but we've got one report of
an application that broke due to the change.

Thanks,
Leonardo

#define _BSD_SOURCE
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/ioctl.h>
#include <sys/wait.h>
#include <sys/stat.h>
#include <sys/un.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <errno.h>
#include <fcntl.h>
#include <poll.h>

int setup_pipe(int f[2])
{
	int ret, server, client, client_ofl;
	struct sockaddr_in own_sa;
	struct sockaddr a_sa;
	socklen_t a_len;

	/* server side */
	if ((server = socket(PF_INET, SOCK_STREAM, 0)) < 0)
		return -1;
	own_sa.sin_family = AF_INET;
	own_sa.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
	own_sa.sin_port = htons(10789);
	if (bind(server, (struct sockaddr *)&own_sa, sizeof(own_sa)) != 0)
		return -1;
	if (listen(server, 1) < 0)
		return -1;

	/* client side */
	if ((client = socket(PF_INET, SOCK_STREAM, 0)) < 0)
		return -1;
	if ((client_ofl = fcntl(client, F_GETFL)) < 0)
		return -1;
	if (fcntl(client, F_SETFL, client_ofl | O_NONBLOCK) < 0)
		return -1;
	ret = connect (client, (struct sockaddr *) &own_sa, sizeof(own_sa));
	if (ret != 0 && errno != EINPROGRESS)
		return -1;

	if ((f[0] = accept(server, &a_sa, &a_len)) < 0)
		return -1;
	f[1] = client;

	return 0;
}

int child_proc(int fd)
{
	struct pollfd fds;
	int ret;

	fds.fd = fd;
	fds.events = POLLPRI;

	printf("child: polling...\n");
	ret = poll(&fds, 1, -1);
	if (ret > 0)
		printf("child: poll(): some data <%d,%d> has arrived!\n",
				ret, fds.revents);
	printf("child: done\n");

	return 0;
}

int main(int argc, char *argv[])
{
	int sfd[2] = { -1, -1 };
	pid_t child_pid;

	printf("main: starting\n");

	if (setup_pipe(sfd) == -1) {
		fprintf(stderr, "main: error in setup_pipe()\n");
		return -1;
	}
	printf("main: setup_pipe ok (sfd[0] = %d, sfd[1] = %d)\n",
			sfd[0], sfd[1]);

	switch (child_pid = fork()) {
	case -1:
		fprintf(stderr, "main: fork() error\n");
		exit(1);
	case 0:
		return child_proc(sfd[0]);
	default:
		printf("parent: child <pid %d> started\n", child_pid);
		sleep(1);
		printf("parent: sending the message\n");
		send(sfd[1], "a", 1, MSG_OOB);
		printf("parent: waiting for child to exit\n");
		waitpid(child_pid, NULL, 0);
		printf("parent: done\n");
	}
	return 0;
}

^ permalink raw reply

* [PATCH 2/2] netconsole: clarify stopping message
From: Ferenc Wagner @ 2011-01-06 15:11 UTC (permalink / raw)
  To: netdev; +Cc: Ferenc Wagner
In-Reply-To: <cover.1294326001.git.wferi@niif.hu>


Signed-off-by: Ferenc Wagner <wferi@niif.hu>
---
 drivers/net/netconsole.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index b2ad998..dfb67eb 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -699,8 +699,8 @@ static int netconsole_netdev_event(struct notifier_block *this,
 	}
 	spin_unlock_irqrestore(&target_list_lock, flags);
 	if (stopped && (event == NETDEV_UNREGISTER || event == NETDEV_BONDING_DESLAVE))
-		printk(KERN_INFO "netconsole: network logging stopped, "
-			"interface %s %s\n",  dev->name,
+		printk(KERN_INFO "netconsole: network logging stopped on "
+			"interface %s as it %s\n",  dev->name,
 			event == NETDEV_UNREGISTER ? "unregistered" : "released slaves");
 
 done:
-- 
1.6.5


^ permalink raw reply related

* [PATCH 1/2] netconsole: don't announce stopping if nothing happened
From: Ferenc Wagner @ 2011-01-06 15:11 UTC (permalink / raw)
  To: netdev; +Cc: Ferenc Wagner
In-Reply-To: <cover.1294326001.git.wferi@niif.hu>


Signed-off-by: Ferenc Wagner <wferi@niif.hu>
---
 drivers/net/netconsole.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index 94255f0..b2ad998 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -664,6 +664,7 @@ static int netconsole_netdev_event(struct notifier_block *this,
 	unsigned long flags;
 	struct netconsole_target *nt;
 	struct net_device *dev = ptr;
+	bool stopped = false;
 
 	if (!(event == NETDEV_CHANGENAME || event == NETDEV_UNREGISTER ||
 	      event == NETDEV_BONDING_DESLAVE || event == NETDEV_GOING_DOWN))
@@ -690,13 +691,14 @@ static int netconsole_netdev_event(struct notifier_block *this,
 			case NETDEV_GOING_DOWN:
 			case NETDEV_BONDING_DESLAVE:
 				nt->enabled = 0;
+				stopped = true;
 				break;
 			}
 		}
 		netconsole_target_put(nt);
 	}
 	spin_unlock_irqrestore(&target_list_lock, flags);
-	if (event == NETDEV_UNREGISTER || event == NETDEV_BONDING_DESLAVE)
+	if (stopped && (event == NETDEV_UNREGISTER || event == NETDEV_BONDING_DESLAVE))
 		printk(KERN_INFO "netconsole: network logging stopped, "
 			"interface %s %s\n",  dev->name,
 			event == NETDEV_UNREGISTER ? "unregistered" : "released slaves");
-- 
1.6.5


^ permalink raw reply related

* [PATCH 0/2] Re: spurious netconsole: network logging stopped messages
From: Ferenc Wagner @ 2011-01-06 15:11 UTC (permalink / raw)
  To: netdev; +Cc: Ferenc Wagner
In-Reply-To: <87bp3yq60w.fsf@tac.ki.iif.hu>

Hi,

I've been running with these two patches for a couple of days now.
I don't understand the bridge case: why should netconsole stop logging
on a bridge device which lost a slave?  Or does NETDEV_BONDING_DESLAVE
mean something else?

Anyway, I think the first patch fixes a real bug and the second makes
the kernel message clearer, independently of its cause.

Regards,
Feri.

Ferenc Wagner (2):
  netconsole: don't announce stopping if nothing happened
  netconsole: clarify stopping message

 drivers/net/netconsole.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)


^ permalink raw reply

* Re: [net-next-2.6 PATCH v5 2/2] net_sched: implement a root container qdisc sch_mqprio
From: Jarek Poplawski @ 2011-01-06 15:41 UTC (permalink / raw)
  To: John Fastabend
  Cc: davem@davemloft.net, hadi@cyberus.ca, shemminger@vyatta.com,
	tgraf@infradead.org, eric.dumazet@gmail.com,
	bhutchings@solarflare.com, nhorman@tuxdriver.com,
	netdev@vger.kernel.org
In-Reply-To: <4D25296D.8070902@intel.com>

On Wed, Jan 05, 2011 at 06:31:09PM -0800, John Fastabend wrote:
> On 1/5/2011 4:32 PM, Jarek Poplawski wrote:
> > On Wed, Jan 05, 2011 at 09:38:44AM -0800, John Fastabend wrote:
> >> On 1/4/2011 2:59 PM, Jarek Poplawski wrote:
> >>> On Tue, Jan 04, 2011 at 10:56:46AM -0800, John Fastabend wrote:
> > ...
> >>>> +
> >>>> +     /* Always use supplied priority mappings */
> >>>> +     for (i = 0; i < TC_BITMASK + 1; i++) {
> >>>> +             if (netdev_set_prio_tc_map(dev, i, qopt->prio_tc_map[i])) {
> >>>> +                     err = -EINVAL;
> >>>
> >>> This would probably trigger if we try qopt->num_tc == 0. Is it expected?
> >>
> >> netdev_set_prio_tc_map() returns 0 on sucess. This if(..) is a bit strange though.
> >>
> >>         err = netdev_set_prio_tc_map(dev, i, qopt->prio_tc_map[i])
> >>         if (err)
> >>                 ...
> >>
> >> Is cleaner IMHO.
> > 
> > Maybe. But I still wonder if qopt->num_tc == 0 is valid (by design)?
> > (netdev_set_prio_tc_map() returns -EINVAL if dev->num_tc == 0)
> > 
> 
> I guess I don't see how it returns -EINVAL. Although setting
> qopt->num_tc = 0 is equivalent to disabling traffic classes so
> it is a strange thing to do but I don't expect it should be a
> problem.
> 
> static inline
> int netdev_set_prio_tc_map(struct net_device *dev, u8 prio, u8 tc)
> {
>         if (tc >= dev->num_tc)

If dev->num_tc was set to 0 this check is always true and we exit
the init with error. It's not a problem if it's documented but
checking it earlier seems more readable.

>                 return -EINVAL;
> 
>         dev->prio_tc_map[prio & TC_BITMASK] = tc & TC_BITMASK;
>         return 0;
> }
> 
> >>>> +static unsigned long mqprio_get(struct Qdisc *sch, u32 classid)
> >>>> +{
> >>>> +     unsigned int ntx = TC_H_MIN(classid);
> >>>
> >>> We need to 'get' tc classes too, eg for individual dumps. Then we
> >>> should omit them in .leaf, .graft etc.
> >>>
> >>
> >> OK missed this. Looks like iproute2 always sets NLM_F_DUMP which works because it uses cl_ops->walk
> >>
> >> # tc -s class show dev eth3 classid 800b:1
> >> class mqprio 800b:1 root
> >>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> >>  backlog 0b 0p requeues 0
> > 
> > OK, then it might be only of theoretical value (for some other tools).
> >>> Why dev->num_tc above, netdev_get_num_tc(dev) here, and dev->num_tc
> >>> below?
> >>
> >> No reason just inconsistant I will use dev->num_tc.
> > 
> > Well, this would suggest netdev_get_num_tc() is redundant.
> > 
> 
> Guess it is I can remove it.

If you definitely don't expect any #ifdef CONFIG_XXX for this code,
both get and set num_tc could be removed. If you're not sure use the
function everywhere.

> 
> static inline
> u8 netdev_get_num_tc(const struct net_device *dev)
> {
>         return dev->num_tc;
> }
> 
> >>>> +static void mqprio_walk(struct Qdisc *sch, struct qdisc_walker *arg)
> >>>> +{
> >>>> +     struct net_device *dev = qdisc_dev(sch);
> >>>> +     unsigned long ntx;
> >>>> +     u8 num_tc = netdev_get_num_tc(dev);
> >>>> +
> >>>> +     if (arg->stop)
> >>>> +             return;
> >>>> +
> >>>> +     /* Walk hierarchy with a virtual class per tc */
> >>>> +     arg->count = arg->skip;
> >>>> +     for (ntx = arg->skip; ntx < dev->num_tx_queues + num_tc; ntx++) {
> >>>
> >>> Should we report possibly unused/unconfigured tx_queues?
> >>
> >> I think it may be OK select_queue() could push skbs onto these queues and we may still want to see the statistics in this case. Although (real_num_tx_queues + num_tc) may make sense I see no reason to show queues above real_num_tx_queues.
> > 
> > IMHO, pushing skbs to unconfigured/unwanted/disabled queues is a bug,
> > and select should handle this, but probably it's a matter of taste.
> 
> certainly a bug for queues >= real_num_tx_queues.
> 
> > Btw, I wonder how dev->real_num_tx_queues is obeyed here.
> > 
> 
> Its not. real_num_tx_queues is not handled correctly here also
> real_num_tx_queues can be changed so this need to be handled as well.
> This means an additional check in the options parsing.

Well, let's do it. Maybe similarly to sch_multiq, with the .change
option?

Jarek P.

^ permalink raw reply

* Use ioctl() to list all network interfaces.
From: Chin Shi Hong @ 2011-01-06 15:33 UTC (permalink / raw)
  To: netdev

Dear all,

The following codes are just part of my application's source codes:

--begin of some codes--
ifconf ifc;
ret = ioctl(n, SIOCGIFCONF,&ifc);
--End of some codes--

My application's role is to display all network interfaces, including
the network interface which are down.

However, with the above codes, my application only display the network
interface which are already up. The network interface which are down
will not be displayed.

So, what can I do to make my application display all network
interfaces, including the network interface which are down?

regards,

^ permalink raw reply

* Re: [PATCH net-next-2.6 v2 1/2] can: add driver for Softing card
From: Kurt Van Dijck @ 2011-01-06 15:05 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: netdev, socketcan-core
In-Reply-To: <4D24DB2C.9040104@grandegger.com>

Wolfgang,

On Wed, Jan 05, 2011 at 09:57:16PM +0100, Wolfgang Grandegger wrote:
> 
> here comes my review... First some general remarks. As Mark already
> pointed out, there are still some coding style issues:
Oops, I tried to eliminate those.
> 
> - Please use the following style for multi line comments:
shame on me. I should have done them all after Mark pointed me to it.
> 
> - Please avoid alignment of expressions and structure members.
I see:
diff --git a/include/linux/can.h b/include/linux/can.h
index d183333..6b1e5a6 100644
--- a/include/linux/can.h
+++ b/include/linux/can.h
@@ -56,18 +56,18 @@ typedef __u32 can_err_mask_t;
  */
 struct can_frame {
 	canid_t can_id;  /* 32 bit CAN_ID + EFF/RTR/ERR flags */
-	__u8    can_dlc; /* data length code: 0 .. 8 */
-	__u8    data[8] __attribute__((aligned(8)));
+	__u8 can_dlc; /* data length code: 0 .. 8 */
+	__u8 data[8] __attribute__((aligned(8)));
 };
 
 /* particular protocols of the protocol family PF_CAN */
-#define CAN_RAW		1 /* RAW sockets */
-#define CAN_BCM		2 /* Broadcast Manager */
-#define CAN_TP16	3 /* VAG Transport Protocol v1.6 */
-#define CAN_TP20	4 /* VAG Transport Protocol v2.0 */
-#define CAN_MCNET	5 /* Bosch MCNet */
-#define CAN_ISOTP	6 /* ISO 15765-2 Transport Protocol */
-#define CAN_NPROTO	7
+#define CAN_RAW 1 /* RAW sockets */
+#define CAN_BCM 2 /* Broadcast Manager */
+#define CAN_TP16 3 /* VAG Transport Protocol v1.6 */
+#define CAN_TP20 4 /* VAG Transport Protocol v2.0 */
+#define CAN_MCNET 5 /* Bosch MCNet */
+#define CAN_ISOTP 6 /* ISO 15765-2 Transport Protocol */
+#define CAN_NPROTO 7
 
 #define SOL_CAN_BASE 100
 
@@ -79,7 +79,7 @@ struct can_frame {
  */
 struct sockaddr_can {
 	sa_family_t can_family;
-	int         can_ifindex;
+	int can_ifindex;
 	union {
 		/* transport protocol class address information (e.g. ISOTP) */
 		struct { canid_t rx_id, tx_id; } tp;

I applied a search pattern on this, since I seem incapable of finding
alignment problems in my own code :-).
I assume alignment is ok for definitions, but not within functions?
I consulted the Documentation/Coding-style, but I did not find
the exact answer.
> 
> More comments inline...
More to process for me too ...

Kurt

^ permalink raw reply related

* Re: [PATCH]netdev: add driver for enc424j600 ethernet chip on SPI bus
From: Michał Mirosław @ 2011-01-06 14:56 UTC (permalink / raw)
  To: Balaji Venkatachalam
  Cc: netdev, mohan, blue.cube, lanconelli.claudio, Sriram Subramanian
In-Reply-To: <AANLkTi=cJuDX7RZCFD4ovgGQi7gBXchoMLx8BeuiGz8G@mail.gmail.com>

A quick look follows.

2011/1/6 Balaji Venkatachalam <balaji.v@thotakaa.com>:
> From: Balaji Venkatachalam <balaji.v@thotakaa.com>
>
> These patches add support for Microchip enc424j600 ethernet chip
> controlled via SPI.
>
> I tested it on my custom board with ARM9 (Freescale i.MX233) with
> Kernel 2.6.31.14.
> Any comments are welcome.
>
> Signed-off-by: Balaji Venkatachalam <balaji.v@thotakaa.com>
> ---
>
> diff -uprN -X a/Documentation/dontdiff a/drivers/net/enc424j600.c
> b/drivers/net/enc424j600.c
> --- a/drivers/net/enc424j600.c  1970-01-01 05:30:00.000000000 +0530
> +++ b/drivers/net/enc424j600.c  2011-01-06 09:39:17.000000000 +0530
> @@ -0,0 +1,1694 @@
[...]
> +static int enc424j600_spi_trans(struct enc424j600_net *priv, int len)
> +{
> +       /*modified to suit half duplexed spi */
> +       struct spi_transfer tt = {
> +               .tx_buf = priv->spi_tx_buf,
> +               .len = SPI_OPLEN,
> +       };
> +       struct spi_transfer tr = {
> +               .rx_buf = priv->spi_rx_buf,
> +               .len = len,
> +       };
> +       struct spi_message m;
> +       int ret;
> +
> +       spi_message_init(&m);
> +
> +       spi_message_add_tail(&tt, &m);
> +       spi_message_add_tail(&tr, &m);
> +
> +       ret = spi_sync(priv->spi, &m);
> +
> +       if (ret == 0)
> +               memcpy(priv->spi_rx_buf, tr.rx_buf, len);
> +
> +       if (ret)
> +               dev_err(&priv->spi->dev,
> +                       "spi transfer failed: ret = %d\n", ret);
> +       return ret;
> +}
> +
> +/*
> + * Read data from chip SRAM.
> + * window = 0 for Receive Buffer
> + *               =     1 for User Defined area
> + *               =     2 for General Purpose area
> + */

This could use an enum. And the comment does not match the code, BTW.

> +static int enc424j600_read_sram(struct enc424j600_net *priv,
> +                               u8 *dst, int len, u16 srcaddr, int window)
> +{
> +       int ret;
> +
> +       if (len > SPI_TRANSFER_BUF_LEN - 1 || len <= 0)
> +               return -EINVAL;
> +
> +       /* First set the write pointer as per selected window */
> +       if (window == 1)
> +               priv->spi_tx_buf[0] = WRXRDPT;
> +       else if (window == 2)
> +               priv->spi_tx_buf[0] = WUDARDPT;
> +       else if (window == 3)
> +               priv->spi_tx_buf[0] = WGPRDPT;
> +
> +       priv->spi_tx_buf[1] = srcaddr & 0xFF;
> +       priv->spi_tx_buf[2] = srcaddr >> 8;
> +       ret = spi_write(priv->spi, priv->spi_tx_buf, 3);
> +
> +       /* Transfer the data */
> +       if (window == 1)
> +               priv->spi_tx_buf[0] = RRXDATA;
> +       else if (window == 2)
> +               priv->spi_tx_buf[0] = RUDADATA;
> +       else if (window == 3)
> +               priv->spi_tx_buf[0] = RGPDATA;
> +
> +       ret = enc424j600_spi_trans(priv, len + 1);      /*READ*/
> +
> +       /* Copy the data to the tx buffer */
> +       memcpy(dst, &priv->spi_rx_buf[0], len);

You could avoid the copy if enc424j600_spi_trans() would get the
buffer as its arguments.

> +
> +       return ret;
> +}
> +
> +/*
> + * Write data to chip SRAM.
> + * window = 1 for RX
> + * window = 2 for User Data
> + * window = 3 for GP
> + */

Enum.

> +static int enc424j600_write_sram(struct enc424j600_net *priv,
> +                                const u8 *src, int len, u16 dstaddr,
> +                                int window)
> +{
> +       int ret;
> +
> +       if (len > SPI_TRANSFER_BUF_LEN - 1 || len <= 0)
> +               return -EINVAL;
> +
> +       /* First set the general purpose write pointer */
> +       if (window == 1)
> +               priv->spi_tx_buf[0] = WRXWRPT;
> +       else if (window == 2)
> +               priv->spi_tx_buf[0] = WUDAWRPT;
> +       else if (window == 3)
> +               priv->spi_tx_buf[0] = WGPWRPT;
> +
> +       priv->spi_tx_buf[1] = dstaddr & 0xFF;
> +       priv->spi_tx_buf[2] = dstaddr >> 8;
> +       ret = spi_write(priv->spi, priv->spi_tx_buf, 3);
> +
> +       /* Copy the data to the tx buffer */
> +       memcpy(&priv->spi_tx_buf[1], src, len);
> +
> +       /* Transfer the data */
> +       if (window == 1)
> +               priv->spi_tx_buf[0] = WRXDATA;
> +       else if (window == 2)
> +               priv->spi_tx_buf[0] = WUDADATA;
> +       else if (window == 3)
> +               priv->spi_tx_buf[0] = WGPDATA;
> +
> +       ret = spi_write(priv->spi, priv->spi_tx_buf, len + 1);
> +
> +       return ret;
> +}
> +
[...]
> +/*
> + * Write a 16bit special function register.
> + * The @sfr parameters takes address of the low byte of the register.
> + * Takes care of the endiannes & buffers.
> + * Uses banked write instruction.
> + */
> +
> +static int enc424j600_write_16b_sfr(struct enc424j600_net *priv,
> +                                   u8 sfr, u16 data)
> +{
> +       int ret;
> +       enc424j600_set_bank(priv, sfr);
> +
> +       priv->spi_tx_buf[0] = WCR(sfr & ADDR_MASK);
> +       priv->spi_tx_buf[1] = data & 0xFF;
> +       priv->spi_tx_buf[2] = data >> 8;
> +       ret = spi_write(priv->spi, priv->spi_tx_buf, 3);
> +       if (ret && netif_msg_drv(priv))
> +               printk(KERN_DEBUG DRV_NAME ": %s() failed: ret = %d\n",
> +                      __func__, ret);

netif_dbg()

It's in 2.6.35 or later, though.

> +
> +       return ret;
> +}
> +
> +/*
> + * Read a 16bit special function register.
> + * The @sfr parameters takes address of the low byte of the register.
> + * Takes care of the endiannes & buffers.
> + * Uses banked read instruction.
> + */
> +static int enc424j600_read_16b_sfr(struct enc424j600_net *priv,
> +                                  u8 sfr, u16 *data)
> +{
> +       int ret;
> +       enc424j600_set_bank(priv, sfr);
> +
> +       priv->spi_tx_buf[0] = RCR(sfr & ADDR_MASK);
> +       priv->spi_tx_buf[1] = 0;
> +       priv->spi_tx_buf[2] = 0;
> +       priv->spi_tx_buf[3] = 0;
> +       ret = enc424j600_spi_trans(priv, 3);    /*READ*/
> +
> +       *data = priv->spi_rx_buf[0] | priv->spi_rx_buf[1] << (u16) 8;
> +
> +       return ret;
> +}
> +
> +/*
> + * Reset the enc424j600.
> + * TODO: What if we get stuck on non-working spi with the initial
> + * test access to EUDAST ?
> + */
> +static void enc424j600_soft_reset(struct enc424j600_net *priv)
> +{
> +       u8 estath;
> +       u16 eudast;
> +       if (netif_msg_hw(priv))
> +               printk(KERN_DEBUG DRV_NAME ": %s() enter\n", __func__);
> +
> +       do {
> +               enc424j600_write_16b_sfr(priv, EUDASTL, EUDAST_TEST_VAL);
> +               enc424j600_read_16b_sfr(priv, EUDASTL, &eudast);
> +       } while (eudast != EUDAST_TEST_VAL);
> +
> +       do {
> +               enc424j600_read_8b_sfr(priv, ESTATH, &estath);
> +       } while (!estath & CLKRDY);

Those loops need a timeout or you risk lockup with broken hardware.

> +
> +       priv->spi_tx_buf[0] = SETETHRST;
> +       enc424j600_spi_trans(priv, 1);
> +
> +       udelay(50);
> +
> +       enc424j600_read_16b_sfr(priv, EUDASTL, &eudast);
> +       if (netif_msg_hw(priv) && eudast != 0)
> +               printk(KERN_DEBUG DRV_NAME
> +                      ": %s() EUDASTL is not zero!\n", __func__);
> +
> +       udelay(500);

Long enough for usleep() if there are no locks held.

> +}
> +
> +static unsigned long msec20_to_jiffies;

== HZ/50. It's constant AFAIR.

> +
> +/*
> + * Wait for bits in register to become equal to @readyMask, but at most 20ms.
> + */
> +static int poll_ready(struct enc424j600_net *priv,
> +                     u8 reg, u8 mask, u8 readyMask)
> +{
> +       unsigned long timeout = jiffies + msec20_to_jiffies;
> +       u8 value;
> +       /* 20 msec timeout read */
> +       enc424j600_read_8b_sfr(priv, reg, &value);
> +       while ((value & mask) != readyMask) {
> +               if (time_after(jiffies, timeout)) {
> +                       if (netif_msg_drv(priv))
> +                               dev_dbg(&priv->spi->dev,
> +                                       "reg %02x ready timeout!\n", reg);
> +                       return -ETIMEDOUT;
> +               }
> +               cpu_relax();
> +               enc424j600_read_8b_sfr(priv, reg, &value);
> +       }
> +
> +       return 0;
> +}
> +
[...]
> +/* Waits for autonegotiation to complete and sets FULDPX bit in macon2. */
> +static void enc424j600_wait_for_autoneg(struct enc424j600_net *priv)
> +{
> +       u16 phstat1;
> +       do {
> +               enc424j600_phy_read(priv, PHSTAT1, &phstat1);
> +       } while (!(phstat1 & ANDONE));

Timeout?

[...]
> +static int
> +enc424j600_setlink(struct net_device *ndev, u8 autoneg, u16 speed, u8 duplex)
> +{
> +       struct enc424j600_net *priv = netdev_priv(ndev);
> +       int ret = 0;
> +       if (!priv->hw_enable) {
> +               /* link is in low power mode now; duplex setting
> +                * will take effect on next enc424j600_hw_init().
> +                */
> +               if (speed == SPEED_10 || speed == SPEED_100) {
> +                       priv->autoneg = (autoneg == AUTONEG_ENABLE);
> +                       priv->full_duplex = (duplex == DUPLEX_FULL);
> +                       priv->speed100 = (speed == SPEED_100);
> +               } else {
> +                       if (netif_msg_link(priv))
> +                               dev_warn(&ndev->dev,
> +                                        "unsupported link setting\n");
> +                       ret = -EOPNOTSUPP;

EINVAL

> +               }
> +       } else {
> +               if (netif_msg_link(priv))
> +                       dev_warn(&ndev->dev, "Warning: hw must be disabled "
> +                                "to set link mode\n");
> +               ret = -EBUSY;
> +       }
> +       return ret;
> +}
[...]
> +
> +/*
> + * Hardware receive function.
> + * Read the buffer memory, update the FIFO pointer to free the buffer,
> + * check the status vector and decrement the packet counter.
> + */
> +static void enc424j600_hw_rx(struct net_device *ndev)
> +{
> +       struct enc424j600_net *priv = netdev_priv(ndev);
> +       struct sk_buff *skb = NULL;
> +       u16 erxrdpt, next_packet, rxstat;
> +       u8 pkcnt;
> +       u16 head, tail;
> +       u8 rsv[RSV_SIZE];
> +       u16 newrxtail;
> +       int len;
> +
> +       if (netif_msg_rx_status(priv))
> +               printk(KERN_DEBUG DRV_NAME ": RX pk_addr:0x%04x\n",
> +                      priv->next_pk_ptr);
> +       if (unlikely(priv->next_pk_ptr > RXEND_INIT)) {
> +               if (netif_msg_rx_err(priv))
> +                       dev_err(&ndev->dev,
> +                               "%s() Invalid packet address!! 0x%04x\n",
> +                               __func__, priv->next_pk_ptr);
> +               mutex_lock(&priv->lock);
> +               enc424j600_clear_bits(priv, ECON1L, RXEN);
> +               enc424j600_set_bits(priv, ECON2L, RXRST);
> +               enc424j600_clear_bits(priv, ECON2L, RXRST);
> +               nolock_rxfifo_init(priv, RXSTART, RXEND_INIT);
> +               enc424j600_clear_bits(priv, EIRL, RXABTIF);
> +               enc424j600_set_bits(priv, ECON1L, RXEN);
> +               mutex_unlock(&priv->lock);
> +               ndev->stats.rx_errors++;
> +               return;
> +       }
> +
> +       /* Read next packet pointer and rx status vector */
> +       enc424j600_read_sram(priv, rsv, sizeof(rsv), priv->next_pk_ptr, 1);
> +
> +       next_packet = rsv[1];
> +       next_packet <<= 8;
> +       next_packet |= rsv[0];
> +
> +       len = rsv[3];
> +       len <<= 8;
> +       len |= rsv[2];
> +
> +       rxstat = rsv[5];
> +       rxstat <<= 8;
> +       rxstat |= rsv[4];
> +
> +       if (netif_msg_rx_status(priv))
> +               enc424j600_dump_rsv(priv, __func__, next_packet, len, rxstat);
> +
> +       if (!RSV_GETBIT(rxstat, RSV_RXOK) || len > MAX_FRAMELEN) {
> +               if (netif_msg_rx_err(priv))
> +                       dev_err(&ndev->dev, "Rx Error (%04x)\n", rxstat);
> +               ndev->stats.rx_errors++;
> +               if (RSV_GETBIT(rxstat, RSV_CRCERROR))
> +                       ndev->stats.rx_crc_errors++;
> +               if (RSV_GETBIT(rxstat, RSV_LENCHECKERR))
> +                       ndev->stats.rx_frame_errors++;
> +               if (len > MAX_FRAMELEN)
> +                       ndev->stats.rx_over_errors++;
> +       } else {
> +               skb = dev_alloc_skb(len + NET_IP_ALIGN);
> +               if (!skb) {
> +                       if (netif_msg_rx_err(priv))
> +                               dev_err(&ndev->dev,
> +                                       "out of memory for Rx'd frame\n");
> +                       ndev->stats.rx_dropped++;
> +               } else {
> +                       skb->dev = ndev;
> +                       skb_reserve(skb, NET_IP_ALIGN);
> +
> +                       /* copy the packet from the receive buffer */
> +                       enc424j600_read_sram(priv, skb_put(skb, len), len,
> +                                            rx_packet_start(priv->next_pk_ptr),
> +                                            1);
> +
> +                       if (netif_msg_pktdata(priv))
> +                               dump_packet(__func__, skb->len, skb->data);
> +                       skb->protocol = eth_type_trans(skb, ndev);
> +                       /* update statistics */
> +                       ndev->stats.rx_packets++;
> +                       ndev->stats.rx_bytes += len;
> +                       netif_rx_ni(skb);

I assume that this is not an IRQ context because of (slow) SPI
accesses. So use netif_rx() or better netif_receive_skb().

> +               }
> +       }
> +       newrxtail = next_packet - 2;
> +       if (next_packet == RXSTART)
> +               newrxtail = SRAMSIZE - 2;
> +
> +       enc424j600_write_16b_sfr(priv, ERXTAILL, newrxtail);
> +       /*
> +        * Move the RX read pointer to the start of the next
> +        * received packet.
> +        * This frees the memory we just read out
> +        */
> +       erxrdpt = erxrdpt_workaround(next_packet, RXSTART, RXEND_INIT);
> +       if (netif_msg_hw(priv))
> +               printk(KERN_DEBUG DRV_NAME ": %s() ERXRDPT:0x%04x\n", __func__,
> +                      erxrdpt);
> +
> +       mutex_lock(&priv->lock);

Why do you need the lock here, but not on previous accesses to the hardware?

> +       enc424j600_write_16b_sfr(priv, ERXRDPTL, erxrdpt);
> +
> +#ifdef CONFIG_ENC28J60_WRITEVERIFY
> +       if (netif_msg_drv(priv)) {
> +               u16 reg;
> +
> +               enc424j600_read_16b_sfr(priv, ERXRDPTL, &reg);
> +               if (reg != erxrdpt)
> +                       printk(KERN_DEBUG DRV_NAME ": %s() ERXRDPT verify "
> +                              "error (0x%04x - 0x%04x)\n", __func__, reg,
> +                              erxrdpt);
> +       }
> +#endif
> +
> +       priv->next_pk_ptr = next_packet;
> +       enc424j600_read_8b_sfr(priv, ESTATL, &pkcnt);
> +       enc424j600_read_16b_sfr(priv, ERXHEADL, &head);
> +       enc424j600_read_16b_sfr(priv, ERXTAILL, &tail);
> +       /* we are done with this packet, decrement the packet counter */
> +       enc424j600_set_bits(priv, ECON1H, PKTDEC);
> +
> +       mutex_unlock(&priv->lock);
> +}
[...]
> +static irqreturn_t enc424j600_irq(int irq, void *dev_id)
> +{
> +       struct enc424j600_net *priv = dev_id;
> +       /*
> +        * Can't do anything in interrupt context because we need to
> +        * block (spi_sync() is blocking) so fire of the interrupt
> +        * handling workqueue.
> +        * Remember that we access enc424j600 registers through SPI bus
> +        * via spi_sync() call.
> +        */
> +       schedule_work(&priv->irq_work);
> +
> +       return IRQ_HANDLED;
> +}

You can use threaded interrupts here and get rid of irq_work. And
definitely disable source interrupt before returning. And after you do
that, you have implemented most of what's necessary for NAPI rx.

[...]
> +static void enc424j600_setrx_work_handler(struct work_struct *work)
> +{
> +       u16 macon1;
> +       struct enc424j600_net *priv =
> +           container_of(work, struct enc424j600_net, setrx_work);
> +
> +       if (priv->rxfilter == RXFILTER_PROMISC) {
> +               if (netif_msg_drv(priv))
> +                       printk(KERN_DEBUG DRV_NAME ": promiscuous mode\n");
> +               enc424j600_read_16b_sfr(priv, MACON1L, &macon1);
> +               macon1 = macon1 | PASSALL;
> +               enc424j600_write_16b_sfr(priv, MACON1L, macon1);
> +               locked_regb_write(priv, ERXFCONL, UCEN | MCEN | NOTMEEN);
> +       } else if (priv->rxfilter == RXFILTER_MULTI) {
> +               if (netif_msg_drv(priv))
> +                       printk(KERN_DEBUG DRV_NAME ": multicast mode\n");
> +               locked_regb_write(priv, ERXFCONL, UCEN | CRCEN | BCEN | MCEN);
> +
> +       } else {
> +               if (netif_msg_drv(priv))
> +                       printk(KERN_DEBUG DRV_NAME ": normal mode\n");
> +               locked_regb_write(priv, ERXFCONL, UCEN | CRCEN | BCEN);
> +
> +       }
> +}
> +
> +static void enc424j600_restart_work_handler(struct work_struct *work)
> +{
> +       struct enc424j600_net *priv =
> +           container_of(work, struct enc424j600_net, restart_work);
> +       struct net_device *ndev = priv->netdev;
> +       int ret;
> +
> +       rtnl_lock();
> +       if (netif_running(ndev)) {
> +               enc424j600_net_close(ndev);
> +               ret = enc424j600_net_open(ndev);
> +               if (unlikely(ret)) {
> +                       dev_info(&ndev->dev, " could not restart %d\n", ret);
> +                       dev_close(ndev);
> +               }
> +       }
> +       rtnl_unlock();
> +}

Are these work handlers guaranteed to not be scheduled concurrently?

[...]
> +       /* If requested, allocate DMA buffers */
> +       if (enc424j600_enable_dma) {
> +               spi->dev.coherent_dma_mask = ~0;
> +
> +               /*
> +                * Minimum coherent DMA allocation is PAGE_SIZE, so allocate
> +                * that much and share it between Tx and Rx DMA buffers.
> +                */
> +#if SPI_TRANSFER_BUF_LEN > PAGE_SIZE / 2
> +#error "A problem in DMA buffer allocation"
> +#endif
> +               priv->spi_tx_buf = dma_alloc_coherent(&spi->dev,
> +                                                     PAGE_SIZE,
> +                                                     &priv->spi_tx_dma,
> +                                                     GFP_DMA);
> +
> +               if (priv->spi_tx_buf) {
> +                       priv->spi_rx_buf = (u8 *) (priv->spi_tx_buf +
> +                                                  (PAGE_SIZE / 2));
> +                       priv->spi_rx_dma = (dma_addr_t) (priv->spi_tx_dma +
> +                                                        (PAGE_SIZE / 2));
> +               } else {
> +                       /* Fall back to non-DMA */
> +                       enc424j600_enable_dma = 0;
> +               }
> +       }

You don't use DMA addresses anywhere. And don't do anything with this
flag except for freeing the buffers. You should use streaming DMA
calls for skb data to avoid copying anyway and let the arch code care
about bounce buffers if needed.

[...]

MAINTAINTERS entry would be nice also.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: genetlink misinterprets NEW as GET
From: Pablo Neira Ayuso @ 2011-01-06 14:55 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Netfilter Developer Mailing List,
	Linux Networking Developer Mailing List
In-Reply-To: <alpine.LNX.2.01.1101061508550.13211@obet.zrqbmnf.qr>

On 06/01/11 15:25, Jan Engelhardt wrote:
> On Thursday 2011-01-06 14:48, Pablo Neira Ayuso wrote:
>>>
>>> 	/* Modifiers to GET request */
>>> 	#define NLM_F_ROOT      0x100
>>> 	#define NLM_F_MATCH     0x200
>>> 	#define NLM_F_ATOMIC    0x400
>>> 	#define NLM_F_DUMP      (NLM_F_ROOT|NLM_F_MATCH)
>>> 	
>>> 	/* Modifiers to NEW request */
>>> 	#define NLM_F_REPLACE   0x100
>>> 	#define NLM_F_EXCL      0x200
>>> 	#define NLM_F_CREATE    0x400
>>> 	#define NLM_F_APPEND    0x800
>>>
>>> Except there is nothing that declares a particular Netlink message
>>> as "GET" or "NEW". Subsequently, genetlink chokes:
>>>
>>> 	if (nlh->nlmsg_flags & NLM_F_DUMP)
>>> 		if (ops->dumpit == NULL)
>>> 			return -EOPNOTSUPP;
>>>
>>> Because NLM_F_CREATE | NLM_F_EXCL == NLM_F_DUMP.
>>> That, of course, is absolutely bogus.
>>
>> Hm, NLM_F_CREATE | NLM_F_EXCL is not equal to NLM_F_DUMP.
>>
>> You must be hitting -EOPNOTSUPP elsewhere.
> 
> No, I am hitting EOPNOTSUPP here; right it's not equal, sorry.
> But nlmsg_flags is tested for NLM_F_MATCH (0x200), which is provided by
> NLM_F_EXCL. ipset does use NLM_F_EXCL and thus ran into this.

i getting confused, so ipset is also setting NLM_F_REPLACE to match the
NLM_F_DUMP bitmask?

^ permalink raw reply

* Re: [PATCH net-next] bnx2x: Add Nic partitioning mode (57712 devices)
From: Eilon Greenstein @ 2011-01-06 14:40 UTC (permalink / raw)
  To: Matt Domsch
  Cc: Dimitris Michailidis, Dmitry Kravkov, davem@davemloft.net,
	netdev@vger.kernel.org, narendra_k@dell.com,
	jordan_hargrave@dell.com
In-Reply-To: <4D0FB233.9050501@chelsio.com>

On Mon, 2010-12-20 at 11:44 -0800, Dimitris Michailidis wrote:
> Matt Domsch wrote:
> >> You can have several interfaces with same device link and different dev_id. 
> >>  While the current driver doesn't do it you could also have several 
> >> interfaces with different device links but same dev_id (NPAR situation, 
> >> notice again that dev_ids are not per PCI function), or interfaces with 
> >> different device and dev_id, or even interfaces with same device and dev_id.
> > 
> > What is the scope of dev_id then?  It's not per PCI device like I
> > thought.
> 
> I don't think it could be that way because for these cards you can't 
> statically tell which ports are controlled by a PCI function.  So knowing 
> that an interface is say port 0 of a function would help little.
> 
> > It sounds like it's per card, but how can I know the card
> > boundary?
> 
> Yes, it's per card and covers the PFs and VFs of the card.
> 
> > If I have 2 cards driven by cxgb4 in the system, each with say 4
> > ports.  I could see a minimum of 8 PCI devices (fine), but the dev_id
> > values would be?  0,1,2,3; 0,1,2,3 ?
> 
> Correct.
> 
> > How can I tell that these are
> > two different cards, with two different sets of dev_id values, rather
> > than one card with 4 ports, 8 (NPAR or SR-IOV) interfaces, with each 2
> > interfaces mapping to the same port?
> 
> Doesn't the information in /sys/devices distinguish them?  For example, 
> something like
> 
> /sys/devices/pci0000:00/0000:00:07.0/0000:04:00.0/net/eth2/dev_id == 0
> /sys/devices/pci0000:00/0000:00:07.0/0000:04:00.1/net/eth3/dev_id == 0
> /sys/devices/pci0000:00/0000:00:07.0/0000:04:01.0/net/eth5/dev_id == 0
> /sys/devices/pci0000:00/0000:00:1c.0/0000:05:00.0/net/eth4/dev_id == 0
> 
> tells me there are two cards, one has eth4 on port 0, the other has eth2, 
> eth3, and eth5 on its port 0 with eth5 being on a VF.
> 
> > dev_id is not system-wide unique.  It's not even slot unique best as I
> > can tell.  If I had a PCI slot extender, with 2 PCI slots, and I put
> > two of the above cards in, I would see 0,1,2,3; 0,1,2,3.  To be fair,
> > my naming scheme doesn't really account for such an extender, though
> > currently it would go pci<slot>#<12345678>.
> 
> Can you give an example of what /sys/devices looks like in the case you're 
> considering?

Matt,

Happy New Year!

Does the dev_id approach suits your needs? Do you want to proceed in
that direction?

Thanks,
Eilon



^ permalink raw reply

* Re: genetlink misinterprets NEW as GET
From: Jan Engelhardt @ 2011-01-06 14:25 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Netfilter Developer Mailing List,
	Linux Networking Developer Mailing List
In-Reply-To: <4D25C82F.4010306@netfilter.org>

On Thursday 2011-01-06 14:48, Pablo Neira Ayuso wrote:
>> 
>> 	/* Modifiers to GET request */
>> 	#define NLM_F_ROOT      0x100
>> 	#define NLM_F_MATCH     0x200
>> 	#define NLM_F_ATOMIC    0x400
>> 	#define NLM_F_DUMP      (NLM_F_ROOT|NLM_F_MATCH)
>> 	
>> 	/* Modifiers to NEW request */
>> 	#define NLM_F_REPLACE   0x100
>> 	#define NLM_F_EXCL      0x200
>> 	#define NLM_F_CREATE    0x400
>> 	#define NLM_F_APPEND    0x800
>> 
>> Except there is nothing that declares a particular Netlink message
>> as "GET" or "NEW". Subsequently, genetlink chokes:
>> 
>> 	if (nlh->nlmsg_flags & NLM_F_DUMP)
>> 		if (ops->dumpit == NULL)
>> 			return -EOPNOTSUPP;
>> 
>> Because NLM_F_CREATE | NLM_F_EXCL == NLM_F_DUMP.
>> That, of course, is absolutely bogus.
>
>Hm, NLM_F_CREATE | NLM_F_EXCL is not equal to NLM_F_DUMP.
>
>You must be hitting -EOPNOTSUPP elsewhere.

No, I am hitting EOPNOTSUPP here; right it's not equal, sorry.
But nlmsg_flags is tested for NLM_F_MATCH (0x200), which is provided by
NLM_F_EXCL. ipset does use NLM_F_EXCL and thus ran into this.

^ permalink raw reply

* Re: genetlink misinterprets NEW as GET
From: Pablo Neira Ayuso @ 2011-01-06 13:48 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Netfilter Developer Mailing List,
	Linux Networking Developer Mailing List
In-Reply-To: <alpine.LNX.2.01.1101040304500.29858@obet.zrqbmnf.qr>

On 04/01/11 03:14, Jan Engelhardt wrote:
> Hey there,
> 
> 
> I can't really say whether it's genetlink or netlink to blame,
> but I noticed that a request with
> 
> 	nlmsg_flags = NLM_F_CREATE | NLM_F_EXCL
> 
> to a genl-registered component can return -EOPNOTSUPP because it does 
> not have a dumpit function defined in struct genl_ops. Make sense?
> Not at first sight at least.
> include/linux/netlink.h has this nice anecdote:
> 
> 	/* Modifiers to GET request */
> 	#define NLM_F_ROOT      0x100
> 	#define NLM_F_MATCH     0x200
> 	#define NLM_F_ATOMIC    0x400
> 	#define NLM_F_DUMP      (NLM_F_ROOT|NLM_F_MATCH)
> 	
> 	/* Modifiers to NEW request */
> 	#define NLM_F_REPLACE   0x100
> 	#define NLM_F_EXCL      0x200
> 	#define NLM_F_CREATE    0x400
> 	#define NLM_F_APPEND    0x800
> 
> Except there is nothing that declares a particular Netlink message
> as "GET" or "NEW". Subsequently, genetlink chokes:
> 
> 	if (nlh->nlmsg_flags & NLM_F_DUMP)
> 		if (ops->dumpit == NULL)
> 			return -EOPNOTSUPP;
> 
> Because NLM_F_CREATE | NLM_F_EXCL == NLM_F_DUMP.
> That, of course, is absolutely bogus.

Hm, NLM_F_CREATE | NLM_F_EXCL is not equal to NLM_F_DUMP.

You must be hitting -EOPNOTSUPP elsewhere.

> [N.B.: I am also wondering whether
> 	(nlh->nlmsg_flags & NLM_F_DUMP) == NLM_F_DUMP
> may have been desired, because NLM_F_DUMP is composed of two bits.]

Someone may include NLM_F_ATOMIC to a dump operation, in that case the
checking that you propose is not valid.

^ permalink raw reply

* Re: Flow Control and Port Mirroring Revisited
From: Eric Dumazet @ 2011-01-06 13:28 UTC (permalink / raw)
  To: Simon Horman
  Cc: Rusty Russell, virtualization, Jesse Gross, dev, virtualization,
	netdev, kvm, Michael S. Tsirkin
In-Reply-To: <20110106124439.GA17004@verge.net.au>

Le jeudi 06 janvier 2011 à 21:44 +0900, Simon Horman a écrit :

> Hi Eric !
> 
> Thanks for the advice. I had thought about the socket buffer but at some
> point it slipped my mind.
> 
> In any case the following patch seems to implement the change that I had in
> mind. However my discussions Michael Tsirkin elsewhere in this thread are
> beginning to make me think that think that perhaps this change isn't the
> best solution.
> 
> diff --git a/datapath/actions.c b/datapath/actions.c
> index 5e16143..505f13f 100644
> --- a/datapath/actions.c
> +++ b/datapath/actions.c
> @@ -384,7 +384,12 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
>  
>  	for (a = actions, rem = actions_len; rem > 0; a = nla_next(a, &rem)) {
>  		if (prev_port != -1) {
> -			do_output(dp, skb_clone(skb, GFP_ATOMIC), prev_port);
> +			struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
> +			if (nskb) {
> +				if (skb->sk)
> +					skb_set_owner_w(nskb, skb->sk);
> +				do_output(dp, nskb, prev_port);
> +			}
>  			prev_port = -1;
>  		}
> 
> I got a rather nasty panic without the if (skb->sk),
> I guess some skbs don't have a socket.

Indeed, some packets are not linked to a socket.

(ARP packets for example)

Sorry, I should have mentioned it :)



^ permalink raw reply

* Re: Flow Control and Port Mirroring Revisited
From: Michael S. Tsirkin @ 2011-01-06 12:47 UTC (permalink / raw)
  To: Simon Horman
  Cc: Rusty Russell, virtualization, Jesse Gross, dev, virtualization,
	netdev, kvm
In-Reply-To: <20110106122859.GA13253@verge.net.au>

On Thu, Jan 06, 2011 at 09:29:02PM +0900, Simon Horman wrote:
> On Thu, Jan 06, 2011 at 02:07:22PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Jan 06, 2011 at 08:30:52PM +0900, Simon Horman wrote:
> > > On Thu, Jan 06, 2011 at 12:27:55PM +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Jan 06, 2011 at 06:33:12PM +0900, Simon Horman wrote:
> > > > > Hi,
> > > > > 
> > > > > Back in October I reported that I noticed a problem whereby flow control
> > > > > breaks down when openvswitch is configured to mirror a port[1].
> > > > 
> > > > Apropos the UDP flow control.  See this
> > > > http://www.spinics.net/lists/netdev/msg150806.html
> > > > for some problems it introduces.
> > > > Unfortunately UDP does not have built-in flow control.
> > > > At some level it's just conceptually broken:
> > > > it's not present in physical networks so why should
> > > > we try and emulate it in a virtual network?
> > > > 
> > > > 
> > > > Specifically, when you do:
> > > > # netperf -c -4 -t UDP_STREAM -H 172.17.60.218 -l 30 -- -m 1472
> > > > You are asking: what happens if I push data faster than it can be received?
> > > > But why is this an interesting question?
> > > > Ask 'what is the maximum rate at which I can send data with %X packet
> > > > loss' or 'what is the packet loss at rate Y Gb/s'. netperf has
> > > > -b and -w flags for this. It needs to be configured
> > > > with --enable-intervals=yes for them to work.
> > > > 
> > > > If you pose the questions this way the problem of pacing
> > > > the execution just goes away.
> > > 
> > > I am aware that UDP inherently lacks flow control.
> > 
> > Everyone's is aware of that, but this is always followed by a 'however'
> > :).
> > 
> > > The aspect of flow control that I am interested in is situations where the
> > > guest can create large amounts of work for the host. However, it seems that
> > > in the case of virtio with vhostnet that the CPU utilisation seems to be
> > > almost entirely attributable to the vhost and qemu-system processes.  And
> > > in the case of virtio without vhost net the CPU is used by the qemu-system
> > > process. In both case I assume that I could use a cgroup or something
> > > similar to limit the guests.
> > 
> > cgroups, yes. the vhost process inherits the cgroups
> > from the qemu process so you can limit them all.
> > 
> > If you are after limiting the max troughput of the guest
> > you can do this with cgroups as well.
> 
> Do you mean a CPU cgroup or something else?

net classifier cgroup

> > > Assuming all of that is true then from a resource control problem point of
> > > view, which is mostly what I am concerned about, the problem goes away.
> > > However, I still think that it would be nice to resolve the situation I
> > > described.
> > 
> > We need to articulate what's wrong here, otherwise we won't
> > be able to resolve the situation. We are sending UDP packets
> > as fast as we can and some receivers can't cope. Is this the problem?
> > We have made attempts to add a pseudo flow control in the past
> > in an attempt to make UDP on the same host work better.
> > Maybe they help some but they also sure introduce problems.
> 
> In the case where port mirroring is not active, which is the
> usual case, to some extent there is flow control in place due to
> (as Eric Dumazet pointed out) the socket buffer.
> 
> When port mirroring is activated the flow control operates based
> only on one port - which can't be controlled by the administrator
> in an obvious way.
> 
> I think that it would be more intuitive if flow control was
> based on sending a packet to all ports rather than just one.
> 
> Though now I think about it some more, perhaps this isn't the best either.
> For instance the case where data was being sent to dummy0 and suddenly
> adding a mirror on eth1 slowed everything down.
> 
> So perhaps there needs to be another knob to tune when setting
> up port-mirroring. Or perhaps the current situation isn't so bad.

To understand whether it's bad, you'd need to measure it.
The netperf manual says:
	5.2.4 UDP_STREAM

		A UDP_STREAM test is similar to a TCP_STREAM test except UDP is used as
	the transport rather than TCP.

		A UDP_STREAM test has no end-to-end flow control - UDP provides none
	and neither does netperf. However, if you wish, you can configure netperf with
	--enable-intervals=yes to enable the global command-line -b and -w options to
	pace bursts of traffic onto the network.

	This has a number of implications.

	...
and one of the implications is that the max throughput
might not be reached when you try to send as much data as possible.
It might be confusing that this is what netperf does by default with UDP_STREAM:
if the endpoint is much faster than the network the issue might not appear.

-- 
MST

^ permalink raw reply

* Re: Flow Control and Port Mirroring Revisited
From: Simon Horman @ 2011-01-06 12:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rusty Russell, virtualization, Jesse Gross, dev, virtualization,
	netdev, kvm, Michael S. Tsirkin
In-Reply-To: <1294309362.3074.11.camel@edumazet-laptop>

On Thu, Jan 06, 2011 at 11:22:42AM +0100, Eric Dumazet wrote:
> Le jeudi 06 janvier 2011 à 18:33 +0900, Simon Horman a écrit :
> > Hi,
> > 
> > Back in October I reported that I noticed a problem whereby flow control
> > breaks down when openvswitch is configured to mirror a port[1].
> > 
> > I have (finally) looked into this further and the problem appears to relate
> > to cloning of skbs, as Jesse Gross originally suspected.
> > 
> > More specifically, in do_execute_actions[2] the first n-1 times that an skb
> > needs to be transmitted it is cloned first and the final time the original
> > skb is used.
> > 
> > In the case that there is only one action, which is the normal case, then
> > the original skb will be used. But in the case of mirroring the cloning
> > comes into effect. And in my case the cloned skb seems to go to the (slow)
> > eth1 interface while the original skb goes to the (fast) dummy0 interface
> > that I set up to be a mirror. The result is that dummy0 "paces" the flow,
> > and its a cracking pace at that.
> > 
> > As an experiment I hacked do_execute_actions() to use the original skb
> > for the first action instead of the last one.  In my case the result was
> > that eth1 "paces" the flow, and things work reasonably nicely.
> > 
> > Well, sort of. Things work well for non-GSO skbs but extremely poorly for
> > GSO skbs where only 3 (yes 3, not 3%) end up at the remote host running
> > netserv. I'm unsure why, but I digress.
> > 
> > It seems to me that my hack illustrates the point that the flow ends up
> > being "paced" by one interface. However I think that what would be
> > desirable is that the flow is "paced" by the slowest link. Unfortunately
> > I'm unsure how to achieve that.
> > 
> 
> Hi Simon !
> 
> "pacing" is done because skb is attached to a socket, and a socket has a
> limited (but configurable) sndbuf. sk->sk_wmem_alloc is the current sum
> of all truesize skbs in flight.
> 
> When you enter something that :
> 
> 1) Get a clone of the skb, queue the clone to device X
> 2) queue the original skb to device Y
> 
> Then :	Socket sndbuf is not affected at all by device X queue.
> 	This is speed on device Y that matters.
> 
> You want to get servo control on both X and Y
> 
> You could try to
> 
> 1) Get a clone of skb
>    Attach it to socket too (so that socket get a feedback of final
> orphaning for the clone) with skb_set_owner_w()
>    queue the clone to device X
> 
> Unfortunatly, stacked skb->destructor() makes this possible only for
> known destructor (aka sock_wfree())

Hi Eric !

Thanks for the advice. I had thought about the socket buffer but at some
point it slipped my mind.

In any case the following patch seems to implement the change that I had in
mind. However my discussions Michael Tsirkin elsewhere in this thread are
beginning to make me think that think that perhaps this change isn't the
best solution.

diff --git a/datapath/actions.c b/datapath/actions.c
index 5e16143..505f13f 100644
--- a/datapath/actions.c
+++ b/datapath/actions.c
@@ -384,7 +384,12 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
 
 	for (a = actions, rem = actions_len; rem > 0; a = nla_next(a, &rem)) {
 		if (prev_port != -1) {
-			do_output(dp, skb_clone(skb, GFP_ATOMIC), prev_port);
+			struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
+			if (nskb) {
+				if (skb->sk)
+					skb_set_owner_w(nskb, skb->sk);
+				do_output(dp, nskb, prev_port);
+			}
 			prev_port = -1;
 		}

I got a rather nasty panic without the if (skb->sk),
I guess some skbs don't have a socket.

^ permalink raw reply related

* Re: [RFC v2 PATCH] m68knommu: added dm9000 support
From: Angelo Dureghello @ 2011-01-06 12:39 UTC (permalink / raw)
  To: Arnaud Lacombe; +Cc: linux-kernel, netdev, m68k
In-Reply-To: <AANLkTi=E0hkTUSd-rXWh5tkct0wwYX_75E-oaSWaa2bn@mail.gmail.com>

thanks for the review, 
- sure, i agree the sw byte swap could be related to a BIG_ENDIAN config option only.
- about HW, can be replaced with something like "dm9000 to cpu bus wiring", i will be more clear in my next patch version.
- i will remove blank lines
- readsX and writesX was completely missing for m68knommu, i have seen that the most appropriate file where to create them was probably "linux/arch/m68k/include/asm/io_no.h" but any comment is appreciated.
- i am adding m68k guys list from now.

thanks
angelo

On 06/01/2011 00:51, Arnaud Lacombe wrote:
> Hi,
> 
> [disclamer: I have no power whatsoever on the dm9k driver, and the
> only dm9000 chip I've access to belongs to hardware whose parent
> company is now defunct.]
> 
> Btw, Linux 68k folks is missing from the CC list.
> 
> On Wed, Jan 5, 2011 at 1:03 PM, Angelo Dureghello <angelo70@gmail.com> wrote:
>> This patch allows to use the dm9000 network chip with a m68knommu
>> big-endian cpu. From the HW point of view,
> what hardware ?
> 
>> the cpu data bus connected to
>> the dm9000 chip should be hardware-byte-swapped, crossing the bytes
>> wires (D0:7 to D24:31, etc.). In anyway, has been also added an option
>> to swap the bytes in the driver, if some cpu has been wired straight
>> D0:D31 to dm9000.
>>
>> Signed-off-by: Angelo Dureghello <angelo70@gmail.com>
>> ---
>>
>> --- linux/drivers/net/Kconfig.orig      2011-01-05 17:11:37.992376124 +0100
>> +++ linux/drivers/net/Kconfig   2011-01-04 22:33:14.132301872 +0100
>> @@ -960,7 +960,7 @@ config TI_DAVINCI_EMAC
>>
>>  config DM9000
>>        tristate "DM9000 support"
>> -       depends on ARM || BLACKFIN || MIPS
>> +       depends on COLDFIRE || ARM || BLACKFIN || MIPS
>>        select CRC32
>>        select MII
>>        ---help---
>> @@ -986,6 +986,14 @@ config DM9000_FORCE_SIMPLE_PHY_POLL
>>          costly MII PHY reads. Note, this will not work if the chip is
>>          operating with an external PHY.
>>
>> +config DM9000_32BIT_SW_SWAP
>> +       bool "Software byte swap for 32 bit data bus"
>> +       depends on DM9000 && COLDFIRE
>> +       ---help---
>> +         This configuration allows to swap data bytes from the dm9000
>> +         driver itself, when the big endian cpu is wired straight to
>> +         the dm9000 32 bit data bus.
>> +
> I'm not sure COLDFIRE is the best dependency. AFAICS, it should
> depends on endianness, and potentially board specific settings.
> 
>>  config ENC28J60
>>        tristate "ENC28J60 support"
>>        depends on EXPERIMENTAL && SPI && NET_ETHERNET
>> @@ -3347,4 +3355,3 @@ config VMXNET3
>>          module will be called vmxnet3.
>>
>>  endif # NETDEVICES
>> -
>>
> useless whitespace change ...
> 
>> --- linux/drivers/net/dm9000.c.orig     2010-12-30 23:19:39.747836070 +0100
>> +++ linux/drivers/net/dm9000.c  2011-01-05 16:30:48.636116500 +0100
>> [...]
>> @@ -981,7 +1032,11 @@ dm9000_rx(struct net_device *dev)
>>                ior(db, DM9000_MRCMDX); /* Dummy read */
>>
>>                /* Get most updated data */
>> -               rxbyte = readb(db->io_data);
>> +#ifdef CONFIG_DM9000_32BIT_SW_SWAP
>> +      rxbyte = (u8)readl(db->io_data);
>> +#else
>> +      rxbyte = readb(db->io_data);
>> +#endif
>>
>>                /* Status check: this byte must be 0 or 1 */
>>                if (rxbyte & DM9000_PKT_ERR) {
>> @@ -996,8 +1051,13 @@ dm9000_rx(struct net_device *dev)
>>
>>                /* A packet ready now  & Get status/length */
>>                GoodPacket = true;
>> -               writeb(DM9000_MRCMD, db->io_addr);
>>
>> +#ifdef CONFIG_DM9000_32BIT_SW_SWAP
>> +               writel(DM9000_MRCMD, db->io_addr);
>> +#else
>> +      writeb(DM9000_MRCMD, db->io_addr);
>> +#endif
>> +
> All these #ifdef make the driver quite horrible to read.
> 
>>                (db->inblk)(db->io_data, &rxhdr, sizeof(rxhdr));
>>
>>                RxLen = le16_to_cpu(rxhdr.RxLen);
>> @@ -1077,7 +1137,7 @@ static irqreturn_t dm9000_interrupt(int
>>        unsigned long flags;
>>        u8 reg_save;
>>
>> -       dm9000_dbg(db, 3, "entering %s\n", __func__);
>> +       //dm9000_dbg(db, 3, "entering %s\n", __func__);
>>
> C++ comment ... Why do you comment this ?
> 
>>        /* Disable all interrupts */
>>        iow(db, DM9000_IMR, IMR_PAR);
>> @@ -1100,7 +1164,7 @@ static irqreturn_t dm9000_interrupt(int
>>        /* Received the coming packet */
>>        if (int_status & ISR_PRS)
>>                dm9000_rx(dev);
>> -
>> +
> whitespace ...
> 
>> @@ -1233,11 +1301,15 @@ dm9000_phy_read(struct net_device *dev,
>>        int ret;
>>
>>        mutex_lock(&db->addr_lock);
>> -
>> +
> whitespace ...
> 
>> @@ -1713,4 +1811,3 @@ MODULE_AUTHOR("Sascha Hauer, Ben Dooks")
>>  MODULE_DESCRIPTION("Davicom DM9000 network driver");
>>  MODULE_LICENSE("GPL");
>>  MODULE_ALIAS("platform:dm9000");
>> -
> whitespace ...
> 
>> --- linux/arch/m68k/include/asm/io_no.h.orig    2011-01-05 16:53:55.904905038 +0100
>> +++ linux/arch/m68k/include/asm/io_no.h 2011-01-04 23:45:08.893049554 +0100
>> @@ -47,6 +47,91 @@ static inline unsigned int _swapl(volati
>>  #define writew(b,addr) (void)((*(volatile unsigned short *) (addr)) = (b))
>>  #define writel(b,addr) (void)((*(volatile unsigned int *) (addr)) = (b))
>>
>> +static inline void writesb (void __iomem *reg, void *data, int count)
>> +{
>> +       unsigned char *p = (unsigned char*) data;
>> +
>> +       while (count--) writeb(*p++, reg);
>> +}
>> +
>> +static inline void writesbsw (void __iomem *reg, void *data, int count)
>> +{
>> +       unsigned char *p = (unsigned char *) data;
>> +
>> +       while (count--) writel((int)(*p++), reg);
>> +}
>> +
>> +static inline void writesw (void __iomem *reg, void *data, int count)
>> +{
>> +   unsigned short *p = (unsigned short*) data;
>> +
>> +   while (count--) writew(*p++, reg);
>> +}
>> +
>> +static inline void writeswsw (void __iomem *reg, void *data, int count)
>> +{
>> +   unsigned short *p = (unsigned short *) data;
>> +
>> +   while (count--) writel((int)(_swapw(*p++)), reg);
>> +}
>> +
>> +static inline void writesl (void __iomem *reg, void *data, int count)
>> +{
>> +   unsigned long *p = (unsigned long*) data;
>> +
>> +   while (count--) writel(*p++, reg);
>> +}
>> +
>> +static inline void writeslsw (void __iomem *reg, void *data, int count)
>> +{
>> +   unsigned long *p = (unsigned long *) data;
>> +
>> +   while (count--) writel((int)(_swapl(*p++)), reg);
>> +}
>> +
>> +static inline void readsb (void __iomem *reg, void *data, int count)
>> +{
>> +   unsigned char *p = (unsigned char *) data;
>> +
>> +   while (count--) *p++ = readb(reg);
>> +}
>> +
>> +static inline void readsbsw (void __iomem *reg, void *data, int count)
>> +{
>> +   unsigned char *p = (unsigned char *) data;
>> +
>> +   while (count--) *p++ = (unsigned char)readl(reg);
>> +}
>> +
>> +static inline void readsw (void __iomem *reg, void *data, int count)
>> +{
>> +   unsigned short *p = (unsigned short *) data;
>> +
>> +   while (count--) *p++ = readb(reg);
>> +}
>> +
>> +static inline void readswsw (void __iomem *reg, void *data, int count)
>> +{
>> +   unsigned short *p = (unsigned short *) data;
>> +
>> +   while (count--) *p++ = _swapw((unsigned short)readw(reg));
>> +}
>> +
>> +static inline void readsl (void __iomem *reg, void *data, int count)
>> +{
>> +   unsigned long *p = (unsigned long *) data;
>> +
>> +   while (count--) *p++ = readb(reg);
>> +}
>> +
>> +static inline void readslsw (void __iomem *reg, void *data, int count)
>> +{
>> +   unsigned long *p = (unsigned long *) data;
>> +
>> +   while (count--) *p++ = _swapl(readl(reg));
>> +}
>> +
>> +
>>  #define __raw_readb readb
>>  #define __raw_readw readw
>>  #define __raw_readl readl
>> @@ -180,4 +265,3 @@ extern void iounmap(void *addr);
>>  #endif /* __KERNEL__ */
>>
>>  #endif /* _M68KNOMMU_IO_H */
> Could not this be moved to a separate patch ? That way arch specific
> changes are separated from the network arch-indep changes.
> 
>  - Arnaud
> 
>> -
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>

^ permalink raw reply

* Re: Flow Control and Port Mirroring Revisited
From: Simon Horman @ 2011-01-06 12:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rusty Russell, virtualization, Jesse Gross, dev, virtualization,
	netdev, kvm
In-Reply-To: <20110106120722.GD12142@redhat.com>

On Thu, Jan 06, 2011 at 02:07:22PM +0200, Michael S. Tsirkin wrote:
> On Thu, Jan 06, 2011 at 08:30:52PM +0900, Simon Horman wrote:
> > On Thu, Jan 06, 2011 at 12:27:55PM +0200, Michael S. Tsirkin wrote:
> > > On Thu, Jan 06, 2011 at 06:33:12PM +0900, Simon Horman wrote:
> > > > Hi,
> > > > 
> > > > Back in October I reported that I noticed a problem whereby flow control
> > > > breaks down when openvswitch is configured to mirror a port[1].
> > > 
> > > Apropos the UDP flow control.  See this
> > > http://www.spinics.net/lists/netdev/msg150806.html
> > > for some problems it introduces.
> > > Unfortunately UDP does not have built-in flow control.
> > > At some level it's just conceptually broken:
> > > it's not present in physical networks so why should
> > > we try and emulate it in a virtual network?
> > > 
> > > 
> > > Specifically, when you do:
> > > # netperf -c -4 -t UDP_STREAM -H 172.17.60.218 -l 30 -- -m 1472
> > > You are asking: what happens if I push data faster than it can be received?
> > > But why is this an interesting question?
> > > Ask 'what is the maximum rate at which I can send data with %X packet
> > > loss' or 'what is the packet loss at rate Y Gb/s'. netperf has
> > > -b and -w flags for this. It needs to be configured
> > > with --enable-intervals=yes for them to work.
> > > 
> > > If you pose the questions this way the problem of pacing
> > > the execution just goes away.
> > 
> > I am aware that UDP inherently lacks flow control.
> 
> Everyone's is aware of that, but this is always followed by a 'however'
> :).
> 
> > The aspect of flow control that I am interested in is situations where the
> > guest can create large amounts of work for the host. However, it seems that
> > in the case of virtio with vhostnet that the CPU utilisation seems to be
> > almost entirely attributable to the vhost and qemu-system processes.  And
> > in the case of virtio without vhost net the CPU is used by the qemu-system
> > process. In both case I assume that I could use a cgroup or something
> > similar to limit the guests.
> 
> cgroups, yes. the vhost process inherits the cgroups
> from the qemu process so you can limit them all.
> 
> If you are after limiting the max troughput of the guest
> you can do this with cgroups as well.

Do you mean a CPU cgroup or something else?

> > Assuming all of that is true then from a resource control problem point of
> > view, which is mostly what I am concerned about, the problem goes away.
> > However, I still think that it would be nice to resolve the situation I
> > described.
> 
> We need to articulate what's wrong here, otherwise we won't
> be able to resolve the situation. We are sending UDP packets
> as fast as we can and some receivers can't cope. Is this the problem?
> We have made attempts to add a pseudo flow control in the past
> in an attempt to make UDP on the same host work better.
> Maybe they help some but they also sure introduce problems.

In the case where port mirroring is not active, which is the
usual case, to some extent there is flow control in place due to
(as Eric Dumazet pointed out) the socket buffer.

When port mirroring is activated the flow control operates based
only on one port - which can't be controlled by the administrator
in an obvious way.

I think that it would be more intuitive if flow control was
based on sending a packet to all ports rather than just one.

Though now I think about it some more, perhaps this isn't the best either.
For instance the case where data was being sent to dummy0 and suddenly
adding a mirror on eth1 slowed everything down.

So perhaps there needs to be another knob to tune when setting
up port-mirroring. Or perhaps the current situation isn't so bad.

^ permalink raw reply

* Re: Flow Control and Port Mirroring Revisited
From: Michael S. Tsirkin @ 2011-01-06 12:07 UTC (permalink / raw)
  To: Simon Horman
  Cc: Rusty Russell, virtualization, Jesse Gross, dev, virtualization,
	netdev, kvm
In-Reply-To: <20110106113052.GA2541@verge.net.au>

On Thu, Jan 06, 2011 at 08:30:52PM +0900, Simon Horman wrote:
> On Thu, Jan 06, 2011 at 12:27:55PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Jan 06, 2011 at 06:33:12PM +0900, Simon Horman wrote:
> > > Hi,
> > > 
> > > Back in October I reported that I noticed a problem whereby flow control
> > > breaks down when openvswitch is configured to mirror a port[1].
> > 
> > Apropos the UDP flow control.  See this
> > http://www.spinics.net/lists/netdev/msg150806.html
> > for some problems it introduces.
> > Unfortunately UDP does not have built-in flow control.
> > At some level it's just conceptually broken:
> > it's not present in physical networks so why should
> > we try and emulate it in a virtual network?
> > 
> > 
> > Specifically, when you do:
> > # netperf -c -4 -t UDP_STREAM -H 172.17.60.218 -l 30 -- -m 1472
> > You are asking: what happens if I push data faster than it can be received?
> > But why is this an interesting question?
> > Ask 'what is the maximum rate at which I can send data with %X packet
> > loss' or 'what is the packet loss at rate Y Gb/s'. netperf has
> > -b and -w flags for this. It needs to be configured
> > with --enable-intervals=yes for them to work.
> > 
> > If you pose the questions this way the problem of pacing
> > the execution just goes away.
> 
> I am aware that UDP inherently lacks flow control.

Everyone's is aware of that, but this is always followed by a 'however'
:).

> The aspect of flow control that I am interested in is situations where the
> guest can create large amounts of work for the host. However, it seems that
> in the case of virtio with vhostnet that the CPU utilisation seems to be
> almost entirely attributable to the vhost and qemu-system processes.  And
> in the case of virtio without vhost net the CPU is used by the qemu-system
> process. In both case I assume that I could use a cgroup or something
> similar to limit the guests.

cgroups, yes. the vhost process inherits the cgroups
from the qemu process so you can limit them all.

If you are after limiting the max troughput of the guest
you can do this with cgroups as well.

> Assuming all of that is true then from a resource control problem point of
> view, which is mostly what I am concerned about, the problem goes away.
> However, I still think that it would be nice to resolve the situation I
> described.

We need to articulate what's wrong here, otherwise we won't
be able to resolve the situation. We are sending UDP packets
as fast as we can and some receivers can't cope. Is this the problem?
We have made attempts to add a pseudo flow control in the past
in an attempt to make UDP on the same host work better.
Maybe they help some but they also sure introduce problems.

-- 
MST

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox