Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH nf-next-2.6] conntrack: IPS_UNTRACKED bit
From: Patrick McHardy @ 2010-06-08 14:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Changli Gao, Netfilter Developers, netdev
In-Reply-To: <1275668732.2482.201.camel@edumazet-laptop>

On 04.06.2010 18:25, Eric Dumazet wrote:
> [PATCH nf-next-2.6] conntrack: IPS_UNTRACKED bit
> 
> NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
> twice per packet. This is bad for performance.
> __read_mostly annotation is also a bad choice.
> 
> This patch introduces IPS_UNTRACKED bit so that we can use later a
> per_cpu untrack structure more easily.
> 
> A new helper, nf_ct_untracked_get() returns a pointer to
> nf_conntrack_untracked.
> 
> Another one, nf_ct_untracked_status_or() is used by nf_nat_init() to add
> IPS_NAT_DONE_MASK bits to untracked status.
> 
> nf_ct_is_untracked() prototype is changed to work on a nf_conn pointer.

Applied, thanks Eric.

^ permalink raw reply

* Re: RFS seems to have incompatiblities with bridged vlans
From: Eric Dumazet @ 2010-06-08 14:18 UTC (permalink / raw)
  To: John Fastabend
  Cc: Stephen Hemminger, Peter Lieven, davem@davemloft.net,
	netdev@vger.kernel.org
In-Reply-To: <4C0D7312.1020300@intel.com>

Le lundi 07 juin 2010 à 15:30 -0700, John Fastabend a écrit :

> There is always a possibility that the underlying device sets the 
> queue_mapping to be greater then num_cpus.  Also I suspect the same 
> issue exists with bonding devices.  Maybe something like the following 
> is worth while? compile tested only,
> 
> [PATCH] 8021q: vlan reassigns dev without check queue_mapping
> 
> recv path reassigns skb->dev without sanity checking the
> queue_mapping field.  This can result in the queue_mapping
> field being set incorrectly if the new dev supports less
> queues then the underlying device.
> 
> This patch just resets the queue_mapping to 0 which should
> resolve this issue?  Any thoughts?
> 
> The same issue could happen on bonding devices as well.
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
> 
>   net/8021q/vlan_core.c |    6 ++++++
>   1 files changed, 6 insertions(+), 0 deletions(-)
> 
> diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
> index bd537fc..ad309f8 100644
> --- a/net/8021q/vlan_core.c
> +++ b/net/8021q/vlan_core.c
> @@ -21,6 +21,9 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct 
> vlan_group *grp,
>   	if (!skb->dev)
>   		goto drop;
> 
> +	if (unlikely(skb->queue_mapping >= skb->dev->real_num_tx_queues))
> +		skb_set_queue_mapping(skb, 0);
> +
>   	return (polling ? netif_receive_skb(skb) : netif_rx(skb));
> 
>   drop:
> @@ -93,6 +96,9 @@ vlan_gro_common(struct napi_struct *napi, struct 
> vlan_group *grp,
>   	if (!skb->dev)
>   		goto drop;
> 
> +	if (unlikely(skb->queue_mapping >= skb->dev->real_num_tx_queues))
> +		skb_set_queue_mapping(skb, 0);
> +
>   	for (p = napi->gro_list; p; p = p->next) {
>   		NAPI_GRO_CB(p)->same_flow =
>   			p->dev == skb->dev && !compare_ether_header(
> --

Only a workaround, added in hot path in a otherwise 'good' driver
(multiqueue enabled and ready)

eth0  -------> bond / bridge ---------> vlan.id
(nbtxq=8)      (ntxbq=1)        (nbtxq=X)

X is capped to 1 because of bond/bridge, while bond has no "queue"
(LLTX driver)

Solutions :

1) queue_mapping could be silently tested in get_rps_cpu()...

diff --git a/net/core/dev.c b/net/core/dev.c
index 6f330ce..3a3f7f6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2272,14 +2272,11 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 
 	if (skb_rx_queue_recorded(skb)) {
 		u16 index = skb_get_rx_queue(skb);
-		if (unlikely(index >= dev->num_rx_queues)) {
-			if (net_ratelimit()) {
-				pr_warning("%s received packet on queue "
-					"%u, but number of RX queues is %u\n",
-					dev->name, index, dev->num_rx_queues);
-			}
-			goto done;
-		}
+		if (WARN_ONCE(index >= dev->num_rx_queues,
+				KERN_WARNING "%s received packet on queue %u, "
+				"but number of RX queues is %u\n",
+				dev->name, index, dev->num_rx_queues))
+			index %= dev->num_rx_queues;
 		rxqueue = dev->_rx + index;
 	} else
 		rxqueue = dev->_rx;



2) bond/bridge should setup more queues, just in case.
   We probably need to be able to make things more dynamic,
   (propagate nbtxq between layers) but not for 2.6.35



diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 5e12462..ce813dd 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -5012,8 +5012,8 @@ int bond_create(struct net *net, const char *name)
 
 	rtnl_lock();
 
-	bond_dev = alloc_netdev(sizeof(struct bonding), name ? name : "",
-				bond_setup);
+	bond_dev = alloc_netdev_mq(sizeof(struct bonding), name ? name : "",
+				   bond_setup, max(64, nr_cpu_ids));
 	if (!bond_dev) {
 		pr_err("%s: eek! can't alloc netdev!\n", name);
 		rtnl_unlock();



^ permalink raw reply related

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to classes
From: Kay Sievers @ 2010-06-08 14:21 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <1276005970.3706.132.camel@jlt3.sipsolutions.net>

On Tue, Jun 8, 2010 at 16:06, Johannes Berg <johannes@sipsolutions.net> wrote:
> On Tue, 2010-06-08 at 15:54 +0200, Kay Sievers wrote:
>> On Tue, Jun 8, 2010 at 14:03, Johannes Berg <johannes@sipsolutions.net> wrote:
>> > On Tue, 2010-06-08 at 13:55 +0200, Kay Sievers wrote:
>> >
>> >> Ok, this needs testing.
>> >>
>> >> Please let me know, if that works for you so far.
>> >
>> > Will do. Just a question: none of this seems to pin the module? So can I
>> > be sure if I rmmod the module that the release function will still be
>> > around etc.?
>>
>> Oh, sorry, that's something very specific to network devices, that the
>> module can be unloaded while the devices it has created are still in
>> use. I am not sure right now what's needed to make this work in a
>> single module.
>
> Well they will be unregistered and everything, but once all the netdevs
> are gone etc. the devices you create in the patch might stick around
> because somebody has them open in sysfs, and I see nothing that would
> pin the module in that case?

That's what I mean, I have no idea how to solve that with network
devices. I don't think any other subsytem allows to unload the module,
when devices, the module has created, are in use.

With this patch, it is likely to fail, even without sysfs, just when
the netif is still in up, and the module is unloaded.

The current code uses the in-core class_create() logic, which was only
meant for devices with a device node, and which is cleaned up by the
core itself. That's why this issue never appeared before.

But as said, I have no idea how to solve that with a single module. It
might not work at all without moving stuff into the core. That people
use device_create() with no major/minor might indicate that something
else is needed here. :)

Kay

^ permalink raw reply

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to  classes
From: Johannes Berg @ 2010-06-08 14:26 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <AANLkTikB0Amzi61JLxdY5HL-2s_nbYo3JH8vXooXzHdQ@mail.gmail.com>

On Tue, 2010-06-08 at 16:21 +0200, Kay Sievers wrote:

> > Well they will be unregistered and everything, but once all the netdevs
> > are gone etc. the devices you create in the patch might stick around
> > because somebody has them open in sysfs, and I see nothing that would
> > pin the module in that case?
> 
> That's what I mean, I have no idea how to solve that with network
> devices. I don't think any other subsytem allows to unload the module,
> when devices, the module has created, are in use.

You're right ... the would only be unregistered from your release, which
would happen after the module is long gone ...

> The current code uses the in-core class_create() logic, which was only
> meant for devices with a device node, and which is cleaned up by the
> core itself. That's why this issue never appeared before.
> 
> But as said, I have no idea how to solve that with a single module. It
> might not work at all without moving stuff into the core. That people
> use device_create() with no major/minor might indicate that something
> else is needed here. :)

So ... can we apply Eric's patch for now then?

johannes

^ permalink raw reply

* Re: [PATCH nf-next-2.6 2/2] conntrack: per_cpu untracking
From: Patrick McHardy @ 2010-06-08 14:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Changli Gao, Netfilter Developers, netdev
In-Reply-To: <1275682522.2490.6.camel@edumazet-laptop>

On 04.06.2010 22:15, Eric Dumazet wrote:
> NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
> twice per packet, slowing down performance.
> 
> This patch converts it to a per_cpu variable.
> 
> We assume same cpu is used for a given packet, entering and exiting the
> NOTRACK state.

That doesn't seem to be a valid assumption, the conntrack entry is
attached to the skb and processing in the output path might get
preempted and rescheduled to a different CPU.

^ permalink raw reply

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to classes
From: Kay Sievers @ 2010-06-08 14:47 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <1276007174.3706.133.camel@jlt3.sipsolutions.net>

On Tue, Jun 8, 2010 at 16:26, Johannes Berg <johannes@sipsolutions.net> wrote:
> On Tue, 2010-06-08 at 16:21 +0200, Kay Sievers wrote:
>
>> > Well they will be unregistered and everything, but once all the netdevs
>> > are gone etc. the devices you create in the patch might stick around
>> > because somebody has them open in sysfs, and I see nothing that would
>> > pin the module in that case?
>>
>> That's what I mean, I have no idea how to solve that with network
>> devices. I don't think any other subsytem allows to unload the module,
>> when devices, the module has created, are in use.
>
> You're right ... the would only be unregistered from your release, which
> would happen after the module is long gone ...
>
>> The current code uses the in-core class_create() logic, which was only
>> meant for devices with a device node, and which is cleaned up by the
>> core itself. That's why this issue never appeared before.
>>
>> But as said, I have no idea how to solve that with a single module. It
>> might not work at all without moving stuff into the core. That people
>> use device_create() with no major/minor might indicate that something
>> else is needed here. :)
>
> So ... can we apply Eric's patch for now then?

It might break other stuff we don't know about yet. Just like we did
not know about what things hwsim is doing here. :)

Eric's patch change the sysfs layout and insert directories into the
device path, and stuff which hardcodes things, or accesses devices
with ../<name> will break.

I don't mind trying it, but I wouldn't be surprised if that needs to
be reverted when issues caused by this appear.

The hwsim issues are caused by the current hwsim code, by doing things
it should not do. Class devices of different classes must never be
stacked (the core should not allow that in the first place). Class
devices must never have a driver assigned behind its back. Also
device_create() should not be used for devices without a major/minor
(but that seems to be done in several other places too).

To fix the hwsim driver core interaction, core changes will probably
be needed to allow network modules to be removed while their devices
are active. That's something which seems not to work for bus devices
currently.

Kay

^ permalink raw reply

* Re: [PATCH nf-next-2.6 2/2] conntrack: per_cpu untracking
From: Eric Dumazet @ 2010-06-08 14:52 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Changli Gao, Netfilter Developers, netdev
In-Reply-To: <4C0E53C4.7090308@trash.net>

Le mardi 08 juin 2010 à 16:29 +0200, Patrick McHardy a écrit :
> On 04.06.2010 22:15, Eric Dumazet wrote:
> > NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
> > twice per packet, slowing down performance.
> > 
> > This patch converts it to a per_cpu variable.
> > 
> > We assume same cpu is used for a given packet, entering and exiting the
> > NOTRACK state.
> 
> That doesn't seem to be a valid assumption, the conntrack entry is
> attached to the skb and processing in the output path might get
> preempted and rescheduled to a different CPU.

Thats unfortunate.

Ok, only choice then is to not change refcount on the untracked ct, and
keep a shared (read only after setup time) untrack structure.





--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to  classes
From: Johannes Berg @ 2010-06-08 15:06 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <AANLkTikNEzJHPrOUK6DKMoWvcblTBInaIhVbk96OtZxe@mail.gmail.com>

On Tue, 2010-06-08 at 16:47 +0200, Kay Sievers wrote:

> > So ... can we apply Eric's patch for now then?
> 
> It might break other stuff we don't know about yet. Just like we did
> not know about what things hwsim is doing here. :)

True.

> The hwsim issues are caused by the current hwsim code, by doing things
> it should not do. Class devices of different classes must never be
> stacked (the core should not allow that in the first place). Class
> devices must never have a driver assigned behind its back. Also
> device_create() should not be used for devices without a major/minor
> (but that seems to be done in several other places too).

Back when hwsim was written that would have been useful feedback to
whoever did ... now, not so much.

> To fix the hwsim driver core interaction, core changes will probably
> be needed to allow network modules to be removed while their devices
> are active. That's something which seems not to work for bus devices
> currently.

Well it just needs to pin the module refcount from the bus and/or device
struct. That seems not too hard?

johannes

^ permalink raw reply

* Re: [Linux-ATM-General] RX/close vcc race with solos/atmtcp/usbatm/he
From: Chas Williams (CONTRACTOR) @ 2010-06-08 15:05 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-atm-general, netdev
In-Reply-To: <1275943792.17903.5119.camel@macbook.infradead.org>

In message <1275943792.17903.5119.camel@macbook.infradead.org>,David Woodhouse 
writes:
>On Mon, 2010-06-07 at 12:37 -0400, Chas Williams (CONTRACTOR) wrote:
>> i dont understand.  if you do a sock_hold() in find_vcc(), and then call
>> vcc->push() you should be able to call vcc->push() and then sock_put(). 
>
>Holding the reference doesn't stop the problem. The problem is
>
> vcc_release()
> --> vcc_destroy_socket()
>   --> br2684_push(vcc, NULL)
>         sets vcc->user_back = NULL
>         (which it what causes the oops when try try to feed it any
>          subsequent packets).
>
> Only _later_ does vcc_release() call sock_put().

hmm... perhaps this routine needs to take the vcc_sklist_lock because
it is going to modify the vcc.  or we need to use locking on the vcc
itself.  it seems like the lock in vcc_remove_socket() just needs move
up one routine to encompass this whole 'closing'  operation.  the vcc
is going away.  we dont want people to be able to find it and use it.
it is not enough to just flip the flags on the vcc obviously.

you took a reference to an object inside a hashed list and didnt do
anything to prevent the object from leaving the hashed list. that is
stil not correct IMHO.

^ permalink raw reply

* Re: [PATCH nf-next-2.6 2/2] conntrack: per_cpu untracking
From: Eric Dumazet @ 2010-06-08 15:12 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Changli Gao, Netfilter Developers, netdev
In-Reply-To: <1276008733.2486.177.camel@edumazet-laptop>

Le mardi 08 juin 2010 à 16:52 +0200, Eric Dumazet a écrit :
> Le mardi 08 juin 2010 à 16:29 +0200, Patrick McHardy a écrit :
> > On 04.06.2010 22:15, Eric Dumazet wrote:
> > > NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
> > > twice per packet, slowing down performance.
> > > 
> > > This patch converts it to a per_cpu variable.
> > > 
> > > We assume same cpu is used for a given packet, entering and exiting the
> > > NOTRACK state.
> > 
> > That doesn't seem to be a valid assumption, the conntrack entry is
> > attached to the skb and processing in the output path might get
> > preempted and rescheduled to a different CPU.
> 
> Thats unfortunate.
> 
> Ok, only choice then is to not change refcount on the untracked ct, and
> keep a shared (read only after setup time) untrack structure.
> 
> 

Oh well, re-reading my patch, I dont see why I said this in Changelog :)

We lazily select the untrack structure in one cpu, then keep the pointer
to this untrack structure, attached to ct.

The (still atomic) increment / decrement of refcount is done on the
saved pointer, not on actual per_cpu structure.

So if a packet is rescheduled on a different CPU, second cpu will "only"
dirty cache line of other cpu, it probably almost never happens...

Thanks

[PATCH nf-next-2.6 2/2] conntrack: per_cpu untracking

NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
twice per packet, slowing down performance.

This patch converts it to a per_cpu variable.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/netfilter/nf_conntrack.h |    5 +--
 net/netfilter/nf_conntrack_core.c    |   36 ++++++++++++++++++-------
 2 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 3bc38c7..84a4b6f 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -261,11 +261,10 @@ extern s16 (*nf_ct_nat_offset)(const struct nf_conn *ct,
 			       u32 seq);
 
 /* Fake conntrack entry for untracked connections */
+DECLARE_PER_CPU(struct nf_conn, nf_conntrack_untracked);
 static inline struct nf_conn *nf_ct_untracked_get(void)
 {
-	extern struct nf_conn nf_conntrack_untracked;
-
-	return &nf_conntrack_untracked;
+	return &__raw_get_cpu_var(nf_conntrack_untracked);
 }
 extern void nf_ct_untracked_status_or(unsigned long bits);
 
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 6c1da21..9c66141 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -62,8 +62,8 @@ EXPORT_SYMBOL_GPL(nf_conntrack_htable_size);
 unsigned int nf_conntrack_max __read_mostly;
 EXPORT_SYMBOL_GPL(nf_conntrack_max);
 
-struct nf_conn nf_conntrack_untracked;
-EXPORT_SYMBOL_GPL(nf_conntrack_untracked);
+DEFINE_PER_CPU(struct nf_conn, nf_conntrack_untracked);
+EXPORT_PER_CPU_SYMBOL(nf_conntrack_untracked);
 
 static int nf_conntrack_hash_rnd_initted;
 static unsigned int nf_conntrack_hash_rnd;
@@ -1183,10 +1183,21 @@ static void nf_ct_release_dying_list(struct net *net)
 	spin_unlock_bh(&nf_conntrack_lock);
 }
 
+static int untrack_refs(void)
+{
+	int cnt = 0, cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct nf_conn *ct = &per_cpu(nf_conntrack_untracked, cpu);
+
+		cnt += atomic_read(&ct->ct_general.use) - 1;
+	}
+	return cnt;
+}
+
 static void nf_conntrack_cleanup_init_net(void)
 {
-	/* wait until all references to nf_conntrack_untracked are dropped */
-	while (atomic_read(&nf_conntrack_untracked.ct_general.use) > 1)
+	while (untrack_refs() > 0)
 		schedule();
 
 	nf_conntrack_helper_fini();
@@ -1323,14 +1334,17 @@ module_param_call(hashsize, nf_conntrack_set_hashsize, param_get_uint,
 
 void nf_ct_untracked_status_or(unsigned long bits)
 {
-	nf_conntrack_untracked.status |= bits;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		per_cpu(nf_conntrack_untracked, cpu).status |= bits;
 }
 EXPORT_SYMBOL_GPL(nf_ct_untracked_status_or);
 
 static int nf_conntrack_init_init_net(void)
 {
 	int max_factor = 8;
-	int ret;
+	int ret, cpu;
 
 	/* Idea from tcp.c: use 1/16384 of memory.  On i386: 32MB
 	 * machine has 512 buckets. >= 1GB machines have 16384 buckets. */
@@ -1369,10 +1383,12 @@ static int nf_conntrack_init_init_net(void)
 		goto err_extend;
 #endif
 	/* Set up fake conntrack: to never be deleted, not in any hashes */
-#ifdef CONFIG_NET_NS
-	nf_conntrack_untracked.ct_net = &init_net;
-#endif
-	atomic_set(&nf_conntrack_untracked.ct_general.use, 1);
+	for_each_possible_cpu(cpu) {
+		struct nf_conn *ct = &per_cpu(nf_conntrack_untracked, cpu);
+
+		write_pnet(&ct->ct_net, &init_net);
+		atomic_set(&ct->ct_general.use, 1);
+	}
 	/*  - and look it like as a confirmed connection */
 	nf_ct_untracked_status_or(IPS_CONFIRMED | IPS_UNTRACKED);
 	return 0;


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH][RFC] Infrastructure for out-of-band parameter passing
From: Randy Dunlap @ 2010-06-08 15:31 UTC (permalink / raw)
  To: David VomLehn; +Cc: to, netdev
In-Reply-To: <20100608003049.GA29350@dvomlehn-lnx2.corp.sa.net>

On Mon, 7 Jun 2010 17:30:49 -0700 David VomLehn wrote:

> Infrastructure to support out of band/indirect passing of data to functions.
> 
> It is useful at times to be able to pass data from one function to another
> nested many stack frames more deeply than the passing function. Doing so
> allows the interfaces in the intervening functions to be simpler, though
> this "hidden" information passing risks increased complexity. In cases
> where this is justified, this simple infrastructure provides the
> functionality.
> 
> Out of band data passing is implemented by adding, for each instance,
> an element to the task_struct that serves as the pointer to the top
> of a OOB parameter stack. Data is made available by being pushed
> on the OOB parameter stack by a function, and accessed via the top
> element of the OOB parameter stack.
> 
> Signed-off-by: David VomLehn (dvomlehn@cisco.com)
> ---
>  include/linux/oobparam.h |   89 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 89 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/oobparam.h b/include/linux/oobparam.h
> new file mode 100644
> index 0000000..6eaa04c
> --- /dev/null
> +++ b/include/linux/oobparam.h
> @@ -0,0 +1,89 @@
...
> +
> +/**
> + * oobparam_push - push an out of band parameter frame on the OOB param stack
> + * @head	Pointer to the OOB parameter stack top, which must be in the
> + *		task structure.
> + * @frame	Pointer to the OOB parameter frame, generally embedded in
> + *		another structure

Need a colon ':' after the parameter names for kernel-doc notation:

 * @head:
 * @frame:

See Documentation/kernel-doc-nano-HOWTO.txt for info or ask me if you have
problems or questions.

Thanks.

> + */
> +static inline void oobparam_push(struct oobparam *top, struct oobparam *frame)
> +{
> +	frame->next = top;
> +	/* We need to ensure that the pointer in the frame is set prior to
> +	 * the pointer to the top in case we handle an interrupt in between
> +	 * the two stores. */
> +	wmb();
> +	top->next = frame;
> +}


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply

* Re: [PATCH][RFC] Infrastructure for compact call location representation
From: Randy Dunlap @ 2010-06-08 15:44 UTC (permalink / raw)
  To: David VomLehn; +Cc: to, netdev
In-Reply-To: <20100608003052.GA29377@dvomlehn-lnx2.corp.sa.net>

On Mon, 7 Jun 2010 17:30:52 -0700 David VomLehn wrote:

> Notes
> o	Under simple test conditions, the number of call site IDs allocated
> 	can be quite small, small enough to fit in 6 bits. That would reduce
> 	the sk_buff growth to one byte. This is *not* a recommended
> 	configuration.
> o	This is placed in net/core and linux/net since those are the only
> 	current users, but there is nothing about this that is networking-
> 	specific.

=> this shouldn't end up in net/

>  include/net/callsite-types.h |  160 +++++++++++++++++++
>  include/net/callsite.h       |  208 +++++++++++++++++++++++++
>  net/core/callsite.c          |  354 ++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 722 insertions(+), 0 deletions(-)
> 
> diff --git a/include/net/callsite-types.h b/include/net/callsite-types.h
> new file mode 100644
> index 0000000..796cfb1
> --- /dev/null
> +++ b/include/net/callsite-types.h
> @@ -0,0 +1,160 @@

> +#ifndef	_LINUX_NET_CALLSITE_TYPES_H_
> +#define _LINUX_NET_CALLSITE_TYPES_H_
> +#include <linux/oobparam.h>
> +
> +/* Pre-defined call site IDs */
> +#define	CALLSITE_UNSET		0	/* Never set (default) */
> +#define	CALLSITE_UNKNOWN	1	/* Tried to set, but couldn't */
> +#define CALLSITE_START		2	/* First valid call site ID */
> +
> +#define	CALLSITE_MAX_ID_SIZE	16	/* Max # bits in a call site ID */
> +
> +/**
> + * struct callsite_id - callsite identifier
> + * @id:		Unique value assigned to callsite
> + *
> + * This includes the unique ID assigned to the call site and the information
> + * that defines the location of the call site.
> + */
> +struct callsite_id {
> +	unsigned		id:CALLSITE_MAX_ID_SIZE;
> +};
> +
> +/* The id value must be set to CALLSITE_UNSET. This is conveniently defined
> + * to have the value zero, so we don't need to explicitly set it */
> +#define CALLSITE_ID_INIT() {						\
> +	}
> +
> +/**
> + * struct callsite_where - Location information for loaded callsites
> + * @here:	Address of code doing the calling (if terse reporting)
> + * @file:	Pointer to file name (if not using terse reporting)
> + * @lineno:	Line number in file (if not using terse reporting)
> + */
> +struct callsite_where {
> +#ifdef CONFIG_CALLSITE_TERSE
> +	void			*here;	/* Address */
> +#else
> +	const char		*file;	/* Call location */
> +	unsigned short		lineno;
> +#endif
> +};
> +
> +#ifdef CONFIG_CALLSITE_TERSE
> +#define CALLSITE_WHERE_INIT()			{		\
> +			.here =		NULL,			\
> +		}
> +#else
> +#define CALLSITE_WHERE_INIT()			{		\
> +			.file =		__FILE__,		\
> +			.lineno =	__LINE__,		\
> +		}
> +#endif
> +
> +/**
> + * struct callsite_const - constant per-callsite information
> + * @id:		Pointer to the location of the callsite ID
> + * @where:	Location information
> + * @module:	Pointer to the module om which the callsite exists
> + * @set:	Pointer to information that allies to all callsites of this
> + *		particular set.
> + */
> +struct callsite_static {
> +	struct callsite_id	*id;
> +	struct callsite_where	where;
> +	struct module		*module;
> +	struct callsite_set	*set;
> +};
> +
> +#define	CALLSITE_STATIC_INIT(_id, _set)	{			\
> +			.id =		_id,			\
> +			.where =	CALLSITE_WHERE_INIT(),	\
> +			.module =	THIS_MODULE,		\
> +			.set =		_set,			\
> +		}
> +
> +/*
> + * callsite_set - information about a set of callsite IDs
> + * @name:		Callsite_set name
> + * @width:		Number of bits available for callsite ID
> + * Private members:

You can use
 * private:
above to keep these parameters from being printed in the kernel-doc output
if that's what you prefer to see.

> + * @warned:		Has a warning been printed that no call site ID could
> + *			be assigned for this callsite set?
> + * @max_id:		The maximim value of a callsite ID. This must fit in
> + *			the number of bits allocated to the callsite_id and
> + *			must be at least CALLSITE_START.
> + * @next_id:		Value of the next callsite ID to give out. Will never
> + *			be more than @max_id.
> + * @info:		Pointer to callsite_info array.
> + * @lock:		Lock that protects the @callsites structure member
> + * @callsite_id_sets:	Link to the next callsite_id_set
> + */
> +struct callsite_set {
> +	const char		*name;
> +	unsigned int		width;
> +	/* private */
> +	bool			warned:1;
> +	unsigned int		max_id;
> +	unsigned int		next_id;
> +	struct callsite_info	*info;
> +	spinlock_t		lock;
> +	struct list_head	callsite_id_sets;
> +};
> +
> +#define CALLSITE_SET_INIT(str_name, _varname, _width)	{		\
> +	.name =			str_name,				\
> +	.width =		_width,					\
> +	.lock =			__SPIN_LOCK_UNLOCKED((_varname).lock),	\
> +	.callsite_id_sets =	LIST_HEAD_INIT((_varname).callsite_id_sets), \
> +}
> +
> +/**
> + * struct callsite_frame - data in each "frame" of the callsite stack
> + * @id:			Callsite ID
> + * @callsite_oobparam:	Data for passing out of band parameters
> + *
> + * Data that is stored on the stack each time a call is made. A linked list
> + * of these is constructed on the stack for each task. In effect, these
> + * are "frames" for the stack of call sites
> + */
> +struct callsite_frame {
> +	struct callsite_id	id;
> +	OOBPARAM_FRAME(frame);
> +};
> +#define CALLSITE_FRAME(name)	struct callsite_frame name;
> +
> +/**
> + * struct callsite_top - pointer to the top of the callsite stack
> + * @callsite_top	Pointer to the top of the callsite stack

 * @callsite_top:

Oh, is it just
 * @top:
??

> + */
> +struct callsite_top {
> +	OOBPARAM_TOP(top);
> +};
> +#define CALLSITE_TOP(name)	struct callsite_top name;
> +#endif
> diff --git a/include/net/callsite.h b/include/net/callsite.h
> new file mode 100644
> index 0000000..a355a23
> --- /dev/null
> +++ b/include/net/callsite.h
> @@ -0,0 +1,208 @@

> +#ifndef	_LINUX_NET_CALLSITE_H_
> +#define _LINUX_NET_CALLSITE_H_
> +#include <linux/stringify.h>
> +#include <linux/sched.h>
> +#include <net/callsite-types.h>
> +
> +#ifdef CONFIG_CALLSITE
> +/* CALLSITE_VARS - macro to define all variables local to a call site
> + * @cs_top:	Name of a variable in which to store the value of the top
> + *		of the stack
> + * @cs_id:	Name of the statically allocated variable in which the call
> + *		site ID is stored
> + * @cs_static:	Name of the statically allocated structure in which constant
> + *		data about the call site is stored
> + * @cs_sf:	Name of the &struct callsite_frame variable (allocated
> + *		on the stack)
> + * @set:	Pointer to the &struct callsite_set for this set of call sites
> + * @top:	Pointer to the &struct callsite_top for this thread
> + */
> +#define CALLSITE_VARS(cs_top, cs_id, cs_static, cs_sf, set, top)	\
> +		struct callsite_top *cs_top = (top);			\
> +		static struct callsite_id cs_id;			\
> +		static struct callsite_static cs_static =		\
> +			CALLSITE_STATIC_INIT(&cs_id, (set));		\
> +		struct callsite_frame cs_sf
> +
> +/* Define a macro for declaring the variables */
> +#define CALLSITE_DECL(cs_top, cs_id, cs_static, cs_sf, set, top) \
> +	CALLSITE_VARS(cs_top, cs_id, cs_static, cs_sf, (set), (top))
> +
> +/**
> + * CALLSITE_CALL - Push a callsite stack "frame" and call a function
> + * @top:	Pointer to a pointer to the first in the list of
> + *		&callsite_top "frames
> + * @set:	Pointer to a &struct callsite_set
> + * @fn:		Function returning a non-void value

Is there always an @arg1 ?

> + * @...:	Arguments to fn()
> + *
> + * Push a callsite stack "frame" on the stack, call the given function,
> + * and pop the callsite stack frame. Evaluates to the value returned by
> + * the function.
> + */
> +#define CALLSITE_CALL(top, set, fn, arg1, ...)	({			\
> +		CALLSITE_DECL(_cs_top, _cs_id, _cs_static,		\
> +			_cs_stackframe, (set), (top));			\
> +		typeof((fn)(arg1, ##__VA_ARGS__)) _cs_result;		\
> +		callsite_push(_cs_top, &_cs_stackframe, &_cs_id,	\
> +			&_cs_static);					\
> +		_cs_result = (fn)(arg1, ##__VA_ARGS__);			\
> +		callsite_pop(_cs_top);					\
> +		_cs_result;						\
> +	})
> +
> +/**
> + * CALLSITE_CALL_VOID - Push a callsite stack "frame" and call a void function
> + * @top:	Pointer to a pointer to the first in the list of
> + *		callsite_top "frames
> + * @fn:		Function of type void

 * @arg1:
??

> + * @...:	Arguments to fn()
> + *
> + * Push a callsite stack "frame" on the stack, call the given function,
> + * and pop the callsite stack frame.
> + */
> +#define CALLSITE_CALL_VOID(top, set, fn, arg1, ...)	do {		\
> +		CALLSITE_DECL(_cs_top, _cs_id, _cs_static,		\
> +			_cs_stackframe, (set), (top));			\
> +		callsite_push(_cs_top, &_cs_stackframe, &_cs_id,	\
> +			&_cs_static);					\
> +		(fn)(arg1, ##__VA_ARGS__);				\
> +		callsite_pop(_cs_top);					\
> +	} while (0)
> +
> +#define CALLSITE_CUR(top) \
> +	OOBPARAM_CUR(&top->top, struct callsite_frame, frame)
> +
> +extern void callsite_print_where_by_id(struct callsite_set *cs_set,
> +	unsigned int id);
> +extern void callsite_assign_id(struct callsite_static *cs_static);
> +extern void callsite_remove_module(struct module *module);
> +extern int callsite_set_register(struct callsite_set *cs_set);
> +
> +/**
> + * callsite_set_id - Set the callsite ID if it isn't already set
> + * @id:		Pointer to &callsite_id to check and set
> + * @cs_static:	Pointer to &struct callsite_static data for this callsite
> + */
> +static inline void callsite_set_id(struct callsite_id *id,
> +	struct callsite_static *cs_static)
> +{
> +	if (unlikely(id->id == CALLSITE_UNSET))
> +		callsite_assign_id(cs_static);
> +}
...
> +/**
> + * callsite_task_init - initialize a callsite member of the task structure
> + * @p	Pointer to the member to initialize

 * @p:

> + */
> +static inline void callsite_top_init(struct callsite_top *p)
> +{
> +}
> +#else
> +#define CALLSITE_CALL(top, set, fn, arg1, ...)	({			\
> +		(fn)(arg1, ##__VA_ARGS__);				\
> +	})
...

> diff --git a/net/core/callsite.c b/net/core/callsite.c
> new file mode 100644
> index 0000000..e77d44b
> --- /dev/null
> +++ b/net/core/callsite.c
> @@ -0,0 +1,354 @@

> +/**
> + * struct callsite_info - Per-call site information
> + * @unloaded:	Is this in a module that has been unloaded?
> + * @where:	Union of location information
> + *   @loaded	Location information if callsite is loaded
> + *   @unloaded	Location information if callsite is in an unloaded module
> + * @lock:	Lock for updating information for this call site

<lock> is not a struct member below.

> + */
> +struct callsite_info {
> +	bool					unloaded:1;
> +	union {
> +		const struct callsite_static	*loaded;
> +		const char			*unloaded;
> +	} where;
> +};
...

> +
> +/**
> + * callsite_assign_id - Assign a call site ID
> + * @cs_static:	Pointer to static information about the callsite
> + *
> + * If the ID is @CALLSITE_UNSET in a given &struct callsite, this
> + * function is called to assign a call site ID. The value assigned will
> + * normally * be @CALLSITE_START or above, but if we exceed the maximum
> + * size of an ID, * we assign @CALLSITE_UNKNOWN.

stray asterisk?      ^

> + */
> +extern void callsite_assign_id(struct callsite_static *cs_static)
> +{
...


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply

* Re: [PATCH][RFC] Infrastructure for compact call location representation
From: Stephen Hemminger @ 2010-06-08 15:44 UTC (permalink / raw)
  To: David VomLehn; +Cc: to, netdev
In-Reply-To: <20100608003052.GA29377@dvomlehn-lnx2.corp.sa.net>

On Mon, 7 Jun 2010 17:30:52 -0700
David VomLehn <dvomlehn@cisco.com> wrote:

> This patch allows the location of a call to be recorded as a small integer,
> with each call location ("callsite") assigned a new value the first time
> the code in that location is executed. Locations can be recorded as a
> an address or as a __FILE__/__LINE__ pair. The later is easier to read, but
> requires significantly more space.
> 
> The goal here was to record the location in very few bits but, at the same
> time, to have minimal overhead.  The key observation towards achieving this
> goals is to note that there are are far fewer locations where calls of
> interest are made than there are addresses. If the site of each call of
> interest is assigned a unique ID, and there are fewer than n of them,
> only log2(n) bits are required to identify the call site. If the IDs
> are assigned dynamically and most call sites aren't reached, you can get
> by with even fewer bits.
> 
> This is debugging code and callsite IDs are generally only used when failures
> are detected, so though the mapping from a callsite location to a callsite ID
> must be fast, speed is not as critical for the reverse mapping. Also, an ID
> is assigned to a callsite just once, so it is acceptable to take a while to
> assign an ID, but things should run with minimal delay if an ID is already
> assigned.
> 
> The major implementation pieces are:
> 1.	Two macros are provided for use in wrapping functions that are to
> 	be instrumented. CALLSITE_CALL is for functions that return values,
> 	CALLSITE_CALL_VOID is used for functions that do not.
> 2.	The call site infrastructure consists of three data structures:
> 	a.	A statically allocated struct callsite_id holds the ID for
> 		the call site.
> 	b.	A statically allocated struct callsite_static holds
> 		information which is constant for each callsite. The call site
> 		ID could be combined with this, but by separating them I hope
> 		to avoid polluting the cache with this very cold information.
> 	c.	A struct callsite_frame builds on the oobparam infrastructure
> 		and holds the call site ID. This is assigned at this time
> 		if this had not previously been done. This will be pushed on
> 		the OOB parameter stack before calling the skb_* function
> 		and popped after it returns.
> 3.	A callsite_top structure is added to task_struct. When a call site
> 	is entered, its callsite_frame is pushed on the call site stack.
> 4.	When a function needs to know the call site ID so it can be stored,
> 	it gets it from the callsite_frame at the top of the call site
> 	stack.
> 
> Notes
> o	Under simple test conditions, the number of call site IDs allocated
> 	can be quite small, small enough to fit in 6 bits. That would reduce
> 	the sk_buff growth to one byte. This is *not* a recommended
> 	configuration.
> o	This is placed in net/core and linux/net since those are the only
> 	current users, but there is nothing about this that is networking-
> 	specific.
> 
> Restrictions
> o	Call site IDs are never reused, so it is possible to exceed the
> 	maximum number of IDs by having a large number of call locations.
> 	In addition, it does not recognize that the same module has been
> 	unloaded and reloaded, so calls from the reloaded module will be
> 	assigned new IDs. Detection of incorrect operations on an sk_buff
> 	is not affected by exhaustion of call site IDs, but it may not be
> 	possible to determine the location of the last operation.
> 	CONFIG_DEBUG_SKB_ID_SIZE is set to reduce the sk_buff growth to 16
> 	bits and should handle most cases. It could be made larger to allow
> 	more call site IDs, if necessary.
> o	The callsite structures for a module will be freed when that module
> 	is unloaded, even though sk_buffs may be using IDs corresponding to
> 	those call sites. To allow useful error reporting, the call site
> 	information in a module being unloaded is copied. If
> 	CONFIG_CALLSITE_TERSE is not enabled and the module that last changed
> 	the sk_buff is no longer loaded, the address of the call site
> 	is no longer valid, so only the function name and offset are printed
> 	if the module is unloaded. If it is loaded, the address is also
> 	reported.
> 
> History
> v2	Support small callsite IDs and split out out-of-band parameter
> 	parsing.
> V1	Initial release
> 
> Signed-off-by: David VomLehn <dvomlehn@cisco.com>

This is really Linux Kernel Mailing List material (not just netdev). And it will
be a hard sell to get it accepted, because it is basically an alternative call
tracing mechanism, and there are already several of these in use or under development
(see perf and ftrace).


-- 

^ permalink raw reply

* Re: [PATCHv2 1/2] net: Enable 64-bit net device statistics on 32-bit architectures
From: Ben Hutchings @ 2010-06-08 15:56 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, arnd, netdev, linux-net-drivers
In-Reply-To: <1275996928.14011.83.camel@localhost>

On Tue, 2010-06-08 at 12:35 +0100, Ben Hutchings wrote:
[...]
> > And the whole tree needs to be inspected to make sure there isn't going
> > to be fallout in areas your patch didn't take care of wrt. printf format
> > strings and the like.
> > 
> > What was always "unsigned long" is now a variable type, therefore using
> > a fixed printf format string is impossible unless you always cast these
> > things when passed in as printf arguments.
> 
> Yes, that's true if there are drivers out there printing members of
> net_device_stats.  I admit I haven't checked for that.  (Hmm, might be
> time to try Coccinelle.)

There are a few of those, so I'll keep the declared type as unsigned
long rather than making the sizes explicit.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [Linux-ATM-General] RX/close vcc race with solos/atmtcp/usbatm/he
From: David Woodhouse @ 2010-06-08 16:25 UTC (permalink / raw)
  To: chas3; +Cc: linux-atm-general, netdev
In-Reply-To: <201006081505.o58F5Pt5006703@thirdoffive.cmf.nrl.navy.mil>

On Tue, 2010-06-08 at 11:05 -0400, Chas Williams (CONTRACTOR) wrote:
> In message <1275943792.17903.5119.camel@macbook.infradead.org>,David Woodhouse 
> writes:
> >On Mon, 2010-06-07 at 12:37 -0400, Chas Williams (CONTRACTOR) wrote:
> >> i dont understand.  if you do a sock_hold() in find_vcc(), and then call
> >> vcc->push() you should be able to call vcc->push() and then sock_put(). 
> >
> >Holding the reference doesn't stop the problem. The problem is
> >
> > vcc_release()
> > --> vcc_destroy_socket()
> >   --> br2684_push(vcc, NULL)
> >         sets vcc->user_back = NULL
> >         (which it what causes the oops when try try to feed it any
> >          subsequent packets).
> >
> > Only _later_ does vcc_release() call sock_put().
> 
> hmm... perhaps this routine needs to take the vcc_sklist_lock because
> it is going to modify the vcc.  or we need to use locking on the vcc
> itself. 

Or move the ->push(vcc, NULL) and anything else which destroys the
state, so that it happens later. Use a real socket destructor function
which will be called from sk_free() after the last sock_put().

> you took a reference to an object inside a hashed list and didnt do
> anything to prevent the object from leaving the hashed list. that is
> stil not correct IMHO.

Yeah yeah, but I fixed that already with the RCU-like approach of
synchronising with the tasklet on dev->ops->close(). So I don't _need_
the reference.

-- 
dwmw2


^ permalink raw reply

* Re: 2.6.35-rc2-git1 - lib/idr.c:605 invoked rcu_dereference_check() without protection!
From: Paul E. McKenney @ 2010-06-08 16:25 UTC (permalink / raw)
  To: Miles Lane
  Cc: Vivek Goyal, Eric Paris, David Woodhouse, Lai Jiangshan,
	Ingo Molnar, Peter Zijlstra, LKML, nauman, eric.dumazet, netdev,
	Jens Axboe, Gui Jianfeng, Li Zefan, Johannes Berg
In-Reply-To: <AANLkTikPcIzd_gQ3XVT8zdy2Gyxibx3wx_twlaxJA3cz@mail.gmail.com>

On Tue, Jun 08, 2010 at 12:28:15AM -0400, Miles Lane wrote:
> On Mon, Jun 7, 2010 at 8:12 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Mon, Jun 07, 2010 at 02:23:17PM -0400, Miles Lane wrote:
> >> [    2.677955] [ INFO: suspicious rcu_dereference_check() usage. ]
> >> [    2.679089] ---------------------------------------------------
> >> [    2.680276] lib/idr.c:605 invoked rcu_dereference_check() without protection!
> >> [    2.681499]
> >> [    2.681500] other info that might help us debug this:
> >> [    2.681501]
> >> [    2.685509]
> >> [    2.685510] rcu_scheduler_active = 1, debug_locks = 1
> >> [    2.688221] 1 lock held by swapper/1:
> >> [    2.689587]  #0:  (mtd_table_mutex){+.+...}, at:
> >> [<ffffffff812bea45>] register_mtd_user+0x1a/0x69
> >> [    2.691096]
> >> [    2.691098] stack backtrace:
> >> [    2.694059] Pid: 1, comm: swapper Not tainted 2.6.35-rc2-git1 #8
> >> [    2.695601] Call Trace:
> >> [    2.697243]  [<ffffffff81064e9c>] lockdep_rcu_dereference+0x9d/0xa5
> >> [    2.698868]  [<ffffffff811b9c86>] idr_get_next+0x60/0x124
> >> [    2.700556]  [<ffffffff812be779>] __mtd_next_device+0x1b/0x1d
> >> [    2.702238]  [<ffffffff812bea7c>] register_mtd_user+0x51/0x69
> >> [    2.703964]  [<ffffffff816cca45>] init_mtdchar+0xb3/0xd3
> >> [    2.705686]  [<ffffffff816cc992>] ? init_mtdchar+0x0/0xd3
> >> [    2.707470]  [<ffffffff810001ef>] do_one_initcall+0x59/0x14e
> >> [    2.709255]  [<ffffffff816a768a>] kernel_init+0x144/0x1ce
> >> [    2.711082]  [<ffffffff81003054>] kernel_thread_helper+0x4/0x10
> >> [    2.712862]  [<ffffffff813ca480>] ? restore_args+0x0/0x30
> >> [    2.714647]  [<ffffffff816a7546>] ? kernel_init+0x0/0x1ce
> >> [    2.716415]  [<ffffffff81003050>] ? kernel_thread_helper+0x0/0x10
> >
> > This looks like a new one!  Does the following patch take care of it?
> >
> >                                                        Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > commit 2d54a6c31b72c902b09d365e9c66205a5c07e549
> > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Date:   Mon Jun 7 17:09:45 2010 -0700
> >
> >    idr: fix RCU lockdep splat in idr_get_next()
> >
> >    Convert to rcu_dereference_raw() given that many callers may have many
> >    different locking models.
> >
> >    Located-by: Miles Lane <miles.lane@gmail.com>
> >    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >
> > diff --git a/lib/idr.c b/lib/idr.c
> > index 2eb1dca..f099f25 100644
> > --- a/lib/idr.c
> > +++ b/lib/idr.c
> > @@ -599,7 +599,7 @@ void *idr_get_next(struct idr *idp, int *nextidp)
> >        /* find first ent */
> >        n = idp->layers * IDR_BITS;
> >        max = 1 << n;
> > -       p = rcu_dereference(idp->top);
> > +       p = rcu_dereference_raw(idp->top);
> >        if (!p)
> >                return NULL;
> >
> > @@ -607,7 +607,7 @@ void *idr_get_next(struct idr *idp, int *nextidp)
> >                while (n > 0 && p) {
> >                        n -= IDR_BITS;
> >                        *paa++ = p;
> > -                       p = rcu_dereference(p->ary[(id >> n) & IDR_MASK]);
> > +                       p = rcu_dereference_raw(p->ary[(id >> n) & IDR_MASK]);
> >                }
> >
> >                if (p) {
> >
> 
> Tested.  Looks good!

Thank you very much for both locating this one and for testing the fix!
I have added your Tested-by.

							Thanx, Paul

^ permalink raw reply

* Re: 2.6.35-rc2-git1 - net/mac80211/sta_info.c:125 invoked rcu_dereference_check() without protection!
From: Paul E. McKenney @ 2010-06-08 16:26 UTC (permalink / raw)
  To: Miles Lane
  Cc: Johannes Berg, Vivek Goyal, Eric Paris, Lai Jiangshan,
	Ingo Molnar, Peter Zijlstra, LKML, nauman, eric.dumazet, netdev,
	Jens Axboe, Gui Jianfeng, Li Zefan
In-Reply-To: <AANLkTillxWXGO4nIqb6hdbaw4P40WLHEPpy_GiM_hSW-@mail.gmail.com>

On Tue, Jun 08, 2010 at 07:29:48AM -0400, Miles Lane wrote:
> On Tue, Jun 8, 2010 at 3:26 AM, Johannes Berg <johannes@sipsolutions.net> wrote:
> > On Mon, 2010-06-07 at 16:59 -0700, Paul E. McKenney wrote:
> >> On Mon, Jun 07, 2010 at 02:25:44PM -0400, Miles Lane wrote:
> >> > [   43.478812] [ INFO: suspicious rcu_dereference_check() usage. ]
> >> > [   43.478815] ---------------------------------------------------
> >> > [   43.478820] net/mac80211/sta_info.c:125 invoked
> >> > rcu_dereference_check() without protection!
> >> > [   43.478824]
> >> > [   43.478824] other info that might help us debug this:
> >> > [   43.478826]
> >> > [   43.478829]
> >> > [   43.478830] rcu_scheduler_active = 1, debug_locks = 1
> >> > [   43.478834] no locks held by NetworkManager/4017.
> >>
> >> Hmmm...  Johannes's update has been merged, and it requires that callers
> >> either be in an RCU read-side critical section or hold either the
> >> ->sta_lock or the ->sta_mtx, and this thread does none of this.
> >>
> >> Johannes, any thoughts?
> >>
> >>                                                       Thanx, Paul
> >>
> >> > [   43.478837] stack backtrace:
> >> > [   43.478842] Pid: 4017, comm: NetworkManager Not tainted 2.6.35-rc2-git1 #8
> >> > [   43.478846] Call Trace:
> >> > [   43.478849]  <IRQ>  [<ffffffff81064e9c>] lockdep_rcu_dereference+0x9d/0xa5
> >> > [   43.478876]  [<ffffffffa010cb3c>] sta_info_get_bss+0x71/0x12d [mac80211]
> >> > [   43.478889]  [<ffffffffa010cc0d>] ieee80211_find_sta+0x15/0x2f [mac80211]
> >> > [   43.478902]  [<ffffffffa019ae16>] iwlagn_tx_queue_reclaim+0xe7/0x1bb [iwlagn]
> >
> > iwlwifi wasn't using rcu protection here -- already sent a patch fixing
> > it. My mistake, I  think. Thanks for checking :)
> >
> > johannes
> 
> Yes, your patch works for me.

Thank you both!!!

							Thanx, Paul

^ permalink raw reply

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to classes
From: Kay Sievers @ 2010-06-08 16:26 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <1276009590.3706.135.camel@jlt3.sipsolutions.net>

On Tue, Jun 8, 2010 at 17:06, Johannes Berg <johannes@sipsolutions.net> wrote:
> On Tue, 2010-06-08 at 16:47 +0200, Kay Sievers wrote:
>
>> > So ... can we apply Eric's patch for now then?
>>
>> It might break other stuff we don't know about yet. Just like we did
>> not know about what things hwsim is doing here. :)
>
> True.
>
>> The hwsim issues are caused by the current hwsim code, by doing things
>> it should not do. Class devices of different classes must never be
>> stacked (the core should not allow that in the first place). Class
>> devices must never have a driver assigned behind its back. Also
>> device_create() should not be used for devices without a major/minor
>> (but that seems to be done in several other places too).
>
> Back when hwsim was written that would have been useful feedback to
> whoever did ... now, not so much.

Well, it still needs to be fixed. It will break again, when stuff
changes like now.

>> To fix the hwsim driver core interaction, core changes will probably
>> be needed to allow network modules to be removed while their devices
>> are active. That's something which seems not to work for bus devices
>> currently.
>
> Well it just needs to pin the module refcount from the bus and/or device
> struct. That seems not too hard?

How? If we ever get a reference of the module, it will not be able to
be unloaded while stuff is active, right?

Kay

^ permalink raw reply

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to  classes
From: Johannes Berg @ 2010-06-08 16:33 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <AANLkTimSqF4D4jPj-HzOjl1vxjf2T7YvlGdR6zcJZ9Pp@mail.gmail.com>

On Tue, 2010-06-08 at 18:26 +0200, Kay Sievers wrote:

> >> To fix the hwsim driver core interaction, core changes will probably
> >> be needed to allow network modules to be removed while their devices
> >> are active. That's something which seems not to work for bus devices
> >> currently.
> >
> > Well it just needs to pin the module refcount from the bus and/or device
> > struct. That seems not too hard?
> 
> How? If we ever get a reference of the module, it will not be able to
> be unloaded while stuff is active, right?

Hmm, true, but how does this work with class devices it has now? It
seems to work fine now. It's because we have a generic release function
there I guess?

johannes

^ permalink raw reply

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to classes
From: Kay Sievers @ 2010-06-08 16:39 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <1276014816.3706.137.camel@jlt3.sipsolutions.net>

On Tue, Jun 8, 2010 at 18:33, Johannes Berg <johannes@sipsolutions.net> wrote:
> On Tue, 2010-06-08 at 18:26 +0200, Kay Sievers wrote:
>
>> >> To fix the hwsim driver core interaction, core changes will probably
>> >> be needed to allow network modules to be removed while their devices
>> >> are active. That's something which seems not to work for bus devices
>> >> currently.
>> >
>> > Well it just needs to pin the module refcount from the bus and/or device
>> > struct. That seems not too hard?
>>
>> How? If we ever get a reference of the module, it will not be able to
>> be unloaded while stuff is active, right?
>
> Hmm, true, but how does this work with class devices it has now? It
> seems to work fine now. It's because we have a generic release function
> there I guess?

Because the code to cleanup is always built into the core, yes. Even
the empty release function thing we can not do. :)

That all works if you have two modules, like almost all buses have.
That's what I meant, that we need to add stuff to the core to be able
to cleanup bus devices internally too, if we use everything in a
single module, which is also supposed to cleanup on unload, like the
network devices like to do.

Kay

^ permalink raw reply

* [PATCH -next] enic: fix pci_alloc_consistent argument
From: Randy Dunlap @ 2010-06-08 17:00 UTC (permalink / raw)
  To: netdev; +Cc: davem, Scott Feldman, Vasanthy Kolluri, Roopa Prabhu

From: Randy Dunlap <randy.dunlap@oracle.com>

Fix build warning on i386 (32-bit) with 32-bit dma_addr_t:

drivers/net/enic/vnic_dev.c: In function 'vnic_dev_init_prov':
drivers/net/enic/vnic_dev.c:716: warning: passing argument 3 of 'pci_alloc_consistent' from incompatible pointer type
include/asm-generic/pci-dma-compat.h:16: note: expected 'dma_addr_t *' but argument is of type 'u64 *'

Now builds without warnings on i386 and on x86_64.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc:	Scott Feldman <scofeldm@cisco.com>
Cc:	Vasanthy Kolluri <vkolluri@cisco.com>
Cc:	Roopa Prabhu <roprabhu@cisco.com>
---
 drivers/net/enic/vnic_dev.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next-20100608.orig/drivers/net/enic/vnic_dev.c
+++ linux-next-20100608/drivers/net/enic/vnic_dev.c
@@ -709,7 +709,7 @@ int vnic_dev_init_prov(struct vnic_dev *
 {
 	u64 a0, a1 = len;
 	int wait = 1000;
-	u64 prov_pa;
+	dma_addr_t prov_pa;
 	void *prov_buf;
 	int ret;
 
---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply

* [PATCH] bnx2: Fix compiler warning in bnx2_disable_forced_2g5().
From: Michael Chan @ 2010-06-08 17:21 UTC (permalink / raw)
  To: davem; +Cc: prarit, netdev

drivers/net/bnx2.c: In function 'bnx2_disable_forced_2g5':
drivers/net/bnx2.c:1489: warning: 'bmcr' may be used uninitialized in this function

We fix it by checking return values from all bnx2_read_phy() and proceeding
to do read-modify-write only if the read operation is successful.

The related bnx2_enable_forced_2g5() is also fixed the same way.

Reported-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
---
 drivers/net/bnx2.c |   43 ++++++++++++++++++++++++++++---------------
 1 files changed, 28 insertions(+), 15 deletions(-)

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 949d7a9..522de9f 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -1446,7 +1446,8 @@ bnx2_test_and_disable_2g5(struct bnx2 *bp)
 static void
 bnx2_enable_forced_2g5(struct bnx2 *bp)
 {
-	u32 bmcr;
+	u32 uninitialized_var(bmcr);
+	int err;
 
 	if (!(bp->phy_flags & BNX2_PHY_FLAG_2_5G_CAPABLE))
 		return;
@@ -1456,22 +1457,28 @@ bnx2_enable_forced_2g5(struct bnx2 *bp)
 
 		bnx2_write_phy(bp, MII_BNX2_BLK_ADDR,
 			       MII_BNX2_BLK_ADDR_SERDES_DIG);
-		bnx2_read_phy(bp, MII_BNX2_SERDES_DIG_MISC1, &val);
-		val &= ~MII_BNX2_SD_MISC1_FORCE_MSK;
-		val |= MII_BNX2_SD_MISC1_FORCE | MII_BNX2_SD_MISC1_FORCE_2_5G;
-		bnx2_write_phy(bp, MII_BNX2_SERDES_DIG_MISC1, val);
+		if (!bnx2_read_phy(bp, MII_BNX2_SERDES_DIG_MISC1, &val)) {
+			val &= ~MII_BNX2_SD_MISC1_FORCE_MSK;
+			val |= MII_BNX2_SD_MISC1_FORCE |
+				MII_BNX2_SD_MISC1_FORCE_2_5G;
+			bnx2_write_phy(bp, MII_BNX2_SERDES_DIG_MISC1, val);
+		}
 
 		bnx2_write_phy(bp, MII_BNX2_BLK_ADDR,
 			       MII_BNX2_BLK_ADDR_COMBO_IEEEB0);
-		bnx2_read_phy(bp, bp->mii_bmcr, &bmcr);
+		err = bnx2_read_phy(bp, bp->mii_bmcr, &bmcr);
 
 	} else if (CHIP_NUM(bp) == CHIP_NUM_5708) {
-		bnx2_read_phy(bp, bp->mii_bmcr, &bmcr);
-		bmcr |= BCM5708S_BMCR_FORCE_2500;
+		err = bnx2_read_phy(bp, bp->mii_bmcr, &bmcr);
+		if (!err)
+			bmcr |= BCM5708S_BMCR_FORCE_2500;
 	} else {
 		return;
 	}
 
+	if (err)
+		return;
+
 	if (bp->autoneg & AUTONEG_SPEED) {
 		bmcr &= ~BMCR_ANENABLE;
 		if (bp->req_duplex == DUPLEX_FULL)
@@ -1483,7 +1490,8 @@ bnx2_enable_forced_2g5(struct bnx2 *bp)
 static void
 bnx2_disable_forced_2g5(struct bnx2 *bp)
 {
-	u32 bmcr;
+	u32 uninitialized_var(bmcr);
+	int err;
 
 	if (!(bp->phy_flags & BNX2_PHY_FLAG_2_5G_CAPABLE))
 		return;
@@ -1493,21 +1501,26 @@ bnx2_disable_forced_2g5(struct bnx2 *bp)
 
 		bnx2_write_phy(bp, MII_BNX2_BLK_ADDR,
 			       MII_BNX2_BLK_ADDR_SERDES_DIG);
-		bnx2_read_phy(bp, MII_BNX2_SERDES_DIG_MISC1, &val);
-		val &= ~MII_BNX2_SD_MISC1_FORCE;
-		bnx2_write_phy(bp, MII_BNX2_SERDES_DIG_MISC1, val);
+		if (!bnx2_read_phy(bp, MII_BNX2_SERDES_DIG_MISC1, &val)) {
+			val &= ~MII_BNX2_SD_MISC1_FORCE;
+			bnx2_write_phy(bp, MII_BNX2_SERDES_DIG_MISC1, val);
+		}
 
 		bnx2_write_phy(bp, MII_BNX2_BLK_ADDR,
 			       MII_BNX2_BLK_ADDR_COMBO_IEEEB0);
-		bnx2_read_phy(bp, bp->mii_bmcr, &bmcr);
+		err = bnx2_read_phy(bp, bp->mii_bmcr, &bmcr);
 
 	} else if (CHIP_NUM(bp) == CHIP_NUM_5708) {
-		bnx2_read_phy(bp, bp->mii_bmcr, &bmcr);
-		bmcr &= ~BCM5708S_BMCR_FORCE_2500;
+		err = bnx2_read_phy(bp, bp->mii_bmcr, &bmcr);
+		if (!err)
+			bmcr &= ~BCM5708S_BMCR_FORCE_2500;
 	} else {
 		return;
 	}
 
+	if (err)
+		return;
+
 	if (bp->autoneg & AUTONEG_SPEED)
 		bmcr |= BMCR_SPEED1000 | BMCR_ANENABLE | BMCR_ANRESTART;
 	bnx2_write_phy(bp, bp->mii_bmcr, bmcr);
-- 
1.6.4.GIT



^ permalink raw reply related

* [PATCHv3 1/2] net: Enable 64-bit net device statistics on 32-bit architectures
From: Ben Hutchings @ 2010-06-08 17:19 UTC (permalink / raw)
  To: David Miller; +Cc: Stephen Hemminger, Arnd Bergmann, netdev, linux-net-drivers

Use struct rtnl_link_stats64 as the statistics structure.

On 32-bit architectures, insert 32 bits of padding after/before each
field of struct net_device_stats to make its layout compatible with
struct rtnl_link_stats64.  Add an anonymous union in net_device; move
stats into the union and add struct rtnl_link_stats64 stats64.

Add net_device_ops::ndo_get_stats64, implementations of which will
return a pointer to struct rtnl_link_stats64.  Drivers that implement
this operation must not update the structure asynchronously.

Change dev_get_stats() to call ndo_get_stats64 if available, and to
return a pointer to struct rtnl_link_stats64.  Change callers of
dev_get_stats() accordingly.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
Changes from v2:
- Keep type of net_device_stats fields as unsigned long rather than
  making it explicitly u32 or u64
- Update bonding driver, which calls dev_get_stats() and must now
  implement net_device_ops::ndo_get_stats64

Changes from v1:
- Remove accessor macros

Compile-tested allmodconfig on i386; this version should introduce no
new warnings.

Tested that 64-bit statistics work on i386 with bonding and sfc drivers
(using the following patch).

Ben.

 drivers/net/bonding/bond_main.c |   13 +++---
 include/linux/if_link.h         |    3 +-
 include/linux/netdevice.h       |   91 +++++++++++++++++++++++----------------
 net/8021q/vlanproc.c            |   13 +++---
 net/core/dev.c                  |   19 +++++---
 net/core/net-sysfs.c            |   12 +++---
 net/core/rtnetlink.c            |    6 +-
 7 files changed, 90 insertions(+), 67 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 5e12462..275f555 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3774,20 +3774,21 @@ static int bond_close(struct net_device *bond_dev)
 	return 0;
 }
 
-static struct net_device_stats *bond_get_stats(struct net_device *bond_dev)
+static struct rtnl_link_stats64 *bond_get_stats(struct net_device *bond_dev)
 {
 	struct bonding *bond = netdev_priv(bond_dev);
-	struct net_device_stats *stats = &bond_dev->stats;
-	struct net_device_stats local_stats;
+	struct rtnl_link_stats64 *stats = &bond_dev->stats64;
+	struct rtnl_link_stats64 local_stats;
 	struct slave *slave;
 	int i;
 
-	memset(&local_stats, 0, sizeof(struct net_device_stats));
+	memset(&local_stats, 0, sizeof(local_stats));
 
 	read_lock_bh(&bond->lock);
 
 	bond_for_each_slave(bond, slave, i) {
-		const struct net_device_stats *sstats = dev_get_stats(slave->dev);
+		const struct rtnl_link_stats64 *sstats =
+			dev_get_stats(slave->dev);
 
 		local_stats.rx_packets += sstats->rx_packets;
 		local_stats.rx_bytes += sstats->rx_bytes;
@@ -4488,7 +4489,7 @@ static const struct net_device_ops bond_netdev_ops = {
 	.ndo_open		= bond_open,
 	.ndo_stop		= bond_close,
 	.ndo_start_xmit		= bond_start_xmit,
-	.ndo_get_stats		= bond_get_stats,
+	.ndo_get_stats64	= bond_get_stats,
 	.ndo_do_ioctl		= bond_do_ioctl,
 	.ndo_set_multicast_list	= bond_set_multicast_list,
 	.ndo_change_mtu		= bond_change_mtu,
diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 85c812d..7fcad2e 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -4,7 +4,7 @@
 #include <linux/types.h>
 #include <linux/netlink.h>
 
-/* The struct should be in sync with struct net_device_stats */
+/* This struct should be in sync with struct rtnl_link_stats64 */
 struct rtnl_link_stats {
 	__u32	rx_packets;		/* total packets received	*/
 	__u32	tx_packets;		/* total packets transmitted	*/
@@ -37,6 +37,7 @@ struct rtnl_link_stats {
 	__u32	tx_compressed;
 };
 
+/* The main device statistics structure */
 struct rtnl_link_stats64 {
 	__u64	rx_packets;		/* total packets received	*/
 	__u64	tx_packets;		/* total packets transmitted	*/
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a249161..d85a38e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -159,45 +159,49 @@ static inline bool dev_xmit_complete(int rc)
 #define MAX_HEADER (LL_MAX_HEADER + 48)
 #endif
 
-#endif  /*  __KERNEL__  */
-
 /*
- *	Network device statistics. Akin to the 2.0 ether stats but
- *	with byte counters.
+ *	Old network device statistics. Fields are native words
+ *	(unsigned long) so they can be read and written atomically.
+ *	Each field is padded to 64 bits for compatibility with
+ *	rtnl_link_stats64.
  */
 
+#if BITS_PER_LONG == 64
+#define NET_DEVICE_STATS_DEFINE(name)	unsigned long name
+#elif defined(__LITTLE_ENDIAN)
+#define NET_DEVICE_STATS_DEFINE(name)	unsigned long name, pad_ ## name
+#else
+#define NET_DEVICE_STATS_DEFINE(name)	unsigned long pad_ ## name, name
+#endif
+
 struct net_device_stats {
-	unsigned long	rx_packets;		/* total packets received	*/
-	unsigned long	tx_packets;		/* total packets transmitted	*/
-	unsigned long	rx_bytes;		/* total bytes received 	*/
-	unsigned long	tx_bytes;		/* total bytes transmitted	*/
-	unsigned long	rx_errors;		/* bad packets received		*/
-	unsigned long	tx_errors;		/* packet transmit problems	*/
-	unsigned long	rx_dropped;		/* no space in linux buffers	*/
-	unsigned long	tx_dropped;		/* no space available in linux	*/
-	unsigned long	multicast;		/* multicast packets received	*/
-	unsigned long	collisions;
-
-	/* detailed rx_errors: */
-	unsigned long	rx_length_errors;
-	unsigned long	rx_over_errors;		/* receiver ring buff overflow	*/
-	unsigned long	rx_crc_errors;		/* recved pkt with crc error	*/
-	unsigned long	rx_frame_errors;	/* recv'd frame alignment error */
-	unsigned long	rx_fifo_errors;		/* recv'r fifo overrun		*/
-	unsigned long	rx_missed_errors;	/* receiver missed packet	*/
-
-	/* detailed tx_errors */
-	unsigned long	tx_aborted_errors;
-	unsigned long	tx_carrier_errors;
-	unsigned long	tx_fifo_errors;
-	unsigned long	tx_heartbeat_errors;
-	unsigned long	tx_window_errors;
-	
-	/* for cslip etc */
-	unsigned long	rx_compressed;
-	unsigned long	tx_compressed;
+	NET_DEVICE_STATS_DEFINE(rx_packets);
+	NET_DEVICE_STATS_DEFINE(tx_packets);
+	NET_DEVICE_STATS_DEFINE(rx_bytes);
+	NET_DEVICE_STATS_DEFINE(tx_bytes);
+	NET_DEVICE_STATS_DEFINE(rx_errors);
+	NET_DEVICE_STATS_DEFINE(tx_errors);
+	NET_DEVICE_STATS_DEFINE(rx_dropped);
+	NET_DEVICE_STATS_DEFINE(tx_dropped);
+	NET_DEVICE_STATS_DEFINE(multicast);
+	NET_DEVICE_STATS_DEFINE(collisions);
+	NET_DEVICE_STATS_DEFINE(rx_length_errors);
+	NET_DEVICE_STATS_DEFINE(rx_over_errors);
+	NET_DEVICE_STATS_DEFINE(rx_crc_errors);
+	NET_DEVICE_STATS_DEFINE(rx_frame_errors);
+	NET_DEVICE_STATS_DEFINE(rx_fifo_errors);
+	NET_DEVICE_STATS_DEFINE(rx_missed_errors);
+	NET_DEVICE_STATS_DEFINE(tx_aborted_errors);
+	NET_DEVICE_STATS_DEFINE(tx_carrier_errors);
+	NET_DEVICE_STATS_DEFINE(tx_fifo_errors);
+	NET_DEVICE_STATS_DEFINE(tx_heartbeat_errors);
+	NET_DEVICE_STATS_DEFINE(tx_window_errors);
+	NET_DEVICE_STATS_DEFINE(rx_compressed);
+	NET_DEVICE_STATS_DEFINE(tx_compressed);
 };
 
+#endif  /*  __KERNEL__  */
+
 
 /* Media selection options. */
 enum {
@@ -660,10 +664,19 @@ struct netdev_rx_queue {
  *	Callback uses when the transmitter has not made any progress
  *	for dev->watchdog ticks.
  *
+ * struct rtnl_link_stats64* (*ndo_get_stats64)(struct net_device *dev);
  * struct net_device_stats* (*ndo_get_stats)(struct net_device *dev);
  *	Called when a user wants to get the network device usage
- *	statistics. If not defined, the counters in dev->stats will
- *	be used.
+ *	statistics. Drivers must do one of the following:
+ *	1. Define @ndo_get_stats64 to update a rtnl_link_stats64 structure
+ *	   (which should normally be dev->stats64) and return a ponter to
+ *	   it. The structure must not be changed asynchronously.
+ *	2. Define @ndo_get_stats to update a net_device_stats64 structure
+ *	   (which should normally be dev->stats) and return a pointer to
+ *	   it. The structure may be changed asynchronously only if each
+ *	   field is written atomically.
+ *	3. Update dev->stats asynchronously and atomically, and define
+ *	   neither operation.
  *
  * void (*ndo_vlan_rx_register)(struct net_device *dev, struct vlan_group *grp);
  *	If device support VLAN receive accleration
@@ -718,6 +731,7 @@ struct net_device_ops {
 						   struct neigh_parms *);
 	void			(*ndo_tx_timeout) (struct net_device *dev);
 
+	struct rtnl_link_stats64* (*ndo_get_stats64)(struct net_device *dev);
 	struct net_device_stats* (*ndo_get_stats)(struct net_device *dev);
 
 	void			(*ndo_vlan_rx_register)(struct net_device *dev,
@@ -867,7 +881,10 @@ struct net_device {
 	int			ifindex;
 	int			iflink;
 
-	struct net_device_stats	stats;
+	union {
+		struct rtnl_link_stats64 stats64;
+		struct net_device_stats stats;
+	};
 
 #ifdef CONFIG_WIRELESS_EXT
 	/* List of functions to handle Wireless Extensions (instead of ioctl).
@@ -2118,7 +2135,7 @@ extern void		netdev_features_change(struct net_device *dev);
 /* Load a device via the kmod */
 extern void		dev_load(struct net *net, const char *name);
 extern void		dev_mcast_init(void);
-extern const struct net_device_stats *dev_get_stats(struct net_device *dev);
+extern const struct rtnl_link_stats64 *dev_get_stats(struct net_device *dev);
 extern void		dev_txq_stats_fold(const struct net_device *dev, struct net_device_stats *stats);
 
 extern int		netdev_max_backlog;
diff --git a/net/8021q/vlanproc.c b/net/8021q/vlanproc.c
index afead35..df56f5c 100644
--- a/net/8021q/vlanproc.c
+++ b/net/8021q/vlanproc.c
@@ -278,8 +278,9 @@ static int vlandev_seq_show(struct seq_file *seq, void *offset)
 {
 	struct net_device *vlandev = (struct net_device *) seq->private;
 	const struct vlan_dev_info *dev_info = vlan_dev_info(vlandev);
-	const struct net_device_stats *stats;
+	const struct rtnl_link_stats64 *stats;
 	static const char fmt[] = "%30s %12lu\n";
+	static const char fmt64[] = "%30s %12llu\n";
 	int i;
 
 	if (!is_vlan_dev(vlandev))
@@ -291,12 +292,12 @@ static int vlandev_seq_show(struct seq_file *seq, void *offset)
 		   vlandev->name, dev_info->vlan_id,
 		   (int)(dev_info->flags & 1), vlandev->priv_flags);
 
-	seq_printf(seq, fmt, "total frames received", stats->rx_packets);
-	seq_printf(seq, fmt, "total bytes received", stats->rx_bytes);
-	seq_printf(seq, fmt, "Broadcast/Multicast Rcvd", stats->multicast);
+	seq_printf(seq, fmt64, "total frames received", stats->rx_packets);
+	seq_printf(seq, fmt64, "total bytes received", stats->rx_bytes);
+	seq_printf(seq, fmt64, "Broadcast/Multicast Rcvd", stats->multicast);
 	seq_puts(seq, "\n");
-	seq_printf(seq, fmt, "total frames transmitted", stats->tx_packets);
-	seq_printf(seq, fmt, "total bytes transmitted", stats->tx_bytes);
+	seq_printf(seq, fmt64, "total frames transmitted", stats->tx_packets);
+	seq_printf(seq, fmt64, "total bytes transmitted", stats->tx_bytes);
 	seq_printf(seq, fmt, "total headroom inc",
 		   dev_info->cnt_inc_headroom_on_tx);
 	seq_printf(seq, fmt, "total encap on xmit",
diff --git a/net/core/dev.c b/net/core/dev.c
index 983a3c1..71a6fd8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3686,10 +3686,10 @@ void dev_seq_stop(struct seq_file *seq, void *v)
 
 static void dev_seq_printf_stats(struct seq_file *seq, struct net_device *dev)
 {
-	const struct net_device_stats *stats = dev_get_stats(dev);
+	const struct rtnl_link_stats64 *stats = dev_get_stats(dev);
 
-	seq_printf(seq, "%6s: %7lu %7lu %4lu %4lu %4lu %5lu %10lu %9lu "
-		   "%8lu %7lu %4lu %4lu %4lu %5lu %7lu %10lu\n",
+	seq_printf(seq, "%6s: %7llu %7llu %4llu %4llu %4llu %5llu %10llu %9llu "
+		   "%8llu %7llu %4llu %4llu %4llu %5llu %7llu %10llu\n",
 		   dev->name, stats->rx_bytes, stats->rx_packets,
 		   stats->rx_errors,
 		   stats->rx_dropped + stats->rx_missed_errors,
@@ -5266,18 +5266,21 @@ EXPORT_SYMBOL(dev_txq_stats_fold);
  *	@dev: device to get statistics from
  *
  *	Get network statistics from device. The device driver may provide
- *	its own method by setting dev->netdev_ops->get_stats; otherwise
- *	the internal statistics structure is used.
+ *	its own method by setting dev->netdev_ops->get_stats64 or
+ *	dev->netdev_ops->get_stats; otherwise the internal statistics
+ *	structure is used.
  */
-const struct net_device_stats *dev_get_stats(struct net_device *dev)
+const struct rtnl_link_stats64 *dev_get_stats(struct net_device *dev)
 {
 	const struct net_device_ops *ops = dev->netdev_ops;
 
+	if (ops->ndo_get_stats64)
+		return ops->ndo_get_stats64(dev);
 	if (ops->ndo_get_stats)
-		return ops->ndo_get_stats(dev);
+		return (struct rtnl_link_stats64 *)ops->ndo_get_stats(dev);
 
 	dev_txq_stats_fold(dev, &dev->stats);
-	return &dev->stats;
+	return &dev->stats64;
 }
 EXPORT_SYMBOL(dev_get_stats);
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 99e7052..ea3bb4c 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -29,6 +29,7 @@ static const char fmt_hex[] = "%#x\n";
 static const char fmt_long_hex[] = "%#lx\n";
 static const char fmt_dec[] = "%d\n";
 static const char fmt_ulong[] = "%lu\n";
+static const char fmt_u64[] = "%llu\n";
 
 static inline int dev_isalive(const struct net_device *dev)
 {
@@ -324,14 +325,13 @@ static ssize_t netstat_show(const struct device *d,
 	struct net_device *dev = to_net_dev(d);
 	ssize_t ret = -EINVAL;
 
-	WARN_ON(offset > sizeof(struct net_device_stats) ||
-			offset % sizeof(unsigned long) != 0);
+	WARN_ON(offset > sizeof(struct rtnl_link_stats64) ||
+			offset % sizeof(u64) != 0);
 
 	read_lock(&dev_base_lock);
 	if (dev_isalive(dev)) {
-		const struct net_device_stats *stats = dev_get_stats(dev);
-		ret = sprintf(buf, fmt_ulong,
-			      *(unsigned long *)(((u8 *) stats) + offset));
+		const struct rtnl_link_stats64 *stats = dev_get_stats(dev);
+		ret = sprintf(buf, fmt_u64, *(u64 *)(((u8 *) stats) + offset));
 	}
 	read_unlock(&dev_base_lock);
 	return ret;
@@ -343,7 +343,7 @@ static ssize_t show_##name(struct device *d,				\
 			   struct device_attribute *attr, char *buf) 	\
 {									\
 	return netstat_show(d, attr, buf,				\
-			    offsetof(struct net_device_stats, name));	\
+			    offsetof(struct rtnl_link_stats64, name));	\
 }									\
 static DEVICE_ATTR(name, S_IRUGO, show_##name, NULL)
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 1a2af24..e645778 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -579,7 +579,7 @@ static unsigned int rtnl_dev_combine_flags(const struct net_device *dev,
 }
 
 static void copy_rtnl_link_stats(struct rtnl_link_stats *a,
-				 const struct net_device_stats *b)
+				 const struct rtnl_link_stats64 *b)
 {
 	a->rx_packets = b->rx_packets;
 	a->tx_packets = b->tx_packets;
@@ -610,7 +610,7 @@ static void copy_rtnl_link_stats(struct rtnl_link_stats *a,
 	a->tx_compressed = b->tx_compressed;
 }
 
-static void copy_rtnl_link_stats64(void *v, const struct net_device_stats *b)
+static void copy_rtnl_link_stats64(void *v, const struct rtnl_link_stats64 *b)
 {
 	struct rtnl_link_stats64 a;
 
@@ -791,7 +791,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 {
 	struct ifinfomsg *ifm;
 	struct nlmsghdr *nlh;
-	const struct net_device_stats *stats;
+	const struct rtnl_link_stats64 *stats;
 	struct nlattr *attr;
 
 	nlh = nlmsg_put(skb, pid, seq, type, sizeof(*ifm), flags);
-- 
1.6.2.5

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCHv3 2/2] sfc: Implement 64-bit net device statistics on all architectures
From: Ben Hutchings @ 2010-06-08 17:21 UTC (permalink / raw)
  To: David Miller; +Cc: Stephen Hemminger, Arnd Bergmann, netdev, linux-net-drivers
In-Reply-To: <1276017594.2185.11.camel@achroite.uk.solarflarecom.com>

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
This is unchanged from v1.

Ben.

 drivers/net/sfc/efx.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/sfc/efx.c b/drivers/net/sfc/efx.c
index 26b0cc2..8ad476a 100644
--- a/drivers/net/sfc/efx.c
+++ b/drivers/net/sfc/efx.c
@@ -1492,11 +1492,11 @@ static int efx_net_stop(struct net_device *net_dev)
 }
 
 /* Context: process, dev_base_lock or RTNL held, non-blocking. */
-static struct net_device_stats *efx_net_stats(struct net_device *net_dev)
+static struct rtnl_link_stats64 *efx_net_stats(struct net_device *net_dev)
 {
 	struct efx_nic *efx = netdev_priv(net_dev);
 	struct efx_mac_stats *mac_stats = &efx->mac_stats;
-	struct net_device_stats *stats = &net_dev->stats;
+	struct rtnl_link_stats64 *stats = &net_dev->stats64;
 
 	spin_lock_bh(&efx->stats_lock);
 	efx->type->update_stats(efx);
@@ -1630,7 +1630,7 @@ static void efx_set_multicast_list(struct net_device *net_dev)
 static const struct net_device_ops efx_netdev_ops = {
 	.ndo_open		= efx_net_open,
 	.ndo_stop		= efx_net_stop,
-	.ndo_get_stats		= efx_net_stats,
+	.ndo_get_stats64	= efx_net_stats,
 	.ndo_tx_timeout		= efx_watchdog,
 	.ndo_start_xmit		= efx_hard_start_xmit,
 	.ndo_validate_addr	= eth_validate_addr,
-- 
1.6.2.5

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* Re: e1000e driver, Intel 82567LF-2, link negotiation (and wol) problems
From: David Härdeman @ 2010-06-08 17:36 UTC (permalink / raw)
  To: John Ronciak; +Cc: netdev, jesse.brandeburg
In-Reply-To: <AANLkTikLgAsAQ-LmWOX9CTZHwVl7h7-ii4omnyHH79Al@mail.gmail.com>

On Thu, Jun 03, 2010 at 03:39:24PM -0700, John Ronciak wrote:
> On Thu, Jun 3, 2010 at 3:20 PM, David Härdeman <david@hardeman.nu> wrote:
> > I have an Intel DG45FC motherboard with an integrated gigabit NIC (lspci
> > says it's a "Intel Corporation 82567LF-2 Gigabit Network Connection").
> >
> > When using the in-kernel e1000e driver (tried up to kernel version
> > 2.6.34), the speed is negotiated to 100mbit (most of the time) even
> > though the NIC is connected to a gigabit switch using quality cables
> > (I've tried a few different to be sure).  There seems to be no real
> > pattern to when the link is negotiated to 100mbit or 1000mbit.
> >
> > I've tried Intel's version of the driver (e1000e from sourceforge,
> > version 1.1.19) and it seems to behave in the same way.
> >
> > The output from mii-tool is quite confusing:
> > scott:~# mii-tool -v eth0
> > SIOCGMIIREG on eth0 failed: Input/output error
> > SIOCGMIIREG on eth0 failed: Input/output error
> > eth0: negotiated 100baseTx-FD flow-control, link ok
> >  product info: vendor 00:50:43, model 11 rev 0
> >  basic mode:   autonegotiation enabled
> >  basic status: autonegotiation complete, link ok
> >  capabilities: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD
> > 10baseT-HD
> >  advertising:  100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
> > flow-control
> >  link partner: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD
> > 10baseT-HD flow-control
> >
> > (capabilities and link partner agree on 1000mbit, but only 100mbit is
> > advertised according to mii-tool)
> >
> > ethtool disagrees with mii-tool:
> > scott:~# ethtool eth0
> > Settings for eth0:
> >        Supported ports: [ TP ]
> >        Supported link modes:   10baseT/Half 10baseT/Full 100baseT/Half
> > 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes
> >        Advertised link modes:  10baseT/Half 10baseT/Full 100baseT/Half
> > 100baseT/Full 1000baseT/Full Advertised pause frame use: No
> >        Advertised auto-negotiation: Yes
> >        Speed: 100Mb/s
> >        Duplex: Full
> >        Port: Twisted Pair
> >        PHYAD: 2
> >        Transceiver: internal
> >        Auto-negotiation: on
> >        MDI-X: on
> >        Supports Wake-on: pumbag
> >        Wake-on: g
> >        Current message level: 0x00000001 (1)
> >        Link detected: yes
> >
> > Manually setting the speed with ethtool doesn't work. Not sure how to
> > proceed...any suggestions?
> >
> > (And while I'm at it, the Intel e1000e driver from sourceforge seems to
> > have a wol init bug, ethtool reports "Wake-on: g" but I can wake a
> > suspended machine using a simple ping. Calling "ethtool -s eth0 wol g"
> > before suspending gets the expected behaviour - i.e. only wake on a
> > magic wol packet. Don't want to register on sourceforge just to report
> > that to the bug tracker though).
> >
> > Not subscribed to netdev, please CC me on any answers.
> >
> > --
> > David Härdeman
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> What is your link partner?  

A netgear gigabit switch

> Do different ones all do the same thing?

Haven't tried different ones yet, would have to lug hardware all over 
the apartment to do so

> Is the link partner configured for auto-neg?  

I guess so, switches should be?

> If you force speed and
> duplex to 1000/full on both sides, is it linked at 1000/full? 

Can't force anything on the switch 


> You have to force both sides of the connection for it to work 
> correctly.
> Same for auto-neg, both sides need to be set to do that.  Maybe the
> link partner is only advertising 100Mb and so that is what it is
> linking to?

The switch *is* advertising 1000Mb, if I reboot the computer *or* 
plug/unplug the cable, the NIC will negotiate 1000Mb about 1/3 of the 
time. Also, the NIC seems to negotiate 1000Mb while the computer is 
booting (during the BIOS phase, before the kernel takes over).

> -- 
> Cheers,
> John

Thanks for the feedback, any other suggestions? 

-- 
David Härdeman

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox