Netdev List

Netdev List
 help / color / mirror / Atom feed

* [RFC PATCH v7 18/19] Add a kconfig entry and make entry for mp device.
From: xiaohui.xin @ 2010-06-05 10:14 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1275732899-5423-17-git-send-email-xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 drivers/vhost/Kconfig  |   10 ++++++++++
 drivers/vhost/Makefile |    2 ++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index e4e2fd1..a6b8cbf 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,13 @@ config VHOST_NET
 	  To compile this driver as a module, choose M here: the module will
 	  be called vhost_net.
 
+config MEDIATE_PASSTHRU
+	tristate "mediate passthru network driver (EXPERIMENTAL)"
+	depends on VHOST_NET
+	---help---
+	  zerocopy network I/O support, we call it as mediate passthru to
+	  be distiguish with hardare passthru.
+
+	  To compile this driver as a module, choose M here: the module will
+	  be called mpassthru.
+
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..c18b9fc 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o
-- 
1.5.4.4

^ permalink raw reply related

* [RFC PATCH v7 00/19] Provide a zero-copy method on KVM virtio-net.
From: xiaohui.xin @ 2010-06-05 10:14 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike
In-Reply-To: <1275732899-5423-19-git-send-email-xiaohui.xin@intel.com>

We provide an zero-copy method which driver side may get external
buffers to DMA. Here external means driver don't use kernel space
to allocate skb buffers. Currently the external buffer can be from
guest virtio-net driver.

The idea is simple, just to pin the guest VM user space and then
let host NIC driver has the chance to directly DMA to it. 
The patches are based on vhost-net backend driver. We add a device
which provides proto_ops as sendmsg/recvmsg to vhost-net to
send/recv directly to/from the NIC driver. KVM guest who use the
vhost-net backend may bind any ethX interface in the host side to
get copyless data transfer thru guest virtio-net frontend.

patch 01-13:  	net core changes.
patch 14-18:  	new device as interface to mantpulate external buffers.
patch 19: 	for vhost-net.

The guest virtio-net driver submits multiple requests thru vhost-net
backend driver to the kernel. And the requests are queued and then
completed after corresponding actions in h/w are done.

For read, user space buffers are dispensed to NIC driver for rx when
a page constructor API is invoked. Means NICs can allocate user buffers
from a page constructor. We add a hook in netif_receive_skb() function
to intercept the incoming packets, and notify the zero-copy device.

For write, the zero-copy deivce may allocates a new host skb and puts
payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
The request remains pending until the skb is transmitted by h/w.

Here, we have ever considered 2 ways to utilize the page constructor
API to dispense the user buffers.

One:	Modify __alloc_skb() function a bit, it can only allocate a 
	structure of sk_buff, and the data pointer is pointing to a 
	user buffer which is coming from a page constructor API.
	Then the shinfo of the skb is also from guest.
	When packet is received from hardware, the skb->data is filled
	directly by h/w. What we have done is in this way.

	Pros:	We can avoid any copy here.
	Cons:	Guest virtio-net driver needs to allocate skb as almost
		the same method with the host NIC drivers, say the size
		of netdev_alloc_skb() and the same reserved space in the
		head of skb. Many NIC drivers are the same with guest and
		ok for this. But some lastest NIC drivers reserves special
		room in skb head. To deal with it, we suggest to provide
		a method in guest virtio-net driver to ask for parameter
		we interest from the NIC driver when we know which device 
		we have bind to do zero-copy. Then we ask guest to do so.

Two:	Modify driver to get user buffer allocated from a page constructor
	API(to substitute alloc_page()), the user buffer are used as payload
	buffers and filled by h/w directly when packet is received. Driver
	should associate the pages with skb (skb_shinfo(skb)->frags). For 
	the head buffer side, let host allocates skb, and h/w fills it. 
	After that, the data filled in host skb header will be copied into
	guest header buffer which is submitted together with the payload buffer.

	Pros:	We could less care the way how guest or host allocates their
		buffers.
	Cons:	We still need a bit copy here for the skb header.

We are not sure which way is the better here. This is the first thing we want
to get comments from the community. We wish the modification to the network
part will be generic which not used by vhost-net backend only, but a user
application may use it as well when the zero-copy device may provides async
read/write operations later.

We have got comments from Michael. And he said the first method will break
the compatiblity of virtio-net driver and may complicate the qemu live 
migration. Currently, we tried to ignore the skb_reserve() if the device
is doing zero-copy. Then guest virtio-net driver wil not changed. So we now
continue to go with the first way. 
But comments about the two ways are still appreicated.

We provide multiple submits and asynchronous notifiicaton to 
vhost-net too.

Our goal is to improve the bandwidth and reduce the CPU usage.
Exact performance data will be provided later. But for simple
test with netperf, we found bindwidth up and CPU % up too,
but the bindwidth up ratio is much more than CPU % up ratio.

What we have not done yet:
	packet split support
	To support GRO
	Performance tuning

what we have done in v1:
	polish the RCU usage
	deal with write logging in asynchroush mode in vhost
	add notifier block for mp device
	rename page_ctor to mp_port in netdevice.h to make it looks generic
	add mp_dev_change_flags() for mp device to change NIC state
	add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
	a small fix for missing dev_put when fail
	using dynamic minor instead of static minor number
	a __KERNEL__ protect to mp_get_sock()

what we have done in v2:

	remove most of the RCU usage, since the ctor pointer is only
	changed by BIND/UNBIND ioctl, and during that time, NIC will be
	stopped to get good cleanup(all outstanding requests are finished),
	so the ctor pointer cannot be raced into wrong situation.

	Remove the struct vhost_notifier with struct kiocb.
	Let vhost-net backend to alloc/free the kiocb and transfer them
	via sendmsg/recvmsg.

	use get_user_pages_fast() and set_page_dirty_lock() when read.

	Add some comments for netdev_mp_port_prep() and handle_mpassthru().

what we have done in v3:
	the async write logging is rewritten 
	a drafted synchronous write function for qemu live migration
	a limit for locked pages from get_user_pages_fast() to prevent Dos
	by using RLIMIT_MEMLOCK

what we have done in v4:
	add iocb completion callback from vhost-net to queue iocb in mp device
	replace vq->receiver by mp_sock_data_ready()
	remove stuff in mp device which access structures from vhost-net
	modify skb_reserve() to ignore host NIC driver reserved space
	rebase to the latest vhost tree
	split large patches into small pieces, especially for net core part.

what we have done in v5:
	address Arnd Bergmann's comments
		-remove IFF_MPASSTHRU_EXCL flag in mp device
		-Add CONFIG_COMPAT macro
		-remove mp_release ops
	move dev_is_mpassthru() as inline func
	fix a bug in memory relinquish
	Apply to current git (2.6.34-rc6) tree.

what we have done in v6:
	move create_iocb() out of page_dtor which may happen in interrupt context
	-This remove the potential issues which lock called in interrupt context
	make the cache used by mp, vhost as static, and created/destoryed during
	modules init/exit functions.
	-This makes multiple mp guest created at the same time.

what we have done in v7:
	some cleanup prepared to suppprt PS mode

performance:
	using netperf with GSO/TSO disabled, 10G NIC, 
	disabled packet split mode, with raw socket case compared to vhost.

	bindwidth will be from 1.1Gbps to 1.7Gbps
	CPU % from 120%-140% to 140%-160%

	We have retested the performance based on 2.6.34-rc6 in above situtation.
				BW		CPU %
	vhost			1.4Gbps		120% ~ 130%
	vhost + zero-copy	2.7Gbps		160% ~ 180%

^ permalink raw reply

* [PATCH] r8169: fix random mdio_write failures
From: Timo Teräs @ 2010-06-05 10:21 UTC (permalink / raw)
  To: netdev; +Cc: Timo Teräs, françois romieu, Edward Hsu
In-Reply-To: <4C0A1736.9030209@iki.fi>

Some configurations need delay between the "write completed" indication
and new write to work reliably.

Realtek driver seems to use longer delay when polling the "write complete"
bit, so it waits long enough between writes with high probability (but
could probably break too). This patch adds a new udelay to make sure we
wait unconditionally some time after the write complete indication.

This caused a regression with XID 18000000 boards when the board specific
phy configuration writing many mdio registers was added in commit
2e955856ff (r8169: phy init for the 8169scd). Some of the configration
mdio writes would almost always fail, and depending on failure might leave
the PHY in non-working state.

Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Cc: françois romieu <romieu@fr.zoreil.com>
Cc: Edward Hsu <edward_hsu@realtek.com.tw>
---
 drivers/net/r8169.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
index 217e709..03a8318 100644
--- a/drivers/net/r8169.c
+++ b/drivers/net/r8169.c
@@ -559,6 +559,11 @@ static void mdio_write(void __iomem *ioaddr, int reg_addr, int value)
 			break;
 		udelay(25);
 	}
+	/*
+	 * Some configurations require a small delay even after the write
+	 * completed indication or the next write might fail.
+	 */
+	udelay(25);
 }

 static int mdio_read(void __iomem *ioaddr, int reg_addr)
-- 
1.7.0.4

^ permalink raw reply related

* [RFC] act_cpu: redirect skb receiving to a special CPU.
From: Changli Gao @ 2010-06-05 10:56 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Tom Herbert, Jamal Hadi Salim
  Cc: Linux Netdev List

I am going to implement a CPU action, which can be used with ingress
qdisc to redirect skb receiving to a special cpu. It is much like RPS,
but more flexible:

* choose the hash function with the help of cls_flow.c
* pin special traffic to a dedicate CPU
* weighted packets distributing

act_cpu will use the function enqueue_to_backlog() supplied by RPS to
redirect skb receiving, and have two kind paramter:

* cpu CPUID - the ID of CPU, which handles this traffic
* map OFFSET - map the mirror class ID to CPUID: CPUID = mirror class ID + CPUID

sch_ingress will be enhanced to support class tree.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: 2.6.32 and Multicast group membership
From: Mark Smith @ 2010-06-05 11:33 UTC (permalink / raw)
  To: Mr. Berkley Shands; +Cc: Net Dev
In-Reply-To: <4C057DF4.3000305@exegy.com>

On Tue, 01 Jun 2010 16:39:00 -0500
"Mr. Berkley Shands" <bshands@exegy.com> wrote:

> starting in 2.6.32, my multicast connections stop getting data after 
> 60-100 seconds.
> The identical user code works fine under 2.6.22 through 2.6.31.
> 
> The NIC (an intel 82586) has two ports on the same subnet
> (eth0 at 172.16.21.55/24 and eth1 at 172.16.21.56/24)
> 
>       if (setsockopt(sock, IPPROTO_IP, IP_ADD_MEMBERSHIP, (char*)&req, 
> sizeof(req)))
>       {
>          perror("setsockopt IP_ADD_MEMBERSHIP failed");
>          ::exit(-1);
>       }
> 
> If I do ADD_MEMBERSHIP on just one of these interfaces, the sockets 
> still get data.
> But if I join on both interfaces, one or both will stop getting packets
> after 60-100 seconds. Sniffing with tcpdump shows the Cisco layer 3 switch
> is not getting its responses back to keep the multicast group open.

While it could be a kernel change, that sounds like it might also be
related to IGMP snooping on the Cisco switch. Any changes made to that
recently?

> The HP layer 2 switch does not seem to care, it keeps the data flowing 
> regardless
> of which physical port the join is executed on.
> 

This is expected behaviour on a 'dumb' layer 2 switch i.e. one that
doesn't perform IGMP snooping and therefore doesn't suppress multicasts
towards end-nodes that aren't subscribed. That further suggests a
change on the Cisco.

FWIW, I'm having no trouble with IPv4 multicast under 2.6.33, plugged
into Cisco switch, with IGMP snooping, but not multicast suppression,
enabled.

> Changing IP_MULTICAST_ALL has no effect :-(
> Did I miss something? New code that I have to specify to keep the Cisco 
> happy?
> 
> tia
> 
> Berkley
> 
> 
> -- 
> 
> // E. F. Berkley Shands, MSc//
> 
> ** Exegy Inc.**
> 
> 349 Marshall Road, Suite 100
> 
> St. Louis , MO  63119
> 
> Direct:  (314) 218-3600 X450
> 
> Cell:  (314) 303-2546
> 
> Office:  (314) 218-3600
> 
> Fax:  (314) 218-3601
> 
>  
> 
> The Usual Disclaimer follows...
> 
>  
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] r8169: fix random mdio_write failures
From: Francois Romieu @ 2010-06-05 12:41 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev, Edward Hsu, Hayes, davem
In-Reply-To: <1275733273-28321-1-git-send-email-timo.teras@iki.fi>

Timo Teräs <timo.teras@iki.fi> :
[...]
> diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
> index 217e709..03a8318 100644
> --- a/drivers/net/r8169.c
> +++ b/drivers/net/r8169.c
> @@ -559,6 +559,11 @@ static void mdio_write(void __iomem *ioaddr, int reg_addr, int value)
>  			break;
>  		udelay(25);
>  	}
> +	/*
> +	 * Some configurations require a small delay even after the write
> +	 * completed indication or the next write might fail.
> +	 */
> +	udelay(25);

Acked-off-by: Francois Romieu <romieu@fr.zoreil.com>

Good work.

I wonder if increasing the in-loop delay as well would help the write
succeed faster (or slower ?).

-- 
Ueimor

^ permalink raw reply

* Re: [PATCH] act_mirred: don't clone skb when skb isn't shared
From: jamal @ 2010-06-05 12:53 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, netdev
In-Reply-To: <1275658990-15838-1-git-send-email-xiaosuo@gmail.com>

On Fri, 2010-06-04 at 21:43 +0800, Changli Gao wrote:
> don't clone skb when skb isn't shared
> 
> When the tcf_action is TC_ACT_STOLEN, and the skb isn't shared, we don't need
> to clone a new skb. As the skb will be freed after this function returns, we
> can use it freely once we get a reference to it.

It looks like a good optimization - but i am not a big fan of one-offs
[because usability goes down and I am forced to explain it longer in the
rules (refer to: Documentation/networking/tc-actions-env-rules.txt)]

How about you update skb_act_clone to take take the action code as well
and do the check the if stolen/queued it does a skb_get otherwise it
calls skb_clone?

cheers,
jamal

^ permalink raw reply

* Re: [RFC] act_cpu: redirect skb receiving to a special CPU.
From: jamal @ 2010-06-05 13:07 UTC (permalink / raw)
  To: Changli Gao; +Cc: Eric Dumazet, David S. Miller, Tom Herbert, Linux Netdev List
In-Reply-To: <AANLkTikskTWJrULj2IKYhebxh9O_XTOMBy3i0wPY8thq@mail.gmail.com>

Changli,

I like the idea..

My preference would be to not change ingress qdisc to have queues.
The cpuid should be sufficient to map to a remote cpu queue, no?
Now, if you could represent each cpu as a netdevice, then we wouldnt
need any change;-> And we could have multiple types of ways to redirect
to cpus instead of just doing IPIs - example, ive always thought of 
sending over something like HT (I think it would be a lot cheaper).

I didnt queit understand the map OFFSET part. is this part of rfs?

cheers,
jamal

On Sat, 2010-06-05 at 18:56 +0800, Changli Gao wrote:
> I am going to implement a CPU action, which can be used with ingress
> qdisc to redirect skb receiving to a special cpu. It is much like RPS,
> but more flexible:
> 
> * choose the hash function with the help of cls_flow.c
> * pin special traffic to a dedicate CPU
> * weighted packets distributing
> 
> act_cpu will use the function enqueue_to_backlog() supplied by RPS to
> redirect skb receiving, and have two kind paramter:
> 
> * cpu CPUID - the ID of CPU, which handles this traffic
> * map OFFSET - map the mirror class ID to CPUID: CPUID = mirror class ID + CPUID
> 
> sch_ingress will be enhanced to support class tree.
> 

^ permalink raw reply

* Re: [PATCH] act_mirred: don't clone skb when skb isn't shared
From: Changli Gao @ 2010-06-05 13:07 UTC (permalink / raw)
  To: hadi; +Cc: David S. Miller, netdev
In-Reply-To: <1275742435.3490.31.camel@bigi>

On Sat, Jun 5, 2010 at 8:53 PM, jamal <hadi@cyberus.ca> wrote:
> On Fri, 2010-06-04 at 21:43 +0800, Changli Gao wrote:
>> don't clone skb when skb isn't shared
>>
>> When the tcf_action is TC_ACT_STOLEN, and the skb isn't shared, we don't need
>> to clone a new skb. As the skb will be freed after this function returns, we
>> can use it freely once we get a reference to it.
>
> It looks like a good optimization - but i am not a big fan of one-offs
> [because usability goes down and I am forced to explain it longer in the
> rules (refer to: Documentation/networking/tc-actions-env-rules.txt)]

Thanks. BTW: act_nat.c doesn't obey the following rule, and you plan
to remove TC_MUNGED and TC_OK2MUNGE?

2) If you munge any packet thou shalt call pskb_expand_head in the case
someone else is referencing the skb. After that you "own" the skb.
You must also tell us if it is ok to munge the packet (TC_OK2MUNGE),
this way any action downstream can stomp on the packet.

>
> How about you update skb_act_clone to take take the action code as well
> and do the check the if stolen/queued it does a skb_get otherwise it
> calls skb_clone?
>

Good idea. Thanks.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH] act_mirred: don't clone skb when skb isn't shared
From: jamal @ 2010-06-05 13:24 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, netdev
In-Reply-To: <AANLkTikHqzIr-SjPZe-iiMPxVG9PRR9-0A9qeMwzfLYp@mail.gmail.com>

On Sat, 2010-06-05 at 21:07 +0800, Changli Gao wrote:

> Thanks. BTW: act_nat.c doesn't obey the following rule, and you plan
> to remove TC_MUNGED and TC_OK2MUNGE?

> 2) If you munge any packet thou shalt call pskb_expand_head in the case
> someone else is referencing the skb. After that you "own" the skb.
> You must also tell us if it is ok to munge the packet (TC_OK2MUNGE),
> this way any action downstream can stomp on the packet.

That rule still applies but it is upto the discretion of the action.
i.e if the act_nat thinks it is ok for others down the street to trample
on the packet, it should tell us so. Maybe i should change the wording
to use the word "may" in that 3rd sentence.
[I will kill (low prio) TC_OK2MUNGE but not TC_MUNGED.]

cheers,
jamal

^ permalink raw reply

* Re: [RFC] act_cpu: redirect skb receiving to a special CPU.
From: Changli Gao @ 2010-06-05 13:26 UTC (permalink / raw)
  To: hadi; +Cc: Eric Dumazet, David S. Miller, Tom Herbert, Linux Netdev List
In-Reply-To: <1275743224.3490.44.camel@bigi>

On Sat, Jun 5, 2010 at 9:07 PM, jamal <hadi@cyberus.ca> wrote:
> Changli,
>
> I like the idea..
>
> My preference would be to not change ingress qdisc to have queues.

ingress doesn't have any qdisc, but a class tree. The ingress_queue
will be sth. like this:

while (1) {
 result = tc_classify(..., &res);
 cl = ingress_find(res.classid, ...);
 if (!cl->level)
     break;
 ...
}

Then we can classify skbs in tree  manner.

> The cpuid should be sufficient to map to a remote cpu queue, no?

It should be sufficient, but it isn't efficient. With map option, we
can use cls_flow to map traffic to classid, and use act_cpu map to map
classid to cpuid.

> Now, if you could represent each cpu as a netdevice, then we wouldnt
> need any change;-> And we could have multiple types of ways to redirect
> to cpus instead of just doing IPIs - example, ive always thought of
> sending over something like HT (I think it would be a lot cheaper).

I won't implement a new netdevice, but reuse the softnet. Even, I'll
reuse the enqueue_to_backlog() introduced by RPS, and of course, use
IPIs as RPS. Is there another way to trigger an IRQ of the remote CPU?

>
> I didnt queit understand the map OFFSET part. is this part of rfs?
>

No. As class IDs are started from 1, but CPU IDs are started from 0, I
need to minus/add a number from/to class IDs to map class IDs from CPU
IDs.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH] act_mirred: don't clone skb when skb isn't shared
From: Changli Gao @ 2010-06-05 13:33 UTC (permalink / raw)
  To: hadi; +Cc: David S. Miller, netdev
In-Reply-To: <1275744299.3490.48.camel@bigi>

On Sat, Jun 5, 2010 at 9:24 PM, jamal <hadi@cyberus.ca> wrote:
> On Sat, 2010-06-05 at 21:07 +0800, Changli Gao wrote:
>
>> Thanks. BTW: act_nat.c doesn't obey the following rule, and you plan
>> to remove TC_MUNGED and TC_OK2MUNGE?
>
>> 2) If you munge any packet thou shalt call pskb_expand_head in the case
>> someone else is referencing the skb. After that you "own" the skb.
>> You must also tell us if it is ok to munge the packet (TC_OK2MUNGE),
>> this way any action downstream can stomp on the packet.
>
> That rule still applies but it is upto the discretion of the action.
> i.e if the act_nat thinks it is ok for others down the street to trample
> on the packet, it should tell us so. Maybe i should change the wording
> to use the word "may" in that 3rd sentence.
> [I will kill (low prio) TC_OK2MUNGE but not TC_MUNGED.]
>

If you kill TC_OK2MUNGE, you should kill TC_MUNGED too, as it will
become useless.

localhost linux # grep MUNGE net/sched/ -R
net/sched/act_pedit.c:  if (!(skb->tc_verd & TC_OK2MUNGE)) {
net/sched/act_pedit.c:                  skb->tc_verd =
SET_TC_MUNGED(skb->tc_verd);
net/sched/act_api.c:                    if (TC_MUNGED & skb->tc_verd) {
net/sched/act_api.c:                            skb->tc_verd =
SET_TC_OK2MUNGE(skb->tc_verd);
net/sched/act_api.c:                            skb->tc_verd =
CLR_TC_MUNGED(skb->tc_verd);

int tcf_action_exec(struct sk_buff *skb, struct tc_action *act,
                    struct tcf_result *res)
{
        struct tc_action *a;
        int ret = -1;

        if (skb->tc_verd & TC_NCLS) {
                skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
                ret = TC_ACT_OK;
                goto exec_done;
        }
        while ((a = act) != NULL) {
repeat:
                if (a->ops && a->ops->act) {
                        ret = a->ops->act(skb, a, res);
                        if (TC_MUNGED & skb->tc_verd) {
                                /* copied already, allow trampling */
                                skb->tc_verd = SET_TC_OK2MUNGE(skb->tc_verd);
                                skb->tc_verd = CLR_TC_MUNGED(skb->tc_verd);
                        }
                        if (ret == TC_ACT_REPEAT)
                                goto repeat;    /* we need a ttl - JHS */
                        if (ret != TC_ACT_PIPE)
                                goto exec_done;
                }
                act = a->next;
        }
exec_done:
        return ret;
}

The bit OK2MUNGE relies on MUNGED only.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [RFC] act_cpu: redirect skb receiving to a special CPU.
From: jamal @ 2010-06-05 13:54 UTC (permalink / raw)
  To: Changli Gao; +Cc: Eric Dumazet, David S. Miller, Tom Herbert, Linux Netdev List
In-Reply-To: <AANLkTiltF2APEnqHyHTtHA7JMNp3HX3aCu9UL3nUJ5_u@mail.gmail.com>

On Sat, 2010-06-05 at 21:26 +0800, Changli Gao wrote:

> ingress doesn't have any qdisc, but a class tree. The ingress_queue
> will be sth. like this:
[..]

> Then we can classify skbs in tree  manner.
[..]
> > The cpuid should be sufficient to map to a remote cpu queue, no?
> 
> It should be sufficient, but it isn't efficient. With map option, we
> can use cls_flow to map traffic to classid, and use act_cpu map to map
> classid to cpuid.

I am missing something, I would see the flow as:
-->ethx/lo/etc->ingressqdisc->classify-->action(redirect to cpuidX)
Why/when do you need the tree variant? If you are thinking of maybe
rate limiting to a specific CPU, then would passing it to a policer
first not be sufficient?  IOW, classid is not very useful.

> I won't implement a new netdevice, but reuse the softnet. Even, I'll
> reuse the enqueue_to_backlog() introduced by RPS, and of course, use
> IPIs as RPS. Is there another way to trigger an IRQ of the remote CPU?

I would look at it as "messaging of remote CPU" which may not result
in an IRQ. I am pretty sure if you tried hard you could use HT in AMD
hardware - the remote cpu may have an IRQ triggered but it wont be as
expensive as IPI.

cheers,
jamal

^ permalink raw reply

* Re: [PATCH] phylib: Add support for the LXT973 phy.
From: Richard Cochran @ 2010-06-05 14:00 UTC (permalink / raw)
  To: Andy Fleming; +Cc: David Miller, netdev
In-Reply-To: <AANLkTinR9_VoWX8zZu93MeohHvCbCy_P5XWHH3rxlPuH@mail.gmail.com>

On Wed, Jun 02, 2010 at 02:32:11PM -0500, Andy Fleming wrote:
> Yeah, I was clearly not thinking clearly.  dev_flags will be
> overwritten, and is not meant for this.  I believe, what we should do
> is add a "port" field to the PHY device, and if PCR_FIBER_SELECT is
> set, then set the port field to PORT_FIBRE.  I'm not entirely clear on
> the semantics of that field in the ethtool cmd, but it seems like the
> right idea.

Here is another try. Is that more like it?

Richard


This patch implements a work around for Erratum 5, "3.3 V Fiber Speed
Selection." If the hardware wiring does not respect this erratum, then
fiber optic mode will not work properly.

As part of the fix, the patch introduces a new field 'port_flags' into
the 'struct phy_device'. This field allows phy drivers to describe
fixed attributes of the port. Only phy drivers should write this field.

Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
---
 drivers/net/phy/lxt.c |   52 ++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/phy.h   |    8 +++++++
 2 files changed, 59 insertions(+), 1 deletions(-)

diff --git a/drivers/net/phy/lxt.c b/drivers/net/phy/lxt.c
index 8ee929b..ef4a320 100644
--- a/drivers/net/phy/lxt.c
+++ b/drivers/net/phy/lxt.c
@@ -53,6 +53,9 @@
 
 #define MII_LXT971_ISR		19  /* Interrupt Status Register */
 
+/* register definitions for the 973 */
+#define MII_LXT973_PCR 16 /* Port Configuration Register */
+#define PCR_FIBER_SELECT 1
 
 MODULE_DESCRIPTION("Intel LXT PHY driver");
 MODULE_AUTHOR("Andy Fleming");
@@ -119,6 +122,34 @@ static int lxt971_config_intr(struct phy_device *phydev)
 	return err;
 }
 
+static int lxt973_probe(struct phy_device *phydev)
+{
+	int val = phy_read(phydev, MII_LXT973_PCR);
+
+	if (val & PCR_FIBER_SELECT) {
+		/*
+		 * If fiber is selected, then the only correct setting
+		 * is 100Mbps, full duplex, and auto negotiation off.
+		 */
+		val = phy_read(phydev, MII_BMCR);
+		val |= (BMCR_SPEED100 | BMCR_FULLDPLX);
+		val &= ~BMCR_ANENABLE;
+		phy_write(phydev, MII_BMCR, val);
+		/* Remember that the port is in fiber mode. */
+		phydev->port_flags |= PHY_PORT_FIBER;
+	} else {
+		phydev->port_flags &= ~PHY_PORT_FIBER;
+	}
+	return 0;
+}
+
+static int lxt973_config_aneg(struct phy_device *phydev)
+{
+	/* Do nothing if port is in fiber mode. */
+	return phydev->port_flags & PHY_PORT_FIBER ?
+		0 : genphy_config_aneg(phydev);
+}
+
 static struct phy_driver lxt970_driver = {
 	.phy_id		= 0x78100000,
 	.name		= "LXT970",
@@ -146,6 +177,18 @@ static struct phy_driver lxt971_driver = {
 	.driver 	= { .owner = THIS_MODULE,},
 };
 
+static struct phy_driver lxt973_driver = {
+	.phy_id		= 0x00137a10,
+	.name		= "LXT973",
+	.phy_id_mask	= 0xfffffff0,
+	.features	= PHY_BASIC_FEATURES,
+	.flags		= 0,
+	.probe		= lxt973_probe,
+	.config_aneg	= lxt973_config_aneg,
+	.read_status	= genphy_read_status,
+	.driver		= { .owner = THIS_MODULE,},
+};
+
 static int __init lxt_init(void)
 {
 	int ret;
@@ -157,9 +200,15 @@ static int __init lxt_init(void)
 	ret = phy_driver_register(&lxt971_driver);
 	if (ret)
 		goto err2;
+
+	ret = phy_driver_register(&lxt973_driver);
+	if (ret)
+		goto err3;
 	return 0;
 
- err2:	
+ err3:
+	phy_driver_unregister(&lxt971_driver);
+ err2:
 	phy_driver_unregister(&lxt970_driver);
  err1:
 	return ret;
@@ -169,6 +218,7 @@ static void __exit lxt_exit(void)
 {
 	phy_driver_unregister(&lxt970_driver);
 	phy_driver_unregister(&lxt971_driver);
+	phy_driver_unregister(&lxt973_driver);
 }
 
 module_init(lxt_init);
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 1c75b6b..602228c 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -234,6 +234,11 @@ enum phy_state {
 	PHY_RESUMING
 };
 
+/*
+ * PHY_PORT_xxx: flags to describe the port's fixed attributes.
+ */
+#define PHY_PORT_FIBER 0x00000001 /* Port has a fiber optic transceiver */
+
 /* phy_device: An instance of a PHY
  *
  * drv: Pointer to the driver for this PHY instance
@@ -246,6 +251,7 @@ enum phy_state {
  * link_timeout: The number of timer firings to wait before the
  * giving up on the current attempt at acquiring a link
  * irq: IRQ number of the PHY's interrupt (-1 if none)
+ * port_flags: Bit field of PHY_PORT_xxx flags
  * phy_timer: The timer for handling the state machine
  * phy_queue: A work_queue for the interrupt
  * attached_dev: The attached enet driver's device instance ptr
@@ -314,6 +320,8 @@ struct phy_device {
 	 */
 	int irq;
 
+	int port_flags;
+
 	/* private data pointer */
 	/* For use by PHYs to maintain extra state */
 	void *priv;
-- 
1.6.3.3


^ permalink raw reply related

* Re: [PATCH] act_mirred: don't clone skb when skb isn't shared
From: jamal @ 2010-06-05 13:59 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, netdev
In-Reply-To: <AANLkTinh6OylNU8NLXB6ZSb012AmzkFhzFQWPR-iAsrf@mail.gmail.com>

On Sat, 2010-06-05 at 21:33 +0800, Changli Gao wrote:

> 
> If you kill TC_OK2MUNGE, you should kill TC_MUNGED too, as it will
> become useless.

Those two have different semantics and use different verdict bits:
one is saying "this packet has been munged" another is "I give you ok to
munge this packet".
Thanks for the suggestion - I will think about it some more before
deleting anything.

cheers,
jamal

^ permalink raw reply

* Re: [RFC] act_cpu: redirect skb receiving to a special CPU.
From: Changli Gao @ 2010-06-05 14:15 UTC (permalink / raw)
  To: hadi; +Cc: Eric Dumazet, David S. Miller, Tom Herbert, Linux Netdev List
In-Reply-To: <1275746045.3490.60.camel@bigi>

On Sat, Jun 5, 2010 at 9:54 PM, jamal <hadi@cyberus.ca> wrote:
> On Sat, 2010-06-05 at 21:26 +0800, Changli Gao wrote:
>
>> ingress doesn't have any qdisc, but a class tree. The ingress_queue
>> will be sth. like this:
> [..]
>
>> Then we can classify skbs in tree  manner.
> [..]
>> > The cpuid should be sufficient to map to a remote cpu queue, no?
>>
>> It should be sufficient, but it isn't efficient. With map option, we
>> can use cls_flow to map traffic to classid, and use act_cpu map to map
>> classid to cpuid.
>
> I am missing something, I would see the flow as:
> -->ethx/lo/etc->ingressqdisc->classify-->action(redirect to cpuidX)
> Why/when do you need the tree variant? If you are thinking of maybe
> rate limiting to a specific CPU, then would passing it to a policer
> first not be sufficient?  IOW, classid is not very useful.

For instance: there are 4 CPUs. I want redirect traffic to CPU 1-3
evenly. If the qdisc is linear the rules as

flow classify(flow classid ffff:2-4) | tc_index 2 action cpu 1 |
tc_index 3 action cpu 2 | tc_index 4 action cpu3

a tree variant:

class ffff:1 : flow classify(flow classid ffff:2-4)
class ffff:2 parent ffff:1 : action cpu 1
class ffff:3 parent ffff:1 : action cpu 2
class ffff:4 parent ffff:1 : action cpu 3

ingress_classify: use flow classify to get the subclass ID, then find
the corresponding class and exec action.

When there are lots of CPUs, tree is more efficient.

>
>> I won't implement a new netdevice, but reuse the softnet. Even, I'll
>> reuse the enqueue_to_backlog() introduced by RPS, and of course, use
>> IPIs as RPS. Is there another way to trigger an IRQ of the remote CPU?
>
> I would look at it as "messaging of remote CPU" which may not result
> in an IRQ. I am pretty sure if you tried hard you could use HT in AMD
> hardware - the remote cpu may have an IRQ triggered but it wont be as
> expensive as IPI.
>

It seems AMD specific. Why do the AMD guys use this to implement async
smp_call_function() if it is useful as you said?

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [RFC] act_cpu: redirect skb receiving to a special CPU.
From: jamal @ 2010-06-05 14:26 UTC (permalink / raw)
  To: Changli Gao; +Cc: Eric Dumazet, David S. Miller, Tom Herbert, Linux Netdev List
In-Reply-To: <AANLkTim83nfQDuLNmYAvk6RBqxot0t6cGq-lbic4-DSs@mail.gmail.com>

On Sat, 2010-06-05 at 22:15 +0800, Changli Gao wrote:

> For instance: there are 4 CPUs. I want redirect traffic to CPU 1-3
> evenly. If the qdisc is linear the rules as
> 
> flow classify(flow classid ffff:2-4) | tc_index 2 action cpu 1 |
> tc_index 3 action cpu 2 | tc_index 4 action cpu3
> 
> a tree variant:
> 
> class ffff:1 : flow classify(flow classid ffff:2-4)
> class ffff:2 parent ffff:1 : action cpu 1
> class ffff:3 parent ffff:1 : action cpu 2
> class ffff:4 parent ffff:1 : action cpu 3
> 
> ingress_classify: use flow classify to get the subclass ID, then find
> the corresponding class and exec action.
> 
> When there are lots of CPUs, tree is more efficient.

I still didnt follow .. 
Even if i had a million CPUs, A classifier matches some filter
and an action already bound to filter is executed. So the expensive
part is the classifier lookup.

> It seems AMD specific. Why do the AMD guys use this to implement async
> smp_call_function() if it is useful as you said?

Indeed it is AMD specific - but my view is if i was using AMD that would
be more efficient way of doing it; i.e IPI is the lowest common
denominator which works on all archs. Essentially what i am saying is
this would be a "inter-cpu messaging netdev" and i could replace its
send/recv parts from what we do in the RPS path right now to one that
uses AMD hypertransport etc.

cheers,
jamal

^ permalink raw reply

* [PATCH v2] act_mirred: don't clone skb when skb isn't shared
From: Changli Gao @ 2010-06-05 12:39 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: David S. Miller, netdev, Changli Gao

don't clone skb when skb isn't shared

When the tcf_action is TC_ACT_STOLEN, and the skb isn't shared, we don't need
to clone a new skb. As the skb will be freed after this function returns, we
can use it freely once we get a reference to it.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
 include/net/sch_generic.h |   11 +++++++++--
 net/sched/act_mirred.c    |    6 +++---
 2 files changed, 12 insertions(+), 5 deletions(-)
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 03ca5d8..d7d4439 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -571,9 +571,16 @@ static inline u32 qdisc_l2t(struct qdisc_rate_table* rtab, unsigned int pktlen)
 }
 
 #ifdef CONFIG_NET_CLS_ACT
-static inline struct sk_buff *skb_act_clone(struct sk_buff *skb, gfp_t gfp_mask)
+static inline struct sk_buff *skb_act_clone(struct sk_buff *skb, gfp_t gfp_mask,
+					    int action)
 {
-	struct sk_buff *n = skb_clone(skb, gfp_mask);
+	struct sk_buff *n;
+
+	if ((action == TC_ACT_SHOT || action == TC_ACT_STOLEN ||
+	     action == TC_ACT_QUEUED) && !skb_shared(skb))
+		n = skb_get(skb);
+	else
+		n = skb_clone(skb, gfp_mask);
 
 	if (n) {
 		n->tc_verd = SET_TC_VERD(n->tc_verd, 0);
diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index c0b6863..2e9a7b9 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -169,13 +169,13 @@ static int tcf_mirred(struct sk_buff *skb, struct tc_action *a,
 		goto out;
 	}
 
-	skb2 = skb_act_clone(skb, GFP_ATOMIC);
+	at = G_TC_AT(skb->tc_verd);
+	skb2 = skb_act_clone(skb, GFP_ATOMIC, m->tcf_action);
 	if (skb2 == NULL)
 		goto out;
 
 	m->tcf_bstats.bytes += qdisc_pkt_len(skb2);
 	m->tcf_bstats.packets++;
-	at = G_TC_AT(skb->tc_verd);
 	if (!(at & AT_EGRESS)) {
 		if (m->tcfm_ok_push)
 			skb_push(skb2, skb2->dev->hard_header_len);
@@ -185,8 +185,8 @@ static int tcf_mirred(struct sk_buff *skb, struct tc_action *a,
 	if (m->tcfm_eaction != TCA_EGRESS_MIRROR)
 		skb2->tc_verd = SET_TC_FROM(skb2->tc_verd, at);
 
-	skb2->dev = dev;
 	skb2->skb_iif = skb->dev->ifindex;
+	skb2->dev = dev;
 	dev_queue_xmit(skb2);
 	err = 0;
 

^ permalink raw reply related

* Re: [PATCH v2] act_mirred: don't clone skb when skb isn't shared
From: jamal @ 2010-06-05 14:49 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, netdev
In-Reply-To: <1275741553-21463-1-git-send-email-xiaosuo@gmail.com>

On Sat, 2010-06-05 at 20:39 +0800, Changli Gao wrote:

> +	if ((action == TC_ACT_SHOT || action == TC_ACT_STOLEN ||

I am not so sure about SHOT; the other two are fine.

> -	skb2 = skb_act_clone(skb, GFP_ATOMIC);
> +	at = G_TC_AT(skb->tc_verd);

Was there any need to move above line?

> +	skb2 = skb_act_clone(skb, GFP_ATOMIC, m->tcf_action);

> -	skb2->dev = dev;

Or this one?

>  	skb2->skb_iif = skb->dev->ifindex;
> +	skb2->dev = dev;

cheers,
jamal

^ permalink raw reply

* Re: [RFC PATCH v7 03/19] Export 2 func for device to assign/deassign new strucure
From: Eric Dumazet @ 2010-06-05 14:51 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike
In-Reply-To: <1275732899-5423-3-git-send-email-xiaohui.xin@intel.com>

Le samedi 05 juin 2010 à 18:14 +0800, xiaohui.xin@intel.com a écrit :
> From: Xin Xiaohui <xiaohui.xin@intel.com>
> 
> Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
> Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
> Reviewed-by: Jeff Dike <jdike@linux.intel.com>
> ---
>  include/linux/netdevice.h |    3 +++
>  net/core/dev.c            |   28 ++++++++++++++++++++++++++++
>  2 files changed, 31 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index bae725c..efb575a 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1592,6 +1592,9 @@ extern gro_result_t	napi_frags_finish(struct napi_struct *napi,
>  					  gro_result_t ret);
>  extern struct sk_buff *	napi_frags_skb(struct napi_struct *napi);
>  extern gro_result_t	napi_gro_frags(struct napi_struct *napi);
> +extern int netdev_mp_port_attach(struct net_device *dev,
> +				 struct mpassthru_port *port);
> +extern void netdev_mp_port_detach(struct net_device *dev);
>  
>  static inline void napi_free_frags(struct napi_struct *napi)
>  {
> diff --git a/net/core/dev.c b/net/core/dev.c
> index f769098..ecbb6b1 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2469,6 +2469,34 @@ void netif_nit_deliver(struct sk_buff *skb)
>  	rcu_read_unlock();
>  }
>  
> +/* Export two functions to assign/de-assign mp_port pointer
> + * to a net device.
> + */
> +
> +int netdev_mp_port_attach(struct net_device *dev,
> +			struct mpassthru_port *port)
> +{
> +	/* locked by mp_mutex */
> +	if (rcu_dereference(dev->mp_port))
> +		return -EBUSY;
> +

Please... this is bogus...

Try with following config settings :

CONFIG_PROVE_LOCKING=y
CONFIG_PROVE_RCU=y
CONFIG_PROVE_RCU_REPEATEDLY=y




> +	rcu_assign_pointer(dev->mp_port, port);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(netdev_mp_port_attach);
> +
> +void netdev_mp_port_detach(struct net_device *dev)
> +{
> +	/* locked by mp_mutex */
> +	if (!rcu_dereference(dev->mp_port))
> +		return;

same problem here

> +
> +	rcu_assign_pointer(dev->mp_port, NULL);
> +	synchronize_rcu();
> +}
> +EXPORT_SYMBOL(netdev_mp_port_detach);
> +
>  /**
>   *	netif_receive_skb - process receive buffer from network
>   *	@skb: buffer to process

^ permalink raw reply

* Re: [RFC PATCH v7 08/19] Make __alloc_skb() to get external buffer.
From: Eric Dumazet @ 2010-06-05 14:53 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike
In-Reply-To: <1275732899-5423-8-git-send-email-xiaohui.xin@intel.com>

Le samedi 05 juin 2010 à 18:14 +0800, xiaohui.xin@intel.com a écrit :
> From: Xin Xiaohui <xiaohui.xin@intel.com>
> 	child->fclone = SKB_FCLONE_UNAVAILABLE;
>  	}
> +	/* Record the external buffer info in this field. It's not so good,
> +	 * but we cannot find another place easily.
> +	 */
> +	shinfo->destructor_arg = ext_page;
> +

Yes this is a big problem, its basically using a cache line that was not
touched before.

^ permalink raw reply

* Re: [RFC PATCH v7 11/19] Use callback to deal with skb_release_data() specially.
From: Eric Dumazet @ 2010-06-05 14:56 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike
In-Reply-To: <1275732899-5423-11-git-send-email-xiaohui.xin@intel.com>

Le samedi 05 juin 2010 à 18:14 +0800, xiaohui.xin@intel.com a écrit :
> From: Xin Xiaohui <xiaohui.xin@intel.com>
> 
> If buffer is external, then use the callback to destruct
> buffers.
> 
> Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
> Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
> Reviewed-by: Jeff Dike <jdike@linux.intel.com>
> ---
>  net/core/skbuff.c |   11 +++++++++++
>  1 files changed, 11 insertions(+), 0 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 37587f0..418457c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -385,6 +385,11 @@ static void skb_clone_fraglist(struct sk_buff *skb)
>  
>  static void skb_release_data(struct sk_buff *skb)
>  {
> +	/* check if the skb has external buffers, we have use destructor_arg
> +	 * here to indicate
> +	 */
> +	struct skb_external_page *ext_page = skb_shinfo(skb)->destructor_arg;
> +

Oh well. This is v7 of your series, and nobody complained yet ?

This is a new cache miss on a _critical_ path.


>  	if (!skb->cloned ||
>  	    !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
>  			       &skb_shinfo(skb)->dataref)) {
> @@ -397,6 +402,12 @@ static void skb_release_data(struct sk_buff *skb)
>  		if (skb_has_frags(skb))
>  			skb_drop_fraglist(skb);
>  
> +		/* if the skb has external buffers, use destructor here,
> +		 * since after that skb->head will be kfree, in case skb->head
> +		 * from external buffer cannot use kfree to destroy.
> +		 */

Why not deferring here the access to skb_shinfo(skb)->destructor_arg ?

> +		if (dev_is_mpassthru(skb->dev) && ext_page && ext_page->dtor)
> +			ext_page->dtor(ext_page);
>  		kfree(skb->head);
>  	}
>  }

if (dev_is_mpassthru(skb->dev)) {
	struct skb_external_page *ext_page =
				skb_shinfo(skb)->destructor_arg;
	if (ext_page && ext_page->dtor)
		ext_page->dtor(ext_page);
}

destructor_arg should me moved before frags[] if you really want to use it.

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bf243fc..b136d90 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -202,10 +202,11 @@ struct skb_shared_info {
         */
        atomic_t        dataref;
 
-       skb_frag_t      frags[MAX_SKB_FRAGS];
        /* Intermediate layers must ensure that destructor_arg
         * remains valid until skb destructor */
        void *          destructor_arg;
+
+       skb_frag_t      frags[MAX_SKB_FRAGS];
 };
 
 /* We divide dataref into two halves.  The higher 16 bits hold references




^ permalink raw reply related

* Re: [PATCH v2] act_mirred: don't clone skb when skb isn't shared
From: Changli Gao @ 2010-06-05 14:58 UTC (permalink / raw)
  To: hadi; +Cc: David S. Miller, netdev
In-Reply-To: <1275749345.3490.88.camel@bigi>

On Sat, Jun 5, 2010 at 10:49 PM, jamal <hadi@cyberus.ca> wrote:
> On Sat, 2010-06-05 at 20:39 +0800, Changli Gao wrote:
>
>> +     if ((action == TC_ACT_SHOT || action == TC_ACT_STOLEN ||
>
> I am not so sure about SHOT; the other two are fine.

It is unlikely that this function will be called with TC_ACT_SHOT. In
fact, this function has only one user act_mirred. Should we remove
this flag, or keep STOLEN flag only?

>
>> -     skb2 = skb_act_clone(skb, GFP_ATOMIC);
>> +     at = G_TC_AT(skb->tc_verd);
>
> Was there any need to move above line?

skb2 maybe skb, and its tc_verd maybe mangled in skb_act_clone(), so I
move it up.

>
>> +     skb2 = skb_act_clone(skb, GFP_ATOMIC, m->tcf_action);
>
>
>> -     skb2->dev = dev;
>
> Or this one?
>
>>       skb2->skb_iif = skb->dev->ifindex;
>> +     skb2->dev = dev;
>

the same reason as above. skb2 and skb maybe the same. If we don't
move lines, skb->dev maybe over written.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [RFC] act_cpu: redirect skb receiving to a special CPU.
From: Eric Dumazet @ 2010-06-05 15:00 UTC (permalink / raw)
  To: hadi; +Cc: Changli Gao, David S. Miller, Tom Herbert, Linux Netdev List
In-Reply-To: <1275748004.3490.86.camel@bigi>

Le samedi 05 juin 2010 à 10:26 -0400, jamal a écrit :

> Indeed it is AMD specific - but my view is if i was using AMD that would
> be more efficient way of doing it; i.e IPI is the lowest common
> denominator which works on all archs. Essentially what i am saying is
> this would be a "inter-cpu messaging netdev" and i could replace its
> send/recv parts from what we do in the RPS path right now to one that
> uses AMD hypertransport etc.

You do realize this should be discussed on lkml , of course ?

^ permalink raw reply

* [RFC] pm_qos: get rid of the allocation in pm_qos_add_request()
From: James Bottomley @ 2010-06-05 19:20 UTC (permalink / raw)
  To: pm list; +Cc: markgross, netdev, alsa-devel, Takashi Iwai, Jaroslav Kysela

[alsa-devel says it's a moderated list, so feel free to drop before
replying]

Since every caller has to squirrel away the returned pointer anyway,
they might as well supply the memory area.  This fixes a bug in a few of
the call sites where the returned pointer was dereferenced without
checking it for NULL (which gets returned if the kzalloc failed).

I'd like to hear how sound and netdev feels about this: it will add
about two more pointers worth of data to struct netdev and struct
snd_pcm_substream .. but I think it's worth it.  If you're OK, I'll add
your acks and send through the pm tree.

This also looks to me like an android independent clean up (even though
it renders the request_add atomically callable).  I also added include
guards to include/linux/pm_qos_params.h

James

---

 drivers/net/e1000e/netdev.c            |   17 ++++------
 drivers/net/igbvf/netdev.c             |    9 ++---
 drivers/net/wireless/ipw2x00/ipw2100.c |   12 +++---
 include/linux/netdevice.h              |    2 +-
 include/linux/pm_qos_params.h          |   12 +++++--
 include/sound/pcm.h                    |    2 +-
 kernel/pm_qos_params.c                 |   55 ++++++++++++++++---------------
 sound/core/pcm_native.c                |   12 ++-----
 8 files changed, 60 insertions(+), 61 deletions(-)

diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 24507f3..47ea62f 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -2901,10 +2901,10 @@ static void e1000_configure_rx(struct e1000_adapter *adapter)
 			 * dropped transactions.
 			 */
 			pm_qos_update_request(
-				adapter->netdev->pm_qos_req, 55);
+				&adapter->netdev->pm_qos_req, 55);
 		} else {
 			pm_qos_update_request(
-				adapter->netdev->pm_qos_req,
+				&adapter->netdev->pm_qos_req,
 				PM_QOS_DEFAULT_VALUE);
 		}
 	}
@@ -3196,9 +3196,9 @@ int e1000e_up(struct e1000_adapter *adapter)
 
 	/* DMA latency requirement to workaround early-receive/jumbo issue */
 	if (adapter->flags & FLAG_HAS_ERT)
-		adapter->netdev->pm_qos_req =
-			pm_qos_add_request(PM_QOS_CPU_DMA_LATENCY,
-				       PM_QOS_DEFAULT_VALUE);
+		pm_qos_add_request(&adapter->netdev->pm_qos_req,
+				   PM_QOS_CPU_DMA_LATENCY,
+				   PM_QOS_DEFAULT_VALUE);
 
 	/* hardware has been reset, we need to reload some things */
 	e1000_configure(adapter);
@@ -3263,11 +3263,8 @@ void e1000e_down(struct e1000_adapter *adapter)
 	e1000_clean_tx_ring(adapter);
 	e1000_clean_rx_ring(adapter);
 
-	if (adapter->flags & FLAG_HAS_ERT) {
-		pm_qos_remove_request(
-			      adapter->netdev->pm_qos_req);
-		adapter->netdev->pm_qos_req = NULL;
-	}
+	if (adapter->flags & FLAG_HAS_ERT)
+		pm_qos_remove_request(&adapter->netdev->pm_qos_req);
 
 	/*
 	 * TODO: for power management, we could drop the link and
diff --git a/drivers/net/igbvf/netdev.c b/drivers/net/igbvf/netdev.c
index 5e2b2a8..5da569f 100644
--- a/drivers/net/igbvf/netdev.c
+++ b/drivers/net/igbvf/netdev.c
@@ -48,7 +48,7 @@
 #define DRV_VERSION "1.0.0-k0"
 char igbvf_driver_name[] = "igbvf";
 const char igbvf_driver_version[] = DRV_VERSION;
-struct pm_qos_request_list *igbvf_driver_pm_qos_req;
+struct pm_qos_request_list igbvf_driver_pm_qos_req;
 static const char igbvf_driver_string[] =
 				"Intel(R) Virtual Function Network Driver";
 static const char igbvf_copyright[] = "Copyright (c) 2009 Intel Corporation.";
@@ -2902,8 +2902,8 @@ static int __init igbvf_init_module(void)
 	printk(KERN_INFO "%s\n", igbvf_copyright);
 
 	ret = pci_register_driver(&igbvf_driver);
-	igbvf_driver_pm_qos_req = pm_qos_add_request(PM_QOS_CPU_DMA_LATENCY,
-	                       PM_QOS_DEFAULT_VALUE);
+	pm_qos_add_request(igbvf_driver_pm_qos_req, PM_QOS_CPU_DMA_LATENCY,
+			   PM_QOS_DEFAULT_VALUE);
 
 	return ret;
 }
@@ -2918,8 +2918,7 @@ module_init(igbvf_init_module);
 static void __exit igbvf_exit_module(void)
 {
 	pci_unregister_driver(&igbvf_driver);
-	pm_qos_remove_request(igbvf_driver_pm_qos_req);
-	igbvf_driver_pm_qos_req = NULL;
+	pm_qos_remove_request(&igbvf_driver_pm_qos_req);
 }
 module_exit(igbvf_exit_module);
 
diff --git a/drivers/net/wireless/ipw2x00/ipw2100.c b/drivers/net/wireless/ipw2x00/ipw2100.c
index 0bd4dfa..7f0d98b 100644
--- a/drivers/net/wireless/ipw2x00/ipw2100.c
+++ b/drivers/net/wireless/ipw2x00/ipw2100.c
@@ -174,7 +174,7 @@ that only one external action is invoked at a time.
 #define DRV_DESCRIPTION	"Intel(R) PRO/Wireless 2100 Network Driver"
 #define DRV_COPYRIGHT	"Copyright(c) 2003-2006 Intel Corporation"
 
-struct pm_qos_request_list *ipw2100_pm_qos_req;
+struct pm_qos_request_list ipw2100_pm_qos_req;
 
 /* Debugging stuff */
 #ifdef CONFIG_IPW2100_DEBUG
@@ -1741,7 +1741,7 @@ static int ipw2100_up(struct ipw2100_priv *priv, int deferred)
 	/* the ipw2100 hardware really doesn't want power management delays
 	 * longer than 175usec
 	 */
-	pm_qos_update_request(ipw2100_pm_qos_req, 175);
+	pm_qos_update_request(&ipw2100_pm_qos_req, 175);
 
 	/* If the interrupt is enabled, turn it off... */
 	spin_lock_irqsave(&priv->low_lock, flags);
@@ -1889,7 +1889,7 @@ static void ipw2100_down(struct ipw2100_priv *priv)
 	ipw2100_disable_interrupts(priv);
 	spin_unlock_irqrestore(&priv->low_lock, flags);
 
-	pm_qos_update_request(ipw2100_pm_qos_req, PM_QOS_DEFAULT_VALUE);
+	pm_qos_update_request(&ipw2100_pm_qos_req, PM_QOS_DEFAULT_VALUE);
 
 	/* We have to signal any supplicant if we are disassociating */
 	if (associated)
@@ -6669,8 +6669,8 @@ static int __init ipw2100_init(void)
 	if (ret)
 		goto out;
 
-	ipw2100_pm_qos_req = pm_qos_add_request(PM_QOS_CPU_DMA_LATENCY,
-			PM_QOS_DEFAULT_VALUE);
+	pm_qos_add_request(&ipw2100_pm_qos_req, PM_QOS_CPU_DMA_LATENCY,
+			   PM_QOS_DEFAULT_VALUE);
 #ifdef CONFIG_IPW2100_DEBUG
 	ipw2100_debug_level = debug;
 	ret = driver_create_file(&ipw2100_pci_driver.driver,
@@ -6692,7 +6692,7 @@ static void __exit ipw2100_exit(void)
 			   &driver_attr_debug_level);
 #endif
 	pci_unregister_driver(&ipw2100_pci_driver);
-	pm_qos_remove_request(ipw2100_pm_qos_req);
+	pm_qos_remove_request(&ipw2100_pm_qos_req);
 }
 
 module_init(ipw2100_init);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 40291f3..393555a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -779,7 +779,7 @@ struct net_device {
 	 */
 	char			name[IFNAMSIZ];
 
-	struct pm_qos_request_list *pm_qos_req;
+	struct pm_qos_request_list pm_qos_req;
 
 	/* device name hash chain */
 	struct hlist_node	name_hlist;
diff --git a/include/linux/pm_qos_params.h b/include/linux/pm_qos_params.h
index 8ba440e..d823cc0 100644
--- a/include/linux/pm_qos_params.h
+++ b/include/linux/pm_qos_params.h
@@ -1,8 +1,10 @@
+#ifndef _LINUX_PM_QOS_PARAMS_H
+#define _LINUX_PM_QOS_PARAMS_H
 /* interface for the pm_qos_power infrastructure of the linux kernel.
  *
  * Mark Gross <mgross@linux.intel.com>
  */
-#include <linux/list.h>
+#include <linux/plist.h>
 #include <linux/notifier.h>
 #include <linux/miscdevice.h>
 
@@ -14,9 +16,12 @@
 #define PM_QOS_NUM_CLASSES 4
 #define PM_QOS_DEFAULT_VALUE -1
 
-struct pm_qos_request_list;
+struct pm_qos_request_list {
+	struct plist_node list;
+	int pm_qos_class;
+};
 
-struct pm_qos_request_list *pm_qos_add_request(int pm_qos_class, s32 value);
+void pm_qos_add_request(struct pm_qos_request_list *l, int pm_qos_class, s32 value);
 void pm_qos_update_request(struct pm_qos_request_list *pm_qos_req,
 		s32 new_value);
 void pm_qos_remove_request(struct pm_qos_request_list *pm_qos_req);
@@ -25,3 +30,4 @@ int pm_qos_request(int pm_qos_class);
 int pm_qos_add_notifier(int pm_qos_class, struct notifier_block *notifier);
 int pm_qos_remove_notifier(int pm_qos_class, struct notifier_block *notifier);
 
+#endif
diff --git a/include/sound/pcm.h b/include/sound/pcm.h
index dd76cde..6e3a297 100644
--- a/include/sound/pcm.h
+++ b/include/sound/pcm.h
@@ -366,7 +366,7 @@ struct snd_pcm_substream {
 	int number;
 	char name[32];			/* substream name */
 	int stream;			/* stream (direction) */
-	struct pm_qos_request_list *latency_pm_qos_req; /* pm_qos request */
+	struct pm_qos_request_list latency_pm_qos_req; /* pm_qos request */
 	size_t buffer_bytes_max;	/* limit ring buffer size */
 	struct snd_dma_buffer dma_buffer;
 	unsigned int dma_buf_id;
diff --git a/kernel/pm_qos_params.c b/kernel/pm_qos_params.c
index 241fa79..f1d3d23 100644
--- a/kernel/pm_qos_params.c
+++ b/kernel/pm_qos_params.c
@@ -30,7 +30,6 @@
 /*#define DEBUG*/
 
 #include <linux/pm_qos_params.h>
-#include <linux/plist.h>
 #include <linux/sched.h>
 #include <linux/spinlock.h>
 #include <linux/slab.h>
@@ -49,11 +48,6 @@
  * or pm_qos_object list and pm_qos_objects need to happen with pm_qos_lock
  * held, taken with _irqsave.  One lock to rule them all
  */
-struct pm_qos_request_list {
-	struct plist_node list;
-	int pm_qos_class;
-};
-
 enum pm_qos_type {
 	PM_QOS_MAX,		/* return the largest value */
 	PM_QOS_MIN,		/* return the smallest value */
@@ -195,6 +189,11 @@ int pm_qos_request(int pm_qos_class)
 }
 EXPORT_SYMBOL_GPL(pm_qos_request);
 
+static int pm_qos_request_active(struct pm_qos_request_list *req)
+{
+	return req->pm_qos_class != 0;
+}
+
 /**
  * pm_qos_add_request - inserts new qos request into the list
  * @pm_qos_class: identifies which list of qos request to us
@@ -206,25 +205,22 @@ EXPORT_SYMBOL_GPL(pm_qos_request);
  * element as a handle for use in updating and removal.  Call needs to save
  * this handle for later use.
  */
-struct pm_qos_request_list *pm_qos_add_request(int pm_qos_class, s32 value)
+void pm_qos_add_request(struct pm_qos_request_list *dep,
+			int pm_qos_class, s32 value)
 {
-	struct pm_qos_request_list *dep;
-
-	dep = kzalloc(sizeof(struct pm_qos_request_list), GFP_KERNEL);
-	if (dep) {
-		struct pm_qos_object *o =  pm_qos_array[pm_qos_class];
-		int new_value;
-
-		if (value == PM_QOS_DEFAULT_VALUE)
-			new_value = o->default_value;
-		else
-			new_value = value;
-		plist_node_init(&dep->list, new_value);
-		dep->pm_qos_class = pm_qos_class;
-		update_target(o, &dep->list, 0);
-	}
+	struct pm_qos_object *o =  pm_qos_array[pm_qos_class];
+	int new_value;
+
+	if (pm_qos_request_active(dep))
+		return;
 
-	return dep;
+	if (value == PM_QOS_DEFAULT_VALUE)
+		new_value = o->default_value;
+	else
+		new_value = value;
+	plist_node_init(&dep->list, new_value);
+	dep->pm_qos_class = pm_qos_class;
+	update_target(o, &dep->list, 0);
 }
 EXPORT_SYMBOL_GPL(pm_qos_add_request);
 
@@ -286,7 +282,7 @@ void pm_qos_remove_request(struct pm_qos_request_list *pm_qos_req)
 
 	o = pm_qos_array[pm_qos_req->pm_qos_class];
 	update_target(o, &pm_qos_req->list, 1);
-	kfree(pm_qos_req);
+	memset(pm_qos_req, 0, sizeof(*pm_qos_req));
 }
 EXPORT_SYMBOL_GPL(pm_qos_remove_request);
 
@@ -334,8 +330,12 @@ static int pm_qos_power_open(struct inode *inode, struct file *filp)
 
 	pm_qos_class = find_pm_qos_object_by_minor(iminor(inode));
 	if (pm_qos_class >= 0) {
-		filp->private_data = (void *) pm_qos_add_request(pm_qos_class,
-				PM_QOS_DEFAULT_VALUE);
+		struct pm_qos_request_list *req = kzalloc(GFP_KERNEL, sizeof(*req));
+		if (!req)
+			return -ENOMEM;
+
+		pm_qos_add_request(req, pm_qos_class, PM_QOS_DEFAULT_VALUE);
+		filp->private_data = req;
 
 		if (filp->private_data)
 			return 0;
@@ -347,8 +347,9 @@ static int pm_qos_power_release(struct inode *inode, struct file *filp)
 {
 	struct pm_qos_request_list *req;
 
-	req = (struct pm_qos_request_list *)filp->private_data;
+	req = filp->private_data;
 	pm_qos_remove_request(req);
+	kfree(req);
 
 	return 0;
 }
diff --git a/sound/core/pcm_native.c b/sound/core/pcm_native.c
index 303ac04..d3b8b51 100644
--- a/sound/core/pcm_native.c
+++ b/sound/core/pcm_native.c
@@ -451,13 +451,10 @@ static int snd_pcm_hw_params(struct snd_pcm_substream *substream,
 	snd_pcm_timer_resolution_change(substream);
 	runtime->status->state = SNDRV_PCM_STATE_SETUP;
 
-	if (substream->latency_pm_qos_req) {
-		pm_qos_remove_request(substream->latency_pm_qos_req);
-		substream->latency_pm_qos_req = NULL;
-	}
+	pm_qos_remove_request(&substream->latency_pm_qos_req);
 	if ((usecs = period_to_usecs(runtime)) >= 0)
-		substream->latency_pm_qos_req = pm_qos_add_request(
-					PM_QOS_CPU_DMA_LATENCY, usecs);
+		pm_qos_add_request(&substream->latency_pm_qos_req,
+				   PM_QOS_CPU_DMA_LATENCY, usecs);
 	return 0;
  _error:
 	/* hardware might be unuseable from this time,
@@ -512,8 +509,7 @@ static int snd_pcm_hw_free(struct snd_pcm_substream *substream)
 	if (substream->ops->hw_free)
 		result = substream->ops->hw_free(substream);
 	runtime->status->state = SNDRV_PCM_STATE_OPEN;
-	pm_qos_remove_request(substream->latency_pm_qos_req);
-	substream->latency_pm_qos_req = NULL;
+	pm_qos_remove_request(&substream->latency_pm_qos_req);
 	return result;
 }
 



^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox