Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v3 04/12] l2tp: Add ppp device name to L2TP ppp session data
From: James Chapman @ 2010-04-01  7:19 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20100331160811.16351eb7@s6510>

Stephen Hemminger wrote:
> On Wed, 31 Mar 2010 08:43:09 +0100
> James Chapman <jchapman@katalix.com> wrote:
> 
>> Stephen Hemminger wrote:
>>> On Tue, 30 Mar 2010 17:17:46 +0100
>>> James Chapman <jchapman@katalix.com> wrote:
>>>
>>>> When dumping L2TP PPP sessions using /proc/net/l2tp, get
>>>> the assigned PPP device name from PPP using ppp_dev_name().
>>>>
>>>> Signed-off-by: James Chapman <jchapman@katalix.com>
>>>> Reviewed-by: Randy Dunlap <randy.dunlap@oracle.com>
>>>>
>>> Why is this a necessary API?
>>> Why not put it in debugfs if just a debugging tool?
>> With the original driver (merged in 2.6.23), some people use horrible
>> hacks in scripts to derive info about their L2TP connections from /proc.
>> So I was reluctant to move it to debugfs in the new driver. If it is ok
>> to move an existing /proc file to debugfs, I'm happy to do so. People
>> should obtain such info from their L2TP userspace daemon, or through
>> netlink anyway.
>>
>>
> Sounds like a good use of sysfs either with attribute or symlink
> back to underlying device
> 
There might be thousands of L2TP sessions in some setups. Populating
sysfs with a link for each of those sessions isn't practical. The
existing /proc file dumps its info as a single text file for this
reason. I'd also like to provide the device name in the session netlink
message, which is the interface used by l2tp userspace, so I need a
kernel API to retrieve the device name from ppp.

I like the suggestion of using debugfs for access to driver debug info
though. I propose leaving the /proc file for L2TPv2 only, removing the
L2TPv3 data that I added to the proc file in this patch series, to
retain compatibility with the existing driver. This would show only
L2TPv2 sessions and tunnels. For new driver functionality (L2TPv3 etc),
use debugfs. The debugfs files would dump lists in a similar form to the
current code, listing all tunnels (L2TPv2 and L2TPv3) in a single file.
Using debugfs gives more flexibility for adding additional info later,
as required. How does that sound?

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development


^ permalink raw reply

* Re: Undefined behaviour of connect(fd, NULL, 0);
From: David Miller @ 2010-04-01  7:23 UTC (permalink / raw)
  To: xiaosuo; +Cc: neilb, shemminger, netdev
In-Reply-To: <x2j412e6f7f1003312116rd3b3ba96t31267545efe7660f@mail.gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Thu, 1 Apr 2010 12:16:43 +0800

> Someone may use connect() to check if the connection is established
> or not. But there is no spec about the addr and addr_len value when
> connect(2) is used this way. Since there is no limit of addr and
> addr_len, and we supports addr is NULL to check the status of socket
> (Although it is buggy). I think we should treat it like a feature,
> and the problem Neil reported is a bug.

This seems logical, but I believe it is wrong.

We already know for a fact that it is guarenteed to not work
reliably for every single kernel in existence in the world
right now.

Every system.  Ones that have been deployed for 10 years as
well as those built from GIT 10 seconds ago.

So you tell me, if you put this into an application that you
wish to deploy anywhere, are you not being completely stupid?

Therefore, if it's illogical to use this in an application, what value
is there in starting to support it now in the kernel?

I'll tell you, the value is absolutely zero.

Yes we need to add the length check, but the behavior we give to this
case as a result, is completely arbitrary.  And I would in fact argue
for a hard error in these cases.

Simply mark it as invalid to call connect() this way.

^ permalink raw reply

* [Patch V3] bonding: fix potential deadlock in bond_uninit()
From: Cong Wang @ 2010-04-01  7:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Jiri Pirko, Stephen Hemminger, netdev,
	David S. Miller, bonding-devel, Jay Vosburgh
In-Reply-To: <m1r5mzsnuh.fsf@fess.ebiederm.org>

[-- Attachment #1: Type: text/plain, Size: 482 bytes --]

Eric W. Biederman wrote:
> Amerigo Wang <amwang@redhat.com> writes:
> 
>> bond_uninit() is invoked with rtnl_lock held, when it does destroy_workqueue()
>> which will potentially flush all works in this workqueue, if we hold rtnl_lock
>> again in the work function, it will deadlock.
>>
>> So move destroy_workqueue() to destructor where rtnl_lock is not held any more,
>> suggested by Eric.
> 
> The error handling on creating a bond device needs to be updated as well.
> 

Done.


[-- Attachment #2: drivers-net-bonding-bond_main_c-fix-destroy_workqueue-deadlock.diff --]
[-- Type: text/x-patch, Size: 2517 bytes --]

V3: fix error handling path of bond_create()

bond_uninit() is invoked with rtnl_lock held, when it does destroy_workqueue()
which will potentially flush all works in this workqueue, if we hold rtnl_lock
again in the work function, it will deadlock.

So move destroy_workqueue() to destructor where rtnl_lock is not held any more,
suggested by Eric.

Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Jiri Pirko <jpirko@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>

---

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 5b92fbf..61f8c63 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4450,6 +4450,14 @@ static const struct net_device_ops bond_netdev_ops = {
 	.ndo_vlan_rx_kill_vid	= bond_vlan_rx_kill_vid,
 };
 
+static void bond_destructor(struct net_device *bond_dev)
+{
+	struct bonding *bond = netdev_priv(bond_dev);
+	if (bond->wq)
+		destroy_workqueue(bond->wq);
+	free_netdev(bond_dev);
+}
+
 static void bond_setup(struct net_device *bond_dev)
 {
 	struct bonding *bond = netdev_priv(bond_dev);
@@ -4470,7 +4478,7 @@ static void bond_setup(struct net_device *bond_dev)
 	bond_dev->ethtool_ops = &bond_ethtool_ops;
 	bond_set_mode_ops(bond, bond->params.mode);
 
-	bond_dev->destructor = free_netdev;
+	bond_dev->destructor = bond_destructor;
 
 	/* Initialize the device options */
 	bond_dev->tx_queue_len = 0;
@@ -4542,9 +4550,6 @@ static void bond_uninit(struct net_device *bond_dev)
 
 	bond_remove_proc_entry(bond);
 
-	if (bond->wq)
-		destroy_workqueue(bond->wq);
-
 	netif_addr_lock_bh(bond_dev);
 	bond_mc_list_destroy(bond);
 	netif_addr_unlock_bh(bond_dev);
@@ -4956,8 +4961,8 @@ int bond_create(struct net *net, const char *name)
 				bond_setup);
 	if (!bond_dev) {
 		pr_err("%s: eek! can't alloc netdev!\n", name);
-		res = -ENOMEM;
-		goto out;
+		rtnl_unlock();
+		return -ENOMEM;
 	}
 
 	dev_net_set(bond_dev, net);
@@ -4966,19 +4971,16 @@ int bond_create(struct net *net, const char *name)
 	if (!name) {
 		res = dev_alloc_name(bond_dev, "bond%d");
 		if (res < 0)
-			goto out_netdev;
+			goto out;
 	}
 
 	res = register_netdevice(bond_dev);
-	if (res < 0)
-		goto out_netdev;
 
 out:
 	rtnl_unlock();
+	if (res < 0)
+		bond_destructor(bond_dev);
 	return res;
-out_netdev:
-	free_netdev(bond_dev);
-	goto out;
 }
 
 static int __net_init bond_net_init(struct net *net)

^ permalink raw reply related

* Re: linux-next: build failure after merge of the slabh tree
From: David Miller @ 2010-04-01  7:29 UTC (permalink / raw)
  To: sfr; +Cc: tj, linux-next, linux-kernel, sjur.brandeland, netdev
In-Reply-To: <20100401164110.80e3b18f.sfr@canb.auug.org.au>

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Thu, 1 Apr 2010 16:41:10 +1100

> I have applied the following patch for today.  Dave, could you apply this
> to the net tree please?

Done.

^ permalink raw reply

* Re: [PATCH v3 04/12] l2tp: Add ppp device name to L2TP ppp session data
From: Eric Dumazet @ 2010-04-01  7:30 UTC (permalink / raw)
  To: James Chapman; +Cc: Stephen Hemminger, netdev
In-Reply-To: <4BB4490F.8090406@katalix.com>

Le jeudi 01 avril 2010 à 08:19 +0100, James Chapman a écrit :

> There might be thousands of L2TP sessions in some setups. Populating
> sysfs with a link for each of those sessions isn't practical. The
> existing /proc file dumps its info as a single text file for this
> reason. I'd also like to provide the device name in the session netlink
> message, which is the interface used by l2tp userspace, so I need a
> kernel API to retrieve the device name from ppp.
> 
> I like the suggestion of using debugfs for access to driver debug info
> though. I propose leaving the /proc file for L2TPv2 only, removing the
> L2TPv3 data that I added to the proc file in this patch series, to
> retain compatibility with the existing driver. This would show only
> L2TPv2 sessions and tunnels. For new driver functionality (L2TPv3 etc),
> use debugfs. The debugfs files would dump lists in a similar form to the
> current code, listing all tunnels (L2TPv2 and L2TPv3) in a single file.
> Using debugfs gives more flexibility for adding additional info later,
> as required. How does that sound?
> 

debugfs ? I dont get it, sorry.

Why not using netlink, as most iproute2 utilities do ?



^ permalink raw reply

* Re: linux-next: build failure after merge of the slabh tree
From: Tejun Heo @ 2010-04-01  7:31 UTC (permalink / raw)
  To: David Miller; +Cc: sfr, linux-next, linux-kernel, sjur.brandeland, netdev
In-Reply-To: <20100401.002908.267643090.davem@davemloft.net>

On 04/01/2010 04:29 PM, David Miller wrote:
> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Thu, 1 Apr 2010 16:41:10 +1100
> 
>> I have applied the following patch for today.  Dave, could you apply this
>> to the net tree please?
> 
> Done.

Thanks!

-- 
tejun

^ permalink raw reply

* Re: [PATCH v3 04/12] l2tp: Add ppp device name to L2TP ppp session data
From: David Miller @ 2010-04-01  7:34 UTC (permalink / raw)
  To: jchapman; +Cc: shemminger, netdev
In-Reply-To: <4BB4490F.8090406@katalix.com>

From: James Chapman <jchapman@katalix.com>
Date: Thu, 01 Apr 2010 08:19:43 +0100

> There might be thousands of L2TP sessions in some setups. Populating
> sysfs with a link for each of those sessions isn't practical. The
> existing /proc file dumps its info as a single text file for this
> reason. I'd also like to provide the device name in the session netlink
> message, which is the interface used by l2tp userspace, so I need a
> kernel API to retrieve the device name from ppp.

Scalability concerns are also another reason _not_ to use
procfs.

Use netlink or similar, which can dump with filtering and
proper queueing.

^ permalink raw reply

* Re: linux-next: build failure after merge of the slabh tree
From: Stephen Rothwell @ 2010-04-01  7:40 UTC (permalink / raw)
  To: David Miller; +Cc: tj, linux-next, linux-kernel, sjur.brandeland, netdev
In-Reply-To: <20100401.002908.267643090.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 433 bytes --]

Hi Dave,

On Thu, 01 Apr 2010 00:29:08 -0700 (PDT) David Miller <davem@davemloft.net> wrote:
>
> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Thu, 1 Apr 2010 16:41:10 +1100
> 
> > I have applied the following patch for today.  Dave, could you apply this
> > to the net tree please?
> 
> Done.

Thanks.

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* [PATCH] stmmac: fix kconfig for crc32 build error
From: Giuseppe CAVALLARO @ 2010-04-01  7:44 UTC (permalink / raw)
  To: netdev; +Cc: Carmelo AMOROSO, Giuseppe Cavallaro

From: Carmelo AMOROSO <carmelo.amoroso@st.com>

stmmac uses crc32 functions so it needs to select CRC32.

Fixes build error:
drivers/built-in.o: In function `dwmac1000_set_filter':
dwmac1000_core.c:(.text+0x3c380): undefined reference to `crc32_le'
dwmac1000_core.c:(.text+0x3c384): undefined reference to `bitrev32'

Signed-off-by: Carmelo Amoroso <carmelo.amoroso@st.com>
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
---
 drivers/net/stmmac/Kconfig |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/stmmac/Kconfig b/drivers/net/stmmac/Kconfig
index fb28764..eb63d44 100644
--- a/drivers/net/stmmac/Kconfig
+++ b/drivers/net/stmmac/Kconfig
@@ -2,6 +2,7 @@ config STMMAC_ETH
 	tristate "STMicroelectronics 10/100/1000 Ethernet driver"
 	select MII
 	select PHYLIB
+	select CRC32
 	depends on NETDEVICES && CPU_SUBTYPE_ST40
 	help
 	  This is the driver for the Ethernet IPs are built around a
-- 
1.6.0.4


^ permalink raw reply related

* [PATCH] stmmac: add documentation for the driver.
From: Giuseppe CAVALLARO @ 2010-04-01  7:44 UTC (permalink / raw)
  To: netdev; +Cc: Giuseppe Cavallaro
In-Reply-To: <1270107844-23457-1-git-send-email-peppe.cavallaro@st.com>

Add Documentation/networking/stmmac.txt for the
stmmac network driver.

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
---
 Documentation/networking/stmmac.txt |  143 +++++++++++++++++++++++++++++++++++
 1 files changed, 143 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/networking/stmmac.txt

diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt
new file mode 100644
index 0000000..7ee770b
--- /dev/null
+++ b/Documentation/networking/stmmac.txt
@@ -0,0 +1,143 @@
+       STMicroelectronics 10/100/1000 Synopsys Ethernet driver
+
+Copyright (C) 2007-2010  STMicroelectronics Ltd
+Author: Giuseppe Cavallaro <peppe.cavallaro@st.com>
+
+This is the driver for the MAC 10/100/1000 on-chip Ethernet controllers
+(Synopsys IP blocks); it has been fully tested on STLinux platforms.
+
+Currently this network device driver is for all STM embedded MAC/GMAC
+(7xxx SoCs).
+
+DWC Ether MAC 10/100/1000 Universal version 3.41a and DWC Ether MAC 10/100
+Universal version 4.0 have been used for developing the first code
+implementation.
+
+Please, for more information also visit: www.stlinux.com
+
+1) Kernel Configuration
+The kernel configuration option is STMMAC_ETH:
+ Device Drivers ---> Network device support ---> Ethernet (1000 Mbit) --->
+ STMicroelectronics 10/100/1000 Ethernet driver (STMMAC_ETH)
+
+2) Driver parameters list:
+	debug: message level (0: no output, 16: all);
+	phyaddr: to manually provide the physical address to the PHY device;
+	dma_rxsize: DMA rx ring size;
+	dma_txsize: DMA tx ring size;
+	buf_sz: DMA buffer size;
+	tc: control the HW FIFO threshold;
+	tx_coe: Enable/Disable Tx Checksum Offload engine;
+	watchdog: transmit timeout (in milliseconds);
+	flow_ctrl: Flow control ability [on/off];
+	pause: Flow Control Pause Time;
+	tmrate: timer period (only if timer optimisation is configured).
+
+3) Command line options
+Driver parameters can be also passed in command line by using:
+	stmmaceth=dma_rxsize:128,dma_txsize:512
+
+4) Driver information and notes
+
+4.1) Transmit process
+The xmit method is invoked when the kernel needs to transmit a packet; it sets
+the descriptors in the ring and informs the DMA engine that there is a packet
+ready to be transmitted.
+Once the controller has finished transmitting the packet, an interrupt is
+triggered; So the driver will be able to release the socket buffers.
+By default, the driver sets the NETIF_F_SG bit in the features field of the
+net_device structure enabling the scatter/gather feature.
+
+4.2) Receive process
+When one or more packets are received, an interrupt happens. The interrupts
+are not queued so the driver has to scan all the descriptors in the ring during
+the receive process.
+This is based on NAPI so the interrupt handler signals only if there is work to be
+done, and it exits.
+Then the poll method will be scheduled at some future point.
+The incoming packets are stored, by the DMA, in a list of pre-allocated socket
+buffers in order to avoid the memcpy (Zero-copy).
+
+4.3) Timer-Driver Interrupt
+Instead of having the device that asynchronously notifies the frame receptions, the
+driver configures a timer to generate an interrupt at regular intervals.
+Based on the granularity of the timer, the frames that are received by the device
+will experience different levels of latency. Some NICs have dedicated timer
+device to perform this task. STMMAC can use either the RTC device or the TMU
+channel 2  on STLinux platforms.
+The timers frequency can be passed to the driver as parameter; when change it,
+take care of both hardware capability and network stability/performance impact.
+Several performance tests on STM platforms showed this optimisation allows to spare
+the CPU while having the maximum throughput.
+
+4.4) WOL
+Wake up on Lan feature through Magic Frame is only supported for the GMAC
+core.
+
+4.5) DMA descriptors
+Driver handles both normal and enhanced descriptors. The latter has been only
+tested on DWC Ether MAC 10/100/1000 Universal version 3.41a.
+
+4.6) Ethtool support
+Ethtool is supported. Driver statistics and internal errors can be taken using:
+ethtool -S ethX command. It is possible to dump registers etc.
+
+4.7) Jumbo and Segmentation Offloading
+Jumbo frames are supported and tested for the GMAC.
+The GSO has been also added but it's performed in software.
+LRO is not supported.
+
+4.8) Physical
+The driver is compatible with PAL to work with PHY and GPHY devices.
+
+4.9) Platform information
+Several information came from the platform; please refer to the
+driver's Header file in include/linux directory.
+
+struct plat_stmmacenet_data {
+        int bus_id;
+        int pbl;
+        int has_gmac;
+        void (*fix_mac_speed)(void *priv, unsigned int speed);
+        void (*bus_setup)(unsigned long ioaddr);
+#ifdef CONFIG_STM_DRIVERS
+        struct stm_pad_config *pad_config;
+#endif
+        void *bsp_priv;
+};
+
+Where:
+- pbl (Programmable Burst Length) is maximum number of
+  beats to be transferred in one DMA transaction.
+  GMAC also enables the 4xPBL by default.
+- fix_mac_speed and bus_setup are used to configure internal target
+  registers (on STM platforms);
+- has_gmac: GMAC core is on board (get it at run-time in the next step);
+- bus_id: bus identifier.
+
+struct plat_stmmacphy_data {
+        int bus_id;
+        int phy_addr;
+        unsigned int phy_mask;
+        int interface;
+        int (*phy_reset)(void *priv);
+        void *priv;
+};
+
+Where:
+- bus_id: bus identifier;
+- phy_addr: physical address used for the attached phy device;
+            set it to -1 to get it at run-time;
+- interface: physical MII interface mode;
+- phy_reset: hook to reset HW function.
+
+TODO:
+- Continue to make the driver more generic and suitable for other Synopsys
+  Ethernet controllers used on other architectures (i.e. ARM).
+- 10G controllers are not supported.
+- MAC uses Normal descriptors and GMAC uses enhanced ones.
+  This is a limit that should be reviewed. MAC could want to
+  use the enhanced structure.
+- Checksumming: Rx/Tx csum is done in HW in case of GMAC only.
+- Review the timer optimisation code to use an embedded device that seems to be
+  available in new chip generations.
-- 
1.6.0.4


^ permalink raw reply related

* [PATCH 2/2] acenic: use the dma state API instead of the pci equivalents
From: FUJITA Tomonori @ 2010-04-01  8:13 UTC (permalink / raw)
  To: linux-acenic; +Cc: netdev, fujita.tomonori
In-Reply-To: <1270109586-9193-1-git-send-email-fujita.tomonori@lab.ntt.co.jp>

The DMA API is preferred.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
---
 drivers/net/acenic.c |   26 +++++++++++++-------------
 drivers/net/acenic.h |    6 +++---
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/net/acenic.c b/drivers/net/acenic.c
index 02a4746..1328eb9 100644
--- a/drivers/net/acenic.c
+++ b/drivers/net/acenic.c
@@ -661,7 +661,7 @@ static void __devexit acenic_remove_one(struct pci_dev *pdev)
 			dma_addr_t mapping;
 
 			ringp = &ap->skb->rx_std_skbuff[i];
-			mapping = pci_unmap_addr(ringp, mapping);
+			mapping = dma_unmap_addr(ringp, mapping);
 			pci_unmap_page(ap->pdev, mapping,
 				       ACE_STD_BUFSIZE,
 				       PCI_DMA_FROMDEVICE);
@@ -681,7 +681,7 @@ static void __devexit acenic_remove_one(struct pci_dev *pdev)
 				dma_addr_t mapping;
 
 				ringp = &ap->skb->rx_mini_skbuff[i];
-				mapping = pci_unmap_addr(ringp,mapping);
+				mapping = dma_unmap_addr(ringp,mapping);
 				pci_unmap_page(ap->pdev, mapping,
 					       ACE_MINI_BUFSIZE,
 					       PCI_DMA_FROMDEVICE);
@@ -700,7 +700,7 @@ static void __devexit acenic_remove_one(struct pci_dev *pdev)
 			dma_addr_t mapping;
 
 			ringp = &ap->skb->rx_jumbo_skbuff[i];
-			mapping = pci_unmap_addr(ringp, mapping);
+			mapping = dma_unmap_addr(ringp, mapping);
 			pci_unmap_page(ap->pdev, mapping,
 				       ACE_JUMBO_BUFSIZE,
 				       PCI_DMA_FROMDEVICE);
@@ -1683,7 +1683,7 @@ static void ace_load_std_rx_ring(struct ace_private *ap, int nr_bufs)
 				       ACE_STD_BUFSIZE,
 				       PCI_DMA_FROMDEVICE);
 		ap->skb->rx_std_skbuff[idx].skb = skb;
-		pci_unmap_addr_set(&ap->skb->rx_std_skbuff[idx],
+		dma_unmap_addr_set(&ap->skb->rx_std_skbuff[idx],
 				   mapping, mapping);
 
 		rd = &ap->rx_std_ring[idx];
@@ -1744,7 +1744,7 @@ static void ace_load_mini_rx_ring(struct ace_private *ap, int nr_bufs)
 				       ACE_MINI_BUFSIZE,
 				       PCI_DMA_FROMDEVICE);
 		ap->skb->rx_mini_skbuff[idx].skb = skb;
-		pci_unmap_addr_set(&ap->skb->rx_mini_skbuff[idx],
+		dma_unmap_addr_set(&ap->skb->rx_mini_skbuff[idx],
 				   mapping, mapping);
 
 		rd = &ap->rx_mini_ring[idx];
@@ -1800,7 +1800,7 @@ static void ace_load_jumbo_rx_ring(struct ace_private *ap, int nr_bufs)
 				       ACE_JUMBO_BUFSIZE,
 				       PCI_DMA_FROMDEVICE);
 		ap->skb->rx_jumbo_skbuff[idx].skb = skb;
-		pci_unmap_addr_set(&ap->skb->rx_jumbo_skbuff[idx],
+		dma_unmap_addr_set(&ap->skb->rx_jumbo_skbuff[idx],
 				   mapping, mapping);
 
 		rd = &ap->rx_jumbo_ring[idx];
@@ -2013,7 +2013,7 @@ static void ace_rx_int(struct net_device *dev, u32 rxretprd, u32 rxretcsm)
 		skb = rip->skb;
 		rip->skb = NULL;
 		pci_unmap_page(ap->pdev,
-			       pci_unmap_addr(rip, mapping),
+			       dma_unmap_addr(rip, mapping),
 			       mapsize,
 			       PCI_DMA_FROMDEVICE);
 		skb_put(skb, retdesc->size);
@@ -2085,7 +2085,7 @@ static inline void ace_tx_int(struct net_device *dev,
 
 		if (dma_unmap_len(info, maplen)) {
 			pci_unmap_page(ap->pdev, dma_unmap_addr(info, mapping),
-				       pci_unmap_len(info, maplen),
+				       dma_unmap_len(info, maplen),
 				       PCI_DMA_TODEVICE);
 			dma_unmap_len_set(info, maplen, 0);
 		}
@@ -2392,7 +2392,7 @@ static int ace_close(struct net_device *dev)
 				memset(ap->tx_ring + i, 0,
 				       sizeof(struct tx_desc));
 			pci_unmap_page(ap->pdev, dma_unmap_addr(info, mapping),
-				       pci_unmap_len(info, maplen),
+				       dma_unmap_len(info, maplen),
 				       PCI_DMA_TODEVICE);
 			dma_unmap_len_set(info, maplen, 0);
 		}
@@ -2429,8 +2429,8 @@ ace_map_tx_skb(struct ace_private *ap, struct sk_buff *skb,
 
 	info = ap->skb->tx_skbuff + idx;
 	info->skb = tail;
-	pci_unmap_addr_set(info, mapping, mapping);
-	pci_unmap_len_set(info, maplen, skb->len);
+	dma_unmap_addr_set(info, mapping, mapping);
+	dma_unmap_len_set(info, maplen, skb->len);
 	return mapping;
 }
 
@@ -2549,8 +2549,8 @@ restart:
 			} else {
 				info->skb = NULL;
 			}
-			pci_unmap_addr_set(info, mapping, mapping);
-			pci_unmap_len_set(info, maplen, frag->size);
+			dma_unmap_addr_set(info, mapping, mapping);
+			dma_unmap_len_set(info, maplen, frag->size);
 			ace_load_tx_bd(ap, desc, mapping, flagsize, vlan_tag);
 		}
 	}
diff --git a/drivers/net/acenic.h b/drivers/net/acenic.h
index 17079b9..0681da7 100644
--- a/drivers/net/acenic.h
+++ b/drivers/net/acenic.h
@@ -589,7 +589,7 @@ struct ace_info {
 
 struct ring_info {
 	struct sk_buff		*skb;
-	DECLARE_PCI_UNMAP_ADDR(mapping)
+	DEFINE_DMA_UNMAP_ADDR(mapping);
 };
 
 
@@ -600,8 +600,8 @@ struct ring_info {
  */
 struct tx_ring_info {
 	struct sk_buff		*skb;
-	DECLARE_PCI_UNMAP_ADDR(mapping)
-	DECLARE_PCI_UNMAP_LEN(maplen)
+	DEFINE_DMA_UNMAP_ADDR(mapping);
+	DEFINE_DMA_UNMAP_LEN(maplen);
 };
 
 
-- 
1.7.0


^ permalink raw reply related

* [PATCH 1/2] acenic: fix the misusage of zero dma address
From: FUJITA Tomonori @ 2010-04-01  8:13 UTC (permalink / raw)
  To: linux-acenic; +Cc: netdev, fujita.tomonori

acenic wrongly assumes that zero is an invalid dma address (calls
dma_unmap_page for only non zero dma addresses). Zero is a valid dma
address on some architectures. The dma length can be used here.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
---
 drivers/net/acenic.c |   16 ++++++----------
 1 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/drivers/net/acenic.c b/drivers/net/acenic.c
index 97a3dfd..02a4746 100644
--- a/drivers/net/acenic.c
+++ b/drivers/net/acenic.c
@@ -2078,18 +2078,16 @@ static inline void ace_tx_int(struct net_device *dev,
 
 	do {
 		struct sk_buff *skb;
-		dma_addr_t mapping;
 		struct tx_ring_info *info;
 
 		info = ap->skb->tx_skbuff + idx;
 		skb = info->skb;
-		mapping = pci_unmap_addr(info, mapping);
 
-		if (mapping) {
-			pci_unmap_page(ap->pdev, mapping,
+		if (dma_unmap_len(info, maplen)) {
+			pci_unmap_page(ap->pdev, dma_unmap_addr(info, mapping),
 				       pci_unmap_len(info, maplen),
 				       PCI_DMA_TODEVICE);
-			pci_unmap_addr_set(info, mapping, 0);
+			dma_unmap_len_set(info, maplen, 0);
 		}
 
 		if (skb) {
@@ -2377,14 +2375,12 @@ static int ace_close(struct net_device *dev)
 
 	for (i = 0; i < ACE_TX_RING_ENTRIES(ap); i++) {
 		struct sk_buff *skb;
-		dma_addr_t mapping;
 		struct tx_ring_info *info;
 
 		info = ap->skb->tx_skbuff + i;
 		skb = info->skb;
-		mapping = pci_unmap_addr(info, mapping);
 
-		if (mapping) {
+		if (dma_unmap_len(info, maplen)) {
 			if (ACE_IS_TIGON_I(ap)) {
 				/* NB: TIGON_1 is special, tx_ring is in io space */
 				struct tx_desc __iomem *tx;
@@ -2395,10 +2391,10 @@ static int ace_close(struct net_device *dev)
 			} else
 				memset(ap->tx_ring + i, 0,
 				       sizeof(struct tx_desc));
-			pci_unmap_page(ap->pdev, mapping,
+			pci_unmap_page(ap->pdev, dma_unmap_addr(info, mapping),
 				       pci_unmap_len(info, maplen),
 				       PCI_DMA_TODEVICE);
-			pci_unmap_addr_set(info, mapping, 0);
+			dma_unmap_len_set(info, maplen, 0);
 		}
 		if (skb) {
 			dev_kfree_skb(skb);
-- 
1.7.0


^ permalink raw reply related

* [PATCH net-next-2.6] ipv6 fib: Make ip6_fib{} more cache-line aware.
From: YOSHIFUJI Hideaki @ 2010-04-01  8:18 UTC (permalink / raw)
  To: davem; +Cc: yoshfuji, netdev

Because elements at the end of dst_entry{} are frequently
updated, it is not good to put frequently-used static
elements, such as rt6i_idev, rt6i_dst or rt6i_flags in the
same cache line.

On the other hand, fib6_table, rt6i_node or rt6i_gateway are
rarely used, so it is okay to stay in the same cache line.

Let's rearrange ip6_fib{}.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
---
 include/net/ip6_fib.h |   29 ++++++++++++++++-------------
 1 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 86f46c4..4b1dc11 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -88,34 +88,37 @@ struct rt6_info {
 		struct dst_entry	dst;
 	} u;
 
-	struct inet6_dev		*rt6i_idev;
-
 #define rt6i_dev			u.dst.dev
 #define rt6i_nexthop			u.dst.neighbour
 #define rt6i_expires			u.dst.expires
 
+	/*
+	 * Tail elements of dst_entry (__refcnt etc.)
+	 * and these elements (rarely used in hot path) are in
+	 * the same cache line.
+	 */
+	struct fib6_table		*rt6i_table;
 	struct fib6_node		*rt6i_node;
 
 	struct in6_addr			rt6i_gateway;
-	
-	u32				rt6i_flags;
-	u32				rt6i_metric;
-	atomic_t			rt6i_ref;
 
-	/* more non-fragment space at head required */
-	unsigned short			rt6i_nfheader_len;
-
-	u8				rt6i_protocol;
+	atomic_t			rt6i_ref;
 
-	struct fib6_table		*rt6i_table;
+	/* These are in a separate cache line. */
+	struct rt6key			rt6i_dst ____cacheline_aligned_in_smp;
+	u32				rt6i_flags;
+	struct rt6key			rt6i_src;
+	u32				rt6i_metric;
 
-	struct rt6key			rt6i_dst;
+	struct inet6_dev		*rt6i_idev;
 
 #ifdef CONFIG_XFRM
 	u32				rt6i_flow_cache_genid;
 #endif
+	/* more non-fragment space at head required */
+	unsigned short			rt6i_nfheader_len;
 
-	struct rt6key			rt6i_src;
+	u8				rt6i_protocol;
 };
 
 static inline struct inet6_dev *ip6_dst_idev(struct dst_entry *dst)
-- 
1.5.6.5


^ permalink raw reply related

* [PATCH net-next-2.6] ipv6 fib: Make rt6_info{} more cache-line aware.
From: YOSHIFUJI Hideaki @ 2010-04-01  8:24 UTC (permalink / raw)
  To: davem; +Cc: yoshfuji, netdev

The head element of rt6_info{} is dst_entry{}, and
IPv6 specific elements follow.

Because elements at the end of dst_entry{} are frequently
updated, it is not good to put frequently-used static
elements, such as rt6i_idev, rt6i_dst or rt6i_flags in the
same cache line.

On the other hand, fib6_table, rt6i_node or rt6i_gateway are
rarely used, so it is okay to stay in the same cache line.

Let's rearrange rt6_info{}.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
---
 include/net/ip6_fib.h |   29 ++++++++++++++++-------------
 1 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 86f46c4..4b1dc11 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -88,34 +88,37 @@ struct rt6_info {
 		struct dst_entry	dst;
 	} u;
 
-	struct inet6_dev		*rt6i_idev;
-
 #define rt6i_dev			u.dst.dev
 #define rt6i_nexthop			u.dst.neighbour
 #define rt6i_expires			u.dst.expires
 
+	/*
+	 * Tail elements of dst_entry (__refcnt etc.)
+	 * and these elements (rarely used in hot path) are in
+	 * the same cache line.
+	 */
+	struct fib6_table		*rt6i_table;
 	struct fib6_node		*rt6i_node;
 
 	struct in6_addr			rt6i_gateway;
-	
-	u32				rt6i_flags;
-	u32				rt6i_metric;
-	atomic_t			rt6i_ref;
 
-	/* more non-fragment space at head required */
-	unsigned short			rt6i_nfheader_len;
-
-	u8				rt6i_protocol;
+	atomic_t			rt6i_ref;
 
-	struct fib6_table		*rt6i_table;
+	/* These are in a separate cache line. */
+	struct rt6key			rt6i_dst ____cacheline_aligned_in_smp;
+	u32				rt6i_flags;
+	struct rt6key			rt6i_src;
+	u32				rt6i_metric;
 
-	struct rt6key			rt6i_dst;
+	struct inet6_dev		*rt6i_idev;
 
 #ifdef CONFIG_XFRM
 	u32				rt6i_flow_cache_genid;
 #endif
+	/* more non-fragment space at head required */
+	unsigned short			rt6i_nfheader_len;
 
-	struct rt6key			rt6i_src;
+	u8				rt6i_protocol;
 };
 
 static inline struct inet6_dev *ip6_dst_idev(struct dst_entry *dst)
-- 
1.5.6.5

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH net-next-2.6] ipv6 fib: Make ip6_fib{} more cache-line aware.
From: YOSHIFUJI Hideaki @ 2010-04-01  8:26 UTC (permalink / raw)
  To: davem; +Cc: YOSHIFUJI Hideaki, netdev
In-Reply-To: <201004010818.o318IZ6f025510@94.43.138.210.xn.2iij.net>

(2010/04/01 17:18), YOSHIFUJI Hideaki wrote:
:
> Let's rearrange ip6_fib{}.

Oops, I should say rt6_info{}.  Resent.

--yoshfuji

^ permalink raw reply

* Re: [PATCH 2/5] net: ipv6: add IPSKB_REROUTED exclusion to NF_HOOK/POSTROUTING invocation
From: David Miller @ 2010-04-01  8:34 UTC (permalink / raw)
  To: jengelh; +Cc: kaber, netfilter-devel, netdev
In-Reply-To: <1270031934-15940-3-git-send-email-jengelh@medozas.de>

From: Jan Engelhardt <jengelh@medozas.de>
Date: Wed, 31 Mar 2010 12:38:50 +0200

> Similar to how IPv4's ip_output.c works, have ip6_output also check
> the IPSKB_REROUTED flag. It will be set from xt_TEE for cloned packets
> since Xtables can currently only deal with a single packet in flight
> at a time.
> 
> Signed-off-by: Jan Engelhardt <jengelh@medozas.de>

I defer to ipv6 experts as to whether this will cause trouble
or not.

If they are fine with it, feel free to add my:

Acked-by: David S. Miller <davem@davemloft.net>

and this can go in via the nf tree along with the other changes in
this set.

Thanks.

^ permalink raw reply

* Re: [PATCH v3 04/12] l2tp: Add ppp device name to L2TP ppp session data
From: James Chapman @ 2010-04-01  8:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Stephen Hemminger, netdev
In-Reply-To: <1270107034.2229.5.camel@edumazet-laptop>

Eric Dumazet wrote:
> Le jeudi 01 avril 2010 à 08:19 +0100, James Chapman a écrit :
> 
>> There might be thousands of L2TP sessions in some setups. Populating
>> sysfs with a link for each of those sessions isn't practical. The
>> existing /proc file dumps its info as a single text file for this
>> reason. I'd also like to provide the device name in the session netlink
>> message, which is the interface used by l2tp userspace, so I need a
>> kernel API to retrieve the device name from ppp.
>>
>> I like the suggestion of using debugfs for access to driver debug info
>> though. I propose leaving the /proc file for L2TPv2 only, removing the
>> L2TPv3 data that I added to the proc file in this patch series, to
>> retain compatibility with the existing driver. This would show only
>> L2TPv2 sessions and tunnels. For new driver functionality (L2TPv3 etc),
>> use debugfs. The debugfs files would dump lists in a similar form to the
>> current code, listing all tunnels (L2TPv2 and L2TPv3) in a single file.
>> Using debugfs gives more flexibility for adding additional info later,
>> as required. How does that sound?
>>
> 
> debugfs ? I dont get it, sorry.
> 
> Why not using netlink, as most iproute2 utilities do ?

I am using netlink. This is only for providing extra (convenience) debug
info from the kernel drivers. It's something I can easily ask users to
do if they have a problem.


-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development


^ permalink raw reply

* [PATCH] net: check the length of the socket address passed to connect(2)
From: Changli Gao @ 2010-04-01  8:58 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, xiaosuo

check the length of the socket address passed to connect(2).

Check the length of the socket address passed to connect(2). If the
length is invalid, -EINVAL will be returned.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
net/bluetooth/l2cap.c | 3 ++-
net/bluetooth/rfcomm/sock.c | 3 ++-
net/bluetooth/sco.c | 3 ++-
net/can/bcm.c | 3 +++
net/ieee802154/af_ieee802154.c | 3 +++
net/ipv4/af_inet.c | 5 +++++
net/netlink/af_netlink.c | 3 +++
7 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/net/bluetooth/l2cap.c b/net/bluetooth/l2cap.c
index 7794a2e..99d68c3 100644
--- a/net/bluetooth/l2cap.c
+++ b/net/bluetooth/l2cap.c
@@ -1002,7 +1002,8 @@ static int l2cap_sock_connect(struct socket *sock, struct sockaddr *addr, int al
 
 	BT_DBG("sk %p", sk);
 
-	if (!addr || addr->sa_family != AF_BLUETOOTH)
+	if (!addr || alen < sizeof(addr->sa_family) ||
+	    addr->sa_family != AF_BLUETOOTH)
 		return -EINVAL;
 
 	memset(&la, 0, sizeof(la));
diff --git a/net/bluetooth/rfcomm/sock.c b/net/bluetooth/rfcomm/sock.c
index 7f43976..8ed3c37 100644
--- a/net/bluetooth/rfcomm/sock.c
+++ b/net/bluetooth/rfcomm/sock.c
@@ -397,7 +397,8 @@ static int rfcomm_sock_connect(struct socket *sock, struct sockaddr *addr, int a
 
 	BT_DBG("sk %p", sk);
 
-	if (addr->sa_family != AF_BLUETOOTH || alen < sizeof(struct sockaddr_rc))
+	if (alen < sizeof(struct sockaddr_rc) ||
+	    addr->sa_family != AF_BLUETOOTH)
 		return -EINVAL;
 
 	lock_sock(sk);
diff --git a/net/bluetooth/sco.c b/net/bluetooth/sco.c
index e5b16b7..ca6b2ad 100644
--- a/net/bluetooth/sco.c
+++ b/net/bluetooth/sco.c
@@ -499,7 +499,8 @@ static int sco_sock_connect(struct socket *sock, struct sockaddr *addr, int alen
 
 	BT_DBG("sk %p", sk);
 
-	if (addr->sa_family != AF_BLUETOOTH || alen < sizeof(struct sockaddr_sco))
+	if (alen < sizeof(struct sockaddr_sco) ||
+	    addr->sa_family != AF_BLUETOOTH)
 		return -EINVAL;
 
 	if (sk->sk_state != BT_OPEN && sk->sk_state != BT_BOUND)
diff --git a/net/can/bcm.c b/net/can/bcm.c
index a2dee52..907dc87 100644
--- a/net/can/bcm.c
+++ b/net/can/bcm.c
@@ -1479,6 +1479,9 @@ static int bcm_connect(struct socket *sock, struct sockaddr *uaddr, int len,
 	struct sock *sk = sock->sk;
 	struct bcm_sock *bo = bcm_sk(sk);
 
+	if (len < sizeof(*addr))
+		return -EINVAL;
+
 	if (bo->bound)
 		return -EISCONN;
 
diff --git a/net/ieee802154/af_ieee802154.c b/net/ieee802154/af_ieee802154.c
index 79886d5..c7da600 100644
--- a/net/ieee802154/af_ieee802154.c
+++ b/net/ieee802154/af_ieee802154.c
@@ -127,6 +127,9 @@ static int ieee802154_sock_connect(struct socket *sock, struct sockaddr *uaddr,
 {
 	struct sock *sk = sock->sk;
 
+	if (addr_len < sizeof(uaddr->sa_family))
+		return -EINVAL;
+
 	if (uaddr->sa_family == AF_UNSPEC)
 		return sk->sk_prot->disconnect(sk, flags);
 
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index be1a6ac..a0beb32 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -531,6 +531,8 @@ int inet_dgram_connect(struct socket *sock, struct sockaddr * uaddr,
 {
 	struct sock *sk = sock->sk;
 
+	if (addr_len < sizeof(uaddr->sa_family))
+		return -EINVAL;
 	if (uaddr->sa_family == AF_UNSPEC)
 		return sk->sk_prot->disconnect(sk, flags);
 
@@ -574,6 +576,9 @@ int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 	int err;
 	long timeo;
 
+	if (addr_len < sizeof(uaddr->sa_family))
+		return -EINVAL;
+
 	lock_sock(sk);
 
 	if (uaddr->sa_family == AF_UNSPEC) {
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 274d977..6464a19 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -683,6 +683,9 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
 	struct netlink_sock *nlk = nlk_sk(sk);
 	struct sockaddr_nl *nladdr = (struct sockaddr_nl *)addr;
 
+	if (alen < sizeof(addr->sa_family))
+		return -EINVAL;
+
 	if (addr->sa_family == AF_UNSPEC) {
 		sk->sk_state	= NETLINK_UNCONNECTED;
 		nlk->dst_pid	= 0;



^ permalink raw reply related

* Re: [PATCH v3 04/12] l2tp: Add ppp device name to L2TP ppp session data
From: James Chapman @ 2010-04-01  8:59 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, netdev
In-Reply-To: <20100401.003459.88028238.davem@davemloft.net>

David Miller wrote:
> From: James Chapman <jchapman@katalix.com>
> Date: Thu, 01 Apr 2010 08:19:43 +0100
> 
>> There might be thousands of L2TP sessions in some setups. Populating
>> sysfs with a link for each of those sessions isn't practical. The
>> existing /proc file dumps its info as a single text file for this
>> reason. I'd also like to provide the device name in the session netlink
>> message, which is the interface used by l2tp userspace, so I need a
>> kernel API to retrieve the device name from ppp.
> 
> Scalability concerns are also another reason _not_ to use
> procfs.
> 
> Use netlink or similar, which can dump with filtering and
> proper queueing.

See previous reply to Eric. The original /proc API was originally
intended for debug, but users have been using it in hacky scripts
instead of using proper socket interfaces to get the data they need. So
I'm proposing to retain the existing proc file for compatibility but put
other (new) driver-private debug info into debugfs.


-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development


^ permalink raw reply

* Re: [PATCH v3 04/12] l2tp: Add ppp device name to L2TP ppp session data
From: David Miller @ 2010-04-01  9:04 UTC (permalink / raw)
  To: jchapman; +Cc: shemminger, netdev
In-Reply-To: <4BB46084.5030105@katalix.com>

From: James Chapman <jchapman@katalix.com>
Date: Thu, 01 Apr 2010 09:59:48 +0100

> The original /proc API was originally intended for debug, but users
> have been using it in hacky scripts instead of using proper socket
> interfaces to get the data they need. So I'm proposing to retain the
> existing proc file for compatibility but put other (new)
> driver-private debug info into debugfs.

Fair enough.

^ permalink raw reply

* Re:[PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
From: Xin Xiaohui @ 2010-04-01  9:14 UTC (permalink / raw)
  To: mst; +Cc: netdev, kvm, linux-kernel, mingo, jdike, Xin Xiaohui
In-Reply-To: <20100317102710.GB9782@redhat.com>

The vhost-net backend now only supports synchronous send/recv
operations. The patch provides multiple submits and asynchronous
notifications. This is needed for zero-copy case.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
---

Michael,
Now, I made vhost to alloc/destroy the kiocb, and transfer it from 
sendmsg/recvmsg. I did not remove vq->receiver, since what the
callback does is related to the structures owned by mp device,
and I think isolation them to vhost is a good thing to us all.
And it will not prevent mp device to be independent of vhost 
in future. Later, when mp device can be a real device which
provides asynchronous read/write operations and not just report
proto_ops, it will use another callback function which is not
related to vhost at all.

For the write logging, do you have a function in hand that we can
recompute the log? If that, I think I can use it to recompute the
log info when the logging is suddenly enabled.
For the outstanding requests, do you mean all the user buffers have
submitted before the logging ioctl changed? That may be a lot, and
some of them are still in NIC ring descriptors. Waiting them to be
finished may be need some time. I think when logging ioctl changed,
then the logging is changed just after that is also reasonable.

Thanks
Xiaohui

 drivers/vhost/net.c   |  189 +++++++++++++++++++++++++++++++++++++++++++++++--
 drivers/vhost/vhost.h |   10 +++
 2 files changed, 192 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 22d5fef..2aafd90 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -17,11 +17,13 @@
 #include <linux/workqueue.h>
 #include <linux/rcupdate.h>
 #include <linux/file.h>
+#include <linux/aio.h>
 
 #include <linux/net.h>
 #include <linux/if_packet.h>
 #include <linux/if_arp.h>
 #include <linux/if_tun.h>
+#include <linux/mpassthru.h>
 
 #include <net/sock.h>
 
@@ -47,6 +49,7 @@ struct vhost_net {
 	struct vhost_dev dev;
 	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
 	struct vhost_poll poll[VHOST_NET_VQ_MAX];
+	struct kmem_cache       *cache;
 	/* Tells us whether we are polling a socket for TX.
 	 * We only do this when socket buffer fills up.
 	 * Protected by tx vq lock. */
@@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
 	net->tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
+struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
+{
+	struct kiocb *iocb = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vq->notify_lock, flags);
+	if (!list_empty(&vq->notifier)) {
+		iocb = list_first_entry(&vq->notifier,
+				struct kiocb, ki_list);
+		list_del(&iocb->ki_list);
+	}
+	spin_unlock_irqrestore(&vq->notify_lock, flags);
+	return iocb;
+}
+
+static void handle_async_rx_events_notify(struct vhost_net *net,
+					struct vhost_virtqueue *vq)
+{
+	struct kiocb *iocb = NULL;
+	struct vhost_log *vq_log = NULL;
+	int rx_total_len = 0;
+	int log, size;
+
+	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
+		return;
+
+	if (vq->receiver)
+		vq->receiver(vq);
+
+	vq_log = unlikely(vhost_has_feature(
+				&net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;
+	while ((iocb = notify_dequeue(vq)) != NULL) {
+		vhost_add_used_and_signal(&net->dev, vq,
+				iocb->ki_pos, iocb->ki_nbytes);
+		log = (int)iocb->ki_user_data;
+		size = iocb->ki_nbytes;
+		rx_total_len += iocb->ki_nbytes;
+
+		if (iocb->ki_dtor)
+			iocb->ki_dtor(iocb);
+		kmem_cache_free(net->cache, iocb);
+
+		if (unlikely(vq_log))
+			vhost_log_write(vq, vq_log, log, size);
+		if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
+			vhost_poll_queue(&vq->poll);
+			break;
+		}
+	}
+}
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+					struct vhost_virtqueue *vq)
+{
+	struct kiocb *iocb = NULL;
+	int tx_total_len = 0;
+
+	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
+		return;
+
+	while ((iocb = notify_dequeue(vq)) != NULL) {
+		vhost_add_used_and_signal(&net->dev, vq,
+				iocb->ki_pos, 0);
+		tx_total_len += iocb->ki_nbytes;
+
+		if (iocb->ki_dtor)
+			iocb->ki_dtor(iocb);
+
+		kmem_cache_free(net->cache, iocb);
+		if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
+			vhost_poll_queue(&vq->poll);
+			break;
+		}
+	}
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
 {
 	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
+	struct kiocb *iocb = NULL;
 	unsigned head, out, in, s;
 	struct msghdr msg = {
 		.msg_name = NULL,
@@ -124,6 +204,8 @@ static void handle_tx(struct vhost_net *net)
 		tx_poll_stop(net);
 	hdr_size = vq->hdr_size;
 
+	handle_async_tx_events_notify(net, vq);
+
 	for (;;) {
 		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
 					 ARRAY_SIZE(vq->iov),
@@ -151,6 +233,15 @@ static void handle_tx(struct vhost_net *net)
 		/* Skip header. TODO: support TSO. */
 		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
 		msg.msg_iovlen = out;
+
+		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
+			iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
+			if (!iocb)
+				break;
+			iocb->ki_pos = head;
+			iocb->private = (void *)vq;
+		}
+
 		len = iov_length(vq->iov, out);
 		/* Sanity check */
 		if (!len) {
@@ -160,12 +251,16 @@ static void handle_tx(struct vhost_net *net)
 			break;
 		}
 		/* TODO: Check specific error and bomb out unless ENOBUFS? */
-		err = sock->ops->sendmsg(NULL, sock, &msg, len);
+		err = sock->ops->sendmsg(iocb, sock, &msg, len);
 		if (unlikely(err < 0)) {
 			vhost_discard_vq_desc(vq);
 			tx_poll_start(net, sock);
 			break;
 		}
+
+		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
+			continue;
+
 		if (err != len)
 			pr_err("Truncated TX packet: "
 			       " len %d != %zd\n", err, len);
@@ -177,6 +272,8 @@ static void handle_tx(struct vhost_net *net)
 		}
 	}
 
+	handle_async_tx_events_notify(net, vq);
+
 	mutex_unlock(&vq->mutex);
 	unuse_mm(net->dev.mm);
 }
@@ -186,6 +283,7 @@ static void handle_tx(struct vhost_net *net)
 static void handle_rx(struct vhost_net *net)
 {
 	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
+	struct kiocb *iocb = NULL;
 	unsigned head, out, in, log, s;
 	struct vhost_log *vq_log;
 	struct msghdr msg = {
@@ -206,7 +304,8 @@ static void handle_rx(struct vhost_net *net)
 	int err;
 	size_t hdr_size;
 	struct socket *sock = rcu_dereference(vq->private_data);
-	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
+	if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
+			vq->link_state == VHOST_VQ_LINK_SYNC))
 		return;
 
 	use_mm(net->dev.mm);
@@ -214,9 +313,18 @@ static void handle_rx(struct vhost_net *net)
 	vhost_disable_notify(vq);
 	hdr_size = vq->hdr_size;
 
-	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
+	/* In async cases, for write logging, the simple way is to get
+	 * the log info always, and really logging is decided later.
+	 * Thus, when logging enabled, we can get log, and when logging
+	 * disabled, we can get log disabled accordingly.
+	 */
+
+	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) |
+		(vq->link_state == VHOST_VQ_LINK_ASYNC) ?
 		vq->log : NULL;
 
+	handle_async_rx_events_notify(net, vq);
+
 	for (;;) {
 		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
 					 ARRAY_SIZE(vq->iov),
@@ -245,6 +353,14 @@ static void handle_rx(struct vhost_net *net)
 		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
 		msg.msg_iovlen = in;
 		len = iov_length(vq->iov, in);
+		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
+			iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
+			if (!iocb)
+				break;
+			iocb->private = vq;
+			iocb->ki_pos = head;
+			iocb->ki_user_data = log;
+		}
 		/* Sanity check */
 		if (!len) {
 			vq_err(vq, "Unexpected header len for RX: "
@@ -252,13 +368,18 @@ static void handle_rx(struct vhost_net *net)
 			       iov_length(vq->hdr, s), hdr_size);
 			break;
 		}
-		err = sock->ops->recvmsg(NULL, sock, &msg,
+
+		err = sock->ops->recvmsg(iocb, sock, &msg,
 					 len, MSG_DONTWAIT | MSG_TRUNC);
 		/* TODO: Check specific error and bomb out unless EAGAIN? */
 		if (err < 0) {
 			vhost_discard_vq_desc(vq);
 			break;
 		}
+
+		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
+			continue;
+
 		/* TODO: Should check and handle checksum. */
 		if (err > len) {
 			pr_err("Discarded truncated rx packet: "
@@ -284,10 +405,13 @@ static void handle_rx(struct vhost_net *net)
 		}
 	}
 
+	handle_async_rx_events_notify(net, vq);
+
 	mutex_unlock(&vq->mutex);
 	unuse_mm(net->dev.mm);
 }
 
+
 static void handle_tx_kick(struct work_struct *work)
 {
 	struct vhost_virtqueue *vq;
@@ -338,6 +462,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
 	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
 	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
+	n->cache = NULL;
 	return 0;
 }
 
@@ -398,6 +523,17 @@ static void vhost_net_flush(struct vhost_net *n)
 	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
 }
 
+static void vhost_notifier_cleanup(struct vhost_net *n)
+{
+	struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_NET_VQ_RX];
+	struct kiocb *iocb = NULL;
+	if (n->cache) {
+		while ((iocb = notify_dequeue(vq)) != NULL)
+			kmem_cache_free(n->cache, iocb);
+		kmem_cache_destroy(n->cache);
+	}
+}
+
 static int vhost_net_release(struct inode *inode, struct file *f)
 {
 	struct vhost_net *n = f->private_data;
@@ -414,6 +550,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
 	/* We do an extra flush before freeing memory,
 	 * since jobs can re-queue themselves. */
 	vhost_net_flush(n);
+	vhost_notifier_cleanup(n);
 	kfree(n);
 	return 0;
 }
@@ -462,7 +599,19 @@ static struct socket *get_tun_socket(int fd)
 	return sock;
 }
 
-static struct socket *get_socket(int fd)
+static struct socket *get_mp_socket(int fd)
+{
+	struct file *file = fget(fd);
+	struct socket *sock;
+	if (!file)
+		return ERR_PTR(-EBADF);
+	sock = mp_get_socket(file);
+	if (IS_ERR(sock))
+		fput(file);
+	return sock;
+}
+
+static struct socket *get_socket(struct vhost_virtqueue *vq, int fd)
 {
 	struct socket *sock;
 	if (fd == -1)
@@ -473,9 +622,31 @@ static struct socket *get_socket(int fd)
 	sock = get_tun_socket(fd);
 	if (!IS_ERR(sock))
 		return sock;
+	sock = get_mp_socket(fd);
+	if (!IS_ERR(sock)) {
+		vq->link_state = VHOST_VQ_LINK_ASYNC;
+		return sock;
+	}
 	return ERR_PTR(-ENOTSOCK);
 }
 
+static void vhost_init_link_state(struct vhost_net *n, int index)
+{
+	struct vhost_virtqueue *vq = n->vqs + index;
+
+	WARN_ON(!mutex_is_locked(&vq->mutex));
+	if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
+		vq->receiver = NULL;
+		INIT_LIST_HEAD(&vq->notifier);
+		spin_lock_init(&vq->notify_lock);
+		if (!n->cache) {
+			n->cache = kmem_cache_create("vhost_kiocb",
+					sizeof(struct kiocb), 0,
+					SLAB_HWCACHE_ALIGN, NULL);
+		}
+	}
+}
+
 static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 {
 	struct socket *sock, *oldsock;
@@ -493,12 +664,15 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 	}
 	vq = n->vqs + index;
 	mutex_lock(&vq->mutex);
-	sock = get_socket(fd);
+	vq->link_state = VHOST_VQ_LINK_SYNC;
+	sock = get_socket(vq, fd);
 	if (IS_ERR(sock)) {
 		r = PTR_ERR(sock);
 		goto err;
 	}
 
+	vhost_init_link_state(n, index);
+
 	/* start polling new socket */
 	oldsock = vq->private_data;
 	if (sock == oldsock)
@@ -507,8 +681,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 	vhost_net_disable_vq(n, vq);
 	rcu_assign_pointer(vq->private_data, sock);
 	vhost_net_enable_vq(n, vq);
-	mutex_unlock(&vq->mutex);
 done:
+	mutex_unlock(&vq->mutex);
 	mutex_unlock(&n->dev.mutex);
 	if (oldsock) {
 		vhost_net_flush_vq(n, index);
@@ -516,6 +690,7 @@ done:
 	}
 	return r;
 err:
+	mutex_unlock(&vq->mutex);
 	mutex_unlock(&n->dev.mutex);
 	return r;
 }
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d1f0453..cffe39a 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -43,6 +43,11 @@ struct vhost_log {
 	u64 len;
 };
 
+enum vhost_vq_link_state {
+	VHOST_VQ_LINK_SYNC = 	0,
+	VHOST_VQ_LINK_ASYNC = 	1,
+};
+
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -96,6 +101,11 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log log[VHOST_NET_MAX_SG];
+	/*Differiate async socket for 0-copy from normal*/
+	enum vhost_vq_link_state link_state;
+	struct list_head notifier;
+	spinlock_t notify_lock;
+	void (*receiver)(struct vhost_virtqueue *);
 };
 
 struct vhost_dev {
-- 
1.5.4.4


^ permalink raw reply related

* Re:[PATCH 1/3] A device for zero-copy based on KVM virtio-net.
From: Xin Xiaohui @ 2010-04-01  9:27 UTC (permalink / raw)
  To: mst; +Cc: netdev, kvm, linux-kernel, mingo, jdike, yzhao81, Xin Xiaohui
In-Reply-To: <20100308112849.GI7482@redhat.com>

Add a device to utilize the vhost-net backend driver for
copy-less data transfer between guest FE and host NIC.
It pins the guest user space to the host memory and
provides proto_ops as sendmsg/recvmsg to vhost-net.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81@gmail.com>
Sigend-off-by: Jeff Dike <jdike@c2.user-mode-linux.org>
---

Micheal,
Sorry, I did not resolve all your comments this time.
I did not move the device out of vhost directory because I
did not implement real asynchronous read/write operations
to mp device for now, We wish we can do this after the network
code checked in. 

For the DOS issue, I'm not sure how much the limit get_user_pages()
can pin is reasonable, should we compute the bindwidth to make it?

We use get_user_pages_fast() and use set_page_dirty_lock().
Remove read_rcu_lock()/unlock(), since the ctor pointer is
only changed by BIND/UNBIND ioctl, and during that time,
the NIC is always stoped, all outstanding requests are done,
so the ctor pointer cannot be raced into wrong condition.

Qemu needs a userspace write, is that a synchronous one or
asynchronous one?

Thanks
Xiaohui

 drivers/vhost/Kconfig     |    5 +
 drivers/vhost/Makefile    |    2 +
 drivers/vhost/mpassthru.c | 1162 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mpassthru.h |   29 ++
 4 files changed, 1198 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/mpassthru.c
 create mode 100644 include/linux/mpassthru.h

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 9f409f4..ee32a3b 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,8 @@ config VHOST_NET
 	  To compile this driver as a module, choose M here: the module will
 	  be called vhost_net.
 
+config VHOST_PASSTHRU
+	tristate "Zerocopy network driver (EXPERIMENTAL)"
+	depends on VHOST_NET
+	---help---
+	  zerocopy network I/O support
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..3f79c79 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_VHOST_PASSTHRU) += mpassthru.o
diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
new file mode 100644
index 0000000..6e8fc4d
--- /dev/null
+++ b/drivers/vhost/mpassthru.c
@@ -0,0 +1,1162 @@
+/*
+ *  MPASSTHRU - Mediate passthrough device.
+ *  Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ *  GNU General Public License for more details.
+ *
+ */
+
+#define DRV_NAME        "mpassthru"
+#define DRV_DESCRIPTION "Mediate passthru device driver"
+#define DRV_COPYRIGHT   "(C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G"
+
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/major.h>
+#include <linux/slab.h>
+#include <linux/smp_lock.h>
+#include <linux/poll.h>
+#include <linux/fcntl.h>
+#include <linux/init.h>
+#include <linux/aio.h>
+
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/miscdevice.h>
+#include <linux/ethtool.h>
+#include <linux/rtnetlink.h>
+#include <linux/if.h>
+#include <linux/if_arp.h>
+#include <linux/if_ether.h>
+#include <linux/crc32.h>
+#include <linux/nsproxy.h>
+#include <linux/uaccess.h>
+#include <linux/virtio_net.h>
+#include <linux/mpassthru.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#include <net/rtnetlink.h>
+#include <net/sock.h>
+
+#include <asm/system.h>
+
+#include "vhost.h"
+
+/* Uncomment to enable debugging */
+/* #define MPASSTHRU_DEBUG 1 */
+
+#ifdef MPASSTHRU_DEBUG
+static int debug;
+
+#define DBG  if (mp->debug) printk
+#define DBG1 if (debug == 2) printk
+#else
+#define DBG(a...)
+#define DBG1(a...)
+#endif
+
+#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
+#define COPY_HDR_LEN   (L1_CACHE_BYTES < 64 ? 64 : L1_CACHE_BYTES)
+
+struct frag {
+	u16     offset;
+	u16     size;
+};
+
+struct page_ctor {
+	struct list_head        readq;
+	int 			w_len;
+	int 			r_len;
+	spinlock_t      	read_lock;
+	struct kmem_cache   	*cache;
+	struct net_device   	*dev;
+	struct mpassthru_port	port;
+};
+
+struct page_info {
+	void 			*ctrl;
+	struct list_head    	list;
+	int         		header;
+	/* indicate the actual length of bytes
+	 * send/recv in the user space buffers
+	 */
+	int         		total;
+	int         		offset;
+	struct page     	*pages[MAX_SKB_FRAGS+1];
+	struct skb_frag_struct 	frag[MAX_SKB_FRAGS+1];
+	struct sk_buff      	*skb;
+	struct page_ctor   	*ctor;
+
+	/* The pointer relayed to skb, to indicate
+	 * it's a user space allocated skb or kernel
+	 */
+	struct skb_user_page    user;
+	struct skb_shared_info	ushinfo;
+
+#define INFO_READ      		0
+#define INFO_WRITE     		1
+	unsigned        	flags;
+	unsigned        	pnum;
+
+	/* It's meaningful for receive, means
+	 * the max length allowed
+	 */
+	size_t          	len;
+
+	/* The fields after that is for backend
+	 * driver, now for vhost-net.
+	 */
+
+	struct kiocb		*iocb;
+	unsigned int    	desc_pos;
+	unsigned int 		log;
+	struct iovec 		hdr[VHOST_NET_MAX_SG];
+	struct iovec 		iov[VHOST_NET_MAX_SG];
+};
+
+struct mp_struct {
+	struct mp_file   	*mfile;
+	struct net_device       *dev;
+	struct page_ctor	*ctor;
+	struct socket           socket;
+
+#ifdef MPASSTHRU_DEBUG
+	int debug;
+#endif
+};
+
+struct mp_file {
+	atomic_t count;
+	struct mp_struct *mp;
+	struct net *net;
+};
+
+struct mp_sock {
+	struct sock            	sk;
+	struct mp_struct       	*mp;
+};
+
+static int mp_dev_change_flags(struct net_device *dev, unsigned flags)
+{
+	int ret = 0;
+
+	rtnl_lock();
+	ret = dev_change_flags(dev, flags);
+	rtnl_unlock();
+
+	if (ret < 0)
+		printk(KERN_ERR "failed to change dev state of %s", dev->name);
+
+	return ret;
+}
+
+/* The main function to allocate user space buffers */
+static struct skb_user_page *page_ctor(struct mpassthru_port *port,
+					struct sk_buff *skb, int npages)
+{
+	int i;
+	unsigned long flags;
+	struct page_ctor *ctor;
+	struct page_info *info = NULL;
+
+	ctor = container_of(port, struct page_ctor, port);
+
+	spin_lock_irqsave(&ctor->read_lock, flags);
+	if (!list_empty(&ctor->readq)) {
+		info = list_first_entry(&ctor->readq, struct page_info, list);
+		list_del(&info->list);
+	}
+	spin_unlock_irqrestore(&ctor->read_lock, flags);
+	if (!info)
+		return NULL;
+
+	for (i = 0; i < info->pnum; i++) {
+		get_page(info->pages[i]);
+		info->frag[i].page = info->pages[i];
+		info->frag[i].page_offset = i ? 0 : info->offset;
+		info->frag[i].size = port->npages > 1 ? PAGE_SIZE :
+			port->data_len;
+	}
+	info->skb = skb;
+	info->user.frags = info->frag;
+	info->user.ushinfo = &info->ushinfo;
+	return &info->user;
+}
+
+static void mp_ki_dtor(struct kiocb *iocb)
+{
+	struct page_info *info = (struct page_info *)(iocb->private);
+	int i;
+
+	for (i = 0; i < info->pnum; i++) {
+		if (info->pages[i])
+			put_page(info->pages[i]);
+	}
+
+	if (info->flags == INFO_READ) {
+		skb_shinfo(info->skb)->destructor_arg = &info->user;
+		info->skb->destructor = NULL;
+		kfree_skb(info->skb);
+	}
+
+	kmem_cache_free(info->ctor->cache, info);
+
+	return;
+}
+
+static struct kiocb *create_iocb(struct page_info *info, int size)
+{
+	struct kiocb *iocb = NULL;
+
+	iocb = info->iocb;
+	if (!iocb)
+		return iocb;
+	iocb->ki_flags = 0;
+	iocb->ki_users = 1;
+	iocb->ki_key = 0;
+	iocb->ki_ctx = NULL;
+	iocb->ki_cancel = NULL;
+	iocb->ki_retry = NULL;
+	iocb->ki_iovec = NULL;
+	iocb->ki_eventfd = NULL;
+	iocb->private = (void *)info;
+	iocb->ki_pos = info->desc_pos;
+	iocb->ki_nbytes = size;
+	iocb->ki_user_data = info->log;
+	iocb->ki_dtor = mp_ki_dtor;
+	return iocb;
+}
+
+/* A helper to clean the skb before the kfree_skb() */
+
+static void page_dtor_prepare(struct page_info *info)
+{
+	if (info->flags == INFO_READ)
+		if (info->skb)
+			info->skb->head = NULL;
+}
+
+/* The callback to destruct the user space buffers or skb */
+static void page_dtor(struct skb_user_page *user)
+{
+	struct page_info *info;
+	struct page_ctor *ctor;
+	struct sock *sk;
+	struct sk_buff *skb;
+	struct kiocb *iocb = NULL;
+	struct vhost_virtqueue *vq = NULL;
+	unsigned long flags;
+	int i;
+
+	if (!user)
+		return;
+	info = container_of(user, struct page_info, user);
+	if (!info)
+		return;
+	ctor = info->ctor;
+	skb = info->skb;
+
+	page_dtor_prepare(info);
+
+	/* If the info->total is 0, make it to be reused */
+	if (!info->total) {
+		spin_lock_irqsave(&ctor->read_lock, flags);
+		list_add(&info->list, &ctor->readq);
+		spin_unlock_irqrestore(&ctor->read_lock, flags);
+		return;
+	}
+
+	if (info->flags == INFO_READ)
+		return;
+
+	/* For transmit, we should wait for the DMA finish by hardware.
+	 * Queue the notifier to wake up the backend driver
+	 */
+	vq = (struct vhost_virtqueue *)info->ctrl;
+	iocb = create_iocb(info, info->total);
+
+	spin_lock_irqsave(&vq->notify_lock, flags);
+	list_add_tail(&iocb->ki_list, &vq->notifier);
+	spin_unlock_irqrestore(&vq->notify_lock, flags);
+
+	sk = ctor->port.sock->sk;
+	sk->sk_write_space(sk);
+
+	return;
+}
+
+static int page_ctor_attach(struct mp_struct *mp)
+{
+	int rc;
+	struct page_ctor *ctor;
+	struct net_device *dev = mp->dev;
+
+	/* locked by mp_mutex */
+	if (rcu_dereference(mp->ctor))
+		return -EBUSY;
+
+	ctor = kzalloc(sizeof(*ctor), GFP_KERNEL);
+	if (!ctor)
+		return -ENOMEM;
+	rc = netdev_mp_port_prep(dev, &ctor->port);
+	if (rc)
+		goto fail;
+
+	ctor->cache = kmem_cache_create("skb_page_info",
+			sizeof(struct page_info), 0,
+			SLAB_HWCACHE_ALIGN, NULL);
+
+	if (!ctor->cache)
+		goto cache_fail;
+
+	INIT_LIST_HEAD(&ctor->readq);
+	spin_lock_init(&ctor->read_lock);
+
+	ctor->w_len = 0;
+	ctor->r_len = 0;
+
+	dev_hold(dev);
+	ctor->dev = dev;
+	ctor->port.ctor = page_ctor;
+	ctor->port.sock = &mp->socket;
+
+	rc = netdev_mp_port_attach(dev, &ctor->port);
+	if (rc)
+		goto fail;
+
+	/* locked by mp_mutex */
+	rcu_assign_pointer(mp->ctor, ctor);
+
+	/* XXX:Need we do set_offload here ? */
+
+	return 0;
+
+fail:
+	kmem_cache_destroy(ctor->cache);
+cache_fail:
+	kfree(ctor);
+	dev_put(dev);
+
+	return rc;
+}
+
+struct page_info *info_dequeue(struct page_ctor *ctor)
+{
+	unsigned long flags;
+	struct page_info *info = NULL;
+	spin_lock_irqsave(&ctor->read_lock, flags);
+	if (!list_empty(&ctor->readq)) {
+		info = list_first_entry(&ctor->readq,
+				struct page_info, list);
+		list_del(&info->list);
+	}
+	spin_unlock_irqrestore(&ctor->read_lock, flags);
+	return info;
+}
+
+static int page_ctor_detach(struct mp_struct *mp)
+{
+	struct page_ctor *ctor;
+	struct page_info *info;
+	struct vhost_virtqueue *vq = NULL;
+	struct kiocb *iocb = NULL;
+	int i;
+	unsigned long flags;
+
+	/* locked by mp_mutex */
+	ctor = rcu_dereference(mp->ctor);
+	if (!ctor)
+		return -ENODEV;
+
+	while ((info = info_dequeue(ctor))) {
+		for (i = 0; i < info->pnum; i++)
+			if (info->pages[i])
+				put_page(info->pages[i]);
+		vq = (struct vhost_virtqueue *)(info->ctrl);
+		iocb = create_iocb(info, 0);
+
+		spin_lock_irqsave(&vq->notify_lock, flags);
+		list_add_tail(&iocb->ki_list, &vq->notifier);
+		spin_unlock_irqrestore(&vq->notify_lock, flags);
+
+		kmem_cache_free(ctor->cache, info);
+	}
+	kmem_cache_destroy(ctor->cache);
+	netdev_mp_port_detach(ctor->dev);
+	dev_put(ctor->dev);
+
+	/* locked by mp_mutex */
+	rcu_assign_pointer(mp->ctor, NULL);
+	synchronize_rcu();
+
+	kfree(ctor);
+	return 0;
+}
+
+/* For small user space buffers transmit, we don't need to call
+ * get_user_pages().
+ */
+static struct page_info *alloc_small_page_info(struct page_ctor *ctor,
+						struct kiocb *iocb, int total)
+{
+	struct page_info *info = kmem_cache_zalloc(ctor->cache, GFP_KERNEL);
+
+	if (!info)
+		return NULL;
+	info->total = total;
+	info->user.dtor = page_dtor;
+	info->ctor = ctor;
+	info->flags = INFO_WRITE;
+	info->iocb = iocb;
+	return info;
+}
+
+/* The main function to transform the guest user space address
+ * to host kernel address via get_user_pages(). Thus the hardware
+ * can do DMA directly to the user space address.
+ */
+static struct page_info *alloc_page_info(struct page_ctor *ctor,
+					struct kiocb *iocb, struct iovec *iov,
+					int count, struct frag *frags,
+					int npages, int total)
+{
+	int rc;
+	int i, j, n = 0;
+	int len;
+	unsigned long base;
+	struct page_info *info = kmem_cache_zalloc(ctor->cache, GFP_KERNEL);
+
+	if (!info)
+		return NULL;
+
+	for (i = j = 0; i < count; i++) {
+		base = (unsigned long)iov[i].iov_base;
+		len = iov[i].iov_len;
+
+		if (!len)
+			continue;
+		n = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+
+		rc = get_user_pages_fast(base, n, npages ? 1 : 0,
+						&info->pages[j]);
+		if (rc != n)
+			goto failed;
+
+		while (n--) {
+			frags[j].offset = base & ~PAGE_MASK;
+			frags[j].size = min_t(int, len,
+					PAGE_SIZE - frags[j].offset);
+			len -= frags[j].size;
+			base += frags[j].size;
+			j++;
+		}
+	}
+
+#ifdef CONFIG_HIGHMEM
+	if (npages && !(dev->features & NETIF_F_HIGHDMA)) {
+		for (i = 0; i < j; i++) {
+			if (PageHighMem(info->pages[i]))
+				goto failed;
+		}
+	}
+#endif
+
+	info->total = total;
+	info->user.dtor = page_dtor;
+	info->ctor = ctor;
+	info->pnum = j;
+	info->iocb = iocb;
+	if (!npages)
+		info->flags = INFO_WRITE;
+	if (info->flags == INFO_READ) {
+		info->user.start = (u8 *)(((unsigned long)
+				(pfn_to_kaddr(page_to_pfn(info->pages[0]))) +
+				frags[0].offset) - NET_IP_ALIGN - NET_SKB_PAD);
+		info->user.size = iov[0].iov_len + NET_IP_ALIGN + NET_SKB_PAD;
+		for (i = 0; i < j; i++)
+			set_page_dirty_lock(info->pages[i]);
+	}
+	return info;
+
+failed:
+	for (i = 0; i < j; i++)
+		put_page(info->pages[i]);
+
+	kmem_cache_free(ctor->cache, info);
+
+	return NULL;
+}
+
+static int mp_sendmsg(struct kiocb *iocb, struct socket *sock,
+			struct msghdr *m, size_t total_len)
+{
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct page_ctor *ctor;
+	struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(iocb->private);
+	struct iovec *iov = m->msg_iov;
+	struct page_info *info = NULL;
+	struct frag frags[MAX_SKB_FRAGS];
+	struct sk_buff *skb;
+	int count = m->msg_iovlen;
+	int total = 0, header, n, i, len, rc;
+	unsigned long base;
+
+	ctor = rcu_dereference(mp->ctor);
+	if (!ctor)
+		return -ENODEV;
+
+	total = iov_length(iov, count);
+
+	if (total < ETH_HLEN)
+		return -EINVAL;
+
+	if (total <= COPY_THRESHOLD)
+		goto copy;
+
+	n = 0;
+	for (i = 0; i < count; i++) {
+		base = (unsigned long)iov[i].iov_base;
+		len = iov[i].iov_len;
+		if (!len)
+			continue;
+		n += ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+		if (n > MAX_SKB_FRAGS)
+			return -EINVAL;
+	}
+
+copy:
+	header = total > COPY_THRESHOLD ? COPY_HDR_LEN : total;
+
+	skb = alloc_skb(header + NET_IP_ALIGN, GFP_ATOMIC);
+	if (!skb)
+		goto drop;
+
+	skb_reserve(skb, NET_IP_ALIGN);
+
+	skb_set_network_header(skb, ETH_HLEN);
+
+	memcpy_fromiovec(skb->data, iov, header);
+	skb_put(skb, header);
+	skb->protocol = *((__be16 *)(skb->data) + ETH_ALEN);
+
+	if (header == total) {
+		rc = total;
+		info = alloc_small_page_info(ctor, iocb, total);
+	} else {
+		info = alloc_page_info(ctor, iocb, iov, count, frags, 0, total);
+		if (info)
+			for (i = 0; info->pages[i]; i++) {
+				skb_add_rx_frag(skb, i, info->pages[i],
+						frags[i].offset, frags[i].size);
+				info->pages[i] = NULL;
+			}
+	}
+	if (info != NULL) {
+		info->desc_pos = iocb->ki_pos;
+		info->ctrl = vq;
+		info->total = total;
+		info->skb = skb;
+		skb_shinfo(skb)->destructor_arg = &info->user;
+		skb->dev = mp->dev;
+		dev_queue_xmit(skb);
+		mp->dev->stats.tx_packets++;
+		mp->dev->stats.tx_bytes += total;
+		return 0;
+	}
+drop:
+	kfree_skb(skb);
+	if (info) {
+		for (i = 0; info->pages[i]; i++)
+			put_page(info->pages[i]);
+		kmem_cache_free(info->ctor->cache, info);
+	}
+	mp->dev->stats.tx_dropped++;
+	return -ENOMEM;
+}
+
+
+static void mp_recvmsg_notify(struct vhost_virtqueue *vq)
+{
+	struct socket *sock = vq->private_data;
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct page_ctor *ctor = NULL;
+	struct sk_buff *skb = NULL;
+	struct page_info *info = NULL;
+	struct ethhdr *eth;
+	struct kiocb *iocb = NULL;
+	int len, i;
+	unsigned long flags;
+
+	struct virtio_net_hdr hdr = {
+		.flags = 0,
+		.gso_type = VIRTIO_NET_HDR_GSO_NONE
+	};
+
+	ctor = rcu_dereference(mp->ctor);
+	if (!ctor)
+		return;
+
+	while ((skb = skb_dequeue(&sock->sk->sk_receive_queue)) != NULL) {
+		if (skb_shinfo(skb)->destructor_arg) {
+			info = container_of(skb_shinfo(skb)->destructor_arg,
+					struct page_info, user);
+			info->skb = skb;
+			if (skb->len > info->len) {
+				mp->dev->stats.rx_dropped++;
+				DBG(KERN_INFO "Discarded truncated rx packet: "
+					" len %d > %zd\n", skb->len, info->len);
+				info->total = skb->len;
+				goto clean;
+			} else {
+				int i;
+				struct skb_shared_info *gshinfo =
+				(struct skb_shared_info *)(&info->ushinfo);
+				struct skb_shared_info *hshinfo =
+						skb_shinfo(skb);
+
+				if (gshinfo->nr_frags < hshinfo->nr_frags)
+					goto clean;
+				eth = eth_hdr(skb);
+				skb_push(skb, ETH_HLEN);
+
+				hdr.hdr_len = skb_headlen(skb);
+				info->total = skb->len;
+
+				for (i = 0; i < gshinfo->nr_frags; i++)
+					gshinfo->frags[i].size = 0;
+				for (i = 0; i < hshinfo->nr_frags; i++)
+					gshinfo->frags[i].size =
+						hshinfo->frags[i].size;
+				memcpy(skb_shinfo(skb), &info->ushinfo,
+						sizeof(struct skb_shared_info));
+			}
+		} else {
+			/* The skb composed with kernel buffers
+			 * in case user space buffers are not sufficent.
+			 * The case should be rare.
+			 */
+			unsigned long flags;
+			int i;
+			struct skb_shared_info *gshinfo = NULL;
+
+			info = NULL;
+
+			spin_lock_irqsave(&ctor->read_lock, flags);
+			if (!list_empty(&ctor->readq)) {
+				info = list_first_entry(&ctor->readq,
+						struct page_info, list);
+				list_del(&info->list);
+			}
+			spin_unlock_irqrestore(&ctor->read_lock, flags);
+			if (!info) {
+				DBG(KERN_INFO "No user buffer avaliable %p\n",
+									skb);
+				skb_queue_head(&sock->sk->sk_receive_queue,
+									skb);
+				break;
+			}
+			info->skb = skb;
+			/* compute the guest skb frags info */
+			gshinfo = (struct skb_shared_info *)(info->user.start +
+					SKB_DATA_ALIGN(info->user.size));
+
+			if (gshinfo->nr_frags < skb_shinfo(skb)->nr_frags)
+				goto clean;
+
+			eth = eth_hdr(skb);
+			skb_push(skb, ETH_HLEN);
+			info->total = skb->len;
+
+			for (i = 0; i < gshinfo->nr_frags; i++)
+				gshinfo->frags[i].size = 0;
+			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
+				gshinfo->frags[i].size =
+					skb_shinfo(skb)->frags[i].size;
+			hdr.hdr_len = min_t(int, skb->len,
+						info->iov[1].iov_len);
+			skb_copy_datagram_iovec(skb, 0, info->iov, skb->len);
+		}
+
+		len = memcpy_toiovec(info->hdr, (unsigned char *)&hdr,
+								 sizeof hdr);
+		if (len) {
+			DBG(KERN_INFO
+				"Unable to write vnet_hdr at addr %p: %d\n",
+				info->hdr->iov_base, len);
+			goto clean;
+		}
+		iocb = create_iocb(info, skb->len + sizeof(hdr));
+
+		spin_lock_irqsave(&vq->notify_lock, flags);
+		list_add_tail(&iocb->ki_list, &vq->notifier);
+		spin_unlock_irqrestore(&vq->notify_lock, flags);
+		continue;
+
+clean:
+		kfree_skb(skb);
+		for (i = 0; info->pages[i]; i++)
+			put_page(info->pages[i]);
+		kmem_cache_free(ctor->cache, info);
+	}
+	return;
+}
+
+static int mp_recvmsg(struct kiocb *iocb, struct socket *sock,
+			struct msghdr *m, size_t total_len,
+			int flags)
+{
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct page_ctor *ctor;
+	struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(iocb->private);
+	struct iovec *iov = m->msg_iov;
+	int count = m->msg_iovlen;
+	int npages, payload;
+	struct page_info *info;
+	struct frag frags[MAX_SKB_FRAGS];
+	unsigned long base;
+	int i, len;
+	unsigned long flag;
+
+	if (!(flags & MSG_DONTWAIT))
+		return -EINVAL;
+
+	ctor = rcu_dereference(mp->ctor);
+	if (!ctor)
+		return -EINVAL;
+
+	/* Error detections in case invalid user space buffer */
+	if (count > 2 && iov[1].iov_len < ctor->port.hdr_len &&
+			mp->dev->features & NETIF_F_SG) {
+		return -EINVAL;
+	}
+
+	npages = ctor->port.npages;
+	payload = ctor->port.data_len;
+
+	/* If KVM guest virtio-net FE driver use SG feature */
+	if (count > 2) {
+		for (i = 2; i < count; i++) {
+			base = (unsigned long)iov[i].iov_base & ~PAGE_MASK;
+			len = iov[i].iov_len;
+			if (npages == 1)
+				len = min_t(int, len, PAGE_SIZE - base);
+			else if (base)
+				break;
+			payload -= len;
+			if (payload <= 0)
+				goto proceed;
+			if (npages == 1 || (len & ~PAGE_MASK))
+				break;
+		}
+	}
+
+	if ((((unsigned long)iov[1].iov_base & ~PAGE_MASK)
+				- NET_SKB_PAD - NET_IP_ALIGN) >= 0)
+		goto proceed;
+
+	return -EINVAL;
+
+proceed:
+	/* skip the virtnet head */
+	iov++;
+	count--;
+
+	/* Translate address to kernel */
+	info = alloc_page_info(ctor, iocb, iov, count, frags, npages, 0);
+	if (!info)
+		return -ENOMEM;
+	info->len = total_len;
+	info->hdr[0].iov_base = vq->hdr[0].iov_base;
+	info->hdr[0].iov_len = vq->hdr[0].iov_len;
+	info->offset = frags[0].offset;
+	info->desc_pos = iocb->ki_pos;
+	info->log = iocb->ki_user_data;
+	info->ctrl = vq;
+
+	iov--;
+	count++;
+
+	memcpy(info->iov, vq->iov, sizeof(struct iovec) * count);
+
+	spin_lock_irqsave(&ctor->read_lock, flag);
+	list_add_tail(&info->list, &ctor->readq);
+	spin_unlock_irqrestore(&ctor->read_lock, flag);
+
+	if (!vq->receiver)
+		vq->receiver = mp_recvmsg_notify;
+
+	return 0;
+}
+
+static void __mp_detach(struct mp_struct *mp)
+{
+	mp->mfile = NULL;
+
+	mp_dev_change_flags(mp->dev, mp->dev->flags & ~IFF_UP);
+	page_ctor_detach(mp);
+	mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
+
+	/* Drop the extra count on the net device */
+	dev_put(mp->dev);
+}
+
+static DEFINE_MUTEX(mp_mutex);
+
+static void mp_detach(struct mp_struct *mp)
+{
+	mutex_lock(&mp_mutex);
+	__mp_detach(mp);
+	mutex_unlock(&mp_mutex);
+}
+
+static void mp_put(struct mp_file *mfile)
+{
+	if (atomic_dec_and_test(&mfile->count))
+		mp_detach(mfile->mp);
+}
+
+static int mp_release(struct socket *sock)
+{
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct mp_file *mfile = mp->mfile;
+
+	mp_put(mfile);
+	sock_put(mp->socket.sk);
+	put_net(mfile->net);
+
+	return 0;
+}
+
+/* Ops structure to mimic raw sockets with mp device */
+static const struct proto_ops mp_socket_ops = {
+	.sendmsg = mp_sendmsg,
+	.recvmsg = mp_recvmsg,
+	.release = mp_release,
+};
+
+static struct proto mp_proto = {
+	.name           = "mp",
+	.owner          = THIS_MODULE,
+	.obj_size       = sizeof(struct mp_sock),
+};
+
+static int mp_chr_open(struct inode *inode, struct file * file)
+{
+	struct mp_file *mfile;
+	cycle_kernel_lock();
+	DBG1(KERN_INFO "mp: mp_chr_open\n");
+
+	mfile = kzalloc(sizeof(*mfile), GFP_KERNEL);
+	if (!mfile)
+		return -ENOMEM;
+	atomic_set(&mfile->count, 0);
+	mfile->mp = NULL;
+	mfile->net = get_net(current->nsproxy->net_ns);
+	file->private_data = mfile;
+	return 0;
+}
+
+
+static struct mp_struct *mp_get(struct mp_file *mfile)
+{
+	struct mp_struct *mp = NULL;
+	if (atomic_inc_not_zero(&mfile->count))
+		mp = mfile->mp;
+
+	return mp;
+}
+
+
+static int mp_attach(struct mp_struct *mp, struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+	int err;
+
+	netif_tx_lock_bh(mp->dev);
+
+	err = -EINVAL;
+
+	if (mfile->mp)
+		goto out;
+
+	err = -EBUSY;
+	if (mp->mfile)
+		goto out;
+
+	err = 0;
+	mfile->mp = mp;
+	mp->mfile = mfile;
+	mp->socket.file = file;
+	dev_hold(mp->dev);
+	sock_hold(mp->socket.sk);
+	atomic_inc(&mfile->count);
+
+out:
+	netif_tx_unlock_bh(mp->dev);
+	return err;
+}
+
+static void mp_sock_destruct(struct sock *sk)
+{
+	struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
+	kfree(mp);
+}
+
+static int do_unbind(struct mp_file *mfile)
+{
+	struct mp_struct *mp = mp_get(mfile);
+
+	if (!mp)
+		return -EINVAL;
+
+	mp_detach(mp);
+	sock_put(mp->socket.sk);
+	mp_put(mfile);
+	return 0;
+}
+
+static void mp_sock_data_ready(struct sock *sk, int len)
+{
+	if (sk_has_sleeper(sk))
+		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN);
+}
+
+static void mp_sock_write_space(struct sock *sk)
+{
+	if (sk_has_sleeper(sk))
+		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLOUT);
+}
+
+static long mp_chr_ioctl(struct file *file, unsigned int cmd,
+		unsigned long arg)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp;
+	struct net_device *dev;
+	void __user* argp = (void __user *)arg;
+	struct ifreq ifr;
+	struct sock *sk;
+	int ret;
+
+	ret = -EINVAL;
+
+	switch (cmd) {
+	case MPASSTHRU_BINDDEV:
+		ret = -EFAULT;
+		if (copy_from_user(&ifr, argp, sizeof ifr))
+			break;
+
+		ifr.ifr_name[IFNAMSIZ-1] = '\0';
+
+		ret = -EBUSY;
+
+		if (ifr.ifr_flags & IFF_MPASSTHRU_EXCL)
+			break;
+
+		ret = -ENODEV;
+		dev = dev_get_by_name(mfile->net, ifr.ifr_name);
+		if (!dev)
+			break;
+
+		mutex_lock(&mp_mutex);
+
+		ret = -EBUSY;
+		mp = mfile->mp;
+		if (mp)
+			goto err_dev_put;
+
+		mp = kzalloc(sizeof(*mp), GFP_KERNEL);
+		if (!mp) {
+			ret = -ENOMEM;
+			goto err_dev_put;
+		}
+		mp->dev = dev;
+		ret = -ENOMEM;
+
+		sk = sk_alloc(mfile->net, AF_UNSPEC, GFP_KERNEL, &mp_proto);
+		if (!sk)
+			goto err_free_mp;
+
+		init_waitqueue_head(&mp->socket.wait);
+		mp->socket.ops = &mp_socket_ops;
+		sock_init_data(&mp->socket, sk);
+		sk->sk_sndbuf = INT_MAX;
+		container_of(sk, struct mp_sock, sk)->mp = mp;
+
+		sk->sk_destruct = mp_sock_destruct;
+		sk->sk_data_ready = mp_sock_data_ready;
+		sk->sk_write_space = mp_sock_write_space;
+
+		ret = mp_attach(mp, file);
+		if (ret < 0)
+			goto err_free_sk;
+
+		ret = page_ctor_attach(mp);
+		if (ret < 0)
+			goto err_free_sk;
+
+		ifr.ifr_flags |= IFF_MPASSTHRU_EXCL;
+		mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
+out:
+		mutex_unlock(&mp_mutex);
+		break;
+err_free_sk:
+		sk_free(sk);
+err_free_mp:
+		kfree(mp);
+err_dev_put:
+		dev_put(dev);
+		goto out;
+
+	case MPASSTHRU_UNBINDDEV:
+		ret = do_unbind(mfile);
+		break;
+
+	default:
+		break;
+	}
+	return ret;
+}
+
+static unsigned int mp_chr_poll(struct file *file, poll_table * wait)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp = mp_get(mfile);
+	struct sock *sk;
+	unsigned int mask = 0;
+
+	if (!mp)
+		return POLLERR;
+
+	sk = mp->socket.sk;
+
+	poll_wait(file, &mp->socket.wait, wait);
+
+	if (!skb_queue_empty(&sk->sk_receive_queue))
+		mask |= POLLIN | POLLRDNORM;
+
+	if (sock_writeable(sk) ||
+		(!test_and_set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags) &&
+			 sock_writeable(sk)))
+		mask |= POLLOUT | POLLWRNORM;
+
+	if (mp->dev->reg_state != NETREG_REGISTERED)
+		mask = POLLERR;
+
+	mp_put(mfile);
+	return mask;
+}
+
+static int mp_chr_close(struct inode *inode, struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+
+	/*
+	 * Ignore return value since an error only means there was nothing to
+	 * do
+	 */
+	do_unbind(mfile);
+
+	put_net(mfile->net);
+	kfree(mfile);
+
+	return 0;
+}
+
+static const struct file_operations mp_fops = {
+	.owner  = THIS_MODULE,
+	.llseek = no_llseek,
+	.poll   = mp_chr_poll,
+	.unlocked_ioctl = mp_chr_ioctl,
+	.open   = mp_chr_open,
+	.release = mp_chr_close,
+};
+
+static struct miscdevice mp_miscdev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "mp",
+	.nodename = "net/mp",
+	.fops = &mp_fops,
+};
+
+static int mp_device_event(struct notifier_block *unused,
+		unsigned long event, void *ptr)
+{
+	struct net_device *dev = ptr;
+	struct mpassthru_port *port;
+	struct mp_struct *mp = NULL;
+	struct socket *sock = NULL;
+
+	port = dev->mp_port;
+	if (port == NULL)
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_UNREGISTER:
+			sock = dev->mp_port->sock;
+			mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+			do_unbind(mp->mfile);
+			break;
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block mp_notifier_block __read_mostly = {
+	.notifier_call  = mp_device_event,
+};
+
+static int mp_init(void)
+{
+	int ret = 0;
+
+	ret = misc_register(&mp_miscdev);
+	if (ret)
+		printk(KERN_ERR "mp: Can't register misc device\n");
+	else {
+		printk(KERN_INFO "Registering mp misc device - minor = %d\n",
+			mp_miscdev.minor);
+		register_netdevice_notifier(&mp_notifier_block);
+	}
+	return ret;
+}
+
+void mp_cleanup(void)
+{
+	unregister_netdevice_notifier(&mp_notifier_block);
+	misc_deregister(&mp_miscdev);
+}
+
+/* Get an underlying socket object from mp file.  Returns error unless file is
+ * attached to a device.  The returned object works like a packet socket, it
+ * can be used for sock_sendmsg/sock_recvmsg.  The caller is responsible for
+ * holding a reference to the file for as long as the socket is in use. */
+struct socket *mp_get_socket(struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp;
+
+	if (file->f_op != &mp_fops)
+		return ERR_PTR(-EINVAL);
+	mp = mp_get(mfile);
+	if (!mp)
+		return ERR_PTR(-EBADFD);
+	mp_put(mfile);
+	return &mp->socket;
+}
+EXPORT_SYMBOL_GPL(mp_get_socket);
+
+module_init(mp_init);
+module_exit(mp_cleanup);
+MODULE_AUTHOR(DRV_COPYRIGHT);
+MODULE_DESCRIPTION(DRV_DESCRIPTION);
+MODULE_LICENSE("GPL v2");
diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
new file mode 100644
index 0000000..2be21c5
--- /dev/null
+++ b/include/linux/mpassthru.h
@@ -0,0 +1,29 @@
+#ifndef __MPASSTHRU_H
+#define __MPASSTHRU_H
+
+#include <linux/types.h>
+#include <linux/if_ether.h>
+
+/* ioctl defines */
+#define MPASSTHRU_BINDDEV      _IOW('M', 213, int)
+#define MPASSTHRU_UNBINDDEV    _IOW('M', 214, int)
+
+/* MPASSTHRU ifc flags */
+#define IFF_MPASSTHRU		0x0001
+#define IFF_MPASSTHRU_EXCL	0x0002
+
+#ifdef __KERNEL__
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+struct socket *mp_get_socket(struct file *);
+#else
+#include <linux/err.h>
+#include <linux/errno.h>
+struct file;
+struct socket;
+static inline struct socket *mp_get_socket(struct file *f)
+{
+	return ERR_PTR(-EINVAL);
+}
+#endif /* CONFIG_VHOST_PASSTHRU */
+#endif /* __KERNEL__ */
+#endif /* __MPASSTHRU_H */
-- 
1.5.4.4


^ permalink raw reply related

* Re: [r8169] WARNING: at net/sched/sch_generic.c
From: Sergey Senozhatsky @ 2010-04-01  9:29 UTC (permalink / raw)
  To: François Romieu
  Cc: Neil Horman, Eric Dumazet, netdev, David S. Miller, linux-kernel
In-Reply-To: <20100331201426.GA3228@electric-eye.fr.zoreil.com>

[-- Attachment #1: Type: text/plain, Size: 554 bytes --]

Hello,

On (03/31/10 22:14), François Romieu wrote:
> -1
> 
> The driver does not support Jumbo frames because the original 8169 will not
> go much beyond 7200 (see r8169.c::SafeMtu and netdev circa 2004 december 7).
> 
> /me checks... Apparently it still works a bit.
> 

> Sergey, can you 'dmesg | grep XID' and send the output of a lspci as
> well as the MTU used during the test ?
> 

Sure. 

dmesg | grep XID
[   11.761633] r8169 0000:02:00.0: eth0: RTL8168b/8111b at 0xfd1da000, 00:1a:92:c9:a0:68, XID 18000000 IRQ 28


	Sergey

[-- Attachment #2: Type: application/pgp-signature, Size: 316 bytes --]

^ permalink raw reply

* Re: [PATCH 1/5] netfilter: ipv6: move POSTROUTING invocation before fragmentation
From: Patrick McHardy @ 2010-04-01 10:23 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: netfilter-devel, netdev
In-Reply-To: <1270031934-15940-2-git-send-email-jengelh@medozas.de>

Jan Engelhardt wrote:
> Patrick McHardy notes: "We used to invoke IPv4 POST_ROUTING after
> fragmentation as well just to defragment the packets in conntrack
> immediately afterwards, but that got changed during the
> netfilter-ipsec integration. Ideally IPv6 would behave like IPv4."
> 
> This patch makes it so. Sending an oversized frame (e.g. `ping6
> -s64000 -c1 ::1`) will now show up in POSTROUTING as a single skb
> rather than multiple ones.

Looks good to me. I'll wait until next week in case anyone
else has comments on this patch.


^ permalink raw reply

* Re: [PATCH 3/5] netfilter: xtables: inclusion of xt_TEE
From: Patrick McHardy @ 2010-04-01 10:34 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: netfilter-devel, netdev
In-Reply-To: <1270031934-15940-4-git-send-email-jengelh@medozas.de>

Jan Engelhardt wrote:
> +static bool
> +tee_tg_route4(struct sk_buff *skb, const struct xt_tee_tginfo *info)
> +{
> +	const struct iphdr *iph = ip_hdr(skb);
> +	struct rtable *rt;
> +	struct flowi fl;
> +	int err;
> +
> +	memset(&fl, 0, sizeof(fl));
> +	fl.iif  = skb->skb_iif;

I'm not sure you really should set iif here. We usually (tunnels, REJECT
etc) packets generated locally as new packets.

> +	fl.mark = skb->mark;

The same applies to mark.

> +	fl.nl_u.ip4_u.daddr = info->gw.ip;
> +	fl.nl_u.ip4_u.tos   = RT_TOS(iph->tos);
> +	fl.nl_u.ip4_u.scope = RT_SCOPE_UNIVERSE;
> +
> +	/* Trying to route the packet using the standard routing table. */
> +	err = ip_route_output_key(dev_net(skb->dev), &rt, &fl);
> +	if (err != 0)
> +		return false;
> +
> +	dst_release(skb_dst(skb));
> +	skb_dst_set(skb, &rt->u.dst);
> +	skb->dev      = rt->u.dst.dev;
> +	skb->protocol = htons(ETH_P_IP);
> +	IPCB(skb)->flags |= IPSKB_REROUTED;
> +	return true;
> +}
> +
> +/*
> + * To detect and deter routed packet loopback when using the --tee option, we
> + * take a page out of the raw.patch book: on the copied skb, we set up a fake
> + * ->nfct entry, pointing to the local &route_tee_track. We skip routing
> + * packets when we see they already have that ->nfct.

So without conntrack, people may create loops? If that's the case,
I'd suggest to simply forbid TEE'ing packets to loopback. That
doesn't seem to be very useful anyways.

> + */
> +static unsigned int
> +tee_tg4(struct sk_buff *skb, const struct xt_target_param *par)
> +{
> +	const struct xt_tee_tginfo *info = par->targinfo;
> +	struct iphdr *iph;
> +
> +#ifdef WITH_CONNTRACK
> +	if (skb->nfct == &tee_track.ct_general)
> +		/*
> +		 * Loopback - a packet we already routed, is to be
> +		 * routed another time. Avoid that, now.
> +		 */
> +		return NF_DROP;
> +#endif
> +	/*
> +	 * Copy the skb, and route the copy. Will later return %XT_CONTINUE for
> +	 * the original skb, which should continue on its way as if nothing has
> +	 * happened. The copy should be independently delivered to the TEE
> +	 * --gateway.
> +	 */
> +	skb = skb_copy(skb, GFP_ATOMIC);
> +	if (skb == NULL)
> +		return XT_CONTINUE;
> +	/*
> +	 * If we are in PREROUTING/INPUT, the checksum must be recalculated
> +	 * since the length could have changed as a result of defragmentation.
> +	 *
> +	 * We also decrease the TTL to mitigate potential TEE loops
> +	 * between two hosts.
> +	 *
> +	 * Set %IP_DF so that the original source is notified of a potentially
> +	 * decreased MTU on the clone route. IPv6 does this too.
> +	 */
> +	iph = ip_hdr(skb);
> +	iph->frag_off |= htons(IP_DF);
> +	if (par->hooknum == NF_INET_PRE_ROUTING ||
> +	    par->hooknum == NF_INET_LOCAL_IN)
> +		--iph->ttl;
> +	ip_send_check(iph);

Shouldn't this only be done in PRE_ROUTING/INPUT as stated above?

> +
> +#ifdef WITH_CONNTRACK
> +	nf_conntrack_put(skb->nfct);
> +	skb->nfct     = &tee_track.ct_general;
> +	skb->nfctinfo = IP_CT_NEW;
> +	nf_conntrack_get(skb->nfct);
> +#endif
> +	/*
> +	 * Xtables is not reentrant currently, so a choice has to be made:
> +	 * 1. return absolute verdict for the original and let the cloned
> +	 *    packet travel through the chains
> +	 * 2. let the original continue travelling and not pass the clone
> +	 *    to Xtables.
> +	 * #2 is chosen. Normally, we would use ip_local_out for the clone.
> +	 * Because iph->check is already correct and we don't pass it to
> +	 * Xtables anyway, a shortcut to dst_output [forwards to ip_output] can
> +	 * be taken. %IPSKB_REROUTED needs to be set so that ip_output does not
> +	 * invoke POSTROUTING on the cloned packet.
> +	 */
> +	IPCB(skb)->flags |= IPSKB_REROUTED;
> +	if (tee_tg_route4(skb, info))
> +		ip_output(skb);
> +
> +	return XT_CONTINUE;
> +}
> +

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox