Netdev List
 help / color / mirror / Atom feed
* Re: why are IPv6 addresses removed on link down
From: YOSHIFUJI Hideaki @ 2015-01-13 11:58 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Stephen Hemminger
  Cc: hideaki.yoshifuji, David Ahern, netdev@vger.kernel.org
In-Reply-To: <1421145346.13626.12.camel@redhat.com>

Hi,

Hannes Frederic Sowa wrote:
> On Mo, 2015-01-12 at 23:10 -0800, Stephen Hemminger wrote:
>> On Mon, 12 Jan 2015 22:06:44 -0700
>> David Ahern <dsahern@gmail.com> wrote:
>>
>>> We noticed that IPv6 addresses are removed on a link down. e.g.,
>>>     ip link set dev eth1
>>>
>>>
>>> Looking at the code it appears to be this code path in addrconf.c:
>>>
>>>           case NETDEV_DOWN:
>>>           case NETDEV_UNREGISTER:
>>>                   /*
>>>                    *      Remove all addresses from this interface.
>>>                    */
>>>                   addrconf_ifdown(dev, event != NETDEV_DOWN);
>>>                   break;
>>>
>>> IPv4 addresses are NOT removed on a link down. Is there a particular
>>> reason IPv6 addresses are?
>>>
>>> Thanks,
>>> David
>>
>> See RFC's which describes how IPv6 does Duplicate Address Detection.
>> Address is not valid when link is down, since DAD is not possible.
>
> It should be no problem if the kernel would reacquire them on ifup and
> do proper DAD. We simply must not use them while the interface is dead
> (also making sure they don't get used for loopback routing).
>
> The problem the IPv6 addresses get removed is much more a historical
> artifact nowadays, I think. It is part of user space API and scripts
> deal with that already.

We might have another "detached" state which essintially drops
outgoing packets while link is down.  Just after recovering link,
we could start receiving packet from the link and perform optimistic
DAD. And then, after it succeeds, we may start sending packets.

Since "detached" state is like the state just before completing
Optimistic DAD, it is not so difficult to implement this extended
behavior, I guess.

-- 
Hideaki Yoshifuji <hideaki.yoshifuji@miraclelinux.com>
Technical Division, MIRACLE LINUX CORPORATION

^ permalink raw reply

* Re: [RFC PATCH v2 2/2] net: ixgbe: implement af_packet direct queue mappings
From: Hannes Frederic Sowa @ 2015-01-13 12:05 UTC (permalink / raw)
  To: John Fastabend
  Cc: netdev, danny.zhou, nhorman, dborkman, john.ronciak, brouer
In-Reply-To: <20150113043542.29985.15658.stgit@nitbit.x32>

On Mo, 2015-01-12 at 20:35 -0800, John Fastabend wrote:
> +static int
> +ixgbe_ndo_qpair_page_map(struct vm_area_struct *vma, struct net_device *dev)
> +{
> +	struct ixgbe_adapter *adapter = netdev_priv(dev);
> +	phys_addr_t phy_addr = pci_resource_start(adapter->pdev, 0);
> +	unsigned long pfn_rx = (phy_addr + RX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
> +	unsigned long pfn_tx = (phy_addr + TX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
> +	unsigned long dummy_page_phy;
> +	pgprot_t pre_vm_page_prot;
> +	unsigned long start;
> +	unsigned int i;
> +	int err;
> +
> +	if (!dummy_page_buf) {
> +		dummy_page_buf = kzalloc(PAGE_SIZE_4K, GFP_KERNEL);
> +		if (!dummy_page_buf)
> +			return -ENOMEM;
> +
> +		for (i = 0; i < PAGE_SIZE_4K / sizeof(unsigned int); i++)
> +			dummy_page_buf[i] = 0xdeadbeef;
> +	}
> +
> +	dummy_page_phy = virt_to_phys(dummy_page_buf);
> +	pre_vm_page_prot = vma->vm_page_prot;
> +	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> +
> +	/* assume the vm_start is 4K aligned address */
> +	for (start = vma->vm_start;
> +	     start < vma->vm_end;
> +	     start += PAGE_SIZE_4K) {
> +		if (start == vma->vm_start + RX_DESC_ADDR_OFFSET) {
> +			err = remap_pfn_range(vma, start, pfn_rx, PAGE_SIZE_4K,
> +					      vma->vm_page_prot);
> +			if (err)
> +				return -EAGAIN;
> +		} else if (start == vma->vm_start + TX_DESC_ADDR_OFFSET) {
> +			err = remap_pfn_range(vma, start, pfn_tx, PAGE_SIZE_4K,
> +					      vma->vm_page_prot);
> +			if (err)
> +				return -EAGAIN;
> +		} else {
> +			unsigned long addr = dummy_page_phy > PAGE_SHIFT;

I guess you have forgotten to delete this line?

> +
> +			err = remap_pfn_range(vma, start, addr, PAGE_SIZE_4K,
> +					      pre_vm_page_prot);
> +			if (err)
> +				return -EAGAIN;
> +		}
> +	}
> +	return 0;
> +}

^ permalink raw reply

* [net PATCH 1/1] drivers: net: cpsw: fix multicast flush in dual emac mode
From: Mugunthan V N @ 2015-01-13 12:05 UTC (permalink / raw)
  To: netdev; +Cc: davem, Mugunthan V N, stable

Since ALE table is a common resource for both the interfaces in Dual EMAC
mode and while bringing up the second interface in cpsw_ndo_set_rx_mode()
all the multicast entries added by the first interface is flushed out and
only second interface multicast addresses are added. Fixing this by
flushing multicast addresses based on dual EMAC port vlans which will not
affect the other emac port multicast addresses.

Fixes: d9ba8f9 (driver: net: ethernet: cpsw: dual emac interface implementation)
Cc: <stable@vger.kernel.org> # v3.9+
Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
---
 drivers/net/ethernet/ti/cpsw.c     | 11 +++++++++--
 drivers/net/ethernet/ti/cpsw_ale.c | 10 +++++++++-
 drivers/net/ethernet/ti/cpsw_ale.h |  2 +-
 3 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index e61ee83..64d1cef 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -610,7 +610,7 @@ static void cpsw_set_promiscious(struct net_device *ndev, bool enable)
 
 			/* Clear all mcast from ALE */
 			cpsw_ale_flush_multicast(ale, ALE_ALL_PORTS <<
-						 priv->host_port);
+						 priv->host_port, -1);
 
 			/* Flood All Unicast Packets to Host port */
 			cpsw_ale_control_set(ale, 0, ALE_P0_UNI_FLOOD, 1);
@@ -634,6 +634,12 @@ static void cpsw_set_promiscious(struct net_device *ndev, bool enable)
 static void cpsw_ndo_set_rx_mode(struct net_device *ndev)
 {
 	struct cpsw_priv *priv = netdev_priv(ndev);
+	int vid;
+
+	if (priv->data.dual_emac)
+		vid = priv->slaves[priv->emac_port].port_vlan;
+	else
+		vid = priv->data.default_vlan;
 
 	if (ndev->flags & IFF_PROMISC) {
 		/* Enable promiscuous mode */
@@ -649,7 +655,8 @@ static void cpsw_ndo_set_rx_mode(struct net_device *ndev)
 	cpsw_ale_set_allmulti(priv->ale, priv->ndev->flags & IFF_ALLMULTI);
 
 	/* Clear all mcast from ALE */
-	cpsw_ale_flush_multicast(priv->ale, ALE_ALL_PORTS << priv->host_port);
+	cpsw_ale_flush_multicast(priv->ale, ALE_ALL_PORTS << priv->host_port,
+				 vid);
 
 	if (!netdev_mc_empty(ndev)) {
 		struct netdev_hw_addr *ha;
diff --git a/drivers/net/ethernet/ti/cpsw_ale.c b/drivers/net/ethernet/ti/cpsw_ale.c
index 097ebe7..5246b3a 100644
--- a/drivers/net/ethernet/ti/cpsw_ale.c
+++ b/drivers/net/ethernet/ti/cpsw_ale.c
@@ -234,7 +234,7 @@ static void cpsw_ale_flush_mcast(struct cpsw_ale *ale, u32 *ale_entry,
 		cpsw_ale_set_entry_type(ale_entry, ALE_TYPE_FREE);
 }
 
-int cpsw_ale_flush_multicast(struct cpsw_ale *ale, int port_mask)
+int cpsw_ale_flush_multicast(struct cpsw_ale *ale, int port_mask, int vid)
 {
 	u32 ale_entry[ALE_ENTRY_WORDS];
 	int ret, idx;
@@ -245,6 +245,14 @@ int cpsw_ale_flush_multicast(struct cpsw_ale *ale, int port_mask)
 		if (ret != ALE_TYPE_ADDR && ret != ALE_TYPE_VLAN_ADDR)
 			continue;
 
+		/* if vid passed is -1 then remove all multicast entry from
+		 * the table irrespective of vlan id, if a valid vlan id is
+		 * passed then remove only multicast added to that vlan id.
+		 * if vlan id doesn't match then move on to next entry.
+		 */
+		if (vid != -1 && cpsw_ale_get_vlan_id(ale_entry) != vid)
+			continue;
+
 		if (cpsw_ale_get_mcast(ale_entry)) {
 			u8 addr[6];
 
diff --git a/drivers/net/ethernet/ti/cpsw_ale.h b/drivers/net/ethernet/ti/cpsw_ale.h
index c0d4127..af1e7ec 100644
--- a/drivers/net/ethernet/ti/cpsw_ale.h
+++ b/drivers/net/ethernet/ti/cpsw_ale.h
@@ -92,7 +92,7 @@ void cpsw_ale_stop(struct cpsw_ale *ale);
 
 int cpsw_ale_set_ageout(struct cpsw_ale *ale, int ageout);
 int cpsw_ale_flush(struct cpsw_ale *ale, int port_mask);
-int cpsw_ale_flush_multicast(struct cpsw_ale *ale, int port_mask);
+int cpsw_ale_flush_multicast(struct cpsw_ale *ale, int port_mask, int vid);
 int cpsw_ale_add_ucast(struct cpsw_ale *ale, u8 *addr, int port,
 		       int flags, u16 vid);
 int cpsw_ale_del_ucast(struct cpsw_ale *ale, u8 *addr, int port,
-- 
2.2.1.62.g3f15098

^ permalink raw reply related

* Re: why are IPv6 addresses removed on link down
From: YOSHIFUJI Hideaki @ 2015-01-13 12:15 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Stephen Hemminger
  Cc: hideaki.yoshifuji, David Ahern, netdev@vger.kernel.org
In-Reply-To: <54B50873.4090907@miraclelinux.com>

YOSHIFUJI Hideaki wrote:
> Hi,
>
> Hannes Frederic Sowa wrote:
>> On Mo, 2015-01-12 at 23:10 -0800, Stephen Hemminger wrote:
>>> On Mon, 12 Jan 2015 22:06:44 -0700
>>> David Ahern <dsahern@gmail.com> wrote:
>>>
>>>> We noticed that IPv6 addresses are removed on a link down. e.g.,
>>>>     ip link set dev eth1
>>>>
>>>>
>>>> Looking at the code it appears to be this code path in addrconf.c:
>>>>
>>>>           case NETDEV_DOWN:
>>>>           case NETDEV_UNREGISTER:
>>>>                   /*
>>>>                    *      Remove all addresses from this interface.
>>>>                    */
>>>>                   addrconf_ifdown(dev, event != NETDEV_DOWN);
>>>>                   break;
>>>>
>>>> IPv4 addresses are NOT removed on a link down. Is there a particular
>>>> reason IPv6 addresses are?
>>>>
>>>> Thanks,
>>>> David
>>>
>>> See RFC's which describes how IPv6 does Duplicate Address Detection.
>>> Address is not valid when link is down, since DAD is not possible.
>>
>> It should be no problem if the kernel would reacquire them on ifup and
>> do proper DAD. We simply must not use them while the interface is dead
>> (also making sure they don't get used for loopback routing).
>>
>> The problem the IPv6 addresses get removed is much more a historical
>> artifact nowadays, I think. It is part of user space API and scripts
>> deal with that already.
>
> We might have another "detached" state which essintially drops
> outgoing packets while link is down.  Just after recovering link,
> we could start receiving packet from the link and perform optimistic
> DAD. And then, after it succeeds, we may start sending packets.
>
> Since "detached" state is like the state just before completing
> Optimistic DAD, it is not so difficult to implement this extended
> behavior, I guess.
>

Note that node is allowed to send packets to neighbours or default
routers if the node knows their link-layer addresses during Optimistic
DAD.

-- 
Hideaki Yoshifuji <hideaki.yoshifuji@miraclelinux.com>
Technical Division, MIRACLE LINUX CORPORATION

^ permalink raw reply

* Re: [3.19-rc3] tg3: BUG: sleeping function called from invalid context
From: Peter Hurley @ 2015-01-13 12:30 UTC (permalink / raw)
  To: Prashant Sreedharan; +Cc: Michael Chan, netdev, Linux kernel
In-Reply-To: <1421116255.16485.14.camel@prashant>

On 01/12/2015 09:30 PM, Prashant Sreedharan wrote:
> On Mon, 2015-01-12 at 19:59 -0500, Peter Hurley wrote:
>> On 3.19-rc3, I'm seeing this might_sleep() warning [1] from the tg3_open()
>> call stack. Let me know if I need to bisect this.
>>
>> Regards,
>> Peter Hurley
>>
>> [1]
>>
>> [   17.203009] BUG: sleeping function called from invalid context at /home/peter/src/kernels/mainline/kernel/irq/manage.c:104
>> [   17.203067] in_atomic(): 1, irqs_disabled(): 0, pid: 1106, name: ip
>> [   17.203092] 2 locks held by ip/1106:
>> [   17.205255]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff816adf1f>] rtnetlink_rcv+0x1f/0x40
>> [   17.207445]  #1:  (&(&tp->lock)->rlock){+.....}, at: [<ffffffffa01073e6>] tg3_start+0xc06/0x11f0 [tg3]
>> [   17.209725] CPU: 2 PID: 1106 Comm: ip Not tainted 3.19.0-rc3+wip-xeon+lockdep #rc3+wip
>> [   17.211900] Hardware name: Dell Inc. Precision WorkStation T5400  /0RW203, BIOS A11 04/30/2012
>> [   17.214086]  0000000000000068 ffff8802ac823498 ffffffff817af7e8 0000000000000005
>> [   17.216265]  ffffffff81a9be78 ffff8802ac8234a8 ffffffff810998a5 ffff8802ac8234d8
>> [   17.218446]  ffffffff8109991a ffff8802ac8234c8 ffff8802af0aae00 ffffffffa00ed000
>> [   17.220636] Call Trace:
>> [   17.222743]  [<ffffffff817af7e8>] dump_stack+0x4f/0x7b
>> [   17.224808]  [<ffffffff810998a5>] ___might_sleep+0x105/0x140
>> [   17.226842]  [<ffffffff8109991a>] __might_sleep+0x3a/0xa0
>> [   17.228869]  [<ffffffffa00ed000>] ? 0xffffffffa00ed000
>> [   17.230939]  [<ffffffff810d7d78>] synchronize_irq+0x38/0xa0
>> [   17.232967]  [<ffffffffa00ed000>] ? 0xffffffffa00ed000
>> [   17.234991]  [<ffffffffa010105f>] tg3_chip_reset+0x13f/0x9c0 [tg3]
>> [   17.236988]  [<ffffffffa01020ae>] tg3_reset_hw+0x7e/0x2d20 [tg3]
>> [   17.238996]  [<ffffffff813bfaff>] ? __udelay+0x2f/0x40
>> [   17.241007]  [<ffffffffa00ef2f7>] ? _tw32_flush+0x47/0x80 [tg3]
>> [   17.243066]  [<ffffffffa0104dac>] tg3_init_hw+0x5c/0x70 [tg3]
>> [   17.245438]  [<ffffffffa010740b>] tg3_start+0xc2b/0x11f0 [tg3]
>> [   17.247444]  [<ffffffffa0107ad7>] ? tg3_open+0x107/0x2e0 [tg3]
>> [   17.249556]  [<ffffffff810c338d>] ? trace_hardirqs_on+0xd/0x10
>> [   17.251581]  [<ffffffff8107806f>] ? __local_bh_enable_ip+0x6f/0x100
>> [   17.253710]  [<ffffffffa0107af8>] tg3_open+0x128/0x2e0 [tg3]
>> [   17.255758]  [<ffffffff816ba3f5>] ? netpoll_poll_disable+0x5/0xa0
>> [   17.257932]  [<ffffffff816a14af>] __dev_open+0xbf/0x140
>> [   17.260091]  [<ffffffff816a17c1>] __dev_change_flags+0xa1/0x160
>> [   17.262222]  [<ffffffff816a18a9>] dev_change_flags+0x29/0x60
>> [   17.264360]  [<ffffffff816b0e02>] do_setlink+0x2f2/0xa30
>> [   17.266431]  [<ffffffff816b1b7f>] rtnl_newlink+0x51f/0x750
>> [   17.268485]  [<ffffffff816b1749>] ? rtnl_newlink+0xe9/0x750
>> [   17.270483]  [<ffffffff811869c2>] ? free_pages_prepare+0x1d2/0x270
>> [   17.272507]  [<ffffffff810c32bd>] ? trace_hardirqs_on_caller+0x11d/0x1e0
>> [   17.274531]  [<ffffffff813dd1b2>] ? nla_parse+0x32/0x120
>> [   17.276531]  [<ffffffff81021ab5>] ? native_sched_clock+0x35/0xa0
>> [   17.278514]  [<ffffffff816adfd5>] rtnetlink_rcv_msg+0x95/0x250
>> [   17.280485]  [<ffffffff8109f699>] ? preempt_count_sub+0x49/0x50
>> [   17.282448]  [<ffffffff817b4a02>] ? mutex_lock_nested+0x382/0x530
>> [   17.284402]  [<ffffffff816adf1f>] ? rtnetlink_rcv+0x1f/0x40
>> [   17.286290]  [<ffffffff816adf1f>] ? rtnetlink_rcv+0x1f/0x40
>> [   17.288142]  [<ffffffff816adf40>] ? rtnetlink_rcv+0x40/0x40
>> [   17.290031]  [<ffffffff816cedc1>] netlink_rcv_skb+0xc1/0xe0
>> [   17.291836]  [<ffffffff816adf2e>] rtnetlink_rcv+0x2e/0x40
>> [   17.293615]  [<ffffffff816ce473>] netlink_unicast+0xf3/0x1d0
>> [   17.295420]  [<ffffffff816ce863>] netlink_sendmsg+0x313/0x690
>> [   17.297132]  [<ffffffff811ada4f>] ? might_fault+0x5f/0xb0
>> [   17.298799]  [<ffffffff8168253c>] do_sock_sendmsg+0x8c/0x100
>> [   17.300493]  [<ffffffff81681e3e>] ? copy_msghdr_from_user+0x15e/0x1f0
>> [   17.302173]  [<ffffffff81682aeb>] ___sys_sendmsg+0x30b/0x320
>> [   17.303798]  [<ffffffff81021ab5>] ? native_sched_clock+0x35/0xa0
>> [   17.305431]  [<ffffffff810bdee0>] ? cpuacct_account_field+0x80/0xb0
>> [   17.307085]  [<ffffffff81021ab5>] ? native_sched_clock+0x35/0xa0
>> [   17.308744]  [<ffffffff810a4f35>] ? sched_clock_local+0x25/0x90
>> [   17.310375]  [<ffffffff810a5dc1>] ? vtime_account_user+0x91/0xa0
>> [   17.311948]  [<ffffffff810a5198>] ? sched_clock_cpu+0xb8/0xe0
>> [   17.313509]  [<ffffffff810bf8be>] ? put_lock_stats.isra.26+0xe/0x30
>> [   17.315069]  [<ffffffff810c007e>] ? lock_release_holdtime.part.27+0x12e/0x1b0
>> [   17.316618]  [<ffffffff810a5dc1>] ? vtime_account_user+0x91/0xa0
>> [   17.318162]  [<ffffffff8109f5d1>] ? get_parent_ip+0x11/0x50
>> [   17.319703]  [<ffffffff8109f699>] ? preempt_count_sub+0x49/0x50
>> [   17.321235]  [<ffffffff811807e5>] ? context_tracking_user_exit+0x55/0x130
>> [   17.322732]  [<ffffffff811807e5>] ? context_tracking_user_exit+0x55/0x130
>> [   17.324197]  [<ffffffff816834f2>] __sys_sendmsg+0x42/0x80
>> [   17.325634]  [<ffffffff81683542>] SyS_sendmsg+0x12/0x20
>> [   17.327048]  [<ffffffff817ba12d>] system_call_fastpath+0x16/0x1b
> 
> Please bisect, there hasn't been tg3 code changes in this path that
> might cause this. It would help to know the commit changes that is
> triggering the problem.

Ok, will do.

> Also could you provide the device details, from
> syslog look for "Tigon3 [partno(BCMxxxxx) rev xxxxxxx]". Thanks.

[    1.430884] tg3 0000:08:00.0 eth0: Tigon3 [partno(BCM95754) rev b002] (PCI Express) MAC address xx:xx:xx:xx:xx:xx
[    1.431095] tg3 0000:08:00.0 eth0: attached PHY is 5787 (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[0])
[    1.431295] tg3 0000:08:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
[    1.431488] tg3 0000:08:00.0 eth0: dma_rwctrl[76180000] dma_mask[64-bit]

Regards,
Peter Hurley

^ permalink raw reply

* Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
From: Hannes Frederic Sowa @ 2015-01-13 12:35 UTC (permalink / raw)
  To: John Fastabend
  Cc: netdev, danny.zhou, nhorman, dborkman, john.ronciak, brouer
In-Reply-To: <20150113043509.29985.33515.stgit@nitbit.x32>

On Mo, 2015-01-12 at 20:35 -0800, John Fastabend wrote:
> This patch adds net_device ops to split off a set of driver queues
> from the driver and map the queues into user space via mmap. This
> allows the queues to be directly manipulated from user space. For
> raw packet interface this removes any overhead from the kernel network
> stack.
> 
> With these operations we bypass the network stack and packet_type
> handlers that would typically send traffic to an af_packet socket.
> This means hardware must do the forwarding. To do this ew can use
> the ETHTOOL_SRXCLSRLINS ops in the ethtool command set. It is
> currently supported by multiple drivers including sfc, mlx4, niu,
> ixgbe, and i40e. Supporting some way to steer traffic to a queue
> is the _only_ hardware requirement to support this interface.
> 
> A follow on patch adds support for ixgbe but we expect at least
> the subset of drivers implementing ETHTOOL_SRXCLSRLINS can be
> implemented later.
> 
> The high level flow, leveraging the af_packet control path, looks
> like:
> 
> 	bind(fd, &sockaddr, sizeof(sockaddr));
> 
> 	/* Get the device type and info */
> 	getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
> 		   &optlen);
> 
> 	/* With device info we can look up descriptor format */
> 
> 	/* Get the layout of ring space offset, page_sz, cnt */
> 	getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
> 		   &info, &optlen);
> 
> 	/* request some queues from the driver */
> 	setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> 		   &qpairs_info, sizeof(qpairs_info));
> 
> 	/* if we let the driver pick us queues learn which queues
>          * we were given
>          */
> 	getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> 		   &qpairs_info, sizeof(qpairs_info));
> 
> 	/* And mmap queue pairs to user space */
> 	mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
> 	     MAP_SHARED, fd, 0);
> 
> 	/* Now we have some user space queues to read/write to*/
> 
> There is one critical difference when running with these interfaces
> vs running without them. In the normal case the af_packet module
> uses a standard descriptor format exported by the af_packet user
> space headers. In this model because we are working directly with
> driver queues the descriptor format maps to the descriptor format
> used by the device. User space applications can learn device
> information from the socket option PACKET_DEV_DESC_INFO. These
> are described by giving the vendor/deviceid and a descriptor layout
> in offset/length/width/alignment/byte_ordering.
> 
> To protect against arbitrary DMA writes IOMMU devices put memory
> in a single domain to stop arbitrary DMA to memory. Note it would
> be possible to dma into another sockets pages because most NIC
> devices only support a single domain. This would require being
> able to guess another sockets page layout. However the socket
> operation does require CAP_NET_ADMIN privileges.
> 
> Additionally we have a set of DPDK patches to enable DPDK with this
> interface. DPDK can be downloaded @ dpdk.org although as I hope is
> clear from above DPDK is just our paticular test environment we
> expect other libraries could be built on this interface.
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>  include/linux/netdevice.h      |   79 ++++++++
>  include/uapi/linux/if_packet.h |   88 +++++++++
>  net/packet/af_packet.c         |  397 ++++++++++++++++++++++++++++++++++++++++
>  net/packet/internal.h          |   10 +
>  4 files changed, 573 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 679e6e9..b71c97d 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -52,6 +52,8 @@
>  #include <linux/neighbour.h>
>  #include <uapi/linux/netdevice.h>
>  
> +#include <linux/if_packet.h>
> +
>  struct netpoll_info;
>  struct device;
>  struct phy_device;
> @@ -1030,6 +1032,54 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
>   * int (*ndo_switch_port_stp_update)(struct net_device *dev, u8 state);
>   *	Called to notify switch device port of bridge port STP
>   *	state change.
> + *
> + * int (*ndo_split_queue_pairs) (struct net_device *dev,
> + *				 unsigned int qpairs_start_from,
> + *				 unsigned int qpairs_num,
> + *				 struct sock *sk)
> + *	Called to request a set of queues from the driver to be handed to the
> + *	callee for management. After this returns the driver will not use the
> + *	queues.
> + *
> + * int (*ndo_get_split_queue_pairs) (struct net_device *dev,
> + *				 unsigned int *qpairs_start_from,
> + *				 unsigned int *qpairs_num,
> + *				 struct sock *sk)
> + *	Called to get the location of queues that have been split for user
> + *	space to use. The socket must have previously requested the queues via
> + *	ndo_split_queue_pairs successfully.
> + *
> + * int (*ndo_return_queue_pairs) (struct net_device *dev,
> + *				  struct sock *sk)
> + *	Called to return a set of queues identified by sock to the driver. The
> + *	socket must have previously requested the queues via
> + *	ndo_split_queue_pairs for this action to be performed.
> + *
> + * int (*ndo_get_device_qpair_map_region_info) (struct net_device *dev,
> + *				struct tpacket_dev_qpair_map_region_info *info)
> + *	Called to return mapping of queue memory region.
> + *
> + * int (*ndo_get_device_desc_info) (struct net_device *dev,
> + *				    struct tpacket_dev_info *dev_info)
> + *	Called to get device specific information. This should uniquely identify
> + *	the hardware so that descriptor formats can be learned by the stack/user
> + *	space.
> + *
> + * int (*ndo_direct_qpair_page_map) (struct vm_area_struct *vma,
> + *				     struct net_device *dev)
> + *	Called to map queue pair range from split_queue_pairs into mmap region.
> + *
> + * int (*ndo_direct_validate_dma_mem_region_map)
> + *					(struct net_device *dev,
> + *					 struct tpacket_dma_mem_region *region,
> + *					 struct sock *sk)
> + *	Called to validate DMA address remaping for userspace memory region
> + *
> + * int (*ndo_get_dma_region_info)
> + *				 (struct net_device *dev,
> + *				  struct tpacket_dma_mem_region *region,
> + *				  struct sock *sk)
> + *	Called to get dma region' information such as iova.
>   */
>  struct net_device_ops {
>  	int			(*ndo_init)(struct net_device *dev);
> @@ -1190,6 +1240,35 @@ struct net_device_ops {
>  	int			(*ndo_switch_port_stp_update)(struct net_device *dev,
>  							      u8 state);
>  #endif
> +	int			(*ndo_split_queue_pairs)(struct net_device *dev,
> +					 unsigned int qpairs_start_from,
> +					 unsigned int qpairs_num,
> +					 struct sock *sk);
> +	int			(*ndo_get_split_queue_pairs)
> +					(struct net_device *dev,
> +					 unsigned int *qpairs_start_from,
> +					 unsigned int *qpairs_num,
> +					 struct sock *sk);
> +	int			(*ndo_return_queue_pairs)
> +					(struct net_device *dev,
> +					 struct sock *sk);
> +	int			(*ndo_get_device_qpair_map_region_info)
> +					(struct net_device *dev,
> +					 struct tpacket_dev_qpair_map_region_info *info);
> +	int			(*ndo_get_device_desc_info)
> +					(struct net_device *dev,
> +					 struct tpacket_dev_info *dev_info);
> +	int			(*ndo_direct_qpair_page_map)
> +					(struct vm_area_struct *vma,
> +					 struct net_device *dev);
> +	int			(*ndo_validate_dma_mem_region_map)
> +					(struct net_device *dev,
> +					 struct tpacket_dma_mem_region *region,
> +					 struct sock *sk);
> +	int			(*ndo_get_dma_region_info)
> +					(struct net_device *dev,
> +					 struct tpacket_dma_mem_region *region,
> +					 struct sock *sk);
>  };
>  
>  /**
> diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
> index da2d668..eb7a727 100644
> --- a/include/uapi/linux/if_packet.h
> +++ b/include/uapi/linux/if_packet.h
> @@ -54,6 +54,13 @@ struct sockaddr_ll {
>  #define PACKET_FANOUT			18
>  #define PACKET_TX_HAS_OFF		19
>  #define PACKET_QDISC_BYPASS		20
> +#define PACKET_RXTX_QPAIRS_SPLIT	21
> +#define PACKET_RXTX_QPAIRS_RETURN	22
> +#define PACKET_DEV_QPAIR_MAP_REGION_INFO	23
> +#define PACKET_DEV_DESC_INFO		24
> +#define PACKET_DMA_MEM_REGION_MAP       25
> +#define PACKET_DMA_MEM_REGION_RELEASE   26
> +
>  
>  #define PACKET_FANOUT_HASH		0
>  #define PACKET_FANOUT_LB		1
> @@ -64,6 +71,87 @@ struct sockaddr_ll {
>  #define PACKET_FANOUT_FLAG_ROLLOVER	0x1000
>  #define PACKET_FANOUT_FLAG_DEFRAG	0x8000
>  
> +#define PACKET_MAX_NUM_MAP_MEMORY_REGIONS 64
> +#define PACKET_MAX_NUM_DESC_FORMATS	  8
> +#define PACKET_MAX_NUM_DESC_FIELDS	  64
> +#define PACKET_NIC_DESC_FIELD(fseq, foffset, fwidth, falign, fbo) \
> +		.seqn = (__u8)fseq,				\
> +		.offset = (__u8)foffset,			\
> +		.width = (__u8)fwidth,				\
> +		.align = (__u8)falign,				\
> +		.byte_order = (__u8)fbo

Are the __u8 necessary? They seem to hide compiler warnings?

> +
> +#define MAX_MAP_MEMORY_REGIONS	64
> +
> +/* setsockopt takes addr, size ,direction parametner, getsockopt takes
> + * iova, size, direction.
> + * */
> +struct tpacket_dma_mem_region {
> +	void *addr;		/* userspace virtual address */
> +	__u64 phys_addr;	/* physical address */
> +	__u64 iova;		/* IO virtual address used for DMA */
> +	unsigned long size;	/* size of region */
> +	int direction;		/* dma data direction */
> +};

Have you tested this with with 32 bit user space and 32 bit kernel, too?
I don't have any problem with only supporting 64 bit kernels for this
feature, but looking through the code I wonder if we handle the __u64
addresses correctly in all situations.

The other question I have, would it make sense to move the

+#ifdef CONFIG_DMA_MEMORY_PROTECTION
+	/* IOVA not equal to physical address means IOMMU takes effect */
+	if (region->phys_addr == region->iova)
+		return -EFAULT;
+#endif

check from the ixgbe driver into the kernel core, so we never expose
memory mapped io which is not protected by its own memory domain?

Thanks,
Hannes

^ permalink raw reply

* Re: why are IPv6 addresses removed on link down
From: Hannes Frederic Sowa @ 2015-01-13 12:36 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki; +Cc: Stephen Hemminger, David Ahern, netdev@vger.kernel.org
In-Reply-To: <54B50C71.7090007@miraclelinux.com>

Hi,

On Di, 2015-01-13 at 21:15 +0900, YOSHIFUJI Hideaki wrote:
> YOSHIFUJI Hideaki wrote:
> > Hi,
> >
> > Hannes Frederic Sowa wrote:
> >> On Mo, 2015-01-12 at 23:10 -0800, Stephen Hemminger wrote:
> >>> On Mon, 12 Jan 2015 22:06:44 -0700
> >>> David Ahern <dsahern@gmail.com> wrote:
> >>>
> >>>> We noticed that IPv6 addresses are removed on a link down. e.g.,
> >>>>     ip link set dev eth1
> >>>>
> >>>>
> >>>> Looking at the code it appears to be this code path in addrconf.c:
> >>>>
> >>>>           case NETDEV_DOWN:
> >>>>           case NETDEV_UNREGISTER:
> >>>>                   /*
> >>>>                    *      Remove all addresses from this interface.
> >>>>                    */
> >>>>                   addrconf_ifdown(dev, event != NETDEV_DOWN);
> >>>>                   break;
> >>>>
> >>>> IPv4 addresses are NOT removed on a link down. Is there a particular
> >>>> reason IPv6 addresses are?
> >>>>
> >>>> Thanks,
> >>>> David
> >>>
> >>> See RFC's which describes how IPv6 does Duplicate Address Detection.
> >>> Address is not valid when link is down, since DAD is not possible.
> >>
> >> It should be no problem if the kernel would reacquire them on ifup and
> >> do proper DAD. We simply must not use them while the interface is dead
> >> (also making sure they don't get used for loopback routing).
> >>
> >> The problem the IPv6 addresses get removed is much more a historical
> >> artifact nowadays, I think. It is part of user space API and scripts
> >> deal with that already.
> >
> > We might have another "detached" state which essintially drops
> > outgoing packets while link is down.  Just after recovering link,
> > we could start receiving packet from the link and perform optimistic
> > DAD. And then, after it succeeds, we may start sending packets.
> >
> > Since "detached" state is like the state just before completing
> > Optimistic DAD, it is not so difficult to implement this extended
> > behavior, I guess.
> >
> 
> Note that node is allowed to send packets to neighbours or default
> routers if the node knows their link-layer addresses during Optimistic
> DAD.
> 

I don't think it should be a problem from internal state handling of the
addresses.

I am much more concerned with scripts expecting the addresses to be
flushed on interface down/up and not reacting appropriate.

Bye,
Hannes

^ permalink raw reply

* Re: [3.19-rc3] tg3: BUG: sleeping function called from invalid context
From: Peter Hurley @ 2015-01-13 12:47 UTC (permalink / raw)
  To: Michael Chan; +Cc: Prashant Sreedharan, netdev, Linux kernel
In-Reply-To: <1421131762.7208.31.camel@LTIRV-MCHAN1.corp.ad.broadcom.com>

On 01/13/2015 01:49 AM, Michael Chan wrote:
> On Mon, 2015-01-12 at 19:59 -0500, Peter Hurley wrote: 
>> [   17.203009] BUG: sleeping function called from invalid context at /home/peter/src/kernels/mainline/kernel/irq/manage.c:104
>> [   17.203067] in_atomic(): 1, irqs_disabled(): 0, pid: 1106, name: ip
>> [   17.203092] 2 locks held by ip/1106:
>> [   17.205255]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff816adf1f>] rtnetlink_rcv+0x1f/0x40
>> [   17.207445]  #1:  (&(&tp->lock)->rlock){+.....}, at: [<ffffffffa01073e6>] tg3_start+0xc06/0x11f0 [tg3]
>> [   17.209725] CPU: 2 PID: 1106 Comm: ip Not tainted 3.19.0-rc3+wip-xeon+lockdep #rc3+wip
>> [   17.211900] Hardware name: Dell Inc. Precision WorkStation T5400  /0RW203, BIOS A11 04/30/2012
>> [   17.214086]  0000000000000068 ffff8802ac823498 ffffffff817af7e8 0000000000000005
>> [   17.216265]  ffffffff81a9be78 ffff8802ac8234a8 ffffffff810998a5 ffff8802ac8234d8
>> [   17.218446]  ffffffff8109991a ffff8802ac8234c8 ffff8802af0aae00 ffffffffa00ed000
>> [   17.220636] Call Trace:
>> [   17.222743]  [<ffffffff817af7e8>] dump_stack+0x4f/0x7b
>> [   17.224808]  [<ffffffff810998a5>] ___might_sleep+0x105/0x140
>> [   17.226842]  [<ffffffff8109991a>] __might_sleep+0x3a/0xa0
>> [   17.228869]  [<ffffffffa00ed000>] ? 0xffffffffa00ed000
>> [   17.230939]  [<ffffffff810d7d78>] synchronize_irq+0x38/0xa0
>> [   17.232967]  [<ffffffffa00ed000>] ? 0xffffffffa00ed000
>> [   17.234991]  [<ffffffffa010105f>] tg3_chip_reset+0x13f/0x9c0 [tg3]
>> [   17.236988]  [<ffffffffa01020ae>] tg3_reset_hw+0x7e/0x2d20 [tg3] 
> 
> tp->lock is held in this code path.  If synchronize_irq() sleeps in
> wait_event(desc->wait_for_threads, ...), we'll get the warning.
> 
> The synchronize_irq() call is to wait for any tg3 irq handler to finish
> so that it is guaranteed that next time it will see the CHIP_RESETTING
> flag and do nothing.
> 
> Not sure if we can drop the tp->lock before we call synchronize_irq()
> and then take it again after synchronize_irq().

Well, this device [1] is using MSI (INTx disabled) so if the synchronize_irq()
is _only_ for the CHIP_RESETTING logic then it would seem ok to skip it (the
synchronize_irq()).

Regards,
Peter Hurley


[1] lspci -vv

08:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5754 Gigabit Ethernet PCI Express (rev 02)
	Subsystem: Dell Precision T5400
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 31
	Region 0: Memory at d3ff0000 (64-bit, non-prefetchable) [size=64K]
	Expansion ROM at <ignored> [disabled]
	Capabilities: [48] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [50] Vital Product Data
		Product Name: Broadcom NetLink Gigabit Ethernet Controller
		Read-only fields:
			[PN] Part number: BCM95754
			[EC] Engineering changes: 106679-15
			[SN] Serial number: 0123456789
			[MN] Manufacture ID: 31 34 65 34
			[RV] Reserved: checksum good, 30 byte(s) reserved
		Read/write fields:
			[YA] Asset tag: XYZ01234567
			[RW] Read-write area: 107 byte(s) free
		End
	Capabilities: [58] Vendor Specific Information: Len=78 <?>
	Capabilities: [e8] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee0400c  Data: 41a2
	Capabilities: [d0] Express (v1) Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s <4us, L1 <64us
			ClockPM- Surprise- LLActRep- BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr+ BadTLP- BadDLLP+ Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [13c v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
			Status:	NegoPending- InProgress-
	Capabilities: [160 v1] Device Serial Number xx-xx-xx-xx-xx-xx-xx-xx
	Capabilities: [16c v1] Power Budgeting <?>
	Kernel driver in use: tg3

^ permalink raw reply

* Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
From: Daniel Borkmann @ 2015-01-13 13:21 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: John Fastabend, netdev, danny.zhou, nhorman, john.ronciak, brouer
In-Reply-To: <1421152510.13626.22.camel@stressinduktion.org>

On 01/13/2015 01:35 PM, Hannes Frederic Sowa wrote:
> On Mo, 2015-01-12 at 20:35 -0800, John Fastabend wrote:
...
>> +/* setsockopt takes addr, size ,direction parametner, getsockopt takes
>> + * iova, size, direction.
>> + * */
>> +struct tpacket_dma_mem_region {
>> +	void *addr;		/* userspace virtual address */
>> +	__u64 phys_addr;	/* physical address */
>> +	__u64 iova;		/* IO virtual address used for DMA */
>> +	unsigned long size;	/* size of region */
>> +	int direction;		/* dma data direction */
>> +};
>
> Have you tested this with with 32 bit user space and 32 bit kernel, too?
> I don't have any problem with only supporting 64 bit kernels for this
> feature, but looking through the code I wonder if we handle the __u64
> addresses correctly in all situations.

Given this is placed into uapi and transferred via setsockopt(2), this
would also need some form of compat handling, also for the case of mixed
environments (e.g. 64 bit kernel, 32 bit user space).

^ permalink raw reply

* Re: [PATCH net-next 4/8] net: dsa: cleanup resources upon module removal
From: Sergei Shtylyov @ 2015-01-13 13:31 UTC (permalink / raw)
  To: Florian Fainelli, netdev; +Cc: davem, buytenh
In-Reply-To: <1421099866-3184-5-git-send-email-f.fainelli@gmail.com>

Hello.

On 1/13/2015 12:57 AM, Florian Fainelli wrote:

> We were not doing anything while removing the dsa module, which means
> that we were leaving dangling network devices without any sort of driver
> backing them, leading to all sorts of crashes. Make sure that we do
> cleanup the slave network devices, slave MII bus we created, and
> unassign the master_netdev dsa_ptr to make the packet processing go
> through the regulard Ethernet receive path.

    Regular. :-)

> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
[...]

> diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
> index de77c83cfd9a..df7ec066ac64 100644
> --- a/net/dsa/dsa.c
> +++ b/net/dsa/dsa.c
> @@ -316,10 +316,22 @@ out:
>
>   static void dsa_switch_destroy(struct dsa_switch *ds)
>   {
> +	int i;

    Need empty line here.

>   #ifdef CONFIG_NET_DSA_HWMON
>   	if (ds->hwmon_dev)
>   		hwmon_device_unregister(ds->hwmon_dev);
>   #endif

[...]

WBR, Sergei

^ permalink raw reply

* [RFC PATCHv1 net-next] xen-netback: always fully coalesce guest Rx packets
From: David Vrabel @ 2015-01-13 14:05 UTC (permalink / raw)
  To: netdev; +Cc: David Vrabel, xen-devel, Ian Campbell, Wei Liu

Always fully coalesce guest Rx packets into the minimum number of ring
slots.  Reducing the number of slots per packet has significant
performance benefits (e.g., 7.2 Gbit/s to 11 Gbit/s in an off-host
receive test).

However, this does increase the number of grant ops per packet which
decreases performance with some workloads (intrahost VM to VM)
/unless/ grant copy has been optimized for adjacent ops with the same
source or destination (see "grant-table: defer releasing pages
acquired in a grant copy"[1]).

Do we need to retain the existing path and make the always coalesce
path conditional on a suitable version of Xen?

[1] http://lists.xen.org/archives/html/xen-devel/2015-01/msg01118.html

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 drivers/net/xen-netback/common.h  |    1 -
 drivers/net/xen-netback/netback.c |  106 ++-----------------------------------
 2 files changed, 3 insertions(+), 104 deletions(-)

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index 5f1fda4..589fa25 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -251,7 +251,6 @@ struct xenvif {
 struct xenvif_rx_cb {
 	unsigned long expires;
 	int meta_slots_used;
-	bool full_coalesce;
 };
 
 #define XENVIF_RX_CB(skb) ((struct xenvif_rx_cb *)(skb)->cb)
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 908e65e..568238d 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -233,51 +233,6 @@ static void xenvif_rx_queue_drop_expired(struct xenvif_queue *queue)
 	}
 }
 
-/*
- * Returns true if we should start a new receive buffer instead of
- * adding 'size' bytes to a buffer which currently contains 'offset'
- * bytes.
- */
-static bool start_new_rx_buffer(int offset, unsigned long size, int head,
-				bool full_coalesce)
-{
-	/* simple case: we have completely filled the current buffer. */
-	if (offset == MAX_BUFFER_OFFSET)
-		return true;
-
-	/*
-	 * complex case: start a fresh buffer if the current frag
-	 * would overflow the current buffer but only if:
-	 *     (i)   this frag would fit completely in the next buffer
-	 * and (ii)  there is already some data in the current buffer
-	 * and (iii) this is not the head buffer.
-	 * and (iv)  there is no need to fully utilize the buffers
-	 *
-	 * Where:
-	 * - (i) stops us splitting a frag into two copies
-	 *   unless the frag is too large for a single buffer.
-	 * - (ii) stops us from leaving a buffer pointlessly empty.
-	 * - (iii) stops us leaving the first buffer
-	 *   empty. Strictly speaking this is already covered
-	 *   by (ii) but is explicitly checked because
-	 *   netfront relies on the first buffer being
-	 *   non-empty and can crash otherwise.
-	 * - (iv) is needed for skbs which can use up more than MAX_SKB_FRAGS
-	 *   slot
-	 *
-	 * This means we will effectively linearise small
-	 * frags but do not needlessly split large buffers
-	 * into multiple copies tend to give large frags their
-	 * own buffers as before.
-	 */
-	BUG_ON(size > MAX_BUFFER_OFFSET);
-	if ((offset + size > MAX_BUFFER_OFFSET) && offset && !head &&
-	    !full_coalesce)
-		return true;
-
-	return false;
-}
-
 struct netrx_pending_operations {
 	unsigned copy_prod, copy_cons;
 	unsigned meta_prod, meta_cons;
@@ -336,24 +291,13 @@ static void xenvif_gop_frag_copy(struct xenvif_queue *queue, struct sk_buff *skb
 		BUG_ON(offset >= PAGE_SIZE);
 		BUG_ON(npo->copy_off > MAX_BUFFER_OFFSET);
 
-		bytes = PAGE_SIZE - offset;
+		if (npo->copy_off == MAX_BUFFER_OFFSET)
+			meta = get_next_rx_buffer(queue, npo);
 
+		bytes = PAGE_SIZE - offset;
 		if (bytes > size)
 			bytes = size;
 
-		if (start_new_rx_buffer(npo->copy_off,
-					bytes,
-					*head,
-					XENVIF_RX_CB(skb)->full_coalesce)) {
-			/*
-			 * Netfront requires there to be some data in the head
-			 * buffer.
-			 */
-			BUG_ON(*head);
-
-			meta = get_next_rx_buffer(queue, npo);
-		}
-
 		if (npo->copy_off + bytes > MAX_BUFFER_OFFSET)
 			bytes = MAX_BUFFER_OFFSET - npo->copy_off;
 
@@ -652,60 +596,16 @@ static void xenvif_rx_action(struct xenvif_queue *queue)
 
 	while (xenvif_rx_ring_slots_available(queue, XEN_NETBK_RX_SLOTS_MAX)
 	       && (skb = xenvif_rx_dequeue(queue)) != NULL) {
-		RING_IDX max_slots_needed;
 		RING_IDX old_req_cons;
 		RING_IDX ring_slots_used;
 		int i;
 
 		queue->last_rx_time = jiffies;
 
-		/* We need a cheap worse case estimate for the number of
-		 * slots we'll use.
-		 */
-
-		max_slots_needed = DIV_ROUND_UP(offset_in_page(skb->data) +
-						skb_headlen(skb),
-						PAGE_SIZE);
-		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
-			unsigned int size;
-			unsigned int offset;
-
-			size = skb_frag_size(&skb_shinfo(skb)->frags[i]);
-			offset = skb_shinfo(skb)->frags[i].page_offset;
-
-			/* For a worse-case estimate we need to factor in
-			 * the fragment page offset as this will affect the
-			 * number of times xenvif_gop_frag_copy() will
-			 * call start_new_rx_buffer().
-			 */
-			max_slots_needed += DIV_ROUND_UP(offset + size,
-							 PAGE_SIZE);
-		}
-
-		/* To avoid the estimate becoming too pessimal for some
-		 * frontends that limit posted rx requests, cap the estimate
-		 * at MAX_SKB_FRAGS. In this case netback will fully coalesce
-		 * the skb into the provided slots.
-		 */
-		if (max_slots_needed > MAX_SKB_FRAGS) {
-			max_slots_needed = MAX_SKB_FRAGS;
-			XENVIF_RX_CB(skb)->full_coalesce = true;
-		} else {
-			XENVIF_RX_CB(skb)->full_coalesce = false;
-		}
-
-		/* We may need one more slot for GSO metadata */
-		if (skb_is_gso(skb) &&
-		   (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4 ||
-		    skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6))
-			max_slots_needed++;
-
 		old_req_cons = queue->rx.req_cons;
 		XENVIF_RX_CB(skb)->meta_slots_used = xenvif_gop_skb(skb, &npo, queue);
 		ring_slots_used = queue->rx.req_cons - old_req_cons;
 
-		BUG_ON(ring_slots_used > max_slots_needed);
-
 		__skb_queue_tail(&rxq, skb);
 	}
 
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH] rtlwifi/rtl8192de: remove redundant else if check
From: Colin King @ 2015-01-13 14:07 UTC (permalink / raw)
  To: Larry Finger, Chaoming Li, Kalle Valo, John W. Linville,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA

From: Colin Ian King <colin.king-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

The else if check condition checks for the opposite of the
if check, hence the else if check is redundant and can be
replaced with a simple else:

if (rtlpriv->rtlhal.macphymode == SINGLEMAC_SINGLEPHY) {
	..
} else if (rtlpriv->rtlhal.macphymode != SINGLEMAC_SINGLEPHY) {
	..
}

replaced with:

if (rtlpriv->rtlhal.macphymode == SINGLEMAC_SINGLEPHY) {
	..
} else {
	..
}

Signed-off-by: Colin Ian King <colin.king-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
---
 drivers/net/wireless/rtlwifi/rtl8192de/hw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/rtlwifi/rtl8192de/hw.c b/drivers/net/wireless/rtlwifi/rtl8192de/hw.c
index 280c3da..01bcc2d 100644
--- a/drivers/net/wireless/rtlwifi/rtl8192de/hw.c
+++ b/drivers/net/wireless/rtlwifi/rtl8192de/hw.c
@@ -546,7 +546,7 @@ static bool _rtl92de_llt_table_init(struct ieee80211_hw *hw)
 		txpktbuf_bndy = 246;
 		value8 = 0;
 		value32 = 0x80bf0d29;
-	} else if (rtlpriv->rtlhal.macphymode != SINGLEMAC_SINGLEPHY) {
+	} else {
 		maxPage = 127;
 		txpktbuf_bndy = 123;
 		value8 = 0;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH] bridge: only provide proxy ARP when CONFIG_INET is enabled
From: Arnd Bergmann @ 2015-01-13 14:10 UTC (permalink / raw)
  To: netdev; +Cc: davem, Kyeyoon Park, bridge, Stephen Hemminger

When IPV4 support is disabled, we cannot call arp_send from
the bridge code, which would result in a kernel link error:

net/built-in.o: In function `br_handle_frame_finish':
:(.text+0x59914): undefined reference to `arp_send'
:(.text+0x59a50): undefined reference to `arp_tbl'

This makes the newly added proxy ARP support in the bridge
code depend on the CONFIG_INET symbol and lets the compiler
optimize the code out to avoid the link error.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 958501163ddd ("bridge: Add support for IEEE 802.11 Proxy ARP")
Cc: Kyeyoon Park <kyeyoonp@codeaurora.org>

diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 1f1de715197c..e2aa7be3a847 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -154,7 +154,8 @@ int br_handle_frame_finish(struct sk_buff *skb)
 	dst = NULL;
 
 	if (is_broadcast_ether_addr(dest)) {
-		if (p->flags & BR_PROXYARP &&
+		if (IS_ENABLED(CONFIG_INET) &&
+		    p->flags & BR_PROXYARP &&
 		    skb->protocol == htons(ETH_P_ARP))
 			br_do_proxy_arp(skb, br, vid);
 

^ permalink raw reply related

* Re: [PATCH 8/8] ath10k: fix error return code
From: Kalle Valo @ 2015-01-13 14:16 UTC (permalink / raw)
  To: Julia Lawall
  Cc: linux-wireless, kernel-janitors, linux-kernel, ath10k, netdev
In-Reply-To: <1419872683-32709-9-git-send-email-Julia.Lawall@lip6.fr>

Julia Lawall <Julia.Lawall@lip6.fr> writes:

> Return a negative error code on failure.
>
> A simplified version of the semantic match that finds this problem is as
> follows: (http://coccinelle.lip6.fr/)
>
> // <smpl>
> @@
> identifier ret; expression e1,e2;
> @@
> (
> if (\(ret < 0\|ret != 0\))
>  { ... return ret; }
> |
> ret = 0
> )
> ... when != ret = e1
>     when != &ret
> *if(...)
> {
>   ... when != ret = e2
>       when forall
>  return ret;
> }
> // </smpl>
>
> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>

Thanks, applied to ath.git.

-- 
Kalle Valo

^ permalink raw reply

* Re: [PATCH v3] ath10k: fixup wait_for_completion_timeout return handling
From: Kalle Valo @ 2015-01-13 14:20 UTC (permalink / raw)
  To: Nicholas Mc Guire
  Cc: Chun-Yeow Yeoh, Sergei Shtylyov, netdev, linux-wireless,
	linux-kernel, ath10k, Michal Kazior, Yanbo Li, Ben Greear
In-Reply-To: <1420720054-27870-1-git-send-email-der.herr@hofr.at>

Nicholas Mc Guire <der.herr@hofr.at> writes:

> wait_for_completion_timeout does not return negative values so the tests
> for <= 0 are not needed and the case differentiation in the error handling
> path unnecessary.
>
> Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>

Thanks, applied to ath.git.

-- 
Kalle Valo

^ permalink raw reply

* Re: [net-next 03/15] i40evf: Remove some scary log messages
From: Sergei Shtylyov @ 2015-01-13 14:22 UTC (permalink / raw)
  To: Jeff Kirsher, davem; +Cc: Mitch A Williams, netdev, nhorman, sassmann, jogreene
In-Reply-To: <1421148811-9763-4-git-send-email-jeffrey.t.kirsher@intel.com>

Hello.

On 1/13/2015 2:33 PM, Jeff Kirsher wrote:

> From: Mitch A Williams <mitch.a.williams@intel.com>

> These messages may be triggered during normal init of the driver if the
> PF or FW take a long time to respond. There's nothing really wrong, so
> don't freak people out logging messages.

> If the communication channel really is dead, then we'll retry a few
> times and give up. This will log a different more scary message that
> should cause consternation. This allows the user to more easily detect a
> genuine failure.

> Change-ID: I6e2b758d4234a3a09c1015c82c8f2442a697cbdb
> Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
> Acked-by: Shannon Nelson <shannon.nelson@intel.com>
> Tested-by: Jim Young <james.m.young@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
[...]

> diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
> index ee0db59..f8f1d26 100644
> --- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
> +++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
> @@ -2026,10 +2026,7 @@ static void i40evf_init_task(struct work_struct *work)
>   		/* aq msg sent, awaiting reply */
>   		err = i40evf_verify_api_ver(adapter);
>   		if (err) {
> -			dev_info(&pdev->dev, "Unable to verify API version (%d), retrying\n",
> -				 err);
>   			if (err == I40E_ERR_ADMIN_QUEUE_NO_WORK) {
> -				dev_info(&pdev->dev, "Resending request\n");
>   				err = i40evf_send_api_ver(adapter);
>   			}

    {} not needed anymore, should have removed them.

[...]

WBR, Sergei

^ permalink raw reply

* [PATCH] rocker: fix harmless warning on 32-bit machines
From: Arnd Bergmann @ 2015-01-13 14:23 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Jiri Pirko, Scott Feldman, linux-arm-kernel

The rocker driver tries to assign a pointer to a 64-bit integer
and then back to a pointer. This is safe on all architectures,
but causes a compiler warning when pointers are shorter than
64-bit:

rocker/rocker.c: In function 'rocker_desc_cookie_ptr_get':
rocker/rocker.c:809:9: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
  return (void *) desc_info->desc->cookie;
         ^

This adds another cast to uintptr_t to tell the compiler
that it's safe.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>

diff --git a/drivers/net/ethernet/rocker/rocker.c b/drivers/net/ethernet/rocker/rocker.c
index 2f398fa4b9e6..cad8cf962cdf 100644
--- a/drivers/net/ethernet/rocker/rocker.c
+++ b/drivers/net/ethernet/rocker/rocker.c
@@ -806,13 +806,13 @@ static bool rocker_desc_gen(struct rocker_desc_info *desc_info)
 
 static void *rocker_desc_cookie_ptr_get(struct rocker_desc_info *desc_info)
 {
-	return (void *) desc_info->desc->cookie;
+	return (void *)(uintptr_t)desc_info->desc->cookie;
 }
 
 static void rocker_desc_cookie_ptr_set(struct rocker_desc_info *desc_info,
 				       void *ptr)
 {
-	desc_info->desc->cookie = (long) ptr;
+	desc_info->desc->cookie = (uintptr_t) ptr;
 }
 
 static struct rocker_desc_info *

^ permalink raw reply related

* Re: [RFC PATCH v2 2/2] net: ixgbe: implement af_packet direct queue mappings
From: Daniel Borkmann @ 2015-01-13 14:26 UTC (permalink / raw)
  To: John Fastabend; +Cc: netdev, danny.zhou, nhorman, john.ronciak, hannes, brouer
In-Reply-To: <20150113043542.29985.15658.stgit@nitbit.x32>

On 01/13/2015 05:35 AM, John Fastabend wrote:
...
> +static int ixgbe_ndo_split_queue_pairs(struct net_device *dev,
> +				       unsigned int start_from,
> +				       unsigned int qpairs_num,
> +				       struct sock *sk)
> +{
> +	struct ixgbe_adapter *adapter = netdev_priv(dev);
> +	unsigned int qpair_index;

We should probably return -EINVAL, still from within the setsockopt
call when qpairs_num is 0?

> +	/* allocate whatever available qpairs */
> +	if (start_from == -1) {

I guess we should define the notion of auto-select into a uapi
define instead of -1, which might not be overly obvious.

Anyway, extending Documentation/networking/packet_mmap.txt with
API details/examples at least for a non-RFC version is encouraged. ;)

> +		unsigned int count = 0;
> +
> +		for (qpair_index = adapter->num_rx_queues;
> +		     qpair_index < MAX_RX_QUEUES;
> +		     qpair_index++) {
> +			if (!adapter->user_queue_info[qpair_index].sk_handle) {
> +				count++;
> +				if (count == qpairs_num) {
> +					start_from = qpair_index - count + 1;
> +					break;
> +				}
> +			} else {
> +				count = 0;
> +			}
> +		}
> +	}
> +
> +	/* otherwise the caller specified exact queues */
> +	if ((start_from > MAX_TX_QUEUES) ||
> +	    (start_from > MAX_RX_QUEUES) ||
> +	    (start_from + qpairs_num > MAX_TX_QUEUES) ||
> +	    (start_from + qpairs_num > MAX_RX_QUEUES))
> +		return -EINVAL;

Shouldn't this be '>=' if I see this correctly?

> +	/* If the qpairs are being used by the driver do not let user space
> +	 * consume the queues. Also if the queue has already been allocated
> +	 * to a socket do fail the request.
> +	 */
> +	for (qpair_index = start_from;
> +	     qpair_index < start_from + qpairs_num;
> +	     qpair_index++) {
> +		if ((qpair_index < adapter->num_tx_queues) ||
> +		    (qpair_index < adapter->num_rx_queues))
> +			return -EINVAL;
> +
> +		if (adapter->user_queue_info[qpair_index].sk_handle)
> +			return -EBUSY;
> +	}
> +
> +	/* remember the sk handle for each queue pair */
> +	for (qpair_index = start_from;
> +	     qpair_index < start_from + qpairs_num;
> +	     qpair_index++) {
> +		adapter->user_queue_info[qpair_index].sk_handle = sk;
> +		adapter->user_queue_info[qpair_index].num_of_regions = 0;
> +	}
> +
> +	return 0;
> +}

I guess many drivers would need to implement similar code, do you see
a chance to move generic parts to the core, at least for some helper
functions?

Thanks,
Daniel

^ permalink raw reply

* Re: [RFC PATCHv1 net-next] xen-netback: always fully coalesce guest Rx packets
From: Wei Liu @ 2015-01-13 14:30 UTC (permalink / raw)
  To: David Vrabel; +Cc: netdev, xen-devel, Ian Campbell, Wei Liu
In-Reply-To: <1421157917-31333-1-git-send-email-david.vrabel@citrix.com>

On Tue, Jan 13, 2015 at 02:05:17PM +0000, David Vrabel wrote:
> Always fully coalesce guest Rx packets into the minimum number of ring
> slots.  Reducing the number of slots per packet has significant
> performance benefits (e.g., 7.2 Gbit/s to 11 Gbit/s in an off-host
> receive test).
> 

Good number.

> However, this does increase the number of grant ops per packet which
> decreases performance with some workloads (intrahost VM to VM)

Do you have figures before and after this change?

> /unless/ grant copy has been optimized for adjacent ops with the same
> source or destination (see "grant-table: defer releasing pages
> acquired in a grant copy"[1]).
> 
> Do we need to retain the existing path and make the always coalesce
> path conditional on a suitable version of Xen?
> 

It the new path improves off-host RX on all Xen versions and doesn't
degrade intrahost VM to VM RX that much, I think we should use it
unconditionally.  Is intrahost VM to VM RX important to XenServer?

I don't consider intrahost VM to VM RX a very important use case, at
least not as important as off-host RX. I would expect in a could
environment users would not count on their VMs reside on the same host.
Plus, some could provider might deliberately route traffic off-host for
various reasons even if VMs are on the same host.  (Verizon for one,
mentioned they do that during last year's Xen Summit IIRC).

Others might disagree. Let's wait for other people to chime in.

> [1] http://lists.xen.org/archives/html/xen-devel/2015-01/msg01118.html
> 
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> ---
>  drivers/net/xen-netback/common.h  |    1 -
>  drivers/net/xen-netback/netback.c |  106 ++-----------------------------------
>  2 files changed, 3 insertions(+), 104 deletions(-)

Love the diffstat!

Wei.

^ permalink raw reply

* Re: [PATCH] rocker: fix harmless warning on 32-bit machines
From: Jiri Pirko @ 2015-01-13 14:36 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: netdev, David Miller, Scott Feldman, linux-arm-kernel
In-Reply-To: <4047824.AYNhYQQ6UI@wuerfel>

Tue, Jan 13, 2015 at 03:23:52PM CET, arnd@arndb.de wrote:
>The rocker driver tries to assign a pointer to a 64-bit integer
>and then back to a pointer. This is safe on all architectures,
>but causes a compiler warning when pointers are shorter than
>64-bit:
>
>rocker/rocker.c: In function 'rocker_desc_cookie_ptr_get':
>rocker/rocker.c:809:9: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
>  return (void *) desc_info->desc->cookie;
>         ^
>
>This adds another cast to uintptr_t to tell the compiler
>that it's safe.
>
>Signed-off-by: Arnd Bergmann <arnd@arndb.de>


Acked-by: Jiri Pirko <jiri@resnulli.us>

>
>diff --git a/drivers/net/ethernet/rocker/rocker.c b/drivers/net/ethernet/rocker/rocker.c
>index 2f398fa4b9e6..cad8cf962cdf 100644
>--- a/drivers/net/ethernet/rocker/rocker.c
>+++ b/drivers/net/ethernet/rocker/rocker.c
>@@ -806,13 +806,13 @@ static bool rocker_desc_gen(struct rocker_desc_info *desc_info)
> 
> static void *rocker_desc_cookie_ptr_get(struct rocker_desc_info *desc_info)
> {
>-	return (void *) desc_info->desc->cookie;
>+	return (void *)(uintptr_t)desc_info->desc->cookie;
> }
> 
> static void rocker_desc_cookie_ptr_set(struct rocker_desc_info *desc_info,
> 				       void *ptr)
> {
>-	desc_info->desc->cookie = (long) ptr;
>+	desc_info->desc->cookie = (uintptr_t) ptr;
> }
> 
> static struct rocker_desc_info *
>

^ permalink raw reply

* [PATCH 1/6] virtio/9p: verify device has config space
From: Michael S. Tsirkin @ 2015-01-13 14:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Eric Van Hensbergen, Ron Minnich, Latchesar Ionkov,
	David S. Miller, v9fs-developer, netdev
In-Reply-To: <1421160167-18498-1-git-send-email-mst@redhat.com>

Some devices might not implement config space access
(e.g. remoteproc used not to - before 3.9).
virtio/9p needs config space access so make it
fail gracefully if not there.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 net/9p/trans_virtio.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index daa749c..d8e376a 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -524,6 +524,12 @@ static int p9_virtio_probe(struct virtio_device *vdev)
 	int err;
 	struct virtio_chan *chan;
 
+	if (!vdev->config->get) {
+		dev_err(&vdev->dev, "%s failure: config access disabled\n",
+			__func__);
+		return -EINVAL;
+	}
+
 	chan = kmalloc(sizeof(struct virtio_chan), GFP_KERNEL);
 	if (!chan) {
 		pr_err("Failed to allocate virtio 9P channel\n");
-- 
MST

^ permalink raw reply related

* [PATCH 4/6] virtio/net: verify device has config space
From: Michael S. Tsirkin @ 2015-01-13 14:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, virtualization
In-Reply-To: <1421160167-18498-1-git-send-email-mst@redhat.com>

Some devices might not implement config space access
(e.g. remoteproc used not to - before 3.9).
virtio/net needs config space access so make it
fail gracefully if not there.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5ca9771..9bc1072 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1713,6 +1713,12 @@ static int virtnet_probe(struct virtio_device *vdev)
 	struct virtnet_info *vi;
 	u16 max_queue_pairs;
 
+	if (!vdev->config->get) {
+		dev_err(&vdev->dev, "%s failure: config access disabled\n",
+			__func__);
+		return -EINVAL;
+	}
+
 	if (!virtnet_validate_features(vdev))
 		return -EINVAL;
 
-- 
MST

^ permalink raw reply related

* Re: [Xen-devel] [PATCH 08/14] xen-netback: use foreign page information from the pages themselves
From: David Vrabel @ 2015-01-13 14:43 UTC (permalink / raw)
  To: David Vrabel, xen-devel, David S. Miller
  Cc: Boris Ostrovsky, Jenny Herbert, netdev@vger.kernel.org
In-Reply-To: <1421077417-7162-9-git-send-email-david.vrabel@citrix.com>

On 12/01/15 15:43, David Vrabel wrote:
> From: Jenny Herbert <jenny.herbert@citrix.com>
> 
> Use the foreign page flag in netback to get the domid and grant ref
> needed for the grant copy.  This signficiantly simplifies the netback
> code and makes netback work with foreign pages from other backends
> (e.g., blkback).
> 
> This allows blkback to use iSCSI disks provided by domUs running on
> the same host.

Dave,

This depends on several xen changes.  It's been Acked-by: Ian Campbell
<ian.campbell@citrix.com>

Are you happy for me to merge this via the xen tree in 3.20?

David

> Signed-off-by: Jenny Herbert <jennifer.herbert@citrix.com>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> ---
>  drivers/net/xen-netback/netback.c |  100 ++++---------------------------------
>  1 file changed, 9 insertions(+), 91 deletions(-)
> 
> diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
> index 6441318..ae3ab37 100644
> --- a/drivers/net/xen-netback/netback.c
> +++ b/drivers/net/xen-netback/netback.c
> @@ -314,9 +314,7 @@ static struct xenvif_rx_meta *get_next_rx_buffer(struct xenvif_queue *queue,
>  static void xenvif_gop_frag_copy(struct xenvif_queue *queue, struct sk_buff *skb,
>  				 struct netrx_pending_operations *npo,
>  				 struct page *page, unsigned long size,
> -				 unsigned long offset, int *head,
> -				 struct xenvif_queue *foreign_queue,
> -				 grant_ref_t foreign_gref)
> +				 unsigned long offset, int *head)
>  {
>  	struct gnttab_copy *copy_gop;
>  	struct xenvif_rx_meta *meta;
> @@ -333,6 +331,8 @@ static void xenvif_gop_frag_copy(struct xenvif_queue *queue, struct sk_buff *skb
>  	offset &= ~PAGE_MASK;
>  
>  	while (size > 0) {
> +		struct xen_page_foreign *foreign;
> +
>  		BUG_ON(offset >= PAGE_SIZE);
>  		BUG_ON(npo->copy_off > MAX_BUFFER_OFFSET);
>  
> @@ -361,9 +361,10 @@ static void xenvif_gop_frag_copy(struct xenvif_queue *queue, struct sk_buff *skb
>  		copy_gop->flags = GNTCOPY_dest_gref;
>  		copy_gop->len = bytes;
>  
> -		if (foreign_queue) {
> -			copy_gop->source.domid = foreign_queue->vif->domid;
> -			copy_gop->source.u.ref = foreign_gref;
> +		foreign = xen_page_foreign(page);
> +		if (foreign) {
> +			copy_gop->source.domid = foreign->domid;
> +			copy_gop->source.u.ref = foreign->gref;
>  			copy_gop->flags |= GNTCOPY_source_gref;
>  		} else {
>  			copy_gop->source.domid = DOMID_SELF;
> @@ -406,35 +407,6 @@ static void xenvif_gop_frag_copy(struct xenvif_queue *queue, struct sk_buff *skb
>  }
>  
>  /*
> - * Find the grant ref for a given frag in a chain of struct ubuf_info's
> - * skb: the skb itself
> - * i: the frag's number
> - * ubuf: a pointer to an element in the chain. It should not be NULL
> - *
> - * Returns a pointer to the element in the chain where the page were found. If
> - * not found, returns NULL.
> - * See the definition of callback_struct in common.h for more details about
> - * the chain.
> - */
> -static const struct ubuf_info *xenvif_find_gref(const struct sk_buff *const skb,
> -						const int i,
> -						const struct ubuf_info *ubuf)
> -{
> -	struct xenvif_queue *foreign_queue = ubuf_to_queue(ubuf);
> -
> -	do {
> -		u16 pending_idx = ubuf->desc;
> -
> -		if (skb_shinfo(skb)->frags[i].page.p ==
> -		    foreign_queue->mmap_pages[pending_idx])
> -			break;
> -		ubuf = (struct ubuf_info *) ubuf->ctx;
> -	} while (ubuf);
> -
> -	return ubuf;
> -}
> -
> -/*
>   * Prepare an SKB to be transmitted to the frontend.
>   *
>   * This function is responsible for allocating grant operations, meta
> @@ -459,8 +431,6 @@ static int xenvif_gop_skb(struct sk_buff *skb,
>  	int head = 1;
>  	int old_meta_prod;
>  	int gso_type;
> -	const struct ubuf_info *ubuf = skb_shinfo(skb)->destructor_arg;
> -	const struct ubuf_info *const head_ubuf = ubuf;
>  
>  	old_meta_prod = npo->meta_prod;
>  
> @@ -507,68 +477,16 @@ static int xenvif_gop_skb(struct sk_buff *skb,
>  			len = skb_tail_pointer(skb) - data;
>  
>  		xenvif_gop_frag_copy(queue, skb, npo,
> -				     virt_to_page(data), len, offset, &head,
> -				     NULL,
> -				     0);
> +				     virt_to_page(data), len, offset, &head);
>  		data += len;
>  	}
>  
>  	for (i = 0; i < nr_frags; i++) {
> -		/* This variable also signals whether foreign_gref has a real
> -		 * value or not.
> -		 */
> -		struct xenvif_queue *foreign_queue = NULL;
> -		grant_ref_t foreign_gref;
> -
> -		if ((skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) &&
> -			(ubuf->callback == &xenvif_zerocopy_callback)) {
> -			const struct ubuf_info *const startpoint = ubuf;
> -
> -			/* Ideally ubuf points to the chain element which
> -			 * belongs to this frag. Or if frags were removed from
> -			 * the beginning, then shortly before it.
> -			 */
> -			ubuf = xenvif_find_gref(skb, i, ubuf);
> -
> -			/* Try again from the beginning of the list, if we
> -			 * haven't tried from there. This only makes sense in
> -			 * the unlikely event of reordering the original frags.
> -			 * For injected local pages it's an unnecessary second
> -			 * run.
> -			 */
> -			if (unlikely(!ubuf) && startpoint != head_ubuf)
> -				ubuf = xenvif_find_gref(skb, i, head_ubuf);
> -
> -			if (likely(ubuf)) {
> -				u16 pending_idx = ubuf->desc;
> -
> -				foreign_queue = ubuf_to_queue(ubuf);
> -				foreign_gref =
> -					foreign_queue->pending_tx_info[pending_idx].req.gref;
> -				/* Just a safety measure. If this was the last
> -				 * element on the list, the for loop will
> -				 * iterate again if a local page were added to
> -				 * the end. Using head_ubuf here prevents the
> -				 * second search on the chain. Or the original
> -				 * frags changed order, but that's less likely.
> -				 * In any way, ubuf shouldn't be NULL.
> -				 */
> -				ubuf = ubuf->ctx ?
> -					(struct ubuf_info *) ubuf->ctx :
> -					head_ubuf;
> -			} else
> -				/* This frag was a local page, added to the
> -				 * array after the skb left netback.
> -				 */
> -				ubuf = head_ubuf;
> -		}
>  		xenvif_gop_frag_copy(queue, skb, npo,
>  				     skb_frag_page(&skb_shinfo(skb)->frags[i]),
>  				     skb_frag_size(&skb_shinfo(skb)->frags[i]),
>  				     skb_shinfo(skb)->frags[i].page_offset,
> -				     &head,
> -				     foreign_queue,
> -				     foreign_queue ? foreign_gref : UINT_MAX);
> +				     &head);
>  	}
>  
>  	return npo->meta_prod - old_meta_prod;
> 

^ permalink raw reply

* Re: why are IPv6 addresses removed on link down
From: David Ahern @ 2015-01-13 14:53 UTC (permalink / raw)
  To: Hannes Frederic Sowa, YOSHIFUJI Hideaki
  Cc: Stephen Hemminger, netdev@vger.kernel.org
In-Reply-To: <1421152613.13626.24.camel@redhat.com>

On 1/13/15 5:36 AM, Hannes Frederic Sowa wrote:
> Hi,
>
> On Di, 2015-01-13 at 21:15 +0900, YOSHIFUJI Hideaki wrote:
>> YOSHIFUJI Hideaki wrote:
>>> Hi,
>>>
>>> Hannes Frederic Sowa wrote:
>>>> On Mo, 2015-01-12 at 23:10 -0800, Stephen Hemminger wrote:
>>>>> On Mon, 12 Jan 2015 22:06:44 -0700
>>>>> David Ahern <dsahern@gmail.com> wrote:
>>>>>
>>>>>> We noticed that IPv6 addresses are removed on a link down. e.g.,
>>>>>>      ip link set dev eth1
>>>>>>
>>>>>>
>>>>>> Looking at the code it appears to be this code path in addrconf.c:
>>>>>>
>>>>>>            case NETDEV_DOWN:
>>>>>>            case NETDEV_UNREGISTER:
>>>>>>                    /*
>>>>>>                     *      Remove all addresses from this interface.
>>>>>>                     */
>>>>>>                    addrconf_ifdown(dev, event != NETDEV_DOWN);
>>>>>>                    break;
>>>>>>
>>>>>> IPv4 addresses are NOT removed on a link down. Is there a particular
>>>>>> reason IPv6 addresses are?
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>
>>>>> See RFC's which describes how IPv6 does Duplicate Address Detection.
>>>>> Address is not valid when link is down, since DAD is not possible.
>>>>
>>>> It should be no problem if the kernel would reacquire them on ifup and
>>>> do proper DAD. We simply must not use them while the interface is dead
>>>> (also making sure they don't get used for loopback routing).
>>>>
>>>> The problem the IPv6 addresses get removed is much more a historical
>>>> artifact nowadays, I think. It is part of user space API and scripts
>>>> deal with that already.
>>>
>>> We might have another "detached" state which essintially drops
>>> outgoing packets while link is down.  Just after recovering link,
>>> we could start receiving packet from the link and perform optimistic
>>> DAD. And then, after it succeeds, we may start sending packets.
>>>
>>> Since "detached" state is like the state just before completing
>>> Optimistic DAD, it is not so difficult to implement this extended
>>> behavior, I guess.
>>>
>>
>> Note that node is allowed to send packets to neighbours or default
>> routers if the node knows their link-layer addresses during Optimistic
>> DAD.
>>
>
> I don't think it should be a problem from internal state handling of the
> addresses.
>
> I am much more concerned with scripts expecting the addresses to be
> flushed on interface down/up and not reacting appropriate.

The current code seems inconsistent: I can put an IPv6 address on a link 
in the down state. On a link up the address is retained. Only on a 
subsequent link down is it removed. If DAD or anything else is the 
reason for the current logic then why allow an address to be assigned in 
the down state? Similarly that it currently seems to work ok then it 
suggests the right thing is done on a link up in which case a flush is 
not needed.

Bottom line is there a harm in removing the flush? If there is no harm 
will mainline kernel take a patch to do that or is your backward 
compatibility concern enough to block it?

David

^ permalink raw reply

* Re: why are IPv6 addresses removed on link down
From: Hannes Frederic Sowa @ 2015-01-13 15:00 UTC (permalink / raw)
  To: David Ahern; +Cc: YOSHIFUJI Hideaki, Stephen Hemminger, netdev@vger.kernel.org
In-Reply-To: <54B53187.7080306@gmail.com>

On Di, 2015-01-13 at 07:53 -0700, David Ahern wrote:
> On 1/13/15 5:36 AM, Hannes Frederic Sowa wrote:
> > Hi,
> >
> > On Di, 2015-01-13 at 21:15 +0900, YOSHIFUJI Hideaki wrote:
> >> YOSHIFUJI Hideaki wrote:
> >>> Hi,
> >>>
> >>> Hannes Frederic Sowa wrote:
> >>>> On Mo, 2015-01-12 at 23:10 -0800, Stephen Hemminger wrote:
> >>>>> On Mon, 12 Jan 2015 22:06:44 -0700
> >>>>> David Ahern <dsahern@gmail.com> wrote:
> >>>>>
> >>>>>> We noticed that IPv6 addresses are removed on a link down. e.g.,
> >>>>>>      ip link set dev eth1
> >>>>>>
> >>>>>>
> >>>>>> Looking at the code it appears to be this code path in addrconf.c:
> >>>>>>
> >>>>>>            case NETDEV_DOWN:
> >>>>>>            case NETDEV_UNREGISTER:
> >>>>>>                    /*
> >>>>>>                     *      Remove all addresses from this interface.
> >>>>>>                     */
> >>>>>>                    addrconf_ifdown(dev, event != NETDEV_DOWN);
> >>>>>>                    break;
> >>>>>>
> >>>>>> IPv4 addresses are NOT removed on a link down. Is there a particular
> >>>>>> reason IPv6 addresses are?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> David
> >>>>>
> >>>>> See RFC's which describes how IPv6 does Duplicate Address Detection.
> >>>>> Address is not valid when link is down, since DAD is not possible.
> >>>>
> >>>> It should be no problem if the kernel would reacquire them on ifup and
> >>>> do proper DAD. We simply must not use them while the interface is dead
> >>>> (also making sure they don't get used for loopback routing).
> >>>>
> >>>> The problem the IPv6 addresses get removed is much more a historical
> >>>> artifact nowadays, I think. It is part of user space API and scripts
> >>>> deal with that already.
> >>>
> >>> We might have another "detached" state which essintially drops
> >>> outgoing packets while link is down.  Just after recovering link,
> >>> we could start receiving packet from the link and perform optimistic
> >>> DAD. And then, after it succeeds, we may start sending packets.
> >>>
> >>> Since "detached" state is like the state just before completing
> >>> Optimistic DAD, it is not so difficult to implement this extended
> >>> behavior, I guess.
> >>>
> >>
> >> Note that node is allowed to send packets to neighbours or default
> >> routers if the node knows their link-layer addresses during Optimistic
> >> DAD.
> >>
> >
> > I don't think it should be a problem from internal state handling of the
> > addresses.
> >
> > I am much more concerned with scripts expecting the addresses to be
> > flushed on interface down/up and not reacting appropriate.
> 
> The current code seems inconsistent: I can put an IPv6 address on a link 
> in the down state. On a link up the address is retained. Only on a 
> subsequent link down is it removed. If DAD or anything else is the 
> reason for the current logic then why allow an address to be assigned in 
> the down state? Similarly that it currently seems to work ok then it 
> suggests the right thing is done on a link up in which case a flush is 
> not needed.
> 
> Bottom line is there a harm in removing the flush? If there is no harm 
> will mainline kernel take a patch to do that or is your backward 
> compatibility concern enough to block it?

This was already discussed several times here, e.g. one patch I just 
found:

http://lists.openwall.net/netdev/2011/01/24/8
and
http://patchwork.ozlabs.org/patch/17558/

Albeit I hate sysctls for things like this, it might I tend to find it
acceptable because it solves a problem which happened to lots of people.
And I don't like the current behavior neither.

I think this can work, but we should follow up all the old discussions
to not introduce any kind of new undesired behavior this time.

Thanks,
Hannes

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox