Netdev List
 help / color / mirror / Atom feed
* [PATCH] bonding: Don't allow bond devices to change network namespaces.
From: Chen Weilong @ 2014-01-22  9:16 UTC (permalink / raw)
  To: fubar, vfalico, andy, davem; +Cc: netdev

From: Weilong Chen <chenweilong@huawei.com>

Like bridge, bonding as netdevice doesn't cross netns boundaries.

Bonding ports and bonding itself live in same netns.

Signed-off-by: Weilong Chen <chenweilong@huawei.com>
---
 drivers/net/bonding/bond_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index f00dd45..897d153 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3916,6 +3916,9 @@ void bond_setup(struct net_device *bond_dev)
 	 * capable
 	 */
 
+	/* Don't allow bond devices to change network namespaces. */
+	bond_dev->features |= NETIF_F_NETNS_LOCAL;
+
 	bond_dev->hw_features = BOND_VLAN_FEATURES |
 				NETIF_F_HW_VLAN_CTAG_TX |
 				NETIF_F_HW_VLAN_CTAG_RX |
-- 
1.7.12

^ permalink raw reply related

* [PATCH net-next 3/3] bonding: cleanup some redundant code and wrong variables
From: Ding Tianhong @ 2014-01-22  9:22 UTC (permalink / raw)
  To: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, David S. Miller,
	Netdev

The dev_set_mac_address() will check the dev->netdev_ops->ndo_set_mac_address,
so no need to check it in bond_set_mac_address().

Fix the wrong variables for pr_err().

Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
---
 drivers/net/bonding/bond_main.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index ce0f5c0..9d92f46 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3509,15 +3509,7 @@ static int bond_set_mac_address(struct net_device *bond_dev, void *addr)
 	 */
 
 	bond_for_each_slave(bond, slave, iter) {
-		const struct net_device_ops *slave_ops = slave->dev->netdev_ops;
 		pr_debug("slave %p %s\n", slave, slave->dev->name);
-
-		if (slave_ops->ndo_set_mac_address == NULL) {
-			res = -EOPNOTSUPP;
-			pr_debug("EOPNOTSUPP %s\n", slave->dev->name);
-			goto unwind;
-		}
-
 		res = dev_set_mac_address(slave->dev, addr);
 		if (res) {
 			/* TODO: consider downing the slave
@@ -4317,7 +4309,7 @@ static int bond_check_params(struct bond_params *params)
 						      fail_over_mac_tbl);
 		if (fail_over_mac_value == -1) {
 			pr_err("Error: invalid fail_over_mac \"%s\"\n",
-			       arp_validate == NULL ? "NULL" : arp_validate);
+			       fail_over_mac == NULL ? "NULL" : fail_over_mac);
 			return -EINVAL;
 		}
 
-- 
1.8.0

^ permalink raw reply related

* [PATCH net-next 1/3] bonding: Set the correct value to fail_over_mac at enslavement
From: Ding Tianhong @ 2014-01-22  9:22 UTC (permalink / raw)
  To: Jay Vosburgh, Veaceslav Falico, David S. Miller, Netdev,
	Andy Gospodarek

If the new slave don't support setting the MAC address, there are
two ways to handle this situation:

1). If the new slave is the first slave, set bond to the new slave's
    MAC address, if the mode is active-backup, set fail_over_mac to
    active, otherwise set fail_over_mac to none.
2). If the new slave is not the first slave and the fail_over_mac is
    active, it means that the slave could work normally in active-backup
    mode, otherwise if the fail_over_mac is none, the slave could not
    work normally for no active-backup mode, so bond could not ensalve
    the new dev.

Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
---
 drivers/net/bonding/bond_main.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 3220b48..598f100 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1334,9 +1334,17 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 
 	if (slave_ops->ndo_set_mac_address == NULL) {
 		if (!bond_has_slaves(bond)) {
-			pr_warning("%s: Warning: The first slave device specified does not support setting the MAC address. Setting fail_over_mac to active.",
+			pr_warning("%s: Warning: The first slave device specified does not support setting the MAC address.\n",
 				   bond_dev->name);
-			bond->params.fail_over_mac = BOND_FOM_ACTIVE;
+			if (bond->params.mode == BOND_MODE_ACTIVEBACKUP) {
+				bond->params.fail_over_mac = BOND_FOM_ACTIVE;
+				pr_warning("%s: Setting fail_over_mac to active for active-backup mode.\n",
+					   bond_dev->name);
+			} else {
+				bond->params.fail_over_mac = BOND_FOM_NONE;
+				pr_warning("%s: Setting fail_over_mac to none for no active-backup modes",
+					   bond_dev->name);
+			}
 		} else if (bond->params.fail_over_mac != BOND_FOM_ACTIVE) {
 			pr_err("%s: Error: The slave device specified does not support setting the MAC address, but fail_over_mac is not set to active.\n",
 			       bond_dev->name);
-- 
1.8.0

^ permalink raw reply related

* [PATCH net-next 2/3] bonding: only affect active-backup mode when fail_over_mac is not none
From: Ding Tianhong @ 2014-01-22  9:22 UTC (permalink / raw)
  To: Jay Vosburgh, Veaceslav Falico, David S. Miller, Netdev,
	Andy Gospodarek

According bonding.txt, the fail_over_mac only affect active-backup mode,
but now the fail_over_mac could be set to active or follow in every mode,
this will make the new slave could not be set to bond's MAC address at
enslavement for several modes, such as RR or XOR.

So this patch fix two problems:
1). If the fail_over_mac is active or follow and the mode is not active-backup,
    the new slave could be set to the bond's MAC address,

2). The first slave could work normally regardless it could support setting MAC
    address or not, and the bonding always be set to first slave's MAC address,
    so no need to set the first slave to bond's MAC address again when fail_over_mac
    is none at enslavement.

Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
---
 drivers/net/bonding/bond_main.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 598f100..ce0f5c0 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1387,7 +1387,9 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 	 */
 	memcpy(new_slave->perm_hwaddr, slave_dev->dev_addr, ETH_ALEN);
 
-	if (!bond->params.fail_over_mac) {
+	if (bond_has_slaves(bond) &&
+	    (!bond->params.fail_over_mac ||
+	     bond->params.mode != BOND_MODE_ACTIVEBACKUP)) {
 		/*
 		 * Set slave to master's mac address.  The application already
 		 * set the master's mac address to that of the first slave
-- 
1.8.0

^ permalink raw reply related

* [PATCH net-next 0/3] bonding: Fix some issues for fail_over_mac
From: Ding Tianhong @ 2014-01-22  9:21 UTC (permalink / raw)
  To: Jay Vosburgh, Veaceslav Falico, David S. Miller, Netdev,
	Andy Gospodarek

The parameter fail_over_mac only affect active-backup mode, if it was
set to active or follow and works with other modes, just like RR or XOR
mode, the bonding could not set all slaves to the master's address, it
will cause the slave could not work well with master.

So set the fail_over_mac to none if the mode is not active-backup and
slight optimization for bond_set_mac_address().

v1->v2: According Jay's suggestion, that we should permit setting an option
	at any time, but only have it take effect in active-backup mode, so
	I add mode checking together with fail_over_mac during enslavement and
	rebuild the patches.

Regards
Ding

Ding Tianhong (3):
  bonding: Set the correct value to fail_over_mac at enslavement
  bonding: only affect active-backup mode when fail_over_mac is not
    none
  bonding: cleanup some redundant code and wrong variables

 drivers/net/bonding/bond_main.c | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

-- 
1.8.0

^ permalink raw reply

* Re: [PATCH] bonding: Don't allow bond devices to change network namespaces.
From: Jiri Pirko @ 2014-01-22  9:41 UTC (permalink / raw)
  To: Chen Weilong; +Cc: fubar, vfalico, andy, davem, netdev
In-Reply-To: <1390382190-1604-1-git-send-email-chenweilong@huawei.com>

Wed, Jan 22, 2014 at 10:16:30AM CET, chenweilong@huawei.com wrote:
>From: Weilong Chen <chenweilong@huawei.com>
>
>Like bridge, bonding as netdevice doesn't cross netns boundaries.
>
>Bonding ports and bonding itself live in same netns.

I think should should be done for team as well. Openvs already
has this. I believe that for vlans it is ok to change ns, right?

>
>Signed-off-by: Weilong Chen <chenweilong@huawei.com>
>---
> drivers/net/bonding/bond_main.c | 3 +++
> 1 file changed, 3 insertions(+)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index f00dd45..897d153 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -3916,6 +3916,9 @@ void bond_setup(struct net_device *bond_dev)
> 	 * capable
> 	 */
> 
>+	/* Don't allow bond devices to change network namespaces. */
>+	bond_dev->features |= NETIF_F_NETNS_LOCAL;
>+
> 	bond_dev->hw_features = BOND_VLAN_FEATURES |
> 				NETIF_F_HW_VLAN_CTAG_TX |
> 				NETIF_F_HW_VLAN_CTAG_RX |
>-- 
>1.7.12
>
>
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH RFC 4/6] net: rfkill: gpio: add device tree support
From: Linus Walleij @ 2014-01-22  9:54 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: Arnd Bergmann,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Heikki Krogerus, netdev, linux-wireless, linux-sunxi,
	linux-kernel, Maxime Ripard, Chen-Yu Tsai, Johannes Berg,
	Mika Westerberg, David S. Miller,
	linux-gpio-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <CAAVeFuKFfAvSPYLmvWV5jjT-peZFJ8sJ2bbh4F=JAYoWLhjZpA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, Jan 21, 2014 at 3:53 PM, Alexandre Courbot <gnurou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Tue, Jan 21, 2014 at 9:35 PM, Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org> wrote:
>> On Tuesday 21 January 2014, Linus Walleij wrote:

>>> As discussed earlier in this thread I'm not sure the con_id is
>>> suitable for labelling GPIOs. It'd be better to have a proper name
>>> specified in DT/ACPI instead.
>>
>> +1
>
> I wonder why you guys prefer to have the name defined in the GPIO
> mapping. Having the driver decide the label makes it easier to look up
> which GPIO does what in debugfs, whereas nothing prevents people to
> name GPIOs whatever inadequate name they want in the device DT node.
> What am I overlooking here?

The proper name of a GPIO does not come from the driver but from the
usecase, i.e. often the name of the rail on the board where it is used.

Remember GPIO are per definition general purpose, they cannot get
any clever names from the driver. They would just be named
"chip-foo-gpio0" thru "chip-foo-gpioN" if the driver was to name them.

Using the rail name on the board is way more useful. A GPIO line
named "HAL_SENSOR" or "MMC_CD" is actually telling us what
that line is used for.

But such names can only come from a DT or ACPI table that has
knowledge of how the GPIO is used on a certain system/board and
not just from the driver.

Yours,
Linus Walleij

^ permalink raw reply

* Re: [PATCH RFC 4/6] net: rfkill: gpio: add device tree support
From: Linus Walleij @ 2014-01-22  9:58 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: Arnd Bergmann,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Heikki Krogerus, netdev, linux-wireless, linux-sunxi,
	linux-kernel, Maxime Ripard, Chen-Yu Tsai, Johannes Berg,
	Mika Westerberg, David S. Miller,
	linux-gpio-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <CAAVeFuKFfAvSPYLmvWV5jjT-peZFJ8sJ2bbh4F=JAYoWLhjZpA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, Jan 21, 2014 at 3:53 PM, Alexandre Courbot <gnurou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> The good (or bad, rather) thing about DT is that we can do whatever we
> please with the new bindings: decide which name or which index
> (doesn't matter here) a GPIO should have. However we don't have this
> control over ACPI, where nothing guarantees that the same index will
> be used for the same GPIO function.

It's not like ACPI will impose some tables on us and expect us to
use them as-is no matter how crazy they are. Then indeed the whole
idea of unifying ACPI and DT accessors would be moot and it would
only work by mistake or as a good joke.

The situation with ACPI is just like with DT, it is assumed that the
author of these tables also look at the Linux kernel drivers to figure
out what argument goes where and we can influence them.

Yours,
Linus Walleij

^ permalink raw reply

* Re: [PATCH v3 1/4] net_dma: simple removal
From: saeed bishara @ 2014-01-22 10:38 UTC (permalink / raw)
  To: Dan Williams
  Cc: dmaengine@vger.kernel.org, Alexander Duyck, Dave Jiang,
	Vinod Koul, netdev@vger.kernel.org, David Whipple, lkml,
	David S. Miller
In-Reply-To: <CAPcyv4h10vAw3rwNo25+RaZwFsnrjWrpoktj+Yg8eR1QDDW2tw@mail.gmail.com>

On Tue, Jan 21, 2014 at 11:44 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Fri, Jan 17, 2014 at 12:16 PM, saeed bishara <saeed.bishara@gmail.com> wrote:
>> Dan,
>>
>> isn't this issue similar to direct io case?
>> can you please look at the following article
>> http://lwn.net/Articles/322795/
>
> I guess it's similar, but the NET_DMA dma api violation is more
> blatant.  The same thread that requested DMA is also writing to those
> same pages with the cpu.  The fix is either guaranteeing that only the
> dma engine ever touches the gup'd pages or synchronizing dma before
> every cpu fallback.
>
>> regarding performance improvement using NET_DMA, I don't have concrete
>> numbers, but it should be around 15-20%. my system is i/o coherent.
>
> That sounds too high... is that throughput or cpu utilization?  It
that's the throughput improvement, my test is iperf server (no special
flags, 1500 mtu).
the iperf and 10G eth interrupts bound to same cpu which is the
bottleneck in my case.
I ran the following configurations
a. NET_DMA=n
b. NET_DMA=y
c. NET_DMA=y + your dma_debug patch below,
d. same as 3. by with my simple fix path.

results in Gbps:
a. 5.41
b. 6.17 (+14%)
c. 5.93 (+9%)
d. 5.92 (+9%)

 Dan, my simple fix is just to call tcp_service_net_dma(sk, true)
whenever the cpu is going to copy the data. proper fix ofcourse can be
smarter.
do you think this is sufficient?


--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1295,6 +1295,7 @@ static int tcp_recv_urg(struct sock *sk, struct
msghdr *msg, int len, int flags)
         */
        return -EAGAIN;
 }
+static void tcp_service_net_dma(struct sock *sk, bool wait);

 static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len)
 {
@@ -1302,6 +1303,7 @@ static int tcp_peek_sndq(struct sock *sk, struct
msghdr *msg, int len)
        int copied = 0, err = 0;

        /* XXX -- need to support SO_PEEK_OFF */
+       tcp_service_net_dma(sk, true);  /* Wait for queue to drain */

        skb_queue_walk(&sk->sk_write_queue, skb) {
                err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, skb->len);
@@ -1861,6 +1863,8 @@ do_prequeue:
                        } else
 #endif
                        {
+                               tcp_service_net_dma(sk, true);  /*
Wait for queue to drain */
+
                                err = skb_copy_datagram_iovec(skb, offset,
                                                msg->msg_iov, used);
                                if (err) {




> sounds high because NET_DMA also makes the data cache cold while the
> cpu copy warms the data before handing it to the application.
for iperf case the test doesn't touch the data.
also, for some applications, specially storage, the data can also be
moved using dma.
so this actually can be big advantage.
>
> Can you measure relative numbers and share your testing details?  You
> will need to fix the data corruption and verify that the performance
> advantage is still there before proposing NET_DMA be restored.
see above.
>
> I have a new dma_debug capability in Andrew's tree that can you help
> you identify holes in the implementation.
>
> http://ozlabs.org/~akpm/mmots/broken-out/dma-debug-introduce-debug_dma_assert_idle.patch
>
> --
> Dan
>
>>
>> saeed
>>
>> On Wed, Jan 15, 2014 at 11:33 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>> On Wed, Jan 15, 2014 at 1:31 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>>> On Wed, Jan 15, 2014 at 1:20 PM, saeed bishara <saeed.bishara@gmail.com> wrote:
>>>>> Hi Dan,
>>>>>
>>>>> I'm using net_dma on my system and I achieve meaningful performance
>>>>> boost when running Iperf receive.
>>>>>
>>>>> As far as I know the net_dma is used by many embedded systems out
>>>>> there and might effect their performance.
>>>>> Can you please elaborate on the exact scenario that cause the memory corruption?
>>>>>
>>>>> Is the scenario mentioned here caused by "real life" application or
>>>>> this is more of theoretical issue found through manual testing, I was
>>>>> trying to find the thread describing the failing scenario and couldn't
>>>>> find it, any pointer will be appreciated.
>>>>
>>>> Did you see the referenced commit?
>>>>
>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=77873803363c
>>>>
>>>> This is a real issue in that any app that forks() while receiving data
>>>> can cause the dma data to be lost.  The problem is that the copy
>>>> operation falls back to cpu at many locations.  Any one of those
>>>> instance could touch a mapped page and trigger a copy-on-write event.
>>>> The dma completes to the wrong location.
>>>>
>>>
>>> Btw, do you have benchmark data showing that NET_DMA is beneficial on
>>> these platforms?  I would have expected worse performance on platforms
>>> without i/o coherent caches.

^ permalink raw reply

* [patch net-next 6/6] rtnetlink: remove IFLA_SLAVES definition
From: Jiri Pirko @ 2014-01-22 10:43 UTC (permalink / raw)
  To: netdev
  Cc: davem, fubar, vfalico, andy, sfeldma, stephen, vyasevic,
	nicolas.dichtel, john.r.fastabend
In-Reply-To: <1390377957-31466-1-git-send-email-jiri@resnulli.us>

Not used anymore, and never should be.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 include/uapi/linux/if_link.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index b8fb352..ed7b2f8 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -144,7 +144,6 @@ enum {
 	IFLA_NUM_RX_QUEUES,
 	IFLA_CARRIER,
 	IFLA_PHYS_PORT_ID,
-	IFLA_BOND_SLAVE,
 	__IFLA_MAX
 };
 
-- 
1.8.3.1

^ permalink raw reply related

* Re: [Xen-devel] [PATCH v4] xen/grant-table: Avoid m2p_override during mapping
From: David Vrabel @ 2014-01-22 10:58 UTC (permalink / raw)
  To: Zoltan Kiss
  Cc: ian.campbell, wei.liu2, xen-devel, netdev, linux-kernel,
	jonathan.davies, Stefano Stabellini
In-Reply-To: <1390335755-29328-1-git-send-email-zoltan.kiss@citrix.com>

On 21/01/14 20:22, Zoltan Kiss wrote:
> The grant mapping API does m2p_override unnecessarily: only gntdev needs it,
> for blkback and future netback patches it just cause a lock contention, as
> those pages never go to userspace. Therefore this series does the following:
> - the original functions were renamed to __gnttab_[un]map_refs, with a new
>   parameter m2p_override
> - based on m2p_override either they follow the original behaviour, or just set
>   the private flag and call set_phys_to_machine
> - gnttab_[un]map_refs are now a wrapper to call __gnttab_[un]map_refs with
>   m2p_override false
> - a new function gnttab_[un]map_refs_userspace provides the old behaviour
> 
> v2:
> - move the storing of the old mfn in page->index to gnttab_map_refs
> - move the function header update to a separate patch
> 
> v3:
> - a new approach to retain old behaviour where it needed
> - squash the patches into one
> 
> v4:
> - move out the common bits from m2p* functions, and pass pfn/mfn as parameter
> - clear page->private before doing anything with the page, so m2p_find_override
>   won't race with this

I would prune this version information out of the commit message.

> Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
> Suggested-by: David Vrabel <david.vrabel@citrix.com>

Reviewed-by: David Vrabel <david.vrabel@citrix.com>

But I think this needs Stefano's ack.

It would be useful to get this into 3.14 if possible.

David

^ permalink raw reply

* Re: [PATCH RFC 4/6] net: rfkill: gpio: add device tree support
From: Mika Westerberg @ 2014-01-22 11:00 UTC (permalink / raw)
  To: Linus Walleij
  Cc: Alexandre Courbot, Arnd Bergmann,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Heikki Krogerus, netdev, linux-wireless, linux-sunxi,
	linux-kernel, Maxime Ripard, Chen-Yu Tsai, Johannes Berg,
	David S. Miller,
	linux-gpio-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <CACRpkdZbb=eO8YRtMn6hW0vn97PkHUo88e_o61DEC8=wPY3_PQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Wed, Jan 22, 2014 at 10:58:38AM +0100, Linus Walleij wrote:
> On Tue, Jan 21, 2014 at 3:53 PM, Alexandre Courbot <gnurou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
> > The good (or bad, rather) thing about DT is that we can do whatever we
> > please with the new bindings: decide which name or which index
> > (doesn't matter here) a GPIO should have. However we don't have this
> > control over ACPI, where nothing guarantees that the same index will
> > be used for the same GPIO function.
> 
> It's not like ACPI will impose some tables on us and expect us to
> use them as-is no matter how crazy they are. Then indeed the whole
> idea of unifying ACPI and DT accessors would be moot and it would
> only work by mistake or as a good joke.

With the current ACPI version we really don't have much of a choice here.
Only way we can distinguish between a set of GPIOs is to use index and hope
that the vendor puts the GPIOs in the table in same order that the driver
expects :-(

This is going to get a bit better when ACPI gets properties (like Arnd
commented already) as then we can get the correct index based on a name in
a table. However, it still requires the vendor to use naming that is
suitable for Linux driver in question.

> The situation with ACPI is just like with DT, it is assumed that the
> author of these tables also look at the Linux kernel drivers to figure
> out what argument goes where and we can influence them.

I certainly hope, now that Android is becoming more and more important,
that the vendors will actually check what the Linux drivers expect and
populate the tables with such information that is suitable for both ACPI
and DT enabled driver.

^ permalink raw reply

* Re: [PATCH v5] can: add Renesas R-Car CAN driver
From: Ben Dooks @ 2014-01-22 11:52 UTC (permalink / raw)
  To: Marc Kleine-Budde
  Cc: Sergei Shtylyov, David Miller, geert, netdev, wg, linux-can,
	linux-sh, vksavl
In-Reply-To: <52DD927B.2060500@pengutronix.de>

On 20/01/14 21:17, Marc Kleine-Budde wrote:
> On 01/20/2014 11:12 PM, Sergei Shtylyov wrote:
> [...]
>
>>> I truly think using packed here is rediculous, please remove it unless
>>> you can prove that things won't work without it.
>>
>>     They will. But how about the following 'struct rcar_can_regs'?
>
> The strcuts in question are:
>
>> +/* Mailbox registers structure */
>> +struct rcar_can_mbox_regs {
>> +	u32 id;		/* IDE and RTR bits, SID and EID */
>> +	u8 stub;	/* Not used */
>> +	u8 dlc;		/* Data Length Code - bits [0..3] */
>> +	u8 data[8];	/* Data Bytes */
>> +	u8 tsh;		/* Time Stamp Higher Byte */
>> +	u8 tsl;		/* Time Stamp Lower Byte */
>> +} __packed;
>> +
>> +struct rcar_can_regs {
>> +	struct rcar_can_mbox_regs mb[RCAR_CAN_N_MBX]; /* Mailbox registers */
>> +	u32 mkr_2_9[8];	/* Mask Registers 2-9 */
>> +	u32 fidcr[2];	/* FIFO Received ID Compare Register */
>> +	u32 mkivlr1;	/* Mask Invalid Register 1 */
>> +	u32 mier1;	/* Mailbox Interrupt Enable Register 1 */
>> +	u32 mkr_0_1[2];	/* Mask Registers 0-1 */
>> +	u32 mkivlr0;    /* Mask Invalid Register 0*/
>> +	u32 mier0;      /* Mailbox Interrupt Enable Register 0 */
>> +	u8 pad_440[0x3c0];
>> +	u8 mctl[64];	/* Message Control Registers */
>> +	u16 ctlr;	/* Control Register */
>> +	u16 str;	/* Status register */
>> +	u8 bcr[3];	/* Bit Configuration Register */
>> +	u8 clkr;	/* Clock Select Register */
>> +	u8 rfcr;	/* Receive FIFO Control Register */
>> +	u8 rfpcr;	/* Receive FIFO Pointer Control Register */
>> +	u8 tfcr;	/* Transmit FIFO Control Register */
>> +	u8 tfpcr;       /* Transmit FIFO Pointer Control Register */
>> +	u8 eier;	/* Error Interrupt Enable Register */
>> +	u8 eifr;	/* Error Interrupt Factor Judge Register */
>> +	u8 recr;	/* Receive Error Count Register */
>> +	u8 tecr;        /* Transmit Error Count Register */
>> +	u8 ecsr;	/* Error Code Store Register */
>> +	u8 cssr;	/* Channel Search Support Register */
>> +	u8 mssr;	/* Mailbox Search Status Register */
>> +	u8 msmr;	/* Mailbox Search Mode Register */
>> +	u16 tsr;	/* Time Stamp Register */
>> +	u8 afsr;	/* Acceptance Filter Support Register */
>> +	u8 pad_857;
>> +	u8 tcr;		/* Test Control Register */
>> +	u8 pad_859[7];
>> +	u8 ier;		/* Interrupt Enable Register */
>> +	u8 isr;		/* Interrupt Status Register */
>> +	u8 pad_862;
>> +	u8 mbsmr;	/* Mailbox Search Mask Register */
>> +} __packed;
>
> I think they should work without packed, too.

I think this discussion proves why this is not a good idea.

IIRC, a compiler has the right to pad, or re-order structure
elements as it sees fit depending on the architecture and options.


-- 
Ben Dooks				http://www.codethink.co.uk/
Senior Engineer				Codethink - Providing Genius

^ permalink raw reply

* Re: [PATCH v5] can: add Renesas R-Car CAN driver
From: Geert Uytterhoeven @ 2014-01-22 11:54 UTC (permalink / raw)
  To: Ben Dooks
  Cc: Marc Kleine-Budde, Sergei Shtylyov, David Miller,
	netdev@vger.kernel.org, wg, linux-can, Linux-sh list,
	Pavel Kiryukhin
In-Reply-To: <52DFB101.8040907@codethink.co.uk>

On Wed, Jan 22, 2014 at 12:52 PM, Ben Dooks <ben.dooks@codethink.co.uk> wrote:
> IIRC, a compiler has the right to pad, or re-order structure
> elements as it sees fit depending on the architecture and options.

Pad: yes, reorder: no.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* RE: [PATCH v5] can: add Renesas R-Car CAN driver
From: David Laight @ 2014-01-22 11:58 UTC (permalink / raw)
  To: 'Ben Dooks', Marc Kleine-Budde
  Cc: Sergei Shtylyov, David Miller, geert@linux-m68k.org,
	netdev@vger.kernel.org, wg@grandegger.com,
	linux-can@vger.kernel.org, linux-sh@vger.kernel.org,
	vksavl@gmail.com
In-Reply-To: <52DFB101.8040907@codethink.co.uk>

From: Ben Dooks
> On 20/01/14 21:17, Marc Kleine-Budde wrote:
> > On 01/20/2014 11:12 PM, Sergei Shtylyov wrote:
> > [...]
> >
> >>> I truly think using packed here is rediculous, please remove it unless
> >>> you can prove that things won't work without it.
> >>
> >>     They will. But how about the following 'struct rcar_can_regs'?
> >
> > The strcuts in question are:
> >
> >> +/* Mailbox registers structure */
> >> +struct rcar_can_mbox_regs {
> >> +	u32 id;		/* IDE and RTR bits, SID and EID */
> >> +	u8 stub;	/* Not used */
> >> +	u8 dlc;		/* Data Length Code - bits [0..3] */
> >> +	u8 data[8];	/* Data Bytes */
> >> +	u8 tsh;		/* Time Stamp Higher Byte */
> >> +	u8 tsl;		/* Time Stamp Lower Byte */
> >> +} __packed;
> >> +
> >> +struct rcar_can_regs {
> >> +	struct rcar_can_mbox_regs mb[RCAR_CAN_N_MBX]; /* Mailbox registers */
> >> +	u32 mkr_2_9[8];	/* Mask Registers 2-9 */
> >> +	u32 fidcr[2];	/* FIFO Received ID Compare Register */
> >> +	u32 mkivlr1;	/* Mask Invalid Register 1 */
> >> +	u32 mier1;	/* Mailbox Interrupt Enable Register 1 */
> >> +	u32 mkr_0_1[2];	/* Mask Registers 0-1 */
> >> +	u32 mkivlr0;    /* Mask Invalid Register 0*/
> >> +	u32 mier0;      /* Mailbox Interrupt Enable Register 0 */
> >> +	u8 pad_440[0x3c0];
> >> +	u8 mctl[64];	/* Message Control Registers */
> >> +	u16 ctlr;	/* Control Register */
> >> +	u16 str;	/* Status register */
> >> +	u8 bcr[3];	/* Bit Configuration Register */
> >> +	u8 clkr;	/* Clock Select Register */
> >> +	u8 rfcr;	/* Receive FIFO Control Register */
> >> +	u8 rfpcr;	/* Receive FIFO Pointer Control Register */
> >> +	u8 tfcr;	/* Transmit FIFO Control Register */
> >> +	u8 tfpcr;       /* Transmit FIFO Pointer Control Register */
> >> +	u8 eier;	/* Error Interrupt Enable Register */
> >> +	u8 eifr;	/* Error Interrupt Factor Judge Register */
> >> +	u8 recr;	/* Receive Error Count Register */
> >> +	u8 tecr;        /* Transmit Error Count Register */
> >> +	u8 ecsr;	/* Error Code Store Register */
> >> +	u8 cssr;	/* Channel Search Support Register */
> >> +	u8 mssr;	/* Mailbox Search Status Register */
> >> +	u8 msmr;	/* Mailbox Search Mode Register */
> >> +	u16 tsr;	/* Time Stamp Register */
> >> +	u8 afsr;	/* Acceptance Filter Support Register */
> >> +	u8 pad_857;
> >> +	u8 tcr;		/* Test Control Register */
> >> +	u8 pad_859[7];
> >> +	u8 ier;		/* Interrupt Enable Register */
> >> +	u8 isr;		/* Interrupt Status Register */
> >> +	u8 pad_862;
> >> +	u8 mbsmr;	/* Mailbox Search Mask Register */
> >> +} __packed;
> >
> > I think they should work without packed, too.
> 
> I think this discussion proves why this is not a good idea.

You need the second structure to be the correct size, with or without
__packed there could easily be a typo - so add a compile-type assert
on the size.

If this matches some hardware spec, and the hardware spec doesn't
have anything misaligned, then you don't want to specify __packed.
Without the __packed the size check will detect more typos.

> IIRC, a compiler has the right to pad, or re-order structure
> elements as it sees fit depending on the architecture and options.

The language might allow many things, the ABI is much more explicit.
In this case you really care about the ABI.

	David


^ permalink raw reply

* RE: [PATCH 2/4 ethtool] ethtool: Support for configurable RSS hash key.
From: Venkata Duvvuru @ 2014-01-22 12:06 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev@vger.kernel.org
In-Reply-To: <1390225398.3651.86.camel@deadeye.wl.decadent.org.uk>

> -----Original Message-----
> From: Ben Hutchings [mailto:ben@decadent.org.uk]
> Sent: Monday, January 20, 2014 7:13 PM
> To: Venkata Duvvuru
> Cc: netdev@vger.kernel.org
> Subject: Re: [PATCH 2/4 ethtool] ethtool: Support for configurable RSS hash
> key.
> 
> On Mon, 2014-01-20 at 13:28 +0000, Venkata Duvvuru wrote:
> > Ben, Please ignore my previous reply. My reply options were screwed up in
> that.
> >
> > > -----Original Message-----
> > > From: Ben Hutchings [mailto:ben@decadent.org.uk]
> > > Sent: Monday, January 20, 2014 12:06 AM
> > > To: Venkata Duvvuru
> > > Cc: netdev@vger.kernel.org
> > > Subject: Re: [PATCH 2/4 ethtool] ethtool: Support for configurable
> > > RSS hash key.
> > >
> > > On Fri, 2014-01-17 at 13:02 +0000, Venkata Duvvuru wrote:
> > > > This ethtool patch will primarily implement the parser for the
> > > > options
> > > provided by the user for set and get hashkey before invoking the ioctl.
> > > > This patch also has Ethtool man page changes which describes the
> > > > Usage of
> > > set and get hashkey options.
> > >
> > > I'd prefer to have this combined with the -x/-X options (and add new
> > > long options to reflect that they cover the key as well).
> >
> > if we add hashkey options to the existing -x/-X (--show-rxfh-indir/ --set-
> rxfh-indir), I think it won't be appropriate going by the command name.
> > We could change the command name to something like --show-rssconfig /-
> -rss-config but I'm afraid would that be backward compatible?
> [...]
> 
> That's why I said 'add new long options'.  The ethtool argument parser allows
> arbitrarily many aliases for each sub-command.

Just to make sure that we are in sync
        { "-x|--show-rxfh-indir|--show-hashkey", 1, do_getrssconfig,
          "Show RSS configuration" },

        { "-X|--set-rxfh-indir|--hashkey", 1, do_setrssconfig,
          "Set RSS configuration",
          "             equal N | weight W0 W1 ...\n"
          "             hkey %x:%x:%x:%x:%x:....:%x\n" },

And equal/weight will be mutually exclusive with hkey.
Does it makes sense?
> 
> Ben.
> 
> --
> Ben Hutchings
> One of the nice things about standards is that there are so many of them.

^ permalink raw reply

* RE: [PATCH net-next 3/4] ethtool: Support for configurable RSS hash key.
From: Venkata Duvvuru @ 2014-01-22 12:12 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev@vger.kernel.org
In-Reply-To: <1390224028.3651.72.camel@deadeye.wl.decadent.org.uk>



> -----Original Message-----
> From: Ben Hutchings [mailto:ben@decadent.org.uk]
> Sent: Monday, January 20, 2014 6:50 PM
> To: Venkata Duvvuru
> Cc: netdev@vger.kernel.org
> Subject: Re: [PATCH net-next 3/4] ethtool: Support for configurable RSS hash
> key.
> 
> On Mon, 2014-01-20 at 12:23 +0000, Venkata Duvvuru wrote:
> [...]
> > > > +/* RSS Hash key */
> > > > +struct ethtool_rss_hkey {
> > > > +	__u32   cmd;            /* ETHTOOL_SET/GET_RSS_HKEY */
> > > > +	__u8    data[RSS_HASH_KEY_LEN];
> > > > +	__u32	data_len;
> > > > +};
> > > [...]
> > >
> > > How about putting data after the data_len and giving it a length of
> > > 0, so this is extensible to an arbitrary length key?
> > >
> > > If we're extending the RSS configuration interface, there are a few
> > > other things that might be worth doing at the same time:
> > >
> > > - Single commands to get/set both the key and the indirection table
> > > at the same time
> > > - Add a field to distinguish multiple RSS contexts (some hardware
> > > can use RSS contexts together with filters, though RX NFC does not
> > > support that yet)
> > Are you referring to the filter-id that is created at the time of config-nfc?
> Pls clarify.
> 
> No, what I mean is:
> 
> 1. An RX flow steering filter can specify use of RSS, in which case the value
> looked up in the indirection is added to the queue number specified in the
> filter.  This is not yet controllable through RX NFC though there is room for
> extension there.
> 
> 2. Multi-function controllers need multiple RSS contexts (key + indirection
> table) to support independent use of RSS on each function.
> But it may also be possible to allocate multiple contexts to a single function.
> This could be useful in conjunction with 1.  But there would need to be a way
> to allocate and configure extra contexts first.
The proposed changes will be incremental so I think this can be done in a separate patch. Thoughts?
> 
> Ben.
> 
> --
> Ben Hutchings
> One of the nice things about standards is that there are so many of them.

^ permalink raw reply

* Re: [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
From: Neil Horman @ 2014-01-22 12:30 UTC (permalink / raw)
  To: Matija Glavinic Pecotic; +Cc: linux-sctp, Alexander Sverdlin, netdev
In-Reply-To: <52D8D544.5050501@nsn.com>

On Fri, Jan 17, 2014 at 08:01:24AM +0100, Matija Glavinic Pecotic wrote:
> Implementation of (a)rwnd calculation might lead to severe performance issues
> and associations completely stalling. These problems are described and solution
> is proposed which improves lksctp's robustness in congestion state.
> 
> 1) Sudden drop of a_rwnd and incomplete window recovery afterwards
> 
> Data accounted in sctp_assoc_rwnd_decrease takes only payload size (sctp data),
> but size of sk_buff, which is blamed against receiver buffer, is not accounted
> in rwnd. Theoretically, this should not be the problem as actual size of buffer
> is double the amount requested on the socket (SO_RECVBUF). Problem here is
> that this will have bad scaling for data which is less then sizeof sk_buff.
> E.g. in 4G (LTE) networks, link interfacing radio side will have a large portion
> of traffic of this size (less then 100B).
> 
> An example of sudden drop and incomplete window recovery is given below. Node B
> exhibits problematic behavior. Node A initiates association and B is configured
> to advertise rwnd of 10000. A sends messages of size 43B (size of typical sctp
> message in 4G (LTE) network). On B data is left in buffer by not reading socket
> in userspace.
> 
> Lets examine when we will hit pressure state and declare rwnd to be 0 for
> scenario with above stated parameters (rwnd == 10000, chunk size == 43, each
> chunk is sent in separate sctp packet)
> 
> Logic is implemented in sctp_assoc_rwnd_decrease:
> 
> socket_buffer (see below) is maximum size which can be held in socket buffer
> (sk_rcvbuf). current_alloced is amount of data currently allocated (rx_count)
> 
> A simple expression is given for which it will be examined after how many
> packets for above stated parameters we enter pressure state:
> 
> We start by condition which has to be met in order to enter pressure state:
> 
> 	socket_buffer < currently_alloced;
> 
> currently_alloced is represented as size of sctp packets received so far and not
> yet delivered to userspace. x is the number of chunks/packets (since there is no
> bundling, and each chunk is delivered in separate packet, we can observe each
> chunk also as sctp packet, and what is important here, having its own sk_buff):
> 
> 	socket_buffer < x*each_sctp_packet;
> 
> each_sctp_packet is sctp chunk size + sizeof(struct sk_buff). socket_buffer is
> twice the amount of initially requested size of socket buffer, which is in case
> of sctp, twice the a_rwnd requested:
> 
> 	2*rwnd < x*(payload+sizeof(struc sk_buff));
> 
> sizeof(struct sk_buff) is 190 (3.13.0-rc4+). Above is stated that rwnd is 10000
> and each payload size is 43
> 
> 	20000 < x(43+190);
> 
> 	x > 20000/233;
> 
> 	x ~> 84;
> 
> After ~84 messages, pressure state is entered and 0 rwnd is advertised while 
> received 84*43B ~= 3612B sctp data. This is why external observer notices sudden
> drop from 6474 to 0, as it will be now shown in example:
> 
> IP A.34340 > B.12345: sctp (1) [INIT] [init tag: 1875509148] [rwnd: 81920] [OS: 10] [MIS: 65535] [init TSN: 1096057017]
> IP B.12345 > A.34340: sctp (1) [INIT ACK] [init tag: 3198966556] [rwnd: 10000] [OS: 10] [MIS: 10] [init TSN: 902132839]
> IP A.34340 > B.12345: sctp (1) [COOKIE ECHO]
> IP B.12345 > A.34340: sctp (1) [COOKIE ACK]
> IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057017] [SID: 0] [SSEQ 0] [PPID 0x18]
> IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057017] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
> IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057018] [SID: 0] [SSEQ 1] [PPID 0x18]
> IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057018] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
> IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057019] [SID: 0] [SSEQ 2] [PPID 0x18]
> IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057019] [a_rwnd 9914] [#gap acks 0] [#dup tsns 0]
> <...>
> IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057098] [SID: 0] [SSEQ 81] [PPID 0x18]
> IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057098] [a_rwnd 6517] [#gap acks 0] [#dup tsns 0]
> IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057099] [SID: 0] [SSEQ 82] [PPID 0x18]
> IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057099] [a_rwnd 6474] [#gap acks 0] [#dup tsns 0]
> IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057100] [SID: 0] [SSEQ 83] [PPID 0x18]
> 
> --> Sudden drop
> 
> IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
> 
> At this point, rwnd_press stores current rwnd value so it can be later restored
> in sctp_assoc_rwnd_increase. This however doesn't happen as condition to start
> slowly increasing rwnd until rwnd_press is returned to rwnd is never met. This
> condition is not met since rwnd, after it hit 0, must first reach rwnd_press by
> adding amount which is read from userspace. Let us observe values in above
> example. Initial a_rwnd is 10000, pressure was hit when rwnd was ~6500 and the
> amount of actual sctp data currently waiting to be delivered to userspace
> is ~3500. When userspace starts to read, sctp_assoc_rwnd_increase will be blamed
> only for sctp data, which is ~3500. Condition is never met, and when userspace
> reads all data, rwnd stays on 3569.
> 
> IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 1505] [#gap acks 0] [#dup tsns 0]
> IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 3010] [#gap acks 0] [#dup tsns 0]
> IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057101] [SID: 0] [SSEQ 84] [PPID 0x18]
> IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057101] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]
> 
> --> At this point userspace read everything, rwnd recovered only to 3569
> 
> IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057102] [SID: 0] [SSEQ 85] [PPID 0x18]
> IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057102] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]
> 
> Reproduction is straight forward, it is enough for sender to send packets of
> size less then sizeof(struct sk_buff) and receiver keeping them in its buffers.
> 
> 2) Minute size window for associations sharing the same socket buffer
> 
> In case multiple associations share the same socket, and same socket buffer
> (sctp.rcvbuf_policy == 0), different scenarios exist in which congestion on one
> of the associations can permanently drop rwnd of other association(s).
> 
> Situation will be typically observed as one association suddenly having rwnd
> dropped to size of last packet received and never recovering beyond that point.
> Different scenarios will lead to it, but all have in common that one of the
> associations (let it be association from 1)) nearly depleted socket buffer, and
> the other association blames socket buffer just for the amount enough to start
> the pressure. This association will enter pressure state, set rwnd_press and 
> announce 0 rwnd.
> When data is read by userspace, similar situation as in 1) will occur, rwnd will
> increase just for the size read by userspace but rwnd_press will be high enough
> so that association doesn't have enough credit to reach rwnd_press and restore
> to previous state. This case is special case of 1), being worse as there is, in
> the worst case, only one packet in buffer for which size rwnd will be increased.
> Consequence is association which has very low maximum rwnd ('minute size', in
> our case down to 43B - size of packet which caused pressure) and as such
> unusable.
> 
> Scenario happened in the field and labs frequently after congestion state (link
> breaks, different probabilities of packet drop, packet reordering) and with 
> scenario 1) preceding. Here is given a deterministic scenario for reproduction:
> 
> From node A establish two associations on the same socket, with rcvbuf_policy
> being set to share one common buffer (sctp.rcvbuf_policy == 0). On association 1
> repeat scenario from 1), that is, bring it down to 0 and restore up. Observe
> scenario 1). Use small payload size (here we use 43). Once rwnd is 'recovered',
> bring it down close to 0, as in just one more packet would close it. This has as
> a consequence that association number 2 is able to receive (at least) one more
> packet which will bring it in pressure state. E.g. if association 2 had rwnd of
> 10000, packet received was 43, and we enter at this point into pressure,
> rwnd_press will have 9957. Once payload is delivered to userspace, rwnd will
> increase for 43, but conditions to restore rwnd to original state, just as in
> 1), will never be satisfied.
> 
> --> Association 1, between A.y and B.12345
> 
> IP A.55915 > B.12345: sctp (1) [INIT] [init tag: 836880897] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 4032536569]
> IP B.12345 > A.55915: sctp (1) [INIT ACK] [init tag: 2873310749] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3799315613]
> IP A.55915 > B.12345: sctp (1) [COOKIE ECHO]
> IP B.12345 > A.55915: sctp (1) [COOKIE ACK]
> 
> --> Association 2, between A.z and B.12346
> 
> IP A.55915 > B.12346: sctp (1) [INIT] [init tag: 534798321] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 2099285173]
> IP B.12346 > A.55915: sctp (1) [INIT ACK] [init tag: 516668823] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3676403240]
> IP A.55915 > B.12346: sctp (1) [COOKIE ECHO]
> IP B.12346 > A.55915: sctp (1) [COOKIE ACK]
> 
> --> Deplete socket buffer by sending messages of size 43B over association 1
> 
> IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315613] [SID: 0] [SSEQ 0] [PPID 0x18]
> IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315613] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
> 
> <...>
> 
> IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315696] [a_rwnd 6388] [#gap acks 0] [#dup tsns 0]
> IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315697] [SID: 0] [SSEQ 84] [PPID 0x18]
> IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315697] [a_rwnd 6345] [#gap acks 0] [#dup tsns 0]
> 
> --> Sudden drop on 1
>  
> IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315698] [SID: 0] [SSEQ 85] [PPID 0x18]
> IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315698] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
> 
> --> Here userspace read, rwnd 'recovered' to 3698, now deplete again using
>     association 1 so there is place in buffer for only one more packet
> 
> IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315799] [SID: 0] [SSEQ 186] [PPID 0x18]
> IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315799] [a_rwnd 86] [#gap acks 0] [#dup tsns 0]
> IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315800] [SID: 0] [SSEQ 187] [PPID 0x18]
> IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]
> 
> --> Socket buffer is almost depleted, but there is space for one more packet,
>     send them over association 2, size 43B
> 
> IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403240] [SID: 0] [SSEQ 0] [PPID 0x18]
> IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403240] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
> 
> --> Immediate drop
> 
> IP A.60995 > B.12346: sctp (1) [SACK] [cum ack 387491510] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
> 
> --> Read everything from the socket, both association recover up to maximum rwnd
>     they are capable of reaching, note that association 1 recovered up to 3698,
>     and association 2 recovered only to 43
> 
> IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 1548] [#gap acks 0] [#dup tsns 0]
> IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 3053] [#gap acks 0] [#dup tsns 0]
> IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315801] [SID: 0] [SSEQ 188] [PPID 0x18]
> IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315801] [a_rwnd 3698] [#gap acks 0] [#dup tsns 0]
> IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403241] [SID: 0] [SSEQ 1] [PPID 0x18]
> IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403241] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]
> 
> A careful reader might wonder why it is necessary to reproduce 1) prior
> reproduction of 2). It is simply easier to observe when to send packet over
> association 2 which will push association into the pressure state.
> 
> Proposed solution:
> 
> Both problems share the same root cause, and that is improper scaling of socket
> buffer with rwnd. Solution in which sizeof(sk_buff) is taken into concern while
> calculating rwnd is not possible due to fact that there is no linear
> relationship between amount of data blamed in increase/decrease with IP packet
> in which payload arrived. Even in case such solution would be followed,
> complexity of the code would increase. Due to nature of current rwnd handling,
> slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure state is
> entered is rationale, but it gives false representation to the sender of current
> buffer space. Furthermore, it implements additional congestion control mechanism
> which is defined on implementation, and not on standard basis.
> 
> Proposed solution simplifies whole algorithm having on mind definition from rfc:
> 
> o  Receiver Window (rwnd): This gives the sender an indication of the space
>    available in the receiver's inbound buffer.
> 
> Core of the proposed solution is given with these lines:
> 
> sctp_assoc_rwnd_account:
> 	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
> 		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
> 	else
> 		asoc->rwnd = 0;
> 
> We advertise to sender (half of) actual space we have. Half is in the braces
> depending whether you would like to observe size of socket buffer as SO_RECVBUF
> or twice the amount, i.e. size is the one visible from userspace, that is,
> from kernelspace.
> In this way sender is given with good approximation of our buffer space,
> regardless of the buffer policy - we always advertise what we have. Proposed
> solution fixes described problems and removes necessity for rwnd restoration
> algorithm. Finally, as proposed solution is simplification, some lines of code,
> along with some bytes in struct sctp_association are saved.
> 
> Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com>
> 


General comment - While this seems to make sense to me generally speaking,
doesn't it currently violate section 6 of the RFC?


A SCTP receiver MUST be able to receive a minimum of 1500 bytes in
   one SCTP packet.  This means that a SCTP endpoint MUST NOT indicate
   less than 1500 bytes in its Initial a_rwnd sent in the INIT or INIT
   ACK.

Since we set the initial rwnd value to the larger of sk->sk_rcvbuf/2 or
SCTP_MIN_RWND (1500 bytes), won't we potentially advertize half that amount?  It
seems we need to double the minimum socket receive buffer here.

Neil


> ---
> 
> --- net-next.orig/net/sctp/associola.c
> +++ net-next/net/sctp/associola.c
> @@ -1367,44 +1367,35 @@ static inline bool sctp_peer_needs_updat
>  	return false;
>  }
>  
> -/* Increase asoc's rwnd by len and send any window update SACK if needed. */
> -void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned int len)
> +/* Account asoc's rwnd for the approximated state in the buffer,
> + * and check whether SACK needs to be sent.
> + */
> +void sctp_assoc_rwnd_account(struct sctp_association *asoc, int check_sack)
>  {
> +	int rx_count;
>  	struct sctp_chunk *sack;
>  	struct timer_list *timer;
>  
> -	if (asoc->rwnd_over) {
> -		if (asoc->rwnd_over >= len) {
> -			asoc->rwnd_over -= len;
> -		} else {
> -			asoc->rwnd += (len - asoc->rwnd_over);
> -			asoc->rwnd_over = 0;
> -		}
> -	} else {
> -		asoc->rwnd += len;
> -	}
> +	if (asoc->ep->rcvbuf_policy)
> +		rx_count = atomic_read(&asoc->rmem_alloc);
> +	else
> +		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
>  
> -	/* If we had window pressure, start recovering it
> -	 * once our rwnd had reached the accumulated pressure
> -	 * threshold.  The idea is to recover slowly, but up
> -	 * to the initial advertised window.
> -	 */
> -	if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
> -		int change = min(asoc->pathmtu, asoc->rwnd_press);
> -		asoc->rwnd += change;
> -		asoc->rwnd_press -= change;
> -	}
> +	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
> +		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
> +	else
> +		asoc->rwnd = 0;
>  
> -	pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
> -		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
> -		 asoc->a_rwnd);
> +	pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
> +		 __func__, asoc, asoc->rwnd, rx_count,
> +		 asoc->base.sk->sk_rcvbuf);
>  
>  	/* Send a window update SACK if the rwnd has increased by at least the
>  	 * minimum of the association's PMTU and half of the receive buffer.
>  	 * The algorithm used is similar to the one described in
>  	 * Section 4.2.3.3 of RFC 1122.
>  	 */
> -	if (sctp_peer_needs_update(asoc)) {
> +	if (check_sack && sctp_peer_needs_update(asoc)) {
>  		asoc->a_rwnd = asoc->rwnd;
>  
>  		pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
> @@ -1426,45 +1417,6 @@ void sctp_assoc_rwnd_increase(struct sct
>  	}
>  }
>  
> -/* Decrease asoc's rwnd by len. */
> -void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned int len)
> -{
> -	int rx_count;
> -	int over = 0;
> -
> -	if (unlikely(!asoc->rwnd || asoc->rwnd_over))
> -		pr_debug("%s: association:%p has asoc->rwnd:%u, "
> -			 "asoc->rwnd_over:%u!\n", __func__, asoc,
> -			 asoc->rwnd, asoc->rwnd_over);
> -
> -	if (asoc->ep->rcvbuf_policy)
> -		rx_count = atomic_read(&asoc->rmem_alloc);
> -	else
> -		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
> -
> -	/* If we've reached or overflowed our receive buffer, announce
> -	 * a 0 rwnd if rwnd would still be positive.  Store the
> -	 * the potential pressure overflow so that the window can be restored
> -	 * back to original value.
> -	 */
> -	if (rx_count >= asoc->base.sk->sk_rcvbuf)
> -		over = 1;
> -
> -	if (asoc->rwnd >= len) {
> -		asoc->rwnd -= len;
> -		if (over) {
> -			asoc->rwnd_press += asoc->rwnd;
> -			asoc->rwnd = 0;
> -		}
> -	} else {
> -		asoc->rwnd_over = len - asoc->rwnd;
> -		asoc->rwnd = 0;
> -	}
> -
> -	pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
> -		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
> -		 asoc->rwnd_press);
> -}
>  
>  /* Build the bind address list for the association based on info from the
>   * local endpoint and the remote peer.
> --- net-next.orig/include/net/sctp/structs.h
> +++ net-next/include/net/sctp/structs.h
> @@ -1653,17 +1653,6 @@ struct sctp_association {
>  	/* This is the last advertised value of rwnd over a SACK chunk. */
>  	__u32 a_rwnd;
>  
> -	/* Number of bytes by which the rwnd has slopped.  The rwnd is allowed
> -	 * to slop over a maximum of the association's frag_point.
> -	 */
> -	__u32 rwnd_over;
> -
> -	/* Keeps treack of rwnd pressure.  This happens when we have
> -	 * a window, but not recevie buffer (i.e small packets).  This one
> -	 * is releases slowly (1 PMTU at a time ).
> -	 */
> -	__u32 rwnd_press;
> -
>  	/* This is the sndbuf size in use for the association.
>  	 * This corresponds to the sndbuf size for the association,
>  	 * as specified in the sk->sndbuf.
> @@ -1892,8 +1881,7 @@ void sctp_assoc_update(struct sctp_assoc
>  __u32 sctp_association_get_next_tsn(struct sctp_association *);
>  
>  void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
> -void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
> -void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
> +void sctp_assoc_rwnd_account(struct sctp_association *, int);
>  void sctp_assoc_set_primary(struct sctp_association *,
>  			    struct sctp_transport *);
>  void sctp_assoc_del_nonprimary_peers(struct sctp_association *,
> --- net-next.orig/net/sctp/sm_statefuns.c
> +++ net-next/net/sctp/sm_statefuns.c
> @@ -6176,7 +6176,7 @@ static int sctp_eat_data(const struct sc
>  	 * PMTU.  In cases, such as loopback, this might be a rather
>  	 * large spill over.
>  	 */
> -	if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
> +	if ((!chunk->data_accepted) && (!asoc->rwnd ||
>  	    (datalen > asoc->rwnd + asoc->frag_point))) {
>  
>  		/* If this is the next TSN, consider reneging to make
> --- net-next.orig/net/sctp/socket.c
> +++ net-next/net/sctp/socket.c
> @@ -2097,7 +2097,7 @@ static int sctp_recvmsg(struct kiocb *io
>  		 * rwnd is updated when the event is freed.
>  		 */
>  		if (!sctp_ulpevent_is_notification(event))
> -			sctp_assoc_rwnd_increase(event->asoc, copied);
> +			sctp_assoc_rwnd_account(event->asoc, 1);
>  		goto out;
>  	} else if ((event->msg_flags & MSG_NOTIFICATION) ||
>  		   (event->msg_flags & MSG_EOR))
> --- net-next.orig/net/sctp/ulpevent.c
> +++ net-next/net/sctp/ulpevent.c
> @@ -989,7 +989,7 @@ static void sctp_ulpevent_receive_data(s
>  	skb = sctp_event2skb(event);
>  	/* Set the owner and charge rwnd for bytes received.  */
>  	sctp_ulpevent_set_owner(event, asoc);
> -	sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
> +	sctp_assoc_rwnd_account(asoc, 0);
>  
>  	if (!skb->data_len)
>  		return;
> @@ -1035,7 +1035,7 @@ static void sctp_ulpevent_release_data(s
>  	}
>  
>  done:
> -	sctp_assoc_rwnd_increase(event->asoc, len);
> +	sctp_assoc_rwnd_account(event->asoc, 1);
>  	sctp_ulpevent_release_owner(event);
>  }
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH RFC 4/6] net: rfkill: gpio: add device tree support
From: Mark Brown @ 2014-01-22 12:38 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Alexandre Courbot, Heikki Krogerus, devicetree@vger.kernel.org,
	netdev, Linus Walleij, linux-wireless, linux-kernel,
	David S. Miller, linux-gpio@vger.kernel.org, linux-sunxi,
	Maxime Ripard, Chen-Yu Tsai, Mika Westerberg, Johannes Berg,
	linux-arm-kernel@lists.infradead.org
In-Reply-To: <201401211950.07011.arnd@arndb.de>


[-- Attachment #1.1: Type: text/plain, Size: 725 bytes --]

On Tue, Jan 21, 2014 at 07:50:06PM +0100, Arnd Bergmann wrote:

> I should have another look at the debugfs representation, but isn't
> there a global namespace that gets used for all gpios?  Neither the
> con_id nor the name that the driver picks would be globally unique
> and stable across kernel versions, so they don't make a good user
> interface.

Currently the globally unique name is the GPIO number but that's not
stable since it gets dynamically assigned.  

> I think what we want here is a system-wide unique identifier for
> each gpio line that may get represented in debugfs, and use a new
> DT mechanism to communicate that.

We've mostly been using things based off dev_name() for stability.

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply

* Re: [PATCH v3] sch_htb: let skb->priority refer to non-leaf class
From: Jamal Hadi Salim @ 2014-01-22 12:41 UTC (permalink / raw)
  To: David Miller; +Cc: harry.mason, sergei.shtylyov, Eric Dumazet, netdev
In-Reply-To: <20140121.143641.295542315649549669.davem@davemloft.net>

Looks reasonable from my view.

cheers,
jamal

On Tue, Jan 21, 2014 at 5:36 PM, David Miller <davem@davemloft.net> wrote:
> From: Harry Mason <harry.mason@smoothwall.net>
> Date: Fri, 17 Jan 2014 13:22:32 +0000
>
>> If the class in skb->priority is not a leaf, apply filters from the
>> selected class, not the qdisc. This lets netfilter or user space
>> partially classify the packet.
>>
>> Signed-off-by: Harry Mason <harry.mason@smoothwall.net>
>
> Can I please get some reviews of this patch?
>
> Thanks.

^ permalink raw reply

* Re: [Patch net-next 3/6] net_sched: act: hide struct tcf_common from API
From: Jamal Hadi Salim @ 2014-01-22 12:44 UTC (permalink / raw)
  To: Cong Wang, netdev; +Cc: David S. Miller
In-Reply-To: <52DD1E15.2040400@mojatatu.com>

Cong,

I applied the patches and run the tests below.
The first one is broken for sure.
The second one has different behavior - but then I rembered
i had an offline discussion with you and i think this is fine.

Please chase whatever these issues are and fix; if the tests pass
you could go ahead and add my signed-off.

Much thanks for your efforts to make this better.

cheers,
jamal

On 01/20/14 08:01, Jamal Hadi Salim wrote:
> On 01/17/14 14:37, Cong Wang wrote:
>> Now we can totally hide it from modules. tcf_hash_*() API's
>> will operate on struct tc_action, modules don't need to care about
>> the details.
>>
>> Cc: Jamal Hadi Salim <jhs@mojatatu.com>
>> Cc: David S. Miller <davem@davemloft.net>
>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>
> Had to stare at this a bit longer - I am afraid
> this and rest look a little suspect.
> Can you run some tests for me after your patches?
> I could do it in about 1 day if you dont have time.
>
> ---
> #add a couple of tests
> sudo tc actions add action drop index 10
> sudo tc actions add action drop index 12
> # now list them - should see both.
> sudo tc actions ls action gact
> #now flush them
> sudo tc actions flush action gact
> # now list them - should see them gone
> sudo tc actions ls action gact
> ---
>
> And for the last patch, in particular
> ---
> #add two actions by index
> sudo tc actions add action drop index 10
> sudo tc actions add action ok index 12
> # we need an ingress qdisc to attach filter to
> sudo tc qdisc del dev lo parent ffff:
> sudo tc qdisc add dev lo ingress
> #use existing action index 10
> sudo tc filter add dev lo parent ffff: protocol ip prio 8 \
> u32 match ip dst 127.0.0.8/32 flowid 1:10 action gact index 10
> #double bind
> sudo tc filter add dev lo parent ffff: protocol ip prio 7 \
> u32 match ip src 127.0.0.10/32 flowid 1:11 action gact index 10
> # now lets see the filters..
> sudo tc filter ls dev lo parent ffff: protocol ip
> #display the actions and pay attention to the bind count
> sudo tc actions ls action gact
> #try to readd an existing action
> sudo tc actions add action ok index 12
> #it should be rejected - now list it and make sure refcnt doesnt go up
> sudo tc actions ls action gact
> #delete action index 12 (which is not bound)
> sudo tc actions del action gact index 12
> #list and make sure index 12 is gone
> sudo tc actions ls action gact
> #delete action index 10 (which is bound)
> sudo tc actions del action gact index 10
> #display to see it is still there ..
> sudo tc actions ls action gact
> #Repeat above two steps several times and make sure action 10 stays
> # action should not be deleted...
> #
> # delete qdisc - which should delete all filters but not
> # action that were not created by filters
> sudo tc qdisc del dev lo parent ffff:
> #ok now that filter is gone, lets see the actions ..
> #pay attention to binds and references
> sudo tc actions ls action gact
> #
> #delete action index 10 (which is no longer bound)
> sudo tc actions del action gact index 10
> #display to see it is gone
> sudo tc actions ls action gact
>
>
> cheers,
> jamal
>
>
>
>
>
>
>
>
>
>
>

^ permalink raw reply

* [PATCH net-next v2] tcp: metrics: Handle v6/v4-mapped sockets in tcp-metrics
From: Christoph Paasch @ 2014-01-22 12:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

A socket may be v6/v4-mapped. In that case sk->sk_family is AF_INET6,
but the IP being used is actually an IPv4-address.
Current's tcp-metrics will thus represent it as an IPv6-address:

root@server:~# ip tcp_metrics
::ffff:10.1.1.2 age 22.920sec rtt 18750us rttvar 15000us cwnd 10
10.1.1.2 age 47.970sec rtt 16250us rttvar 10000us cwnd 10

This patch modifies the tcp-metrics so that they are able to handle the
v6/v4-mapped sockets correctly.

Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
---
v2: Respin to net-next.

 net/ipv4/tcp_metrics.c | 64 +++++++++++++++++++++++++++++++-------------------
 1 file changed, 40 insertions(+), 24 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index fa950941de65..9923cebf5709 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -274,24 +274,32 @@ static struct tcp_metrics_block *__tcp_get_metrics_tw(struct inet_timewait_sock
 	unsigned int hash;
 	struct net *net;
 
-	saddr.family = tw->tw_family;
-	daddr.family = tw->tw_family;
-	switch (daddr.family) {
-	case AF_INET:
+	if (tw->tw_family == AF_INET) {
+		saddr.family = AF_INET;
 		saddr.addr.a4 = tw->tw_rcv_saddr;
+		daddr.family = AF_INET;
 		daddr.addr.a4 = tw->tw_daddr;
 		hash = (__force unsigned int) daddr.addr.a4;
-		break;
+	}
 #if IS_ENABLED(CONFIG_IPV6)
-	case AF_INET6:
-		*(struct in6_addr *)saddr.addr.a6 = tw->tw_v6_rcv_saddr;
-		*(struct in6_addr *)daddr.addr.a6 = tw->tw_v6_daddr;
-		hash = ipv6_addr_hash(&tw->tw_v6_daddr);
-		break;
+	else if (tw->tw_family == AF_INET6) {
+		if (ipv6_addr_v4mapped(&tw->tw_v6_daddr)) {
+			saddr.family = AF_INET;
+			saddr.addr.a4 = tw->tw_rcv_saddr;
+			daddr.family = AF_INET;
+			daddr.addr.a4 = tw->tw_daddr;
+			hash = (__force unsigned int) daddr.addr.a4;
+		} else {
+			saddr.family = AF_INET6;
+			*(struct in6_addr *)saddr.addr.a6 = tw->tw_v6_rcv_saddr;
+			daddr.family = AF_INET6;
+			*(struct in6_addr *)daddr.addr.a6 = tw->tw_v6_daddr;
+			hash = ipv6_addr_hash(&tw->tw_v6_daddr);
+		}
+	}
 #endif
-	default:
+	else
 		return NULL;
-	}
 
 	net = twsk_net(tw);
 	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
@@ -314,24 +322,32 @@ static struct tcp_metrics_block *tcp_get_metrics(struct sock *sk,
 	unsigned int hash;
 	struct net *net;
 
-	saddr.family = sk->sk_family;
-	daddr.family = sk->sk_family;
-	switch (daddr.family) {
-	case AF_INET:
+	if (sk->sk_family == AF_INET) {
+		saddr.family = AF_INET;
 		saddr.addr.a4 = inet_sk(sk)->inet_saddr;
+		daddr.family = AF_INET;
 		daddr.addr.a4 = inet_sk(sk)->inet_daddr;
 		hash = (__force unsigned int) daddr.addr.a4;
-		break;
+	}
 #if IS_ENABLED(CONFIG_IPV6)
-	case AF_INET6:
-		*(struct in6_addr *)saddr.addr.a6 = sk->sk_v6_rcv_saddr;
-		*(struct in6_addr *)daddr.addr.a6 = sk->sk_v6_daddr;
-		hash = ipv6_addr_hash(&sk->sk_v6_daddr);
-		break;
+	else if (sk->sk_family = AF_INET6) {
+		if (ipv6_addr_v4mapped(&sk->sk_v6_daddr)) {
+			saddr.family = AF_INET;
+			saddr.addr.a4 = inet_sk(sk)->inet_saddr;
+			daddr.family = AF_INET;
+			daddr.addr.a4 = inet_sk(sk)->inet_daddr;
+			hash = (__force unsigned int) daddr.addr.a4;
+		} else {
+			saddr.family = AF_INET6;
+			*(struct in6_addr *)saddr.addr.a6 = sk->sk_v6_rcv_saddr;
+			daddr.family = AF_INET6;
+			*(struct in6_addr *)daddr.addr.a6 = sk->sk_v6_daddr;
+			hash = ipv6_addr_hash(&sk->sk_v6_daddr);
+		}
+	}
 #endif
-	default:
+	else
 		return NULL;
-	}
 
 	net = dev_net(dst->dev);
 	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
-- 
1.8.3.2

^ permalink raw reply related

* Re: [PATCH net-next 03/25] bonding: convert packets_per_slave to use the new option API
From: Nikolay Aleksandrov @ 2014-01-22 13:09 UTC (permalink / raw)
  To: netdev; +Cc: hannes, David Miller
In-Reply-To: <20140122072528.GD28225@order.stressinduktion.org>

On 01/22/2014 08:25 AM, Hannes Frederic Sowa wrote:
> Hi Nikolay!
> 
> On Tue, Jan 21, 2014 at 03:54:52PM +0100, Nikolay Aleksandrov wrote:
>> This patch adds the necessary changes so packets_per_slave would use the
>> new bonding option API.
> 
> Just want to warn you that because of the reciproal_divide merge in net-next
> there will be some conflicts with this patch in net-next.
> 
> I actually looked to rebase Daniel and my series on your patchset, but now
> David pulled ours first.
> 
> Greetings,
> 
>   Hannes
> 
Hi Hannes,
Thanks for the heads up, I just synced net-next with your changes and indeed
there're a few conflicts.
I'll rebase on top of them and post a v2, shouldn't be a problem :-)

Cheers,
 Nik

^ permalink raw reply

* [PATCH net-next] net/udp_offload: Handle static checker complaints
From: Or Gerlitz @ 2014-01-22 13:23 UTC (permalink / raw)
  To: davem; +Cc: netdev, Shlomo Pongratz, Or Gerlitz

From: Shlomo Pongratz <shlomop@mellanox.com>

Fixed few issues around using __rcu prefix and rcu_assign_pointer, also
fixed a warning print to use ntohs(port) and not htons(port).

net/ipv4/udp_offload.c:112:9: error: incompatible types in comparison expression (different address spaces)
net/ipv4/udp_offload.c:113:9: error: incompatible types in comparison expression (different address spaces)
net/ipv4/udp_offload.c:176:19: error: incompatible types in comparison expression (different address spaces)

Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 net/ipv4/udp_offload.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index ee853c5..25f5cee 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -15,7 +15,7 @@
 #include <net/protocol.h>
 
 static DEFINE_SPINLOCK(udp_offload_lock);
-static struct udp_offload_priv *udp_offload_base __read_mostly;
+static struct udp_offload_priv __rcu *udp_offload_base __read_mostly;
 
 struct udp_offload_priv {
 	struct udp_offload	*offload;
@@ -100,7 +100,7 @@ out:
 
 int udp_add_offload(struct udp_offload *uo)
 {
-	struct udp_offload_priv **head = &udp_offload_base;
+	struct udp_offload_priv __rcu **head = &udp_offload_base;
 	struct udp_offload_priv *new_offload = kzalloc(sizeof(*new_offload), GFP_KERNEL);
 
 	if (!new_offload)
@@ -110,7 +110,7 @@ int udp_add_offload(struct udp_offload *uo)
 
 	spin_lock(&udp_offload_lock);
 	rcu_assign_pointer(new_offload->next, rcu_dereference(*head));
-	rcu_assign_pointer(*head, rcu_dereference(new_offload));
+	rcu_assign_pointer(*head, new_offload);
 	spin_unlock(&udp_offload_lock);
 
 	return 0;
@@ -140,7 +140,7 @@ void udp_del_offload(struct udp_offload *uo)
 		}
 		head = &uo_priv->next;
 	}
-	pr_warn("udp_del_offload: didn't find offload for port %d\n", htons(uo->port));
+	pr_warn("udp_del_offload: didn't find offload for port %d\n", ntohs(uo->port));
 unlock:
 	spin_unlock(&udp_offload_lock);
 	if (uo_priv != NULL)
-- 
1.7.1

^ permalink raw reply related

* Re: Fwd: [RFC PATCH net-next 0/3] virtio_net: add aRFS support
From: Zhi Yong Wu @ 2014-01-22 13:27 UTC (permalink / raw)
  To: Tom Herbert, Ben Hutchings
  Cc: Stefan Hajnoczi, Linux Netdev List, Eric Dumazet, David S. Miller,
	Zhi Yong Wu, Michael S. Tsirkin, Rusty Russell, Jason Wang
In-Reply-To: <CA+mtBx9PBtYurdnhCKL0MLL8i+_+3yPNWFVj5h6SPJH+YDBCjw@mail.gmail.com>

On Fri, Jan 17, 2014 at 1:12 AM, Tom Herbert <therbert@google.com> wrote:
> On Thu, Jan 16, 2014 at 12:52 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>> On Thu, Jan 16, 2014 at 04:34:10PM +0800, Zhi Yong Wu wrote:
>>> CC: stefanha, MST, Rusty Russel
>>>
>>> ---------- Forwarded message ----------
>>> From: Jason Wang <jasowang@redhat.com>
>>> Date: Thu, Jan 16, 2014 at 12:23 PM
>>> Subject: Re: [RFC PATCH net-next 0/3] virtio_net: add aRFS support
>>> To: Zhi Yong Wu <zwu.kernel@gmail.com>
>>> Cc: netdev@vger.kernel.org, therbert@google.com, edumazet@google.com,
>>> davem@davemloft.net, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>>
>>>
>>> On 01/15/2014 10:20 PM, Zhi Yong Wu wrote:
>>> >
>>> > From: Zhi Yong Wu<wuzhy@linux.vnet.ibm.com>
>>> >
>>> > HI, folks
>>> >
>>> > The patchset is trying to integrate aRFS support to virtio_net. In this case,
>>> > aRFS will be used to select the RX queue. To make sure that it's going ahead
>>> > in the correct direction, although it is still one RFC and isn't tested, it's
>>> > post out ASAP. Any comment are appreciated, thanks.
>>> >
>>> > If anyone is interested in playing with it, you can get this patchset from my
>>> > dev git on github:
>>> >    git://github.com/wuzhy/kernel.git virtnet_rfs
>>> >
>>> > Zhi Yong Wu (3):
>>> >    virtio_pci: Introduce one new config api vp_get_vq_irq()
>>> >    virtio_net: Introduce one dummy function virtnet_filter_rfs()
>>> >    virtio-net: Add accelerated RFS support
>>> >
>>> >   drivers/net/virtio_net.c      |   67 ++++++++++++++++++++++++++++++++++++++++-
>>> >   drivers/virtio/virtio_pci.c   |   11 +++++++
>>> >   include/linux/virtio_config.h |   12 +++++++
>>> >   3 files changed, 89 insertions(+), 1 deletions(-)
>>> >
>>>
>>> Please run get_maintainter.pl before sending the patch. You'd better
>>> at least cc virtio maintainer/list for this.
>>>
>>> The core aRFS method is a noop in this RFC which make this series no
>>> much sense to discuss. You should at least mention the big picture
>>> here in the cover letter. I suggest you should post a RFC which can
>>> run and has expected result or you can just raise a thread for the
>>> design discussion.
>>>
>>> And this method has been discussed before, you can search "[net-next
>>> RFC PATCH 5/5] virtio-net: flow director support" in netdev archive
>>> for a very old prototype implemented by me. It can work and looks like
>>> most of this RFC have already done there.
>>>
>>> A basic question is whether or not we need this, not all the mq cards
>>> use aRFS (see ixgbe ATR). And whether or not it can bring extra
>>> overheads? For virtio, we want to reduce the vmexits as much as
>>> possible but this aRFS seems introduce a lot of more of this. Making a
>>> complex interfaces just for an virtual device may not be good, simple
>>> method may works for most of the cases.
>>>
>>> We really should consider to offload this to real nic. VMDq and L2
>>> forwarding offload may help in this case.
>>
> Adding flow director support would be a good step, Zhi's patches for
> support in tun have been merged, so support in virtio-net would be a
> good follow on. But, flow-director does have some limitations and
> performance issues of it's own (forced pairing between TX and RX
> queues, lookup on every TX packet). In the case of virtualization,
> aRFS, RSS, ntuple filtering, LRO, etc. can be implemented as software
> emulations and so far seems to be wins in most cases. Extending these
> down into the stack so that they can leverage HW mechanisms is a good
> goal for best performance. It's probably generally true that most of
> the offloads commonly available for NICs we'll want in virtualization
> path. Of course, we need to deomonstrate that they provide real
> performance benefit in this use case.
>
> I believe tying in aRFS (or flow director) into a real aRFS is just a
> matter of programming the RFS table properly. This is not the complex
> side of the interface, I believe this already works with the tun
> patches.
>
>> Zhi Yong and I had an IRC chat.  I wanted to post my questions on the
>> list - it's still the same concern I had in the old email thread that
>> Jason mentioned.
>>
>> In order for virtio-net aRFS to make sense there needs to be an overall
>> plan for pushing flow mapping information down to the physical NIC.
>> That's the only way to actually achieve the benefit of steering:
>> processing the packet on the CPU where the application is running.
>>
> I don't think this is necessarily true. Per flow steering amongst
> virtual queues should be beneficial in itself. virtio-net can leverage
> RFS or aRFS where it's available.
>
>> If it's not possible or too hard to implement aRFS down the entire
>> stack, we won't be able to process the packet on the right CPU.
>> Then we might as well not bother with aRFS and just distribute uniformly
>> across the rx virtqueues.
>>
>> Please post an outline of how rx packets will be steered up the stack so
>> we can discuss whether aRFS can bring any benefit.
>>
> 1. The aRFS interface for the guest to specify which virtual queue to
> receive a packet on is fairly straight forward.
> 2. To hook into RFS, we need to match the virtual queue to the real
> CPU it will processed on, and then program the RFS table for that flow
> and CPU.
> 3. NIC aRFS keys off the RFS tables so it can program the HW with the
> correct queue for the CPU.
Does anyone have time to make one conclusion for this discussion? in
particular, how will rx packet be steered up the stack from guest
virtio_net driver, virtio_net NIC, vhost_net, tun driver, host network
stack, to physical NIC with more details?
What is the role of each path units? otherwise this discussion wont
get any meanful result, which is not what we expect.

>
>> Stefan



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox