Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH 2/2] net: spider_net: avoid using signed char for bitops
From: Antoine Tenart @ 2014-10-03 15:01 UTC (permalink / raw)
  To: dan.carpenter, kou.ishizaki, jens
  Cc: Antoine Tenart, netdev, linux-arm-kernel, linux-kernel
In-Reply-To: <1412348517-20352-1-git-send-email-antoine.tenart@free-electrons.com>

Signedness bugs may occur when using signed char for bitops,
depending on if the highest bit is ever used.

Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
---
 drivers/net/ethernet/toshiba/spider_net.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/toshiba/spider_net.c b/drivers/net/ethernet/toshiba/spider_net.c
index 713313e15c68..8e9371a3388a 100644
--- a/drivers/net/ethernet/toshiba/spider_net.c
+++ b/drivers/net/ethernet/toshiba/spider_net.c
@@ -1325,9 +1325,9 @@ spider_net_set_mac(struct net_device *netdev, void *p)
 	spider_net_write_reg(card, SPIDER_NET_GMACOPEMD, regvalue);
 
 	/* write mac */
-	macu = (addr->sa_data[0]<<24) + (addr->sa_data[1]<<16) +
-		(addr->sa_data[2]<<8) + (addr->sa_data[3]);
-	macl = (addr->sa_data[4]<<8) + (addr->sa_data[5]);
+	macu = (netdev->dev_addr[0]<<24) + (netdev->dev_addr[1]<<16) +
+		(netdev->dev_addr[2]<<8) + (netdev->dev_addr[3]);
+	macl = (netdev->dev_addr[4]<<8) + (netdev->dev_addr[5]);
 	spider_net_write_reg(card, SPIDER_NET_GMACUNIMACU, macu);
 	spider_net_write_reg(card, SPIDER_NET_GMACUNIMACL, macl);
 
-- 
1.9.1

^ permalink raw reply related

* [PATCH 1/2] net: spider_net: do not read mac address again after setting it
From: Antoine Tenart @ 2014-10-03 15:01 UTC (permalink / raw)
  To: dan.carpenter, kou.ishizaki, jens
  Cc: Antoine Tenart, netdev, linux-arm-kernel, linux-kernel
In-Reply-To: <1412348517-20352-1-git-send-email-antoine.tenart@free-electrons.com>

This patch removes the spider_net_get_mac_address() call at the end of
the spider_net_set_mac() function. The dev->dev_addr is instead updated
with a memcpy() from sa->sa_data.

Since spider_net_get_mac_address() is not used anywhere else, this patch
also removes the function.

Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
---
 drivers/net/ethernet/toshiba/spider_net.c | 36 ++-----------------------------
 1 file changed, 2 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ethernet/toshiba/spider_net.c b/drivers/net/ethernet/toshiba/spider_net.c
index 3e38f67c6011..713313e15c68 100644
--- a/drivers/net/ethernet/toshiba/spider_net.c
+++ b/drivers/net/ethernet/toshiba/spider_net.c
@@ -267,34 +267,6 @@ spider_net_set_promisc(struct spider_net_card *card)
 }
 
 /**
- * spider_net_get_mac_address - read mac address from spider card
- * @card: device structure
- *
- * reads MAC address from GMACUNIMACU and GMACUNIMACL registers
- */
-static int
-spider_net_get_mac_address(struct net_device *netdev)
-{
-	struct spider_net_card *card = netdev_priv(netdev);
-	u32 macl, macu;
-
-	macl = spider_net_read_reg(card, SPIDER_NET_GMACUNIMACL);
-	macu = spider_net_read_reg(card, SPIDER_NET_GMACUNIMACU);
-
-	netdev->dev_addr[0] = (macu >> 24) & 0xff;
-	netdev->dev_addr[1] = (macu >> 16) & 0xff;
-	netdev->dev_addr[2] = (macu >> 8) & 0xff;
-	netdev->dev_addr[3] = macu & 0xff;
-	netdev->dev_addr[4] = (macl >> 8) & 0xff;
-	netdev->dev_addr[5] = macl & 0xff;
-
-	if (!is_valid_ether_addr(&netdev->dev_addr[0]))
-		return -EINVAL;
-
-	return 0;
-}
-
-/**
  * spider_net_get_descr_status -- returns the status of a descriptor
  * @descr: descriptor to look at
  *
@@ -1345,6 +1317,8 @@ spider_net_set_mac(struct net_device *netdev, void *p)
 	if (!is_valid_ether_addr(addr->sa_data))
 		return -EADDRNOTAVAIL;
 
+	memcpy(netdev->dev_addr, addr->sa_data, ETH_ALEN);
+
 	/* switch off GMACTPE and GMACRPE */
 	regvalue = spider_net_read_reg(card, SPIDER_NET_GMACOPEMD);
 	regvalue &= ~((1 << 5) | (1 << 6));
@@ -1364,12 +1338,6 @@ spider_net_set_mac(struct net_device *netdev, void *p)
 
 	spider_net_set_promisc(card);
 
-	/* look up, whether we have been successful */
-	if (spider_net_get_mac_address(netdev))
-		return -EADDRNOTAVAIL;
-	if (memcmp(netdev->dev_addr,addr->sa_data,netdev->addr_len))
-		return -EADDRNOTAVAIL;
-
 	return 0;
 }
 
-- 
1.9.1

^ permalink raw reply related

* [PATCH 0/2] net: spider_net: fix possible bitops errors
From: Antoine Tenart @ 2014-10-03 15:01 UTC (permalink / raw)
  To: dan.carpenter, kou.ishizaki, jens
  Cc: Antoine Tenart, netdev, linux-arm-kernel, linux-kernel

Hi,

Dan reported a possible signedness issue on the pxa168_eth driver. While
having a look at it, I came across a similar problem in the spider_net
driver.

Here is one proposal to fix it. The first patch rework the
spider_net_set_mac() function by removing the spider_net_get_mac_address()
call and using memcpy() to set netdev->dev_addr (which is what's done in
lots of Ethernet drivers) and the second one fix the actual signedness
issue.

If for any reason you really want to keep a call to
spider_net_get_mac_address() because the memcpy() is somehow not good
enough here, we can also come up with a solution involving a temporary
unsigned char variable.

I couldn't test these changes, so please do.

Thanks,

Antoine

Antoine Tenart (2):
  net: spider_net: do not read mac address again after setting it
  net: spider_net: avoid using signed char for bitops

 drivers/net/ethernet/toshiba/spider_net.c | 42 ++++---------------------------
 1 file changed, 5 insertions(+), 37 deletions(-)

-- 
1.9.1

^ permalink raw reply

* [PATCH] net: pxa168_eth: avoid using signed char for bitops
From: Antoine Tenart @ 2014-10-03 15:01 UTC (permalink / raw)
  To: dan.carpenter
  Cc: thomas.petazzoni, zmxu, netdev, Antoine Tenart, linux-kernel,
	alexandre.belloni, jszhang, linux-arm-kernel,
	sebastian.hesselbarth
In-Reply-To: <1412348517-20352-1-git-send-email-antoine.tenart@free-electrons.com>

Signedness bugs may occur when using signed char for bitops,
depending on if the highest bit is ever used.

Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
---

Dan reported a static checker warning:
http://marc.info/?l=kernel-janitors&m=141218246222535&w=2

This patch fixes it. Instead of using sa->sa_data we now use
dev->dev_addr (of type unsigned char *) to avoid possible
signedness issues.

 drivers/net/ethernet/marvell/pxa168_eth.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/marvell/pxa168_eth.c b/drivers/net/ethernet/marvell/pxa168_eth.c
index 24de41231593..c3b209cd0660 100644
--- a/drivers/net/ethernet/marvell/pxa168_eth.c
+++ b/drivers/net/ethernet/marvell/pxa168_eth.c
@@ -634,12 +634,12 @@ static int pxa168_eth_set_mac_address(struct net_device *dev, void *addr)
 	memcpy(oldMac, dev->dev_addr, ETH_ALEN);
 	memcpy(dev->dev_addr, sa->sa_data, ETH_ALEN);
 
-	mac_h = sa->sa_data[0] << 24;
-	mac_h |= sa->sa_data[1] << 16;
-	mac_h |= sa->sa_data[2] << 8;
-	mac_h |= sa->sa_data[3];
-	mac_l = sa->sa_data[4] << 8;
-	mac_l |= sa->sa_data[5];
+	mac_h = dev->dev_addr[0] << 24;
+	mac_h |= dev->dev_addr[1] << 16;
+	mac_h |= dev->dev_addr[2] << 8;
+	mac_h |= dev->dev_addr[3];
+	mac_l = dev->dev_addr[4] << 8;
+	mac_l |= dev->dev_addr[5];
 	wrl(pep, MAC_ADDR_HIGH, mac_h);
 	wrl(pep, MAC_ADDR_LOW, mac_l);
 
-- 
1.9.1

^ permalink raw reply related

* Re: RFC: ixgbe+build_skb+extra performance experiments
From: Alexander Duyck @ 2014-10-03 14:40 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov
  Cc: David S. Miller, Jeff Kirsher, Alexander Duyck, Ben Hutchings,
	Eric Dumazet, netdev
In-Reply-To: <20141002093636.6f482281@redhat.com>

On 10/02/2014 12:36 AM, Jesper Dangaard Brouer wrote:
> On Wed,  1 Oct 2014 23:00:42 -0700 Alexei Starovoitov <ast@plumgrid.com> wrote:
>
>> I'm trying to speed up single core packet per second.
> Great, welcome to the club ;-)

Yes, but please keep in mind that multi-core is the more common use case
for many systems.

>> I took dual port ixgbe and added both ports to a linux bridge.
>> Only one port is connected to another system running pktgen at 10G rate.
>> I disabled gro to measure pure RX speed of ixgbe.
> It is great that you are attacking the RX side, I planned to look at it
> after finishing the qdisc bulking.  It is really lacking behind,
> especially after we have now almost "fixed" the TX side (driver layer
> can now do 14.8Mpps, if ignoring rest of stack, alloc etc.).

To that end we may want to look to something like GRO to do the
buffering on the Rx side so that we could make use of GRO/GSO to send
blocks of buffers instead of one at a time.

>> Out of the box I see 6.5 Mpps and the following stack:
>>   21.83%    ksoftirqd/0  [kernel.kallsyms]  [k] memcpy
>>   17.58%    ksoftirqd/0  [ixgbe]            [k] ixgbe_clean_rx_irq
>>   10.07%    ksoftirqd/0  [kernel.kallsyms]  [k] build_skb
>>    6.40%    ksoftirqd/0  [kernel.kallsyms]  [k] __netdev_alloc_frag
>>    5.18%    ksoftirqd/0  [kernel.kallsyms]  [k] put_compound_page
>>    4.93%    ksoftirqd/0  [kernel.kallsyms]  [k] kmem_cache_alloc
>>    4.55%    ksoftirqd/0  [kernel.kallsyms]  [k] __netif_receive_skb_core
>>
>> Obviously driver spends huge amount of time copying data from
>> hw buffers into skb.
>>
>> Then I applied buggy but working in this case patch:
>> http://patchwork.ozlabs.org/patch/236044/
>> that is trying to use build_skb() API in ixgbe.
> I also expected it will be a huge win to use build_skb() API.
> Good to see this confirmed! :-)

>From my past experience this is very platform dependant.  For example
with DDIO or DCA features enabled on a system the memcpy is very cheap
since it is already in the cache.  It is one of the reasons for choosing
that as a means of working around the fact that we cannot use build_skb
and page reuse in the same driver.

> I've been playing with a faster memory pool/allocator (implemented via a
> ring_queue), and my experiments show I could save 52ns when using it
> for the skb->data.  And you basically avoid this skb->data alloc with
> build_skb().

The other big item that I think many are overlooking is the cost of
mapping the buffer.  On systems with an IOMMU or even in some cases just
the swiotlb can add some significant cost to allocating a new buffer.

>> RX speed jumped to 7.6 Mpps with the following stack:
>>   27.02%    ksoftirqd/0  [kernel.kallsyms]  [k] eth_type_trans
>>   16.68%    ksoftirqd/0  [ixgbe]            [k] ixgbe_clean_rx_irq
>>   11.45%    ksoftirqd/0  [kernel.kallsyms]  [k] build_skb
>>    5.20%    ksoftirqd/0  [kernel.kallsyms]  [k] __netif_receive_skb_core
>>    4.72%    ksoftirqd/0  [kernel.kallsyms]  [k] kmem_cache_alloc
>>    3.96%    ksoftirqd/0  [kernel.kallsyms]  [k] kmem_cache_free
> My faster memory pool/allocator could save 8ns for the kmem_cache/slub
> calls, which is also high in your perf top.  8ns out of 40ns which is
> the micro benchmarked cost of the kmem_cache_{alloc,free} calls.
>
>
>> packets no longer copied and performance is higher.
>> It's doing the following:
>> - build_skb out of hw buffer and prefetch packet data
>> - eth_type_trans
>> - napi_gro_receive
>>
>> but build_skb() is too fast and cpu doesn't have enough time
>> to prefetch packet data before eth_type_trans() is called,
>> so I added mini skb bursting of 2 skbs (patch below) that does:
>> - build_skb1 out of hw buffer and prefetch packet data
>> - build_skb2 out of hw buffer and prefetch packet data
>> - eth_type_trans(skb1)
>> - napi_gro_receive(skb1)
>> - eth_type_trans(skb2)
>> - napi_gro_receive(skb2)
>> and performance jumped to 9.0 Mpps with stack:
>>   20.54%    ksoftirqd/0  [ixgbe]            [k] ixgbe_clean_rx_irq
>>   13.15%    ksoftirqd/0  [kernel.kallsyms]  [k] build_skb
>>    8.35%    ksoftirqd/0  [kernel.kallsyms]  [k] __netif_receive_skb_core
>>    7.16%    ksoftirqd/0  [kernel.kallsyms]  [k] eth_type_trans
>>    4.73%    ksoftirqd/0  [kernel.kallsyms]  [k] kmem_cache_free
>>    4.50%    ksoftirqd/0  [kernel.kallsyms]  [k] kmem_cache_alloc
>>
>> with further instruction tunning inside ixgbe_clean_rx_irq()
>> I could push it to 9.4 Mpps.
>>
>> From 6.5 Mpps to 9.4 Mpps via build_skb() and tunning.
> Cool, quite impressive performance boost! - good work! :-)

One thought I had at one point was to try and add a flag to the DMA api
to indicate if the DMA api is trivial resulting in just a call to
virt_to_phys.  It might be worthwhile to look into something like that,
then we could split the receive processing into one of two paths, one
for non-trivial DMA mapping APIs, and one for trivial DMA mapping APIs
such as swiotlb on a device that supports all the memory in the system.

>> Is there a way to fix the issue Ben pointed a year ago?
>> Brute force fix could to be: avoid half-page buffers.
>> We'll be wasting 16Mbyte of memory. Sure, but in some cases
>> extra peformance might be worth it.
>> Other options?
>>
>> I think we need to try harder to switch to build_skb()
>> It will open up a lot of possibilities for further performance
>> improvements.
>> Thoughts?
> Yes, we should really work on converting more drivers to use
> build_skb().

The problem is build_skb usage comes at a certain cost.  Specifically in
the case of small packets it can result in a larger memory footprint
since you cannot just reuse the same region in the buffer.  I suspect we
may need to look into some sort of compromise between build_skb and a
copybreak scheme for best cache performance on Xeon for example.

>> ---
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   34 +++++++++++++++++++++----
>>  1 file changed, 29 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> index 21d1a65..1d1e37f 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> @@ -1590,8 +1590,6 @@ static void ixgbe_process_skb_fields(struct ixgbe_ring *rx_ring,
>>  	}
>>
>>  	skb_record_rx_queue(skb, rx_ring->queue_index);
>> -
>> -	skb->protocol = eth_type_trans(skb, dev);
>>  }
>>
>>  static void ixgbe_rx_skb(struct ixgbe_q_vector *q_vector,
>> @@ -2063,6 +2061,24 @@ dma_sync:
>>  	return skb;
>>  }
>>
>> +#define BURST_SIZE 2
>> +static void ixgbe_rx_skb_burst(struct sk_buff *skbs[BURST_SIZE],
>> +			       unsigned int skb_burst,
>> +			       struct ixgbe_q_vector *q_vector,
>> +			       struct net_device *dev)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < skb_burst; i++) {
>> +		struct sk_buff *skb = skbs[i];
>> +
>> +		skb->protocol = eth_type_trans(skb, dev);
>> +
>> +		skb_mark_napi_id(skb, &q_vector->napi);
>> +		ixgbe_rx_skb(q_vector, skb);
>> +	}
>> +}
>> +
>>  /**
>>   * ixgbe_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
>>   * @q_vector: structure containing interrupt and ring information
>> @@ -2087,6 +2103,8 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
>>  	unsigned int mss = 0;
>>  #endif /* IXGBE_FCOE */
>>  	u16 cleaned_count = ixgbe_desc_unused(rx_ring);
>> +	struct sk_buff *skbs[BURST_SIZE];
>> +	unsigned int skb_burst = 0;
>>
>>  	while (likely(total_rx_packets < budget)) {
>>  		union ixgbe_adv_rx_desc *rx_desc;
>> @@ -2161,13 +2179,19 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
>>  		}
>>  
>>  #endif /* IXGBE_FCOE */
>> -		skb_mark_napi_id(skb, &q_vector->napi);
>> -		ixgbe_rx_skb(q_vector, skb);
>> -
>>  		/* update budget accounting */
>>  		total_rx_packets++;
>> +		skbs[skb_burst++] = skb;
>> +
>> +		if (skb_burst == BURST_SIZE) {
>> +			ixgbe_rx_skb_burst(skbs, skb_burst, q_vector,
>> +					   rx_ring->netdev);
>> +			skb_burst = 0;
>> +		}
>>  	}
>>  
>> +	ixgbe_rx_skb_burst(skbs, skb_burst, q_vector, rx_ring->netdev);
>> +
>>  	u64_stats_update_begin(&rx_ring->syncp);
>>  	rx_ring->stats.packets += total_rx_packets;
>>  	rx_ring->stats.bytes += total_rx_bytes;
>

For the burst size logic you might want to explore handling the
descriptors in 4 descriptor aligned chunks that should give you the best
possible performance since that would mean processing the descriptor
ring one cache-line at a time.

Thanks,

Alex

^ permalink raw reply

* Re: [patch] checkpatch: remove the ether_addr_copy warning
From: Dan Carpenter @ 2014-10-03 14:47 UTC (permalink / raw)
  To: Joe Perches
  Cc: Andy Whitcroft, Andrew Morton, netdev, kernel-janitors, gregkh
In-Reply-To: <1412346147.3247.97.camel@joe-AO725>

On Fri, Oct 03, 2014 at 07:22:27AM -0700, Joe Perches wrote:
> On Fri, 2014-10-03 at 12:35 +0300, Dan Carpenter wrote:
> > Most people sending checkpatch.pl fixes don't know how to verify the
> > alignment.  This checkpatch warning just encourages newbies to try
> > introduce bugs.  Patch submitters tell us that they just sed the code
> > and it's the job for the maintainer to check that it's correct.
> 
> I haven't seen many instances of bad patch submittals
> on netdev.  Is this mostly an issue for staging?

I don't follow netdev so I can't say.

Most of the time data is aligned at a 4 byte mark so probably you are
just getting lucky.  I really doubt that netdev checkpatch newbies know
about alignment...

> 
> Maybe a downgrade to CHK requiring --strict is OK.

I would actually like to turn --strict by default in staging.

Checkpatch is a good concept, but it should only do safe things instead
of telling newbies to send buggy patches.

regards,
dan carpenter


^ permalink raw reply

* Re: [PATCH net-next 0/2] sunvnet: Packet processing in non-interrupt context.
From: Sowmini Varadhan @ 2014-10-03 14:40 UTC (permalink / raw)
  To: David Miller; +Cc: raghuram.kothakota, netdev
In-Reply-To: <20141002.134346.2291749763304558513.davem@davemloft.net>

On (10/02/14 13:43), David Miller wrote:
> For example, you can now move everything into software IRQ context,
> just disable the VIO interrupt and unconditionally go into NAPI
> context from the VIO event.
> No more irqsave/irqrestore.
> Then the TX path even can run mostly lockless, it just needs
> to hold the VIO lock for a minute period of time.  The caller
> holds the xmit_lock of the network device to prevent re-entry
> into the ->ndo_start_xmit() path.
> 

let me check into this and get back. I think the xmit path
may also need to have some kind of locking for the dring access
and ldc_write? I think you are suggesting that I should also
move the control-packet processing to vnet_event_napi(), which
I have not done in my patch. I will examine where that leads.

But there is one thing that I do not understand - why does my hack
to lie to net_rx_action() by always returning "budget" 
make such a difference to throughput? 

Even if I set the budget to be as low as 64 (so I would get
called repeatedly under NAPIs polling infra), I have to 
turn on the "liar" commented code in my patch for the throughput
shoots up to 2 - 2.5  Gbps (whereas it's otherwise around 300 Mbps).
Eyeballing the net_rx_action() code quickly did not make the explanation
for this pop out at me (yet).

Pure polling (i.e., workq) gives me about 1.5 Gbps, and pure-tasklet
(i.e, just setting up a tasklet from vnet_event to handle data) gives me
approx 2 Gbps. So I dont understand why NAPI doesn't give me something
similar, if I adhere strictly to the documentation.

--Sowmini

^ permalink raw reply

* Re: [PATCH v3 net-next 1/3] bridge: Add a default_pvid sysfs attribute
From: Toshiaki Makita @ 2014-10-03 14:31 UTC (permalink / raw)
  To: vyasevic, Toshiaki Makita, Vladislav Yasevich, netdev; +Cc: stephen, bridge
In-Reply-To: <542EA7A8.1070701@redhat.com>

(14/10/03 (金) 22:42), Vlad Yasevich wrote:
> On 10/02/2014 09:08 PM, Toshiaki Makita wrote:
>> On 2014/10/03 8:54, Vladislav Yasevich wrote:
>>> This patch allows the user to set and retrieve default_pvid
>>> value.  A new value can only be stored when vlan filtering
>>> is disabled.
>>>
>>> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
>>> ---
>> ...
>>> +int br_vlan_set_default_pvid(struct net_bridge *br, unsigned long val)
>>> +{
>>> +	u16 pvid = val;
>>> +	int err = 0;
>>> +
>>> +	if (!val || val >= VLAN_VID_MASK)
>>> +		return -EINVAL;
>>> +
>>> +	if (!rtnl_trylock())
>>> +		return restart_syscall();
>>> +
>>> +	if (pvid == br->default_pvid)
>>> +		goto unlock;
>>> +
>>> +	/* Only allow default pvid change when filtering is disabled */
>>> +	if (br->vlan_enabled) {
>>> +		pr_info_once("Please disable vlan filtering to change default_pvid\n");
>>> +		err = -EPERM;
>>> +		goto unlock;
>>> +	}
>>> +
>>> +	br->default_pvid = vid;
>>
>> typo: s/vid/pvid/
> 
> How the hell did this even build then...!  Oh,  the last patch moved it elsewhere and
> fixed it.
> 
> Fixed.

This can break bisect, so I think this patch itself should be fixed.

Thanks,
Toshiaki Makita

^ permalink raw reply

* Re: [patch] checkpatch: remove the ether_addr_copy warning
From: Julia Lawall @ 2014-10-03 14:30 UTC (permalink / raw)
  To: Joe Perches
  Cc: Dan Carpenter, Andy Whitcroft, Andrew Morton, netdev,
	kernel-janitors, gregkh
In-Reply-To: <1412346147.3247.97.camel@joe-AO725>

On Fri, 3 Oct 2014, Joe Perches wrote:

> On Fri, 2014-10-03 at 12:35 +0300, Dan Carpenter wrote:
> > Most people sending checkpatch.pl fixes don't know how to verify the
> > alignment.  This checkpatch warning just encourages newbies to try
> > introduce bugs.  Patch submitters tell us that they just sed the code
> > and it's the job for the maintainer to check that it's correct.
>
> I haven't seen many instances of bad patch submittals
> on netdev.  Is this mostly an issue for staging?
>
> Maybe a downgrade to CHK requiring --strict is OK.
>
> > If you want to work on these then you can get the same information by
> > typing `grep -nw memcpy drivers/net/ -R | grep ETH_ALEN`
>
> That's not much of an argument for or against anything in
> checkpatch as every operation done by it can be reproduced
> by a series of grep operations.

I think it is too bad to have a piece of knowledge that was apparent be
made more obscure.  Why not just change the checkpatch warning to make
more explicit that a lot of expertise is required to make the change?

julia

^ permalink raw reply

* Re: [patch] checkpatch: remove the ether_addr_copy warning
From: Joe Perches @ 2014-10-03 14:22 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Andy Whitcroft, Andrew Morton, netdev, kernel-janitors, gregkh
In-Reply-To: <20141003093505.GA7393@mwanda>

On Fri, 2014-10-03 at 12:35 +0300, Dan Carpenter wrote:
> Most people sending checkpatch.pl fixes don't know how to verify the
> alignment.  This checkpatch warning just encourages newbies to try
> introduce bugs.  Patch submitters tell us that they just sed the code
> and it's the job for the maintainer to check that it's correct.

I haven't seen many instances of bad patch submittals
on netdev.  Is this mostly an issue for staging?

Maybe a downgrade to CHK requiring --strict is OK.

> If you want to work on these then you can get the same information by
> typing `grep -nw memcpy drivers/net/ -R | grep ETH_ALEN`

That's not much of an argument for or against anything in
checkpatch as every operation done by it can be reproduced
by a series of grep operations.



^ permalink raw reply

* Re: HW bridging support using notifiers?
From: Benjamin LaHaise @ 2014-10-03 14:22 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: netdev, davem, jiri, stephen, andy, tgraf, nbd, john.r.fastabend,
	edumazet, vyasevic, buytenh, sfeldma, Jamal Hadi Salim
In-Reply-To: <542E0089.5050005@gmail.com>

Hi Florian et al,

On Thu, Oct 02, 2014 at 06:48:57PM -0700, Florian Fainelli wrote:
> Hi all,
> 
> I am taking a look at adding HW bridging support to DSA, in way that's
> usable outside of DSA.

I've been working on support for the RTL8366S switch, and our work is 
directly overlapping here.  I actually have something that is working at 
configuring port and tag based vlans on the RTL8366S.  I'll try to clean 
up the code to post something for discussion over the next couple of days.

> Lennert's approach in 2008 [1] looks conceptually good to me,as he
> noted, it uses a bunch of new ndo's which is not only limiting to one
> ndo implementer per struct net_device, but also is mostly consuming the
> information from the bridge layer, while the ndo is an action

I think having ndo implementer methods for hardware switch offloads makes 
more sense.  Such a scheme is needed in order to implement the stacking of 
devices that is required in order to transparently handle configuration of 
vlans on switch ports where the 8021q device has to pass on the vlan tag 
to the switch device.  The ndo methods do perform an action of causing the 
switch to be configured to match the bridge config.  Additionally, they 
can be used to veto changes that cannot be offloaded to hardware -- this 
(configurable) behaviour is desired by some users of these APIs who wish 
to be made aware when a particuarly configuration is not supported by the 
underlying hardware.

> So here's what I am up to now:
> 
> - use the NETDEV_JOIN notifier to discover when a bridge port is added
> - use the NETDEV_LEAVE notifier, still need to verify this does not
> break netconsole as indicated in net/bridge/br_if.c
> - use the NETDEV_CHANGEINFODATA notifier to notify about STP state changes

To me, notifiers are the wrong model for join and leave.  Implementing 
stacking on top of notifiers is somewhat more complicated.  Here are the 
ndo methods I've implemented so far which are sufficient for basic config 
of the RTL8366S.  They're fairly similar to those in [1].

+	int			(*ndo_join_bridge)(struct net_bridge *bridge,
+						   struct net_device *dev,
+						   int *switch_nr,
+						   int *switch_port_nr,
+						   int vlan);
+	int			(*ndo_leave_bridge)(struct net_bridge *bridge,
+						    struct net_device *dev,
+						    int switch_nr,
+						    int switch_port_nr,
+						    int vlan);
+	int			(*ndo_flood_xmit)(struct switch_info *dev,
+						  struct sk_buff *skb,
+						  u64 port_mask);

There are a couple of important points here.  In the case of joining and 
leaving a bridge, the bridge needs to be provided with information it can 
use to identify switch ports.  This is needed in order to offload the 
flooding of packets to multiple ports, as otherwise the Linux bridge code 
doesn't have any way to figure out which packets can be merged into a 
single transmit via the ndo_flood_xmit() method.

> Now, this raises a bunch of questions:
> 
> - we would need a getter to return the stp state of a given network
> device when called with NETDEV_CHANGEINFODATA, is that acceptable? This
> would be the first function exported by the bridge layer to expose
> internal data

I have yet to dig into STP, so I'll refrain from commenting on these parts 
for now.

> NB: this also raises the question of the race condition and locking
> within br_set_stp_state() and when the network devices notifier callback
> runs
U
> - or do we need a new network device notifier accepting an opaque
> pointer which could provide us with the data we what, something like
> this: call_netdevices_notifier_data(NETDEV_CHANGEINFODATA, dev, info),
> where info would be a structure/union telling what's this data about
> 
> Let me know what you think, thanks!
> 
> [1]: http://patchwork.ozlabs.org/patch/16578/

Thanks for the pointer to this.  Cheers!

		-ben

> --
> Florian
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
"Thought is the essence of where you are now."

^ permalink raw reply

* [PATCH] team: avoid race condition in scheduling delayed work
From: Joe Lawrence @ 2014-10-03 13:58 UTC (permalink / raw)
  To: netdev; +Cc: Jiri Pirko, Joe Lawrence

When team_notify_peers and team_mcast_rejoin are called, they both reset
their respective .count_pending atomic variable. Then when the actual
worker function is executed, the variable is atomically decremented.
This pattern introduces a potential race condition where the
.count_pending rolls over and the worker function keeps rescheduling
until .count_pending decrements to zero again:

THREAD 1                           THREAD 2
========                           ========
team_notify_peers(teamX)
  atomic_set count_pending = 1
  schedule_delayed_work
                                   team_notify_peers(teamX)
                                   atomic_set count_pending = 1
team_notify_peers_work
  atomic_dec_and_test
    count_pending = 0
  (return)
                                   schedule_delayed_work
                                   team_notify_peers_work
                                   atomic_dec_and_test
                                     count_pending = -1
                                   schedule_delayed_work
                                   (repeat until count_pending = 0)

Instead of assigning a new value to .count_pending, use atomic_add to
tack-on the additional desired worker function invocations.

Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Fixes: fc423ff00df3a19554414ee ("team: add peer notification")
Fixes: 492b200efdd20b8fcfdac87 ("team: add support for sending multicast rejoins")
---
 drivers/net/team/team.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index d46df38..2b87e3f 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -647,7 +647,7 @@ static void team_notify_peers(struct team *team)
 {
 	if (!team->notify_peers.count || !netif_running(team->dev))
 		return;
-	atomic_set(&team->notify_peers.count_pending, team->notify_peers.count);
+	atomic_add(team->notify_peers.count, &team->notify_peers.count_pending);
 	schedule_delayed_work(&team->notify_peers.dw, 0);
 }
 
@@ -687,7 +687,7 @@ static void team_mcast_rejoin(struct team *team)
 {
 	if (!team->mcast_rejoin.count || !netif_running(team->dev))
 		return;
-	atomic_set(&team->mcast_rejoin.count_pending, team->mcast_rejoin.count);
+	atomic_add(team->mcast_rejoin.count, &team->mcast_rejoin.count_pending);
 	schedule_delayed_work(&team->mcast_rejoin.dw, 0);
 }
 
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH net v2 1/1] ematch: Fix early ending of inverted containers.
From: Ignacy Gawędzki @ 2014-10-03 13:44 UTC (permalink / raw)
  To: netdev
In-Reply-To: <542E9505.8010805@cogentembedded.com>

The result of a negated container has to be inverted before checking for
early ending.

This fixes my previous attempt (17c9c8232663a47f074b7452b9b034efda868ca7) to
make inverted containers work correctly.

Signed-off-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr>
---
 net/sched/ematch.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/sched/ematch.c b/net/sched/ematch.c
index ad57f44..f878fa1 100644
--- a/net/sched/ematch.c
+++ b/net/sched/ematch.c
@@ -526,9 +526,10 @@ pop_stack:
 		match_idx = stack[--stackp];
 		cur_match = tcf_em_get_match(tree, match_idx);
 
+		if (tcf_em_is_inverted(cur_match))
+			res = !res;
+
 		if (tcf_em_early_end(cur_match, res)) {
-			if (tcf_em_is_inverted(cur_match))
-				res = !res;
 			goto pop_stack;
 		} else {
 			match_idx++;
-- 
1.9.1

^ permalink raw reply related

* Re: [net-next 1/6] net: Add Geneve tunneling protocol driver
From: Nicolas Dichtel @ 2014-10-03 13:44 UTC (permalink / raw)
  To: Andy Zhou, davem; +Cc: netdev, Jesse Gross
In-Reply-To: <1412237085-27215-2-git-send-email-azhou@nicira.com>

Le 02/10/2014 10:04, Andy Zhou a écrit :
> This adds a device level support for Geneve -- Generic Network
> Virtualization Encapsulation. The protocol is documented at
> http://tools.ietf.org/html/draft-gross-geneve-01
>
> Only protocol layer Geneve support is provided by this driver.
> Openvswitch can be used for configuring, set up and tear down
> functional Geneve tunnels.
Do you plan too add the full support (ie being able to configure a
geneve netdev interface with iproute2)?

Another small comment below.
>
> Signed-off-by: Jesse Gross <jesse@nicira.com>
> Signed-off-by: Andy Zhou <azhou@nicira.com>
> ---
>   include/net/geneve.h     |   91 +++++++++++
>   include/net/ip_tunnels.h |    2 +
>   net/ipv4/Kconfig         |   14 ++
>   net/ipv4/Makefile        |    1 +
>   net/ipv4/geneve.c        |  373 ++++++++++++++++++++++++++++++++++++++++++++++
>   5 files changed, 481 insertions(+)
>   create mode 100644 include/net/geneve.h
>   create mode 100644 net/ipv4/geneve.c
>
> diff --git a/include/net/geneve.h b/include/net/geneve.h
> new file mode 100644
> index 0000000..ce98865
> --- /dev/null
> +++ b/include/net/geneve.h
> @@ -0,0 +1,91 @@
> +#ifndef __NET_GENEVE_H
> +#define __NET_GENEVE_H  1
> +
> +#include <net/udp_tunnel.h>
> +
> +struct geneve_sock;
> +
> +typedef void (geneve_rcv_t)(struct geneve_sock *gs, struct sk_buff *skb);
> +
> +struct geneve_sock {
> +	struct hlist_node	hlist;
> +	geneve_rcv_t		*rcv;
> +	void			*rcv_data;
> +	struct work_struct	del_work;
> +	struct socket		*sock;
> +	struct rcu_head		rcu;
> +	atomic_t		refcnt;
> +	struct udp_offload	udp_offloads;
> +};
> +
> +/* Geneve Header:
> + *  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> + *  |Ver|  Opt Len  |O|C|    Rsvd.  |          Protocol Type        |
> + *  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> + *  |        Virtual Network Identifier (VNI)       |    Reserved   |
> + *  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> + *  |                    Variable Length Options                    |
> + *  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> + *
> + * Option Header:
> + *  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> + *  |          Option Class         |      Type     |R|R|R| Length  |
> + *  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> + *  |                      Variable Option Data                     |
> + *  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> + */
> +
> +struct geneve_opt {
> +	__be16	opt_class;
> +	u8	type;
> +#ifdef __LITTLE_ENDIAN_BITFIELD
> +	u8	length:5;
> +	u8	r3:1;
> +	u8	r2:1;
> +	u8	r1:1;
> +#else
> +	u8	r1:1;
> +	u8	r2:1;
> +	u8	r3:1;
> +	u8	length:5;
> +#endif
> +	u8	opt_data[];
> +};
> +
> +#define GENEVE_CRIT_OPT_TYPE (1 << 7)
> +
> +struct genevehdr {
> +#ifdef __LITTLE_ENDIAN_BITFIELD
> +	u8 opt_len:6;
> +	u8 ver:2;
> +	u8 rsvd1:6;
> +	u8 critical:1;
> +	u8 oam:1;
> +#else
> +	u8 ver:2;
> +	u8 opt_len:6;
> +	u8 oam:1;
> +	u8 critical:1;
> +	u8 rsvd1:6;
> +#endif
> +	__be16 proto_type;
> +	u8 vni[3];
> +	u8 rsvd2;
> +	struct geneve_opt options[];
> +};
> +
> +#define GENEVE_VER 0
> +#define GENEVE_BASE_HLEN (sizeof(struct udphdr) + sizeof(struct genevehdr))
> +
> +struct geneve_sock *geneve_sock_add(struct net *net, __be16 port,
> +				    geneve_rcv_t *rcv, void *data,
> +				    bool no_share, bool ipv6);
> +
> +void geneve_sock_release(struct geneve_sock *vs);
> +
> +int geneve_xmit_skb(struct geneve_sock *gs, struct rtable *rt,
> +		    struct sk_buff *skb, __be32 src, __be32 dst, __u8 tos,
> +		    __u8 ttl, __be16 df, __be16 src_port, __be16 dst_port,
> +		    __be16 tun_flags, u8 vni[3], u8 opt_len, u8 *opt,
> +		    bool xnet);
> +#endif
> diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
> index 7f538ba..a9ce155 100644
> --- a/include/net/ip_tunnels.h
> +++ b/include/net/ip_tunnels.h
> @@ -95,6 +95,8 @@ struct ip_tunnel {
>   #define TUNNEL_VERSION	__cpu_to_be16(0x40)
>   #define TUNNEL_NO_KEY	__cpu_to_be16(0x80)
>   #define TUNNEL_DONT_FRAGMENT    __cpu_to_be16(0x0100)
> +#define TUNNEL_OAM	__cpu_to_be16(0x0200)
> +#define TUNNEL_CRIT_OPT	__cpu_to_be16(0x0400)
>
>   struct tnl_ptk_info {
>   	__be16 flags;
> diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
> index 69fb378..15ce6b0 100644
> --- a/net/ipv4/Kconfig
> +++ b/net/ipv4/Kconfig
> @@ -453,6 +453,20 @@ config TCP_CONG_BIC
>   	increase provides TCP friendliness.
>   	See http://www.csc.ncsu.edu/faculty/rhee/export/bitcp/
>
> +config GENEVE
> +       tristate "Generic Network Virtualization Encapsulation (Geneve)"
> +       depends on INET
> +       select NET_IP_TUNNEL
> +       select NET_UDP_TUNNEL
> +       ---help---
Use tabs instead of spaces for the baove lines.

^ permalink raw reply

* [PATCH net v2 0/1] Re: ematch: Fix the matching of inverted containers (again).
From: Ignacy Gawędzki @ 2014-10-03 13:44 UTC (permalink / raw)
  To: netdev
In-Reply-To: <542E9505.8010805@cogentembedded.com>

On Fri, Oct 03, 2014 at 04:22:29PM +0400, thus spake Sergei Shtylyov:
>    The kernel style dictates that an *if* statement should have {}
> in all its arms, if it has {} in at least one of the arms.

Hi,

Thanks for your kind remark.  Fixed patch follows.

Ignacy Gawędzki (1):
  ematch: Fix early ending of inverted containers.

 net/sched/ematch.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

-- 
1.9.1

^ permalink raw reply

* Re: [PATCH v3 net-next 1/3] bridge: Add a default_pvid sysfs attribute
From: Vlad Yasevich @ 2014-10-03 13:42 UTC (permalink / raw)
  To: Toshiaki Makita, Vladislav Yasevich, netdev; +Cc: stephen, bridge
In-Reply-To: <542DF6FA.4050107@lab.ntt.co.jp>

On 10/02/2014 09:08 PM, Toshiaki Makita wrote:
> On 2014/10/03 8:54, Vladislav Yasevich wrote:
>> This patch allows the user to set and retrieve default_pvid
>> value.  A new value can only be stored when vlan filtering
>> is disabled.
>>
>> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
>> ---
> ...
>> +int br_vlan_set_default_pvid(struct net_bridge *br, unsigned long val)
>> +{
>> +	u16 pvid = val;
>> +	int err = 0;
>> +
>> +	if (!val || val >= VLAN_VID_MASK)
>> +		return -EINVAL;
>> +
>> +	if (!rtnl_trylock())
>> +		return restart_syscall();
>> +
>> +	if (pvid == br->default_pvid)
>> +		goto unlock;
>> +
>> +	/* Only allow default pvid change when filtering is disabled */
>> +	if (br->vlan_enabled) {
>> +		pr_info_once("Please disable vlan filtering to change default_pvid\n");
>> +		err = -EPERM;
>> +		goto unlock;
>> +	}
>> +
>> +	br->default_pvid = vid;
> 
> typo: s/vid/pvid/

How the hell did this even build then...!  Oh,  the last patch moved it elsewhere and
fixed it.

Fixed.
Thanks
-vlad

> 
> Thanks,
> Toshiaki Makita
> 

^ permalink raw reply

* Re: [PATCH v3 net-next 3/3] bridge: Add filtering support for default_pvid
From: Vlad Yasevich @ 2014-10-03 13:37 UTC (permalink / raw)
  To: Cong Wang, Vladislav Yasevich
  Cc: Stephen Hemminger, netdev, bridge@lists.linux-foundation.org
In-Reply-To: <CAHA+R7OycteYron5K3YJ3qtsEE-5uUxhPkictJDWosXCihR5hA@mail.gmail.com>

On 10/03/2014 12:41 AM, Cong Wang wrote:
> On Thu, Oct 2, 2014 at 4:54 PM, Vladislav Yasevich <vyasevich@gmail.com> wrote:
>> +static void br_vlan_disable_default_pvid(struct net_bridge *br)
>> +{
>> +       struct net_bridge_port *p;
>> +       u16 pvid = br->default_pvid;
>> +
>> +       /* Disable default_pvid on all ports where it is still
>> +        * configured.
>> +        */
>> +
> 
> This empty line is not necessary.
> 
>> +       if (vlan_default_pvid(br_get_vlan_info(br), pvid))
>> +               br_vlan_delete(br, pvid);
>> +
>> +       list_for_each_entry(p, &br->port_list, list) {
>> +               if (vlan_default_pvid(nbp_get_vlan_info(p), pvid))
>> +                       nbp_vlan_delete(p, pvid);
>> +       }
>> +
>> +       br->default_pvid = 0;
>> +}
>> +
>> +static int __br_vlan_set_default_pvid(struct net_bridge *br, u16 pvid)
>> +{
>> +       struct net_bridge_port *p;
>> +       u16 old_pvid;
>> +       int err;
>> +       DECLARE_BITMAP(changed, BR_MAX_PORTS);
> 
> 
> This bitmap will use 128 bytes on stack, why not using heap?
> 

I suppose I wanted to avoid yet another memory allocation failure condition.
Is this really going to cause issues?

Thanks
-vlad


>> +
>> +       bitmap_zero(changed, BR_MAX_PORTS);
>> +
>> +       /* This function runs with filtering turned off so we can
>> +        * remove the old pvid configuration and add the new one after
>> +        * without impacting traffic.
>> +        */
>> +
>> +       old_pvid = br->default_pvid;
> 
> 
> Remove the empty line.
> 
> [...]
> 
>> +int nbp_vlan_init(struct net_bridge_port *p)
>> +{
>> +       int rc = 0;
>> +
>> +       if (p->br->default_pvid) {
>> +               rc = nbp_vlan_add(p, p->br->default_pvid,
>> +                                 BRIDGE_VLAN_INFO_PVID |
>> +                                 BRIDGE_VLAN_INFO_UNTAGGED);
>> +       }
>> +
>> +       return rc;
>> +}
> 
> 'rc' can be removed.
> 

^ permalink raw reply

* [PATCH iproute2] ip address: print stats with -s
From: Jiri Benc @ 2014-10-03 13:25 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, Pavel Simerda

Make ip address show accept the -s option similarly to ip link. This creates
an one command replacement for "ifconfig -a" useful for people who still
stay with ifconfig because of this feature.

Print the stats as the last thing for the interface. This requires some code
shuffling.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
---
 ip/ipaddress.c | 43 +++++++++++++++++++++++++++++++------------
 1 file changed, 31 insertions(+), 12 deletions(-)

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 245df39179a9..45729d890abd 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -322,7 +322,6 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
 static void print_link_stats64(FILE *fp, const struct rtnl_link_stats64 *s,
                                const struct rtattr *carrier_changes)
 {
-	fprintf(fp, "%s", _SL_);
 	fprintf(fp, "    RX: bytes  packets  errors  dropped overrun mcast   %s%s",
 		s->rx_compressed ? "compressed" : "", _SL_);
 	fprintf(fp, "    %-10"PRIu64" %-8"PRIu64" %-7"PRIu64" %-7"PRIu64" %-7"PRIu64" %-7"PRIu64"",
@@ -375,10 +374,9 @@ static void print_link_stats64(FILE *fp, const struct rtnl_link_stats64 *s,
 	}
 }
 
-static void print_link_stats(FILE *fp, const struct rtnl_link_stats *s,
-			     const struct rtattr *carrier_changes)
+static void print_link_stats32(FILE *fp, const struct rtnl_link_stats *s,
+			       const struct rtattr *carrier_changes)
 {
-	fprintf(fp, "%s", _SL_);
 	fprintf(fp, "    RX: bytes  packets  errors  dropped overrun mcast   %s%s",
 		s->rx_compressed ? "compressed" : "", _SL_);
 	fprintf(fp, "    %-10u %-8u %-7u %-7u %-7u %-7u",
@@ -425,6 +423,27 @@ static void print_link_stats(FILE *fp, const struct rtnl_link_stats *s,
 	}
 }
 
+static void __print_link_stats(FILE *fp, struct rtattr **tb)
+{
+	if (tb[IFLA_STATS64])
+		print_link_stats64(fp, RTA_DATA(tb[IFLA_STATS64]),
+					tb[IFLA_CARRIER_CHANGES]);
+	else if (tb[IFLA_STATS])
+		print_link_stats32(fp, RTA_DATA(tb[IFLA_STATS]),
+					tb[IFLA_CARRIER_CHANGES]);
+}
+
+static void print_link_stats(FILE *fp, struct nlmsghdr *n)
+{
+	struct ifinfomsg *ifi = NLMSG_DATA(n);
+	struct rtattr * tb[IFLA_MAX+1];
+
+	parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi),
+		     n->nlmsg_len - NLMSG_LENGTH(sizeof(*ifi)));
+	__print_link_stats(fp, tb);
+	fprintf(fp, "%s", _SL_);
+}
+
 int print_linkinfo(const struct sockaddr_nl *who,
 		   struct nlmsghdr *n, void *arg)
 {
@@ -550,12 +569,8 @@ int print_linkinfo(const struct sockaddr_nl *who,
 	}
 
 	if (do_link && show_stats) {
-		if (tb[IFLA_STATS64])
-			print_link_stats64(fp, RTA_DATA(tb[IFLA_STATS64]),
-					   tb[IFLA_CARRIER_CHANGES]);
-		else if (tb[IFLA_STATS])
-			print_link_stats(fp, RTA_DATA(tb[IFLA_STATS]),
-					 tb[IFLA_CARRIER_CHANGES]);
+		fprintf(fp, "%s", _SL_);
+		__print_link_stats(fp, tb);
 	}
 
 	if (do_link && tb[IFLA_VFINFO_LIST] && tb[IFLA_NUM_VF]) {
@@ -567,7 +582,7 @@ int print_linkinfo(const struct sockaddr_nl *who,
 
 	fprintf(fp, "\n");
 	fflush(fp);
-	return 0;
+	return 1;
 }
 
 static int flush_update(void)
@@ -1282,11 +1297,15 @@ static int ipaddr_list_flush_or_save(int argc, char **argv, int action)
 	}
 
 	for (l = linfo.head; l; l = l->next) {
-		if (no_link || print_linkinfo(NULL, &l->h, stdout) == 0) {
+		int res = 0;
+
+		if (no_link || (res = print_linkinfo(NULL, &l->h, stdout)) >= 0) {
 			struct ifinfomsg *ifi = NLMSG_DATA(&l->h);
 			if (filter.family != AF_PACKET)
 				print_selected_addrinfo(ifi->ifi_index,
 							ainfo.head, stdout);
+			if (res > 0 && !do_link && show_stats)
+				print_link_stats(stdout, &l->h);
 		}
 	}
 	fflush(stdout);
-- 
1.8.3.1

^ permalink raw reply related

* Re: [RFC PATCH linux 2/2] fs/proc: use a hash table for the directory entries
From: Nicolas Dichtel @ 2014-10-03 13:10 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: netdev, linux-kernel, davem, ebiederm, akpm, adobriyan, rui.xiang,
	viro, oleg, gorcunov, kirill.shutemov, grant.likely, tytso,
	Thierry Herbelot
In-Reply-To: <20141002094657.1e28e027@urahara>

Le 02/10/2014 18:46, Stephen Hemminger a écrit :
> On Thu,  2 Oct 2014 17:25:01 +0200
> Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote:
>
>> From: Thierry Herbelot <thierry.herbelot@6wind.com>
>>
>> The current implementation for the directories in /proc is using a single
>> linked list. This is slow when handling directories with large numbers of
>> entries (eg netdevice-related entries when lots of tunnels are opened).
>>
>> This patch enables multiple linked lists. A hash based on the entry name is
>> used to select the linked list for one given entry.
>>
>> The speed creation of netdevices is faster as shorter linked lists must be
>> scanned when adding a new netdevice.
>>
>> Here are some numbers:
>>
>> dummy30000.batch contains 30 000 times 'link add type dummy'.
>>
>> Before the patch:
>> time ip -b dummy30000.batch
>> real    2m32.221s
>> user    0m0.380s
>> sys     2m30.610s
>>
>> After the patch:
>> time ip -b dummy30000.batch
>> real    1m57.190s
>> user    0m0.350s
>> sys     1m56.120s
>>
>> The single 'subdir' list head is replaced by a subdir hash table. The subdir
>> hash buckets are only allocated for directories. The number of hash buckets
>> is a compile-time parameter.
>>
>> For all functions which handle directory entries, an additional check on the
>> directory nature of the dir entry ensures that pde_hash_buckets was allocated.
>> This check was not needed as subdir was present for all dir entries, whether
>> actual directories or simple files.
>>
>> Signed-off-by: Thierry Herbelot <thierry.herbelot@6wind.com>
>> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>
> I think the speed up is a good idea and makes sense.
> It would be better to use exist hlist macros for hash table rather than
> open coding it.
>
Right, I will rework this after looking at rbtree.

^ permalink raw reply

* Re: [RFC PATCH linux 2/2] fs/proc: use a hash table for the directory entries
From: Nicolas Dichtel @ 2014-10-03 13:09 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: netdev, linux-kernel, davem, akpm, adobriyan, rui.xiang, viro,
	oleg, gorcunov, kirill.shutemov, grant.likely, tytso,
	Thierry Herbelot
In-Reply-To: <87h9zmpcz5.fsf@x220.int.ebiederm.org>

Le 02/10/2014 20:01, Eric W. Biederman a écrit :
> Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:
>
>> From: Thierry Herbelot <thierry.herbelot@6wind.com>
>>
>> The current implementation for the directories in /proc is using a single
>> linked list. This is slow when handling directories with large numbers of
>> entries (eg netdevice-related entries when lots of tunnels are opened).
>>
>> This patch enables multiple linked lists. A hash based on the entry name is
>> used to select the linked list for one given entry.
>>
>> The speed creation of netdevices is faster as shorter linked lists must be
>> scanned when adding a new netdevice.
>
> Is the directory of primary concern /proc/net/dev/snmp6 ?
Yes.

>
> Unless I have configured my networking stack weird by mistake that
> is the only directory under /proc/net that grows when we add an
> interface.
>
> I just want to make certain I am seeing the same things that you are
> seeing.
>
> I feel silly for overlooking this directory when the rest of the
> scalability work was done.
>
>> Here are some numbers:
>>
>> dummy30000.batch contains 30 000 times 'link add type dummy'.
>>
>> Before the patch:
>> time ip -b dummy30000.batch
>> real    2m32.221s
>> user    0m0.380s
>> sys     2m30.610s
>>
>> After the patch:
>> time ip -b dummy30000.batch
>> real    1m57.190s
>> user    0m0.350s
>> sys     1m56.120s
>>
>> The single 'subdir' list head is replaced by a subdir hash table. The subdir
>> hash buckets are only allocated for directories. The number of hash buckets
>> is a compile-time parameter.
>
> That looks like a nice speed up.  A couple of things.
>
> With sysfs and sysctl when faced this class of challenge we used an
> rbtree instead of a hash table.  That should use less memory and scale
> better.
>
> I am concerned about a fixed sized hash table moving the location where
> we fall off a cliff but not removing the cliff itself.
>
> I suppose it would be possible to use the new fancy resizable hash
> tables but previous work on sysctl and sysfs suggests that we don't look
> up these entries sufficiently to require a hash table.  We just need a
> data structure that doesn't fall over at scale, and the rbtrees seem to
> do that very nicely.
Ok, I will have a look at it.



Thank you,
Nicolas

^ permalink raw reply

* Re: [RFC PATCH linux 2/2] fs/proc: use a hash table for the directory entries
From: Nicolas Dichtel @ 2014-10-03 13:07 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: netdev, Linux Kernel, David Miller, Eric W. Biederman,
	Andrew Morton, rui.xiang, Al Viro, Oleg Nesterov, Cyrill Gorcunov,
	kirill.shutemov, Grant Likely, Theodore Ts'o,
	Thierry Herbelot
In-Reply-To: <CACVxJT98EJvw1E+o6mpu8ugMz3Ztqz7pJOwZeZW9A41P0AhB_g@mail.gmail.com>

Le 03/10/2014 12:55, Alexey Dobriyan a écrit :
> On Thu, Oct 2, 2014 at 6:25 PM, Nicolas Dichtel
> <nicolas.dichtel@6wind.com> wrote:
>> --- a/fs/proc/generic.c
>> +++ b/fs/proc/generic.c
>> @@ -81,10 +81,13 @@ static int __xlate_proc_name(const char *name, struct proc_dir_entry **ret,
>
>> +       if (!S_ISDIR(de->mode))
>> +               return -EINVAL;
>
> There are way too many S_ISDIR checks.
> In lookup and readdir, it is guaranteed that PDE is directory.
>
> I'd say all of them aren't needed because non-directories have
> ->subdir = NULL and
> directories have ->subdir != NULL which transforms into hashtable or
> rbtree or whatever,
> so you only need to guarantee only directories appear where they are
> expected and
> fearlessly use your new data structure traversing directories.
To be honest, they was put during the debug stage and I hesitated to remove
them at the end.
I will remove them!


Thank you,
Nicolas

^ permalink raw reply

* Re: [RFC PATCH linux 2/2] fs/proc: use a hash table for the directory entries
From: Nicolas Dichtel @ 2014-10-03 13:07 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: netdev, linux-kernel, davem, ebiederm, akpm, rui.xiang, viro,
	oleg, gorcunov, kirill.shutemov, grant.likely, tytso,
	Thierry Herbelot
In-Reply-To: <20141002172814.GA2435@p183.telecom.by>

Le 02/10/2014 19:28, Alexey Dobriyan a écrit :
> On Thu, Oct 02, 2014 at 05:25:01PM +0200, Nicolas Dichtel wrote:
>> +static inline unsigned int proc_pde_name_hash(const unsigned char *name,
>> +					      const unsigned int len)
>> +{
>> +	return full_name_hash(name, len) & PROC_PDE_HASH_MASK;
>> +}
>
> PDE already stands for "proc dir entry" :^)
>
> 	Alexey
>
Oops, will fix it in v2.


Thank you,
Nicolas

^ permalink raw reply

* Re: [PATCH v2 3/6] dtb: Add 10GbE node to APM X-Gene SoC device tree
From: Mark Rutland @ 2014-10-03 12:49 UTC (permalink / raw)
  To: Iyappan Subramanian
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	devicetree@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	patches@apm.com, kchudgar@apm.com
In-Reply-To: <1412294568-26747-4-git-send-email-isubramanian@apm.com>

Hi,

[...]

> +		xgenet: ethernet@1f610000 {
> +			compatible = "apm,xgene-enet";
> +			status = "disabled";
> +			reg = <0x0 0x1f610000 0x0 0xd100>,
> +			      <0x0 0x1f600000 0x0 0X400>,
> +			      <0x0 0x18000000 0x0 0X200>;
> +			reg-names = "enet_csr", "ring_csr", "ring_cmd";
> +			interrupts = <0x0 0x60 0x4>;
> +			dma-coherent;
> +			clocks = <&xge0clk 0>;
> +			local-mac-address = [00 01 73 00 00 04];

Apolgoies, I'd missed your reply on this last time. Please use
all-zeroes for the mac address, and have a comment that this will be
overwritten by the bootloader.

Thanks,
Mark.

^ permalink raw reply

* Re: [PATCH net] ematch: Fix the matching of inverted containers (again).
From: Sergei Shtylyov @ 2014-10-03 12:22 UTC (permalink / raw)
  To: Ignacy Gawędzki, netdev
In-Reply-To: <20141003080600.GA13146@zenon.in.qult.net>

Hello.

On 10/3/2014 12:06 PM, Ignacy Gawędzki wrote:

> The result of a negated container has to be inverted before checking for
> early ending.

> Signed-off-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr>
> ---
>   net/sched/ematch.c | 9 +++++----
>   1 file changed, 5 insertions(+), 4 deletions(-)

> diff --git a/net/sched/ematch.c b/net/sched/ematch.c
> index ad57f44..300ecf6 100644
> --- a/net/sched/ematch.c
> +++ b/net/sched/ematch.c
> @@ -526,11 +526,12 @@ pop_stack:
>   		match_idx = stack[--stackp];
>   		cur_match = tcf_em_get_match(tree, match_idx);
>
> -		if (tcf_em_early_end(cur_match, res)) {
> -			if (tcf_em_is_inverted(cur_match))
> -				res = !res;
> +		if (tcf_em_is_inverted(cur_match))
> +			res = !res;
> +
> +		if (tcf_em_early_end(cur_match, res))
>   			goto pop_stack;
> -		} else {
> +		else {
>   			match_idx++;
>   			goto proceed;
>   		}

    The kernel style dictates that an *if* statement should have {} in all its 
arms, if it has {} in at least one of the arms.

WBR, Sergei

^ permalink raw reply

* Re: [RFC PATCH net-next v3 1/4] netns: add genl cmd to add and get peer netns ids
From: Nicolas Dichtel @ 2014-10-03 12:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	luto-kltTT9wpgjJwATOyAt5JVQ, cwang-xCSkyg8dI+0RB7SZvlqPiA
In-Reply-To: <87tx3mmflp.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

Le 02/10/2014 21:33, Eric W. Biederman a écrit :
> Nicolas Dichtel <nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org> writes:
>
>> With this patch, a user can define an id for a peer netns by providing a FD or a
>> PID. These ids are local to netns (ie valid only into one netns).
>>
>> This will be useful for netlink messages when a x-netns interface is
>> dumped.
>
> You have a "id -> struct net *" table but you don't have a
> "struct net * -> id" table which looks like it will impact the
> performance of peernet2id at scale.
It is indirectly stores in 'struct idr'. It can be optimized later, with a
proper algorithm to find quickly this 'struct net *' (hash table? something
else?). A basic algorithm will not be more scalable than the current
idr_for_each().

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox