Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] genirq: set initial affinity when hinting
From: Jesse Brandeburg @ 2014-12-19 19:53 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: Thomas Gleixner, linux-kernel@vger.kernel.org, NetDEV list
In-Reply-To: <20141219012206.4220.27491.stgit@jbrandeb-cp2.jf.intel.com>

+netdev, as network driver developers might be interested in this functionality.

On Thu, Dec 18, 2014 at 5:22 PM, Jesse Brandeburg
<jesse.brandeburg@intel.com> wrote:
> Problem:
> The default behavior of the kernel is somewhat undesirable as all
> requested interrupts end up on CPU0 after registration.  A user can
> run irqbalance daemon, or can manually configure smp_affinity via the
> proc filesystem, but the default affinity of the interrupts for all
> devices is always CPU zero, this can cause performance problems or
> very heavy cpu use of only one core if not noticed and fixed by the
> user.
>
> Patch:
> This patch enables the setting of the initial affinity directly
> when the driver sets a hint.
>
> This enabling means that kernel drivers can include an initial
> affinity setting for the interrupt, instead of all interrupts starting
> out life on CPU0. Of course if irqbalance is still running then the
> interrupts will get moved as before.
>
> This function is currently called by drivers in block, crypto,
> infiniband, ethernet and scsi trees, but only a handful, so these will
> be the devices affected by this change.
>
> Tested on i40e, and default interrupts were spread across the CPUs
> according to the hint.
>
> drivers/block/mtip32xx/mtip32xx.c:3
> drivers/block/nvme-core.c:2
> drivers/crypto/qat/qat_dh895xcc/adf_isr.c:3
> drivers/infiniband/hw/qib/qib_iba7322.c:2
> drivers/net/ethernet/intel/i40e/i40e_main.c:3
> drivers/net/ethernet/intel/i40evf/i40evf_main.c:3
> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:3
> drivers/net/ethernet/mellanox/mlx4/en_cq.c:2
> drivers/scsi/hpsa.c:3
> drivers/scsi/lpfc/lpfc_init.c:3
> drivers/scsi/megaraid/megaraid_sas_base.c:8
> drivers/soc/ti/knav_qmss_acc.c:1
> drivers/soc/ti/knav_qmss_queue.c:2
> drivers/virtio/virtio_pci_common.c:2
>
> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> CC: Thomas Gleixner <tglx@linutronix.de>
> ---
>  kernel/irq/manage.c |    2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 8069237..f038e58 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -243,6 +243,8 @@ int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m)
>                 return -EINVAL;
>         desc->affinity_hint = m;
>         irq_put_desc_unlock(desc, flags);
> +       /* set the initial affinity to prevent every interrupt being on CPU0 */
> +       __irq_set_affinity(irq, m, false);
>         return 0;
>  }
>  EXPORT_SYMBOL_GPL(irq_set_affinity_hint);
>

^ permalink raw reply

* Re: [PATCH 01/10] core: Split out UFO6 support
From: Vlad Yasevich @ 2014-12-19 19:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, Vladislav Yasevich, virtualization, stefanha, ben,
	David Miller
In-Reply-To: <20141218173546.GB5425@redhat.com>

On 12/18/2014 12:35 PM, Michael S. Tsirkin wrote:
> On Thu, Dec 18, 2014 at 10:01:35AM -0500, Vlad Yasevich wrote:
>> On 12/18/2014 02:54 AM, Michael S. Tsirkin wrote:
>>> Cc Dave, pls remember to do this next time otherwise
>>> your patches won't get merged :)
>>>
>>> On Wed, Dec 17, 2014 at 06:31:50PM -0500, Vlad Yasevich wrote:
>>>> On 12/17/2014 05:45 PM, Michael S. Tsirkin wrote:
>>>>> On Wed, Dec 17, 2014 at 01:20:46PM -0500, Vladislav Yasevich wrote:
>>>>>> Split IPv6 support for UFO into its own feature similiar to TSO.
>>>>>> This will later allow us to re-enable UFO support for virtio-net
>>>>>> devices.
>>>>>>
>>>>>> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
>>>>>> ---
>>>>>>  include/linux/netdev_features.h |  7 +++++--
>>>>>>  include/linux/netdevice.h       |  1 +
>>>>>>  include/linux/skbuff.h          |  1 +
>>>>>>  net/core/dev.c                  | 35 +++++++++++++++++++----------------
>>>>>>  net/core/ethtool.c              |  2 +-
>>>>>>  5 files changed, 27 insertions(+), 19 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
>>>>>> index dcfdecb..a078945 100644
>>>>>> --- a/include/linux/netdev_features.h
>>>>>> +++ b/include/linux/netdev_features.h
>>>>>> @@ -48,8 +48,9 @@ enum {
>>>>>>  	NETIF_F_GSO_UDP_TUNNEL_BIT,	/* ... UDP TUNNEL with TSO */
>>>>>>  	NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT,/* ... UDP TUNNEL with TSO & CSUM */
>>>>>>  	NETIF_F_GSO_MPLS_BIT,		/* ... MPLS segmentation */
>>>>>> +	NETIF_F_UFO6_BIT,		/* ... UDPv6 fragmentation */
>>>>>>  	/**/NETIF_F_GSO_LAST =		/* last bit, see GSO_MASK */
>>>>>> -		NETIF_F_GSO_MPLS_BIT,
>>>>>> +		NETIF_F_UFO6_BIT,
>>>>>>  
>>>>>>  	NETIF_F_FCOE_CRC_BIT,		/* FCoE CRC32 */
>>>>>>  	NETIF_F_SCTP_CSUM_BIT,		/* SCTP checksum offload */
>>>>>> @@ -109,6 +110,7 @@ enum {
>>>>>>  #define NETIF_F_TSO_ECN		__NETIF_F(TSO_ECN)
>>>>>>  #define NETIF_F_TSO		__NETIF_F(TSO)
>>>>>>  #define NETIF_F_UFO		__NETIF_F(UFO)
>>>>>> +#define NETIF_F_UFO6		__NETIF_F(UFO6)
>>>>>>  #define NETIF_F_VLAN_CHALLENGED	__NETIF_F(VLAN_CHALLENGED)
>>>>>>  #define NETIF_F_RXFCS		__NETIF_F(RXFCS)
>>>>>>  #define NETIF_F_RXALL		__NETIF_F(RXALL)
>>>>>> @@ -141,7 +143,7 @@ enum {
>>>>>>  
>>>>>>  /* List of features with software fallbacks. */
>>>>>>  #define NETIF_F_GSO_SOFTWARE	(NETIF_F_TSO | NETIF_F_TSO_ECN | \
>>>>>> -				 NETIF_F_TSO6 | NETIF_F_UFO)
>>>>>> +				 NETIF_F_TSO6 | NETIF_F_UFO | NETIF_F_UFO6)
>>>>>>  
>>>>>>  #define NETIF_F_GEN_CSUM	NETIF_F_HW_CSUM
>>>>>>  #define NETIF_F_V4_CSUM		(NETIF_F_GEN_CSUM | NETIF_F_IP_CSUM)
>>>>>> @@ -149,6 +151,7 @@ enum {
>>>>>>  #define NETIF_F_ALL_CSUM	(NETIF_F_V4_CSUM | NETIF_F_V6_CSUM)
>>>>>>  
>>>>>>  #define NETIF_F_ALL_TSO 	(NETIF_F_TSO | NETIF_F_TSO6 | NETIF_F_TSO_ECN)
>>>>>> +#define NETIF_F_ALL_UFO		(NETIF_F_UFO | NETIF_F_UFO6)
>>>>>>  
>>>>>>  #define NETIF_F_ALL_FCOE	(NETIF_F_FCOE_CRC | NETIF_F_FCOE_MTU | \
>>>>>>  				 NETIF_F_FSO)
>>>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>>>>> index 74fd5d3..86af10a 100644
>>>>>> --- a/include/linux/netdevice.h
>>>>>> +++ b/include/linux/netdevice.h
>>>>>> @@ -3559,6 +3559,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
>>>>>>  	/* check flags correspondence */
>>>>>>  	BUILD_BUG_ON(SKB_GSO_TCPV4   != (NETIF_F_TSO >> NETIF_F_GSO_SHIFT));
>>>>>>  	BUILD_BUG_ON(SKB_GSO_UDP     != (NETIF_F_UFO >> NETIF_F_GSO_SHIFT));
>>>>>> +	BUILD_BUG_ON(SKB_GSO_UDP6    != (NETIF_F_UFO6 >> NETIF_F_GSO_SHIFT));
>>>>>>  	BUILD_BUG_ON(SKB_GSO_DODGY   != (NETIF_F_GSO_ROBUST >> NETIF_F_GSO_SHIFT));
>>>>>>  	BUILD_BUG_ON(SKB_GSO_TCP_ECN != (NETIF_F_TSO_ECN >> NETIF_F_GSO_SHIFT));
>>>>>>  	BUILD_BUG_ON(SKB_GSO_TCPV6   != (NETIF_F_TSO6 >> NETIF_F_GSO_SHIFT));
>>>>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>>>>> index 6c8b6f6..8538b67 100644
>>>>>> --- a/include/linux/skbuff.h
>>>>>> +++ b/include/linux/skbuff.h
>>>>>> @@ -372,6 +372,7 @@ enum {
>>>>>>  
>>>>>>  	SKB_GSO_MPLS = 1 << 12,
>>>>>>  
>>>>>> +	SKB_GSO_UDP6 = 1 << 13
>>>>>>  };
>>>>>>  
>>>>>>  #if BITS_PER_LONG > 32
>>>>>
>>>>> So this implies anything getting GSO packets e.g.
>>>>> from userspace now needs to check IP version to
>>>>> set GSO type correctly.
>>>>>
>>>>> I think you missed some places that do this, e.g. af_packet
>>>>> sockets.
>>>>>
>>>>
>>>> I looked at af_packet sockets and they set this only in the event
>>>> vnet header has been used with a GSO type.  In this case, the user
>>>> already knows the the type.
>>>
>>> Imagine you are receiving a packet:
>>>
>>>                 if (vnet_hdr.gso_type != VIRTIO_NET_HDR_GSO_NONE) {
>>>                         switch (vnet_hdr.gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
>>>                         case VIRTIO_NET_HDR_GSO_TCPV4:
>>>                                 gso_type = SKB_GSO_TCPV4;
>>>                                 break;
>>>                         case VIRTIO_NET_HDR_GSO_TCPV6:
>>>                                 gso_type = SKB_GSO_TCPV6;
>>>                                 break;
>>>                         case VIRTIO_NET_HDR_GSO_UDP:
>>>                                 gso_type = SKB_GSO_UDP;
>>>                                 break;
>>>                         default:
>>>                                 goto out_unlock;
>>>                         }
>>>
>>>                         if (vnet_hdr.gso_type & VIRTIO_NET_HDR_GSO_ECN)
>>>                                 gso_type |= SKB_GSO_TCP_ECN;
>>>
>>>                         if (vnet_hdr.gso_size == 0)
>>>                                 goto out_unlock;
>>>
>>>                 }
>>>
>>> we used to report UFO6 as SKB_GSO_UDP, we probably
>>> should keep doing this, with your patch we select the
>>> goto out_unlock path.
>>>
>>>
>>
>> No.  The vnet_hdr.gso_type is still going to be VIRTIO_NET_HDR_GSO_UDP
>> since the UDP6 version isn't defined yet.  So, it will be marked as
>> GSO_UDP.
> 
> I pasted the wrong snippet.
> I meant this:
> 
>                         /* This is a hint as to how much should be * linear. */
>                         vnet_hdr.hdr_len = __cpu_to_virtio16(false, skb_headlen(skb));
>                         vnet_hdr.gso_size = __cpu_to_virtio16(false, sinfo->gso_size);
>                         if (sinfo->gso_type & SKB_GSO_TCPV4)
>                                 vnet_hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
>                         else if (sinfo->gso_type & SKB_GSO_TCPV6)
>                                 vnet_hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
>                         else if (sinfo->gso_type & SKB_GSO_UDP)
>                                 vnet_hdr.gso_type = VIRTIO_NET_HDR_GSO_UDP;
>                         else if (sinfo->gso_type & SKB_GSO_FCOE)
>                                 goto out_free;
>                         else
>                                 BUG();
> 
> so if we get SKB_GSO_UDP we'll get BUG().
> 
> 
> 
>> This code most likely needs the same workaround as exists in tap and macvtap,
>> i.e select the proxy fragment id and decide what to do with gso_type.
> 
> And fixup type to GSO_UDP6 while we are at it.
> 
>>>
>>>> It is true that with this series af_packets now can't do IPv6 UFO
>>>> since there is no VIRTIO_NET_HDR_GSO_UDPV6 yet.
>>>
>>> What do you mean by "do".
>>
>> What I mean is that AF_PACKET sockets currently do not do IPv6 UFO
>> correctly, even after Ben's fixes to tap/macvtap.  There is no
>> proxy fragment id selection in af_packet case and we have the
>> same problem Ben was trying address for tap/macvtap.
>>> Are we talking about sending or receiving packets?
>>
>> I am talking about sending, see above.
>>
>>> You seem to conflate the two.
>>>
>>> We always discarded ID on RX.
>>>
>>> For tun, this is xmit, so just by saying "this device can
>>> not do UFO" you will never get short packets.
>>>
>>
>> You must mean long packets.
> 
> Yes.
> 
>> This is actually an issue I've been thinking
>> about. With with your suggestion of switching the GSO type for legacy
>> applications we end up with fragments for IPv6 traffic.   As a result,
>> legacy VMs will see a regression for large IPv6 datagrams.
> 
> I'm not sure what's meant by my suggestion here :)

What I am referring to here is to change the gso_type to UDPV6 in tap/macvtap
when passing packet to the host.

> It seems clear that legacy applications don't want to get IPv6
> fragment IDs in virtio header. Either we pass them plain ethernet
> packets or assume they are ok with discarding the IDs even
> if we set GSO_UDP.

So, I am not try to pass fragment IDs yet.  I am trying to make sure that
older gueest that assume that UFO == UFO4 + UFO6 continue to work and do not
regress.  At the same time, I want to enable UFO(4) for new guests.

If we set gso_type to SKB_GSO_UDPV6 before passing the data to the forwarding
path of the host, then this will cause IPv6 fragmentation.  If we leave gso_type
as SKB_GSO_UDP, then older VMs will continue to communicate using large UDP
data packets.

Later when VMs support UFO6, when UFO6 is enabled, the fragment id can be communicated.
But that's later.

> 
>>>
>>>>
>>>> I suppose we could do something similar there as we do in tun code/macvtap code.
>>>> If that's the case, it currently broken as well.
>>>>
>>>> -vlad
>>>
>>>
>>> Broken is a big word.
>>>
>>> Let's stop conflating two directions.
>>
>> I am not and was talking only about af_packet as that is what you asked about.
>> There is no tun/macvtap in play here.  They are handled separately in their
>> respective drivers.
>>
>>>
>>> Here's the way I look at it:
>>>
>>> 1. Userspace doesn't have a way to get fragment IDs
>>> from tun/macvtap/packet sockets.
>>> Presumably, not all users need these IDs.
>>> E.g. old guests clearly don't.
>>>
>>> We should either give them packets stripping the ID,
>>> like we always did, or make sure they never get these packets.
>>> Second option seems achievable for tun if we
>>> never tell linux it can do UFO, but I don't see
>>> how to do it for macvtap and packet socket.
>>>
>>
>> macvtap is slightly problematic, but doable with some tricks.
>> packet socket is an interesting problem.  The only way
>> an af_packet socket can receive an skb marked SKB_GSO_UDPV6
>> is if someone else on the host sent it (like another guest).
> 
> Or if a NIC set it.

OK, virtio seems to be the only nic doing this... You are right, I
need to cover that scenario.

-vlad

> 
>> Since there is are no feature or vnet header length negotiations,
>> it is impossible to tell if an application using this af_packet
>> socket is capable of processing VIRTIO_NET_HDR_GSO_UDPV6
>> type (yet to be defined).
>>
>> So, we can either use existing VIRTIO_NET_HDR_GSO_UDP on receive
>> path, add some kind of negotiation logic to packet socket (like
>> storing the application expected vnet header size), or perform
>> IPv6 fragmentation somehow.
>>
>> Options 1 or 2 are doable.
> 
> 1 is using VIRTIO_NET_HDR_GSO_UDP and discarding ID,
> 2 is "some kind of negotiation logic"?
> 2 can't be enough, you will need 1 as well.
> 
> So let's start with 1 as a first step.
> 
> 
> 
> 
>>>
>>> 2. Userspace doesn't have a way to set fragment IDs
>>> for tun/macvtap/packet sockets.
>>> Presumably, not all users have these IDs.  E.g. old
>>> guests clearly don't.
>>>
>>> We should either generate our own ID,
>>> like we always did, or make sure we don't accept
>>> these packets.
>>> Second option is almost sure to break userspace,
>>> so it seems we must do the first one.
>>>
>>
>> Right.  This was missing from packet sockets.  I can fix it.
>>
>> -vlad
> 
> Also, this can't be a patch on top, since we don't
> want bisect to give us configurations which
> can BUG().
> 
> 
>>>
>>> 3.
>>> Exactly the same two things if you replace userspace
>>> with hypervisor and tun/macvtap/packet socket with
>>> virtio device.
>>>
>>>
>>> 4. 
>>> As a next step, we should add a way for userspace
>>> to tell us that it has ids and can handle them.
>>>
>>>
>>>
>>>
>>>
>>>>>
>>>>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>>>>> index 945bbd0..fa4d2ee 100644
>>>>>> --- a/net/core/dev.c
>>>>>> +++ b/net/core/dev.c
>>>>>> @@ -5929,6 +5929,12 @@ static netdev_features_t netdev_fix_features(struct net_device *dev,
>>>>>>  		features &= ~NETIF_F_ALL_TSO;
>>>>>>  	}
>>>>>>  
>>>>>> +	/* UFO requires that SG is present as well */
>>>>>> +	if ((features & NETIF_F_ALL_UFO) && !(features & NETIF_F_SG)) {
>>>>>> +		netdev_dbg(dev, "Dropping UFO features since no SG feature.\n");
>>>>>> +		features &= ~NETIF_F_ALL_UFO;
>>>>>> +	}
>>>>>> +
>>>>>>  	if ((features & NETIF_F_TSO) && !(features & NETIF_F_HW_CSUM) &&
>>>>>>  					!(features & NETIF_F_IP_CSUM)) {
>>>>>>  		netdev_dbg(dev, "Dropping TSO features since no CSUM feature.\n");
>>>>>> @@ -5952,24 +5958,21 @@ static netdev_features_t netdev_fix_features(struct net_device *dev,
>>>>>>  		features &= ~NETIF_F_GSO;
>>>>>>  	}
>>>>>>  
>>>>>> -	/* UFO needs SG and checksumming */
>>>>>> -	if (features & NETIF_F_UFO) {
>>>>>> -		/* maybe split UFO into V4 and V6? */
>>>>>> -		if (!((features & NETIF_F_GEN_CSUM) ||
>>>>>> -		    (features & (NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM))
>>>>>> -			    == (NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM))) {
>>>>>> -			netdev_dbg(dev,
>>>>>> -				"Dropping NETIF_F_UFO since no checksum offload features.\n");
>>>>>> -			features &= ~NETIF_F_UFO;
>>>>>> -		}
>>>>>> -
>>>>>> -		if (!(features & NETIF_F_SG)) {
>>>>>> -			netdev_dbg(dev,
>>>>>> -				"Dropping NETIF_F_UFO since no NETIF_F_SG feature.\n");
>>>>>> -			features &= ~NETIF_F_UFO;
>>>>>> -		}
>>>>>> +	/* UFO also needs checksumming */
>>>>>> +	if ((features & NETIF_F_UFO) && !(features & NETIF_F_GEN_CSUM) &&
>>>>>> +					!(features & NETIF_F_IP_CSUM)) {
>>>>>> +		netdev_dbg(dev,
>>>>>> +			   "Dropping NETIF_F_UFO since no checksum offload features.\n");
>>>>>> +		features &= ~NETIF_F_UFO;
>>>>>> +	}
>>>>>> +	if ((features & NETIF_F_UFO6) && !(features & NETIF_F_GEN_CSUM) &&
>>>>>> +					 !(features & NETIF_F_IPV6_CSUM)) {
>>>>>> +		netdev_dbg(dev,
>>>>>> +			   "Dropping NETIF_F_UFO6 since no checksum offload features.\n");
>>>>>> +		features &= ~NETIF_F_UFO6;
>>>>>>  	}
>>>>>>  
>>>>>> +
>>>>>>  #ifdef CONFIG_NET_RX_BUSY_POLL
>>>>>>  	if (dev->netdev_ops->ndo_busy_poll)
>>>>>>  		features |= NETIF_F_BUSY_POLL;
>>>>>> diff --git a/net/core/ethtool.c b/net/core/ethtool.c
>>>>>> index 06dfb29..93eff41 100644
>>>>>> --- a/net/core/ethtool.c
>>>>>> +++ b/net/core/ethtool.c
>>>>>> @@ -223,7 +223,7 @@ static netdev_features_t ethtool_get_feature_mask(u32 eth_cmd)
>>>>>>  		return NETIF_F_ALL_TSO;
>>>>>>  	case ETHTOOL_GUFO:
>>>>>>  	case ETHTOOL_SUFO:
>>>>>> -		return NETIF_F_UFO;
>>>>>> +		return NETIF_F_ALL_UFO;
>>>>>>  	case ETHTOOL_GGSO:
>>>>>>  	case ETHTOOL_SGSO:
>>>>>>  		return NETIF_F_GSO;
>>>>>> -- 
>>>>>> 1.9.3

^ permalink raw reply

* Re: [PATCH 01/10] core: Split out UFO6 support
From: Vlad Yasevich @ 2014-12-19 20:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, Vladislav Yasevich, virtualization, stefanha, ben,
	David Miller
In-Reply-To: <20141218175002.GA5539@redhat.com>

On 12/18/2014 12:50 PM, Michael S. Tsirkin wrote:
> On Thu, Dec 18, 2014 at 07:35:46PM +0200, Michael S. Tsirkin wrote:
>>>> We should either generate our own ID,
>>>> like we always did, or make sure we don't accept
>>>> these packets.
>>>> Second option is almost sure to break userspace,
>>>> so it seems we must do the first one.
>>>>
>>>
>>> Right.  This was missing from packet sockets.  I can fix it.
>>>
>>> -vlad
>>
>> Also, this can't be a patch on top, since we don't
>> want bisect to give us configurations which
>> can BUG().
> 
> So how doing this in stages:
> 
> 1. add helper that checks skb GSO type.
> If it is SKB_GSO_UDP, check for IPv6, and
> generate the fragment ID.
> 
> Call this helper in
> 	- virtio net rx path

Why do we need id on rx path?  Fragment ID should only be generated on tx.

> 	- tun rx path (get user)
> 	- macvtap tx patch (get user)
> 	- packet socket tx patch (get user)

The reset makes sense.
> 
> 2. Put back UFO flag everywhere: virtio,tun,macvtap,packet,
> since we can handle IPv6 now even if it's suboptimal.
> 
> Above two seem like smalling patches, appropriate for stable.

OK.

> 
> Next, work on a new feature to pass
> fragment ID in virtio header:
> 
> 3. split UFO/UFO6 as you did here, but add code
> in above 4 places to convert UDP to UDP6,

Doing so will adversely impact IPv6 UFO performance for legacy
guests.  I know how many times I've seen mail wrt patch xyz caused
%X  regression in my setup and how we've reverted or tried to fixed
things to solve this.  If we go with approach, the only "fix' would be
to upgrade the guest and that's not available to some users.

-vlad

> additionally, add code in
> 	- virtio net tx path
> 	- tun tx path (get user)
> 	- macvtap rx patch (put user)
> 	- packet socket rx patch (put user)
> to convert UDP6 to UDP.
>
> 	step 3 needs to be bisect-clean, somehow.
> 
> 4. add new field in header, new feature bit for virtio net to gate it,
> new ioctls to tun,macvtap,packet socket.
> 
> These two are more like optimizations, so not stable material.
> 
> 

^ permalink raw reply

* Re: [PATCH net-next 2/2] r8152: check the status before submitting rx
From: David Miller @ 2014-12-19 20:44 UTC (permalink / raw)
  To: hayeswang; +Cc: netdev, nic_swsd, linux-kernel, linux-usb
In-Reply-To: <1394712342-15778-109-Taiwan-albertk@realtek.com>

From: Hayes Wang <hayeswang@realtek.com>
Date: Fri, 19 Dec 2014 16:56:00 +0800

> Don't submit the rx if the device is unplugged, linking down,
> or stopped.
 ...
> @@ -1789,6 +1789,11 @@ int r8152_submit_rx(struct r8152 *tp, struct rx_agg *agg, gfp_t mem_flags)
>  {
>  	int ret;
>  
> +	/* The rx would be stopped, so skip submitting */
> +	if (test_bit(RTL8152_UNPLUG, &tp->flags) ||
> +	    !test_bit(WORK_ENABLE, &tp->flags) || !(tp->speed & LINK_STATUS))
> +		return 0;
> +

I think netif_carrier_off() should always be true in all three of those
situations, and would be a much simpler test than what you've coded
here.

^ permalink raw reply

* Re: [PATCH net-next v3] Add support of Cavium Liquidio ethernet adapters
From: Lino Sanfilippo @ 2014-12-19 20:44 UTC (permalink / raw)
  To: Raghu Vatsavayi, davem
  Cc: netdev, Derek Chickles, Satanand Burla, Felix Manlunas,
	Raghu Vatsavayi
In-Reply-To: <1418959519-31681-1-git-send-email-rvatsavayi@caviumnetworks.com>

Hi,

On 19.12.2014 04:25, Raghu Vatsavayi wrote:
> +
> +static int cn6xxx_soft_reset(struct octeon_device *oct)
> +{
> +	octeon_write_csr64(oct, CN66XX_WIN_WR_MASK_REG, 0xFF);
> +
> +	lio_dev_dbg(oct, "BIST enabled for soft reset\n");
> +
> +	OCTEON_PCI_WIN_WRITE(oct, CN66XX_CIU_SOFT_BIST, 1);
> +	octeon_write_csr64(oct, CN66XX_SLI_SCRATCH1, 0x1234ULL);
> +
> +	OCTEON_PCI_WIN_READ(oct, CN66XX_CIU_SOFT_RST);
> +	OCTEON_PCI_WIN_WRITE(oct, CN66XX_CIU_SOFT_RST, 1);
> +
> +	/* Wait for 10ms as Octeon resets. */
> +	mdelay(10);
> +
> +	if (octeon_read_csr64(oct, CN66XX_SLI_SCRATCH1) == 0x1234ULL) {
> +		lio_dev_err(oct, "Soft reset failed\n");
> +		return 1;
> +	}

Before the delay you should probably make sure that the writes are
flushed to avoid pci write posting.

> +
> +static int cn68xx_soft_reset(struct octeon_device *oct)
> +{
> +	octeon_write_csr64(oct, CN68XX_WIN_WR_MASK_REG, 0xFF);
> +
> +	lio_dev_dbg(oct, "BIST enabled for CN68XX soft reset\n");
> +	OCTEON_PCI_WIN_WRITE(oct, CN68XX_CIU_SOFT_BIST, 1);
> +
> +	octeon_write_csr64(oct, CN68XX_SLI_SCRATCH1, 0x1234ULL);
> +
> +	OCTEON_PCI_WIN_READ(oct, CN68XX_CIU_SOFT_RST);
> +	OCTEON_PCI_WIN_WRITE(oct, CN68XX_CIU_SOFT_RST, 1);
> +
> +	/* Wait for 100ms as Octeon resets. */
> +	mdelay(100);
> +
> +	if (octeon_read_csr64(oct, CN68XX_SLI_SCRATCH1) == 0x1234ULL) {
> +		lio_dev_err(oct, "Soft reset failed\n");
> +		return 1;
> +	}

same here.

Regards,
Lino

^ permalink raw reply

* Re: [PATCH net] enic: fix rx skb checksum
From: David Miller @ 2014-12-19 20:45 UTC (permalink / raw)
  To: jbenc; +Cc: _govind, netdev, ssujith, benve, sassmann
In-Reply-To: <20141219121144.5f9899ba@griffin>

From: Jiri Benc <jbenc@redhat.com>
Date: Fri, 19 Dec 2014 12:11:44 +0100

> On Thu, 18 Dec 2014 15:58:42 +0530, Govindarajulu Varadarajan wrote:
>> Hardware always provides compliment of IP pseudo checksum. Stack expects
>> whole packet checksum without pseudo checksum if CHECKSUM_COMPLETE is set.
>> 
>> This causes checksum error in nf & ovs.
>> 
>> [...]
>> 
>> Hardware verifies IP & tcp/udp header checksum but does not provide payload
>> checksum, use CHECKSUM_UNNECESSARY. Set it only if its valid IP tcp/udp packet.
>> 
>> Cc: Jiri Benc <jbenc@redhat.com>
>> Cc: Stefan Assmann <sassmann@redhat.com>
>> Reported-by: Sunil Choudhary <schoudha@redhat.com>
>> Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
> 
> Reviewed-by: Jiri Benc <jbenc@redhat.com>

Applied, thanks.

^ permalink raw reply

* Re: pull request: bluetooth 2014-12-19
From: David Miller @ 2014-12-19 20:47 UTC (permalink / raw)
  To: johan.hedberg; +Cc: netdev, linux-bluetooth
In-Reply-To: <20141219143400.GA20385@t440s.P-661HNU-F1>

From: Johan Hedberg <johan.hedberg@gmail.com>
Date: Fri, 19 Dec 2014 16:34:00 +0200

> Here's one more pull request for 3.19. It contains the socket type
> verification fixes from Al Viro as well as an skb double-free fix for
> 6lowpan from Jukka Rissanen.
> 
> Please let me know if there are any issues pulling. Thanks.

Pulled, thanks Johan.

^ permalink raw reply

* Re: [PATCH net-next v3] Add support of Cavium Liquidio ethernet adapters
From: Stephen Hemminger @ 2014-12-19 20:53 UTC (permalink / raw)
  To: Chickles, Derek
  Cc: Vatsavayi, Raghu, davem@davemloft.net, netdev@vger.kernel.org,
	Burla, Satananda, Manlunas, Felix
In-Reply-To: <CO2PR07MB490962E1E05D146BFD67B66FE6B0@CO2PR07MB490.namprd07.prod.outlook.com>

On Fri, 19 Dec 2014 18:39:44 +0000
"Chickles, Derek" <Derek.Chickles@caviumnetworks.com> wrote:

> > Also, why does this device not have standard ethernet naming?  
> 
> No good reason. We will follow the p<slot_number>p<port_number> convention.

Use alloc_etherdev_mq and let it choose.
Policy about slot/port is done in userspace (udev/systemd).

^ permalink raw reply

* Re: [PATCH net-next v3] Add support of Cavium Liquidio ethernet adapters
From: Stephen Hemminger @ 2014-12-19 20:54 UTC (permalink / raw)
  To: Lino Sanfilippo
  Cc: Raghu Vatsavayi, davem, netdev, Derek Chickles, Satanand Burla,
	Felix Manlunas, Raghu Vatsavayi
In-Reply-To: <54948E1F.2080405@gmx.de>

On Fri, 19 Dec 2014 21:44:15 +0100
Lino Sanfilippo <LinoSanfilippo@gmx.de> wrote:

> > +	OCTEON_PCI_WIN_READ(oct, CN66XX_CIU_SOFT_RST);
> > +	OCTEON_PCI_WIN_WRITE(oct, CN66XX_CIU_SOFT_RST, 1);

Missing a following PCI read to force PCI posting?

^ permalink raw reply

* Re: [PATCH net-next v3] Add support of Cavium Liquidio ethernet adapters
From: Lino Sanfilippo @ 2014-12-19 21:00 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Raghu Vatsavayi, davem, netdev, Derek Chickles, Satanand Burla,
	Felix Manlunas, Raghu Vatsavayi
In-Reply-To: <20141219125434.594f4d46@urahara>

On 19.12.2014 21:54, Stephen Hemminger wrote:
> On Fri, 19 Dec 2014 21:44:15 +0100
> Lino Sanfilippo <LinoSanfilippo@gmx.de> wrote:
> 
>> > +	OCTEON_PCI_WIN_READ(oct, CN66XX_CIU_SOFT_RST);
>> > +	OCTEON_PCI_WIN_WRITE(oct, CN66XX_CIU_SOFT_RST, 1);
> 
> Missing a following PCI read to force PCI posting?
> 

Yes, I think so.

^ permalink raw reply

* Re: [PATCH net-next v3] Add support of Cavium Liquidio ethernet adapters
From: Lino Sanfilippo @ 2014-12-19 21:07 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Raghu Vatsavayi, davem, netdev, Derek Chickles, Satanand Burla,
	Felix Manlunas, Raghu Vatsavayi
In-Reply-To: <549491D2.1090504@gmx.de>

On 19.12.2014 22:00, Lino Sanfilippo wrote:
> On 19.12.2014 21:54, Stephen Hemminger wrote:
>> On Fri, 19 Dec 2014 21:44:15 +0100
>> Lino Sanfilippo <LinoSanfilippo@gmx.de> wrote:
>> 
>>> > +	OCTEON_PCI_WIN_READ(oct, CN66XX_CIU_SOFT_RST);
>>> > +	OCTEON_PCI_WIN_WRITE(oct, CN66XX_CIU_SOFT_RST, 1);
>> 
>> Missing a following PCI read to force PCI posting?
>> 
> 
> Yes, I think so.

I thought posting means "writes are delayed", so I wrote it should be
avoided (instead of forced). However I think there should be a PCI read
before the call to mdelay.

Regards,
Lino

^ permalink raw reply

* Re: [PATCH net-next v3] Add support of Cavium Liquidio ethernet adapters
From: Lino Sanfilippo @ 2014-12-19 21:14 UTC (permalink / raw)
  To: Raghu Vatsavayi, davem
  Cc: netdev, Derek Chickles, Satanand Burla, Felix Manlunas,
	Raghu Vatsavayi
In-Reply-To: <1418959519-31681-1-git-send-email-rvatsavayi@caviumnetworks.com>

On 19.12.2014 04:25, Raghu Vatsavayi wrote:
> 		desc_ring[i].info_ptr =
> +			(uint64_t)pci_map_single(oct->pci_dev,
> +						&droq->info_list[i],
> +						OCT_DROQ_INFO_SIZE,
> +						PCI_DMA_FROMDEVICE);
> +
> +		desc_ring[i].buffer_ptr =
> +			(uint64_t)pci_map_single(oct->pci_dev,
> +						droq->recv_buf_list[i].
> +						data, droq->buffer_size,
> +						PCI_DMA_FROMDEVICE);
> +	}

You should also check if pci mapping was successful...

Regards,
Lino

^ permalink raw reply

* Re: [PATCH net-next v3] Add support of Cavium Liquidio ethernet adapters
From: Lino Sanfilippo @ 2014-12-19 21:48 UTC (permalink / raw)
  To: Raghu Vatsavayi, davem
  Cc: netdev, Derek Chickles, Satanand Burla, Felix Manlunas,
	Raghu Vatsavayi
In-Reply-To: <1418959519-31681-1-git-send-email-rvatsavayi@caviumnetworks.com>

On 19.12.2014 04:25, Raghu Vatsavayi wrote:
> +static int setup_nic_devices(struct octeon_device *octeon_dev)
> +{
> +	struct lio *lio = NULL;
> +	struct net_device *netdev;
> +	uint8_t macaddr[6], i, j;
> +	int octeon_id;
> +	struct octeon_soft_command *sc;
> +	struct liquidio_if_cfg_resp *resp;
> +	int retval;
> +	int num_iqueues;
> +	int num_oqueues;
> +	int q_no;
> +	uint64_t q_mask;
> +	int num_cpus = num_online_cpus();
> +
> +	if (num_cpus & (num_cpus - 1))
> +		/* numcpus is not a power of 2 */
> +		num_cpus = 1U << (fls(num_cpus) - 1);
> +	octeon_id = octeon_dev->octeon_id;
> +
> +	sc = kmalloc(sizeof(*sc), GFP_ATOMIC);
> +	if (!sc)
> +		return -ENOMEM;
> +
> +	resp = kmalloc(sizeof(*resp), GFP_KERNEL);
> +	if (!resp) {
> +		kfree(sc);
> +		return -ENOMEM;
> +	}
> +
> +	for (i = 0; i < octeon_dev->props.ifcount; i++) {
> +		memset(sc, 0, sizeof(struct octeon_soft_command));
> +		memset(resp, 0, sizeof(struct liquidio_if_cfg_resp));
> +		num_iqueues =
> +			CFG_GET_NUM_TXQS_NIC_IF(octeon_get_conf(octeon_dev), i);
> +		num_oqueues =
> +			CFG_GET_NUM_RXQS_NIC_IF(octeon_get_conf(octeon_dev), i);
> +		if (num_iqueues > num_cpus)
> +			num_iqueues = num_cpus;
> +		if (num_oqueues > num_cpus)
> +			num_oqueues = num_cpus;
> +		lio_dev_dbg(octeon_dev,
> +			    "requesting config for interface %d, iqs %d, oqs %d\n",
> +			    i, num_iqueues, num_oqueues);
> +		ACCESS_ONCE(resp->s.cond) = 0;
> +		resp->s.octeon_id = get_octeon_device_id(octeon_dev);
> +		init_waitqueue_head(&resp->s.wc);
> +
> +		octeon_prepare_soft_command(octeon_dev, sc, OPCODE_NIC,
> +					    OPCODE_NIC_IF_CFG, i, num_iqueues,
> +					    num_oqueues, NULL, 0, &resp->rh,
> +					    (sizeof(struct liquidio_if_cfg_resp)
> +					     - sizeof(resp->s)));
> +
> +		sc->callback = if_cfg_callback;
> +		sc->callback_arg = resp;
> +		sc->wait_time = 1000;
> +
> +		retval = octeon_send_soft_command(octeon_dev, sc);
> +		if (retval) {
> +			lio_dev_err(octeon_dev,
> +				    "Link status instruction failed status: %x\n",
> +				    retval);
> +			/* Soft instr is freed by driver in case of failure. */
> +			goto setup_nic_dev_fail;
> +		}
> +
> +		/* Sleep on a wait queue till the cond flag indicates that the
> +		 * response arrived or timed-out.
> +		 */
> +		sleep_cond(&resp->s.wc, &resp->s.cond);
> +		retval = resp->status;
> +		if (retval) {
> +			lio_dev_err(octeon_dev, "Link status failed\n");
> +			goto setup_nic_dev_fail;
> +		}
> +		octeon_swap_8B_data((uint64_t *)(&resp->cfg_info),
> +				    (sizeof(struct liquidio_if_cfg_info)) >> 3);
> +
> +		num_iqueues = hweight64(resp->cfg_info.iqmask);
> +		num_oqueues = hweight64(resp->cfg_info.oqmask);
> +
> +		if (!(num_iqueues) || !(num_oqueues)) {
> +			lio_dev_err(octeon_dev,
> +				    "Got bad iqueues (%016llx) or oqueues (%016llx) from firmware.\n",
> +				    resp->cfg_info.iqmask,
> +				    resp->cfg_info.oqmask);
> +			goto setup_nic_dev_fail;
> +		}
> +		lio_dev_dbg(octeon_dev,
> +			    "interface %d, iqmask %016llx, oqmask %016llx, numiqueues %d, numoqueues %d\n",
> +			    i, resp->cfg_info.iqmask, resp->cfg_info.oqmask,
> +			    num_iqueues, num_oqueues);
> +		netdev = liquidio_alloc_netdev(LIO_SIZE, num_iqueues);
> +
> +		if (!netdev) {
> +			lio_dev_err(octeon_dev, "Device allocation failed\n");
> +			goto setup_nic_dev_fail;
> +		}
> +
> +		octeon_dev->props.netdev[i] = netdev;
> +
> +		if (num_iqueues > 1)
> +			lionetdevops.ndo_select_queue = select_q;
> +
> +		/* Associate the routines that will handle different
> +		 * netdev tasks.
> +		 */
> +		netdev->netdev_ops = &lionetdevops;
> +
> +		lio = GET_LIO(netdev);
> +
> +		memset(lio, 0, sizeof(struct lio));
> +
> +		lio->linfo.ifidx = resp->cfg_info.ifidx;
> +		lio->ifidx = resp->cfg_info.ifidx;
> +
> +		lio->linfo.num_rxpciq = num_oqueues;
> +		lio->linfo.num_txpciq = num_iqueues;
> +		q_mask = resp->cfg_info.oqmask;
> +		/* q_mask is 0-based and already verified mask is nonzero */
> +		for (j = 0; j < num_oqueues; j++) {
> +			q_no = __ffs64(q_mask);
> +			q_mask &= (~(1UL << q_no));
> +			lio->linfo.rxpciq[j] = q_no;
> +		}
> +		q_mask = resp->cfg_info.iqmask;
> +		for (j = 0; j < num_iqueues; j++) {
> +			q_no = __ffs64(q_mask);
> +			q_mask &= (~(1UL << q_no));
> +			lio->linfo.txpciq[j] = q_no;
> +		}
> +		retval = get_inittime_link_status(octeon_dev,
> +						  &octeon_dev->props, i);
> +		if (retval) {
> +			lio_dev_err(octeon_dev, "link status failed\n");
> +			goto setup_nic_dev_fail;
> +		}
> +		lio->linfo.hw_addr = octeon_dev->props.ls->link_info.hw_addr;
> +		lio->linfo.gmxport = octeon_dev->props.ls->link_info.gmxport;
> +
> +		lio->msg_enable = netif_msg_init(debug, DEFAULT_MSG_ENABLE);
> +
> +		lio->dev_capability =
> +			NETIF_F_HW_CSUM | NETIF_F_SG | NETIF_F_RXCSUM;
> +		lio->dev_capability |=
> +			(NETIF_F_TSO | NETIF_F_TSO6 | NETIF_F_LRO);
> +		netif_set_gso_max_size(netdev, GSO_MAX_SIZE - ETH_HLEN - 4);
> +
> +		netdev->features = lio->dev_capability;
> +		netdev->vlan_features = lio->dev_capability;
> +
> +		netdev->hw_features = lio->dev_capability;
> +
> +		/* Point to the  properties for octeon device to which this
> +		 * interface belongs.
> +		 */
> +		lio->oct_dev = get_octeon_device_ptr(octeon_id);
> +		lio->octprops = &octeon_dev->props;
> +		lio->netdev = netdev;
> +		spin_lock_init(&lio->lock);
> +
> +		lio_dev_dbg(octeon_dev, "if%d gmx: %d hw_addr: 0x%llx\n", i,
> +			    lio->linfo.gmxport, CVM_CAST64(lio->linfo.hw_addr));
> +
> +		/* 64-bit swap required on LE machines */
> +		octeon_swap_8B_data(&lio->linfo.hw_addr, 1);
> +		for (j = 0; j < 6; j++)
> +			macaddr[j] =
> +				*((uint8_t *)(((uint8_t *)&lio->linfo.hw_addr) +
> +					     2 + j));
> +
> +		/* Copy MAC Address to OS network device structure */
> +
> +		ether_addr_copy(netdev->dev_addr, macaddr);
> +
> +		lio->linfo.link.u64 = octeon_dev->props.ls->link_info.link.u64;
> +		spin_lock_init(&lio->link_update_lock);
> +
> +		if (setup_io_queues(octeon_dev, netdev)) {
> +			lio_dev_err(octeon_dev, "I/O queues creation failed\n");
> +			goto setup_nic_dev_fail;
> +		}
> +
> +		ifstate_set(lio, LIO_IFSTATE_DROQ_OPS);
> +
> +		/* By default all interfaces on a single Octeon uses the same
> +		 * tx and rx queues
> +		 */
> +		lio->txq = lio->linfo.txpciq[0];
> +		lio->rxq = lio->linfo.rxpciq[0];
> +
> +		lio->tx_qsize = octeon_get_tx_qsize(octeon_id, lio->txq);
> +		lio->rx_qsize = octeon_get_rx_qsize(octeon_id, lio->rxq);
> +
> +		if (setup_glist(lio)) {
> +			lio_dev_err(octeon_dev,
> +				    "Gather list allocation failed\n");
> +			goto setup_nic_dev_fail;
> +		}
> +
> +		/* Register ethtool support */
> +		liquidio_set_ethtool_ops(netdev);
> +
> +		/* Register the network device with the OS */
> +		if (register_netdev(netdev)) {
> +			lio_dev_err(octeon_dev, "Device registration failed\n");
> +			goto setup_nic_dev_fail;
> +		}
> +
> +		lio_dev_dbg(octeon_dev,
> +			    "Setup NIC ifidx:%d mac:%02x%02x%02x%02x%02x%02x\n",
> +			    i, macaddr[0], macaddr[1], macaddr[2], macaddr[3],
> +			    macaddr[4], macaddr[5]);
> +		netif_carrier_off(netdev);
> +
> +		if (lio->linfo.link.s.status) {
> +			netif_carrier_on(netdev);
> +			start_txq(netdev);
> +		} else {
> +			netif_carrier_off(netdev);
> +		}
> +
> +		ifstate_set(lio, LIO_IFSTATE_REGISTERED);
> +
> +		liquidio_set_lro(netdev, OCTNET_CMD_LRO_ENABLE);
> +
> +		print_link_info(netdev);
> +		lio_dev_dbg(octeon_dev, "NIC ifidx:%d Setup successful\n", i);
> +	}
> +	kfree(sc);
> +	kfree(resp);

I wonder if this is correct. AFAICS sc is the same memory that we put on
the octeon_instr_queue (see send_soft_command() ->
send_command()->add_to_nrlist()). Is that queue cleared before sc is freed?

^ permalink raw reply

* Re: [PATCH net-next v3] Add support of Cavium Liquidio ethernet adapters
From: Lino Sanfilippo @ 2014-12-19 22:04 UTC (permalink / raw)
  To: Raghu Vatsavayi, davem
  Cc: netdev, Derek Chickles, Satanand Burla, Felix Manlunas,
	Raghu Vatsavayi, Stephen Hemminger
In-Reply-To: <1418959519-31681-1-git-send-email-rvatsavayi@caviumnetworks.com>

On 19.12.2014 04:25, Raghu Vatsavayi wrote:
> +static void cn6xxx_disable_interrupt(void *chip)
> +{
> +	struct octeon_cn6xxx *cn6xxx = (struct octeon_cn6xxx *)chip;
> +
> +	/* Disable Interrupts */
> +	writeq(0, cn6xxx->intr_enb_reg64);
> +}
> +

This could also be a good candidate for forced write posting. The code
assumes that interrupts are actually deactivated after that.

Regards,
Lino

^ permalink raw reply

* Re: [PATCH net-next v3] Add support of Cavium Liquidio ethernet adapters
From: Lino Sanfilippo @ 2014-12-19 22:24 UTC (permalink / raw)
  To: Raghu Vatsavayi, davem
  Cc: netdev, Derek Chickles, Satanand Burla, Felix Manlunas,
	Raghu Vatsavayi, Stephen Hemminger
In-Reply-To: <1418959519-31681-1-git-send-email-rvatsavayi@caviumnetworks.com>

There seem to be various places where GFP_KERNEL could be used instead
of GFP_ATOMIC, e.g.:

> +static int setup_nic_devices(struct octeon_device *octeon_dev)
> +{

> +
> +	sc = kmalloc(sizeof(*sc), GFP_ATOMIC);
> +	if (!sc)
> +		return -ENOMEM;
> +

and

> +static int liquidio_init_nic_module(int octeon_id, void *octeon_dev)
> +{

> +	/* Allocate a soft command to be used to send link status requests
> +	 * to the core app.
> +	 */
> +	sc = kmalloc(sizeof(*sc), GFP_ATOMIC);
> +	if (!sc) {
> +		kfree(ls);
> +		return -ENOMEM;

also

> +/*  Configure interrupt moderation parameters */
> +static int octnet_set_intrmod_cfg(void *oct, struct oct_intrmod_cfg *intr_cfg)
> +{
> +	struct octeon_soft_command *sc;
> +	struct oct_intrmod_cmd *cmd;
> +	int retval;
> +	struct octeon_device *oct_dev = (struct octeon_device *)oct;
> +	unsigned char *cfg;
> +
> +	/* Alloc soft command */
> +	sc = (struct octeon_soft_command *)
> +	     kmalloc(sizeof(struct octeon_soft_command), GFP_ATOMIC);
> +	if (!sc)
> +		return -ENOMEM;

seems only to be called from process context.

Regards,
Lino

^ permalink raw reply

* RE: [PATCH net-next v3] Add support of Cavium Liquidio ethernet adapters
From: Chickles, Derek @ 2014-12-19 22:20 UTC (permalink / raw)
  To: Lino Sanfilippo, Vatsavayi, Raghu, davem@davemloft.net
  Cc: netdev@vger.kernel.org, Burla, Satananda, Manlunas, Felix,
	Vatsavayi, Raghu, Stephen Hemminger
In-Reply-To: <5494A0DD.1010304@gmx.de>

> On 19.12.2014 04:25, Raghu Vatsavayi wrote:
> > +static void cn6xxx_disable_interrupt(void *chip)
> > +{
> > +	struct octeon_cn6xxx *cn6xxx = (struct octeon_cn6xxx *)chip;
> > +
> > +	/* Disable Interrupts */
> > +	writeq(0, cn6xxx->intr_enb_reg64);
> > +}
> > +
> 
> This could also be a good candidate for forced write posting. The code
> assumes that interrupts are actually deactivated after that.
> 
> Regards,
> Lino
> 
> 
Good catch. We'll examine all the writeq and OCTEON_PCI_WIN_WRITE
calls to make sure there aren't any other critical posted write scenarios that
we've missed.

^ permalink raw reply

* RE: [PATCH net-next v3] Add support of Cavium Liquidio ethernet adapters
From: Chickles, Derek @ 2014-12-19 22:39 UTC (permalink / raw)
  To: Lino Sanfilippo, Vatsavayi, Raghu, davem@davemloft.net
  Cc: netdev@vger.kernel.org, Burla, Satananda, Manlunas, Felix,
	Vatsavayi, Raghu
In-Reply-To: <54949D24.1040707@gmx.de>

> On 19.12.2014 04:25, Raghu Vatsavayi wrote:
> > +static int setup_nic_devices(struct octeon_device *octeon_dev)
> > +{
...
> > +
> > +	sc = kmalloc(sizeof(*sc), GFP_ATOMIC);
> > +	if (!sc)
> > +		return -ENOMEM;
> > +
> > +	resp = kmalloc(sizeof(*resp), GFP_KERNEL);
> > +	if (!resp) {
> > +		kfree(sc);
> > +		return -ENOMEM;
> > +	}
...
> > +		init_waitqueue_head(&resp->s.wc);
...
> > +
> > +		retval = octeon_send_soft_command(octeon_dev, sc);
> > +		if (retval) {
> > +			lio_dev_err(octeon_dev,
> > +				    "Link status instruction failed status: %x\n",
> > +				    retval);
> > +			/* Soft instr is freed by driver in case of failure. */
> > +			goto setup_nic_dev_fail;
> > +		}
> > +
> > +		/* Sleep on a wait queue till the cond flag indicates that the
> > +		 * response arrived or timed-out.
> > +		 */
> > +		sleep_cond(&resp->s.wc, &resp->s.cond);
...
> > +	kfree(sc);
> > +	kfree(resp);
> 
> I wonder if this is correct. AFAICS sc is the same memory that we put on
> the octeon_instr_queue (see send_soft_command() ->
> send_command()->add_to_nrlist()). Is that queue cleared before sc is
> freed?
> 
> 

Yes, it's correct. We wait for the response before exiting this loop with
that sleep_cond() call. This adds an entry to a wait_queue that is
serviced asynchronously and will return once it is complete.

^ permalink raw reply

* Re: [PATCH net-next v3] Add support of Cavium Liquidio ethernet adapters
From: Lino Sanfilippo @ 2014-12-19 22:51 UTC (permalink / raw)
  To: Raghu Vatsavayi, davem
  Cc: netdev, Derek Chickles, Satanand Burla, Felix Manlunas,
	Raghu Vatsavayi, stephen Hemminger
In-Reply-To: <1418959519-31681-1-git-send-email-rvatsavayi@caviumnetworks.com>

On 19.12.2014 04:25, Raghu Vatsavayi wrote:


> static int
> +oct_cfg_rx_intrcnt(struct lio *lio, struct ethtool_coalesce *intr_coal)
> +{

> +	}
> +
> +	spin_lock_irqsave(&cn6xxx->lock_for_droq_int_enb_reg, flags);
> +	octeon_write_csr(oct, OCT_SLI_REGNAME(oct, PKT_CNT_INT_ENB), intr);
> +	spin_unlock_irqrestore(&cn6xxx->lock_for_droq_int_enb_reg, flags);
> +
> +	return 0;

What is the reason that this is locked? If it really has to be
synchronized then there should AFAIK at least be an mmiowb() to make
sure that the write does not leak out of the lock...

Regards,
Lino

^ permalink raw reply

* RE: [PATCH net-next v3] Add support of Cavium Liquidio ethernet adapters
From: Chickles, Derek @ 2014-12-19 22:53 UTC (permalink / raw)
  To: Lino Sanfilippo, Vatsavayi, Raghu, davem@davemloft.net
  Cc: netdev@vger.kernel.org, Burla, Satananda, Manlunas, Felix,
	Vatsavayi, Raghu, Stephen Hemminger
In-Reply-To: <5494A58E.7000000@gmx.de>

> There seem to be various places where GFP_KERNEL could be used instead
> of GFP_ATOMIC, e.g.:
> 
> > +static int setup_nic_devices(struct octeon_device *octeon_dev)
> > +{
> 
> > +
> > +	sc = kmalloc(sizeof(*sc), GFP_ATOMIC);
> > +	if (!sc)
> > +		return -ENOMEM;
> > +
> 
> and
> 
> > +static int liquidio_init_nic_module(int octeon_id, void *octeon_dev)
> > +{
> 
> > +	/* Allocate a soft command to be used to send link status requests
> > +	 * to the core app.
> > +	 */
> > +	sc = kmalloc(sizeof(*sc), GFP_ATOMIC);
> > +	if (!sc) {
> > +		kfree(ls);
> > +		return -ENOMEM;
> 
> also
> 
> > +/*  Configure interrupt moderation parameters */
> > +static int octnet_set_intrmod_cfg(void *oct, struct oct_intrmod_cfg
> *intr_cfg)
> > +{
> > +	struct octeon_soft_command *sc;
> > +	struct oct_intrmod_cmd *cmd;
> > +	int retval;
> > +	struct octeon_device *oct_dev = (struct octeon_device *)oct;
> > +	unsigned char *cfg;
> > +
> > +	/* Alloc soft command */
> > +	sc = (struct octeon_soft_command *)
> > +	     kmalloc(sizeof(struct octeon_soft_command), GFP_ATOMIC);
> > +	if (!sc)
> > +		return -ENOMEM;
> 
> seems only to be called from process context.
> 
> Regards,
> Lino
> 
Thanks, you're right. I think also octnic_alloc_ctrl_pkt_sc() can be GFP_KERNEL.
We'll also examine the other two spots where we use GFP_ATOMIC and make
sure they are correct. 

^ permalink raw reply

* [PATCH net] net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding
From: Jay Vosburgh @ 2014-12-19 23:32 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller


	When using VXLAN tunnels and a sky2 device, I have experienced
checksum failures of the following type:

[ 4297.761899] eth0: hw csum failure
[...]
[ 4297.765223] Call Trace:
[ 4297.765224]  <IRQ>  [<ffffffff8172f026>] dump_stack+0x46/0x58
[ 4297.765235]  [<ffffffff8162ba52>] netdev_rx_csum_fault+0x42/0x50
[ 4297.765238]  [<ffffffff8161c1a0>] ? skb_push+0x40/0x40
[ 4297.765240]  [<ffffffff8162325c>] __skb_checksum_complete+0xbc/0xd0
[ 4297.765243]  [<ffffffff8168c602>] tcp_v4_rcv+0x2e2/0x950
[ 4297.765246]  [<ffffffff81666ca0>] ? ip_rcv_finish+0x360/0x360

	These are reliably reproduced in a network topology of:

container:eth0 == host(OVS VXLAN on VLAN) == bond0 == eth0 (sky2) -> switch

	When VXLAN encapsulated traffic is received from a similarly
configured peer, the above warning is generated in the receive
processing of the encapsulated packet.  Note that the warning is
associated with the container eth0.

        The skbs from sky2 have ip_summed set to CHECKSUM_COMPLETE, and
because the packet is an encapsulated Ethernet frame, the checksum
generated by the hardware includes the inner protocol and Ethernet
headers.

	The receive code is careful to update the skb->csum, except in
__dev_forward_skb, as called by dev_forward_skb.  __dev_forward_skb
calls eth_type_trans, which in turn calls skb_pull_inline(skb, ETH_HLEN)
to skip over the Ethernet header, but does not update skb->csum when
doing so.

	This patch resolves the problem by adding a call to
skb_postpull_rcsum to update the skb->csum after the call to
eth_type_trans.

Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>

---

Please consider for 3.17 -stable; I do not see the warning on 3.14.


 net/core/dev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index f411c28..df755e5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1694,6 +1694,7 @@ int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 
 	skb_scrub_packet(skb, true);
 	skb->protocol = eth_type_trans(skb, dev);
+	skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
 
 	return 0;
 }
-- 
1.9.1

^ permalink raw reply related

* virtio_net: Fix napi poll list corruption
From: Herbert Xu @ 2014-12-20  0:23 UTC (permalink / raw)
  To: David Vrabel
  Cc: netdev, david.vrabel, xen-devel, konrad.wilk, boris.ostrovsky,
	edumazet, David S. Miller
In-Reply-To: <1418756386-6940-1-git-send-email-david.vrabel@citrix.com>

David Vrabel <david.vrabel@citrix.com> wrote:
> After d75b1ade567ffab085e8adbbdacf0092d10cd09c (net: less interrupt
> masking in NAPI) the napi instance is removed from the per-cpu list
> prior to calling the n->poll(), and is only requeued if all of the
> budget was used.  This inadvertently broke netfront because netfront
> does not use NAPI correctly.

A similar bug exists in virtio_net.

-- >8 --
The commit d75b1ade567ffab085e8adbbdacf0092d10cd09c (net: less
interrupt masking in NAPI) breaks virtio_net in an insidious way.

It is now required that if the entire budget is consumed when poll
returns, the napi poll_list must remain empty.  However, like some
other drivers virtio_net tries to do a last-ditch check and if
there is more work it will call napi_schedule and then immediately
process some of this new work.  Should the entire budget be consumed
while processing such new work then we will violate the new caller
contract.

This patch fixes this by not touching any work when we reschedule
in virtio_net.

The worst part of this bug is that the list corruption causes other
napi users to be moved off-list.  In my case I was chasing a stall
in IPsec (IPsec uses netif_rx) and I only belatedly realised that it
was virtio_net which caused the stall even though the virtio_net
poll was still functioning perfectly after IPsec stalled.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b8bd719..5ca9771 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -760,7 +760,6 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
 		container_of(napi, struct receive_queue, napi);
 	unsigned int r, received = 0;

-again:
 	received += virtnet_receive(rq, budget - received);

 	/* Out of packets? */
@@ -771,7 +770,6 @@ again:
 		    napi_schedule_prep(napi)) {
 			virtqueue_disable_cb(rq->vq);
 			__napi_schedule(napi);
-			goto again;
 		}
 	}

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* Re: [PATCH net-next v3] Add support of Cavium Liquidio ethernet adapters
From: Lino Sanfilippo @ 2014-12-20  0:33 UTC (permalink / raw)
  To: Raghu Vatsavayi, davem
  Cc: netdev, Derek Chickles, Satanand Burla, Felix Manlunas,
	Raghu Vatsavayi, Stephen Hemminger
In-Reply-To: <1418959519-31681-1-git-send-email-rvatsavayi@caviumnetworks.com>

On 19.12.2014 04:25, Raghu Vatsavayi wrote:
> +sleep_cond(wait_queue_head_t *wait_queue, int *condition)
> +{
> +	wait_queue_t we;
> +
> +	init_waitqueue_entry(&we, current);
> +	add_wait_queue(wait_queue, &we);
> +	set_current_state(TASK_INTERRUPTIBLE);
> +	while (!(ACCESS_ONCE(*condition)))
> +		schedule();
> +	set_current_state(TASK_RUNNING);
> +	remove_wait_queue(wait_queue, &we);
> +}

This looks as if we should at least check for pending signals, otherwise
there is no point in waiting interruptible.
Maybe wait_event_interruptible() is what we really want here...

Regards,
Lino

^ permalink raw reply

* net: Detect drivers that reschedule NAPI and exhaust budget
From: Herbert Xu @ 2014-12-20  0:36 UTC (permalink / raw)
  To: David Vrabel
  Cc: netdev, xen-devel, konrad.wilk, boris.ostrovsky, edumazet,
	David S. Miller
In-Reply-To: <20141220002327.GA31975@gondor.apana.org.au>

On Sat, Dec 20, 2014 at 11:23:27AM +1100, Herbert Xu wrote:
>
> A similar bug exists in virtio_net.

In order to detect other drivers doing this we should add something
like this.

-- >8 --
The commit d75b1ade567ffab085e8adbbdacf0092d10cd09c (net: less
interrupt masking in NAPI) required drivers to leave poll_list
empty if the entire budget is consumed.

We have already had two broken drivers so let's add a check for
this.
 
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/core/dev.c b/net/core/dev.c
index f411c28..88f9725 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4620,7 +4620,11 @@ static void net_rx_action(struct softirq_action *h)
 					 */
 					napi_gro_flush(n, HZ >= 1000);
 				}
-				list_add_tail(&n->poll_list, &repoll);
+				/* Some drivers may have called napi_schedule
+				 * prior to exhausting their budget.
+				 */
+				if (!WARN_ON_ONCE(!list_empty(&n->poll_list)))
+					list_add_tail(&n->poll_list, &repoll);
 			}
 		}
 
Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* RE: [PATCH net-next v3] Add support of Cavium Liquidio ethernet adapters
From: Chickles, Derek @ 2014-12-20  0:47 UTC (permalink / raw)
  To: Lino Sanfilippo, Vatsavayi, Raghu, davem@davemloft.net
  Cc: netdev@vger.kernel.org, Burla, Satananda, Manlunas, Felix,
	Vatsavayi, Raghu, stephen Hemminger
In-Reply-To: <5494ABE3.5070003@gmx.de>

> > static int
> > +oct_cfg_rx_intrcnt(struct lio *lio, struct ethtool_coalesce *intr_coal)
> > +{
> 
> > +	}
> > +
> > +	spin_lock_irqsave(&cn6xxx->lock_for_droq_int_enb_reg, flags);
> > +	octeon_write_csr(oct, OCT_SLI_REGNAME(oct, PKT_CNT_INT_ENB),
> intr);
> > +	spin_unlock_irqrestore(&cn6xxx->lock_for_droq_int_enb_reg,
> flags);
> > +
> > +	return 0;
> 
> What is the reason that this is locked? If it really has to be
> synchronized then there should AFAIK at least be an mmiowb() to make
> sure that the write does not leak out of the lock...
> 
> Regards,
> Lino
> 

Yes, we need this locked because the interrupt handler accesses this as well,
But you are right we need the mmiowb().

Thanks,
Derek

^ permalink raw reply

* Re: [RFC PATCH net-next v2 1/1] net: Support for switch port configuration
From: Williams, Kenneth @ 2014-12-20  0:57 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Jiri Pirko, B Viswanath, Samudrala, Sridhar, John Fastabend,
	Varlese, Marco, netdev@vger.kernel.org, Thomas Graf,
	sfeldma@gmail.com, linux-kernel@vger.kernel.org
In-Reply-To: <549450D0.8030805@cumulusnetworks.com>

On Fri, Dec 19, 2014 at 08:22:40AM -0800, Roopa Prabhu wrote:
> On 12/19/14, 1:55 AM, Jiri Pirko wrote:
> >Fri, Dec 19, 2014 at 10:35:27AM CET, marichika4@gmail.com wrote:
> >>On 19 December 2014 at 14:53, Jiri Pirko <jiri@resnulli.us> wrote:
> >>>Fri, Dec 19, 2014 at 10:01:46AM CET, marichika4@gmail.com wrote:
> >>>>On 19 December 2014 at 13:57, Jiri Pirko <jiri@resnulli.us> wrote:
> >>>>>Fri, Dec 19, 2014 at 06:14:57AM CET, marichika4@gmail.com wrote:
> >>>>>>On 19 December 2014 at 05:18, Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
> >>>>>>>On 12/18/14, 3:26 PM, Samudrala, Sridhar wrote:
> >>>><snipped for ease of reading>
> >>>>>>>>
> >>>>>>>>We also need an interface to set per-switch attributes. Can this work?
> >>>>>>>>     bridge link set dev sw0 sw_attr bcast_flooding 1 master
> >>>>>>>>where sw0 is a bridge representing the hardware switch.
> >>>>>>>
> >>>>>>>Not today. We discussed this @ LPC, and one way to do this would be to have
> >>>>>>>a device
> >>>>>>>representing the switch asic. This is in the works.
> >>>>>>
> >>>>>>Can I assume that on  platforms which house more than one asic (say
> >>>>>>two 24 port asics, interconnected via a 10G link or equivalent, to get
> >>>>>>a 48 port 'switch') , the 'rocker' driver (or similar) should expose
> >>>>>>them as a single set of ports, and not as two 'switch ports' ?
> >>>>>Well that really depends on particular implementation and drivers. If you
> >>>>>have 2 pci-e devices, I think you should expose them as 2 entities. For
> >>>>>sure, you can have the driver to do the masking for you. I don't believe
> >>>>>that is correct though.
> >>>>>
> >>>>In a platform that houses two asic chips, IMO, the user is still
> >>>>expected to manage the router as a single entity. The configuration
> >>>>being applied on both asic devices need to be matching if not
> >>>>identical, and may not be conflicting. The FDB is to be synchronized
> >>>>so that (offloaded) switching can happen across the asics. Some of
> >>>>this stuff is asic specific anyway. Another example is that of the
> >>>>learning. The (hardware) learning can't be enabled on one asic, while
> >>>>being disabled on another one. The general use cases I have seen are
> >>>>all involving managing the 'router' as a single entity.  That the
> >>>>'router' is implemented with two asics instead of a single asic (with
> >>>>more ports) is to be treated as an implementation detail.  This is the
> >>>>usual router management method that exists today.
> >>>>
> >>>>I hope I make sense.
> >>>>
> >>>>So I am trying to figure out what this single entity that will be used
> >>>>from a user perspective. It can be a bridge, but our bridges are more
> >>>>802.1q bridges. We can use the 'self' mode, but then it means that it
> >>>>should reflect the entire port count, and not just an asic.
> >>>>
> >>>>So I was trying to deduce that in our switchdevice model, the best bet
> >>>>would be to leave the unification to the driver (i.e., to project the
> >>>>multiple physical asics as a single virtual switch device). Thist
> >>>Is it possible to have the asic as just single one? Or is it possible to
> >>>connect asics being multiple chips maybe from multiple vendors together?
> >>I didn't understand the first question. Some times, it is possible to
> >I ment that there is a design with just a single asic of this type,
> >instead of a pair.
> >
> >>have a single asic replace two, but its a cost factor, and others that
> >>are involved.
> >>
> >>AFAIK, the answer to the second question is a No. Two asics from
> >>different vendors may not be connected together. The interconnect
> >>tends to be proprietary.
> >Okay. In that case, it might make sense to mask it on driver level.
> >
> >
> >>>I believe that answer is "yes" in both cases. Making two separate asics
> >>>to appear as one for user is not correct in my opinion. Driver should
> >>>not do such masking. It is unclean, unextendable.
> >>>
> >>I am only looking for a single management entity. I am not thinking it
> >>needs to be at driver level. I am not sure of  any other option apart
> >>from creating a 'switchdev' that Roopa was mentioning.
> >
> >Well the thing is there is a common desire to make the offloading as
> >transparent as possible. For example, have 4 ports of same switch and
> >put them into br0. Just like that, without need to do anything else
> >than you would do when bridging ordinary NICs. Introducing some
> >"management entity" would break this approach.
> >
> I don't think having a switchdevice breaks this approach. A software bridge
> is not a 1-1 mapping with the asic in all cases.
> When its a vlan filtering bridge, yes, it is (In which case all switch
> global l2 non-port specific attributes can be applied to the bridge).
> 
> The switch asic can do l2 and l3 too. For a bridge, the switch asic is just
> accelerating l2.
> And a switch asic is also capable of l3, acls. A switch device (whether
> accessible to userspace or not)
> may become necessary (as discussed in other threads) where you cannot
> resolve a kernel object to a switch port (Global acl rules, unresolved route
> nexthops etc).
A switch-chip vendor that provides a proprietary mechanism of
bonding two or more switch-chips into a single functional unit, also
typically provides an API that allows operating on this bonded set of
switch chips to be addressed as a single unit. If my understanding is
correct, the question of port uniqueness, etc becomes moot.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox