Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] tg3: Avoid NULL pointer dereference in tg3_get_nstats()
From: Michael Chan @ 2017-01-05 21:53 UTC (permalink / raw)
  To: David Miller
  Cc: wangyufen, Siva Reddy Kallam, prashant.sreedharan@broadcom.com,
	michael.chan@broadcom.com, Netdev
In-Reply-To: <20170105.151705.2187713228201334532.davem@davemloft.net>

On Thu, Jan 5, 2017 at 12:17 PM, David Miller <davem@davemloft.net> wrote:
> From: Michael Chan <michael.chan@broadcom.com>
> Date: Thu, 5 Jan 2017 12:04:13 -0800
>
>> But it looks like ndo_get_stats() can be called without rtnl lock from
>> net-procfs.c.  So it is possible that we'll read tp->hw_stats after it
>> has been freed.  For example, if we are reading /proc/net/dev and
>> closing tg3 at the same time.  David, is not taking rtnl_lock in
>> net-procfs.c by design?
>
> Probably not, that dev_get_stats() call probably should be surrounded
> by RTNL protection.
>
> Doing a quick grep on dev_get_stats() shows other call sites, most of
> which are using it to fetch slave device statistics from the get stats
> method of the parent.  Which should be ok.
>
> It appears that the vlan procfs code in net/8021q/vlanproc.c has a
> similar bug as net/core/net-procfs.c
>
> Maybe net/core/net-sysfs.c has the same issue as well, and perhaps also
> net/openvswitch/vport.c:ovs_vport_get_stats().
>

OK.  I will send a patch later today to add rtnl_lock to these callers.

^ permalink raw reply

* Re: [PATCH v4] net: ethernet: faraday: To support device tree usage.
From: Arnd Bergmann @ 2017-01-05 21:55 UTC (permalink / raw)
  To: Greentime Hu
  Cc: f.fainelli, netdev, devicetree, andrew, linux-kernel, jiri,
	jonas.jensen, davem
In-Reply-To: <20170105102345.GA25774@app09>

On Thursday, January 5, 2017 6:23:53 PM CET Greentime Hu wrote:
> Signed-off-by: Greentime Hu <green.hu@gmail.com>
> ---
> Changes in v4:
>   - Use the same binding document to describe the same faraday ethernet controller and add faraday to vendor-prefixes.txt.
> Changes in v3:
>   - Nothing changed in this patch but I have committed andestech to vendor-prefixes.txt.
> Changes in v2:
>   - Change atmac100_of_ids to ftmac100_of_ids
> 

The patch looks good to me now, but please add a proper commit log
before your Signed-off-by tag.

	Arnd

^ permalink raw reply

* Re: [PATCH net-next 01/10] net: netcp: ethss: add support of subsystem register region regmap
From: Murali Karicheri @ 2017-01-05 22:08 UTC (permalink / raw)
  To: Rob Herring
  Cc: netdev, linux-omap, grygorii.strashko, mugunthanvnm, linux-kernel,
	arnd, davem, devicetree, mark.rutland
In-Reply-To: <586EAFBE.1000500@ti.com>

On 01/05/2017 03:42 PM, Murali Karicheri wrote:
> Rob,
> 
> On 12/22/2016 04:24 PM, Rob Herring wrote:
>> On Tue, Dec 20, 2016 at 05:09:44PM -0500, Murali Karicheri wrote:
>>> From: WingMan Kwok <w-kwok2@ti.com>
>>>
>>> 10gbe phy driver needs to access the 10gbe subsystem control
>>> register during phy initialization. To facilitate the shared
>>> access of the subsystem register region between the 10gbe Ethernet
>>> driver and the phy driver, this patch adds support of the
>>> subsystem register region defined by a syscon node in the dts.
>>>
>>> Although there is no shared access to the gbe subsystem register
>>> region, using syscon for that is for the sake of consistency.
>>>
>>> This change is backward compatible with previously released gbe
>>> devicetree bindings.
>>>
>>> Signed-off-by: WingMan Kwok <w-kwok2@ti.com>
>>> Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
>>> Signed-off-by: Sekhar Nori <nsekhar@ti.com>
>>> ---
>>>  .../devicetree/bindings/net/keystone-netcp.txt     |  16 ++-
>>>  drivers/net/ethernet/ti/netcp_ethss.c              | 140 +++++++++++++++++----
>>>  2 files changed, 127 insertions(+), 29 deletions(-)
>>>
>>> diff --git a/Documentation/devicetree/bindings/net/keystone-netcp.txt b/Documentation/devicetree/bindings/net/keystone-netcp.txt
>>> index 04ba1dc..0854a73 100644
>>> --- a/Documentation/devicetree/bindings/net/keystone-netcp.txt
>>> +++ b/Documentation/devicetree/bindings/net/keystone-netcp.txt
>>> @@ -72,20 +72,24 @@ Required properties:
>>>  		"ti,netcp-gbe-2" for 1GbE N NetCP 1.5 (N=2)
>>>  		"ti,netcp-xgbe" for 10 GbE
>>>  
>>> +- syscon-subsys:	phandle to syscon node of the switch
>>> +			subsystem registers.
>>> +
>>>  - reg:		register location and the size for the following register
>>>  		regions in the specified order.
>>>  		- switch subsystem registers
>>> +		- sgmii module registers
>>
>> This needs to go on the end of the list. Otherwise, it is not backwards 
>> compatible.
> 
> Thanks for your review! I assumed backward compatibility means new kernel
> should work with old DTB. The driver code is adjusted to work with both
> DTBs. Isn't that enough?
Rob,

I will pull out 1/10 and 2/10 from the series as it needs more work and
re-submit rest of the patches in my v1 spin. However please reply to my
above backward compatibility question.

Thanks and Regards,

Murali
> 
> Murali
> 
>>
>>>  		- sgmii port3/4 module registers (only for NetCP 1.4)
>>>  		- switch module registers
>>>  		- serdes registers (only for 10G)
>>>  
>>>  		NetCP 1.4 ethss, here is the order
>>> -			index #0 - switch subsystem registers
>>> +			index #0 - sgmii module registers
>>>  			index #1 - sgmii port3/4 module registers
>>>  			index #2 - switch module registers
>>>  
>>>  		NetCP 1.5 ethss 9 port, 5 port and 2 port
>>> -			index #0 - switch subsystem registers
>>> +			index #0 - sgmii module registers
>>>  			index #1 - switch module registers
>>>  			index #2 - serdes registers
>>>  
>>> @@ -145,6 +149,11 @@ Optional properties:
>>>  
>>>  Example binding:
>>>  
>>> +gbe_subsys: subsys@2090000 {
>>> +	compatible = "syscon";
>>> +	reg = <0x02090000 0x100>;
>>> +};
>>> +
>>>  netcp: netcp@2000000 {
>>>  	reg = <0x2620110 0x8>;
>>>  	reg-names = "efuse";
>>> @@ -163,7 +172,8 @@ netcp: netcp@2000000 {
>>>  		ranges;
>>>  		gbe@90000 {
>>>  			label = "netcp-gbe";
>>> -			reg = <0x90000 0x300>, <0x90400 0x400>, <0x90800 0x700>;
>>> +			syscon-subsys = <&gbe_subsys>;
>>> +			reg = <0x90100 0x200>, <0x90400 0x200>, <0x90800 0x700>;
>>>  			/* enable-ale; */
>>>  			tx-queue = <648>;
>>>  			tx-channel = <8>;
>>
> 
> 


-- 
Murali Karicheri
Linux Kernel, Keystone

^ permalink raw reply

* Re: [PATCH] net: phy: dp83867: fix irq generation
From: Florian Fainelli @ 2017-01-05 22:10 UTC (permalink / raw)
  To: Grygorii Strashko, netdev, Dan Murphy, Mugunthan V N
  Cc: linux-omap, Sekhar Nori, linux-kernel, linux-arm-kernel
In-Reply-To: <20170105204807.25990-1-grygorii.strashko@ti.com>

On 01/05/2017 12:48 PM, Grygorii Strashko wrote:
> For proper IRQ generation by DP83867 phy the INT/PWDN pin has to be
> programmed as an interrupt output instead of a Powerdown input in
> Configuration Register 3 (CFG3), Address 0x001E, bit 7 INT_OE = 1. The
> current driver doesn't do this and as result IRQs will not be generated by
> DP83867 phy even if they are properly configured in DT.
> 
> Hence, fix IRQ generation by properly configuring CFG3.INT_OE bit and
> ensure that Link Status Change (LINK_STATUS_CHNG_INT) and Auto-Negotiation
> Complete (AUTONEG_COMP_INT) interrupt are enabled. After this the DP83867
> driver will work properly in interrupt enabled mode.
> 
> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
> ---
>  drivers/net/phy/dp83867.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/net/phy/dp83867.c b/drivers/net/phy/dp83867.c
> index 1b63924..e84ae08 100644
> --- a/drivers/net/phy/dp83867.c
> +++ b/drivers/net/phy/dp83867.c
> @@ -29,6 +29,7 @@
>  #define MII_DP83867_MICR	0x12
>  #define MII_DP83867_ISR		0x13
>  #define DP83867_CTRL		0x1f
> +#define DP83867_CFG3		0x1e
>  
>  /* Extended Registers */
>  #define DP83867_RGMIICTL	0x0032
> @@ -98,6 +99,8 @@ static int dp83867_config_intr(struct phy_device *phydev)
>  		micr_status |=
>  			(MII_DP83867_MICR_AN_ERR_INT_EN |
>  			MII_DP83867_MICR_SPEED_CHNG_INT_EN |
> +			MII_DP83867_MICR_AUTONEG_COMP_INT_EN |
> +			MII_DP83867_MICR_LINK_STS_CHNG_INT_EN |
>  			MII_DP83867_MICR_DUP_MODE_CHNG_INT_EN |
>  			MII_DP83867_MICR_SLEEP_MODE_CHNG_INT_EN);
>  
> @@ -214,6 +217,13 @@ static int dp83867_config_init(struct phy_device *phydev)
>  		}
>  	}
>  
> +	/* Enable Interrupt output INT_OE in CFG3 register */
> +	if (phy_interrupt_is_valid(phydev)) {
> +		val = phy_read(phydev, DP83867_CFG3);
> +		val |= BIT(7);
> +		phy_write(phydev, DP83867_CFG3, val);
> +	}

Don't you need to clear that bit in the case phy_interrupt_is_valid()
returns false?

Other than that:

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
-- 
Florian

^ permalink raw reply

* Re: [net-next PATCH 5/6] i40e: Add TX and RX support in switchdev mode.
From: Samudrala, Sridhar @ 2017-01-05 22:32 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Alexander Duyck, John Fastabend, Anjali Singhai Jain,
	jakub.kicinski, intel-wired-lan, Linux Netdev List
In-Reply-To: <CAJ3xEMgAaXNbvaP0=GcuTsVOOGwPHuSMQf-SeLKpxB8WrbEvng@mail.gmail.com>



On 1/5/2017 3:50 AM, Or Gerlitz wrote:
> On Thu, Jan 5, 2017 at 12:46 AM, Samudrala, Sridhar
> <sridhar.samudrala@intel.com> wrote:
>>
>> On 1/3/2017 3:03 PM, Or Gerlitz wrote:
>>> On Fri, Dec 30, 2016 at 7:04 PM, Samudrala, Sridhar
>>> <sridhar.samudrala@intel.com> wrote:
>>>> On 12/30/2016 7:31 AM, Or Gerlitz wrote:
>>>>> Are you exposing switchdev ops for the representators? didn't see that
>>>>> or maybe it's in the 4th patch which didn't make it to the list?
>>>> Not at this time. In the future patches when we offload fdb/vlan
>>>> functionality, we could use switchdev ops.
>>> but wait, this is the switchdev mode... even before doing any
>>> offloading, you want (need) your representor netdevices to have the
>>> same HW ID marking they are all ports of the same ASIC, this you can
>>> do with the switchdev parent ID attribute.
>> OK. I will add switchdev_port_attr_get() with PORT_PARENT_ID support in v3.
> Good, I made this comment, b/c we want to create a well defined user-experience
> to be taken into account also by upper virtualization layers.
>
> Another piece there to add is have your VF reps implement the
> get_phys_port_name ndo,

It looks like you are returning the VF port number as phys_port_name() 
for a VF rep in en_rep.c.
Is this correct?

By default i am creating VFPR netdev with name as <pf_name>_VF<vf_num>
For ex; if enp5s0f0 is the pf name, VFPR netdev for VF0 will be enp5s0f0_vf0

If we want udev to follow this syntax should i return '_vf0'  as 
get_phys_port_name() for VF rep 0?

> where as we explain in commit cb67b832921cfa20ad79bafdc51f1745339d0557 is used
> as follows:
>
>      Port phys name (ndo_get_phys_port_name) is implemented to allow exporting
>      to user-space the VF vport number and along with the switchdev port parent
>      id (phys_switch_id) enable a udev base consistent naming scheme:
>
>      SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
>              ATTR{phys_port_name}!="", NAME="$PF_NIC$attr{phys_port_name}"
>
>      where phys_switch_id is exposed by the PF (and VF reps) and $PF_NIC is
>      the name of the PF netdevice.
>
> Or.

^ permalink raw reply

* Re: [RFC PATCH] virtio_net: XDP support for adjust_head
From: John Fastabend @ 2017-01-05 22:57 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang
  Cc: john.r.fastabend, netdev, alexei.starovoitov, daniel
In-Reply-To: <20170104001332-mutt-send-email-mst@kernel.org>

On 17-01-03 02:16 PM, Michael S. Tsirkin wrote:
> On Tue, Jan 03, 2017 at 02:01:27PM +0800, Jason Wang wrote:
>>
>>
>> On 2017年01月03日 03:44, John Fastabend wrote:
>>> Add support for XDP adjust head by allocating a 256B header region
>>> that XDP programs can grow into. This is only enabled when a XDP
>>> program is loaded.
>>>
>>> In order to ensure that we do not have to unwind queue headroom push
>>> queue setup below bpf_prog_add. It reads better to do a prog ref
>>> unwind vs another queue setup call.
>>>
>>> : There is a problem with this patch as is. When xdp prog is loaded
>>>    the old buffers without the 256B headers need to be flushed so that
>>>    the bpf prog has the necessary headroom. This patch does this by
>>>    calling the virtqueue_detach_unused_buf() and followed by the
>>>    virtnet_set_queues() call to reinitialize the buffers. However I
>>>    don't believe this is safe per comment in virtio_ring this API
>>>    is not valid on an active queue and the only thing we have done
>>>    here is napi_disable/napi_enable wrappers which doesn't do anything
>>>    to the emulation layer.
>>>
>>>    So the RFC is really to find the best solution to this problem.
>>>    A couple things come to mind, (a) always allocate the necessary
>>>    headroom but this is a bit of a waste (b) add some bit somewhere
>>>    to check if the buffer has headroom but this would mean XDP programs
>>>    would be broke for a cycle through the ring, (c) figure out how
>>>    to deactivate a queue, free the buffers and finally reallocate.
>>>    I think (c) is the best choice for now but I'm not seeing the
>>>    API to do this so virtio/qemu experts anyone know off-hand
>>>    how to make this work? I started looking into the PCI callbacks
>>>    reset() and virtio_device_ready() or possibly hitting the right
>>>    set of bits with vp_set_status() but my first attempt just hung
>>>    the device.
>>
>> Hi John:
>>
>> AFAIK, disabling a specific queue was supported only by virtio 1.0 through
>> queue_enable field in pci common cfg.
> 
> In fact 1.0 only allows enabling queues selectively.
> We can add disabling by a spec enhancement but
> for now reset is the only way.
> 
> 
>> But unfortunately, qemu does not
>> emulate this at all and legacy device does not even support this. So the
>> safe way is probably reset the device and redo the initialization here.
> 
> You will also have to re-apply rx filtering if you do this.
> Probably sending notification uplink.
> 

The following seems to hang the device on the next virtnet_send_command()
I expected this to meet the reset requirements from the spec because I
believe its the same flow coming out of restore(). For a real patch we
don't actually need to kfree all the structs and reallocate them but
I was expecting the below to work. Any ideas/hints?

static int virtnet_xdp_reset(struct virtnet_info *vi)
{
        int i, ret;

        netif_device_detach(vi->dev);
        cancel_delayed_work_sync(&vi->refill);
        if (netif_running(vi->dev)) {
                for (i = 0; i < vi->max_queue_pairs; i++)
                        napi_disable(&vi->rq[i].napi);
        }

        remove_vq_common(vi, false);
        ret = init_vqs(vi);
        if (ret)
                return ret;
        virtio_device_ready(vi->vdev);

        if (netif_running(vi->dev)) {
                for (i = 0; i < vi->curr_queue_pairs; i++)
                        if (!try_fill_recv(vi, &vi->rq[i], GFP_KERNEL))
                                schedule_delayed_work(&vi->refill, 0);

                for (i = 0; i < vi->max_queue_pairs; i++)
                        virtnet_napi_enable(&vi->rq[i]);
        }
        netif_device_attach(vi->dev);
        return 0;
}

^ permalink raw reply

* Re: [PATCH net-next v4 0/4] Fix OdroidC2 Gigabit Tx link issue
From: Russell King - ARM Linux @ 2017-01-05 23:25 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: devicetree, Alexandre TORGUE, Neil Armstrong, Martin Blumenstingl,
	netdev, Giuseppe Cavallaro, linux-kernel, Yegor Yefremov,
	Julia Lawall, Andre Roth, Andrew Lunn, Kevin Hilman, Carlo Caione,
	linux-amlogic, Andreas Färber, linux-arm-kernel,
	Jerome Brunet
In-Reply-To: <049b1efc-3bad-92e0-45ef-0563dc5d81de@gmail.com>

On Mon, Nov 28, 2016 at 09:54:28AM -0800, Florian Fainelli wrote:
> If we start supporting generic "enable", "disable" type of properties
> with values that map directly to register definitions of the HW, we
> leave too much room for these properties to be utilized to implement a
> specific policy, and this is not acceptable.

Another concern with this patch is that the existing phylib "set_eee"
code is horribly buggy - it just translates the modes from userspace
into the register value and writes them directly to the register with
no validation.  So it's possible to set modes in the register that the
hardware doesn't support, and have them advertised to the link partner.

I have a patch which fixes that, restricting (as we do elsewhere) the
advert according to the EEE supported capabilities retrieved from the
PCS - maybe the problem here is that the PCS doesn't support support
EEE in 1000baseT mode?

Out of interest, which PHY is used on this platform?

On the SolidRun boards, they're using AR8035, and have suffered this
occasional link drop problem.  What has been found is that it seems to
be to do with the timing parameters, and it seemed to only be 1000bT
that was affected.  I don't remember off hand exactly which or what
the change was they made to stabilise it though, but I can probabily
find out tomorrow.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply

* Re: [net-next PATCH v2 5/6] i40e: Add TX and RX support in switchdev mode.
From: Samudrala, Sridhar @ 2017-01-06  0:27 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: alexander.h.duyck, john.r.fastabend, anjali.singhai,
	jakub.kicinski, davem, intel-wired-lan, netdev
In-Reply-To: <20170105125656.GB2211@nanopsycho>



On 1/5/2017 4:56 AM, Jiri Pirko wrote:
> Tue, Jan 03, 2017 at 07:07:53PM CET, sridhar.samudrala@intel.com wrote:
>> In switchdev mode, broadcast filter is not enabled on VFs. The broadcasts and
>> unknown frames from VFs are received by the PF and passed to corresponding VF
>> port representator netdev.
>> A host based switching entity like a linux bridge or OVS redirects these frames
>> to the right VFs via VFPR netdevs. Any frames sent via VFPR netdevs are sent as
>> directed transmits to the corresponding VFs. To enable directed transmit, skb
>> metadata dst is used to pass the VF id and the frame is requeued to call the PFs
>> transmit routine.
>>
>> Small script to demonstrate inter VF pings in switchdev mode.
>> PF: enp5s0f0, VFs: enp5s2,enp5s2f1 VFPRs:enp5s0f0-vf0, enp5s0f0-vf1
>>
>> # rmmod i40e; modprobe i40e
>> # devlink dev eswitch set pci/0000:05:00.0 mode switchdev
>> # echo 2 > /sys/class/net/enp5s0f0/device/sriov_numvfs
>> # ip link set enp5s0f0 vf 0 mac 00:11:22:33:44:55
>> # ip link set enp5s0f0 vf 1 mac 00:11:22:33:44:56
>> # rmmod i40evf; modprobe i40evf
>>
>> /* Create 2 namespaces and move the VFs to the corresponding ns. */
>> # ip netns add ns0
>> # ip link set enp5s2 netns ns0
>> # ip netns exec ns0 ip addr add 192.168.1.10/24 dev enp5s2
>> # ip netns exec ns0 ip link set enp5s2 up
>> # ip netns add ns1
>> # ip link set enp5s2f1 netns ns1
>> # ip netns exec ns1 ip addr add 192.168.1.11/24 dev enp5s2f1
>> # ip netns exec ns1 ip link set enp5s2f1 up
>>
>> /* bring up pf and vfpr netdevs */
>> # ip link set enp5s0f0 up
>> # ip link set enp5s0f0-vf0 up
>> # ip link set enp5s0f0-vf1 up
>>
>> /* Create a linux bridge and add vfpr netdevs to it. */
>> # ip link add vfpr-br type bridge
>> # ip link set enp5s0f0-vf0 master vfpr-br
>> # ip link set enp5s0f0-vf1 master vfpr-br
>> # ip addr add 192.168.1.1/24 dev vfpr-br
>> # ip link set vfpr-br up
>>
>> # ip netns exec ns0 ping -c3 192.168.1.11
>> # ip netns exec ns1 ping -c3 192.168.1.10
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> [...]
>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>> index 352cf7c..b46ddaa 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>> @@ -1176,16 +1176,37 @@ static bool i40e_alloc_mapped_page(struct i40e_ring *rx_ring,
>>   * @rx_ring:  rx ring in play
>>   * @skb: packet to send up
>>   * @vlan_tag: vlan tag for packet
>> + * @lpbk: is it a loopback frame?
>>   **/
>> static void i40e_receive_skb(struct i40e_ring *rx_ring,
>> -			     struct sk_buff *skb, u16 vlan_tag)
>> +			     struct sk_buff *skb, u16 vlan_tag, bool lpbk)
>> {
>> 	struct i40e_q_vector *q_vector = rx_ring->q_vector;
>> +	struct i40e_pf *pf = rx_ring->vsi->back;
>> +	struct i40e_vf *vf;
>> +	struct ethhdr *eth;
>> +	int vf_id;
>>
>> 	if ((rx_ring->netdev->features & NETIF_F_HW_VLAN_CTAG_RX) &&
>> 	    (vlan_tag & VLAN_VID_MASK))
>> 		__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vlan_tag);
>>
>> +	if ((pf->eswitch_mode == DEVLINK_ESWITCH_MODE_LEGACY) || !lpbk)
>> +		goto gro_receive;
>> +
>> +	/* If a loopback packet is received from a VF in switchdev mode, pass the
>> +	 * frame to the corresponding VFPR netdev based on the source MAC in the frame.
>> +	 */
>> +	eth = (struct ethhdr *)skb_mac_header(skb);
>> +	for (vf_id = 0; vf_id < pf->num_alloc_vfs; vf_id++) {
>> +		vf = &pf->vf[vf_id];
>> +		if (ether_addr_equal(eth->h_source, vf->default_lan_addr.addr)) {
>> +			skb->dev = vf->vfpr_netdev;
> This sucks :( Is't there any identification coming from rx ring that
> would tell you what vf this is? To match vrpr according to a single MAC
> address seems a bit awkward. What if there is a macvlan configured
> on the VF?
Unfortunately, with the current HW, RX descriptor only indicates if it 
is a loopback packet from a VF,
but not the specific id of the VF.
Multiple macs on VF are not supported with the current patchset.
At this point we are not making switchdev as the default mode because of 
these limitations.

>
>
>
>> +			break;
>> +		}
>> +	}
>> +
>> +gro_receive:
>> 	napi_gro_receive(&q_vector->napi, skb);
>> }
>>
> [...]
>
>> @@ -2998,3 +3064,19 @@ netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
>>
>> 	return i40e_xmit_frame_ring(skb, tx_ring);
>> }
>> +
>> +netdev_tx_t i40e_vfpr_netdev_start_xmit(struct sk_buff *skb,
>> +				        struct net_device *dev)
>> +{
>> +	struct i40e_vfpr_netdev_priv *priv = netdev_priv(dev);
>> +	struct i40e_vf *vf = priv->vf;
>> +	struct i40e_pf *pf = vf->pf;
>> +	struct i40e_vsi *vsi = pf->vsi[pf->lan_vsi];
>> +
>> +	skb_dst_drop(skb);
>> +	dst_hold(&priv->vfpr_dst->dst);
>> +	skb_dst_set(skb, &priv->vfpr_dst->dst);
>> +	skb->dev = vsi->netdev;
> This dst dance seems a bit odd to me. Why don't you just call
> i40e_xmit_frame_ring with an extra arg holding the needed metadata?

We don't have TX/RX queues associated with VFPR netdevs, so we need to 
set the dev to PF netdev and requeue the skb.

>
>
>> +
>> +	return dev_queue_xmit(skb);
>> +}
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>> index e065321..850723f 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>> @@ -366,6 +366,7 @@ struct i40e_ring_container {
>>
>> bool i40e_alloc_rx_buffers(struct i40e_ring *rxr, u16 cleaned_count);
>> netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
>> +netdev_tx_t i40e_vfpr_netdev_start_xmit(struct sk_buff *skb, struct net_device *netdev);
>> void i40e_clean_tx_ring(struct i40e_ring *tx_ring);
>> void i40e_clean_rx_ring(struct i40e_ring *rx_ring);
>> int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring);
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_type.h b/drivers/net/ethernet/intel/i40e/i40e_type.h
>> index edc0abd..c136cc9 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_type.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_type.h
>> @@ -728,6 +728,9 @@ enum i40e_rx_desc_status_bits {
>> #define I40E_RXD_QW1_STATUS_TSYNVALID_SHIFT  I40E_RX_DESC_STATUS_TSYNVALID_SHIFT
>> #define I40E_RXD_QW1_STATUS_TSYNVALID_MASK \
>> 				    BIT_ULL(I40E_RXD_QW1_STATUS_TSYNVALID_SHIFT)
>> +#define I40E_RXD_QW1_STATUS_LPBK_SHIFT  I40E_RX_DESC_STATUS_LPBK_SHIFT
>> +#define I40E_RXD_QW1_STATUS_LPBK_MASK \
>> +				BIT_ULL(I40E_RXD_QW1_STATUS_LPBK_SHIFT)
>>
>> enum i40e_rx_desc_fltstat_values {
>> 	I40E_RX_DESC_FLTSTAT_NO_DATA	= 0,
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
>> index 0c8687d..f0860ef 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
>> @@ -1062,6 +1062,7 @@ static int i40e_vfpr_netdev_stop(struct net_device *dev)
>> static const struct net_device_ops i40e_vfpr_netdev_ops = {
>> 	.ndo_open       	= i40e_vfpr_netdev_open,
>> 	.ndo_stop       	= i40e_vfpr_netdev_stop,
>> +	.ndo_start_xmit		= i40e_vfpr_netdev_start_xmit,
>> };
>>
>> /**
>> @@ -1121,16 +1122,22 @@ int i40e_alloc_vfpr_netdev(struct i40e_vf *vf, u16 vf_num)
>>
>> 	priv = netdev_priv(vfpr_netdev);
>> 	priv->vf = &(pf->vf[vf_num]);
>> +	priv->vfpr_dst = metadata_dst_alloc(0, METADATA_HW_PORT_MUX, GFP_KERNEL);
> I'm missing a patch here. In net-next, metadata_dst_alloc has 2 args.

Somehow that patch didn't make it to netdev. You can find it here on 
intel-wired-lan archives.
http://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20170102/007723.html
I will CC you when i submit V3.

^ permalink raw reply

* [PATCH iproute2 1/3] ip vrf: Fix run-on error message on mkdir failure
From: David Ahern @ 2017-01-06  0:22 UTC (permalink / raw)
  To: netdev, stephen; +Cc: David Ahern
In-Reply-To: <1483662143-15242-1-git-send-email-dsa@cumulusnetworks.com>

Andy reported a missing newline if a non-root user attempts to run
'ip vrf exec':

$ ./ip/ip vrf exec default /bin/echo asdf
mkdir failed for /var/run/cgroup2: Permission deniedFailed to setup vrf cgroup2 directory

Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 lib/fs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/fs.c b/lib/fs.c
index 39cc96dccca9..644bb486ae8e 100644
--- a/lib/fs.c
+++ b/lib/fs.c
@@ -121,7 +121,7 @@ int make_path(const char *path, mode_t mode)
 
 			if (mkdir(dir, mode) != 0) {
 				fprintf(stderr,
-					"mkdir failed for %s: %s",
+					"mkdir failed for %s: %s\n",
 					dir, strerror(errno));
 				goto out;
 			}
-- 
2.1.4

^ permalink raw reply related

* [PATCH iproute2 3/3] ip vrf: Improve bpf error messages
From: David Ahern @ 2017-01-06  0:22 UTC (permalink / raw)
  To: netdev, stephen; +Cc: David Ahern
In-Reply-To: <1483662143-15242-1-git-send-email-dsa@cumulusnetworks.com>

Next up a non-root user gets various bpf related error messages:

$ ip vrf exec mgmt bash
Failed to load BPF prog: 'Operation not permitted'
Kernel compiled with CGROUP_BPF enabled?

Catch the EPERM error and do not show the kernel config option.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 ip/ipvrf.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/ip/ipvrf.c b/ip/ipvrf.c
index dc8364a43a57..8bd99d6251f2 100644
--- a/ip/ipvrf.c
+++ b/ip/ipvrf.c
@@ -181,7 +181,11 @@ static int vrf_configure_cgroup(const char *path, int ifindex)
 	if (prog_fd < 0) {
 		fprintf(stderr, "Failed to load BPF prog: '%s'\n",
 			strerror(errno));
-		fprintf(stderr, "Kernel compiled with CGROUP_BPF enabled?\n");
+
+		if (errno != EPERM) {
+			fprintf(stderr,
+				"Kernel compiled with CGROUP_BPF enabled?\n");
+		}
 		goto out;
 	}
 
-- 
2.1.4

^ permalink raw reply related

* [PATCH iproute2 0/3] ip vrf: minor error message cleanups
From: David Ahern @ 2017-01-06  0:22 UTC (permalink / raw)
  To: netdev, stephen; +Cc: David Ahern

David Ahern (3):
  ip vrf: Fix error message when running exec as non-root user
  ip vrf: Improve error message for non-root user
  ip vrf: Clean up bpf related error messages

 ip/ipvrf.c |  6 +++++-
 lib/fs.c   | 16 ++++++++++++----
 2 files changed, 17 insertions(+), 5 deletions(-)

-- 
2.1.4

^ permalink raw reply

* [PATCH iproute2 2/3] ip vrf: Improve cgroup2 error messages
From: David Ahern @ 2017-01-06  0:22 UTC (permalink / raw)
  To: netdev, stephen; +Cc: David Ahern
In-Reply-To: <1483662143-15242-1-git-send-email-dsa@cumulusnetworks.com>

Currently, if a non-root user attempts to run ip vrf exec a non-helpful
error is returned:

$ ip vrf exec mgmt bash
Failed to mount cgroup2. Are CGROUPS enabled in your kernel?

Only show the CGROUPS kernel hint for the ENODEV error and for the
rest show the strerror for the errno. So now:

$ ip/ip vrf exec mgmt bash
Failed to mount cgroup2: Operation not permitted

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 lib/fs.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/lib/fs.c b/lib/fs.c
index 644bb486ae8e..12a4657a0bc9 100644
--- a/lib/fs.c
+++ b/lib/fs.c
@@ -80,13 +80,21 @@ char *find_cgroup2_mount(void)
 
 	if (mount("none", mnt, CGROUP2_FS_NAME, 0, NULL)) {
 		/* EBUSY means already mounted */
-		if (errno != EBUSY) {
+		if (errno == EBUSY)
+			goto out;
+
+		if (errno == ENODEV) {
 			fprintf(stderr,
 				"Failed to mount cgroup2. Are CGROUPS enabled in your kernel?\n");
-			free(mnt);
-			return NULL;
+		} else {
+			fprintf(stderr,
+				"Failed to mount cgroup2: %s\n",
+				strerror(errno));
 		}
+		free(mnt);
+		return NULL;
 	}
+out:
 	return mnt;
 }
 
-- 
2.1.4

^ permalink raw reply related

* Re: [RFC PATCH] virtio_net: XDP support for adjust_head
From: Michael S. Tsirkin @ 2017-01-06  0:39 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jason Wang, john.r.fastabend, netdev, alexei.starovoitov, daniel
In-Reply-To: <586ECF53.1070700@gmail.com>

On Thu, Jan 05, 2017 at 02:57:23PM -0800, John Fastabend wrote:
> On 17-01-03 02:16 PM, Michael S. Tsirkin wrote:
> > On Tue, Jan 03, 2017 at 02:01:27PM +0800, Jason Wang wrote:
> >>
> >>
> >> On 2017年01月03日 03:44, John Fastabend wrote:
> >>> Add support for XDP adjust head by allocating a 256B header region
> >>> that XDP programs can grow into. This is only enabled when a XDP
> >>> program is loaded.
> >>>
> >>> In order to ensure that we do not have to unwind queue headroom push
> >>> queue setup below bpf_prog_add. It reads better to do a prog ref
> >>> unwind vs another queue setup call.
> >>>
> >>> : There is a problem with this patch as is. When xdp prog is loaded
> >>>    the old buffers without the 256B headers need to be flushed so that
> >>>    the bpf prog has the necessary headroom. This patch does this by
> >>>    calling the virtqueue_detach_unused_buf() and followed by the
> >>>    virtnet_set_queues() call to reinitialize the buffers. However I
> >>>    don't believe this is safe per comment in virtio_ring this API
> >>>    is not valid on an active queue and the only thing we have done
> >>>    here is napi_disable/napi_enable wrappers which doesn't do anything
> >>>    to the emulation layer.
> >>>
> >>>    So the RFC is really to find the best solution to this problem.
> >>>    A couple things come to mind, (a) always allocate the necessary
> >>>    headroom but this is a bit of a waste (b) add some bit somewhere
> >>>    to check if the buffer has headroom but this would mean XDP programs
> >>>    would be broke for a cycle through the ring, (c) figure out how
> >>>    to deactivate a queue, free the buffers and finally reallocate.
> >>>    I think (c) is the best choice for now but I'm not seeing the
> >>>    API to do this so virtio/qemu experts anyone know off-hand
> >>>    how to make this work? I started looking into the PCI callbacks
> >>>    reset() and virtio_device_ready() or possibly hitting the right
> >>>    set of bits with vp_set_status() but my first attempt just hung
> >>>    the device.
> >>
> >> Hi John:
> >>
> >> AFAIK, disabling a specific queue was supported only by virtio 1.0 through
> >> queue_enable field in pci common cfg.
> > 
> > In fact 1.0 only allows enabling queues selectively.
> > We can add disabling by a spec enhancement but
> > for now reset is the only way.
> > 
> > 
> >> But unfortunately, qemu does not
> >> emulate this at all and legacy device does not even support this. So the
> >> safe way is probably reset the device and redo the initialization here.
> > 
> > You will also have to re-apply rx filtering if you do this.
> > Probably sending notification uplink.
> > 
> 
> The following seems to hang the device on the next virtnet_send_command()
> I expected this to meet the reset requirements from the spec because I
> believe its the same flow coming out of restore(). For a real patch we
> don't actually need to kfree all the structs and reallocate them but
> I was expecting the below to work. Any ideas/hints?

Restore assumes device was previously reset.
You want to combine freeze+restore.

> static int virtnet_xdp_reset(struct virtnet_info *vi)
> {
>         int i, ret;
> 
>         netif_device_detach(vi->dev);
>         cancel_delayed_work_sync(&vi->refill);
>         if (netif_running(vi->dev)) {
>                 for (i = 0; i < vi->max_queue_pairs; i++)
>                         napi_disable(&vi->rq[i].napi);
>         }
> 
>         remove_vq_common(vi, false);
>         ret = init_vqs(vi);
>         if (ret)
>                 return ret;
>         virtio_device_ready(vi->vdev);
> 
>         if (netif_running(vi->dev)) {
>                 for (i = 0; i < vi->curr_queue_pairs; i++)
>                         if (!try_fill_recv(vi, &vi->rq[i], GFP_KERNEL))
>                                 schedule_delayed_work(&vi->refill, 0);
> 
>                 for (i = 0; i < vi->max_queue_pairs; i++)
>                         virtnet_napi_enable(&vi->rq[i]);
>         }
>         netif_device_attach(vi->dev);
>         return 0;
> }

^ permalink raw reply

* Re: [Intel-wired-lan] [net-next PATCH v2 2/6] i40e: Introduce VF Port Representator(VFPR) netdevs.
From: Samudrala, Sridhar @ 2017-01-06  0:42 UTC (permalink / raw)
  To: Jeff Kirsher, alexander.h.duyck, john.r.fastabend, anjali.singhai,
	jakub.kicinski, davem, intel-wired-lan, netdev
In-Reply-To: <1483652771.25700.41.camel@intel.com>



On 1/5/2017 1:46 PM, Jeff Kirsher wrote:
> On Tue, 2017-01-03 at 10:07 -0800, Sridhar Samudrala wrote:
>> VF Port Representator netdevs are created for each VF if the switch mode
>> is set to 'switchdev'. These netdevs can be used to control and configure
>> VFs from PFs namespace. They enable exposing VF statistics, configure and
>> monitor link state, mtu, filters, fdb/vlan entries etc. of VFs.
>> Broadcast filters are not enabled in switchdev mode.
>>
>> Sample script to create VF port representors
>> # rmmod i40e; modprobe i40e
>> # devlink dev eswitch set pci/0000:05:00.0 mode switchdev
>> # echo 2 > /sys/class/net/enp5s0f0/device/sriov_numvfs
>> # ip l show
>> 297: enp5s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop portid
>> 6805ca2e7268 state DOWN mode DEFAULT group default qlen 1000
>>       link/ether 68:05:ca:2e:72:68 brd ff:ff:ff:ff:ff:ff
>>       vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto,
>> trust off
>>       vf 1 MAC 00:00:00:00:00:00, spoof checking on, link-state auto,
>> trust off
>> 299: enp5s0f0-vf0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
>> mode DEFAULT group default qlen 1000
>>       link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
>> 300: enp5s0f0-vf1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
>> mode DEFAULT group default qlen 1000
>>       link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> ---
>>   drivers/net/ethernet/intel/i40e/i40e_main.c        |  21 ++-
>>   drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 154
>> ++++++++++++++++++++-
>>   drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.h |  14 ++
>>   3 files changed, 182 insertions(+), 7 deletions(-)
> This does not apply cleanly because it is based on an older version of
> i40e_virtchnl_pf.c file.  It appears that i40e has been updated to use
> "i40e_add_filter()" yet your patch still uses "i40e_add_mac_filter()".
I am not using i40e_add_mac_filter() in my patches. I only i40e_add_filter()
These patches are against davem's net-next kernel

>
> We need to clarify what the "right way" is to add filters and use the
> correct function.
>
> Dropping this series and will await v3, please address the other feedback
> from Or Gerlitz and Jiri Pirko as well in your updated series.

Sure. I will be submitting a v3 soon addressing the review comments.

^ permalink raw reply

* RE: [PATCH 2/2] PCI: lock each enable/disable num_vfs operation in sysfs
From: Tantilov, Emil S @ 2017-01-06  0:55 UTC (permalink / raw)
  To: Gavin Shan
  Cc: linux-pci@vger.kernel.org, intel-wired-lan@lists.osuosl.org,
	Duyck, Alexander H, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20170104231139.GA3726@gwshan>

>-----Original Message-----
>From: Gavin Shan [mailto:gwshan@linux.vnet.ibm.com]
>Sent: Wednesday, January 04, 2017 3:12 PM
>To: Tantilov, Emil S <emil.s.tantilov@intel.com>
>Cc: Gavin Shan <gwshan@linux.vnet.ibm.com>; linux-pci@vger.kernel.org;
>intel-wired-lan@lists.osuosl.org; Duyck, Alexander H
><alexander.h.duyck@intel.com>; netdev@vger.kernel.org; linux-
>kernel@vger.kernel.org
>Subject: Re: [PATCH 2/2] PCI: lock each enable/disable num_vfs operation in
>sysfs
>
>On Wed, Jan 04, 2017 at 04:00:20PM +0000, Tantilov, Emil S wrote:
>>>On Tue, Jan 03, 2017 at 04:48:31PM -0800, Emil Tantilov wrote:
>>>>Enabling/disabling SRIOV via sysfs by echo-ing multiple values
>>>>simultaneously:
>>>>
>>>>echo 63 > /sys/class/net/ethX/device/sriov_numvfs&
>>>>echo 63 > /sys/class/net/ethX/device/sriov_numvfs
>>>>
>>>>sleep 5
>>>>
>>>>echo 0 > /sys/class/net/ethX/device/sriov_numvfs&
>>>>echo 0 > /sys/class/net/ethX/device/sriov_numvfs
>>>>
>>>>Results in the following bug:
>>>>
>>>>kernel BUG at drivers/pci/iov.c:495!
>>>>invalid opcode: 0000 [#1] SMP
>>>>CPU: 1 PID: 8050 Comm: bash Tainted: G   W   4.9.0-rc7-net-next #2092
>>>>RIP: 0010:[<ffffffff813b1647>]
>>>>	  [<ffffffff813b1647>] pci_iov_release+0x57/0x60
>>>>
>>>>Call Trace:
>>>> [<ffffffff81391726>] pci_release_dev+0x26/0x70
>>>> [<ffffffff8155be6e>] device_release+0x3e/0xb0
>>>> [<ffffffff81365ee7>] kobject_cleanup+0x67/0x180
>>>> [<ffffffff81365d9d>] kobject_put+0x2d/0x60
>>>> [<ffffffff8155bc27>] put_device+0x17/0x20
>>>> [<ffffffff8139c08a>] pci_dev_put+0x1a/0x20
>>>> [<ffffffff8139cb6b>] pci_get_dev_by_id+0x5b/0x90
>>>> [<ffffffff8139cca5>] pci_get_subsys+0x35/0x40
>>>> [<ffffffff8139ccc8>] pci_get_device+0x18/0x20
>>>> [<ffffffff8139ccfb>] pci_get_domain_bus_and_slot+0x2b/0x60
>>>> [<ffffffff813b09e7>] pci_iov_remove_virtfn+0x57/0x180
>>>> [<ffffffff813b0b95>] pci_disable_sriov+0x65/0x140
>>>> [<ffffffffa00a1af7>] ixgbe_disable_sriov+0xc7/0x1d0 [ixgbe]
>>>> [<ffffffffa00a1e9d>] ixgbe_pci_sriov_configure+0x3d/0x170 [ixgbe]
>>>> [<ffffffff8139d28c>] sriov_numvfs_store+0xdc/0x130
>>>>...
>>>>RIP  [<ffffffff813b1647>] pci_iov_release+0x57/0x60
>>>>
>>>>Use the existing mutex lock to protect each enable/disable operation.
>>>>
>>>>CC: Alexander Duyck <alexander.h.duyck@intel.com>
>>>>Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
>>>
>>>Emil, It's going to change semantics of pci_enable_sriov() and
>pci_disable_sriov().
>>>They can be invoked when writing to the sysfs entry, or loading PF's
>>>driver. With the change applied, the lock (pf->sriov->lock) isn't acquired and released
>>>in the PF's driver loading path.
>>
>>The enablement of SRIOV on driver load is done via deprecated module parameter.
>>Perhaps we can just remove it, although there are probably still people that use it
>>and may not be happy if we get rid of it.
>>
>
>Yeah, some drivers are still using the interface. So we cannot affect it
>until it can be droped.
>
>>>I think the reasonable way would be adding a flag in "struct sriov", to
>>>indicate someone is accessing the IOV capability through sysfs file. With this, the
>>>code returns with "-EBUSY" immediately for contenders. With it, nothing is going
>>>to be changed in PF's driver loading path.
>>
>>Flag is what I initially had in mind, but did not want to add extra locking if we
>>can make use of the existing.
>>
>
>The problem is sriov->lock wasn't introduced to protect the whole IOV capability.
>Instead, it protects the allocation of virtual bus (if needed). In your patch,
>it will be used to protect the whole IOV capability, ensure accessing the
>IOV capability exclusively. So the usage of this lock is changed.
>
>     code extracted from pci.h:
>
>     struct pci_sriov {
>            :
>            struct mutex lock;      /* lock for VF bus */
>            :
>     }
>
>The lock was introduced by commit d1b054da8 ("PCI: initialize and release
>SR-IOV capability"). If I'm correct enough, I don't think this lock is needed when
>pci_enable_sriov() or pci_disable_sriov() are called in driver because of
>module
>parameters. I don't see the usage case calling pci_disable_sriov() while
>previous pci_enable_sriov() isn't finished yet. Also, it's not needed in EEH
>subsystem.
>So I think the lock can be dropped, then it can be used to protect sysfs path.

That's pretty much what this patch does, except I kept the locking for EEH since 
it is the only driver that calls pci_iov_add/remove_virtfn() directly.

I'll write it up and run some tests, although I have no way to test EEH.
 
>>>Also, there are some minor comments as below and I guess most of them won't
>>>be applied if you take my suggestion eventually. However, I'm trying to make
>>>the comments complete.
>>
>>Thanks a lot for reviewing!
>>
>>>
>>>>---
>>>> drivers/pci/pci-sysfs.c |   24 +++++++++++++++++-------
>>>> 1 file changed, 17 insertions(+), 7 deletions(-)
>>>>
>>>>diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
>>>>index 0666287..5b54cf5 100644
>>>>--- a/drivers/pci/pci-sysfs.c
>>>>+++ b/drivers/pci/pci-sysfs.c
>>>>@@ -472,7 +472,9 @@ static ssize_t sriov_numvfs_store(struct device
>*dev,
>>>> 				  const char *buf, size_t count)
>>>> {
>>>> 	struct pci_dev *pdev = to_pci_dev(dev);
>>>>+	struct pci_sriov *iov = pdev->sriov;
>>>> 	int ret;
>>>>+
>>>
>>>Unnecessary change.
>>>
>>>> 	u16 num_vfs;
>>>>
>>>> 	ret = kstrtou16(buf, 0, &num_vfs);
>>>>@@ -482,38 +484,46 @@ static ssize_t sriov_numvfs_store(struct device
>>>*dev,
>>>> 	if (num_vfs > pci_sriov_get_totalvfs(pdev))
>>>> 		return -ERANGE;
>>>>
>>>>+	mutex_lock(&iov->dev->sriov->lock);
>>>>+
>>>> 	if (num_vfs == pdev->sriov->num_VFs)
>>>>-		return count;		/* no change */
>>>>+		goto exit;
>>>>
>>>> 	/* is PF driver loaded w/callback */
>>>> 	if (!pdev->driver || !pdev->driver->sriov_configure) {
>>>> 		dev_info(&pdev->dev, "Driver doesn't support SRIOV
>>>configuration via sysfs\n");
>>>>-		return -ENOSYS;
>>>>+		ret = -EINVAL;
>>>>+		goto exit;
>>>
>>>Why we need change the error code here?
>>
>>checkpatch was complaining about the use of the ENOSYS error code being specific
>>and even though it was not my patch introducing it I had to change it to shut it up.
>>
>
>Right, it's reserved for attempt to call nonexisting syscall, but I think
>ENOENT might be more indicative than EINVAL in this specific case?

ENOENT is for a missing file, but if we got this far in the code then there must've been 
a sysfs file. This is pretty straightforward "not supported" error, which is why I picked
EINVAL.

Thanks,
Emil 

^ permalink raw reply

* [PATCH net] udp: inuse checks can quit early for reuseport
From: Eric Garver @ 2017-01-06  1:22 UTC (permalink / raw)
  To: davem; +Cc: netdev

UDP lib inuse checks will walk the entire hash bucket to check if the
portaddr is in use. In the case of reuseport we can stop searching when
we find a matching reuseport.

On a 16-core VM a test program that spawns 16 threads that each bind to
1024 sockets (one per 10ms) takes 1m45s. With this change it takes 11s.

Also add a cond_resched() when the port is not specified.

Signed-off-by: Eric Garver <e@erig.me>
---
 net/ipv4/udp.c | 29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 1307a7c2e544..4318d72e0248 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -153,13 +153,18 @@ static int udp_lib_lport_inuse(struct net *net, __u16 num,
 		    (!sk2->sk_reuse || !sk->sk_reuse) &&
 		    (!sk2->sk_bound_dev_if || !sk->sk_bound_dev_if ||
 		     sk2->sk_bound_dev_if == sk->sk_bound_dev_if) &&
-		    (!sk2->sk_reuseport || !sk->sk_reuseport ||
-		     rcu_access_pointer(sk->sk_reuseport_cb) ||
-		     !uid_eq(uid, sock_i_uid(sk2))) &&
 		    saddr_comp(sk, sk2, true)) {
-			if (!bitmap)
-				return 1;
-			__set_bit(udp_sk(sk2)->udp_port_hash >> log, bitmap);
+			if (sk2->sk_reuseport && sk->sk_reuseport &&
+			    !rcu_access_pointer(sk->sk_reuseport_cb) &&
+			    uid_eq(uid, sock_i_uid(sk2))) {
+				if (!bitmap)
+					return 0;
+			} else {
+				if (!bitmap)
+					return 1;
+				__set_bit(udp_sk(sk2)->udp_port_hash >> log,
+					  bitmap);
+			}
 		}
 	}
 	return 0;
@@ -188,11 +193,14 @@ static int udp_lib_lport_inuse2(struct net *net, __u16 num,
 		    (!sk2->sk_reuse || !sk->sk_reuse) &&
 		    (!sk2->sk_bound_dev_if || !sk->sk_bound_dev_if ||
 		     sk2->sk_bound_dev_if == sk->sk_bound_dev_if) &&
-		    (!sk2->sk_reuseport || !sk->sk_reuseport ||
-		     rcu_access_pointer(sk->sk_reuseport_cb) ||
-		     !uid_eq(uid, sock_i_uid(sk2))) &&
 		    saddr_comp(sk, sk2, true)) {
-			res = 1;
+			if (sk2->sk_reuseport && sk->sk_reuseport &&
+			    !rcu_access_pointer(sk->sk_reuseport_cb) &&
+			    uid_eq(uid, sock_i_uid(sk2))) {
+				res = 0;
+			} else {
+				res = 1;
+			}
 			break;
 		}
 	}
@@ -285,6 +293,7 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
 				snum += rand;
 			} while (snum != first);
 			spin_unlock_bh(&hslot->lock);
+			cond_resched();
 		} while (++first != last);
 		goto fail;
 	} else {
-- 
2.10.0

^ permalink raw reply related

* Re: [PATCH v2] scsi: bfa: Increase requested firmware version to 3.2.5.1
From: Martin K. Petersen @ 2017-01-06  1:48 UTC (permalink / raw)
  To: Benjamin Poirier
  Cc: Martin K. Petersen, Tim Ehlers, Rasesh Mody, Anil Gurumurthy,
	Sudarsana Kalluru, James E.J. Bottomley, linux-scsi, netdev,
	linux-kernel
In-Reply-To: <20161224044019.13797-1-bpoirier@suse.com>

>>>>> "Benjamin" == Benjamin Poirier <bpoirier@suse.com> writes:

Benjamin> bna & bfa firmware version 3.2.5.1 was submitted to
Benjamin> linux-firmware on Feb 17 19:10:20 2015 -0500 in 0ab54ff1dc
Benjamin> ("linux-firmware: Add QLogic BR Series Adapter Firmware").

Benjamin> Increase bfa's requested firmware version. Also increase the
Benjamin> driver version.  I only tested that all of the devices probe
Benjamin> without error.

Applied to 4.10/scsi-fixes.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* Re: [PATCH 2/2] PCI: lock each enable/disable num_vfs operation in sysfs
From: Gavin Shan @ 2017-01-06  1:49 UTC (permalink / raw)
  To: Tantilov, Emil S
  Cc: Gavin Shan, linux-pci@vger.kernel.org,
	intel-wired-lan@lists.osuosl.org, Duyck, Alexander H,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <87618083B2453E4A8714035B62D679925074474B@FMSMSX105.amr.corp.intel.com>

On Fri, Jan 06, 2017 at 12:55:08AM +0000, Tantilov, Emil S wrote:
>>On Wed, Jan 04, 2017 at 04:00:20PM +0000, Tantilov, Emil S wrote:
>>>>On Tue, Jan 03, 2017 at 04:48:31PM -0800, Emil Tantilov wrote:
>>>>>Enabling/disabling SRIOV via sysfs by echo-ing multiple values
>>>>>simultaneously:
>>>>>
>>>>>echo 63 > /sys/class/net/ethX/device/sriov_numvfs&
>>>>>echo 63 > /sys/class/net/ethX/device/sriov_numvfs
>>>>>
>>>>>sleep 5
>>>>>
>>>>>echo 0 > /sys/class/net/ethX/device/sriov_numvfs&
>>>>>echo 0 > /sys/class/net/ethX/device/sriov_numvfs
>>>>>
>>>>>Results in the following bug:
>>>>>
>>>>>kernel BUG at drivers/pci/iov.c:495!
>>>>>invalid opcode: 0000 [#1] SMP
>>>>>CPU: 1 PID: 8050 Comm: bash Tainted: G   W   4.9.0-rc7-net-next #2092
>>>>>RIP: 0010:[<ffffffff813b1647>]
>>>>>	  [<ffffffff813b1647>] pci_iov_release+0x57/0x60
>>>>>
>>>>>Call Trace:
>>>>> [<ffffffff81391726>] pci_release_dev+0x26/0x70
>>>>> [<ffffffff8155be6e>] device_release+0x3e/0xb0
>>>>> [<ffffffff81365ee7>] kobject_cleanup+0x67/0x180
>>>>> [<ffffffff81365d9d>] kobject_put+0x2d/0x60
>>>>> [<ffffffff8155bc27>] put_device+0x17/0x20
>>>>> [<ffffffff8139c08a>] pci_dev_put+0x1a/0x20
>>>>> [<ffffffff8139cb6b>] pci_get_dev_by_id+0x5b/0x90
>>>>> [<ffffffff8139cca5>] pci_get_subsys+0x35/0x40
>>>>> [<ffffffff8139ccc8>] pci_get_device+0x18/0x20
>>>>> [<ffffffff8139ccfb>] pci_get_domain_bus_and_slot+0x2b/0x60
>>>>> [<ffffffff813b09e7>] pci_iov_remove_virtfn+0x57/0x180
>>>>> [<ffffffff813b0b95>] pci_disable_sriov+0x65/0x140
>>>>> [<ffffffffa00a1af7>] ixgbe_disable_sriov+0xc7/0x1d0 [ixgbe]
>>>>> [<ffffffffa00a1e9d>] ixgbe_pci_sriov_configure+0x3d/0x170 [ixgbe]
>>>>> [<ffffffff8139d28c>] sriov_numvfs_store+0xdc/0x130
>>>>>...
>>>>>RIP  [<ffffffff813b1647>] pci_iov_release+0x57/0x60
>>>>>
>>>>>Use the existing mutex lock to protect each enable/disable operation.
>>>>>
>>>>>CC: Alexander Duyck <alexander.h.duyck@intel.com>
>>>>>Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
>>>>
>>>>Emil, It's going to change semantics of pci_enable_sriov() and
>>pci_disable_sriov().
>>>>They can be invoked when writing to the sysfs entry, or loading PF's
>>>>driver. With the change applied, the lock (pf->sriov->lock) isn't acquired and released
>>>>in the PF's driver loading path.
>>>
>>>The enablement of SRIOV on driver load is done via deprecated module parameter.
>>>Perhaps we can just remove it, although there are probably still people that use it
>>>and may not be happy if we get rid of it.
>>>
>>
>>Yeah, some drivers are still using the interface. So we cannot affect it
>>until it can be droped.
>>
>>>>I think the reasonable way would be adding a flag in "struct sriov", to
>>>>indicate someone is accessing the IOV capability through sysfs file. With this, the
>>>>code returns with "-EBUSY" immediately for contenders. With it, nothing is going
>>>>to be changed in PF's driver loading path.
>>>
>>>Flag is what I initially had in mind, but did not want to add extra locking if we
>>>can make use of the existing.
>>>
>>
>>The problem is sriov->lock wasn't introduced to protect the whole IOV capability.
>>Instead, it protects the allocation of virtual bus (if needed). In your patch,
>>it will be used to protect the whole IOV capability, ensure accessing the
>>IOV capability exclusively. So the usage of this lock is changed.
>>
>>     code extracted from pci.h:
>>
>>     struct pci_sriov {
>>            :
>>            struct mutex lock;      /* lock for VF bus */
>>            :
>>     }
>>
>>The lock was introduced by commit d1b054da8 ("PCI: initialize and release
>>SR-IOV capability"). If I'm correct enough, I don't think this lock is needed when
>>pci_enable_sriov() or pci_disable_sriov() are called in driver because of
>>module
>>parameters. I don't see the usage case calling pci_disable_sriov() while
>>previous pci_enable_sriov() isn't finished yet. Also, it's not needed in EEH
>>subsystem.
>>So I think the lock can be dropped, then it can be used to protect sysfs path.
>
>That's pretty much what this patch does, except I kept the locking for EEH since 
>it is the only driver that calls pci_iov_add/remove_virtfn() directly.
>
>I'll write it up and run some tests, although I have no way to test EEH.
> 

Yes, Your patch is close to what I suggested. I'm pretty sure EEH needn't
this lock. I'll fix it up if EEH is broken because of it.

>>>>Also, there are some minor comments as below and I guess most of them won't
>>>>be applied if you take my suggestion eventually. However, I'm trying to make
>>>>the comments complete.
>>>
>>>Thanks a lot for reviewing!
>>>
>>>>
>>>>>---
>>>>> drivers/pci/pci-sysfs.c |   24 +++++++++++++++++-------
>>>>> 1 file changed, 17 insertions(+), 7 deletions(-)
>>>>>
>>>>>diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
>>>>>index 0666287..5b54cf5 100644
>>>>>--- a/drivers/pci/pci-sysfs.c
>>>>>+++ b/drivers/pci/pci-sysfs.c
>>>>>@@ -472,7 +472,9 @@ static ssize_t sriov_numvfs_store(struct device
>>*dev,
>>>>> 				  const char *buf, size_t count)
>>>>> {
>>>>> 	struct pci_dev *pdev = to_pci_dev(dev);
>>>>>+	struct pci_sriov *iov = pdev->sriov;
>>>>> 	int ret;
>>>>>+
>>>>
>>>>Unnecessary change.
>>>>
>>>>> 	u16 num_vfs;
>>>>>
>>>>> 	ret = kstrtou16(buf, 0, &num_vfs);
>>>>>@@ -482,38 +484,46 @@ static ssize_t sriov_numvfs_store(struct device
>>>>*dev,
>>>>> 	if (num_vfs > pci_sriov_get_totalvfs(pdev))
>>>>> 		return -ERANGE;
>>>>>
>>>>>+	mutex_lock(&iov->dev->sriov->lock);
>>>>>+
>>>>> 	if (num_vfs == pdev->sriov->num_VFs)
>>>>>-		return count;		/* no change */
>>>>>+		goto exit;
>>>>>
>>>>> 	/* is PF driver loaded w/callback */
>>>>> 	if (!pdev->driver || !pdev->driver->sriov_configure) {
>>>>> 		dev_info(&pdev->dev, "Driver doesn't support SRIOV
>>>>configuration via sysfs\n");
>>>>>-		return -ENOSYS;
>>>>>+		ret = -EINVAL;
>>>>>+		goto exit;
>>>>
>>>>Why we need change the error code here?
>>>
>>>checkpatch was complaining about the use of the ENOSYS error code being specific
>>>and even though it was not my patch introducing it I had to change it to shut it up.
>>>
>>
>>Right, it's reserved for attempt to call nonexisting syscall, but I think
>>ENOENT might be more indicative than EINVAL in this specific case?
>
>ENOENT is for a missing file, but if we got this far in the code then there must've been 
>a sysfs file. This is pretty straightforward "not supported" error, which is why I picked
>EINVAL.
>

EINVAL means "invalid arguments" in my mind. The understanding matches with what's
said in https://linux.die.net/man/3/errno  I also treated ENOENT as missed callbacks.
However, it's not a big deal. Just pick the errno you like. What I concerned is user
might has some script to map the errno and exact failure. Precise and unique errno
from error path isn't bad.

Thanks,
Gavin

^ permalink raw reply

* RE: [PATCH] net: stmmac: fix maxmtu assignment to be within valid range
From: Kweh, Hock Leong @ 2017-01-06  1:59 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: David S. Miller, Joao Pinto, Giuseppe CAVALLARO,
	seraphin.bonnaffe@st.com, Jarod Wilson, Alexandre TORGUE,
	Joachim Eastwood, Niklas Cassel, Johan Hovold, Pavel Machek,
	lars.persson@axis.com, netdev, LKML
In-Reply-To: <CAHp75Ve6c4W-eqWdqabX5Q5cX4SpqmxByb6=TKnrr1yodkoPkw@mail.gmail.com>

> -----Original Message-----
> From: Andy Shevchenko [mailto:andy.shevchenko@gmail.com]
> Sent: Friday, January 06, 2017 5:07 AM
> To: Kweh, Hock Leong <hock.leong.kweh@intel.com>
> Cc: David S. Miller <davem@davemloft.net>; Joao Pinto
> <Joao.Pinto@synopsys.com>; Giuseppe CAVALLARO <peppe.cavallaro@st.com>;
> seraphin.bonnaffe@st.com; Jarod Wilson <jarod@redhat.com>; Alexandre
> TORGUE <alexandre.torgue@gmail.com>; Joachim Eastwood
> <manabian@gmail.com>; Niklas Cassel <niklas.cassel@axis.com>; Johan Hovold
> <johan@kernel.org>; Pavel Machek <pavel@ucw.cz>; lars.persson@axis.com;
> netdev <netdev@vger.kernel.org>; LKML <linux-kernel@vger.kernel.org>
> Subject: Re: [PATCH] net: stmmac: fix maxmtu assignment to be within valid
> range
> 
> On Thu, Jan 5, 2017 at 12:47 PM, Kweh, Hock Leong
> <hock.leong.kweh@intel.com> wrote:
> > From: "Kweh, Hock Leong" <hock.leong.kweh@intel.com>
> >
> > There is no checking valid value of maxmtu when getting it from devicetree.
> 
> 'Device Tree' or 'device tree' ?

Noted & Thanks. Submitting V2.

> 
> > This resolution added the checking condition to ensure the assignment
> > is made within a valid range.
> 
> > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > index 39eb7a6..683d59f 100644
> > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > @@ -3319,7 +3319,8 @@ int stmmac_dvr_probe(struct device *device,
> >                 ndev->max_mtu = JUMBO_LEN;
> >         else
> >                 ndev->max_mtu = SKB_MAX_HEAD(NET_SKB_PAD + NET_IP_ALIGN);
> > -       if (priv->plat->maxmtu < ndev->max_mtu)
> > +       if ((priv->plat->maxmtu < ndev->max_mtu) &&
> > +           (priv->plat->maxmtu >= ndev->min_mtu))
> >                 ndev->max_mtu = priv->plat->maxmtu;
> 
> Perhaps add a warning message on else branch?

Noted & Thanks. Submitting V2.

> 
> --
> With Best Regards,
> Andy Shevchenko

^ permalink raw reply

* [PATCH V4 net-next 0/3] vhost_net tx batching
From: Jason Wang @ 2017-01-06  2:13 UTC (permalink / raw)
  To: mst, virtualization, netdev, kvm; +Cc: wexu, stefanha

Hi:

This series tries to implement tx batching support for vhost. This was
done by using MSG_MORE as a hint for under layer socket. The backend
(e.g tap) can then batch the packets temporarily in a list and
submit it all once the number of bacthed exceeds a limitation.

Tests shows obvious improvement on guest pktgen over over
mlx4(noqueue) on host:

                                     Mpps  -+%
        rx-frames = 0                0.91  +0%
        rx-frames = 4                1.00  +9.8%
        rx-frames = 8                1.00  +9.8%
        rx-frames = 16               1.01  +10.9%
        rx-frames = 32               1.07  +17.5%
        rx-frames = 48               1.07  +17.5%
        rx-frames = 64               1.08  +18.6%
        rx-frames = 64 (no MSG_MORE) 0.91  +0%

Changes from V3:
- use ethtool instead of module parameter to control the maximum
  number of batched packets
- avoid overhead when MSG_MORE were not set and no packet queued

Changes from V2:
- remove uselss queue limitation check (and we don't drop any packet now)

Changes from V1:
- drop NAPI handler since we don't use NAPI now
- fix the issues that may exceeds max pending of zerocopy
- more improvement on available buffer detection
- move the limitation of batched pacekts from vhost to tuntap

Please review.

Thanks

Jason Wang (3):
  vhost: better detection of available buffers
  vhost_net: tx batching
  tun: rx batching

 drivers/net/tun.c     | 76 +++++++++++++++++++++++++++++++++++++++++++++++----
 drivers/vhost/net.c   | 23 ++++++++++++++--
 drivers/vhost/vhost.c |  8 ++++--
 3 files changed, 96 insertions(+), 11 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH V4 net-next 1/3] vhost: better detection of available buffers
From: Jason Wang @ 2017-01-06  2:13 UTC (permalink / raw)
  To: mst, virtualization, netdev, kvm; +Cc: wexu, stefanha
In-Reply-To: <1483668797-24112-1-git-send-email-jasowang@redhat.com>

This patch tries to do several tweaks on vhost_vq_avail_empty() for a
better performance:

- check cached avail index first which could avoid userspace memory access.
- using unlikely() for the failure of userspace access
- check vq->last_avail_idx instead of cached avail index as the last
  step.

This patch is need for batching supports which needs to peek whether
or not there's still available buffers in the ring.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index d643260..9f11838 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2241,11 +2241,15 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	__virtio16 avail_idx;
 	int r;
 
+	if (vq->avail_idx != vq->last_avail_idx)
+		return false;
+
 	r = vhost_get_user(vq, avail_idx, &vq->avail->idx);
-	if (r)
+	if (unlikely(r))
 		return false;
+	vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
 
-	return vhost16_to_cpu(vq, avail_idx) == vq->avail_idx;
+	return vq->avail_idx == vq->last_avail_idx;
 }
 EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH V4 net-next 2/3] vhost_net: tx batching
From: Jason Wang @ 2017-01-06  2:13 UTC (permalink / raw)
  To: mst, virtualization, netdev, kvm; +Cc: wexu, stefanha
In-Reply-To: <1483668797-24112-1-git-send-email-jasowang@redhat.com>

This patch tries to utilize tuntap rx batching by peeking the tx
virtqueue during transmission, if there's more available buffers in
the virtqueue, set MSG_MORE flag for a hint for backend (e.g tuntap)
to batch the packets.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 5dc3465..c42e9c3 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -351,6 +351,15 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
 	return r;
 }
 
+static bool vhost_exceeds_maxpend(struct vhost_net *net)
+{
+	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
+	struct vhost_virtqueue *vq = &nvq->vq;
+
+	return (nvq->upend_idx + vq->num - VHOST_MAX_PEND) % UIO_MAXIOV
+		== nvq->done_idx;
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -394,8 +403,7 @@ static void handle_tx(struct vhost_net *net)
 		/* If more outstanding DMAs, queue the work.
 		 * Handle upend_idx wrap around
 		 */
-		if (unlikely((nvq->upend_idx + vq->num - VHOST_MAX_PEND)
-			      % UIO_MAXIOV == nvq->done_idx))
+		if (unlikely(vhost_exceeds_maxpend(net)))
 			break;
 
 		head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
@@ -454,6 +462,16 @@ static void handle_tx(struct vhost_net *net)
 			msg.msg_control = NULL;
 			ubufs = NULL;
 		}
+
+		total_len += len;
+		if (total_len < VHOST_NET_WEIGHT &&
+		    !vhost_vq_avail_empty(&net->dev, vq) &&
+		    likely(!vhost_exceeds_maxpend(net))) {
+			msg.msg_flags |= MSG_MORE;
+		} else {
+			msg.msg_flags &= ~MSG_MORE;
+		}
+
 		/* TODO: Check specific error and bomb out unless ENOBUFS? */
 		err = sock->ops->sendmsg(sock, &msg, len);
 		if (unlikely(err < 0)) {
@@ -472,7 +490,6 @@ static void handle_tx(struct vhost_net *net)
 			vhost_add_used_and_signal(&net->dev, vq, head, 0);
 		else
 			vhost_zerocopy_signal_used(net, vq);
-		total_len += len;
 		vhost_net_tx_packet(net);
 		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
 			vhost_poll_queue(&vq->poll);
-- 
2.7.4

^ permalink raw reply related

* [PATCH V4 net-next 3/3] tun: rx batching
From: Jason Wang @ 2017-01-06  2:13 UTC (permalink / raw)
  To: mst, virtualization, netdev, kvm; +Cc: wexu, stefanha
In-Reply-To: <1483668797-24112-1-git-send-email-jasowang@redhat.com>

We can only process 1 packet at one time during sendmsg(). This often
lead bad cache utilization under heavy load. So this patch tries to do
some batching during rx before submitting them to host network
stack. This is done through accepting MSG_MORE as a hint from
sendmsg() caller, if it was set, batch the packet temporarily in a
linked list and submit them all once MSG_MORE were cleared.

Tests were done by pktgen (burst=128) in guest over mlx4(noqueue) on host:

                                 Mpps  -+%
    rx-frames = 0                0.91  +0%
    rx-frames = 4                1.00  +9.8%
    rx-frames = 8                1.00  +9.8%
    rx-frames = 16               1.01  +10.9%
    rx-frames = 32               1.07  +17.5%
    rx-frames = 48               1.07  +17.5%
    rx-frames = 64               1.08  +18.6%
    rx-frames = 64 (no MSG_MORE) 0.91  +0%

User were allowed to change per device batched packets through
ethtool -C rx-frames. NAPI_POLL_WEIGHT were used as upper limitation
to prevent bh from being disabled too long.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 70 insertions(+), 6 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index cd8e02c..6c93926 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -218,6 +218,7 @@ struct tun_struct {
 	struct list_head disabled;
 	void *security;
 	u32 flow_count;
+	u32 rx_batched;
 	struct tun_pcpu_stats __percpu *pcpu_stats;
 };
 
@@ -522,6 +523,7 @@ static void tun_queue_purge(struct tun_file *tfile)
 	while ((skb = skb_array_consume(&tfile->tx_array)) != NULL)
 		kfree_skb(skb);
 
+	skb_queue_purge(&tfile->sk.sk_write_queue);
 	skb_queue_purge(&tfile->sk.sk_error_queue);
 }
 
@@ -1140,10 +1142,45 @@ static struct sk_buff *tun_alloc_skb(struct tun_file *tfile,
 	return skb;
 }
 
+static void tun_rx_batched(struct tun_struct *tun, struct tun_file *tfile,
+			   struct sk_buff *skb, int more)
+{
+	struct sk_buff_head *queue = &tfile->sk.sk_write_queue;
+	struct sk_buff_head process_queue;
+	u32 rx_batched = tun->rx_batched;
+	bool rcv = false;
+
+	if (!rx_batched || (!more && skb_queue_empty(queue))) {
+		local_bh_disable();
+		netif_receive_skb(skb);
+		local_bh_enable();
+		return;
+	}
+
+	spin_lock(&queue->lock);
+	if (!more || skb_queue_len(queue) == rx_batched) {
+		__skb_queue_head_init(&process_queue);
+		skb_queue_splice_tail_init(queue, &process_queue);
+		rcv = true;
+	} else {
+		__skb_queue_tail(queue, skb);
+	}
+	spin_unlock(&queue->lock);
+
+	if (rcv) {
+		struct sk_buff *nskb;
+		local_bh_disable();
+		while ((nskb = __skb_dequeue(&process_queue)))
+			netif_receive_skb(nskb);
+		netif_receive_skb(skb);
+		local_bh_enable();
+	}
+}
+
 /* Get packet from user space buffer */
 static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 			    void *msg_control, struct iov_iter *from,
-			    int noblock)
+			    int noblock, bool more)
 {
 	struct tun_pi pi = { 0, cpu_to_be16(ETH_P_IP) };
 	struct sk_buff *skb;
@@ -1283,10 +1320,9 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 	skb_probe_transport_header(skb, 0);
 
 	rxhash = skb_get_hash(skb);
+
 #ifndef CONFIG_4KSTACKS
-	local_bh_disable();
-	netif_receive_skb(skb);
-	local_bh_enable();
+	tun_rx_batched(tun, tfile, skb, more);
 #else
 	netif_rx_ni(skb);
 #endif
@@ -1312,7 +1348,8 @@ static ssize_t tun_chr_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (!tun)
 		return -EBADFD;
 
-	result = tun_get_user(tun, tfile, NULL, from, file->f_flags & O_NONBLOCK);
+	result = tun_get_user(tun, tfile, NULL, from,
+			      file->f_flags & O_NONBLOCK, false);
 
 	tun_put(tun);
 	return result;
@@ -1570,7 +1607,8 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 		return -EBADFD;
 
 	ret = tun_get_user(tun, tfile, m->msg_control, &m->msg_iter,
-			   m->msg_flags & MSG_DONTWAIT);
+			   m->msg_flags & MSG_DONTWAIT,
+			   m->msg_flags & MSG_MORE);
 	tun_put(tun);
 	return ret;
 }
@@ -1771,6 +1809,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 		tun->align = NET_SKB_PAD;
 		tun->filter_attached = false;
 		tun->sndbuf = tfile->socket.sk->sk_sndbuf;
+		tun->rx_batched = 0;
 
 		tun->pcpu_stats = netdev_alloc_pcpu_stats(struct tun_pcpu_stats);
 		if (!tun->pcpu_stats) {
@@ -2439,6 +2478,29 @@ static void tun_set_msglevel(struct net_device *dev, u32 value)
 #endif
 }
 
+static int tun_get_coalesce(struct net_device *dev,
+			    struct ethtool_coalesce *ec)
+{
+	struct tun_struct *tun = netdev_priv(dev);
+
+	ec->rx_max_coalesced_frames = tun->rx_batched;
+
+	return 0;
+}
+
+static int tun_set_coalesce(struct net_device *dev,
+			    struct ethtool_coalesce *ec)
+{
+	struct tun_struct *tun = netdev_priv(dev);
+
+	if (ec->rx_max_coalesced_frames > NAPI_POLL_WEIGHT)
+		return -EINVAL;
+
+	tun->rx_batched = ec->rx_max_coalesced_frames;
+
+	return 0;
+}
+
 static const struct ethtool_ops tun_ethtool_ops = {
 	.get_settings	= tun_get_settings,
 	.get_drvinfo	= tun_get_drvinfo,
@@ -2446,6 +2508,8 @@ static const struct ethtool_ops tun_ethtool_ops = {
 	.set_msglevel	= tun_set_msglevel,
 	.get_link	= ethtool_op_get_link,
 	.get_ts_info	= ethtool_op_get_ts_info,
+	.get_coalesce   = tun_get_coalesce,
+	.set_coalesce   = tun_set_coalesce,
 };
 
 static int tun_queue_resize(struct tun_struct *tun)
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v4 0/4] vsock: cancel connect packets when failing to connect
From: Peng Tao @ 2017-01-06  2:22 UTC (permalink / raw)
  To: Stefan Hajnoczi, David Miller
  Cc: netdev@vger.kernel.org, kvm, Jorgen Hansen, Stefan Hajnoczi,
	virtualization
In-Reply-To: <20161213095010.GA3103@stefanha-x1.localdomain>

On Tue, Dec 13, 2016 at 5:50 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Mon, Dec 12, 2016 at 08:21:05PM +0800, Peng Tao wrote:
>> Currently, if a connect call fails on a signal or timeout (e.g., guest is still
>> in the process of starting up), we'll just return to caller and leave the connect
>> packet queued and they are sent even though the connection is considered a failure,
>> which can confuse applications with unwanted false connect attempt.
>>
>> The patchset enables vsock (both host and guest) to cancel queued packets when
>> a connect attempt is considered to fail.
>>
>> v4 changelog:
>>   - drop two unnecessary void * cast
>>   - update new callback commnet
>> v3 changelog:
>>   - define cancel_pkt callback in struct vsock_transport rather than struct virtio_transport
>>   - rename virtio_vsock_pkt->vsk to virtio_vsock_pkt->cancel_token
>> v2 changelog:
>>   - fix queued_replies counting and resume tx/rx when necessary
>>
>> Cheers,
>> Tao
>>
>> Peng Tao (4):
>>   vsock: track pkt owner vsock
>>   vhost-vsock: add pkt cancel capability
>>   vsock: add pkt cancel capability
>>   vsock: cancel packets when failing to connect
>>
>>  drivers/vhost/vsock.c                   | 41 ++++++++++++++++++++++++++++++++
>>  include/linux/virtio_vsock.h            |  2 ++
>>  include/net/af_vsock.h                  |  3 +++
>>  net/vmw_vsock/af_vsock.c                | 14 +++++++++++
>>  net/vmw_vsock/virtio_transport.c        | 42 +++++++++++++++++++++++++++++++++
>>  net/vmw_vsock/virtio_transport_common.c |  7 ++++++
>>  6 files changed, 109 insertions(+)
>>
>> --
>> 2.7.4
>>
>> _______________________________________________
>> Virtualization mailing list
>> Virtualization@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/virtualization
>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
ping~

It looks like the patchsets are reviewed but not merged. Is there any blocker?

Cheers,
Tao

^ permalink raw reply

* RE: [PATCH] net:phy fix driver reference count error when attach and detach phy device
From: maowenan @ 2017-01-06  2:29 UTC (permalink / raw)
  To: Florian Fainelli, David Laight, netdev@vger.kernel.org,
	Dingtianhong, weiyongjun (A)
In-Reply-To: <c4fce3bf-4105-c7f2-fc67-be4316a2004a@gmail.com>



> -----Original Message-----
> From: Florian Fainelli [mailto:f.fainelli@gmail.com]
> Sent: Tuesday, December 13, 2016 12:33 AM
> To: maowenan; David Laight; netdev@vger.kernel.org; Dingtianhong;
> weiyongjun (A)
> Subject: Re: [PATCH] net:phy fix driver reference count error when attach and
> detach phy device
> 
> On 12/12/2016 12:49 AM, maowenan wrote:
> >
> >
> > On 2016/12/5 16:47, maowenan wrote:
> >>
> >>
> >> On 2016/12/2 17:45, David Laight wrote:
> >>> From: Mao Wenan
> >>>> Sent: 30 November 2016 10:23
> >>>> The nic in my board use the phy dev from marvell, and the system
> >>>> will load the marvell phy driver automatically, but when I remove
> >>>> the phy drivers, the system immediately panic:
> >>>> Call trace:
> >>>> [ 2582.834493] [<ffff800000715384>] phy_state_machine+0x3c/0x438 [
> >>>> 2582.851754] [<ffff8000000db3b8>] process_one_work+0x150/0x428 [
> >>>> 2582.868188] [<ffff8000000db7d4>] worker_thread+0x144/0x4b0 [
> >>>> 2582.883882] [<ffff8000000e1d0c>] kthread+0xfc/0x110
> >>>>
> >>>> there should be proper reference counting in place to avoid that.
> >>>> I found that phy_attach_direct() forgets to add phy device driver
> >>>> reference count, and phy_detach() forgets to subtract reference count.
> >>>> This patch is to fix this bug, after that panic is disappeared when
> >>>> remove marvell.ko
> >>>>
> >>>> Signed-off-by: Mao Wenan <maowenan@huawei.com>
> >>>> ---
> >>>>  drivers/net/phy/phy_device.c | 7 +++++++
> >>>>  1 file changed, 7 insertions(+)
> >>>>
> >>>> diff --git a/drivers/net/phy/phy_device.c
> >>>> b/drivers/net/phy/phy_device.c index 1a4bf8a..a7ec7c2 100644
> >>>> --- a/drivers/net/phy/phy_device.c
> >>>> +++ b/drivers/net/phy/phy_device.c
> >>>> @@ -866,6 +866,11 @@ int phy_attach_direct(struct net_device *dev,
> struct phy_device *phydev,
> >>>>  		return -EIO;
> >>>>  	}
> >>>>
> >>>> +	if (!try_module_get(d->driver->owner)) {
> >>>> +		dev_err(&dev->dev, "failed to get the device driver module\n");
> >>>> +		return -EIO;
> >>>> +	}
> >>>
> >>> If this is the phy code, what stops the phy driver being unloaded
> >>> before the try_module_get() obtains a reference.
> >>> If it isn't the phy driver then there ought to be a reference count
> >>> obtained when the phy driver is located (by whatever decides which phy
> driver to use).
> >>> Even if that code later releases its reference (it probably
> >>> shouldn't on success) then you can't fail to get an extra reference here.
> >>
> >> [Mao Wenan]Yes, this is phy code, in function phy_attach_direct(),
> drivers/net/phy/phy_device.c.
> >> when one NIC driver to do probe behavior, it will attach one matched
> >> phy driver. phy_attach_direct() is to obtain phy driver reference and
> >> bind phy driver, if try_module_get() execute on success, the
> >> reference count is added; if failed, the driver can't be attached to this NIC,
> and it can't added the phy driver reference count. So before try_module_get
> obtains a reference, phy driver can't can't be bound to this NIC.
> >> when the phy driver is attached to NIC, the reference count is added,
> >> if someone remove phy driver directly, it will be failed because reference
> count is not equal to 0.
> >>
> >> An example of call trace when there is NIC driver to attch one phy driver:
> >> hns_nic_dev_probe->hns_nic_try_get_ae->hns_nic_init_phy->of_phy_conne
> >> ct->phy_connect_direct->phy_attach_direct
> >>
> >> Consider the steps of phy driver(marvell.ko) added and removed, and NIC
> driver(hns_enet_drv.ko) added and removed:
> >> 1)insmod marvell       ref=0
> >> 2)insmod hns_enet_drv  ref=1
> >> 3)rmmod marvell        (should not on success, ref=1)
> >> 4)rmmod hns_enet_drv   ref=0
> >> 5)rmmod marvell        (should on success, because ref=0)
> >>
> >> if we don't add the reference count in phy_attach_direct(the second
> >> step ref=0), so the third step rmmod marvell will be panic, because there is
> one user remain use marvell driver and phy_stat_machine use the NULL drv
> pointer.
> >>
> >>>
> >>>> +
> >>>>  	get_device(d);
> >>>>
> >>>>  	/* Assume that if there is no driver, that it doesn't @@ -921,6
> >>>> +926,7 @@ int phy_attach_direct(struct net_device *dev, struct
> >>>> phy_device *phydev,
> >>>>
> >>>>  error:
> >>>>  	put_device(d);
> >>>> +	module_put(d->driver->owner);
> >>>
> >>> Are those two in the wrong order ?
> >>>
> >>>>  	module_put(bus->owner);
> >>>>  	return err;
> >>>>  }
> >>>> @@ -998,6 +1004,7 @@ void phy_detach(struct phy_device *phydev)
> >>>>  	bus = phydev->mdio.bus;
> >>>>
> >>>>  	put_device(&phydev->mdio.dev);
> >>>> +	module_put(phydev->mdio.dev.driver->owner);
> >>>>  	module_put(bus->owner);
> >>>
> >>> Where is this code called from?
> >>> You can't call it from the phy driver because the driver can be
> >>> unloaded as soon as the last reference is removed.
> >>> At that point the code memory is freed.
> >>
> >> [Mao Wenan] it is called by NIC when it is removed, which aims to
> >> disconnect one bound phy driver. If this phy driver is not used for this NIC,
> reference count should be subtracted, and phy driver can be removed if there is
> no user.
> >> hns_nic_dev_remove->phy_disconnect->phy_detach
> >>
> >>
> >>
> >>>
> >>>>  }
> >>>>  EXPORT_SYMBOL(phy_detach);
> >>>> --
> >>>> 2.7.0
> >>>>
> >>>
> >>>
> >>> .
> >>>
> >
> > @Florian Fainelli, what's your comments about this patch?
> 
> I am trying to reproduce what you are seeing, but at first glance is looks like an
> appropriate solution to me. Do you mind giving me a couple more days?
> 
> Thanks!
> --
> Florian

Hi Florian, 
  Do you have any update about this patch?
  Thank you!
  









^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox