* RE: [RFC PATCH] i40e: enable PCIe relax ordering for SPARC
From: maowenan @ 2016-12-26 11:39 UTC (permalink / raw)
To: Tushar Dave, jeffrey.t.kirsher@intel.com,
intel-wired-lan@lists.osuosl.org
Cc: netdev@vger.kernel.org, weiyongjun (A), Dingtianhong
In-Reply-To: <1480957627-16825-1-git-send-email-tushar.n.dave@oracle.com>
> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> On Behalf Of Tushar Dave
> Sent: Tuesday, December 06, 2016 1:07 AM
> To: jeffrey.t.kirsher@intel.com; intel-wired-lan@lists.osuosl.org
> Cc: netdev@vger.kernel.org
> Subject: [RFC PATCH] i40e: enable PCIe relax ordering for SPARC
>
> Unlike previous generation NIC (e.g. ixgbe) i40e doesn't seem to have standard
> CSR where PCIe relaxed ordering can be set. Without PCIe relax ordering
> enabled, i40e performance is significantly low on SPARC.
>
[Mao Wenan]Hi Tushar, you have referred to i40e doesn't seem to have standard CSR
to set PCIe relaxed ordering, this CSR like TX&Rx DCA Control Register in 82599, right?
Is DMA_ATTR_WEAK_ORDERING the same as TX&RX control register in 82599?
And to enable relax ordering mode in 82599 for SPARC using below codes:
s32 ixgbe_start_hw_gen2(struct ixgbe_hw *hw)
{
u32 i;
/* Clear the rate limiters */
for (i = 0; i < hw->mac.max_tx_queues; i++) {
IXGBE_WRITE_REG(hw, IXGBE_RTTDQSEL, i);
IXGBE_WRITE_REG(hw, IXGBE_RTTBCNRC, 0);
}
IXGBE_WRITE_FLUSH(hw);
#ifndef CONFIG_SPARC
/* Disable relaxed ordering */
for (i = 0; i < hw->mac.max_tx_queues; i++) {
u32 regval;
regval = IXGBE_READ_REG(hw, IXGBE_DCA_TXCTRL_82599(i));
regval &= ~IXGBE_DCA_TXCTRL_DESC_WRO_EN;
IXGBE_WRITE_REG(hw, IXGBE_DCA_TXCTRL_82599(i), regval);
}
for (i = 0; i < hw->mac.max_rx_queues; i++) {
u32 regval;
regval = IXGBE_READ_REG(hw, IXGBE_DCA_RXCTRL(i));
regval &= ~(IXGBE_DCA_RXCTRL_DATA_WRO_EN |
IXGBE_DCA_RXCTRL_HEAD_WRO_EN);
IXGBE_WRITE_REG(hw, IXGBE_DCA_RXCTRL(i), regval);
}
#endif
return 0;
}
> This patch sets PCIe relax ordering for SPARC arch by setting dma attr
> DMA_ATTR_WEAK_ORDERING for every tx and rx DMA map/unmap.
> This has shown 10x increase in performance numbers.
>
> e.g.
> iperf TCP test with 10 threads on SPARC S7
>
> Test 1: Without this patch
>
> [root@brm-snt1-03 net]# iperf -s
> ------------------------------------------------------------
> Server listening on TCP port 5001
> TCP window size: 85.3 KByte (default)
> ------------------------------------------------------------
> [ 4] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40926 [ 5] local
> 16.0.0.7 port 5001 connected with 16.0.0.1 port 40934 [ 6] local 16.0.0.7 port
> 5001 connected with 16.0.0.1 port 40930 [ 7] local 16.0.0.7 port 5001
> connected with 16.0.0.1 port 40928 [ 8] local 16.0.0.7 port 5001 connected
> with 16.0.0.1 port 40922 [ 9] local 16.0.0.7 port 5001 connected with 16.0.0.1
> port 40932 [ 10] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40920
> [ 11] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40924 [ 14] local
> 16.0.0.7 port 5001 connected with 16.0.0.1 port 40982 [ 12] local 16.0.0.7 port
> 5001 connected with 16.0.0.1 port 40980
> [ ID] Interval Transfer Bandwidth
> [ 4] 0.0-20.0 sec 566 MBytes 237 Mbits/sec
> [ 5] 0.0-20.0 sec 532 MBytes 223 Mbits/sec
> [ 6] 0.0-20.0 sec 537 MBytes 225 Mbits/sec
> [ 8] 0.0-20.0 sec 546 MBytes 229 Mbits/sec
> [ 11] 0.0-20.0 sec 592 MBytes 248 Mbits/sec
> [ 7] 0.0-20.0 sec 539 MBytes 226 Mbits/sec
> [ 9] 0.0-20.0 sec 572 MBytes 240 Mbits/sec
> [ 10] 0.0-20.0 sec 604 MBytes 253 Mbits/sec
> [ 14] 0.0-20.0 sec 567 MBytes 238 Mbits/sec
> [ 12] 0.0-20.0 sec 511 MBytes 214 Mbits/sec
> [SUM] 0.0-20.0 sec 5.44 GBytes 2.33 Gbits/sec
>
> Test 2: with this patch:
>
> [root@brm-snt1-03 net]# iperf -s
> ------------------------------------------------------------
> Server listening on TCP port 5001
> TCP window size: 85.3 KByte (default)
> ------------------------------------------------------------
> TCP: request_sock_TCP: Possible SYN flooding on port 5001. Sending cookies.
> Check SNMP counters.
> [ 4] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46876 [ 5] local
> 16.0.0.7 port 5001 connected with 16.0.0.1 port 46874 [ 6] local 16.0.0.7 port
> 5001 connected with 16.0.0.1 port 46872 [ 7] local 16.0.0.7 port 5001
> connected with 16.0.0.1 port 46880 [ 8] local 16.0.0.7 port 5001 connected
> with 16.0.0.1 port 46878 [ 9] local 16.0.0.7 port 5001 connected with 16.0.0.1
> port 46884 [ 10] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46886
> [ 11] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46890 [ 12] local
> 16.0.0.7 port 5001 connected with 16.0.0.1 port 46888 [ 13] local 16.0.0.7 port
> 5001 connected with 16.0.0.1 port 46882
> [ ID] Interval Transfer Bandwidth
> [ 4] 0.0-20.0 sec 7.45 GBytes 3.19 Gbits/sec [ 5] 0.0-20.0 sec 7.48
> GBytes 3.21 Gbits/sec [ 7] 0.0-20.0 sec 7.34 GBytes 3.15 Gbits/sec
> [ 8] 0.0-20.0 sec 7.42 GBytes 3.18 Gbits/sec [ 9] 0.0-20.0 sec 7.24
> GBytes 3.11 Gbits/sec [ 10] 0.0-20.0 sec 7.40 GBytes 3.17 Gbits/sec
> [ 12] 0.0-20.0 sec 7.49 GBytes 3.21 Gbits/sec [ 6] 0.0-20.0 sec 7.30
> GBytes 3.13 Gbits/sec [ 11] 0.0-20.0 sec 7.44 GBytes 3.19 Gbits/sec
> [ 13] 0.0-20.0 sec 7.22 GBytes 3.10 Gbits/sec [SUM] 0.0-20.0 sec 73.8
> GBytes 31.6 Gbits/sec
>
> NOTE: In my testing, this patch does _not_ show any harm to i40e performance
> numbers on x86.
>
> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
> ---
> drivers/net/ethernet/intel/i40e/i40e_txrx.c | 69
> ++++++++++++++++++++--------- drivers/net/ethernet/intel/i40e/i40e_txrx.h |
> 1 +
> 2 files changed, 49 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index 6287bf6..800dca7 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -551,15 +551,17 @@ static void
> i40e_unmap_and_free_tx_resource(struct i40e_ring *ring,
> else
> dev_kfree_skb_any(tx_buffer->skb);
> if (dma_unmap_len(tx_buffer, len))
> - dma_unmap_single(ring->dev,
> - dma_unmap_addr(tx_buffer, dma),
> - dma_unmap_len(tx_buffer, len),
> - DMA_TO_DEVICE);
> + dma_unmap_single_attrs(ring->dev,
> + dma_unmap_addr(tx_buffer, dma),
> + dma_unmap_len(tx_buffer, len),
> + DMA_TO_DEVICE,
> + ring->dma_attrs);
> } else if (dma_unmap_len(tx_buffer, len)) {
> - dma_unmap_page(ring->dev,
> - dma_unmap_addr(tx_buffer, dma),
> - dma_unmap_len(tx_buffer, len),
> - DMA_TO_DEVICE);
> + dma_unmap_single_attrs(ring->dev,
> + dma_unmap_addr(tx_buffer, dma),
> + dma_unmap_len(tx_buffer, len),
> + DMA_TO_DEVICE,
> + ring->dma_attrs);
> }
>
> tx_buffer->next_to_watch = NULL;
> @@ -662,6 +664,8 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
> struct i40e_tx_buffer *tx_buf;
> struct i40e_tx_desc *tx_head;
> struct i40e_tx_desc *tx_desc;
> + dma_addr_t addr;
> + size_t size;
> unsigned int total_bytes = 0, total_packets = 0;
> unsigned int budget = vsi->work_limit;
>
> @@ -696,10 +700,11 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
> napi_consume_skb(tx_buf->skb, napi_budget);
>
> /* unmap skb header data */
> - dma_unmap_single(tx_ring->dev,
> - dma_unmap_addr(tx_buf, dma),
> - dma_unmap_len(tx_buf, len),
> - DMA_TO_DEVICE);
> + dma_unmap_single_attrs(tx_ring->dev,
> + dma_unmap_addr(tx_buf, dma),
> + dma_unmap_len(tx_buf, len),
> + DMA_TO_DEVICE,
> + tx_ring->dma_attrs);
>
> /* clear tx_buffer data */
> tx_buf->skb = NULL;
> @@ -717,12 +722,15 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
> tx_desc = I40E_TX_DESC(tx_ring, 0);
> }
>
> + addr = dma_unmap_addr(tx_buf, dma);
> + size = dma_unmap_len(tx_buf, len);
> /* unmap any remaining paged data */
> if (dma_unmap_len(tx_buf, len)) {
> - dma_unmap_page(tx_ring->dev,
> - dma_unmap_addr(tx_buf, dma),
> - dma_unmap_len(tx_buf, len),
> - DMA_TO_DEVICE);
> + dma_unmap_single_attrs(tx_ring->dev,
> + addr,
> + size,
> + DMA_TO_DEVICE,
> + tx_ring->dma_attrs);
> dma_unmap_len_set(tx_buf, len, 0);
> }
> }
> @@ -1010,6 +1018,11 @@ int i40e_setup_tx_descriptors(struct i40e_ring
> *tx_ring)
> */
> tx_ring->size += sizeof(u32);
> tx_ring->size = ALIGN(tx_ring->size, 4096);
> +#ifdef CONFIG_SPARC
> + tx_ring->dma_attrs = DMA_ATTR_WEAK_ORDERING; #else
> + tx_ring->dma_attrs = 0;
> +#endif
> tx_ring->desc = dma_alloc_coherent(dev, tx_ring->size,
> &tx_ring->dma, GFP_KERNEL);
> if (!tx_ring->desc) {
> @@ -1053,7 +1066,11 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
> if (!rx_bi->page)
> continue;
>
> - dma_unmap_page(dev, rx_bi->dma, PAGE_SIZE,
> DMA_FROM_DEVICE);
> + dma_unmap_single_attrs(dev,
> + rx_bi->dma,
> + PAGE_SIZE,
> + DMA_FROM_DEVICE,
> + rx_ring->dma_attrs);
> __free_pages(rx_bi->page, 0);
>
> rx_bi->page = NULL;
> @@ -1113,6 +1130,11 @@ int i40e_setup_rx_descriptors(struct i40e_ring
> *rx_ring)
> /* Round up to nearest 4K */
> rx_ring->size = rx_ring->count * sizeof(union i40e_32byte_rx_desc);
> rx_ring->size = ALIGN(rx_ring->size, 4096);
> +#ifdef CONFIG_SPARC
> + rx_ring->dma_attrs = DMA_ATTR_WEAK_ORDERING; #else
> + rx_ring->dma_attrs = 0;
> +#endif
> rx_ring->desc = dma_alloc_coherent(dev, rx_ring->size,
> &rx_ring->dma, GFP_KERNEL);
>
> @@ -1182,7 +1204,8 @@ static bool i40e_alloc_mapped_page(struct
> i40e_ring *rx_ring,
> }
>
> /* map page for use */
> - dma = dma_map_page(rx_ring->dev, page, 0, PAGE_SIZE,
> DMA_FROM_DEVICE);
> + dma = dma_map_single_attrs(rx_ring->dev, page_address(page),
> PAGE_SIZE,
> + DMA_FROM_DEVICE, rx_ring->dma_attrs);
>
> /* if mapping failed free memory back to system since
> * there isn't much point in holding memory we can't use @@ -1695,8
> +1718,11 @@ struct sk_buff *i40e_fetch_rx_buffer(struct i40e_ring *rx_ring,
> rx_ring->rx_stats.page_reuse_count++;
> } else {
> /* we are not reusing the buffer so unmap it */
> - dma_unmap_page(rx_ring->dev, rx_buffer->dma, PAGE_SIZE,
> - DMA_FROM_DEVICE);
> + dma_unmap_single_attrs(rx_ring->dev,
> + rx_buffer->dma,
> + PAGE_SIZE,
> + DMA_FROM_DEVICE,
> + rx_ring->dma_attrs);
> }
>
> /* clear contents of buffer_info */
> @@ -2737,7 +2763,8 @@ static inline void i40e_tx_map(struct i40e_ring
> *tx_ring, struct sk_buff *skb,
> first->skb = skb;
> first->tx_flags = tx_flags;
>
> - dma = dma_map_single(tx_ring->dev, skb->data, size, DMA_TO_DEVICE);
> + dma = dma_map_single_attrs(tx_ring->dev, skb->data, size,
> + DMA_TO_DEVICE, tx_ring->dma_attrs);
>
> tx_desc = I40E_TX_DESC(tx_ring, i);
> tx_bi = first;
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> index 5088405..9a86212 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> @@ -327,6 +327,7 @@ struct i40e_ring {
>
> unsigned int size; /* length of descriptor ring in bytes */
> dma_addr_t dma; /* physical address of ring */
> + unsigned long dma_attrs; /* DMA attributes */
>
> struct i40e_vsi *vsi; /* Backreference to associated VSI */
> struct i40e_q_vector *q_vector; /* Backreference to associated vector
> */
> --
> 1.9.1
^ permalink raw reply
* RE: [RFC PATCH 0/7] ThunderX Embedded switch support
From: Koteshwar Rao, Satha @ 2016-12-26 14:04 UTC (permalink / raw)
To: Sunil Kovvuri
Cc: Robert Richter, Chickles, Derek, Goutham, Sunil, Daney, David,
Linux Netdev List, LKML, Vatsavayi, Raghu, David S. Miller, LAKML,
Romanov, Philip
In-Reply-To: <CA+sq2Cf3npwzZc2+qQXrNzVBWywn1iwSL_mXjpU1LzC+JFKdUA@mail.gmail.com>
Hi Sunil,
In RFC cover letter we explained the feature details, files organized based on their supporting functionality, let me know if you are interested in any specific details
Thanks,
Satha
-----Original Message-----
From: Sunil Kovvuri [mailto:sunil.kovvuri@gmail.com]
Sent: Wednesday, December 21, 2016 4:03 AM
To: Koteshwar Rao, Satha
Cc: LKML; Goutham, Sunil; Robert Richter; David S. Miller; Daney, David; Vatsavayi, Raghu; Chickles, Derek; Romanov, Philip; Linux Netdev List; LAKML
Subject: Re: [RFC PATCH 0/7] ThunderX Embedded switch support
It would be easier for anyone to review if you prepare patches based on features rather than based on modifications to files.
Thanks,
Sunil.
^ permalink raw reply
* RE: [RFC PATCH 4/7] HW Filter Initialization code and register access APIs
From: Koteshwar Rao, Satha @ 2016-12-26 14:13 UTC (permalink / raw)
To: Sunil Kovvuri
Cc: LKML, Goutham, Sunil, Robert Richter, David S. Miller,
Daney, David, Vatsavayi, Raghu, Chickles, Derek, Romanov, Philip,
Linux Netdev List, LAKML
In-Reply-To: <CA+sq2Ce1JopxeJ=mdvAgE1iUzf_BnkPfPYQ9FeThE07nZK_2HA@mail.gmail.com>
Hi Sunil,
Thanks for review. Answers inline.
Thanks,
Satha.
-----Original Message-----
From: Sunil Kovvuri [mailto:sunil.kovvuri@gmail.com]
Sent: Wednesday, December 21, 2016 4:36 AM
To: Koteshwar Rao, Satha
Cc: LKML; Goutham, Sunil; Robert Richter; David S. Miller; Daney, David; Vatsavayi, Raghu; Chickles, Derek; Romanov, Philip; Linux Netdev List; LAKML
Subject: Re: [RFC PATCH 4/7] HW Filter Initialization code and register access APIs
On Wed, Dec 21, 2016 at 2:16 PM, Satha Koteswara Rao <satha.rao@caviumnetworks.com> wrote:
> ---
> drivers/net/ethernet/cavium/thunder/pf_reg.c | 660
> +++++++++++++++++++++++++++
> 1 file changed, 660 insertions(+)
> create mode 100644 drivers/net/ethernet/cavium/thunder/pf_reg.c
>
> diff --git a/drivers/net/ethernet/cavium/thunder/pf_reg.c
> b/drivers/net/ethernet/cavium/thunder/pf_reg.c
Sunil>>
From the file name 'pf_reg.c', what is PF here ?
TNS is not a SRIOV device right ?
SATHA>>> PF stands for acted Physical Function. PF referred in file name confuses common usage of NIC PF, planning to change file name in next version
Yes this block does not support SRIOV
> new file mode 100644
> index 0000000..1f95c7f
> --- /dev/null
> +++ b/drivers/net/ethernet/cavium/thunder/pf_reg.c
> @@ -0,0 +1,660 @@
> +/*
> + * Copyright (C) 2015 Cavium, Inc.
> + *
> + * This program is free software; you can redistribute it and/or
> +modify it
> + * under the terms of version 2 of the GNU General Public License
> + * as published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/slab.h>
> +#include <linux/fs.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/version.h>
> +#include <linux/proc_fs.h>
> +#include <linux/device.h>
> +#include <linux/mman.h>
> +#include <linux/uaccess.h>
> +#include <linux/delay.h>
> +#include <linux/cdev.h>
> +#include <linux/err.h>
> +#include <linux/device.h>
> +#include <linux/io.h>
> +#include <linux/firmware.h>
> +#include "pf_globals.h"
> +#include "pf_locals.h"
> +#include "tbl_access.h"
> +#include "linux/lz4.h"
> +
> +struct tns_table_s tbl_info[TNS_MAX_TABLE];
> +
> +#define TNS_TDMA_SST_ACC_CMD_ADDR 0x0000842000000270ull
> +
> +#define BAR0_START 0x842000000000
> +#define BAR0_END 0x84200000FFFF
> +#define BAR0_SIZE (64 * 1024)
> +#define BAR2_START 0x842040000000
> +#define BAR2_END 0x84207FFFFFFF
> +#define BAR2_SIZE (1024 * 1024 * 1024)
> +
> +#define NODE1_BAR0_START 0x942000000000
> +#define NODE1_BAR0_END 0x94200000FFFF
> +#define NODE1_BAR0_SIZE (64 * 1024)
> +#define NODE1_BAR2_START 0x942040000000
> +#define NODE1_BAR2_END 0x94207FFFFFFF
> +#define NODE1_BAR2_SIZE (1024 * 1024 * 1024)
Sunil>> This is absurd, why are you using hardcoded HW addresses,
why not use TNS device's PCI BARs.
SATHA>>> Due to various considerations TNS is not treated as PCI device by our driver (probably will be even disabled as such later in the FW), HW addresses mentioned was register base address as per HRM.
Macro name with BAR_X confuses, will change this in my next revision.
> +/* Allow a max of 4 chunks for the Indirect Read/Write */ #define
> +MAX_SIZE (64 * 4) #define CHUNK_SIZE (64)
> +/* To protect register access */
> +spinlock_t pf_reg_lock;
> +
> +u64 iomem0;
> +u64 iomem2;
> +u8 tns_enabled;
> +u64 node1_iomem0;
> +u64 node1_iomem2;
> +u8 node1_tns;
> +int n1_tns;
Sunil>> A simple structure would have nice instead of so many global variables.
SATHA>>> Good suggestion, will do this in next version
> +
> +int tns_write_register_indirect(int node_id, u64 address, u8 size,
> + u8 *kern_buffer) {
> + union tns_tdma_sst_acc_cmd acccmd;
> + union tns_tdma_sst_acc_stat_t accstat;
> + union tns_acc_data data;
> + int i, j, w = 0;
> + int cnt = 0;
> + u32 *dataw = NULL;
> + int temp = 0;
> + int k = 0;
> + int chunks = 0;
> + u64 acccmd_address;
> + u64 lmem2 = 0, lmem0 = 0;
> +
> + if (size == 0 || !kern_buffer) {
> + filter_dbg(FERR, "%s data size cannot be zero\n", __func__);
> + return TNS_ERROR_INVALID_ARG;
> + }
> + if (size > MAX_SIZE) {
> + filter_dbg(FERR, "%s Max allowed size exceeded\n", __func__);
> + return TNS_ERROR_DATA_TOO_LARGE;
> + }
> + if (node_id) {
> + lmem0 = node1_iomem0;
> + lmem2 = node1_iomem2;
> + } else {
> + lmem0 = iomem0;
> + lmem2 = iomem2;
> + }
> +
> + chunks = ((size + (CHUNK_SIZE - 1)) / CHUNK_SIZE);
> + acccmd_address = (address & 0x00000000ffffffff);
> + spin_lock_bh(&pf_reg_lock);
> +
> + for (k = 0; k < chunks; k++) {
Sunil>> Why not use some proper variable names, instead of i,j,k,w,
temp e.t.c e.t.c
SATHA>>> Will do this in next version
> + /* Should never happen */
> + if (size < 0) {
> + filter_dbg(FERR, "%s size mismatch [CHUNK %d]\n",
> + __func__, k);
> + break;
> + }
> + temp = (size > CHUNK_SIZE) ? CHUNK_SIZE : size;
> + dataw = (u32 *)(kern_buffer + (k * CHUNK_SIZE));
> + cnt = ((temp + 3) / 4);
> + data.u = 0ULL;
> + for (j = 0, i = 0; i < cnt; i++) {
> + /* Odd words go in the upper 32 bits of the data
> + * register
> + */
> + if (i & 1) {
> + data.s.upper32 = dataw[i];
> + writeq_relaxed(data.u, (void *)(lmem0 +
> + TNS_TDMA_SST_ACC_WDATX(j)));
> + data.u = 0ULL;
> + j++; /* Advance to the next data word */
> + w = 0;
> + } else {
> + /* Lower 32 bits contain words 0, 2, 4, etc. */
> + data.s.lower32 = dataw[i];
> + w = 1;
> + }
> + }
> +
> + /* If the last word was a partial (< 64 bits) then
> + * see if we need to write it.
> + */
> + if (w)
> + writeq_relaxed(data.u, (void *)(lmem0 +
> + TNS_TDMA_SST_ACC_WDATX(j)));
> +
> + acccmd.u = 0ULL;
> + acccmd.s.go = 1; /* Cleared once the request is serviced */
> + acccmd.s.size = cnt;
> + acccmd.s.addr = (acccmd_address >> 2);
> + writeq_relaxed(acccmd.u, (void *)(lmem0 +
> + TDMA_SST_ACC_CMD));
> + accstat.u = 0ULL;
> +
> + while (!accstat.s.cmd_done && !accstat.s.error)
> + accstat.u = readq_relaxed((void *)(lmem0 +
> + TDMA_SST_ACC_STAT));
> +
> + if (accstat.s.error) {
> + data.u = readq_relaxed((void *)(lmem2 +
> + TDMA_NB_INT_STAT));
> + filter_dbg(FERR, "%s Reading data from ", __func__);
> + filter_dbg(FERR, "0x%0lx chunk %d failed 0x%0lx",
> + (unsigned long)address, k,
> + (unsigned long)data.u);
> + spin_unlock_bh(&pf_reg_lock);
> + kfree(kern_buffer);
> + return TNS_ERROR_INDIRECT_WRITE;
> + }
> + /* Calculate the next offset to write */
> + acccmd_address = acccmd_address + CHUNK_SIZE;
> + size -= CHUNK_SIZE;
> + }
> + spin_unlock_bh(&pf_reg_lock);
> +
> + return 0;
> +}
> +
> +int tns_read_register_indirect(int node_id, u64 address, u8 size,
> + u8 *kern_buffer) {
> + union tns_tdma_sst_acc_cmd acccmd;
> + union tns_tdma_sst_acc_stat_t accstat;
> + union tns_acc_data data;
> + int i, j, dcnt;
> + int cnt = 0;
> + u32 *dataw = NULL;
> + int temp = 0;
> + int k = 0;
> + int chunks = 0;
> + u64 acccmd_address;
> + u64 lmem2 = 0, lmem0 = 0;
> +
> + if (size == 0 || !kern_buffer) {
> + filter_dbg(FERR, "%s data size cannot be zero\n", __func__);
> + return TNS_ERROR_INVALID_ARG;
> + }
> + if (size > MAX_SIZE) {
> + filter_dbg(FERR, "%s Max allowed size exceeded\n", __func__);
> + return TNS_ERROR_DATA_TOO_LARGE;
> + }
> + if (node_id) {
> + lmem0 = node1_iomem0;
> + lmem2 = node1_iomem2;
> + } else {
> + lmem0 = iomem0;
> + lmem2 = iomem2;
> + }
> +
> + chunks = ((size + (CHUNK_SIZE - 1)) / CHUNK_SIZE);
> + acccmd_address = (address & 0x00000000ffffffff);
> + spin_lock_bh(&pf_reg_lock);
> + for (k = 0; k < chunks; k++) {
> + /* This should never happen */
> + if (size < 0) {
> + filter_dbg(FERR, "%s size mismatch [CHUNK:%d]\n",
> + __func__, k);
> + break;
> + }
> + temp = (size > CHUNK_SIZE) ? CHUNK_SIZE : size;
> + dataw = (u32 *)(kern_buffer + (k * CHUNK_SIZE));
> + cnt = ((temp + 3) / 4);
> + acccmd.u = 0ULL;
> + acccmd.s.op = 1; /* Read operation */
> + acccmd.s.size = cnt;
> + acccmd.s.addr = (acccmd_address >> 2);
> + acccmd.s.go = 1; /* Execute */
> + writeq_relaxed(acccmd.u, (void *)(lmem0 +
> + TDMA_SST_ACC_CMD));
> + accstat.u = 0ULL;
> +
> + while (!accstat.s.cmd_done && !accstat.s.error)
> + accstat.u = readq_relaxed((void *)(lmem0 +
> + TDMA_SST_ACC_STAT));
> +
> + if (accstat.s.error) {
> + data.u = readq_relaxed((void *)(lmem2 +
> + TDMA_NB_INT_STAT));
> + filter_dbg(FERR, "%s Reading data from", __func__);
> + filter_dbg(FERR, "0x%0lx chunk %d failed 0x%0lx",
> + (unsigned long)address, k,
> + (unsigned long)data.u);
> + spin_unlock_bh(&pf_reg_lock);
> + kfree(kern_buffer);
> + return TNS_ERROR_INDIRECT_READ;
> + }
> +
> + dcnt = cnt / 2;
> + if (cnt & 1)
> + dcnt++;
> + for (i = 0, j = 0; (j < dcnt) && (i < cnt); j++) {
> + data.u = readq_relaxed((void *)(lmem0 +
> + TNS_TDMA_SST_ACC_RDATX(j)));
> + dataw[i++] = data.s.lower32;
> + if (i < cnt)
> + dataw[i++] = data.s.upper32;
> + }
> + /* Calculate the next offset to read */
> + acccmd_address = acccmd_address + CHUNK_SIZE;
> + size -= CHUNK_SIZE;
> + }
> + spin_unlock_bh(&pf_reg_lock);
> + return 0;
> +}
> +
> +u64 tns_read_register(u64 start, u64 offset) {
> + return readq_relaxed((void *)(start + offset)); }
> +
> +void tns_write_register(u64 start, u64 offset, u64 data) {
> + writeq_relaxed(data, (void *)(start + offset)); }
> +
> +/* Check if TNS is available. If yes return 0 else 1 */ int
> +is_tns_available(void) {
> + union tns_tdma_cap tdma_cap;
> +
> + tdma_cap.u = tns_read_register(iomem0, TNS_TDMA_CAP_OFFSET);
> + tns_enabled = tdma_cap.s.switch_capable;
> + /* In multi-node systems, make sure TNS should be there in
> + both nodes */
Can't node-0 TNS work with node-0 interfaces if node-1 TNS is not detected ?
SATHA>>> Presently we enabled TNS only when two nodes supports TNS (only incase of 2 node system)
> + if (nr_node_ids > 1) {
> + tdma_cap.u = tns_read_register(node1_iomem0,
> + TNS_TDMA_CAP_OFFSET);
> + if (tdma_cap.s.switch_capable)
> + n1_tns = 1;
> + }
> + tns_enabled &= tdma_cap.s.switch_capable;
> + return (!tns_enabled);
> +}
> +
> +int bist_error_check(void)
> +{
> + int fail = 0, i;
> + u64 bist_stat = 0;
> +
> + for (i = 0; i < 12; i++) {
> + bist_stat = tns_read_register(iomem0, (i * 16));
> + if (bist_stat) {
> + filter_dbg(FERR, "TNS BIST%d fail 0x%llx\n",
> + i, bist_stat);
> + fail = 1;
> + }
> + if (!n1_tns)
> + continue;
> + bist_stat = tns_read_register(node1_iomem0, (i * 16));
> + if (bist_stat) {
> + filter_dbg(FERR, "TNS(N1) BIST%d fail 0x%llx\n",
> + i, bist_stat);
> + fail = 1;
> + }
> + }
> +
> + return fail;
> +}
> +
> +int replay_indirect_trace(int node, u64 *buf_ptr, int idx) {
> + union _tns_sst_config cmd = (union _tns_sst_config)(buf_ptr[idx]);
> + int remaining = cmd.cmd.run;
> + u64 io_addr;
> + int word_cnt = cmd.cmd.word_cnt;
> + int size = (word_cnt + 1) / 2;
> + u64 stride = word_cnt;
> + u64 acc_cmd = cmd.copy.do_copy;
> + u64 lmem2 = 0, lmem0 = 0;
> + union tns_tdma_sst_acc_stat_t accstat;
> + union tns_acc_data data;
> +
> + if (node) {
> + lmem0 = node1_iomem0;
> + lmem2 = node1_iomem2;
> + } else {
> + lmem0 = iomem0;
> + lmem2 = iomem2;
> + }
> +
> + if (word_cnt == 0) {
> + word_cnt = 16;
> + stride = 16;
> + size = 8;
> + } else {
> + // make stride next power of 2
Please use proper commenting, have you ran checkpatch ?
SATHA>>> Will change these comments in next version. We ran checkpatch, no errors and warning reported
> + if (cmd.cmd.powerof2stride)
> + while ((stride & (stride - 1)) != 0)
> + stride++;
> + }
> + stride *= 4; //convert stride from 32-bit words to bytes
> +
> + do {
> + int addr_p = 1;
> + /* extract (big endian) data from the config
> + * into the data array
> + */
> + while (size > 0) {
> + io_addr = lmem0 + TDMA_SST_ACC_CMD + addr_p * 16;
> + tns_write_register(io_addr, 0, buf_ptr[idx + size]);
> + addr_p += 1;
> + size--;
> + }
> + tns_write_register((lmem0 + TDMA_SST_ACC_CMD), 0, acc_cmd);
> + /* TNS Block access registers indirectly, ran memory barrier
> + * between two writes
> + */
> + wmb();
> + /* Check for completion */
> + accstat.u = 0ULL;
> + while (!accstat.s.cmd_done && !accstat.s.error)
> + accstat.u = readq_relaxed((void *)(lmem0 +
> +
> + TDMA_SST_ACC_STAT));
> +
> + /* Check for error, and report it */
> + if (accstat.s.error) {
> + filter_dbg(FERR, "%s data from 0x%0llx failed 0x%llx\n",
> + __func__, acc_cmd, accstat.u);
> + data.u = readq_relaxed((void *)(lmem2 +
> + TDMA_NB_INT_STAT));
> + filter_dbg(FERR, "Status 0x%llx\n", data.u);
> + }
> + /* update the address */
> + acc_cmd += stride;
> + size = (word_cnt + 1) / 2;
> + usleep_range(20, 30);
> + } while (remaining-- > 0);
> +
> + return size;
> +}
> +
> +void replay_tns_node(int node, u64 *buf_ptr, int reg_cnt) {
> + int counter = 0;
> + u64 offset = 0;
> + u64 io_address;
> + int datapathmode = 1;
> + u64 lmem2 = 0, lmem0 = 0;
> +
> + if (node) {
> + lmem0 = node1_iomem0;
> + lmem2 = node1_iomem2;
> + } else {
> + lmem0 = iomem0;
> + lmem2 = iomem2;
> + }
> + for (counter = 0; counter < reg_cnt; counter++) {
> + if (buf_ptr[counter] == 0xDADADADADADADADAull) {
> + datapathmode = 1;
> + continue;
> + } else if (buf_ptr[counter] == 0xDEDEDEDEDEDEDEDEull) {
> + datapathmode = 0;
> + continue;
> + }
> + if (datapathmode == 1) {
> + if (buf_ptr[counter] >= BAR0_START &&
> + buf_ptr[counter] <= BAR0_END) {
> + offset = buf_ptr[counter] - BAR0_START;
> + io_address = lmem0 + offset;
> + } else if (buf_ptr[counter] >= BAR2_START &&
> + buf_ptr[counter] <= BAR2_END) {
> + offset = buf_ptr[counter] - BAR2_START;
> + io_address = lmem2 + offset;
> + } else {
> + filter_dbg(FERR, "%s Address 0x%llx invalid\n",
> + __func__, buf_ptr[counter]);
> + return;
> + }
> +
> + tns_write_register(io_address, 0, buf_ptr[counter + 1]);
> + /* TNS Block access registers indirectly, ran memory
> + * barrier between two writes
> + */
> + wmb();
> + counter += 1;
> + usleep_range(20, 30);
> + } else if (datapathmode == 0) {
> + int sz = replay_indirect_trace(node, buf_ptr,
> + counter);
> +
> + counter += sz;
> + }
> + }
> +}
> +
> +int alloc_table_info(int i, struct table_static_s tbl_sdata[]) {
> + tbl_info[i].ddata[0].bitmap = kcalloc(BITS_TO_LONGS(tbl_sdata[i].depth),
> + sizeof(uintptr_t), GFP_KERNEL);
> + if (!tbl_info[i].ddata[0].bitmap)
> + return 1;
> +
> + if (!n1_tns)
> + return 0;
> +
> + tbl_info[i].ddata[1].bitmap = kcalloc(BITS_TO_LONGS(tbl_sdata[i].depth),
> + sizeof(uintptr_t), GFP_KERNEL);
> + if (!tbl_info[i].ddata[1].bitmap) {
> + kfree(tbl_info[i].ddata[0].bitmap);
> + return 1;
> + }
> +
> + return 0;
> +}
> +
> +void tns_replay_register_trace(const struct firmware *fw, struct
> +device *dev) {
> + int i;
> + int node = 0;
> + u8 *buffer = NULL;
> + u64 *buf_ptr = NULL;
> + struct tns_global_st *fw_header = NULL;
> + struct table_static_s tbl_sdata[TNS_MAX_TABLE];
> + size_t src_len;
> + size_t dest_len = TNS_FW_MAX_SIZE;
> + int rc;
> + u8 *fw2_buf = NULL;
> + unsigned char *decomp_dest = NULL;
> +
> + fw2_buf = (u8 *)fw->data;
> + src_len = fw->size - 8;
> +
> + decomp_dest = kcalloc((dest_len * 2), sizeof(char), GFP_KERNEL);
> + if (!decomp_dest)
> + return;
> +
> + memset(decomp_dest, 0, (dest_len * 2));
> + rc = lz4_decompress_unknownoutputsize(&fw2_buf[8], src_len, decomp_dest,
> + &dest_len);
> + if (rc) {
> + filter_dbg(FERR, "Decompress Error %d\n", rc);
> + pr_info("Uncompressed destination length %ld\n", dest_len);
> + kfree(decomp_dest);
> + return;
> + }
> + fw_header = (struct tns_global_st *)decomp_dest;
> + buffer = (u8 *)decomp_dest;
> +
> + filter_dbg(FINFO, "TNS Firmware version: %s Loading...\n",
> + fw_header->version);
> +
> + memset(tbl_info, 0x0, sizeof(tbl_info));
> + buf_ptr = (u64 *)(buffer + sizeof(struct tns_global_st));
> + memcpy(tbl_sdata, fw_header->tbl_info,
> + sizeof(fw_header->tbl_info));
> +
> + for (i = 0; i < TNS_MAX_TABLE; i++) {
> + if (!tbl_sdata[i].valid)
> + continue;
> + memcpy(&tbl_info[i].sdata, &tbl_sdata[i],
> + sizeof(struct table_static_s));
> + if (alloc_table_info(i, tbl_sdata)) {
> + kfree(decomp_dest);
> + return;
> + }
> + }
> +
> + for (node = 0; node < nr_node_ids; node++)
> + replay_tns_node(node, buf_ptr, fw_header->reg_cnt);
> +
> + kfree(decomp_dest);
> + release_firmware(fw);
> +}
> +
> +int tns_init(const struct firmware *fw, struct device *dev) {
> + int result = 0;
> + int i = 0;
> + int temp;
> + union tns_tdma_config tdma_config;
> + union tns_tdma_lmacx_config tdma_lmac_cfg;
> + u64 reg_init_val;
> +
> + spin_lock_init(&pf_reg_lock);
> +
> + /* use two regions insted of a single big mapping to save
> + * the kernel virtual space
> + */
> + iomem0 = (u64)ioremap(BAR0_START, BAR0_SIZE);
> + if (iomem0 == 0ULL) {
> + filter_dbg(FERR, "Node0 ioremap failed for BAR0\n");
> + result = -EAGAIN;
> + goto error;
> + } else {
> + filter_dbg(FINFO, "ioremap success for BAR0\n");
> + }
> +
> + if (nr_node_ids > 1) {
> + node1_iomem0 = (u64)ioremap(NODE1_BAR0_START, NODE1_BAR0_SIZE);
> + if (node1_iomem0 == 0ULL) {
> + filter_dbg(FERR, "Node1 ioremap failed for BAR0\n");
> + result = -EAGAIN;
> + goto error;
> + } else {
> + filter_dbg(FINFO, "ioremap success for BAR0\n");
> + }
> + }
> +
> + if (is_tns_available()) {
> + filter_dbg(FERR, "TNS NOT AVAILABLE\n");
> + goto error;
> + }
> +
> + if (bist_error_check()) {
> + filter_dbg(FERR, "BIST ERROR CHECK FAILED");
> + goto error;
> + }
> +
> + /* NIC0-BGX0 is TNS, NIC1-BGX1 is TNS, DISABLE BACKPRESSURE */
Sunil>> Why disable backpressure, if it's in TNS mode ?
SATHA>>> As part of the init code we disabled BP, later it is enabled
> + reg_init_val = 0ULL;
> + pr_info("NIC Block configured in TNS/TNS mode");
> + tns_write_register(iomem0, TNS_RDMA_CONFIG_OFFSET, reg_init_val);
> + usleep_range(10, 20);
Sunil>> Why sleep after every register write ?
SATHA>>> Consecutive indirect register access needs some delay
> + if (n1_tns) {
> + tns_write_register(node1_iomem0, TNS_RDMA_CONFIG_OFFSET,
> + reg_init_val);
> + usleep_range(10, 20);
> + }
> +
> + // Configure each LMAC with 512 credits in BYPASS mode
> + for (i = TNS_MIN_LMAC; i < (TNS_MIN_LMAC + TNS_MAX_LMAC); i++) {
> + tdma_lmac_cfg.u = 0ULL;
> + tdma_lmac_cfg.s.fifo_cdts = 0x200;
> + tns_write_register(iomem0, TNS_TDMA_LMACX_CONFIG_OFFSET(i),
> + tdma_lmac_cfg.u);
> + usleep_range(10, 20);
> + if (n1_tns) {
> + tns_write_register(node1_iomem0,
> + TNS_TDMA_LMACX_CONFIG_OFFSET(i),
> + tdma_lmac_cfg.u);
> + usleep_range(10, 20);
> + }
> + }
> +
> + //ENABLE TNS CLOCK AND CSR READS
> + temp = tns_read_register(iomem0, TNS_TDMA_CONFIG_OFFSET);
> + tdma_config.u = temp;
> + tdma_config.s.clk_2x_ena = 1;
> + tdma_config.s.clk_ena = 1;
> + tns_write_register(iomem0, TNS_TDMA_CONFIG_OFFSET, tdma_config.u);
> + if (n1_tns)
> + tns_write_register(node1_iomem0, TNS_TDMA_CONFIG_OFFSET,
> + tdma_config.u);
> +
> + temp = tns_read_register(iomem0, TNS_TDMA_CONFIG_OFFSET);
> + tdma_config.u = temp;
> + tdma_config.s.csr_access_ena = 1;
> + tns_write_register(iomem0, TNS_TDMA_CONFIG_OFFSET, tdma_config.u);
> + if (n1_tns)
> + tns_write_register(node1_iomem0, TNS_TDMA_CONFIG_OFFSET,
> + tdma_config.u);
> +
> + reg_init_val = 0ULL;
> + tns_write_register(iomem0, TNS_TDMA_RESET_CTL_OFFSET, reg_init_val);
> + if (n1_tns)
> + tns_write_register(node1_iomem0, TNS_TDMA_RESET_CTL_OFFSET,
> + reg_init_val);
> +
> + iomem2 = (u64)ioremap(BAR2_START, BAR2_SIZE);
> + if (iomem2 == 0ULL) {
> + filter_dbg(FERR, "ioremap failed for BAR2\n");
> + result = -EAGAIN;
> + goto error;
> + } else {
> + filter_dbg(FINFO, "ioremap success for BAR2\n");
> + }
> +
> + if (n1_tns) {
> + node1_iomem2 = (u64)ioremap(NODE1_BAR2_START, NODE1_BAR2_SIZE);
> + if (node1_iomem2 == 0ULL) {
> + filter_dbg(FERR, "Node1 ioremap failed for BAR2\n");
> + result = -EAGAIN;
> + goto error;
> + } else {
> + filter_dbg(FINFO, "Node1 ioremap success for BAR2\n");
> + }
> + }
> + msleep(1000);
> + //We will replay register trace to initialize TNS block
> + tns_replay_register_trace(fw, dev);
> +
> + return 0;
> +error:
> + if (iomem0 != 0)
> + iounmap((void *)iomem0);
> + if (iomem2 != 0)
> + iounmap((void *)iomem2);
> +
> + if (node1_iomem0 != 0)
> + iounmap((void *)node1_iomem0);
> + if (node1_iomem2 != 0)
> + iounmap((void *)node1_iomem2);
> +
> + return result;
> +}
> +
> +void tns_exit(void)
> +{
> + int i;
> +
> + if (iomem0 != 0)
> + iounmap((void *)iomem0);
> + if (iomem2 != 0)
> + iounmap((void *)iomem2);
> +
> + if (node1_iomem0 != 0)
> + iounmap((void *)node1_iomem0);
> + if (node1_iomem2 != 0)
> + iounmap((void *)node1_iomem2);
> +
> + for (i = 0; i < TNS_MAX_TABLE; i++) {
> + if (!tbl_info[i].sdata.valid)
> + continue;
> + kfree(tbl_info[i].ddata[0].bitmap);
> + kfree(tbl_info[i].ddata[n1_tns].bitmap);
> + }
> +}
> --
> 1.8.3.1
>
^ permalink raw reply
* RE: [RFC PATCH 5/7] Multiple VF's grouped together under single physical port called PF group PF Group maintainance API's
From: Koteshwar Rao, Satha @ 2016-12-26 14:16 UTC (permalink / raw)
To: Sunil Kovvuri
Cc: LKML, Goutham, Sunil, Robert Richter, David S. Miller,
Daney, David, Vatsavayi, Raghu, Chickles, Derek, Romanov, Philip,
Linux Netdev List, LAKML
In-Reply-To: <CA+sq2Cf=3-WUaT+iNefYswgGf2ejW6EQp1igcMWELU=-xUnnNA@mail.gmail.com>
Thanks for suggestion. Will clean up code in next revision
Thanks,
Satha
-----Original Message-----
From: Sunil Kovvuri [mailto:sunil.kovvuri@gmail.com]
Sent: Wednesday, December 21, 2016 4:44 AM
To: Koteshwar Rao, Satha
Cc: LKML; Goutham, Sunil; Robert Richter; David S. Miller; Daney, David; Vatsavayi, Raghu; Chickles, Derek; Romanov, Philip; Linux Netdev List; LAKML
Subject: Re: [RFC PATCH 5/7] Multiple VF's grouped together under single physical port called PF group PF Group maintainance API's
On Wed, Dec 21, 2016 at 2:16 PM, Satha Koteswara Rao <satha.rao@caviumnetworks.com> wrote:
> +struct tns_global_st {
> + u64 magic;
> + char version[16];
> + u64 reg_cnt;
> + struct table_static_s tbl_info[TNS_MAX_TABLE]; };
> +
> +#define PF_COUNT 3
> +#define PF_1 0
> +#define PF_2 64
> +#define PF_3 96
> +#define PF_END 128
Some comments please ... what is 0, 64, 96 ??
You can read PCI_SRIOV_TOTAL_VF from PCI config space instead of defining PF_END with 128.
^ permalink raw reply
* RE: [RFC PATCH 1/7] PF driver modified to enable HW filter support, changes works in backward compatibility mode Enable required things in Makefile Enable LZ4 dependecy inside config file
From: Koteshwar Rao, Satha @ 2016-12-26 14:20 UTC (permalink / raw)
To: Sunil Kovvuri
Cc: LKML, Goutham, Sunil, Robert Richter, David S. Miller,
Daney, David, Vatsavayi, Raghu, Chickles, Derek, Romanov, Philip,
Linux Netdev List, LAKML
In-Reply-To: <CA+sq2CfwTN5u6nhxriqWqoFxsYHdEYikbjBgMCL2YwNXpqyQ=Q@mail.gmail.com>
Responses inline
Thanks,
Satha
-----Original Message-----
From: Sunil Kovvuri [mailto:sunil.kovvuri@gmail.com]
Sent: Wednesday, December 21, 2016 5:05 AM
To: Koteshwar Rao, Satha
Cc: LKML; Goutham, Sunil; Robert Richter; David S. Miller; Daney, David; Vatsavayi, Raghu; Chickles, Derek; Romanov, Philip; Linux Netdev List; LAKML
Subject: Re: [RFC PATCH 1/7] PF driver modified to enable HW filter support, changes works in backward compatibility mode Enable required things in Makefile Enable LZ4 dependecy inside config file
>
> #define NIC_MAX_RSS_HASH_BITS 8
> #define NIC_MAX_RSS_IDR_TBL_SIZE (1 << NIC_MAX_RSS_HASH_BITS)
> +#define NIC_TNS_RSS_IDR_TBL_SIZE 5
So you want to use only 5 queues per VF when TNS is enabled, is it ??
There are 4096 RSS indices in total, for each VF you can use max 32.
I guess you wanted to set no of hash bits to 5 instead of table size.
SATHA>>> We enabled 8 queues for VF. Yes Macro name misleads it has to be hash bits, will change this in next version
> #define RSS_HASH_KEY_SIZE 5 /* 320 bit key */
>
> struct nicvf_rss_info {
> @@ -255,74 +258,6 @@ struct nicvf_drv_stats {
> struct u64_stats_sync syncp;
> };
>
> -struct nicvf {
> - struct nicvf *pnicvf;
> - struct net_device *netdev;
> - struct pci_dev *pdev;
> - void __iomem *reg_base;
Didn't get why you moved this structure to the end of file.
Looks like an unnecessary modification.
SATHA>>> Previously we have some dependency, we look into this, and address in next verison
> +static unsigned int num_vfs;
> +module_param(num_vfs, uint, 0644);
> +MODULE_PARM_DESC(num_vfs, "Non zero positive value, specifies number
> +of VF's per physical port");
So what if driver is built-in instead of module, I can't use TNS is it ?
SATHA>>> Still you can enable this special features by passing boot argument "nicpf.num_vfs=X"
>
> +/* Set RBDR Backpressure (RBDR_BP) and CQ backpressure (CQ_BP) of
> +vnic queues
> + * to 129 each
Why 129 ??
RBDR minimum size is 8K buffers, why you want to assert BP when still ~4K buffers are available. Isn't 4K a huge number to start asserting backpressure ?
SATHA>>> As CQ count was 4K entries, I used same BP value for both, will address this in next version
^ permalink raw reply
* RE: [RFC PATCH 3/7] Enable pause frame support
From: Koteshwar Rao, Satha @ 2016-12-26 14:21 UTC (permalink / raw)
To: Goutham, Sunil, linux-kernel@vger.kernel.org
Cc: rric@kernel.org, Chickles, Derek, Daney, David,
netdev@vger.kernel.org, Vatsavayi, Raghu, davem@davemloft.net,
linux-arm-kernel@lists.infradead.org, Romanov, Philip
In-Reply-To: <DM5PR07MB2889471E0668C95BD2A266709E930@DM5PR07MB2889.namprd07.prod.outlook.com>
Thanks Sunil, will fix this in next version
Thanks,
Satha
From: Goutham, Sunil
Sent: Wednesday, December 21, 2016 1:20 AM
To: Koteshwar Rao, Satha; linux-kernel@vger.kernel.org
Cc: rric@kernel.org; davem@davemloft.net; Daney, David; Vatsavayi, Raghu; Chickles, Derek; Romanov, Philip; netdev@vger.kernel.org; linux-arm-kernel@lists.infradead.org
Subject: Re: [RFC PATCH 3/7] Enable pause frame support
>>+#define BGX_SMUX_CBFC_CTL 0x20218
These macros are already defined.
if you check 'net-next ' branch pause frame support has already been
added. You should send patch on top it if you have further changes
to the existing.
Thanks,
Sunil.
________________________________________
From: Koteshwar Rao, Satha
Sent: Wednesday, December 21, 2016 2:16 PM
To: linux-kernel@vger.kernel.org
Cc: Goutham, Sunil; rric@kernel.org; davem@davemloft.net; Daney, David; Vatsavayi, Raghu; Chickles, Derek; Koteshwar Rao, Satha; Romanov, Philip; netdev@vger.kernel.org; linux-arm-kernel@lists.infradead.org
Subject: [RFC PATCH 3/7] Enable pause frame support
---
drivers/net/ethernet/cavium/thunder/thunder_bgx.c | 25 +++++++++++++++++++++++
drivers/net/ethernet/cavium/thunder/thunder_bgx.h | 7 +++++++
2 files changed, 32 insertions(+)
diff --git a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
index 050e21f..92d7e04 100644
--- a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
+++ b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
@@ -121,6 +121,31 @@ static int bgx_poll_reg(struct bgx *bgx, u8 lmac, u64 reg, u64 mask, bool zero)
return 1;
}
+void enable_pause_frames(int node, int bgx_idx, int lmac)
+{
+ u64 reg_value = 0;
+ struct bgx *bgx = bgx_vnic[(node * MAX_BGX_PER_NODE) + bgx_idx];
+
+ reg_value = bgx_reg_read(bgx, lmac, BGX_SMUX_TX_CTL);
+ /* Enable BGX()_SMU()_TX_CTL */
+ if (!(reg_value & L2P_BP_CONV))
+ bgx_reg_write(bgx, lmac, BGX_SMUX_TX_CTL,
+ (reg_value | (L2P_BP_CONV)));
+
+ reg_value = bgx_reg_read(bgx, lmac, BGX_SMUX_HG2_CTL);
+ /* Clear if BGX()_SMU()_HG2_CONTROL[HG2TX_EN] is set */
+ if (reg_value & SMUX_HG2_CTL_HG2TX_EN)
+ bgx_reg_write(bgx, lmac, BGX_SMUX_HG2_CTL,
+ (reg_value & (~SMUX_HG2_CTL_HG2TX_EN)));
+
+ reg_value = bgx_reg_read(bgx, lmac, BGX_SMUX_CBFC_CTL);
+ /* Clear if BGX()_SMU()_CBFC_CTL[TX_EN] is set */
+ if (reg_value & CBFC_CTL_TX_EN)
+ bgx_reg_write(bgx, lmac, BGX_SMUX_CBFC_CTL,
+ (reg_value & (~CBFC_CTL_TX_EN)));
+}
+EXPORT_SYMBOL(enable_pause_frames);
+
/* Return number of BGX present in HW */
unsigned bgx_get_map(int node)
{
diff --git a/drivers/net/ethernet/cavium/thunder/thunder_bgx.h b/drivers/net/ethernet/cavium/thunder/thunder_bgx.h
index 01cc7c8..5b57bd1 100644
--- a/drivers/net/ethernet/cavium/thunder/thunder_bgx.h
+++ b/drivers/net/ethernet/cavium/thunder/thunder_bgx.h
@@ -131,6 +131,11 @@
#define BGX_SMUX_TX_CTL 0x20178
#define SMU_TX_CTL_DIC_EN BIT_ULL(0)
#define SMU_TX_CTL_UNI_EN BIT_ULL(1)
+#define L2P_BP_CONV BIT_ULL(7)
+#define BGX_SMUX_CBFC_CTL 0x20218
+#define CBFC_CTL_TX_EN BIT_ULL(1)
+#define BGX_SMUX_HG2_CTL 0x20210
+#define SMUX_HG2_CTL_HG2TX_EN BIT_ULL(18)
#define SMU_TX_CTL_LNK_STATUS (3ull << 4)
#define BGX_SMUX_TX_THRESH 0x20180
#define BGX_SMUX_CTL 0x20200
@@ -212,6 +217,8 @@ void bgx_lmac_internal_loopback(int node, int bgx_idx,
u64 bgx_get_rx_stats(int node, int bgx_idx, int lmac, int idx);
u64 bgx_get_tx_stats(int node, int bgx_idx, int lmac, int idx);
+void enable_pause_frames(int node, int bgx_idx, int lmac);
+
#define BGX_RX_STATS_COUNT 11
#define BGX_TX_STATS_COUNT 18
--
1.8.3.1
^ permalink raw reply related
* Re: [RFC PATCH 0/7] ThunderX Embedded switch support
From: Andrew Lunn @ 2016-12-26 14:55 UTC (permalink / raw)
To: Koteshwar Rao, Satha
Cc: Sunil Kovvuri, LKML, Goutham, Sunil, Robert Richter,
David S. Miller, Daney, David, Vatsavayi, Raghu, Chickles, Derek,
Romanov, Philip, Linux Netdev List, LAKML
In-Reply-To: <DM5PR07MB2842EC647393430BE559D5B78D960@DM5PR07MB2842.namprd07.prod.outlook.com>
On Mon, Dec 26, 2016 at 02:04:27PM +0000, Koteshwar Rao, Satha wrote:
> Hi Sunil,
>
> In RFC cover letter we explained the feature details, files organized based on their supporting functionality, let me know if you are interested in any specific details
Please don't top post. Also, please perform correct quoting of the
email you are replying to.
As for getting patches merged, you will find it easier to get reviews
if you have lots of small patches which are obviously correct, and
each has a good change log entry describing the why as well as what.
Andrew
^ permalink raw reply
* Re: [PATCH 2/6] wl1251: Use request_firmware_prefer_user() for loading NVS calibration data
From: Pavel Machek @ 2016-12-26 15:43 UTC (permalink / raw)
To: Pali Rohár
Cc: Arend Van Spriel, Ming Lei, Luis R. Rodriguez, Greg Kroah-Hartman,
Kalle Valo, David Gnedt, Michal Kazior, Daniel Wagner,
Tony Lindgren, Sebastian Reichel, Ivaylo Dimitrov, Aaro Koskinen,
Grazvydas Ignotas, linux-kernel, linux-wireless, netdev
In-Reply-To: <201612252146.44496@pali>
[-- Attachment #1: Type: text/plain, Size: 2621 bytes --]
Hi!
> > > NVS calibration data for wl1251 are model specific. Every one
> > > device with wl1251 chip has different and calibrated in factory.
> > >
> > > Not all wl1251 chips have own EEPROM where are calibration data
> > > stored. And in that case there is no "standard" place. Every
> > > device has stored them on different place (some in rootfs file,
> > > some in dedicated nand partition, some in another proprietary
> > > structure).
> > >
> > > Kernel wl1251 driver cannot support every one different storage
> > > decided by device manufacture so it will use
> > > request_firmware_prefer_user() call for loading NVS calibration
> > > data and userspace helper will be responsible to prepare correct
> > > data.
> >
> > Responding to this patch as it provides a lot of context to discuss.
> > As you might have gathered from earlier discussions I am not a fan
> > of using user-space helper. I can agree that the kernel driver,
> > wl1251 in this case, should be agnostic to platform specific details
> > regarding storage solutions and the firmware api should hide that.
> > However, it seems your only solution is adding user-space to the mix
> > and changing the api towards that. Can we solve it without
> > user-space help?
>
> Without userspace helper it means that userspace helper code must be
> integrated into kernel.
>
> So what is userspace helper doing?
>
> 1) Read MAC address from CAL
> 2) Read NVS data from CAL
> 3) Modify MAC address in memory NVS data (new for this patch series)
> 4) Modify in memory NVS data if we in FCC country
>
> Checking for country is done via dbus call to either Maemo cellular
> daemon or alternatively via REGDOMAIN in /etc/default/crda. I have plan
> to use ofono (instead Maemo cellular daemon) too...
>
> Currently we are using closed Nokia proprietary CAL library.
>
> Steps 1) and 2) needs closed library, step 4) needs dbus call.
I guess pointer to the source code implementing this would be welcome.
> > But on other devices that use wl1251, but for instance have no
> > userspace helper the request to userspace will fail (after 60 sec?)
> > and try VFS after that. Maybe not so nice.
>
> Currently support for those devices is broken (like for N900) as without
> proper NVS data they do not work correctly...
Is it expected to work at all, perhaps with degraded performance /
range? Because it seems to work for me.
Thanks,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply
* Re: [PATCH v1] net: dev_weight: TX/RX orthogonality
From: David Miller @ 2016-12-26 15:52 UTC (permalink / raw)
To: matthias.tafelmeier; +Cc: netdev, hagen, fw, edumazet, daniel
In-Reply-To: <1482745763-15082-1-git-send-email-matthias.tafelmeier@gmx.net>
From: Matthias Tafelmeier <matthias.tafelmeier@gmx.net>
Date: Mon, 26 Dec 2016 10:49:23 +0100
> @@ -269,13 +269,21 @@ static struct ctl_table net_core_table[] = {
> .extra1 = &min_rcvbuf,
> },
> {
> - .procname = "dev_weight",
> - .data = &weight_p,
> + .procname = "dev_weight_rx",
> + .data = &weight_p_rx,
...
> {
> + .procname = "dev_weight_tx",
Sysctls are user visible APIs. You cannot change them without
breaking userspace. You particularly cannot change the name of
the sysctl.
^ permalink raw reply
* Re: [PATCH 2/6] wl1251: Use request_firmware_prefer_user() for loading NVS calibration data
From: Pali Rohár @ 2016-12-26 16:04 UTC (permalink / raw)
To: Pavel Machek
Cc: Arend Van Spriel, Ming Lei, Luis R. Rodriguez, Greg Kroah-Hartman,
Kalle Valo, David Gnedt, Michal Kazior, Daniel Wagner,
Tony Lindgren, Sebastian Reichel, Ivaylo Dimitrov, Aaro Koskinen,
Grazvydas Ignotas, linux-kernel, linux-wireless, netdev
In-Reply-To: <20161226154353.GA27087@amd>
[-- Attachment #1: Type: Text/Plain, Size: 2959 bytes --]
On Monday 26 December 2016 16:43:53 Pavel Machek wrote:
> Hi!
>
> > > > NVS calibration data for wl1251 are model specific. Every one
> > > > device with wl1251 chip has different and calibrated in
> > > > factory.
> > > >
> > > > Not all wl1251 chips have own EEPROM where are calibration data
> > > > stored. And in that case there is no "standard" place. Every
> > > > device has stored them on different place (some in rootfs file,
> > > > some in dedicated nand partition, some in another proprietary
> > > > structure).
> > > >
> > > > Kernel wl1251 driver cannot support every one different storage
> > > > decided by device manufacture so it will use
> > > > request_firmware_prefer_user() call for loading NVS calibration
> > > > data and userspace helper will be responsible to prepare
> > > > correct data.
> > >
> > > Responding to this patch as it provides a lot of context to
> > > discuss. As you might have gathered from earlier discussions I
> > > am not a fan of using user-space helper. I can agree that the
> > > kernel driver, wl1251 in this case, should be agnostic to
> > > platform specific details regarding storage solutions and the
> > > firmware api should hide that. However, it seems your only
> > > solution is adding user-space to the mix and changing the api
> > > towards that. Can we solve it without user-space help?
> >
> > Without userspace helper it means that userspace helper code must
> > be integrated into kernel.
> >
> > So what is userspace helper doing?
> >
> > 1) Read MAC address from CAL
> > 2) Read NVS data from CAL
> > 3) Modify MAC address in memory NVS data (new for this patch
> > series) 4) Modify in memory NVS data if we in FCC country
> >
> > Checking for country is done via dbus call to either Maemo cellular
> > daemon or alternatively via REGDOMAIN in /etc/default/crda. I have
> > plan to use ofono (instead Maemo cellular daemon) too...
> >
> > Currently we are using closed Nokia proprietary CAL library.
> >
> > Steps 1) and 2) needs closed library, step 4) needs dbus call.
>
> I guess pointer to the source code implementing this would be
> welcome.
Here is current code: https://github.com/community-ssu/wl1251-cal
(there is implemented also Maemo netlink interface)
> > > But on other devices that use wl1251, but for instance have no
> > > userspace helper the request to userspace will fail (after 60
> > > sec?) and try VFS after that. Maybe not so nice.
> >
> > Currently support for those devices is broken (like for N900) as
> > without proper NVS data they do not work correctly...
>
> Is it expected to work at all, perhaps with degraded performance /
> range? Because it seems to work for me.
Yes, some degraded performance or problems with connecting is expected.
And random MAC address at every boot. Plus some regulatory problems in
FCC countries.
--
Pali Rohár
pali.rohar@gmail.com
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply
* Re: [PATCH net] net, sched: fix soft lockup in tc_classify
From: David Miller @ 2016-12-26 16:24 UTC (permalink / raw)
To: daniel
Cc: shahark, xiyou.wangcong, gerlitz.or, roid, jiri, john.fastabend,
netdev
In-Reply-To: <1954223e8b7aa90ff835357e8a3ef7537be33e43.1482337600.git.daniel@iogearbox.net>
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Wed, 21 Dec 2016 18:04:11 +0100
> Shahar reported a soft lockup in tc_classify(), where we run into an
> endless loop when walking the classifier chain due to tp->next == tp
> which is a state we should never run into. The issue only seems to
> trigger under load in the tc control path.
>
> What happens is that in tc_ctl_tfilter(), thread A allocates a new
> tp, initializes it, sets tp_created to 1, and calls into tp->ops->change()
> with it. In that classifier callback we had to unlock/lock the rtnl
> mutex and returned with -EAGAIN. One reason why we need to drop there
> is, for example, that we need to request an action module to be loaded.
>
> This happens via tcf_exts_validate() -> tcf_action_init/_1() meaning
> after we loaded and found the requested action, we need to redo the
> whole request so we don't race against others. While we had to unlock
> rtnl in that time, thread B's request was processed next on that CPU.
> Thread B added a new tp instance successfully to the classifier chain.
> When thread A returned grabbing the rtnl mutex again, propagating -EAGAIN
> and destroying its tp instance which never got linked, we goto replay
> and redo A's request.
>
> This time when walking the classifier chain in tc_ctl_tfilter() for
> checking for existing tp instances we had a priority match and found
> the tp instance that was created and linked by thread B. Now calling
> again into tp->ops->change() with that tp was successful and returned
> without error.
>
> tp_created was never cleared in the second round, thus kernel thinks
> that we need to link it into the classifier chain (once again). tp and
> *back point to the same object due to the match we had earlier on. Thus
> for thread B's already public tp, we reset tp->next to tp itself and
> link it into the chain, which eventually causes the mentioned endless
> loop in tc_classify() once a packet hits the data path.
>
> Fix is to clear tp_created at the beginning of each request, also when
> we replay it. On the paths that can cause -EAGAIN we already destroy
> the original tp instance we had and on replay we really need to start
> from scratch. It seems that this issue was first introduced in commit
> 12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining
> and avoid kernel panic when we use cls_cgroup").
>
> Fixes: 12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining and avoid kernel panic when we use cls_cgroup")
> Reported-by: Shahar Klein <shahark@mellanox.com>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Applied and queued up for -stable, thanks Daniel.
^ permalink raw reply
* Re: [PATCH net] net: korina: Fix NAPI versus resources freeing
From: David Miller @ 2016-12-26 16:26 UTC (permalink / raw)
To: f.fainelli; +Cc: netdev, alex, phil
In-Reply-To: <20161224035656.11289-1-f.fainelli@gmail.com>
From: Florian Fainelli <f.fainelli@gmail.com>
Date: Fri, 23 Dec 2016 19:56:56 -0800
> Commit beb0babfb77e ("korina: disable napi on close and restart")
> introduced calls to napi_disable() that were missing before,
> unfortunately this leaves a small window during which NAPI has a chance
> to run, yet we just freed resources since korina_free_ring() has been
> called:
>
> Fix this by disabling NAPI first then freeing resource, and make sure
> that we also cancel the restart taks before doing the resource freeing.
>
> Fixes: beb0babfb77e ("korina: disable napi on close and restart")
> Reported-by: Alexandros C. Couloumbis <alex@ozo.com>
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH 2/6] wl1251: Use request_firmware_prefer_user() for loading NVS calibration data
From: Pavel Machek @ 2016-12-26 16:35 UTC (permalink / raw)
To: Arend Van Spriel
Cc: Pali Rohár, Ming Lei, Luis R. Rodriguez, Greg Kroah-Hartman,
Kalle Valo, David Gnedt, Michal Kazior, Daniel Wagner,
Tony Lindgren, Sebastian Reichel, Ivaylo Dimitrov, Aaro Koskinen,
Grazvydas Ignotas, linux-kernel, linux-wireless, netdev
In-Reply-To: <e728586e-80bf-c9e0-0063-71da4c4aba85@broadcom.com>
[-- Attachment #1: Type: text/plain, Size: 2144 bytes --]
On Sun 2016-12-25 21:15:40, Arend Van Spriel wrote:
> On 24-12-2016 17:52, Pali Rohár wrote:
> > NVS calibration data for wl1251 are model specific. Every one device with
> > wl1251 chip has different and calibrated in factory.
> >
> > Not all wl1251 chips have own EEPROM where are calibration data stored. And
> > in that case there is no "standard" place. Every device has stored them on
> > different place (some in rootfs file, some in dedicated nand partition,
> > some in another proprietary structure).
> >
> > Kernel wl1251 driver cannot support every one different storage decided by
> > device manufacture so it will use request_firmware_prefer_user() call for
> > loading NVS calibration data and userspace helper will be responsible to
> > prepare correct data.
>
> Responding to this patch as it provides a lot of context to discuss. As
> you might have gathered from earlier discussions I am not a fan of using
> user-space helper. I can agree that the kernel driver, wl1251 in this
> case, should be agnostic to platform specific details regarding storage
> solutions and the firmware api should hide that. However, it seems your
> only solution is adding user-space to the mix and changing the api
> towards that. Can we solve it without user-space help?
Answer is no, due to licensing. But that's wrong question to ask.
Right question is "should we solve it without user-space help"?
Answer is no, too. Way too complex. Yes, it would be nice if hardware
was designed in such a way that getting calibration data from kernel
is easy, and if you design hardware, please design it like that. But
N900 is not designed like that and getting the calibration through
userspace looks like only reasonable solution.
Now... how exactly to do that is other question. (But this is looks
very reasonable. Maybe I'd add request_firmware_with_flags(, ...int
flags), but.. that's a tiny detail.). But userspace needs to be
involved.
Thanks,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply
* [PATCH net] openvswitch: upcall: Fix vlan handling.
From: Pravin B Shelar @ 2016-12-26 16:31 UTC (permalink / raw)
To: netdev; +Cc: Pravin B Shelar, Jarno Rajahalme, Jiri Benc
Networking stack accelerate vlan tag handling by
keeping topmost vlan header in skb. This works as
long as packet remains in OVS datapath. But during
OVS upcall vlan header is pushed on to the packet.
When such packet is sent back to OVS datapath, core
networking stack might not handle it correctly. Following
patch avoids this issue by accelerating the vlan tag
during flow key extract. This simplifies datapath by
bringing uniform packet processing for packets from
all code paths.
Fixes: 5108bbaddc ("openvswitch: add processing of L3 packets").
CC: Jarno Rajahalme <jarno@ovn.org>
CC: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
---
net/openvswitch/datapath.c | 1 -
net/openvswitch/flow.c | 54 +++++++++++++++++++++++-----------------------
2 files changed, 27 insertions(+), 28 deletions(-)
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 2d4c4d3..9c62b63 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -606,7 +606,6 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
rcu_assign_pointer(flow->sf_acts, acts);
packet->priority = flow->key.phy.priority;
packet->mark = flow->key.phy.skb_mark;
- packet->protocol = flow->key.eth.type;
rcu_read_lock();
dp = get_dp_rcu(net, ovs_header->dp_ifindex);
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 08aa926..2c0a00f 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -312,7 +312,8 @@ static bool icmp6hdr_ok(struct sk_buff *skb)
* Returns 0 if it encounters a non-vlan or incomplete packet.
* Returns 1 after successfully parsing vlan tag.
*/
-static int parse_vlan_tag(struct sk_buff *skb, struct vlan_head *key_vh)
+static int parse_vlan_tag(struct sk_buff *skb, struct vlan_head *key_vh,
+ bool untag_vlan)
{
struct vlan_head *vh = (struct vlan_head *)skb->data;
@@ -330,7 +331,20 @@ static int parse_vlan_tag(struct sk_buff *skb, struct vlan_head *key_vh)
key_vh->tci = vh->tci | htons(VLAN_TAG_PRESENT);
key_vh->tpid = vh->tpid;
- __skb_pull(skb, sizeof(struct vlan_head));
+ if (unlikely(untag_vlan)) {
+ int offset = skb->data - skb_mac_header(skb);
+ u16 tci;
+ int err;
+
+ __skb_push(skb, offset);
+ err = __skb_vlan_pop(skb, &tci);
+ __skb_pull(skb, offset);
+ if (err)
+ return err;
+ __vlan_hwaccel_put_tag(skb, key_vh->tpid, tci);
+ } else {
+ __skb_pull(skb, sizeof(struct vlan_head));
+ }
return 1;
}
@@ -351,13 +365,13 @@ static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key)
key->eth.vlan.tpid = skb->vlan_proto;
} else {
/* Parse outer vlan tag in the non-accelerated case. */
- res = parse_vlan_tag(skb, &key->eth.vlan);
+ res = parse_vlan_tag(skb, &key->eth.vlan, true);
if (res <= 0)
return res;
}
/* Parse inner vlan tag. */
- res = parse_vlan_tag(skb, &key->eth.cvlan);
+ res = parse_vlan_tag(skb, &key->eth.cvlan, false);
if (res <= 0)
return res;
@@ -800,29 +814,15 @@ int ovs_flow_key_extract_userspace(struct net *net, const struct nlattr *attr,
if (err)
return err;
- if (ovs_key_mac_proto(key) == MAC_PROTO_NONE) {
- /* key_extract assumes that skb->protocol is set-up for
- * layer 3 packets which is the case for other callers,
- * in particular packets recieved from the network stack.
- * Here the correct value can be set from the metadata
- * extracted above.
- */
- skb->protocol = key->eth.type;
- } else {
- struct ethhdr *eth;
-
- skb_reset_mac_header(skb);
- eth = eth_hdr(skb);
-
- /* Normally, setting the skb 'protocol' field would be
- * handled by a call to eth_type_trans(), but it assumes
- * there's a sending device, which we may not have.
- */
- if (eth_proto_is_802_3(eth->h_proto))
- skb->protocol = eth->h_proto;
- else
- skb->protocol = htons(ETH_P_802_2);
- }
+ /* key_extract assumes that skb->protocol is set-up for
+ * layer 3 packets which is the case for other callers,
+ * in particular packets received from the network stack.
+ * Here the correct value can be set from the metadata
+ * extracted above.
+ * For L2 packet key eth type would be zero. skb protocol
+ * would be set to correct value later during key-extact.
+ */
+ skb->protocol = key->eth.type;
return key_extract(skb, key);
}
--
2.9.3
^ permalink raw reply related
* Re: [PATCH] vif queue counters from int to long,[PATCH] vif queue counters from int to long
From: David Miller @ 2016-12-26 16:35 UTC (permalink / raw)
To: mart; +Cc: netdev, ian.campbell, wei.liu2
In-Reply-To: <585D3E23.10509@greenhost.nl>
From: Mart van Santen <mart@greenhost.nl>
Date: Fri, 23 Dec 2016 16:09:23 +0100
> This patch fixes an issue where counters in the queue have type int,
> while the counters of the vif itself are specified as long. This can
> cause incorrect reporting of tx/rx values of the vif interface.
> More extensively reported on xen-devel mailinglist.
>
> Signed-off-by: Mart van Santen <mart@greenhost.nl>
Your subject line lacks a proper subsystem prefix, in this case
an appropriate one might be "xen-netback: "
Also, your patch was corrupted by your email client, transforming
TAB characters into spaces. This makes the patch unusable.
Please read Documentation/email-clients.txt on how to fix this
and how to submit uncorrupted patches properly.
Email a test patch to yourself, and do not attempt to resend this
patch to the mailing list until you can successfully apply the test
patch you send to yourself.
Thanks.
^ permalink raw reply
* Re: [PATCH v1] net: dev_weight: TX/RX orthogonality,Re: [PATCH v1] net: dev_weight: TX/RX orthogonality
From: David Miller @ 2016-12-26 16:58 UTC (permalink / raw)
To: matthias.tafelmeier; +Cc: netdev, hagen, fw, edumazet, daniel
In-Reply-To: <ae0712c3-61c6-432e-78d9-665d0c291c9f@gmx.net>
From: Matthias Tafelmeier <matthias.tafelmeier@gmx.net>
Date: Mon, 26 Dec 2016 17:43:08 +0100
>
>> From: Matthias Tafelmeier <matthias.tafelmeier@gmx.net>
>> Date: Mon, 26 Dec 2016 10:49:23 +0100
>>
>>> @@ -269,13 +269,21 @@ static struct ctl_table net_core_table[] = {
>>> .extra1 = &min_rcvbuf,
>>> },
>>> {
>>> - .procname = "dev_weight",
>>> - .data = &weight_p,
>>> + .procname = "dev_weight_rx",
>>> + .data = &weight_p_rx,
>> ...
>>> {
>>> + .procname = "dev_weight_tx",
>> Sysctls are user visible APIs. You cannot change them without
>> breaking userspace. You particularly cannot change the name of
>> the sysctl.
>
> What about leaving *dev_weight* in place for TX side as is and newly
> introducing a sysctl param
> *dev_weight_rx*. Though, am open to a better naming for the latter.
This changes behavior for existing users, you cannot do this.
^ permalink raw reply
* Re: [PATCH] net: ethtool: don't require CAP_NET_ADMIN for ETHTOOL_GLINKSETTINGS
From: David Miller @ 2016-12-26 17:38 UTC (permalink / raw)
To: bernat; +Cc: mlichvar, netdev, decot
In-Reply-To: <87tw9s311z.fsf@luffy.cx>
From: Vincent Bernat <bernat@luffy.cx>
Date: Sun, 25 Dec 2016 08:44:40 +0100
> ❦ 24 novembre 2016 10:55 +0100, Miroslav Lichvar <mlichvar@redhat.com> :
>
>> The ETHTOOL_GLINKSETTINGS command is deprecating the ETHTOOL_GSET
>> command and likewise it shouldn't require the CAP_NET_ADMIN
>> capability.
>
> Could this patch be pushed to stable branches too?
Sure, queued up.
^ permalink raw reply
* Re: [RFC PATCH 4.10 1/6] crypto/sha256: Refactor the API so it can be used without shash
From: Ard Biesheuvel @ 2016-12-26 17:51 UTC (permalink / raw)
To: Herbert Xu
Cc: Andy Lutomirski, Andy Lutomirski, Daniel Borkmann, Netdev, LKML,
Linux Crypto Mailing List, Jason A. Donenfeld,
Hannes Frederic Sowa, Alexei Starovoitov, Eric Dumazet,
Eric Biggers, Tom Herbert, David S. Miller
In-Reply-To: <20161226075757.GA8916@gondor.apana.org.au>
On 26 December 2016 at 07:57, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Sat, Dec 24, 2016 at 09:57:53AM -0800, Andy Lutomirski wrote:
>>
>> I actually do use incremental hashing later on. BPF currently
>> vmallocs() a big temporary buffer just so it can fill it and hash it.
>> I change it to hash as it goes.
>
> How much data is this supposed to hash on average? If it's a large
> amount then perhaps using the existing crypto API would be a better
> option than adding this.
>
This is a good point actually: you didn't explain *why* BPF shouldn't
depend on the crypto API.
^ permalink raw reply
* Re: [RFC PATCH 4.10 1/6] crypto/sha256: Refactor the API so it can be used without shash
From: Andy Lutomirski @ 2016-12-26 18:08 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Herbert Xu, Andy Lutomirski, Daniel Borkmann, Netdev, LKML,
Linux Crypto Mailing List, Jason A. Donenfeld,
Hannes Frederic Sowa, Alexei Starovoitov, Eric Dumazet,
Eric Biggers, Tom Herbert, David S. Miller
In-Reply-To: <CAKv+Gu-NMNCEs61f4k7JSf9iSJ1A_Gy1r=kZRGqtbDsEDz7--Q@mail.gmail.com>
On Mon, Dec 26, 2016 at 9:51 AM, Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> On 26 December 2016 at 07:57, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>> On Sat, Dec 24, 2016 at 09:57:53AM -0800, Andy Lutomirski wrote:
>>>
>>> I actually do use incremental hashing later on. BPF currently
>>> vmallocs() a big temporary buffer just so it can fill it and hash it.
>>> I change it to hash as it goes.
>>
>> How much data is this supposed to hash on average? If it's a large
>> amount then perhaps using the existing crypto API would be a better
>> option than adding this.
>>
>
> This is a good point actually: you didn't explain *why* BPF shouldn't
> depend on the crypto API.
According to Daniel, the networking folks want to let embedded systems
include BPF without requiring the crypto core.
At some point, I'd also like to use modern hash functions for module
verification. If doing so would require the crypto core to be
available when modules are loaded, then the crypto core couldn't be
modular. (Although it occurs to me that my patches get that wrong --
if this change happens, I need to split the code so that the library
functions can be built in even if CRYPTO=m.)
Daniel, would you be okay with BPF selecting CRYPTO and CRYPTO_HASH?
Also, as a bikeshed thought: I could call the functions
sha256_init_direct(), etc. Then there wouldn't be namespace
collisions and the fact that they bypass accelerated drivers would be
more obvious.
--Andy
^ permalink raw reply
* RE: Microsoft 36772 netdev
From: helga.brickl @ 2016-12-26 23:38 UTC (permalink / raw)
To: netdev
[-- Attachment #1: 181628705052.zip --]
[-- Type: application/zip, Size: 41056 bytes --]
^ permalink raw reply
* Re: driver r8169 suddenly failed
From: Francois Romieu @ 2016-12-27 0:12 UTC (permalink / raw)
To: Robert Grasso; +Cc: Realtek linux nic maintainers, netdev
In-Reply-To: <428aa40c-e313-99ae-d3ae-97b6a5ef40e4@modulonet.fr>
Robert Grasso <robert.grasso@modulonet.fr> :
[dhcp snafu]
> First of all, can you confirm that I am doing right in posting to you
> (addresses found in README.Debian) ?
It isn't completely wrong.
> If I do, can you help ? I am not very proficient with Ethernet, and I am not
> able to figure out what my provider changed : their hotline is
> underqualified, they are just able to tell that "the signal on the line is
> ok".
You're spoiled. It is more than decent for a company whose core business
used to sell cable TV.
> But if you want me to run various tests, try new versions, I would be
> glad to do so : I would appreciate if I could salvage this Shuttle.
Please try a recent stable vanilla kernel and send a complete dmesg
from boot. I need to identify the specific 816x chipset.
Then record some traffic:
# touch gonzo.pcap && tshark -w gonzo.pcap -i eth1
It should only exhibit small outgoing packets but, well, one never knows.
Check the leds activity to be sure that the network interfaces have not
been renumbered.
Use ethtool to check that tso is disabled. If it isn't, disable it.
--
Ueimor
^ permalink raw reply
* Re: [RFC PATCH 4.10 3/6] bpf: Use SHA256 instead of SHA1 for bpf digests
From: Alexei Starovoitov @ 2016-12-27 1:36 UTC (permalink / raw)
To: Daniel Borkmann
Cc: Andy Lutomirski, Netdev, LKML, Linux Crypto Mailing List,
Jason A. Donenfeld, Hannes Frederic Sowa, Eric Dumazet,
Eric Biggers, Tom Herbert, David S. Miller, Alexei Starovoitov
In-Reply-To: <585ED3B9.6020407@iogearbox.net>
On Sat, Dec 24, 2016 at 08:59:53PM +0100, Daniel Borkmann wrote:
> On 12/24/2016 03:22 AM, Andy Lutomirski wrote:
> >BPF digests are intended to be used to avoid reloading programs that
> >are already loaded. For use cases (CRIU?) where untrusted programs
> >are involved, intentional hash collisions could cause the wrong BPF
> >program to execute. Additionally, if BPF digests are ever used
> >in-kernel to skip verification, a hash collision could give privilege
> >escalation directly.
>
> Just for the record, digests will never ever be used to skip the
> verification step, so I don't know why this idea even comes up
> here (?) or is part of the changelog? As this will never be done
> anyway, rather drop that part so we can avoid confusion on this?
+1 to what Daniel said above.
For the others let me explain what this patch set is actually
trying to accomplish.
Andy had an idea that sha256 of the program can somehow be used
to bypass kernel verifier during program loading. Furthemore
he thinks that such 'bypass' would be useful for criu of bpf programs,
hence see vigorously attacking existing prog_digest (sha1) because
it's not as secure as sha256 and hence cannot be used for such 'bypass'.
The problem with criu of bpf programs is same as criu of kernel modules.
For the main tracing and networking use cases, we cannot stop the kernel,
so criu is out of question already.
Even if we could stop all the events that trigger bpf program execution,
the sha256 or memcmp() of the full program is not enough to guarantee
that two programs are the same.
Ex. bpf_map_lookup() may be safe for one program and not for another
depending on how map was created. Two programs of different types
are not comparable either. Etc, etc.
Therefore the idea of using sha256 for such purpose is bogus,
and I have an obvious NACK for bpf related patches 3,4,5,6.
For the questions raised in other threads:
I'm not ok with making BPF depend on CRYPTO, since it's the same as
requiring CRYPTO to select BPF for no good reason.
And 0/6 commit log:
> Since there are plenty of uses for the new-in-4.10 BPF digest feature
> that would be problematic if malicious users could produce collisions,
> the BPF digest should be collision-resistant.
This statement is also bogus. The only reason we added prog_digest is
to improve debuggability and introspection of bpf programs.
As I said in the previous thread "collisions are ok" and we could have
used jhash here to avoid patches like this ever appearing
and wasting everyones time.
sha1 is 20 bytes which is already a bit long to print and copy paste by humans.
whereas 4 byte jhash is a bit too short, since collisions are not that rare
and may lead to incorrect assumptions from the users that develop the programs.
I would prefer something in 6-10 byte range that prevents collisions most of
the time and short to print as hex, but I don't know of anything like this
in the existing kernel and inventing bpf specific hash is not great.
Another requirement for debugging (and prog_digest) that user space
should be able to produce the same hash without asking kernel, so
sha1 fits that as well, since it's well known and easy to put into library.
sha256 doesn't fit either of these requirements. 32-bytes are too long to print
and when we use it as a substitue for the prog name for jited ksym, looking
at long function names will screw up all tools like perf, which we don't
want. sha256 is equally not easy for user space app like iproute2,
so not an acceptable choice from that pov either.
^ permalink raw reply
* Re: [RFC PATCH 4.10 3/6] bpf: Use SHA256 instead of SHA1 for bpf digests
From: Andy Lutomirski @ 2016-12-27 2:08 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Daniel Borkmann, Andy Lutomirski, Netdev, LKML,
Linux Crypto Mailing List, Jason A. Donenfeld,
Hannes Frederic Sowa, Eric Dumazet, Eric Biggers, Tom Herbert,
David S. Miller, Alexei Starovoitov
In-Reply-To: <20161227013644.GA96815@ast-mbp.thefacebook.com>
On Mon, Dec 26, 2016 at 5:36 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Sat, Dec 24, 2016 at 08:59:53PM +0100, Daniel Borkmann wrote:
>> On 12/24/2016 03:22 AM, Andy Lutomirski wrote:
>> >BPF digests are intended to be used to avoid reloading programs that
>> >are already loaded. For use cases (CRIU?) where untrusted programs
>> >are involved, intentional hash collisions could cause the wrong BPF
>> >program to execute. Additionally, if BPF digests are ever used
>> >in-kernel to skip verification, a hash collision could give privilege
>> >escalation directly.
>>
>> Just for the record, digests will never ever be used to skip the
>> verification step, so I don't know why this idea even comes up
>> here (?) or is part of the changelog? As this will never be done
>> anyway, rather drop that part so we can avoid confusion on this?
>
> +1 to what Daniel said above.
>
> For the others let me explain what this patch set is actually
> trying to accomplish.
The patch:
a) cleans up the code and
b) uses a cryptographic hash that is actually believed to satisfy the
definition of a cryptographic hash.
There's no excuse for not doing b.
> and I have an obvious NACK for bpf related patches 3,4,5,6.
Did you *read* the ones that were pure cleanups?
>
> sha1 is 20 bytes which is already a bit long to print and copy paste by humans.
> whereas 4 byte jhash is a bit too short, since collisions are not that rare
> and may lead to incorrect assumptions from the users that develop the programs.
> I would prefer something in 6-10 byte range that prevents collisions most of
> the time and short to print as hex, but I don't know of anything like this
> in the existing kernel and inventing bpf specific hash is not great.
> Another requirement for debugging (and prog_digest) that user space
> should be able to produce the same hash without asking kernel, so
> sha1 fits that as well, since it's well known and easy to put into library.
Then truncate them in user space.
^ permalink raw reply
* [PATCH net] net: xdp: remove unused bfp_warn_invalid_xdp_buffer()
From: Jason Wang @ 2016-12-27 2:49 UTC (permalink / raw)
To: davem, linux-kernel, netdev; +Cc: Jason Wang, Daniel Borkmann, John Fastabend
After commit 73b62bd085f4737679ea9afc7867fa5f99ba7d1b ("virtio-net:
remove the warning before XDP linearizing"), there's no users for
bpf_warn_invalid_xdp_buffer(), so remove it. This is a revert for
commit f23bc46c30ca5ef58b8549434899fcbac41b2cfc.
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
include/linux/filter.h | 1 -
net/core/filter.c | 6 ------
2 files changed, 7 deletions(-)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 7023142..a0934e6 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -610,7 +610,6 @@ bool bpf_helper_changes_pkt_data(void *func);
struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
const struct bpf_insn *patch, u32 len);
void bpf_warn_invalid_xdp_action(u32 act);
-void bpf_warn_invalid_xdp_buffer(void);
#ifdef CONFIG_BPF_JIT
extern int bpf_jit_enable;
diff --git a/net/core/filter.c b/net/core/filter.c
index 7190bd6..b146170 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2972,12 +2972,6 @@ void bpf_warn_invalid_xdp_action(u32 act)
}
EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
-void bpf_warn_invalid_xdp_buffer(void)
-{
- WARN_ONCE(1, "Illegal XDP buffer encountered, expect throughput degradation\n");
-}
-EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_buffer);
-
static u32 sk_filter_convert_ctx_access(enum bpf_access_type type, int dst_reg,
int src_reg, int ctx_off,
struct bpf_insn *insn_buf,
--
2.7.4
^ permalink raw reply related
* Re: [PATCH net 1/9] virtio-net: remove the warning before XDP linearizing
From: Jason Wang @ 2016-12-27 3:08 UTC (permalink / raw)
To: Daniel Borkmann, mst, virtualization, netdev, linux-kernel
Cc: john.r.fastabend
In-Reply-To: <585D7BA9.50509@iogearbox.net>
On 2016年12月24日 03:31, Daniel Borkmann wrote:
> Hi Jason,
>
> On 12/23/2016 03:37 PM, Jason Wang wrote:
>> Since we use EWMA to estimate the size of rx buffer. When rx buffer
>> size is underestimated, it's usual to have a packet with more than one
>> buffers. Consider this is not a bug, remove the warning and correct
>> the comment before XDP linearizing.
>>
>> Cc: John Fastabend <john.r.fastabend@intel.com>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>> drivers/net/virtio_net.c | 8 +-------
>> 1 file changed, 1 insertion(+), 7 deletions(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 08327e0..1067253 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -552,14 +552,8 @@ static struct sk_buff *receive_mergeable(struct
>> net_device *dev,
>> struct page *xdp_page;
>> u32 act;
>>
>> - /* No known backend devices should send packets with
>> - * more than a single buffer when XDP conditions are
>> - * met. However it is not strictly illegal so the case
>> - * is handled as an exception and a warning is thrown.
>> - */
>> + /* This happens when rx buffer size is underestimated */
>> if (unlikely(num_buf > 1)) {
>> - bpf_warn_invalid_xdp_buffer();
>
> Could you also remove the bpf_warn_invalid_xdp_buffer(), which got added
> just for this?
>
> Thanks.
Posted.
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox