From: Leon Romanovsky <leon@kernel.org>
To: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: jgg@ziepe.ca, Dean Luick <dean.luick@cornelisnetworks.com>,
Breandan Cunningham <brendan.cunningham@cornelisnetworks.com>,
Douglas Miller <doug.miller@cornelisnetworks.com>,
linux-rdma@vger.kernel.org
Subject: Re: [PATCH for-next resend 09/24] RDMA/hfi2: Add initialization and firmware support
Date: Wed, 18 Mar 2026 11:14:09 +0200 [thread overview]
Message-ID: <20260318091409.GG61385@unreal> (raw)
In-Reply-To: <177325167086.57064.11403114326044529507.stgit@awdrv-04.cornelisnetworks.com>
On Wed, Mar 11, 2026 at 01:54:30PM -0400, Dennis Dalessandro wrote:
> Add device initialization, firmware loading and management, and CPU
> affinity support.
>
> Co-developed-by: Dean Luick <dean.luick@cornelisnetworks.com>
> Signed-off-by: Dean Luick <dean.luick@cornelisnetworks.com>
> Co-developed-by: Bendan Cunningham <brendan.cunningham@cornelisnetworks.com>
> Signed-off-by: Breandan Cunningham <brendan.cunningham@cornelisnetworks.com>
> Co-developed-by: Douglas Miller <doug.miller@cornelisnetworks.com>
> Signed-off-by: Douglas Miller <doug.miller@cornelisnetworks.com>
> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
> ---
> drivers/infiniband/hw/hfi2/affinity.c | 1194 +++++++++++++
Let's put affinity code aside. Can we start from bare minimum driver?
Thanks
> + }
> +
> + if (set) {
> + cpumask_andnot(&set->used, &set->used, &msix->mask);
> + _cpu_mask_set_gen_dec(set);
> + }
> +
> + irq_set_affinity_hint(msix->irq, NULL);
> + cpumask_clear(&msix->mask);
> + mutex_unlock(&node_affinity.lock);
> +}
> +
> +int hfi2_get_proc_affinity(int node)
> +{
> + int cpu = -1, ret, i;
> + struct hfi2_affinity_node *entry;
> + cpumask_var_t diff, hw_thread_mask, available_mask, intrs_mask;
> + const struct cpumask *node_mask,
> + *proc_mask = current->cpus_ptr;
> + struct hfi2_affinity_node_list *affinity = &node_affinity;
> + struct cpu_mask_set *set = &affinity->proc;
> + int pruned;
> +
> + /*
> + * check whether process/context affinity has already
> + * been set
> + */
> + if (current->nr_cpus_allowed == 1) {
> + hfi2_cdbg(PROC, "PID %u %s affinity set to CPU %*pbl",
> + current->pid, current->comm,
> + cpumask_pr_args(proc_mask));
> + /*
> + * Mark the pre-set CPU as used. This is atomic so we don't
> + * need the lock
> + */
> + cpu = cpumask_first(proc_mask);
> + cpumask_set_cpu(cpu, &set->used);
> + goto done;
> + } else if (current->nr_cpus_allowed < cpumask_weight(&set->mask)) {
> + hfi2_cdbg(PROC, "PID %u %s affinity set to CPU set(s) %*pbl",
> + current->pid, current->comm,
> + cpumask_pr_args(proc_mask));
> + goto done;
> + }
> +
> + /*
> + * The process does not have a preset CPU affinity so find one to
> + * recommend using the following algorithm:
> + *
> + * For each user process that is opening a context on HFI Y:
> + * a) If all cores are filled, reinitialize the bitmask
> + * b) Fill real cores first, then HT cores (First set of HT
> + * cores on all physical cores, then second set of HT core,
> + * and, so on) in the following order:
> + *
> + * 1. Same NUMA node as HFI Y and not running an IRQ
> + * handler
> + * 2. Same NUMA node as HFI Y and running an IRQ handler
> + * 3. Different NUMA node to HFI Y and not running an IRQ
> + * handler
> + * 4. Different NUMA node to HFI Y and running an IRQ
> + * handler
> + * c) Mark core as filled in the bitmask. As user processes are
> + * done, clear cores from the bitmask.
> + */
> +
> + ret = zalloc_cpumask_var(&diff, GFP_KERNEL);
> + if (!ret)
> + goto done;
> + ret = zalloc_cpumask_var(&hw_thread_mask, GFP_KERNEL);
> + if (!ret)
> + goto free_diff;
> + ret = zalloc_cpumask_var(&available_mask, GFP_KERNEL);
> + if (!ret)
> + goto free_hw_thread_mask;
> + ret = zalloc_cpumask_var(&intrs_mask, GFP_KERNEL);
> + if (!ret)
> + goto free_available_mask;
> +
> + mutex_lock(&affinity->lock);
> + /*
> + * If we've used all available HW threads, clear the mask and start
> + * overloading.
> + */
> + _cpu_mask_set_gen_inc(set);
> +
> + /*
> + * If NUMA node has CPUs used by interrupt handlers, include them in the
> + * interrupt handler mask.
> + */
> + entry = node_affinity_lookup(node);
> + if (entry) {
> + cpumask_copy(intrs_mask, (entry->def_intr.gen ?
> + &entry->def_intr.mask :
> + &entry->def_intr.used));
> + cpumask_or(intrs_mask, intrs_mask, (entry->rcv_intr.gen ?
> + &entry->rcv_intr.mask :
> + &entry->rcv_intr.used));
> + cpumask_or(intrs_mask, intrs_mask, &entry->general_intr_mask);
> + }
> + hfi2_cdbg(PROC, "CPUs used by interrupts: %*pbl",
> + cpumask_pr_args(intrs_mask));
> +
> +
> + /*
> + * If HT cores are enabled, identify which HW threads within the
> + * physical cores should be used.
> + *
> + * Start with affinity mask but prune HT/SMT threads. If all HW threads
> + * are in use, then try again with all threads in mask, but only if
> + * threads were pruned before the first step.
> + */
> + cpumask_copy(hw_thread_mask, &affinity->proc.mask);
> + pruned = clear_ht_siblings(hw_thread_mask);
> + for (i = 0; i < 2; i++) {
> + /*
> + * diff will always be not empty at least once in this
> + * loop as the used mask gets reset when
> + * (set->mask == set->used) before this loop.
> + */
> + cpumask_andnot(diff, hw_thread_mask, &set->used);
> + if (!cpumask_empty(diff) || !pruned)
> + break;
> + cpumask_copy(hw_thread_mask, &affinity->proc.mask);
> + }
> + hfi2_cdbg(PROC, "Same available HW thread on all physical CPUs: %*pbl",
> + cpumask_pr_args(hw_thread_mask));
> +
> + node_mask = cpumask_of_node(node);
> + hfi2_cdbg(PROC, "Device on NUMA %u, CPUs %*pbl", node,
> + cpumask_pr_args(node_mask));
> +
> + /* Get cpumask of available CPUs on preferred NUMA */
> + cpumask_and(available_mask, hw_thread_mask, node_mask);
> + cpumask_andnot(available_mask, available_mask, &set->used);
> + hfi2_cdbg(PROC, "Available CPUs on NUMA %u: %*pbl", node,
> + cpumask_pr_args(available_mask));
> +
> + /*
> + * At first, we don't want to place processes on the same
> + * CPUs as interrupt handlers. Then, CPUs running interrupt
> + * handlers are used.
> + *
> + * 1) If diff is not empty, then there are CPUs not running
> + * non-interrupt handlers available, so diff gets copied
> + * over to available_mask.
> + * 2) If diff is empty, then all CPUs not running interrupt
> + * handlers are taken, so available_mask contains all
> + * available CPUs running interrupt handlers.
> + * 3) If available_mask is empty, then all CPUs on the
> + * preferred NUMA node are taken, so other NUMA nodes are
> + * used for process assignments using the same method as
> + * the preferred NUMA node.
> + */
> + if (cpumask_andnot(diff, available_mask, intrs_mask))
> + cpumask_copy(available_mask, diff);
> +
> + /* If we don't have CPUs on the preferred node, use other NUMA nodes */
> + if (cpumask_empty(available_mask)) {
> + cpumask_andnot(available_mask, hw_thread_mask, &set->used);
> + /* Excluding preferred NUMA cores */
> + cpumask_andnot(available_mask, available_mask, node_mask);
> + hfi2_cdbg(PROC,
> + "Preferred NUMA node cores are taken, cores available in other NUMA nodes: %*pbl",
> + cpumask_pr_args(available_mask));
> +
> + /*
> + * At first, we don't want to place processes on the same
> + * CPUs as interrupt handlers.
> + */
> + if (cpumask_andnot(diff, available_mask, intrs_mask))
> + cpumask_copy(available_mask, diff);
> + }
> + hfi2_cdbg(PROC, "Possible CPUs for process: %*pbl",
> + cpumask_pr_args(available_mask));
> +
> + cpu = cpumask_first(available_mask);
> + if (cpu >= nr_cpu_ids) /* empty */
> + cpu = -1;
> + else
> + cpumask_set_cpu(cpu, &set->used);
> +
> + mutex_unlock(&affinity->lock);
> + hfi2_cdbg(PROC, "Process assigned to CPU %d", cpu);
> +
> + free_cpumask_var(intrs_mask);
> +free_available_mask:
> + free_cpumask_var(available_mask);
> +free_hw_thread_mask:
> + free_cpumask_var(hw_thread_mask);
> +free_diff:
> + free_cpumask_var(diff);
> +done:
> + return cpu;
> +}
> +
> +void hfi2_put_proc_affinity(int cpu)
> +{
> + struct hfi2_affinity_node_list *affinity = &node_affinity;
> + struct cpu_mask_set *set = &affinity->proc;
> +
> + if (cpu < 0)
> + return;
> +
> + mutex_lock(&affinity->lock);
> + cpu_mask_set_put(set, cpu);
> + hfi2_cdbg(PROC, "Returning CPU %d for future process assignment", cpu);
> + mutex_unlock(&affinity->lock);
> +}
> diff --git a/drivers/infiniband/hw/hfi2/chip.c b/drivers/infiniband/hw/hfi2/chip.c
> index c9012ea0b970..7547058acb29 100644
> --- a/drivers/infiniband/hw/hfi2/chip.c
> +++ b/drivers/infiniband/hw/hfi2/chip.c
> @@ -12141,13 +12141,13 @@ int wfr_early_per_chip_init(struct hfi2_devdata *dd)
> if (ret)
> return ret;
>
> - /* call before get_platform_config(), after init_chip_resources() */
> + /* call before hfi2_get_platform_config(), after init_chip_resources() */
> ret = eprom_init(dd);
> if (ret)
> return ret;
>
> /* Needs to be called before hfi2_firmware_init */
> - get_platform_config(&dd->pport[HFI2_PORT_IDX]);
> + hfi2_get_platform_config(&dd->pport[HFI2_PORT_IDX]);
>
> /* read in firmware */
> ret = hfi2_firmware_init(dd);
> diff --git a/drivers/infiniband/hw/hfi2/firmware.c b/drivers/infiniband/hw/hfi2/firmware.c
> new file mode 100644
> index 000000000000..1a94622ba09e
> --- /dev/null
> +++ b/drivers/infiniband/hw/hfi2/firmware.c
> @@ -0,0 +1,2267 @@
> +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
> +/*
> + * Copyright(c) 2015 - 2017 Intel Corporation.
> + * Copyright(c) 2025-2026 Cornelis Networks, Inc.
> + */
> +
> +#include <linux/firmware.h>
> +#include <linux/mutex.h>
> +#include <linux/delay.h>
> +#include <linux/crc32.h>
> +
> +#include "hfi2.h"
> +#include "trace.h"
> +
> +/*
> + * Make it easy to toggle firmware file name and if it gets loaded by
> + * editing the following. This may be something we do while in development
> + * but not necessarily something a user would ever need to use.
> + */
> +#define DEFAULT_FW_8051_NAME_FPGA "hfi_dc8051.bin"
> +#define DEFAULT_FW_8051_NAME_ASIC "hfi2_dc8051.fw"
> +#define DEFAULT_FW_FABRIC_NAME "hfi2_fabric.fw"
> +#define DEFAULT_FW_SBUS_NAME "hfi2_sbus.fw"
> +#define DEFAULT_FW_PCIE_NAME "hfi2_pcie.fw"
> +#define ALT_FW_8051_NAME_ASIC "hfi2_dc8051_d.fw"
> +#define ALT_FW_FABRIC_NAME "hfi2_fabric_d.fw"
> +#define ALT_FW_SBUS_NAME "hfi2_sbus_d.fw"
> +#define ALT_FW_PCIE_NAME "hfi2_pcie_d.fw"
> +
> +MODULE_FIRMWARE(DEFAULT_FW_8051_NAME_ASIC);
> +MODULE_FIRMWARE(DEFAULT_FW_FABRIC_NAME);
> +MODULE_FIRMWARE(DEFAULT_FW_SBUS_NAME);
> +MODULE_FIRMWARE(DEFAULT_FW_PCIE_NAME);
> +
> +static uint fw_8051_load = 1;
> +static uint fw_fabric_serdes_load = 1;
> +static uint fw_pcie_serdes_load = 1;
> +static uint fw_sbus_load = 1;
> +
> +/* Firmware file names get set in hfi2_firmware_init() based on the above */
> +static char *fw_8051_name;
> +static char *fw_fabric_serdes_name;
> +static char *fw_sbus_name;
> +static char *fw_pcie_serdes_name;
> +
> +#define SBUS_MAX_POLL_COUNT 100
> +#define SBUS_COUNTER(reg, name) \
> + (((reg) >> ASIC_STS_SBUS_COUNTERS_##name##_CNT_SHIFT) & \
> + ASIC_STS_SBUS_COUNTERS_##name##_CNT_MASK)
> +
> +/*
> + * Firmware security header.
> + */
> +struct css_header {
> + u32 module_type;
> + u32 header_len;
> + u32 header_version;
> + u32 module_id;
> + u32 module_vendor;
> + u32 date; /* BCD yyyymmdd */
> + u32 size; /* in DWORDs */
> + u32 key_size; /* in DWORDs */
> + u32 modulus_size; /* in DWORDs */
> + u32 exponent_size; /* in DWORDs */
> + u32 reserved[22];
> +};
> +
> +/* expected field values */
> +#define CSS_MODULE_TYPE 0x00000006
> +#define CSS_HEADER_LEN 0x000000a1
> +#define CSS_HEADER_VERSION 0x00010000
> +#define CSS_MODULE_VENDOR 0x00008086
> +
> +#define KEY_SIZE 256
> +#define MU_SIZE 8
> +#define EXPONENT_SIZE 4
> +
> +/* size of platform configuration partition */
> +#define MAX_PLATFORM_CONFIG_FILE_SIZE 4096
> +
> +/* size of file of platform configuration encoded in format version 4 */
> +#define PLATFORM_CONFIG_FORMAT_4_FILE_SIZE 528
> +
> +/* the file itself */
> +struct firmware_file {
> + struct css_header css_header;
> + u8 modulus[KEY_SIZE];
> + u8 exponent[EXPONENT_SIZE];
> + u8 signature[KEY_SIZE];
> + u8 firmware[];
> +};
> +
> +struct augmented_firmware_file {
> + struct css_header css_header;
> + u8 modulus[KEY_SIZE];
> + u8 exponent[EXPONENT_SIZE];
> + u8 signature[KEY_SIZE];
> + u8 r2[KEY_SIZE];
> + u8 mu[MU_SIZE];
> + u8 firmware[];
> +};
> +
> +/* augmented file size difference */
> +#define AUGMENT_SIZE (sizeof(struct augmented_firmware_file) - \
> + sizeof(struct firmware_file))
> +
> +struct firmware_details {
> + /* Linux core piece */
> + const struct firmware *fw;
> +
> + struct css_header *css_header;
> + u8 *firmware_ptr; /* pointer to binary data */
> + u32 firmware_len; /* length in bytes */
> + u8 *modulus; /* pointer to the modulus */
> + u8 *exponent; /* pointer to the exponent */
> + u8 *signature; /* pointer to the signature */
> + u8 *r2; /* pointer to r2 */
> + u8 *mu; /* pointer to mu */
> + struct augmented_firmware_file dummy_header;
> +};
> +
> +/*
> + * The mutex protects fw_state, fw_err, and all of the firmware_details
> + * variables.
> + */
> +static DEFINE_MUTEX(fw_mutex);
> +enum fw_state {
> + FW_EMPTY,
> + FW_TRY,
> + FW_FINAL,
> + FW_ERR
> +};
> +
> +static enum fw_state fw_state = FW_EMPTY;
> +static int fw_err;
> +static struct firmware_details fw_8051;
> +static struct firmware_details fw_fabric;
> +static struct firmware_details fw_pcie;
> +static struct firmware_details fw_sbus;
> +
> +/* flags for turn_off_spicos() */
> +#define SPICO_SBUS 0x1
> +#define SPICO_FABRIC 0x2
> +#define ENABLE_SPICO_SMASK 0x1
> +
> +/* security block commands */
> +#define RSA_CMD_INIT 0x1
> +#define RSA_CMD_START 0x2
> +
> +/* security block status */
> +#define RSA_STATUS_IDLE 0x0
> +#define RSA_STATUS_ACTIVE 0x1
> +#define RSA_STATUS_DONE 0x2
> +#define RSA_STATUS_FAILED 0x3
> +
> +/* RSA engine timeout, in ms */
> +#define RSA_ENGINE_TIMEOUT 100 /* ms */
> +
> +/* hardware mutex timeout, in ms */
> +#define HM_TIMEOUT 10 /* ms */
> +
> +/* 8051 memory access timeout, in us */
> +#define DC8051_ACCESS_TIMEOUT 100 /* us */
> +
> +/* the number of fabric SerDes on the SBus */
> +#define NUM_FABRIC_SERDES 4
> +
> +/* ASIC_STS_SBUS_RESULT.RESULT_CODE value */
> +#define SBUS_READ_COMPLETE 0x4
> +
> +/* SBus fabric SerDes addresses, one set per HFI */
> +static const u8 fabric_serdes_addrs[2][NUM_FABRIC_SERDES] = {
> + { 0x01, 0x02, 0x03, 0x04 },
> + { 0x28, 0x29, 0x2a, 0x2b }
> +};
> +
> +/* SBus PCIe SerDes addresses, one set per HFI */
> +static const u8 pcie_serdes_addrs[2][NUM_PCIE_SERDES] = {
> + { 0x08, 0x0a, 0x0c, 0x0e, 0x10, 0x12, 0x14, 0x16,
> + 0x18, 0x1a, 0x1c, 0x1e, 0x20, 0x22, 0x24, 0x26 },
> + { 0x2f, 0x31, 0x33, 0x35, 0x37, 0x39, 0x3b, 0x3d,
> + 0x3f, 0x41, 0x43, 0x45, 0x47, 0x49, 0x4b, 0x4d }
> +};
> +
> +/* SBus PCIe PCS addresses, one set per HFI */
> +const u8 pcie_pcs_addrs[2][NUM_PCIE_SERDES] = {
> + { 0x09, 0x0b, 0x0d, 0x0f, 0x11, 0x13, 0x15, 0x17,
> + 0x19, 0x1b, 0x1d, 0x1f, 0x21, 0x23, 0x25, 0x27 },
> + { 0x30, 0x32, 0x34, 0x36, 0x38, 0x3a, 0x3c, 0x3e,
> + 0x40, 0x42, 0x44, 0x46, 0x48, 0x4a, 0x4c, 0x4e }
> +};
> +
> +/* SBus fabric SerDes broadcast addresses, one per HFI */
> +static const u8 fabric_serdes_broadcast[2] = { 0xe4, 0xe5 };
> +static const u8 all_fabric_serdes_broadcast = 0xe1;
> +
> +/* SBus PCIe SerDes broadcast addresses, one per HFI */
> +const u8 pcie_serdes_broadcast[2] = { 0xe2, 0xe3 };
> +static const u8 all_pcie_serdes_broadcast = 0xe0;
> +
> +static const u32 platform_config_table_limits[PLATFORM_CONFIG_TABLE_MAX] = {
> + 0,
> + SYSTEM_TABLE_MAX,
> + PORT_TABLE_MAX,
> + RX_PRESET_TABLE_MAX,
> + TX_PRESET_TABLE_MAX,
> + QSFP_ATTEN_TABLE_MAX,
> + VARIABLE_SETTINGS_TABLE_MAX
> +};
> +
> +/* forwards */
> +static void dispose_one_firmware(struct firmware_details *fdet);
> +static int load_fabric_serdes_firmware(struct hfi2_devdata *dd,
> + struct firmware_details *fdet);
> +static void dump_fw_version(struct hfi2_devdata *dd);
> +
> +/*
> + * Read a single 64-bit value from 8051 data memory.
> + *
> + * Expects:
> + * o caller to have already set up data read, no auto increment
> + * o caller to turn off read enable when finished
> + *
> + * The address argument is a byte offset. Bits 0:2 in the address are
> + * ignored - i.e. the hardware will always do aligned 8-byte reads as if
> + * the lower bits are zero.
> + *
> + * Return 0 on success, -ENXIO on a read error (timeout).
> + */
> +static int __read_8051_data(struct hfi2_devdata *dd, u32 addr, u64 *result)
> +{
> + u64 reg;
> + int count;
> +
> + /* step 1: set the address, clear enable */
> + reg = (addr & DC_DC8051_CFG_RAM_ACCESS_CTRL_ADDRESS_MASK)
> + << DC_DC8051_CFG_RAM_ACCESS_CTRL_ADDRESS_SHIFT;
> + write_csr(dd, DC_DC8051_CFG_RAM_ACCESS_CTRL, reg);
> + /* step 2: enable */
> + write_csr(dd, DC_DC8051_CFG_RAM_ACCESS_CTRL,
> + reg | DC_DC8051_CFG_RAM_ACCESS_CTRL_READ_ENA_SMASK);
> +
> + /* wait until ACCESS_COMPLETED is set */
> + count = 0;
> + while ((read_csr(dd, DC_DC8051_CFG_RAM_ACCESS_STATUS)
> + & DC_DC8051_CFG_RAM_ACCESS_STATUS_ACCESS_COMPLETED_SMASK)
> + == 0) {
> + count++;
> + if (count > DC8051_ACCESS_TIMEOUT) {
> + dd_dev_err(dd, "timeout reading 8051 data\n");
> + return -ENXIO;
> + }
> + ndelay(10);
> + }
> +
> + /* gather the data */
> + *result = read_csr(dd, DC_DC8051_CFG_RAM_ACCESS_RD_DATA);
> +
> + return 0;
> +}
> +
> +/*
> + * Read 8051 data starting at addr, for len bytes. Will read in 8-byte chunks.
> + * Return 0 on success, -errno on error.
> + */
> +int read_8051_data(struct hfi2_devdata *dd, u32 addr, u32 len, u64 *result)
> +{
> + unsigned long flags;
> + u32 done;
> + int ret = 0;
> +
> + spin_lock_irqsave(&dd->dc8051_memlock, flags);
> +
> + /* data read set-up, no auto-increment */
> + write_csr(dd, DC_DC8051_CFG_RAM_ACCESS_SETUP, 0);
> +
> + for (done = 0; done < len; addr += 8, done += 8, result++) {
> + ret = __read_8051_data(dd, addr, result);
> + if (ret)
> + break;
> + }
> +
> + /* turn off read enable */
> + write_csr(dd, DC_DC8051_CFG_RAM_ACCESS_CTRL, 0);
> +
> + spin_unlock_irqrestore(&dd->dc8051_memlock, flags);
> +
> + return ret;
> +}
> +
> +/*
> + * Write data or code to the 8051 code or data RAM.
> + */
> +static int write_8051(struct hfi2_devdata *dd, int code, u32 start,
> + const u8 *data, u32 len)
> +{
> + u64 reg;
> + u32 offset;
> + int aligned, count;
> +
> + /* check alignment */
> + aligned = ((unsigned long)data & 0x7) == 0;
> +
> + /* write set-up */
> + reg = (code ? DC_DC8051_CFG_RAM_ACCESS_SETUP_RAM_SEL_SMASK : 0ull)
> + | DC_DC8051_CFG_RAM_ACCESS_SETUP_AUTO_INCR_ADDR_SMASK;
> + write_csr(dd, DC_DC8051_CFG_RAM_ACCESS_SETUP, reg);
> +
> + reg = ((start & DC_DC8051_CFG_RAM_ACCESS_CTRL_ADDRESS_MASK)
> + << DC_DC8051_CFG_RAM_ACCESS_CTRL_ADDRESS_SHIFT)
> + | DC_DC8051_CFG_RAM_ACCESS_CTRL_WRITE_ENA_SMASK;
> + write_csr(dd, DC_DC8051_CFG_RAM_ACCESS_CTRL, reg);
> +
> + /* write */
> + for (offset = 0; offset < len; offset += 8) {
> + int bytes = len - offset;
> +
> + if (bytes < 8) {
> + reg = 0;
> + memcpy(®, &data[offset], bytes);
> + } else if (aligned) {
> + reg = *(u64 *)&data[offset];
> + } else {
> + memcpy(®, &data[offset], 8);
> + }
> + write_csr(dd, DC_DC8051_CFG_RAM_ACCESS_WR_DATA, reg);
> +
> + /* wait until ACCESS_COMPLETED is set */
> + count = 0;
> + while ((read_csr(dd, DC_DC8051_CFG_RAM_ACCESS_STATUS)
> + & DC_DC8051_CFG_RAM_ACCESS_STATUS_ACCESS_COMPLETED_SMASK)
> + == 0) {
> + count++;
> + if (count > DC8051_ACCESS_TIMEOUT) {
> + dd_dev_err(dd, "timeout writing 8051 data\n");
> + return -ENXIO;
> + }
> + udelay(1);
> + }
> + }
> +
> + /* turn off write access, auto increment (also sets to data access) */
> + write_csr(dd, DC_DC8051_CFG_RAM_ACCESS_CTRL, 0);
> + write_csr(dd, DC_DC8051_CFG_RAM_ACCESS_SETUP, 0);
> +
> + return 0;
> +}
> +
> +/* return 0 if values match, non-zero and complain otherwise */
> +static int invalid_header(struct hfi2_devdata *dd, const char *what,
> + u32 actual, u32 expected)
> +{
> + if (actual == expected)
> + return 0;
> +
> + dd_dev_err(dd,
> + "invalid firmware header field %s: expected 0x%x, actual 0x%x\n",
> + what, expected, actual);
> + return 1;
> +}
> +
> +/*
> + * Verify that the static fields in the CSS header match.
> + */
> +static int verify_css_header(struct hfi2_devdata *dd, struct css_header *css)
> +{
> + /* verify CSS header fields (most sizes are in DW, so add /4) */
> + if (invalid_header(dd, "module_type", css->module_type,
> + CSS_MODULE_TYPE) ||
> + invalid_header(dd, "header_len", css->header_len,
> + (sizeof(struct firmware_file) / 4)) ||
> + invalid_header(dd, "header_version", css->header_version,
> + CSS_HEADER_VERSION) ||
> + invalid_header(dd, "module_vendor", css->module_vendor,
> + CSS_MODULE_VENDOR) ||
> + invalid_header(dd, "key_size", css->key_size, KEY_SIZE / 4) ||
> + invalid_header(dd, "modulus_size", css->modulus_size,
> + KEY_SIZE / 4) ||
> + invalid_header(dd, "exponent_size", css->exponent_size,
> + EXPONENT_SIZE / 4)) {
> + return -EINVAL;
> + }
> + return 0;
> +}
> +
> +/*
> + * Make sure there are at least some bytes after the prefix.
> + */
> +static int payload_check(struct hfi2_devdata *dd, const char *name,
> + long file_size, long prefix_size)
> +{
> + /* make sure we have some payload */
> + if (prefix_size >= file_size) {
> + dd_dev_err(dd,
> + "firmware \"%s\", size %ld, must be larger than %ld bytes\n",
> + name, file_size, prefix_size);
> + return -EINVAL;
> + }
> +
> + return 0;
> +}
> +
> +/*
> + * Request the firmware from the system. Extract the pieces and fill in
> + * fdet. If successful, the caller will need to call dispose_one_firmware().
> + * Returns 0 on success, -ERRNO on error.
> + */
> +static int obtain_one_firmware(struct hfi2_devdata *dd, const char *name,
> + struct firmware_details *fdet)
> +{
> + struct css_header *css;
> + int ret;
> +
> + memset(fdet, 0, sizeof(*fdet));
> +
> + ret = request_firmware(&fdet->fw, name, &dd->pcidev->dev);
> + if (ret) {
> + dd_dev_warn(dd, "cannot find firmware \"%s\", err %d\n",
> + name, ret);
> + return ret;
> + }
> +
> + /* verify the firmware */
> + if (fdet->fw->size < sizeof(struct css_header)) {
> + dd_dev_err(dd, "firmware \"%s\" is too small\n", name);
> + ret = -EINVAL;
> + goto done;
> + }
> + css = (struct css_header *)fdet->fw->data;
> +
> + hfi2_cdbg(FIRMWARE, "Firmware %s details:", name);
> + hfi2_cdbg(FIRMWARE, "file size: 0x%lx bytes", fdet->fw->size);
> + hfi2_cdbg(FIRMWARE, "CSS structure:");
> + hfi2_cdbg(FIRMWARE, " module_type 0x%x", css->module_type);
> + hfi2_cdbg(FIRMWARE, " header_len 0x%03x (0x%03x bytes)",
> + css->header_len, 4 * css->header_len);
> + hfi2_cdbg(FIRMWARE, " header_version 0x%x", css->header_version);
> + hfi2_cdbg(FIRMWARE, " module_id 0x%x", css->module_id);
> + hfi2_cdbg(FIRMWARE, " module_vendor 0x%x", css->module_vendor);
> + hfi2_cdbg(FIRMWARE, " date 0x%x", css->date);
> + hfi2_cdbg(FIRMWARE, " size 0x%03x (0x%03x bytes)",
> + css->size, 4 * css->size);
> + hfi2_cdbg(FIRMWARE, " key_size 0x%03x (0x%03x bytes)",
> + css->key_size, 4 * css->key_size);
> + hfi2_cdbg(FIRMWARE, " modulus_size 0x%03x (0x%03x bytes)",
> + css->modulus_size, 4 * css->modulus_size);
> + hfi2_cdbg(FIRMWARE, " exponent_size 0x%03x (0x%03x bytes)",
> + css->exponent_size, 4 * css->exponent_size);
> + hfi2_cdbg(FIRMWARE, "firmware size: 0x%lx bytes",
> + fdet->fw->size - sizeof(struct firmware_file));
> +
> + /*
> + * If the file does not have a valid CSS header, fail.
> + * Otherwise, check the CSS size field for an expected size.
> + * The augmented file has r2 and mu inserted after the header
> + * was generated, so there will be a known difference between
> + * the CSS header size and the actual file size. Use this
> + * difference to identify an augmented file.
> + *
> + * Note: css->size is in DWORDs, multiply by 4 to get bytes.
> + */
> + ret = verify_css_header(dd, css);
> + if (ret) {
> + dd_dev_info(dd, "Invalid CSS header for \"%s\"\n", name);
> + } else if ((css->size * 4) == fdet->fw->size) {
> + /* non-augmented firmware file */
> + struct firmware_file *ff = (struct firmware_file *)
> + fdet->fw->data;
> +
> + /* make sure there are bytes in the payload */
> + ret = payload_check(dd, name, fdet->fw->size,
> + sizeof(struct firmware_file));
> + if (ret == 0) {
> + fdet->css_header = css;
> + fdet->modulus = ff->modulus;
> + fdet->exponent = ff->exponent;
> + fdet->signature = ff->signature;
> + fdet->r2 = fdet->dummy_header.r2; /* use dummy space */
> + fdet->mu = fdet->dummy_header.mu; /* use dummy space */
> + fdet->firmware_ptr = ff->firmware;
> + fdet->firmware_len = fdet->fw->size -
> + sizeof(struct firmware_file);
> + /*
> + * Header does not include r2 and mu - generate here.
> + * For now, fail.
> + */
> + dd_dev_err(dd, "driver is unable to validate firmware without r2 and mu (not in firmware file)\n");
> + ret = -EINVAL;
> + }
> + } else if ((css->size * 4) + AUGMENT_SIZE == fdet->fw->size) {
> + /* augmented firmware file */
> + struct augmented_firmware_file *aff =
> + (struct augmented_firmware_file *)fdet->fw->data;
> +
> + /* make sure there are bytes in the payload */
> + ret = payload_check(dd, name, fdet->fw->size,
> + sizeof(struct augmented_firmware_file));
> + if (ret == 0) {
> + fdet->css_header = css;
> + fdet->modulus = aff->modulus;
> + fdet->exponent = aff->exponent;
> + fdet->signature = aff->signature;
> + fdet->r2 = aff->r2;
> + fdet->mu = aff->mu;
> + fdet->firmware_ptr = aff->firmware;
> + fdet->firmware_len = fdet->fw->size -
> + sizeof(struct augmented_firmware_file);
> + }
> + } else {
> + /* css->size check failed */
> + dd_dev_err(dd,
> + "invalid firmware header field size: expected 0x%lx or 0x%lx, actual 0x%x\n",
> + fdet->fw->size / 4,
> + (fdet->fw->size - AUGMENT_SIZE) / 4,
> + css->size);
> +
> + ret = -EINVAL;
> + }
> +
> +done:
> + /* if returning an error, clean up after ourselves */
> + if (ret)
> + dispose_one_firmware(fdet);
> + return ret;
> +}
> +
> +static void dispose_one_firmware(struct firmware_details *fdet)
> +{
> + release_firmware(fdet->fw);
> + /* erase all previous information */
> + memset(fdet, 0, sizeof(*fdet));
> +}
> +
> +/*
> + * Obtain the 4 firmwares from the OS. All must be obtained at once or not
> + * at all. If called with the firmware state in FW_TRY, use alternate names.
> + * On exit, this routine will have set the firmware state to one of FW_TRY,
> + * FW_FINAL, or FW_ERR.
> + *
> + * Must be holding fw_mutex.
> + */
> +static void __obtain_firmware(struct hfi2_devdata *dd)
> +{
> + int err = 0;
> +
> + if (fw_state == FW_FINAL) /* nothing more to obtain */
> + return;
> + if (fw_state == FW_ERR) /* already in error */
> + return;
> +
> + /* fw_state is FW_EMPTY or FW_TRY */
> +retry:
> + if (fw_state == FW_TRY) {
> + /*
> + * We tried the original and it failed. Move to the
> + * alternate.
> + */
> + dd_dev_warn(dd, "using alternate firmware names\n");
> + /*
> + * Let others run. Some systems, when missing firmware, does
> + * something that holds for 30 seconds. If we do that twice
> + * in a row it triggers task blocked warning.
> + */
> + cond_resched();
> + if (fw_8051_load)
> + dispose_one_firmware(&fw_8051);
> + if (fw_fabric_serdes_load)
> + dispose_one_firmware(&fw_fabric);
> + if (fw_sbus_load)
> + dispose_one_firmware(&fw_sbus);
> + if (fw_pcie_serdes_load)
> + dispose_one_firmware(&fw_pcie);
> + fw_8051_name = ALT_FW_8051_NAME_ASIC;
> + fw_fabric_serdes_name = ALT_FW_FABRIC_NAME;
> + fw_sbus_name = ALT_FW_SBUS_NAME;
> + fw_pcie_serdes_name = ALT_FW_PCIE_NAME;
> +
> + /*
> + * Add a delay before obtaining and loading debug firmware.
> + * Authorization will fail if the delay between firmware
> + * authorization events is shorter than 50us. Add 100us to
> + * make a delay time safe.
> + */
> + usleep_range(100, 120);
> + }
> +
> + if (fw_sbus_load) {
> + err = obtain_one_firmware(dd, fw_sbus_name, &fw_sbus);
> + if (err)
> + goto done;
> + }
> +
> + if (fw_pcie_serdes_load) {
> + err = obtain_one_firmware(dd, fw_pcie_serdes_name, &fw_pcie);
> + if (err)
> + goto done;
> + }
> +
> + if (fw_fabric_serdes_load) {
> + err = obtain_one_firmware(dd, fw_fabric_serdes_name,
> + &fw_fabric);
> + if (err)
> + goto done;
> + }
> +
> + if (fw_8051_load) {
> + err = obtain_one_firmware(dd, fw_8051_name, &fw_8051);
> + if (err)
> + goto done;
> + }
> +
> +done:
> + if (err) {
> + /* oops, had problems obtaining a firmware */
> + if (fw_state == FW_EMPTY && dd->icode == ICODE_RTL_SILICON) {
> + /* retry with alternate (RTL only) */
> + fw_state = FW_TRY;
> + goto retry;
> + }
> + dd_dev_err(dd, "unable to obtain working firmware\n");
> + fw_state = FW_ERR;
> + fw_err = -ENOENT;
> + } else {
> + /* success */
> + if (fw_state == FW_EMPTY &&
> + dd->icode != ICODE_FUNCTIONAL_SIMULATOR)
> + fw_state = FW_TRY; /* may retry later */
> + else
> + fw_state = FW_FINAL; /* cannot try again */
> + }
> +}
> +
> +/*
> + * Called by all HFIs when loading their firmware - i.e. device probe time.
> + * The first one will do the actual firmware load. Use a mutex to resolve
> + * any possible race condition.
> + *
> + * The call to this routine cannot be moved to driver load because the kernel
> + * call request_firmware() requires a device which is only available after
> + * the first device probe.
> + */
> +static int obtain_firmware(struct hfi2_devdata *dd)
> +{
> + unsigned long timeout;
> +
> + mutex_lock(&fw_mutex);
> +
> + /* 40s delay due to long delay on missing firmware on some systems */
> + timeout = jiffies + msecs_to_jiffies(40000);
> + while (fw_state == FW_TRY) {
> + /*
> + * Another device is trying the firmware. Wait until it
> + * decides what works (or not).
> + */
> + if (time_after(jiffies, timeout)) {
> + /* waited too long */
> + dd_dev_err(dd, "Timeout waiting for firmware try");
> + fw_state = FW_ERR;
> + fw_err = -ETIMEDOUT;
> + break;
> + }
> + mutex_unlock(&fw_mutex);
> + msleep(20); /* arbitrary delay */
> + mutex_lock(&fw_mutex);
> + }
> + /* not in FW_TRY state */
> +
> + /* set fw_state to FW_TRY, FW_FINAL, or FW_ERR, and fw_err */
> + if (fw_state == FW_EMPTY)
> + __obtain_firmware(dd);
> +
> + mutex_unlock(&fw_mutex);
> + return fw_err;
> +}
> +
> +/*
> + * Called when the driver unloads. The timing is asymmetric with its
> + * counterpart, obtain_firmware(). If called at device remove time,
> + * then it is conceivable that another device could probe while the
> + * firmware is being disposed. The mutexes can be moved to do that
> + * safely, but then the firmware would be requested from the OS multiple
> + * times.
> + *
> + * No mutex is needed as the driver is unloading and there cannot be any
> + * other callers.
> + */
> +void dispose_firmware(void)
> +{
> + dispose_one_firmware(&fw_8051);
> + dispose_one_firmware(&fw_fabric);
> + dispose_one_firmware(&fw_pcie);
> + dispose_one_firmware(&fw_sbus);
> +
> + /* retain the error state, otherwise revert to empty */
> + if (fw_state != FW_ERR)
> + fw_state = FW_EMPTY;
> +}
> +
> +/*
> + * Called with the result of a firmware download.
> + *
> + * Return 1 to retry loading the firmware, 0 to stop.
> + */
> +static int retry_firmware(struct hfi2_devdata *dd, int load_result)
> +{
> + int retry;
> +
> + mutex_lock(&fw_mutex);
> +
> + if (load_result == 0) {
> + /*
> + * The load succeeded, so expect all others to do the same.
> + * Do not retry again.
> + */
> + if (fw_state == FW_TRY)
> + fw_state = FW_FINAL;
> + retry = 0; /* do NOT retry */
> + } else if (fw_state == FW_TRY) {
> + /* load failed, obtain alternate firmware */
> + __obtain_firmware(dd);
> + retry = (fw_state == FW_FINAL);
> + } else {
> + /* else in FW_FINAL or FW_ERR, no retry in either case */
> + retry = 0;
> + }
> +
> + mutex_unlock(&fw_mutex);
> + return retry;
> +}
> +
> +/*
> + * Write a block of data to a given array CSR. All calls will be in
> + * multiples of 8 bytes.
> + */
> +static void write_rsa_data(struct hfi2_devdata *dd, int what,
> + const u8 *data, int nbytes)
> +{
> + int qw_size = nbytes / 8;
> + int i;
> +
> + if (((unsigned long)data & 0x7) == 0) {
> + /* aligned */
> + u64 *ptr = (u64 *)data;
> +
> + for (i = 0; i < qw_size; i++, ptr++)
> + write_csr(dd, what + (8 * i), *ptr);
> + } else {
> + /* not aligned */
> + for (i = 0; i < qw_size; i++, data += 8) {
> + u64 value;
> +
> + memcpy(&value, data, 8);
> + write_csr(dd, what + (8 * i), value);
> + }
> + }
> +}
> +
> +/*
> + * Write a block of data to a given CSR as a stream of writes. All calls will
> + * be in multiples of 8 bytes.
> + */
> +static void write_streamed_rsa_data(struct hfi2_devdata *dd, int what,
> + const u8 *data, int nbytes)
> +{
> + u64 *ptr = (u64 *)data;
> + int qw_size = nbytes / 8;
> +
> + for (; qw_size > 0; qw_size--, ptr++)
> + write_csr(dd, what, *ptr);
> +}
> +
> +/*
> + * Download the signature and start the RSA mechanism. Wait for
> + * RSA_ENGINE_TIMEOUT before giving up.
> + */
> +static int run_rsa(struct hfi2_devdata *dd, const char *who,
> + const u8 *signature)
> +{
> + unsigned long timeout;
> + u64 reg;
> + u32 status;
> + int ret = 0;
> +
> + /* write the signature */
> + write_rsa_data(dd, MISC_CFG_RSA_SIGNATURE, signature, KEY_SIZE);
> +
> + /* initialize RSA */
> + write_csr(dd, MISC_CFG_RSA_CMD, RSA_CMD_INIT);
> +
> + /*
> + * Make sure the engine is idle and insert a delay between the two
> + * writes to MISC_CFG_RSA_CMD.
> + */
> + status = (read_csr(dd, MISC_CFG_FW_CTRL)
> + & MISC_CFG_FW_CTRL_RSA_STATUS_SMASK)
> + >> MISC_CFG_FW_CTRL_RSA_STATUS_SHIFT;
> + if (status != RSA_STATUS_IDLE) {
> + dd_dev_err(dd, "%s security engine not idle - giving up\n",
> + who);
> + return -EBUSY;
> + }
> +
> + /* start RSA */
> + write_csr(dd, MISC_CFG_RSA_CMD, RSA_CMD_START);
> +
> + /*
> + * Look for the result.
> + *
> + * The RSA engine is hooked up to two MISC errors. The driver
> + * masks these errors as they do not respond to the standard
> + * error "clear down" mechanism. Look for these errors here and
> + * clear them when possible. This routine will exit with the
> + * errors of the current run still set.
> + *
> + * MISC_FW_AUTH_FAILED_ERR
> + * Firmware authorization failed. This can be cleared by
> + * re-initializing the RSA engine, then clearing the status bit.
> + * Do not re-init the RSA angine immediately after a successful
> + * run - this will reset the current authorization.
> + *
> + * MISC_KEY_MISMATCH_ERR
> + * Key does not match. The only way to clear this is to load
> + * a matching key then clear the status bit. If this error
> + * is raised, it will persist outside of this routine until a
> + * matching key is loaded.
> + */
> + timeout = msecs_to_jiffies(RSA_ENGINE_TIMEOUT) + jiffies;
> + while (1) {
> + status = (read_csr(dd, MISC_CFG_FW_CTRL)
> + & MISC_CFG_FW_CTRL_RSA_STATUS_SMASK)
> + >> MISC_CFG_FW_CTRL_RSA_STATUS_SHIFT;
> +
> + if (status == RSA_STATUS_IDLE) {
> + /* should not happen */
> + dd_dev_err(dd, "%s firmware security bad idle state\n",
> + who);
> + ret = -EINVAL;
> + break;
> + } else if (status == RSA_STATUS_DONE) {
> + /* finished successfully */
> + break;
> + } else if (status == RSA_STATUS_FAILED) {
> + /* finished unsuccessfully */
> + ret = -EINVAL;
> + break;
> + }
> + /* else still active */
> +
> + if (time_after(jiffies, timeout)) {
> + /*
> + * Timed out while active. We can't reset the engine
> + * if it is stuck active, but run through the
> + * error code to see what error bits are set.
> + */
> + dd_dev_err(dd, "%s firmware security time out\n", who);
> + ret = -ETIMEDOUT;
> + break;
> + }
> +
> + msleep(20);
> + }
> +
> + /*
> + * Arrive here on success or failure. Clear all RSA engine
> + * errors. All current errors will stick - the RSA logic is keeping
> + * error high. All previous errors will clear - the RSA logic
> + * is not keeping the error high.
> + */
> + write_csr(dd, MISC_ERR_CLEAR,
> + MISC_ERR_STATUS_MISC_FW_AUTH_FAILED_ERR_SMASK |
> + MISC_ERR_STATUS_MISC_KEY_MISMATCH_ERR_SMASK);
> + /*
> + * All that is left are the current errors. Print warnings on
> + * authorization failure details, if any. Firmware authorization
> + * can be retried, so these are only warnings.
> + */
> + reg = read_csr(dd, MISC_ERR_STATUS);
> + if (ret) {
> + if (reg & MISC_ERR_STATUS_MISC_FW_AUTH_FAILED_ERR_SMASK)
> + dd_dev_warn(dd, "%s firmware authorization failed\n",
> + who);
> + if (reg & MISC_ERR_STATUS_MISC_KEY_MISMATCH_ERR_SMASK)
> + dd_dev_warn(dd, "%s firmware key mismatch\n", who);
> + }
> +
> + return ret;
> +}
> +
> +static void load_security_variables(struct hfi2_devdata *dd,
> + struct firmware_details *fdet)
> +{
> + /* Security variables a. Write the modulus */
> + write_rsa_data(dd, MISC_CFG_RSA_MODULUS, fdet->modulus, KEY_SIZE);
> + /* Security variables b. Write the r2 */
> + write_rsa_data(dd, MISC_CFG_RSA_R2, fdet->r2, KEY_SIZE);
> + /* Security variables c. Write the mu */
> + write_rsa_data(dd, MISC_CFG_RSA_MU, fdet->mu, MU_SIZE);
> + /* Security variables d. Write the header */
> + write_streamed_rsa_data(dd, MISC_CFG_SHA_PRELOAD,
> + (u8 *)fdet->css_header,
> + sizeof(struct css_header));
> +}
> +
> +/* return the 8051 firmware state */
> +static inline u32 get_firmware_state(struct hfi2_devdata *dd)
> +{
> + u64 reg = read_csr(dd, DC_DC8051_STS_CUR_STATE);
> +
> + return (reg >> DC_DC8051_STS_CUR_STATE_FIRMWARE_SHIFT)
> + & DC_DC8051_STS_CUR_STATE_FIRMWARE_MASK;
> +}
> +
> +/*
> + * Wait until the firmware is up and ready to take host requests.
> + * Return 0 on success, -ETIMEDOUT on timeout.
> + */
> +int wait_fm_ready(struct hfi2_devdata *dd, u32 mstimeout)
> +{
> + unsigned long timeout;
> +
> + timeout = msecs_to_jiffies(mstimeout) + jiffies;
> + while (1) {
> + if (get_firmware_state(dd) == 0xa0) /* ready */
> + return 0;
> + if (time_after(jiffies, timeout)) /* timed out */
> + return -ETIMEDOUT;
> + usleep_range(1950, 2050); /* sleep 2ms-ish */
> + }
> +}
> +
> +/*
> + * Load the 8051 firmware.
> + */
> +static int load_8051_firmware(struct hfi2_devdata *dd,
> + struct firmware_details *fdet)
> +{
> + u64 reg;
> + int ret;
> + u8 ver_major;
> + u8 ver_minor;
> + u8 ver_patch;
> +
> + /*
> + * DC Reset sequence
> + * Load DC 8051 firmware
> + */
> + /*
> + * DC reset step 1: Reset DC8051
> + */
> + reg = DC_DC8051_CFG_RST_M8051W_SMASK
> + | DC_DC8051_CFG_RST_CRAM_SMASK
> + | DC_DC8051_CFG_RST_DRAM_SMASK
> + | DC_DC8051_CFG_RST_IRAM_SMASK
> + | DC_DC8051_CFG_RST_SFR_SMASK;
> + write_csr(dd, DC_DC8051_CFG_RST, reg);
> +
> + /*
> + * DC reset step 2 (optional): Load 8051 data memory with link
> + * configuration
> + */
> +
> + /*
> + * DC reset step 3: Load DC8051 firmware
> + */
> + /* release all but the core reset */
> + reg = DC_DC8051_CFG_RST_M8051W_SMASK;
> + write_csr(dd, DC_DC8051_CFG_RST, reg);
> +
> + /* Firmware load step 1 */
> + load_security_variables(dd, fdet);
> +
> + /*
> + * Firmware load step 2. Clear MISC_CFG_FW_CTRL.FW_8051_LOADED
> + */
> + write_csr(dd, MISC_CFG_FW_CTRL, 0);
> +
> + /* Firmware load steps 3-5 */
> + ret = write_8051(dd, 1/*code*/, 0, fdet->firmware_ptr,
> + fdet->firmware_len);
> + if (ret)
> + return ret;
> +
> + /*
> + * DC reset step 4. Host starts the DC8051 firmware
> + */
> + /*
> + * Firmware load step 6. Set MISC_CFG_FW_CTRL.FW_8051_LOADED
> + */
> + write_csr(dd, MISC_CFG_FW_CTRL, MISC_CFG_FW_CTRL_FW_8051_LOADED_SMASK);
> +
> + /* Firmware load steps 7-10 */
> + ret = run_rsa(dd, "8051", fdet->signature);
> + if (ret)
> + return ret;
> +
> + /* clear all reset bits, releasing the 8051 */
> + write_csr(dd, DC_DC8051_CFG_RST, 0ull);
> +
> + /*
> + * DC reset step 5. Wait for firmware to be ready to accept host
> + * requests.
> + */
> + ret = wait_fm_ready(dd, TIMEOUT_8051_START);
> + if (ret) { /* timed out */
> + dd_dev_err(dd, "8051 start timeout, current state 0x%x\n",
> + get_firmware_state(dd));
> + return -ETIMEDOUT;
> + }
> +
> + read_misc_status(dd, &ver_major, &ver_minor, &ver_patch);
> + dd_dev_info(dd, "8051 firmware version %d.%d.%d\n",
> + (int)ver_major, (int)ver_minor, (int)ver_patch);
> + dd->dc8051_ver = dc8051_ver(ver_major, ver_minor, ver_patch);
> + ret = write_host_interface_version(dd, HOST_INTERFACE_VERSION);
> + if (ret != HCMD_SUCCESS) {
> + dd_dev_err(dd,
> + "Failed to set host interface version, return 0x%x\n",
> + ret);
> + return -EIO;
> + }
> +
> + return 0;
> +}
> +
> +/*
> + * Write the SBus request register
> + *
> + * No need for masking - the arguments are sized exactly.
> + */
> +void sbus_request(struct hfi2_devdata *dd,
> + u8 receiver_addr, u8 data_addr, u8 command, u32 data_in)
> +{
> + write_csr(dd, ASIC_CFG_SBUS_REQUEST,
> + ((u64)data_in << ASIC_CFG_SBUS_REQUEST_DATA_IN_SHIFT) |
> + ((u64)command << ASIC_CFG_SBUS_REQUEST_COMMAND_SHIFT) |
> + ((u64)data_addr << ASIC_CFG_SBUS_REQUEST_DATA_ADDR_SHIFT) |
> + ((u64)receiver_addr <<
> + ASIC_CFG_SBUS_REQUEST_RECEIVER_ADDR_SHIFT));
> +}
> +
> +/*
> + * Read a value from the SBus.
> + *
> + * Requires the caller to be in fast mode
> + */
> +static u32 sbus_read(struct hfi2_devdata *dd, u8 receiver_addr, u8 data_addr,
> + u32 data_in)
> +{
> + u64 reg;
> + int retries;
> + int success = 0;
> + u32 result = 0;
> + u32 result_code = 0;
> +
> + sbus_request(dd, receiver_addr, data_addr, READ_SBUS_RECEIVER, data_in);
> +
> + for (retries = 0; retries < 100; retries++) {
> + usleep_range(1000, 1200); /* arbitrary */
> + reg = read_csr(dd, ASIC_STS_SBUS_RESULT);
> + result_code = (reg >> ASIC_STS_SBUS_RESULT_RESULT_CODE_SHIFT)
> + & ASIC_STS_SBUS_RESULT_RESULT_CODE_MASK;
> + if (result_code != SBUS_READ_COMPLETE)
> + continue;
> +
> + success = 1;
> + result = (reg >> ASIC_STS_SBUS_RESULT_DATA_OUT_SHIFT)
> + & ASIC_STS_SBUS_RESULT_DATA_OUT_MASK;
> + break;
> + }
> +
> + if (!success) {
> + dd_dev_err(dd, "%s: read failed, result code 0x%x\n", __func__,
> + result_code);
> + }
> +
> + return result;
> +}
> +
> +/*
> + * Turn off the SBus and fabric serdes spicos.
> + *
> + * + Must be called with Sbus fast mode turned on.
> + * + Must be called after fabric serdes broadcast is set up.
> + * + Must be called before the 8051 is loaded - assumes 8051 is not loaded
> + * when using MISC_CFG_FW_CTRL.
> + */
> +static void turn_off_spicos(struct hfi2_devdata *dd, int flags)
> +{
> + /* only needed on A0 */
> + if (!is_ax(dd))
> + return;
> +
> + dd_dev_info(dd, "Turning off spicos:%s%s\n",
> + flags & SPICO_SBUS ? " SBus" : "",
> + flags & SPICO_FABRIC ? " fabric" : "");
> +
> + write_csr(dd, MISC_CFG_FW_CTRL, ENABLE_SPICO_SMASK);
> + /* disable SBus spico */
> + if (flags & SPICO_SBUS)
> + sbus_request(dd, SBUS_MASTER_BROADCAST, 0x01,
> + WRITE_SBUS_RECEIVER, 0x00000040);
> +
> + /* disable the fabric serdes spicos */
> + if (flags & SPICO_FABRIC)
> + sbus_request(dd, fabric_serdes_broadcast[dd->hfi2_id],
> + 0x07, WRITE_SBUS_RECEIVER, 0x00000000);
> + write_csr(dd, MISC_CFG_FW_CTRL, 0);
> +}
> +
> +/*
> + * Reset all of the fabric serdes for this HFI in preparation to take the
> + * link to Polling.
> + *
> + * To do a reset, we need to write to the serdes registers. Unfortunately,
> + * the fabric serdes download to the other HFI on the ASIC will have turned
> + * off the firmware validation on this HFI. This means we can't write to the
> + * registers to reset the serdes. Work around this by performing a complete
> + * re-download and validation of the fabric serdes firmware. This, as a
> + * by-product, will reset the serdes. NOTE: the re-download requires that
> + * the 8051 be in the Offline state. I.e. not actively trying to use the
> + * serdes. This routine is called at the point where the link is Offline and
> + * is getting ready to go to Polling.
> + */
> +void fabric_serdes_reset(struct hfi2_devdata *dd)
> +{
> + int ret;
> +
> + if (!fw_fabric_serdes_load)
> + return;
> +
> + ret = acquire_chip_resource(dd, CR_SBUS, SBUS_TIMEOUT);
> + if (ret) {
> + dd_dev_err(dd,
> + "Cannot acquire SBus resource to reset fabric SerDes - perhaps you should reboot\n");
> + return;
> + }
> + set_sbus_fast_mode(dd);
> +
> + if (is_ax(dd)) {
> + /* A0 serdes do not work with a re-download */
> + u8 ra = fabric_serdes_broadcast[dd->hfi2_id];
> +
> + /* place SerDes in reset and disable SPICO */
> + sbus_request(dd, ra, 0x07, WRITE_SBUS_RECEIVER, 0x00000011);
> + /* wait 100 refclk cycles @ 156.25MHz => 640ns */
> + udelay(1);
> + /* remove SerDes reset */
> + sbus_request(dd, ra, 0x07, WRITE_SBUS_RECEIVER, 0x00000010);
> + /* turn SPICO enable on */
> + sbus_request(dd, ra, 0x07, WRITE_SBUS_RECEIVER, 0x00000002);
> + } else {
> + turn_off_spicos(dd, SPICO_FABRIC);
> + /*
> + * No need for firmware retry - what to download has already
> + * been decided.
> + * No need to pay attention to the load return - the only
> + * failure is a validation failure, which has already been
> + * checked by the initial download.
> + */
> + (void)load_fabric_serdes_firmware(dd, &fw_fabric);
> + }
> +
> + clear_sbus_fast_mode(dd);
> + release_chip_resource(dd, CR_SBUS);
> +}
> +
> +/* Access to the SBus in this routine should probably be serialized */
> +int sbus_request_slow(struct hfi2_devdata *dd,
> + u8 receiver_addr, u8 data_addr, u8 command, u32 data_in)
> +{
> + u64 reg, count = 0;
> +
> + /* make sure fast mode is clear */
> + clear_sbus_fast_mode(dd);
> +
> + sbus_request(dd, receiver_addr, data_addr, command, data_in);
> + write_csr(dd, ASIC_CFG_SBUS_EXECUTE,
> + ASIC_CFG_SBUS_EXECUTE_EXECUTE_SMASK);
> + /* Wait for both DONE and RCV_DATA_VALID to go high */
> + reg = read_csr(dd, ASIC_STS_SBUS_RESULT);
> + while (!((reg & ASIC_STS_SBUS_RESULT_DONE_SMASK) &&
> + (reg & ASIC_STS_SBUS_RESULT_RCV_DATA_VALID_SMASK))) {
> + if (count++ >= SBUS_MAX_POLL_COUNT) {
> + u64 counts = read_csr(dd, ASIC_STS_SBUS_COUNTERS);
> + /*
> + * If the loop has timed out, we are OK if DONE bit
> + * is set and RCV_DATA_VALID and EXECUTE counters
> + * are the same. If not, we cannot proceed.
> + */
> + if ((reg & ASIC_STS_SBUS_RESULT_DONE_SMASK) &&
> + (SBUS_COUNTER(counts, RCV_DATA_VALID) ==
> + SBUS_COUNTER(counts, EXECUTE)))
> + break;
> + return -ETIMEDOUT;
> + }
> + udelay(1);
> + reg = read_csr(dd, ASIC_STS_SBUS_RESULT);
> + }
> + count = 0;
> + write_csr(dd, ASIC_CFG_SBUS_EXECUTE, 0);
> + /* Wait for DONE to clear after EXECUTE is cleared */
> + reg = read_csr(dd, ASIC_STS_SBUS_RESULT);
> + while (reg & ASIC_STS_SBUS_RESULT_DONE_SMASK) {
> + if (count++ >= SBUS_MAX_POLL_COUNT)
> + return -ETIME;
> + udelay(1);
> + reg = read_csr(dd, ASIC_STS_SBUS_RESULT);
> + }
> + return 0;
> +}
> +
> +static int load_fabric_serdes_firmware(struct hfi2_devdata *dd,
> + struct firmware_details *fdet)
> +{
> + int i, err;
> + const u8 ra = fabric_serdes_broadcast[dd->hfi2_id]; /* receiver addr */
> +
> + dd_dev_info(dd, "Downloading fabric firmware\n");
> +
> + /* step 1: load security variables */
> + load_security_variables(dd, fdet);
> + /* step 2: place SerDes in reset and disable SPICO */
> + sbus_request(dd, ra, 0x07, WRITE_SBUS_RECEIVER, 0x00000011);
> + /* wait 100 refclk cycles @ 156.25MHz => 640ns */
> + udelay(1);
> + /* step 3: remove SerDes reset */
> + sbus_request(dd, ra, 0x07, WRITE_SBUS_RECEIVER, 0x00000010);
> + /* step 4: assert IMEM override */
> + sbus_request(dd, ra, 0x00, WRITE_SBUS_RECEIVER, 0x40000000);
> + /* step 5: download SerDes machine code */
> + for (i = 0; i < fdet->firmware_len; i += 4) {
> + sbus_request(dd, ra, 0x0a, WRITE_SBUS_RECEIVER,
> + *(u32 *)&fdet->firmware_ptr[i]);
> + }
> + /* step 6: IMEM override off */
> + sbus_request(dd, ra, 0x00, WRITE_SBUS_RECEIVER, 0x00000000);
> + /* step 7: turn ECC on */
> + sbus_request(dd, ra, 0x0b, WRITE_SBUS_RECEIVER, 0x000c0000);
> +
> + /* steps 8-11: run the RSA engine */
> + err = run_rsa(dd, "fabric serdes", fdet->signature);
> + if (err)
> + return err;
> +
> + /* step 12: turn SPICO enable on */
> + sbus_request(dd, ra, 0x07, WRITE_SBUS_RECEIVER, 0x00000002);
> + /* step 13: enable core hardware interrupts */
> + sbus_request(dd, ra, 0x08, WRITE_SBUS_RECEIVER, 0x00000000);
> +
> + return 0;
> +}
> +
> +static int load_sbus_firmware(struct hfi2_devdata *dd,
> + struct firmware_details *fdet)
> +{
> + int i, err;
> + const u8 ra = SBUS_MASTER_BROADCAST; /* receiver address */
> +
> + dd_dev_info(dd, "Downloading SBus firmware\n");
> +
> + /* step 1: load security variables */
> + load_security_variables(dd, fdet);
> + /* step 2: place SPICO into reset and enable off */
> + sbus_request(dd, ra, 0x01, WRITE_SBUS_RECEIVER, 0x000000c0);
> + /* step 3: remove reset, enable off, IMEM_CNTRL_EN on */
> + sbus_request(dd, ra, 0x01, WRITE_SBUS_RECEIVER, 0x00000240);
> + /* step 4: set starting IMEM address for burst download */
> + sbus_request(dd, ra, 0x03, WRITE_SBUS_RECEIVER, 0x80000000);
> + /* step 5: download the SBus Master machine code */
> + for (i = 0; i < fdet->firmware_len; i += 4) {
> + sbus_request(dd, ra, 0x14, WRITE_SBUS_RECEIVER,
> + *(u32 *)&fdet->firmware_ptr[i]);
> + }
> + /* step 6: set IMEM_CNTL_EN off */
> + sbus_request(dd, ra, 0x01, WRITE_SBUS_RECEIVER, 0x00000040);
> + /* step 7: turn ECC on */
> + sbus_request(dd, ra, 0x16, WRITE_SBUS_RECEIVER, 0x000c0000);
> +
> + /* steps 8-11: run the RSA engine */
> + err = run_rsa(dd, "SBus", fdet->signature);
> + if (err)
> + return err;
> +
> + /* step 12: set SPICO_ENABLE on */
> + sbus_request(dd, ra, 0x01, WRITE_SBUS_RECEIVER, 0x00000140);
> +
> + return 0;
> +}
> +
> +static int load_pcie_serdes_firmware(struct hfi2_devdata *dd,
> + struct firmware_details *fdet)
> +{
> + int i;
> + const u8 ra = SBUS_MASTER_BROADCAST; /* receiver address */
> +
> + dd_dev_info(dd, "Downloading PCIe firmware\n");
> +
> + /* step 1: load security variables */
> + load_security_variables(dd, fdet);
> + /* step 2: assert single step (halts the SBus Master spico) */
> + sbus_request(dd, ra, 0x05, WRITE_SBUS_RECEIVER, 0x00000001);
> + /* step 3: enable XDMEM access */
> + sbus_request(dd, ra, 0x01, WRITE_SBUS_RECEIVER, 0x00000d40);
> + /* step 4: load firmware into SBus Master XDMEM */
> + /*
> + * NOTE: the dmem address, write_en, and wdata are all pre-packed,
> + * we only need to pick up the bytes and write them
> + */
> + for (i = 0; i < fdet->firmware_len; i += 4) {
> + sbus_request(dd, ra, 0x04, WRITE_SBUS_RECEIVER,
> + *(u32 *)&fdet->firmware_ptr[i]);
> + }
> + /* step 5: disable XDMEM access */
> + sbus_request(dd, ra, 0x01, WRITE_SBUS_RECEIVER, 0x00000140);
> + /* step 6: allow SBus Spico to run */
> + sbus_request(dd, ra, 0x05, WRITE_SBUS_RECEIVER, 0x00000000);
> +
> + /*
> + * steps 7-11: run RSA, if it succeeds, firmware is available to
> + * be swapped
> + */
> + return run_rsa(dd, "PCIe serdes", fdet->signature);
> +}
> +
> +/*
> + * Set the given broadcast values on the given list of devices.
> + */
> +static void set_serdes_broadcast(struct hfi2_devdata *dd, u8 bg1, u8 bg2,
> + const u8 *addrs, int count)
> +{
> + while (--count >= 0) {
> + /*
> + * Set BROADCAST_GROUP_1 and BROADCAST_GROUP_2, leave
> + * defaults for everything else. Do not read-modify-write,
> + * per instruction from the manufacturer.
> + *
> + * Register 0xfd:
> + * bits what
> + * ----- ---------------------------------
> + * 0 IGNORE_BROADCAST (default 0)
> + * 11:4 BROADCAST_GROUP_1 (default 0xff)
> + * 23:16 BROADCAST_GROUP_2 (default 0xff)
> + */
> + sbus_request(dd, addrs[count], 0xfd, WRITE_SBUS_RECEIVER,
> + (u32)bg1 << 4 | (u32)bg2 << 16);
> + }
> +}
> +
> +int acquire_hw_mutex(struct hfi2_devdata *dd)
> +{
> + unsigned long timeout;
> + int try = 0;
> + u8 mask = 1 << dd->hfi2_id;
> + u8 user = (u8)read_csr(dd, ASIC_CFG_MUTEX);
> +
> + if (user == mask) {
> + dd_dev_info(dd,
> + "Hardware mutex already acquired, mutex mask %u\n",
> + (u32)mask);
> + return 0;
> + }
> +
> +retry:
> + timeout = msecs_to_jiffies(HM_TIMEOUT) + jiffies;
> + while (1) {
> + write_csr(dd, ASIC_CFG_MUTEX, mask);
> + user = (u8)read_csr(dd, ASIC_CFG_MUTEX);
> + if (user == mask)
> + return 0; /* success */
> + if (time_after(jiffies, timeout))
> + break; /* timed out */
> + msleep(20);
> + }
> +
> + /* timed out */
> + dd_dev_err(dd,
> + "Unable to acquire hardware mutex, mutex mask %u, my mask %u (%s)\n",
> + (u32)user, (u32)mask, (try == 0) ? "retrying" : "giving up");
> +
> + if (try == 0) {
> + /* break mutex and retry */
> + write_csr(dd, ASIC_CFG_MUTEX, 0);
> + try++;
> + goto retry;
> + }
> +
> + return -EBUSY;
> +}
> +
> +void release_hw_mutex(struct hfi2_devdata *dd)
> +{
> + u8 mask = 1 << dd->hfi2_id;
> + u8 user = (u8)read_csr(dd, ASIC_CFG_MUTEX);
> +
> + if (user != mask)
> + dd_dev_warn(dd,
> + "Unable to release hardware mutex, mutex mask %u, my mask %u\n",
> + (u32)user, (u32)mask);
> + else
> + write_csr(dd, ASIC_CFG_MUTEX, 0);
> +}
> +
> +/* return the given resource bit(s) as a mask for the given HFI */
> +static inline u64 resource_mask(u32 hfi2_id, u32 resource)
> +{
> + return ((u64)resource) << (hfi2_id ? CR_DYN_SHIFT : 0);
> +}
> +
> +static void fail_mutex_acquire_message(struct hfi2_devdata *dd,
> + const char *func)
> +{
> + dd_dev_err(dd,
> + "%s: hardware mutex stuck - suggest rebooting the machine\n",
> + func);
> +}
> +
> +/*
> + * Acquire access to a chip resource.
> + *
> + * Return 0 on success, -EBUSY if resource busy, -EIO if mutex acquire failed.
> + */
> +static int __acquire_chip_resource(struct hfi2_devdata *dd, u32 resource)
> +{
> + u64 scratch0, all_bits, my_bit;
> + int ret;
> +
> + if (resource & CR_DYN_MASK) {
> + /* a dynamic resource is in use if either HFI has set the bit */
> + if (dd->pcidev->device == PCI_DEVICE_ID_INTEL0 &&
> + (resource & (CR_I2C1 | CR_I2C2))) {
> + /* discrete devices must serialize across both chains */
> + all_bits = resource_mask(0, CR_I2C1 | CR_I2C2) |
> + resource_mask(1, CR_I2C1 | CR_I2C2);
> + } else {
> + all_bits = resource_mask(0, resource) |
> + resource_mask(1, resource);
> + }
> + my_bit = resource_mask(dd->hfi2_id, resource);
> + } else {
> + /* non-dynamic resources are not split between HFIs */
> + all_bits = resource;
> + my_bit = resource;
> + }
> +
> + /* lock against other callers within the driver wanting a resource */
> + mutex_lock(&dd->asic_data->asic_resource_mutex);
> +
> + ret = acquire_hw_mutex(dd);
> + if (ret) {
> + fail_mutex_acquire_message(dd, __func__);
> + ret = -EIO;
> + goto done;
> + }
> +
> + scratch0 = read_csr(dd, ASIC_CFG_SCRATCH);
> + if (scratch0 & all_bits) {
> + ret = -EBUSY;
> + } else {
> + write_csr(dd, ASIC_CFG_SCRATCH, scratch0 | my_bit);
> + /* force write to be visible to other HFI on another OS */
> + (void)read_csr(dd, ASIC_CFG_SCRATCH);
> + }
> +
> + release_hw_mutex(dd);
> +
> +done:
> + mutex_unlock(&dd->asic_data->asic_resource_mutex);
> + return ret;
> +}
> +
> +/*
> + * Acquire access to a chip resource, wait up to mswait milliseconds for
> + * the resource to become available.
> + *
> + * Return 0 on success, -EBUSY if busy (even after wait), -EIO if mutex
> + * acquire failed, -EINVAL if there is no asic_data.
> + */
> +int acquire_chip_resource(struct hfi2_devdata *dd, u32 resource, u32 mswait)
> +{
> + unsigned long timeout;
> + int ret;
> +
> + if (!dd->asic_data)
> + return -EINVAL;
> +
> + timeout = jiffies + msecs_to_jiffies(mswait);
> + while (1) {
> + ret = __acquire_chip_resource(dd, resource);
> + if (ret != -EBUSY)
> + return ret;
> + /* resource is busy, check our timeout */
> + if (time_after_eq(jiffies, timeout))
> + return -EBUSY;
> + usleep_range(80, 120); /* arbitrary delay */
> + }
> +}
> +
> +/*
> + * Release access to a chip resource
> + */
> +void release_chip_resource(struct hfi2_devdata *dd, u32 resource)
> +{
> + u64 scratch0, bit;
> +
> + if (!dd->asic_data)
> + return;
> +
> + /* only dynamic resources should ever be cleared */
> + if (!(resource & CR_DYN_MASK)) {
> + dd_dev_err(dd, "%s: invalid resource 0x%x\n", __func__,
> + resource);
> + return;
> + }
> + bit = resource_mask(dd->hfi2_id, resource);
> +
> + /* lock against other callers within the driver wanting a resource */
> + mutex_lock(&dd->asic_data->asic_resource_mutex);
> +
> + if (acquire_hw_mutex(dd)) {
> + fail_mutex_acquire_message(dd, __func__);
> + goto done;
> + }
> +
> + scratch0 = read_csr(dd, ASIC_CFG_SCRATCH);
> + if ((scratch0 & bit) != 0) {
> + scratch0 &= ~bit;
> + write_csr(dd, ASIC_CFG_SCRATCH, scratch0);
> + /* force write to be visible to other HFI on another OS */
> + (void)read_csr(dd, ASIC_CFG_SCRATCH);
> + } else {
> + dd_dev_warn(dd, "%s: id %d, resource 0x%x: bit not set\n",
> + __func__, dd->hfi2_id, resource);
> + }
> +
> + release_hw_mutex(dd);
> +
> +done:
> + mutex_unlock(&dd->asic_data->asic_resource_mutex);
> +}
> +
> +/*
> + * Return true if resource is set, false otherwise. Print a warning
> + * if not set and a function is supplied.
> + */
> +bool check_chip_resource(struct hfi2_devdata *dd, u32 resource,
> + const char *func)
> +{
> + u64 scratch0, bit;
> +
> + if (resource & CR_DYN_MASK)
> + bit = resource_mask(dd->hfi2_id, resource);
> + else
> + bit = resource;
> +
> + scratch0 = read_csr(dd, ASIC_CFG_SCRATCH);
> + if ((scratch0 & bit) == 0) {
> + if (func)
> + dd_dev_warn(dd,
> + "%s: id %d, resource 0x%x, not acquired!\n",
> + func, dd->hfi2_id, resource);
> + return false;
> + }
> + return true;
> +}
> +
> +static void clear_chip_resources(struct hfi2_devdata *dd, const char *func)
> +{
> + u64 scratch0;
> +
> + if (!dd->asic_data)
> + return;
> +
> + /* lock against other callers within the driver wanting a resource */
> + mutex_lock(&dd->asic_data->asic_resource_mutex);
> +
> + if (acquire_hw_mutex(dd)) {
> + fail_mutex_acquire_message(dd, func);
> + goto done;
> + }
> +
> + /* clear all dynamic access bits for this HFI */
> + scratch0 = read_csr(dd, ASIC_CFG_SCRATCH);
> + scratch0 &= ~resource_mask(dd->hfi2_id, CR_DYN_MASK);
> + write_csr(dd, ASIC_CFG_SCRATCH, scratch0);
> + /* force write to be visible to other HFI on another OS */
> + (void)read_csr(dd, ASIC_CFG_SCRATCH);
> +
> + release_hw_mutex(dd);
> +
> +done:
> + mutex_unlock(&dd->asic_data->asic_resource_mutex);
> +}
> +
> +void init_chip_resources(struct hfi2_devdata *dd)
> +{
> + /* clear any holds left by us */
> + clear_chip_resources(dd, __func__);
> +}
> +
> +void finish_chip_resources(struct hfi2_devdata *dd)
> +{
> + /* clear any holds left by us */
> + clear_chip_resources(dd, __func__);
> +}
> +
> +void set_sbus_fast_mode(struct hfi2_devdata *dd)
> +{
> + write_csr(dd, ASIC_CFG_SBUS_EXECUTE,
> + ASIC_CFG_SBUS_EXECUTE_FAST_MODE_SMASK);
> +}
> +
> +void clear_sbus_fast_mode(struct hfi2_devdata *dd)
> +{
> + u64 reg, count = 0;
> +
> + reg = read_csr(dd, ASIC_STS_SBUS_COUNTERS);
> + while (SBUS_COUNTER(reg, EXECUTE) !=
> + SBUS_COUNTER(reg, RCV_DATA_VALID)) {
> + if (count++ >= SBUS_MAX_POLL_COUNT)
> + break;
> + udelay(1);
> + reg = read_csr(dd, ASIC_STS_SBUS_COUNTERS);
> + }
> + write_csr(dd, ASIC_CFG_SBUS_EXECUTE, 0);
> +}
> +
> +int load_firmware(struct hfi2_devdata *dd)
> +{
> + int ret;
> +
> + if (fw_fabric_serdes_load) {
> + ret = acquire_chip_resource(dd, CR_SBUS, SBUS_TIMEOUT);
> + if (ret)
> + return ret;
> +
> + set_sbus_fast_mode(dd);
> +
> + set_serdes_broadcast(dd, all_fabric_serdes_broadcast,
> + fabric_serdes_broadcast[dd->hfi2_id],
> + fabric_serdes_addrs[dd->hfi2_id],
> + NUM_FABRIC_SERDES);
> + turn_off_spicos(dd, SPICO_FABRIC);
> + do {
> + ret = load_fabric_serdes_firmware(dd, &fw_fabric);
> + } while (retry_firmware(dd, ret));
> +
> + clear_sbus_fast_mode(dd);
> + release_chip_resource(dd, CR_SBUS);
> + if (ret)
> + return ret;
> + }
> +
> + if (fw_8051_load) {
> + do {
> + ret = load_8051_firmware(dd, &fw_8051);
> + } while (retry_firmware(dd, ret));
> + if (ret)
> + return ret;
> + }
> +
> + dump_fw_version(dd);
> + return 0;
> +}
> +
> +int hfi2_firmware_init(struct hfi2_devdata *dd)
> +{
> + /* only RTL can use these */
> + if (dd->icode != ICODE_RTL_SILICON) {
> + fw_fabric_serdes_load = 0;
> + fw_pcie_serdes_load = 0;
> + fw_sbus_load = 0;
> + }
> +
> + /* no 8051 or QSFP on simulator */
> + if (dd->icode == ICODE_FUNCTIONAL_SIMULATOR) {
> + u8 ver_major, ver_minor, ver_patch;
> +
> + read_misc_status(dd, &ver_major, &ver_minor, &ver_patch);
> + dd_dev_info(dd, "Simulated 8051 firmware version %d.%d.%d\n",
> + (int)ver_major, (int)ver_minor, (int)ver_patch);
> + dd->dc8051_ver = dc8051_ver(ver_major, ver_minor, ver_patch);
> + fw_8051_load = 0;
> + }
> +
> + if (!fw_8051_name) {
> + if (dd->icode == ICODE_RTL_SILICON)
> + fw_8051_name = DEFAULT_FW_8051_NAME_ASIC;
> + else
> + fw_8051_name = DEFAULT_FW_8051_NAME_FPGA;
> + }
> + if (!fw_fabric_serdes_name)
> + fw_fabric_serdes_name = DEFAULT_FW_FABRIC_NAME;
> + if (!fw_sbus_name)
> + fw_sbus_name = DEFAULT_FW_SBUS_NAME;
> + if (!fw_pcie_serdes_name)
> + fw_pcie_serdes_name = DEFAULT_FW_PCIE_NAME;
> +
> + return obtain_firmware(dd);
> +}
> +
> +/*
> + * This function is a helper function for parse_platform_config(...) and
> + * does not check for validity of the platform configuration cache
> + * (because we know it is invalid as we are building up the cache).
> + * As such, this should not be called from anywhere other than
> + * parse_platform_config
> + */
> +static int check_meta_version(struct hfi2_devdata *dd, u32 *system_table)
> +{
> + u32 meta_ver, meta_ver_meta, ver_start, ver_len, mask;
> + struct platform_config_cache *pcfgcache = &dd->pcfg_cache;
> +
> + if (!system_table)
> + return -EINVAL;
> +
> + meta_ver_meta =
> + *(pcfgcache->config_tables[PLATFORM_CONFIG_SYSTEM_TABLE].table_metadata
> + + SYSTEM_TABLE_META_VERSION);
> +
> + mask = ((1 << METADATA_TABLE_FIELD_START_LEN_BITS) - 1);
> + ver_start = meta_ver_meta & mask;
> +
> + meta_ver_meta >>= METADATA_TABLE_FIELD_LEN_SHIFT;
> +
> + mask = ((1 << METADATA_TABLE_FIELD_LEN_LEN_BITS) - 1);
> + ver_len = meta_ver_meta & mask;
> +
> + ver_start /= 8;
> + meta_ver = *((u8 *)system_table + ver_start) & ((1 << ver_len) - 1);
> +
> + if (meta_ver < 4) {
> + dd_dev_info(
> + dd, "%s:Please update platform config\n", __func__);
> + return -EINVAL;
> + }
> + return 0;
> +}
> +
> +int parse_platform_config(struct hfi2_pportdata *ppd)
> +{
> + struct hfi2_devdata *dd = ppd->dd;
> + struct platform_config_cache *pcfgcache = &dd->pcfg_cache;
> + u32 *ptr = NULL;
> + u32 header1 = 0, header2 = 0, magic_num = 0, crc = 0, file_length = 0;
> + u32 record_idx = 0, table_type = 0, table_length_dwords = 0;
> + int ret = -EINVAL; /* assume failure */
> +
> + /*
> + * For integrated devices that did not fall back to the default file,
> + * the SI tuning information for active channels is acquired from the
> + * scratch register bitmap, thus there is no platform config to parse.
> + * Skip parsing in these situations.
> + */
> + if (ppd->config_from_scratch)
> + return 0;
> +
> + if (!dd->platform_config.data) {
> + dd_dev_err(dd, "%s: Missing config file\n", __func__);
> + ret = -EINVAL;
> + goto bail;
> + }
> + ptr = (u32 *)dd->platform_config.data;
> +
> + magic_num = *ptr;
> + ptr++;
> + if (magic_num != PLATFORM_CONFIG_MAGIC_NUM) {
> + dd_dev_err(dd, "%s: Bad config file\n", __func__);
> + ret = -EINVAL;
> + goto bail;
> + }
> +
> + /* Field is file size in DWORDs */
> + file_length = (*ptr) * 4;
> +
> + /*
> + * Length can't be larger than partition size. Assume platform
> + * config format version 4 is being used. Interpret the file size
> + * field as header instead by not moving the pointer.
> + */
> + if (file_length > MAX_PLATFORM_CONFIG_FILE_SIZE) {
> + dd_dev_info(dd,
> + "%s:File length out of bounds, using alternative format\n",
> + __func__);
> + file_length = PLATFORM_CONFIG_FORMAT_4_FILE_SIZE;
> + } else {
> + ptr++;
> + }
> +
> + if (file_length > dd->platform_config.size) {
> + dd_dev_info(dd, "%s:File claims to be larger than read size\n",
> + __func__);
> + ret = -EINVAL;
> + goto bail;
> + } else if (file_length < dd->platform_config.size) {
> + dd_dev_info(dd,
> + "%s:File claims to be smaller than read size, continuing\n",
> + __func__);
> + }
> + /* exactly equal, perfection */
> +
> + /*
> + * In both cases where we proceed, using the self-reported file length
> + * is the safer option. In case of old format a predefined value is
> + * being used.
> + */
> + while (ptr < (u32 *)(dd->platform_config.data + file_length)) {
> + header1 = *ptr;
> + header2 = *(ptr + 1);
> + if (header1 != ~header2) {
> + dd_dev_err(dd, "%s: Failed validation at offset %ld\n",
> + __func__, (ptr - (u32 *)
> + dd->platform_config.data));
> + ret = -EINVAL;
> + goto bail;
> + }
> +
> + record_idx = *ptr &
> + ((1 << PLATFORM_CONFIG_HEADER_RECORD_IDX_LEN_BITS) - 1);
> +
> + table_length_dwords = (*ptr >>
> + PLATFORM_CONFIG_HEADER_TABLE_LENGTH_SHIFT) &
> + ((1 << PLATFORM_CONFIG_HEADER_TABLE_LENGTH_LEN_BITS) - 1);
> +
> + table_type = (*ptr >> PLATFORM_CONFIG_HEADER_TABLE_TYPE_SHIFT) &
> + ((1 << PLATFORM_CONFIG_HEADER_TABLE_TYPE_LEN_BITS) - 1);
> +
> + /* Done with this set of headers */
> + ptr += 2;
> +
> + if (record_idx) {
> + /* data table */
> + switch (table_type) {
> + case PLATFORM_CONFIG_SYSTEM_TABLE:
> + pcfgcache->config_tables[table_type].num_table =
> + 1;
> + ret = check_meta_version(dd, ptr);
> + if (ret)
> + goto bail;
> + break;
> + case PLATFORM_CONFIG_PORT_TABLE:
> + pcfgcache->config_tables[table_type].num_table =
> + 2;
> + break;
> + case PLATFORM_CONFIG_RX_PRESET_TABLE:
> + case PLATFORM_CONFIG_TX_PRESET_TABLE:
> + case PLATFORM_CONFIG_QSFP_ATTEN_TABLE:
> + case PLATFORM_CONFIG_VARIABLE_SETTINGS_TABLE:
> + pcfgcache->config_tables[table_type].num_table =
> + table_length_dwords;
> + break;
> + default:
> + dd_dev_err(dd,
> + "%s: Unknown data table %d, offset %ld\n",
> + __func__, table_type,
> + (ptr - (u32 *)
> + dd->platform_config.data));
> + ret = -EINVAL;
> + goto bail; /* We don't trust this file now */
> + }
> + pcfgcache->config_tables[table_type].table = ptr;
> + } else {
> + /* metadata table */
> + switch (table_type) {
> + case PLATFORM_CONFIG_SYSTEM_TABLE:
> + case PLATFORM_CONFIG_PORT_TABLE:
> + case PLATFORM_CONFIG_RX_PRESET_TABLE:
> + case PLATFORM_CONFIG_TX_PRESET_TABLE:
> + case PLATFORM_CONFIG_QSFP_ATTEN_TABLE:
> + case PLATFORM_CONFIG_VARIABLE_SETTINGS_TABLE:
> + break;
> + default:
> + dd_dev_err(dd,
> + "%s: Unknown meta table %d, offset %ld\n",
> + __func__, table_type,
> + (ptr -
> + (u32 *)dd->platform_config.data));
> + ret = -EINVAL;
> + goto bail; /* We don't trust this file now */
> + }
> + pcfgcache->config_tables[table_type].table_metadata =
> + ptr;
> + }
> +
> + /* Calculate and check table crc */
> + crc = crc32_le(~(u32)0, (unsigned char const *)ptr,
> + (table_length_dwords * 4));
> + crc ^= ~(u32)0;
> +
> + /* Jump the table */
> + ptr += table_length_dwords;
> + if (crc != *ptr) {
> + dd_dev_err(dd, "%s: Failed CRC check at offset %ld\n",
> + __func__, (ptr -
> + (u32 *)dd->platform_config.data));
> + ret = -EINVAL;
> + goto bail;
> + }
> + /* Jump the CRC DWORD */
> + ptr++;
> + }
> +
> + pcfgcache->cache_valid = 1;
> + return 0;
> +bail:
> + memset(pcfgcache, 0, sizeof(struct platform_config_cache));
> + return ret;
> +}
> +
> +static void get_integrated_platform_config_field(
> + struct hfi2_pportdata *ppd,
> + enum platform_config_table_type_encoding table_type,
> + int field_index, u32 *data)
> +{
> + u8 *cache = ppd->qsfp_info.cache;
> + u32 tx_preset = 0;
> +
> + switch (table_type) {
> + case PLATFORM_CONFIG_SYSTEM_TABLE:
> + if (field_index == SYSTEM_TABLE_QSFP_POWER_CLASS_MAX)
> + *data = ppd->max_power_class;
> + else if (field_index == SYSTEM_TABLE_QSFP_ATTENUATION_DEFAULT_25G)
> + *data = ppd->default_atten;
> + break;
> + case PLATFORM_CONFIG_PORT_TABLE:
> + if (field_index == PORT_TABLE_PORT_TYPE)
> + *data = ppd->port_type;
> + else if (field_index == PORT_TABLE_LOCAL_ATTEN_25G)
> + *data = ppd->local_atten;
> + else if (field_index == PORT_TABLE_REMOTE_ATTEN_25G)
> + *data = ppd->remote_atten;
> + break;
> + case PLATFORM_CONFIG_RX_PRESET_TABLE:
> + if (field_index == RX_PRESET_TABLE_QSFP_RX_CDR_APPLY)
> + *data = (ppd->rx_preset & QSFP_RX_CDR_APPLY_SMASK) >>
> + QSFP_RX_CDR_APPLY_SHIFT;
> + else if (field_index == RX_PRESET_TABLE_QSFP_RX_EMP_APPLY)
> + *data = (ppd->rx_preset & QSFP_RX_EMP_APPLY_SMASK) >>
> + QSFP_RX_EMP_APPLY_SHIFT;
> + else if (field_index == RX_PRESET_TABLE_QSFP_RX_AMP_APPLY)
> + *data = (ppd->rx_preset & QSFP_RX_AMP_APPLY_SMASK) >>
> + QSFP_RX_AMP_APPLY_SHIFT;
> + else if (field_index == RX_PRESET_TABLE_QSFP_RX_CDR)
> + *data = (ppd->rx_preset & QSFP_RX_CDR_SMASK) >>
> + QSFP_RX_CDR_SHIFT;
> + else if (field_index == RX_PRESET_TABLE_QSFP_RX_EMP)
> + *data = (ppd->rx_preset & QSFP_RX_EMP_SMASK) >>
> + QSFP_RX_EMP_SHIFT;
> + else if (field_index == RX_PRESET_TABLE_QSFP_RX_AMP)
> + *data = (ppd->rx_preset & QSFP_RX_AMP_SMASK) >>
> + QSFP_RX_AMP_SHIFT;
> + break;
> + case PLATFORM_CONFIG_TX_PRESET_TABLE:
> + if (cache[QSFP_EQ_INFO_OFFS] & 0x4)
> + tx_preset = ppd->tx_preset_eq;
> + else
> + tx_preset = ppd->tx_preset_noeq;
> + if (field_index == TX_PRESET_TABLE_PRECUR)
> + *data = (tx_preset & TX_PRECUR_SMASK) >>
> + TX_PRECUR_SHIFT;
> + else if (field_index == TX_PRESET_TABLE_ATTN)
> + *data = (tx_preset & TX_ATTN_SMASK) >>
> + TX_ATTN_SHIFT;
> + else if (field_index == TX_PRESET_TABLE_POSTCUR)
> + *data = (tx_preset & TX_POSTCUR_SMASK) >>
> + TX_POSTCUR_SHIFT;
> + else if (field_index == TX_PRESET_TABLE_QSFP_TX_CDR_APPLY)
> + *data = (tx_preset & QSFP_TX_CDR_APPLY_SMASK) >>
> + QSFP_TX_CDR_APPLY_SHIFT;
> + else if (field_index == TX_PRESET_TABLE_QSFP_TX_EQ_APPLY)
> + *data = (tx_preset & QSFP_TX_EQ_APPLY_SMASK) >>
> + QSFP_TX_EQ_APPLY_SHIFT;
> + else if (field_index == TX_PRESET_TABLE_QSFP_TX_CDR)
> + *data = (tx_preset & QSFP_TX_CDR_SMASK) >>
> + QSFP_TX_CDR_SHIFT;
> + else if (field_index == TX_PRESET_TABLE_QSFP_TX_EQ)
> + *data = (tx_preset & QSFP_TX_EQ_SMASK) >>
> + QSFP_TX_EQ_SHIFT;
> + break;
> + case PLATFORM_CONFIG_QSFP_ATTEN_TABLE:
> + case PLATFORM_CONFIG_VARIABLE_SETTINGS_TABLE:
> + default:
> + break;
> + }
> +}
> +
> +static int get_platform_fw_field_metadata(struct hfi2_devdata *dd, int table,
> + int field, u32 *field_len_bits,
> + u32 *field_start_bits)
> +{
> + struct platform_config_cache *pcfgcache = &dd->pcfg_cache;
> + u32 *src_ptr = NULL;
> +
> + if (!pcfgcache->cache_valid)
> + return -EINVAL;
> +
> + switch (table) {
> + case PLATFORM_CONFIG_SYSTEM_TABLE:
> + case PLATFORM_CONFIG_PORT_TABLE:
> + case PLATFORM_CONFIG_RX_PRESET_TABLE:
> + case PLATFORM_CONFIG_TX_PRESET_TABLE:
> + case PLATFORM_CONFIG_QSFP_ATTEN_TABLE:
> + case PLATFORM_CONFIG_VARIABLE_SETTINGS_TABLE:
> + if (field && field < platform_config_table_limits[table])
> + src_ptr =
> + pcfgcache->config_tables[table].table_metadata + field;
> + break;
> + default:
> + dd_dev_info(dd, "%s: Unknown table\n", __func__);
> + break;
> + }
> +
> + if (!src_ptr)
> + return -EINVAL;
> +
> + if (field_start_bits)
> + *field_start_bits = *src_ptr &
> + ((1 << METADATA_TABLE_FIELD_START_LEN_BITS) - 1);
> +
> + if (field_len_bits)
> + *field_len_bits = (*src_ptr >> METADATA_TABLE_FIELD_LEN_SHIFT)
> + & ((1 << METADATA_TABLE_FIELD_LEN_LEN_BITS) - 1);
> +
> + return 0;
> +}
> +
> +/* This is the central interface to getting data out of the platform config
> + * file. It depends on parse_platform_config() having populated the
> + * platform_config_cache in hfi2_devdata, and checks the cache_valid member to
> + * validate the sanity of the cache.
> + *
> + * The non-obvious parameters:
> + * @table_index: Acts as a look up key into which instance of the tables the
> + * relevant field is fetched from.
> + *
> + * This applies to the data tables that have multiple instances. The port table
> + * is an exception to this rule as each HFI only has one port and thus the
> + * relevant table can be distinguished by hfi_id.
> + *
> + * @data: pointer to memory that will be populated with the field requested.
> + * @len: length of memory pointed by @data in bytes.
> + */
> +int get_platform_config_field(struct hfi2_pportdata *ppd,
> + enum platform_config_table_type_encoding
> + table_type, int table_index, int field_index,
> + u32 *data, u32 len)
> +{
> + struct hfi2_devdata *dd = ppd->dd;
> + int ret = 0, wlen = 0, seek = 0;
> + u32 field_len_bits = 0, field_start_bits = 0, *src_ptr = NULL;
> + struct platform_config_cache *pcfgcache = &dd->pcfg_cache;
> +
> + if (data)
> + memset(data, 0, len);
> + else
> + return -EINVAL;
> +
> + if (ppd->config_from_scratch) {
> + /*
> + * Use saved configuration from ppd for integrated platforms
> + */
> + get_integrated_platform_config_field(ppd, table_type,
> + field_index, data);
> + return 0;
> + }
> +
> + ret = get_platform_fw_field_metadata(dd, table_type, field_index,
> + &field_len_bits,
> + &field_start_bits);
> + if (ret)
> + return -EINVAL;
> +
> + /* Convert length to bits */
> + len *= 8;
> +
> + /* Our metadata function checked cache_valid and field_index for us */
> + switch (table_type) {
> + case PLATFORM_CONFIG_SYSTEM_TABLE:
> + src_ptr = pcfgcache->config_tables[table_type].table;
> +
> + if (field_index != SYSTEM_TABLE_QSFP_POWER_CLASS_MAX) {
> + if (len < field_len_bits)
> + return -EINVAL;
> +
> + seek = field_start_bits / 8;
> + wlen = field_len_bits / 8;
> +
> + src_ptr = (u32 *)((u8 *)src_ptr + seek);
> +
> + /*
> + * We expect the field to be byte aligned and whole byte
> + * lengths if we are here
> + */
> + memcpy(data, src_ptr, wlen);
> + return 0;
> + }
> + break;
> + case PLATFORM_CONFIG_PORT_TABLE:
> + /* Port table is 4 DWORDS */
> + src_ptr = dd->hfi2_id ?
> + pcfgcache->config_tables[table_type].table + 4 :
> + pcfgcache->config_tables[table_type].table;
> + break;
> + case PLATFORM_CONFIG_RX_PRESET_TABLE:
> + case PLATFORM_CONFIG_TX_PRESET_TABLE:
> + case PLATFORM_CONFIG_QSFP_ATTEN_TABLE:
> + case PLATFORM_CONFIG_VARIABLE_SETTINGS_TABLE:
> + src_ptr = pcfgcache->config_tables[table_type].table;
> +
> + if (table_index <
> + pcfgcache->config_tables[table_type].num_table)
> + src_ptr += table_index;
> + else
> + src_ptr = NULL;
> + break;
> + default:
> + dd_dev_info(dd, "%s: Unknown table\n", __func__);
> + break;
> + }
> +
> + if (!src_ptr || len < field_len_bits)
> + return -EINVAL;
> +
> + src_ptr += (field_start_bits / 32);
> + *data = (*src_ptr >> (field_start_bits % 32)) &
> + ((1 << field_len_bits) - 1);
> +
> + return 0;
> +}
> +
> +/*
> + * Download the firmware needed for the Gen3 PCIe SerDes. An update
> + * to the SBus firmware is needed before updating the PCIe firmware.
> + *
> + * Note: caller must be holding the SBus resource.
> + */
> +int load_pcie_firmware(struct hfi2_devdata *dd)
> +{
> + int ret = 0;
> +
> + /* both firmware loads below use the SBus */
> + set_sbus_fast_mode(dd);
> +
> + if (fw_sbus_load) {
> + turn_off_spicos(dd, SPICO_SBUS);
> + do {
> + ret = load_sbus_firmware(dd, &fw_sbus);
> + } while (retry_firmware(dd, ret));
> + if (ret)
> + goto done;
> + }
> +
> + if (fw_pcie_serdes_load) {
> + dd_dev_info(dd, "Setting PCIe SerDes broadcast\n");
> + set_serdes_broadcast(dd, all_pcie_serdes_broadcast,
> + pcie_serdes_broadcast[dd->hfi2_id],
> + pcie_serdes_addrs[dd->hfi2_id],
> + NUM_PCIE_SERDES);
> + do {
> + ret = load_pcie_serdes_firmware(dd, &fw_pcie);
> + } while (retry_firmware(dd, ret));
> + if (ret)
> + goto done;
> + }
> +
> +done:
> + clear_sbus_fast_mode(dd);
> +
> + return ret;
> +}
> +
> +/*
> + * Read the GUID from the hardware, store it in dd.
> + */
> +void read_guid(struct hfi2_devdata *dd)
> +{
> + /* Take the DC out of reset to get a valid GUID value */
> + write_csr(dd, CCE_DC_CTRL, 0);
> + (void)read_csr(dd, CCE_DC_CTRL);
> +
> + dd->base_guid = read_csr(dd, DC_DC8051_CFG_LOCAL_GUID);
> +}
> +
> +/* read and display firmware version info */
> +static void dump_fw_version(struct hfi2_devdata *dd)
> +{
> + u32 pcie_vers[NUM_PCIE_SERDES];
> + u32 fabric_vers[NUM_FABRIC_SERDES];
> + u32 sbus_vers;
> + int i;
> + int all_same;
> + int ret;
> + u8 rcv_addr;
> +
> + /* no firmware or sbus in simulation, skip */
> + if (dd->icode == ICODE_FUNCTIONAL_SIMULATOR)
> + return;
> +
> + ret = acquire_chip_resource(dd, CR_SBUS, SBUS_TIMEOUT);
> + if (ret) {
> + dd_dev_err(dd, "Unable to acquire SBus to read firmware versions\n");
> + return;
> + }
> +
> + /* set fast mode */
> + set_sbus_fast_mode(dd);
> +
> + /* read version for SBus Master */
> + sbus_request(dd, SBUS_MASTER_BROADCAST, 0x02, WRITE_SBUS_RECEIVER, 0);
> + sbus_request(dd, SBUS_MASTER_BROADCAST, 0x07, WRITE_SBUS_RECEIVER, 0x1);
> + /* wait for interrupt to be processed */
> + usleep_range(10000, 11000);
> + sbus_vers = sbus_read(dd, SBUS_MASTER_BROADCAST, 0x08, 0x1);
> + dd_dev_info(dd, "SBus Master firmware version 0x%08x\n", sbus_vers);
> +
> + /* read version for PCIe SerDes */
> + all_same = 1;
> + pcie_vers[0] = 0;
> + for (i = 0; i < NUM_PCIE_SERDES; i++) {
> + rcv_addr = pcie_serdes_addrs[dd->hfi2_id][i];
> + sbus_request(dd, rcv_addr, 0x03, WRITE_SBUS_RECEIVER, 0);
> + /* wait for interrupt to be processed */
> + usleep_range(10000, 11000);
> + pcie_vers[i] = sbus_read(dd, rcv_addr, 0x04, 0x0);
> + if (i > 0 && pcie_vers[0] != pcie_vers[i])
> + all_same = 0;
> + }
> +
> + if (all_same) {
> + dd_dev_info(dd, "PCIe SerDes firmware version 0x%x\n",
> + pcie_vers[0]);
> + } else {
> + dd_dev_warn(dd, "PCIe SerDes do not have the same firmware version\n");
> + for (i = 0; i < NUM_PCIE_SERDES; i++) {
> + dd_dev_info(dd,
> + "PCIe SerDes lane %d firmware version 0x%x\n",
> + i, pcie_vers[i]);
> + }
> + }
> +
> + /* read version for fabric SerDes */
> + all_same = 1;
> + fabric_vers[0] = 0;
> + for (i = 0; i < NUM_FABRIC_SERDES; i++) {
> + rcv_addr = fabric_serdes_addrs[dd->hfi2_id][i];
> + sbus_request(dd, rcv_addr, 0x03, WRITE_SBUS_RECEIVER, 0);
> + /* wait for interrupt to be processed */
> + usleep_range(10000, 11000);
> + fabric_vers[i] = sbus_read(dd, rcv_addr, 0x04, 0x0);
> + if (i > 0 && fabric_vers[0] != fabric_vers[i])
> + all_same = 0;
> + }
> +
> + if (all_same) {
> + dd_dev_info(dd, "Fabric SerDes firmware version 0x%x\n",
> + fabric_vers[0]);
> + } else {
> + dd_dev_warn(dd, "Fabric SerDes do not have the same firmware version\n");
> + for (i = 0; i < NUM_FABRIC_SERDES; i++) {
> + dd_dev_info(dd,
> + "Fabric SerDes lane %d firmware version 0x%x\n",
> + i, fabric_vers[i]);
> + }
> + }
> +
> + clear_sbus_fast_mode(dd);
> + release_chip_resource(dd, CR_SBUS);
> +}
> diff --git a/drivers/infiniband/hw/hfi2/init.c b/drivers/infiniband/hw/hfi2/init.c
> new file mode 100644
> index 000000000000..70145b643d31
> --- /dev/null
> +++ b/drivers/infiniband/hw/hfi2/init.c
> @@ -0,0 +1,2931 @@
> +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
> +/*
> + * Copyright(c) 2015 - 2020 Intel Corporation.
> + * Copyright(c) 2025-2026 Cornelis Networks, Inc.
> + */
> +
> +#include <linux/pci.h>
> +#include <linux/netdevice.h>
> +#include <linux/vmalloc.h>
> +#include <linux/delay.h>
> +#include <linux/xarray.h>
> +#include <linux/module.h>
> +#include <linux/printk.h>
> +#include <linux/hrtimer.h>
> +#include <linux/bitmap.h>
> +#include <linux/numa.h>
> +#include <rdma/rdma_vt.h>
> +
> +#include "hfi2.h"
> +#include "file_ops.h"
> +#include "common.h"
> +#include "trace.h"
> +#include "mad.h"
> +#include "sdma.h"
> +#include "debugfs.h"
> +#include "verbs.h"
> +#include "aspm.h"
> +#include "affinity.h"
> +#include "exp_rcv.h"
> +#include "netdev.h"
> +#include "chip_jkr.h"
> +#include "chip_gen.h"
> +#include "pinning.h"
> +#include "cport_traps.h"
> +#include "sriov.h"
> +#include "vf2pf.h"
> +
> +#undef pr_fmt
> +#define pr_fmt(fmt) DRIVER_NAME ": " fmt
> +
> +#undef CPORT_TRAP_DEBUG /* all MCTXT TRAP events from CPORT */
> +#define PDEV_SRIOV_DEBUG
> +
> +/*
> + * min buffers we want to have per context, after driver
> + */
> +#define HFI2_MIN_USER_CTXT_BUFCNT 7
> +
> +#define HFI2_MIN_EAGER_BUFFER_SIZE (4 * 1024) /* 4KB */
> +#define HFI2_MAX_EAGER_BUFFER_SIZE (256 * 1024) /* 256KB */
> +
> +static void wfr_start_port(struct hfi2_pportdata *ppd);
> +static void wfr_stop_port(struct hfi2_pportdata *ppd);
> +static void destroy_workqueues(struct hfi2_devdata *dd);
> +
> +/* parameters for the WFR ASIC */
> +static const struct chip_params wfr_params = {
> + .chip_type = CHIP_WFR,
> + .num_ports = 1,
> + .dma_mask_bits = 48,
> +
> + /* BAR0 map: rcv array splits kreg1 and kreg2 */
> + .bar0_size = TXE_PIO_SEND + TXE_PIO_SIZE,
> + .kreg1_size = RCV_ARRAY,
> + .kreg2_offset = RCV_ARRAY + RCV_ARRAY_SIZE,
> + .kreg2_size = TXE_PIO_SEND - (RCV_ARRAY + RCV_ARRAY_SIZE),
> + .rcv_array_offset = RCV_ARRAY,
> + .rcv_array_size = RCV_ARRAY_SIZE,
> +
> + .link_speed_supported = OPA_LINK_SPEED_25G,
> + .link_speed_active = OPA_LINK_SPEED_25G,
> + .asic_cclock_ps = ASIC_CCLOCK_PS,
> + .rsm_rule_size = WFR_RXE_NUM_RSM_INSTANCES,
> + .rsm_rule_offset_shift = WFR_RCV_RSM_CFG_OFFSET_SHIFT,
> + .rsm_map_table_entries = 256,
> + .rsm_map_table_entries_per_csr = 8,
> + .rsm_map_table_entry_mask = 0xff,
> + .rsm_map_table_entry_shift = 8,
> + .qp_map_table_entries = 256,
> + .qp_map_table_entries_per_csr = 8,
> + .qp_map_table_entry_mask = 0xff,
> + .qp_map_table_entry_shift = 8,
> + .pkey_table_size = WFR_MAX_PKEY_VALUES,
> + .generic_boardname = "Cornelis Omni-Path Host Fabric Interface Adapter 100 Series",
> + .max_eager_entries = WFR_MAX_EAGER_ENTRIES,
> + .pio_base_bits = WFR_PIO_BASE_BITS,
> + .pio_base_shift = WFR_SEND_CTXT_CTRL_CTXT_BASE_SHIFT,
> + .egress_err_info_data = &wfr_egress_err_info_data,
> + .send_ctrl_flush = 0, /* no flush flag available */
> + .port_discard_egress_errs = WFR_PORT_DISCARD_EGRESS_ERRS,
> +
> + /* interrupt sources */
> + .num_int_csrs = WFR_CCE_NUM_INT_CSRS,
> + .num_int_map_csrs = WFR_CCE_NUM_INT_MAP_CSRS,
> + .is_rcvavail_start = IS_RCVAVAIL_START,
> + .is_rcvurgent_start = IS_RCVURGENT_START,
> + .is_sdmaeng_err_start = IS_SDMAENG_ERR_START,
> + .is_sdma_idle_start = IS_SDMA_IDLE_START,
> + .is_sdma_progress_start = IS_SDMA_PROGRESS_START,
> + .is_sdma_start = IS_SDMA_START,
> + .is_last_source = IS_LAST_SOURCE,
> + .is_table = is_table,
> + .gi_enable_table = wfr_gi_enable_table,
> +
> + /* cce_interrupt registers */
> + .cce_int_status_reg = WFR_CCE_INT_STATUS,
> + .cce_int_mask_reg = WFR_CCE_INT_MASK,
> + .cce_int_clear_reg = WFR_CCE_INT_CLEAR,
> + .cce_int_force_reg = WFR_CCE_INT_FORCE,
> + .cce_int_blocked_reg = WFR_CCE_INT_BLOCKED,
> +
> + /* counters */
> + .chip_dev_cntrs = wfr_dev_cntrs,
> + .chip_dev_cntr_first = WFR_DEV_CNTR_FIRST,
> + .chip_num_dev_cntrs = WFR_NUM_DEV_CNTRS,
> + .chip_port_cntrs = wfr_port_cntrs,
> + .chip_port_cntr_first = WFR_PORT_CNTR_FIRST,
> + .chip_num_port_cntrs = WFR_NUM_PORT_CNTRS,
> +
> + /* ingress port registers */
> + .rxe_iport_stride = 0,
> + .rcv_iport_ctrl_reg = WFR_RCV_CTRL,
> + .rcv_iport_status_reg = WFR_RCV_STATUS,
> + .rcv_bth_qp_reg = WFR_RCV_BTH_QP,
> + .rcv_multicast_reg = WFR_RCV_MULTICAST,
> + .rcv_bypass_reg = WFR_RCV_BYPASS,
> + .rcv_vl15_reg = WFR_RCV_VL15,
> + .rcv_err_info_reg = WFR_RCV_ERR_INFO,
> + .rcv_err_status_reg = WFR_RCV_ERR_STATUS,
> + .rcv_err_mask_reg = WFR_RCV_ERR_MASK,
> + .rcv_err_clear_reg = WFR_RCV_ERR_CLEAR,
> + .rcv_qp_map_table_reg = WFR_RCV_QP_MAP_TABLE,
> + .rcv_partition_key_reg = WFR_RCV_PARTITION_KEY,
> + .rcv_counter_array32_reg = WFR_RCV_COUNTER_ARRAY32,
> + .rcv_counter_array64_reg = WFR_RCV_COUNTER_ARRAY64,
> +
> + /* ingress port receive context registers */
> + .rxe_iprc_stride = WFR_RXE_IPRC_STRIDE,
> + .rcv_jkey_ctrl_reg = WFR_RCV_KEY_CTRL,
> +
> + /* RXE restricted context registers */
> + .rxe_rctxt_stride = WFR_RXE_RCTXT_STRIDE,
> + .rcv_rctxt_ctrl_reg = WFR_RCV_CTXT_CTRL,
> + .rcv_egr_ctrl_reg = WFR_RCV_EGR_CTRL,
> + .rcv_tid_ctrl_reg = WFR_RCV_TID_CTRL,
> +
> + /* RXE kernel context registers */
> + .rxe_kctxt_stride = WFR_RXE_KCTXT_STRIDE,
> + .rcv_kctxt_ctrl_reg = WFR_RCV_CTXT_CTRL,
> + .rcv_hdr_addr_reg = WFR_RCV_HDR_ADDR,
> + .rcv_hdr_cnt_reg = WFR_RCV_HDR_CNT,
> + .rcv_hdr_ent_size_reg = WFR_RCV_HDR_ENT_SIZE,
> + .rcv_hdr_tail_addr_reg = WFR_RCV_HDR_TAIL_ADDR,
> + .rcv_avail_time_out_reg = WFR_RCV_AVAIL_TIME_OUT,
> + .rcv_hdr_ovfl_cnt_reg = WFR_RCV_HDR_OVFL_CNT,
> +
> + /* RXE kernel/user registers */
> + .rxe_ku_stride = WFR_RXE_KCTXT_STRIDE,
> + .rcv_ctxt_status_reg = WFR_RCV_CTXT_STATUS,
> +
> + /* RXE user registers */
> + .rxe_uctxt_stride = WFR_RXE_UCTXT_STRIDE,
> + .rcv_hdr_tail_reg = WFR_RCV_HDR_TAIL,
> + .rcv_hdr_head_reg = WFR_RCV_HDR_HEAD,
> + .rcv_egr_index_head_reg = WFR_RCV_EGR_INDEX_HEAD,
> + .rcv_tid_flow_table_reg = WFR_RCV_TID_FLOW_TABLE,
> +
> + /* RXE RSM registers */
> + .rcv_rsm_cfg_reg = WFR_RCV_RSM_CFG,
> + .rcv_rsm_select_reg = WFR_RCV_RSM_SELECT,
> + .rcv_rsm_match_reg = WFR_RCV_RSM_MATCH,
> + .rcv_rsm_map_table_reg = WFR_RCV_RSM_MAP_TABLE,
> +
> + /* TXE kernel registers */
> + .send_contexts_reg = SEND_CONTEXTS,
> + .send_dma_engines_reg = WFR_SEND_DMA_ENGINES,
> + .send_pio_mem_size_reg = WFR_SEND_PIO_MEM_SIZE,
> + .send_dma_mem_size_reg = WFR_SEND_DMA_MEM_SIZE,
> + .send_pio_init_ctxt_reg = WFR_SEND_PIO_INIT_CTXT,
> +
> + /* send context_registers */
> + .txe_sctxt_stride = WFR_TXE_SCTXT_STRIDE,
> + .send_ctxt_status_reg = WFR_SEND_CTXT_STATUS,
> + .send_ctxt_credit_ctrl_reg = WFR_SEND_CTXT_CREDIT_CTRL,
> + .send_ctxt_credit_status_reg = WFR_SEND_CTXT_CREDIT_STATUS,
> + .send_ctxt_credit_return_addr_reg = WFR_SEND_CTXT_CREDIT_RETURN_ADDR,
> + .send_ctxt_credit_force_reg = WFR_SEND_CTXT_CREDIT_FORCE,
> + .send_ctxt_err_status_reg = WFR_SEND_CTXT_ERR_STATUS,
> + .send_ctxt_err_mask_reg = WFR_SEND_CTXT_ERR_MASK,
> + .send_ctxt_err_clear_reg = WFR_SEND_CTXT_ERR_CLEAR,
> +
> + /* TXE send context registers */
> + .txe_tctxt_stride = WFR_TXE_TCTXT_STRIDE,
> + .send_ctxt_ctrl_reg = WFR_SEND_CTXT_CTRL,
> +
> + /* SDMA registers */
> + .txe_sdma_stride = WFR_TXE_SDMA_STRIDE,
> + .send_dma_ctrl_reg = WFR_SEND_DMA_CTRL,
> + .send_dma_status_reg = WFR_SEND_DMA_STATUS,
> + .send_dma_base_addr_reg = WFR_SEND_DMA_BASE_ADDR,
> + .send_dma_len_gen_reg = WFR_SEND_DMA_LEN_GEN,
> + .send_dma_tail_reg = WFR_SEND_DMA_TAIL,
> + .send_dma_head_reg = WFR_SEND_DMA_HEAD,
> + .send_dma_head_addr_reg = WFR_SEND_DMA_HEAD_ADDR,
> + .send_dma_priority_thld_reg = WFR_SEND_DMA_PRIORITY_THLD,
> + .send_dma_idle_cnt_reg = WFR_SEND_DMA_IDLE_CNT,
> + .send_dma_reload_cnt_reg = WFR_SEND_DMA_RELOAD_CNT,
> + .send_dma_desc_cnt_reg = WFR_SEND_DMA_DESC_CNT,
> + .send_dma_desc_fetched_cnt_reg = WFR_SEND_DMA_DESC_FETCHED_CNT,
> + .send_dma_eng_err_status_reg = WFR_SEND_DMA_ENG_ERR_STATUS,
> + .send_dma_eng_err_mask_reg = WFR_SEND_DMA_ENG_ERR_MASK,
> + .send_dma_eng_err_clear_reg = WFR_SEND_DMA_ENG_ERR_CLEAR,
> +
> + /* SDMA Config registers */
> + .txe_sdmacfg_stride = WFR_TXE_SDMACFG_STRIDE,
> + .send_dma_cfg_memory_reg = WFR_SEND_DMA_MEMORY,
> +
> + /* egress port registers */
> + .txe_eport_stride = 0,
> + .send_ctrl_reg = SEND_CTRL,
> + .send_high_priority_limit_reg = WFR_SEND_HIGH_PRIORITY_LIMIT,
> + .send_egress_err_status_reg = WFR_SEND_EGRESS_ERR_STATUS,
> + .send_egress_err_mask_reg = WFR_SEND_EGRESS_ERR_MASK,
> + .send_egress_err_clear_reg = WFR_SEND_EGRESS_ERR_CLEAR,
> + .send_bth_qp_reg = WFR_SEND_BTH_QP,
> + .send_static_rate_control_reg = WFR_SEND_STATIC_RATE_CONTROL,
> + .send_sc2vlt0_reg = WFR_SEND_SC2VLT0,
> + .send_sc2vlt1_reg = WFR_SEND_SC2VLT1,
> + .send_sc2vlt2_reg = WFR_SEND_SC2VLT2,
> + .send_sc2vlt3_reg = WFR_SEND_SC2VLT3,
> + .send_len_check0_reg = WFR_SEND_LEN_CHECK0,
> + .send_len_check1_reg = WFR_SEND_LEN_CHECK1,
> + .send_low_priority_list_reg = WFR_SEND_LOW_PRIORITY_LIST,
> + .send_high_priority_list_reg = WFR_SEND_HIGH_PRIORITY_LIST,
> + .send_counter_array32_reg = WFR_SEND_COUNTER_ARRAY32,
> + .send_counter_array64_reg = WFR_SEND_COUNTER_ARRAY64,
> + .send_cm_ctrl_reg = WFR_SEND_CM_CTRL,
> + .send_cm_global_credit_reg = WFR_SEND_CM_GLOBAL_CREDIT,
> + .send_cm_credit_used_status_reg = WFR_SEND_CM_CREDIT_USED_STATUS,
> + .send_cm_timer_ctrl_reg = WFR_SEND_CM_TIMER_CTRL,
> + .send_cm_local_au_table0_to3_reg = WFR_SEND_CM_LOCAL_AU_TABLE0_TO3,
> + .send_cm_local_au_table4_to7_reg = WFR_SEND_CM_LOCAL_AU_TABLE4_TO7,
> + .send_cm_remote_au_table0_to3_reg = WFR_SEND_CM_REMOTE_AU_TABLE0_TO3,
> + .send_cm_remote_au_table4_to7_reg = WFR_SEND_CM_REMOTE_AU_TABLE4_TO7,
> + .send_cm_credit_vl_reg = WFR_SEND_CM_CREDIT_VL,
> + .send_cm_credit_vl15_reg = WFR_SEND_CM_CREDIT_VL15,
> + .send_egress_err_info_reg = WFR_SEND_EGRESS_ERR_INFO,
> + .send_egress_err_source_reg = WFR_SEND_EGRESS_ERR_SOURCE,
> + .send_egress_ctxt_status_reg = WFR_SEND_EGRESS_CTXT_STATUS,
> + .send_egress_send_dma_status_reg = WFR_SEND_EGRESS_SEND_DMA_STATUS,
> +
> + /* egress port send context registers */
> + .txe_epsc_stride = WFR_TXE_EPSC_STRIDE,
> + .send_ctxt_check_enable_reg = WFR_SEND_CTXT_CHECK_ENABLE,
> + .send_ctxt_check_vl_reg = WFR_SEND_CTXT_CHECK_VL,
> + .send_ctxt_check_job_key_reg = WFR_SEND_CTXT_CHECK_JOB_KEY,
> + .send_ctxt_check_partition_key_reg = WFR_SEND_CTXT_CHECK_PARTITION_KEY,
> + .send_ctxt_check_slid_reg = WFR_SEND_CTXT_CHECK_SLID,
> + .send_ctxt_check_opcode_reg = WFR_SEND_CTXT_CHECK_OPCODE,
> +
> + /* SI registers */
> + .cce_msix_int_map_vec_reg = WFR_CCE_INT_MAP,
> + .send_pio_err_status_reg = WFR_SEND_PIO_ERR_STATUS,
> + .send_pio_err_mask_reg = WFR_SEND_PIO_ERR_MASK,
> + .send_pio_err_clear_reg = WFR_SEND_PIO_ERR_CLEAR,
> + .send_dma_err_status_reg = WFR_SEND_DMA_ERR_STATUS,
> + .send_dma_err_mask_reg = WFR_SEND_DMA_ERR_MASK,
> + .send_dma_err_clear_reg = WFR_SEND_DMA_ERR_CLEAR,
> + .csr_err_status_reg = WFR_SEND_ERR_STATUS,
> + .csr_err_mask_reg = WFR_SEND_ERR_MASK,
> + .csr_err_clear_reg = WFR_SEND_ERR_CLEAR,
> +
> + .setextled = setextled,
> + .start_led_override = hfi2_start_led_override,
> + .shutdown_led_override = shutdown_led_override,
> + .read_guid = read_guid,
> + .early_per_chip_init = wfr_early_per_chip_init,
> + .mid_per_chip_init = wfr_mid_per_chip_init,
> + .init_other = init_other,
> + .late_per_chip_init = wfr_late_per_chip_init,
> + .start_port = wfr_start_port,
> + .stop_port = wfr_stop_port,
> + .put_tid = wfr_put_tid,
> + .rcv_array_wc_fill = wfr_rcv_array_wc_fill,
> + .set_port_tid_config = wfr_set_port_tid_config,
> + .set_port_max_mtu = wfr_set_port_max_mtu,
> + .update_rcv_hdr_size = wfr_update_rcv_hdr_size,
> + .check_synth_status = wfr_check_synth_status,
> + .update_synth_status = wfr_update_synth_status,
> + .create_pbc = wfr_create_pbc,
> + .set_pio_integrity = wfr_set_pio_integrity,
> + .find_used_resources = wfr_find_used_resources,
> + .read_link_quality = wfr_read_link_quality,
> + .set_rheq_addr = NULL,
> + .handle_link_bounce = wfr_handle_link_bounce,
> + .enable_rcv_context = wfr_enable_rcv_context,
> +};
> +
> +/* parameters for the JKR ASIC */
> +static const struct chip_params jkr_params = {
> + .chip_type = CHIP_JKR,
> + .num_ports = 2,
> + .dma_mask_bits = 58,
> +
> + /* BAR0 map: see comments where KREG values are defined */
> + .bar0_size = JKR_BAR0_SIZE,
> + .kreg1_size = JKR_KREG1_SIZE,
> + .kreg2_offset = JKR_KREG2_OFFSET,
> + .kreg2_size = JKR_KREG2_SIZE,
> + .rcv_array_offset = JKR_RCV_ARRAY,
> + .rcv_array_size = JKR_RCV_ARRAY_SIZE,
> +
> + .link_speed_supported = OPA_LINK_SPEED_100G | OPA_LINK_SPEED_25G,
> + .link_speed_active = OPA_LINK_SPEED_100G,
> + .asic_cclock_ps = JKR_ASIC_CCLOCK_PS,
> + .rsm_rule_size = JKR_C_RXE_NUM_RSM_INSTANCES,
> + .rsm_rule_offset_shift = JKR_RCV_RSM_CFG_OFFSET_SHIFT,
> + .rsm_map_table_entries = 256,
> + .rsm_map_table_entries_per_csr = 8,
> + .rsm_map_table_entry_mask = 0xff,
> + .rsm_map_table_entry_shift = 8,
> + .qp_map_table_entries = 256,
> + .qp_map_table_entries_per_csr = 8,
> + .qp_map_table_entry_mask = 0xff,
> + .qp_map_table_entry_shift = 8,
> + .pkey_table_size = JKR_MAX_PKEY_VALUES,
> + .generic_boardname = "Cornelis Networks 5000 Host Fabric Interface Adapter",
> + .max_eager_entries = JKR_MAX_EAGER_ENTRIES,
> + .pio_base_bits = JKR_PIO_BASE_BITS,
> + .pio_base_shift = JKR_SEND_CTXT_CTRL_CTXT_BASE_SHIFT,
> + .egress_err_info_data = &jkr_egress_err_info_data,
> + .send_ctrl_flush = JKR_SEND_CTRL_FLUSH_WRONG_LINK_STATE_SMASK,
> + .port_discard_egress_errs = JKR_PORT_DISCARD_EGRESS_ERRS,
> +
> + /* interrupt sources */
> + .num_int_csrs = JKR_C_CCE_NUM_INT_CSRS,
> + .num_int_map_csrs = JKR_C_CCE_NUM_INT_MAP_CSRS,
> + .is_cport_int = JKR_MCTXT_CPORT_TO_PCIE_INT,
> + .is_rcvavail_start = JKR_IS_RCVAVAIL_START,
> + .is_rcvurgent_start = JKR_IS_RCVURGENT_START,
> + .is_sdmaeng_err_start = JKR_IS_SDMAENG_ERR_START,
> + .is_sdma_idle_start = JKR_IS_SDMA_IDLE_START,
> + .is_sdma_progress_start = JKR_IS_SDMA_PROGRESS_START,
> + .is_sdma_start = JKR_IS_SDMA_START,
> + .is_last_source = JKR_IS_LAST_SOURCE,
> + .is_table = jkr_is_table,
> + .gi_enable_table = jkr_gi_enable_table,
> +
> + /* cce_interrupt registers */
> + .cce_int_status_reg = JKR_CCE_INT_STATUS,
> + .cce_int_mask_reg = JKR_CCE_INT_MASK,
> + .cce_int_clear_reg = JKR_CCE_INT_CLEAR,
> + .cce_int_force_reg = JKR_CCE_INT_FORCE,
> + .cce_int_blocked_reg = JKR_CCE_INT_BLOCKED,
> +
> + /* counters */
> + .chip_dev_cntrs = jkr_dev_cntrs,
> + .chip_dev_cntr_first = JKR_DEV_CNTR_FIRST,
> + .chip_num_dev_cntrs = JKR_NUM_DEV_CNTRS,
> + .chip_port_cntrs = jkr_port_cntrs,
> + .chip_port_cntr_first = JKR_PORT_CNTR_FIRST,
> + .chip_num_port_cntrs = JKR_NUM_PORT_CNTRS,
> +
> + /* ingress port registers */
> + .rxe_iport_stride = JKR_C_RXE_IPORT_STRIDE,
> + .rcv_iport_ctrl_reg = JKR_RCV_IPORT_CTRL,
> + .rcv_iport_status_reg = JKR_RCV_IPORT_STATUS,
> + .rcv_bth_qp_reg = JKR_RCV_BTH_QP,
> + .rcv_multicast_reg = JKR_RCV_MULTICAST,
> + .rcv_bypass_reg = JKR_RCV_BYPASS,
> + .rcv_vl15_reg = JKR_RCV_VL15,
> + .rcv_err_info_reg = JKR_RCV_ERR_INFO,
> + .rcv_err_status_reg = JKR_RCV_ERR_STATUS,
> + .rcv_err_mask_reg = JKR_RCV_ERR_MASK,
> + .rcv_err_clear_reg = JKR_RCV_ERR_CLEAR,
> + .rcv_qp_map_table_reg = JKR_RCV_QP_MAP_TABLE,
> + .rcv_partition_key_reg = JKR_RCV_PARTITION_KEY,
> + .rcv_counter_array32_reg = JKR_RCV_COUNTER_ARRAY32,
> + .rcv_counter_array64_reg = JKR_RCV_COUNTER_ARRAY64,
> +
> + /* ingress port receive context registers */
> + .rxe_iprc_stride = JKR_C_RXE_IPRC_STRIDE,
> + .rcv_jkey_ctrl_reg = JKR_RCV_JKEY_CTRL,
> +
> + /* RXE restricted context registers */
> + .rxe_rctxt_stride = JKR_C_RXE_RCTXT_STRIDE,
> + .rcv_rctxt_ctrl_reg = JKR_RCV_RCTXT_CTRL,
> + .rcv_egr_ctrl_reg = JKR_RCV_EGR_CTRL,
> + .rcv_tid_ctrl_reg = JKR_RCV_TID_CTRL,
> +
> + /* RXE kernel context registers */
> + .rxe_kctxt_stride = JKR_C_RXE_KCTXT_STRIDE,
> + .rcv_kctxt_ctrl_reg = JKR_RCV_KCTXT_CTRL,
> + .rcv_hdr_addr_reg = JKR_RCV_HDR_ADDR,
> + .rcv_hdr_cnt_reg = JKR_RCV_HDR_CNT,
> + .rcv_hdr_ent_size_reg = JKR_RCV_HDR_ENT_SIZE,
> + .rcv_hdr_tail_addr_reg = JKR_RCV_HDR_TAIL_ADDR,
> + .rcv_avail_time_out_reg = JKR_RCV_AVAIL_TIME_OUT,
> + .rcv_hdr_ovfl_cnt_reg = JKR_RCV_HDR_OVFL_CNT,
> +
> + /* RXE kernel/user registers */
> + .rxe_ku_stride = JKR_C_RXE_UCTXT_STRIDE,
> + .rcv_ctxt_status_reg = JKR_RCV_CTXT_STATUS,
> +
> + /* RXE user registers */
> + .rxe_uctxt_stride = JKR_C_RXE_UCTXT_STRIDE,
> + .rcv_hdr_tail_reg = JKR_RCV_HDR_TAIL,
> + .rcv_hdr_head_reg = JKR_RCV_HDR_HEAD,
> + .rcv_egr_index_head_reg = JKR_RCV_EGR_INDEX_HEAD,
> + .rcv_tid_flow_table_reg = JKR_RCV_TID_FLOW_TABLE,
> +
> + /* RXE RSM registers */
> + .rcv_rsm_cfg_reg = JKR_RCV_RSM_CFG,
> + .rcv_rsm_select_reg = JKR_RCV_RSM_SELECT,
> + .rcv_rsm_match_reg = JKR_RCV_RSM_MATCH,
> + .rcv_rsm_map_table_reg = JKR_RCV_RSM_MAP_TABLE,
> +
> + /* TXE kernel registers */
> + .send_contexts_reg = JKR_SEND_CONTEXTS,
> + .send_dma_engines_reg = JKR_SEND_DMA_ENGINES,
> + .send_pio_mem_size_reg = JKR_SEND_PIO_MEM_SIZE,
> + .send_dma_mem_size_reg = JKR_SEND_DMA_MEM_SIZE,
> + .send_pio_init_ctxt_reg = JKR_SEND_PIO_INIT_CTXT,
> +
> + /* send context_registers */
> + .txe_sctxt_stride = JKR_C_TXE_SCTXT_STRIDE,
> + .send_ctxt_status_reg = JKR_SEND_CTXT_STATUS,
> + .send_ctxt_credit_ctrl_reg = JKR_SEND_CTXT_CREDIT_CTRL,
> + .send_ctxt_credit_status_reg = JKR_SEND_CTXT_CREDIT_STATUS,
> + .send_ctxt_credit_return_addr_reg = JKR_SEND_CTXT_CREDIT_RETURN_ADDR,
> + .send_ctxt_credit_force_reg = JKR_SEND_CTXT_CREDIT_FORCE,
> + .send_ctxt_err_status_reg = JKR_SEND_CTXT_ERR_STATUS,
> + .send_ctxt_err_mask_reg = JKR_SEND_CTXT_ERR_MASK,
> + .send_ctxt_err_clear_reg = JKR_SEND_CTXT_ERR_CLEAR,
> +
> + /* TXE send context registers */
> + .txe_tctxt_stride = JKR_C_TXE_TCTXT_STRIDE,
> + .send_ctxt_ctrl_reg = JKR_SEND_CTXT_CTRL,
> +
> + /* SDMA registers */
> + .txe_sdma_stride = JKR_C_TXE_SDMA_STRIDE,
> + .send_dma_ctrl_reg = JKR_SEND_DMA_CTRL,
> + .send_dma_status_reg = JKR_SEND_DMA_STATUS,
> + .send_dma_base_addr_reg = JKR_SEND_DMA_BASE_ADDR,
> + .send_dma_len_gen_reg = JKR_SEND_DMA_LEN_GEN,
> + .send_dma_tail_reg = JKR_SEND_DMA_TAIL,
> + .send_dma_head_reg = JKR_SEND_DMA_HEAD,
> + .send_dma_head_addr_reg = JKR_SEND_DMA_HEAD_ADDR,
> + .send_dma_priority_thld_reg = JKR_SEND_DMA_PRIORITY_THLD,
> + .send_dma_idle_cnt_reg = JKR_SEND_DMA_IDLE_CNT,
> + .send_dma_reload_cnt_reg = JKR_SEND_DMA_RELOAD_CNT,
> + .send_dma_desc_cnt_reg = JKR_SEND_DMA_DESC_CNT,
> + .send_dma_desc_fetched_cnt_reg = JKR_SEND_DMA_DESC_FETCHED_CNT,
> + .send_dma_eng_err_status_reg = JKR_SEND_DMA_ENG_ERR_STATUS,
> + .send_dma_eng_err_mask_reg = JKR_SEND_DMA_ENG_ERR_MASK,
> + .send_dma_eng_err_clear_reg = JKR_SEND_DMA_ENG_ERR_CLEAR,
> +
> + /* SDMA Config registers */
> + .txe_sdmacfg_stride = JKR_C_TXE_SDMACFG_STRIDE,
> + .send_dma_cfg_memory_reg = JKR_SEND_DMA_CFG_MEMORY,
> +
> + /* egress port registers */
> + .txe_eport_stride = JKR_C_TXE_EPORT_STRIDE,
> + .send_ctrl_reg = JKR_SEND_CTRL,
> + .send_high_priority_limit_reg = JKR_SEND_HIGH_PRIORITY_LIMIT,
> + .send_egress_err_status_reg = JKR_SEND_EGRESS_ERR_STATUS,
> + .send_egress_err_mask_reg = JKR_SEND_EGRESS_ERR_MASK,
> + .send_egress_err_clear_reg = JKR_SEND_EGRESS_ERR_CLEAR,
> + .send_bth_qp_reg = JKR_SEND_BTH_QP,
> + .send_static_rate_control_reg = JKR_SEND_STATIC_RATE_CONTROL,
> + .send_sc2vlt0_reg = JKR_SEND_SC2VLT0,
> + .send_sc2vlt1_reg = JKR_SEND_SC2VLT1,
> + .send_sc2vlt2_reg = JKR_SEND_SC2VLT2,
> + .send_sc2vlt3_reg = JKR_SEND_SC2VLT3,
> + .send_len_check0_reg = JKR_SEND_LEN_CHECK0,
> + .send_len_check1_reg = JKR_SEND_LEN_CHECK1,
> + .send_low_priority_list_reg = JKR_SEND_LOW_PRIORITY_LIST,
> + .send_high_priority_list_reg = JKR_SEND_HIGH_PRIORITY_LIST,
> + .send_counter_array32_reg = JKR_SEND_COUNTER_ARRAY32,
> + .send_counter_array64_reg = JKR_SEND_COUNTER_ARRAY64,
> + .send_cm_ctrl_reg = JKR_SEND_CM_CTRL,
> + .send_cm_global_credit_reg = JKR_SEND_CM_GLOBAL_CREDIT,
> + .send_cm_credit_used_status_reg = JKR_SEND_CM_CREDIT_USED_STATUS,
> + .send_cm_timer_ctrl_reg = JKR_SEND_CM_TIMER_CTRL,
> + .send_cm_local_au_table0_to3_reg = JKR_SEND_CM_LOCAL_AU_TABLE0_TO3,
> + .send_cm_local_au_table4_to7_reg = JKR_SEND_CM_LOCAL_AU_TABLE4_TO7,
> + .send_cm_remote_au_table0_to3_reg = JKR_SEND_CM_REMOTE_AU_TABLE0_TO3,
> + .send_cm_remote_au_table4_to7_reg = JKR_SEND_CM_REMOTE_AU_TABLE4_TO7,
> + .send_cm_credit_vl_reg = JKR_SEND_CM_CREDIT_VL,
> + .send_cm_credit_vl15_reg = JKR_SEND_CM_CREDIT_VL15,
> + .send_egress_err_info_reg = JKR_SEND_EGRESS_ERR_INFO,
> + .send_egress_err_source_reg = JKR_SEND_EGRESS_ERR_SOURCE,
> + .send_egress_ctxt_status_reg = JKR_SEND_EGRESS_CTXT_STATUS,
> + .send_egress_send_dma_status_reg = JKR_SEND_EGRESS_SEND_DMA_STATUS,
> +
> + /* egress port send context registers */
> + .txe_epsc_stride = JKR_C_TXE_EPSC_STRIDE,
> + .send_ctxt_check_enable_reg = JKR_SEND_CTXT_CHECK_ENABLE,
> + .send_ctxt_check_vl_reg = JKR_SEND_CTXT_CHECK_VL,
> + .send_ctxt_check_job_key_reg = JKR_SEND_CTXT_CHECK_JOB_KEY,
> + .send_ctxt_check_partition_key_reg = JKR_SEND_CTXT_CHECK_PARTITION_KEY,
> + .send_ctxt_check_slid_reg = JKR_SEND_CTXT_CHECK_SLID,
> + .send_ctxt_check_opcode_reg = JKR_SEND_CTXT_CHECK_OPCODE,
> +
> + /* SI registers */
> + .cce_msix_int_map_vec_reg = JKR_CCE_MSIX_INT_MAP_VEC,
> + .send_pio_err_status_reg = JKR_SEND_PIO_ERR_STATUS,
> + .send_pio_err_mask_reg = JKR_SEND_PIO_ERR_MASK,
> + .send_pio_err_clear_reg = JKR_SEND_PIO_ERR_CLEAR,
> + .send_dma_err_status_reg = JKR_SEND_DMA_ERR_STATUS,
> + .send_dma_err_mask_reg = JKR_SEND_DMA_ERR_MASK,
> + .send_dma_err_clear_reg = JKR_SEND_DMA_ERR_CLEAR,
> + .csr_err_status_reg = JKR_CSR_ERR_STATUS,
> + .csr_err_mask_reg = JKR_CSR_ERR_MASK,
> + .csr_err_clear_reg = JKR_CSR_ERR_CLEAR,
> +
> + .setextled = gen_setextled,
> + .start_led_override = gen_start_led_override,
> + .shutdown_led_override = gen_shutdown_led_override,
> + .read_guid = jkr_read_guid,
> + .early_per_chip_init = jkr_early_per_chip_init,
> + .mid_per_chip_init = jkr_mid_per_chip_init,
> + .init_other = jkr_init_other,
> + .late_per_chip_init = gen_late_per_chip_init,
> + .start_port = gen_start_port,
> + .stop_port = gen_stop_port,
> + .put_tid = jkr_put_tid,
> + .rcv_array_wc_fill = jkr_rcv_array_wc_fill,
> + .set_port_tid_config = jkr_set_port_tid_config,
> + .set_port_max_mtu = gen_set_port_max_mtu,
> + .update_rcv_hdr_size = jkr_update_rcv_hdr_size,
> + .check_synth_status = jkr_check_synth_status,
> + .update_synth_status = jkr_update_synth_status,
> + .create_pbc = gen_create_pbc,
> + .set_pio_integrity = jkr_set_pio_integrity,
> + .find_used_resources = jkr_find_used_resources,
> + .read_link_quality = jkr_read_link_quality,
> + .set_rheq_addr = jkr_set_rheq_addr,
> + .handle_link_bounce = jkr_handle_link_bounce,
> + .enable_rcv_context = jkr_enable_rcv_context,
> +};
> +
> +/*
> + * Number of user receive contexts each port configured to use (allow for more
> + * pio buffers per ctxt, etc).
> + */
> +static int num_user_contexts_array[32];
> +static int num_user_contexts_count;
> +module_param_array_named(num_user_contexts, num_user_contexts_array, int,
> + &num_user_contexts_count, 0444);
> +MODULE_PARM_DESC(num_user_contexts, "Set max number of user contexts to use per-hfi, per-port (unset or -1: use the real (non-HT) CPU count)");
> +
> +uint krcvqs[RXE_NUM_DATA_VL];
> +int krcvqsset;
> +module_param_array(krcvqs, uint, &krcvqsset, 0444);
> +MODULE_PARM_DESC(krcvqs, "Array of the number of non-control kernel receive queues by VL");
> +
> +/* computed based on above array */
> +unsigned long n_krcvqs;
> +
> +static unsigned int hfi2_rcvarr_split = 25;
> +module_param_named(rcvarr_split, hfi2_rcvarr_split, uint, 0444);
> +MODULE_PARM_DESC(rcvarr_split, "Percent of context's RcvArray entries used for Eager buffers");
> +
> +static uint eager_buffer_size = (8 << 20); /* 8MB */
> +module_param(eager_buffer_size, uint, 0444);
> +MODULE_PARM_DESC(eager_buffer_size, "Size of the eager buffers, default: 8MB");
> +
> +static uint rcvhdrcnt = 2048; /* 2x the max eager buffer count */
> +module_param_named(rcvhdrcnt, rcvhdrcnt, uint, 0444);
> +MODULE_PARM_DESC(rcvhdrcnt, "Receive header queue count (default 2048)");
> +
> +static uint hfi2_hdrq_entsize = DEFAULT_HDRQ_ENTSIZE;
> +module_param_named(hdrq_entsize, hfi2_hdrq_entsize, uint, 0444);
> +MODULE_PARM_DESC(hdrq_entsize, "Size of header queue entries: 2 - 8B, 16 - 64B, 32 - 128B (default)");
> +
> +unsigned int user_credit_return_threshold = 33; /* default is 33% */
> +module_param(user_credit_return_threshold, uint, 0444);
> +MODULE_PARM_DESC(user_credit_return_threshold, "Credit return threshold for user send contexts, return when unreturned credits passes this many blocks (in percent of allocated blocks, 0 is off)");
> +
> +DEFINE_XARRAY_FLAGS(hfi2_dev_table, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
> +
> +struct cport_trap_reg {
> + u32 mask;
> + cport_trap_handler func;
> +};
> +
> +/* send, or resend, START message */
> +static int cport_start(struct hfi2_devdata *dd, int to_secs)
> +{
> + struct cport_start_payload start = {0};
> + union {
> + struct cport_start_payload pl;
> + u64 qw;
> + } *resp = NULL;
> + int resp_len = 0;
> + int ret;
> +
> + start.opts_ena = dd->cport->opts;
> + start.trap_ena = dd->cport->traps;
> +
> + ret = cport_send_req(dd, CH_OP_START, 0, &start, sizeof(start),
> + (void **)&resp, &resp_len, to_secs * HZ);
> + if (ret == MSG_RSP_STATUS_SEQ_NO_ERROR) {
> + dd_dev_info(dd, "CPORT sequence error, retrying\n");
> + ret = cport_send_req(dd, CH_OP_START, 0, &start, sizeof(start),
> + (void **)&resp, &resp_len, HZ);
> + }
> + if (ret) {
> + dd_dev_err(dd, "CPORT start failed %d\n", ret);
> + } else if (resp_len) {
> + dd_dev_info(dd, "CPORT started %016llx\n", resp->qw);
> + dd->cport->traps_act = resp->pl.trap_ena;
> + } else {
> + dd_dev_info(dd, "CPORT started\n");
> + }
> + kfree(resp);
> + return ret;
> +}
> +
> +int register_cport_trap(struct hfi2_devdata *dd, struct cport_trap_status traps,
> + cport_trap_handler func)
> +{
> + union {
> + struct cport_trap_status traps;
> + u32 dw;
> + } trap_val, cur_traps;
> + struct cport_trap_reg *entry;
> + u32 index;
> + int ret;
> +
> + if (!dd->cport)
> + return 0;
> +
> + trap_val.traps = traps;
> + cur_traps.traps = dd->cport->traps;
> +
> + entry = kzalloc_obj(entry, GFP_KERNEL);
> + if (!entry)
> + return -ENOMEM;
> + entry->mask = trap_val.dw;
> + entry->func = func;
> + ret = xa_alloc_irq(&dd->cport->trap_xa, &index, entry, xa_limit_32b, GFP_KERNEL);
> + if (ret < 0) {
> + kfree(entry);
> + return ret;
> + }
> +
> + trap_val.dw |= cur_traps.dw;
> + if (trap_val.dw != cur_traps.dw) {
> + dd->cport->traps = trap_val.traps;
> + ret = cport_start(dd, cport_adm_to);
> + }
> + return ret;
> +}
> +
> +int deregister_cport_trap(struct hfi2_devdata *dd, cport_trap_handler func)
> +{
> + union {
> + struct cport_trap_status traps;
> + u32 dw;
> + } trap_val, cur_traps;
> + struct cport_trap_reg *entry;
> + unsigned long index;
> +
> + if (!dd->cport)
> + return 0;
> +
> + trap_val.dw = 0;
> + xa_lock_irq(&dd->cport->trap_xa);
> + xa_for_each(&dd->cport->trap_xa, index, entry) {
> + if (entry->func == func) {
> + __xa_erase(&dd->cport->trap_xa, index);
> + kfree(entry);
> + } else {
> + trap_val.dw |= entry->mask;
> + }
> + }
> + xa_unlock_irq(&dd->cport->trap_xa);
> + cur_traps.traps = dd->cport->traps;
> + if (trap_val.dw != cur_traps.dw) {
> + dd->cport->traps = trap_val.traps;
> + cport_start(dd, cport_adm_to);
> + }
> +
> + return 0;
> +}
> +
> +static void clearall_cport_trap(struct hfi2_devdata *dd)
> +{
> + struct cport_trap_reg *entry;
> + unsigned long index;
> + struct cport_trap_status no_traps = {0};
> +
> + if (!dd->cport)
> + return;
> +
> + dd->cport->traps = no_traps;
> + cport_start(dd, cport_adm_to);
> + cport_register_cb(dd, CH_OP_TRAP, CH_OP_TRAP, NULL);
> + xa_lock_irq(&dd->cport->trap_xa);
> + /* there should be none left, but make certain */
> + xa_for_each(&dd->cport->trap_xa, index, entry) {
> + __xa_erase(&dd->cport->trap_xa, index);
> + dd_dev_info(dd, "removing latent TRAP handler %ps\n", entry->func);
> + kfree(entry);
> + }
> + xa_unlock_irq(&dd->cport->trap_xa);
> +}
> +
> +static int handle_cport_trap(struct hfi2_devdata *dd, u8 op, u8 sideband,
> + void *payload, int len, void *handle)
> +{
> + struct cport_trap_payload *traps = payload;
> + struct cport_trap_payload repress = {0};
> + union {
> + struct cport_trap_status traps;
> + u32 dw;
> + } trap_val;
> + struct cport_trap_reg *entry;
> + unsigned long index;
> + int ret;
> +
> + trap_val.traps = traps->trap_sts;
> +
> + /* clear-down the traps we got */
> + repress.trap_sts = traps->trap_sts;
> + ret = cport_send_notif(dd, CH_OP_TRAP_REPRESS, 0, &repress, sizeof(repress),
> + cport_adm_to * HZ);
> + if (ret)
> + dd_dev_warn(dd, "CPORT TRAP_REPRESS failed: %d\n", ret);
> +#ifdef CPORT_TRAP_DEBUG
> + pr_warn("hfi2_%d: %s: CPORT TRAP %08x\n", dd->unit, __func__, trap_val.dw);
> +#endif
> +
> + xa_lock_irq(&dd->cport->trap_xa);
> + xa_for_each(&dd->cport->trap_xa, index, entry) {
> + if (entry->mask & trap_val.dw)
> + entry->func(dd, trap_val.traps);
> + }
> + xa_unlock_irq(&dd->cport->trap_xa);
> +
> + return 0;
> +}
> +
> +static void cport_stop(struct hfi2_devdata *dd)
> +{
> + struct cport_stop_payload stop = {0};
> + u64 *resp = NULL;
> + int resp_len = 0;
> + int ret;
> +
> + if (!dd->cport)
> + return;
> +
> + ret = cport_send_req(dd, CH_OP_STOP, 0, &stop, sizeof(stop),
> + (void **)&resp, &resp_len, cport_adm_to * HZ);
> + if (ret)
> + dd_dev_err(dd, "CPORT stop failed %d\n", ret);
> + else if (resp_len)
> + dd_dev_info(dd, "CPORT stopped %016llx\n", *resp);
> + else
> + dd_dev_info(dd, "CPORT stopped\n");
> + kfree(resp);
> +}
> +
> +int start_cport(struct hfi2_devdata *dd)
> +{
> + int ret;
> +
> + ret = cport_init(dd);
> + if (ret || !dd->cport)
> + return ret;
> +
> + /*
> + * Do a STOP to ensure the device is properly cleaned up.
> + * This may cause firmware to be unresponsive for awhile,
> + * so increase the timeout for the subsequent START.
> + */
> + cport_stop(dd);
> +
> + cport_register_cb(dd, CH_OP_TRAP, CH_OP_TRAP, handle_cport_trap);
> +
> + dd->cport->opts.bare_metal = 1;
> +
> + ret = cport_start(dd, 3 * cport_adm_to);
> + if (ret)
> + cport_exit(dd);
> + return (ret > 0 ? -EIO : ret);
> +}
> +
> +static void stop_cport(struct hfi2_devdata *dd)
> +{
> + if (!dd->cport)
> + return;
> +
> + cport_stop(dd);
> +
> + cport_exit(dd);
> +}
> +
> +static int hfi2_create_kctxt(struct hfi2_pportdata *ppd, u16 ctxt)
> +{
> + struct hfi2_devdata *dd = ppd->dd;
> + struct hfi2_ctxtdata *rcd;
> + int ret;
> +
> + /* Control context has to be always 0 */
> + BUILD_BUG_ON(HFI2_CTRL_CTXT != 0);
> +
> + ret = hfi2_create_ctxtdata(ppd, dd->node, ctxt, &rcd);
> + if (ret < 0) {
> + dd_dev_err(dd, "Kernel receive context allocation failed\n");
> + return ret;
> + }
> +
> + /*
> + * Set up the kernel context flags here and now because they use
> + * default values for all receive side memories. User contexts will
> + * be handled as they are created.
> + */
> + rcd->flags = HFI2_CAP_KGET(MULTI_PKT_EGR) |
> + HFI2_CAP_KGET(NODROP_RHQ_FULL) |
> + HFI2_CAP_KGET(NODROP_EGR_FULL) |
> + HFI2_CAP_KGET(DMA_RTAIL);
> +
> + /* Control context must use DMA_RTAIL */
> + if (is_control_context(rcd))
> + rcd->flags |= HFI2_CAP_DMA_RTAIL;
> + rcd->fast_handler = get_dma_rtail_setting(rcd) ?
> + handle_receive_interrupt_dma_rtail :
> + handle_receive_interrupt_nodma_rtail;
> +
> + hfi2_set_seq_cnt(rcd, 1);
> +
> + rcd->sc = sc_alloc(ppd, SC_ACK, rcd->rcvhdrqentsize, dd->node);
> + if (!rcd->sc) {
> + dd_dev_err(dd, "Kernel send context allocation failed\n");
> + return -ENOMEM;
> + }
> + hfi2_init_ctxt(rcd->sc);
> +
> + return 0;
> +}
> +
> +/*
> + * Create the receive context array and one or more kernel contexts
> + */
> +int hfi2_create_kctxts(struct hfi2_devdata *dd)
> +{
> + struct hfi2_devrsrcs *dr = &dd->rsrcs;
> + u16 i;
> + u16 j;
> + int ret;
> +
> + /*
> + * so this is making dd->rcd much larger than needed. Unfortunately,
> + * current code requires that dd->rcd[x].ctxt == x (h/w context number
> + * must be the same as dd->rcd index number - s/w context number)
> + * and much code needs to change in order to fix this.
> + */
> + dd->num_rcd = chip_rcv_contexts(dd);
> + dd->rcd = kcalloc_node(dd->num_rcd, sizeof(*dd->rcd),
> + GFP_KERNEL, dd->node);
> + if (!dd->rcd) {
> + dd->num_rcd = 0;
> + return -ENOMEM;
> + }
> +
> + for (i = 0; i < dd->num_pports; i++) {
> + struct hfi2_pportdata *ppd = dd->pport + i;
> + struct hfi2_portrsrcs *pr = &dr->ppr[i];
> +
> + for (j = 0; j < pr->n_krcv_queues; j++) {
> + u16 ctxt = pr->rcv_context_base + j;
> +
> + ret = hfi2_create_kctxt(ppd, ctxt);
> + if (ret)
> + goto bail;
> + }
> + }
> +
> + return 0;
> +bail:
> + for (i = 0; i < dd->num_pports; i++) {
> + struct hfi2_portrsrcs *pr = &dr->ppr[i];
> +
> + for (j = 0; j < pr->n_krcv_queues; j++) {
> + u16 ctxt = pr->rcv_context_base + j;
> +
> + hfi2_free_ctxt(dd->rcd[ctxt]);
> + }
> + }
> +
> + /* All the contexts should be freed, free the array */
> + kfree(dd->rcd);
> + dd->rcd = NULL;
> + dd->num_rcd = 0;
> + return ret;
> +}
> +
> +/*
> + * Helper routines for the receive context reference count (rcd and uctxt).
> + */
> +static void hfi2_rcd_init(struct hfi2_ctxtdata *rcd)
> +{
> + kref_init(&rcd->kref);
> +}
> +
> +/**
> + * hfi2_rcd_free - When reference is zero clean up.
> + * @kref: pointer to an initialized rcd data structure
> + *
> + */
> +static void hfi2_rcd_free(struct kref *kref)
> +{
> + unsigned long flags;
> + struct hfi2_ctxtdata *rcd =
> + container_of(kref, struct hfi2_ctxtdata, kref);
> +
> + spin_lock_irqsave(&rcd->dd->uctxt_lock, flags);
> + rcd->dd->rcd[rcd->ctxt] = NULL;
> + spin_unlock_irqrestore(&rcd->dd->uctxt_lock, flags);
> +
> + hfi2_free_ctxtdata(rcd->dd, rcd);
> +
> + kfree(rcd);
> +}
> +
> +/**
> + * hfi2_rcd_put - decrement reference for rcd
> + * @rcd: pointer to an initialized rcd data structure
> + *
> + * Use this to put a reference after the init.
> + */
> +int hfi2_rcd_put(struct hfi2_ctxtdata *rcd)
> +{
> + if (rcd)
> + return kref_put(&rcd->kref, hfi2_rcd_free);
> +
> + return 0;
> +}
> +
> +/**
> + * hfi2_rcd_get - increment reference for rcd
> + * @rcd: pointer to an initialized rcd data structure
> + *
> + * Use this to get a reference after the init.
> + *
> + * Return : reflect kref_get_unless_zero(), which returns non-zero on
> + * increment, otherwise 0.
> + */
> +int hfi2_rcd_get(struct hfi2_ctxtdata *rcd)
> +{
> + return kref_get_unless_zero(&rcd->kref);
> +}
> +
> +/**
> + * allocate_rcd_index - allocate an rcd index from the rcd array
> + * @ppd: pointer to a valid port data structure
> + * @rcd: rcd data structure to assign
> + * @index[in,out]: in, suggested context number; out, selected context number
> + *
> + * Allocate an rcd index, either at the given context number or any within
> + * a dynamic range. If the fixed index is used or the dynamic range is full,
> + * return -EBUSY.
> + */
> +static int allocate_rcd_index(struct hfi2_pportdata *ppd,
> + struct hfi2_ctxtdata *rcd, u16 *index)
> +{
> + struct hfi2_devdata *dd = ppd->dd;
> + struct hfi2_portrsrcs *pr = &dd->rsrcs.ppr[ppd->hw_pidx];
> + unsigned long flags;
> + u16 ctxt = *index;
> + bool found;
> +
> + spin_lock_irqsave(&dd->uctxt_lock, flags);
> + found = false;
> + if (ctxt == DYNAMIC_CONTEXT) {
> + /* look for an unused dynamic context */
> + for (ctxt = pr->first_dyn_alloc_ctxt;
> + ctxt < pr->rcv_context_base + pr->num_rcv_contexts;
> + ctxt++) {
> + if (!dd->rcd[ctxt]) {
> + found = true;
> + break;
> + }
> + }
> + } else {
> + /* use the context number given */
> + if (!dd->rcd[ctxt])
> + found = true;
> + }
> +
> + if (found) {
> + rcd->ctxt = ctxt;
> + dd->rcd[ctxt] = rcd;
> + hfi2_rcd_init(rcd);
> + }
> + spin_unlock_irqrestore(&dd->uctxt_lock, flags);
> +
> + if (!found)
> + return -EBUSY;
> +
> + *index = ctxt;
> +
> + return 0;
> +}
> +
> +/**
> + * hfi2_rcd_get_by_index - get rcd by index
> + * @dd: pointer to a valid devdata structure
> + * @ctxt: the index of a possible rcd
> + *
> + * Hold the protecting spinlock and increment the reference on the selected
> + * rcd element.
> + *
> + * The caller is responsible for calling hfi2_rcd_put() on the returned
> + * pointer.
> + */
> +struct hfi2_ctxtdata *hfi2_rcd_get_by_index(struct hfi2_devdata *dd, u16 ctxt)
> +{
> + unsigned long flags;
> + struct hfi2_ctxtdata *rcd = NULL;
> +
> + spin_lock_irqsave(&dd->uctxt_lock, flags);
> + if (ctxt < dd->num_rcd) {
> + rcd = dd->rcd[ctxt];
> + if (rcd && !hfi2_rcd_get(rcd))
> + rcd = NULL;
> + }
> + spin_unlock_irqrestore(&dd->uctxt_lock, flags);
> +
> + return rcd;
> +}
> +
> +/*
> + * Common code for user and kernel context create and setup.
> + * NOTE: the initial kref is done here (hf1_rcd_init()).
> + */
> +int hfi2_create_ctxtdata(struct hfi2_pportdata *ppd, int numa, u16 ctxt,
> + struct hfi2_ctxtdata **context)
> +{
> + struct hfi2_devdata *dd = ppd->dd;
> + struct hfi2_devrsrcs *dr = &dd->rsrcs;
> + struct hfi2_portrsrcs *pr = &dr->ppr[ppd->hw_pidx];
> + struct hfi2_ctxtdata *rcd;
> +
> + rcd = kzalloc_node(sizeof(*rcd), GFP_KERNEL, numa);
> + if (rcd) {
> + u32 rcvtids, max_entries;
> + int ret;
> +
> + ret = allocate_rcd_index(ppd, rcd, &ctxt);
> + if (ret) {
> + *context = NULL;
> + kfree(rcd);
> + return ret;
> + }
> +
> + INIT_LIST_HEAD(&rcd->qp_wait_list);
> + hfi2_exp_tid_group_init(rcd);
> + rcd->ppd = ppd;
> + rcd->dd = dd;
> + rcd->numa_id = numa;
> + rcd->rcv_array_groups = dd->rcv_entries.ngroups;
> + rcd->rhf_rcv_function_map = normal_rhf_rcv_functions;
> + rcd->slow_handler = handle_receive_interrupt;
> + rcd->do_interrupt = rcd->slow_handler;
> + rcd->msix_intr = CCE_NUM_MSIX_VECTORS;
> +
> + mutex_init(&rcd->exp_mutex);
> + spin_lock_init(&rcd->exp_lock);
> + INIT_LIST_HEAD(&rcd->flow_queue.queue_head);
> + INIT_LIST_HEAD(&rcd->rarr_queue.queue_head);
> +
> + hfi2_cdbg(PROC, "setting up context %u", rcd->ctxt);
> +
> + /* calculate the context's RcvArray entry starting point */
> + rcd->eager_base = pr->rcv_array_base +
> + ((ctxt - pr->rcv_context_base) *
> + dd->rcv_entries.ngroups *
> + dd->rcv_entries.group_size);
> +
> + rcd->rcvhdrq_cnt = rcvhdrcnt;
> + rcd->rcvhdrqentsize = hfi2_hdrq_entsize;
> + rcd->rhf_offset =
> + rcd->rcvhdrqentsize - sizeof(u64) / sizeof(u32);
> + rcd->kdeth_rcv_hdr = DEFAULT_RCVHDRSIZE;
> + /*
> + * Simple Eager buffer allocation: we have already pre-allocated
> + * the number of RcvArray entry groups. Each ctxtdata structure
> + * holds the number of groups for that context.
> + *
> + * To follow CSR requirements and maintain cacheline alignment,
> + * make sure all sizes and bases are multiples of group_size.
> + *
> + * The expected entry count is what is left after assigning
> + * eager.
> + */
> + max_entries = rcd->rcv_array_groups * dd->rcv_entries.group_size;
> + rcvtids = ((max_entries * hfi2_rcvarr_split) / 100);
> + rcd->egrbufs.count = round_down(rcvtids, dd->rcv_entries.group_size);
> + if (rcd->egrbufs.count > dd->params->max_eager_entries) {
> + dd_dev_err(dd, "ctxt%u: requested too many RcvArray entries.\n",
> + rcd->ctxt);
> + rcd->egrbufs.count = dd->params->max_eager_entries;
> + }
> + hfi2_cdbg(PROC,
> + "ctxt%u: max Eager buffer RcvArray entries: %u",
> + rcd->ctxt, rcd->egrbufs.count);
> +
> + /*
> + * Allocate array that will hold the eager buffer accounting
> + * data.
> + * This will allocate the maximum possible buffer count based
> + * on the value of the RcvArray split parameter.
> + * The resulting value will be rounded down to the closest
> + * multiple of dd->rcv_entries.group_size.
> + */
> + rcd->egrbufs.buffers =
> + kcalloc_node(rcd->egrbufs.count,
> + sizeof(*rcd->egrbufs.buffers),
> + GFP_KERNEL, numa);
> + if (!rcd->egrbufs.buffers)
> + goto bail;
> + rcd->egrbufs.rcvtids =
> + kcalloc_node(rcd->egrbufs.count,
> + sizeof(*rcd->egrbufs.rcvtids),
> + GFP_KERNEL, numa);
> + if (!rcd->egrbufs.rcvtids)
> + goto bail;
> + rcd->egrbufs.size = eager_buffer_size;
> + /*
> + * The size of the buffers programmed into the RcvArray
> + * entries needs to be big enough to handle the highest
> + * MTU supported.
> + */
> + if (rcd->egrbufs.size < hfi2_max_mtu) {
> + rcd->egrbufs.size = __roundup_pow_of_two(hfi2_max_mtu);
> + hfi2_cdbg(PROC,
> + "ctxt%u: eager bufs size too small. Adjusting to %u",
> + rcd->ctxt, rcd->egrbufs.size);
> + }
> + rcd->egrbufs.rcvtid_size = HFI2_MAX_EAGER_BUFFER_SIZE;
> +
> + /* Applicable only for statically created kernel contexts */
> + if (ctxt < pr->first_dyn_alloc_ctxt) {
> + rcd->opstats = kzalloc_node(sizeof(*rcd->opstats),
> + GFP_KERNEL, numa);
> + if (!rcd->opstats)
> + goto bail;
> +
> + /* Initialize TID flow generations for the context */
> + hfi2_kern_init_ctxt_generations(rcd);
> + }
> +
> + *context = rcd;
> + return 0;
> + }
> +
> +bail:
> + *context = NULL;
> + hfi2_free_ctxt(rcd);
> + return -ENOMEM;
> +}
> +
> +/**
> + * hfi2_free_ctxt - free context
> + * @rcd: pointer to an initialized rcd data structure
> + *
> + * This wrapper is the free function that matches hfi2_create_ctxtdata().
> + * When a context is done being used (kernel or user), this function is called
> + * for the "final" put to match the kref init from hfi2_create_ctxtdata().
> + * Other users of the context do a get/put sequence to make sure that the
> + * structure isn't removed while in use.
> + */
> +void hfi2_free_ctxt(struct hfi2_ctxtdata *rcd)
> +{
> + hfi2_rcd_put(rcd);
> +}
> +
> +/*
> + * Select the largest ccti value over all SLs to determine the intra-
> + * packet gap for the link.
> + *
> + * called with cca_timer_lock held (to protect access to cca_timer
> + * array), and rcu_read_lock() (to protect access to cc_state).
> + */
> +void set_link_ipg(struct hfi2_pportdata *ppd)
> +{
> + struct hfi2_devdata *dd = ppd->dd;
> + struct cc_state *cc_state;
> + int i;
> + u16 cce, ccti_limit, max_ccti = 0;
> + u16 shift, mult;
> + u64 src;
> + u32 current_egress_rate; /* Mbits /sec */
> + u64 max_pkt_time;
> + /*
> + * max_pkt_time is the maximum packet egress time in units
> + * of the fabric clock period 1/(805 MHz).
> + */
> +
> + cc_state = get_cc_state(ppd);
> +
> + if (!cc_state)
> + /*
> + * This should _never_ happen - rcu_read_lock() is held,
> + * and set_link_ipg() should not be called if cc_state
> + * is NULL.
> + */
> + return;
> +
> + for (i = 0; i < OPA_MAX_SLS; i++) {
> + u16 ccti = ppd->cca_timer[i].ccti;
> +
> + if (ccti > max_ccti)
> + max_ccti = ccti;
> + }
> +
> + ccti_limit = cc_state->cct.ccti_limit;
> + if (max_ccti > ccti_limit)
> + max_ccti = ccti_limit;
> +
> + cce = cc_state->cct.entries[max_ccti].entry;
> + shift = (cce & 0xc000) >> 14;
> + mult = (cce & 0x3fff);
> +
> + current_egress_rate = active_egress_rate(ppd);
> +
> + max_pkt_time = egress_cycles(ppd->ibmaxlen, current_egress_rate);
> +
> + src = (max_pkt_time >> shift) * mult;
> +
> + src &= SEND_STATIC_RATE_CONTROL_CSR_SRC_RELOAD_SMASK;
> + src <<= SEND_STATIC_RATE_CONTROL_CSR_SRC_RELOAD_SHIFT;
> +
> + write_eport_csr(dd, ppd->hw_pidx, dd->params->send_static_rate_control_reg, src);
> +}
> +
> +static enum hrtimer_restart cca_timer_fn(struct hrtimer *t)
> +{
> + struct cca_timer *cca_timer;
> + struct hfi2_pportdata *ppd;
> + int sl;
> + u16 ccti_timer, ccti_min;
> + struct cc_state *cc_state;
> + unsigned long flags;
> + enum hrtimer_restart ret = HRTIMER_NORESTART;
> +
> + cca_timer = container_of(t, struct cca_timer, hrtimer);
> + ppd = cca_timer->ppd;
> + sl = cca_timer->sl;
> +
> + rcu_read_lock();
> +
> + cc_state = get_cc_state(ppd);
> +
> + if (!cc_state) {
> + rcu_read_unlock();
> + return HRTIMER_NORESTART;
> + }
> +
> + /*
> + * 1) decrement ccti for SL
> + * 2) calculate IPG for link (set_link_ipg())
> + * 3) restart timer, unless ccti is at min value
> + */
> +
> + ccti_min = cc_state->cong_setting.entries[sl].ccti_min;
> + ccti_timer = cc_state->cong_setting.entries[sl].ccti_timer;
> +
> + spin_lock_irqsave(&ppd->cca_timer_lock, flags);
> +
> + if (cca_timer->ccti > ccti_min) {
> + cca_timer->ccti--;
> + set_link_ipg(ppd);
> + }
> +
> + if (cca_timer->ccti > ccti_min) {
> + unsigned long nsec = 1024 * ccti_timer;
> + /* ccti_timer is in units of 1.024 usec */
> + hrtimer_forward_now(t, ns_to_ktime(nsec));
> + ret = HRTIMER_RESTART;
> + }
> +
> + spin_unlock_irqrestore(&ppd->cca_timer_lock, flags);
> + rcu_read_unlock();
> + return ret;
> +}
> +
> +/*
> + * Common code for initializing the physical port structure.
> + */
> +void hfi2_init_pportdata(struct pci_dev *pdev, struct hfi2_pportdata *ppd,
> + struct hfi2_devdata *dd, u8 hw_pidx, u32 port)
> +{
> + int i;
> + uint default_pkey_idx;
> + struct cc_state *cc_state;
> +
> + ppd->dd = dd;
> + ppd->hw_pidx = hw_pidx;
> + ppd->port = port; /* IB port number, not index */
> + ppd->prev_link_width = LINK_WIDTH_DEFAULT;
> + /*
> + * There are C_VL_COUNT number of PortVLXmitWait counters.
> + * Adding 1 to C_VL_COUNT to include the PortXmitWait counter.
> + */
> + for (i = 0; i < C_VL_COUNT + 1; i++) {
> + ppd->port_vl_xmit_wait_last[i] = 0;
> + ppd->vl_xmit_flit_cnt[i] = 0;
> + }
> +
> + default_pkey_idx = 1;
> +
> + ppd->pkeys[default_pkey_idx] = DEFAULT_P_KEY;
> + ppd->part_enforce |= HFI2_PART_ENFORCE_IN;
> + ppd->pkeys[0] = 0x8001;
> +
> + INIT_WORK(&ppd->link_vc_work, handle_verify_cap);
> + INIT_WORK(&ppd->link_up_work, handle_link_up);
> + INIT_WORK(&ppd->link_down_work, handle_link_down);
> + INIT_WORK(&ppd->link_downgrade_work, handle_link_downgrade);
> + INIT_WORK(&ppd->sma_message_work, handle_sma_message);
> + INIT_WORK(&ppd->link_bounce_work, dd->params->handle_link_bounce);
> + INIT_DELAYED_WORK(&ppd->start_link_work, handle_start_link);
> + INIT_WORK(&ppd->linkstate_active_work, receive_interrupt_work);
> + INIT_WORK(&ppd->qsfp_info.qsfp_work, qsfp_event);
> +
> + mutex_init(&ppd->hls_lock);
> + spin_lock_init(&ppd->qsfp_info.qsfp_lock);
> + seqlock_init(&ppd->sc2vl_lock);
> +
> + ppd->qsfp_info.ppd = ppd;
> + ppd->sm_trap_qp = 0x0;
> + ppd->sa_qp = 0x1;
> +
> + spin_lock_init(&ppd->cca_timer_lock);
> +
> + for (i = 0; i < OPA_MAX_SLS; i++) {
> + ppd->cca_timer[i].ppd = ppd;
> + ppd->cca_timer[i].sl = i;
> + ppd->cca_timer[i].ccti = 0;
> + hrtimer_setup(&ppd->cca_timer[i].hrtimer, cca_timer_fn, CLOCK_MONOTONIC,
> + HRTIMER_MODE_REL);
> + }
> +
> + ppd->cc_max_table_entries = IB_CC_TABLE_CAP_DEFAULT;
> +
> + spin_lock_init(&ppd->cc_state_lock);
> + spin_lock_init(&ppd->cc_log_lock);
> + cc_state = kzalloc_obj(cc_state, GFP_KERNEL);
> + RCU_INIT_POINTER(ppd->cc_state, cc_state);
> + if (!cc_state)
> + goto bail;
> + atomic_set(&ppd->ipoib_rsm_usr_num, 0);
> + ppd->netdev_rsm_rule = -1;
> + return;
> +
> +bail:
> + dd_dev_err(dd, "Congestion Control Agent disabled for port %d\n", port);
> +}
> +
> +/*
> + * Do initialization for device that is only needed on
> + * first detect, not on resets.
> + */
> +static int loadtime_init(struct hfi2_devdata *dd)
> +{
> + return 0;
> +}
> +
> +/**
> + * init_after_reset - re-initialize after a reset
> + * @dd: the hfi2_ib device
> + *
> + * sanity check at least some of the values after reset, and
> + * ensure no receive or transmit (explicitly, in case reset
> + * failed
> + */
> +static int init_after_reset(struct hfi2_devdata *dd)
> +{
> + struct hfi2_devrsrcs *dr = &dd->rsrcs;
> + int i;
> + int j;
> + struct hfi2_ctxtdata *rcd;
> + /*
> + * Ensure chip does no sends or receives, tail updates, or
> + * pioavail updates while we re-initialize. This is mostly
> + * for the driver data structures, not chip registers.
> + */
> + for (i = 0; i < dd->num_pports; i++) {
> + struct hfi2_portrsrcs *pr = &dr->ppr[i];
> +
> + for (j = 0; j < pr->num_rcv_contexts; j++) {
> + u16 ctxt = pr->rcv_context_base + j;
> +
> + rcd = hfi2_rcd_get_by_index(dd, ctxt);
> + hfi2_rcvctrl(dd, HFI2_RCVCTRL_CTXT_DIS |
> + HFI2_RCVCTRL_INTRAVAIL_DIS |
> + HFI2_RCVCTRL_TAILUPD_DIS, rcd);
> + hfi2_rcd_put(rcd);
> + }
> + }
> + for (i = 0; i < dd->num_pports; i++)
> + pio_send_control(&dd->pport[i], PSC_GLOBAL_DISABLE);
> + for (i = 0; i < dd->num_send_contexts; i++)
> + sc_disable(dd->send_contexts[i].sc);
> +
> + return 0;
> +}
> +
> +static void enable_chip(struct hfi2_devdata *dd)
> +{
> + struct hfi2_devrsrcs *dr = &dd->rsrcs;
> + struct hfi2_ctxtdata *rcd;
> + u32 rcvmask;
> + u16 i;
> + u16 j;
> +
> + /* enable PIO send */
> + for (i = 0; i < dd->num_pports; i++)
> + pio_send_control(&dd->pport[i], PSC_GLOBAL_ENABLE);
> +
> + /*
> + * Enable kernel ctxts' receive and receive interrupt.
> + * Other ctxts done as user opens and initializes them.
> + */
> + for (i = 0; i < dd->num_pports; i++) {
> + struct hfi2_portrsrcs *pr = &dr->ppr[i];
> +
> + for (j = 0; j < pr->n_krcv_queues; j++) {
> + u16 ctxt = pr->rcv_context_base + j;
> +
> + rcd = hfi2_rcd_get_by_index(dd, ctxt);
> + if (!rcd)
> + continue;
> + rcvmask = HFI2_RCVCTRL_CTXT_ENB
> + | HFI2_RCVCTRL_INTRAVAIL_ENB;
> + if (HFI2_CAP_KGET_MASK(rcd->flags, DMA_RTAIL))
> + rcvmask |= HFI2_RCVCTRL_TAILUPD_ENB;
> + else
> + rcvmask |= HFI2_RCVCTRL_TAILUPD_DIS;
> + if (!HFI2_CAP_KGET_MASK(rcd->flags, MULTI_PKT_EGR))
> + rcvmask |= HFI2_RCVCTRL_ONE_PKT_EGR_ENB;
> + if (HFI2_CAP_KGET_MASK(rcd->flags, NODROP_RHQ_FULL))
> + rcvmask |= HFI2_RCVCTRL_NO_RHQ_DROP_ENB;
> + if (HFI2_CAP_KGET_MASK(rcd->flags, NODROP_EGR_FULL))
> + rcvmask |= HFI2_RCVCTRL_NO_EGR_DROP_ENB;
> + if (HFI2_CAP_IS_KSET(TID_RDMA))
> + rcvmask |= HFI2_RCVCTRL_TIDFLOW_ENB;
> + hfi2_rcvctrl(dd, rcvmask, rcd);
> + sc_enable(rcd->sc);
> + hfi2_rcd_put(rcd);
> + }
> + }
> +}
> +
> +/**
> + * create_workqueues - create per port workqueues
> + * @dd: the hfi2_ib device
> + */
> +static int create_workqueues(struct hfi2_devdata *dd)
> +{
> + int pidx;
> + struct hfi2_pportdata *ppd;
> +
> + if (!dd->hfi2_wq) {
> + dd->hfi2_wq = alloc_workqueue("hfi%d",
> + WQ_SYSFS | WQ_HIGHPRI |
> + WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM |
> + WQ_PERCPU,
> + HFI2_MAX_ACTIVE_GEN_WQ_ENTRIES,
> + dd->unit);
> + if (!dd->hfi2_wq)
> + goto wq_error;
> + }
> + for (pidx = 0; pidx < dd->num_pports; ++pidx) {
> + ppd = dd->pport + pidx;
> + if (!ppd->link_wq) {
> + /*
> + * Make the link workqueue single-threaded to enforce
> + * serialization.
> + */
> + ppd->link_wq = alloc_workqueue("hfi_link_%d_%d",
> + WQ_SYSFS |
> + WQ_MEM_RECLAIM |
> + WQ_UNBOUND,
> + 1, /* max_active */
> + dd->unit, pidx);
> + if (!ppd->link_wq) {
> + pr_err("alloc_workqueue failed for port %d\n",
> + pidx + 1);
> + goto wq_error;
> + }
> + }
> + }
> + return 0;
> +
> +wq_error:
> + destroy_workqueues(dd);
> + return -ENOMEM;
> +}
> +
> +/**
> + * destroy_workqueues - destroy per port workqueues
> + * @dd: the hfi2_ib device
> + */
> +static void destroy_workqueues(struct hfi2_devdata *dd)
> +{
> + int pidx;
> + struct hfi2_pportdata *ppd;
> +
> + for (pidx = 0; pidx < dd->num_pports; ++pidx) {
> + ppd = dd->pport + pidx;
> +
> + if (ppd->link_wq) {
> + destroy_workqueue(ppd->link_wq);
> + ppd->link_wq = NULL;
> + }
> + }
> + if (dd->hfi2_wq) {
> + destroy_workqueue(dd->hfi2_wq);
> + dd->hfi2_wq = NULL;
> + }
> +}
> +
> +/**
> + * enable_general_intr() - Enable the IRQs that will be handled by the
> + * general interrupt handler.
> + * @dd: valid devdata
> + *
> + */
> +static void enable_general_intr(struct hfi2_devdata *dd)
> +{
> + const struct gi_enable_entry *entry = dd->params->gi_enable_table;
> +
> + for (; entry->start <= entry->end; entry++)
> + set_intr_bits(dd, entry->start, entry->end, true);
> +}
> +
> +static void wfr_start_port(struct hfi2_pportdata *ppd)
> +{
> + int ret;
> +
> + init_qsfp_int(ppd);
> +
> + /*
> + * start the serdes - must be after interrupts are
> + * enabled so we are notified when the link goes up
> + */
> + ret = bringup_serdes(ppd);
> + if (ret)
> + ppd_dev_info(ppd, "Failed to bring up port\n");
> +}
> +
> +static void wfr_stop_port(struct hfi2_pportdata *ppd)
> +{
> + /*
> + * Clear SerdesEnable.
> + * We can't count on interrupts since we are stopping.
> + */
> + hfi2_quiet_serdes(ppd);
> + if (ppd->link_wq)
> + flush_workqueue(ppd->link_wq);
> +}
> +
> +/**
> + * hfi2_init - do the actual initialization sequence on the chip
> + * @dd: the hfi2_ib device
> + * @reinit: re-initializing, so don't allocate new memory
> + *
> + * Do the actual initialization sequence on the chip. This is done
> + * both from the init routine called from the PCI infrastructure, and
> + * when we reset the chip, or detect that it was reset internally,
> + * or it's administratively re-enabled.
> + *
> + * Memory allocation here and in called routines is only done in
> + * the first case (reinit == 0). We have to be careful, because even
> + * without memory allocation, we need to re-write all the chip registers
> + * TIDs, etc. after the reset or enable has completed.
> + */
> +int hfi2_init(struct hfi2_devdata *dd, int reinit)
> +{
> + struct hfi2_devrsrcs *dr = &dd->rsrcs;
> + int ret = 0, pidx, lastfail = 0;
> + unsigned long len;
> + u16 i;
> + struct hfi2_ctxtdata *rcd;
> + struct hfi2_pportdata *ppd;
> +
> + /* Set up send low level handlers */
> + dd->process_pio_send = hfi2_verbs_send_pio;
> + dd->process_dma_send = hfi2_verbs_send_dma;
> + dd->pio_inline_send = pio_copy;
> +
> + if (is_ax(dd)) {
> + atomic_set(&dd->drop_packet, DROP_PACKET_ON);
> + dd->do_drop = true;
> + } else {
> + atomic_set(&dd->drop_packet, DROP_PACKET_OFF);
> + dd->do_drop = false;
> + }
> +
> + /* make sure the link is not "up" */
> + for (pidx = 0; pidx < dd->num_pports; ++pidx) {
> + ppd = dd->pport + pidx;
> + ppd->linkup = 0;
> + }
> +
> + if (reinit)
> + ret = init_after_reset(dd);
> + else
> + ret = loadtime_init(dd);
> + if (ret)
> + goto done;
> +
> + /* dd->rcd can be NULL if early initialization failed */
> + for (pidx = 0; dd->rcd && pidx < dd->num_pports; pidx++) {
> + struct hfi2_portrsrcs *pr = &dr->ppr[pidx];
> +
> + for (i = 0; i < pr->n_krcv_queues; ++i) {
> + u16 ctxt = pr->rcv_context_base + i;
> + /*
> + * Set up the (kernel) rcvhdr queue and egr TIDs. If
> + * doing re-init, the simplest way to handle this is
> + * to free existing, and re-allocate.
> + * Need to re-create rest of ctxt 0 ctxtdata as well.
> + */
> + rcd = hfi2_rcd_get_by_index(dd, ctxt);
> + if (!rcd)
> + continue;
> +
> + lastfail = hfi2_create_rcvhdrq(dd, rcd);
> + if (!lastfail)
> + lastfail = hfi2_setup_eagerbufs(rcd);
> + if (!lastfail)
> + lastfail = hfi2_kern_exp_rcv_init(rcd, reinit);
> + if (lastfail) {
> + dd_dev_err(dd,
> + "failed to allocate kernel ctxt's rcvhdrq and/or egr bufs\n");
> + ret = lastfail;
> + }
> + /* enable IRQ */
> + hfi2_rcd_put(rcd);
> + }
> + }
> +
> + /*
> + * so this is making dd->events much larger than needed. Unfortunately,
> + * uctxt_offset() uses the h/w context number and so all that would
> + * need to change in order to fix this.
> + */
> + /* Allocate enough memory for user event notification. */
> + len = PAGE_ALIGN(chip_rcv_contexts(dd) * HFI2_MAX_SHARED_CTXTS *
> + sizeof(*dd->events));
> + dd->events = vmalloc_user(len);
> + if (!dd->events)
> + dd_dev_err(dd, "Failed to allocate user events page\n");
> + /*
> + * Allocate a page for device and port status.
> + * Page will be shared amongst all user processes.
> + */
> + dd->status = vmalloc_user(PAGE_SIZE);
> + if (!dd->status)
> + dd_dev_err(dd, "Failed to allocate dev status page\n");
> + for (pidx = 0; pidx < dd->num_pports; ++pidx) {
> + ppd = dd->pport + pidx;
> + if (dd->status)
> + ppd->statusp = &dd->status->ports[pidx];
> +
> + set_mtu(ppd);
> + }
> +
> + /* enable chip even if we have an error, so we can debug cause */
> + enable_chip(dd);
> +
> +done:
> + /*
> + * Set status even if port serdes is not initialized
> + * so that diags will work.
> + */
> + if (dd->status)
> + dd->status->dev |= HFI2_STATUS_CHIP_PRESENT |
> + HFI2_STATUS_INITTED;
> + if (!ret) {
> + /* enable all interrupts from the chip */
> + enable_general_intr(dd);
> +
> + /* chip is OK for user apps; mark it as initialized */
> + for (pidx = 0; pidx < dd->num_pports; ++pidx) {
> + ppd = dd->pport + pidx;
> +
> + dd->params->start_port(ppd);
> +
> + /*
> + * Set status even if port serdes is not initialized
> + * so that diags will work.
> + */
> + if (ppd->statusp)
> + *ppd->statusp |= HFI2_STATUS_CHIP_PRESENT |
> + HFI2_STATUS_INITTED;
> + }
> + }
> +
> + /* if ret is non-zero, we probably should do some cleanup here... */
> + return ret;
> +}
> +
> +struct hfi2_devdata *hfi2_lookup(int unit)
> +{
> + return xa_load(&hfi2_dev_table, unit);
> +}
> +
> +/*
> + * Stop the timers during unit shutdown, or after an error late
> + * in initialization.
> + */
> +static void stop_timers(struct hfi2_devdata *dd)
> +{
> + struct hfi2_pportdata *ppd;
> + int pidx;
> +
> + for (pidx = 0; pidx < dd->num_pports; ++pidx) {
> + ppd = dd->pport + pidx;
> + if (ppd->led_override_timer.function) {
> + timer_delete_sync(&ppd->led_override_timer);
> + atomic_set(&ppd->led_override_timer_active, 0);
> + }
> + if (ppd->ibport_data.rvp.trap_timer.function)
> + timer_delete_sync(&ppd->ibport_data.rvp.trap_timer);
> + }
> +}
> +
> +/**
> + * shutdown_device - shut down a device
> + * @dd: the hfi2_ib device
> + *
> + * This is called to make the device quiet when we are about to
> + * unload the driver, and also when the device is administratively
> + * disabled. It does not free any data structures.
> + * Everything it does has to be setup again by hfi2_init(dd, 1)
> + */
> +static void shutdown_device(struct hfi2_devdata *dd)
> +{
> + struct hfi2_devrsrcs *dr = &dd->rsrcs;
> + struct hfi2_pportdata *ppd;
> + struct hfi2_ctxtdata *rcd;
> + unsigned int pidx;
> + int i;
> +
> + if (dd->flags & HFI2_SHUTDOWN)
> + return;
> + dd->flags |= HFI2_SHUTDOWN;
> +
> + for (pidx = 0; pidx < dd->num_pports; ++pidx) {
> + ppd = dd->pport + pidx;
> +
> + ppd->linkup = 0;
> + if (ppd->statusp)
> + *ppd->statusp &= ~(HFI2_STATUS_IB_CONF |
> + HFI2_STATUS_IB_READY);
> + }
> + dd->flags &= ~HFI2_INITTED;
> +
> + /*
> + * Drop all traps. After this point, there should be no more cport
> + * handlers that depend on driver state.
> + */
> + clearall_cport_trap(dd);
> +
> + /* disable all interrupts except cport response */
> + if (dd->params->chip_type == CHIP_WFR) {
> + /* WFR has no cport */
> + set_intr_bits(dd, 0, dd->params->is_last_source, false);
> + msix_shut_down_interrupts(dd, false);
> + } else {
> + vf2pf_deinit_irq(dd); /* gracefully stop using interrupts */
> + /* mask all but the cport interrupt source */
> + set_intr_bits(dd, 0, dd->params->is_cport_int - 1, false);
> + set_intr_bits(dd, dd->params->is_cport_int + 1,
> + dd->params->is_last_source, false);
> + msix_shut_down_interrupts(dd, true);
> + }
> +
> + for (pidx = 0; pidx < dd->num_pports; ++pidx) {
> + struct hfi2_portrsrcs *pr = &dr->ppr[pidx];
> +
> + ppd = dd->pport + pidx;
> + for (i = 0; i < pr->num_rcv_contexts; i++) {
> + u16 ctxt = pr->rcv_context_base + i;
> +
> + rcd = hfi2_rcd_get_by_index(dd, ctxt);
> + hfi2_rcvctrl(dd, HFI2_RCVCTRL_TAILUPD_DIS |
> + HFI2_RCVCTRL_CTXT_DIS |
> + HFI2_RCVCTRL_INTRAVAIL_DIS |
> + HFI2_RCVCTRL_PKEY_DIS |
> + HFI2_RCVCTRL_ONE_PKT_EGR_DIS, rcd);
> + hfi2_rcd_put(rcd);
> + }
> + }
> + /*
> + * Gracefully stop all sends allowing any in progress to
> + * trickle out first.
> + */
> + for (i = 0; i < dd->num_send_contexts; i++)
> + sc_flush(dd->send_contexts[i].sc);
> +
> + /*
> + * Enough for anything that's going to trickle out to have actually
> + * done so.
> + */
> + udelay(20);
> +
> + /* disable all contexts */
> + for (i = 0; i < dd->num_send_contexts; i++)
> + sc_disable(dd->send_contexts[i].sc);
> +
> + for (pidx = 0; pidx < dd->num_pports; ++pidx) {
> + ppd = dd->pport + pidx;
> +
> + /* disable the send device */
> + pio_send_control(ppd, PSC_GLOBAL_DISABLE);
> +
> + dd->params->shutdown_led_override(ppd);
> +
> + dd->params->stop_port(ppd);
> + }
> + if (dd->hfi2_wq)
> + flush_workqueue(dd->hfi2_wq);
> + sdma_exit(dd);
> +}
> +
> +/*
> + * SRIOV has been disabled. Do any cleanup not handled by
> + * VF remove_one() calls.
> + */
> +void hfi2_pf0_cleanup(struct hfi2_devdata *dd)
> +{
> + restore_qpmap_table(dd);
> +}
> +
> +/**
> + * hfi2_free_ctxtdata - free a context's allocated data
> + * @dd: the hfi2_ib device
> + * @rcd: the ctxtdata structure
> + *
> + * free up any allocated data for a context
> + * It should never change any chip state, or global driver state.
> + */
> +void hfi2_free_ctxtdata(struct hfi2_devdata *dd, struct hfi2_ctxtdata *rcd)
> +{
> + u32 e;
> +
> + if (!rcd)
> + return;
> +
> + if (rcd->rcvhdrq) {
> + dma_free_coherent(&dd->pcidev->dev, rcvhdrq_size(rcd),
> + rcd->rcvhdrq, rcd->rcvhdrq_dma);
> + rcd->rcvhdrq = NULL;
> + if (hfi2_rcvhdrtail_kvaddr(rcd)) {
> + dma_free_coherent(&dd->pcidev->dev, PAGE_SIZE,
> + (void *)hfi2_rcvhdrtail_kvaddr(rcd),
> + rcd->rcvhdrqtailaddr_dma);
> + rcd->rcvhdrtail_kvaddr = NULL;
> + }
> + }
> + if (rcd->rheq) {
> + dma_free_coherent(&dd->pcidev->dev, rheq_size(rcd),
> + rcd->rheq, rcd->rheq_dma);
> + rcd->rheq = NULL;
> + }
> +
> + /* all the RcvArray entries should have been cleared by now */
> + kfree(rcd->egrbufs.rcvtids);
> + rcd->egrbufs.rcvtids = NULL;
> +
> + for (e = 0; e < rcd->egrbufs.alloced; e++) {
> + if (rcd->egrbufs.buffers[e].addr)
> + dma_free_coherent(&dd->pcidev->dev,
> + rcd->egrbufs.buffers[e].len,
> + rcd->egrbufs.buffers[e].addr,
> + rcd->egrbufs.buffers[e].dma);
> + }
> + kfree(rcd->egrbufs.buffers);
> + rcd->egrbufs.alloced = 0;
> + rcd->egrbufs.buffers = NULL;
> +
> + sc_free(rcd->sc);
> + rcd->sc = NULL;
> +
> + vfree(rcd->subctxt_uregbase);
> + vfree(rcd->subctxt_rcvegrbuf);
> + vfree(rcd->subctxt_rcvhdr_base);
> + kfree(rcd->opstats);
> +
> + rcd->subctxt_uregbase = NULL;
> + rcd->subctxt_rcvegrbuf = NULL;
> + rcd->subctxt_rcvhdr_base = NULL;
> + rcd->opstats = NULL;
> +}
> +
> +/*
> + * Release our hold on the shared asic data. If we are the last one,
> + * return the structure to be finalized outside the lock. Must be
> + * holding hfi2_dev_table lock.
> + */
> +static struct hfi2_asic_data *release_asic_data(struct hfi2_devdata *dd)
> +{
> + struct hfi2_asic_data *ad;
> + int other;
> +
> + if (!dd->asic_data)
> + return NULL;
> + dd->asic_data->dds[dd->hfi2_id] = NULL;
> + other = dd->hfi2_id ? 0 : 1;
> + ad = dd->asic_data;
> + dd->asic_data = NULL;
> + /* return NULL if the other dd still has a link */
> + return ad->dds[other] ? NULL : ad;
> +}
> +
> +static void finalize_asic_data(struct hfi2_devdata *dd,
> + struct hfi2_asic_data *ad)
> +{
> + clean_up_i2c(dd, ad);
> + kfree(ad);
> +}
> +
> +/**
> + * hfi2_free_devdata - cleans up and frees per-unit data structure
> + * @dd: pointer to a valid devdata structure
> + *
> + * It cleans up and frees all data structures set up by
> + * hfi2_alloc_devdata().
> + */
> +static void hfi2_free_devdata(struct hfi2_devdata *dd)
> +{
> + struct hfi2_asic_data *ad;
> + unsigned long flags;
> +
> + xa_lock_irqsave(&hfi2_dev_table, flags);
> + __xa_erase(&hfi2_dev_table, dd->unit);
> + ad = release_asic_data(dd);
> + xa_unlock_irqrestore(&hfi2_dev_table, flags);
> +
> + finalize_asic_data(dd, ad);
> + free_platform_config(dd);
> + rcu_barrier(); /* wait for rcu callbacks to complete */
> + free_percpu(dd->int_counter);
> + free_percpu(dd->rcv_limit);
> + free_percpu(dd->send_schedule);
> + free_percpu(dd->tx_opstats);
> + dd->int_counter = NULL;
> + dd->rcv_limit = NULL;
> + dd->send_schedule = NULL;
> + dd->tx_opstats = NULL;
> + kfree(dd->comp_vect);
> + dd->comp_vect = NULL;
> + if (dd->rcvhdrtail_dummy_kvaddr)
> + dma_free_coherent(&dd->pcidev->dev, sizeof(u64),
> + (void *)dd->rcvhdrtail_dummy_kvaddr,
> + dd->rcvhdrtail_dummy_dma);
> + dd->rcvhdrtail_dummy_kvaddr = NULL;
> + sdma_clean(dd);
> + hfi2_sriov_free_cfg(dd);
> + /* dd is freed by the time this returns: */
> + rvt_dealloc_device(&dd->verbs_dev.rdi);
> +}
> +
> +/**
> + * hfi2_alloc_devdata - Allocate our primary per-unit data structure.
> + * @pdev: Valid PCI device
> + * @extra: How many bytes to alloc past the default
> + *
> + * Must be done via verbs allocator, because the verbs cleanup process
> + * both does cleanup and free of the data structure.
> + * "extra" is for chip-specific data.
> + */
> +static struct hfi2_devdata *hfi2_alloc_devdata(struct pci_dev *pdev,
> + const struct chip_params *params)
> +{
> + struct hfi2_devdata *dd;
> + size_t extra;
> + int ret, nports;
> +
> + nports = params->num_ports;
> + extra = nports * sizeof(struct hfi2_pportdata);
> + dd = (struct hfi2_devdata *)rvt_alloc_device(sizeof(*dd) + extra,
> + nports);
> + if (!dd)
> + return ERR_PTR(-ENOMEM);
> + dd->params = params;
> + dd->num_pports = nports;
> + dd->pport = (struct hfi2_pportdata *)(dd + 1);
> + dd->pcidev = pdev;
> + /*
> + * Check for PCI device being a VF in SRIOV.
> + * The VFs do not have a Power Management capability block.
> + */
> + dd->is_vf = (params->chip_type != CHIP_WFR && !pdev->pm_cap);
> + dd->is_sriov = (dd->is_vf || sriov_is_enabled());
> +#if defined(CONFIG_X86)
> + dd->is_vm = boot_cpu_has(X86_FEATURE_HYPERVISOR);
> +#endif
> +#ifdef PDEV_SRIOV_DEBUG
> + dev_warn(&pdev->dev, "is_vm=%d is_vf=%d is_physfn=%d is_virtfn=%d physfn=%p\n",
> + dd->is_vm, dd->is_vf, pdev->is_physfn, pdev->is_virtfn, pdev->physfn);
> +#endif
> + pci_set_drvdata(pdev, dd);
> +
> + /*
> + * Must set DMA mask for device before any dma_map*() or
> + * dma_alloc*() calls referring to pdev->dev. Otherwise
> + * those calls may return DMA addresses that are
> + * incompatible with the HFI.
> + */
> + ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(params->dma_mask_bits));
> + if (ret) {
> + dd_dev_warn(dd, "Failed to set %u-bit DMA mask ret %d; setting 32-bit DMA mask\n",
> + params->dma_mask_bits, ret);
> + ret = dma_set_mask_and_coherent(&pdev->dev,
> + DMA_BIT_MASK(32));
> + if (ret) {
> + dd_dev_err(dd, "Unable to set DMA mask: %d\n",
> + ret);
> + goto bail;
> + }
> + }
> +
> + ret = xa_alloc_irq(&hfi2_dev_table, &dd->unit, dd, xa_limit_32b,
> + GFP_KERNEL);
> + if (ret < 0) {
> + dev_err(&pdev->dev,
> + "Could not allocate unit ID: error %d\n", -ret);
> + goto bail;
> + }
> + rvt_set_ibdev_name(&dd->verbs_dev.rdi, "%s_%d", "hfi2", dd->unit);
> + /*
> + * If the BIOS does not have the NUMA node information set, select
> + * NUMA 0 so we get consistent performance.
> + */
> + dd->node = pcibus_to_node(pdev->bus);
> + if (dd->node == NUMA_NO_NODE) {
> + dd_dev_err(dd, "Invalid PCI NUMA node. Performance may be affected\n");
> + dd->node = 0;
> + }
> +
> + /*
> + * Initialize all locks for the device. This needs to be as early as
> + * possible so locks are usable.
> + */
> + spin_lock_init(&dd->sc_lock);
> + spin_lock_init(&dd->sendctrl_lock);
> + spin_lock_init(&dd->rcvctrl_lock);
> + spin_lock_init(&dd->uctxt_lock);
> + spin_lock_init(&dd->sc_init_lock);
> + spin_lock_init(&dd->dc8051_memlock);
> + spin_lock_init(&dd->sde_map_lock);
> + spin_lock_init(&dd->pio_map_lock);
> + mutex_init(&dd->dc8051_lock);
> + init_waitqueue_head(&dd->event_queue);
> + spin_lock_init(&dd->irq_src_lock);
> + INIT_WORK(&dd->freeze_work, handle_freeze);
> +
> + dd->int_counter = alloc_percpu(u64);
> + if (!dd->int_counter) {
> + ret = -ENOMEM;
> + goto bail;
> + }
> +
> + dd->rcv_limit = alloc_percpu(u64);
> + if (!dd->rcv_limit) {
> + ret = -ENOMEM;
> + goto bail;
> + }
> +
> + dd->send_schedule = alloc_percpu(u64);
> + if (!dd->send_schedule) {
> + ret = -ENOMEM;
> + goto bail;
> + }
> +
> + dd->tx_opstats = alloc_percpu(struct hfi2_opcode_stats_perctx);
> + if (!dd->tx_opstats) {
> + ret = -ENOMEM;
> + goto bail;
> + }
> +
> + dd->comp_vect = kzalloc(sizeof(*dd->comp_vect), GFP_KERNEL);
> + if (!dd->comp_vect) {
> + ret = -ENOMEM;
> + goto bail;
> + }
> +
> + /* allocate dummy tail memory for all receive contexts */
> + dd->rcvhdrtail_dummy_kvaddr =
> + dma_alloc_coherent(&dd->pcidev->dev, sizeof(u64),
> + &dd->rcvhdrtail_dummy_dma, GFP_KERNEL);
> + if (!dd->rcvhdrtail_dummy_kvaddr) {
> + ret = -ENOMEM;
> + goto bail;
> + }
> +
> + return dd;
> +
> +bail:
> + hfi2_free_devdata(dd);
> + return ERR_PTR(ret);
> +}
> +
> +/*
> + * Called from freeze mode handlers, and from PCI error
> + * reporting code. Should be paranoid about state of
> + * system and data structures.
> + */
> +void hfi2_disable_after_error(struct hfi2_devdata *dd)
> +{
> + if (dd->flags & HFI2_INITTED) {
> + u32 pidx;
> +
> + dd->flags &= ~HFI2_INITTED;
> + if (dd->pport)
> + for (pidx = 0; pidx < dd->num_pports; ++pidx) {
> + struct hfi2_pportdata *ppd;
> +
> + ppd = dd->pport + pidx;
> + if (dd->flags & HFI2_PRESENT)
> + set_link_state(ppd, HLS_DN_DISABLE);
> +
> + if (ppd->statusp)
> + *ppd->statusp &= ~HFI2_STATUS_IB_READY;
> + }
> + }
> +
> + /*
> + * Mark as having had an error for driver, and also
> + * for /sys and status word mapped to user programs.
> + * This marks unit as not usable, until reset.
> + */
> + if (dd->status)
> + dd->status->dev |= HFI2_STATUS_HWERROR;
> +}
> +
> +static void remove_one(struct pci_dev *);
> +static int init_one(struct pci_dev *, const struct pci_device_id *);
> +static void shutdown_one(struct pci_dev *);
> +
> +#define DRIVER_LOAD_MSG "Cornelis " DRIVER_NAME " loaded: "
> +#define PFX DRIVER_NAME ": "
> +
> +const struct pci_device_id hfi2_pci_tbl[] = {
> + { PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL0) },
> + { PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL1) },
> + { PCI_DEVICE(PCI_VENDOR_ID_CORNELIS, PCI_DEVICE_ID_CORNELIS_CN5000) },
> + { 0, }
> +};
> +
> +MODULE_DEVICE_TABLE(pci, hfi2_pci_tbl);
> +
> +static struct pci_driver hfi2_pci_driver = {
> + .name = DRIVER_NAME,
> + .probe = init_one,
> + .remove = remove_one,
> + .shutdown = shutdown_one,
> + .id_table = hfi2_pci_tbl,
> + .err_handler = &hfi2_pci_err_handler,
> + .sriov_configure = hfi2_sriov_configure,
> +};
> +
> +static void __init compute_krcvqs(void)
> +{
> + int i;
> +
> + for (i = 0; i < krcvqsset; i++)
> + n_krcvqs += krcvqs[i];
> +}
> +
> +/*
> + * Do all the generic driver unit- and chip-independent memory
> + * allocation and initialization.
> + */
> +static int __init hfi2_mod_init(void)
> +{
> + int ret;
> +
> + register_system_pinning_interface();
> + register_system_tid_ops();
> +
> + ret = node_affinity_init();
> + if (ret)
> + goto bail;
> +
> + /* validate max MTU before any devices start */
> + if (!valid_opa_max_mtu(hfi2_max_mtu)) {
> + pr_err("Invalid max_mtu 0x%x, using 0x%x instead\n",
> + hfi2_max_mtu, HFI2_DEFAULT_MAX_MTU);
> + hfi2_max_mtu = HFI2_DEFAULT_MAX_MTU;
> + }
> + /* valid CUs run from 1-128 in powers of 2 */
> + if (hfi2_cu > 128 || !is_power_of_2(hfi2_cu))
> + hfi2_cu = 1;
> + /* valid credit return threshold is 0-100, variable is unsigned */
> + if (user_credit_return_threshold > 100)
> + user_credit_return_threshold = 100;
> +
> + compute_krcvqs();
> + /*
> + * sanitize receive interrupt count, time must wait until after
> + * the hardware type is known
> + */
> + if (rcv_intr_count > RCV_HDR_HEAD_COUNTER_MASK)
> + rcv_intr_count = RCV_HDR_HEAD_COUNTER_MASK;
> + /* reject invalid combinations */
> + if (rcv_intr_count == 0 && rcv_intr_timeout == 0) {
> + pr_err("Invalid mode: both receive interrupt count and available timeout are zero - setting interrupt count to 1\n");
> + rcv_intr_count = 1;
> + }
> + if (rcv_intr_count > 1 && rcv_intr_timeout == 0) {
> + /*
> + * Avoid indefinite packet delivery by requiring a timeout
> + * if count is > 1.
> + */
> + pr_err("Invalid mode: receive interrupt count greater than 1 and available timeout is zero - setting available timeout to 1\n");
> + rcv_intr_timeout = 1;
> + }
> + if (rcv_intr_dynamic && !(rcv_intr_count > 1 && rcv_intr_timeout > 0)) {
> + /*
> + * The dynamic algorithm expects a non-zero timeout
> + * and a count > 1.
> + */
> + pr_err("Invalid mode: dynamic receive interrupt mitigation with invalid count and timeout - turning dynamic off\n");
> + rcv_intr_dynamic = 0;
> + }
> +
> + /* sanitize link CRC options */
> + link_crc_mask &= SUPPORTED_CRCS;
> +
> + ret = opfn_init();
> + if (ret < 0) {
> + pr_err("Failed to allocate opfn_wq");
> + goto bail_dev;
> + }
> +
> + /*
> + * These must be called before the driver is registered with
> + * the PCI subsystem.
> + */
> + hfi2_dbg_init();
> + /*
> + * This causes devices to be probed, so any initialization
> + * that must happen before that must be above this point.
> + */
> + ret = pci_register_driver(&hfi2_pci_driver);
> + if (ret < 0) {
> + pr_err("Unable to register driver: error %d\n", -ret);
> + goto bail_dev;
> + }
> + goto bail; /* all OK */
> +
> +bail_dev:
> + hfi2_dbg_exit();
> +bail:
> + return ret;
> +}
> +
> +module_init(hfi2_mod_init);
> +
> +/*
> + * Do the non-unit driver cleanup, memory free, etc. at unload.
> + */
> +static void __exit hfi2_mod_cleanup(void)
> +{
> + pci_unregister_driver(&hfi2_pci_driver);
> + opfn_exit();
> + node_affinity_destroy_all();
> + hfi2_dbg_exit();
> +
> + WARN_ON(!xa_empty(&hfi2_dev_table));
> + dispose_firmware(); /* asymmetric with obtain_firmware() */
> +
> + deregister_system_tid_ops();
> + deregister_system_pinning_interface();
> +}
> +
> +module_exit(hfi2_mod_cleanup);
> +
> +/* this can only be called after a successful initialization */
> +static void cleanup_device_data(struct hfi2_devdata *dd)
> +{
> + int ctxt;
> + int pidx;
> +
> + /* users can't do anything more with chip */
> + for (pidx = 0; pidx < dd->num_pports; ++pidx) {
> + struct hfi2_pportdata *ppd = &dd->pport[pidx];
> + struct cc_state *cc_state;
> + int i;
> +
> + if (ppd->statusp)
> + *ppd->statusp &= ~HFI2_STATUS_CHIP_PRESENT;
> +
> + for (i = 0; i < OPA_MAX_SLS; i++)
> + hrtimer_cancel(&ppd->cca_timer[i].hrtimer);
> +
> + spin_lock(&ppd->cc_state_lock);
> + cc_state = get_cc_state_protected(ppd);
> + RCU_INIT_POINTER(ppd->cc_state, NULL);
> + spin_unlock(&ppd->cc_state_lock);
> +
> + if (cc_state)
> + kfree_rcu(cc_state, rcu);
> + }
> +
> + free_credit_return(dd);
> +
> + /*
> + * Free any receive resources still in use (usually just kernel
> + * contexts) at unload.
> + */
> + for (ctxt = 0; dd->rcd && ctxt < dd->num_rcd; ctxt++) {
> + struct hfi2_ctxtdata *rcd = dd->rcd[ctxt];
> +
> + if (rcd) {
> + hfi2_free_ctxt_rcv_groups(rcd);
> + hfi2_free_ctxt(rcd);
> + }
> + }
> +
> + kfree(dd->rcd);
> + dd->rcd = NULL;
> + dd->num_rcd = 0;
> +
> + free_pio_map(dd);
> + /* must follow rcv context free - need to remove rcv's hooks */
> + if (dd->send_contexts) {
> + for (ctxt = 0; ctxt < dd->num_send_contexts; ctxt++)
> + sc_free(dd->send_contexts[ctxt].sc);
> + }
> + dd->num_send_contexts = 0;
> + kfree(dd->send_contexts);
> + dd->send_contexts = NULL;
> + kfree(dd->hw_to_sw);
> + dd->hw_to_sw = NULL;
> + /* free netdev data */
> + hfi2_free_rx(dd);
> + kfree(dd->boardname);
> + vfree(dd->events);
> + vfree(dd->status);
> +
> + vf2pf_deinit(dd); /* still requires CSR access/permissions */
> +
> + /* finalize the cport - CSR perms revoked on PF0 */
> + stop_cport(dd);
> + /* release interrupts */
> + msix_clean_up_interrupts(dd);
> +
> + /* CSR reads and writes are invalid after this call */
> + hfi2_pcie_ddcleanup(dd);
> +}
> +
> +/*
> + * Clean up on unit shutdown, or error during unit load after
> + * successful initialization.
> + */
> +static void postinit_cleanup(struct hfi2_devdata *dd)
> +{
> + hfi2_start_cleanup(dd);
> + hfi2_comp_vectors_clean_up(dd);
> + hfi2_dev_affinity_clean_up(dd);
> + release_rsm_rules(dd);
> +
> + cleanup_device_data(dd);
> +
> + destroy_workqueues(dd);
> + hfi2_pcie_cleanup(dd->pcidev);
> + hfi2_free_devdata(dd);
> +}
> +
> +static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
> +{
> + int ret = 0, pidx, initfail;
> + struct hfi2_devdata *dd;
> + const struct chip_params *params;
> +
> +#ifdef CONFIG_HFI_L8SIM
> + if (!(pdev->bus->bus_flags & PCI_BUS_FLAGS_SIMULATED)) {
> + dev_warn(&pdev->dev, "Ignoring real hardware on simulator driver\n");
> + return -ENODEV;
> + }
> +#endif
> + /* VF in host driver - leave for KVM */
> + if (pdev->is_virtfn) {
> + /*
> + * It is theoretically possible for the host driver to claim
> + * a VF, so the decision whether to claim or not is made by
> + * hfi2_sriov_init(). Returning ENODEV does not fail SRIOV init.
> + */
> + ret = hfi2_sriov_init(pdev); /* may do nothing */
> + if (ret)
> + return ret; /* do not claim device */
> + }
> +
> + /* First, lock the non-writable module parameters */
> + HFI2_CAP_LOCK();
> +
> + /* Validate dev ids */
> + if (ent->vendor == PCI_VENDOR_ID_INTEL &&
> + (ent->device == PCI_DEVICE_ID_INTEL0 ||
> + ent->device == PCI_DEVICE_ID_INTEL1)) {
> + params = &wfr_params;
> + } else if (ent->vendor == PCI_VENDOR_ID_CORNELIS &&
> + ent->device == PCI_DEVICE_ID_CORNELIS_CN5000) {
> + params = &jkr_params;
> + } else {
> + dev_err(&pdev->dev, "Failing on unknown device %04x:%04x\n",
> + ent->vendor, ent->device);
> + return -ENODEV;
> + }
> +
> + /* verify arrays are large enough */
> + if (params->num_int_csrs > LARGEST_NUM_INT_CSRS ||
> + params->num_ports > LARGEST_NUM_PORTS ||
> + params->pkey_table_size > MAX_PKEY_VALUES) {
> + dev_err(&pdev->dev, "Source arrays are compiled too small\n");
> + return -EINVAL;
> + }
> +
> + /* Allocate the dd so we can get to work */
> + dd = hfi2_alloc_devdata(pdev, params);
> + if (IS_ERR(dd))
> + return PTR_ERR(dd);
> +
> + /* Validate some global module parameters */
> + ret = hfi2_validate_rcvhdrcnt(dd, rcvhdrcnt);
> + if (ret)
> + goto free_dd;
> +
> + /* use the encoding function as a sanitization check */
> + if (!encode_rcv_header_entry_size(hfi2_hdrq_entsize)) {
> + dd_dev_err(dd, "Invalid HdrQ Entry size %u\n",
> + hfi2_hdrq_entsize);
> + ret = -EINVAL;
> + goto free_dd;
> + }
> +
> + /* The receive eager buffer size must be set before the receive
> + * contexts are created.
> + *
> + * Set the eager buffer size. Validate that it falls in a range
> + * allowed by the hardware - all powers of 2 between the min and
> + * max. The maximum valid MTU is within the eager buffer range
> + * so we do not need to cap the max_mtu by an eager buffer size
> + * setting.
> + */
> + if (eager_buffer_size) {
> + if (!is_power_of_2(eager_buffer_size))
> + eager_buffer_size =
> + roundup_pow_of_two(eager_buffer_size);
> + eager_buffer_size =
> + clamp_val(eager_buffer_size,
> + MIN_EAGER_BUFFER * 8,
> + MAX_EAGER_BUFFER_TOTAL);
> + dd_dev_info(dd, "Eager buffer size %u\n",
> + eager_buffer_size);
> + } else {
> + dd_dev_err(dd, "Invalid Eager buffer size of 0\n");
> + ret = -EINVAL;
> + goto free_dd;
> + }
> +
> + /* restrict value of hfi2_rcvarr_split */
> + hfi2_rcvarr_split = clamp_val(hfi2_rcvarr_split, 0, 100);
> +
> + ret = hfi2_pcie_init(dd);
> + if (ret)
> + goto free_dd;
> +
> + ret = create_workqueues(dd);
> + if (ret)
> + goto pcie_cleanup;
> +
> + /*
> + * Do device-specific initialization. If hfi2_init_dd() fails, it
> + * cleans up after itself.
> + */
> + ret = hfi2_init_dd(dd);
> + if (ret)
> + goto destroy_wqs; /* error already printed */
> +
> + /* do the generic initialization */
> + if (!ret)
> + initfail = hfi2_init(dd, 0);
> +
> + if (!initfail && !ret)
> + ret = hfi2_mad_init(dd);
> +
> + if (!initfail && !ret)
> + ret = hfi2_register_ib_device(dd);
> +
> + if (!initfail && !ret)
> + ret = init_cport_trap128(dd); /* after IB device register */
> +
> + /*
> + * Now ready for use. this should be cleared whenever we
> + * detect a reset, or initiate one. If earlier failure,
> + * we still create devices, so diags, etc. can be used
> + * to determine cause of problem.
> + */
> + if (!initfail && !ret) {
> + dd->flags |= HFI2_INITTED;
> + /* create debufs files after init and ib register */
> + hfi2_dbg_ibdev_init(&dd->verbs_dev);
> + }
> +
> + if (initfail || ret) {
> + stop_cport(dd);
> + msix_clean_up_interrupts(dd);
> + stop_timers(dd);
> + flush_workqueue(ib_wq);
> + for (pidx = 0; pidx < dd->num_pports; ++pidx)
> + dd->params->stop_port(dd->pport + pidx);
> + if (!ret) {
> + hfi2_unregister_ib_device(dd);
> + hfi2_mad_deinit(dd);
> + }
> + postinit_cleanup(dd);
> + if (initfail)
> + ret = initfail;
> + goto bail; /* everything already cleaned */
> + }
> +
> + sdma_start(dd);
> + init_cport_overtemp(dd);
> +
> + hfi2_sriov_auto_conf(dd);
> + vf2pf_ready(dd);
> + return 0;
> +
> +destroy_wqs:
> + destroy_workqueues(dd);
> +pcie_cleanup:
> + hfi2_pcie_cleanup(pdev);
> +free_dd:
> + hfi2_free_devdata(dd);
> +bail:
> + return ret;
> +}
> +
> +static void wait_for_clients(struct hfi2_devdata *dd)
> +{
> + /*
> + * Remove the device init value and complete the device if there is
> + * no clients or wait for active clients to finish.
> + */
> + if (refcount_dec_and_test(&dd->user_refcount))
> + complete(&dd->user_comp);
> +
> + wait_for_completion(&dd->user_comp);
> +}
> +
> +/*
> + * This is called for rmmod or other driver-device unbinds.
> + * (and now by shutdown_one() if not WFR)
> + */
> +static void remove_one(struct pci_dev *pdev)
> +{
> + struct hfi2_devdata *dd = pci_get_drvdata(pdev);
> +
> + if (pdev->is_virtfn) {
> + /*
> + * Should only reach here if the VF was claimed by the driver,
> + * however, this cannot destroy device functionality.
> + */
> + hfi2_sriov_remove(pdev);
> + }
> +
> + /*
> + * If VFs are still active, must shut them down now,
> + * before PF0 becomes unusable.
> + */
> + if (pdev->is_physfn)
> + hfi2_sriov_disable(dd->pcidev);
> +
> + /* close debugfs files before ib unregister */
> + hfi2_dbg_ibdev_exit(&dd->verbs_dev);
> +
> + /* wait for existing user space clients to finish */
> + wait_for_clients(dd);
> +
> + /* unregister from IB core */
> + hfi2_unregister_ib_device(dd);
> +
> + /* stop handling LOCAL_MAD_ from CPORT */
> + hfi2_mad_deinit(dd);
> +
> + /*
> + * Disable the IB link, disable interrupts on the device,
> + * clear dma engines, etc.
> + */
> + shutdown_device(dd);
> +
> + stop_timers(dd);
> +
> + /* wait until all of our (qsfp) queue_work() calls complete */
> + flush_workqueue(ib_wq);
> +
> + postinit_cleanup(dd);
> +}
> +
> +/*
> + * This is called during system reboot/shutdown/halt.
> + */
> +static void shutdown_one(struct pci_dev *pdev)
> +{
> + struct hfi2_devdata *dd = pci_get_drvdata(pdev);
> +
> + if (dd->params->chip_type == CHIP_WFR)
> + shutdown_device(dd);
> + else
> + remove_one(pdev);
> +}
> +
> +/* The device has reported over-temp and will shutdown soon (~500mS) */
> +void hfi2_overtemp(struct hfi2_devdata *dd)
> +{
> + dd_dev_err(dd, "*** OVER TEMP *** device shutdown imminent!\n");
> + /* take some action to gracefully shut down/quiesce */
> +}
> +
> +/**
> + * hfi2_create_rcvhdrq - create a receive header queue
> + * @dd: the hfi2_ib device
> + * @rcd: the context data
> + *
> + * This must be contiguous memory (from an i/o perspective), and must be
> + * DMA'able (which means for some systems, it will go through an IOMMU,
> + * or be forced into a low address range).
> + */
> +int hfi2_create_rcvhdrq(struct hfi2_devdata *dd, struct hfi2_ctxtdata *rcd)
> +{
> + u32 amt = rcvhdrq_size(rcd);
> +
> + if (!rcd->rcvhdrq) {
> + rcd->rcvhdrq = dma_alloc_coherent(&dd->pcidev->dev, amt,
> + &rcd->rcvhdrq_dma,
> + GFP_KERNEL);
> +
> + if (!rcd->rcvhdrq) {
> + dd_dev_err(dd,
> + "attempt to allocate %d bytes for ctxt %u rcvhdrq failed\n",
> + amt, rcd->ctxt);
> + goto bail;
> + }
> +
> + if (HFI2_CAP_KGET_MASK(rcd->flags, DMA_RTAIL) ||
> + HFI2_CAP_UGET_MASK(rcd->flags, DMA_RTAIL)) {
> + rcd->rcvhdrtail_kvaddr = dma_alloc_coherent(&dd->pcidev->dev,
> + PAGE_SIZE,
> + &rcd->rcvhdrqtailaddr_dma,
> + GFP_KERNEL);
> + if (!rcd->rcvhdrtail_kvaddr) {
> + dd_dev_err(dd,
> + "attempt to allocate 1 page for ctxt %u rcvhdrqtailaddr failed\n",
> + rcd->ctxt);
> + goto rhq_free;
> + }
> + }
> +
> + if (dd->params->chip_type != CHIP_WFR) {
> + u32 rheq_amt = rheq_size(rcd);
> +
> + rcd->rheq = dma_alloc_coherent(&dd->pcidev->dev,
> + rheq_amt,
> + &rcd->rheq_dma,
> + GFP_KERNEL);
> + if (!rcd->rheq) {
> + dd_dev_err(dd,
> + "attempt to allocate %d bytes for ctxt %u rheq failed\n",
> + rheq_amt, rcd->ctxt);
> + goto tail_free;
> + }
> + }
> + }
> +
> + set_hdrq_regs(rcd->ppd, rcd->ctxt, rcd->rcvhdrqentsize,
> + rcd->rcvhdrq_cnt, rcd->kdeth_rcv_hdr);
> +
> + return 0;
> +
> +tail_free:
> + if (rcd->rcvhdrtail_kvaddr) {
> + dma_free_coherent(&dd->pcidev->dev, PAGE_SIZE,
> + (void *)hfi2_rcvhdrtail_kvaddr(rcd),
> + rcd->rcvhdrqtailaddr_dma);
> + rcd->rcvhdrtail_kvaddr = NULL;
> + }
> +rhq_free:
> + dma_free_coherent(&dd->pcidev->dev, amt, rcd->rcvhdrq,
> + rcd->rcvhdrq_dma);
> + rcd->rcvhdrq = NULL;
> +bail:
> + return -ENOMEM;
> +}
> +
> +/**
> + * hfi2_setup_eagerbufs - allocate eager buffers, both kernel and user
> + * contexts.
> + * @rcd: the context we are setting up.
> + *
> + * Allocate the eager TID buffers and program them into the chip.
> + * They are no longer completely contiguous, we do multiple allocation
> + * calls. Otherwise we get the OOM code involved, by asking for too
> + * much per call, with disastrous results on some kernels.
> + */
> +int hfi2_setup_eagerbufs(struct hfi2_ctxtdata *rcd)
> +{
> + struct hfi2_devdata *dd = rcd->dd;
> + u32 max_entries, egrtop, alloced_bytes = 0;
> + u16 order, idx = 0;
> + int ret = 0;
> + u16 round_mtu = roundup_pow_of_two(hfi2_max_mtu);
> +
> + /*
> + * The minimum size of the eager buffers is a groups of MTU-sized
> + * buffers.
> + * The global eager_buffer_size parameter is checked against the
> + * theoretical lower limit of the value. Here, we check against the
> + * MTU.
> + */
> + if (rcd->egrbufs.size < (round_mtu * dd->rcv_entries.group_size))
> + rcd->egrbufs.size = round_mtu * dd->rcv_entries.group_size;
> + /*
> + * If using one-pkt-per-egr-buffer, lower the eager buffer
> + * size to the max MTU (page-aligned).
> + */
> + if (!HFI2_CAP_KGET_MASK(rcd->flags, MULTI_PKT_EGR))
> + rcd->egrbufs.rcvtid_size = round_mtu;
> +
> + /*
> + * Eager buffers sizes of 1MB or less require smaller TID sizes
> + * to satisfy the "multiple of 8 RcvArray entries" requirement.
> + */
> + if (rcd->egrbufs.size <= (1 << 20))
> + rcd->egrbufs.rcvtid_size = max((unsigned long)round_mtu,
> + rounddown_pow_of_two(rcd->egrbufs.size / 8));
> +
> + while (alloced_bytes < rcd->egrbufs.size &&
> + rcd->egrbufs.alloced < rcd->egrbufs.count) {
> + rcd->egrbufs.buffers[idx].addr =
> + dma_alloc_coherent(&dd->pcidev->dev,
> + rcd->egrbufs.rcvtid_size,
> + &rcd->egrbufs.buffers[idx].dma,
> + GFP_KERNEL);
> + if (rcd->egrbufs.buffers[idx].addr) {
> + rcd->egrbufs.buffers[idx].len =
> + rcd->egrbufs.rcvtid_size;
> + rcd->egrbufs.rcvtids[rcd->egrbufs.alloced].addr =
> + rcd->egrbufs.buffers[idx].addr;
> + rcd->egrbufs.rcvtids[rcd->egrbufs.alloced].dma =
> + rcd->egrbufs.buffers[idx].dma;
> + rcd->egrbufs.alloced++;
> + alloced_bytes += rcd->egrbufs.rcvtid_size;
> + idx++;
> + } else {
> + u32 new_size, i, j;
> + u64 offset = 0;
> +
> + /*
> + * Fail the eager buffer allocation if:
> + * - we are already using the lowest acceptable size
> + * - we are using one-pkt-per-egr-buffer (this implies
> + * that we are accepting only one size)
> + */
> + if (rcd->egrbufs.rcvtid_size == round_mtu ||
> + !HFI2_CAP_KGET_MASK(rcd->flags, MULTI_PKT_EGR)) {
> + dd_dev_err(dd, "ctxt%u: Failed to allocate eager buffers\n",
> + rcd->ctxt);
> + ret = -ENOMEM;
> + goto bail_rcvegrbuf_phys;
> + }
> +
> + new_size = rcd->egrbufs.rcvtid_size / 2;
> +
> + /*
> + * If the first attempt to allocate memory failed, don't
> + * fail everything but continue with the next lower
> + * size.
> + */
> + if (idx == 0) {
> + rcd->egrbufs.rcvtid_size = new_size;
> + continue;
> + }
> +
> + /*
> + * Re-partition already allocated buffers to a smaller
> + * size.
> + */
> + rcd->egrbufs.alloced = 0;
> + for (i = 0, j = 0, offset = 0; j < idx; i++) {
> + if (i >= rcd->egrbufs.count)
> + break;
> + rcd->egrbufs.rcvtids[i].dma =
> + rcd->egrbufs.buffers[j].dma + offset;
> + rcd->egrbufs.rcvtids[i].addr =
> + rcd->egrbufs.buffers[j].addr + offset;
> + rcd->egrbufs.alloced++;
> + if ((rcd->egrbufs.buffers[j].dma + offset +
> + new_size) ==
> + (rcd->egrbufs.buffers[j].dma +
> + rcd->egrbufs.buffers[j].len)) {
> + j++;
> + offset = 0;
> + } else {
> + offset += new_size;
> + }
> + }
> + rcd->egrbufs.rcvtid_size = new_size;
> + }
> + }
> + rcd->egrbufs.numbufs = idx;
> + rcd->egrbufs.size = alloced_bytes;
> +
> + hfi2_cdbg(PROC,
> + "ctxt%u: Alloced %u rcv tid entries @ %uKB, total %uKB",
> + rcd->ctxt, rcd->egrbufs.alloced,
> + rcd->egrbufs.rcvtid_size / 1024, rcd->egrbufs.size / 1024);
> +
> + /*
> + * Set the contexts rcv array head update threshold to the closest
> + * power of 2 (so we can use a mask instead of modulo) below half
> + * the allocated entries.
> + */
> + rcd->egrbufs.threshold =
> + rounddown_pow_of_two(rcd->egrbufs.alloced / 2);
> + /*
> + * Compute the expected RcvArray entry base. This is done after
> + * allocating the eager buffers in order to maximize the
> + * expected RcvArray entries for the context.
> + */
> + max_entries = rcd->rcv_array_groups * dd->rcv_entries.group_size;
> + egrtop = roundup(rcd->egrbufs.alloced, dd->rcv_entries.group_size);
> + rcd->expected_count = max_entries - egrtop;
> + if (rcd->expected_count > MAX_TID_PAIR_ENTRIES * 2)
> + rcd->expected_count = MAX_TID_PAIR_ENTRIES * 2;
> +
> + rcd->expected_base = rcd->eager_base + egrtop;
> + hfi2_cdbg(PROC, "ctxt%u: eager:%u, exp:%u, egrbase:%u, expbase:%u",
> + rcd->ctxt, rcd->egrbufs.alloced, rcd->expected_count,
> + rcd->eager_base, rcd->expected_base);
> +
> + if (!hfi2_rcvbuf_validate(rcd->egrbufs.rcvtid_size, PT_EAGER, &order)) {
> + hfi2_cdbg(PROC,
> + "ctxt%u: current Eager buffer size is invalid %u",
> + rcd->ctxt, rcd->egrbufs.rcvtid_size);
> + ret = -EINVAL;
> + goto bail_rcvegrbuf_phys;
> + }
> +
> + /*
> + * Enable RcvArray access on JKR and later by configuring RcvEgrCtrl and
> + * RcvTidCtrl before writing TIDs to the RcvArray.
> + *
> + * Call set_port_tid_config only after eager_base, egrbufs.alloced,
> + * expected_count, and expected_base are initialized in rcd. The last
> + * 3 of the 4 are initialized above in this function.
> + */
> + dd->params->set_port_tid_config(dd, rcd->ppd->hw_pidx, rcd->ctxt,
> + rcd->eager_base, rcd->egrbufs.alloced,
> + rcd->expected_base, rcd->expected_count);
> +
> + for (idx = 0; idx < rcd->egrbufs.alloced; idx++) {
> + dd->params->put_tid(rcd, idx, PT_EAGER,
> + rcd->egrbufs.rcvtids[idx].dma, order,
> + false);
> + cond_resched();
> + }
> +
> + return 0;
> +
> +bail_rcvegrbuf_phys:
> + for (idx = 0; idx < rcd->egrbufs.alloced &&
> + rcd->egrbufs.buffers[idx].addr;
> + idx++) {
> + dma_free_coherent(&dd->pcidev->dev,
> + rcd->egrbufs.buffers[idx].len,
> + rcd->egrbufs.buffers[idx].addr,
> + rcd->egrbufs.buffers[idx].dma);
> + rcd->egrbufs.buffers[idx].addr = NULL;
> + rcd->egrbufs.buffers[idx].dma = 0;
> + rcd->egrbufs.buffers[idx].len = 0;
> + }
> +
> + return ret;
> +}
> +
> +/*
> + * Return number of requested user contexts for the given unit and port based
> + * on information given in the module parameter num_user_contexts.
> + * Return -1 (use non-HT cores) if the corresponding entry is not set.
> + */
> +int get_num_user_contexts(struct hfi2_devdata *dd, int pidx)
> +{
> + struct hfi2_devdata *xdd;
> + int start;
> + int i;
> +
> + /* find the count of ports from earlier units */
> + start = 0;
> + for (i = 0; i < dd->unit; i++) {
> + xdd = hfi2_lookup(i);
> + /* previous units should exist - check anyway */
> + if (!xdd) {
> + dd_dev_err(dd, "%s: unit %d not found?\n", __func__, i);
> + return -1;
> + }
> + start += xdd->num_pports;
> + }
> +
> + /* adjust for the port on this unit */
> + start += pidx;
> +
> + /* check if enough elements are set for this unit's port */
> + if (start >= num_user_contexts_count)
> + return -1;
> +
> + return num_user_contexts_array[start];
> +}
>
>
>
next prev parent reply other threads:[~2026-03-18 9:14 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-11 17:53 [PATCH for-next resend 00/24] Migrate to hfi2 driver Dennis Dalessandro
2026-03-11 17:53 ` [PATCH for-next resend 01/24] RDMA/hfi2: Start hfi2 driver by basing off of hfi1 Dennis Dalessandro
2026-03-16 15:51 ` Leon Romanovsky
2026-03-16 22:00 ` Dennis Dalessandro
2026-03-17 10:07 ` Leon Romanovsky
2026-03-11 17:53 ` [PATCH for-next resend 02/24] RDMA/hfi2: Add in HW register definition files Dennis Dalessandro
2026-03-11 17:54 ` [PATCH for-next resend 03/24] RDMA/hfi2: Add counter accessor functions Dennis Dalessandro
2026-03-11 17:54 ` [PATCH for-next resend 04/24] RDMA/hfi2: Add in HW register access support Dennis Dalessandro
2026-03-11 17:54 ` [PATCH for-next resend 05/24] RDMA/hfi2: Add in trace header files Dennis Dalessandro
2026-03-11 17:54 ` [PATCH for-next resend 06/24] RDMA/hfi2: Add in trace support Dennis Dalessandro
2026-03-11 17:54 ` [PATCH for-next resend 07/24] RDMA/hfi2: Add system core header files Dennis Dalessandro
2026-03-16 15:58 ` Leon Romanovsky
2026-03-16 21:37 ` Dennis Dalessandro
2026-03-17 9:54 ` Leon Romanovsky
2026-03-11 17:54 ` [PATCH for-next resend 08/24] RDMA/hfi2: Add driver and interrupt infrastructure Dennis Dalessandro
2026-03-18 9:11 ` Leon Romanovsky
2026-03-11 17:54 ` [PATCH for-next resend 09/24] RDMA/hfi2: Add initialization and firmware support Dennis Dalessandro
2026-03-18 9:14 ` Leon Romanovsky [this message]
2026-03-11 17:54 ` [PATCH for-next resend 10/24] RDMA/hfi2: Add in MAD handling related headers Dennis Dalessandro
2026-03-11 17:54 ` [PATCH for-next resend 11/24] RDMA/hfi2: Add cport management Dennis Dalessandro
2026-03-11 17:54 ` [PATCH for-next resend 12/24] RDMA/hfi2: Implement MAD handling Dennis Dalessandro
2026-03-11 17:54 ` [PATCH for-next resend 13/24] RDMA/hfi2: Add IO related headers Dennis Dalessandro
2026-03-11 17:54 ` [PATCH for-next resend 14/24] RDMA/hfi2: Add PIO send infrastructure Dennis Dalessandro
2026-03-11 17:55 ` [PATCH for-next resend 15/24] RDMA/hfi2: Add SDMA infrastructure Dennis Dalessandro
2026-03-11 17:55 ` [PATCH for-next resend 16/24] RDMA/hfi2: Implement data moving infrastructure Dennis Dalessandro
2026-03-11 17:55 ` [PATCH for-next resend 17/24] RDMA/hfi2: Add verbs core Dennis Dalessandro
2026-03-11 17:55 ` [PATCH for-next resend 18/24] RDMA/hfi2: Add RC protocol support Dennis Dalessandro
2026-03-11 17:55 ` [PATCH for-next resend 19/24] RDMA/hfi2: Add in support for verbs Dennis Dalessandro
2026-03-11 17:55 ` [PATCH for-next resend 20/24] RDMA/hfi2: Support ipoib Dennis Dalessandro
2026-03-11 17:55 ` [PATCH for-next resend 21/24] RDMA/hfi2: Add misc header files Dennis Dalessandro
2026-03-11 17:55 ` [PATCH for-next resend 22/24] RDMA/hfi2: Add the rest of the driver Dennis Dalessandro
2026-03-11 17:55 ` [PATCH for-next resend 23/24] RDMA/hfi2: Make it build Dennis Dalessandro
2026-03-11 17:55 ` [PATCH for-next resend 24/24] RDMA/hfi2: Modernize mmap to use rdma_user_mmap_entry infrastructure Dennis Dalessandro
2026-03-16 16:02 ` [PATCH for-next resend 00/24] Migrate to hfi2 driver Leon Romanovsky
2026-03-16 21:29 ` Dennis Dalessandro
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260318091409.GG61385@unreal \
--to=leon@kernel.org \
--cc=brendan.cunningham@cornelisnetworks.com \
--cc=dean.luick@cornelisnetworks.com \
--cc=dennis.dalessandro@cornelisnetworks.com \
--cc=doug.miller@cornelisnetworks.com \
--cc=jgg@ziepe.ca \
--cc=linux-rdma@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox