All of lore.kernel.org
 help / color / mirror / Atom feed
From: Leon Romanovsky <leon@kernel.org>
To: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: jgg@ziepe.ca, Dean Luick <dean.luick@cornelisnetworks.com>,
	Breandan Cunningham <brendan.cunningham@cornelisnetworks.com>,
	Douglas Miller <doug.miller@cornelisnetworks.com>,
	linux-rdma@vger.kernel.org
Subject: Re: [PATCH for-next resend 09/24] RDMA/hfi2: Add initialization and firmware support
Date: Wed, 18 Mar 2026 11:14:09 +0200	[thread overview]
Message-ID: <20260318091409.GG61385@unreal> (raw)
In-Reply-To: <177325167086.57064.11403114326044529507.stgit@awdrv-04.cornelisnetworks.com>

On Wed, Mar 11, 2026 at 01:54:30PM -0400, Dennis Dalessandro wrote:
> Add device initialization, firmware loading and management, and CPU
> affinity support.
> 
> Co-developed-by: Dean Luick <dean.luick@cornelisnetworks.com>
> Signed-off-by: Dean Luick <dean.luick@cornelisnetworks.com>
> Co-developed-by: Bendan Cunningham <brendan.cunningham@cornelisnetworks.com>
> Signed-off-by: Breandan Cunningham <brendan.cunningham@cornelisnetworks.com>
> Co-developed-by: Douglas Miller <doug.miller@cornelisnetworks.com>
> Signed-off-by: Douglas Miller <doug.miller@cornelisnetworks.com>
> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
> ---
>  drivers/infiniband/hw/hfi2/affinity.c | 1194 +++++++++++++

Let's put affinity code aside. Can we start from bare minimum driver?

Thanks
> +	}
> +
> +	if (set) {
> +		cpumask_andnot(&set->used, &set->used, &msix->mask);
> +		_cpu_mask_set_gen_dec(set);
> +	}
> +
> +	irq_set_affinity_hint(msix->irq, NULL);
> +	cpumask_clear(&msix->mask);
> +	mutex_unlock(&node_affinity.lock);
> +}
> +
> +int hfi2_get_proc_affinity(int node)
> +{
> +	int cpu = -1, ret, i;
> +	struct hfi2_affinity_node *entry;
> +	cpumask_var_t diff, hw_thread_mask, available_mask, intrs_mask;
> +	const struct cpumask *node_mask,
> +		*proc_mask = current->cpus_ptr;
> +	struct hfi2_affinity_node_list *affinity = &node_affinity;
> +	struct cpu_mask_set *set = &affinity->proc;
> +	int pruned;
> +
> +	/*
> +	 * check whether process/context affinity has already
> +	 * been set
> +	 */
> +	if (current->nr_cpus_allowed == 1) {
> +		hfi2_cdbg(PROC, "PID %u %s affinity set to CPU %*pbl",
> +			  current->pid, current->comm,
> +			  cpumask_pr_args(proc_mask));
> +		/*
> +		 * Mark the pre-set CPU as used. This is atomic so we don't
> +		 * need the lock
> +		 */
> +		cpu = cpumask_first(proc_mask);
> +		cpumask_set_cpu(cpu, &set->used);
> +		goto done;
> +	} else if (current->nr_cpus_allowed < cpumask_weight(&set->mask)) {
> +		hfi2_cdbg(PROC, "PID %u %s affinity set to CPU set(s) %*pbl",
> +			  current->pid, current->comm,
> +			  cpumask_pr_args(proc_mask));
> +		goto done;
> +	}
> +
> +	/*
> +	 * The process does not have a preset CPU affinity so find one to
> +	 * recommend using the following algorithm:
> +	 *
> +	 * For each user process that is opening a context on HFI Y:
> +	 *  a) If all cores are filled, reinitialize the bitmask
> +	 *  b) Fill real cores first, then HT cores (First set of HT
> +	 *     cores on all physical cores, then second set of HT core,
> +	 *     and, so on) in the following order:
> +	 *
> +	 *     1. Same NUMA node as HFI Y and not running an IRQ
> +	 *        handler
> +	 *     2. Same NUMA node as HFI Y and running an IRQ handler
> +	 *     3. Different NUMA node to HFI Y and not running an IRQ
> +	 *        handler
> +	 *     4. Different NUMA node to HFI Y and running an IRQ
> +	 *        handler
> +	 *  c) Mark core as filled in the bitmask. As user processes are
> +	 *     done, clear cores from the bitmask.
> +	 */
> +
> +	ret = zalloc_cpumask_var(&diff, GFP_KERNEL);
> +	if (!ret)
> +		goto done;
> +	ret = zalloc_cpumask_var(&hw_thread_mask, GFP_KERNEL);
> +	if (!ret)
> +		goto free_diff;
> +	ret = zalloc_cpumask_var(&available_mask, GFP_KERNEL);
> +	if (!ret)
> +		goto free_hw_thread_mask;
> +	ret = zalloc_cpumask_var(&intrs_mask, GFP_KERNEL);
> +	if (!ret)
> +		goto free_available_mask;
> +
> +	mutex_lock(&affinity->lock);
> +	/*
> +	 * If we've used all available HW threads, clear the mask and start
> +	 * overloading.
> +	 */
> +	_cpu_mask_set_gen_inc(set);
> +
> +	/*
> +	 * If NUMA node has CPUs used by interrupt handlers, include them in the
> +	 * interrupt handler mask.
> +	 */
> +	entry = node_affinity_lookup(node);
> +	if (entry) {
> +		cpumask_copy(intrs_mask, (entry->def_intr.gen ?
> +					  &entry->def_intr.mask :
> +					  &entry->def_intr.used));
> +		cpumask_or(intrs_mask, intrs_mask, (entry->rcv_intr.gen ?
> +						    &entry->rcv_intr.mask :
> +						    &entry->rcv_intr.used));
> +		cpumask_or(intrs_mask, intrs_mask, &entry->general_intr_mask);
> +	}
> +	hfi2_cdbg(PROC, "CPUs used by interrupts: %*pbl",
> +		  cpumask_pr_args(intrs_mask));
> +
> +
> +	/*
> +	 * If HT cores are enabled, identify which HW threads within the
> +	 * physical cores should be used.
> +	 *
> +	 * Start with affinity mask but prune HT/SMT threads. If all HW threads
> +	 * are in use, then try again with all threads in mask, but only if
> +	 * threads were pruned before the first step.
> +	 */
> +	cpumask_copy(hw_thread_mask, &affinity->proc.mask);
> +	pruned = clear_ht_siblings(hw_thread_mask);
> +	for (i = 0; i < 2; i++) {
> +		/*
> +		 * diff will always be not empty at least once in this
> +		 * loop as the used mask gets reset when
> +		 * (set->mask == set->used) before this loop.
> +		 */
> +		cpumask_andnot(diff, hw_thread_mask, &set->used);
> +		if (!cpumask_empty(diff) || !pruned)
> +			break;
> +		cpumask_copy(hw_thread_mask, &affinity->proc.mask);
> +	}
> +	hfi2_cdbg(PROC, "Same available HW thread on all physical CPUs: %*pbl",
> +		  cpumask_pr_args(hw_thread_mask));
> +
> +	node_mask = cpumask_of_node(node);
> +	hfi2_cdbg(PROC, "Device on NUMA %u, CPUs %*pbl", node,
> +		  cpumask_pr_args(node_mask));
> +
> +	/* Get cpumask of available CPUs on preferred NUMA */
> +	cpumask_and(available_mask, hw_thread_mask, node_mask);
> +	cpumask_andnot(available_mask, available_mask, &set->used);
> +	hfi2_cdbg(PROC, "Available CPUs on NUMA %u: %*pbl", node,
> +		  cpumask_pr_args(available_mask));
> +
> +	/*
> +	 * At first, we don't want to place processes on the same
> +	 * CPUs as interrupt handlers. Then, CPUs running interrupt
> +	 * handlers are used.
> +	 *
> +	 * 1) If diff is not empty, then there are CPUs not running
> +	 *    non-interrupt handlers available, so diff gets copied
> +	 *    over to available_mask.
> +	 * 2) If diff is empty, then all CPUs not running interrupt
> +	 *    handlers are taken, so available_mask contains all
> +	 *    available CPUs running interrupt handlers.
> +	 * 3) If available_mask is empty, then all CPUs on the
> +	 *    preferred NUMA node are taken, so other NUMA nodes are
> +	 *    used for process assignments using the same method as
> +	 *    the preferred NUMA node.
> +	 */
> +	if (cpumask_andnot(diff, available_mask, intrs_mask))
> +		cpumask_copy(available_mask, diff);
> +
> +	/* If we don't have CPUs on the preferred node, use other NUMA nodes */
> +	if (cpumask_empty(available_mask)) {
> +		cpumask_andnot(available_mask, hw_thread_mask, &set->used);
> +		/* Excluding preferred NUMA cores */
> +		cpumask_andnot(available_mask, available_mask, node_mask);
> +		hfi2_cdbg(PROC,
> +			  "Preferred NUMA node cores are taken, cores available in other NUMA nodes: %*pbl",
> +			  cpumask_pr_args(available_mask));
> +
> +		/*
> +		 * At first, we don't want to place processes on the same
> +		 * CPUs as interrupt handlers.
> +		 */
> +		if (cpumask_andnot(diff, available_mask, intrs_mask))
> +			cpumask_copy(available_mask, diff);
> +	}
> +	hfi2_cdbg(PROC, "Possible CPUs for process: %*pbl",
> +		  cpumask_pr_args(available_mask));
> +
> +	cpu = cpumask_first(available_mask);
> +	if (cpu >= nr_cpu_ids) /* empty */
> +		cpu = -1;
> +	else
> +		cpumask_set_cpu(cpu, &set->used);
> +
> +	mutex_unlock(&affinity->lock);
> +	hfi2_cdbg(PROC, "Process assigned to CPU %d", cpu);
> +
> +	free_cpumask_var(intrs_mask);
> +free_available_mask:
> +	free_cpumask_var(available_mask);
> +free_hw_thread_mask:
> +	free_cpumask_var(hw_thread_mask);
> +free_diff:
> +	free_cpumask_var(diff);
> +done:
> +	return cpu;
> +}
> +
> +void hfi2_put_proc_affinity(int cpu)
> +{
> +	struct hfi2_affinity_node_list *affinity = &node_affinity;
> +	struct cpu_mask_set *set = &affinity->proc;
> +
> +	if (cpu < 0)
> +		return;
> +
> +	mutex_lock(&affinity->lock);
> +	cpu_mask_set_put(set, cpu);
> +	hfi2_cdbg(PROC, "Returning CPU %d for future process assignment", cpu);
> +	mutex_unlock(&affinity->lock);
> +}
> diff --git a/drivers/infiniband/hw/hfi2/chip.c b/drivers/infiniband/hw/hfi2/chip.c
> index c9012ea0b970..7547058acb29 100644
> --- a/drivers/infiniband/hw/hfi2/chip.c
> +++ b/drivers/infiniband/hw/hfi2/chip.c
> @@ -12141,13 +12141,13 @@ int wfr_early_per_chip_init(struct hfi2_devdata *dd)
>  	if (ret)
>  		return ret;
>  
> -	/* call before get_platform_config(), after init_chip_resources() */
> +	/* call before hfi2_get_platform_config(), after init_chip_resources() */
>  	ret = eprom_init(dd);
>  	if (ret)
>  		return ret;
>  
>  	/* Needs to be called before hfi2_firmware_init */
> -	get_platform_config(&dd->pport[HFI2_PORT_IDX]);
> +	hfi2_get_platform_config(&dd->pport[HFI2_PORT_IDX]);
>  
>  	/* read in firmware */
>  	ret = hfi2_firmware_init(dd);
> diff --git a/drivers/infiniband/hw/hfi2/firmware.c b/drivers/infiniband/hw/hfi2/firmware.c
> new file mode 100644
> index 000000000000..1a94622ba09e
> --- /dev/null
> +++ b/drivers/infiniband/hw/hfi2/firmware.c
> @@ -0,0 +1,2267 @@
> +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
> +/*
> + * Copyright(c) 2015 - 2017 Intel Corporation.
> + * Copyright(c) 2025-2026 Cornelis Networks, Inc.
> + */
> +
> +#include <linux/firmware.h>
> +#include <linux/mutex.h>
> +#include <linux/delay.h>
> +#include <linux/crc32.h>
> +
> +#include "hfi2.h"
> +#include "trace.h"
> +
> +/*
> + * Make it easy to toggle firmware file name and if it gets loaded by
> + * editing the following. This may be something we do while in development
> + * but not necessarily something a user would ever need to use.
> + */
> +#define DEFAULT_FW_8051_NAME_FPGA "hfi_dc8051.bin"
> +#define DEFAULT_FW_8051_NAME_ASIC "hfi2_dc8051.fw"
> +#define DEFAULT_FW_FABRIC_NAME "hfi2_fabric.fw"
> +#define DEFAULT_FW_SBUS_NAME "hfi2_sbus.fw"
> +#define DEFAULT_FW_PCIE_NAME "hfi2_pcie.fw"
> +#define ALT_FW_8051_NAME_ASIC "hfi2_dc8051_d.fw"
> +#define ALT_FW_FABRIC_NAME "hfi2_fabric_d.fw"
> +#define ALT_FW_SBUS_NAME "hfi2_sbus_d.fw"
> +#define ALT_FW_PCIE_NAME "hfi2_pcie_d.fw"
> +
> +MODULE_FIRMWARE(DEFAULT_FW_8051_NAME_ASIC);
> +MODULE_FIRMWARE(DEFAULT_FW_FABRIC_NAME);
> +MODULE_FIRMWARE(DEFAULT_FW_SBUS_NAME);
> +MODULE_FIRMWARE(DEFAULT_FW_PCIE_NAME);
> +
> +static uint fw_8051_load = 1;
> +static uint fw_fabric_serdes_load = 1;
> +static uint fw_pcie_serdes_load = 1;
> +static uint fw_sbus_load = 1;
> +
> +/* Firmware file names get set in hfi2_firmware_init() based on the above */
> +static char *fw_8051_name;
> +static char *fw_fabric_serdes_name;
> +static char *fw_sbus_name;
> +static char *fw_pcie_serdes_name;
> +
> +#define SBUS_MAX_POLL_COUNT 100
> +#define SBUS_COUNTER(reg, name) \
> +	(((reg) >> ASIC_STS_SBUS_COUNTERS_##name##_CNT_SHIFT) & \
> +	 ASIC_STS_SBUS_COUNTERS_##name##_CNT_MASK)
> +
> +/*
> + * Firmware security header.
> + */
> +struct css_header {
> +	u32 module_type;
> +	u32 header_len;
> +	u32 header_version;
> +	u32 module_id;
> +	u32 module_vendor;
> +	u32 date;		/* BCD yyyymmdd */
> +	u32 size;		/* in DWORDs */
> +	u32 key_size;		/* in DWORDs */
> +	u32 modulus_size;	/* in DWORDs */
> +	u32 exponent_size;	/* in DWORDs */
> +	u32 reserved[22];
> +};
> +
> +/* expected field values */
> +#define CSS_MODULE_TYPE	   0x00000006
> +#define CSS_HEADER_LEN	   0x000000a1
> +#define CSS_HEADER_VERSION 0x00010000
> +#define CSS_MODULE_VENDOR  0x00008086
> +
> +#define KEY_SIZE      256
> +#define MU_SIZE		8
> +#define EXPONENT_SIZE	4
> +
> +/* size of platform configuration partition */
> +#define MAX_PLATFORM_CONFIG_FILE_SIZE 4096
> +
> +/* size of file of platform configuration encoded in format version 4 */
> +#define PLATFORM_CONFIG_FORMAT_4_FILE_SIZE 528
> +
> +/* the file itself */
> +struct firmware_file {
> +	struct css_header css_header;
> +	u8 modulus[KEY_SIZE];
> +	u8 exponent[EXPONENT_SIZE];
> +	u8 signature[KEY_SIZE];
> +	u8 firmware[];
> +};
> +
> +struct augmented_firmware_file {
> +	struct css_header css_header;
> +	u8 modulus[KEY_SIZE];
> +	u8 exponent[EXPONENT_SIZE];
> +	u8 signature[KEY_SIZE];
> +	u8 r2[KEY_SIZE];
> +	u8 mu[MU_SIZE];
> +	u8 firmware[];
> +};
> +
> +/* augmented file size difference */
> +#define AUGMENT_SIZE (sizeof(struct augmented_firmware_file) - \
> +						sizeof(struct firmware_file))
> +
> +struct firmware_details {
> +	/* Linux core piece */
> +	const struct firmware *fw;
> +
> +	struct css_header *css_header;
> +	u8 *firmware_ptr;		/* pointer to binary data */
> +	u32 firmware_len;		/* length in bytes */
> +	u8 *modulus;			/* pointer to the modulus */
> +	u8 *exponent;			/* pointer to the exponent */
> +	u8 *signature;			/* pointer to the signature */
> +	u8 *r2;				/* pointer to r2 */
> +	u8 *mu;				/* pointer to mu */
> +	struct augmented_firmware_file dummy_header;
> +};
> +
> +/*
> + * The mutex protects fw_state, fw_err, and all of the firmware_details
> + * variables.
> + */
> +static DEFINE_MUTEX(fw_mutex);
> +enum fw_state {
> +	FW_EMPTY,
> +	FW_TRY,
> +	FW_FINAL,
> +	FW_ERR
> +};
> +
> +static enum fw_state fw_state = FW_EMPTY;
> +static int fw_err;
> +static struct firmware_details fw_8051;
> +static struct firmware_details fw_fabric;
> +static struct firmware_details fw_pcie;
> +static struct firmware_details fw_sbus;
> +
> +/* flags for turn_off_spicos() */
> +#define SPICO_SBUS   0x1
> +#define SPICO_FABRIC 0x2
> +#define ENABLE_SPICO_SMASK 0x1
> +
> +/* security block commands */
> +#define RSA_CMD_INIT  0x1
> +#define RSA_CMD_START 0x2
> +
> +/* security block status */
> +#define RSA_STATUS_IDLE   0x0
> +#define RSA_STATUS_ACTIVE 0x1
> +#define RSA_STATUS_DONE   0x2
> +#define RSA_STATUS_FAILED 0x3
> +
> +/* RSA engine timeout, in ms */
> +#define RSA_ENGINE_TIMEOUT 100 /* ms */
> +
> +/* hardware mutex timeout, in ms */
> +#define HM_TIMEOUT 10 /* ms */
> +
> +/* 8051 memory access timeout, in us */
> +#define DC8051_ACCESS_TIMEOUT 100 /* us */
> +
> +/* the number of fabric SerDes on the SBus */
> +#define NUM_FABRIC_SERDES 4
> +
> +/* ASIC_STS_SBUS_RESULT.RESULT_CODE value */
> +#define SBUS_READ_COMPLETE 0x4
> +
> +/* SBus fabric SerDes addresses, one set per HFI */
> +static const u8 fabric_serdes_addrs[2][NUM_FABRIC_SERDES] = {
> +	{ 0x01, 0x02, 0x03, 0x04 },
> +	{ 0x28, 0x29, 0x2a, 0x2b }
> +};
> +
> +/* SBus PCIe SerDes addresses, one set per HFI */
> +static const u8 pcie_serdes_addrs[2][NUM_PCIE_SERDES] = {
> +	{ 0x08, 0x0a, 0x0c, 0x0e, 0x10, 0x12, 0x14, 0x16,
> +	  0x18, 0x1a, 0x1c, 0x1e, 0x20, 0x22, 0x24, 0x26 },
> +	{ 0x2f, 0x31, 0x33, 0x35, 0x37, 0x39, 0x3b, 0x3d,
> +	  0x3f, 0x41, 0x43, 0x45, 0x47, 0x49, 0x4b, 0x4d }
> +};
> +
> +/* SBus PCIe PCS addresses, one set per HFI */
> +const u8 pcie_pcs_addrs[2][NUM_PCIE_SERDES] = {
> +	{ 0x09, 0x0b, 0x0d, 0x0f, 0x11, 0x13, 0x15, 0x17,
> +	  0x19, 0x1b, 0x1d, 0x1f, 0x21, 0x23, 0x25, 0x27 },
> +	{ 0x30, 0x32, 0x34, 0x36, 0x38, 0x3a, 0x3c, 0x3e,
> +	  0x40, 0x42, 0x44, 0x46, 0x48, 0x4a, 0x4c, 0x4e }
> +};
> +
> +/* SBus fabric SerDes broadcast addresses, one per HFI */
> +static const u8 fabric_serdes_broadcast[2] = { 0xe4, 0xe5 };
> +static const u8 all_fabric_serdes_broadcast = 0xe1;
> +
> +/* SBus PCIe SerDes broadcast addresses, one per HFI */
> +const u8 pcie_serdes_broadcast[2] = { 0xe2, 0xe3 };
> +static const u8 all_pcie_serdes_broadcast = 0xe0;
> +
> +static const u32 platform_config_table_limits[PLATFORM_CONFIG_TABLE_MAX] = {
> +	0,
> +	SYSTEM_TABLE_MAX,
> +	PORT_TABLE_MAX,
> +	RX_PRESET_TABLE_MAX,
> +	TX_PRESET_TABLE_MAX,
> +	QSFP_ATTEN_TABLE_MAX,
> +	VARIABLE_SETTINGS_TABLE_MAX
> +};
> +
> +/* forwards */
> +static void dispose_one_firmware(struct firmware_details *fdet);
> +static int load_fabric_serdes_firmware(struct hfi2_devdata *dd,
> +				       struct firmware_details *fdet);
> +static void dump_fw_version(struct hfi2_devdata *dd);
> +
> +/*
> + * Read a single 64-bit value from 8051 data memory.
> + *
> + * Expects:
> + * o caller to have already set up data read, no auto increment
> + * o caller to turn off read enable when finished
> + *
> + * The address argument is a byte offset.  Bits 0:2 in the address are
> + * ignored - i.e. the hardware will always do aligned 8-byte reads as if
> + * the lower bits are zero.
> + *
> + * Return 0 on success, -ENXIO on a read error (timeout).
> + */
> +static int __read_8051_data(struct hfi2_devdata *dd, u32 addr, u64 *result)
> +{
> +	u64 reg;
> +	int count;
> +
> +	/* step 1: set the address, clear enable */
> +	reg = (addr & DC_DC8051_CFG_RAM_ACCESS_CTRL_ADDRESS_MASK)
> +			<< DC_DC8051_CFG_RAM_ACCESS_CTRL_ADDRESS_SHIFT;
> +	write_csr(dd, DC_DC8051_CFG_RAM_ACCESS_CTRL, reg);
> +	/* step 2: enable */
> +	write_csr(dd, DC_DC8051_CFG_RAM_ACCESS_CTRL,
> +		  reg | DC_DC8051_CFG_RAM_ACCESS_CTRL_READ_ENA_SMASK);
> +
> +	/* wait until ACCESS_COMPLETED is set */
> +	count = 0;
> +	while ((read_csr(dd, DC_DC8051_CFG_RAM_ACCESS_STATUS)
> +		    & DC_DC8051_CFG_RAM_ACCESS_STATUS_ACCESS_COMPLETED_SMASK)
> +		    == 0) {
> +		count++;
> +		if (count > DC8051_ACCESS_TIMEOUT) {
> +			dd_dev_err(dd, "timeout reading 8051 data\n");
> +			return -ENXIO;
> +		}
> +		ndelay(10);
> +	}
> +
> +	/* gather the data */
> +	*result = read_csr(dd, DC_DC8051_CFG_RAM_ACCESS_RD_DATA);
> +
> +	return 0;
> +}
> +
> +/*
> + * Read 8051 data starting at addr, for len bytes.  Will read in 8-byte chunks.
> + * Return 0 on success, -errno on error.
> + */
> +int read_8051_data(struct hfi2_devdata *dd, u32 addr, u32 len, u64 *result)
> +{
> +	unsigned long flags;
> +	u32 done;
> +	int ret = 0;
> +
> +	spin_lock_irqsave(&dd->dc8051_memlock, flags);
> +
> +	/* data read set-up, no auto-increment */
> +	write_csr(dd, DC_DC8051_CFG_RAM_ACCESS_SETUP, 0);
> +
> +	for (done = 0; done < len; addr += 8, done += 8, result++) {
> +		ret = __read_8051_data(dd, addr, result);
> +		if (ret)
> +			break;
> +	}
> +
> +	/* turn off read enable */
> +	write_csr(dd, DC_DC8051_CFG_RAM_ACCESS_CTRL, 0);
> +
> +	spin_unlock_irqrestore(&dd->dc8051_memlock, flags);
> +
> +	return ret;
> +}
> +
> +/*
> + * Write data or code to the 8051 code or data RAM.
> + */
> +static int write_8051(struct hfi2_devdata *dd, int code, u32 start,
> +		      const u8 *data, u32 len)
> +{
> +	u64 reg;
> +	u32 offset;
> +	int aligned, count;
> +
> +	/* check alignment */
> +	aligned = ((unsigned long)data & 0x7) == 0;
> +
> +	/* write set-up */
> +	reg = (code ? DC_DC8051_CFG_RAM_ACCESS_SETUP_RAM_SEL_SMASK : 0ull)
> +		| DC_DC8051_CFG_RAM_ACCESS_SETUP_AUTO_INCR_ADDR_SMASK;
> +	write_csr(dd, DC_DC8051_CFG_RAM_ACCESS_SETUP, reg);
> +
> +	reg = ((start & DC_DC8051_CFG_RAM_ACCESS_CTRL_ADDRESS_MASK)
> +			<< DC_DC8051_CFG_RAM_ACCESS_CTRL_ADDRESS_SHIFT)
> +		| DC_DC8051_CFG_RAM_ACCESS_CTRL_WRITE_ENA_SMASK;
> +	write_csr(dd, DC_DC8051_CFG_RAM_ACCESS_CTRL, reg);
> +
> +	/* write */
> +	for (offset = 0; offset < len; offset += 8) {
> +		int bytes = len - offset;
> +
> +		if (bytes < 8) {
> +			reg = 0;
> +			memcpy(&reg, &data[offset], bytes);
> +		} else if (aligned) {
> +			reg = *(u64 *)&data[offset];
> +		} else {
> +			memcpy(&reg, &data[offset], 8);
> +		}
> +		write_csr(dd, DC_DC8051_CFG_RAM_ACCESS_WR_DATA, reg);
> +
> +		/* wait until ACCESS_COMPLETED is set */
> +		count = 0;
> +		while ((read_csr(dd, DC_DC8051_CFG_RAM_ACCESS_STATUS)
> +		    & DC_DC8051_CFG_RAM_ACCESS_STATUS_ACCESS_COMPLETED_SMASK)
> +		    == 0) {
> +			count++;
> +			if (count > DC8051_ACCESS_TIMEOUT) {
> +				dd_dev_err(dd, "timeout writing 8051 data\n");
> +				return -ENXIO;
> +			}
> +			udelay(1);
> +		}
> +	}
> +
> +	/* turn off write access, auto increment (also sets to data access) */
> +	write_csr(dd, DC_DC8051_CFG_RAM_ACCESS_CTRL, 0);
> +	write_csr(dd, DC_DC8051_CFG_RAM_ACCESS_SETUP, 0);
> +
> +	return 0;
> +}
> +
> +/* return 0 if values match, non-zero and complain otherwise */
> +static int invalid_header(struct hfi2_devdata *dd, const char *what,
> +			  u32 actual, u32 expected)
> +{
> +	if (actual == expected)
> +		return 0;
> +
> +	dd_dev_err(dd,
> +		   "invalid firmware header field %s: expected 0x%x, actual 0x%x\n",
> +		   what, expected, actual);
> +	return 1;
> +}
> +
> +/*
> + * Verify that the static fields in the CSS header match.
> + */
> +static int verify_css_header(struct hfi2_devdata *dd, struct css_header *css)
> +{
> +	/* verify CSS header fields (most sizes are in DW, so add /4) */
> +	if (invalid_header(dd, "module_type", css->module_type,
> +			   CSS_MODULE_TYPE) ||
> +	    invalid_header(dd, "header_len", css->header_len,
> +			   (sizeof(struct firmware_file) / 4)) ||
> +	    invalid_header(dd, "header_version", css->header_version,
> +			   CSS_HEADER_VERSION) ||
> +	    invalid_header(dd, "module_vendor", css->module_vendor,
> +			   CSS_MODULE_VENDOR) ||
> +	    invalid_header(dd, "key_size", css->key_size, KEY_SIZE / 4) ||
> +	    invalid_header(dd, "modulus_size", css->modulus_size,
> +			   KEY_SIZE / 4) ||
> +	    invalid_header(dd, "exponent_size", css->exponent_size,
> +			   EXPONENT_SIZE / 4)) {
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * Make sure there are at least some bytes after the prefix.
> + */
> +static int payload_check(struct hfi2_devdata *dd, const char *name,
> +			 long file_size, long prefix_size)
> +{
> +	/* make sure we have some payload */
> +	if (prefix_size >= file_size) {
> +		dd_dev_err(dd,
> +			   "firmware \"%s\", size %ld, must be larger than %ld bytes\n",
> +			   name, file_size, prefix_size);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Request the firmware from the system.  Extract the pieces and fill in
> + * fdet.  If successful, the caller will need to call dispose_one_firmware().
> + * Returns 0 on success, -ERRNO on error.
> + */
> +static int obtain_one_firmware(struct hfi2_devdata *dd, const char *name,
> +			       struct firmware_details *fdet)
> +{
> +	struct css_header *css;
> +	int ret;
> +
> +	memset(fdet, 0, sizeof(*fdet));
> +
> +	ret = request_firmware(&fdet->fw, name, &dd->pcidev->dev);
> +	if (ret) {
> +		dd_dev_warn(dd, "cannot find firmware \"%s\", err %d\n",
> +			    name, ret);
> +		return ret;
> +	}
> +
> +	/* verify the firmware */
> +	if (fdet->fw->size < sizeof(struct css_header)) {
> +		dd_dev_err(dd, "firmware \"%s\" is too small\n", name);
> +		ret = -EINVAL;
> +		goto done;
> +	}
> +	css = (struct css_header *)fdet->fw->data;
> +
> +	hfi2_cdbg(FIRMWARE, "Firmware %s details:", name);
> +	hfi2_cdbg(FIRMWARE, "file size: 0x%lx bytes", fdet->fw->size);
> +	hfi2_cdbg(FIRMWARE, "CSS structure:");
> +	hfi2_cdbg(FIRMWARE, "  module_type    0x%x", css->module_type);
> +	hfi2_cdbg(FIRMWARE, "  header_len     0x%03x (0x%03x bytes)",
> +		  css->header_len, 4 * css->header_len);
> +	hfi2_cdbg(FIRMWARE, "  header_version 0x%x", css->header_version);
> +	hfi2_cdbg(FIRMWARE, "  module_id      0x%x", css->module_id);
> +	hfi2_cdbg(FIRMWARE, "  module_vendor  0x%x", css->module_vendor);
> +	hfi2_cdbg(FIRMWARE, "  date           0x%x", css->date);
> +	hfi2_cdbg(FIRMWARE, "  size           0x%03x (0x%03x bytes)",
> +		  css->size, 4 * css->size);
> +	hfi2_cdbg(FIRMWARE, "  key_size       0x%03x (0x%03x bytes)",
> +		  css->key_size, 4 * css->key_size);
> +	hfi2_cdbg(FIRMWARE, "  modulus_size   0x%03x (0x%03x bytes)",
> +		  css->modulus_size, 4 * css->modulus_size);
> +	hfi2_cdbg(FIRMWARE, "  exponent_size  0x%03x (0x%03x bytes)",
> +		  css->exponent_size, 4 * css->exponent_size);
> +	hfi2_cdbg(FIRMWARE, "firmware size: 0x%lx bytes",
> +		  fdet->fw->size - sizeof(struct firmware_file));
> +
> +	/*
> +	 * If the file does not have a valid CSS header, fail.
> +	 * Otherwise, check the CSS size field for an expected size.
> +	 * The augmented file has r2 and mu inserted after the header
> +	 * was generated, so there will be a known difference between
> +	 * the CSS header size and the actual file size.  Use this
> +	 * difference to identify an augmented file.
> +	 *
> +	 * Note: css->size is in DWORDs, multiply by 4 to get bytes.
> +	 */
> +	ret = verify_css_header(dd, css);
> +	if (ret) {
> +		dd_dev_info(dd, "Invalid CSS header for \"%s\"\n", name);
> +	} else if ((css->size * 4) == fdet->fw->size) {
> +		/* non-augmented firmware file */
> +		struct firmware_file *ff = (struct firmware_file *)
> +							fdet->fw->data;
> +
> +		/* make sure there are bytes in the payload */
> +		ret = payload_check(dd, name, fdet->fw->size,
> +				    sizeof(struct firmware_file));
> +		if (ret == 0) {
> +			fdet->css_header = css;
> +			fdet->modulus = ff->modulus;
> +			fdet->exponent = ff->exponent;
> +			fdet->signature = ff->signature;
> +			fdet->r2 = fdet->dummy_header.r2; /* use dummy space */
> +			fdet->mu = fdet->dummy_header.mu; /* use dummy space */
> +			fdet->firmware_ptr = ff->firmware;
> +			fdet->firmware_len = fdet->fw->size -
> +						sizeof(struct firmware_file);
> +			/*
> +			 * Header does not include r2 and mu - generate here.
> +			 * For now, fail.
> +			 */
> +			dd_dev_err(dd, "driver is unable to validate firmware without r2 and mu (not in firmware file)\n");
> +			ret = -EINVAL;
> +		}
> +	} else if ((css->size * 4) + AUGMENT_SIZE == fdet->fw->size) {
> +		/* augmented firmware file */
> +		struct augmented_firmware_file *aff =
> +			(struct augmented_firmware_file *)fdet->fw->data;
> +
> +		/* make sure there are bytes in the payload */
> +		ret = payload_check(dd, name, fdet->fw->size,
> +				    sizeof(struct augmented_firmware_file));
> +		if (ret == 0) {
> +			fdet->css_header = css;
> +			fdet->modulus = aff->modulus;
> +			fdet->exponent = aff->exponent;
> +			fdet->signature = aff->signature;
> +			fdet->r2 = aff->r2;
> +			fdet->mu = aff->mu;
> +			fdet->firmware_ptr = aff->firmware;
> +			fdet->firmware_len = fdet->fw->size -
> +					sizeof(struct augmented_firmware_file);
> +		}
> +	} else {
> +		/* css->size check failed */
> +		dd_dev_err(dd,
> +			   "invalid firmware header field size: expected 0x%lx or 0x%lx, actual 0x%x\n",
> +			   fdet->fw->size / 4,
> +			   (fdet->fw->size - AUGMENT_SIZE) / 4,
> +			   css->size);
> +
> +		ret = -EINVAL;
> +	}
> +
> +done:
> +	/* if returning an error, clean up after ourselves */
> +	if (ret)
> +		dispose_one_firmware(fdet);
> +	return ret;
> +}
> +
> +static void dispose_one_firmware(struct firmware_details *fdet)
> +{
> +	release_firmware(fdet->fw);
> +	/* erase all previous information */
> +	memset(fdet, 0, sizeof(*fdet));
> +}
> +
> +/*
> + * Obtain the 4 firmwares from the OS.  All must be obtained at once or not
> + * at all.  If called with the firmware state in FW_TRY, use alternate names.
> + * On exit, this routine will have set the firmware state to one of FW_TRY,
> + * FW_FINAL, or FW_ERR.
> + *
> + * Must be holding fw_mutex.
> + */
> +static void __obtain_firmware(struct hfi2_devdata *dd)
> +{
> +	int err = 0;
> +
> +	if (fw_state == FW_FINAL)	/* nothing more to obtain */
> +		return;
> +	if (fw_state == FW_ERR)		/* already in error */
> +		return;
> +
> +	/* fw_state is FW_EMPTY or FW_TRY */
> +retry:
> +	if (fw_state == FW_TRY) {
> +		/*
> +		 * We tried the original and it failed.  Move to the
> +		 * alternate.
> +		 */
> +		dd_dev_warn(dd, "using alternate firmware names\n");
> +		/*
> +		 * Let others run.  Some systems, when missing firmware, does
> +		 * something that holds for 30 seconds.  If we do that twice
> +		 * in a row it triggers task blocked warning.
> +		 */
> +		cond_resched();
> +		if (fw_8051_load)
> +			dispose_one_firmware(&fw_8051);
> +		if (fw_fabric_serdes_load)
> +			dispose_one_firmware(&fw_fabric);
> +		if (fw_sbus_load)
> +			dispose_one_firmware(&fw_sbus);
> +		if (fw_pcie_serdes_load)
> +			dispose_one_firmware(&fw_pcie);
> +		fw_8051_name = ALT_FW_8051_NAME_ASIC;
> +		fw_fabric_serdes_name = ALT_FW_FABRIC_NAME;
> +		fw_sbus_name = ALT_FW_SBUS_NAME;
> +		fw_pcie_serdes_name = ALT_FW_PCIE_NAME;
> +
> +		/*
> +		 * Add a delay before obtaining and loading debug firmware.
> +		 * Authorization will fail if the delay between firmware
> +		 * authorization events is shorter than 50us. Add 100us to
> +		 * make a delay time safe.
> +		 */
> +		usleep_range(100, 120);
> +	}
> +
> +	if (fw_sbus_load) {
> +		err = obtain_one_firmware(dd, fw_sbus_name, &fw_sbus);
> +		if (err)
> +			goto done;
> +	}
> +
> +	if (fw_pcie_serdes_load) {
> +		err = obtain_one_firmware(dd, fw_pcie_serdes_name, &fw_pcie);
> +		if (err)
> +			goto done;
> +	}
> +
> +	if (fw_fabric_serdes_load) {
> +		err = obtain_one_firmware(dd, fw_fabric_serdes_name,
> +					  &fw_fabric);
> +		if (err)
> +			goto done;
> +	}
> +
> +	if (fw_8051_load) {
> +		err = obtain_one_firmware(dd, fw_8051_name, &fw_8051);
> +		if (err)
> +			goto done;
> +	}
> +
> +done:
> +	if (err) {
> +		/* oops, had problems obtaining a firmware */
> +		if (fw_state == FW_EMPTY && dd->icode == ICODE_RTL_SILICON) {
> +			/* retry with alternate (RTL only) */
> +			fw_state = FW_TRY;
> +			goto retry;
> +		}
> +		dd_dev_err(dd, "unable to obtain working firmware\n");
> +		fw_state = FW_ERR;
> +		fw_err = -ENOENT;
> +	} else {
> +		/* success */
> +		if (fw_state == FW_EMPTY &&
> +		    dd->icode != ICODE_FUNCTIONAL_SIMULATOR)
> +			fw_state = FW_TRY;	/* may retry later */
> +		else
> +			fw_state = FW_FINAL;	/* cannot try again */
> +	}
> +}
> +
> +/*
> + * Called by all HFIs when loading their firmware - i.e. device probe time.
> + * The first one will do the actual firmware load.  Use a mutex to resolve
> + * any possible race condition.
> + *
> + * The call to this routine cannot be moved to driver load because the kernel
> + * call request_firmware() requires a device which is only available after
> + * the first device probe.
> + */
> +static int obtain_firmware(struct hfi2_devdata *dd)
> +{
> +	unsigned long timeout;
> +
> +	mutex_lock(&fw_mutex);
> +
> +	/* 40s delay due to long delay on missing firmware on some systems */
> +	timeout = jiffies + msecs_to_jiffies(40000);
> +	while (fw_state == FW_TRY) {
> +		/*
> +		 * Another device is trying the firmware.  Wait until it
> +		 * decides what works (or not).
> +		 */
> +		if (time_after(jiffies, timeout)) {
> +			/* waited too long */
> +			dd_dev_err(dd, "Timeout waiting for firmware try");
> +			fw_state = FW_ERR;
> +			fw_err = -ETIMEDOUT;
> +			break;
> +		}
> +		mutex_unlock(&fw_mutex);
> +		msleep(20);	/* arbitrary delay */
> +		mutex_lock(&fw_mutex);
> +	}
> +	/* not in FW_TRY state */
> +
> +	/* set fw_state to FW_TRY, FW_FINAL, or FW_ERR, and fw_err */
> +	if (fw_state == FW_EMPTY)
> +		__obtain_firmware(dd);
> +
> +	mutex_unlock(&fw_mutex);
> +	return fw_err;
> +}
> +
> +/*
> + * Called when the driver unloads.  The timing is asymmetric with its
> + * counterpart, obtain_firmware().  If called at device remove time,
> + * then it is conceivable that another device could probe while the
> + * firmware is being disposed.  The mutexes can be moved to do that
> + * safely, but then the firmware would be requested from the OS multiple
> + * times.
> + *
> + * No mutex is needed as the driver is unloading and there cannot be any
> + * other callers.
> + */
> +void dispose_firmware(void)
> +{
> +	dispose_one_firmware(&fw_8051);
> +	dispose_one_firmware(&fw_fabric);
> +	dispose_one_firmware(&fw_pcie);
> +	dispose_one_firmware(&fw_sbus);
> +
> +	/* retain the error state, otherwise revert to empty */
> +	if (fw_state != FW_ERR)
> +		fw_state = FW_EMPTY;
> +}
> +
> +/*
> + * Called with the result of a firmware download.
> + *
> + * Return 1 to retry loading the firmware, 0 to stop.
> + */
> +static int retry_firmware(struct hfi2_devdata *dd, int load_result)
> +{
> +	int retry;
> +
> +	mutex_lock(&fw_mutex);
> +
> +	if (load_result == 0) {
> +		/*
> +		 * The load succeeded, so expect all others to do the same.
> +		 * Do not retry again.
> +		 */
> +		if (fw_state == FW_TRY)
> +			fw_state = FW_FINAL;
> +		retry = 0;	/* do NOT retry */
> +	} else if (fw_state == FW_TRY) {
> +		/* load failed, obtain alternate firmware */
> +		__obtain_firmware(dd);
> +		retry = (fw_state == FW_FINAL);
> +	} else {
> +		/* else in FW_FINAL or FW_ERR, no retry in either case */
> +		retry = 0;
> +	}
> +
> +	mutex_unlock(&fw_mutex);
> +	return retry;
> +}
> +
> +/*
> + * Write a block of data to a given array CSR.  All calls will be in
> + * multiples of 8 bytes.
> + */
> +static void write_rsa_data(struct hfi2_devdata *dd, int what,
> +			   const u8 *data, int nbytes)
> +{
> +	int qw_size = nbytes / 8;
> +	int i;
> +
> +	if (((unsigned long)data & 0x7) == 0) {
> +		/* aligned */
> +		u64 *ptr = (u64 *)data;
> +
> +		for (i = 0; i < qw_size; i++, ptr++)
> +			write_csr(dd, what + (8 * i), *ptr);
> +	} else {
> +		/* not aligned */
> +		for (i = 0; i < qw_size; i++, data += 8) {
> +			u64 value;
> +
> +			memcpy(&value, data, 8);
> +			write_csr(dd, what + (8 * i), value);
> +		}
> +	}
> +}
> +
> +/*
> + * Write a block of data to a given CSR as a stream of writes.  All calls will
> + * be in multiples of 8 bytes.
> + */
> +static void write_streamed_rsa_data(struct hfi2_devdata *dd, int what,
> +				    const u8 *data, int nbytes)
> +{
> +	u64 *ptr = (u64 *)data;
> +	int qw_size = nbytes / 8;
> +
> +	for (; qw_size > 0; qw_size--, ptr++)
> +		write_csr(dd, what, *ptr);
> +}
> +
> +/*
> + * Download the signature and start the RSA mechanism.  Wait for
> + * RSA_ENGINE_TIMEOUT before giving up.
> + */
> +static int run_rsa(struct hfi2_devdata *dd, const char *who,
> +		   const u8 *signature)
> +{
> +	unsigned long timeout;
> +	u64 reg;
> +	u32 status;
> +	int ret = 0;
> +
> +	/* write the signature */
> +	write_rsa_data(dd, MISC_CFG_RSA_SIGNATURE, signature, KEY_SIZE);
> +
> +	/* initialize RSA */
> +	write_csr(dd, MISC_CFG_RSA_CMD, RSA_CMD_INIT);
> +
> +	/*
> +	 * Make sure the engine is idle and insert a delay between the two
> +	 * writes to MISC_CFG_RSA_CMD.
> +	 */
> +	status = (read_csr(dd, MISC_CFG_FW_CTRL)
> +			   & MISC_CFG_FW_CTRL_RSA_STATUS_SMASK)
> +			     >> MISC_CFG_FW_CTRL_RSA_STATUS_SHIFT;
> +	if (status != RSA_STATUS_IDLE) {
> +		dd_dev_err(dd, "%s security engine not idle - giving up\n",
> +			   who);
> +		return -EBUSY;
> +	}
> +
> +	/* start RSA */
> +	write_csr(dd, MISC_CFG_RSA_CMD, RSA_CMD_START);
> +
> +	/*
> +	 * Look for the result.
> +	 *
> +	 * The RSA engine is hooked up to two MISC errors.  The driver
> +	 * masks these errors as they do not respond to the standard
> +	 * error "clear down" mechanism.  Look for these errors here and
> +	 * clear them when possible.  This routine will exit with the
> +	 * errors of the current run still set.
> +	 *
> +	 * MISC_FW_AUTH_FAILED_ERR
> +	 *	Firmware authorization failed.  This can be cleared by
> +	 *	re-initializing the RSA engine, then clearing the status bit.
> +	 *	Do not re-init the RSA angine immediately after a successful
> +	 *	run - this will reset the current authorization.
> +	 *
> +	 * MISC_KEY_MISMATCH_ERR
> +	 *	Key does not match.  The only way to clear this is to load
> +	 *	a matching key then clear the status bit.  If this error
> +	 *	is raised, it will persist outside of this routine until a
> +	 *	matching key is loaded.
> +	 */
> +	timeout = msecs_to_jiffies(RSA_ENGINE_TIMEOUT) + jiffies;
> +	while (1) {
> +		status = (read_csr(dd, MISC_CFG_FW_CTRL)
> +			   & MISC_CFG_FW_CTRL_RSA_STATUS_SMASK)
> +			     >> MISC_CFG_FW_CTRL_RSA_STATUS_SHIFT;
> +
> +		if (status == RSA_STATUS_IDLE) {
> +			/* should not happen */
> +			dd_dev_err(dd, "%s firmware security bad idle state\n",
> +				   who);
> +			ret = -EINVAL;
> +			break;
> +		} else if (status == RSA_STATUS_DONE) {
> +			/* finished successfully */
> +			break;
> +		} else if (status == RSA_STATUS_FAILED) {
> +			/* finished unsuccessfully */
> +			ret = -EINVAL;
> +			break;
> +		}
> +		/* else still active */
> +
> +		if (time_after(jiffies, timeout)) {
> +			/*
> +			 * Timed out while active.  We can't reset the engine
> +			 * if it is stuck active, but run through the
> +			 * error code to see what error bits are set.
> +			 */
> +			dd_dev_err(dd, "%s firmware security time out\n", who);
> +			ret = -ETIMEDOUT;
> +			break;
> +		}
> +
> +		msleep(20);
> +	}
> +
> +	/*
> +	 * Arrive here on success or failure.  Clear all RSA engine
> +	 * errors.  All current errors will stick - the RSA logic is keeping
> +	 * error high.  All previous errors will clear - the RSA logic
> +	 * is not keeping the error high.
> +	 */
> +	write_csr(dd, MISC_ERR_CLEAR,
> +		  MISC_ERR_STATUS_MISC_FW_AUTH_FAILED_ERR_SMASK |
> +		  MISC_ERR_STATUS_MISC_KEY_MISMATCH_ERR_SMASK);
> +	/*
> +	 * All that is left are the current errors.  Print warnings on
> +	 * authorization failure details, if any.  Firmware authorization
> +	 * can be retried, so these are only warnings.
> +	 */
> +	reg = read_csr(dd, MISC_ERR_STATUS);
> +	if (ret) {
> +		if (reg & MISC_ERR_STATUS_MISC_FW_AUTH_FAILED_ERR_SMASK)
> +			dd_dev_warn(dd, "%s firmware authorization failed\n",
> +				    who);
> +		if (reg & MISC_ERR_STATUS_MISC_KEY_MISMATCH_ERR_SMASK)
> +			dd_dev_warn(dd, "%s firmware key mismatch\n", who);
> +	}
> +
> +	return ret;
> +}
> +
> +static void load_security_variables(struct hfi2_devdata *dd,
> +				    struct firmware_details *fdet)
> +{
> +	/* Security variables a.  Write the modulus */
> +	write_rsa_data(dd, MISC_CFG_RSA_MODULUS, fdet->modulus, KEY_SIZE);
> +	/* Security variables b.  Write the r2 */
> +	write_rsa_data(dd, MISC_CFG_RSA_R2, fdet->r2, KEY_SIZE);
> +	/* Security variables c.  Write the mu */
> +	write_rsa_data(dd, MISC_CFG_RSA_MU, fdet->mu, MU_SIZE);
> +	/* Security variables d.  Write the header */
> +	write_streamed_rsa_data(dd, MISC_CFG_SHA_PRELOAD,
> +				(u8 *)fdet->css_header,
> +				sizeof(struct css_header));
> +}
> +
> +/* return the 8051 firmware state */
> +static inline u32 get_firmware_state(struct hfi2_devdata *dd)
> +{
> +	u64 reg = read_csr(dd, DC_DC8051_STS_CUR_STATE);
> +
> +	return (reg >> DC_DC8051_STS_CUR_STATE_FIRMWARE_SHIFT)
> +				& DC_DC8051_STS_CUR_STATE_FIRMWARE_MASK;
> +}
> +
> +/*
> + * Wait until the firmware is up and ready to take host requests.
> + * Return 0 on success, -ETIMEDOUT on timeout.
> + */
> +int wait_fm_ready(struct hfi2_devdata *dd, u32 mstimeout)
> +{
> +	unsigned long timeout;
> +
> +	timeout = msecs_to_jiffies(mstimeout) + jiffies;
> +	while (1) {
> +		if (get_firmware_state(dd) == 0xa0)	/* ready */
> +			return 0;
> +		if (time_after(jiffies, timeout))	/* timed out */
> +			return -ETIMEDOUT;
> +		usleep_range(1950, 2050); /* sleep 2ms-ish */
> +	}
> +}
> +
> +/*
> + * Load the 8051 firmware.
> + */
> +static int load_8051_firmware(struct hfi2_devdata *dd,
> +			      struct firmware_details *fdet)
> +{
> +	u64 reg;
> +	int ret;
> +	u8 ver_major;
> +	u8 ver_minor;
> +	u8 ver_patch;
> +
> +	/*
> +	 * DC Reset sequence
> +	 * Load DC 8051 firmware
> +	 */
> +	/*
> +	 * DC reset step 1: Reset DC8051
> +	 */
> +	reg = DC_DC8051_CFG_RST_M8051W_SMASK
> +		| DC_DC8051_CFG_RST_CRAM_SMASK
> +		| DC_DC8051_CFG_RST_DRAM_SMASK
> +		| DC_DC8051_CFG_RST_IRAM_SMASK
> +		| DC_DC8051_CFG_RST_SFR_SMASK;
> +	write_csr(dd, DC_DC8051_CFG_RST, reg);
> +
> +	/*
> +	 * DC reset step 2 (optional): Load 8051 data memory with link
> +	 * configuration
> +	 */
> +
> +	/*
> +	 * DC reset step 3: Load DC8051 firmware
> +	 */
> +	/* release all but the core reset */
> +	reg = DC_DC8051_CFG_RST_M8051W_SMASK;
> +	write_csr(dd, DC_DC8051_CFG_RST, reg);
> +
> +	/* Firmware load step 1 */
> +	load_security_variables(dd, fdet);
> +
> +	/*
> +	 * Firmware load step 2.  Clear MISC_CFG_FW_CTRL.FW_8051_LOADED
> +	 */
> +	write_csr(dd, MISC_CFG_FW_CTRL, 0);
> +
> +	/* Firmware load steps 3-5 */
> +	ret = write_8051(dd, 1/*code*/, 0, fdet->firmware_ptr,
> +			 fdet->firmware_len);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * DC reset step 4. Host starts the DC8051 firmware
> +	 */
> +	/*
> +	 * Firmware load step 6.  Set MISC_CFG_FW_CTRL.FW_8051_LOADED
> +	 */
> +	write_csr(dd, MISC_CFG_FW_CTRL, MISC_CFG_FW_CTRL_FW_8051_LOADED_SMASK);
> +
> +	/* Firmware load steps 7-10 */
> +	ret = run_rsa(dd, "8051", fdet->signature);
> +	if (ret)
> +		return ret;
> +
> +	/* clear all reset bits, releasing the 8051 */
> +	write_csr(dd, DC_DC8051_CFG_RST, 0ull);
> +
> +	/*
> +	 * DC reset step 5. Wait for firmware to be ready to accept host
> +	 * requests.
> +	 */
> +	ret = wait_fm_ready(dd, TIMEOUT_8051_START);
> +	if (ret) { /* timed out */
> +		dd_dev_err(dd, "8051 start timeout, current state 0x%x\n",
> +			   get_firmware_state(dd));
> +		return -ETIMEDOUT;
> +	}
> +
> +	read_misc_status(dd, &ver_major, &ver_minor, &ver_patch);
> +	dd_dev_info(dd, "8051 firmware version %d.%d.%d\n",
> +		    (int)ver_major, (int)ver_minor, (int)ver_patch);
> +	dd->dc8051_ver = dc8051_ver(ver_major, ver_minor, ver_patch);
> +	ret = write_host_interface_version(dd, HOST_INTERFACE_VERSION);
> +	if (ret != HCMD_SUCCESS) {
> +		dd_dev_err(dd,
> +			   "Failed to set host interface version, return 0x%x\n",
> +			   ret);
> +		return -EIO;
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Write the SBus request register
> + *
> + * No need for masking - the arguments are sized exactly.
> + */
> +void sbus_request(struct hfi2_devdata *dd,
> +		  u8 receiver_addr, u8 data_addr, u8 command, u32 data_in)
> +{
> +	write_csr(dd, ASIC_CFG_SBUS_REQUEST,
> +		  ((u64)data_in << ASIC_CFG_SBUS_REQUEST_DATA_IN_SHIFT) |
> +		  ((u64)command << ASIC_CFG_SBUS_REQUEST_COMMAND_SHIFT) |
> +		  ((u64)data_addr << ASIC_CFG_SBUS_REQUEST_DATA_ADDR_SHIFT) |
> +		  ((u64)receiver_addr <<
> +		   ASIC_CFG_SBUS_REQUEST_RECEIVER_ADDR_SHIFT));
> +}
> +
> +/*
> + * Read a value from the SBus.
> + *
> + * Requires the caller to be in fast mode
> + */
> +static u32 sbus_read(struct hfi2_devdata *dd, u8 receiver_addr, u8 data_addr,
> +		     u32 data_in)
> +{
> +	u64 reg;
> +	int retries;
> +	int success = 0;
> +	u32 result = 0;
> +	u32 result_code = 0;
> +
> +	sbus_request(dd, receiver_addr, data_addr, READ_SBUS_RECEIVER, data_in);
> +
> +	for (retries = 0; retries < 100; retries++) {
> +		usleep_range(1000, 1200); /* arbitrary */
> +		reg = read_csr(dd, ASIC_STS_SBUS_RESULT);
> +		result_code = (reg >> ASIC_STS_SBUS_RESULT_RESULT_CODE_SHIFT)
> +				& ASIC_STS_SBUS_RESULT_RESULT_CODE_MASK;
> +		if (result_code != SBUS_READ_COMPLETE)
> +			continue;
> +
> +		success = 1;
> +		result = (reg >> ASIC_STS_SBUS_RESULT_DATA_OUT_SHIFT)
> +			   & ASIC_STS_SBUS_RESULT_DATA_OUT_MASK;
> +		break;
> +	}
> +
> +	if (!success) {
> +		dd_dev_err(dd, "%s: read failed, result code 0x%x\n", __func__,
> +			   result_code);
> +	}
> +
> +	return result;
> +}
> +
> +/*
> + * Turn off the SBus and fabric serdes spicos.
> + *
> + * + Must be called with Sbus fast mode turned on.
> + * + Must be called after fabric serdes broadcast is set up.
> + * + Must be called before the 8051 is loaded - assumes 8051 is not loaded
> + *   when using MISC_CFG_FW_CTRL.
> + */
> +static void turn_off_spicos(struct hfi2_devdata *dd, int flags)
> +{
> +	/* only needed on A0 */
> +	if (!is_ax(dd))
> +		return;
> +
> +	dd_dev_info(dd, "Turning off spicos:%s%s\n",
> +		    flags & SPICO_SBUS ? " SBus" : "",
> +		    flags & SPICO_FABRIC ? " fabric" : "");
> +
> +	write_csr(dd, MISC_CFG_FW_CTRL, ENABLE_SPICO_SMASK);
> +	/* disable SBus spico */
> +	if (flags & SPICO_SBUS)
> +		sbus_request(dd, SBUS_MASTER_BROADCAST, 0x01,
> +			     WRITE_SBUS_RECEIVER, 0x00000040);
> +
> +	/* disable the fabric serdes spicos */
> +	if (flags & SPICO_FABRIC)
> +		sbus_request(dd, fabric_serdes_broadcast[dd->hfi2_id],
> +			     0x07, WRITE_SBUS_RECEIVER, 0x00000000);
> +	write_csr(dd, MISC_CFG_FW_CTRL, 0);
> +}
> +
> +/*
> + * Reset all of the fabric serdes for this HFI in preparation to take the
> + * link to Polling.
> + *
> + * To do a reset, we need to write to the serdes registers.  Unfortunately,
> + * the fabric serdes download to the other HFI on the ASIC will have turned
> + * off the firmware validation on this HFI.  This means we can't write to the
> + * registers to reset the serdes.  Work around this by performing a complete
> + * re-download and validation of the fabric serdes firmware.  This, as a
> + * by-product, will reset the serdes.  NOTE: the re-download requires that
> + * the 8051 be in the Offline state.  I.e. not actively trying to use the
> + * serdes.  This routine is called at the point where the link is Offline and
> + * is getting ready to go to Polling.
> + */
> +void fabric_serdes_reset(struct hfi2_devdata *dd)
> +{
> +	int ret;
> +
> +	if (!fw_fabric_serdes_load)
> +		return;
> +
> +	ret = acquire_chip_resource(dd, CR_SBUS, SBUS_TIMEOUT);
> +	if (ret) {
> +		dd_dev_err(dd,
> +			   "Cannot acquire SBus resource to reset fabric SerDes - perhaps you should reboot\n");
> +		return;
> +	}
> +	set_sbus_fast_mode(dd);
> +
> +	if (is_ax(dd)) {
> +		/* A0 serdes do not work with a re-download */
> +		u8 ra = fabric_serdes_broadcast[dd->hfi2_id];
> +
> +		/* place SerDes in reset and disable SPICO */
> +		sbus_request(dd, ra, 0x07, WRITE_SBUS_RECEIVER, 0x00000011);
> +		/* wait 100 refclk cycles @ 156.25MHz => 640ns */
> +		udelay(1);
> +		/* remove SerDes reset */
> +		sbus_request(dd, ra, 0x07, WRITE_SBUS_RECEIVER, 0x00000010);
> +		/* turn SPICO enable on */
> +		sbus_request(dd, ra, 0x07, WRITE_SBUS_RECEIVER, 0x00000002);
> +	} else {
> +		turn_off_spicos(dd, SPICO_FABRIC);
> +		/*
> +		 * No need for firmware retry - what to download has already
> +		 * been decided.
> +		 * No need to pay attention to the load return - the only
> +		 * failure is a validation failure, which has already been
> +		 * checked by the initial download.
> +		 */
> +		(void)load_fabric_serdes_firmware(dd, &fw_fabric);
> +	}
> +
> +	clear_sbus_fast_mode(dd);
> +	release_chip_resource(dd, CR_SBUS);
> +}
> +
> +/* Access to the SBus in this routine should probably be serialized */
> +int sbus_request_slow(struct hfi2_devdata *dd,
> +		      u8 receiver_addr, u8 data_addr, u8 command, u32 data_in)
> +{
> +	u64 reg, count = 0;
> +
> +	/* make sure fast mode is clear */
> +	clear_sbus_fast_mode(dd);
> +
> +	sbus_request(dd, receiver_addr, data_addr, command, data_in);
> +	write_csr(dd, ASIC_CFG_SBUS_EXECUTE,
> +		  ASIC_CFG_SBUS_EXECUTE_EXECUTE_SMASK);
> +	/* Wait for both DONE and RCV_DATA_VALID to go high */
> +	reg = read_csr(dd, ASIC_STS_SBUS_RESULT);
> +	while (!((reg & ASIC_STS_SBUS_RESULT_DONE_SMASK) &&
> +		 (reg & ASIC_STS_SBUS_RESULT_RCV_DATA_VALID_SMASK))) {
> +		if (count++ >= SBUS_MAX_POLL_COUNT) {
> +			u64 counts = read_csr(dd, ASIC_STS_SBUS_COUNTERS);
> +			/*
> +			 * If the loop has timed out, we are OK if DONE bit
> +			 * is set and RCV_DATA_VALID and EXECUTE counters
> +			 * are the same. If not, we cannot proceed.
> +			 */
> +			if ((reg & ASIC_STS_SBUS_RESULT_DONE_SMASK) &&
> +			    (SBUS_COUNTER(counts, RCV_DATA_VALID) ==
> +			     SBUS_COUNTER(counts, EXECUTE)))
> +				break;
> +			return -ETIMEDOUT;
> +		}
> +		udelay(1);
> +		reg = read_csr(dd, ASIC_STS_SBUS_RESULT);
> +	}
> +	count = 0;
> +	write_csr(dd, ASIC_CFG_SBUS_EXECUTE, 0);
> +	/* Wait for DONE to clear after EXECUTE is cleared */
> +	reg = read_csr(dd, ASIC_STS_SBUS_RESULT);
> +	while (reg & ASIC_STS_SBUS_RESULT_DONE_SMASK) {
> +		if (count++ >= SBUS_MAX_POLL_COUNT)
> +			return -ETIME;
> +		udelay(1);
> +		reg = read_csr(dd, ASIC_STS_SBUS_RESULT);
> +	}
> +	return 0;
> +}
> +
> +static int load_fabric_serdes_firmware(struct hfi2_devdata *dd,
> +				       struct firmware_details *fdet)
> +{
> +	int i, err;
> +	const u8 ra = fabric_serdes_broadcast[dd->hfi2_id]; /* receiver addr */
> +
> +	dd_dev_info(dd, "Downloading fabric firmware\n");
> +
> +	/* step 1: load security variables */
> +	load_security_variables(dd, fdet);
> +	/* step 2: place SerDes in reset and disable SPICO */
> +	sbus_request(dd, ra, 0x07, WRITE_SBUS_RECEIVER, 0x00000011);
> +	/* wait 100 refclk cycles @ 156.25MHz => 640ns */
> +	udelay(1);
> +	/* step 3:  remove SerDes reset */
> +	sbus_request(dd, ra, 0x07, WRITE_SBUS_RECEIVER, 0x00000010);
> +	/* step 4: assert IMEM override */
> +	sbus_request(dd, ra, 0x00, WRITE_SBUS_RECEIVER, 0x40000000);
> +	/* step 5: download SerDes machine code */
> +	for (i = 0; i < fdet->firmware_len; i += 4) {
> +		sbus_request(dd, ra, 0x0a, WRITE_SBUS_RECEIVER,
> +			     *(u32 *)&fdet->firmware_ptr[i]);
> +	}
> +	/* step 6: IMEM override off */
> +	sbus_request(dd, ra, 0x00, WRITE_SBUS_RECEIVER, 0x00000000);
> +	/* step 7: turn ECC on */
> +	sbus_request(dd, ra, 0x0b, WRITE_SBUS_RECEIVER, 0x000c0000);
> +
> +	/* steps 8-11: run the RSA engine */
> +	err = run_rsa(dd, "fabric serdes", fdet->signature);
> +	if (err)
> +		return err;
> +
> +	/* step 12: turn SPICO enable on */
> +	sbus_request(dd, ra, 0x07, WRITE_SBUS_RECEIVER, 0x00000002);
> +	/* step 13: enable core hardware interrupts */
> +	sbus_request(dd, ra, 0x08, WRITE_SBUS_RECEIVER, 0x00000000);
> +
> +	return 0;
> +}
> +
> +static int load_sbus_firmware(struct hfi2_devdata *dd,
> +			      struct firmware_details *fdet)
> +{
> +	int i, err;
> +	const u8 ra = SBUS_MASTER_BROADCAST; /* receiver address */
> +
> +	dd_dev_info(dd, "Downloading SBus firmware\n");
> +
> +	/* step 1: load security variables */
> +	load_security_variables(dd, fdet);
> +	/* step 2: place SPICO into reset and enable off */
> +	sbus_request(dd, ra, 0x01, WRITE_SBUS_RECEIVER, 0x000000c0);
> +	/* step 3: remove reset, enable off, IMEM_CNTRL_EN on */
> +	sbus_request(dd, ra, 0x01, WRITE_SBUS_RECEIVER, 0x00000240);
> +	/* step 4: set starting IMEM address for burst download */
> +	sbus_request(dd, ra, 0x03, WRITE_SBUS_RECEIVER, 0x80000000);
> +	/* step 5: download the SBus Master machine code */
> +	for (i = 0; i < fdet->firmware_len; i += 4) {
> +		sbus_request(dd, ra, 0x14, WRITE_SBUS_RECEIVER,
> +			     *(u32 *)&fdet->firmware_ptr[i]);
> +	}
> +	/* step 6: set IMEM_CNTL_EN off */
> +	sbus_request(dd, ra, 0x01, WRITE_SBUS_RECEIVER, 0x00000040);
> +	/* step 7: turn ECC on */
> +	sbus_request(dd, ra, 0x16, WRITE_SBUS_RECEIVER, 0x000c0000);
> +
> +	/* steps 8-11: run the RSA engine */
> +	err = run_rsa(dd, "SBus", fdet->signature);
> +	if (err)
> +		return err;
> +
> +	/* step 12: set SPICO_ENABLE on */
> +	sbus_request(dd, ra, 0x01, WRITE_SBUS_RECEIVER, 0x00000140);
> +
> +	return 0;
> +}
> +
> +static int load_pcie_serdes_firmware(struct hfi2_devdata *dd,
> +				     struct firmware_details *fdet)
> +{
> +	int i;
> +	const u8 ra = SBUS_MASTER_BROADCAST; /* receiver address */
> +
> +	dd_dev_info(dd, "Downloading PCIe firmware\n");
> +
> +	/* step 1: load security variables */
> +	load_security_variables(dd, fdet);
> +	/* step 2: assert single step (halts the SBus Master spico) */
> +	sbus_request(dd, ra, 0x05, WRITE_SBUS_RECEIVER, 0x00000001);
> +	/* step 3: enable XDMEM access */
> +	sbus_request(dd, ra, 0x01, WRITE_SBUS_RECEIVER, 0x00000d40);
> +	/* step 4: load firmware into SBus Master XDMEM */
> +	/*
> +	 * NOTE: the dmem address, write_en, and wdata are all pre-packed,
> +	 * we only need to pick up the bytes and write them
> +	 */
> +	for (i = 0; i < fdet->firmware_len; i += 4) {
> +		sbus_request(dd, ra, 0x04, WRITE_SBUS_RECEIVER,
> +			     *(u32 *)&fdet->firmware_ptr[i]);
> +	}
> +	/* step 5: disable XDMEM access */
> +	sbus_request(dd, ra, 0x01, WRITE_SBUS_RECEIVER, 0x00000140);
> +	/* step 6: allow SBus Spico to run */
> +	sbus_request(dd, ra, 0x05, WRITE_SBUS_RECEIVER, 0x00000000);
> +
> +	/*
> +	 * steps 7-11: run RSA, if it succeeds, firmware is available to
> +	 * be swapped
> +	 */
> +	return run_rsa(dd, "PCIe serdes", fdet->signature);
> +}
> +
> +/*
> + * Set the given broadcast values on the given list of devices.
> + */
> +static void set_serdes_broadcast(struct hfi2_devdata *dd, u8 bg1, u8 bg2,
> +				 const u8 *addrs, int count)
> +{
> +	while (--count >= 0) {
> +		/*
> +		 * Set BROADCAST_GROUP_1 and BROADCAST_GROUP_2, leave
> +		 * defaults for everything else.  Do not read-modify-write,
> +		 * per instruction from the manufacturer.
> +		 *
> +		 * Register 0xfd:
> +		 *	bits    what
> +		 *	-----	---------------------------------
> +		 *	  0	IGNORE_BROADCAST  (default 0)
> +		 *	11:4	BROADCAST_GROUP_1 (default 0xff)
> +		 *	23:16	BROADCAST_GROUP_2 (default 0xff)
> +		 */
> +		sbus_request(dd, addrs[count], 0xfd, WRITE_SBUS_RECEIVER,
> +			     (u32)bg1 << 4 | (u32)bg2 << 16);
> +	}
> +}
> +
> +int acquire_hw_mutex(struct hfi2_devdata *dd)
> +{
> +	unsigned long timeout;
> +	int try = 0;
> +	u8 mask = 1 << dd->hfi2_id;
> +	u8 user = (u8)read_csr(dd, ASIC_CFG_MUTEX);
> +
> +	if (user == mask) {
> +		dd_dev_info(dd,
> +			    "Hardware mutex already acquired, mutex mask %u\n",
> +			    (u32)mask);
> +		return 0;
> +	}
> +
> +retry:
> +	timeout = msecs_to_jiffies(HM_TIMEOUT) + jiffies;
> +	while (1) {
> +		write_csr(dd, ASIC_CFG_MUTEX, mask);
> +		user = (u8)read_csr(dd, ASIC_CFG_MUTEX);
> +		if (user == mask)
> +			return 0; /* success */
> +		if (time_after(jiffies, timeout))
> +			break; /* timed out */
> +		msleep(20);
> +	}
> +
> +	/* timed out */
> +	dd_dev_err(dd,
> +		   "Unable to acquire hardware mutex, mutex mask %u, my mask %u (%s)\n",
> +		   (u32)user, (u32)mask, (try == 0) ? "retrying" : "giving up");
> +
> +	if (try == 0) {
> +		/* break mutex and retry */
> +		write_csr(dd, ASIC_CFG_MUTEX, 0);
> +		try++;
> +		goto retry;
> +	}
> +
> +	return -EBUSY;
> +}
> +
> +void release_hw_mutex(struct hfi2_devdata *dd)
> +{
> +	u8 mask = 1 << dd->hfi2_id;
> +	u8 user = (u8)read_csr(dd, ASIC_CFG_MUTEX);
> +
> +	if (user != mask)
> +		dd_dev_warn(dd,
> +			    "Unable to release hardware mutex, mutex mask %u, my mask %u\n",
> +			    (u32)user, (u32)mask);
> +	else
> +		write_csr(dd, ASIC_CFG_MUTEX, 0);
> +}
> +
> +/* return the given resource bit(s) as a mask for the given HFI */
> +static inline u64 resource_mask(u32 hfi2_id, u32 resource)
> +{
> +	return ((u64)resource) << (hfi2_id ? CR_DYN_SHIFT : 0);
> +}
> +
> +static void fail_mutex_acquire_message(struct hfi2_devdata *dd,
> +				       const char *func)
> +{
> +	dd_dev_err(dd,
> +		   "%s: hardware mutex stuck - suggest rebooting the machine\n",
> +		   func);
> +}
> +
> +/*
> + * Acquire access to a chip resource.
> + *
> + * Return 0 on success, -EBUSY if resource busy, -EIO if mutex acquire failed.
> + */
> +static int __acquire_chip_resource(struct hfi2_devdata *dd, u32 resource)
> +{
> +	u64 scratch0, all_bits, my_bit;
> +	int ret;
> +
> +	if (resource & CR_DYN_MASK) {
> +		/* a dynamic resource is in use if either HFI has set the bit */
> +		if (dd->pcidev->device == PCI_DEVICE_ID_INTEL0 &&
> +		    (resource & (CR_I2C1 | CR_I2C2))) {
> +			/* discrete devices must serialize across both chains */
> +			all_bits = resource_mask(0, CR_I2C1 | CR_I2C2) |
> +					resource_mask(1, CR_I2C1 | CR_I2C2);
> +		} else {
> +			all_bits = resource_mask(0, resource) |
> +						resource_mask(1, resource);
> +		}
> +		my_bit = resource_mask(dd->hfi2_id, resource);
> +	} else {
> +		/* non-dynamic resources are not split between HFIs */
> +		all_bits = resource;
> +		my_bit = resource;
> +	}
> +
> +	/* lock against other callers within the driver wanting a resource */
> +	mutex_lock(&dd->asic_data->asic_resource_mutex);
> +
> +	ret = acquire_hw_mutex(dd);
> +	if (ret) {
> +		fail_mutex_acquire_message(dd, __func__);
> +		ret = -EIO;
> +		goto done;
> +	}
> +
> +	scratch0 = read_csr(dd, ASIC_CFG_SCRATCH);
> +	if (scratch0 & all_bits) {
> +		ret = -EBUSY;
> +	} else {
> +		write_csr(dd, ASIC_CFG_SCRATCH, scratch0 | my_bit);
> +		/* force write to be visible to other HFI on another OS */
> +		(void)read_csr(dd, ASIC_CFG_SCRATCH);
> +	}
> +
> +	release_hw_mutex(dd);
> +
> +done:
> +	mutex_unlock(&dd->asic_data->asic_resource_mutex);
> +	return ret;
> +}
> +
> +/*
> + * Acquire access to a chip resource, wait up to mswait milliseconds for
> + * the resource to become available.
> + *
> + * Return 0 on success, -EBUSY if busy (even after wait), -EIO if mutex
> + * acquire failed, -EINVAL if there is no asic_data.
> + */
> +int acquire_chip_resource(struct hfi2_devdata *dd, u32 resource, u32 mswait)
> +{
> +	unsigned long timeout;
> +	int ret;
> +
> +	if (!dd->asic_data)
> +		return -EINVAL;
> +
> +	timeout = jiffies + msecs_to_jiffies(mswait);
> +	while (1) {
> +		ret = __acquire_chip_resource(dd, resource);
> +		if (ret != -EBUSY)
> +			return ret;
> +		/* resource is busy, check our timeout */
> +		if (time_after_eq(jiffies, timeout))
> +			return -EBUSY;
> +		usleep_range(80, 120);	/* arbitrary delay */
> +	}
> +}
> +
> +/*
> + * Release access to a chip resource
> + */
> +void release_chip_resource(struct hfi2_devdata *dd, u32 resource)
> +{
> +	u64 scratch0, bit;
> +
> +	if (!dd->asic_data)
> +		return;
> +
> +	/* only dynamic resources should ever be cleared */
> +	if (!(resource & CR_DYN_MASK)) {
> +		dd_dev_err(dd, "%s: invalid resource 0x%x\n", __func__,
> +			   resource);
> +		return;
> +	}
> +	bit = resource_mask(dd->hfi2_id, resource);
> +
> +	/* lock against other callers within the driver wanting a resource */
> +	mutex_lock(&dd->asic_data->asic_resource_mutex);
> +
> +	if (acquire_hw_mutex(dd)) {
> +		fail_mutex_acquire_message(dd, __func__);
> +		goto done;
> +	}
> +
> +	scratch0 = read_csr(dd, ASIC_CFG_SCRATCH);
> +	if ((scratch0 & bit) != 0) {
> +		scratch0 &= ~bit;
> +		write_csr(dd, ASIC_CFG_SCRATCH, scratch0);
> +		/* force write to be visible to other HFI on another OS */
> +		(void)read_csr(dd, ASIC_CFG_SCRATCH);
> +	} else {
> +		dd_dev_warn(dd, "%s: id %d, resource 0x%x: bit not set\n",
> +			    __func__, dd->hfi2_id, resource);
> +	}
> +
> +	release_hw_mutex(dd);
> +
> +done:
> +	mutex_unlock(&dd->asic_data->asic_resource_mutex);
> +}
> +
> +/*
> + * Return true if resource is set, false otherwise.  Print a warning
> + * if not set and a function is supplied.
> + */
> +bool check_chip_resource(struct hfi2_devdata *dd, u32 resource,
> +			 const char *func)
> +{
> +	u64 scratch0, bit;
> +
> +	if (resource & CR_DYN_MASK)
> +		bit = resource_mask(dd->hfi2_id, resource);
> +	else
> +		bit = resource;
> +
> +	scratch0 = read_csr(dd, ASIC_CFG_SCRATCH);
> +	if ((scratch0 & bit) == 0) {
> +		if (func)
> +			dd_dev_warn(dd,
> +				    "%s: id %d, resource 0x%x, not acquired!\n",
> +				    func, dd->hfi2_id, resource);
> +		return false;
> +	}
> +	return true;
> +}
> +
> +static void clear_chip_resources(struct hfi2_devdata *dd, const char *func)
> +{
> +	u64 scratch0;
> +
> +	if (!dd->asic_data)
> +		return;
> +
> +	/* lock against other callers within the driver wanting a resource */
> +	mutex_lock(&dd->asic_data->asic_resource_mutex);
> +
> +	if (acquire_hw_mutex(dd)) {
> +		fail_mutex_acquire_message(dd, func);
> +		goto done;
> +	}
> +
> +	/* clear all dynamic access bits for this HFI */
> +	scratch0 = read_csr(dd, ASIC_CFG_SCRATCH);
> +	scratch0 &= ~resource_mask(dd->hfi2_id, CR_DYN_MASK);
> +	write_csr(dd, ASIC_CFG_SCRATCH, scratch0);
> +	/* force write to be visible to other HFI on another OS */
> +	(void)read_csr(dd, ASIC_CFG_SCRATCH);
> +
> +	release_hw_mutex(dd);
> +
> +done:
> +	mutex_unlock(&dd->asic_data->asic_resource_mutex);
> +}
> +
> +void init_chip_resources(struct hfi2_devdata *dd)
> +{
> +	/* clear any holds left by us */
> +	clear_chip_resources(dd, __func__);
> +}
> +
> +void finish_chip_resources(struct hfi2_devdata *dd)
> +{
> +	/* clear any holds left by us */
> +	clear_chip_resources(dd, __func__);
> +}
> +
> +void set_sbus_fast_mode(struct hfi2_devdata *dd)
> +{
> +	write_csr(dd, ASIC_CFG_SBUS_EXECUTE,
> +		  ASIC_CFG_SBUS_EXECUTE_FAST_MODE_SMASK);
> +}
> +
> +void clear_sbus_fast_mode(struct hfi2_devdata *dd)
> +{
> +	u64 reg, count = 0;
> +
> +	reg = read_csr(dd, ASIC_STS_SBUS_COUNTERS);
> +	while (SBUS_COUNTER(reg, EXECUTE) !=
> +	       SBUS_COUNTER(reg, RCV_DATA_VALID)) {
> +		if (count++ >= SBUS_MAX_POLL_COUNT)
> +			break;
> +		udelay(1);
> +		reg = read_csr(dd, ASIC_STS_SBUS_COUNTERS);
> +	}
> +	write_csr(dd, ASIC_CFG_SBUS_EXECUTE, 0);
> +}
> +
> +int load_firmware(struct hfi2_devdata *dd)
> +{
> +	int ret;
> +
> +	if (fw_fabric_serdes_load) {
> +		ret = acquire_chip_resource(dd, CR_SBUS, SBUS_TIMEOUT);
> +		if (ret)
> +			return ret;
> +
> +		set_sbus_fast_mode(dd);
> +
> +		set_serdes_broadcast(dd, all_fabric_serdes_broadcast,
> +				     fabric_serdes_broadcast[dd->hfi2_id],
> +				     fabric_serdes_addrs[dd->hfi2_id],
> +				     NUM_FABRIC_SERDES);
> +		turn_off_spicos(dd, SPICO_FABRIC);
> +		do {
> +			ret = load_fabric_serdes_firmware(dd, &fw_fabric);
> +		} while (retry_firmware(dd, ret));
> +
> +		clear_sbus_fast_mode(dd);
> +		release_chip_resource(dd, CR_SBUS);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	if (fw_8051_load) {
> +		do {
> +			ret = load_8051_firmware(dd, &fw_8051);
> +		} while (retry_firmware(dd, ret));
> +		if (ret)
> +			return ret;
> +	}
> +
> +	dump_fw_version(dd);
> +	return 0;
> +}
> +
> +int hfi2_firmware_init(struct hfi2_devdata *dd)
> +{
> +	/* only RTL can use these */
> +	if (dd->icode != ICODE_RTL_SILICON) {
> +		fw_fabric_serdes_load = 0;
> +		fw_pcie_serdes_load = 0;
> +		fw_sbus_load = 0;
> +	}
> +
> +	/* no 8051 or QSFP on simulator */
> +	if (dd->icode == ICODE_FUNCTIONAL_SIMULATOR) {
> +		u8 ver_major, ver_minor, ver_patch;
> +
> +		read_misc_status(dd, &ver_major, &ver_minor, &ver_patch);
> +		dd_dev_info(dd, "Simulated 8051 firmware version %d.%d.%d\n",
> +			    (int)ver_major, (int)ver_minor, (int)ver_patch);
> +		dd->dc8051_ver = dc8051_ver(ver_major, ver_minor, ver_patch);
> +		fw_8051_load = 0;
> +	}
> +
> +	if (!fw_8051_name) {
> +		if (dd->icode == ICODE_RTL_SILICON)
> +			fw_8051_name = DEFAULT_FW_8051_NAME_ASIC;
> +		else
> +			fw_8051_name = DEFAULT_FW_8051_NAME_FPGA;
> +	}
> +	if (!fw_fabric_serdes_name)
> +		fw_fabric_serdes_name = DEFAULT_FW_FABRIC_NAME;
> +	if (!fw_sbus_name)
> +		fw_sbus_name = DEFAULT_FW_SBUS_NAME;
> +	if (!fw_pcie_serdes_name)
> +		fw_pcie_serdes_name = DEFAULT_FW_PCIE_NAME;
> +
> +	return obtain_firmware(dd);
> +}
> +
> +/*
> + * This function is a helper function for parse_platform_config(...) and
> + * does not check for validity of the platform configuration cache
> + * (because we know it is invalid as we are building up the cache).
> + * As such, this should not be called from anywhere other than
> + * parse_platform_config
> + */
> +static int check_meta_version(struct hfi2_devdata *dd, u32 *system_table)
> +{
> +	u32 meta_ver, meta_ver_meta, ver_start, ver_len, mask;
> +	struct platform_config_cache *pcfgcache = &dd->pcfg_cache;
> +
> +	if (!system_table)
> +		return -EINVAL;
> +
> +	meta_ver_meta =
> +	*(pcfgcache->config_tables[PLATFORM_CONFIG_SYSTEM_TABLE].table_metadata
> +	+ SYSTEM_TABLE_META_VERSION);
> +
> +	mask = ((1 << METADATA_TABLE_FIELD_START_LEN_BITS) - 1);
> +	ver_start = meta_ver_meta & mask;
> +
> +	meta_ver_meta >>= METADATA_TABLE_FIELD_LEN_SHIFT;
> +
> +	mask = ((1 << METADATA_TABLE_FIELD_LEN_LEN_BITS) - 1);
> +	ver_len = meta_ver_meta & mask;
> +
> +	ver_start /= 8;
> +	meta_ver = *((u8 *)system_table + ver_start) & ((1 << ver_len) - 1);
> +
> +	if (meta_ver < 4) {
> +		dd_dev_info(
> +			dd, "%s:Please update platform config\n", __func__);
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
> +int parse_platform_config(struct hfi2_pportdata *ppd)
> +{
> +	struct hfi2_devdata *dd = ppd->dd;
> +	struct platform_config_cache *pcfgcache = &dd->pcfg_cache;
> +	u32 *ptr = NULL;
> +	u32 header1 = 0, header2 = 0, magic_num = 0, crc = 0, file_length = 0;
> +	u32 record_idx = 0, table_type = 0, table_length_dwords = 0;
> +	int ret = -EINVAL; /* assume failure */
> +
> +	/*
> +	 * For integrated devices that did not fall back to the default file,
> +	 * the SI tuning information for active channels is acquired from the
> +	 * scratch register bitmap, thus there is no platform config to parse.
> +	 * Skip parsing in these situations.
> +	 */
> +	if (ppd->config_from_scratch)
> +		return 0;
> +
> +	if (!dd->platform_config.data) {
> +		dd_dev_err(dd, "%s: Missing config file\n", __func__);
> +		ret = -EINVAL;
> +		goto bail;
> +	}
> +	ptr = (u32 *)dd->platform_config.data;
> +
> +	magic_num = *ptr;
> +	ptr++;
> +	if (magic_num != PLATFORM_CONFIG_MAGIC_NUM) {
> +		dd_dev_err(dd, "%s: Bad config file\n", __func__);
> +		ret = -EINVAL;
> +		goto bail;
> +	}
> +
> +	/* Field is file size in DWORDs */
> +	file_length = (*ptr) * 4;
> +
> +	/*
> +	 * Length can't be larger than partition size. Assume platform
> +	 * config format version 4 is being used. Interpret the file size
> +	 * field as header instead by not moving the pointer.
> +	 */
> +	if (file_length > MAX_PLATFORM_CONFIG_FILE_SIZE) {
> +		dd_dev_info(dd,
> +			    "%s:File length out of bounds, using alternative format\n",
> +			    __func__);
> +		file_length = PLATFORM_CONFIG_FORMAT_4_FILE_SIZE;
> +	} else {
> +		ptr++;
> +	}
> +
> +	if (file_length > dd->platform_config.size) {
> +		dd_dev_info(dd, "%s:File claims to be larger than read size\n",
> +			    __func__);
> +		ret = -EINVAL;
> +		goto bail;
> +	} else if (file_length < dd->platform_config.size) {
> +		dd_dev_info(dd,
> +			    "%s:File claims to be smaller than read size, continuing\n",
> +			    __func__);
> +	}
> +	/* exactly equal, perfection */
> +
> +	/*
> +	 * In both cases where we proceed, using the self-reported file length
> +	 * is the safer option. In case of old format a predefined value is
> +	 * being used.
> +	 */
> +	while (ptr < (u32 *)(dd->platform_config.data + file_length)) {
> +		header1 = *ptr;
> +		header2 = *(ptr + 1);
> +		if (header1 != ~header2) {
> +			dd_dev_err(dd, "%s: Failed validation at offset %ld\n",
> +				   __func__, (ptr - (u32 *)
> +					      dd->platform_config.data));
> +			ret = -EINVAL;
> +			goto bail;
> +		}
> +
> +		record_idx = *ptr &
> +			((1 << PLATFORM_CONFIG_HEADER_RECORD_IDX_LEN_BITS) - 1);
> +
> +		table_length_dwords = (*ptr >>
> +				PLATFORM_CONFIG_HEADER_TABLE_LENGTH_SHIFT) &
> +		      ((1 << PLATFORM_CONFIG_HEADER_TABLE_LENGTH_LEN_BITS) - 1);
> +
> +		table_type = (*ptr >> PLATFORM_CONFIG_HEADER_TABLE_TYPE_SHIFT) &
> +			((1 << PLATFORM_CONFIG_HEADER_TABLE_TYPE_LEN_BITS) - 1);
> +
> +		/* Done with this set of headers */
> +		ptr += 2;
> +
> +		if (record_idx) {
> +			/* data table */
> +			switch (table_type) {
> +			case PLATFORM_CONFIG_SYSTEM_TABLE:
> +				pcfgcache->config_tables[table_type].num_table =
> +									1;
> +				ret = check_meta_version(dd, ptr);
> +				if (ret)
> +					goto bail;
> +				break;
> +			case PLATFORM_CONFIG_PORT_TABLE:
> +				pcfgcache->config_tables[table_type].num_table =
> +									2;
> +				break;
> +			case PLATFORM_CONFIG_RX_PRESET_TABLE:
> +			case PLATFORM_CONFIG_TX_PRESET_TABLE:
> +			case PLATFORM_CONFIG_QSFP_ATTEN_TABLE:
> +			case PLATFORM_CONFIG_VARIABLE_SETTINGS_TABLE:
> +				pcfgcache->config_tables[table_type].num_table =
> +							table_length_dwords;
> +				break;
> +			default:
> +				dd_dev_err(dd,
> +					   "%s: Unknown data table %d, offset %ld\n",
> +					   __func__, table_type,
> +					   (ptr - (u32 *)
> +					    dd->platform_config.data));
> +				ret = -EINVAL;
> +				goto bail; /* We don't trust this file now */
> +			}
> +			pcfgcache->config_tables[table_type].table = ptr;
> +		} else {
> +			/* metadata table */
> +			switch (table_type) {
> +			case PLATFORM_CONFIG_SYSTEM_TABLE:
> +			case PLATFORM_CONFIG_PORT_TABLE:
> +			case PLATFORM_CONFIG_RX_PRESET_TABLE:
> +			case PLATFORM_CONFIG_TX_PRESET_TABLE:
> +			case PLATFORM_CONFIG_QSFP_ATTEN_TABLE:
> +			case PLATFORM_CONFIG_VARIABLE_SETTINGS_TABLE:
> +				break;
> +			default:
> +				dd_dev_err(dd,
> +					   "%s: Unknown meta table %d, offset %ld\n",
> +					   __func__, table_type,
> +					   (ptr -
> +					    (u32 *)dd->platform_config.data));
> +				ret = -EINVAL;
> +				goto bail; /* We don't trust this file now */
> +			}
> +			pcfgcache->config_tables[table_type].table_metadata =
> +									ptr;
> +		}
> +
> +		/* Calculate and check table crc */
> +		crc = crc32_le(~(u32)0, (unsigned char const *)ptr,
> +			       (table_length_dwords * 4));
> +		crc ^= ~(u32)0;
> +
> +		/* Jump the table */
> +		ptr += table_length_dwords;
> +		if (crc != *ptr) {
> +			dd_dev_err(dd, "%s: Failed CRC check at offset %ld\n",
> +				   __func__, (ptr -
> +				   (u32 *)dd->platform_config.data));
> +			ret = -EINVAL;
> +			goto bail;
> +		}
> +		/* Jump the CRC DWORD */
> +		ptr++;
> +	}
> +
> +	pcfgcache->cache_valid = 1;
> +	return 0;
> +bail:
> +	memset(pcfgcache, 0, sizeof(struct platform_config_cache));
> +	return ret;
> +}
> +
> +static void get_integrated_platform_config_field(
> +		struct hfi2_pportdata *ppd,
> +		enum platform_config_table_type_encoding table_type,
> +		int field_index, u32 *data)
> +{
> +	u8 *cache = ppd->qsfp_info.cache;
> +	u32 tx_preset = 0;
> +
> +	switch (table_type) {
> +	case PLATFORM_CONFIG_SYSTEM_TABLE:
> +		if (field_index == SYSTEM_TABLE_QSFP_POWER_CLASS_MAX)
> +			*data = ppd->max_power_class;
> +		else if (field_index == SYSTEM_TABLE_QSFP_ATTENUATION_DEFAULT_25G)
> +			*data = ppd->default_atten;
> +		break;
> +	case PLATFORM_CONFIG_PORT_TABLE:
> +		if (field_index == PORT_TABLE_PORT_TYPE)
> +			*data = ppd->port_type;
> +		else if (field_index == PORT_TABLE_LOCAL_ATTEN_25G)
> +			*data = ppd->local_atten;
> +		else if (field_index == PORT_TABLE_REMOTE_ATTEN_25G)
> +			*data = ppd->remote_atten;
> +		break;
> +	case PLATFORM_CONFIG_RX_PRESET_TABLE:
> +		if (field_index == RX_PRESET_TABLE_QSFP_RX_CDR_APPLY)
> +			*data = (ppd->rx_preset & QSFP_RX_CDR_APPLY_SMASK) >>
> +				QSFP_RX_CDR_APPLY_SHIFT;
> +		else if (field_index == RX_PRESET_TABLE_QSFP_RX_EMP_APPLY)
> +			*data = (ppd->rx_preset & QSFP_RX_EMP_APPLY_SMASK) >>
> +				QSFP_RX_EMP_APPLY_SHIFT;
> +		else if (field_index == RX_PRESET_TABLE_QSFP_RX_AMP_APPLY)
> +			*data = (ppd->rx_preset & QSFP_RX_AMP_APPLY_SMASK) >>
> +				QSFP_RX_AMP_APPLY_SHIFT;
> +		else if (field_index == RX_PRESET_TABLE_QSFP_RX_CDR)
> +			*data = (ppd->rx_preset & QSFP_RX_CDR_SMASK) >>
> +				QSFP_RX_CDR_SHIFT;
> +		else if (field_index == RX_PRESET_TABLE_QSFP_RX_EMP)
> +			*data = (ppd->rx_preset & QSFP_RX_EMP_SMASK) >>
> +				QSFP_RX_EMP_SHIFT;
> +		else if (field_index == RX_PRESET_TABLE_QSFP_RX_AMP)
> +			*data = (ppd->rx_preset & QSFP_RX_AMP_SMASK) >>
> +				QSFP_RX_AMP_SHIFT;
> +		break;
> +	case PLATFORM_CONFIG_TX_PRESET_TABLE:
> +		if (cache[QSFP_EQ_INFO_OFFS] & 0x4)
> +			tx_preset = ppd->tx_preset_eq;
> +		else
> +			tx_preset = ppd->tx_preset_noeq;
> +		if (field_index == TX_PRESET_TABLE_PRECUR)
> +			*data = (tx_preset & TX_PRECUR_SMASK) >>
> +				TX_PRECUR_SHIFT;
> +		else if (field_index == TX_PRESET_TABLE_ATTN)
> +			*data = (tx_preset & TX_ATTN_SMASK) >>
> +				TX_ATTN_SHIFT;
> +		else if (field_index == TX_PRESET_TABLE_POSTCUR)
> +			*data = (tx_preset & TX_POSTCUR_SMASK) >>
> +				TX_POSTCUR_SHIFT;
> +		else if (field_index == TX_PRESET_TABLE_QSFP_TX_CDR_APPLY)
> +			*data = (tx_preset & QSFP_TX_CDR_APPLY_SMASK) >>
> +				QSFP_TX_CDR_APPLY_SHIFT;
> +		else if (field_index == TX_PRESET_TABLE_QSFP_TX_EQ_APPLY)
> +			*data = (tx_preset & QSFP_TX_EQ_APPLY_SMASK) >>
> +				QSFP_TX_EQ_APPLY_SHIFT;
> +		else if (field_index == TX_PRESET_TABLE_QSFP_TX_CDR)
> +			*data = (tx_preset & QSFP_TX_CDR_SMASK) >>
> +				QSFP_TX_CDR_SHIFT;
> +		else if (field_index == TX_PRESET_TABLE_QSFP_TX_EQ)
> +			*data = (tx_preset & QSFP_TX_EQ_SMASK) >>
> +				QSFP_TX_EQ_SHIFT;
> +		break;
> +	case PLATFORM_CONFIG_QSFP_ATTEN_TABLE:
> +	case PLATFORM_CONFIG_VARIABLE_SETTINGS_TABLE:
> +	default:
> +		break;
> +	}
> +}
> +
> +static int get_platform_fw_field_metadata(struct hfi2_devdata *dd, int table,
> +					  int field, u32 *field_len_bits,
> +					  u32 *field_start_bits)
> +{
> +	struct platform_config_cache *pcfgcache = &dd->pcfg_cache;
> +	u32 *src_ptr = NULL;
> +
> +	if (!pcfgcache->cache_valid)
> +		return -EINVAL;
> +
> +	switch (table) {
> +	case PLATFORM_CONFIG_SYSTEM_TABLE:
> +	case PLATFORM_CONFIG_PORT_TABLE:
> +	case PLATFORM_CONFIG_RX_PRESET_TABLE:
> +	case PLATFORM_CONFIG_TX_PRESET_TABLE:
> +	case PLATFORM_CONFIG_QSFP_ATTEN_TABLE:
> +	case PLATFORM_CONFIG_VARIABLE_SETTINGS_TABLE:
> +		if (field && field < platform_config_table_limits[table])
> +			src_ptr =
> +			pcfgcache->config_tables[table].table_metadata + field;
> +		break;
> +	default:
> +		dd_dev_info(dd, "%s: Unknown table\n", __func__);
> +		break;
> +	}
> +
> +	if (!src_ptr)
> +		return -EINVAL;
> +
> +	if (field_start_bits)
> +		*field_start_bits = *src_ptr &
> +		      ((1 << METADATA_TABLE_FIELD_START_LEN_BITS) - 1);
> +
> +	if (field_len_bits)
> +		*field_len_bits = (*src_ptr >> METADATA_TABLE_FIELD_LEN_SHIFT)
> +		       & ((1 << METADATA_TABLE_FIELD_LEN_LEN_BITS) - 1);
> +
> +	return 0;
> +}
> +
> +/* This is the central interface to getting data out of the platform config
> + * file. It depends on parse_platform_config() having populated the
> + * platform_config_cache in hfi2_devdata, and checks the cache_valid member to
> + * validate the sanity of the cache.
> + *
> + * The non-obvious parameters:
> + * @table_index: Acts as a look up key into which instance of the tables the
> + * relevant field is fetched from.
> + *
> + * This applies to the data tables that have multiple instances. The port table
> + * is an exception to this rule as each HFI only has one port and thus the
> + * relevant table can be distinguished by hfi_id.
> + *
> + * @data: pointer to memory that will be populated with the field requested.
> + * @len: length of memory pointed by @data in bytes.
> + */
> +int get_platform_config_field(struct hfi2_pportdata *ppd,
> +			      enum platform_config_table_type_encoding
> +			      table_type, int table_index, int field_index,
> +			      u32 *data, u32 len)
> +{
> +	struct hfi2_devdata *dd = ppd->dd;
> +	int ret = 0, wlen = 0, seek = 0;
> +	u32 field_len_bits = 0, field_start_bits = 0, *src_ptr = NULL;
> +	struct platform_config_cache *pcfgcache = &dd->pcfg_cache;
> +
> +	if (data)
> +		memset(data, 0, len);
> +	else
> +		return -EINVAL;
> +
> +	if (ppd->config_from_scratch) {
> +		/*
> +		 * Use saved configuration from ppd for integrated platforms
> +		 */
> +		get_integrated_platform_config_field(ppd, table_type,
> +						     field_index, data);
> +		return 0;
> +	}
> +
> +	ret = get_platform_fw_field_metadata(dd, table_type, field_index,
> +					     &field_len_bits,
> +					     &field_start_bits);
> +	if (ret)
> +		return -EINVAL;
> +
> +	/* Convert length to bits */
> +	len *= 8;
> +
> +	/* Our metadata function checked cache_valid and field_index for us */
> +	switch (table_type) {
> +	case PLATFORM_CONFIG_SYSTEM_TABLE:
> +		src_ptr = pcfgcache->config_tables[table_type].table;
> +
> +		if (field_index != SYSTEM_TABLE_QSFP_POWER_CLASS_MAX) {
> +			if (len < field_len_bits)
> +				return -EINVAL;
> +
> +			seek = field_start_bits / 8;
> +			wlen = field_len_bits / 8;
> +
> +			src_ptr = (u32 *)((u8 *)src_ptr + seek);
> +
> +			/*
> +			 * We expect the field to be byte aligned and whole byte
> +			 * lengths if we are here
> +			 */
> +			memcpy(data, src_ptr, wlen);
> +			return 0;
> +		}
> +		break;
> +	case PLATFORM_CONFIG_PORT_TABLE:
> +		/* Port table is 4 DWORDS */
> +		src_ptr = dd->hfi2_id ?
> +			pcfgcache->config_tables[table_type].table + 4 :
> +			pcfgcache->config_tables[table_type].table;
> +		break;
> +	case PLATFORM_CONFIG_RX_PRESET_TABLE:
> +	case PLATFORM_CONFIG_TX_PRESET_TABLE:
> +	case PLATFORM_CONFIG_QSFP_ATTEN_TABLE:
> +	case PLATFORM_CONFIG_VARIABLE_SETTINGS_TABLE:
> +		src_ptr = pcfgcache->config_tables[table_type].table;
> +
> +		if (table_index <
> +			pcfgcache->config_tables[table_type].num_table)
> +			src_ptr += table_index;
> +		else
> +			src_ptr = NULL;
> +		break;
> +	default:
> +		dd_dev_info(dd, "%s: Unknown table\n", __func__);
> +		break;
> +	}
> +
> +	if (!src_ptr || len < field_len_bits)
> +		return -EINVAL;
> +
> +	src_ptr += (field_start_bits / 32);
> +	*data = (*src_ptr >> (field_start_bits % 32)) &
> +			((1 << field_len_bits) - 1);
> +
> +	return 0;
> +}
> +
> +/*
> + * Download the firmware needed for the Gen3 PCIe SerDes.  An update
> + * to the SBus firmware is needed before updating the PCIe firmware.
> + *
> + * Note: caller must be holding the SBus resource.
> + */
> +int load_pcie_firmware(struct hfi2_devdata *dd)
> +{
> +	int ret = 0;
> +
> +	/* both firmware loads below use the SBus */
> +	set_sbus_fast_mode(dd);
> +
> +	if (fw_sbus_load) {
> +		turn_off_spicos(dd, SPICO_SBUS);
> +		do {
> +			ret = load_sbus_firmware(dd, &fw_sbus);
> +		} while (retry_firmware(dd, ret));
> +		if (ret)
> +			goto done;
> +	}
> +
> +	if (fw_pcie_serdes_load) {
> +		dd_dev_info(dd, "Setting PCIe SerDes broadcast\n");
> +		set_serdes_broadcast(dd, all_pcie_serdes_broadcast,
> +				     pcie_serdes_broadcast[dd->hfi2_id],
> +				     pcie_serdes_addrs[dd->hfi2_id],
> +				     NUM_PCIE_SERDES);
> +		do {
> +			ret = load_pcie_serdes_firmware(dd, &fw_pcie);
> +		} while (retry_firmware(dd, ret));
> +		if (ret)
> +			goto done;
> +	}
> +
> +done:
> +	clear_sbus_fast_mode(dd);
> +
> +	return ret;
> +}
> +
> +/*
> + * Read the GUID from the hardware, store it in dd.
> + */
> +void read_guid(struct hfi2_devdata *dd)
> +{
> +	/* Take the DC out of reset to get a valid GUID value */
> +	write_csr(dd, CCE_DC_CTRL, 0);
> +	(void)read_csr(dd, CCE_DC_CTRL);
> +
> +	dd->base_guid = read_csr(dd, DC_DC8051_CFG_LOCAL_GUID);
> +}
> +
> +/* read and display firmware version info */
> +static void dump_fw_version(struct hfi2_devdata *dd)
> +{
> +	u32 pcie_vers[NUM_PCIE_SERDES];
> +	u32 fabric_vers[NUM_FABRIC_SERDES];
> +	u32 sbus_vers;
> +	int i;
> +	int all_same;
> +	int ret;
> +	u8 rcv_addr;
> +
> +	/* no firmware or sbus in simulation, skip */
> +	if (dd->icode == ICODE_FUNCTIONAL_SIMULATOR)
> +		return;
> +
> +	ret = acquire_chip_resource(dd, CR_SBUS, SBUS_TIMEOUT);
> +	if (ret) {
> +		dd_dev_err(dd, "Unable to acquire SBus to read firmware versions\n");
> +		return;
> +	}
> +
> +	/* set fast mode */
> +	set_sbus_fast_mode(dd);
> +
> +	/* read version for SBus Master */
> +	sbus_request(dd, SBUS_MASTER_BROADCAST, 0x02, WRITE_SBUS_RECEIVER, 0);
> +	sbus_request(dd, SBUS_MASTER_BROADCAST, 0x07, WRITE_SBUS_RECEIVER, 0x1);
> +	/* wait for interrupt to be processed */
> +	usleep_range(10000, 11000);
> +	sbus_vers = sbus_read(dd, SBUS_MASTER_BROADCAST, 0x08, 0x1);
> +	dd_dev_info(dd, "SBus Master firmware version 0x%08x\n", sbus_vers);
> +
> +	/* read version for PCIe SerDes */
> +	all_same = 1;
> +	pcie_vers[0] = 0;
> +	for (i = 0; i < NUM_PCIE_SERDES; i++) {
> +		rcv_addr = pcie_serdes_addrs[dd->hfi2_id][i];
> +		sbus_request(dd, rcv_addr, 0x03, WRITE_SBUS_RECEIVER, 0);
> +		/* wait for interrupt to be processed */
> +		usleep_range(10000, 11000);
> +		pcie_vers[i] = sbus_read(dd, rcv_addr, 0x04, 0x0);
> +		if (i > 0 && pcie_vers[0] != pcie_vers[i])
> +			all_same = 0;
> +	}
> +
> +	if (all_same) {
> +		dd_dev_info(dd, "PCIe SerDes firmware version 0x%x\n",
> +			    pcie_vers[0]);
> +	} else {
> +		dd_dev_warn(dd, "PCIe SerDes do not have the same firmware version\n");
> +		for (i = 0; i < NUM_PCIE_SERDES; i++) {
> +			dd_dev_info(dd,
> +				    "PCIe SerDes lane %d firmware version 0x%x\n",
> +				    i, pcie_vers[i]);
> +		}
> +	}
> +
> +	/* read version for fabric SerDes */
> +	all_same = 1;
> +	fabric_vers[0] = 0;
> +	for (i = 0; i < NUM_FABRIC_SERDES; i++) {
> +		rcv_addr = fabric_serdes_addrs[dd->hfi2_id][i];
> +		sbus_request(dd, rcv_addr, 0x03, WRITE_SBUS_RECEIVER, 0);
> +		/* wait for interrupt to be processed */
> +		usleep_range(10000, 11000);
> +		fabric_vers[i] = sbus_read(dd, rcv_addr, 0x04, 0x0);
> +		if (i > 0 && fabric_vers[0] != fabric_vers[i])
> +			all_same = 0;
> +	}
> +
> +	if (all_same) {
> +		dd_dev_info(dd, "Fabric SerDes firmware version 0x%x\n",
> +			    fabric_vers[0]);
> +	} else {
> +		dd_dev_warn(dd, "Fabric SerDes do not have the same firmware version\n");
> +		for (i = 0; i < NUM_FABRIC_SERDES; i++) {
> +			dd_dev_info(dd,
> +				    "Fabric SerDes lane %d firmware version 0x%x\n",
> +				    i, fabric_vers[i]);
> +		}
> +	}
> +
> +	clear_sbus_fast_mode(dd);
> +	release_chip_resource(dd, CR_SBUS);
> +}
> diff --git a/drivers/infiniband/hw/hfi2/init.c b/drivers/infiniband/hw/hfi2/init.c
> new file mode 100644
> index 000000000000..70145b643d31
> --- /dev/null
> +++ b/drivers/infiniband/hw/hfi2/init.c
> @@ -0,0 +1,2931 @@
> +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
> +/*
> + * Copyright(c) 2015 - 2020 Intel Corporation.
> + * Copyright(c) 2025-2026 Cornelis Networks, Inc.
> + */
> +
> +#include <linux/pci.h>
> +#include <linux/netdevice.h>
> +#include <linux/vmalloc.h>
> +#include <linux/delay.h>
> +#include <linux/xarray.h>
> +#include <linux/module.h>
> +#include <linux/printk.h>
> +#include <linux/hrtimer.h>
> +#include <linux/bitmap.h>
> +#include <linux/numa.h>
> +#include <rdma/rdma_vt.h>
> +
> +#include "hfi2.h"
> +#include "file_ops.h"
> +#include "common.h"
> +#include "trace.h"
> +#include "mad.h"
> +#include "sdma.h"
> +#include "debugfs.h"
> +#include "verbs.h"
> +#include "aspm.h"
> +#include "affinity.h"
> +#include "exp_rcv.h"
> +#include "netdev.h"
> +#include "chip_jkr.h"
> +#include "chip_gen.h"
> +#include "pinning.h"
> +#include "cport_traps.h"
> +#include "sriov.h"
> +#include "vf2pf.h"
> +
> +#undef pr_fmt
> +#define pr_fmt(fmt) DRIVER_NAME ": " fmt
> +
> +#undef CPORT_TRAP_DEBUG	/* all MCTXT TRAP events from CPORT */
> +#define PDEV_SRIOV_DEBUG
> +
> +/*
> + * min buffers we want to have per context, after driver
> + */
> +#define HFI2_MIN_USER_CTXT_BUFCNT 7
> +
> +#define HFI2_MIN_EAGER_BUFFER_SIZE (4 * 1024) /* 4KB */
> +#define HFI2_MAX_EAGER_BUFFER_SIZE (256 * 1024) /* 256KB */
> +
> +static void wfr_start_port(struct hfi2_pportdata *ppd);
> +static void wfr_stop_port(struct hfi2_pportdata *ppd);
> +static void destroy_workqueues(struct hfi2_devdata *dd);
> +
> +/* parameters for the WFR ASIC */
> +static const struct chip_params wfr_params = {
> +	.chip_type = CHIP_WFR,
> +	.num_ports = 1,
> +	.dma_mask_bits = 48,
> +
> +	/* BAR0 map: rcv array splits kreg1 and kreg2 */
> +	.bar0_size = TXE_PIO_SEND + TXE_PIO_SIZE,
> +	.kreg1_size = RCV_ARRAY,
> +	.kreg2_offset = RCV_ARRAY + RCV_ARRAY_SIZE,
> +	.kreg2_size = TXE_PIO_SEND - (RCV_ARRAY + RCV_ARRAY_SIZE),
> +	.rcv_array_offset = RCV_ARRAY,
> +	.rcv_array_size = RCV_ARRAY_SIZE,
> +
> +	.link_speed_supported = OPA_LINK_SPEED_25G,
> +	.link_speed_active = OPA_LINK_SPEED_25G,
> +	.asic_cclock_ps = ASIC_CCLOCK_PS,
> +	.rsm_rule_size = WFR_RXE_NUM_RSM_INSTANCES,
> +	.rsm_rule_offset_shift = WFR_RCV_RSM_CFG_OFFSET_SHIFT,
> +	.rsm_map_table_entries = 256,
> +	.rsm_map_table_entries_per_csr = 8,
> +	.rsm_map_table_entry_mask = 0xff,
> +	.rsm_map_table_entry_shift = 8,
> +	.qp_map_table_entries = 256,
> +	.qp_map_table_entries_per_csr = 8,
> +	.qp_map_table_entry_mask = 0xff,
> +	.qp_map_table_entry_shift = 8,
> +	.pkey_table_size = WFR_MAX_PKEY_VALUES,
> +	.generic_boardname = "Cornelis Omni-Path Host Fabric Interface Adapter 100 Series",
> +	.max_eager_entries = WFR_MAX_EAGER_ENTRIES,
> +	.pio_base_bits = WFR_PIO_BASE_BITS,
> +	.pio_base_shift = WFR_SEND_CTXT_CTRL_CTXT_BASE_SHIFT,
> +	.egress_err_info_data = &wfr_egress_err_info_data,
> +	.send_ctrl_flush = 0, /* no flush flag available */
> +	.port_discard_egress_errs = WFR_PORT_DISCARD_EGRESS_ERRS,
> +
> +	/* interrupt sources */
> +	.num_int_csrs = WFR_CCE_NUM_INT_CSRS,
> +	.num_int_map_csrs = WFR_CCE_NUM_INT_MAP_CSRS,
> +	.is_rcvavail_start = IS_RCVAVAIL_START,
> +	.is_rcvurgent_start = IS_RCVURGENT_START,
> +	.is_sdmaeng_err_start = IS_SDMAENG_ERR_START,
> +	.is_sdma_idle_start = IS_SDMA_IDLE_START,
> +	.is_sdma_progress_start = IS_SDMA_PROGRESS_START,
> +	.is_sdma_start = IS_SDMA_START,
> +	.is_last_source = IS_LAST_SOURCE,
> +	.is_table = is_table,
> +	.gi_enable_table = wfr_gi_enable_table,
> +
> +	/* cce_interrupt registers */
> +	.cce_int_status_reg = WFR_CCE_INT_STATUS,
> +	.cce_int_mask_reg = WFR_CCE_INT_MASK,
> +	.cce_int_clear_reg = WFR_CCE_INT_CLEAR,
> +	.cce_int_force_reg = WFR_CCE_INT_FORCE,
> +	.cce_int_blocked_reg = WFR_CCE_INT_BLOCKED,
> +
> +	/* counters */
> +	.chip_dev_cntrs = wfr_dev_cntrs,
> +	.chip_dev_cntr_first = WFR_DEV_CNTR_FIRST,
> +	.chip_num_dev_cntrs = WFR_NUM_DEV_CNTRS,
> +	.chip_port_cntrs = wfr_port_cntrs,
> +	.chip_port_cntr_first = WFR_PORT_CNTR_FIRST,
> +	.chip_num_port_cntrs = WFR_NUM_PORT_CNTRS,
> +
> +	/* ingress port registers */
> +	.rxe_iport_stride = 0,
> +	.rcv_iport_ctrl_reg = WFR_RCV_CTRL,
> +	.rcv_iport_status_reg = WFR_RCV_STATUS,
> +	.rcv_bth_qp_reg = WFR_RCV_BTH_QP,
> +	.rcv_multicast_reg = WFR_RCV_MULTICAST,
> +	.rcv_bypass_reg = WFR_RCV_BYPASS,
> +	.rcv_vl15_reg = WFR_RCV_VL15,
> +	.rcv_err_info_reg = WFR_RCV_ERR_INFO,
> +	.rcv_err_status_reg = WFR_RCV_ERR_STATUS,
> +	.rcv_err_mask_reg = WFR_RCV_ERR_MASK,
> +	.rcv_err_clear_reg = WFR_RCV_ERR_CLEAR,
> +	.rcv_qp_map_table_reg = WFR_RCV_QP_MAP_TABLE,
> +	.rcv_partition_key_reg = WFR_RCV_PARTITION_KEY,
> +	.rcv_counter_array32_reg = WFR_RCV_COUNTER_ARRAY32,
> +	.rcv_counter_array64_reg = WFR_RCV_COUNTER_ARRAY64,
> +
> +	/* ingress port receive context registers */
> +	.rxe_iprc_stride = WFR_RXE_IPRC_STRIDE,
> +	.rcv_jkey_ctrl_reg = WFR_RCV_KEY_CTRL,
> +
> +	/* RXE restricted context registers */
> +	.rxe_rctxt_stride = WFR_RXE_RCTXT_STRIDE,
> +	.rcv_rctxt_ctrl_reg = WFR_RCV_CTXT_CTRL,
> +	.rcv_egr_ctrl_reg = WFR_RCV_EGR_CTRL,
> +	.rcv_tid_ctrl_reg = WFR_RCV_TID_CTRL,
> +
> +	/* RXE kernel context registers */
> +	.rxe_kctxt_stride = WFR_RXE_KCTXT_STRIDE,
> +	.rcv_kctxt_ctrl_reg = WFR_RCV_CTXT_CTRL,
> +	.rcv_hdr_addr_reg = WFR_RCV_HDR_ADDR,
> +	.rcv_hdr_cnt_reg = WFR_RCV_HDR_CNT,
> +	.rcv_hdr_ent_size_reg = WFR_RCV_HDR_ENT_SIZE,
> +	.rcv_hdr_tail_addr_reg = WFR_RCV_HDR_TAIL_ADDR,
> +	.rcv_avail_time_out_reg = WFR_RCV_AVAIL_TIME_OUT,
> +	.rcv_hdr_ovfl_cnt_reg = WFR_RCV_HDR_OVFL_CNT,
> +
> +	/* RXE kernel/user registers */
> +	.rxe_ku_stride = WFR_RXE_KCTXT_STRIDE,
> +	.rcv_ctxt_status_reg = WFR_RCV_CTXT_STATUS,
> +
> +	/* RXE user registers */
> +	.rxe_uctxt_stride = WFR_RXE_UCTXT_STRIDE,
> +	.rcv_hdr_tail_reg = WFR_RCV_HDR_TAIL,
> +	.rcv_hdr_head_reg = WFR_RCV_HDR_HEAD,
> +	.rcv_egr_index_head_reg = WFR_RCV_EGR_INDEX_HEAD,
> +	.rcv_tid_flow_table_reg = WFR_RCV_TID_FLOW_TABLE,
> +
> +	/* RXE RSM registers */
> +	.rcv_rsm_cfg_reg = WFR_RCV_RSM_CFG,
> +	.rcv_rsm_select_reg = WFR_RCV_RSM_SELECT,
> +	.rcv_rsm_match_reg = WFR_RCV_RSM_MATCH,
> +	.rcv_rsm_map_table_reg = WFR_RCV_RSM_MAP_TABLE,
> +
> +	/* TXE kernel registers */
> +	.send_contexts_reg = SEND_CONTEXTS,
> +	.send_dma_engines_reg = WFR_SEND_DMA_ENGINES,
> +	.send_pio_mem_size_reg = WFR_SEND_PIO_MEM_SIZE,
> +	.send_dma_mem_size_reg = WFR_SEND_DMA_MEM_SIZE,
> +	.send_pio_init_ctxt_reg = WFR_SEND_PIO_INIT_CTXT,
> +
> +	/* send context_registers */
> +	.txe_sctxt_stride = WFR_TXE_SCTXT_STRIDE,
> +	.send_ctxt_status_reg = WFR_SEND_CTXT_STATUS,
> +	.send_ctxt_credit_ctrl_reg = WFR_SEND_CTXT_CREDIT_CTRL,
> +	.send_ctxt_credit_status_reg = WFR_SEND_CTXT_CREDIT_STATUS,
> +	.send_ctxt_credit_return_addr_reg = WFR_SEND_CTXT_CREDIT_RETURN_ADDR,
> +	.send_ctxt_credit_force_reg = WFR_SEND_CTXT_CREDIT_FORCE,
> +	.send_ctxt_err_status_reg = WFR_SEND_CTXT_ERR_STATUS,
> +	.send_ctxt_err_mask_reg = WFR_SEND_CTXT_ERR_MASK,
> +	.send_ctxt_err_clear_reg = WFR_SEND_CTXT_ERR_CLEAR,
> +
> +	/* TXE send context registers */
> +	.txe_tctxt_stride = WFR_TXE_TCTXT_STRIDE,
> +	.send_ctxt_ctrl_reg = WFR_SEND_CTXT_CTRL,
> +
> +	/* SDMA registers */
> +	.txe_sdma_stride = WFR_TXE_SDMA_STRIDE,
> +	.send_dma_ctrl_reg = WFR_SEND_DMA_CTRL,
> +	.send_dma_status_reg = WFR_SEND_DMA_STATUS,
> +	.send_dma_base_addr_reg = WFR_SEND_DMA_BASE_ADDR,
> +	.send_dma_len_gen_reg = WFR_SEND_DMA_LEN_GEN,
> +	.send_dma_tail_reg = WFR_SEND_DMA_TAIL,
> +	.send_dma_head_reg = WFR_SEND_DMA_HEAD,
> +	.send_dma_head_addr_reg = WFR_SEND_DMA_HEAD_ADDR,
> +	.send_dma_priority_thld_reg = WFR_SEND_DMA_PRIORITY_THLD,
> +	.send_dma_idle_cnt_reg = WFR_SEND_DMA_IDLE_CNT,
> +	.send_dma_reload_cnt_reg = WFR_SEND_DMA_RELOAD_CNT,
> +	.send_dma_desc_cnt_reg = WFR_SEND_DMA_DESC_CNT,
> +	.send_dma_desc_fetched_cnt_reg = WFR_SEND_DMA_DESC_FETCHED_CNT,
> +	.send_dma_eng_err_status_reg = WFR_SEND_DMA_ENG_ERR_STATUS,
> +	.send_dma_eng_err_mask_reg = WFR_SEND_DMA_ENG_ERR_MASK,
> +	.send_dma_eng_err_clear_reg = WFR_SEND_DMA_ENG_ERR_CLEAR,
> +
> +	/* SDMA Config registers */
> +	.txe_sdmacfg_stride = WFR_TXE_SDMACFG_STRIDE,
> +	.send_dma_cfg_memory_reg = WFR_SEND_DMA_MEMORY,
> +
> +	/* egress port registers */
> +	.txe_eport_stride = 0,
> +	.send_ctrl_reg = SEND_CTRL,
> +	.send_high_priority_limit_reg = WFR_SEND_HIGH_PRIORITY_LIMIT,
> +	.send_egress_err_status_reg = WFR_SEND_EGRESS_ERR_STATUS,
> +	.send_egress_err_mask_reg = WFR_SEND_EGRESS_ERR_MASK,
> +	.send_egress_err_clear_reg = WFR_SEND_EGRESS_ERR_CLEAR,
> +	.send_bth_qp_reg = WFR_SEND_BTH_QP,
> +	.send_static_rate_control_reg = WFR_SEND_STATIC_RATE_CONTROL,
> +	.send_sc2vlt0_reg = WFR_SEND_SC2VLT0,
> +	.send_sc2vlt1_reg = WFR_SEND_SC2VLT1,
> +	.send_sc2vlt2_reg = WFR_SEND_SC2VLT2,
> +	.send_sc2vlt3_reg = WFR_SEND_SC2VLT3,
> +	.send_len_check0_reg = WFR_SEND_LEN_CHECK0,
> +	.send_len_check1_reg = WFR_SEND_LEN_CHECK1,
> +	.send_low_priority_list_reg = WFR_SEND_LOW_PRIORITY_LIST,
> +	.send_high_priority_list_reg = WFR_SEND_HIGH_PRIORITY_LIST,
> +	.send_counter_array32_reg = WFR_SEND_COUNTER_ARRAY32,
> +	.send_counter_array64_reg = WFR_SEND_COUNTER_ARRAY64,
> +	.send_cm_ctrl_reg = WFR_SEND_CM_CTRL,
> +	.send_cm_global_credit_reg = WFR_SEND_CM_GLOBAL_CREDIT,
> +	.send_cm_credit_used_status_reg = WFR_SEND_CM_CREDIT_USED_STATUS,
> +	.send_cm_timer_ctrl_reg = WFR_SEND_CM_TIMER_CTRL,
> +	.send_cm_local_au_table0_to3_reg = WFR_SEND_CM_LOCAL_AU_TABLE0_TO3,
> +	.send_cm_local_au_table4_to7_reg = WFR_SEND_CM_LOCAL_AU_TABLE4_TO7,
> +	.send_cm_remote_au_table0_to3_reg = WFR_SEND_CM_REMOTE_AU_TABLE0_TO3,
> +	.send_cm_remote_au_table4_to7_reg = WFR_SEND_CM_REMOTE_AU_TABLE4_TO7,
> +	.send_cm_credit_vl_reg = WFR_SEND_CM_CREDIT_VL,
> +	.send_cm_credit_vl15_reg = WFR_SEND_CM_CREDIT_VL15,
> +	.send_egress_err_info_reg = WFR_SEND_EGRESS_ERR_INFO,
> +	.send_egress_err_source_reg = WFR_SEND_EGRESS_ERR_SOURCE,
> +	.send_egress_ctxt_status_reg = WFR_SEND_EGRESS_CTXT_STATUS,
> +	.send_egress_send_dma_status_reg = WFR_SEND_EGRESS_SEND_DMA_STATUS,
> +
> +	/* egress port send context registers */
> +	.txe_epsc_stride = WFR_TXE_EPSC_STRIDE,
> +	.send_ctxt_check_enable_reg = WFR_SEND_CTXT_CHECK_ENABLE,
> +	.send_ctxt_check_vl_reg = WFR_SEND_CTXT_CHECK_VL,
> +	.send_ctxt_check_job_key_reg = WFR_SEND_CTXT_CHECK_JOB_KEY,
> +	.send_ctxt_check_partition_key_reg = WFR_SEND_CTXT_CHECK_PARTITION_KEY,
> +	.send_ctxt_check_slid_reg = WFR_SEND_CTXT_CHECK_SLID,
> +	.send_ctxt_check_opcode_reg = WFR_SEND_CTXT_CHECK_OPCODE,
> +
> +	/* SI registers */
> +	.cce_msix_int_map_vec_reg = WFR_CCE_INT_MAP,
> +	.send_pio_err_status_reg = WFR_SEND_PIO_ERR_STATUS,
> +	.send_pio_err_mask_reg = WFR_SEND_PIO_ERR_MASK,
> +	.send_pio_err_clear_reg = WFR_SEND_PIO_ERR_CLEAR,
> +	.send_dma_err_status_reg = WFR_SEND_DMA_ERR_STATUS,
> +	.send_dma_err_mask_reg = WFR_SEND_DMA_ERR_MASK,
> +	.send_dma_err_clear_reg = WFR_SEND_DMA_ERR_CLEAR,
> +	.csr_err_status_reg = WFR_SEND_ERR_STATUS,
> +	.csr_err_mask_reg = WFR_SEND_ERR_MASK,
> +	.csr_err_clear_reg = WFR_SEND_ERR_CLEAR,
> +
> +	.setextled = setextled,
> +	.start_led_override = hfi2_start_led_override,
> +	.shutdown_led_override = shutdown_led_override,
> +	.read_guid = read_guid,
> +	.early_per_chip_init = wfr_early_per_chip_init,
> +	.mid_per_chip_init = wfr_mid_per_chip_init,
> +	.init_other = init_other,
> +	.late_per_chip_init = wfr_late_per_chip_init,
> +	.start_port = wfr_start_port,
> +	.stop_port = wfr_stop_port,
> +	.put_tid = wfr_put_tid,
> +	.rcv_array_wc_fill = wfr_rcv_array_wc_fill,
> +	.set_port_tid_config = wfr_set_port_tid_config,
> +	.set_port_max_mtu = wfr_set_port_max_mtu,
> +	.update_rcv_hdr_size = wfr_update_rcv_hdr_size,
> +	.check_synth_status = wfr_check_synth_status,
> +	.update_synth_status = wfr_update_synth_status,
> +	.create_pbc = wfr_create_pbc,
> +	.set_pio_integrity = wfr_set_pio_integrity,
> +	.find_used_resources = wfr_find_used_resources,
> +	.read_link_quality = wfr_read_link_quality,
> +	.set_rheq_addr = NULL,
> +	.handle_link_bounce = wfr_handle_link_bounce,
> +	.enable_rcv_context = wfr_enable_rcv_context,
> +};
> +
> +/* parameters for the JKR ASIC */
> +static const struct chip_params jkr_params = {
> +	.chip_type = CHIP_JKR,
> +	.num_ports = 2,
> +	.dma_mask_bits = 58,
> +
> +	/* BAR0 map: see comments where KREG values are defined */
> +	.bar0_size = JKR_BAR0_SIZE,
> +	.kreg1_size = JKR_KREG1_SIZE,
> +	.kreg2_offset = JKR_KREG2_OFFSET,
> +	.kreg2_size = JKR_KREG2_SIZE,
> +	.rcv_array_offset = JKR_RCV_ARRAY,
> +	.rcv_array_size = JKR_RCV_ARRAY_SIZE,
> +
> +	.link_speed_supported = OPA_LINK_SPEED_100G | OPA_LINK_SPEED_25G,
> +	.link_speed_active = OPA_LINK_SPEED_100G,
> +	.asic_cclock_ps = JKR_ASIC_CCLOCK_PS,
> +	.rsm_rule_size = JKR_C_RXE_NUM_RSM_INSTANCES,
> +	.rsm_rule_offset_shift = JKR_RCV_RSM_CFG_OFFSET_SHIFT,
> +	.rsm_map_table_entries = 256,
> +	.rsm_map_table_entries_per_csr = 8,
> +	.rsm_map_table_entry_mask = 0xff,
> +	.rsm_map_table_entry_shift = 8,
> +	.qp_map_table_entries = 256,
> +	.qp_map_table_entries_per_csr = 8,
> +	.qp_map_table_entry_mask = 0xff,
> +	.qp_map_table_entry_shift = 8,
> +	.pkey_table_size = JKR_MAX_PKEY_VALUES,
> +	.generic_boardname = "Cornelis Networks 5000 Host Fabric Interface Adapter",
> +	.max_eager_entries = JKR_MAX_EAGER_ENTRIES,
> +	.pio_base_bits = JKR_PIO_BASE_BITS,
> +	.pio_base_shift = JKR_SEND_CTXT_CTRL_CTXT_BASE_SHIFT,
> +	.egress_err_info_data = &jkr_egress_err_info_data,
> +	.send_ctrl_flush = JKR_SEND_CTRL_FLUSH_WRONG_LINK_STATE_SMASK,
> +	.port_discard_egress_errs = JKR_PORT_DISCARD_EGRESS_ERRS,
> +
> +	/* interrupt sources */
> +	.num_int_csrs = JKR_C_CCE_NUM_INT_CSRS,
> +	.num_int_map_csrs = JKR_C_CCE_NUM_INT_MAP_CSRS,
> +	.is_cport_int = JKR_MCTXT_CPORT_TO_PCIE_INT,
> +	.is_rcvavail_start = JKR_IS_RCVAVAIL_START,
> +	.is_rcvurgent_start = JKR_IS_RCVURGENT_START,
> +	.is_sdmaeng_err_start = JKR_IS_SDMAENG_ERR_START,
> +	.is_sdma_idle_start = JKR_IS_SDMA_IDLE_START,
> +	.is_sdma_progress_start = JKR_IS_SDMA_PROGRESS_START,
> +	.is_sdma_start = JKR_IS_SDMA_START,
> +	.is_last_source = JKR_IS_LAST_SOURCE,
> +	.is_table = jkr_is_table,
> +	.gi_enable_table = jkr_gi_enable_table,
> +
> +	/* cce_interrupt registers */
> +	.cce_int_status_reg = JKR_CCE_INT_STATUS,
> +	.cce_int_mask_reg = JKR_CCE_INT_MASK,
> +	.cce_int_clear_reg = JKR_CCE_INT_CLEAR,
> +	.cce_int_force_reg = JKR_CCE_INT_FORCE,
> +	.cce_int_blocked_reg = JKR_CCE_INT_BLOCKED,
> +
> +	/* counters */
> +	.chip_dev_cntrs = jkr_dev_cntrs,
> +	.chip_dev_cntr_first = JKR_DEV_CNTR_FIRST,
> +	.chip_num_dev_cntrs = JKR_NUM_DEV_CNTRS,
> +	.chip_port_cntrs = jkr_port_cntrs,
> +	.chip_port_cntr_first = JKR_PORT_CNTR_FIRST,
> +	.chip_num_port_cntrs = JKR_NUM_PORT_CNTRS,
> +
> +	/* ingress port registers */
> +	.rxe_iport_stride = JKR_C_RXE_IPORT_STRIDE,
> +	.rcv_iport_ctrl_reg = JKR_RCV_IPORT_CTRL,
> +	.rcv_iport_status_reg = JKR_RCV_IPORT_STATUS,
> +	.rcv_bth_qp_reg = JKR_RCV_BTH_QP,
> +	.rcv_multicast_reg = JKR_RCV_MULTICAST,
> +	.rcv_bypass_reg = JKR_RCV_BYPASS,
> +	.rcv_vl15_reg = JKR_RCV_VL15,
> +	.rcv_err_info_reg = JKR_RCV_ERR_INFO,
> +	.rcv_err_status_reg = JKR_RCV_ERR_STATUS,
> +	.rcv_err_mask_reg = JKR_RCV_ERR_MASK,
> +	.rcv_err_clear_reg = JKR_RCV_ERR_CLEAR,
> +	.rcv_qp_map_table_reg = JKR_RCV_QP_MAP_TABLE,
> +	.rcv_partition_key_reg = JKR_RCV_PARTITION_KEY,
> +	.rcv_counter_array32_reg = JKR_RCV_COUNTER_ARRAY32,
> +	.rcv_counter_array64_reg = JKR_RCV_COUNTER_ARRAY64,
> +
> +	/* ingress port receive context registers */
> +	.rxe_iprc_stride = JKR_C_RXE_IPRC_STRIDE,
> +	.rcv_jkey_ctrl_reg = JKR_RCV_JKEY_CTRL,
> +
> +	/* RXE restricted context registers */
> +	.rxe_rctxt_stride = JKR_C_RXE_RCTXT_STRIDE,
> +	.rcv_rctxt_ctrl_reg = JKR_RCV_RCTXT_CTRL,
> +	.rcv_egr_ctrl_reg = JKR_RCV_EGR_CTRL,
> +	.rcv_tid_ctrl_reg = JKR_RCV_TID_CTRL,
> +
> +	/* RXE kernel context registers */
> +	.rxe_kctxt_stride = JKR_C_RXE_KCTXT_STRIDE,
> +	.rcv_kctxt_ctrl_reg = JKR_RCV_KCTXT_CTRL,
> +	.rcv_hdr_addr_reg = JKR_RCV_HDR_ADDR,
> +	.rcv_hdr_cnt_reg = JKR_RCV_HDR_CNT,
> +	.rcv_hdr_ent_size_reg = JKR_RCV_HDR_ENT_SIZE,
> +	.rcv_hdr_tail_addr_reg = JKR_RCV_HDR_TAIL_ADDR,
> +	.rcv_avail_time_out_reg = JKR_RCV_AVAIL_TIME_OUT,
> +	.rcv_hdr_ovfl_cnt_reg = JKR_RCV_HDR_OVFL_CNT,
> +
> +	/* RXE kernel/user registers */
> +	.rxe_ku_stride = JKR_C_RXE_UCTXT_STRIDE,
> +	.rcv_ctxt_status_reg = JKR_RCV_CTXT_STATUS,
> +
> +	/* RXE user registers */
> +	.rxe_uctxt_stride = JKR_C_RXE_UCTXT_STRIDE,
> +	.rcv_hdr_tail_reg = JKR_RCV_HDR_TAIL,
> +	.rcv_hdr_head_reg = JKR_RCV_HDR_HEAD,
> +	.rcv_egr_index_head_reg = JKR_RCV_EGR_INDEX_HEAD,
> +	.rcv_tid_flow_table_reg = JKR_RCV_TID_FLOW_TABLE,
> +
> +	/* RXE RSM registers */
> +	.rcv_rsm_cfg_reg = JKR_RCV_RSM_CFG,
> +	.rcv_rsm_select_reg = JKR_RCV_RSM_SELECT,
> +	.rcv_rsm_match_reg = JKR_RCV_RSM_MATCH,
> +	.rcv_rsm_map_table_reg = JKR_RCV_RSM_MAP_TABLE,
> +
> +	/* TXE kernel registers */
> +	.send_contexts_reg = JKR_SEND_CONTEXTS,
> +	.send_dma_engines_reg = JKR_SEND_DMA_ENGINES,
> +	.send_pio_mem_size_reg = JKR_SEND_PIO_MEM_SIZE,
> +	.send_dma_mem_size_reg = JKR_SEND_DMA_MEM_SIZE,
> +	.send_pio_init_ctxt_reg = JKR_SEND_PIO_INIT_CTXT,
> +
> +	/* send context_registers */
> +	.txe_sctxt_stride = JKR_C_TXE_SCTXT_STRIDE,
> +	.send_ctxt_status_reg = JKR_SEND_CTXT_STATUS,
> +	.send_ctxt_credit_ctrl_reg = JKR_SEND_CTXT_CREDIT_CTRL,
> +	.send_ctxt_credit_status_reg = JKR_SEND_CTXT_CREDIT_STATUS,
> +	.send_ctxt_credit_return_addr_reg = JKR_SEND_CTXT_CREDIT_RETURN_ADDR,
> +	.send_ctxt_credit_force_reg = JKR_SEND_CTXT_CREDIT_FORCE,
> +	.send_ctxt_err_status_reg = JKR_SEND_CTXT_ERR_STATUS,
> +	.send_ctxt_err_mask_reg = JKR_SEND_CTXT_ERR_MASK,
> +	.send_ctxt_err_clear_reg = JKR_SEND_CTXT_ERR_CLEAR,
> +
> +	/* TXE send context registers */
> +	.txe_tctxt_stride = JKR_C_TXE_TCTXT_STRIDE,
> +	.send_ctxt_ctrl_reg = JKR_SEND_CTXT_CTRL,
> +
> +	/* SDMA registers */
> +	.txe_sdma_stride = JKR_C_TXE_SDMA_STRIDE,
> +	.send_dma_ctrl_reg = JKR_SEND_DMA_CTRL,
> +	.send_dma_status_reg = JKR_SEND_DMA_STATUS,
> +	.send_dma_base_addr_reg = JKR_SEND_DMA_BASE_ADDR,
> +	.send_dma_len_gen_reg = JKR_SEND_DMA_LEN_GEN,
> +	.send_dma_tail_reg = JKR_SEND_DMA_TAIL,
> +	.send_dma_head_reg = JKR_SEND_DMA_HEAD,
> +	.send_dma_head_addr_reg = JKR_SEND_DMA_HEAD_ADDR,
> +	.send_dma_priority_thld_reg = JKR_SEND_DMA_PRIORITY_THLD,
> +	.send_dma_idle_cnt_reg = JKR_SEND_DMA_IDLE_CNT,
> +	.send_dma_reload_cnt_reg = JKR_SEND_DMA_RELOAD_CNT,
> +	.send_dma_desc_cnt_reg = JKR_SEND_DMA_DESC_CNT,
> +	.send_dma_desc_fetched_cnt_reg = JKR_SEND_DMA_DESC_FETCHED_CNT,
> +	.send_dma_eng_err_status_reg = JKR_SEND_DMA_ENG_ERR_STATUS,
> +	.send_dma_eng_err_mask_reg = JKR_SEND_DMA_ENG_ERR_MASK,
> +	.send_dma_eng_err_clear_reg = JKR_SEND_DMA_ENG_ERR_CLEAR,
> +
> +	/* SDMA Config registers */
> +	.txe_sdmacfg_stride = JKR_C_TXE_SDMACFG_STRIDE,
> +	.send_dma_cfg_memory_reg = JKR_SEND_DMA_CFG_MEMORY,
> +
> +	/* egress port registers */
> +	.txe_eport_stride = JKR_C_TXE_EPORT_STRIDE,
> +	.send_ctrl_reg = JKR_SEND_CTRL,
> +	.send_high_priority_limit_reg = JKR_SEND_HIGH_PRIORITY_LIMIT,
> +	.send_egress_err_status_reg = JKR_SEND_EGRESS_ERR_STATUS,
> +	.send_egress_err_mask_reg = JKR_SEND_EGRESS_ERR_MASK,
> +	.send_egress_err_clear_reg = JKR_SEND_EGRESS_ERR_CLEAR,
> +	.send_bth_qp_reg = JKR_SEND_BTH_QP,
> +	.send_static_rate_control_reg = JKR_SEND_STATIC_RATE_CONTROL,
> +	.send_sc2vlt0_reg = JKR_SEND_SC2VLT0,
> +	.send_sc2vlt1_reg = JKR_SEND_SC2VLT1,
> +	.send_sc2vlt2_reg = JKR_SEND_SC2VLT2,
> +	.send_sc2vlt3_reg = JKR_SEND_SC2VLT3,
> +	.send_len_check0_reg = JKR_SEND_LEN_CHECK0,
> +	.send_len_check1_reg = JKR_SEND_LEN_CHECK1,
> +	.send_low_priority_list_reg = JKR_SEND_LOW_PRIORITY_LIST,
> +	.send_high_priority_list_reg = JKR_SEND_HIGH_PRIORITY_LIST,
> +	.send_counter_array32_reg = JKR_SEND_COUNTER_ARRAY32,
> +	.send_counter_array64_reg = JKR_SEND_COUNTER_ARRAY64,
> +	.send_cm_ctrl_reg = JKR_SEND_CM_CTRL,
> +	.send_cm_global_credit_reg = JKR_SEND_CM_GLOBAL_CREDIT,
> +	.send_cm_credit_used_status_reg = JKR_SEND_CM_CREDIT_USED_STATUS,
> +	.send_cm_timer_ctrl_reg = JKR_SEND_CM_TIMER_CTRL,
> +	.send_cm_local_au_table0_to3_reg = JKR_SEND_CM_LOCAL_AU_TABLE0_TO3,
> +	.send_cm_local_au_table4_to7_reg = JKR_SEND_CM_LOCAL_AU_TABLE4_TO7,
> +	.send_cm_remote_au_table0_to3_reg = JKR_SEND_CM_REMOTE_AU_TABLE0_TO3,
> +	.send_cm_remote_au_table4_to7_reg = JKR_SEND_CM_REMOTE_AU_TABLE4_TO7,
> +	.send_cm_credit_vl_reg = JKR_SEND_CM_CREDIT_VL,
> +	.send_cm_credit_vl15_reg = JKR_SEND_CM_CREDIT_VL15,
> +	.send_egress_err_info_reg = JKR_SEND_EGRESS_ERR_INFO,
> +	.send_egress_err_source_reg = JKR_SEND_EGRESS_ERR_SOURCE,
> +	.send_egress_ctxt_status_reg = JKR_SEND_EGRESS_CTXT_STATUS,
> +	.send_egress_send_dma_status_reg = JKR_SEND_EGRESS_SEND_DMA_STATUS,
> +
> +	/* egress port send context registers */
> +	.txe_epsc_stride = JKR_C_TXE_EPSC_STRIDE,
> +	.send_ctxt_check_enable_reg = JKR_SEND_CTXT_CHECK_ENABLE,
> +	.send_ctxt_check_vl_reg = JKR_SEND_CTXT_CHECK_VL,
> +	.send_ctxt_check_job_key_reg = JKR_SEND_CTXT_CHECK_JOB_KEY,
> +	.send_ctxt_check_partition_key_reg = JKR_SEND_CTXT_CHECK_PARTITION_KEY,
> +	.send_ctxt_check_slid_reg = JKR_SEND_CTXT_CHECK_SLID,
> +	.send_ctxt_check_opcode_reg = JKR_SEND_CTXT_CHECK_OPCODE,
> +
> +	/* SI registers */
> +	.cce_msix_int_map_vec_reg = JKR_CCE_MSIX_INT_MAP_VEC,
> +	.send_pio_err_status_reg = JKR_SEND_PIO_ERR_STATUS,
> +	.send_pio_err_mask_reg = JKR_SEND_PIO_ERR_MASK,
> +	.send_pio_err_clear_reg = JKR_SEND_PIO_ERR_CLEAR,
> +	.send_dma_err_status_reg = JKR_SEND_DMA_ERR_STATUS,
> +	.send_dma_err_mask_reg = JKR_SEND_DMA_ERR_MASK,
> +	.send_dma_err_clear_reg = JKR_SEND_DMA_ERR_CLEAR,
> +	.csr_err_status_reg = JKR_CSR_ERR_STATUS,
> +	.csr_err_mask_reg = JKR_CSR_ERR_MASK,
> +	.csr_err_clear_reg = JKR_CSR_ERR_CLEAR,
> +
> +	.setextled = gen_setextled,
> +	.start_led_override = gen_start_led_override,
> +	.shutdown_led_override = gen_shutdown_led_override,
> +	.read_guid = jkr_read_guid,
> +	.early_per_chip_init = jkr_early_per_chip_init,
> +	.mid_per_chip_init = jkr_mid_per_chip_init,
> +	.init_other = jkr_init_other,
> +	.late_per_chip_init = gen_late_per_chip_init,
> +	.start_port = gen_start_port,
> +	.stop_port = gen_stop_port,
> +	.put_tid = jkr_put_tid,
> +	.rcv_array_wc_fill = jkr_rcv_array_wc_fill,
> +	.set_port_tid_config = jkr_set_port_tid_config,
> +	.set_port_max_mtu = gen_set_port_max_mtu,
> +	.update_rcv_hdr_size = jkr_update_rcv_hdr_size,
> +	.check_synth_status = jkr_check_synth_status,
> +	.update_synth_status = jkr_update_synth_status,
> +	.create_pbc = gen_create_pbc,
> +	.set_pio_integrity = jkr_set_pio_integrity,
> +	.find_used_resources = jkr_find_used_resources,
> +	.read_link_quality = jkr_read_link_quality,
> +	.set_rheq_addr = jkr_set_rheq_addr,
> +	.handle_link_bounce = jkr_handle_link_bounce,
> +	.enable_rcv_context = jkr_enable_rcv_context,
> +};
> +
> +/*
> + * Number of user receive contexts each port configured to use (allow for more
> + * pio buffers per ctxt, etc).
> + */
> +static int num_user_contexts_array[32];
> +static int num_user_contexts_count;
> +module_param_array_named(num_user_contexts, num_user_contexts_array, int,
> +			 &num_user_contexts_count, 0444);
> +MODULE_PARM_DESC(num_user_contexts, "Set max number of user contexts to use per-hfi, per-port (unset or -1: use the real (non-HT) CPU count)");
> +
> +uint krcvqs[RXE_NUM_DATA_VL];
> +int krcvqsset;
> +module_param_array(krcvqs, uint, &krcvqsset, 0444);
> +MODULE_PARM_DESC(krcvqs, "Array of the number of non-control kernel receive queues by VL");
> +
> +/* computed based on above array */
> +unsigned long n_krcvqs;
> +
> +static unsigned int hfi2_rcvarr_split = 25;
> +module_param_named(rcvarr_split, hfi2_rcvarr_split, uint, 0444);
> +MODULE_PARM_DESC(rcvarr_split, "Percent of context's RcvArray entries used for Eager buffers");
> +
> +static uint eager_buffer_size = (8 << 20); /* 8MB */
> +module_param(eager_buffer_size, uint, 0444);
> +MODULE_PARM_DESC(eager_buffer_size, "Size of the eager buffers, default: 8MB");
> +
> +static uint rcvhdrcnt = 2048; /* 2x the max eager buffer count */
> +module_param_named(rcvhdrcnt, rcvhdrcnt, uint, 0444);
> +MODULE_PARM_DESC(rcvhdrcnt, "Receive header queue count (default 2048)");
> +
> +static uint hfi2_hdrq_entsize = DEFAULT_HDRQ_ENTSIZE;
> +module_param_named(hdrq_entsize, hfi2_hdrq_entsize, uint, 0444);
> +MODULE_PARM_DESC(hdrq_entsize, "Size of header queue entries: 2 - 8B, 16 - 64B, 32 - 128B (default)");
> +
> +unsigned int user_credit_return_threshold = 33;	/* default is 33% */
> +module_param(user_credit_return_threshold, uint, 0444);
> +MODULE_PARM_DESC(user_credit_return_threshold, "Credit return threshold for user send contexts, return when unreturned credits passes this many blocks (in percent of allocated blocks, 0 is off)");
> +
> +DEFINE_XARRAY_FLAGS(hfi2_dev_table, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
> +
> +struct cport_trap_reg {
> +	u32 mask;
> +	cport_trap_handler func;
> +};
> +
> +/* send, or resend, START message */
> +static int cport_start(struct hfi2_devdata *dd, int to_secs)
> +{
> +	struct cport_start_payload start = {0};
> +	union {
> +		struct cport_start_payload pl;
> +		u64 qw;
> +	} *resp = NULL;
> +	int resp_len = 0;
> +	int ret;
> +
> +	start.opts_ena = dd->cport->opts;
> +	start.trap_ena = dd->cport->traps;
> +
> +	ret = cport_send_req(dd, CH_OP_START, 0, &start, sizeof(start),
> +			     (void **)&resp, &resp_len, to_secs * HZ);
> +	if (ret == MSG_RSP_STATUS_SEQ_NO_ERROR) {
> +		dd_dev_info(dd, "CPORT sequence error, retrying\n");
> +		ret = cport_send_req(dd, CH_OP_START, 0, &start, sizeof(start),
> +				     (void **)&resp, &resp_len, HZ);
> +	}
> +	if (ret) {
> +		dd_dev_err(dd, "CPORT start failed %d\n", ret);
> +	} else if (resp_len) {
> +		dd_dev_info(dd, "CPORT started %016llx\n", resp->qw);
> +		dd->cport->traps_act = resp->pl.trap_ena;
> +	} else {
> +		dd_dev_info(dd, "CPORT started\n");
> +	}
> +	kfree(resp);
> +	return ret;
> +}
> +
> +int register_cport_trap(struct hfi2_devdata *dd, struct cport_trap_status traps,
> +			cport_trap_handler func)
> +{
> +	union {
> +		struct cport_trap_status traps;
> +		u32 dw;
> +	} trap_val, cur_traps;
> +	struct cport_trap_reg *entry;
> +	u32 index;
> +	int ret;
> +
> +	if (!dd->cport)
> +		return 0;
> +
> +	trap_val.traps = traps;
> +	cur_traps.traps = dd->cport->traps;
> +
> +	entry = kzalloc_obj(entry, GFP_KERNEL);
> +	if (!entry)
> +		return -ENOMEM;
> +	entry->mask = trap_val.dw;
> +	entry->func = func;
> +	ret = xa_alloc_irq(&dd->cport->trap_xa, &index, entry, xa_limit_32b, GFP_KERNEL);
> +	if (ret < 0) {
> +		kfree(entry);
> +		return ret;
> +	}
> +
> +	trap_val.dw |= cur_traps.dw;
> +	if (trap_val.dw != cur_traps.dw) {
> +		dd->cport->traps = trap_val.traps;
> +		ret = cport_start(dd, cport_adm_to);
> +	}
> +	return ret;
> +}
> +
> +int deregister_cport_trap(struct hfi2_devdata *dd, cport_trap_handler func)
> +{
> +	union {
> +		struct cport_trap_status traps;
> +		u32 dw;
> +	} trap_val, cur_traps;
> +	struct cport_trap_reg *entry;
> +	unsigned long index;
> +
> +	if (!dd->cport)
> +		return 0;
> +
> +	trap_val.dw = 0;
> +	xa_lock_irq(&dd->cport->trap_xa);
> +	xa_for_each(&dd->cport->trap_xa, index, entry) {
> +		if (entry->func == func) {
> +			__xa_erase(&dd->cport->trap_xa, index);
> +			kfree(entry);
> +		} else {
> +			trap_val.dw |= entry->mask;
> +		}
> +	}
> +	xa_unlock_irq(&dd->cport->trap_xa);
> +	cur_traps.traps = dd->cport->traps;
> +	if (trap_val.dw != cur_traps.dw) {
> +		dd->cport->traps = trap_val.traps;
> +		cport_start(dd, cport_adm_to);
> +	}
> +
> +	return 0;
> +}
> +
> +static void clearall_cport_trap(struct hfi2_devdata *dd)
> +{
> +	struct cport_trap_reg *entry;
> +	unsigned long index;
> +	struct cport_trap_status no_traps = {0};
> +
> +	if (!dd->cport)
> +		return;
> +
> +	dd->cport->traps = no_traps;
> +	cport_start(dd, cport_adm_to);
> +	cport_register_cb(dd, CH_OP_TRAP, CH_OP_TRAP, NULL);
> +	xa_lock_irq(&dd->cport->trap_xa);
> +	/* there should be none left, but make certain */
> +	xa_for_each(&dd->cport->trap_xa, index, entry) {
> +		__xa_erase(&dd->cport->trap_xa, index);
> +		dd_dev_info(dd, "removing latent TRAP handler %ps\n", entry->func);
> +		kfree(entry);
> +	}
> +	xa_unlock_irq(&dd->cport->trap_xa);
> +}
> +
> +static int handle_cport_trap(struct hfi2_devdata *dd, u8 op, u8 sideband,
> +			     void *payload, int len, void *handle)
> +{
> +	struct cport_trap_payload *traps = payload;
> +	struct cport_trap_payload repress = {0};
> +	union {
> +		struct cport_trap_status traps;
> +		u32 dw;
> +	} trap_val;
> +	struct cport_trap_reg *entry;
> +	unsigned long index;
> +	int ret;
> +
> +	trap_val.traps = traps->trap_sts;
> +
> +	/* clear-down the traps we got */
> +	repress.trap_sts = traps->trap_sts;
> +	ret = cport_send_notif(dd, CH_OP_TRAP_REPRESS, 0, &repress, sizeof(repress),
> +			       cport_adm_to * HZ);
> +	if (ret)
> +		dd_dev_warn(dd, "CPORT TRAP_REPRESS failed: %d\n", ret);
> +#ifdef CPORT_TRAP_DEBUG
> +	pr_warn("hfi2_%d: %s: CPORT TRAP %08x\n", dd->unit, __func__, trap_val.dw);
> +#endif
> +
> +	xa_lock_irq(&dd->cport->trap_xa);
> +	xa_for_each(&dd->cport->trap_xa, index, entry) {
> +		if (entry->mask & trap_val.dw)
> +			entry->func(dd, trap_val.traps);
> +	}
> +	xa_unlock_irq(&dd->cport->trap_xa);
> +
> +	return 0;
> +}
> +
> +static void cport_stop(struct hfi2_devdata *dd)
> +{
> +	struct cport_stop_payload stop = {0};
> +	u64 *resp = NULL;
> +	int resp_len = 0;
> +	int ret;
> +
> +	if (!dd->cport)
> +		return;
> +
> +	ret = cport_send_req(dd, CH_OP_STOP, 0, &stop, sizeof(stop),
> +			     (void **)&resp, &resp_len, cport_adm_to * HZ);
> +	if (ret)
> +		dd_dev_err(dd, "CPORT stop failed %d\n", ret);
> +	else if (resp_len)
> +		dd_dev_info(dd, "CPORT stopped %016llx\n", *resp);
> +	else
> +		dd_dev_info(dd, "CPORT stopped\n");
> +	kfree(resp);
> +}
> +
> +int start_cport(struct hfi2_devdata *dd)
> +{
> +	int ret;
> +
> +	ret = cport_init(dd);
> +	if (ret || !dd->cport)
> +		return ret;
> +
> +	/*
> +	 * Do a STOP to ensure the device is properly cleaned up.
> +	 * This may cause firmware to be unresponsive for awhile,
> +	 * so increase the timeout for the subsequent START.
> +	 */
> +	cport_stop(dd);
> +
> +	cport_register_cb(dd, CH_OP_TRAP, CH_OP_TRAP, handle_cport_trap);
> +
> +	dd->cport->opts.bare_metal = 1;
> +
> +	ret = cport_start(dd, 3 * cport_adm_to);
> +	if (ret)
> +		cport_exit(dd);
> +	return (ret > 0 ? -EIO : ret);
> +}
> +
> +static void stop_cport(struct hfi2_devdata *dd)
> +{
> +	if (!dd->cport)
> +		return;
> +
> +	cport_stop(dd);
> +
> +	cport_exit(dd);
> +}
> +
> +static int hfi2_create_kctxt(struct hfi2_pportdata *ppd, u16 ctxt)
> +{
> +	struct hfi2_devdata *dd = ppd->dd;
> +	struct hfi2_ctxtdata *rcd;
> +	int ret;
> +
> +	/* Control context has to be always 0 */
> +	BUILD_BUG_ON(HFI2_CTRL_CTXT != 0);
> +
> +	ret = hfi2_create_ctxtdata(ppd, dd->node, ctxt, &rcd);
> +	if (ret < 0) {
> +		dd_dev_err(dd, "Kernel receive context allocation failed\n");
> +		return ret;
> +	}
> +
> +	/*
> +	 * Set up the kernel context flags here and now because they use
> +	 * default values for all receive side memories.  User contexts will
> +	 * be handled as they are created.
> +	 */
> +	rcd->flags = HFI2_CAP_KGET(MULTI_PKT_EGR) |
> +		HFI2_CAP_KGET(NODROP_RHQ_FULL) |
> +		HFI2_CAP_KGET(NODROP_EGR_FULL) |
> +		HFI2_CAP_KGET(DMA_RTAIL);
> +
> +	/* Control context must use DMA_RTAIL */
> +	if (is_control_context(rcd))
> +		rcd->flags |= HFI2_CAP_DMA_RTAIL;
> +	rcd->fast_handler = get_dma_rtail_setting(rcd) ?
> +				handle_receive_interrupt_dma_rtail :
> +				handle_receive_interrupt_nodma_rtail;
> +
> +	hfi2_set_seq_cnt(rcd, 1);
> +
> +	rcd->sc = sc_alloc(ppd, SC_ACK, rcd->rcvhdrqentsize, dd->node);
> +	if (!rcd->sc) {
> +		dd_dev_err(dd, "Kernel send context allocation failed\n");
> +		return -ENOMEM;
> +	}
> +	hfi2_init_ctxt(rcd->sc);
> +
> +	return 0;
> +}
> +
> +/*
> + * Create the receive context array and one or more kernel contexts
> + */
> +int hfi2_create_kctxts(struct hfi2_devdata *dd)
> +{
> +	struct hfi2_devrsrcs *dr = &dd->rsrcs;
> +	u16 i;
> +	u16 j;
> +	int ret;
> +
> +	/*
> +	 * so this is making dd->rcd much larger than needed. Unfortunately,
> +	 * current code requires that dd->rcd[x].ctxt == x (h/w context number
> +	 * must be the same as dd->rcd index number - s/w context number)
> +	 * and much code needs to change in order to fix this.
> +	 */
> +	dd->num_rcd = chip_rcv_contexts(dd);
> +	dd->rcd = kcalloc_node(dd->num_rcd, sizeof(*dd->rcd),
> +			       GFP_KERNEL, dd->node);
> +	if (!dd->rcd) {
> +		dd->num_rcd = 0;
> +		return -ENOMEM;
> +	}
> +
> +	for (i = 0; i < dd->num_pports; i++) {
> +		struct hfi2_pportdata *ppd = dd->pport + i;
> +		struct hfi2_portrsrcs *pr = &dr->ppr[i];
> +
> +		for (j = 0; j < pr->n_krcv_queues; j++) {
> +			u16 ctxt = pr->rcv_context_base + j;
> +
> +			ret = hfi2_create_kctxt(ppd, ctxt);
> +			if (ret)
> +				goto bail;
> +		}
> +	}
> +
> +	return 0;
> +bail:
> +	for (i = 0; i < dd->num_pports; i++) {
> +		struct hfi2_portrsrcs *pr = &dr->ppr[i];
> +
> +		for (j = 0; j < pr->n_krcv_queues; j++) {
> +			u16 ctxt = pr->rcv_context_base + j;
> +
> +			hfi2_free_ctxt(dd->rcd[ctxt]);
> +		}
> +	}
> +
> +	/* All the contexts should be freed, free the array */
> +	kfree(dd->rcd);
> +	dd->rcd = NULL;
> +	dd->num_rcd = 0;
> +	return ret;
> +}
> +
> +/*
> + * Helper routines for the receive context reference count (rcd and uctxt).
> + */
> +static void hfi2_rcd_init(struct hfi2_ctxtdata *rcd)
> +{
> +	kref_init(&rcd->kref);
> +}
> +
> +/**
> + * hfi2_rcd_free - When reference is zero clean up.
> + * @kref: pointer to an initialized rcd data structure
> + *
> + */
> +static void hfi2_rcd_free(struct kref *kref)
> +{
> +	unsigned long flags;
> +	struct hfi2_ctxtdata *rcd =
> +		container_of(kref, struct hfi2_ctxtdata, kref);
> +
> +	spin_lock_irqsave(&rcd->dd->uctxt_lock, flags);
> +	rcd->dd->rcd[rcd->ctxt] = NULL;
> +	spin_unlock_irqrestore(&rcd->dd->uctxt_lock, flags);
> +
> +	hfi2_free_ctxtdata(rcd->dd, rcd);
> +
> +	kfree(rcd);
> +}
> +
> +/**
> + * hfi2_rcd_put - decrement reference for rcd
> + * @rcd: pointer to an initialized rcd data structure
> + *
> + * Use this to put a reference after the init.
> + */
> +int hfi2_rcd_put(struct hfi2_ctxtdata *rcd)
> +{
> +	if (rcd)
> +		return kref_put(&rcd->kref, hfi2_rcd_free);
> +
> +	return 0;
> +}
> +
> +/**
> + * hfi2_rcd_get - increment reference for rcd
> + * @rcd: pointer to an initialized rcd data structure
> + *
> + * Use this to get a reference after the init.
> + *
> + * Return : reflect kref_get_unless_zero(), which returns non-zero on
> + * increment, otherwise 0.
> + */
> +int hfi2_rcd_get(struct hfi2_ctxtdata *rcd)
> +{
> +	return kref_get_unless_zero(&rcd->kref);
> +}
> +
> +/**
> + * allocate_rcd_index - allocate an rcd index from the rcd array
> + * @ppd: pointer to a valid port data structure
> + * @rcd: rcd data structure to assign
> + * @index[in,out]: in, suggested context number; out, selected context number
> + *
> + * Allocate an rcd index, either at the given context number or any within
> + * a dynamic range.  If the fixed index is used or the dynamic range is full,
> + * return -EBUSY.
> + */
> +static int allocate_rcd_index(struct hfi2_pportdata *ppd,
> +			      struct hfi2_ctxtdata *rcd, u16 *index)
> +{
> +	struct hfi2_devdata *dd = ppd->dd;
> +	struct hfi2_portrsrcs *pr = &dd->rsrcs.ppr[ppd->hw_pidx];
> +	unsigned long flags;
> +	u16 ctxt = *index;
> +	bool found;
> +
> +	spin_lock_irqsave(&dd->uctxt_lock, flags);
> +	found = false;
> +	if (ctxt == DYNAMIC_CONTEXT) {
> +		/* look for an unused dynamic context */
> +		for (ctxt = pr->first_dyn_alloc_ctxt;
> +		     ctxt < pr->rcv_context_base + pr->num_rcv_contexts;
> +		     ctxt++) {
> +			if (!dd->rcd[ctxt]) {
> +				found = true;
> +				break;
> +			}
> +		}
> +	} else {
> +		/* use the context number given */
> +		if (!dd->rcd[ctxt])
> +			found = true;
> +	}
> +
> +	if (found) {
> +		rcd->ctxt = ctxt;
> +		dd->rcd[ctxt] = rcd;
> +		hfi2_rcd_init(rcd);
> +	}
> +	spin_unlock_irqrestore(&dd->uctxt_lock, flags);
> +
> +	if (!found)
> +		return -EBUSY;
> +
> +	*index = ctxt;
> +
> +	return 0;
> +}
> +
> +/**
> + * hfi2_rcd_get_by_index - get rcd by index
> + * @dd: pointer to a valid devdata structure
> + * @ctxt: the index of a possible rcd
> + *
> + * Hold the protecting spinlock and increment the reference on the selected
> + * rcd element.
> + *
> + * The caller is responsible for calling hfi2_rcd_put() on the returned
> + * pointer.
> + */
> +struct hfi2_ctxtdata *hfi2_rcd_get_by_index(struct hfi2_devdata *dd, u16 ctxt)
> +{
> +	unsigned long flags;
> +	struct hfi2_ctxtdata *rcd = NULL;
> +
> +	spin_lock_irqsave(&dd->uctxt_lock, flags);
> +	if (ctxt < dd->num_rcd) {
> +		rcd = dd->rcd[ctxt];
> +		if (rcd && !hfi2_rcd_get(rcd))
> +			rcd = NULL;
> +	}
> +	spin_unlock_irqrestore(&dd->uctxt_lock, flags);
> +
> +	return rcd;
> +}
> +
> +/*
> + * Common code for user and kernel context create and setup.
> + * NOTE: the initial kref is done here (hf1_rcd_init()).
> + */
> +int hfi2_create_ctxtdata(struct hfi2_pportdata *ppd, int numa, u16 ctxt,
> +			 struct hfi2_ctxtdata **context)
> +{
> +	struct hfi2_devdata *dd = ppd->dd;
> +	struct hfi2_devrsrcs *dr = &dd->rsrcs;
> +	struct hfi2_portrsrcs *pr = &dr->ppr[ppd->hw_pidx];
> +	struct hfi2_ctxtdata *rcd;
> +
> +	rcd = kzalloc_node(sizeof(*rcd), GFP_KERNEL, numa);
> +	if (rcd) {
> +		u32 rcvtids, max_entries;
> +		int ret;
> +
> +		ret = allocate_rcd_index(ppd, rcd, &ctxt);
> +		if (ret) {
> +			*context = NULL;
> +			kfree(rcd);
> +			return ret;
> +		}
> +
> +		INIT_LIST_HEAD(&rcd->qp_wait_list);
> +		hfi2_exp_tid_group_init(rcd);
> +		rcd->ppd = ppd;
> +		rcd->dd = dd;
> +		rcd->numa_id = numa;
> +		rcd->rcv_array_groups = dd->rcv_entries.ngroups;
> +		rcd->rhf_rcv_function_map = normal_rhf_rcv_functions;
> +		rcd->slow_handler = handle_receive_interrupt;
> +		rcd->do_interrupt = rcd->slow_handler;
> +		rcd->msix_intr = CCE_NUM_MSIX_VECTORS;
> +
> +		mutex_init(&rcd->exp_mutex);
> +		spin_lock_init(&rcd->exp_lock);
> +		INIT_LIST_HEAD(&rcd->flow_queue.queue_head);
> +		INIT_LIST_HEAD(&rcd->rarr_queue.queue_head);
> +
> +		hfi2_cdbg(PROC, "setting up context %u", rcd->ctxt);
> +
> +		/* calculate the context's RcvArray entry starting point */
> +		rcd->eager_base = pr->rcv_array_base +
> +				  ((ctxt - pr->rcv_context_base) *
> +				   dd->rcv_entries.ngroups *
> +				   dd->rcv_entries.group_size);
> +
> +		rcd->rcvhdrq_cnt = rcvhdrcnt;
> +		rcd->rcvhdrqentsize = hfi2_hdrq_entsize;
> +		rcd->rhf_offset =
> +			rcd->rcvhdrqentsize - sizeof(u64) / sizeof(u32);
> +		rcd->kdeth_rcv_hdr = DEFAULT_RCVHDRSIZE;
> +		/*
> +		 * Simple Eager buffer allocation: we have already pre-allocated
> +		 * the number of RcvArray entry groups. Each ctxtdata structure
> +		 * holds the number of groups for that context.
> +		 *
> +		 * To follow CSR requirements and maintain cacheline alignment,
> +		 * make sure all sizes and bases are multiples of group_size.
> +		 *
> +		 * The expected entry count is what is left after assigning
> +		 * eager.
> +		 */
> +		max_entries = rcd->rcv_array_groups * dd->rcv_entries.group_size;
> +		rcvtids = ((max_entries * hfi2_rcvarr_split) / 100);
> +		rcd->egrbufs.count = round_down(rcvtids, dd->rcv_entries.group_size);
> +		if (rcd->egrbufs.count > dd->params->max_eager_entries) {
> +			dd_dev_err(dd, "ctxt%u: requested too many RcvArray entries.\n",
> +				   rcd->ctxt);
> +			rcd->egrbufs.count = dd->params->max_eager_entries;
> +		}
> +		hfi2_cdbg(PROC,
> +			  "ctxt%u: max Eager buffer RcvArray entries: %u",
> +			  rcd->ctxt, rcd->egrbufs.count);
> +
> +		/*
> +		 * Allocate array that will hold the eager buffer accounting
> +		 * data.
> +		 * This will allocate the maximum possible buffer count based
> +		 * on the value of the RcvArray split parameter.
> +		 * The resulting value will be rounded down to the closest
> +		 * multiple of dd->rcv_entries.group_size.
> +		 */
> +		rcd->egrbufs.buffers =
> +			kcalloc_node(rcd->egrbufs.count,
> +				     sizeof(*rcd->egrbufs.buffers),
> +				     GFP_KERNEL, numa);
> +		if (!rcd->egrbufs.buffers)
> +			goto bail;
> +		rcd->egrbufs.rcvtids =
> +			kcalloc_node(rcd->egrbufs.count,
> +				     sizeof(*rcd->egrbufs.rcvtids),
> +				     GFP_KERNEL, numa);
> +		if (!rcd->egrbufs.rcvtids)
> +			goto bail;
> +		rcd->egrbufs.size = eager_buffer_size;
> +		/*
> +		 * The size of the buffers programmed into the RcvArray
> +		 * entries needs to be big enough to handle the highest
> +		 * MTU supported.
> +		 */
> +		if (rcd->egrbufs.size < hfi2_max_mtu) {
> +			rcd->egrbufs.size = __roundup_pow_of_two(hfi2_max_mtu);
> +			hfi2_cdbg(PROC,
> +				  "ctxt%u: eager bufs size too small. Adjusting to %u",
> +				    rcd->ctxt, rcd->egrbufs.size);
> +		}
> +		rcd->egrbufs.rcvtid_size = HFI2_MAX_EAGER_BUFFER_SIZE;
> +
> +		/* Applicable only for statically created kernel contexts */
> +		if (ctxt < pr->first_dyn_alloc_ctxt) {
> +			rcd->opstats = kzalloc_node(sizeof(*rcd->opstats),
> +						    GFP_KERNEL, numa);
> +			if (!rcd->opstats)
> +				goto bail;
> +
> +			/* Initialize TID flow generations for the context */
> +			hfi2_kern_init_ctxt_generations(rcd);
> +		}
> +
> +		*context = rcd;
> +		return 0;
> +	}
> +
> +bail:
> +	*context = NULL;
> +	hfi2_free_ctxt(rcd);
> +	return -ENOMEM;
> +}
> +
> +/**
> + * hfi2_free_ctxt - free context
> + * @rcd: pointer to an initialized rcd data structure
> + *
> + * This wrapper is the free function that matches hfi2_create_ctxtdata().
> + * When a context is done being used (kernel or user), this function is called
> + * for the "final" put to match the kref init from hfi2_create_ctxtdata().
> + * Other users of the context do a get/put sequence to make sure that the
> + * structure isn't removed while in use.
> + */
> +void hfi2_free_ctxt(struct hfi2_ctxtdata *rcd)
> +{
> +	hfi2_rcd_put(rcd);
> +}
> +
> +/*
> + * Select the largest ccti value over all SLs to determine the intra-
> + * packet gap for the link.
> + *
> + * called with cca_timer_lock held (to protect access to cca_timer
> + * array), and rcu_read_lock() (to protect access to cc_state).
> + */
> +void set_link_ipg(struct hfi2_pportdata *ppd)
> +{
> +	struct hfi2_devdata *dd = ppd->dd;
> +	struct cc_state *cc_state;
> +	int i;
> +	u16 cce, ccti_limit, max_ccti = 0;
> +	u16 shift, mult;
> +	u64 src;
> +	u32 current_egress_rate; /* Mbits /sec */
> +	u64 max_pkt_time;
> +	/*
> +	 * max_pkt_time is the maximum packet egress time in units
> +	 * of the fabric clock period 1/(805 MHz).
> +	 */
> +
> +	cc_state = get_cc_state(ppd);
> +
> +	if (!cc_state)
> +		/*
> +		 * This should _never_ happen - rcu_read_lock() is held,
> +		 * and set_link_ipg() should not be called if cc_state
> +		 * is NULL.
> +		 */
> +		return;
> +
> +	for (i = 0; i < OPA_MAX_SLS; i++) {
> +		u16 ccti = ppd->cca_timer[i].ccti;
> +
> +		if (ccti > max_ccti)
> +			max_ccti = ccti;
> +	}
> +
> +	ccti_limit = cc_state->cct.ccti_limit;
> +	if (max_ccti > ccti_limit)
> +		max_ccti = ccti_limit;
> +
> +	cce = cc_state->cct.entries[max_ccti].entry;
> +	shift = (cce & 0xc000) >> 14;
> +	mult = (cce & 0x3fff);
> +
> +	current_egress_rate = active_egress_rate(ppd);
> +
> +	max_pkt_time = egress_cycles(ppd->ibmaxlen, current_egress_rate);
> +
> +	src = (max_pkt_time >> shift) * mult;
> +
> +	src &= SEND_STATIC_RATE_CONTROL_CSR_SRC_RELOAD_SMASK;
> +	src <<= SEND_STATIC_RATE_CONTROL_CSR_SRC_RELOAD_SHIFT;
> +
> +	write_eport_csr(dd, ppd->hw_pidx, dd->params->send_static_rate_control_reg, src);
> +}
> +
> +static enum hrtimer_restart cca_timer_fn(struct hrtimer *t)
> +{
> +	struct cca_timer *cca_timer;
> +	struct hfi2_pportdata *ppd;
> +	int sl;
> +	u16 ccti_timer, ccti_min;
> +	struct cc_state *cc_state;
> +	unsigned long flags;
> +	enum hrtimer_restart ret = HRTIMER_NORESTART;
> +
> +	cca_timer = container_of(t, struct cca_timer, hrtimer);
> +	ppd = cca_timer->ppd;
> +	sl = cca_timer->sl;
> +
> +	rcu_read_lock();
> +
> +	cc_state = get_cc_state(ppd);
> +
> +	if (!cc_state) {
> +		rcu_read_unlock();
> +		return HRTIMER_NORESTART;
> +	}
> +
> +	/*
> +	 * 1) decrement ccti for SL
> +	 * 2) calculate IPG for link (set_link_ipg())
> +	 * 3) restart timer, unless ccti is at min value
> +	 */
> +
> +	ccti_min = cc_state->cong_setting.entries[sl].ccti_min;
> +	ccti_timer = cc_state->cong_setting.entries[sl].ccti_timer;
> +
> +	spin_lock_irqsave(&ppd->cca_timer_lock, flags);
> +
> +	if (cca_timer->ccti > ccti_min) {
> +		cca_timer->ccti--;
> +		set_link_ipg(ppd);
> +	}
> +
> +	if (cca_timer->ccti > ccti_min) {
> +		unsigned long nsec = 1024 * ccti_timer;
> +		/* ccti_timer is in units of 1.024 usec */
> +		hrtimer_forward_now(t, ns_to_ktime(nsec));
> +		ret = HRTIMER_RESTART;
> +	}
> +
> +	spin_unlock_irqrestore(&ppd->cca_timer_lock, flags);
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
> +/*
> + * Common code for initializing the physical port structure.
> + */
> +void hfi2_init_pportdata(struct pci_dev *pdev, struct hfi2_pportdata *ppd,
> +			 struct hfi2_devdata *dd, u8 hw_pidx, u32 port)
> +{
> +	int i;
> +	uint default_pkey_idx;
> +	struct cc_state *cc_state;
> +
> +	ppd->dd = dd;
> +	ppd->hw_pidx = hw_pidx;
> +	ppd->port = port; /* IB port number, not index */
> +	ppd->prev_link_width = LINK_WIDTH_DEFAULT;
> +	/*
> +	 * There are C_VL_COUNT number of PortVLXmitWait counters.
> +	 * Adding 1 to C_VL_COUNT to include the PortXmitWait counter.
> +	 */
> +	for (i = 0; i < C_VL_COUNT + 1; i++) {
> +		ppd->port_vl_xmit_wait_last[i] = 0;
> +		ppd->vl_xmit_flit_cnt[i] = 0;
> +	}
> +
> +	default_pkey_idx = 1;
> +
> +	ppd->pkeys[default_pkey_idx] = DEFAULT_P_KEY;
> +	ppd->part_enforce |= HFI2_PART_ENFORCE_IN;
> +	ppd->pkeys[0] = 0x8001;
> +
> +	INIT_WORK(&ppd->link_vc_work, handle_verify_cap);
> +	INIT_WORK(&ppd->link_up_work, handle_link_up);
> +	INIT_WORK(&ppd->link_down_work, handle_link_down);
> +	INIT_WORK(&ppd->link_downgrade_work, handle_link_downgrade);
> +	INIT_WORK(&ppd->sma_message_work, handle_sma_message);
> +	INIT_WORK(&ppd->link_bounce_work, dd->params->handle_link_bounce);
> +	INIT_DELAYED_WORK(&ppd->start_link_work, handle_start_link);
> +	INIT_WORK(&ppd->linkstate_active_work, receive_interrupt_work);
> +	INIT_WORK(&ppd->qsfp_info.qsfp_work, qsfp_event);
> +
> +	mutex_init(&ppd->hls_lock);
> +	spin_lock_init(&ppd->qsfp_info.qsfp_lock);
> +	seqlock_init(&ppd->sc2vl_lock);
> +
> +	ppd->qsfp_info.ppd = ppd;
> +	ppd->sm_trap_qp = 0x0;
> +	ppd->sa_qp = 0x1;
> +
> +	spin_lock_init(&ppd->cca_timer_lock);
> +
> +	for (i = 0; i < OPA_MAX_SLS; i++) {
> +		ppd->cca_timer[i].ppd = ppd;
> +		ppd->cca_timer[i].sl = i;
> +		ppd->cca_timer[i].ccti = 0;
> +		hrtimer_setup(&ppd->cca_timer[i].hrtimer, cca_timer_fn, CLOCK_MONOTONIC,
> +			      HRTIMER_MODE_REL);
> +	}
> +
> +	ppd->cc_max_table_entries = IB_CC_TABLE_CAP_DEFAULT;
> +
> +	spin_lock_init(&ppd->cc_state_lock);
> +	spin_lock_init(&ppd->cc_log_lock);
> +	cc_state = kzalloc_obj(cc_state, GFP_KERNEL);
> +	RCU_INIT_POINTER(ppd->cc_state, cc_state);
> +	if (!cc_state)
> +		goto bail;
> +	atomic_set(&ppd->ipoib_rsm_usr_num, 0);
> +	ppd->netdev_rsm_rule = -1;
> +	return;
> +
> +bail:
> +	dd_dev_err(dd, "Congestion Control Agent disabled for port %d\n", port);
> +}
> +
> +/*
> + * Do initialization for device that is only needed on
> + * first detect, not on resets.
> + */
> +static int loadtime_init(struct hfi2_devdata *dd)
> +{
> +	return 0;
> +}
> +
> +/**
> + * init_after_reset - re-initialize after a reset
> + * @dd: the hfi2_ib device
> + *
> + * sanity check at least some of the values after reset, and
> + * ensure no receive or transmit (explicitly, in case reset
> + * failed
> + */
> +static int init_after_reset(struct hfi2_devdata *dd)
> +{
> +	struct hfi2_devrsrcs *dr = &dd->rsrcs;
> +	int i;
> +	int j;
> +	struct hfi2_ctxtdata *rcd;
> +	/*
> +	 * Ensure chip does no sends or receives, tail updates, or
> +	 * pioavail updates while we re-initialize.  This is mostly
> +	 * for the driver data structures, not chip registers.
> +	 */
> +	for (i = 0; i < dd->num_pports; i++) {
> +		struct hfi2_portrsrcs *pr = &dr->ppr[i];
> +
> +		for (j = 0; j < pr->num_rcv_contexts; j++) {
> +			u16 ctxt = pr->rcv_context_base + j;
> +
> +			rcd = hfi2_rcd_get_by_index(dd, ctxt);
> +			hfi2_rcvctrl(dd, HFI2_RCVCTRL_CTXT_DIS |
> +				     HFI2_RCVCTRL_INTRAVAIL_DIS |
> +				     HFI2_RCVCTRL_TAILUPD_DIS, rcd);
> +			hfi2_rcd_put(rcd);
> +		}
> +	}
> +	for (i = 0; i < dd->num_pports; i++)
> +		pio_send_control(&dd->pport[i], PSC_GLOBAL_DISABLE);
> +	for (i = 0; i < dd->num_send_contexts; i++)
> +		sc_disable(dd->send_contexts[i].sc);
> +
> +	return 0;
> +}
> +
> +static void enable_chip(struct hfi2_devdata *dd)
> +{
> +	struct hfi2_devrsrcs *dr = &dd->rsrcs;
> +	struct hfi2_ctxtdata *rcd;
> +	u32 rcvmask;
> +	u16 i;
> +	u16 j;
> +
> +	/* enable PIO send */
> +	for (i = 0; i < dd->num_pports; i++)
> +		pio_send_control(&dd->pport[i], PSC_GLOBAL_ENABLE);
> +
> +	/*
> +	 * Enable kernel ctxts' receive and receive interrupt.
> +	 * Other ctxts done as user opens and initializes them.
> +	 */
> +	for (i = 0; i < dd->num_pports; i++) {
> +		struct hfi2_portrsrcs *pr = &dr->ppr[i];
> +
> +		for (j = 0; j < pr->n_krcv_queues; j++) {
> +			u16 ctxt = pr->rcv_context_base + j;
> +
> +			rcd = hfi2_rcd_get_by_index(dd, ctxt);
> +			if (!rcd)
> +				continue;
> +			rcvmask = HFI2_RCVCTRL_CTXT_ENB
> +				  | HFI2_RCVCTRL_INTRAVAIL_ENB;
> +			if (HFI2_CAP_KGET_MASK(rcd->flags, DMA_RTAIL))
> +				rcvmask |= HFI2_RCVCTRL_TAILUPD_ENB;
> +			else
> +				rcvmask |= HFI2_RCVCTRL_TAILUPD_DIS;
> +			if (!HFI2_CAP_KGET_MASK(rcd->flags, MULTI_PKT_EGR))
> +				rcvmask |= HFI2_RCVCTRL_ONE_PKT_EGR_ENB;
> +			if (HFI2_CAP_KGET_MASK(rcd->flags, NODROP_RHQ_FULL))
> +				rcvmask |= HFI2_RCVCTRL_NO_RHQ_DROP_ENB;
> +			if (HFI2_CAP_KGET_MASK(rcd->flags, NODROP_EGR_FULL))
> +				rcvmask |= HFI2_RCVCTRL_NO_EGR_DROP_ENB;
> +			if (HFI2_CAP_IS_KSET(TID_RDMA))
> +				rcvmask |= HFI2_RCVCTRL_TIDFLOW_ENB;
> +			hfi2_rcvctrl(dd, rcvmask, rcd);
> +			sc_enable(rcd->sc);
> +			hfi2_rcd_put(rcd);
> +		}
> +	}
> +}
> +
> +/**
> + * create_workqueues - create per port workqueues
> + * @dd: the hfi2_ib device
> + */
> +static int create_workqueues(struct hfi2_devdata *dd)
> +{
> +	int pidx;
> +	struct hfi2_pportdata *ppd;
> +
> +	if (!dd->hfi2_wq) {
> +		dd->hfi2_wq = alloc_workqueue("hfi%d",
> +					      WQ_SYSFS | WQ_HIGHPRI |
> +					      WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM |
> +					      WQ_PERCPU,
> +					      HFI2_MAX_ACTIVE_GEN_WQ_ENTRIES,
> +					      dd->unit);
> +		if (!dd->hfi2_wq)
> +			goto wq_error;
> +	}
> +	for (pidx = 0; pidx < dd->num_pports; ++pidx) {
> +		ppd = dd->pport + pidx;
> +		if (!ppd->link_wq) {
> +			/*
> +			 * Make the link workqueue single-threaded to enforce
> +			 * serialization.
> +			 */
> +			ppd->link_wq = alloc_workqueue("hfi_link_%d_%d",
> +						       WQ_SYSFS |
> +						       WQ_MEM_RECLAIM |
> +						       WQ_UNBOUND,
> +						       1, /* max_active */
> +						       dd->unit, pidx);
> +			if (!ppd->link_wq) {
> +				pr_err("alloc_workqueue failed for port %d\n",
> +				       pidx + 1);
> +				goto wq_error;
> +			}
> +		}
> +	}
> +	return 0;
> +
> +wq_error:
> +	destroy_workqueues(dd);
> +	return -ENOMEM;
> +}
> +
> +/**
> + * destroy_workqueues - destroy per port workqueues
> + * @dd: the hfi2_ib device
> + */
> +static void destroy_workqueues(struct hfi2_devdata *dd)
> +{
> +	int pidx;
> +	struct hfi2_pportdata *ppd;
> +
> +	for (pidx = 0; pidx < dd->num_pports; ++pidx) {
> +		ppd = dd->pport + pidx;
> +
> +		if (ppd->link_wq) {
> +			destroy_workqueue(ppd->link_wq);
> +			ppd->link_wq = NULL;
> +		}
> +	}
> +	if (dd->hfi2_wq) {
> +		destroy_workqueue(dd->hfi2_wq);
> +		dd->hfi2_wq = NULL;
> +	}
> +}
> +
> +/**
> + * enable_general_intr() - Enable the IRQs that will be handled by the
> + * general interrupt handler.
> + * @dd: valid devdata
> + *
> + */
> +static void enable_general_intr(struct hfi2_devdata *dd)
> +{
> +	const struct gi_enable_entry *entry = dd->params->gi_enable_table;
> +
> +	for (; entry->start <= entry->end; entry++)
> +		set_intr_bits(dd, entry->start, entry->end, true);
> +}
> +
> +static void wfr_start_port(struct hfi2_pportdata *ppd)
> +{
> +	int ret;
> +
> +	init_qsfp_int(ppd);
> +
> +	/*
> +	 * start the serdes - must be after interrupts are
> +	 * enabled so we are notified when the link goes up
> +	 */
> +	ret = bringup_serdes(ppd);
> +	if (ret)
> +		ppd_dev_info(ppd, "Failed to bring up port\n");
> +}
> +
> +static void wfr_stop_port(struct hfi2_pportdata *ppd)
> +{
> +	/*
> +	 * Clear SerdesEnable.
> +	 * We can't count on interrupts since we are stopping.
> +	 */
> +	hfi2_quiet_serdes(ppd);
> +	if (ppd->link_wq)
> +		flush_workqueue(ppd->link_wq);
> +}
> +
> +/**
> + * hfi2_init - do the actual initialization sequence on the chip
> + * @dd: the hfi2_ib device
> + * @reinit: re-initializing, so don't allocate new memory
> + *
> + * Do the actual initialization sequence on the chip.  This is done
> + * both from the init routine called from the PCI infrastructure, and
> + * when we reset the chip, or detect that it was reset internally,
> + * or it's administratively re-enabled.
> + *
> + * Memory allocation here and in called routines is only done in
> + * the first case (reinit == 0).  We have to be careful, because even
> + * without memory allocation, we need to re-write all the chip registers
> + * TIDs, etc. after the reset or enable has completed.
> + */
> +int hfi2_init(struct hfi2_devdata *dd, int reinit)
> +{
> +	struct hfi2_devrsrcs *dr = &dd->rsrcs;
> +	int ret = 0, pidx, lastfail = 0;
> +	unsigned long len;
> +	u16 i;
> +	struct hfi2_ctxtdata *rcd;
> +	struct hfi2_pportdata *ppd;
> +
> +	/* Set up send low level handlers */
> +	dd->process_pio_send = hfi2_verbs_send_pio;
> +	dd->process_dma_send = hfi2_verbs_send_dma;
> +	dd->pio_inline_send = pio_copy;
> +
> +	if (is_ax(dd)) {
> +		atomic_set(&dd->drop_packet, DROP_PACKET_ON);
> +		dd->do_drop = true;
> +	} else {
> +		atomic_set(&dd->drop_packet, DROP_PACKET_OFF);
> +		dd->do_drop = false;
> +	}
> +
> +	/* make sure the link is not "up" */
> +	for (pidx = 0; pidx < dd->num_pports; ++pidx) {
> +		ppd = dd->pport + pidx;
> +		ppd->linkup = 0;
> +	}
> +
> +	if (reinit)
> +		ret = init_after_reset(dd);
> +	else
> +		ret = loadtime_init(dd);
> +	if (ret)
> +		goto done;
> +
> +	/* dd->rcd can be NULL if early initialization failed */
> +	for (pidx = 0; dd->rcd && pidx < dd->num_pports; pidx++) {
> +		struct hfi2_portrsrcs *pr = &dr->ppr[pidx];
> +
> +		for (i = 0; i < pr->n_krcv_queues; ++i) {
> +			u16 ctxt = pr->rcv_context_base + i;
> +			/*
> +			 * Set up the (kernel) rcvhdr queue and egr TIDs.  If
> +			 * doing re-init, the simplest way to handle this is
> +			 * to free existing, and re-allocate.
> +			 * Need to re-create rest of ctxt 0 ctxtdata as well.
> +			 */
> +			rcd = hfi2_rcd_get_by_index(dd, ctxt);
> +			if (!rcd)
> +				continue;
> +
> +			lastfail = hfi2_create_rcvhdrq(dd, rcd);
> +			if (!lastfail)
> +				lastfail = hfi2_setup_eagerbufs(rcd);
> +			if (!lastfail)
> +				lastfail = hfi2_kern_exp_rcv_init(rcd, reinit);
> +			if (lastfail) {
> +				dd_dev_err(dd,
> +					   "failed to allocate kernel ctxt's rcvhdrq and/or egr bufs\n");
> +				ret = lastfail;
> +			}
> +			/* enable IRQ */
> +			hfi2_rcd_put(rcd);
> +		}
> +	}
> +
> +	/*
> +	 * so this is making dd->events much larger than needed. Unfortunately,
> +	 * uctxt_offset() uses the h/w context number and so all that would
> +	 * need to change in order to fix this.
> +	 */
> +	/* Allocate enough memory for user event notification. */
> +	len = PAGE_ALIGN(chip_rcv_contexts(dd) * HFI2_MAX_SHARED_CTXTS *
> +			 sizeof(*dd->events));
> +	dd->events = vmalloc_user(len);
> +	if (!dd->events)
> +		dd_dev_err(dd, "Failed to allocate user events page\n");
> +	/*
> +	 * Allocate a page for device and port status.
> +	 * Page will be shared amongst all user processes.
> +	 */
> +	dd->status = vmalloc_user(PAGE_SIZE);
> +	if (!dd->status)
> +		dd_dev_err(dd, "Failed to allocate dev status page\n");
> +	for (pidx = 0; pidx < dd->num_pports; ++pidx) {
> +		ppd = dd->pport + pidx;
> +		if (dd->status)
> +			ppd->statusp = &dd->status->ports[pidx];
> +
> +		set_mtu(ppd);
> +	}
> +
> +	/* enable chip even if we have an error, so we can debug cause */
> +	enable_chip(dd);
> +
> +done:
> +	/*
> +	 * Set status even if port serdes is not initialized
> +	 * so that diags will work.
> +	 */
> +	if (dd->status)
> +		dd->status->dev |= HFI2_STATUS_CHIP_PRESENT |
> +			HFI2_STATUS_INITTED;
> +	if (!ret) {
> +		/* enable all interrupts from the chip */
> +		enable_general_intr(dd);
> +
> +		/* chip is OK for user apps; mark it as initialized */
> +		for (pidx = 0; pidx < dd->num_pports; ++pidx) {
> +			ppd = dd->pport + pidx;
> +
> +			dd->params->start_port(ppd);
> +
> +			/*
> +			 * Set status even if port serdes is not initialized
> +			 * so that diags will work.
> +			 */
> +			if (ppd->statusp)
> +				*ppd->statusp |= HFI2_STATUS_CHIP_PRESENT |
> +							HFI2_STATUS_INITTED;
> +		}
> +	}
> +
> +	/* if ret is non-zero, we probably should do some cleanup here... */
> +	return ret;
> +}
> +
> +struct hfi2_devdata *hfi2_lookup(int unit)
> +{
> +	return xa_load(&hfi2_dev_table, unit);
> +}
> +
> +/*
> + * Stop the timers during unit shutdown, or after an error late
> + * in initialization.
> + */
> +static void stop_timers(struct hfi2_devdata *dd)
> +{
> +	struct hfi2_pportdata *ppd;
> +	int pidx;
> +
> +	for (pidx = 0; pidx < dd->num_pports; ++pidx) {
> +		ppd = dd->pport + pidx;
> +		if (ppd->led_override_timer.function) {
> +			timer_delete_sync(&ppd->led_override_timer);
> +			atomic_set(&ppd->led_override_timer_active, 0);
> +		}
> +		if (ppd->ibport_data.rvp.trap_timer.function)
> +			timer_delete_sync(&ppd->ibport_data.rvp.trap_timer);
> +	}
> +}
> +
> +/**
> + * shutdown_device - shut down a device
> + * @dd: the hfi2_ib device
> + *
> + * This is called to make the device quiet when we are about to
> + * unload the driver, and also when the device is administratively
> + * disabled.   It does not free any data structures.
> + * Everything it does has to be setup again by hfi2_init(dd, 1)
> + */
> +static void shutdown_device(struct hfi2_devdata *dd)
> +{
> +	struct hfi2_devrsrcs *dr = &dd->rsrcs;
> +	struct hfi2_pportdata *ppd;
> +	struct hfi2_ctxtdata *rcd;
> +	unsigned int pidx;
> +	int i;
> +
> +	if (dd->flags & HFI2_SHUTDOWN)
> +		return;
> +	dd->flags |= HFI2_SHUTDOWN;
> +
> +	for (pidx = 0; pidx < dd->num_pports; ++pidx) {
> +		ppd = dd->pport + pidx;
> +
> +		ppd->linkup = 0;
> +		if (ppd->statusp)
> +			*ppd->statusp &= ~(HFI2_STATUS_IB_CONF |
> +					   HFI2_STATUS_IB_READY);
> +	}
> +	dd->flags &= ~HFI2_INITTED;
> +
> +	/*
> +	 * Drop all traps.  After this point, there should be no more cport
> +	 * handlers that depend on driver state.
> +	 */
> +	clearall_cport_trap(dd);
> +
> +	/* disable all interrupts except cport response */
> +	if (dd->params->chip_type == CHIP_WFR) {
> +		/* WFR has no cport */
> +		set_intr_bits(dd, 0, dd->params->is_last_source, false);
> +		msix_shut_down_interrupts(dd, false);
> +	} else {
> +		vf2pf_deinit_irq(dd); /* gracefully stop using interrupts */
> +		/* mask all but the cport interrupt source */
> +		set_intr_bits(dd, 0, dd->params->is_cport_int - 1, false);
> +		set_intr_bits(dd, dd->params->is_cport_int + 1,
> +			      dd->params->is_last_source, false);
> +		msix_shut_down_interrupts(dd, true);
> +	}
> +
> +	for (pidx = 0; pidx < dd->num_pports; ++pidx) {
> +		struct hfi2_portrsrcs *pr = &dr->ppr[pidx];
> +
> +		ppd = dd->pport + pidx;
> +		for (i = 0; i < pr->num_rcv_contexts; i++) {
> +			u16 ctxt = pr->rcv_context_base + i;
> +
> +			rcd = hfi2_rcd_get_by_index(dd, ctxt);
> +			hfi2_rcvctrl(dd, HFI2_RCVCTRL_TAILUPD_DIS |
> +				     HFI2_RCVCTRL_CTXT_DIS |
> +				     HFI2_RCVCTRL_INTRAVAIL_DIS |
> +				     HFI2_RCVCTRL_PKEY_DIS |
> +				     HFI2_RCVCTRL_ONE_PKT_EGR_DIS, rcd);
> +			hfi2_rcd_put(rcd);
> +		}
> +	}
> +	/*
> +	 * Gracefully stop all sends allowing any in progress to
> +	 * trickle out first.
> +	 */
> +	for (i = 0; i < dd->num_send_contexts; i++)
> +		sc_flush(dd->send_contexts[i].sc);
> +
> +	/*
> +	 * Enough for anything that's going to trickle out to have actually
> +	 * done so.
> +	 */
> +	udelay(20);
> +
> +	/* disable all contexts */
> +	for (i = 0; i < dd->num_send_contexts; i++)
> +		sc_disable(dd->send_contexts[i].sc);
> +
> +	for (pidx = 0; pidx < dd->num_pports; ++pidx) {
> +		ppd = dd->pport + pidx;
> +
> +		/* disable the send device */
> +		pio_send_control(ppd, PSC_GLOBAL_DISABLE);
> +
> +		dd->params->shutdown_led_override(ppd);
> +
> +		dd->params->stop_port(ppd);
> +	}
> +	if (dd->hfi2_wq)
> +		flush_workqueue(dd->hfi2_wq);
> +	sdma_exit(dd);
> +}
> +
> +/*
> + * SRIOV has been disabled. Do any cleanup not handled by
> + * VF remove_one() calls.
> + */
> +void hfi2_pf0_cleanup(struct hfi2_devdata *dd)
> +{
> +	restore_qpmap_table(dd);
> +}
> +
> +/**
> + * hfi2_free_ctxtdata - free a context's allocated data
> + * @dd: the hfi2_ib device
> + * @rcd: the ctxtdata structure
> + *
> + * free up any allocated data for a context
> + * It should never change any chip state, or global driver state.
> + */
> +void hfi2_free_ctxtdata(struct hfi2_devdata *dd, struct hfi2_ctxtdata *rcd)
> +{
> +	u32 e;
> +
> +	if (!rcd)
> +		return;
> +
> +	if (rcd->rcvhdrq) {
> +		dma_free_coherent(&dd->pcidev->dev, rcvhdrq_size(rcd),
> +				  rcd->rcvhdrq, rcd->rcvhdrq_dma);
> +		rcd->rcvhdrq = NULL;
> +		if (hfi2_rcvhdrtail_kvaddr(rcd)) {
> +			dma_free_coherent(&dd->pcidev->dev, PAGE_SIZE,
> +					  (void *)hfi2_rcvhdrtail_kvaddr(rcd),
> +					  rcd->rcvhdrqtailaddr_dma);
> +			rcd->rcvhdrtail_kvaddr = NULL;
> +		}
> +	}
> +	if (rcd->rheq) {
> +		dma_free_coherent(&dd->pcidev->dev, rheq_size(rcd),
> +				  rcd->rheq, rcd->rheq_dma);
> +		rcd->rheq = NULL;
> +	}
> +
> +	/* all the RcvArray entries should have been cleared by now */
> +	kfree(rcd->egrbufs.rcvtids);
> +	rcd->egrbufs.rcvtids = NULL;
> +
> +	for (e = 0; e < rcd->egrbufs.alloced; e++) {
> +		if (rcd->egrbufs.buffers[e].addr)
> +			dma_free_coherent(&dd->pcidev->dev,
> +					  rcd->egrbufs.buffers[e].len,
> +					  rcd->egrbufs.buffers[e].addr,
> +					  rcd->egrbufs.buffers[e].dma);
> +	}
> +	kfree(rcd->egrbufs.buffers);
> +	rcd->egrbufs.alloced = 0;
> +	rcd->egrbufs.buffers = NULL;
> +
> +	sc_free(rcd->sc);
> +	rcd->sc = NULL;
> +
> +	vfree(rcd->subctxt_uregbase);
> +	vfree(rcd->subctxt_rcvegrbuf);
> +	vfree(rcd->subctxt_rcvhdr_base);
> +	kfree(rcd->opstats);
> +
> +	rcd->subctxt_uregbase = NULL;
> +	rcd->subctxt_rcvegrbuf = NULL;
> +	rcd->subctxt_rcvhdr_base = NULL;
> +	rcd->opstats = NULL;
> +}
> +
> +/*
> + * Release our hold on the shared asic data.  If we are the last one,
> + * return the structure to be finalized outside the lock.  Must be
> + * holding hfi2_dev_table lock.
> + */
> +static struct hfi2_asic_data *release_asic_data(struct hfi2_devdata *dd)
> +{
> +	struct hfi2_asic_data *ad;
> +	int other;
> +
> +	if (!dd->asic_data)
> +		return NULL;
> +	dd->asic_data->dds[dd->hfi2_id] = NULL;
> +	other = dd->hfi2_id ? 0 : 1;
> +	ad = dd->asic_data;
> +	dd->asic_data = NULL;
> +	/* return NULL if the other dd still has a link */
> +	return ad->dds[other] ? NULL : ad;
> +}
> +
> +static void finalize_asic_data(struct hfi2_devdata *dd,
> +			       struct hfi2_asic_data *ad)
> +{
> +	clean_up_i2c(dd, ad);
> +	kfree(ad);
> +}
> +
> +/**
> + * hfi2_free_devdata - cleans up and frees per-unit data structure
> + * @dd: pointer to a valid devdata structure
> + *
> + * It cleans up and frees all data structures set up by
> + * hfi2_alloc_devdata().
> + */
> +static void hfi2_free_devdata(struct hfi2_devdata *dd)
> +{
> +	struct hfi2_asic_data *ad;
> +	unsigned long flags;
> +
> +	xa_lock_irqsave(&hfi2_dev_table, flags);
> +	__xa_erase(&hfi2_dev_table, dd->unit);
> +	ad = release_asic_data(dd);
> +	xa_unlock_irqrestore(&hfi2_dev_table, flags);
> +
> +	finalize_asic_data(dd, ad);
> +	free_platform_config(dd);
> +	rcu_barrier(); /* wait for rcu callbacks to complete */
> +	free_percpu(dd->int_counter);
> +	free_percpu(dd->rcv_limit);
> +	free_percpu(dd->send_schedule);
> +	free_percpu(dd->tx_opstats);
> +	dd->int_counter   = NULL;
> +	dd->rcv_limit     = NULL;
> +	dd->send_schedule = NULL;
> +	dd->tx_opstats    = NULL;
> +	kfree(dd->comp_vect);
> +	dd->comp_vect = NULL;
> +	if (dd->rcvhdrtail_dummy_kvaddr)
> +		dma_free_coherent(&dd->pcidev->dev, sizeof(u64),
> +				  (void *)dd->rcvhdrtail_dummy_kvaddr,
> +				  dd->rcvhdrtail_dummy_dma);
> +	dd->rcvhdrtail_dummy_kvaddr = NULL;
> +	sdma_clean(dd);
> +	hfi2_sriov_free_cfg(dd);
> +	/* dd is freed by the time this returns: */
> +	rvt_dealloc_device(&dd->verbs_dev.rdi);
> +}
> +
> +/**
> + * hfi2_alloc_devdata - Allocate our primary per-unit data structure.
> + * @pdev: Valid PCI device
> + * @extra: How many bytes to alloc past the default
> + *
> + * Must be done via verbs allocator, because the verbs cleanup process
> + * both does cleanup and free of the data structure.
> + * "extra" is for chip-specific data.
> + */
> +static struct hfi2_devdata *hfi2_alloc_devdata(struct pci_dev *pdev,
> +					       const struct chip_params *params)
> +{
> +	struct hfi2_devdata *dd;
> +	size_t extra;
> +	int ret, nports;
> +
> +	nports = params->num_ports;
> +	extra = nports * sizeof(struct hfi2_pportdata);
> +	dd = (struct hfi2_devdata *)rvt_alloc_device(sizeof(*dd) + extra,
> +						     nports);
> +	if (!dd)
> +		return ERR_PTR(-ENOMEM);
> +	dd->params = params;
> +	dd->num_pports = nports;
> +	dd->pport = (struct hfi2_pportdata *)(dd + 1);
> +	dd->pcidev = pdev;
> +	/*
> +	 * Check for PCI device being a VF in SRIOV.
> +	 * The VFs do not have a Power Management capability block.
> +	 */
> +	dd->is_vf = (params->chip_type != CHIP_WFR && !pdev->pm_cap);
> +	dd->is_sriov = (dd->is_vf || sriov_is_enabled());
> +#if defined(CONFIG_X86)
> +	dd->is_vm = boot_cpu_has(X86_FEATURE_HYPERVISOR);
> +#endif
> +#ifdef PDEV_SRIOV_DEBUG
> +	dev_warn(&pdev->dev, "is_vm=%d is_vf=%d is_physfn=%d is_virtfn=%d physfn=%p\n",
> +		dd->is_vm, dd->is_vf, pdev->is_physfn, pdev->is_virtfn, pdev->physfn);
> +#endif
> +	pci_set_drvdata(pdev, dd);
> +
> +	/*
> +	 * Must set DMA mask for device before any dma_map*() or
> +	 * dma_alloc*() calls referring to pdev->dev. Otherwise
> +	 * those calls may return DMA addresses that are
> +	 * incompatible with the HFI.
> +	 */
> +	ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(params->dma_mask_bits));
> +	if (ret) {
> +		dd_dev_warn(dd, "Failed to set %u-bit DMA mask ret %d; setting 32-bit DMA mask\n",
> +			    params->dma_mask_bits, ret);
> +		ret = dma_set_mask_and_coherent(&pdev->dev,
> +						DMA_BIT_MASK(32));
> +		if (ret) {
> +			dd_dev_err(dd, "Unable to set DMA mask: %d\n",
> +				   ret);
> +			goto bail;
> +		}
> +	}
> +
> +	ret = xa_alloc_irq(&hfi2_dev_table, &dd->unit, dd, xa_limit_32b,
> +			GFP_KERNEL);
> +	if (ret < 0) {
> +		dev_err(&pdev->dev,
> +			"Could not allocate unit ID: error %d\n", -ret);
> +		goto bail;
> +	}
> +	rvt_set_ibdev_name(&dd->verbs_dev.rdi, "%s_%d", "hfi2", dd->unit);
> +	/*
> +	 * If the BIOS does not have the NUMA node information set, select
> +	 * NUMA 0 so we get consistent performance.
> +	 */
> +	dd->node = pcibus_to_node(pdev->bus);
> +	if (dd->node == NUMA_NO_NODE) {
> +		dd_dev_err(dd, "Invalid PCI NUMA node. Performance may be affected\n");
> +		dd->node = 0;
> +	}
> +
> +	/*
> +	 * Initialize all locks for the device. This needs to be as early as
> +	 * possible so locks are usable.
> +	 */
> +	spin_lock_init(&dd->sc_lock);
> +	spin_lock_init(&dd->sendctrl_lock);
> +	spin_lock_init(&dd->rcvctrl_lock);
> +	spin_lock_init(&dd->uctxt_lock);
> +	spin_lock_init(&dd->sc_init_lock);
> +	spin_lock_init(&dd->dc8051_memlock);
> +	spin_lock_init(&dd->sde_map_lock);
> +	spin_lock_init(&dd->pio_map_lock);
> +	mutex_init(&dd->dc8051_lock);
> +	init_waitqueue_head(&dd->event_queue);
> +	spin_lock_init(&dd->irq_src_lock);
> +	INIT_WORK(&dd->freeze_work, handle_freeze);
> +
> +	dd->int_counter = alloc_percpu(u64);
> +	if (!dd->int_counter) {
> +		ret = -ENOMEM;
> +		goto bail;
> +	}
> +
> +	dd->rcv_limit = alloc_percpu(u64);
> +	if (!dd->rcv_limit) {
> +		ret = -ENOMEM;
> +		goto bail;
> +	}
> +
> +	dd->send_schedule = alloc_percpu(u64);
> +	if (!dd->send_schedule) {
> +		ret = -ENOMEM;
> +		goto bail;
> +	}
> +
> +	dd->tx_opstats = alloc_percpu(struct hfi2_opcode_stats_perctx);
> +	if (!dd->tx_opstats) {
> +		ret = -ENOMEM;
> +		goto bail;
> +	}
> +
> +	dd->comp_vect = kzalloc(sizeof(*dd->comp_vect), GFP_KERNEL);
> +	if (!dd->comp_vect) {
> +		ret = -ENOMEM;
> +		goto bail;
> +	}
> +
> +	/* allocate dummy tail memory for all receive contexts */
> +	dd->rcvhdrtail_dummy_kvaddr =
> +		dma_alloc_coherent(&dd->pcidev->dev, sizeof(u64),
> +				   &dd->rcvhdrtail_dummy_dma, GFP_KERNEL);
> +	if (!dd->rcvhdrtail_dummy_kvaddr) {
> +		ret = -ENOMEM;
> +		goto bail;
> +	}
> +
> +	return dd;
> +
> +bail:
> +	hfi2_free_devdata(dd);
> +	return ERR_PTR(ret);
> +}
> +
> +/*
> + * Called from freeze mode handlers, and from PCI error
> + * reporting code.  Should be paranoid about state of
> + * system and data structures.
> + */
> +void hfi2_disable_after_error(struct hfi2_devdata *dd)
> +{
> +	if (dd->flags & HFI2_INITTED) {
> +		u32 pidx;
> +
> +		dd->flags &= ~HFI2_INITTED;
> +		if (dd->pport)
> +			for (pidx = 0; pidx < dd->num_pports; ++pidx) {
> +				struct hfi2_pportdata *ppd;
> +
> +				ppd = dd->pport + pidx;
> +				if (dd->flags & HFI2_PRESENT)
> +					set_link_state(ppd, HLS_DN_DISABLE);
> +
> +				if (ppd->statusp)
> +					*ppd->statusp &= ~HFI2_STATUS_IB_READY;
> +			}
> +	}
> +
> +	/*
> +	 * Mark as having had an error for driver, and also
> +	 * for /sys and status word mapped to user programs.
> +	 * This marks unit as not usable, until reset.
> +	 */
> +	if (dd->status)
> +		dd->status->dev |= HFI2_STATUS_HWERROR;
> +}
> +
> +static void remove_one(struct pci_dev *);
> +static int init_one(struct pci_dev *, const struct pci_device_id *);
> +static void shutdown_one(struct pci_dev *);
> +
> +#define DRIVER_LOAD_MSG "Cornelis " DRIVER_NAME " loaded: "
> +#define PFX DRIVER_NAME ": "
> +
> +const struct pci_device_id hfi2_pci_tbl[] = {
> +	{ PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL0) },
> +	{ PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL1) },
> +	{ PCI_DEVICE(PCI_VENDOR_ID_CORNELIS, PCI_DEVICE_ID_CORNELIS_CN5000) },
> +	{ 0, }
> +};
> +
> +MODULE_DEVICE_TABLE(pci, hfi2_pci_tbl);
> +
> +static struct pci_driver hfi2_pci_driver = {
> +	.name = DRIVER_NAME,
> +	.probe = init_one,
> +	.remove = remove_one,
> +	.shutdown = shutdown_one,
> +	.id_table = hfi2_pci_tbl,
> +	.err_handler = &hfi2_pci_err_handler,
> +	.sriov_configure = hfi2_sriov_configure,
> +};
> +
> +static void __init compute_krcvqs(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < krcvqsset; i++)
> +		n_krcvqs += krcvqs[i];
> +}
> +
> +/*
> + * Do all the generic driver unit- and chip-independent memory
> + * allocation and initialization.
> + */
> +static int __init hfi2_mod_init(void)
> +{
> +	int ret;
> +
> +	register_system_pinning_interface();
> +	register_system_tid_ops();
> +
> +	ret = node_affinity_init();
> +	if (ret)
> +		goto bail;
> +
> +	/* validate max MTU before any devices start */
> +	if (!valid_opa_max_mtu(hfi2_max_mtu)) {
> +		pr_err("Invalid max_mtu 0x%x, using 0x%x instead\n",
> +		       hfi2_max_mtu, HFI2_DEFAULT_MAX_MTU);
> +		hfi2_max_mtu = HFI2_DEFAULT_MAX_MTU;
> +	}
> +	/* valid CUs run from 1-128 in powers of 2 */
> +	if (hfi2_cu > 128 || !is_power_of_2(hfi2_cu))
> +		hfi2_cu = 1;
> +	/* valid credit return threshold is 0-100, variable is unsigned */
> +	if (user_credit_return_threshold > 100)
> +		user_credit_return_threshold = 100;
> +
> +	compute_krcvqs();
> +	/*
> +	 * sanitize receive interrupt count, time must wait until after
> +	 * the hardware type is known
> +	 */
> +	if (rcv_intr_count > RCV_HDR_HEAD_COUNTER_MASK)
> +		rcv_intr_count = RCV_HDR_HEAD_COUNTER_MASK;
> +	/* reject invalid combinations */
> +	if (rcv_intr_count == 0 && rcv_intr_timeout == 0) {
> +		pr_err("Invalid mode: both receive interrupt count and available timeout are zero - setting interrupt count to 1\n");
> +		rcv_intr_count = 1;
> +	}
> +	if (rcv_intr_count > 1 && rcv_intr_timeout == 0) {
> +		/*
> +		 * Avoid indefinite packet delivery by requiring a timeout
> +		 * if count is > 1.
> +		 */
> +		pr_err("Invalid mode: receive interrupt count greater than 1 and available timeout is zero - setting available timeout to 1\n");
> +		rcv_intr_timeout = 1;
> +	}
> +	if (rcv_intr_dynamic && !(rcv_intr_count > 1 && rcv_intr_timeout > 0)) {
> +		/*
> +		 * The dynamic algorithm expects a non-zero timeout
> +		 * and a count > 1.
> +		 */
> +		pr_err("Invalid mode: dynamic receive interrupt mitigation with invalid count and timeout - turning dynamic off\n");
> +		rcv_intr_dynamic = 0;
> +	}
> +
> +	/* sanitize link CRC options */
> +	link_crc_mask &= SUPPORTED_CRCS;
> +
> +	ret = opfn_init();
> +	if (ret < 0) {
> +		pr_err("Failed to allocate opfn_wq");
> +		goto bail_dev;
> +	}
> +
> +	/*
> +	 * These must be called before the driver is registered with
> +	 * the PCI subsystem.
> +	 */
> +	hfi2_dbg_init();
> +	/*
> +	 * This causes devices to be probed, so any initialization
> +	 * that must happen before that must be above this point.
> +	 */
> +	ret = pci_register_driver(&hfi2_pci_driver);
> +	if (ret < 0) {
> +		pr_err("Unable to register driver: error %d\n", -ret);
> +		goto bail_dev;
> +	}
> +	goto bail; /* all OK */
> +
> +bail_dev:
> +	hfi2_dbg_exit();
> +bail:
> +	return ret;
> +}
> +
> +module_init(hfi2_mod_init);
> +
> +/*
> + * Do the non-unit driver cleanup, memory free, etc. at unload.
> + */
> +static void __exit hfi2_mod_cleanup(void)
> +{
> +	pci_unregister_driver(&hfi2_pci_driver);
> +	opfn_exit();
> +	node_affinity_destroy_all();
> +	hfi2_dbg_exit();
> +
> +	WARN_ON(!xa_empty(&hfi2_dev_table));
> +	dispose_firmware();	/* asymmetric with obtain_firmware() */
> +
> +	deregister_system_tid_ops();
> +	deregister_system_pinning_interface();
> +}
> +
> +module_exit(hfi2_mod_cleanup);
> +
> +/* this can only be called after a successful initialization */
> +static void cleanup_device_data(struct hfi2_devdata *dd)
> +{
> +	int ctxt;
> +	int pidx;
> +
> +	/* users can't do anything more with chip */
> +	for (pidx = 0; pidx < dd->num_pports; ++pidx) {
> +		struct hfi2_pportdata *ppd = &dd->pport[pidx];
> +		struct cc_state *cc_state;
> +		int i;
> +
> +		if (ppd->statusp)
> +			*ppd->statusp &= ~HFI2_STATUS_CHIP_PRESENT;
> +
> +		for (i = 0; i < OPA_MAX_SLS; i++)
> +			hrtimer_cancel(&ppd->cca_timer[i].hrtimer);
> +
> +		spin_lock(&ppd->cc_state_lock);
> +		cc_state = get_cc_state_protected(ppd);
> +		RCU_INIT_POINTER(ppd->cc_state, NULL);
> +		spin_unlock(&ppd->cc_state_lock);
> +
> +		if (cc_state)
> +			kfree_rcu(cc_state, rcu);
> +	}
> +
> +	free_credit_return(dd);
> +
> +	/*
> +	 * Free any receive resources still in use (usually just kernel
> +	 * contexts) at unload.
> +	 */
> +	for (ctxt = 0; dd->rcd && ctxt < dd->num_rcd; ctxt++) {
> +		struct hfi2_ctxtdata *rcd = dd->rcd[ctxt];
> +
> +		if (rcd) {
> +			hfi2_free_ctxt_rcv_groups(rcd);
> +			hfi2_free_ctxt(rcd);
> +		}
> +	}
> +
> +	kfree(dd->rcd);
> +	dd->rcd = NULL;
> +	dd->num_rcd = 0;
> +
> +	free_pio_map(dd);
> +	/* must follow rcv context free - need to remove rcv's hooks */
> +	if (dd->send_contexts) {
> +		for (ctxt = 0; ctxt < dd->num_send_contexts; ctxt++)
> +			sc_free(dd->send_contexts[ctxt].sc);
> +	}
> +	dd->num_send_contexts = 0;
> +	kfree(dd->send_contexts);
> +	dd->send_contexts = NULL;
> +	kfree(dd->hw_to_sw);
> +	dd->hw_to_sw = NULL;
> +	/* free netdev data */
> +	hfi2_free_rx(dd);
> +	kfree(dd->boardname);
> +	vfree(dd->events);
> +	vfree(dd->status);
> +
> +	vf2pf_deinit(dd); /* still requires CSR access/permissions */
> +
> +	/* finalize the cport - CSR perms revoked on PF0 */
> +	stop_cport(dd);
> +	/* release interrupts */
> +	msix_clean_up_interrupts(dd);
> +
> +	/* CSR reads and writes are invalid after this call */
> +	hfi2_pcie_ddcleanup(dd);
> +}
> +
> +/*
> + * Clean up on unit shutdown, or error during unit load after
> + * successful initialization.
> + */
> +static void postinit_cleanup(struct hfi2_devdata *dd)
> +{
> +	hfi2_start_cleanup(dd);
> +	hfi2_comp_vectors_clean_up(dd);
> +	hfi2_dev_affinity_clean_up(dd);
> +	release_rsm_rules(dd);
> +
> +	cleanup_device_data(dd);
> +
> +	destroy_workqueues(dd);
> +	hfi2_pcie_cleanup(dd->pcidev);
> +	hfi2_free_devdata(dd);
> +}
> +
> +static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
> +{
> +	int ret = 0, pidx, initfail;
> +	struct hfi2_devdata *dd;
> +	const struct chip_params *params;
> +
> +#ifdef CONFIG_HFI_L8SIM
> +	if (!(pdev->bus->bus_flags & PCI_BUS_FLAGS_SIMULATED)) {
> +		dev_warn(&pdev->dev, "Ignoring real hardware on simulator driver\n");
> +		return -ENODEV;
> +	}
> +#endif
> +	/* VF in host driver - leave for KVM */
> +	if (pdev->is_virtfn) {
> +		/*
> +		 * It is theoretically possible for the host driver to claim
> +		 * a VF, so the decision whether to claim or not is made by
> +		 * hfi2_sriov_init(). Returning ENODEV does not fail SRIOV init.
> +		 */
> +		ret = hfi2_sriov_init(pdev); /* may do nothing */
> +		if (ret)
> +			return ret; /* do not claim device */
> +	}
> +
> +	/* First, lock the non-writable module parameters */
> +	HFI2_CAP_LOCK();
> +
> +	/* Validate dev ids */
> +	if (ent->vendor == PCI_VENDOR_ID_INTEL &&
> +	    (ent->device == PCI_DEVICE_ID_INTEL0 ||
> +	      ent->device == PCI_DEVICE_ID_INTEL1)) {
> +		params = &wfr_params;
> +	} else if (ent->vendor == PCI_VENDOR_ID_CORNELIS &&
> +		   ent->device == PCI_DEVICE_ID_CORNELIS_CN5000) {
> +		params = &jkr_params;
> +	} else {
> +		dev_err(&pdev->dev, "Failing on unknown device %04x:%04x\n",
> +			ent->vendor, ent->device);
> +		return -ENODEV;
> +	}
> +
> +	/* verify arrays are large enough */
> +	if (params->num_int_csrs > LARGEST_NUM_INT_CSRS ||
> +	    params->num_ports > LARGEST_NUM_PORTS ||
> +	    params->pkey_table_size > MAX_PKEY_VALUES) {
> +		dev_err(&pdev->dev, "Source arrays are compiled too small\n");
> +		return -EINVAL;
> +	}
> +
> +	/* Allocate the dd so we can get to work */
> +	dd = hfi2_alloc_devdata(pdev, params);
> +	if (IS_ERR(dd))
> +		return PTR_ERR(dd);
> +
> +	/* Validate some global module parameters */
> +	ret = hfi2_validate_rcvhdrcnt(dd, rcvhdrcnt);
> +	if (ret)
> +		goto free_dd;
> +
> +	/* use the encoding function as a sanitization check */
> +	if (!encode_rcv_header_entry_size(hfi2_hdrq_entsize)) {
> +		dd_dev_err(dd, "Invalid HdrQ Entry size %u\n",
> +			   hfi2_hdrq_entsize);
> +		ret = -EINVAL;
> +		goto free_dd;
> +	}
> +
> +	/* The receive eager buffer size must be set before the receive
> +	 * contexts are created.
> +	 *
> +	 * Set the eager buffer size.  Validate that it falls in a range
> +	 * allowed by the hardware - all powers of 2 between the min and
> +	 * max.  The maximum valid MTU is within the eager buffer range
> +	 * so we do not need to cap the max_mtu by an eager buffer size
> +	 * setting.
> +	 */
> +	if (eager_buffer_size) {
> +		if (!is_power_of_2(eager_buffer_size))
> +			eager_buffer_size =
> +				roundup_pow_of_two(eager_buffer_size);
> +		eager_buffer_size =
> +			clamp_val(eager_buffer_size,
> +				  MIN_EAGER_BUFFER * 8,
> +				  MAX_EAGER_BUFFER_TOTAL);
> +		dd_dev_info(dd, "Eager buffer size %u\n",
> +			    eager_buffer_size);
> +	} else {
> +		dd_dev_err(dd, "Invalid Eager buffer size of 0\n");
> +		ret = -EINVAL;
> +		goto free_dd;
> +	}
> +
> +	/* restrict value of hfi2_rcvarr_split */
> +	hfi2_rcvarr_split = clamp_val(hfi2_rcvarr_split, 0, 100);
> +
> +	ret = hfi2_pcie_init(dd);
> +	if (ret)
> +		goto free_dd;
> +
> +	ret = create_workqueues(dd);
> +	if (ret)
> +		goto pcie_cleanup;
> +
> +	/*
> +	 * Do device-specific initialization.  If hfi2_init_dd() fails, it
> +	 * cleans up after itself.
> +	 */
> +	ret = hfi2_init_dd(dd);
> +	if (ret)
> +		goto destroy_wqs; /* error already printed */
> +
> +	/* do the generic initialization */
> +	if (!ret)
> +		initfail = hfi2_init(dd, 0);
> +
> +	if (!initfail && !ret)
> +		ret = hfi2_mad_init(dd);
> +
> +	if (!initfail && !ret)
> +		ret = hfi2_register_ib_device(dd);
> +
> +	if (!initfail && !ret)
> +		ret = init_cport_trap128(dd); /* after IB device register */
> +
> +	/*
> +	 * Now ready for use.  this should be cleared whenever we
> +	 * detect a reset, or initiate one.  If earlier failure,
> +	 * we still create devices, so diags, etc. can be used
> +	 * to determine cause of problem.
> +	 */
> +	if (!initfail && !ret) {
> +		dd->flags |= HFI2_INITTED;
> +		/* create debufs files after init and ib register */
> +		hfi2_dbg_ibdev_init(&dd->verbs_dev);
> +	}
> +
> +	if (initfail || ret) {
> +		stop_cport(dd);
> +		msix_clean_up_interrupts(dd);
> +		stop_timers(dd);
> +		flush_workqueue(ib_wq);
> +		for (pidx = 0; pidx < dd->num_pports; ++pidx)
> +			dd->params->stop_port(dd->pport + pidx);
> +		if (!ret) {
> +			hfi2_unregister_ib_device(dd);
> +			hfi2_mad_deinit(dd);
> +		}
> +		postinit_cleanup(dd);
> +		if (initfail)
> +			ret = initfail;
> +		goto bail;	/* everything already cleaned */
> +	}
> +
> +	sdma_start(dd);
> +	init_cport_overtemp(dd);
> +
> +	hfi2_sriov_auto_conf(dd);
> +	vf2pf_ready(dd);
> +	return 0;
> +
> +destroy_wqs:
> +	destroy_workqueues(dd);
> +pcie_cleanup:
> +	hfi2_pcie_cleanup(pdev);
> +free_dd:
> +	hfi2_free_devdata(dd);
> +bail:
> +	return ret;
> +}
> +
> +static void wait_for_clients(struct hfi2_devdata *dd)
> +{
> +	/*
> +	 * Remove the device init value and complete the device if there is
> +	 * no clients or wait for active clients to finish.
> +	 */
> +	if (refcount_dec_and_test(&dd->user_refcount))
> +		complete(&dd->user_comp);
> +
> +	wait_for_completion(&dd->user_comp);
> +}
> +
> +/*
> + * This is called for rmmod or other driver-device unbinds.
> + * (and now by shutdown_one() if not WFR)
> + */
> +static void remove_one(struct pci_dev *pdev)
> +{
> +	struct hfi2_devdata *dd = pci_get_drvdata(pdev);
> +
> +	if (pdev->is_virtfn) {
> +		/*
> +		 * Should only reach here if the VF was claimed by the driver,
> +		 * however, this cannot destroy device functionality.
> +		 */
> +		hfi2_sriov_remove(pdev);
> +	}
> +
> +	/*
> +	 * If VFs are still active, must shut them down now,
> +	 * before PF0 becomes unusable.
> +	 */
> +	if (pdev->is_physfn)
> +		hfi2_sriov_disable(dd->pcidev);
> +
> +	/* close debugfs files before ib unregister */
> +	hfi2_dbg_ibdev_exit(&dd->verbs_dev);
> +
> +	/* wait for existing user space clients to finish */
> +	wait_for_clients(dd);
> +
> +	/* unregister from IB core */
> +	hfi2_unregister_ib_device(dd);
> +
> +	/* stop handling LOCAL_MAD_ from CPORT */
> +	hfi2_mad_deinit(dd);
> +
> +	/*
> +	 * Disable the IB link, disable interrupts on the device,
> +	 * clear dma engines, etc.
> +	 */
> +	shutdown_device(dd);
> +
> +	stop_timers(dd);
> +
> +	/* wait until all of our (qsfp) queue_work() calls complete */
> +	flush_workqueue(ib_wq);
> +
> +	postinit_cleanup(dd);
> +}
> +
> +/*
> + * This is called during system reboot/shutdown/halt.
> + */
> +static void shutdown_one(struct pci_dev *pdev)
> +{
> +	struct hfi2_devdata *dd = pci_get_drvdata(pdev);
> +
> +	if (dd->params->chip_type == CHIP_WFR)
> +		shutdown_device(dd);
> +	else
> +		remove_one(pdev);
> +}
> +
> +/* The device has reported over-temp and will shutdown soon (~500mS) */
> +void hfi2_overtemp(struct hfi2_devdata *dd)
> +{
> +	dd_dev_err(dd, "*** OVER TEMP *** device shutdown imminent!\n");
> +	/* take some action to gracefully shut down/quiesce */
> +}
> +
> +/**
> + * hfi2_create_rcvhdrq - create a receive header queue
> + * @dd: the hfi2_ib device
> + * @rcd: the context data
> + *
> + * This must be contiguous memory (from an i/o perspective), and must be
> + * DMA'able (which means for some systems, it will go through an IOMMU,
> + * or be forced into a low address range).
> + */
> +int hfi2_create_rcvhdrq(struct hfi2_devdata *dd, struct hfi2_ctxtdata *rcd)
> +{
> +	u32 amt = rcvhdrq_size(rcd);
> +
> +	if (!rcd->rcvhdrq) {
> +		rcd->rcvhdrq = dma_alloc_coherent(&dd->pcidev->dev, amt,
> +						  &rcd->rcvhdrq_dma,
> +						  GFP_KERNEL);
> +
> +		if (!rcd->rcvhdrq) {
> +			dd_dev_err(dd,
> +				   "attempt to allocate %d bytes for ctxt %u rcvhdrq failed\n",
> +				   amt, rcd->ctxt);
> +			goto bail;
> +		}
> +
> +		if (HFI2_CAP_KGET_MASK(rcd->flags, DMA_RTAIL) ||
> +		    HFI2_CAP_UGET_MASK(rcd->flags, DMA_RTAIL)) {
> +			rcd->rcvhdrtail_kvaddr = dma_alloc_coherent(&dd->pcidev->dev,
> +								    PAGE_SIZE,
> +								    &rcd->rcvhdrqtailaddr_dma,
> +								    GFP_KERNEL);
> +			if (!rcd->rcvhdrtail_kvaddr) {
> +				dd_dev_err(dd,
> +					   "attempt to allocate 1 page for ctxt %u rcvhdrqtailaddr failed\n",
> +					   rcd->ctxt);
> +				goto rhq_free;
> +			}
> +		}
> +
> +		if (dd->params->chip_type != CHIP_WFR) {
> +			u32 rheq_amt = rheq_size(rcd);
> +
> +			rcd->rheq = dma_alloc_coherent(&dd->pcidev->dev,
> +						       rheq_amt,
> +						       &rcd->rheq_dma,
> +						       GFP_KERNEL);
> +			if (!rcd->rheq) {
> +				dd_dev_err(dd,
> +					   "attempt to allocate %d bytes for ctxt %u rheq failed\n",
> +					   rheq_amt, rcd->ctxt);
> +				goto tail_free;
> +			}
> +		}
> +	}
> +
> +	set_hdrq_regs(rcd->ppd, rcd->ctxt, rcd->rcvhdrqentsize,
> +		      rcd->rcvhdrq_cnt, rcd->kdeth_rcv_hdr);
> +
> +	return 0;
> +
> +tail_free:
> +	if (rcd->rcvhdrtail_kvaddr) {
> +		dma_free_coherent(&dd->pcidev->dev, PAGE_SIZE,
> +				  (void *)hfi2_rcvhdrtail_kvaddr(rcd),
> +				  rcd->rcvhdrqtailaddr_dma);
> +		rcd->rcvhdrtail_kvaddr = NULL;
> +	}
> +rhq_free:
> +	dma_free_coherent(&dd->pcidev->dev, amt, rcd->rcvhdrq,
> +			  rcd->rcvhdrq_dma);
> +	rcd->rcvhdrq = NULL;
> +bail:
> +	return -ENOMEM;
> +}
> +
> +/**
> + * hfi2_setup_eagerbufs - allocate eager buffers, both kernel and user
> + * contexts.
> + * @rcd: the context we are setting up.
> + *
> + * Allocate the eager TID buffers and program them into the chip.
> + * They are no longer completely contiguous, we do multiple allocation
> + * calls.  Otherwise we get the OOM code involved, by asking for too
> + * much per call, with disastrous results on some kernels.
> + */
> +int hfi2_setup_eagerbufs(struct hfi2_ctxtdata *rcd)
> +{
> +	struct hfi2_devdata *dd = rcd->dd;
> +	u32 max_entries, egrtop, alloced_bytes = 0;
> +	u16 order, idx = 0;
> +	int ret = 0;
> +	u16 round_mtu = roundup_pow_of_two(hfi2_max_mtu);
> +
> +	/*
> +	 * The minimum size of the eager buffers is a groups of MTU-sized
> +	 * buffers.
> +	 * The global eager_buffer_size parameter is checked against the
> +	 * theoretical lower limit of the value. Here, we check against the
> +	 * MTU.
> +	 */
> +	if (rcd->egrbufs.size < (round_mtu * dd->rcv_entries.group_size))
> +		rcd->egrbufs.size = round_mtu * dd->rcv_entries.group_size;
> +	/*
> +	 * If using one-pkt-per-egr-buffer, lower the eager buffer
> +	 * size to the max MTU (page-aligned).
> +	 */
> +	if (!HFI2_CAP_KGET_MASK(rcd->flags, MULTI_PKT_EGR))
> +		rcd->egrbufs.rcvtid_size = round_mtu;
> +
> +	/*
> +	 * Eager buffers sizes of 1MB or less require smaller TID sizes
> +	 * to satisfy the "multiple of 8 RcvArray entries" requirement.
> +	 */
> +	if (rcd->egrbufs.size <= (1 << 20))
> +		rcd->egrbufs.rcvtid_size = max((unsigned long)round_mtu,
> +			rounddown_pow_of_two(rcd->egrbufs.size / 8));
> +
> +	while (alloced_bytes < rcd->egrbufs.size &&
> +	       rcd->egrbufs.alloced < rcd->egrbufs.count) {
> +		rcd->egrbufs.buffers[idx].addr =
> +			dma_alloc_coherent(&dd->pcidev->dev,
> +					   rcd->egrbufs.rcvtid_size,
> +					   &rcd->egrbufs.buffers[idx].dma,
> +					   GFP_KERNEL);
> +		if (rcd->egrbufs.buffers[idx].addr) {
> +			rcd->egrbufs.buffers[idx].len =
> +				rcd->egrbufs.rcvtid_size;
> +			rcd->egrbufs.rcvtids[rcd->egrbufs.alloced].addr =
> +				rcd->egrbufs.buffers[idx].addr;
> +			rcd->egrbufs.rcvtids[rcd->egrbufs.alloced].dma =
> +				rcd->egrbufs.buffers[idx].dma;
> +			rcd->egrbufs.alloced++;
> +			alloced_bytes += rcd->egrbufs.rcvtid_size;
> +			idx++;
> +		} else {
> +			u32 new_size, i, j;
> +			u64 offset = 0;
> +
> +			/*
> +			 * Fail the eager buffer allocation if:
> +			 *   - we are already using the lowest acceptable size
> +			 *   - we are using one-pkt-per-egr-buffer (this implies
> +			 *     that we are accepting only one size)
> +			 */
> +			if (rcd->egrbufs.rcvtid_size == round_mtu ||
> +			    !HFI2_CAP_KGET_MASK(rcd->flags, MULTI_PKT_EGR)) {
> +				dd_dev_err(dd, "ctxt%u: Failed to allocate eager buffers\n",
> +					   rcd->ctxt);
> +				ret = -ENOMEM;
> +				goto bail_rcvegrbuf_phys;
> +			}
> +
> +			new_size = rcd->egrbufs.rcvtid_size / 2;
> +
> +			/*
> +			 * If the first attempt to allocate memory failed, don't
> +			 * fail everything but continue with the next lower
> +			 * size.
> +			 */
> +			if (idx == 0) {
> +				rcd->egrbufs.rcvtid_size = new_size;
> +				continue;
> +			}
> +
> +			/*
> +			 * Re-partition already allocated buffers to a smaller
> +			 * size.
> +			 */
> +			rcd->egrbufs.alloced = 0;
> +			for (i = 0, j = 0, offset = 0; j < idx; i++) {
> +				if (i >= rcd->egrbufs.count)
> +					break;
> +				rcd->egrbufs.rcvtids[i].dma =
> +					rcd->egrbufs.buffers[j].dma + offset;
> +				rcd->egrbufs.rcvtids[i].addr =
> +					rcd->egrbufs.buffers[j].addr + offset;
> +				rcd->egrbufs.alloced++;
> +				if ((rcd->egrbufs.buffers[j].dma + offset +
> +				     new_size) ==
> +				    (rcd->egrbufs.buffers[j].dma +
> +				     rcd->egrbufs.buffers[j].len)) {
> +					j++;
> +					offset = 0;
> +				} else {
> +					offset += new_size;
> +				}
> +			}
> +			rcd->egrbufs.rcvtid_size = new_size;
> +		}
> +	}
> +	rcd->egrbufs.numbufs = idx;
> +	rcd->egrbufs.size = alloced_bytes;
> +
> +	hfi2_cdbg(PROC,
> +		  "ctxt%u: Alloced %u rcv tid entries @ %uKB, total %uKB",
> +		  rcd->ctxt, rcd->egrbufs.alloced,
> +		  rcd->egrbufs.rcvtid_size / 1024, rcd->egrbufs.size / 1024);
> +
> +	/*
> +	 * Set the contexts rcv array head update threshold to the closest
> +	 * power of 2 (so we can use a mask instead of modulo) below half
> +	 * the allocated entries.
> +	 */
> +	rcd->egrbufs.threshold =
> +		rounddown_pow_of_two(rcd->egrbufs.alloced / 2);
> +	/*
> +	 * Compute the expected RcvArray entry base. This is done after
> +	 * allocating the eager buffers in order to maximize the
> +	 * expected RcvArray entries for the context.
> +	 */
> +	max_entries = rcd->rcv_array_groups * dd->rcv_entries.group_size;
> +	egrtop = roundup(rcd->egrbufs.alloced, dd->rcv_entries.group_size);
> +	rcd->expected_count = max_entries - egrtop;
> +	if (rcd->expected_count > MAX_TID_PAIR_ENTRIES * 2)
> +		rcd->expected_count = MAX_TID_PAIR_ENTRIES * 2;
> +
> +	rcd->expected_base = rcd->eager_base + egrtop;
> +	hfi2_cdbg(PROC, "ctxt%u: eager:%u, exp:%u, egrbase:%u, expbase:%u",
> +		  rcd->ctxt, rcd->egrbufs.alloced, rcd->expected_count,
> +		  rcd->eager_base, rcd->expected_base);
> +
> +	if (!hfi2_rcvbuf_validate(rcd->egrbufs.rcvtid_size, PT_EAGER, &order)) {
> +		hfi2_cdbg(PROC,
> +			  "ctxt%u: current Eager buffer size is invalid %u",
> +			  rcd->ctxt, rcd->egrbufs.rcvtid_size);
> +		ret = -EINVAL;
> +		goto bail_rcvegrbuf_phys;
> +	}
> +
> +	/*
> +	 * Enable RcvArray access on JKR and later by configuring RcvEgrCtrl and
> +	 * RcvTidCtrl before writing TIDs to the RcvArray.
> +	 *
> +	 * Call set_port_tid_config only after eager_base, egrbufs.alloced,
> +	 * expected_count, and expected_base are initialized in rcd.  The last
> +	 * 3 of the 4 are initialized above in this function.
> +	 */
> +	dd->params->set_port_tid_config(dd, rcd->ppd->hw_pidx, rcd->ctxt,
> +			rcd->eager_base, rcd->egrbufs.alloced,
> +			rcd->expected_base, rcd->expected_count);
> +
> +	for (idx = 0; idx < rcd->egrbufs.alloced; idx++) {
> +		dd->params->put_tid(rcd, idx, PT_EAGER,
> +				    rcd->egrbufs.rcvtids[idx].dma, order,
> +				    false);
> +		cond_resched();
> +	}
> +
> +	return 0;
> +
> +bail_rcvegrbuf_phys:
> +	for (idx = 0; idx < rcd->egrbufs.alloced &&
> +	     rcd->egrbufs.buffers[idx].addr;
> +	     idx++) {
> +		dma_free_coherent(&dd->pcidev->dev,
> +				  rcd->egrbufs.buffers[idx].len,
> +				  rcd->egrbufs.buffers[idx].addr,
> +				  rcd->egrbufs.buffers[idx].dma);
> +		rcd->egrbufs.buffers[idx].addr = NULL;
> +		rcd->egrbufs.buffers[idx].dma = 0;
> +		rcd->egrbufs.buffers[idx].len = 0;
> +	}
> +
> +	return ret;
> +}
> +
> +/*
> + * Return number of requested user contexts for the given unit and port based
> + * on information given in the module parameter num_user_contexts.
> + * Return -1 (use non-HT cores) if the corresponding entry is not set.
> + */
> +int get_num_user_contexts(struct hfi2_devdata *dd, int pidx)
> +{
> +	struct hfi2_devdata *xdd;
> +	int start;
> +	int i;
> +
> +	/* find the count of ports from earlier units */
> +	start = 0;
> +	for (i = 0; i < dd->unit; i++) {
> +		xdd = hfi2_lookup(i);
> +		/* previous units should exist - check anyway */
> +		if (!xdd) {
> +			dd_dev_err(dd, "%s: unit %d not found?\n", __func__, i);
> +			return -1;
> +		}
> +		start += xdd->num_pports;
> +	}
> +
> +	/* adjust for the port on this unit */
> +	start += pidx;
> +
> +	/* check if enough elements are set for this unit's port */
> +	if (start >= num_user_contexts_count)
> +		return -1;
> +
> +	return num_user_contexts_array[start];
> +}
> 
> 
> 

  reply	other threads:[~2026-03-18  9:14 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-11 17:53 [PATCH for-next resend 00/24] Migrate to hfi2 driver Dennis Dalessandro
2026-03-11 17:53 ` [PATCH for-next resend 01/24] RDMA/hfi2: Start hfi2 driver by basing off of hfi1 Dennis Dalessandro
2026-03-16 15:51   ` Leon Romanovsky
2026-03-16 22:00     ` Dennis Dalessandro
2026-03-17 10:07       ` Leon Romanovsky
2026-03-11 17:53 ` [PATCH for-next resend 02/24] RDMA/hfi2: Add in HW register definition files Dennis Dalessandro
2026-03-11 17:54 ` [PATCH for-next resend 03/24] RDMA/hfi2: Add counter accessor functions Dennis Dalessandro
2026-03-11 17:54 ` [PATCH for-next resend 04/24] RDMA/hfi2: Add in HW register access support Dennis Dalessandro
2026-03-11 17:54 ` [PATCH for-next resend 05/24] RDMA/hfi2: Add in trace header files Dennis Dalessandro
2026-03-11 17:54 ` [PATCH for-next resend 06/24] RDMA/hfi2: Add in trace support Dennis Dalessandro
2026-03-11 17:54 ` [PATCH for-next resend 07/24] RDMA/hfi2: Add system core header files Dennis Dalessandro
2026-03-16 15:58   ` Leon Romanovsky
2026-03-16 21:37     ` Dennis Dalessandro
2026-03-17  9:54       ` Leon Romanovsky
2026-03-11 17:54 ` [PATCH for-next resend 08/24] RDMA/hfi2: Add driver and interrupt infrastructure Dennis Dalessandro
2026-03-18  9:11   ` Leon Romanovsky
2026-03-11 17:54 ` [PATCH for-next resend 09/24] RDMA/hfi2: Add initialization and firmware support Dennis Dalessandro
2026-03-18  9:14   ` Leon Romanovsky [this message]
2026-03-11 17:54 ` [PATCH for-next resend 10/24] RDMA/hfi2: Add in MAD handling related headers Dennis Dalessandro
2026-03-11 17:54 ` [PATCH for-next resend 11/24] RDMA/hfi2: Add cport management Dennis Dalessandro
2026-03-11 17:54 ` [PATCH for-next resend 12/24] RDMA/hfi2: Implement MAD handling Dennis Dalessandro
2026-03-11 17:54 ` [PATCH for-next resend 13/24] RDMA/hfi2: Add IO related headers Dennis Dalessandro
2026-03-11 17:54 ` [PATCH for-next resend 14/24] RDMA/hfi2: Add PIO send infrastructure Dennis Dalessandro
2026-03-11 17:55 ` [PATCH for-next resend 15/24] RDMA/hfi2: Add SDMA infrastructure Dennis Dalessandro
2026-03-11 17:55 ` [PATCH for-next resend 16/24] RDMA/hfi2: Implement data moving infrastructure Dennis Dalessandro
2026-03-11 17:55 ` [PATCH for-next resend 17/24] RDMA/hfi2: Add verbs core Dennis Dalessandro
2026-03-11 17:55 ` [PATCH for-next resend 18/24] RDMA/hfi2: Add RC protocol support Dennis Dalessandro
2026-03-11 17:55 ` [PATCH for-next resend 19/24] RDMA/hfi2: Add in support for verbs Dennis Dalessandro
2026-03-11 17:55 ` [PATCH for-next resend 20/24] RDMA/hfi2: Support ipoib Dennis Dalessandro
2026-03-11 17:55 ` [PATCH for-next resend 21/24] RDMA/hfi2: Add misc header files Dennis Dalessandro
2026-03-11 17:55 ` [PATCH for-next resend 22/24] RDMA/hfi2: Add the rest of the driver Dennis Dalessandro
2026-03-11 17:55 ` [PATCH for-next resend 23/24] RDMA/hfi2: Make it build Dennis Dalessandro
2026-03-11 17:55 ` [PATCH for-next resend 24/24] RDMA/hfi2: Modernize mmap to use rdma_user_mmap_entry infrastructure Dennis Dalessandro
2026-03-16 16:02 ` [PATCH for-next resend 00/24] Migrate to hfi2 driver Leon Romanovsky
2026-03-16 21:29   ` Dennis Dalessandro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260318091409.GG61385@unreal \
    --to=leon@kernel.org \
    --cc=brendan.cunningham@cornelisnetworks.com \
    --cc=dean.luick@cornelisnetworks.com \
    --cc=dennis.dalessandro@cornelisnetworks.com \
    --cc=doug.miller@cornelisnetworks.com \
    --cc=jgg@ziepe.ca \
    --cc=linux-rdma@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.