Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v3 2/2][BNX2]: Add iSCSI support to BNX2 devices.
From: Mike Christie @ 2007-09-05 18:34 UTC (permalink / raw)
  To: Michael Chan
  Cc: davem, netdev, open-iscsi, anilgv, talm, lusinsky, uri,
	SCSI Mailing List
In-Reply-To: <1188599815.5176.12.camel@dell>

Michael Chan wrote:
> diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
> index 6f2c71e..adcfbbc 100644
> --- a/drivers/scsi/Kconfig
> +++ b/drivers/scsi/Kconfig
> @@ -1800,6 +1800,8 @@ config ZFCP
>            called zfcp. If you want to compile it as a module, say M here
>            and read <file:Documentation/kbuild/modules.txt>.
>  
> +source "drivers/scsi/bnx2i/Kconfig"
> +
>  config SCSI_SRP
>  	tristate "SCSI RDMA Protocol helper library"
>  	depends on SCSI && PCI
> diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
> index 86a7ba7..65fe633 100644
> --- a/drivers/scsi/Makefile
> +++ b/drivers/scsi/Makefile
> @@ -133,6 +133,7 @@ obj-$(CONFIG_SCSI_IBMVSCSIS)	+= ibmvscsi/
>  obj-$(CONFIG_SCSI_HPTIOP)	+= hptiop.o
>  obj-$(CONFIG_SCSI_STEX)		+= stex.o
>  obj-$(CONFIG_PS3_ROM)		+= ps3rom.o
> +obj-$(CONFIG_SCSI_BNX2_ISCSI)	+= bnx2i/
>  
>  obj-$(CONFIG_ARM)		+= arm/
>  
> diff --git a/drivers/scsi/bnx2i/57xx_iscsi_constants.h b/drivers/scsi/bnx2i/57xx_iscsi_constants.h
> new file mode 100644
> index 0000000..61f07f9
> --- /dev/null
> +++ b/drivers/scsi/bnx2i/57xx_iscsi_constants.h
> @@ -0,0 +1,212 @@
> +#ifndef __57XX_ISCSI_CONSTANTS_H_
> +#define __57XX_ISCSI_CONSTANTS_H_
> +
> +/**
> +* This file defines HSI constants for the iSCSI flows
> +*/
> +
> +/* iSCSI request op codes */
> +#define ISCSI_OPCODE_NOP_OUT 			(0 | 0x40)
> +#define ISCSI_OPCODE_SCSI_CMD 			(1)
> +#define ISCSI_OPCODE_TMF_REQUEST 		(2 | 0x40)
> +#define ISCSI_OPCODE_LOGIN_REQUEST 		(3 | 0x40)
> +#define ISCSI_OPCODE_TEXT_REQUEST 		(4 | 0x40)
> +#define ISCSI_OPCODE_DATA_OUT 			(5)
> +#define ISCSI_OPCODE_LOGOUT_REQUEST 		(6 | 0x00)
> +#define ISCSI_OPCODE_CLEANUP_REQUEST		(7)


What is ISCSI_OPCODE_CLEANUP_REQUEST?


> +
> +/* iSCSI response/messages op codes */
> +#define ISCSI_OPCODE_NOP_IN 			(0x20)
> +#define ISCSI_OPCODE_SCSI_RESPONSE 		(0x21)
> +#define ISCSI_OPCODE_TMF_RESPONSE 		(0x22)
> +#define ISCSI_OPCODE_LOGIN_RESPONSE 		(0x23)
> +#define ISCSI_OPCODE_TEXT_RESPONSE 		(0x24)
> +#define ISCSI_OPCODE_DATA_IN			(0x25)
> +#define ISCSI_OPCODE_LOGOUT_RESPONSE 		(0x26)
> +#define ISCSI_OPCODE_CLEANUP_RESPONSE 		(0x27)
> +#define ISCSI_OPCODE_R2T			(0x31)
> +#define ISCSI_OPCODE_ASYNC_MSG 			(0x32)
> +#define ISCSI_OPCODE_REJECT 			(0x3f)
> +#define ISCSI_OPCODE_NOPOUT_LOCAL_COMPLETION	(0)


What does the LOCAL_COMPLETION on the nopout mean?


> +
> +/* iSCSI stages */
> +#define ISCSI_STAGE_SECURITY_NEGOTIATION (0)
> +#define ISCSI_STAGE_LOGIN_OPERATIONAL_NEGOTIATION (1)
> +#define ISCSI_STAGE_FULL_FEATURE_PHASE (3)
> +
> +/* iSCSI parameter defaults */
> +#define ISCSI_DEFAULT_HEADER_DIGEST 		(0)
> +#define ISCSI_DEFAULT_DATA_DIGEST 		(0)
> +#define ISCSI_DEFAULT_INITIAL_R2T 		(1)
> +#define ISCSI_DEFAULT_IMMEDIATE_DATA 		(1)
> +#define ISCSI_DEFAULT_MAX_PDU_LENGTH 		(0x2000)
> +#define ISCSI_DEFAULT_FIRST_BURST_LENGTH 	(0x10000)
> +#define ISCSI_DEFAULT_MAX_BURST_LENGTH 		(0x40000)
> +#define ISCSI_DEFAULT_MAX_OUTSTANDING_R2T 	(1)
> +
> +/* iSCSI parameter limits */
> +#define ISCSI_MIN_VAL_MAX_PDU_LENGTH (0x200)
> +#define ISCSI_MAX_VAL_MAX_PDU_LENGTH (0xffffff)
> +#define ISCSI_MIN_VAL_BURST_LENGTH (0x200)
> +#define ISCSI_MAX_VAL_BURST_LENGTH (0xffffff)
> +#define ISCSI_MIN_VAL_MAX_OUTSTANDING_R2T (1)
> +#define ISCSI_MAX_VAL_MAX_OUTSTANDING_R2T (0xff) /* 0x10000 according to RFC */
> +
> +/* SCSI command response codes */
> +#define ISCSI_SCSI_CMD_RESPONSE_CMD_COMPLETED	(0x00)
> +#define ISCSI_SCSI_CMD_RESPONSE_TARGET_FAILURE	(0x01)
> +
> +/* SCSI command status codes */
> +#define ISCSI_SCSI_CMD_STATUS_GOOD 		(0x00)
> +#define ISCSI_SCSI_CMD_STATUS_CHECK_CONDITION 	(0x02)
> +#define ISCSI_SCSI_CMD_STATUS_INTERMIDIATE 	(0x10)
> +
> +/* TMF codes */
> +#define ISCSI_TMF_ABORT_TASK 			(1)
> +#define ISCSI_TMF_LOGICAL_UNIT_RESET 		(5)
> +
> +/* TMF response codes */
> +#define ISCSI_TMF_RESPONSE_FUNCTION_COMPLETE			(0x00)
> +#define ISCSI_TMF_RESPONSE_TASK_DOESNT_EXIST			(0x01)
> +#define ISCSI_TMF_RESPONSE_LUN_DOESNT_EXIST			(0x02)
> +#define ISCSI_TMF_RESPONSE_TASK_STILL_ALLEGIANT			(0x03)
> +#define ISCSI_TMF_RESPONSE_FUNCTION_NOT_SUPPORTED		(0x05)
> +#define ISCSI_TMF_RESPONSE_FUNCTION_AUTHORIZATION_FAILED	(0x06)
> +#define ISCSI_TMF_RESPONSE_FUNCTION_REJECTED			(0xff)
> +
> +/* Logout reason codes */
> +/*#define ISCSI_LOGOUT_REASON_CLOSE_CONNECTION (1) */
> +
> +/* Logout response codes */
> +#define ISCSI_LOGOUT_RESPONSE_CONNECTION_CLOSED (0)
> +#define ISCSI_LOGOUT_RESPONSE_CID_NOT_FOUND (1)
> +#define ISCSI_LOGOUT_RESPONSE_CLEANUP_FAILED (3)
> +
> +/* iSCSI task types */
> +#define ISCSI_TASK_TYPE_READ    (0)
> +#define ISCSI_TASK_TYPE_WRITE   (1)
> +#define ISCSI_TASK_TYPE_MPATH   (2)




All of these iscsi code shoulds be in iscsi_proto.h or should be added 
there.


> +
> +#endif
> diff --git a/drivers/scsi/bnx2i/bnx2i_hwi.c b/drivers/scsi/bnx2i/bnx2i_hwi.c
> new file mode 100644
> index 0000000..b74a93c
> --- /dev/null
> +++ b/drivers/scsi/bnx2i/bnx2i_hwi.c
> @@ -0,0 +1,1977 @@



> + * process iSCSI TMF Response CQE & wake up driver EH thread.
> + */
> +static int bnx2i_process_tmf_resp(struct bnx2i_conn *conn, struct cqe *cqe)
> +{
> +	u32 itt;
> +	struct bnx2i_cmd *cmd;
> +	struct bnx2i_cmd *aborted_cmd;
> +	struct iscsi_tmf_response *tmf_cqe;
> +
> +	tmf_cqe = (struct iscsi_tmf_response *) cqe;
> +	itt = (tmf_cqe->itt & ISCSI_TMF_RESPONSE_INDEX);
> +	cmd = get_cmnd(conn->sess, itt);
> +	if (!cmd) {
> +		printk(KERN_ALERT "tmf_resp: ITT 0x%x is not active\n",
> +				  tmf_cqe->itt);
> +		return 0;
> +	}
> +
> +	bnx2i_update_cmd_sequence(conn->sess, tmf_cqe->exp_cmd_sn,
> +				  tmf_cqe->max_cmd_sn);
> +
> +	aborted_cmd = cmd->tmf_ref_cmd;
> +
> +	if (tmf_cqe->response == ISCSI_TMF_RSP_COMPLETE) {
> +		if (aborted_cmd->scsi_cmd) {
> +			aborted_cmd->scsi_cmd->result = DID_ABORT << 16;
> +			aborted_cmd->scsi_cmd->resid =

should use resid wrappers throughout driver.

> +				aborted_cmd->scsi_cmd->request_bufflen;

Should use bufflen wrappers throught driver.


> +}
> diff --git a/drivers/scsi/bnx2i/bnx2i_init.c b/drivers/scsi/bnx2i/bnx2i_init.c
> new file mode 100644
> index 0000000..837bd95
> --- /dev/null
> +++ b/drivers/scsi/bnx2i/bnx2i_init.c
>
> + * handles SCSI command queued by SCSI-ML, allocates a command structure,
> + *	assigning CMDSN, mapping SG buffers and handing over request to CNIC.
> + */
> +int bnx2i_queuecommand(struct scsi_cmnd *sc,
> +		       void (*done) (struct scsi_cmnd *))
> +{
> +	struct bnx2i_sess *sess;
> +	struct bnx2i_conn *conn;
> +	struct bnx2i_cmd *cmd;
> +	static int old_recovery_state = 0;
> +
> +	sc->scsi_done = done;
> +	sc->result = 0;
> +	sess = iscsi_hostdata(sc->device->host->hostdata);
> +
> +#define iscsi_cmd_win_closed(_sess)	\
> +	((int) (_sess->max_cmdsn - _sess->cmdsn) < 0)
> +
> +	if (iscsi_cmd_win_closed(sess) ||
> +	    test_bit(ADAPTER_STATE_LINK_DOWN, &sess->hba->adapter_state))
> +		goto iscsi_win_closed;
> +
> +	if ((sess->state & BNX2I_SESS_IN_SHUTDOWN) ||
> +		(sess->state & BNX2I_SESS_IN_LOGOUT))
> +		goto dev_not_found;
> +
> +	if (sess->recovery_state) {
> +		if (old_recovery_state != sess->recovery_state)
> +			old_recovery_state = sess->recovery_state;
> +
> +		if (sess->recovery_state & ISCSI_SESS_RECOVERY_FAILED)
> +			goto dev_not_found;
> +		else if (!(sess->recovery_state & ISCSI_SESS_RECOVERY_COMPLETE))
> +			goto iscsi_win_closed;
> +		else
> +			sess->recovery_state = 0;
> +	}
> +
> +	cmd = bnx2i_alloc_cmd(sess);
> +	if (cmd == NULL)
> +		/* should never happen as cmd list size == SHT->can_queue */
> +		goto cmd_not_accepted;
> +
> +	cmd->conn = conn = sess->lead_conn;
> +	cmd->scsi_cmd = sc;
> +	cmd->req.total_data_transfer_length = sc->request_bufflen;
> +	cmd->iscsi_opcode = ISCSI_OPCODE_SCSI_CMD;
> +	cmd->req.cmd_sn = sess->cmdsn++;
> +
> +	bnx2i_iscsi_map_sg_list(cmd);
> +	bnx2i_cpy_scsi_cdb(sc, cmd);
> +
> +	if (sc->sc_data_direction == DMA_TO_DEVICE) {
> +		cmd->req.op_attr = ISCSI_CMD_REQUEST_WRITE;
> +		cmd->req.itt |= (ISCSI_TASK_TYPE_WRITE <<
> +				 ISCSI_CMD_REQUEST_TYPE_SHIFT);
> +		bnx2i_setup_write_cmd_bd_info(cmd);
> +	} else {
> +		cmd->req.op_attr = ISCSI_CMD_REQUEST_READ;
> +		cmd->req.itt |= (ISCSI_TASK_TYPE_READ <<
> +				 ISCSI_CMD_REQUEST_TYPE_SHIFT);
> +	}
> +	cmd->req.num_bds = cmd->bd_tbl->bd_valid;
> +	if (!cmd->bd_tbl->bd_valid) {
> +		cmd->req.bd_list_addr_lo = (u32) sess->hba->mp_bd_dma;
> +		cmd->req.bd_list_addr_hi =
> +			(u32) ((u64) sess->hba->mp_bd_dma >> 32);
> +		cmd->req.num_bds = 1;
> +	}
> +
> +	cmd->cmd_state = ISCSI_CMD_STATE_INITIATED;
> +	sc->SCp.ptr = (char *) cmd;
> +
> +	if (cmd->req.itt != ITT_INVALID_SIGNATURE) {
> +		bnx2i_send_iscsi_scsicmd(conn, cmd);
> +		list_add_tail(&cmd->link, &sess->active_cmds);
> +		sess->num_active_cmds++;
> +	}
> +	return 0;
> +
> +iscsi_win_closed:
> +cmd_not_accepted:
> +	return SCSI_MLQUEUE_HOST_BUSY;
> +
> +dev_not_found:
> +	sc->sense_buffer[0] = 0x70;
> +	sc->sense_buffer[2] = NOT_READY;
> +	sc->sense_buffer[7] = 0x6;
> +	sc->sense_buffer[12] = 0x08;
> +	sc->sense_buffer[13] = 0x00;


do not fake sense and do not face sense and set a host byte of 
DID_NO_CONNECT. DID_NO_CONNECT is fine.


> +	sc->result = (DID_NO_CONNECT << 16);
> +	sc->resid = sc->request_bufflen;
> +	sc->scsi_done(sc);
> +	return 0;
> +}
> +
> +
> +
> +/*
> + * TMF request timeout handler
> + */
> +static void bnx2i_iscsi_tmf_timer(unsigned long data)
> +{
> +	struct bnx2i_cmd *cmd = (struct bnx2i_cmd *) data;
> +
> +	printk(KERN_ALERT "TMF timer: abort failed, cmd 0x%p\n", cmd);
> +	cmd->cmd_state = ISCSI_CMD_STATE_TMF_TIMEOUT;
> +	cmd->conn->sess->recovery_state = ISCSI_SESS_RECOVERY_OPEN_ISCSI;
> +	iscsi_conn_error(cmd->conn->cls_conn, ISCSI_ERR_CONN_FAILED);
> +}
> +
> +
> +/*
> + * initiate command abort process by requesting CNIC to send
> + *	an iSCSI TMF request to target
> + */
> +static int bnx2i_initiate_abort_cmd(struct scsi_cmnd *sc)
> +{
> +	struct bnx2i_cmd *cmd = (struct bnx2i_cmd *) sc->SCp.ptr;
> +	struct bnx2i_cmd *tmf_cmd;
> +	struct Scsi_Host *shost = cmd->scsi_cmd->device->host;
> +	struct bnx2i_conn *conn = cmd->conn;
> +	struct bnx2i_sess *sess;
> +	struct bnx2i_hba *hba;
> +
> +	shost = cmd->scsi_cmd->device->host;
> +	sess = iscsi_hostdata(shost->hostdata);
> +	BUG_ON(shost != sess->host);
> +
> +	if (sess && (is_sess_active(sess)))
> +		hba = sess->hba;
> +	else
> +		return FAILED;
> +
> +	if (test_bit(ADAPTER_STATE_LINK_DOWN, &hba->adapter_state))
> +		return FAILED;
> +
> +	bnx2i_setup_ictx_dump(hba, conn);
> +
> +	if (cmd->scsi_cmd != sc)
> +		/* command already completed to scsi mid-layer */
> +		goto cmd_not_active;
> +
> +	tmf_cmd = bnx2i_alloc_cmd(sess);
> +	if (tmf_cmd == NULL)
> +		goto lack_of_resc;
> +
> +	tmf_cmd->conn = conn = sess->lead_conn;
> +	tmf_cmd->scsi_cmd = NULL;
> +	tmf_cmd->iscsi_opcode = ISCSI_OPCODE_TMF_REQUEST;
> +	tmf_cmd->req.cmd_sn = sess->cmdsn;
> +	tmf_cmd->tmf_ref_itt = cmd->req.itt;
> +	tmf_cmd->tmf_ref_cmd = cmd;
> +	tmf_cmd->tmf_ref_sc = cmd->scsi_cmd;
> +	cmd->cmd_state = ISCSI_CMD_STATE_ABORT_PEND;
> +	tmf_cmd->cmd_state = ISCSI_CMD_STATE_INITIATED;
> +
> +#define BNX2I_TMF_TIMEOUT	10 * HZ
> +	sess->abort_timer.expires = BNX2I_TMF_TIMEOUT + jiffies;
> +	sess->abort_timer.function = bnx2i_iscsi_tmf_timer;
> +	sess->abort_timer.data = (unsigned long)tmf_cmd;
> +	add_timer(&sess->abort_timer);
> +
> +	bnx2i_send_iscsi_tmf(conn, tmf_cmd);
> +
> +	/* update iSCSI context for this conn, wait for CNIC to complete */
> +	wait_event_interruptible(sess->er_wait, (!conn->ep ||
> +		 			 (tmf_cmd->cmd_state ==
> +					  ISCSI_CMD_STATE_FAILED) ||
> +					 (tmf_cmd->cmd_state ==
> +					  ISCSI_CMD_STATE_COMPLETED)));
> +
> +	if (signal_pending(current))
> +		flush_signals(current);
> +
> +	del_timer_sync(&sess->abort_timer);
> +
> +	if (tmf_cmd->cmd_state == ISCSI_CMD_STATE_TMF_TIMEOUT) {
> +		printk(KERN_ALERT "abort: abort failed, cmd 0x%p\n", tmf_cmd);
> +		/* TMF timed out, return error status and let SCSI-ML do
> +		 * session recovery.
> +		 */
> +		list_del_init(&tmf_cmd->link);
> +		bnx2i_free_cmd(sess, tmf_cmd);
> +		return SUCCESS;
> +	}
> +
> +	list_del_init(&tmf_cmd->link);
> +	bnx2i_free_cmd(sess, tmf_cmd);
> +
> +	if ((cmd->scsi_cmd->result & 0xFF0000) == (DID_ABORT << 16)) {
> +		cmd->cmd_state = ISCSI_CMD_STATE_CLEANUP_PEND;
> +		bnx2i_send_cmd_cleanup_req(hba, cmd);
> +		wait_event_interruptible_timeout(sess->er_wait,
> +						 (cmd->cmd_state ==
> +						  ISCSI_CMD_STATE_CLEANUP_CMPL),
> +						 msecs_to_jiffies(
> +						  ISCSI_CMD_CLEANUP_TIMEOUT));
> +
> +		if (signal_pending(current))
> +			flush_signals(current);
> +	} else
> +		cmd->scsi_cmd->result = (DID_ABORT << 16);
> +
> +	cmd->conn->sess->num_active_cmds--;
> +	list_del_init(&cmd->link);
> +	bnx2i_return_failed_command(sess, cmd, DID_ABORT);
> +	bnx2i_free_cmd(sess, cmd);
> +
> +cmd_not_active:
> +	return SUCCESS;
> +
> +lack_of_resc:
> +	return FAILED;
> +}
> +
> +
> +/*
> + * SCSI abort request handler.
> + */
> +int bnx2i_abort(struct scsi_cmnd *sc)
> +{
> +	int reason;
> +	struct bnx2i_cmd *cmd = (struct bnx2i_cmd *) sc->SCp.ptr;
> +
> +	if (unlikely(!cmd)) {
> +		/* command already completed to scsi mid-layer */
> +		printk(KERN_INFO "bnx2i_abort: sc 0x%p, not active\n", sc);
> +		return SUCCESS;
> +	}
> +
> +	reason = bnx2i_initiate_abort_cmd(sc);
> +	return reason;
> +}
> +
> +
> +
> +/*
> + * hardware reset
> + */
> +int bnx2i_reset(struct scsi_cmnd *sc)
> +{
> +	return 0;
> +}


So what is up with this one? It seems like if there is a way to reset 
hardware then you would want it as the scsi eh host reset callout 
instead of dropping the session. We could add some transport level 
recovery callouts for the iscsi specifics.


> +
> +
> +void bnx2i_return_failed_command(struct bnx2i_sess *sess,
> +				 struct bnx2i_cmd *cmd, int err_code)
> +{
> +	struct scsi_cmnd *sc = cmd->scsi_cmd;
> +	sc->result = err_code << 16;
> +	sc->resid = cmd->scsi_cmd->request_bufflen;
> +	cmd->scsi_cmd = NULL;
> +	sess->num_active_cmds--;
> +	sc->SCp.ptr = NULL;
> +	sc->scsi_done(sc);
> +}
> +
> +
> +
> +/*
> + * SCSI host reset handler - iSCSI session recovery
> + */
> +int bnx2i_host_reset(struct scsi_cmnd *sc)
> +{
> +	struct Scsi_Host *shost;
> +	struct bnx2i_sess *sess;
> +	int rc = 0;
> +
> +	shost = sc->device->host;
> +	sess = iscsi_hostdata(shost->hostdata);
> +	printk(KERN_INFO "bnx2i: attempting to reset host, #%d\n",
> +			  sess->host->host_no);
> +
> +	BUG_ON(shost != sess->host);
> +	rc = bnx2i_do_iscsi_sess_recovery(sess, DID_RESET);
> +
> +	return rc;
> +}
> +
> +
> +
> +/**********************************************************************
> + *		open-iscsi interface
> + **********************************************************************/
> +
> +
> +#define get_bnx2_device(_hba, _devc) 	do {				\
> +		if ((_hba->pci_did == PCI_DEVICE_ID_NX2_5706) ||	\
> +			(_hba->pci_did == PCI_DEVICE_ID_NX2_5706S)) {	\
> +			_devc = '6';					\
> +		} else if ((_hba->pci_did == PCI_DEVICE_ID_NX2_5708) ||	\
> +			(_hba->pci_did == PCI_DEVICE_ID_NX2_5708S)) {	\
> +			_devc = '8';					\
> +		} else if ((_hba->pci_did == PCI_DEVICE_ID_NX2_5709) ||	\
> +			(_hba->pci_did == PCI_DEVICE_ID_NX2_5709S)) {	\
> +			_devc = '9';					\
> +		}							\
> +	} while (0)
> +
> +/* from open-iscsi project */
> +/*
> + * iSCSI Session's hostdata organization:
> + *
> + *    *------------------* <== hostdata_session(host->hostdata)
> + *    | ptr to class sess|
> + *    |------------------| <== iscsi_hostdata(host->hostdata)
> + *    | iscsi_session    |
> + *    *------------------*
> + */



I think it is going to be nicer to not allocate the host with the 
session for bnx2i like was done for qla4xxx. If you are going to 
allocate the host with the session then it seems like you should be able 
to intergrate into libiscsi like was done with iscsi_tcp and ib_iser. 
And even if you do not allocate a host per session it seems like you 
should be able to better integreate with libiscsi like was done with 
iscsi_tcp and ib_iser. Right now, it is sort of a mix of qla4xxx's model 
and iscsi_tcp/ib_iser's model which is not going to work well in the 
long run.

You can change libiscsi so that you do not run in the host's workq and 
so data structs are better optimized for you and whatever else is 
needed. I mean feel free to change that code so it better suits your 
model. Your code is probably closer to ib_iser since that module does 
not have to handle operations like responding to r2ts and has to make 
its own connections.

There seem to be some bugs from when you might have modeled bnx2i's 
callouts after a older iscsi_tcp or ib_iser. Those bugs got fixed for 
both modules for free when libiscsi got fixed.


> +
> +#define hostdata_privsize(_sz)	(sizeof(unsigned long) + _sz + \
> +				 _sz % sizeof(unsigned long))
> +
> +#define hostdata_session(_hostdata) (iscsi_ptr(*(unsigned long *)_hostdata))
> +
> +#define session_to_cls(_sess) 	hostdata_session(_sess->host->hostdata)
> +
> +
> +
> +
> +/*
> + * Function: bnx2i_register_xport
> + * Description: this routine will allocate memory for SCSI host template,
> + * 		iSCSI template and registers one instance of NX2 device with
> + *		iSCSI Transport Kernel module.
> + */
> +int bnx2i_register_xport(struct bnx2i_hba *hba)
> +{
> +	void *mem_ptr;
> +	char dev_id = '8';
> +
> +	get_bnx2_device(hba, dev_id);
> +
> +	mem_ptr = kmalloc(sizeof(struct scsi_host_template), GFP_KERNEL);
> +	hba->scsi_template = mem_ptr;
> +	if (hba->scsi_template == NULL) {
> +		printk(KERN_ALERT "bnx2i: failed to alloc memory for sht\n");
> +		return -ENOMEM;
> +	}
> +
> +	mem_ptr = kmalloc(sizeof(struct iscsi_transport), GFP_KERNEL);
> +	hba->iscsi_transport = mem_ptr;
> +	if (hba->iscsi_transport == NULL) {
> +		printk(KERN_ALERT "mem error for iscsi_transport template\n");
> +		goto iscsi_xport_err;
> +	}
> +
> +	mem_ptr = kmalloc(BRCM_ISCSI_XPORT_NAME_SIZE_MAX, GFP_KERNEL);
> +	if (mem_ptr == NULL) {
> +		printk(KERN_ALERT "failed to alloc memory for xport name\n");
> +		goto scsi_name_mem_err;
> +	}
> +
> +	memcpy(hba->scsi_template, (const void *) &bnx2i_host_template,
> +	       sizeof(struct scsi_host_template));
> +	hba->scsi_template->name = mem_ptr;
> +	memcpy((void *) hba->scsi_template->name,
> +	       (const void *) bnx2i_host_template.name,
> +	       strlen(bnx2i_host_template.name) + 1);
> +
> +	mem_ptr = kmalloc(BRCM_ISCSI_XPORT_NAME_SIZE_MAX, GFP_KERNEL);
> +	if (mem_ptr == NULL) {
> +		printk(KERN_ALERT "failed to alloc proc name mem\n");
> +		goto scsi_proc_name_mem_err;
> +	}
> +	hba->scsi_template->proc_name = mem_ptr;
> +
> +	memcpy((void *) hba->iscsi_transport,
> +	       (const void *) &bnx2i_iscsi_transport,
> +	       sizeof(struct iscsi_transport));
> +
> +	hba->iscsi_transport->host_template = hba->scsi_template;
> +
> +	mem_ptr = kmalloc(BRCM_ISCSI_XPORT_NAME_SIZE_MAX, GFP_KERNEL);
> +	if (mem_ptr == NULL) {
> +		printk(KERN_ALERT "mem alloc error, iscsi xport name\n");
> +		goto xport_name_mem_err;
> +	}
> +	hba->iscsi_transport->name = mem_ptr;
> +	sprintf(mem_ptr, "%s%c-%.2x%.2x%.2x", BRCM_ISCSI_XPORT_NAME_PREFIX,
> +			 dev_id, (u8)hba->pci_dev->bus->number,
> +			 hba->pci_devno, (u8)hba->pci_func);
> +
> +	memcpy((void *) hba->scsi_template->proc_name, (const void *) mem_ptr,
> +	       strlen(mem_ptr) + 1);
> +
> +	hba->shost_template = iscsi_register_transport(hba->iscsi_transport);
> +	if (!hba->shost_template) {
> +		printk(KERN_ALERT "bnx2i: xport reg failed, hba 0x%p\n", hba);
> +		goto failed_registration;
> +	}


The transport should be allocated like qla4xxx or iscsi_tcp/ib_iser. 
Userspace then sets everything up based on usersettings like with 
qla4xxx and iscsi_tcp and ib_iser.

I started that patch that I sent to you guys to modify 
scsi_transport_iscsi abd libiscsi for you guys, but have not finished it 
yet. I am trying to finish merging Olafs iscsi_tcp patches. Feel free to 
change iscsid, scsi_transport_iscsi and libiscsi for me.


> +}
> +
> +int bnx2i_sysfs_setup(void)
> +{
> +	int ret;
> +	ret = class_register(&bnx2i_class);
> +
> +	bnx2i_register_port_class_dev(&port_class_dev);
> +	return ret;
> +}
> +
> +void bnx2i_sysfs_cleanup(void)
> +{
> +	class_device_unregister(&port_class_dev);
> +	class_unregister(&bnx2i_class);
> +}

The sysfs bits related to the hba should be use one of the scsi sysfs 
facilities or if they are related to iscsi bits and are generic then 
through the iscsi hba

^ permalink raw reply

* Re: [PATCH] make _minimum_ TCP retransmission timeout configurable
From: Ilpo Järvinen @ 2007-09-05 19:04 UTC (permalink / raw)
  To: David Miller; +Cc: rick.jones2, ian.mcdonald, Netdev
In-Reply-To: <20070829.161528.38309258.davem@davemloft.net>

On Wed, 29 Aug 2007, David Miller wrote:

> From: Rick Jones <rick.jones2@hp.com>
> Date: Wed, 29 Aug 2007 16:06:27 -0700
> 
> > I belive the biggest component comes from link-layer retransmissions. 
> > There can also be some short outtages thanks to signal blocking, 
> > tunnels, people with big hats and whatnot that the link-layer 
> > retransmissions are trying to address.  The three seconds seems to be a 
> > value that gives the certainty that 99 times out of 10 the segment was 
> > indeed lost.
> > 
> > The trace I've been sent shows clean RTTs ranging from ~200 milliseconds 
> > to ~7000 milliseconds.
> 
> Thanks for the info.
> 
> It's pretty easy to generate examples where we might have some sockets
> talking over interfaces on such a network and others which are not.
> Therefore, if we do this, a per-route metric is probably the best bet.
> 
> Ilpo, I'm also very interested to see what you think of all of this :-)

...Haven't been too actively reading mails for a while until now, so I'm a 
bit late in response... I'll try to quickly summarize FRTO here.

It's true that FRTO cannot prevent the first retransmission, yet I suspect 
that it won't cost that much even if you have to pay for each bit, won't 
be that high percentage out of all packets after all :-). However, usually 
when you have a spurious RTO, not only the first segment unnecessarily 
retransmitted but the *whole window*. It goes like this: all cumulative 
ACKs got delayed due to in-order delivery, then TCP will actually send
1.5*original cwnd worth of data in the RTO's slow-start when the delayed 
ACKs arrive (basically the original cwnd worth of it unnecessarily). In 
case one is interested in minimizing unnecessary retransmissions e.g. due 
to cost, those rexmissions must never see daylight. Besides, in the worst 
case the generated burst overloads the bottleneck buffers which is likely 
to significantly delay the further progress of the flow. In case of ll 
rexmissions, ACK compression often occurs at the same time making the 
burst very "sharp edged" (in that case TCP often loses most of the 
segments above high_seq => very bad performance too). When FRTO is 
enabled, those unnecessary retransmissions are fully avoided except for 
the first segment and the cwnd behavior after detected spurious RTO is 
determined by the response (one can tune that by sysctl). Basic version 
(non-SACK enhanced one), FRTO can fail to detect spurious RTO as spurious 
and falls back to conservative behavior. ACK lossage is much less 
significant than reordering, usually the FRTO can detect spurious RTO if 
at least 2 cumulative ACKs from original window are preserved (excluding 
the ACK that advances to high_seq). With SACK-enhanced version, the 
detection is quite robust. Of course one could jump to min_rto bandwagon 
instead, but it often ends up being more or less black magic and can still 
produce unwanted behavior unless one goes to ridicilously high minimum RTOs.

Main obstacle to FRTO use is its deployment as it has to be on the sender 
side where as wireless link is often the receiver's access link but if one 
can tune tcp_min_rto (or equal) on the sender side, one could enable
FRTO at will as well. Anyway, anything older than 2.6.22 is not going to 
give very good results with FRTO. FRTO code's maturity point of view, IMHO 
currently just unconditional clearing of undo_marker (in 
tcp_enter_frto_loss) is on the way of enabling FRTO in future kernels by 
default as it basically disables DSACK undoing, I'll try to solve that 
soon, has been on my todo list for too long already (don't currently have 
much time to devote to that though so 2.6.24-rc1 might come too early for 
me :-(). After that, it might be a good move to enable it in mainline by 
default if you agree... ...Uninteresting enough, even IETF seems to 
interested in advancing FRTO from experimental [1].

Another important thing to consider in cellular besides ll rexmissions is 
bandwidth allocation delay... We actually a week ago ran some measurements 
in a real umts network to determine buffer, one-way delay, etc. behavior 
(though YMMV depending on operators configuration etc.). Basically we saw 
1 s delay spike when allocation delay occurs (it's very hard to predict 
when that happens due to other network users role). One-way propagation 
delay was around 50 ms, so 1500 bytes takes about 80 ms+ to transmit, so 
it's it order of magnitude larger than RTT but queue delay is probably 
large enough to prevent spurious RTOs due to allocation delay. Besides 
that, we saw some long latencies, up to 8-12 s, they could be due to ll 
retransmissions but their source is not yet verified to be the WWAN link 
as we had the phone connected through bluetooth (could interfere). A funny 
sidenote about the experiment, we found out what Linux cannot do (from 
userspace only): it seems to be unable to receive the same packet it has 
sent out to itself as we forced the packet out from eth0 by binding 
sending dev to eth0 and received from ppp0 => the packet gots always 
discard as martian and there seems to be no knob to that, so had to 
hack it :-).


-- 
 i.


[1] http://www1.ietf.org/mail-archive/web/tcpm/current/msg02862.html

^ permalink raw reply

* [TG3]: Workaround MSI bug on 5714/5780.
From: Michael Chan @ 2007-09-05 20:07 UTC (permalink / raw)
  To: davem; +Cc: netdev, andy

[TG3]: Workaround MSI bug on 5714/5780.

A hardware bug was revealed after a recent PCI MSI patch was made to
always disable legacy INTX when enabling MSI.  The 5714/5780 chips
will not generate MSI when INTX is disabled, causing MSI failure
messages to be reported, and another patch was made to workaround the
problem by disabling MSI on ServerWorks HT1000 bridge chips commonly
found with the 5714.

We workaround this chip bug by enabling INTX after we enable MSI and
after we resume from suspend.

Update version to 3.81.

This problem was discovered by David Miller.

Signed-off-by: Michael Chan <mchan@broadcom.com>

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 5874042..9034a05 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -64,8 +64,8 @@
 
 #define DRV_MODULE_NAME		"tg3"
 #define PFX DRV_MODULE_NAME	": "
-#define DRV_MODULE_VERSION	"3.80"
-#define DRV_MODULE_RELDATE	"August 2, 2007"
+#define DRV_MODULE_VERSION	"3.81"
+#define DRV_MODULE_RELDATE	"September 5, 2007"
 
 #define TG3_DEF_MAC_MODE	0
 #define TG3_DEF_RX_MODE		0
@@ -7127,6 +7127,10 @@ static int tg3_open(struct net_device *dev)
 		} else if (pci_enable_msi(tp->pdev) == 0) {
 			u32 msi_mode;
 
+			/* Hardware bug - MSI won't work if INTX disabled. */
+			if (tp->tg3_flags2 & TG3_FLG2_5780_CLASS)
+				pci_intx(tp->pdev, 1);
+
 			msi_mode = tr32(MSGINT_MODE);
 			tw32(MSGINT_MODE, msi_mode | MSGINT_MODE_ENABLE);
 			tp->tg3_flags2 |= TG3_FLG2_USING_MSI;
@@ -12172,6 +12176,11 @@ static int tg3_resume(struct pci_dev *pdev)
 	if (err)
 		return err;
 
+	/* Hardware bug - MSI won't work if INTX disabled. */
+	if ((tp->tg3_flags2 & TG3_FLG2_5780_CLASS) &&
+	    (tp->tg3_flags2 & TG3_FLG2_USING_MSI))
+		pci_intx(tp->pdev, 1);
+
 	netif_device_attach(dev);
 
 	tg3_full_lock(tp, 0);



^ permalink raw reply related

* Re: [PATCH] 3c59x: sparse warning fix
From: Francois Romieu @ 2007-09-05 19:22 UTC (permalink / raw)
  To: Joe Perches; +Cc: Stephen Hemminger, klassert, Jeff Garzik, netdev
In-Reply-To: <1189011425.13308.30.camel@localhost>

Joe Perches <joe@perches.com> :
[...]
> Shouldn't this be:
> 
> 	if (pkt_len < rx_copybreak) {
> 		skb = dev_alloc_skb(pkt_len + 2);
> 		if (!skb) {
> 			bad_news! (like the refill rx ring buffers block)
> 		}

It is not that bad a news: the rx_copybreak optimization is lost but it
is still possible to push the data up the stack.

-- 
Ueimor

^ permalink raw reply

* Re: [TG3]: Workaround MSI bug on 5714/5780.
From: Andy Gospodarek @ 2007-09-05 19:38 UTC (permalink / raw)
  To: Michael Chan; +Cc: davem, netdev
In-Reply-To: <1189022828.4960.2.camel@dell>

On 9/5/07, Michael Chan <mchan@broadcom.com> wrote:
> [TG3]: Workaround MSI bug on 5714/5780.
>
> A hardware bug was revealed after a recent PCI MSI patch was made to
> always disable legacy INTX when enabling MSI.  The 5714/5780 chips
> will not generate MSI when INTX is disabled, causing MSI failure
> messages to be reported, and another patch was made to workaround the
> problem by disabling MSI on ServerWorks HT1000 bridge chips commonly
> found with the 5714.
>
> We workaround this chip bug by enabling INTX after we enable MSI and
> after we resume from suspend.
>
> Update version to 3.81.
>
> This problem was discovered by David Miller.
>
> Signed-off-by: Michael Chan <mchan@broadcom.com>
>

Seems reasonable.

Acked-by: Andy Gospodarek <andy@greyhouse.net>

^ permalink raw reply

* Re: Oops in 2.6.22.1: skb_copy_and_csum_datagram_iovec()
From: Herbert Xu @ 2007-09-05 20:55 UTC (permalink / raw)
  To: Neil Brown, David S. Miller; +Cc: Chuck Ebbert, netdev
In-Reply-To: <18142.42509.447265.575323@notabene.brown>

Hi Neil:

On Wed, Sep 05, 2007 at 01:50:21PM +0100, Neil Brown wrote:
>
> > iov == NULL used to work.

Well it used to work mostly.  In fact, it still does work
mostly too.

> > I think it stopped working at 
> >    commit 759e5d006462d53fb708daa8284b4ad909415da1
> > 
> > Previously, as len==0, MSG_TRUNC would get set, so copy_only would get
> > set, so skb_copy_datagram_iovec would get called, and that handles a
> > len of 0.
> > 
> > Now, skb_copy_and_csum_datagram_iovec gets called unless
> > skb_csum_unnecessary(skb), which now kills us.

Not quite.  If MSG_TRUNC is set, then we'll just call
udp_lib_checksum_complete.  If it fails we'll bail, if
it succeeds then skb_csum_unnecessary(skb) is guaranteed
to be true.

So in this respect the new code behaves identically with
the old code.

> However earlier there is:
> 	ulen = skb->len - sizeof(struct udphdr);
> 	copied = len;
> 	if (copied > ulen)
> 		copied = ulen;
> 
> so if the 'len' (of the iovec) is too small, we end up with "copied == ulen", 

Nope, if len (== copied) is too small, we simply set MSG_TRUNC
as we did before.

Where this all falls apart is when the UDP payload is
empty (ulen == 0).  In that case MSG_TRUNC is not set
and we end up calling skb_copy_and_csum_datagram_iovec
which oopses.  I've looked up the original crash and
indeed %edi there which corresponds to the UDP payload
length is zero.

However, I can see where you're coming from in that we
shouldn't dereference msg_iov at all if msg_iovlen is 0.
This never happens with sys_recvmsg so we never bothered
checking it.  The following patch should fix the problem.

[NET]: Do not dereference iov if length is zero

When msg_iovlen is zero we shouldn't try to dereference
msg_iov.  Right now the only thing that tries to do so
is skb_copy_and_csum_datagram_iovec.  Since the total
length should also be zero if msg_iovlen is zero, it's
sufficient to check the total length there and simply
return if it's zero.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/net/core/datagram.c b/net/core/datagram.c
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -450,6 +450,9 @@ int skb_copy_and_csum_datagram_iovec(str
 	__wsum csum;
 	int chunk = skb->len - hlen;
 
+	if (!chunk)
+		return 0;
+
 	/* Skip filled elements.
 	 * Pretty silly, look at memcpy_toiovec, though 8)
 	 */

^ permalink raw reply

* Re: [Lksctp-developers] SCTP: Fix dead loop while received unexpected chunk with length set to zero
From: Vlad Yasevich @ 2007-09-05 20:57 UTC (permalink / raw)
  To: Wei Yongjun; +Cc: lksctp-developers, netdev
In-Reply-To: <46D7EBA7.5080807@cn.fujitsu.com>

Wei Yongjun wrote:

> Packet changed:
> 1. Used sctp_sf_ootb() to handle OOTB packet
> 2. Remove length check from sctp_sf_tabort_8_4_8() in last patch
> 3. Add length check to sctp_sf_ootb()
> 4. Changed validity check order in sctp_sf_do_5_1B_init() and other
> functions to fix possible attack.

Can you explain a little what the attack vector on this order.

See below...

> 
> This patch may be correct.
> 
> 
> Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
> 
> diff -Nurp a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
> --- a/net/sctp/sm_statefuns.c    2007-08-17 06:17:14.000000000 -0400
> +++ b/net/sctp/sm_statefuns.c    2007-08-19 07:52:17.000000000 -0400
> @@ -98,6 +98,7 @@ static sctp_disposition_t sctp_stop_t1_a
>                        struct sctp_transport *transport);
> 
> static sctp_disposition_t sctp_sf_abort_violation(
> +                     const struct sctp_endpoint *ep,
>                      const struct sctp_association *asoc,
>                      void *arg,
>                      sctp_cmd_seq_t *commands,
> @@ -181,6 +182,14 @@ sctp_disposition_t sctp_sf_do_4_C(const     struct
> sctp_chunk *chunk = arg;
>     struct sctp_ulpevent *ev;
> 
> +    if (!sctp_vtag_verify_either(chunk, asoc))
> +        return sctp_sf_pdiscard(ep, asoc, type, arg, commands);
> +
> +    /* Make sure that the SHUTDOWN_COMPLETE chunk has a valid length. */
> +    if (!sctp_chunk_length_valid(chunk, sizeof(sctp_chunkhdr_t)))
> +        return sctp_sf_violation_chunklen(ep, asoc, type, arg,
> +                          commands);
> +
>     /* RFC 2960 6.10 Bundling
>      *
>      * An endpoint MUST NOT bundle INIT, INIT ACK or
> @@ -189,9 +198,6 @@ sctp_disposition_t sctp_sf_do_4_C(const     if
> (!chunk->singleton)
>         return SCTP_DISPOSITION_VIOLATION;
> 
> -    if (!sctp_vtag_verify_either(chunk, asoc))
> -        return sctp_sf_pdiscard(ep, asoc, type, arg, commands);
> -
>     /* RFC 2960 10.2 SCTP-to-ULP
>      *
>      * H) SHUTDOWN COMPLETE notification

OK, I can see how moving this resolves the attack on SHUTDOWN COMPLETE,
but that was because we reported a rather silly violation message and
not really skipping the chunk properly.

I looking at other handling of SHUDOWN COMPLETE, one could make an
argument for simply discarding the packet packet in both cases.

On the other hand, if one protocol violation deserves an abort, then
why not the other.  They are both very blatant.


> @@ -267,6 +273,20 @@ sctp_disposition_t sctp_sf_do_5_1B_init(
>     struct sock *sk;
>     int len;
> 
> +    /* Make sure that the INIT chunk has a valid length.
> +     * Normally, this would cause an ABORT with a Protocol Violation
> +     * error, but since we don't have an association, we'll
> +     * just discard the packet.
> +     */
> +    if (!sctp_chunk_length_valid(chunk, sizeof(sctp_init_chunk_t)))
> +        return sctp_sf_pdiscard(ep, asoc, type, arg, commands);
> +
> +    /* 3.1 A packet containing an INIT chunk MUST have a zero Verification
> +     * Tag.
> +     */
> +    if (chunk->sctp_hdr->vtag != 0)
> +        return sctp_sf_tabort_8_4_8(ep, asoc, type, arg, commands);
> +
>     /* 6.10 Bundling
>      * An endpoint MUST NOT bundle INIT, INIT ACK or
>      * SHUTDOWN COMPLETE with any other chunks.
> @@ -295,20 +315,6 @@ sctp_disposition_t sctp_sf_do_5_1B_init(
>          sk_acceptq_is_full(sk)))
>         return sctp_sf_tabort_8_4_8(ep, asoc, type, arg, commands);
> 
> -    /* 3.1 A packet containing an INIT chunk MUST have a zero Verification
> -     * Tag.
> -     */
> -    if (chunk->sctp_hdr->vtag != 0)
> -        return sctp_sf_tabort_8_4_8(ep, asoc, type, arg, commands);
> -
> -    /* Make sure that the INIT chunk has a valid length.
> -     * Normally, this would cause an ABORT with a Protocol Violation
> -     * error, but since we don't have an association, we'll
> -     * just discard the packet.
> -     */
> -    if (!sctp_chunk_length_valid(chunk, sizeof(sctp_init_chunk_t)))
> -        return sctp_sf_pdiscard(ep, asoc, type, arg, commands);
> -
>     /* Verify the INIT chunk before processing it. */
>     err_chunk = NULL;
>     if (!sctp_verify_init(asoc, chunk->chunk_hdr->type,

Why re-order here?  Is it because sctp_sf_tabort_8_4_8() doesn't discard
the packet?

An interesting problem with reordering this is that we now respond with
an abort when previously we were silently discarding.  This is the bundled
INIT case.  

I spent some time looking at sctp_sf_tabort_8_4_8().  Based on the spec,
that function should discard the packet.  I finally had time to really
look at all the callers and I looks like they will be ok with that.
This was something I asked before, but never got an answer to.

> @@ -591,12 +597,6 @@ sctp_disposition_t sctp_sf_do_5_1D_ce(co
>     int error = 0;
>     struct sctp_chunk *err_chk_p;
> 
> -    /* If the packet is an OOTB packet which is temporarily on the
> -     * control endpoint, respond with an ABORT.
> -     */
> -    if (ep == sctp_sk((sctp_get_ctl_sock()))->ep)
> -        return sctp_sf_ootb(ep, asoc, type, arg, commands);
> -
>     /* Make sure that the COOKIE_ECHO chunk has a valid length.
>      * In this case, we check that we have enough for at least a
>      * chunk header.  More detailed verification is done
> @@ -605,6 +605,12 @@ sctp_disposition_t sctp_sf_do_5_1D_ce(co
>     if (!sctp_chunk_length_valid(chunk, sizeof(sctp_chunkhdr_t)))
>         return sctp_sf_pdiscard(ep, asoc, type, arg, commands);
> 
> +    /* If the packet is an OOTB packet which is temporarily on the
> +     * control endpoint, respond with an ABORT.
> +     */
> +    if (ep == sctp_sk((sctp_get_ctl_sock()))->ep)
> +        return sctp_sf_tabort_8_4_8(ep, asoc, type, arg, commands);
> +
>     /* "Decode" the chunk.  We have no optional parameters so we
>      * are in good shape.
>      */
> @@ -1281,6 +1287,20 @@ static sctp_disposition_t sctp_sf_do_une
>     sctp_unrecognized_param_t *unk_param;
>     int len;
> 
> +    /* Make sure that the INIT chunk has a valid length.
> +     * In this case, we generate a protocol violation since we have
> +     * an association established.
> +     */
> +    if (!sctp_chunk_length_valid(chunk, sizeof(sctp_init_chunk_t)))
> +        return sctp_sf_violation_chunklen(ep, asoc, type, arg,
> +                          commands);
> +
> +    /* 3.1 A packet containing an INIT chunk MUST have a zero Verification
> +     * Tag.
> +     */
> +    if (chunk->sctp_hdr->vtag != 0)
> +        return sctp_sf_tabort_8_4_8(ep, asoc, type, arg, commands);
> +
>     /* 6.10 Bundling
>      * An endpoint MUST NOT bundle INIT, INIT ACK or
>      * SHUTDOWN COMPLETE with any other chunks.
> @@ -1293,19 +1313,6 @@ static sctp_disposition_t sctp_sf_do_une
>     if (!chunk->singleton)
>         return sctp_sf_pdiscard(ep, asoc, type, arg, commands);
> 
> -    /* 3.1 A packet containing an INIT chunk MUST have a zero Verification
> -     * Tag.
> -     */
> -    if (chunk->sctp_hdr->vtag != 0)
> -        return sctp_sf_tabort_8_4_8(ep, asoc, type, arg, commands);
> -
> -    /* Make sure that the INIT chunk has a valid length.
> -     * In this case, we generate a protocol violation since we have
> -     * an association established.
> -     */
> -    if (!sctp_chunk_length_valid(chunk, sizeof(sctp_init_chunk_t)))
> -        return sctp_sf_violation_chunklen(ep, asoc, type, arg,
> -                          commands);
>     /* Grab the INIT header.  */
>     chunk->subh.init_hdr = (sctp_inithdr_t *) chunk->skb->data;
> 

The same question as for the INIT processing above applies here as
well.

> @@ -2495,6 +2502,11 @@ sctp_disposition_t sctp_sf_do_9_2_reshut
>     struct sctp_chunk *chunk = (struct sctp_chunk *) arg;
>     struct sctp_chunk *reply;
> 
> +    /* Make sure that the chunk has a valid length */
> +    if (!sctp_chunk_length_valid(chunk, sizeof(sctp_chunkhdr_t)))
> +        return sctp_sf_violation_chunklen(ep, asoc, type, arg,
> +                          commands);
> +
>     /* Since we are not going to really process this INIT, there
>      * is no point in verifying chunk boundries.  Just generate
>      * the SHUTDOWN ACK.
> @@ -3146,6 +3158,11 @@ sctp_disposition_t sctp_sf_ootb(const st
>         ch = (sctp_chunkhdr_t *) ch_end;
>     } while (ch_end < skb_tail_pointer(skb));
> 
> +    /* Make sure that the chunk has a valid length. */
> +    if (!sctp_chunk_length_valid(chunk, sizeof(sctp_chunkhdr_t)))
> +        return sctp_sf_violation_chunklen(ep, asoc, type, arg,
> +                          commands);
> +
>     if (ootb_shut_ack)
>         sctp_sf_shut_8_4_5(ep, asoc, type, arg, commands);
>     else

Hm..  I think you can just replace the 'break' lines in the loop
with a call to sctp_sf_violation_chunklen() can get rid of yet
another conditional.


The rest looks good
Thanks
-vlad

> @@ -3240,6 +3257,13 @@ sctp_disposition_t sctp_sf_do_8_5_1_E_sa
>                       void *arg,
>                       sctp_cmd_seq_t *commands)
> {
> +    struct sctp_chunk *chunk = arg;
> +
> +    /* Make sure that the SHUTDOWN_ACK chunk has a valid length. */
> +    if (!sctp_chunk_length_valid(chunk, sizeof(sctp_chunkhdr_t)))
> +        return sctp_sf_violation_chunklen(ep, asoc, type, arg,
> +                          commands);
> +
>     /* Although we do have an association in this case, it corresponds
>      * to a restarted association. So the packet is treated as an OOTB
>      * packet and the state function that handles OOTB SHUTDOWN_ACK is
> @@ -3654,6 +3678,16 @@ sctp_disposition_t sctp_sf_discard_chunk
>                      void *arg,
>                      sctp_cmd_seq_t *commands)
> {
> +    struct sctp_chunk *chunk = arg;
> +
> +    /* Make sure that the chunk has a valid length.
> +     * Since we don't know the chunk type, we use a general
> +     * chunkhdr structure to make a comparison.
> +     */
> +    if (!sctp_chunk_length_valid(chunk, sizeof(sctp_chunkhdr_t)))
> +        return sctp_sf_violation_chunklen(ep, asoc, type, arg,
> +                          commands);
> +
>     SCTP_DEBUG_PRINTK("Chunk %d is discarded\n", type.chunk);
>     return SCTP_DISPOSITION_DISCARD;
> }
> @@ -3709,6 +3743,13 @@ sctp_disposition_t sctp_sf_violation(con
>                      void *arg,
>                      sctp_cmd_seq_t *commands)
> {
> +    struct sctp_chunk *chunk = arg;
> +
> +    /* Make sure that the chunk has a valid length. */
> +    if (!sctp_chunk_length_valid(chunk, sizeof(sctp_chunkhdr_t)))
> +        return sctp_sf_violation_chunklen(ep, asoc, type, arg,
> +                          commands);
> +
>     return SCTP_DISPOSITION_VIOLATION;
> }
> 
> @@ -3716,12 +3757,14 @@ sctp_disposition_t sctp_sf_violation(con
>  * Common function to handle a protocol violation.
>  */
> static sctp_disposition_t sctp_sf_abort_violation(
> +                     const struct sctp_endpoint *ep,
>                      const struct sctp_association *asoc,
>                      void *arg,
>                      sctp_cmd_seq_t *commands,
>                      const __u8 *payload,
>                      const size_t paylen)
> {
> +    struct sctp_packet *packet = NULL;
>     struct sctp_chunk *chunk =  arg;
>     struct sctp_chunk *abort = NULL;
> 
> @@ -3730,22 +3773,41 @@ static sctp_disposition_t sctp_sf_abort_
>     if (!abort)
>         goto nomem;
> 
> -    sctp_add_cmd_sf(commands, SCTP_CMD_REPLY, SCTP_CHUNK(abort));
> -    SCTP_INC_STATS(SCTP_MIB_OUTCTRLCHUNKS);
> +    if (asoc) {
> +        sctp_add_cmd_sf(commands, SCTP_CMD_REPLY, SCTP_CHUNK(abort));
> +        SCTP_INC_STATS(SCTP_MIB_OUTCTRLCHUNKS);
> 
> -    if (asoc->state <= SCTP_STATE_COOKIE_ECHOED) {
> -        sctp_add_cmd_sf(commands, SCTP_CMD_TIMER_STOP,
> -                SCTP_TO(SCTP_EVENT_TIMEOUT_T1_INIT));
> -        sctp_add_cmd_sf(commands, SCTP_CMD_SET_SK_ERR,
> -                SCTP_ERROR(ECONNREFUSED));
> -        sctp_add_cmd_sf(commands, SCTP_CMD_INIT_FAILED,
> -                SCTP_PERR(SCTP_ERROR_PROTO_VIOLATION));
> +        if (asoc->state <= SCTP_STATE_COOKIE_ECHOED) {
> +            sctp_add_cmd_sf(commands, SCTP_CMD_TIMER_STOP,
> +                    SCTP_TO(SCTP_EVENT_TIMEOUT_T1_INIT));
> +            sctp_add_cmd_sf(commands, SCTP_CMD_SET_SK_ERR,
> +                    SCTP_ERROR(ECONNREFUSED));
> +            sctp_add_cmd_sf(commands, SCTP_CMD_INIT_FAILED,
> +                    SCTP_PERR(SCTP_ERROR_PROTO_VIOLATION));
> +        } else {
> +            sctp_add_cmd_sf(commands, SCTP_CMD_SET_SK_ERR,
> +                    SCTP_ERROR(ECONNABORTED));
> +            sctp_add_cmd_sf(commands, SCTP_CMD_ASSOC_FAILED,
> +                    SCTP_PERR(SCTP_ERROR_PROTO_VIOLATION));
> +            SCTP_DEC_STATS(SCTP_MIB_CURRESTAB);
> +        }
>     } else {
> -        sctp_add_cmd_sf(commands, SCTP_CMD_SET_SK_ERR,
> -                SCTP_ERROR(ECONNABORTED));
> -        sctp_add_cmd_sf(commands, SCTP_CMD_ASSOC_FAILED,
> -                SCTP_PERR(SCTP_ERROR_PROTO_VIOLATION));
> -        SCTP_DEC_STATS(SCTP_MIB_CURRESTAB);
> +        packet = sctp_ootb_pkt_new(asoc, chunk);
> +
> +        if (!packet)
> +            goto nomem;
> +
> +        if (sctp_test_T_bit(abort))
> +            packet->vtag = ntohl(chunk->sctp_hdr->vtag);
> +
> +        abort->skb->sk = ep->base.sk;
> +
> +        sctp_packet_append_chunk(packet, abort);
> +
> +        sctp_add_cmd_sf(commands, SCTP_CMD_SEND_PKT, +           
> SCTP_PACKET(packet));
> +
> +        SCTP_INC_STATS(SCTP_MIB_OUTCTRLCHUNKS);
>     }
> 
>     sctp_add_cmd_sf(commands, SCTP_CMD_DISCARD_PACKET, SCTP_NULL());
> @@ -3786,7 +3848,7 @@ static sctp_disposition_t sctp_sf_violat
> {
>     char err_str[]="The following chunk had invalid length:";
> 
> -    return sctp_sf_abort_violation(asoc, arg, commands, err_str,
> +    return sctp_sf_abort_violation(ep, asoc, arg, commands, err_str,
>                     sizeof(err_str));
> }
> 
> @@ -3805,7 +3867,7 @@ static sctp_disposition_t sctp_sf_violat
> {
>     char err_str[]="The cumulative tsn ack beyond the max tsn currently
> sent:";
> 
> -    return sctp_sf_abort_violation(asoc, arg, commands, err_str,
> +    return sctp_sf_abort_violation(ep, asoc, arg, commands, err_str,
>                     sizeof(err_str));
> }
> 
> diff -Nurp a/net/sctp/sm_statetable.c b/net/sctp/sm_statetable.c
> --- a/net/sctp/sm_statetable.c    2007-08-09 11:58:11.000000000 -0400
> +++ b/net/sctp/sm_statetable.c    2007-08-19 05:44:29.000000000 -0400
> @@ -110,7 +110,7 @@ const sctp_sm_table_entry_t *sctp_sm_loo
>     /* SCTP_STATE_EMPTY */ \
>     TYPE_SCTP_FUNC(sctp_sf_ootb), \
>     /* SCTP_STATE_CLOSED */ \
> -    TYPE_SCTP_FUNC(sctp_sf_tabort_8_4_8), \
> +    TYPE_SCTP_FUNC(sctp_sf_ootb), \
>     /* SCTP_STATE_COOKIE_WAIT */ \
>     TYPE_SCTP_FUNC(sctp_sf_discard_chunk), \
>     /* SCTP_STATE_COOKIE_ECHOED */ \
> @@ -173,7 +173,7 @@ const sctp_sm_table_entry_t *sctp_sm_loo
>     /*  SCTP_STATE_EMPTY */ \
>     TYPE_SCTP_FUNC(sctp_sf_ootb), \
>     /* SCTP_STATE_CLOSED */ \
> -    TYPE_SCTP_FUNC(sctp_sf_tabort_8_4_8), \
> +    TYPE_SCTP_FUNC(sctp_sf_ootb), \
>     /* SCTP_STATE_COOKIE_WAIT */ \
>     TYPE_SCTP_FUNC(sctp_sf_discard_chunk), \
>     /* SCTP_STATE_COOKIE_ECHOED */ \
> @@ -194,7 +194,7 @@ const sctp_sm_table_entry_t *sctp_sm_loo
>     /*  SCTP_STATE_EMPTY */ \
>     TYPE_SCTP_FUNC(sctp_sf_ootb), \
>     /* SCTP_STATE_CLOSED */ \
> -    TYPE_SCTP_FUNC(sctp_sf_tabort_8_4_8), \
> +    TYPE_SCTP_FUNC(sctp_sf_ootb), \
>     /* SCTP_STATE_COOKIE_WAIT */ \
>     TYPE_SCTP_FUNC(sctp_sf_discard_chunk), \
>     /* SCTP_STATE_COOKIE_ECHOED */ \
> @@ -216,7 +216,7 @@ const sctp_sm_table_entry_t *sctp_sm_loo
>     /*  SCTP_STATE_EMPTY */ \
>     TYPE_SCTP_FUNC(sctp_sf_ootb), \
>     /* SCTP_STATE_CLOSED */ \
> -    TYPE_SCTP_FUNC(sctp_sf_tabort_8_4_8), \
> +    TYPE_SCTP_FUNC(sctp_sf_ootb), \
>     /* SCTP_STATE_COOKIE_WAIT */ \
>     TYPE_SCTP_FUNC(sctp_sf_violation), \
>     /* SCTP_STATE_COOKIE_ECHOED */ \
> @@ -258,7 +258,7 @@ const sctp_sm_table_entry_t *sctp_sm_loo
>     /* SCTP_STATE_EMPTY */ \
>     TYPE_SCTP_FUNC(sctp_sf_ootb), \
>     /* SCTP_STATE_CLOSED */ \
> -    TYPE_SCTP_FUNC(sctp_sf_tabort_8_4_8), \
> +    TYPE_SCTP_FUNC(sctp_sf_ootb), \
>     /* SCTP_STATE_COOKIE_WAIT */ \
>     TYPE_SCTP_FUNC(sctp_sf_discard_chunk), \
>     /* SCTP_STATE_COOKIE_ECHOED */ \
> @@ -300,7 +300,7 @@ const sctp_sm_table_entry_t *sctp_sm_loo
>     /* SCTP_STATE_EMPTY */ \
>     TYPE_SCTP_FUNC(sctp_sf_ootb), \
>     /* SCTP_STATE_CLOSED */ \
> -    TYPE_SCTP_FUNC(sctp_sf_tabort_8_4_8), \
> +    TYPE_SCTP_FUNC(sctp_sf_ootb), \
>     /* SCTP_STATE_COOKIE_WAIT */ \
>     TYPE_SCTP_FUNC(sctp_sf_discard_chunk), \
>     /* SCTP_STATE_COOKIE_ECHOED */ \
> @@ -499,7 +499,7 @@ static const sctp_sm_table_entry_t addip
>     /* SCTP_STATE_EMPTY */ \
>     TYPE_SCTP_FUNC(sctp_sf_ootb), \
>     /* SCTP_STATE_CLOSED */ \
> -    TYPE_SCTP_FUNC(sctp_sf_tabort_8_4_8), \
> +    TYPE_SCTP_FUNC(sctp_sf_ootb), \
>     /* SCTP_STATE_COOKIE_WAIT */ \
>     TYPE_SCTP_FUNC(sctp_sf_discard_chunk), \
>     /* SCTP_STATE_COOKIE_ECHOED */ \
> @@ -528,7 +528,7 @@ chunk_event_table_unknown[SCTP_STATE_NUM
>     /* SCTP_STATE_EMPTY */
>     TYPE_SCTP_FUNC(sctp_sf_ootb),
>     /* SCTP_STATE_CLOSED */
> -    TYPE_SCTP_FUNC(sctp_sf_tabort_8_4_8),
> +    TYPE_SCTP_FUNC(sctp_sf_ootb),
>     /* SCTP_STATE_COOKIE_WAIT */
>     TYPE_SCTP_FUNC(sctp_sf_unk_chunk),
>     /* SCTP_STATE_COOKIE_ECHOED */
> 
> 


^ permalink raw reply

* Re: [PATCH v3 2/2][BNX2]: Add iSCSI support to BNX2 devices.
From: Anil Veerabhadrappa @ 2007-09-05 21:27 UTC (permalink / raw)
  To: Mike Christie
  Cc: Michael Chan, davem, netdev, open-iscsi, talm, lusinsky, uri,
	SCSI Mailing List
In-Reply-To: <46DEF69D.3090901@redhat.com>

On Wed, 2007-09-05 at 13:34 -0500, Mike Christie wrote:
> Michael Chan wrote:
> > +* This file defines HSI constants for the iSCSI flows
> > +*/
> > +
> > +/* iSCSI request op codes */
> > +#define ISCSI_OPCODE_NOP_OUT 			(0 | 0x40)
> > +#define ISCSI_OPCODE_SCSI_CMD 			(1)
> > +#define ISCSI_OPCODE_TMF_REQUEST 		(2 | 0x40)
> > +#define ISCSI_OPCODE_LOGIN_REQUEST 		(3 | 0x40)
> > +#define ISCSI_OPCODE_TEXT_REQUEST 		(4 | 0x40)
> > +#define ISCSI_OPCODE_DATA_OUT 			(5)
> > +#define ISCSI_OPCODE_LOGOUT_REQUEST 		(6 | 0x00)
> > +#define ISCSI_OPCODE_CLEANUP_REQUEST		(7)
> 
> 
> What is ISCSI_OPCODE_CLEANUP_REQUEST?

This is a local message between iSCSI driver and the firmware to cleanup
iSCSI task resources held by chip. This operation is required to reclaim
SCSI command related resources after the TMF response is received for a
task and before it is completed to SCSI-ML


> 
> 
> > +
> > +/* iSCSI response/messages op codes */
> > +#define ISCSI_OPCODE_NOP_IN 			(0x20)
> > +#define ISCSI_OPCODE_SCSI_RESPONSE 		(0x21)
> > +#define ISCSI_OPCODE_TMF_RESPONSE 		(0x22)
> > +#define ISCSI_OPCODE_LOGIN_RESPONSE 		(0x23)
> > +#define ISCSI_OPCODE_TEXT_RESPONSE 		(0x24)
> > +#define ISCSI_OPCODE_DATA_IN			(0x25)
> > +#define ISCSI_OPCODE_LOGOUT_RESPONSE 		(0x26)
> > +#define ISCSI_OPCODE_CLEANUP_RESPONSE 		(0x27)
> > +#define ISCSI_OPCODE_R2T			(0x31)
> > +#define ISCSI_OPCODE_ASYNC_MSG 			(0x32)
> > +#define ISCSI_OPCODE_REJECT 			(0x3f)
> > +#define ISCSI_OPCODE_NOPOUT_LOCAL_COMPLETION	(0)
> 
> 
> What does the LOCAL_COMPLETION on the nopout mean?

This is the completion notification for NOPOUT pdus which does not
involve response from the target. Following two NOPOUTs are classified
under this -
1. Initiator's proactive nopout with ITT=0xFFFFFFFF
2. Initiator's response to target nopin with TTT != 0xFFFFFFFF


> 
> 
> > +
> > +/* iSCSI stages */
> > +#define ISCSI_STAGE_SECURITY_NEGOTIATION (0)
> > +#define ISCSI_STAGE_LOGIN_OPERATIONAL_NEGOTIATION (1)
> > +#define ISCSI_STAGE_FULL_FEATURE_PHASE (3)
> > +/* Logout response codes */
> > +#define ISCSI_LOGOUT_RESPONSE_CONNECTION_CLOSED (0)
> > +#define ISCSI_LOGOUT_RESPONSE_CID_NOT_FOUND (1)
> > +#define ISCSI_LOGOUT_RESPONSE_CLEANUP_FAILED (3)
> > +
> > +/* iSCSI task types */
> > +#define ISCSI_TASK_TYPE_READ    (0)
> > +#define ISCSI_TASK_TYPE_WRITE   (1)
> > +#define ISCSI_TASK_TYPE_MPATH   (2)
> 
> 
> 
> 
> All of these iscsi code shoulds be in iscsi_proto.h or should be added 
> there.
This is a very tricky proposal as this header file is automatically
generated by a well defined process and is shared between various driver
supporting multiple platform/OS and the firmware. If it is not of a big
issue I would like to keep it the way it is.


> 
> 
> > +
> > +#endif
> > diff --git a/drivers/scsi/bnx2i/bnx2i_hwi.c b/drivers/scsi/bnx2i/bnx2i_hwi.c
> > new file mode 100644
> > index 0000000..b74a93c
> > --- /dev/null
> > +++ b/drivers/scsi/bnx2i/bnx2i_hwi.c
> > @@ -0,0 +1,1977 @@
> > +
> > +	if (tmf_cqe->response == ISCSI_TMF_RSP_COMPLETE) {
> > +		if (aborted_cmd->scsi_cmd) {
> > +			aborted_cmd->scsi_cmd->result = DID_ABORT << 16;
> > +			aborted_cmd->scsi_cmd->resid =
> 
> should use resid wrappers throughout driver.
> 
> > +				aborted_cmd->scsi_cmd->request_bufflen;
> 
> Should use bufflen wrappers throught driver.

Will look into this.


> > +	cmd->cmd_state = ISCSI_CMD_STATE_INITIATED;
> > +	sc->SCp.ptr = (char *) cmd;
> > +
> > +	if (cmd->req.itt != ITT_INVALID_SIGNATURE) {
> > +		bnx2i_send_iscsi_scsicmd(conn, cmd);
> > +		list_add_tail(&cmd->link, &sess->active_cmds);
> > +		sess->num_active_cmds++;
> > +	}
> > +	return 0;
> > +
> > +iscsi_win_closed:
> > +cmd_not_accepted:
> > +	return SCSI_MLQUEUE_HOST_BUSY;
> > +
> > +dev_not_found:
> > +	sc->sense_buffer[0] = 0x70;
> > +	sc->sense_buffer[2] = NOT_READY;
> > +	sc->sense_buffer[7] = 0x6;
> > +	sc->sense_buffer[12] = 0x08;
> > +	sc->sense_buffer[13] = 0x00;
> 
> 
> do not fake sense and do not face sense and set a host byte of 
> DID_NO_CONNECT. DID_NO_CONNECT is fine.

ok, thanks.

> 
> 
> > +	sc->result = (DID_NO_CONNECT << 16);
> > +	sc->resid = sc->request_bufflen;
> > +	sc->scsi_done(sc);
> > +	return 0;
> > +}
> > +
> > +
> > +
> > +/*
> > + * TMF request timeout handler
> > + */
> > +static void bnx2i_iscsi_tmf_timer(unsigned long data)
> > +{
> > +	struct bnx2i_cmd *cmd = (struct bnx2i_cmd *) data;
> > +
> > +	printk(KERN_ALERT "TMF timer: abort failed, cmd 0x%p\n", cmd);
> > +	cmd->cmd_state = ISCSI_CMD_STATE_TMF_TIMEOUT;
> > +	cmd->conn->sess->recovery_state = ISCSI_SESS_RECOVERY_OPEN_ISCSI;
> > +	iscsi_conn_error(cmd->conn->cls_conn, ISCSI_ERR_CONN_FAILED);
> > +}
> > +
> > +
> > +/*
> > + * initiate command abort process by requesting CNIC to send
> > + *	an iSCSI TMF request to target
> > + */
> > +static int bnx2i_initiate_abort_cmd(struct scsi_cmnd *sc)
> > +{
> > +	struct bnx2i_cmd *cmd = (struct bnx2i_cmd *) sc->SCp.ptr;
> > +	struct bnx2i_cmd *tmf_cmd;
> > +	struct Scsi_Host *shost = cmd->scsi_cmd->device->host;
> > +	struct bnx2i_conn *conn = cmd->conn;
> > +	struct bnx2i_sess *sess;
> > +	struct bnx2i_hba *hba;
> > +
> > +	shost = cmd->scsi_cmd->device->host;
> > +	sess = iscsi_hostdata(shost->hostdata);
> > +	BUG_ON(shost != sess->host);
> > +
> > +	if (sess && (is_sess_active(sess)))
> > +		hba = sess->hba;
> > +	else
> > +		return FAILED;
> > +
> > +	if (test_bit(ADAPTER_STATE_LINK_DOWN, &hba->adapter_state))
> > +		return FAILED;
> > +
> > +	bnx2i_setup_ictx_dump(hba, conn);
> > +
> > +	if (cmd->scsi_cmd != sc)
> > +		/* command already completed to scsi mid-layer */
> > +		goto cmd_not_active;
> > +
> > +	tmf_cmd = bnx2i_alloc_cmd(sess);
> > +	if (tmf_cmd == NULL)
> > +		goto lack_of_resc;
> > +
> > +	tmf_cmd->conn = conn = sess->lead_conn;
> > +	tmf_cmd->scsi_cmd = NULL;
> > +	tmf_cmd->iscsi_opcode = ISCSI_OPCODE_TMF_REQUEST;
> > +	tmf_cmd->req.cmd_sn = sess->cmdsn;
> > +	tmf_cmd->tmf_ref_itt = cmd->req.itt;
> > +	tmf_cmd->tmf_ref_cmd = cmd;
> > +	tmf_cmd->tmf_ref_sc = cmd->scsi_cmd;
> > +	cmd->cmd_state = ISCSI_CMD_STATE_ABORT_PEND;
> > +	tmf_cmd->cmd_state = ISCSI_CMD_STATE_INITIATED;
> > +
> > +#define BNX2I_TMF_TIMEOUT	10 * HZ
> > +	sess->abort_timer.expires = BNX2I_TMF_TIMEOUT + jiffies;
> > +	sess->abort_timer.function = bnx2i_iscsi_tmf_timer;
> > +	sess->abort_timer.data = (unsigned long)tmf_cmd;
> > +	add_timer(&sess->abort_timer);
> > +
> > +	bnx2i_send_iscsi_tmf(conn, tmf_cmd);
> > +
> > +	/* update iSCSI context for this conn, wait for CNIC to complete */
> > +	wait_event_interruptible(sess->er_wait, (!conn->ep ||
> > +		 			 (tmf_cmd->cmd_state ==
> > +					  ISCSI_CMD_STATE_FAILED) ||
> > +					 (tmf_cmd->cmd_state ==
> > +					  ISCSI_CMD_STATE_COMPLETED)));
> > +
> > +	if (signal_pending(current))
> > +		flush_signals(current);
> > +
> > +	del_timer_sync(&sess->abort_timer);
> > +
> > +	if (tmf_cmd->cmd_state == ISCSI_CMD_STATE_TMF_TIMEOUT) {
> > +		printk(KERN_ALERT "abort: abort failed, cmd 0x%p\n", tmf_cmd);
> > +		/* TMF timed out, return error status and let SCSI-ML do
> > +		 * session recovery.
> > +		 */
> > +		list_del_init(&tmf_cmd->link);
> > +		bnx2i_free_cmd(sess, tmf_cmd);
> > +		return SUCCESS;
> > +	}
> > +
> > +	list_del_init(&tmf_cmd->link);
> > +	bnx2i_free_cmd(sess, tmf_cmd);
> > +
> > +	if ((cmd->scsi_cmd->result & 0xFF0000) == (DID_ABORT << 16)) {
> > +		cmd->cmd_state = ISCSI_CMD_STATE_CLEANUP_PEND;
> > +		bnx2i_send_cmd_cleanup_req(hba, cmd);
> > +		wait_event_interruptible_timeout(sess->er_wait,
> > +						 (cmd->cmd_state ==
> > +						  ISCSI_CMD_STATE_CLEANUP_CMPL),
> > +						 msecs_to_jiffies(
> > +						  ISCSI_CMD_CLEANUP_TIMEOUT));
> > +
> > +		if (signal_pending(current))
> > +			flush_signals(current);
> > +	} else
> > +		cmd->scsi_cmd->result = (DID_ABORT << 16);
> > +
> > +	cmd->conn->sess->num_active_cmds--;
> > +	list_del_init(&cmd->link);
> > +	bnx2i_return_failed_command(sess, cmd, DID_ABORT);
> > +	bnx2i_free_cmd(sess, cmd);
> > +
> > +cmd_not_active:
> > +	return SUCCESS;
> > +
> > +lack_of_resc:
> > +	return FAILED;
> > +}
> > +
> > +
> > +/*
> > + * SCSI abort request handler.
> > + */
> > +int bnx2i_abort(struct scsi_cmnd *sc)
> > +{
> > +	int reason;
> > +	struct bnx2i_cmd *cmd = (struct bnx2i_cmd *) sc->SCp.ptr;
> > +
> > +	if (unlikely(!cmd)) {
> > +		/* command already completed to scsi mid-layer */
> > +		printk(KERN_INFO "bnx2i_abort: sc 0x%p, not active\n", sc);
> > +		return SUCCESS;
> > +	}
> > +
> > +	reason = bnx2i_initiate_abort_cmd(sc);
> > +	return reason;
> > +}
> > +
> > +
> > +
> > +/*
> > + * hardware reset
> > + */
> > +int bnx2i_reset(struct scsi_cmnd *sc)
> > +{
> > +	return 0;
> > +}
> 
> 
> So what is up with this one? It seems like if there is a way to reset 
> hardware then you would want it as the scsi eh host reset callout 
> instead of dropping the session. We could add some transport level 
> recovery callouts for the iscsi specifics.

We may not be able to support HBA cold reset as bnx2 driver is the
primary owner of chip reset and initialization. This is the drawback of
sharing network interface with the NIC driver. If there is a need for
administrator to reset the iSCSI port same can be achieved by running
'ifdown eth#' and 'ifup eth#'.
Current driver even allows ethernet interface reset when there are
active iSCSI connection, all active iscsi sessions will be reinstated
when the network link comes back live
 



> 
> 
> > +
> > +
> > +void bnx2i_return_failed_command(struct bnx2i_sess *sess,
> > +				 struct bnx2i_cmd *cmd, int err_code)
> > +{
> > +	struct scsi_cmnd *sc = cmd->scsi_cmd;
> > +	sc->result = err_code << 16;
> > +	sc->resid = cmd->scsi_cmd->request_bufflen;
> > +	cmd->scsi_cmd = NULL;
> > +	sess->num_active_cmds--;
> > +	sc->SCp.ptr = NULL;
> > +	sc->scsi_done(sc);
> > +}
> > +
> > +
> > +
> > +/*
> > + * SCSI host reset handler - iSCSI session recovery
> > + */
> > +int bnx2i_host_reset(struct scsi_cmnd *sc)
> > +{
> > +	struct Scsi_Host *shost;
> > +	struct bnx2i_sess *sess;
> > +	int rc = 0;
> > +
> > +	shost = sc->device->host;
> > +	sess = iscsi_hostdata(shost->hostdata);
> > +	printk(KERN_INFO "bnx2i: attempting to reset host, #%d\n",
> > +			  sess->host->host_no);
> > +
> > +	BUG_ON(shost != sess->host);
> > +	rc = bnx2i_do_iscsi_sess_recovery(sess, DID_RESET);
> > +
> > +	return rc;
> > +}
> > +
> > +
> > +
> > +/**********************************************************************
> > + *		open-iscsi interface
> > + **********************************************************************/
> > +
> > +
> > +#define get_bnx2_device(_hba, _devc) 	do {				\
> > +		if ((_hba->pci_did == PCI_DEVICE_ID_NX2_5706) ||	\
> > +			(_hba->pci_did == PCI_DEVICE_ID_NX2_5706S)) {	\
> > +			_devc = '6';					\
> > +		} else if ((_hba->pci_did == PCI_DEVICE_ID_NX2_5708) ||	\
> > +			(_hba->pci_did == PCI_DEVICE_ID_NX2_5708S)) {	\
> > +			_devc = '8';					\
> > +		} else if ((_hba->pci_did == PCI_DEVICE_ID_NX2_5709) ||	\
> > +			(_hba->pci_did == PCI_DEVICE_ID_NX2_5709S)) {	\
> > +			_devc = '9';					\
> > +		}							\
> > +	} while (0)
> > +
> > +/* from open-iscsi project */
> > +/*
> > + * iSCSI Session's hostdata organization:
> > + *
> > + *    *------------------* <== hostdata_session(host->hostdata)
> > + *    | ptr to class sess|
> > + *    |------------------| <== iscsi_hostdata(host->hostdata)
> > + *    | iscsi_session    |
> > + *    *------------------*
> > + */
> 
> 
> 
> I think it is going to be nicer to not allocate the host with the 
> session for bnx2i like was done for qla4xxx. If you are going to 
> allocate the host with the session then it seems like you should be able 
> to intergrate into libiscsi like was done with iscsi_tcp and ib_iser. 
> And even if you do not allocate a host per session it seems like you 
> should be able to better integreate with libiscsi like was done with 
> iscsi_tcp and ib_iser. Right now, it is sort of a mix of qla4xxx's model 
> and iscsi_tcp/ib_iser's model which is not going to work well in the 
> long run.
Either way will work for us, will investigate more on the code changes
required in bnx2i  


> You can change libiscsi so that you do not run in the host's workq and 
> so data structs are better optimized for you and whatever else is 
> needed. I mean feel free to change that code so it better suits your 
> model. Your code is probably closer to ib_iser since that module does 
> not have to handle operations like responding to r2ts and has to make 
> its own connections.

Right.


> 
> There seem to be some bugs from when you might have modeled bnx2i's 
> callouts after a older iscsi_tcp or ib_iser. Those bugs got fixed for 
> both modules for free when libiscsi got fixed.
> 
> 
> > +
> > +#define hostdata_privsize(_sz)	(sizeof(unsigned long) + _sz + \
> > +				 _sz % sizeof(unsigned long))
> > +
> > +#define hostdata_session(_hostdata) (iscsi_ptr(*(unsigned long *)_hostdata))
> > +
> > +#define session_to_cls(_sess) 	hostdata_session(_sess->host->hostdata)
> > +
> > +
> > +
> > +
> > +/*
> > + * Function: bnx2i_register_xport
> > + * Description: this routine will allocate memory for SCSI host template,
> > + * 		iSCSI template and registers one instance of NX2 device with
> > + *		iSCSI Transport Kernel module.
> > + */
> > +int bnx2i_register_xport(struct bnx2i_hba *hba)
> > +{
> > +	void *mem_ptr;
> > +	char dev_id = '8';
> > +
> > +	get_bnx2_device(hba, dev_id);
> > +
> > +	mem_ptr = kmalloc(sizeof(struct scsi_host_template), GFP_KERNEL);
> > +	hba->scsi_template = mem_ptr;
> > +	if (hba->scsi_template == NULL) {
> > +		printk(KERN_ALERT "bnx2i: failed to alloc memory for sht\n");
> > +		return -ENOMEM;
> > +	}
> > +
> > +	mem_ptr = kmalloc(sizeof(struct iscsi_transport), GFP_KERNEL);
> > +	hba->iscsi_transport = mem_ptr;
> > +	if (hba->iscsi_transport == NULL) {
> > +		printk(KERN_ALERT "mem error for iscsi_transport template\n");
> > +		goto iscsi_xport_err;
> > +	}
> > +
> > +	mem_ptr = kmalloc(BRCM_ISCSI_XPORT_NAME_SIZE_MAX, GFP_KERNEL);
> > +	if (mem_ptr == NULL) {
> > +		printk(KERN_ALERT "failed to alloc memory for xport name\n");
> > +		goto scsi_name_mem_err;
> > +	}
> > +
> > +	memcpy(hba->scsi_template, (const void *) &bnx2i_host_template,
> > +	       sizeof(struct scsi_host_template));
> > +	hba->scsi_template->name = mem_ptr;
> > +	memcpy((void *) hba->scsi_template->name,
> > +	       (const void *) bnx2i_host_template.name,
> > +	       strlen(bnx2i_host_template.name) + 1);
> > +
> > +	mem_ptr = kmalloc(BRCM_ISCSI_XPORT_NAME_SIZE_MAX, GFP_KERNEL);
> > +	if (mem_ptr == NULL) {
> > +		printk(KERN_ALERT "failed to alloc proc name mem\n");
> > +		goto scsi_proc_name_mem_err;
> > +	}
> > +	hba->scsi_template->proc_name = mem_ptr;
> > +
> > +	memcpy((void *) hba->iscsi_transport,
> > +	       (const void *) &bnx2i_iscsi_transport,
> > +	       sizeof(struct iscsi_transport));
> > +
> > +	hba->iscsi_transport->host_template = hba->scsi_template;
> > +
> > +	mem_ptr = kmalloc(BRCM_ISCSI_XPORT_NAME_SIZE_MAX, GFP_KERNEL);
> > +	if (mem_ptr == NULL) {
> > +		printk(KERN_ALERT "mem alloc error, iscsi xport name\n");
> > +		goto xport_name_mem_err;
> > +	}
> > +	hba->iscsi_transport->name = mem_ptr;
> > +	sprintf(mem_ptr, "%s%c-%.2x%.2x%.2x", BRCM_ISCSI_XPORT_NAME_PREFIX,
> > +			 dev_id, (u8)hba->pci_dev->bus->number,
> > +			 hba->pci_devno, (u8)hba->pci_func);
> > +
> > +	memcpy((void *) hba->scsi_template->proc_name, (const void *) mem_ptr,
> > +	       strlen(mem_ptr) + 1);
> > +
> > +	hba->shost_template = iscsi_register_transport(hba->iscsi_transport);
> > +	if (!hba->shost_template) {
> > +		printk(KERN_ALERT "bnx2i: xport reg failed, hba 0x%p\n", hba);
> > +		goto failed_registration;
> > +	}
> 
> 
> The transport should be allocated like qla4xxx or iscsi_tcp/ib_iser. 
> Userspace then sets everything up based on usersettings like with 
> qla4xxx and iscsi_tcp and ib_iser.
> 
> I started that patch that I sent to you guys to modify 
> scsi_transport_iscsi abd libiscsi for you guys, but have not finished it 
> yet. I am trying to finish merging Olafs iscsi_tcp patches. Feel free to 
> change iscsid, scsi_transport_iscsi and libiscsi for me.
> 
> 
> > +}
> > +
> > +int bnx2i_sysfs_setup(void)
> > +{
> > +	int ret;
> > +	ret = class_register(&bnx2i_class);
> > +
> > +	bnx2i_register_port_class_dev(&port_class_dev);
> > +	return ret;
> > +}
> > +
> > +void bnx2i_sysfs_cleanup(void)
> > +{
> > +	class_device_unregister(&port_class_dev);
> > +	class_unregister(&bnx2i_class);
> > +}
> 
> The sysfs bits related to the hba should be use one of the scsi sysfs 
> facilities or if they are related to iscsi bits and are generic then 
> through the iscsi hba

bnx2i needs 2 sysfs entries -
1. QP size info - this is used to size per connection shared data
structures to issue work requests to chip (login, scsi cmd, tmf, nopin)
and get completions from the chip (scsi completions, async messages,
etc'). This is a iSCSI HBA attribute
2. port mapper - we can be more flexible on classifying this as either
iSCSI HBA attribute or bnx2i driver global attribute
Can hooks be added to iSCSI transport class to include these?


^ permalink raw reply

* Re: [PATCH] 3c59x: sparse warning fix
From: Stephen Hemminger @ 2007-09-05 22:46 UTC (permalink / raw)
  To: Joe Perches; +Cc: klassert, Jeff Garzik, netdev
In-Reply-To: <1189011425.13308.30.camel@localhost>

On Wed, 05 Sep 2007 09:57:05 -0700
Joe Perches <joe@perches.com> wrote:

> On Wed, 2007-09-05 at 15:23 +0100, Stephen Hemminger wrote:
> > --- a/drivers/net/3c59x.c	2007-09-05 15:15:16.000000000 +0100
> > +++ b/drivers/net/3c59x.c	2007-09-05 15:16:29.000000000 +0100
> > @@ -1122,7 +1122,7 @@ static int __devinit vortex_probe1(struc
> >  					   + sizeof(struct boom_tx_desc) * TX_RING_SIZE,
> >  					   &vp->rx_ring_dma);
> >  	retval = -ENOMEM;
> > -	if (vp->rx_ring == 0)
> > +	if (!vp->rx_ring)
> >  		goto free_region;
> >  
> >  	vp->tx_ring = (struct boom_tx_desc *)(vp->rx_ring + RX_RING_SIZE);
> > @@ -2476,7 +2476,8 @@ boomerang_rx(struct net_device *dev)
> >  
> >  			/* Check if the packet is long enough to just accept without
> >  			   copying to a properly sized skbuff. */
> > -			if (pkt_len < rx_copybreak && (skb = dev_alloc_skb(pkt_len + 2)) != 0) {
> > +			if (pkt_len < rx_copybreak &&
> > +			    (skb = dev_alloc_skb(pkt_len + 2)) ) {
> >  				skb_reserve(skb, 2);	/* Align IP on 16 byte boundaries */
> >  				pci_dma_sync_single_for_cpu(VORTEX_PCI(vp), dma, PKT_BUF_SZ, PCI_DMA_FROMDEVICE);
> >  				/* 'skb_put()' points to the start of sk_buff data area. */
> 
> Shouldn't this be:
> 
> 	if (pkt_len < rx_copybreak) {
> 		skb = dev_alloc_skb(pkt_len + 2);
> 		if (!skb) {
> 			bad_news! (like the refill rx ring buffers block)
> 		}

No, the existing code path does the right thing, and if skb can't be allocated
it just reuses the existing one. Having a single error recovery path (for out of
memory) is preferable to two paths.

^ permalink raw reply

* Re: [PATCH 9/11] cxgb3 - engine microcode update
From: Divy Le Ray @ 2007-09-05 22:57 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Steve Wise, netdev, linux-kernel
In-Reply-To: <20070905143029.GC19543@havoc.gtf.org>


>
> >
> > I think 9-14 still need to be incorporated.  I don't see them in your
> > upstream branch, and they aren't in linus' tree either.
>
> I am not the blocker here. 
>

Sorry for the delay - again.
I'm resubmitting these patches against net#upstream.

Cheers,
Divy

^ permalink raw reply

* [PATCH 1/7 RESEND] cxgb3 - Firmware update
From: Divy Le Ray @ 2007-09-05 22:58 UTC (permalink / raw)
  To: jeff; +Cc: netdev, linux-kernel, swise

From: Divy Le Ray <divy@chelsio.com>

Update firmware version.
Allow the driver to be up and running with older FW image

Signed-off-by: Divy Le Ray <divy@chelsio.com>
---

 drivers/net/cxgb3/common.h     |    2 +-
 drivers/net/cxgb3/cxgb3_main.c |    9 +++++----
 drivers/net/cxgb3/t3_hw.c      |   20 +++++++++++++++-----
 drivers/net/cxgb3/version.h    |    2 +-
 4 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/drivers/net/cxgb3/common.h b/drivers/net/cxgb3/common.h
index 510e93f..ada5e4b 100644
--- a/drivers/net/cxgb3/common.h
+++ b/drivers/net/cxgb3/common.h
@@ -690,7 +690,7 @@ int t3_read_flash(struct adapter *adapter, unsigned int addr,
 		  unsigned int nwords, u32 *data, int byte_oriented);
 int t3_load_fw(struct adapter *adapter, const u8 * fw_data, unsigned int size);
 int t3_get_fw_version(struct adapter *adapter, u32 *vers);
-int t3_check_fw_version(struct adapter *adapter);
+int t3_check_fw_version(struct adapter *adapter, int *must_load);
 int t3_init_hw(struct adapter *adapter, u32 fw_params);
 void mac_prep(struct cmac *mac, struct adapter *adapter, int index);
 void early_hw_init(struct adapter *adapter, const struct adapter_info *ai);
diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c
index eabb841..ae9c213 100644
--- a/drivers/net/cxgb3/cxgb3_main.c
+++ b/drivers/net/cxgb3/cxgb3_main.c
@@ -832,11 +832,12 @@ static int cxgb_up(struct adapter *adap)
 	int must_load;
 
 	if (!(adap->flags & FULL_INIT_DONE)) {
-		err = t3_check_fw_version(adap);
-		if (err == -EINVAL)
+		err = t3_check_fw_version(adap, &must_load);
+		if (err == -EINVAL) {
 			err = upgrade_fw(adap);
-		if (err)
-			goto out;
+			if (err && must_load)
+				goto out;
+		}
 
 		err = t3_check_tpsram_version(adap, &must_load);
 		if (err == -EINVAL) {
diff --git a/drivers/net/cxgb3/t3_hw.c b/drivers/net/cxgb3/t3_hw.c
index e958bbe..2b49b96 100644
--- a/drivers/net/cxgb3/t3_hw.c
+++ b/drivers/net/cxgb3/t3_hw.c
@@ -960,16 +960,18 @@ int t3_get_fw_version(struct adapter *adapter, u32 *vers)
 /**
  *	t3_check_fw_version - check if the FW is compatible with this driver
  *	@adapter: the adapter
- *
+ *	@must_load: set to 1 if loading a new FW image is required
+
  *	Checks if an adapter's FW is compatible with the driver.  Returns 0
  *	if the versions are compatible, a negative error otherwise.
  */
-int t3_check_fw_version(struct adapter *adapter)
+int t3_check_fw_version(struct adapter *adapter, int *must_load)
 {
 	int ret;
 	u32 vers;
 	unsigned int type, major, minor;
 
+	*must_load = 1;
 	ret = t3_get_fw_version(adapter, &vers);
 	if (ret)
 		return ret;
@@ -982,9 +984,17 @@ int t3_check_fw_version(struct adapter *adapter)
 	    minor == FW_VERSION_MINOR)
 		return 0;
 
-	CH_ERR(adapter, "found wrong FW version(%u.%u), "
-	       "driver needs version %u.%u\n", major, minor,
-	       FW_VERSION_MAJOR, FW_VERSION_MINOR);
+	if (major != FW_VERSION_MAJOR)
+		CH_ERR(adapter, "found wrong FW version(%u.%u), "
+		       "driver needs version %u.%u\n", major, minor,
+		       FW_VERSION_MAJOR, FW_VERSION_MINOR);
+	else {
+		*must_load = 0;
+		CH_WARN(adapter, "found wrong FW minor version(%u.%u), "
+		        "driver compiled for version %u.%u\n", major, minor,
+			FW_VERSION_MAJOR, FW_VERSION_MINOR);
+	}
+
 	return -EINVAL;
 }
 
diff --git a/drivers/net/cxgb3/version.h b/drivers/net/cxgb3/version.h
index eb508bf..ef1c633 100644
--- a/drivers/net/cxgb3/version.h
+++ b/drivers/net/cxgb3/version.h
@@ -39,6 +39,6 @@
 
 /* Firmware version */
 #define FW_VERSION_MAJOR 4
-#define FW_VERSION_MINOR 3
+#define FW_VERSION_MINOR 6
 #define FW_VERSION_MICRO 0
 #endif				/* __CHELSIO_VERSION_H */

^ permalink raw reply related

* [PATCH 2/7 RESEND] cxgb3 - log and clear PEX errors
From: Divy Le Ray @ 2007-09-05 22:58 UTC (permalink / raw)
  To: jeff; +Cc: netdev, linux-kernel, swise

From: Divy Le Ray <divy@chelsio.com>

Clear pciE PEX errors late at module load time.
Log details when PEX errors occur.

Signed-off-by: Divy Le Ray <divy@chelsio.com>
---

 drivers/net/cxgb3/t3_hw.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/drivers/net/cxgb3/t3_hw.c b/drivers/net/cxgb3/t3_hw.c
index 2b49b96..cdcfc13 100644
--- a/drivers/net/cxgb3/t3_hw.c
+++ b/drivers/net/cxgb3/t3_hw.c
@@ -1358,6 +1358,10 @@ static void pcie_intr_handler(struct adapter *adapter)
 		{0}
 	};
 
+	if (t3_read_reg(adapter, A_PCIE_INT_CAUSE) & F_PEXERR)
+		CH_ALERT(adapter, "PEX error code 0x%x\n",
+			 t3_read_reg(adapter, A_PCIE_PEX_ERR));
+
 	if (t3_handle_intr_status(adapter, A_PCIE_INT_CAUSE, PCIE_INTR_MASK,
 				  pcie_intr_info, adapter->irq_stats))
 		t3_fatal_err(adapter);
@@ -1809,6 +1813,8 @@ void t3_intr_clear(struct adapter *adapter)
 	for (i = 0; i < ARRAY_SIZE(cause_reg_addr); ++i)
 		t3_write_reg(adapter, cause_reg_addr[i], 0xffffffff);
 
+	if (is_pcie(adapter))
+		t3_write_reg(adapter, A_PCIE_PEX_ERR, 0xffffffff);
 	t3_write_reg(adapter, A_PL_INT_CAUSE0, 0xffffffff);
 	t3_read_reg(adapter, A_PL_INT_CAUSE0);	/* flush */
 }

^ permalink raw reply related

* [PATCH 3/7 RESEND] cxgb3 - remove false positive in xgmac workaround
From: Divy Le Ray @ 2007-09-05 22:58 UTC (permalink / raw)
  To: jeff; +Cc: netdev, linux-kernel, swise

From: Divy Le Ray <divy@chelsio.com>

Qualify toggling of xgmac tx enable with not getting pause frames, 
we might not make forward progress because the peer is sending 
lots of pause frames.

Signed-off-by: Divy Le Ray <divy@chelsio.com>
---

 drivers/net/cxgb3/common.h |    1 +
 drivers/net/cxgb3/xgmac.c  |    4 +++-
 2 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/drivers/net/cxgb3/common.h b/drivers/net/cxgb3/common.h
index ada5e4b..7e9e43d 100644
--- a/drivers/net/cxgb3/common.h
+++ b/drivers/net/cxgb3/common.h
@@ -513,6 +513,7 @@ struct cmac {
 	u64 rx_mcnt;
 	unsigned int toggle_cnt;
 	unsigned int txen;
+	u64 rx_pause;
 	struct mac_stats stats;
 };
 
diff --git a/drivers/net/cxgb3/xgmac.c b/drivers/net/cxgb3/xgmac.c
index 1d1c391..ff9e9dc 100644
--- a/drivers/net/cxgb3/xgmac.c
+++ b/drivers/net/cxgb3/xgmac.c
@@ -452,6 +452,7 @@ int t3_mac_enable(struct cmac *mac, int which)
 						A_XGM_TX_SPI4_SOP_EOP_CNT +
 						oft)));
 		mac->rx_mcnt = s->rx_frames;
+		mac->rx_pause = s->rx_pause;
 		mac->rx_xcnt = (G_TXSPI4SOPCNT(t3_read_reg(adap,
 						A_XGM_RX_SPI4_SOP_EOP_CNT +
 						oft)));
@@ -504,7 +505,7 @@ int t3b2_mac_watchdog_task(struct cmac *mac)
 	tx_xcnt = 1;		/* By default tx_xcnt is making progress */
 	tx_tcnt = mac->tx_tcnt;	/* If tx_mcnt is progressing ignore tx_tcnt */
 	rx_xcnt = 1;		/* By default rx_xcnt is making progress */
-	if (tx_mcnt == mac->tx_mcnt) {
+	if (tx_mcnt == mac->tx_mcnt && mac->rx_pause == s->rx_pause) {
 		tx_xcnt = (G_TXSPI4SOPCNT(t3_read_reg(adap,
 						A_XGM_TX_SPI4_SOP_EOP_CNT +
 					       	mac->offset)));
@@ -560,6 +561,7 @@ out:
 	mac->tx_mcnt = s->tx_frames;
 	mac->rx_xcnt = rx_xcnt;
 	mac->rx_mcnt = s->rx_frames;
+	mac->rx_pause = s->rx_pause;
 	if (status == 1) {
 		t3_write_reg(adap, A_XGM_TX_CTRL + mac->offset, 0);
 		t3_read_reg(adap, A_XGM_TX_CTRL + mac->offset);  /* flush */

^ permalink raw reply related

* [PATCH 4/7 RESEND] cxgb3 -  Set the CQ_ERR bit in CQ contexts.
From: Divy Le Ray @ 2007-09-05 22:58 UTC (permalink / raw)
  To: jeff; +Cc: netdev, linux-kernel, swise

From: Divy Le Ray <divy@chelsio.com>

The cxgb3 driver is incorrectly configuring the HW CQ context for CQ's
that use overflow-avoidance.  Namely the RDMA control CQ.  This results
in a bad DMA from the device to bus address 0.  The solution is to set
the CQ_ERR bit in the context for these types of CQs.

Signed-off-by: Divy Le Ray <divy@chelsio.com>
---

 drivers/net/cxgb3/sge_defs.h |    4 ++++
 drivers/net/cxgb3/t3_hw.c    |    3 ++-
 2 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/drivers/net/cxgb3/sge_defs.h b/drivers/net/cxgb3/sge_defs.h
index 514869e..29b6c80 100644
--- a/drivers/net/cxgb3/sge_defs.h
+++ b/drivers/net/cxgb3/sge_defs.h
@@ -106,6 +106,10 @@
 #define V_CQ_GEN(x) ((x) << S_CQ_GEN)
 #define F_CQ_GEN    V_CQ_GEN(1U)
 
+#define S_CQ_ERR    30
+#define V_CQ_ERR(x) ((x) << S_CQ_ERR)
+#define F_CQ_ERR    V_CQ_ERR(1U)
+
 #define S_CQ_OVERFLOW_MODE    31
 #define V_CQ_OVERFLOW_MODE(x) ((x) << S_CQ_OVERFLOW_MODE)
 #define F_CQ_OVERFLOW_MODE    V_CQ_OVERFLOW_MODE(1U)
diff --git a/drivers/net/cxgb3/t3_hw.c b/drivers/net/cxgb3/t3_hw.c
index cdcfc13..bff1d02 100644
--- a/drivers/net/cxgb3/t3_hw.c
+++ b/drivers/net/cxgb3/t3_hw.c
@@ -2046,7 +2046,8 @@ int t3_sge_init_cqcntxt(struct adapter *adapter, unsigned int id, u64 base_addr,
 	base_addr >>= 32;
 	t3_write_reg(adapter, A_SG_CONTEXT_DATA2,
 		     V_CQ_BASE_HI((u32) base_addr) | V_CQ_RSPQ(rspq) |
-		     V_CQ_GEN(1) | V_CQ_OVERFLOW_MODE(ovfl_mode));
+		     V_CQ_GEN(1) | V_CQ_OVERFLOW_MODE(ovfl_mode) |
+		     V_CQ_ERR(ovfl_mode));
 	t3_write_reg(adapter, A_SG_CONTEXT_DATA3, V_CQ_CREDITS(credits) |
 		     V_CQ_CREDIT_THRES(credit_thres));
 	return t3_sge_write_context(adapter, id, F_CQ);

^ permalink raw reply related

* [PATCH 6/7] cxgb3 - Add T3C rev
From: Divy Le Ray @ 2007-09-05 22:58 UTC (permalink / raw)
  To: jeff; +Cc: netdev, linux-kernel, swise

From: Divy Le Ray <divy@chelsio.com>

add driver recognition for T3C rev board. 

Signed-off-by: Divy Le Ray <divy@chelsio.com>
---

 drivers/net/cxgb3/common.h     |    1 +
 drivers/net/cxgb3/cxgb3_main.c |    3 +++
 2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/net/cxgb3/common.h b/drivers/net/cxgb3/common.h
index 7e9e43d..d3f276c 100644
--- a/drivers/net/cxgb3/common.h
+++ b/drivers/net/cxgb3/common.h
@@ -438,6 +438,7 @@ enum {					    /* chip revisions */
 	T3_REV_A  = 0,
 	T3_REV_B  = 2,
 	T3_REV_B2 = 3,
+	T3_REV_C  = 4,
 };
 
 struct trace_params {
diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c
index ae9c213..9d360eb 100644
--- a/drivers/net/cxgb3/cxgb3_main.c
+++ b/drivers/net/cxgb3/cxgb3_main.c
@@ -768,6 +768,9 @@ static inline char t3rev2char(struct adapter *adapter)
 	case T3_REV_B2:
 		rev = 'b';
 		break;
+	case T3_REV_C:
+		rev = 'c';
+		break;
 	}
 	return rev;
 }

^ permalink raw reply related

* [PATCH 7/7] cxgb3 - Update engine microcode version
From: Divy Le Ray @ 2007-09-05 22:58 UTC (permalink / raw)
  To: jeff; +Cc: netdev, linux-kernel, swise

From: Divy Le Ray <divy@chelsio.com>

The new microcode engine version is set to 1.1.0

Signed-off-by: Divy Le Ray <divy@chelsio.com>
---

 drivers/net/cxgb3/common.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/cxgb3/common.h b/drivers/net/cxgb3/common.h
index d3f276c..3e5b0db 100644
--- a/drivers/net/cxgb3/common.h
+++ b/drivers/net/cxgb3/common.h
@@ -127,8 +127,8 @@ enum {				/* adapter interrupt-maintained statistics */
 
 enum {
 	TP_VERSION_MAJOR	= 1,
-	TP_VERSION_MINOR	= 0,
-	TP_VERSION_MICRO	= 44
+	TP_VERSION_MINOR	= 1,
+	TP_VERSION_MICRO	= 0
 };
 
 #define S_TP_VERSION_MAJOR		16

^ permalink raw reply related

* [PATCH 5/7 RESEND] cxgb3 - CQ context operations time out too soon.
From: Divy Le Ray @ 2007-09-05 22:58 UTC (permalink / raw)
  To: jeff; +Cc: netdev, linux-kernel, swise

From: Divy Le Ray <divy@chelsio.com>

Currently, the driver only tries up to 5 times (5us) to get the results
of a CQ context operation.  Testing has shown the chip can take as much
as 50us to return the response on SG_CONTEXT_CMD operations.  So we up
the retry count to 100 to cover high loads.

Signed-off-by: Divy Le Ray <divy@chelsio.com>
---

 drivers/net/cxgb3/t3_hw.c |   19 +++++++++++--------
 1 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/net/cxgb3/t3_hw.c b/drivers/net/cxgb3/t3_hw.c
index bff1d02..0583293 100644
--- a/drivers/net/cxgb3/t3_hw.c
+++ b/drivers/net/cxgb3/t3_hw.c
@@ -1870,6 +1870,8 @@ void t3_port_intr_clear(struct adapter *adapter, int idx)
 	phy->ops->intr_clear(phy);
 }
 
+#define SG_CONTEXT_CMD_ATTEMPTS 100
+
 /**
  * 	t3_sge_write_context - write an SGE context
  * 	@adapter: the adapter
@@ -1889,7 +1891,7 @@ static int t3_sge_write_context(struct adapter *adapter, unsigned int id,
 	t3_write_reg(adapter, A_SG_CONTEXT_CMD,
 		     V_CONTEXT_CMD_OPCODE(1) | type | V_CONTEXT(id));
 	return t3_wait_op_done(adapter, A_SG_CONTEXT_CMD, F_CONTEXT_CMD_BUSY,
-			       0, 5, 1);
+			       0, SG_CONTEXT_CMD_ATTEMPTS, 1);
 }
 
 /**
@@ -2075,7 +2077,7 @@ int t3_sge_enable_ecntxt(struct adapter *adapter, unsigned int id, int enable)
 	t3_write_reg(adapter, A_SG_CONTEXT_CMD,
 		     V_CONTEXT_CMD_OPCODE(1) | F_EGRESS | V_CONTEXT(id));
 	return t3_wait_op_done(adapter, A_SG_CONTEXT_CMD, F_CONTEXT_CMD_BUSY,
-			       0, 5, 1);
+			       0, SG_CONTEXT_CMD_ATTEMPTS, 1);
 }
 
 /**
@@ -2099,7 +2101,7 @@ int t3_sge_disable_fl(struct adapter *adapter, unsigned int id)
 	t3_write_reg(adapter, A_SG_CONTEXT_CMD,
 		     V_CONTEXT_CMD_OPCODE(1) | F_FREELIST | V_CONTEXT(id));
 	return t3_wait_op_done(adapter, A_SG_CONTEXT_CMD, F_CONTEXT_CMD_BUSY,
-			       0, 5, 1);
+			       0, SG_CONTEXT_CMD_ATTEMPTS, 1);
 }
 
 /**
@@ -2123,7 +2125,7 @@ int t3_sge_disable_rspcntxt(struct adapter *adapter, unsigned int id)
 	t3_write_reg(adapter, A_SG_CONTEXT_CMD,
 		     V_CONTEXT_CMD_OPCODE(1) | F_RESPONSEQ | V_CONTEXT(id));
 	return t3_wait_op_done(adapter, A_SG_CONTEXT_CMD, F_CONTEXT_CMD_BUSY,
-			       0, 5, 1);
+			       0, SG_CONTEXT_CMD_ATTEMPTS, 1);
 }
 
 /**
@@ -2147,7 +2149,7 @@ int t3_sge_disable_cqcntxt(struct adapter *adapter, unsigned int id)
 	t3_write_reg(adapter, A_SG_CONTEXT_CMD,
 		     V_CONTEXT_CMD_OPCODE(1) | F_CQ | V_CONTEXT(id));
 	return t3_wait_op_done(adapter, A_SG_CONTEXT_CMD, F_CONTEXT_CMD_BUSY,
-			       0, 5, 1);
+			       0, SG_CONTEXT_CMD_ATTEMPTS, 1);
 }
 
 /**
@@ -2172,7 +2174,7 @@ int t3_sge_cqcntxt_op(struct adapter *adapter, unsigned int id, unsigned int op,
 	t3_write_reg(adapter, A_SG_CONTEXT_CMD, V_CONTEXT_CMD_OPCODE(op) |
 		     V_CONTEXT(id) | F_CQ);
 	if (t3_wait_op_done_val(adapter, A_SG_CONTEXT_CMD, F_CONTEXT_CMD_BUSY,
-				0, 5, 1, &val))
+				0, SG_CONTEXT_CMD_ATTEMPTS, 1, &val))
 		return -EIO;
 
 	if (op >= 2 && op < 7) {
@@ -2182,7 +2184,8 @@ int t3_sge_cqcntxt_op(struct adapter *adapter, unsigned int id, unsigned int op,
 		t3_write_reg(adapter, A_SG_CONTEXT_CMD,
 			     V_CONTEXT_CMD_OPCODE(0) | F_CQ | V_CONTEXT(id));
 		if (t3_wait_op_done(adapter, A_SG_CONTEXT_CMD,
-				    F_CONTEXT_CMD_BUSY, 0, 5, 1))
+				    F_CONTEXT_CMD_BUSY, 0,
+				    SG_CONTEXT_CMD_ATTEMPTS, 1))
 			return -EIO;
 		return G_CQ_INDEX(t3_read_reg(adapter, A_SG_CONTEXT_DATA0));
 	}
@@ -2208,7 +2211,7 @@ static int t3_sge_read_context(unsigned int type, struct adapter *adapter,
 	t3_write_reg(adapter, A_SG_CONTEXT_CMD,
 		     V_CONTEXT_CMD_OPCODE(0) | type | V_CONTEXT(id));
 	if (t3_wait_op_done(adapter, A_SG_CONTEXT_CMD, F_CONTEXT_CMD_BUSY, 0,
-			    5, 1))
+			    SG_CONTEXT_CMD_ATTEMPTS, 1))
 		return -EIO;
 	data[0] = t3_read_reg(adapter, A_SG_CONTEXT_DATA0);
 	data[1] = t3_read_reg(adapter, A_SG_CONTEXT_DATA1);

^ permalink raw reply related

* Re: kernel crashes inside MV643xx driver
From: Satyam Sharma @ 2007-09-05 23:40 UTC (permalink / raw)
  To: Dale Farnsworth
  Cc: Andrew Morton, gshan, Linux Kernel Mailing List,
	Linux Netdev Mailing List
In-Reply-To: <20070905165452.GA18396@xyzzy.farnsworth.org>



On Wed, 5 Sep 2007, Dale Farnsworth wrote:

> On Wed, Sep 05, 2007 at 08:24:52AM -0700, Andrew Morton wrote:
> > > On Mon, 20 Aug 2007 14:38:57 +0800 gshan <gshan@alcatel-lucent.com> wrote:
> > > Hi All,
> > > 
> > > After I started the NFS server, it crashed:
> > > 
> > > <3>Badness in local_bh_enable at 
> > > /home/cli4/sandbox/main/TelicaRoot/components/mvlinux/cge/devkit/lsp/7xx/linux/kernel/softirq.c:195
> > > Badness in local_bh_enable at 
> > > /home/cli4/sandbox/main/TelicaRoot/components/mvlinux/cge/devkit/lsp/7xx/linux/kernel/softirq.c:195
> > > Call trace:
> > >  [c0005340] check_bug_trap+0xbc/0x11c
> > >  [c0005604] ProgramCheckException+0x264/0x2bc
> > >  [c0004ac4] ret_from_except_full+0x0/0x4c
> > >  [c0022ae4] local_bh_enable+0x18/0x80
> > >  [c024648c] skb_copy_bits+0x168/0x3b8
> > >  [c024db44] __skb_linearize+0x90/0x150
> > >  [c020e8a4] mv643xx_eth_start_xmit+0x4c0/0x5bc
> > >  [c025c934] qdisc_restart+0xac/0x2bc
> > >  [c024de9c] dev_queue_xmit+0x298/0x34c
> > >  [c0269814] ip_finish_output+0x140/0x2b8
> > >  [c026a3ac] ip_fragment+0x3cc/0x6e0
> > >  [c026bac8] ip_push_pending_frames+0x3dc/0x46c
> > >  [c0289ec4] udp_push_pending_frames+0x10c/0x1cc
> > >  [c028a7c4] udp_sendpage+0x104/0x188
> > >  [c0292fc8] inet_sendpage+0x90/0xb8
> > > 
> > > I searched the webs and found the similar problems:

> > > http://www.mail-archive.com/netdev@vger.kernel.org/msg05199.html

> > > http://oss.sgi.com/archives/netdev/2005-09/msg00025.html
> > > 
> > > Who knew there are fixes for the problem?
> > 
> > Well that got a tremendous response, didn't it?

I thought of replying to this post when I saw it couple weeks back, but
didn't because (1) that's an ancient kernel and (2) the first link that
was mentioned up there in the original post itself pointed to a patch to
solve it (which I verified was since applied to mainline too).


> > What do you mean by "crashed"?  The above is a warning and the system
> > should have survived.
> > 
> > Which kernel version is being used?
> 
> That is the key question.  From the pathnames, I suspect that gshan is
> using a MontaVista version.  I'm still (especially) interested, since
> MontaVista pays me.
> 
> BTW, I never received the original message on netdev or linux-kernel.
> Hmm.  Thanks to Andrew for replying.

^ permalink raw reply

* Re: [PATCH 1/2]: [NET_SCHED]: Make all rate based scheduler work with TSO.
From: Bill Fink @ 2007-09-06  3:59 UTC (permalink / raw)
  To: jdb; +Cc: Jesper Dangaard Brouer, netdev@vger.kernel.org
In-Reply-To: <1188983830.16405.21.camel@localhost.localdomain>

On Wed, 05 Sep 2007, Jesper Dangaard Brouer wrote:

> On Tue, 2007-09-04 at 13:40 -0400, Bill Fink wrote:
> > On Tue, 04 Sep 2007, Patrick McHardy wrote:
> > 
> > > Bill Fink wrote:
> > > > On Sat, 1 Sep 2007, Jesper Dangaard Brouer wrote:
> > > >
> > > Yes, you need to specify the MTU on the command line for
> > > jumbo frames.
> > 
> > Thanks!  Works much better now, although it does slightly exceed
> > the specified rate.
> 
> Thats what happens, with the current rate table system, as we use the
> lower boundry (when doing the packet to time lookups). Especially with a
> high MTU, as the "resolution" of the rate table diminish (mpu=9000 gives
> cell_log=6, 2^6=64 bytes "resolution" buckets).
> 
> > [root@lang4 ~]# tc qdisc add dev eth2 root tbf rate 2gbit buffer 5000000 limit 18000 mtu 9000
> > 
> > [root@lang4 ~]# ./nuttcp-5.5.5 -w10m 192.168.88.14
> >  2465.6729 MB /  10.08 sec = 2051.8241 Mbps 19 %TX 13 %RX

That doesn't seem to account for the magnitude of the rate exceeding.
In the worst case (rough calculation):

	(1+64/9000)*2000 = 2014.2222 Mbps

Now if that were 256 rather than 64:

	(1+256/9000)*2000 = 2056.8888 Mbps

Or maybe the packet overhead is calculated wrong for the 9000 MTU case
(just wild speculation on my part).

						-Bill

^ permalink raw reply

* Re: BUG: scheduling while atomic: ifconfig/0x00000002/4170
From: Satyam Sharma @ 2007-09-06  5:02 UTC (permalink / raw)
  To: Florian Lohoff
  Cc: Linux Kernel Mailing List, Netdev,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA, Michal Piotrowski,
	Herbert Xu, ipw3945-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	yi.zhu-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <6bffcb0e0709021659o3856cd06gabc054c949a84397-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Hi,


> On 02/09/07, Florian Lohoff <flo-BCn6idZOOBwdnm+yROfE0A@public.gmane.org> wrote:
> >
> > Hi,
> > with current git i got this when "ifconfig eth1" down. eth1 had a mac
> > address which looked really like an eth1394 ethernet although the module
> > was not loaded. Something is really broken in 2.6.23-currentgit. I always get the

-currentgit would be -rc5 right?


> > sysfs rename issues which are discussed to be an udev issue. Then i see
> > a eth1394 mac address on an interface which typically shouldn exist
> > (udev should rename the wireless to eth1) and when issueing an
> > ifconfig eth1 down i get a
> >
> >         BUG: scheduling while atomic: ifconfig/0x00000002/4170

Is this reproducible? Also, please send your .config.

BTW the calltrace below shows that eth1 was the wireless interface when
you tried to "down" it.


> > On the next boot i see the eth1394 mac address on the wireless interface
> > wmaster0_rename whereas eth1 is active (the wireless) and has the correct
> > ip address.

I don't think the "scheduling while atomic" bug you saw is related to
the other problem you're seeing ... these are probably a bunch of all
different issues, but I'm not sure if eth1394 is involved at all.


> > I dont get it - this all looks really messed up. udev is
> > debian sid 114-2.

True, messed up it indeed is.


> > eth0      Link encap:Ethernet  HWaddr 00:17:42:13:45:8C
> >           UP BROADCAST MULTICAST  MTU:1500  Metric:1
> >           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> >           collisions:0 txqueuelen:1000
> >           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
> >           Interrupt:19
> >
> > eth1      Link encap:Ethernet  HWaddr 00:18:DE:63:F0:B3
> >           inet addr:195.71.97.208  Bcast:195.71.97.223  Mask:255.255.255.224
> >           inet6 addr: fe80::218:deff:fe63:f0b3/64 Scope:Link
> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >           RX packets:2079 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:2220 errors:0 dropped:0 overruns:0 carrier:0
> >           collisions:0 txqueuelen:1000
> >           RX bytes:508959 (497.0 KiB)  TX bytes:261123 (255.0 KiB)
> >
> > wmaster0_ Link encap:UNSPEC  HWaddr 00-18-DE-63-F0-B3-30-3A-00-00-00-00-00-00-00-00
> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> >           collisions:0 txqueuelen:1000
> >           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
> >
> >
> > [   14.300736] VFS: Mounted root (ext3 filesystem) readonly.
> > [   14.300902] Freeing unused kernel memory: 216k freed
> > [   17.618804] irda_init()
> > [   17.618817] NET: Registered protocol family 23
> > [   17.636399] ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 16 (level, low) -> IRQ 19
> > [   17.636588] PCI: Setting latency timer of device 0000:02:00.0 to 64
> > [   17.636619] sky2 0000:02:00.0: v1.17 addr 0xf0000000 irq 19 Yukon-EC Ultra (0xb4) rev 2
> > [   17.648081] parport_pc 00:0c: reported by Plug and Play ACPI
> > [   17.648206] parport0: PC-style at 0x378, irq 7 [PCSPP,TRISTATE,EPP]
> > [   17.653652] sky2 eth0: addr 00:17:42:13:45:8c
> > [   17.680848] input: Video Bus as /class/input/input6
> > [   17.680961] ACPI: Video Device [GFX0] (multi-head: yes  rom: no  post: no)
> > [   17.757019] usbcore: registered new interface driver usbfs
> > [   17.757139] usbcore: registered new interface driver hub
> > [   17.757264] usbcore: registered new device driver usb
> > [   17.824819] Yenta: CardBus bridge found at 0000:08:03.0 [10cf:131e]
> > [   17.824941] Yenta O2: res at 0x94/0xD4: 00/ea
> > [   17.825034] Yenta O2: enabling read prefetch/write burst
> > [   17.828363] hda: ATAPI 24X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)
> > [   17.828838] Uniform CD-ROM driver Revision: 3.20
> > [   17.891481] USB Universal Host Controller Interface driver v3.0
> > [   17.891650] ACPI: PCI Interrupt 0000:00:1d.0[A] -> GSI 23 (level, low) -> IRQ 22
> > [   17.891840] PCI: Setting latency timer of device 0000:00:1d.0 to 64
> > [   17.891844] uhci_hcd 0000:00:1d.0: UHCI Host Controller
> > [   17.892155] uhci_hcd 0000:00:1d.0: new USB bus registered, assigned bus number 1
> > [   17.892327] uhci_hcd 0000:00:1d.0: irq 22, io base 0x00001820
> > [   17.892571] usb usb1: configuration #1 chosen from 1 choice
> > [   17.892689] hub 1-0:1.0: USB hub found
> > [   17.892784] hub 1-0:1.0: 2 ports detected
> > [   17.924265] found SMC SuperIO Chip (devid=0x7a rev=00 base=0x002e): LPC47N227
> > [   17.924390] smsc_superio_flat(): fir: 0x6e8, sir: 0x2e8, dma: 03, irq: 3, mode: 0x0e
> > [   17.924526] smsc_ircc_present: can't get sir_base of 0x2e8
> > [   17.954918] Yenta: ISA IRQ mask 0x0c38, PCI irq 19
> > [   17.955009] Socket status: 30000006
> > [   17.955094] Yenta: Raising subordinate bus# of parent bus (#08) from #09 to #0c
> > [   17.955225] pcmcia: parent PCI bridge I/O window: 0x3000 - 0x3fff
> > [   17.955315] cs: IO port probe 0x3000-0x3fff: clean.
> > [   17.955773] pcmcia: parent PCI bridge Memory window: 0xf0200000 - 0xf02fffff
> > [   17.955864] pcmcia: parent PCI bridge Memory window: 0x30000000 - 0x37ffffff
> > [   17.956497] Yenta: CardBus bridge found at 0000:08:03.1 [10cf:131e]
> > [   17.981605] iwl3945: Intel(R) PRO/Wireless 3945ABG/BG Network Connection driver for Linux, 0.1.14
> > [   17.981752] iwl3945: Copyright(c) 2003-2007 Intel Corporation

Cc'ing ipw3945 ...


> > [   17.983847] sdhci: Secure Digital Host Controller Interface driver
> > [   17.983940] sdhci: Copyright(c) Pierre Ossman
> > [   17.998087] ACPI: PCI Interrupt 0000:00:1d.1[B] -> GSI 20 (level, low) -> IRQ 18
> > [   17.998299] PCI: Setting latency timer of device 0000:00:1d.1 to 64
> > [   17.998303] uhci_hcd 0000:00:1d.1: UHCI Host Controller
> > [   17.998428] uhci_hcd 0000:00:1d.1: new USB bus registered, assigned bus number 2
> > [   17.998601] uhci_hcd 0000:00:1d.1: irq 18, io base 0x00001840
> > [   17.998811] usb usb2: configuration #1 chosen from 1 choice
> > [   17.998933] hub 2-0:1.0: USB hub found
> > [   17.999023] hub 2-0:1.0: 2 ports detected
> > [   18.082862] Yenta: ISA IRQ mask 0x0438, PCI irq 19
> > [   18.082956] Socket status: 30000410
> > [   18.083041] Yenta: Raising subordinate bus# of parent bus (#08) from #0c to #10
> > [   18.083169] pcmcia: parent PCI bridge I/O window: 0x3000 - 0x3fff
> > [   18.083255] cs: IO port probe 0x3000-0x3fff: clean.
> > [   18.083710] pcmcia: parent PCI bridge Memory window: 0xf0200000 - 0xf02fffff
> > [   18.083798] pcmcia: parent PCI bridge Memory window: 0x30000000 - 0x37ffffff
> > [   18.105897] ACPI: PCI Interrupt 0000:00:1d.2[C] -> GSI 18 (level, low) -> IRQ 20
> > [   18.106097] PCI: Setting latency timer of device 0000:00:1d.2 to 64
> > [   18.106101] uhci_hcd 0000:00:1d.2: UHCI Host Controller
> > [   18.106218] uhci_hcd 0000:00:1d.2: new USB bus registered, assigned bus number 3
> > [   18.106385] uhci_hcd 0000:00:1d.2: irq 20, io base 0x00001860
> > [   18.106594] usb usb3: configuration #1 chosen from 1 choice
> > [   18.106709] hub 3-0:1.0: USB hub found
> > [   18.106799] hub 3-0:1.0: 2 ports detected
> > [   18.213730] ACPI: PCI Interrupt 0000:00:1d.3[D] -> GSI 16 (level, low) -> IRQ 19
> > [   18.213932] PCI: Setting latency timer of device 0000:00:1d.3 to 64
> > [   18.213938] uhci_hcd 0000:00:1d.3: UHCI Host Controller
> > [   18.214052] uhci_hcd 0000:00:1d.3: new USB bus registered, assigned bus number 4
> > [   18.214210] uhci_hcd 0000:00:1d.3: irq 19, io base 0x00001880
> > [   18.214414] usb usb4: configuration #1 chosen from 1 choice
> > [   18.214529] hub 4-0:1.0: USB hub found
> > [   18.214620] hub 4-0:1.0: 2 ports detected
> > [   18.318020] ACPI: PCI Interrupt 0000:00:1d.7[A] -> GSI 23 (level, low) -> IRQ 22
> > [   18.324538] PCI: Setting latency timer of device 0000:00:1d.7 to 64
> > [   18.324544] ehci_hcd 0000:00:1d.7: EHCI Host Controller
> > [   18.324687] ehci_hcd 0000:00:1d.7: new USB bus registered, assigned bus number 5
> > [   18.324857] ehci_hcd 0000:00:1d.7: debug port 1
> > [   18.324951] PCI: cache line size of 32 is not supported by device 0000:00:1d.7
> > [   18.324959] ehci_hcd 0000:00:1d.7: irq 22, io mem 0xf0644000
> > [   18.328914] ehci_hcd 0000:00:1d.7: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
> > [   18.329163] usb usb5: configuration #1 chosen from 1 choice
> > [   18.329282] hub 5-0:1.0: USB hub found
> > [   18.329374] hub 5-0:1.0: 8 ports detected
> > [   18.429643] ACPI: PCI Interrupt 0000:08:03.4[A] -> GSI 16 (level, low) -> IRQ 19
> > [   18.480534] ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[19]  MMIO=[f0201000-f02017ff]  Max Packet=[2048]  IR/IT contexts=[8/8]
> > [   18.483902] ACPI: PCI Interrupt 0000:05:00.0[A] -> GSI 18 (level, low) -> IRQ 20
> > [   18.484135] PCI: Setting latency timer of device 0000:05:00.0 to 64
> > [   18.484153] iwl3945: Detected Intel PRO/Wireless 3945ABG Network Connection

> > [   18.595000] sdhci: SDHCI controller found at 0000:08:03.2 [1217:7120] (rev 1)
> > [   18.595114] ACPI: PCI Interrupt 0000:08:03.2[A] -> GSI 16 (level, low) -> IRQ 19
> > [   18.595299] sdhci:slot0: Unknown controller version (16). You may experience problems.
> > [   18.595581] mmc0: SDHCI at 0xf0202800 irq 19 PIO
> > [   18.614800] ACPI: PCI Interrupt 0000:00:1b.0[A] -> GSI 17 (level, low) -> IRQ 23
> > [   18.614990] PCI: Setting latency timer of device 0000:00:1b.0 to 64
> > [   18.700659] iwl3945: Tunable channels: 13 802.11bg, 23 802.11a channels
> > [   18.701906] wmaster0: Selected rate control algorithm 'iwl-3945-rs'

iwl3945 is wmaster0


> > [   18.721016] pccard: PCMCIA card inserted into slot 1
> > [   18.721131] cs: memory probe 0xf0200000-0xf02fffff: excluding 0xf0200000-0xf020ffff
> > [   18.739151] pcmcia: registering new device pcmcia1.0

> > [   18.740929] net eth1: device_rename: sysfs_create_symlink failed (-17)

> > [   18.741075] net wlan0_rename: device_rename: sysfs_create_symlink failed (-17)

These look like interesting bits here ... sysfs_create_symlink() failed
with the infamous -EEXIST. Those were actually two different calls to
dev_change_name()->device_rename()->sysfs_create_symlink() it appears.

If this is reproducible, please rebuild with a kernel with debugging
config options enabled and reproduce.


> > [   18.741625] udev: renamed network interface wmaster0 to eth1

But did this wmaster0 -> eth1 actually succeed, is the question.


> > [   18.779425] cs: IO port probe 0x100-0x3af:<3>si3054: cannot initialize. EXT MID = 0000
> > [   18.783468]  clean.
> > [   18.783592] cs: IO port probe 0x3e0-0x4ff: excluding 0x4d0-0x4d7
> > [   18.784668] cs: IO port probe 0x820-0x8ff: clean.
> > [   18.785521] cs: IO port probe 0xc00-0xcf7: clean.
> > [   18.786456] cs: IO port probe 0xa00-0xaff: clean.
> > [   18.837550] cs: IO port probe 0x100-0x3af: clean.
> > [   18.839641] cs: IO port probe 0x3e0-0x4ff: excluding 0x4d0-0x4d7
> > [   18.840664] cs: IO port probe 0x820-0x8ff: clean.
> > [   18.841501] cs: IO port probe 0xc00-0xcf7: clean.
> > [   18.842426] cs: IO port probe 0xa00-0xaff: clean.
> > [   18.964571] usb 4-2: new full speed USB device using uhci_hcd and address 2
> > [   19.141976] usb 4-2: configuration #1 chosen from 1 choice
> > [   19.214394] Bluetooth: Core ver 2.11
> > [   19.214546] NET: Registered protocol family 31
> > [   19.214640] Bluetooth: HCI device and connection manager initialized
> > [   19.214737] Bluetooth: HCI socket layer initialized
> > [   19.257180] Bluetooth: HCI USB driver ver 2.9
> > [   19.257349] usbcore: registered new interface driver hci_usb
> > [   19.751313] ieee1394: Host added: ID:BUS[0-00:1023]  GUID[00000e10036532e4]
> > [   50.113619] Adding 979924k swap on /dev/sda5.  Priority:-1 extents:1 across:979924k
> > [   50.163229] EXT3 FS on sda6, internal journal
> > [   64.870417] NET: Registered protocol family 10
> > [   64.870613] lo: Disabled Privacy Extensions
> > [   71.028607] Bluetooth: L2CAP ver 2.8
> > [   71.028615] Bluetooth: L2CAP socket layer initialized
> > [   71.168926] Bluetooth: RFCOMM socket layer initialized
> > [   71.169019] Bluetooth: RFCOMM TTY layer initialized
> > [   71.169054] Bluetooth: RFCOMM ver 1.8
> > [   71.539346] ADDRCONF(NETDEV_UP): wlan0_rename: link is not ready
> > [   71.737809] sky2 eth0: enabling interface
> > [   71.740794] sky2 eth0: ram buffer 0K
> > [   71.741733] ADDRCONF(NETDEV_UP): eth0: link is not ready

> > [  142.968449] wlan0_rename: Initial auth_alg=0
> > [  142.968461] wlan0_rename: authenticate with AP 00:0c:41:de:12:e1
> > [  142.976883] wlan0_rename: RX authentication from 00:0c:41:de:12:e1 (alg=0 transaction=2 status=0)
> > [  142.976891] wlan0_rename: authenticated
> > [  142.976895] wlan0_rename: associate with AP 00:0c:41:de:12:e1
> > [  142.979389] wlan0_rename: authentication frame received from 00:0c:41:de:12:e1, but not in authenticate state - ignored
> > [  142.980404] wlan0_rename: RX AssocResp from 00:0c:41:de:12:e1 (capab=0x401 status=0 aid=1)
> > [  142.980412] wlan0_rename: associated
> > [  142.982914] ADDRCONF(NETDEV_CHANGE): wlan0_rename: link becomes ready
> > [  148.291754] ICMPv6 NA: someone advertises our address on wlan0_rename!
> > [  151.386585] wlan0_rename: duplicate address detected!


> > [  382.517007] BUG: scheduling while atomic: ifconfig/0x00000002/4170
> > [  382.517029]  [<c032bf5c>] __sched_text_start+0x84/0x71c
> > [  382.517047]  [<c032c5da>] __sched_text_start+0x702/0x71c
> > [  382.517065]  [<c0174a6c>] link_path_walk+0xa9/0xb3
> > [  382.517079]  [<c032c6ad>] wait_for_completion+0x65/0x9b
> > [  382.517090]  [<c011f0a2>] default_wake_function+0x0/0xc
> > [  382.517103]  [<c01334cf>] synchronize_rcu+0x2a/0x2f
> > [  382.517115]  [<c0133091>] wakeme_after_rcu+0x0/0x8
> > [  382.517127]  [<c02d5f4f>] dev_deactivate+0x89/0x9c
> > [  382.517138]  [<c02c8abc>] dev_close+0x24/0x67
> > [  382.517150]  [<e01f402b>] ieee80211_master_stop+0x4a/0x6d [mac80211]
> > [  382.517176]  [<c02c8ae3>] dev_close+0x4b/0x67
> > [  382.517183]  [<c02c7f47>] dev_change_flags+0x9d/0x14e
> > [  382.517193]  [<c03058c8>] devinet_ioctl+0x224/0x532
> > [  382.517201]  [<c02c9589>] dev_ifsioc+0x113/0x396
> > [  382.517208]  [<c02c8d59>] dev_load+0x24/0x4b
> > [  382.517214]  [<c02be61d>] sock_ioctl+0x0/0x1be
> > [  382.517233]  [<c02be7bc>] sock_ioctl+0x19f/0x1be
> > [  382.517239]  [<c011a993>] do_page_fault+0x269/0x58e
> > [  382.517246]  [<c02be61d>] sock_ioctl+0x0/0x1be
> > [  382.517254]  [<c0176787>] do_ioctl+0x1f/0x62
> > [  382.517263]  [<c0176a01>] vfs_ioctl+0x237/0x249
> > [  382.517270]  [<c016b7d8>] do_sys_open+0xbb/0xc5
> > [  382.517280]  [<c0176a46>] sys_ioctl+0x33/0x4d
> > [  382.517288]  [<c0103e52>] sysenter_past_esp+0x5f/0x85
> > [  382.517303]  =======================


> > [  382.528958] BUG: scheduling while atomic: ifconfig/0x00000002/4170
> > [  382.528972]  [<c032bf5c>] __sched_text_start+0x84/0x71c
> > [  382.528996]  [<c032c6ad>] wait_for_completion+0x65/0x9b
> > [  382.529005]  [<c011f0a2>] default_wake_function+0x0/0xc
> > [  382.529014]  [<c01334cf>] synchronize_rcu+0x2a/0x2f
> > [  382.529021]  [<c0133091>] wakeme_after_rcu+0x0/0x8
> > [  382.529031]  [<c02d5f4f>] dev_deactivate+0x89/0x9c

Probably tangential, but Herbert, is the call to synchronize_rcu() in
dev_deactivate() really correct?

I saw that you put it in commit d4828d85d188dc70 about a year back
and it's supposed to ensure that all outstanding transmissions from
dev_queue_xmit() are over before proceeding. However, I think it may not
serve that purpose -- as the comment before synchronize_rcu() and its
use of call_rcu() [and not of call_rcu_bh()] indicate, synchronize_rcu()
is used to synchronize with all outstanding rcu_read_lock() uses. OTOH,
dev_queue_xmit() uses rcu_read_lock_*bh*() which isn't really the same --
bh / non-bh RCU callbacks have separate lists in the RCU implementation,
so I don't really see anything preventing old transmissions to remain
outstanding even after the synchronize_rcu() call in dev_deactivate (?)

Whether that has anything to do with this particular bug or not,
I have no idea ...


> > [  382.529041]  [<c02c8abc>] dev_close+0x24/0x67
> > [  382.529052]  [<e01f402b>] ieee80211_master_stop+0x4a/0x6d [mac80211]
> > [  382.529077]  [<c02c8ae3>] dev_close+0x4b/0x67
> > [  382.529088]  [<c02c7f47>] dev_change_flags+0x9d/0x14e
> > [  382.529101]  [<c03058c8>] devinet_ioctl+0x224/0x532
> > [  382.529111]  [<c02c9589>] dev_ifsioc+0x113/0x396
> > [  382.529122]  [<c02c8d59>] dev_load+0x24/0x4b
> > [  382.529132]  [<c02be61d>] sock_ioctl+0x0/0x1be
> > [  382.529153]  [<c02be7bc>] sock_ioctl+0x19f/0x1be
> > [  382.529164]  [<c011a993>] do_page_fault+0x269/0x58e
> > [  382.529174]  [<c02be61d>] sock_ioctl+0x0/0x1be
> > [  382.529187]  [<c0176787>] do_ioctl+0x1f/0x62
> > [  382.529201]  [<c0176a01>] vfs_ioctl+0x237/0x249
> > [  382.529211]  [<c016b7d8>] do_sys_open+0xbb/0xc5
> > [  382.529224]  [<c0176a46>] sys_ioctl+0x33/0x4d
> > [  382.529235]  [<c0103e52>] sysenter_past_esp+0x5f/0x85
> > [  382.529253]  =======================
> > [  382.530962] Freeing alive inet6 address cd98f600

Unrelated, but I also wish the "BUG: scheduling while atomic" was the more
instructive show_regs() trace and not a dump_stack() ... I'll make such a
patch today itself.

^ permalink raw reply

* [PATCH 1/1] netcore: minor cleanup for register_netdevice
From: FNST-Wang Chen @ 2007-09-06  5:27 UTC (permalink / raw)
  To: netdev; +Cc: Wang Chen

- minor logic cleanup

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>

diff -Nurp linux-2.6.22.4.org/net/core/dev.c  linux-2.6.22.4/net/core/dev.c
--- linux-2.6.22.4.org/net/core/dev.c    2007-08-22 15:33:40.000000000 
+0800
+++ linux-2.6.22.4/net/core/dev.c    2007-09-06 13:27:35.000000000 +0800
@@ -3075,8 +3075,6 @@ int register_netdevice(struct net_device
    dev->xmit_lock_owner = -1;
    spin_lock_init(&dev->ingress_lock);

-    dev->iflink = -1;
-
    /* Init, if this function is available */
    if (dev->init) {
        ret = dev->init(dev);
@@ -3093,8 +3091,7 @@ int register_netdevice(struct net_device
    }

    dev->ifindex = dev_new_index();
-    if (dev->iflink == -1)
-        dev->iflink = dev->ifindex;
+    dev->iflink = dev->ifindex;

    /* Check for existence of name */
    head = dev_name_hash(dev->name);






^ permalink raw reply

* Re: [PATCH 1/1] netcore: minor cleanup for register_netdevice
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2007-09-06  5:34 UTC (permalink / raw)
  To: wangchen; +Cc: netdev, ellre923, yoshfuji
In-Reply-To: <46DF8FBC.20106@cn.fujitsu.com>

I disagree.  Some of "dev->init()" functions set up dev->iflink;
e.g., ipip_tunnel_init() etc.

In article <46DF8FBC.20106@cn.fujitsu.com> (at Thu, 06 Sep 2007 13:27:24 +0800), FNST-Wang Chen <wangchen@cn.fujitsu.com> says:

> - minor logic cleanup
> 
> Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
> 
> diff -Nurp linux-2.6.22.4.org/net/core/dev.c  linux-2.6.22.4/net/core/dev.c
> --- linux-2.6.22.4.org/net/core/dev.c    2007-08-22 15:33:40.000000000 
> +0800
> +++ linux-2.6.22.4/net/core/dev.c    2007-09-06 13:27:35.000000000 +0800
> @@ -3075,8 +3075,6 @@ int register_netdevice(struct net_device
>     dev->xmit_lock_owner = -1;
>     spin_lock_init(&dev->ingress_lock);
> 
> -    dev->iflink = -1;
> -
>     /* Init, if this function is available */
>     if (dev->init) {
>         ret = dev->init(dev);
> @@ -3093,8 +3091,7 @@ int register_netdevice(struct net_device
>     }
> 
>     dev->ifindex = dev_new_index();
> -    if (dev->iflink == -1)
> -        dev->iflink = dev->ifindex;
> +    dev->iflink = dev->ifindex;
> 
>     /* Check for existence of name */
>     head = dev_name_hash(dev->name);
> 
> 
> 
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/1] netcore: minor cleanup for register_netdevice
From: Krishna Kumar2 @ 2007-09-06  5:37 UTC (permalink / raw)
  To: FNST-Wang Chen; +Cc: Wang Chen, netdev, netdev-owner
In-Reply-To: <46DF8FBC.20106@cn.fujitsu.com>

Some init handlers set iflink, and hence it is required. See tunnel init
routines (ipip_tunnel_init), macvlan_init(), etc

netdev-owner@vger.kernel.org wrote on 09/06/2007 10:57:24 AM:

> - minor logic cleanup
>
> Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
>
> diff -Nurp linux-2.6.22.4.org/net/core/dev.c  linux-2.6.22.4
/net/core/dev.c
> --- linux-2.6.22.4.org/net/core/dev.c    2007-08-22 15:33:40.000000000
> +0800
> +++ linux-2.6.22.4/net/core/dev.c    2007-09-06 13:27:35.000000000 +0800
> @@ -3075,8 +3075,6 @@ int register_netdevice(struct net_device
>     dev->xmit_lock_owner = -1;
>     spin_lock_init(&dev->ingress_lock);
>
> -    dev->iflink = -1;
> -
>     /* Init, if this function is available */
>     if (dev->init) {
>         ret = dev->init(dev);
> @@ -3093,8 +3091,7 @@ int register_netdevice(struct net_device
>     }
>
>     dev->ifindex = dev_new_index();
> -    if (dev->iflink == -1)
> -        dev->iflink = dev->ifindex;
> +    dev->iflink = dev->ifindex;
>
>     /* Check for existence of name */
>     head = dev_name_hash(dev->name);
>
>
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: [Spam-fortigate] Re: [PATCH 1/1] netcore: minor cleanup for register_netdevice
From: FNST-Wang Chen @ 2007-09-06  5:37 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki / 吉藤英明
  Cc: netdev, ellre923, krkumar2
In-Reply-To: <20070906.143402.07329837.yoshfuji@linux-ipv6.org>

oops
i am wrong

YOSHIFUJI Hideaki / 吉藤英明 said the following on 2007-9-6 13:34:
> I disagree.  Some of "dev->init()" functions set up dev->iflink;
> e.g., ipip_tunnel_init() etc.
>
> In article <46DF8FBC.20106@cn.fujitsu.com> (at Thu, 06 Sep 2007 13:27:24 +0800), FNST-Wang Chen <wangchen@cn.fujitsu.com> says:
>
>   
>> - minor logic cleanup
>>
>> Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
>>
>> diff -Nurp linux-2.6.22.4.org/net/core/dev.c  linux-2.6.22.4/net/core/dev.c
>> --- linux-2.6.22.4.org/net/core/dev.c    2007-08-22 15:33:40.000000000 
>> +0800
>> +++ linux-2.6.22.4/net/core/dev.c    2007-09-06 13:27:35.000000000 +0800
>> @@ -3075,8 +3075,6 @@ int register_netdevice(struct net_device
>>     dev->xmit_lock_owner = -1;
>>     spin_lock_init(&dev->ingress_lock);
>>
>> -    dev->iflink = -1;
>> -
>>     /* Init, if this function is available */
>>     if (dev->init) {
>>         ret = dev->init(dev);
>> @@ -3093,8 +3091,7 @@ int register_netdevice(struct net_device
>>     }
>>
>>     dev->ifindex = dev_new_index();
>> -    if (dev->iflink == -1)
>> -        dev->iflink = dev->ifindex;
>> +    dev->iflink = dev->ifindex;
>>
>>     /* Check for existence of name */
>>     head = dev_name_hash(dev->name);
>>
>>
>>
>>
>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>     
>
>
>
>   


-- 
Best Regards,
A new email address of FJWAN is launched from Apr.1 2007.
The updated address is: wangchen@cn.fujitsu.com
--------------------------------------------------
Wang Chen
Development Dept.I
Nanjing Fujitsu Nanda Software Tech. Co., Ltd.(FNST)
8/F., Civil Defense Building, No.189 Guangzhou Road,
Nanjing, 210029, China
TEL:+86+25-86630566-850
FUJITSU INTERNAL:79955-850
FAX:+86+25-83317685
MAIL:wangchen@cn.fujitsu.com
--------------------------------------------------




^ permalink raw reply

* Re: [PATCH] [sis900] convert to NAPI, WAS Re: pktgen terminating condition
From: Mandeep Singh Baines @ 2007-09-06  5:06 UTC (permalink / raw)
  To: jamal
  Cc: James Chapman, Mandeep Singh Baines, Daniele Venzano, davem,
	rick.jones2, msb, netdev, grundler, robert.olsson, jeff, nhorman
In-Reply-To: <1189002087.4228.26.camel@localhost>

jamal (hadi@cyberus.ca) wrote:
> On Wed, 2007-05-09 at 14:55 +0100, James Chapman wrote:
> 
> > Thanks Jamal. Yes, I'd already read your paper. I think my idea is 
> > different to the ideas described in your paper 
> 
> I am hoping you can pick from the lessons of what has been tried and
> failed and the justification for critiqueing something as "Failed". 
> If you have - i feel i accomplished something useful writting the paper.
> 
> > and I'm in the process of 
> > writing it up as an RFC to post to netdev. 
> 
> Please cc me if you want my feedback - I am backlogged by about 3000
> messages on netdev.
> 
> > Briefly though, the driver's 
> > ->poll() holds off doing netif_rx_complete() for 1-2 jiffies and keeps 
> > itself in poll_on mode for that time, consuming no quota. 
> > The net_rx 
> > softirq processing loop is modified to detect the case when the only 
> > devices in its poll list are doing no work (consuming no quota). The 
> > driver's ->poll() samples jiffies while it is considering when to do the 
> > netif_rx_complete() like your Parked state - no timers are used.
> 
> Ok, so the difference seems to be you actually poll instead for those
> jiffies instead starting a timer in the parked state - is that right?
> 
> > If I missed that this approach has already been tried before and 
> > rejected, please let me know. I see better throughput and latency in my 
> > packet forwarding and LAN setups using it.
> 
> If you read the paper: There are no issues with high throughput - NAPI
> kicks in.
> The challenge to be overcome is at low traffic, if you have a real fast
> processor your cpu-cycles-used/bits-processed ratio is high....

I'm not sure cpu-cycles-used/bits-processed is the correct metric to use.
An alternative would be to look at cpu-cycles-used/unit-time (i.e.
CPU utilization or load) for a given bit-rate or packet-rate. This would
make an interesting graph.

At low packet-rate, CPU utilization is low so doing extra work per packet
is probably OK. Utilizing 2% of CPU vesus 1% is negligible. But at higher
rate, when there is more CPU utilization, using 40% of CPU versus 60% is
significant. I think the absolute different in CPU utilization is more
important than the relative difference. iow, 2% versus 1%, even though 
a 2x difference in cpu-cycles/packet, is negligible compared to 40% 
versus 60%.

> If you are polling (softirqs have high prio and depending on the cpu,
> there could be a few gazillion within those 1-2 jiffies), then isnt the
> end result still a high cpu-cycles used?
> Thats what the timer tries to avoid (do nothing until some deffered
> point).

Using a timer might also behave better in a tick-less (CONFIG_NO_HZ) 
configuration.

> If you waste cpu cycles and poll, I can see that (to emphasize: For
> FastCPU-LowTraffic scenario), you will end up _not_ having latency
> issues i pointed out, but you surely have digressed from the original
> goal which is to address the cpu abuse at low traffic (because you abuse
> more cpu).
> 
> One thing that may be valuable is to show that the timers and polls are
> not much different in terms of cpu abuse (It's theoretically not true,
> but reality may not match).
> The other thing is now hrestimers are on, can you try using a piece of
> hardware that can get those kicked and see if you see any useful results
> on latency.
> 
> cheers,
> jamal
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox