qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Paolo Bonzini <pbonzini@redhat.com>
To: sfeldma@gmail.com, qemu-devel@nongnu.org, jiri@resnulli.us,
	roopa@cumulusnetworks.com, john.fastabend@gmail.com,
	eblake@redhat.com, stefanha@gmail.com
Subject: Re: [Qemu-devel] [PATCH v3 3/9] rocker: add register programming guide
Date: Mon, 12 Jan 2015 12:40:57 +0100	[thread overview]
Message-ID: <54B3B2C9.6040104@redhat.com> (raw)
In-Reply-To: <1420948672-50103-4-git-send-email-sfeldma@gmail.com>



On 11/01/2015 04:57, sfeldma@gmail.com wrote:
> From: Scott Feldman <sfeldma@gmail.com>
> 
> This is the register programming guide for the Rocker device.  It's intended
> for driver writers and device writers.  It covers the device's PCI space,
> the register set, DMA interface, and interrupts.
> 
> Signed-off-by: Scott Feldman <sfeldma@gmail.com>
> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
> ---
>  hw/net/rocker/reg_guide.txt |  961 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 961 insertions(+)
>  create mode 100644 hw/net/rocker/reg_guide.txt

This should be docs/specs/rocker.txt

> diff --git a/hw/net/rocker/reg_guide.txt b/hw/net/rocker/reg_guide.txt
> new file mode 100644
> index 0000000..3146708
> --- /dev/null
> +++ b/hw/net/rocker/reg_guide.txt
> @@ -0,0 +1,961 @@
> +Rocker Network Switch Register Programming Guide
> +Copyright (c) Scott Feldman <sfeldma@gmail.com>
> +Copyright (c) Neil Horman <nhorman@tuxdriver.com>
> +Version 0.11, 12/29/2014
> +
> +LICENSE
> +=======
> +
> +This program is free software; you can redistribute it and/or modify
> +it under the terms of the GNU General Public License as published by
> +the Free Software Foundation; either version 2 of the License, or
> +(at your option) any later version.
> +
> +This program is distributed in the hope that it will be useful,
> +but WITHOUT ANY WARRANTY; without even the implied warranty of
> +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> +GNU General Public License for more details.
> +
> +SECTION 1: Introduction
> +=======================
> +
> +Overview
> +--------
> +
> +This document describes the hardware/software interface for the Rocker switch
> +device.  The intended audience is authors of OS drivers and device emulation
> +software.
> +
> +Notations and Conventions
> +-------------------------
> +
> +o In register descriptions, [n:m] indicates a range from bit n to bit m,
> +inclusive.
> +o Use of leading 0x indicates a hexadecimal number.
> +o Use of leading 0b indicates a binary number.
> +o The use of RSVD or Reserved indicates that a bit or field is reserved for
> +future use.
> +o Field width is in bytes, unless otherwise noted.
> +o Register are (R) read-only, (R/W) read/write, (W) write-only, or (COR) clear
> +on read
> +o TLV values in network-byte-order are designated with (N).
> +
> +
> +SECTION 2: PCI Configuration Registers
> +======================================
> +
> +PCI Configuration Space
> +-----------------------
> +
> +Each switch instance registers as a PCI device with PCI configuration space:
> +
> +	offset	width	description		value
> +	---------------------------------------------
> +	0x0	2	Vendor ID		0x1b36
> +	0x2	2	Device ID		0x0006
> +	0x4	4	Command/Status
> +	0x8	1	Revision ID		0x01
> +	0x9	3	Class code		0x2800
> +	0xC	1	Cache line size
> +	0xD	1	Latency timer
> +	0xE	1	Header type
> +	0xF	1	Built-in self test
> +	0x10	4	Base address low
> +	0x14	4	Base address high
> +	0x18-28		Reserved
> +	0x2C	2	Subsystem vendor ID	0x0000
> +	0x2E	2	Subsystem ID		0x0000

This should not be guaranteed to 0, should it?

> +	0x30-38		Reserved
> +	0x3C	1	Interrupt line
> +	0x3D	1	Interrupt pin		0x00
> +	0x3E	1	Min grant		0x00
> +	0x3D	1	Max latency		0x00
> +	0x40	1	TRDY timeout
> +	0x41	1	Retry count
> +	0x42	2	Reserved
> +
> +
> +SECTION 3: Memory-Mapped Register Space
> +=======================================
> +
> +There are two memory-mapped BARs.  BAR0 maps device register space and is
> +0x2000 in size.  BAR1 maps MSI-X vector and PBA tables and is also 0x2000 in
> +size, allowing for 256 MSI-X vectors.  The host BIOS will assign the base
> +address location.  The host driver/OS will map the base address to host memory,
> +giving the driver mmio access to the device register space.

No need for the bits after "The host BIOS..." since that's just normal PCI.

> +All registers are 4 or 8 bytes long.  It is assumed host software will access 4
> +byte registers with one 4-byte access, and 8 byte registers with either two
> +4-byte accesses or a single 8-byte access.  In the case of two 4-byte accesses,
> +access must be lower and then upper 4-bytes, in that order.

Double 4-byte accesses are not implemented, are they?

> +Interrupt credits
> +^^^^^^^^^^^^^^^^^
> +
> +MSI-X vectors used for descriptor ring completions use a credit mechanism for
> +efficient device, PCIe bus, OS and driver operations.  Each descriptor ring has
> +a credit count which represent the number of outstanding descriptors to be
> +processed by the driver.  As the device marks descriptors complete, the credit
> +count is incremented.  As the driver processes those outstanding descriptors,
> +it returns credits back to the device.  This way, the device knows the driver's
> +progress and can make decisions about when to fire the next interrupt or not.
> +When the credit count is zero, and the first descriptors are posted for the
> +driver, a single interrupt is fired.  Once the interrupt is fired, the
> +interrupt is disabled (auto-masked).  In response to the interrupt, the driver
> +will process descriptors and PIO write a returned credit value for that
> +descriptor ring.  If the driver returns all credits (the driver caught up with
> +the device and there is no outstanding work), then the interrupt is unmasked,
> +but not fired.  If only partial credits are returned, the interrupt remains
> +masked but the device generates an interrupt, signaling the driver that more
> +outstanding work is available.

Perhaps mention that this masking is unrelated to the MSI-X interrupt
mask register?

> +SECTION 5: Test Registers
> +=========================
> +
> +Rocker switch has several test registers to support troubleshooting register

s/Rocker switch/Rocker/

> +access, interrupt generation, and DMA operations:
> +
> +	TEST_REG, offset 0x0010, 32-bit (R/W)
> +	TEST_REG64, offset 0x0018, 64-bit (R/W)
> +	TEST_IRQ, offset 0x0020, 32-bit (R/W)
> +	TEST_DMA_ADDR, offset 0x0028, 64-bit (R/W)
> +	TEST_DMA_SIZE, offset 0x0030, 32-bit (R/W)
> +	TEST_DMA_CTRL, offset 0x0034, 32-bit (R/W)
> +
> +Reads to TEST_REG and TEST_REG64 will read a value 2x the last value written to

s/2x/equal to twice/

> +the register.  The 32-bit and 64-bit versions are for testing 32-bit and 64-bit
> +host accesses.

Right now, as mentioned above, 64-bit registers must be accessed with a
single 32-bit host access.

In the case of 32-bit host accesses, should TEST_REG64's value be
latched until the upper half is written?  If so, please mention it and
describe that this behavior is shared with the other 64-bit Rocker
registers.

> +Bits written to TEST_IRQ will cause same (unmasked) bits to be written to
> +IRQ_STAT and an interrupt generated.  Use IRQ_MASK to mask and unmask
> +particular bits.

It looks like actually TEST_IRQ will generate a single interrupt, not
many of them.  So writing 1 sets bits 1 in the PBA, not bit 0.  Writing
3 sets bits 3, not bits 0 and 1.

Please do not use "IRQ_STAT", call it the PBA instead.  Also remove the
reference to IRQ_MASK, it's uninteresting.

> +SECTION 7: Switch Control
> +=========================
> +
> +This section covers switch-wide register settings.
> +
> +Control
> +-------
> +
> +This register is used for low level control of the switch.
> +
> +	CONTROL: offset 0x0300, 32-bit, (W)
> +
> +	bit	name		description
> +	------------------------------------------------------------------------
> +	[0]	CONTROL_RESET	If set, device will perform reset (same
> +				as pci reset)

It's not the same as PCI reset, as it will not reset BARs for example.

> +
> +SECTION 8: CPU Packet Processing
> +================================
> +
> +For packets ingressing on switch ports that are not forwarded by the switch but
> +rather directed to the host CPU for further processing are delivered in the
> +DMA RX ring.  Likewise, for host CPU originating packets destined to egress on
> +switch ports onto the network are scheduled by software using the DMA TX ring.

Ingress packets for ports that are not forwarded by the switch are
directed to the host CPU for further processing, and delivered in the
DMA RX ring.  Likewise, the host CPU can use the DMA TX ring to schedule
packets that will egress onto the network.

> +
> +Tx Packet Processing
> +--------------------
> +
> +Software schedules packets for egress on switch ports using the DMA TX ring.  A
> +TX descriptor buffer describes the packet location and size in host DMA-able
> +memory, the destination port, and any hardware-offload functions (such as L3
> +payload checksum offload).  Software then bumps the descriptor head to signal
> +hardware of new Tx work.  In response, hardware will DMA read Tx descriptors up
> +to head, DMA read descriptor buffer and packet data, perform offloading
> +functions, and finally frame packet on wire (network).  Once packet processing
> +is complete, hardware will writeback status to descriptor(s) to signal to
> +software that Tx is complete and software resources (e.g. skb) backing packet
> +can be released.
> +
> +Figure 2 shows an example 3-fragment packet queued with one Tx descriptor.  A
> +TLV is used for each packet fragment.
> +
> +	                                           pkt frag 1
> +	                                           +–––––––+  +–+
> +	                                       +–––+       |    |
> +	                         desc buf      |   |       |    |
> +	                        +––––––––+     |   |       |    |
> +	        Tx ring     +–––+        +–––––+   |       |    |
> +	      +–––––––––+   |   |  TLVs  |         +–––––––+    |
> +	      |         +–––+   +––––––––+         pkt frag 2   |
> +	      | desc 0  |       |        +–––––+   +–––––––+    |
> +	      +–––––––––+       |  TLVs  |     +–––+       |    |
> +	head+–+         |       +––––––––+         |       |    |
> +	      | desc 1  |       |        +–––––+   +–––––––+    |pkt
> +	      +–––––––––+       |  TLVs  |     |                |
> +	      |         |       +––––––––+     |   pkt frag 3   |
> +	      |         |                      |   +–––––––+    |
> +	      +–––––––––+                      +–––+       |    |
> +	      |         |                          |       |    |
> +	      |         |                          |       |    |
> +	      +–––––––––+                          |       |    |
> +	      |         |                          |       |    |
> +	      |         |                          |       |    |
> +	      +–––––––––+                          |       |    |
> +	      |         |                          +–––––––+  +–+
> +	      |         |
> +	      +–––––––––+
> +
> +				fig 2.
> +
> +The TLVs for Tx descriptor buffer are:
> +
> +	field			width	description
> +	---------------------------------------------------------------------
> +	PPORT			4	Destination physical port #
> +	TX_OFFLOAD		1	Hardware offload modes:
> +					  0: no offload
> +					  1: insert IP csum (ipv4 only)
> +					  2: insert TCP/UDP csum
> +					  3: L3 csum calc and insert
> +                        	             into csum offset (TX_L3_CSUM_OFF)
> +                 	                    16-bit 1's complement csum value.
> +                                	     IPv4 pseudo-header and IP
> +                        	             already calculated by OS
> +                  	                   and inserted.
> +					  4: TSO (TCP Segmentation Offload)
> +	TX_L3_CSUM_OFF		2	For L3 csum offload mode, the offset,
> +					from the beginning of the packet,
> +					of the csum field in the L3 header
> +	TX_TSO_MSS		2	For TSO offload mode, the
> +					Maximum Segment Size in bytes
> +        TX_TSO_HDR_LEN		2	For TSO offload mode, the
> +					length of ethernet, IP, and
> +					TCP/UDP headers, including IP
> +					and TCP options.
> +	TX_FRAGS		<array>	Packet fragments
> +	  TX_FRAG		<nest>	Packet fragment
> +	    TX_FRAG_ADDR	8	DMA address of packet fragment
> +	    TX_FRAG_LEN		2	Packet fragment length
> +
> +Possible status return codes in descriptor on completion are:
> +
> +	DESC_COMP_ERR	reason
> +	--------------------------------------------------------------------
> +	0		OK
> +	ENXIO		address or data read err on desc buf or packet
> +			fragment

This is more like EFAULT actually.

> +	EINVAL		bad pport or TSO or csum offloading error
> +	ENOMEM		no memory for internal staging tx fragment

QEMU is portable and these values are not, unfortunately.  So please
hardcode them to be 6/22/12 respectively.

Or even better, to avoid the temptation, make them 1/2/3 and create new
constants ROCKER_OK, ROCKER_ERR_FAULT, ROCKER_ERR_INVAL, ROCKER_ERR_NOMEM.

In any case, since you are at it, sort them in either numeric order or
alphabetic order (apart from OK which can remain first).

> +Rx Packet Processing
> +--------------------
> +
> +For packets ingressing on switch ports that are not forwarded by the switch but
> +rather directed to the host CPU for further processing are delivered in the
> +DMA RX ring.  Rx descriptor buffers are allocated by software and placed on the
> +ring.  Hardware will fill Rx descriptor buffers with packet data, write the
> +completion, and signal to software that a new packet is ready.  Since Rx packet
> +size is not known a-priori, the Rx descriptor buffer must be allocated for
> +worst-case packet size.  A single Rx descriptor will contain the entire Rx
> +packet data in one RX_PACKET TLV.  Other Rx TLVs describe and hardware offloads
> +performed on the packet, such as checksum validation.
> +
> +The TLVs for Rx descriptor buffer are:
> +
> +	field		width	description
> +	---------------------------------------------------
> +	PPORT		4	Source physical port #
> +	RX_FLAGS	2	Packet parsing flags:
> +				  (1 << 0): IPv4 packet
> +				  (1 << 1): IPv6 packet
> +				  (1 << 2): csum calculated
> +				  (1 << 3): IPv4 csum good
> +				  (1 << 4): IP fragment
> +				  (1 << 5): TCP packet
> +				  (1 << 6): UDP packet
> +				  (1 << 7): TCP/UDP csum good
> +	RX_CSUM		2	IP calculated checksum:
> +				  IPv4: IP payload csum
> +				  IPv6: header and payload csum
> +				(Only valid is RX_FLAGS:csum calc is set)
> +	RX_PACKET (N)	<var>	Packet data
> +
> +Possible status return codes in descriptor on completion are:
> +
> +	DESC_COMP_ERR	reason
> +	--------------------------------------------------------------------
> +	0		OK
> +	ENXIO		address or data read err on desc buf
> +	ENOMEM		no memory for internal staging desc buf
> +	EMSGSIZE	Rx descriptor buffer wasn't big enough to contain
> +			pactet data TLV and other TLVs.

EMSGSIZE in fact doesn't exist on Windows even.  So make this
ROCKER_ERR_MSGSIZE==4.


> +	field			width	description
> +	----------------------------------------------------
> +	OF_DPA_CMD		2	CMD_[ADD|MOD]
> +	OF_DPA_TBL		2	Flow table ID
> +					  0: ingress port
> +					  10: vlan
> +					  20: termination mac
> +					  30: unicast routing
> +					  40: multicast routing
> +					  50: bridging
> +					  60: ACL policy

Decimal, I guess.  Better mention it, if only for completeness.

> +Possible status return codes in descriptor on completion are:
> +
> +	DESC_COMP_ERR	command			reason
> +	--------------------------------------------------------------------
> +	0		all			OK
> +	EFAULT		all			head or tail index outside
> +						of ring
> +	ENXIO		all			address or data read err on
> +						desc buf
> +	ENOSPC		GET_STATS		cmd descriptor buffer wasn't
> +						big enough to contain write-back
> +						TLVs
> +	EINVAL		ADD|MOD			invalid parameters passed in
> +	EEXIST		ADD			entry already exists
> +	ENOSPC		ADD			no space left in flow table
> +	ENOENT		MOD|DEL|GET_STATS	group ID invalid
> +	EBUSY		DEL			group reference count non-zero
> +	ENODEV		ADD			next group ID doesn't exist

Same as above, please add decimal values instead of overloading errno.

Paolo

  reply	other threads:[~2015-01-12 11:41 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-11  3:57 [Qemu-devel] [PATCH v3 0/9] rocker: add new rocker ethernet switch device sfeldma
2015-01-11  3:57 ` [Qemu-devel] [PATCH v3 1/9] net: add MAC address string printer sfeldma
2015-01-11  3:57 ` [Qemu-devel] [PATCH v3 2/9] virtio-net: use qemu_mac_strdup_printf sfeldma
2015-01-11  3:57 ` [Qemu-devel] [PATCH v3 3/9] rocker: add register programming guide sfeldma
2015-01-12 11:40   ` Paolo Bonzini [this message]
2015-01-16  8:14     ` Scott Feldman
2015-01-11  3:57 ` [Qemu-devel] [PATCH v3 4/9] pci: add rocker device ID sfeldma
2015-01-11  3:57 ` [Qemu-devel] [PATCH v3 5/9] pci: add network device class 'other' for network switches sfeldma
2015-01-11  3:57 ` [Qemu-devel] [PATCH v3 6/9] rocker: add new rocker switch device sfeldma
2015-01-12 12:57   ` Paolo Bonzini
2015-01-14 23:20     ` Scott Feldman
2015-01-15  9:08       ` Paolo Bonzini
2015-01-16  9:15   ` Jason Wang
2015-01-16  9:48     ` Scott Feldman
2015-01-21  3:36       ` Jason Wang
2015-01-11  3:57 ` [Qemu-devel] [PATCH v3 7/9] qmp: add rocker device support sfeldma
2015-01-16  9:26   ` Jason Wang
2015-01-16  9:59     ` Scott Feldman
2015-01-11  3:57 ` [Qemu-devel] [PATCH v3 8/9] rocker: add tests sfeldma
2015-01-11  3:57 ` [Qemu-devel] [PATCH v3 9/9] MAINTAINERS: add rocker sfeldma

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54B3B2C9.6040104@redhat.com \
    --to=pbonzini@redhat.com \
    --cc=eblake@redhat.com \
    --cc=jiri@resnulli.us \
    --cc=john.fastabend@gmail.com \
    --cc=qemu-devel@nongnu.org \
    --cc=roopa@cumulusnetworks.com \
    --cc=sfeldma@gmail.com \
    --cc=stefanha@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).