netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: linas@austin.ibm.com (Linas Vepstas)
To: Jeff Garzik <jgarzik@pobox.com>
Cc: cbe-oss-dev@ozlabs.org, netdev@vger.kernel.org,
	Nathan J Lee <njlee@us.ibm.com>, Ling Shao <shaol@cn.ibm.com>,
	Utz Bacher <utz.bacher@de.ibm.com>,
	Zhen Bo Zhu <zhuzb@cn.ibm.com>, Zhu Han <hanzhu@cn.ibm.com>,
	Jens Osterkamp <Jens.Osterkamp@de.ibm.com>,
	Yan Qi Wang <yqwang@cn.ibm.com>
Subject: [PATCH 18/18] spidernet: driver docmentation
Date: Thu, 7 Jun 2007 15:05:03 -0500	[thread overview]
Message-ID: <20070607200503.GR16077@austin.ibm.com> (raw)
In-Reply-To: <20070607191707.GA7904@austin.ibm.com>


Documentation for the spidernet driver.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>

----
 Documentation/networking/spider_net.txt |  204 ++++++++++++++++++++++++++++++++
 1 file changed, 204 insertions(+)

Index: linux-2.6.22-rc1/Documentation/networking/spider_net.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.22-rc1/Documentation/networking/spider_net.txt	2007-06-07 14:01:52.000000000 -0500
@@ -0,0 +1,204 @@
+
+            The Spidernet Device Driver
+            ===========================
+
+Written by Linas Vepstas <linas@austin.ibm.com>
+
+Version of 7 June 2007
+
+Abstract
+========
+This document sketches the structure of portions of the spidernet
+device driver in the Linux kernel tree. The spidernet is a gigabit
+ethernet device built into the Toshiba southbridge commonly used
+in the SONY Playstation 3 and the IBM QS20 Cell blade.
+
+The Structure of the RX Ring.
+=============================
+The receive (RX) ring is a circular linked list of RX descriptors,
+together with three pointers into the ring that are used to manage its
+contents.
+
+The elements of the ring are called "descriptors" or "descrs"; they
+describe the received data. This includes a pointer to a buffer
+containing the received data, the buffer size, and various status bits.
+
+There are three primary states that a descriptor can be in: "empty",
+"full" and "not-in-use".  An "empty" or "ready" descriptor is ready
+to receive data from the hardware. A "full" descriptor has data in it,
+and is waiting to be emptied and processed by the OS. A "not-in-use"
+descriptor is neither empty or full; it is simply not ready. It may
+not even have a data buffer in it, or is otherwise unusable.
+
+During normal operation, on device startup, the OS (specifically, the
+spidernet device driver) allocates a set of RX descriptors and RX
+buffers. These are all marked "empty", ready to receive data. This
+ring is handed off to the hardware, which sequentially fills in the
+buffers, and marks them "full". The OS follows up, taking the full
+buffers, processing them, and re-marking them empty.
+
+This filling and emptying is managed by three pointers, the "head"
+and "tail" pointers, managed by the OS, and a hardware current
+descriptor pointer (GDACTDPA). The GDACTDPA points at the descr
+currently being filled. When this descr is filled, the hardware
+marks it full, and advances the GDACTDPA by one.  Thus, when there is
+flowing RX traffic, every descr behind it should be marked "full",
+and everything in front of it should be "empty".  If the hardware
+discovers that the current descr is not empty, it will signal an
+interrupt, and halt processing.
+
+The tail pointer tails or trails the hardware pointer. When the
+hardware is ahead, the tail pointer will be pointing at a "full"
+descr. The OS will process this descr, and then mark it "not-in-use",
+and advance the tail pointer.  Thus, when there is flowing RX traffic,
+all of the descrs in front of the tail pointer should be "full", and
+all of those behind it should be "not-in-use". When RX traffic is not
+flowing, then the tail pointer can catch up to the hardware pointer.
+The OS will then note that the current tail is "empty", and halt
+processing.
+
+The head pointer (somewhat mis-named) follows after the tail pointer.
+When traffic is flowing, then the head pointer will be pointing at
+a "not-in-use" descr. The OS will perform various housekeeping duties
+on this descr. This includes allocating a new data buffer and
+dma-mapping it so as to make it visible to the hardware. The OS will
+then mark the descr as "empty", ready to receive data. Thus, when there
+is flowing RX traffic, everything in front of the head pointer should
+be "not-in-use", and everything behind it should be "empty". If no
+RX traffic is flowing, then the head pointer can catch up to the tail
+pointer, at which point the OS will notice that the head descr is
+"empty", and it will halt processing.
+
+Thus, in an idle system, the GDACTDPA, tail and head pointers will
+all be pointing at the same descr, which should be "empty". All of the
+other descrs in the ring should be "empty" as well.
+
+The show_rx_chain() routine will print out the the locations of the
+GDACTDPA, tail and head pointers. It will also summarize the contents
+of the ring, starting at the tail pointer, and listing the status
+of the descrs that follow.
+
+A typical example of the output, for a nearly idle system, might be
+
+net eth1: Total number of descrs=256
+net eth1: Chain tail located at descr=20
+net eth1: Chain head is at 20
+net eth1: HW curr desc (GDACTDPA) is at 21
+net eth1: Have 1 descrs with stat=x40800101
+net eth1: HW next desc (GDACNEXTDA) is at 22
+net eth1: Last 255 descrs with stat=xa0800000
+
+In the above, the hardware has filled in one descr, number 20. Both
+head and tail are pointing at 20, because it has not yet been emptied.
+Meanwhile, hw is pointing at 21, which is free.
+
+The "Have nnn decrs" refers to the descr starting at the tail: in this
+case, nnn=1 descr, starting at descr 20. The "Last nnn descrs" refers
+to all of the rest of the descrs, from the last status change. The "nnn"
+is a count of how many descrs have exactly the same status.
+
+The status x4... corresponds to "full" and status xa... corresponds
+to "empty". The actual value printed is RXCOMST_A.
+
+In the device driver source code, a different set of names are
+used for these same concepts, so that
+
+"empty" == SPIDER_NET_DESCR_CARDOWNED == 0xa
+"full"  == SPIDER_NET_DESCR_FRAME_END == 0x4
+"not in use" == SPIDER_NET_DESCR_NOT_IN_USE == 0xf
+
+
+The RX RAM full bug/feature
+===========================
+
+As long as the OS can empty out the RX buffers at a rate faster than
+the hardware can fill them, there is no problem. If, for some reason,
+the OS fails to empty the RX ring fast enough, the hardware GDACTDPA
+pointer will catch up to the head, notice the not-empty condition,
+ad stop. However, RX packets may still continue arriving on the wire.
+The spidernet chip can save some limited number of these in local RAM.
+When this local ram fills up, the spider chip will issue an interrupt
+indicating this (GHIINT0STS will show ERRINT, and the GRMFLLINT bit
+will be set in GHIINT1STS).  When the RX ram full condition occurs,
+a certain bug/feature is triggered that has to be specially handled.
+This section describes the special handling for this condition.
+
+When the OS finally has a chance to run, it will empty out the RX ring.
+In particular, it will clear the descriptor on which the hardware had
+stopped. However, once the hardware has decided that a certain
+descriptor is invalid, it will not restart at that descriptor; instead
+it will restart at the next descr. This potentially will lead to a
+deadlock condition, as the tail pointer will be pointing at this descr,
+which, from the OS point of view, is empty; the OS will be waiting for
+this descr to be filled. However, the hardware has skipped this descr,
+and is filling the next descrs. Since the OS doesn't see this, there
+is a potential deadlock, with the OS waiting for one descr to fill,
+while the hardware is waiting for a different set of descrs to become
+empty.
+
+A call to show_rx_chain() at this point indicates the nature of the
+problem. A typical print when the network is hung shows the following:
+
+net eth1: Spider RX RAM full, incoming packets might be discarded!
+net eth1: Total number of descrs=256
+net eth1: Chain tail located at descr=255
+net eth1: Chain head is at 255
+net eth1: HW curr desc (GDACTDPA) is at 0
+net eth1: Have 1 descrs with stat=xa0800000
+net eth1: HW next desc (GDACNEXTDA) is at 1
+net eth1: Have 127 descrs with stat=x40800101
+net eth1: Have 1 descrs with stat=x40800001
+net eth1: Have 126 descrs with stat=x40800101
+net eth1: Last 1 descrs with stat=xa0800000
+
+Both the tail and head pointers are pointing at descr 255, which is
+marked xa... which is "empty". Thus, from the OS point of view, there
+is nothing to be done. In particular, there is the implicit assumption
+that everything in front of the "empty" descr must surely also be empty,
+as explained in the last section. The OS is waiting for descr 255 to
+become non-empty, which, in this case, will never happen.
+
+The HW pointer is at descr 0. This descr is marked 0x4.. or "full".
+Since its already full, the hardware can do nothing more, and thus has
+halted processing. Notice that descrs 0 through 254 are all marked
+"full", while descr 254 and 255 are empty. (The "Last 1 descrs" is
+descr 254, since tail was at 255.) Thus, the system is deadlocked,
+and there can be no forward progress; the OS thinks there's nothing
+to do, and the hardware has nowhere to put incoming data.
+
+This bug/feature is worked around with the spider_net_resync_head_ptr()
+routine. When the driver receives RX interrupts, but an examination
+of the RX chain seems to show it is empty, then it is probable that
+the hardware has skipped a descr or two (sometimes dozens under heavy
+network conditions). The spider_net_resync_head_ptr() subroutine will
+search the ring for the next full descr, and the driver will resume
+operations there.  Since this will leave "holes" in the ring, there
+is also a spider_net_resync_tail_ptr() that will skip over such holes.
+
+As of this writing, the spider_net_resync() strategy seems to work very
+well, even under heavy network loads.
+
+
+The TX ring
+===========
+The TX ring uses a low-watermark interrupt scheme to make sure that
+the TX queue is appropriately serviced for large packet sizes.
+
+For packet sizes greater than about 1KBytes, the kernel can fill
+the TX ring quicker than the device can drain it. Once the ring
+is full, the netdev is stopped. When there is room in the ring,
+the netdev needs to be reawakened, so that more TX packets are placed
+in the ring. The hardware can empty the ring about four times per jiffy,
+so its not appropriate to wait for the poll routine to refill, since
+the poll routine runs only once per jiffy.  The low-watermark mechanism
+marks a descr about 1/4th of the way from the bottom of the queue, so
+that an interrupt is generated when the descr is processed. This
+interrupt wakes up the netdev, which can then refill the queue.
+For large packets, this mechanism generates a relatively small number
+of interrupts, about 1K/sec. For smaller packets, this will drop to zero
+interrupts, as the hardware can empty the queue faster than the kernel
+can fill it.
+
+
+ ======= END OF DOCUMENT ========
+

  parent reply	other threads:[~2007-06-07 20:05 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-06-07 19:17 [PATCH 0/18] spidernet driver bug fixes Linas Vepstas
2007-06-07 19:20 ` [PATCH 1/18] spidernet: skb used after netif_receive_skb Linas Vepstas
2007-06-07 19:22 ` [PATCH 2/18] spidernet: checksum and ethtool Linas Vepstas
2007-06-07 19:24 ` [PATCH 3/18] spidernet: beautify error messages Linas Vepstas
2007-06-07 19:25 ` [PATCH 4/18] spidernet: move a block of code around Linas Vepstas
2007-06-07 19:27 ` [PATCH 5/18] spidernet: zero out a pointer Linas Vepstas
2007-06-07 19:29 ` [PATCH 6/18] spidernet: null out skb pointer after its been used Linas Vepstas
2007-06-07 19:33 ` [PATCH 7/18] spidernet: Don't terminate the RX ring Linas Vepstas
2007-06-07 19:35 ` [PATCH 8/18] spidernet: enhance the dump routine Linas Vepstas
2007-06-07 19:39 ` [PATCH 9/18] spidernet: reset the card when an rxramfull is seen Linas Vepstas
2007-06-07 19:41 ` [PATCH 10/18] spidernet: service TX later Linas Vepstas
2007-06-07 19:43 ` [PATCH 11/18] spidernet: increase the NAPI weight Linas Vepstas
2007-06-07 19:45 ` [PATCH 12/18] spidernet: don't flag rare packets as bad packets Linas Vepstas
2007-06-07 19:51 ` [PATCH 13/18] spidernet: Cure RX ram full bug Linas Vepstas
2007-06-07 19:53 ` [PATCH 14/18] spidernet: silence the ramfull messages Linas Vepstas
2007-06-07 19:55 ` [PATCH 15/18] spidernet: minor RX optimization Linas Vepstas
2007-06-07 19:57 ` [PATCH 16/18] spidernet: fix misnamed flag Linas Vepstas
2007-06-07 20:01 ` [PATCH 17/18] spidernet: turn off descriptor chain end interrupt Linas Vepstas
2007-06-07 20:05 ` Linas Vepstas [this message]
2007-06-08  1:12 ` [Cbe-oss-dev] [PATCH 0/18] spidernet driver bug fixes Michael Ellerman
2007-06-08 17:06   ` Linas Vepstas
2007-06-08 17:20     ` Jeff Garzik
2007-06-11 18:14       ` [PATCH 0/15] " Linas Vepstas
2007-06-11 18:17         ` [PATCH 1/15] spidernet: null out skb pointer after its been used Linas Vepstas
2007-06-11 18:21           ` [PATCH 2/15] spidernet: Cure RX ram full bug Linas Vepstas
2007-06-11 18:23           ` [PATCH 3/15] spidernet: Don't terminate the RX ring Linas Vepstas
2007-06-11 18:26           ` [PATCH 4/15] spidernet: silence the ramfull messages Linas Vepstas
2007-06-13 20:12             ` Jeff Garzik
2007-06-14 22:29               ` Linas Vepstas
2007-06-14 23:12               ` [PATCH] spidernet: Replace literal with const Linas Vepstas
2007-07-02 12:37                 ` Jeff Garzik
2007-06-11 18:29           ` [PATCH 5/15] spidernet: turn off descriptor chain end interrupt Linas Vepstas
2007-06-11 18:32           ` [PATCH 6/15] spidernet: skb used after netif_receive_skb Linas Vepstas
2007-06-11 18:35           ` [PATCH 7/15] spidernet: checksum and ethtool Linas Vepstas
2007-06-11 18:41           ` [PATCH 8/15] spidernet: beautify error messages Linas Vepstas
2007-06-13 20:15             ` Jeff Garzik
2007-06-11 18:48           ` [PATCH 9/15] spidernet: enhance the dump routine Linas Vepstas
2007-06-11 18:52           ` [PATCH 10/15] spidernet: invalidate unused pointer Linas Vepstas
2007-06-11 18:59           ` [PATCH 11/15] spidernet: service TX later Linas Vepstas
2007-06-11 19:02           ` [PATCH 12/15] spidernet: increase the NAPI weight Linas Vepstas
2007-06-13 20:14             ` Jeff Garzik
2007-06-13 20:49               ` [Cbe-oss-dev] " Arnd Bergmann
2007-06-14 22:08                 ` Linas Vepstas
2007-06-11 19:05           ` [PATCH 13/15] spidernet: move a block of code around Linas Vepstas
2007-06-11 19:09           ` [PATCH 14/15] spidernet: fix misnamed flag Linas Vepstas
2007-06-11 19:12           ` [PATCH 15/15] spidernet: driver docmentation Linas Vepstas
2007-06-13 20:10           ` [PATCH 1/15] spidernet: null out skb pointer after its been used Jeff Garzik
2007-06-14 22:00             ` Linas Vepstas
2007-06-12  2:01         ` [PATCH 0/15] spidernet driver bug fixes Michael Ellerman
2007-06-12 23:00         ` Jeff Garzik
2007-06-12 23:32           ` Linas Vepstas
2007-06-13  0:04             ` Jeff Garzik
2007-06-13 16:14               ` Linas Vepstas
2007-06-13 18:51                 ` Jeff Garzik
2007-06-13 19:01                   ` [Cbe-oss-dev] " Segher Boessenkool
2007-06-13 19:02                     ` Jeff Garzik
2007-06-13 20:52                       ` Arnd Bergmann
2007-06-13 23:55                     ` Michael Ellerman
2007-06-13 18:52                 ` Jeff Garzik
2007-06-14 22:08                 ` [Cbe-oss-dev] " David Woodhouse
2007-06-14 23:01                   ` Jeff Garzik
2007-06-14 23:03                     ` David Woodhouse
2007-06-14 23:04                       ` Jeff Garzik
2007-06-14 23:07                         ` David Woodhouse
2007-06-14 23:32                           ` Michael Ellerman
2007-06-13  1:33           ` Michael Ellerman
2007-06-13  1:54             ` Jeff Garzik
2007-06-13 13:53               ` Michael Ellerman
2007-06-13 18:45                 ` Jeff Garzik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070607200503.GR16077@austin.ibm.com \
    --to=linas@austin.ibm.com \
    --cc=Jens.Osterkamp@de.ibm.com \
    --cc=cbe-oss-dev@ozlabs.org \
    --cc=hanzhu@cn.ibm.com \
    --cc=jgarzik@pobox.com \
    --cc=netdev@vger.kernel.org \
    --cc=njlee@us.ibm.com \
    --cc=shaol@cn.ibm.com \
    --cc=utz.bacher@de.ibm.com \
    --cc=yqwang@cn.ibm.com \
    --cc=zhuzb@cn.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).