Netdev List
 help / color / mirror / Atom feed
* Re: [RFC 1/2] PCI-Express Non-Transparent Bridge Support
From: Jon Mason @ 2012-07-16 17:55 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, netdev, linux-pci, Dave Jiang
In-Reply-To: <20120716001921.GA19775@kroah.com>

On Sun, Jul 15, 2012 at 05:19:21PM -0700, Greg KH wrote:
> On Sun, Jul 15, 2012 at 04:55:48PM -0700, Jon Mason wrote:
> > On Sat, Jul 14, 2012 at 10:10:15AM -0700, Greg KH wrote:
> > > On Fri, Jul 13, 2012 at 02:44:59PM -0700, Jon Mason wrote:
> > > > +static int max_num_cbs = 2;
> > > > +module_param(max_num_cbs, uint, 0644);
> > > > +MODULE_PARM_DESC(max_num_cbs, "Maximum number of NTB transport connections");
> > > > +
> > > > +static bool no_msix;
> > > > +module_param(no_msix, bool, 0644);
> > > > +MODULE_PARM_DESC(no_msix, "Do not allow MSI-X interrupts to be selected");
> > > 
> > > How would a user, or a distro, know to set these options?  Why are they
> > > even options at all?
> > 
> > Good question.  There is actually a potential benefit to disabling
> > MSI-X.  The NTB device on one of our platforms only has 3 MSI-X
> > vectors.  In the current driver design, that would limit them to 3
> > client/virtual devices.  However, there are 15bits in the ISR that can
> > be used for the same purpose.  So, if you disable MSI-X, you can have
> > 15 instead of 3.  
> 
> But again, how would a user, or a distro, know to set these?  Where is
> the documentation describing it?  Why really have these options at all
> and not just fix the platform issues (only 3 MSI-X vectors?  Really?)

I believe we'll want multiple clients (or have multiqueue Ethernet).  I'm happy to add something to /Documentation to describe it and why it would be useful, or I can remove it and re-introduce it when I add multiqueue Ethernet.

3 MSI-X vectors (plus one for PCI-E link up/down) on Xeon NTB, and 33 for Atom NTB.  Yeah, really.

> 
> thanks,
> 
> greg k-h

^ permalink raw reply

* Re: [RFC 1/2] PCI-Express Non-Transparent Bridge Support
From: chetan loke @ 2012-07-16 18:26 UTC (permalink / raw)
  To: Jon Mason; +Cc: linux-kernel, netdev, linux-pci, Dave Jiang
In-Reply-To: <1342215900-3358-1-git-send-email-jon.mason@intel.com>

Jon,

On Fri, Jul 13, 2012 at 5:44 PM, Jon Mason <jon.mason@intel.com> wrote:

..............

> +/**
> + * ntb_ring_sdb() - Set the doorbell on the secondary/external side
> + * @ndev: pointer to ntb_device instance
> + * @db: doorbell to ring
> + *
> + * This function allows triggering of a doorbell on the secondary/external
> + * side that will initiate an interrupt on the remote host
> + *
> + * RETURNS: An appropriate -ERRNO error value on error, or zero for success.
> + */
> +int ntb_ring_sdb(struct ntb_device *ndev, unsigned int db)
> +{
> +       dev_dbg(&ndev->pdev->dev, "%s: ringing doorbell %d\n", __func__, db);
> +

> +       if (db >= ndev->max_cbs)
> +               return -EINVAL;

How about moving this max_cbs error check in the upper level
callers(example in ntb_process_tx)?
That way you won't have to defer handling some negative cases all the
way till the end.

So ntb_process_tx could now look like:

.....
error=0;
if (entry->len > transport_mtu) {
...
error=1;
}
else if (qp->qp_num >= qp->ndev->max_cbs) {
...
error=1;
}

if (unlikely(error)) {
      ntb_list_add_tail(&qp->txc_lock, &entry->entry, &qp->txc);
      if (qp->tx_handler)
           qp->tx_handler(qp);

      return 0;
}
.................

No further comments below

> +
> +static int ntb_process_tx(struct ntb_transport_qp *qp,
> +                         struct ntb_queue_entry *entry)
> +{
> +       struct ntb_payload_header *hdr;
> +       void *offset;
> +
> +       offset = qp->tx_offset;
> +       hdr = offset;
> +
> +       pr_debug("%lld - offset %p, tx %p, entry len %d flags %x buff %p\n",
> +                qp->tx_pkts, offset, qp->tx_offset, entry->len, entry->flags,
> +                entry->buf);
> +       if (hdr->flags) {
> +               ntb_list_add_head(&qp->txq_lock, &entry->entry, &qp->txq);
> +               qp->tx_ring_full++;
> +               return -EAGAIN;
> +       }
> +
> +       if (entry->len > transport_mtu) {
> +               pr_err("Trying to send pkt size of %d\n", entry->len);
> +               entry->flags = HW_ERROR_FLAG;
> +
> +               ntb_list_add_tail(&qp->txc_lock, &entry->entry, &qp->txc);
> +
> +               if (qp->tx_handler)
> +                       qp->tx_handler(qp);
> +
> +               return 0;
> +       }
> +
> +       ntb_tx_copy_task(qp, entry, offset);
> +
> +       qp->tx_offset =
> +           (qp->tx_offset +
> +            ((transport_mtu + sizeof(struct ntb_payload_header)) * 2) >=
> +            qp->tx_mw_end) ? qp->tx_mw_begin : qp->tx_offset + transport_mtu +
> +           sizeof(struct ntb_payload_header);
> +
> +       qp->tx_pkts++;
> +
> +       return 0;
> +}
> +


Chetan Loke

^ permalink raw reply

* Re: [RFC 1/2] PCI-Express Non-Transparent Bridge Support
From: Greg KH @ 2012-07-16 18:30 UTC (permalink / raw)
  To: Jon Mason; +Cc: linux-kernel, netdev, linux-pci, Dave Jiang
In-Reply-To: <20120716175505.GF9598@jonmason-lab>

On Mon, Jul 16, 2012 at 10:55:06AM -0700, Jon Mason wrote:
> On Sun, Jul 15, 2012 at 05:19:21PM -0700, Greg KH wrote:
> > On Sun, Jul 15, 2012 at 04:55:48PM -0700, Jon Mason wrote:
> > > On Sat, Jul 14, 2012 at 10:10:15AM -0700, Greg KH wrote:
> > > > On Fri, Jul 13, 2012 at 02:44:59PM -0700, Jon Mason wrote:
> > > > > +static int max_num_cbs = 2;
> > > > > +module_param(max_num_cbs, uint, 0644);
> > > > > +MODULE_PARM_DESC(max_num_cbs, "Maximum number of NTB transport connections");
> > > > > +
> > > > > +static bool no_msix;
> > > > > +module_param(no_msix, bool, 0644);
> > > > > +MODULE_PARM_DESC(no_msix, "Do not allow MSI-X interrupts to be selected");
> > > > 
> > > > How would a user, or a distro, know to set these options?  Why are they
> > > > even options at all?
> > > 
> > > Good question.  There is actually a potential benefit to disabling
> > > MSI-X.  The NTB device on one of our platforms only has 3 MSI-X
> > > vectors.  In the current driver design, that would limit them to 3
> > > client/virtual devices.  However, there are 15bits in the ISR that can
> > > be used for the same purpose.  So, if you disable MSI-X, you can have
> > > 15 instead of 3.  
> > 
> > But again, how would a user, or a distro, know to set these?  Where is
> > the documentation describing it?  Why really have these options at all
> > and not just fix the platform issues (only 3 MSI-X vectors?  Really?)
> 
> I believe we'll want multiple clients (or have multiqueue Ethernet).
> I'm happy to add something to /Documentation to describe it and why it
> would be useful, or I can remove it and re-introduce it when I add
> multiqueue Ethernet.

I'd suggest waiting and adding it later if really needed (see previous
comment about not adding code/features before they are actually needed.)

thanks,

greg k-h

^ permalink raw reply

* connecting two or more hotspots at the same time
From: Akhil pm @ 2012-07-16 18:37 UTC (permalink / raw)
  To: netdev



^ permalink raw reply

* Re: [RFC 1/2] PCI-Express Non-Transparent Bridge Support
From: Jon Mason @ 2012-07-16 18:38 UTC (permalink / raw)
  To: chetan loke; +Cc: linux-kernel, netdev, linux-pci, Dave Jiang
In-Reply-To: <CAAsGZS5uMArBVa+Rog1H60FVfKq84eqt5G2dvafDWXTP-RdiwQ@mail.gmail.com>

On Mon, Jul 16, 2012 at 12:49:39PM -0400, chetan loke wrote:
> Hi Jon,
> 
> On Fri, Jul 13, 2012 at 5:44 PM, Jon Mason <jon.mason@intel.com> wrote:
> 
> Just a few minor comments/questions:
> 
> ....
> 
> > +struct ntb_transport_qp {
> > +       struct ntb_device *ndev;
> > +
> > +       bool client_ready;
> > +       bool qp_link;
> > +       u8 qp_num;      /* Only 64 QP's are allowed.  0-63 */
> > +
> > +       void (*tx_handler) (struct ntb_transport_qp *qp);
> > +       struct tasklet_struct tx_work;
> 
> Is it ok to rename the following vars for convenience sake?
> 
> > +       struct list_head txq;
> tx_pend_q - (pending_queue) or tx_out_q - (outstanding_queue) - or
> pick any new string you like - other than a mono-syllable definition
> 
> > +       struct list_head txc;
> tx_compl_q - completion queue
> 
> > +       struct list_head txe;
> tx_avail_e - available entry queue
> 
> 
> > +       spinlock_t txq_lock;
> > +       spinlock_t txc_lock;
> > +       spinlock_t txe_lock;
> 
> then match the variants accordingly.
> 
> > +       struct list_head rxq;
> > +       struct list_head rxc;
> > +       struct list_head rxe;
> > +       spinlock_t rxq_lock;
> > +       spinlock_t rxc_lock;
> > +       spinlock_t rxe_lock;
> 
> similarly the rx-counterpart

Are they difficult to understand?  I can change them, but it seems rather moot since you seemed to immediately understand the logic behind the names.

> 
> 
> ..................
> 
> > +static void ntb_tx_copy_task(struct ntb_transport_qp *qp,
> > +                            struct ntb_queue_entry *entry,
> > +                            void *offset)
> > +{
> > +       struct ntb_payload_header *hdr = offset;
> > +       int rc;
> > +
> > +       offset += sizeof(struct ntb_payload_header);
> > +       memcpy_toio(offset, entry->buf, entry->len);
> > +
> > +       hdr->len = entry->len;
> > +       hdr->ver = qp->tx_pkts;
> > +
> > +       /* Ensure that the data is fully copied out before setting the flag */
> > +       wmb();
> > +       hdr->flags = entry->flags | DESC_DONE_FLAG;
> > +
> > +       rc = ntb_ring_sdb(qp->ndev, qp->qp_num);
> > +       if (rc)
> > +               pr_err("%s: error ringing db %d\n", __func__, qp->qp_num);
> > +
> > +       if (entry->len > 0) {
> 
> how do you interpret this len variable and decide if it's a good/bad completion?

The length of 0 is for messages from the remote system, which currently only consists of a "link down" message notifying the local system to no longer send any messages to the remote side.

> 
> > +               qp->tx_bytes += entry->len;
> > +
> > +               /* Add fully transmitted data to completion queue */
> > +               ntb_list_add_tail(&qp->txc_lock, &entry->entry, &qp->txc);
> > +
> > +               if (qp->tx_handler)
> > +                       qp->tx_handler(qp);
> > +       } else
> > +               ntb_list_add_tail(&qp->txe_lock, &entry->entry, &qp->txe);
> 
> I could be wrong but how is the original skb handled if the code path
> goes in the else clause?

There is no skb is the length is zero.  Since the only client is virtual ethernet, it will always be > 60.  However, I should add a sanity check for 0 length in tx_enqueue.

> Also, in the else clause, how about a ntb_list_add_head rather than a _tail.

Why add to the head, it was just used?

> > +
> > +static int ntb_process_tx(struct ntb_transport_qp *qp,
> > +                         struct ntb_queue_entry *entry)
> > +{
> > +       struct ntb_payload_header *hdr;
> > +       void *offset;
> > +
> > +       offset = qp->tx_offset;
> > +       hdr = offset;
> > +
> > +       pr_debug("%lld - offset %p, tx %p, entry len %d flags %x buff %p\n",
> > +                qp->tx_pkts, offset, qp->tx_offset, entry->len, entry->flags,
> > +                entry->buf);
> > +       if (hdr->flags) {
> > +               ntb_list_add_head(&qp->txq_lock, &entry->entry, &qp->txq);
> > +               qp->tx_ring_full++;
> > +               return -EAGAIN;
> > +       }
> > +
> > +       if (entry->len > transport_mtu) {
> > +               pr_err("Trying to send pkt size of %d\n", entry->len);
> > +               entry->flags = HW_ERROR_FLAG;
> > +
> > +               ntb_list_add_tail(&qp->txc_lock, &entry->entry, &qp->txc);
> > +
> > +               if (qp->tx_handler)
> > +                       qp->tx_handler(qp);
> > +
> > +               return 0;
> > +       }
> > +
> > +       ntb_tx_copy_task(qp, entry, offset);
> 
> what happens when ntb_sdb_ring returns an error? would you still want
> to increment tx_pkts below?

It's not fatal if the remote side isn't notified.  It will hurt latency, since the next packet would be the one that triggers the next interrupt.  Also, the only way it could ever fail would be if it was an invalid interrupt bit being set, which should never happen.

> > +
> > +       qp->tx_offset =
> > +           (qp->tx_offset +
> > +            ((transport_mtu + sizeof(struct ntb_payload_header)) * 2) >=
> > +            qp->tx_mw_end) ? qp->tx_mw_begin : qp->tx_offset + transport_mtu +
> > +           sizeof(struct ntb_payload_header);
> > +
> > +       qp->tx_pkts++;
> > +
> > +       return 0;
> > +}
> > +
> 
> ........................
> 
> 
> > +void *ntb_transport_tx_dequeue(struct ntb_transport_qp *qp, unsigned int *len)
> > +{
> > +       struct ntb_queue_entry *entry;
> > +       void *buf;
> > +
> > +       if (!qp)
> > +               return NULL;
> > +
> > +       entry = ntb_list_rm_head(&qp->txc_lock, &qp->txc);
> > +       if (!entry)
> > +               return NULL;
> > +
> > +       buf = entry->callback_data;
> > +       if (entry->flags != HW_ERROR_FLAG)
> > +               *len = entry->len;
> > +       else
> > +               *len = -EIO;
> > +
> > +       ntb_list_add_tail(&qp->txe_lock, &entry->entry, &qp->txe);
> 
> how about a ntb_list_add_head?

Is there a benefit to adding to the head versus tail?  It makes more sense to me to pull from the head and add to the tail.

> 
> 
> Chetan Loke

^ permalink raw reply

* [PATCH] sctp: Fix list corruption resulting from freeing an association on a list
From: Neil Horman @ 2012-07-16 18:39 UTC (permalink / raw)
  To: netdev
  Cc: Neil Horman, davej, David S. Miller, Vlad Yasevich,
	Sridhar Samudrala, linux-sctp

A few days ago Dave Jones reported this oops:

[22766.294255] general protection fault: 0000 [#1] PREEMPT SMP
[22766.295376] CPU 0
[22766.295384] Modules linked in:
[22766.387137]  ffffffffa169f292 6b6b6b6b6b6b6b6b ffff880147c03a90
ffff880147c03a74
[22766.387135] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000
[22766.387136] Process trinity-watchdo (pid: 10896, threadinfo ffff88013e7d2000,
[22766.387137] Stack:
[22766.387140]  ffff880147c03a10
[22766.387140]  ffffffffa169f2b6
[22766.387140]  ffff88013ed95728
[22766.387143]  0000000000000002
[22766.387143]  0000000000000000
[22766.387143]  ffff880003fad062
[22766.387144]  ffff88013c120000
[22766.387144]
[22766.387145] Call Trace:
[22766.387145]  <IRQ>
[22766.387150]  [<ffffffffa169f292>] ? __sctp_lookup_association+0x62/0xd0
[sctp]
[22766.387154]  [<ffffffffa169f2b6>] __sctp_lookup_association+0x86/0xd0 [sctp]
[22766.387157]  [<ffffffffa169f597>] sctp_rcv+0x207/0xbb0 [sctp]
[22766.387161]  [<ffffffff810d4da8>] ? trace_hardirqs_off_caller+0x28/0xd0
[22766.387163]  [<ffffffff815827e3>] ? nf_hook_slow+0x133/0x210
[22766.387166]  [<ffffffff815902fc>] ? ip_local_deliver_finish+0x4c/0x4c0
[22766.387168]  [<ffffffff8159043d>] ip_local_deliver_finish+0x18d/0x4c0
[22766.387169]  [<ffffffff815902fc>] ? ip_local_deliver_finish+0x4c/0x4c0
[22766.387171]  [<ffffffff81590a07>] ip_local_deliver+0x47/0x80
[22766.387172]  [<ffffffff8158fd80>] ip_rcv_finish+0x150/0x680
[22766.387174]  [<ffffffff81590c54>] ip_rcv+0x214/0x320
[22766.387176]  [<ffffffff81558c07>] __netif_receive_skb+0x7b7/0x910
[22766.387178]  [<ffffffff8155856c>] ? __netif_receive_skb+0x11c/0x910
[22766.387180]  [<ffffffff810d423e>] ? put_lock_stats.isra.25+0xe/0x40
[22766.387182]  [<ffffffff81558f83>] netif_receive_skb+0x23/0x1f0
[22766.387183]  [<ffffffff815596a9>] ? dev_gro_receive+0x139/0x440
[22766.387185]  [<ffffffff81559280>] napi_skb_finish+0x70/0xa0
[22766.387187]  [<ffffffff81559cb5>] napi_gro_receive+0xf5/0x130
[22766.387218]  [<ffffffffa01c4679>] e1000_receive_skb+0x59/0x70 [e1000e]
[22766.387242]  [<ffffffffa01c5aab>] e1000_clean_rx_irq+0x28b/0x460 [e1000e]
[22766.387266]  [<ffffffffa01c9c18>] e1000e_poll+0x78/0x430 [e1000e]
[22766.387268]  [<ffffffff81559fea>] net_rx_action+0x1aa/0x3d0
[22766.387270]  [<ffffffff810a495f>] ? account_system_vtime+0x10f/0x130
[22766.387273]  [<ffffffff810734d0>] __do_softirq+0xe0/0x420
[22766.387275]  [<ffffffff8169826c>] call_softirq+0x1c/0x30
[22766.387278]  [<ffffffff8101db15>] do_softirq+0xd5/0x110
[22766.387279]  [<ffffffff81073bc5>] irq_exit+0xd5/0xe0
[22766.387281]  [<ffffffff81698b03>] do_IRQ+0x63/0xd0
[22766.387283]  [<ffffffff8168ee2f>] common_interrupt+0x6f/0x6f
[22766.387283]  <EOI>
[22766.387284]
[22766.387285]  [<ffffffff8168eed9>] ? retint_swapgs+0x13/0x1b
[22766.387285] Code: c0 90 5d c3 66 0f 1f 44 00 00 4c 89 c8 5d c3 0f 1f 00 55 48
89 e5 48 83
ec 20 48 89 5d e8 4c 89 65 f0 4c 89 6d f8 66 66 66 66 90 <0f> b7 87 98 00 00 00
48 89 fb
49 89 f5 66 c1 c0 08 66 39 46 02
[22766.387307]
[22766.387307] RIP
[22766.387311]  [<ffffffffa168a2c9>] sctp_assoc_is_match+0x19/0x90 [sctp]
[22766.387311]  RSP <ffff880147c039b0>
[22766.387142]  ffffffffa16ab120
[22766.599537] ---[ end trace 3f6dae82e37b17f5 ]---
[22766.601221] Kernel panic - not syncing: Fatal exception in interrupt

It appears from his analysis and some staring at the code that this is likely
occuring because an association is getting freed while still on the
sctp_assoc_hashtable.  As a result, we get a gpf when traversing the hashtable
while a freed node corrupts part of the list.

Nominally I would think that an mibalanced refcount was responsible for this,
but I can't seem to find any obvious imbalance.  What I did note however was
that the two places where we create an association using
sctp_primitive_ASSOCIATE (__sctp_connect and sctp_sendmsg), have failure paths
which free a newly created association after calling sctp_primitive_ASSOCIATE.
sctp_primitive_ASSOCIATE brings us into the sctp_sf_do_prm_asoc path, which
issues a SCTP_CMD_NEW_ASOC side effect, which in turn adds a new association to
the aforementioned hash table.  the sctp command interpreter that process side
effects has not way to unwind previously processed commands, so freeing the
association from the __sctp_connect or sctp_sendmsg error path would lead to a
freed association remaining on this hash table.

I've fixed this but modifying sctp_[un]hash_established to use hlist_del_init,
which allows us to proerly use hlist_unhashed to check if the node is on a
hashlist safely during a delete.  That in turn alows us to safely call
sctp_unhash_established in the __sctp_connect and sctp_sendmsg error paths
before freeing them, regardles of what the associations state is on the hash
list.

I noted, while I was doing this, that the __sctp_unhash_endpoint was using
hlist_unhsashed in a simmilar fashion, but never nullified any removed nodes
pointers to make that function work properly, so I fixed that up in a simmilar
fashion.

I attempted to test this using a virtual guest running the SCTP_RR test from
netperf in a loop while running the trinity fuzzer, both in a loop.  I wasn't
able to recreate the problem prior to this fix, nor was I able to trigger the
failure after (neither of which I suppose is suprising).  Given the trace above
however, I think its likely that this is what we hit.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: davej@redhat.com
CC: davej@redhat.com
CC: "David S. Miller" <davem@davemloft.net>
CC: Vlad Yasevich <vyasevich@gmail.com>
CC: Sridhar Samudrala <sri@us.ibm.com>
CC: linux-sctp@vger.kernel.org
---
 net/sctp/input.c  |    7 ++-----
 net/sctp/socket.c |    8 +++++++-
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/sctp/input.c b/net/sctp/input.c
index f050d45..05994374 100644
--- a/net/sctp/input.c
+++ b/net/sctp/input.c
@@ -752,15 +752,12 @@ static void __sctp_unhash_endpoint(struct sctp_endpoint *ep)
 
 	epb = &ep->base;
 
-	if (hlist_unhashed(&epb->node))
-		return;
-
 	epb->hashent = sctp_ep_hashfn(epb->bind_addr.port);
 
 	head = &sctp_ep_hashtable[epb->hashent];
 
 	sctp_write_lock(&head->lock);
-	__hlist_del(&epb->node);
+	hlist_del_init(&epb->node);
 	sctp_write_unlock(&head->lock);
 }
 
@@ -841,7 +838,7 @@ static void __sctp_unhash_established(struct sctp_association *asoc)
 	head = &sctp_assoc_hashtable[epb->hashent];
 
 	sctp_write_lock(&head->lock);
-	__hlist_del(&epb->node);
+	hlist_del_init(&epb->node);
 	sctp_write_unlock(&head->lock);
 }
 
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index b3b8a8d..d740db4 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1231,8 +1231,14 @@ out_free:
 	SCTP_DEBUG_PRINTK("About to exit __sctp_connect() free asoc: %p"
 			  " kaddrs: %p err: %d\n",
 			  asoc, kaddrs, err);
-	if (asoc)
+	if (asoc) {
+		/* sctp_primitive_ASSOCIATE may have added this association
+		 * To the hash table, try to unhash it, just in case, its a noop
+		 * if it wasn't hashed so we're safe
+		 */
+		sctp_unhash_established(asoc);
 		sctp_association_free(asoc);
+	}
 	return err;
 }
 
-- 
1.7.7.6

^ permalink raw reply related

* connecting two or more hotspots at the same time
From: Akhil pm @ 2012-07-16 18:47 UTC (permalink / raw)
  To: netdev

Hi,

I would like to implement a platform to connect two or more wireless
hotspots to a computer.
But when i connected a wired and wireless connection simultaneously on
my laptop either one works at time.what i want to know is

1.is it possible to run two or more connections simultaneously
receiving packets?
2.If not possible then switching the connection between the hotspots
will give more data rate than using a single connection (for example
if i was downloading a file and at the same time i was browsing too,in
that case if i assign two separate  hotspots for downloading and
browsing is it give more output that doing both with a single
connection)?

please do reply

^ permalink raw reply

* Re: [PATCH] sctp: Fix list corruption resulting from freeing an association on a list
From: Neil Horman @ 2012-07-16 18:49 UTC (permalink / raw)
  To: netdev; +Cc: davej, David S. Miller, Vlad Yasevich, Sridhar Samudrala,
	linux-sctp
In-Reply-To: <1342463970-7457-1-git-send-email-nhorman@tuxdriver.com>

On Mon, Jul 16, 2012 at 02:39:30PM -0400, Neil Horman wrote:
> A few days ago Dave Jones reported this oops:
> 
> [22766.294255] general protection fault: 0000 [#1] PREEMPT SMP
> [22766.295376] CPU 0
> [22766.295384] Modules linked in:
> [22766.387137]  ffffffffa169f292 6b6b6b6b6b6b6b6b ffff880147c03a90
> ffff880147c03a74
> [22766.387135] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000
> [22766.387136] Process trinity-watchdo (pid: 10896, threadinfo ffff88013e7d2000,
> [22766.387137] Stack:
> [22766.387140]  ffff880147c03a10
> [22766.387140]  ffffffffa169f2b6
> [22766.387140]  ffff88013ed95728
> [22766.387143]  0000000000000002
> [22766.387143]  0000000000000000
> [22766.387143]  ffff880003fad062
> [22766.387144]  ffff88013c120000
> [22766.387144]
> [22766.387145] Call Trace:
> [22766.387145]  <IRQ>
> [22766.387150]  [<ffffffffa169f292>] ? __sctp_lookup_association+0x62/0xd0
> [sctp]
> [22766.387154]  [<ffffffffa169f2b6>] __sctp_lookup_association+0x86/0xd0 [sctp]
> [22766.387157]  [<ffffffffa169f597>] sctp_rcv+0x207/0xbb0 [sctp]
> [22766.387161]  [<ffffffff810d4da8>] ? trace_hardirqs_off_caller+0x28/0xd0
> [22766.387163]  [<ffffffff815827e3>] ? nf_hook_slow+0x133/0x210
> [22766.387166]  [<ffffffff815902fc>] ? ip_local_deliver_finish+0x4c/0x4c0
> [22766.387168]  [<ffffffff8159043d>] ip_local_deliver_finish+0x18d/0x4c0
> [22766.387169]  [<ffffffff815902fc>] ? ip_local_deliver_finish+0x4c/0x4c0
> [22766.387171]  [<ffffffff81590a07>] ip_local_deliver+0x47/0x80
> [22766.387172]  [<ffffffff8158fd80>] ip_rcv_finish+0x150/0x680
> [22766.387174]  [<ffffffff81590c54>] ip_rcv+0x214/0x320
> [22766.387176]  [<ffffffff81558c07>] __netif_receive_skb+0x7b7/0x910
> [22766.387178]  [<ffffffff8155856c>] ? __netif_receive_skb+0x11c/0x910
> [22766.387180]  [<ffffffff810d423e>] ? put_lock_stats.isra.25+0xe/0x40
> [22766.387182]  [<ffffffff81558f83>] netif_receive_skb+0x23/0x1f0
> [22766.387183]  [<ffffffff815596a9>] ? dev_gro_receive+0x139/0x440
> [22766.387185]  [<ffffffff81559280>] napi_skb_finish+0x70/0xa0
> [22766.387187]  [<ffffffff81559cb5>] napi_gro_receive+0xf5/0x130
> [22766.387218]  [<ffffffffa01c4679>] e1000_receive_skb+0x59/0x70 [e1000e]
> [22766.387242]  [<ffffffffa01c5aab>] e1000_clean_rx_irq+0x28b/0x460 [e1000e]
> [22766.387266]  [<ffffffffa01c9c18>] e1000e_poll+0x78/0x430 [e1000e]
> [22766.387268]  [<ffffffff81559fea>] net_rx_action+0x1aa/0x3d0
> [22766.387270]  [<ffffffff810a495f>] ? account_system_vtime+0x10f/0x130
> [22766.387273]  [<ffffffff810734d0>] __do_softirq+0xe0/0x420
> [22766.387275]  [<ffffffff8169826c>] call_softirq+0x1c/0x30
> [22766.387278]  [<ffffffff8101db15>] do_softirq+0xd5/0x110
> [22766.387279]  [<ffffffff81073bc5>] irq_exit+0xd5/0xe0
> [22766.387281]  [<ffffffff81698b03>] do_IRQ+0x63/0xd0
> [22766.387283]  [<ffffffff8168ee2f>] common_interrupt+0x6f/0x6f
> [22766.387283]  <EOI>
> [22766.387284]
> [22766.387285]  [<ffffffff8168eed9>] ? retint_swapgs+0x13/0x1b
> [22766.387285] Code: c0 90 5d c3 66 0f 1f 44 00 00 4c 89 c8 5d c3 0f 1f 00 55 48
> 89 e5 48 83
> ec 20 48 89 5d e8 4c 89 65 f0 4c 89 6d f8 66 66 66 66 90 <0f> b7 87 98 00 00 00
> 48 89 fb
> 49 89 f5 66 c1 c0 08 66 39 46 02
> [22766.387307]
> [22766.387307] RIP
> [22766.387311]  [<ffffffffa168a2c9>] sctp_assoc_is_match+0x19/0x90 [sctp]
> [22766.387311]  RSP <ffff880147c039b0>
> [22766.387142]  ffffffffa16ab120
> [22766.599537] ---[ end trace 3f6dae82e37b17f5 ]---
> [22766.601221] Kernel panic - not syncing: Fatal exception in interrupt
> 
> It appears from his analysis and some staring at the code that this is likely
> occuring because an association is getting freed while still on the
> sctp_assoc_hashtable.  As a result, we get a gpf when traversing the hashtable
> while a freed node corrupts part of the list.
> 
> Nominally I would think that an mibalanced refcount was responsible for this,
> but I can't seem to find any obvious imbalance.  What I did note however was
> that the two places where we create an association using
> sctp_primitive_ASSOCIATE (__sctp_connect and sctp_sendmsg), have failure paths
> which free a newly created association after calling sctp_primitive_ASSOCIATE.
> sctp_primitive_ASSOCIATE brings us into the sctp_sf_do_prm_asoc path, which
> issues a SCTP_CMD_NEW_ASOC side effect, which in turn adds a new association to
> the aforementioned hash table.  the sctp command interpreter that process side
> effects has not way to unwind previously processed commands, so freeing the
> association from the __sctp_connect or sctp_sendmsg error path would lead to a
> freed association remaining on this hash table.
> 
> I've fixed this but modifying sctp_[un]hash_established to use hlist_del_init,
> which allows us to proerly use hlist_unhashed to check if the node is on a
> hashlist safely during a delete.  That in turn alows us to safely call
> sctp_unhash_established in the __sctp_connect and sctp_sendmsg error paths
> before freeing them, regardles of what the associations state is on the hash
> list.
> 
> I noted, while I was doing this, that the __sctp_unhash_endpoint was using
> hlist_unhsashed in a simmilar fashion, but never nullified any removed nodes
> pointers to make that function work properly, so I fixed that up in a simmilar
> fashion.
> 
> I attempted to test this using a virtual guest running the SCTP_RR test from
> netperf in a loop while running the trinity fuzzer, both in a loop.  I wasn't
> able to recreate the problem prior to this fix, nor was I able to trigger the
> failure after (neither of which I suppose is suprising).  Given the trace above
> however, I think its likely that this is what we hit.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> Reported-by: davej@redhat.com
> CC: davej@redhat.com
> CC: "David S. Miller" <davem@davemloft.net>
> CC: Vlad Yasevich <vyasevich@gmail.com>
> CC: Sridhar Samudrala <sri@us.ibm.com>
> CC: linux-sctp@vger.kernel.org
> ---
>  net/sctp/input.c  |    7 ++-----
>  net/sctp/socket.c |    8 +++++++-
>  2 files changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/net/sctp/input.c b/net/sctp/input.c
> index f050d45..05994374 100644
> --- a/net/sctp/input.c
> +++ b/net/sctp/input.c
> @@ -752,15 +752,12 @@ static void __sctp_unhash_endpoint(struct sctp_endpoint *ep)
>  
>  	epb = &ep->base;
>  
> -	if (hlist_unhashed(&epb->node))
> -		return;
> -
>  	epb->hashent = sctp_ep_hashfn(epb->bind_addr.port);
>  
>  	head = &sctp_ep_hashtable[epb->hashent];
>  
>  	sctp_write_lock(&head->lock);
> -	__hlist_del(&epb->node);
> +	hlist_del_init(&epb->node);
>  	sctp_write_unlock(&head->lock);
>  }
>  
> @@ -841,7 +838,7 @@ static void __sctp_unhash_established(struct sctp_association *asoc)
>  	head = &sctp_assoc_hashtable[epb->hashent];
>  
>  	sctp_write_lock(&head->lock);
> -	__hlist_del(&epb->node);
> +	hlist_del_init(&epb->node);
>  	sctp_write_unlock(&head->lock);
>  }
>  
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index b3b8a8d..d740db4 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -1231,8 +1231,14 @@ out_free:
>  	SCTP_DEBUG_PRINTK("About to exit __sctp_connect() free asoc: %p"
>  			  " kaddrs: %p err: %d\n",
>  			  asoc, kaddrs, err);
> -	if (asoc)
> +	if (asoc) {
> +		/* sctp_primitive_ASSOCIATE may have added this association
> +		 * To the hash table, try to unhash it, just in case, its a noop
> +		 * if it wasn't hashed so we're safe
> +		 */
> +		sctp_unhash_established(asoc);
>  		sctp_association_free(asoc);
> +	}
>  	return err;
>  }
>  
> -- 
> 1.7.7.6
> 
> 


Damn, self, nak, I missed comitting the bits for __sctp_connect.  I'll repost in
a bit. Sorry.
Neil

^ permalink raw reply

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
From: Thadeu Lima de Souza Cascardo @ 2012-07-16 19:06 UTC (permalink / raw)
  To: Rick Jones
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	yevgenyp@mellanox.co.il, ogerlitz@mellanox.com,
	amirv@mellanox.com, brking@linux.vnet.ibm.com,
	leitao@linux.vnet.ibm.com, klebers@linux.vnet.ibm.com,
	linuxppc-dev, anton
In-Reply-To: <50044F1D.6000703@hp.com>

On Mon, Jul 16, 2012 at 10:27:57AM -0700, Rick Jones wrote:
> On 07/16/2012 10:01 AM, Thadeu Lima de Souza Cascardo wrote:
> >In its receive path, mlx4_en driver maps each page chunk that it pushes
> >to the hardware and unmaps it when pushing it up the stack. This limits
> >throughput to about 3Gbps on a Power7 8-core machine.
> 
> That seems rather extraordinarily low - Power7 is supposed to be a
> rather high performance CPU.  The last time I noticed O(3Gbit/s) on
> 10G for bulk transfer was before the advent of LRO/GRO - that was in
> the x86 space though.  Is mapping really that expensive with Power7?
> 

Copying linuxppc-dev and Anton here. But I can tell you that we have
lock contention when doing the mapping on the same adapter (map table
per device). Anton has sent some patches that improves that *a lot*.

However, for 1500 MTU, mlx4_en was doing two unmaps and two maps per
packet. The problem is not the CPU power needed to do the mappings, but
that we find the lock contention and end up with the CPUs more than 30%
of the time spent on spin locking.

> 
> >One solution is to map the entire allocated page at once. However, this
> >requires that we keep track of every page fragment we give to a
> >descriptor. We also need to work with the discipline that all fragments will
> >be released (in the sense that it will not be reused by the driver
> >anymore) in the order they are allocated to the driver.
> >
> >This requires that we don't reuse any fragments, every single one of
> >them must be reallocated. We do that by releasing all the fragments that
> >are processed and only after finished processing the descriptors, we
> >start the refill.
> >
> >We also must somehow guarantee that we either refill all fragments in a
> >descriptor or none at all, without resorting to giving up a page
> >fragment that we would have already given. Otherwise, we would break the
> >discipline of only releasing the fragments in the order they were
> >allocated.
> >
> >This has passed page allocation fault injections (restricted to the
> >driver by using required-start and required-end) and device hotplug
> >while 16 TCP streams were able to deliver more than 9Gbps.
> 
> What is the effect on packet-per-second performance?  (eg aggregate,
> burst-mode netperf TCP_RR with TCP_NODELAY set or perhaps UDP_RR)
> 

I used uperf with TCP_NODELAY and 16 threads sending from another
machine 64000-sized writes for 60 seconds.

I get 5898op/s (3.02Gb/s) without the patch against 18022ops/s
(9.23Gb/s) with the patch.

Best regards.
Cascardo.


> rick jones
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* [PATCH v2] sctp: Fix list corruption resulting from freeing an association on a list
From: Neil Horman @ 2012-07-16 19:13 UTC (permalink / raw)
  To: netdev
  Cc: Neil Horman, davej, David S. Miller, Vlad Yasevich,
	Sridhar Samudrala, linux-sctp
In-Reply-To: <1342463970-7457-1-git-send-email-nhorman@tuxdriver.com>

A few days ago Dave Jones reported this oops:

[22766.294255] general protection fault: 0000 [#1] PREEMPT SMP
[22766.295376] CPU 0
[22766.295384] Modules linked in:
[22766.387137]  ffffffffa169f292 6b6b6b6b6b6b6b6b ffff880147c03a90
ffff880147c03a74
[22766.387135] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000
[22766.387136] Process trinity-watchdo (pid: 10896, threadinfo ffff88013e7d2000,
[22766.387137] Stack:
[22766.387140]  ffff880147c03a10
[22766.387140]  ffffffffa169f2b6
[22766.387140]  ffff88013ed95728
[22766.387143]  0000000000000002
[22766.387143]  0000000000000000
[22766.387143]  ffff880003fad062
[22766.387144]  ffff88013c120000
[22766.387144]
[22766.387145] Call Trace:
[22766.387145]  <IRQ>
[22766.387150]  [<ffffffffa169f292>] ? __sctp_lookup_association+0x62/0xd0
[sctp]
[22766.387154]  [<ffffffffa169f2b6>] __sctp_lookup_association+0x86/0xd0 [sctp]
[22766.387157]  [<ffffffffa169f597>] sctp_rcv+0x207/0xbb0 [sctp]
[22766.387161]  [<ffffffff810d4da8>] ? trace_hardirqs_off_caller+0x28/0xd0
[22766.387163]  [<ffffffff815827e3>] ? nf_hook_slow+0x133/0x210
[22766.387166]  [<ffffffff815902fc>] ? ip_local_deliver_finish+0x4c/0x4c0
[22766.387168]  [<ffffffff8159043d>] ip_local_deliver_finish+0x18d/0x4c0
[22766.387169]  [<ffffffff815902fc>] ? ip_local_deliver_finish+0x4c/0x4c0
[22766.387171]  [<ffffffff81590a07>] ip_local_deliver+0x47/0x80
[22766.387172]  [<ffffffff8158fd80>] ip_rcv_finish+0x150/0x680
[22766.387174]  [<ffffffff81590c54>] ip_rcv+0x214/0x320
[22766.387176]  [<ffffffff81558c07>] __netif_receive_skb+0x7b7/0x910
[22766.387178]  [<ffffffff8155856c>] ? __netif_receive_skb+0x11c/0x910
[22766.387180]  [<ffffffff810d423e>] ? put_lock_stats.isra.25+0xe/0x40
[22766.387182]  [<ffffffff81558f83>] netif_receive_skb+0x23/0x1f0
[22766.387183]  [<ffffffff815596a9>] ? dev_gro_receive+0x139/0x440
[22766.387185]  [<ffffffff81559280>] napi_skb_finish+0x70/0xa0
[22766.387187]  [<ffffffff81559cb5>] napi_gro_receive+0xf5/0x130
[22766.387218]  [<ffffffffa01c4679>] e1000_receive_skb+0x59/0x70 [e1000e]
[22766.387242]  [<ffffffffa01c5aab>] e1000_clean_rx_irq+0x28b/0x460 [e1000e]
[22766.387266]  [<ffffffffa01c9c18>] e1000e_poll+0x78/0x430 [e1000e]
[22766.387268]  [<ffffffff81559fea>] net_rx_action+0x1aa/0x3d0
[22766.387270]  [<ffffffff810a495f>] ? account_system_vtime+0x10f/0x130
[22766.387273]  [<ffffffff810734d0>] __do_softirq+0xe0/0x420
[22766.387275]  [<ffffffff8169826c>] call_softirq+0x1c/0x30
[22766.387278]  [<ffffffff8101db15>] do_softirq+0xd5/0x110
[22766.387279]  [<ffffffff81073bc5>] irq_exit+0xd5/0xe0
[22766.387281]  [<ffffffff81698b03>] do_IRQ+0x63/0xd0
[22766.387283]  [<ffffffff8168ee2f>] common_interrupt+0x6f/0x6f
[22766.387283]  <EOI>
[22766.387284]
[22766.387285]  [<ffffffff8168eed9>] ? retint_swapgs+0x13/0x1b
[22766.387285] Code: c0 90 5d c3 66 0f 1f 44 00 00 4c 89 c8 5d c3 0f 1f 00 55 48
89 e5 48 83
ec 20 48 89 5d e8 4c 89 65 f0 4c 89 6d f8 66 66 66 66 90 <0f> b7 87 98 00 00 00
48 89 fb
49 89 f5 66 c1 c0 08 66 39 46 02
[22766.387307]
[22766.387307] RIP
[22766.387311]  [<ffffffffa168a2c9>] sctp_assoc_is_match+0x19/0x90 [sctp]
[22766.387311]  RSP <ffff880147c039b0>
[22766.387142]  ffffffffa16ab120
[22766.599537] ---[ end trace 3f6dae82e37b17f5 ]---
[22766.601221] Kernel panic - not syncing: Fatal exception in interrupt

It appears from his analysis and some staring at the code that this is likely
occuring because an association is getting freed while still on the
sctp_assoc_hashtable.  As a result, we get a gpf when traversing the hashtable
while a freed node corrupts part of the list.

Nominally I would think that an mibalanced refcount was responsible for this,
but I can't seem to find any obvious imbalance.  What I did note however was
that the two places where we create an association using
sctp_primitive_ASSOCIATE (__sctp_connect and sctp_sendmsg), have failure paths
which free a newly created association after calling sctp_primitive_ASSOCIATE.
sctp_primitive_ASSOCIATE brings us into the sctp_sf_do_prm_asoc path, which
issues a SCTP_CMD_NEW_ASOC side effect, which in turn adds a new association to
the aforementioned hash table.  the sctp command interpreter that process side
effects has not way to unwind previously processed commands, so freeing the
association from the __sctp_connect or sctp_sendmsg error path would lead to a
freed association remaining on this hash table.

I've fixed this but modifying sctp_[un]hash_established to use hlist_del_init,
which allows us to proerly use hlist_unhashed to check if the node is on a
hashlist safely during a delete.  That in turn alows us to safely call
sctp_unhash_established in the __sctp_connect and sctp_sendmsg error paths
before freeing them, regardles of what the associations state is on the hash
list.

I noted, while I was doing this, that the __sctp_unhash_endpoint was using
hlist_unhsashed in a simmilar fashion, but never nullified any removed nodes
pointers to make that function work properly, so I fixed that up in a simmilar
fashion.

I attempted to test this using a virtual guest running the SCTP_RR test from
netperf in a loop while running the trinity fuzzer, both in a loop.  I wasn't
able to recreate the problem prior to this fix, nor was I able to trigger the
failure after (neither of which I suppose is suprising).  Given the trace above
however, I think its likely that this is what we hit.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: davej@redhat.com
CC: davej@redhat.com
CC: "David S. Miller" <davem@davemloft.net>
CC: Vlad Yasevich <vyasevich@gmail.com>
CC: Sridhar Samudrala <sri@us.ibm.com>
CC: linux-sctp@vger.kernel.org

---
Change notes

V2) Added missing bits to sctp_sendmsg that I neglected in my last post
---
 net/sctp/input.c  |    7 ++-----
 net/sctp/socket.c |   12 ++++++++++--
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/net/sctp/input.c b/net/sctp/input.c
index f050d45..05994374 100644
--- a/net/sctp/input.c
+++ b/net/sctp/input.c
@@ -752,15 +752,12 @@ static void __sctp_unhash_endpoint(struct sctp_endpoint *ep)
 
 	epb = &ep->base;
 
-	if (hlist_unhashed(&epb->node))
-		return;
-
 	epb->hashent = sctp_ep_hashfn(epb->bind_addr.port);
 
 	head = &sctp_ep_hashtable[epb->hashent];
 
 	sctp_write_lock(&head->lock);
-	__hlist_del(&epb->node);
+	hlist_del_init(&epb->node);
 	sctp_write_unlock(&head->lock);
 }
 
@@ -841,7 +838,7 @@ static void __sctp_unhash_established(struct sctp_association *asoc)
 	head = &sctp_assoc_hashtable[epb->hashent];
 
 	sctp_write_lock(&head->lock);
-	__hlist_del(&epb->node);
+	hlist_del_init(&epb->node);
 	sctp_write_unlock(&head->lock);
 }
 
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index b3b8a8d..31c7bfc 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1231,8 +1231,14 @@ out_free:
 	SCTP_DEBUG_PRINTK("About to exit __sctp_connect() free asoc: %p"
 			  " kaddrs: %p err: %d\n",
 			  asoc, kaddrs, err);
-	if (asoc)
+	if (asoc) {
+		/* sctp_primitive_ASSOCIATE may have added this association
+		 * To the hash table, try to unhash it, just in case, its a noop
+		 * if it wasn't hashed so we're safe
+		 */
+		sctp_unhash_established(asoc);
 		sctp_association_free(asoc);
+	}
 	return err;
 }
 
@@ -1942,8 +1948,10 @@ SCTP_STATIC int sctp_sendmsg(struct kiocb *iocb, struct sock *sk,
 	goto out_unlock;
 
 out_free:
-	if (new_asoc)
+	if (new_asoc) {
+		sctp_unhash_established(asoc);
 		sctp_association_free(asoc);
+	}
 out_unlock:
 	sctp_release_sock(sk);
 
-- 
1.7.7.6

^ permalink raw reply related

* Re: [RFC 1/2] PCI-Express Non-Transparent Bridge Support
From: chetan loke @ 2012-07-16 19:27 UTC (permalink / raw)
  To: Jon Mason; +Cc: linux-kernel, netdev, linux-pci, Dave Jiang
In-Reply-To: <20120716183816.GH9598@jonmason-lab>

On Mon, Jul 16, 2012 at 2:38 PM, Jon Mason <jon.mason@intel.com> wrote:
> On Mon, Jul 16, 2012 at 12:49:39PM -0400, chetan loke wrote:

....

>> Is it ok to rename the following vars for convenience sake?
>>
>> > +       struct list_head txq;
>> tx_pend_q - (pending_queue) or tx_out_q - (outstanding_queue) - or
>> pick any new string you like - other than a mono-syllable definition
>>
>> > +       struct list_head txc;
>> tx_compl_q - completion queue
>>
>> > +       struct list_head txe;
>> tx_avail_e - available entry queue
>>
>>
>> > +       spinlock_t txq_lock;
>> > +       spinlock_t txc_lock;
>> > +       spinlock_t txe_lock;
>>
>> then match the variants accordingly.
>>
>> > +       struct list_head rxq;
>> > +       struct list_head rxc;
>> > +       struct list_head rxe;
>> > +       spinlock_t rxq_lock;
>> > +       spinlock_t rxc_lock;
>> > +       spinlock_t rxe_lock;
>>
>> similarly the rx-counterpart
>
> Are they difficult to understand?  I can change them, but it seems rather moot since you seemed to immediately understand the logic behind the names.
>

Immediately understand? Not at first glance. I had to re-read the
functions. Its really is a minor change and variables will then become
self-explanatory. I can almost feel that a developer who works on this
code for the first time might end up choosing the wrong 'syllable' and
locking an entirely different lock.

Infact add a prefix 'ntb' to all the rx/tx locks. That way grep'ing
output of lockstat also becomes easier.

It now reads: ntb_tx_pend_lock

>>
>>
>> ..................
>>
>> > +static void ntb_tx_copy_task(struct ntb_transport_qp *qp,
>> > +                            struct ntb_queue_entry *entry,
>> > +                            void *offset)
>> > +{
>> > +       struct ntb_payload_header *hdr = offset;
>> > +       int rc;
>> > +
>> > +       offset += sizeof(struct ntb_payload_header);
>> > +       memcpy_toio(offset, entry->buf, entry->len);
>> > +
>> > +       hdr->len = entry->len;
>> > +       hdr->ver = qp->tx_pkts;
>> > +
>> > +       /* Ensure that the data is fully copied out before setting the flag */
>> > +       wmb();
>> > +       hdr->flags = entry->flags | DESC_DONE_FLAG;
>> > +
>> > +       rc = ntb_ring_sdb(qp->ndev, qp->qp_num);
>> > +       if (rc)
>> > +               pr_err("%s: error ringing db %d\n", __func__, qp->qp_num);
>> > +
>> > +       if (entry->len > 0) {
>>
>> how do you interpret this len variable and decide if it's a good/bad completion?
>
> The length of 0 is for messages from the remote system, which currently only consists of a "link down" message notifying the local system to no longer send any messages to the remote side.
>

May be I didn't read the code properly. Is there a length-comment that
explains the above? If not then just by pure code inspection it would
seem that a skb was leaked. So add the above comment that way we can
save time for other netdev folks too.


>>
>> > +               qp->tx_bytes += entry->len;
>> > +
>> > +               /* Add fully transmitted data to completion queue */
>> > +               ntb_list_add_tail(&qp->txc_lock, &entry->entry, &qp->txc);
>> > +
>> > +               if (qp->tx_handler)
>> > +                       qp->tx_handler(qp);
>> > +       } else
>> > +               ntb_list_add_tail(&qp->txe_lock, &entry->entry, &qp->txe);
>>
>> I could be wrong but how is the original skb handled if the code path
>> goes in the else clause?
>
> There is no skb is the length is zero.  Since the only client is virtual ethernet, it will always be > 60.  However, I should add a sanity check for 0 length in tx_enqueue.
>
>> Also, in the else clause, how about a ntb_list_add_head rather than a _tail.
>
> Why add to the head, it was just used?

Yes, just re-use what's hot(best guess).

>
>> > +
>> > +static int ntb_process_tx(struct ntb_transport_qp *qp,
>> > +                         struct ntb_queue_entry *entry)
>> > +{
>> > +       struct ntb_payload_header *hdr;
>> > +       void *offset;
>> > +
>> > +       offset = qp->tx_offset;
>> > +       hdr = offset;
>> > +
>> > +       pr_debug("%lld - offset %p, tx %p, entry len %d flags %x buff %p\n",
>> > +                qp->tx_pkts, offset, qp->tx_offset, entry->len, entry->flags,
>> > +                entry->buf);
>> > +       if (hdr->flags) {
>> > +               ntb_list_add_head(&qp->txq_lock, &entry->entry, &qp->txq);
>> > +               qp->tx_ring_full++;
>> > +               return -EAGAIN;
>> > +       }
>> > +
>> > +       if (entry->len > transport_mtu) {
>> > +               pr_err("Trying to send pkt size of %d\n", entry->len);
>> > +               entry->flags = HW_ERROR_FLAG;
>> > +
>> > +               ntb_list_add_tail(&qp->txc_lock, &entry->entry, &qp->txc);
>> > +
>> > +               if (qp->tx_handler)
>> > +                       qp->tx_handler(qp);
>> > +
>> > +               return 0;
>> > +       }
>> > +
>> > +       ntb_tx_copy_task(qp, entry, offset);
>>
>> what happens when ntb_sdb_ring returns an error? would you still want
>> to increment tx_pkts below?
>
> It's not fatal if the remote side isn't notified.  It will hurt latency, since the next packet would be the one that triggers the next interrupt.  Also, the only way it could ever fail would be if it was an invalid interrupt bit being set, which should never happen.
>

What happens when the 'db >= ndev->max_cbs' check fails? Under what
circumstances will that happen? When it does happen how does the
remote side then gets notified or should it even get notified?

'which should never happen'? FYI - I have seen and debugged(not this
one but doorbells and what not) weirdness while working on CLARiiON +
PCie-interconnects. Board bring-up is a PITA. So you get the idea ...

>>
>> > +void *ntb_transport_tx_dequeue(struct ntb_transport_qp *qp, unsigned int *len)
>> > +{
>> > +       struct ntb_queue_entry *entry;
>> > +       void *buf;
>> > +
>> > +       if (!qp)
>> > +               return NULL;
>> > +
>> > +       entry = ntb_list_rm_head(&qp->txc_lock, &qp->txc);
>> > +       if (!entry)
>> > +               return NULL;
>> > +
>> > +       buf = entry->callback_data;
>> > +       if (entry->flags != HW_ERROR_FLAG)
>> > +               *len = entry->len;
>> > +       else
>> > +               *len = -EIO;
>> > +
>> > +       ntb_list_add_tail(&qp->txe_lock, &entry->entry, &qp->txe);
>>
>> how about a ntb_list_add_head?
>
> Is there a benefit to adding to the head versus tail?  It makes more sense to me to pull from the head and add to the tail.
>

Yes, explained above. Today there are 100(..DEF_NUM...) entries.
Tomorrow there could be more. So why not re-use what's hot? You could
also think of this as yet another way of forcing detection of
double-use corruption/bug. I'm not saying that there's a bug here but
you get the idea.

Chetan Loke

^ permalink raw reply

* [PULL net-next] tipc: kill off struct print_buf/log
From: Paul Gortmaker @ 2012-07-16 19:38 UTC (permalink / raw)
  To: davem; +Cc: netdev, jon.maloy, erik.hugne

Dave,

This is the same eight commits as sent for review last week[1],
with just the incorporation of the pr_fmt change as suggested
by JoeP.  There was no additional change requests, so unless you
can see something else you'd like me to change, please pull.

Thanks,
Paul.

[1] http://www.spinics.net/lists/netdev/msg204448.html

---

The following changes since commit 48ee3569f31d91084dc694fef5517eb782428083:

  ipv6: Move ipv6 twsk accessors outside of CONFIG_IPV6 ifdefs. (2012-07-11 02:39:24 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux.git tipc_net-next

for you to fetch changes up to 869dd4662f90514cb92b44a389e85c737b464e25:

  tipc: remove print_buf and deprecated log buffer code (2012-07-13 19:34:43 -0400)

----------------------------------------------------------------
Erik Hugne (5):
      tipc: use standard printk shortcut macros (pr_err etc.)
      tipc: remove TIPC packet debugging functions and macros
      tipc: simplify print buffer handling in tipc_printf
      tipc: phase out most of the struct print_buf usage
      tipc: remove print_buf and deprecated log buffer code

Paul Gortmaker (3):
      tipc: factor stats struct out of the larger link struct
      tipc: limit error messages relating to memory leak to one line
      tipc: simplify link_print by divorcing it from using tipc_printf

 include/linux/tipc_config.h |    4 +-
 net/tipc/Kconfig            |   25 ----
 net/tipc/bcast.c            |   65 +++++----
 net/tipc/bearer.c           |   62 +++++----
 net/tipc/bearer.h           |    2 +-
 net/tipc/config.c           |   41 +++---
 net/tipc/core.c             |   18 +--
 net/tipc/core.h             |   65 +--------
 net/tipc/discover.c         |   10 +-
 net/tipc/handler.c          |    4 +-
 net/tipc/link.c             |  304 +++++++++++++++++++------------------------
 net/tipc/link.h             |   63 ++++-----
 net/tipc/log.c              |  302 ++----------------------------------------
 net/tipc/log.h              |   66 ----------
 net/tipc/msg.c              |  242 ----------------------------------
 net/tipc/name_distr.c       |   25 ++--
 net/tipc/name_table.c       |  132 ++++++++++---------
 net/tipc/net.c              |    8 +-
 net/tipc/netlink.c          |    2 +-
 net/tipc/node.c             |   22 ++--
 net/tipc/node_subscr.c      |    3 +-
 net/tipc/port.c             |   66 +++++-----
 net/tipc/ref.c              |   10 +-
 net/tipc/socket.c           |   10 +-
 net/tipc/subscr.c           |   14 +-
 25 files changed, 439 insertions(+), 1126 deletions(-)
 delete mode 100644 net/tipc/log.h

^ permalink raw reply

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
From: Rick Jones @ 2012-07-16 19:42 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	yevgenyp@mellanox.co.il, ogerlitz@mellanox.com,
	amirv@mellanox.com, brking@linux.vnet.ibm.com,
	leitao@linux.vnet.ibm.com, klebers@linux.vnet.ibm.com,
	linuxppc-dev@lists.ozlabs.org, anton@samba.org
In-Reply-To: <20120716190611.GA1023@oc1711230544.ibm.com>

On 07/16/2012 12:06 PM, Thadeu Lima de Souza Cascardo wrote:
> On Mon, Jul 16, 2012 at 10:27:57AM -0700, Rick Jones wrote:
>
>> What is the effect on packet-per-second performance?  (eg aggregate,
>> burst-mode netperf TCP_RR with TCP_NODELAY set or perhaps UDP_RR)
>>
> I used uperf with TCP_NODELAY and 16 threads sending from another
> machine 64000-sized writes for 60 seconds.
>
> I get 5898op/s (3.02Gb/s) without the patch against 18022ops/s
> (9.23Gb/s) with the patch.

I was thinking more along the lines of an additional comparison, 
explicitly using netperf TCP_RR or something like it, not just the 
packets per second from a bulk transfer test.

rick

^ permalink raw reply

* Re: [PATCHv1] ethtool: added support for 40G link.
From: Ben Hutchings @ 2012-07-16 19:46 UTC (permalink / raw)
  To: Parav Pandit; +Cc: netdev
In-Reply-To: <847203bd-a11e-41c6-b451-abdeced5c5bf@exht1.ad.emulex.com>

On Wed, 2012-06-27 at 19:26 +0530, Parav Pandit wrote:
> 1. defined values for KR4, CR4, SR4, LR4 PHY.
> 
> Signed-off-by: Parav Pandit <parav.pandit@emulex.com>
[...]

Applied, thanks.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [PATCH 0/4] pch_gbe: avoiding transmit timeouts (rev 2)
From: Andy Cress @ 2012-07-16 20:02 UTC (permalink / raw)
  To: netdev


When the interface is stressed with 6 VLANs, some transmit timeout stats
were 
observed, which is a potential precursor to the more severe netdev
watchdog 
timeout oops.  Also we saw more than the expected number of
transmit restarts, which impacted performance.   The following patches
were applied and resolved the symptom of the transmit timeout stats, and

reduced the number of transmit restarts.  

This patch set includes the following patches:
0001-pch_gbe-Fix-the-checksum-fill-to-the-error-location.patch
0002-pch_gbe-fix-transmit-watchdog-timeout.patch
0003-pch_gbe-add-extra-clean-tx.patch  (includes bumping the version to
1.01) 
0004-pch_gbe-vlan-skb-len-fix.patch

This rev2 has the following changes:
0001: tested w skb_checksum_start_offset, but cannot use it here, added
comment
0004: delete transmit length error check as unnecessary

The resulting pch_gbe 1.01 driver has been tested on Kontron Tunnel
Creek 
EG20T modules and Intel Crown Bay EG20T modules, so I believe that these
are 
appropriate for consideration in the upstream pch_gbe driver.


Please review and comment.

Thanks,
Andy

^ permalink raw reply

* [PATCH 1/4] pch_gbe: Fix the checksum fill to the error location
From: Andy Cress @ 2012-07-16 20:03 UTC (permalink / raw)
  To: netdev


Author: Zhong Hongbo <hongbo.zhong@windriver.com>

Due to some unknown hardware limitations the pch_gbe hardware cannot
calculate checksums when the length of network package is less
than 64 bytes, where we will surprisingly encounter a problem of
the destination IP incorrectly changed.

When forwarding network packages at the network layer the IP packages
won't be relayed to the upper transport layer and analyzed there,
consequently, skb->transport_header pointer will be mistakenly remained
the same as that of skb->network_header, resulting in TCP checksum
wrongly
filled into the field of destination IP in IP header.

We can fix this issue by manually calculate the offset of the TCP
checksum
 and update it accordingly.

We would normally use the skb_checksum_start_offset(skb) here, but in
this
case it is sometimes -2 (csum_start=0 - skb_headroom=2 => -2), hence the
manual calculation.

Signed-off-by: Zhong Hongbo <hongbo.zhong@windriver.com>
Merged-by: Andy Cress <andy.cress@us.kontron.com>

---
 drivers/net/pch_gbe/pch_gbe_main.c |   14 ++++++++------
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 3787c64..1642bff 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -1178,32 +1178,35 @@ static void pch_gbe_tx_queue(struct
pch_gbe_adapter *adapter,
 	/*
 	 * It is because the hardware accelerator does not support a
checksum,
 	 * when the received data size is less than 64 bytes.
+	 * Note: skb_checksum_start_offset(skb) is sometimes -2 here.
 	 */
 	if (skb->len < PCH_GBE_SHORT_PKT && skb->ip_summed !=
CHECKSUM_NONE) {
+		struct iphdr *iph = ip_hdr(skb);
 		frame_ctrl |= PCH_GBE_TXD_CTRL_APAD |
 			      PCH_GBE_TXD_CTRL_TCPIP_ACC_OFF;
 		if (skb->protocol == htons(ETH_P_IP)) {
-			struct iphdr *iph = ip_hdr(skb);
 			unsigned int offset;
-			offset = skb_transport_offset(skb);
+			offset = (unsigned char *)((u8 *)iph + iph->ihl
* 4) - skb->data;
 			if (iph->protocol == IPPROTO_TCP) {
+				struct tcphdr *tcphdr_point = (struct
tcphdr *)((u8 *)iph + iph->ihl * 4);
 				skb->csum = 0;
-				tcp_hdr(skb)->check = 0;
+				tcphdr_point->check = 0;
 				skb->csum = skb_checksum(skb, offset,
 							 skb->len -
offset, 0);
-				tcp_hdr(skb)->check =
+				tcphdr_point->check = 
 					csum_tcpudp_magic(iph->saddr,
 							  iph->daddr,
 							  skb->len -
offset,
 							  IPPROTO_TCP,
 							  skb->csum);
 			} else if (iph->protocol == IPPROTO_UDP) {
+				struct udphdr *udphdr_point = (struct
udphdr *)((u8 *)iph + iph->ihl * 4);
 				skb->csum = 0;
-				udp_hdr(skb)->check = 0;
+				udphdr_point->check = 0;
 				skb->csum =
 					skb_checksum(skb, offset,
 						     skb->len - offset,
0);
-				udp_hdr(skb)->check =
+				udphdr_point->check = 
 					csum_tcpudp_magic(iph->saddr,
 							  iph->daddr,
 							  skb->len -
offset,

^ permalink raw reply related

* Re: [ethtool PATCH] ethtool: Resolve use of uninitialized memory in rxclass_get_dev_info
From: Ben Hutchings @ 2012-07-16 20:03 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, jeffrey.t.kirsher
In-Reply-To: <20120713165221.28140.92681.stgit@gitlad.jf.intel.com>

On Fri, 2012-07-13 at 09:55 -0700, Alexander Duyck wrote:
> The ethtool function for getting the rule count was not zeroing out the
> data field before passing it to the kernel.  As a result the value started
> uninitialized and was incorrectly returning a result indicating that
> devices supported setting new rule indexes.  In order to correct this I am
> adding a one line fix that sets data to zero before we pass the command to
> the kernel.

Right.  For 'get' commands with no parameters (besides the device) the
data copied back to userland is normally zero-initialised and then
filled out by the driver, and I seem to have worked on that assumption.
But because of the odd multiplexing of RX NFC commands
ETHTOOL_GRXCLSRLCNT doesn't work like that.  And for 'my' driver that
didn't matter.  Sorry about that.

(We should really have some explicit documentation of responsibility for
structure initialisation.)

> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
> 
> I am resending this since I didn't see any notification that it had been seen.
> I also realized that I had not clearly identified that this is an ethtool user
> space patch and not an ethtool kernel space patch.

It was perfectly clear and I had queued it up to review but hadn't yet
done so.

Ben.

>  rxclass.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/rxclass.c b/rxclass.c
> index 4d49aa6..e1633a8 100644
> --- a/rxclass.c
> +++ b/rxclass.c
> @@ -207,6 +207,7 @@ static int rxclass_get_dev_info(struct cmd_context *ctx, __u32 *count,
>  	int err;
>  
>  	nfccmd.cmd = ETHTOOL_GRXCLSRLCNT;
> +	nfccmd.data = 0;
>  	err = send_ioctl(ctx, &nfccmd);
>  	*count = nfccmd.rule_cnt;
>  	if (driver_select)
> 

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [PATCH 2/4] pch_gbe: fix transmit watchdog timeout
From: Andy Cress @ 2012-07-16 20:04 UTC (permalink / raw)
  To: netdev


Author: Andy Cress <andy.cress@us.kontron.com>

An extended ping test with 6 vlans resulted in a driver oops with a
netdev transmit timeout.
Fix WATCHDOG_TIMEOUT to be more like e1000e at 5 * HZ, to avoid
unnecessary transmit timeouts.

Signed-off-by: Andy Cress <andy.cress@us.kontron.com>

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 4c04843..a746064 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -35,7 +35,7 @@ const char pch_driver_version[] = DRV_VERSION;
 #define DSC_INIT16			0xC000
 #define PCH_GBE_DMA_ALIGN		0
 #define PCH_GBE_DMA_PADDING		2
-#define PCH_GBE_WATCHDOG_PERIOD		(1 * HZ)	/*
watchdog time */
+#define PCH_GBE_WATCHDOG_PERIOD		(5 * HZ)	/*
watchdog time */
 #define PCH_GBE_COPYBREAK_DEFAULT	256
 #define PCH_GBE_PCI_BAR			1
 #define PCH_GBE_RESERVE_MEMORY		0x200000	/* 2MB */

^ permalink raw reply related

* [PATCH 3/4] pch_gbe: add extra clean tx
From: Andy Cress @ 2012-07-16 20:04 UTC (permalink / raw)
  To: netdev

Author: Andy Cress <andy.cress@us.kontron.com>

This adds extra cleaning to the pch_gbe_clean_tx routine to avoid 
transmit timeouts on some BCM PHYs that have different timing.
Also update the DRV_VERSION to 1.01, and show it.

Signed-off-by: Andy Cress <andy.cress@us.kontron.com>

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 42e9874..2ccdca6 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -26,7 +26,7 @@
 #include <linux/ptp_classify.h>
 #endif
 
-#define DRV_VERSION     "1.00"
+#define DRV_VERSION     "1.01"
 const char pch_driver_version[] = DRV_VERSION;
 
 #define PCI_DEVICE_ID_INTEL_IOH1_GBE	0x8802		/* Pci device ID
*/
@@ -1582,7 +1582,8 @@ pch_gbe_clean_tx(struct pch_gbe_adapter *adapter,
 	struct sk_buff *skb;
 	unsigned int i;
 	unsigned int cleaned_count = 0;
-	bool cleaned = true;
+	bool cleaned = false;
+	int unused, thresh;
 
 	pr_debug("next_to_clean : %d\n", tx_ring->next_to_clean);
 
@@ -1591,10 +1592,36 @@ pch_gbe_clean_tx(struct pch_gbe_adapter
*adapter,
 	pr_debug("gbec_status:0x%04x  dma_status:0x%04x\n",
 		 tx_desc->gbec_status, tx_desc->dma_status);
 
+	unused = PCH_GBE_DESC_UNUSED(tx_ring);
+	thresh = tx_ring->count - PCH_GBE_TX_WEIGHT;
+	if ((tx_desc->gbec_status == DSC_INIT16) && (unused < thresh))
+	{  /* current marked clean, tx queue filling up, do extra clean
*/
+		int j, k;
+		if (unused < 8) {  /* tx queue nearly full */
+			pr_debug("clean_tx: transmit queue warning
(%x,%x) unused=%d\n",
+
tx_ring->next_to_clean,tx_ring->next_to_use,unused);
+		}
+	   
+		/* current marked clean, scan for more that need
cleaning. */
+		k = i;
+		for (j = 0; j < PCH_GBE_TX_WEIGHT; j++) 
+		{
+			tx_desc = PCH_GBE_TX_DESC(*tx_ring, k);
+			if (tx_desc->gbec_status != DSC_INIT16) break;
/*found*/
+			if (++k >= tx_ring->count) k = 0;  /*increment,
wrap*/
+		}
+		if (j < PCH_GBE_TX_WEIGHT) {
+			pr_debug("clean_tx: unused=%d loops=%d found
tx_desc[%x,%x:%x].gbec_status=%04x\n",
+				unused,j, i,k, tx_ring->next_to_use,
tx_desc->gbec_status);
+			i = k;  /*found one to clean, usu
gbec_status==2000.*/
+		}
+	}
+
 	while ((tx_desc->gbec_status & DSC_INIT16) == 0x0000) {
 		pr_debug("gbec_status:0x%04x\n", tx_desc->gbec_status);
 		buffer_info = &tx_ring->buffer_info[i];
 		skb = buffer_info->skb;
+		cleaned = true;
 
 		if ((tx_desc->gbec_status & PCH_GBE_TXD_GMAC_STAT_ABT))
{
 			adapter->stats.tx_aborted_errors++;
@@ -1642,18 +1669,21 @@ pch_gbe_clean_tx(struct pch_gbe_adapter
*adapter,
 	}
 	pr_debug("called pch_gbe_unmap_and_free_tx_resource() %d
count\n",
 		 cleaned_count);
-	/* Recover from running out of Tx resources in xmit_frame */
-	spin_lock(&tx_ring->tx_lock);
-	if (unlikely(cleaned && (netif_queue_stopped(adapter->netdev))))
{
-		netif_wake_queue(adapter->netdev);
-		adapter->stats.tx_restart_count++;
-		pr_debug("Tx wake queue\n");
-	}
+	if (cleaned_count > 0)  { /*skip this if nothing cleaned*/
+		/* Recover from running out of Tx resources in
xmit_frame */
+		spin_lock(&tx_ring->tx_lock);
+		if (unlikely(cleaned &&
(netif_queue_stopped(adapter->netdev))))
+		{
+			netif_wake_queue(adapter->netdev);
+			adapter->stats.tx_restart_count++;
+			pr_debug("Tx wake queue\n");
+		}
 
-	tx_ring->next_to_clean = i;
+		tx_ring->next_to_clean = i;
 
-	pr_debug("next_to_clean : %d\n", tx_ring->next_to_clean);
-	spin_unlock(&tx_ring->tx_lock);
+		pr_debug("next_to_clean : %d\n",
tx_ring->next_to_clean);
+		spin_unlock(&tx_ring->tx_lock);
+	}
 	return cleaned;
 }
 
@@ -2390,7 +2420,7 @@ static int pch_gbe_napi_poll(struct napi_struct
*napi, int budget)
 	pch_gbe_clean_rx(adapter, adapter->rx_ring, &work_done, budget);
 	cleaned = pch_gbe_clean_tx(adapter, adapter->tx_ring);
 
-	if (!cleaned)
+	if (cleaned)
 		work_done = budget;
 	/* If no Tx and not enough Rx work done,
 	 * exit the polling mode
@@ -2796,6 +2826,7 @@ static int __init pch_gbe_init_module(void)
 {
 	int ret;
 
+	pr_info("EG20T PCH Gigabit Ethernet Driver - version
%s\n",DRV_VERSION);
 	ret = pci_register_driver(&pch_gbe_driver);
 	if (copybreak != PCH_GBE_COPYBREAK_DEFAULT) {
 		if (copybreak == 0) {

^ permalink raw reply related

* [PATCH 4/4] pch_gbe: vlan skb len fix
From: Andy Cress @ 2012-07-16 20:05 UTC (permalink / raw)
  To: netdev


Author: Veaceslav Falico <vfalico@redhat.com>
Date:   Tue Apr 10 08:14:17 2012 +0200

pch_gbe_xmit_frame skb->len verification was incorrect in vlan case 
causing bogus transfer length errors.  One correction could be:
    offset = skb->protocol == htons(ETH_P_8021Q) ? 0 : 4;
    if (unlikely(skb->len > (adapter->hw.mac.max_frame_size - offset))) 
However, this verification is not necessary, so remove it.

Merged-by: Andy Cress <andy.cress@us.kontron.com>
Signed-off-by: Andy Cress <andy.cress@us.kontron.com>

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 2ccdca6..5eaac7f 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -2160,13 +2160,6 @@ static int pch_gbe_xmit_frame(struct sk_buff
*skb, struct net_device *netdev)
 	struct pch_gbe_tx_ring *tx_ring = adapter->tx_ring;
 	unsigned long flags;
 
-	if (unlikely(skb->len > (adapter->hw.mac.max_frame_size - 4))) {
-		pr_err("Transfer length Error: skb len: %d > max: %d\n",
-		       skb->len, adapter->hw.mac.max_frame_size);
-		dev_kfree_skb_any(skb);
-		adapter->stats.tx_length_errors++;
-		return NETDEV_TX_OK;
-	}
 	if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags)) {
 		/* Collision - tell upper layer to requeue */
 		return NETDEV_TX_LOCKED;

^ permalink raw reply related

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
From: Or Gerlitz @ 2012-07-16 20:36 UTC (permalink / raw)
  To: Rick Jones
  Cc: netdev@vger.kernel.org, leitao@linux.vnet.ibm.com,
	amirv@mellanox.com, yevgenyp@mellanox.co.il,
	klebers@linux.vnet.ibm.com, Thadeu Lima de Souza Cascardo,
	brking@linux.vnet.ibm.com, ogerlitz@mellanox.com,
	linuxppc-dev@lists.ozlabs.org, davem@davemloft.net,
	anton@samba.org
In-Reply-To: <50046EB1.5040909@hp.com>


[-- Attachment #1.1: Type: text/plain, Size: 312 bytes --]

On Mon, Jul 16, 2012 at 10:42 PM, Rick Jones <rick.jones2@hp.com> wrote:

> I was thinking more along the lines of an additional comparison,
> explicitly using netperf TCP_RR or something like it, not just the packets
> per second from a bulk transfer test.
>

TCP_STREAM would be good to know here as well

Or.

[-- Attachment #1.2: Type: text/html, Size: 636 bytes --]

[-- Attachment #2: Type: text/plain, Size: 150 bytes --]

_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
From: Or Gerlitz @ 2012-07-16 20:43 UTC (permalink / raw)
  To: Rick Jones, Thadeu Lima de Souza Cascardo
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	yevgenyp@mellanox.co.il, ogerlitz@mellanox.com,
	amirv@mellanox.com, brking@linux.vnet.ibm.com,
	leitao@linux.vnet.ibm.com, klebers@linux.vnet.ibm.com,
	linuxppc-dev@lists.ozlabs.org, anton@samba.org
In-Reply-To: <50046EB1.5040909@hp.com>

On Mon, Jul 16, 2012 at 10:42 PM, Rick Jones <rick.jones2@hp.com> wrote:

> I was thinking more along the lines of an additional comparison,
> explicitly using netperf TCP_RR or something like it, not just the packets
> per second from a bulk transfer test.


TCP_STREAM from this setup before the patch would be good to know as well

^ permalink raw reply

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
From: Thadeu Lima de Souza Cascardo @ 2012-07-16 20:47 UTC (permalink / raw)
  To: Rick Jones
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	yevgenyp@mellanox.co.il, ogerlitz@mellanox.com,
	amirv@mellanox.com, brking@linux.vnet.ibm.com,
	leitao@linux.vnet.ibm.com, klebers@linux.vnet.ibm.com,
	linuxppc-dev@lists.ozlabs.org, anton@samba.org
In-Reply-To: <50046EB1.5040909@hp.com>

On Mon, Jul 16, 2012 at 12:42:41PM -0700, Rick Jones wrote:
> On 07/16/2012 12:06 PM, Thadeu Lima de Souza Cascardo wrote:
> >On Mon, Jul 16, 2012 at 10:27:57AM -0700, Rick Jones wrote:
> >
> >>What is the effect on packet-per-second performance?  (eg aggregate,
> >>burst-mode netperf TCP_RR with TCP_NODELAY set or perhaps UDP_RR)
> >>
> >I used uperf with TCP_NODELAY and 16 threads sending from another
> >machine 64000-sized writes for 60 seconds.
> >
> >I get 5898op/s (3.02Gb/s) without the patch against 18022ops/s
> >(9.23Gb/s) with the patch.
> 
> I was thinking more along the lines of an additional comparison,
> explicitly using netperf TCP_RR or something like it, not just the
> packets per second from a bulk transfer test.
> 
> rick
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

I used a uperf profile that is similar to TCP_RR. It writes, then reads
some bytes. I kept the TCP_NODELAY flag.

Without the patch, I saw the following:

packet size	ops/s		Gb/s
1		337024		0.0027
90		276620		0.199
900		190455		1.37
4000		68863		2.20
9000		45638		3.29
60000		9409		4.52

With the patch:

packet size	ops/s		Gb/s
1		451738		0.0036
90		345682		0.248
900		272258		1.96
4000		127055		4.07
9000		106614		7.68
60000		30671		14.72

^ permalink raw reply

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
From: Thadeu Lima de Souza Cascardo @ 2012-07-16 20:57 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Rick Jones, davem@davemloft.net, netdev@vger.kernel.org,
	yevgenyp@mellanox.co.il, ogerlitz@mellanox.com,
	amirv@mellanox.com, brking@linux.vnet.ibm.com,
	leitao@linux.vnet.ibm.com, klebers@linux.vnet.ibm.com,
	linuxppc-dev@lists.ozlabs.org, anton@samba.org
In-Reply-To: <CAJZOPZL3F+xdHSFfhg7v9A6DDjT6CPK=kgwyzcE6c0pGYFyupg@mail.gmail.com>

On Mon, Jul 16, 2012 at 11:43:33PM +0300, Or Gerlitz wrote:
> On Mon, Jul 16, 2012 at 10:42 PM, Rick Jones <rick.jones2@hp.com> wrote:
> 
> > I was thinking more along the lines of an additional comparison,
> > explicitly using netperf TCP_RR or something like it, not just the packets
> > per second from a bulk transfer test.
> 
> 
> TCP_STREAM from this setup before the patch would be good to know as well
> 

Hi, Or.

Does the stream test that I did with uperf using messages of 64000 bytes
fit?

TCP_NODELAY does not make a difference in this case. I get something
around 3Gbps before the patch and something around 9Gbps after the
patch.

Before the patch:

# ./uperf-1.0.3-beta/src/uperf -m tcp.xml
Starting 16 threads running profile:tcp_stream ...   0.00 seconds
Txn1          0 /1.00(s) =            0          16op/s
Txn2    20.81GB /59.26(s) =     3.02Gb/s        5914op/s
Txn3          0 /0.00(s) =            0      128295op/s
-------------------------------------------------------------------------------------------------------------------------------
Total   20.81GB /61.37(s) =     2.91Gb/s        5712op/s

Netstat statistics for this run
-------------------------------------------------------------------------------------------------------------------------------
Nic       opkts/s     ipkts/s     obits/s     ibits/s
eth6       252459       31694   3.06Gb/s  16.74Mb/s
eth0            2          18   3.87Kb/s  14.28Kb/s
-------------------------------------------------------------------------------------------------------------------------------

Run Statistics
Hostname           Time        Data   Throughput   Operations
Errors
-------------------------------------------------------------------------------------------------------------------------------
10.0.0.2         61.47s     20.81GB     2.91Gb/s       350528
0.00
master           61.37s     20.81GB     2.91Gb/s       350528
0.00
-------------------------------------------------------------------------------------------------------------------------------
Difference(%)     -0.16%      0.00%        0.16%        0.00%
0.00%


After the patch:

# ./uperf-1.0.3-beta/src/uperf -m tcp.xml
Starting 16 threads running profile:tcp_stream ...   0.00 seconds
Txn1          0 /1.00(s) =            0          16op/s
Txn2    64.50GB /60.27(s) =     9.19Gb/s       17975op/s
Txn3          0 /0.00(s) =            0
-------------------------------------------------------------------------------------------------------------------------------
Total   64.50GB /62.27(s) =     8.90Gb/s       17397op/s

Netstat statistics for this run
-------------------------------------------------------------------------------------------------------------------------------
Nic       opkts/s     ipkts/s     obits/s     ibits/s
eth6       769428       96018   9.31Gb/s  50.72Mb/s
eth0            1          15   2.48Kb/s  13.59Kb/s
-------------------------------------------------------------------------------------------------------------------------------

Run Statistics
Hostname           Time        Data   Throughput   Operations
Errors
-------------------------------------------------------------------------------------------------------------------------------
10.0.0.2         62.27s     64.36GB     8.88Gb/s      1081096
0.00
master           62.27s     64.50GB     8.90Gb/s      1083325
0.00
-------------------------------------------------------------------------------------------------------------------------------
Difference(%)     -0.00%      0.21%        0.21%        0.21%
0.00%


Profile tcp.xml:

<?xml version="1.0"?>
<profile name="TCP_STREAM">
  <group nthreads="16">
        <transaction iterations="1">
            <flowop type="connect" options="remotehost=10.0.0.2 protocol=tcp tcp_nodelay"/>
        </transaction>
        <transaction duration="60">
            <flowop type="write" options="count=160 size=64000"/>
        </transaction>
        <transaction iterations="1">
            <flowop type="disconnect" />
        </transaction>
  </group>
</profile>

^ permalink raw reply

* Re: [PATCH 2/3] ipvs: add missing lock in ip_vs_ftp_init_conn()
From: Pablo Neira Ayuso @ 2012-07-16 21:07 UTC (permalink / raw)
  To: Simon Horman
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Hans Schillstrom, Jesper Dangaard Brouer,
	Xiaotian Feng, Xiaotian Feng, Patrick McHardy, David S. Miller
In-Reply-To: <1341965963-7275-3-git-send-email-horms@verge.net.au>

Hi Simon,

On Wed, Jul 11, 2012 at 09:19:22AM +0900, Simon Horman wrote:
> From: Xiaotian Feng <xtfeng@gmail.com>
> 
> We met a kernel panic in 2.6.32.43 kernel:
[...]
>  net/netfilter/ipvs/ip_vs_ftp.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ftp.c b/net/netfilter/ipvs/ip_vs_ftp.c
> index b20b29c..c2bc264 100644
> --- a/net/netfilter/ipvs/ip_vs_ftp.c
> +++ b/net/netfilter/ipvs/ip_vs_ftp.c
> @@ -65,8 +65,10 @@ static int ip_vs_ftp_pasv;
>  static int
>  ip_vs_ftp_init_conn(struct ip_vs_app *app, struct ip_vs_conn *cp)
>  {
> +	spin_lock(&cp->lock);
>  	/* We use connection tracking for the command connection */
>  	cp->flags |= IP_VS_CONN_F_NFCT;
> +	spin_unlock(&cp->lock);
>  	return 0;

The conntrack support for FTP IPVS helper seems to be there since
2.6.37.

However, the patch description mentions 2.6.32.43.

Something doesn't match here, could you clarify this?

Thanks.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox