Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [ethtool PATCH] ethtool: Resolve use of uninitialized memory in rxclass_get_dev_info
From: Ben Hutchings @ 2012-07-16 20:03 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, jeffrey.t.kirsher
In-Reply-To: <20120713165221.28140.92681.stgit@gitlad.jf.intel.com>

On Fri, 2012-07-13 at 09:55 -0700, Alexander Duyck wrote:
> The ethtool function for getting the rule count was not zeroing out the
> data field before passing it to the kernel.  As a result the value started
> uninitialized and was incorrectly returning a result indicating that
> devices supported setting new rule indexes.  In order to correct this I am
> adding a one line fix that sets data to zero before we pass the command to
> the kernel.

Right.  For 'get' commands with no parameters (besides the device) the
data copied back to userland is normally zero-initialised and then
filled out by the driver, and I seem to have worked on that assumption.
But because of the odd multiplexing of RX NFC commands
ETHTOOL_GRXCLSRLCNT doesn't work like that.  And for 'my' driver that
didn't matter.  Sorry about that.

(We should really have some explicit documentation of responsibility for
structure initialisation.)

> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
> 
> I am resending this since I didn't see any notification that it had been seen.
> I also realized that I had not clearly identified that this is an ethtool user
> space patch and not an ethtool kernel space patch.

It was perfectly clear and I had queued it up to review but hadn't yet
done so.

Ben.

>  rxclass.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/rxclass.c b/rxclass.c
> index 4d49aa6..e1633a8 100644
> --- a/rxclass.c
> +++ b/rxclass.c
> @@ -207,6 +207,7 @@ static int rxclass_get_dev_info(struct cmd_context *ctx, __u32 *count,
>  	int err;
>  
>  	nfccmd.cmd = ETHTOOL_GRXCLSRLCNT;
> +	nfccmd.data = 0;
>  	err = send_ioctl(ctx, &nfccmd);
>  	*count = nfccmd.rule_cnt;
>  	if (driver_select)
> 

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [PATCH 1/4] pch_gbe: Fix the checksum fill to the error location
From: Andy Cress @ 2012-07-16 20:03 UTC (permalink / raw)
  To: netdev


Author: Zhong Hongbo <hongbo.zhong@windriver.com>

Due to some unknown hardware limitations the pch_gbe hardware cannot
calculate checksums when the length of network package is less
than 64 bytes, where we will surprisingly encounter a problem of
the destination IP incorrectly changed.

When forwarding network packages at the network layer the IP packages
won't be relayed to the upper transport layer and analyzed there,
consequently, skb->transport_header pointer will be mistakenly remained
the same as that of skb->network_header, resulting in TCP checksum
wrongly
filled into the field of destination IP in IP header.

We can fix this issue by manually calculate the offset of the TCP
checksum
 and update it accordingly.

We would normally use the skb_checksum_start_offset(skb) here, but in
this
case it is sometimes -2 (csum_start=0 - skb_headroom=2 => -2), hence the
manual calculation.

Signed-off-by: Zhong Hongbo <hongbo.zhong@windriver.com>
Merged-by: Andy Cress <andy.cress@us.kontron.com>

---
 drivers/net/pch_gbe/pch_gbe_main.c |   14 ++++++++------
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 3787c64..1642bff 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -1178,32 +1178,35 @@ static void pch_gbe_tx_queue(struct
pch_gbe_adapter *adapter,
 	/*
 	 * It is because the hardware accelerator does not support a
checksum,
 	 * when the received data size is less than 64 bytes.
+	 * Note: skb_checksum_start_offset(skb) is sometimes -2 here.
 	 */
 	if (skb->len < PCH_GBE_SHORT_PKT && skb->ip_summed !=
CHECKSUM_NONE) {
+		struct iphdr *iph = ip_hdr(skb);
 		frame_ctrl |= PCH_GBE_TXD_CTRL_APAD |
 			      PCH_GBE_TXD_CTRL_TCPIP_ACC_OFF;
 		if (skb->protocol == htons(ETH_P_IP)) {
-			struct iphdr *iph = ip_hdr(skb);
 			unsigned int offset;
-			offset = skb_transport_offset(skb);
+			offset = (unsigned char *)((u8 *)iph + iph->ihl
* 4) - skb->data;
 			if (iph->protocol == IPPROTO_TCP) {
+				struct tcphdr *tcphdr_point = (struct
tcphdr *)((u8 *)iph + iph->ihl * 4);
 				skb->csum = 0;
-				tcp_hdr(skb)->check = 0;
+				tcphdr_point->check = 0;
 				skb->csum = skb_checksum(skb, offset,
 							 skb->len -
offset, 0);
-				tcp_hdr(skb)->check =
+				tcphdr_point->check = 
 					csum_tcpudp_magic(iph->saddr,
 							  iph->daddr,
 							  skb->len -
offset,
 							  IPPROTO_TCP,
 							  skb->csum);
 			} else if (iph->protocol == IPPROTO_UDP) {
+				struct udphdr *udphdr_point = (struct
udphdr *)((u8 *)iph + iph->ihl * 4);
 				skb->csum = 0;
-				udp_hdr(skb)->check = 0;
+				udphdr_point->check = 0;
 				skb->csum =
 					skb_checksum(skb, offset,
 						     skb->len - offset,
0);
-				udp_hdr(skb)->check =
+				udphdr_point->check = 
 					csum_tcpudp_magic(iph->saddr,
 							  iph->daddr,
 							  skb->len -
offset,

^ permalink raw reply related

* [PATCH 0/4] pch_gbe: avoiding transmit timeouts (rev 2)
From: Andy Cress @ 2012-07-16 20:02 UTC (permalink / raw)
  To: netdev

When the interface is stressed with 6 VLANs, some transmit timeout stats
were 
observed, which is a potential precursor to the more severe netdev
watchdog 
timeout oops.  Also we saw more than the expected number of
transmit restarts, which impacted performance.   The following patches
were applied and resolved the symptom of the transmit timeout stats, and

reduced the number of transmit restarts.  

This patch set includes the following patches:
0001-pch_gbe-Fix-the-checksum-fill-to-the-error-location.patch
0002-pch_gbe-fix-transmit-watchdog-timeout.patch
0003-pch_gbe-add-extra-clean-tx.patch  (includes bumping the version to
1.01) 
0004-pch_gbe-vlan-skb-len-fix.patch

This rev2 has the following changes:
0001: tested w skb_checksum_start_offset, but cannot use it here, added
comment
0004: delete transmit length error check as unnecessary

The resulting pch_gbe 1.01 driver has been tested on Kontron Tunnel
Creek 
EG20T modules and Intel Crown Bay EG20T modules, so I believe that these
are 
appropriate for consideration in the upstream pch_gbe driver.

Please review and comment.

Thanks,
Andy

^ permalink raw reply

* Re: [PATCHv1] ethtool: added support for 40G link.
From: Ben Hutchings @ 2012-07-16 19:46 UTC (permalink / raw)
  To: Parav Pandit; +Cc: netdev
In-Reply-To: <847203bd-a11e-41c6-b451-abdeced5c5bf@exht1.ad.emulex.com>

On Wed, 2012-06-27 at 19:26 +0530, Parav Pandit wrote:
> 1. defined values for KR4, CR4, SR4, LR4 PHY.
> 
> Signed-off-by: Parav Pandit <parav.pandit@emulex.com>
[...]

Applied, thanks.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
From: Rick Jones @ 2012-07-16 19:42 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	yevgenyp@mellanox.co.il, ogerlitz@mellanox.com,
	amirv@mellanox.com, brking@linux.vnet.ibm.com,
	leitao@linux.vnet.ibm.com, klebers@linux.vnet.ibm.com,
	linuxppc-dev@lists.ozlabs.org, anton@samba.org
In-Reply-To: <20120716190611.GA1023@oc1711230544.ibm.com>

On 07/16/2012 12:06 PM, Thadeu Lima de Souza Cascardo wrote:
> On Mon, Jul 16, 2012 at 10:27:57AM -0700, Rick Jones wrote:
>
>> What is the effect on packet-per-second performance?  (eg aggregate,
>> burst-mode netperf TCP_RR with TCP_NODELAY set or perhaps UDP_RR)
>>
> I used uperf with TCP_NODELAY and 16 threads sending from another
> machine 64000-sized writes for 60 seconds.
>
> I get 5898op/s (3.02Gb/s) without the patch against 18022ops/s
> (9.23Gb/s) with the patch.

I was thinking more along the lines of an additional comparison, 
explicitly using netperf TCP_RR or something like it, not just the 
packets per second from a bulk transfer test.

rick

^ permalink raw reply

* [PULL net-next] tipc: kill off struct print_buf/log
From: Paul Gortmaker @ 2012-07-16 19:38 UTC (permalink / raw)
  To: davem; +Cc: netdev, jon.maloy, erik.hugne

Dave,

This is the same eight commits as sent for review last week[1],
with just the incorporation of the pr_fmt change as suggested
by JoeP.  There was no additional change requests, so unless you
can see something else you'd like me to change, please pull.

Thanks,
Paul.

[1] http://www.spinics.net/lists/netdev/msg204448.html

---

The following changes since commit 48ee3569f31d91084dc694fef5517eb782428083:

  ipv6: Move ipv6 twsk accessors outside of CONFIG_IPV6 ifdefs. (2012-07-11 02:39:24 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux.git tipc_net-next

for you to fetch changes up to 869dd4662f90514cb92b44a389e85c737b464e25:

  tipc: remove print_buf and deprecated log buffer code (2012-07-13 19:34:43 -0400)

----------------------------------------------------------------
Erik Hugne (5):
      tipc: use standard printk shortcut macros (pr_err etc.)
      tipc: remove TIPC packet debugging functions and macros
      tipc: simplify print buffer handling in tipc_printf
      tipc: phase out most of the struct print_buf usage
      tipc: remove print_buf and deprecated log buffer code

Paul Gortmaker (3):
      tipc: factor stats struct out of the larger link struct
      tipc: limit error messages relating to memory leak to one line
      tipc: simplify link_print by divorcing it from using tipc_printf

 include/linux/tipc_config.h |    4 +-
 net/tipc/Kconfig            |   25 ----
 net/tipc/bcast.c            |   65 +++++----
 net/tipc/bearer.c           |   62 +++++----
 net/tipc/bearer.h           |    2 +-
 net/tipc/config.c           |   41 +++---
 net/tipc/core.c             |   18 +--
 net/tipc/core.h             |   65 +--------
 net/tipc/discover.c         |   10 +-
 net/tipc/handler.c          |    4 +-
 net/tipc/link.c             |  304 +++++++++++++++++++------------------------
 net/tipc/link.h             |   63 ++++-----
 net/tipc/log.c              |  302 ++----------------------------------------
 net/tipc/log.h              |   66 ----------
 net/tipc/msg.c              |  242 ----------------------------------
 net/tipc/name_distr.c       |   25 ++--
 net/tipc/name_table.c       |  132 ++++++++++---------
 net/tipc/net.c              |    8 +-
 net/tipc/netlink.c          |    2 +-
 net/tipc/node.c             |   22 ++--
 net/tipc/node_subscr.c      |    3 +-
 net/tipc/port.c             |   66 +++++-----
 net/tipc/ref.c              |   10 +-
 net/tipc/socket.c           |   10 +-
 net/tipc/subscr.c           |   14 +-
 25 files changed, 439 insertions(+), 1126 deletions(-)
 delete mode 100644 net/tipc/log.h

^ permalink raw reply

* Re: [RFC 1/2] PCI-Express Non-Transparent Bridge Support
From: chetan loke @ 2012-07-16 19:27 UTC (permalink / raw)
  To: Jon Mason; +Cc: linux-kernel, netdev, linux-pci, Dave Jiang
In-Reply-To: <20120716183816.GH9598@jonmason-lab>

On Mon, Jul 16, 2012 at 2:38 PM, Jon Mason <jon.mason@intel.com> wrote:
> On Mon, Jul 16, 2012 at 12:49:39PM -0400, chetan loke wrote:

....

>> Is it ok to rename the following vars for convenience sake?
>>
>> > +       struct list_head txq;
>> tx_pend_q - (pending_queue) or tx_out_q - (outstanding_queue) - or
>> pick any new string you like - other than a mono-syllable definition
>>
>> > +       struct list_head txc;
>> tx_compl_q - completion queue
>>
>> > +       struct list_head txe;
>> tx_avail_e - available entry queue
>>
>>
>> > +       spinlock_t txq_lock;
>> > +       spinlock_t txc_lock;
>> > +       spinlock_t txe_lock;
>>
>> then match the variants accordingly.
>>
>> > +       struct list_head rxq;
>> > +       struct list_head rxc;
>> > +       struct list_head rxe;
>> > +       spinlock_t rxq_lock;
>> > +       spinlock_t rxc_lock;
>> > +       spinlock_t rxe_lock;
>>
>> similarly the rx-counterpart
>
> Are they difficult to understand?  I can change them, but it seems rather moot since you seemed to immediately understand the logic behind the names.
>

Immediately understand? Not at first glance. I had to re-read the
functions. Its really is a minor change and variables will then become
self-explanatory. I can almost feel that a developer who works on this
code for the first time might end up choosing the wrong 'syllable' and
locking an entirely different lock.

Infact add a prefix 'ntb' to all the rx/tx locks. That way grep'ing
output of lockstat also becomes easier.

It now reads: ntb_tx_pend_lock

>>
>>
>> ..................
>>
>> > +static void ntb_tx_copy_task(struct ntb_transport_qp *qp,
>> > +                            struct ntb_queue_entry *entry,
>> > +                            void *offset)
>> > +{
>> > +       struct ntb_payload_header *hdr = offset;
>> > +       int rc;
>> > +
>> > +       offset += sizeof(struct ntb_payload_header);
>> > +       memcpy_toio(offset, entry->buf, entry->len);
>> > +
>> > +       hdr->len = entry->len;
>> > +       hdr->ver = qp->tx_pkts;
>> > +
>> > +       /* Ensure that the data is fully copied out before setting the flag */
>> > +       wmb();
>> > +       hdr->flags = entry->flags | DESC_DONE_FLAG;
>> > +
>> > +       rc = ntb_ring_sdb(qp->ndev, qp->qp_num);
>> > +       if (rc)
>> > +               pr_err("%s: error ringing db %d\n", __func__, qp->qp_num);
>> > +
>> > +       if (entry->len > 0) {
>>
>> how do you interpret this len variable and decide if it's a good/bad completion?
>
> The length of 0 is for messages from the remote system, which currently only consists of a "link down" message notifying the local system to no longer send any messages to the remote side.
>

May be I didn't read the code properly. Is there a length-comment that
explains the above? If not then just by pure code inspection it would
seem that a skb was leaked. So add the above comment that way we can
save time for other netdev folks too.


>>
>> > +               qp->tx_bytes += entry->len;
>> > +
>> > +               /* Add fully transmitted data to completion queue */
>> > +               ntb_list_add_tail(&qp->txc_lock, &entry->entry, &qp->txc);
>> > +
>> > +               if (qp->tx_handler)
>> > +                       qp->tx_handler(qp);
>> > +       } else
>> > +               ntb_list_add_tail(&qp->txe_lock, &entry->entry, &qp->txe);
>>
>> I could be wrong but how is the original skb handled if the code path
>> goes in the else clause?
>
> There is no skb is the length is zero.  Since the only client is virtual ethernet, it will always be > 60.  However, I should add a sanity check for 0 length in tx_enqueue.
>
>> Also, in the else clause, how about a ntb_list_add_head rather than a _tail.
>
> Why add to the head, it was just used?

Yes, just re-use what's hot(best guess).

>
>> > +
>> > +static int ntb_process_tx(struct ntb_transport_qp *qp,
>> > +                         struct ntb_queue_entry *entry)
>> > +{
>> > +       struct ntb_payload_header *hdr;
>> > +       void *offset;
>> > +
>> > +       offset = qp->tx_offset;
>> > +       hdr = offset;
>> > +
>> > +       pr_debug("%lld - offset %p, tx %p, entry len %d flags %x buff %p\n",
>> > +                qp->tx_pkts, offset, qp->tx_offset, entry->len, entry->flags,
>> > +                entry->buf);
>> > +       if (hdr->flags) {
>> > +               ntb_list_add_head(&qp->txq_lock, &entry->entry, &qp->txq);
>> > +               qp->tx_ring_full++;
>> > +               return -EAGAIN;
>> > +       }
>> > +
>> > +       if (entry->len > transport_mtu) {
>> > +               pr_err("Trying to send pkt size of %d\n", entry->len);
>> > +               entry->flags = HW_ERROR_FLAG;
>> > +
>> > +               ntb_list_add_tail(&qp->txc_lock, &entry->entry, &qp->txc);
>> > +
>> > +               if (qp->tx_handler)
>> > +                       qp->tx_handler(qp);
>> > +
>> > +               return 0;
>> > +       }
>> > +
>> > +       ntb_tx_copy_task(qp, entry, offset);
>>
>> what happens when ntb_sdb_ring returns an error? would you still want
>> to increment tx_pkts below?
>
> It's not fatal if the remote side isn't notified.  It will hurt latency, since the next packet would be the one that triggers the next interrupt.  Also, the only way it could ever fail would be if it was an invalid interrupt bit being set, which should never happen.
>

What happens when the 'db >= ndev->max_cbs' check fails? Under what
circumstances will that happen? When it does happen how does the
remote side then gets notified or should it even get notified?

'which should never happen'? FYI - I have seen and debugged(not this
one but doorbells and what not) weirdness while working on CLARiiON +
PCie-interconnects. Board bring-up is a PITA. So you get the idea ...

>>
>> > +void *ntb_transport_tx_dequeue(struct ntb_transport_qp *qp, unsigned int *len)
>> > +{
>> > +       struct ntb_queue_entry *entry;
>> > +       void *buf;
>> > +
>> > +       if (!qp)
>> > +               return NULL;
>> > +
>> > +       entry = ntb_list_rm_head(&qp->txc_lock, &qp->txc);
>> > +       if (!entry)
>> > +               return NULL;
>> > +
>> > +       buf = entry->callback_data;
>> > +       if (entry->flags != HW_ERROR_FLAG)
>> > +               *len = entry->len;
>> > +       else
>> > +               *len = -EIO;
>> > +
>> > +       ntb_list_add_tail(&qp->txe_lock, &entry->entry, &qp->txe);
>>
>> how about a ntb_list_add_head?
>
> Is there a benefit to adding to the head versus tail?  It makes more sense to me to pull from the head and add to the tail.
>

Yes, explained above. Today there are 100(..DEF_NUM...) entries.
Tomorrow there could be more. So why not re-use what's hot? You could
also think of this as yet another way of forcing detection of
double-use corruption/bug. I'm not saying that there's a bug here but
you get the idea.

Chetan Loke

^ permalink raw reply

* [PATCH v2] sctp: Fix list corruption resulting from freeing an association on a list
From: Neil Horman @ 2012-07-16 19:13 UTC (permalink / raw)
  To: netdev
  Cc: Neil Horman, davej, David S. Miller, Vlad Yasevich,
	Sridhar Samudrala, linux-sctp
In-Reply-To: <1342463970-7457-1-git-send-email-nhorman@tuxdriver.com>

A few days ago Dave Jones reported this oops:

[22766.294255] general protection fault: 0000 [#1] PREEMPT SMP
[22766.295376] CPU 0
[22766.295384] Modules linked in:
[22766.387137]  ffffffffa169f292 6b6b6b6b6b6b6b6b ffff880147c03a90
ffff880147c03a74
[22766.387135] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000
[22766.387136] Process trinity-watchdo (pid: 10896, threadinfo ffff88013e7d2000,
[22766.387137] Stack:
[22766.387140]  ffff880147c03a10
[22766.387140]  ffffffffa169f2b6
[22766.387140]  ffff88013ed95728
[22766.387143]  0000000000000002
[22766.387143]  0000000000000000
[22766.387143]  ffff880003fad062
[22766.387144]  ffff88013c120000
[22766.387144]
[22766.387145] Call Trace:
[22766.387145]  <IRQ>
[22766.387150]  [<ffffffffa169f292>] ? __sctp_lookup_association+0x62/0xd0
[sctp]
[22766.387154]  [<ffffffffa169f2b6>] __sctp_lookup_association+0x86/0xd0 [sctp]
[22766.387157]  [<ffffffffa169f597>] sctp_rcv+0x207/0xbb0 [sctp]
[22766.387161]  [<ffffffff810d4da8>] ? trace_hardirqs_off_caller+0x28/0xd0
[22766.387163]  [<ffffffff815827e3>] ? nf_hook_slow+0x133/0x210
[22766.387166]  [<ffffffff815902fc>] ? ip_local_deliver_finish+0x4c/0x4c0
[22766.387168]  [<ffffffff8159043d>] ip_local_deliver_finish+0x18d/0x4c0
[22766.387169]  [<ffffffff815902fc>] ? ip_local_deliver_finish+0x4c/0x4c0
[22766.387171]  [<ffffffff81590a07>] ip_local_deliver+0x47/0x80
[22766.387172]  [<ffffffff8158fd80>] ip_rcv_finish+0x150/0x680
[22766.387174]  [<ffffffff81590c54>] ip_rcv+0x214/0x320
[22766.387176]  [<ffffffff81558c07>] __netif_receive_skb+0x7b7/0x910
[22766.387178]  [<ffffffff8155856c>] ? __netif_receive_skb+0x11c/0x910
[22766.387180]  [<ffffffff810d423e>] ? put_lock_stats.isra.25+0xe/0x40
[22766.387182]  [<ffffffff81558f83>] netif_receive_skb+0x23/0x1f0
[22766.387183]  [<ffffffff815596a9>] ? dev_gro_receive+0x139/0x440
[22766.387185]  [<ffffffff81559280>] napi_skb_finish+0x70/0xa0
[22766.387187]  [<ffffffff81559cb5>] napi_gro_receive+0xf5/0x130
[22766.387218]  [<ffffffffa01c4679>] e1000_receive_skb+0x59/0x70 [e1000e]
[22766.387242]  [<ffffffffa01c5aab>] e1000_clean_rx_irq+0x28b/0x460 [e1000e]
[22766.387266]  [<ffffffffa01c9c18>] e1000e_poll+0x78/0x430 [e1000e]
[22766.387268]  [<ffffffff81559fea>] net_rx_action+0x1aa/0x3d0
[22766.387270]  [<ffffffff810a495f>] ? account_system_vtime+0x10f/0x130
[22766.387273]  [<ffffffff810734d0>] __do_softirq+0xe0/0x420
[22766.387275]  [<ffffffff8169826c>] call_softirq+0x1c/0x30
[22766.387278]  [<ffffffff8101db15>] do_softirq+0xd5/0x110
[22766.387279]  [<ffffffff81073bc5>] irq_exit+0xd5/0xe0
[22766.387281]  [<ffffffff81698b03>] do_IRQ+0x63/0xd0
[22766.387283]  [<ffffffff8168ee2f>] common_interrupt+0x6f/0x6f
[22766.387283]  <EOI>
[22766.387284]
[22766.387285]  [<ffffffff8168eed9>] ? retint_swapgs+0x13/0x1b
[22766.387285] Code: c0 90 5d c3 66 0f 1f 44 00 00 4c 89 c8 5d c3 0f 1f 00 55 48
89 e5 48 83
ec 20 48 89 5d e8 4c 89 65 f0 4c 89 6d f8 66 66 66 66 90 <0f> b7 87 98 00 00 00
48 89 fb
49 89 f5 66 c1 c0 08 66 39 46 02
[22766.387307]
[22766.387307] RIP
[22766.387311]  [<ffffffffa168a2c9>] sctp_assoc_is_match+0x19/0x90 [sctp]
[22766.387311]  RSP <ffff880147c039b0>
[22766.387142]  ffffffffa16ab120
[22766.599537] ---[ end trace 3f6dae82e37b17f5 ]---
[22766.601221] Kernel panic - not syncing: Fatal exception in interrupt

It appears from his analysis and some staring at the code that this is likely
occuring because an association is getting freed while still on the
sctp_assoc_hashtable.  As a result, we get a gpf when traversing the hashtable
while a freed node corrupts part of the list.

Nominally I would think that an mibalanced refcount was responsible for this,
but I can't seem to find any obvious imbalance.  What I did note however was
that the two places where we create an association using
sctp_primitive_ASSOCIATE (__sctp_connect and sctp_sendmsg), have failure paths
which free a newly created association after calling sctp_primitive_ASSOCIATE.
sctp_primitive_ASSOCIATE brings us into the sctp_sf_do_prm_asoc path, which
issues a SCTP_CMD_NEW_ASOC side effect, which in turn adds a new association to
the aforementioned hash table.  the sctp command interpreter that process side
effects has not way to unwind previously processed commands, so freeing the
association from the __sctp_connect or sctp_sendmsg error path would lead to a
freed association remaining on this hash table.

I've fixed this but modifying sctp_[un]hash_established to use hlist_del_init,
which allows us to proerly use hlist_unhashed to check if the node is on a
hashlist safely during a delete.  That in turn alows us to safely call
sctp_unhash_established in the __sctp_connect and sctp_sendmsg error paths
before freeing them, regardles of what the associations state is on the hash
list.

I noted, while I was doing this, that the __sctp_unhash_endpoint was using
hlist_unhsashed in a simmilar fashion, but never nullified any removed nodes
pointers to make that function work properly, so I fixed that up in a simmilar
fashion.

I attempted to test this using a virtual guest running the SCTP_RR test from
netperf in a loop while running the trinity fuzzer, both in a loop.  I wasn't
able to recreate the problem prior to this fix, nor was I able to trigger the
failure after (neither of which I suppose is suprising).  Given the trace above
however, I think its likely that this is what we hit.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: davej@redhat.com
CC: davej@redhat.com
CC: "David S. Miller" <davem@davemloft.net>
CC: Vlad Yasevich <vyasevich@gmail.com>
CC: Sridhar Samudrala <sri@us.ibm.com>
CC: linux-sctp@vger.kernel.org

---
Change notes

V2) Added missing bits to sctp_sendmsg that I neglected in my last post
---
 net/sctp/input.c  |    7 ++-----
 net/sctp/socket.c |   12 ++++++++++--
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/net/sctp/input.c b/net/sctp/input.c
index f050d45..05994374 100644
--- a/net/sctp/input.c
+++ b/net/sctp/input.c
@@ -752,15 +752,12 @@ static void __sctp_unhash_endpoint(struct sctp_endpoint *ep)

 	epb = &ep->base;

-	if (hlist_unhashed(&epb->node))
-		return;
-
 	epb->hashent = sctp_ep_hashfn(epb->bind_addr.port);

 	head = &sctp_ep_hashtable[epb->hashent];

 	sctp_write_lock(&head->lock);
-	__hlist_del(&epb->node);
+	hlist_del_init(&epb->node);
 	sctp_write_unlock(&head->lock);
 }

@@ -841,7 +838,7 @@ static void __sctp_unhash_established(struct sctp_association *asoc)
 	head = &sctp_assoc_hashtable[epb->hashent];

 	sctp_write_lock(&head->lock);
-	__hlist_del(&epb->node);
+	hlist_del_init(&epb->node);
 	sctp_write_unlock(&head->lock);
 }

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index b3b8a8d..31c7bfc 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1231,8 +1231,14 @@ out_free:
 	SCTP_DEBUG_PRINTK("About to exit __sctp_connect() free asoc: %p"
 			  " kaddrs: %p err: %d\n",
 			  asoc, kaddrs, err);
-	if (asoc)
+	if (asoc) {
+		/* sctp_primitive_ASSOCIATE may have added this association
+		 * To the hash table, try to unhash it, just in case, its a noop
+		 * if it wasn't hashed so we're safe
+		 */
+		sctp_unhash_established(asoc);
 		sctp_association_free(asoc);
+	}
 	return err;
 }

@@ -1942,8 +1948,10 @@ SCTP_STATIC int sctp_sendmsg(struct kiocb *iocb, struct sock *sk,
 	goto out_unlock;

 out_free:
-	if (new_asoc)
+	if (new_asoc) {
+		sctp_unhash_established(asoc);
 		sctp_association_free(asoc);
+	}
 out_unlock:
 	sctp_release_sock(sk);

-- 
1.7.7.6

^ permalink raw reply related

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
From: Thadeu Lima de Souza Cascardo @ 2012-07-16 19:06 UTC (permalink / raw)
  To: Rick Jones
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	yevgenyp@mellanox.co.il, ogerlitz@mellanox.com,
	amirv@mellanox.com, brking@linux.vnet.ibm.com,
	leitao@linux.vnet.ibm.com, klebers@linux.vnet.ibm.com,
	linuxppc-dev, anton
In-Reply-To: <50044F1D.6000703@hp.com>

On Mon, Jul 16, 2012 at 10:27:57AM -0700, Rick Jones wrote:
> On 07/16/2012 10:01 AM, Thadeu Lima de Souza Cascardo wrote:
> >In its receive path, mlx4_en driver maps each page chunk that it pushes
> >to the hardware and unmaps it when pushing it up the stack. This limits
> >throughput to about 3Gbps on a Power7 8-core machine.
> 
> That seems rather extraordinarily low - Power7 is supposed to be a
> rather high performance CPU.  The last time I noticed O(3Gbit/s) on
> 10G for bulk transfer was before the advent of LRO/GRO - that was in
> the x86 space though.  Is mapping really that expensive with Power7?
> 

Copying linuxppc-dev and Anton here. But I can tell you that we have
lock contention when doing the mapping on the same adapter (map table
per device). Anton has sent some patches that improves that *a lot*.

However, for 1500 MTU, mlx4_en was doing two unmaps and two maps per
packet. The problem is not the CPU power needed to do the mappings, but
that we find the lock contention and end up with the CPUs more than 30%
of the time spent on spin locking.

> 
> >One solution is to map the entire allocated page at once. However, this
> >requires that we keep track of every page fragment we give to a
> >descriptor. We also need to work with the discipline that all fragments will
> >be released (in the sense that it will not be reused by the driver
> >anymore) in the order they are allocated to the driver.
> >
> >This requires that we don't reuse any fragments, every single one of
> >them must be reallocated. We do that by releasing all the fragments that
> >are processed and only after finished processing the descriptors, we
> >start the refill.
> >
> >We also must somehow guarantee that we either refill all fragments in a
> >descriptor or none at all, without resorting to giving up a page
> >fragment that we would have already given. Otherwise, we would break the
> >discipline of only releasing the fragments in the order they were
> >allocated.
> >
> >This has passed page allocation fault injections (restricted to the
> >driver by using required-start and required-end) and device hotplug
> >while 16 TCP streams were able to deliver more than 9Gbps.
> 
> What is the effect on packet-per-second performance?  (eg aggregate,
> burst-mode netperf TCP_RR with TCP_NODELAY set or perhaps UDP_RR)
> 

I used uperf with TCP_NODELAY and 16 threads sending from another
machine 64000-sized writes for 60 seconds.

I get 5898op/s (3.02Gb/s) without the patch against 18022ops/s
(9.23Gb/s) with the patch.

Best regards.
Cascardo.


> rick jones
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH] sctp: Fix list corruption resulting from freeing an association on a list
From: Neil Horman @ 2012-07-16 18:49 UTC (permalink / raw)
  To: netdev; +Cc: davej, David S. Miller, Vlad Yasevich, Sridhar Samudrala,
	linux-sctp
In-Reply-To: <1342463970-7457-1-git-send-email-nhorman@tuxdriver.com>

On Mon, Jul 16, 2012 at 02:39:30PM -0400, Neil Horman wrote:
> A few days ago Dave Jones reported this oops:
> 
> [22766.294255] general protection fault: 0000 [#1] PREEMPT SMP
> [22766.295376] CPU 0
> [22766.295384] Modules linked in:
> [22766.387137]  ffffffffa169f292 6b6b6b6b6b6b6b6b ffff880147c03a90
> ffff880147c03a74
> [22766.387135] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000
> [22766.387136] Process trinity-watchdo (pid: 10896, threadinfo ffff88013e7d2000,
> [22766.387137] Stack:
> [22766.387140]  ffff880147c03a10
> [22766.387140]  ffffffffa169f2b6
> [22766.387140]  ffff88013ed95728
> [22766.387143]  0000000000000002
> [22766.387143]  0000000000000000
> [22766.387143]  ffff880003fad062
> [22766.387144]  ffff88013c120000
> [22766.387144]
> [22766.387145] Call Trace:
> [22766.387145]  <IRQ>
> [22766.387150]  [<ffffffffa169f292>] ? __sctp_lookup_association+0x62/0xd0
> [sctp]
> [22766.387154]  [<ffffffffa169f2b6>] __sctp_lookup_association+0x86/0xd0 [sctp]
> [22766.387157]  [<ffffffffa169f597>] sctp_rcv+0x207/0xbb0 [sctp]
> [22766.387161]  [<ffffffff810d4da8>] ? trace_hardirqs_off_caller+0x28/0xd0
> [22766.387163]  [<ffffffff815827e3>] ? nf_hook_slow+0x133/0x210
> [22766.387166]  [<ffffffff815902fc>] ? ip_local_deliver_finish+0x4c/0x4c0
> [22766.387168]  [<ffffffff8159043d>] ip_local_deliver_finish+0x18d/0x4c0
> [22766.387169]  [<ffffffff815902fc>] ? ip_local_deliver_finish+0x4c/0x4c0
> [22766.387171]  [<ffffffff81590a07>] ip_local_deliver+0x47/0x80
> [22766.387172]  [<ffffffff8158fd80>] ip_rcv_finish+0x150/0x680
> [22766.387174]  [<ffffffff81590c54>] ip_rcv+0x214/0x320
> [22766.387176]  [<ffffffff81558c07>] __netif_receive_skb+0x7b7/0x910
> [22766.387178]  [<ffffffff8155856c>] ? __netif_receive_skb+0x11c/0x910
> [22766.387180]  [<ffffffff810d423e>] ? put_lock_stats.isra.25+0xe/0x40
> [22766.387182]  [<ffffffff81558f83>] netif_receive_skb+0x23/0x1f0
> [22766.387183]  [<ffffffff815596a9>] ? dev_gro_receive+0x139/0x440
> [22766.387185]  [<ffffffff81559280>] napi_skb_finish+0x70/0xa0
> [22766.387187]  [<ffffffff81559cb5>] napi_gro_receive+0xf5/0x130
> [22766.387218]  [<ffffffffa01c4679>] e1000_receive_skb+0x59/0x70 [e1000e]
> [22766.387242]  [<ffffffffa01c5aab>] e1000_clean_rx_irq+0x28b/0x460 [e1000e]
> [22766.387266]  [<ffffffffa01c9c18>] e1000e_poll+0x78/0x430 [e1000e]
> [22766.387268]  [<ffffffff81559fea>] net_rx_action+0x1aa/0x3d0
> [22766.387270]  [<ffffffff810a495f>] ? account_system_vtime+0x10f/0x130
> [22766.387273]  [<ffffffff810734d0>] __do_softirq+0xe0/0x420
> [22766.387275]  [<ffffffff8169826c>] call_softirq+0x1c/0x30
> [22766.387278]  [<ffffffff8101db15>] do_softirq+0xd5/0x110
> [22766.387279]  [<ffffffff81073bc5>] irq_exit+0xd5/0xe0
> [22766.387281]  [<ffffffff81698b03>] do_IRQ+0x63/0xd0
> [22766.387283]  [<ffffffff8168ee2f>] common_interrupt+0x6f/0x6f
> [22766.387283]  <EOI>
> [22766.387284]
> [22766.387285]  [<ffffffff8168eed9>] ? retint_swapgs+0x13/0x1b
> [22766.387285] Code: c0 90 5d c3 66 0f 1f 44 00 00 4c 89 c8 5d c3 0f 1f 00 55 48
> 89 e5 48 83
> ec 20 48 89 5d e8 4c 89 65 f0 4c 89 6d f8 66 66 66 66 90 <0f> b7 87 98 00 00 00
> 48 89 fb
> 49 89 f5 66 c1 c0 08 66 39 46 02
> [22766.387307]
> [22766.387307] RIP
> [22766.387311]  [<ffffffffa168a2c9>] sctp_assoc_is_match+0x19/0x90 [sctp]
> [22766.387311]  RSP <ffff880147c039b0>
> [22766.387142]  ffffffffa16ab120
> [22766.599537] ---[ end trace 3f6dae82e37b17f5 ]---
> [22766.601221] Kernel panic - not syncing: Fatal exception in interrupt
> 
> It appears from his analysis and some staring at the code that this is likely
> occuring because an association is getting freed while still on the
> sctp_assoc_hashtable.  As a result, we get a gpf when traversing the hashtable
> while a freed node corrupts part of the list.
> 
> Nominally I would think that an mibalanced refcount was responsible for this,
> but I can't seem to find any obvious imbalance.  What I did note however was
> that the two places where we create an association using
> sctp_primitive_ASSOCIATE (__sctp_connect and sctp_sendmsg), have failure paths
> which free a newly created association after calling sctp_primitive_ASSOCIATE.
> sctp_primitive_ASSOCIATE brings us into the sctp_sf_do_prm_asoc path, which
> issues a SCTP_CMD_NEW_ASOC side effect, which in turn adds a new association to
> the aforementioned hash table.  the sctp command interpreter that process side
> effects has not way to unwind previously processed commands, so freeing the
> association from the __sctp_connect or sctp_sendmsg error path would lead to a
> freed association remaining on this hash table.
> 
> I've fixed this but modifying sctp_[un]hash_established to use hlist_del_init,
> which allows us to proerly use hlist_unhashed to check if the node is on a
> hashlist safely during a delete.  That in turn alows us to safely call
> sctp_unhash_established in the __sctp_connect and sctp_sendmsg error paths
> before freeing them, regardles of what the associations state is on the hash
> list.
> 
> I noted, while I was doing this, that the __sctp_unhash_endpoint was using
> hlist_unhsashed in a simmilar fashion, but never nullified any removed nodes
> pointers to make that function work properly, so I fixed that up in a simmilar
> fashion.
> 
> I attempted to test this using a virtual guest running the SCTP_RR test from
> netperf in a loop while running the trinity fuzzer, both in a loop.  I wasn't
> able to recreate the problem prior to this fix, nor was I able to trigger the
> failure after (neither of which I suppose is suprising).  Given the trace above
> however, I think its likely that this is what we hit.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> Reported-by: davej@redhat.com
> CC: davej@redhat.com
> CC: "David S. Miller" <davem@davemloft.net>
> CC: Vlad Yasevich <vyasevich@gmail.com>
> CC: Sridhar Samudrala <sri@us.ibm.com>
> CC: linux-sctp@vger.kernel.org
> ---
>  net/sctp/input.c  |    7 ++-----
>  net/sctp/socket.c |    8 +++++++-
>  2 files changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/net/sctp/input.c b/net/sctp/input.c
> index f050d45..05994374 100644
> --- a/net/sctp/input.c
> +++ b/net/sctp/input.c
> @@ -752,15 +752,12 @@ static void __sctp_unhash_endpoint(struct sctp_endpoint *ep)
>  
>  	epb = &ep->base;
>  
> -	if (hlist_unhashed(&epb->node))
> -		return;
> -
>  	epb->hashent = sctp_ep_hashfn(epb->bind_addr.port);
>  
>  	head = &sctp_ep_hashtable[epb->hashent];
>  
>  	sctp_write_lock(&head->lock);
> -	__hlist_del(&epb->node);
> +	hlist_del_init(&epb->node);
>  	sctp_write_unlock(&head->lock);
>  }
>  
> @@ -841,7 +838,7 @@ static void __sctp_unhash_established(struct sctp_association *asoc)
>  	head = &sctp_assoc_hashtable[epb->hashent];
>  
>  	sctp_write_lock(&head->lock);
> -	__hlist_del(&epb->node);
> +	hlist_del_init(&epb->node);
>  	sctp_write_unlock(&head->lock);
>  }
>  
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index b3b8a8d..d740db4 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -1231,8 +1231,14 @@ out_free:
>  	SCTP_DEBUG_PRINTK("About to exit __sctp_connect() free asoc: %p"
>  			  " kaddrs: %p err: %d\n",
>  			  asoc, kaddrs, err);
> -	if (asoc)
> +	if (asoc) {
> +		/* sctp_primitive_ASSOCIATE may have added this association
> +		 * To the hash table, try to unhash it, just in case, its a noop
> +		 * if it wasn't hashed so we're safe
> +		 */
> +		sctp_unhash_established(asoc);
>  		sctp_association_free(asoc);
> +	}
>  	return err;
>  }
>  
> -- 
> 1.7.7.6
> 
> 


Damn, self, nak, I missed comitting the bits for __sctp_connect.  I'll repost in
a bit. Sorry.
Neil

^ permalink raw reply

* connecting two or more hotspots at the same time
From: Akhil pm @ 2012-07-16 18:47 UTC (permalink / raw)
  To: netdev

Hi,

I would like to implement a platform to connect two or more wireless
hotspots to a computer.
But when i connected a wired and wireless connection simultaneously on
my laptop either one works at time.what i want to know is

1.is it possible to run two or more connections simultaneously
receiving packets?
2.If not possible then switching the connection between the hotspots
will give more data rate than using a single connection (for example
if i was downloading a file and at the same time i was browsing too,in
that case if i assign two separate  hotspots for downloading and
browsing is it give more output that doing both with a single
connection)?

please do reply

^ permalink raw reply

* [PATCH] sctp: Fix list corruption resulting from freeing an association on a list
From: Neil Horman @ 2012-07-16 18:39 UTC (permalink / raw)
  To: netdev
  Cc: Neil Horman, davej, David S. Miller, Vlad Yasevich,
	Sridhar Samudrala, linux-sctp

A few days ago Dave Jones reported this oops:

[22766.294255] general protection fault: 0000 [#1] PREEMPT SMP
[22766.295376] CPU 0
[22766.295384] Modules linked in:
[22766.387137]  ffffffffa169f292 6b6b6b6b6b6b6b6b ffff880147c03a90
ffff880147c03a74
[22766.387135] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000
[22766.387136] Process trinity-watchdo (pid: 10896, threadinfo ffff88013e7d2000,
[22766.387137] Stack:
[22766.387140]  ffff880147c03a10
[22766.387140]  ffffffffa169f2b6
[22766.387140]  ffff88013ed95728
[22766.387143]  0000000000000002
[22766.387143]  0000000000000000
[22766.387143]  ffff880003fad062
[22766.387144]  ffff88013c120000
[22766.387144]
[22766.387145] Call Trace:
[22766.387145]  <IRQ>
[22766.387150]  [<ffffffffa169f292>] ? __sctp_lookup_association+0x62/0xd0
[sctp]
[22766.387154]  [<ffffffffa169f2b6>] __sctp_lookup_association+0x86/0xd0 [sctp]
[22766.387157]  [<ffffffffa169f597>] sctp_rcv+0x207/0xbb0 [sctp]
[22766.387161]  [<ffffffff810d4da8>] ? trace_hardirqs_off_caller+0x28/0xd0
[22766.387163]  [<ffffffff815827e3>] ? nf_hook_slow+0x133/0x210
[22766.387166]  [<ffffffff815902fc>] ? ip_local_deliver_finish+0x4c/0x4c0
[22766.387168]  [<ffffffff8159043d>] ip_local_deliver_finish+0x18d/0x4c0
[22766.387169]  [<ffffffff815902fc>] ? ip_local_deliver_finish+0x4c/0x4c0
[22766.387171]  [<ffffffff81590a07>] ip_local_deliver+0x47/0x80
[22766.387172]  [<ffffffff8158fd80>] ip_rcv_finish+0x150/0x680
[22766.387174]  [<ffffffff81590c54>] ip_rcv+0x214/0x320
[22766.387176]  [<ffffffff81558c07>] __netif_receive_skb+0x7b7/0x910
[22766.387178]  [<ffffffff8155856c>] ? __netif_receive_skb+0x11c/0x910
[22766.387180]  [<ffffffff810d423e>] ? put_lock_stats.isra.25+0xe/0x40
[22766.387182]  [<ffffffff81558f83>] netif_receive_skb+0x23/0x1f0
[22766.387183]  [<ffffffff815596a9>] ? dev_gro_receive+0x139/0x440
[22766.387185]  [<ffffffff81559280>] napi_skb_finish+0x70/0xa0
[22766.387187]  [<ffffffff81559cb5>] napi_gro_receive+0xf5/0x130
[22766.387218]  [<ffffffffa01c4679>] e1000_receive_skb+0x59/0x70 [e1000e]
[22766.387242]  [<ffffffffa01c5aab>] e1000_clean_rx_irq+0x28b/0x460 [e1000e]
[22766.387266]  [<ffffffffa01c9c18>] e1000e_poll+0x78/0x430 [e1000e]
[22766.387268]  [<ffffffff81559fea>] net_rx_action+0x1aa/0x3d0
[22766.387270]  [<ffffffff810a495f>] ? account_system_vtime+0x10f/0x130
[22766.387273]  [<ffffffff810734d0>] __do_softirq+0xe0/0x420
[22766.387275]  [<ffffffff8169826c>] call_softirq+0x1c/0x30
[22766.387278]  [<ffffffff8101db15>] do_softirq+0xd5/0x110
[22766.387279]  [<ffffffff81073bc5>] irq_exit+0xd5/0xe0
[22766.387281]  [<ffffffff81698b03>] do_IRQ+0x63/0xd0
[22766.387283]  [<ffffffff8168ee2f>] common_interrupt+0x6f/0x6f
[22766.387283]  <EOI>
[22766.387284]
[22766.387285]  [<ffffffff8168eed9>] ? retint_swapgs+0x13/0x1b
[22766.387285] Code: c0 90 5d c3 66 0f 1f 44 00 00 4c 89 c8 5d c3 0f 1f 00 55 48
89 e5 48 83
ec 20 48 89 5d e8 4c 89 65 f0 4c 89 6d f8 66 66 66 66 90 <0f> b7 87 98 00 00 00
48 89 fb
49 89 f5 66 c1 c0 08 66 39 46 02
[22766.387307]
[22766.387307] RIP
[22766.387311]  [<ffffffffa168a2c9>] sctp_assoc_is_match+0x19/0x90 [sctp]
[22766.387311]  RSP <ffff880147c039b0>
[22766.387142]  ffffffffa16ab120
[22766.599537] ---[ end trace 3f6dae82e37b17f5 ]---
[22766.601221] Kernel panic - not syncing: Fatal exception in interrupt

It appears from his analysis and some staring at the code that this is likely
occuring because an association is getting freed while still on the
sctp_assoc_hashtable.  As a result, we get a gpf when traversing the hashtable
while a freed node corrupts part of the list.

Nominally I would think that an mibalanced refcount was responsible for this,
but I can't seem to find any obvious imbalance.  What I did note however was
that the two places where we create an association using
sctp_primitive_ASSOCIATE (__sctp_connect and sctp_sendmsg), have failure paths
which free a newly created association after calling sctp_primitive_ASSOCIATE.
sctp_primitive_ASSOCIATE brings us into the sctp_sf_do_prm_asoc path, which
issues a SCTP_CMD_NEW_ASOC side effect, which in turn adds a new association to
the aforementioned hash table.  the sctp command interpreter that process side
effects has not way to unwind previously processed commands, so freeing the
association from the __sctp_connect or sctp_sendmsg error path would lead to a
freed association remaining on this hash table.

I've fixed this but modifying sctp_[un]hash_established to use hlist_del_init,
which allows us to proerly use hlist_unhashed to check if the node is on a
hashlist safely during a delete.  That in turn alows us to safely call
sctp_unhash_established in the __sctp_connect and sctp_sendmsg error paths
before freeing them, regardles of what the associations state is on the hash
list.

I noted, while I was doing this, that the __sctp_unhash_endpoint was using
hlist_unhsashed in a simmilar fashion, but never nullified any removed nodes
pointers to make that function work properly, so I fixed that up in a simmilar
fashion.

I attempted to test this using a virtual guest running the SCTP_RR test from
netperf in a loop while running the trinity fuzzer, both in a loop.  I wasn't
able to recreate the problem prior to this fix, nor was I able to trigger the
failure after (neither of which I suppose is suprising).  Given the trace above
however, I think its likely that this is what we hit.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: davej@redhat.com
CC: davej@redhat.com
CC: "David S. Miller" <davem@davemloft.net>
CC: Vlad Yasevich <vyasevich@gmail.com>
CC: Sridhar Samudrala <sri@us.ibm.com>
CC: linux-sctp@vger.kernel.org
---
 net/sctp/input.c  |    7 ++-----
 net/sctp/socket.c |    8 +++++++-
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/sctp/input.c b/net/sctp/input.c
index f050d45..05994374 100644
--- a/net/sctp/input.c
+++ b/net/sctp/input.c
@@ -752,15 +752,12 @@ static void __sctp_unhash_endpoint(struct sctp_endpoint *ep)
 
 	epb = &ep->base;
 
-	if (hlist_unhashed(&epb->node))
-		return;
-
 	epb->hashent = sctp_ep_hashfn(epb->bind_addr.port);
 
 	head = &sctp_ep_hashtable[epb->hashent];
 
 	sctp_write_lock(&head->lock);
-	__hlist_del(&epb->node);
+	hlist_del_init(&epb->node);
 	sctp_write_unlock(&head->lock);
 }
 
@@ -841,7 +838,7 @@ static void __sctp_unhash_established(struct sctp_association *asoc)
 	head = &sctp_assoc_hashtable[epb->hashent];
 
 	sctp_write_lock(&head->lock);
-	__hlist_del(&epb->node);
+	hlist_del_init(&epb->node);
 	sctp_write_unlock(&head->lock);
 }
 
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index b3b8a8d..d740db4 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1231,8 +1231,14 @@ out_free:
 	SCTP_DEBUG_PRINTK("About to exit __sctp_connect() free asoc: %p"
 			  " kaddrs: %p err: %d\n",
 			  asoc, kaddrs, err);
-	if (asoc)
+	if (asoc) {
+		/* sctp_primitive_ASSOCIATE may have added this association
+		 * To the hash table, try to unhash it, just in case, its a noop
+		 * if it wasn't hashed so we're safe
+		 */
+		sctp_unhash_established(asoc);
 		sctp_association_free(asoc);
+	}
 	return err;
 }
 
-- 
1.7.7.6

^ permalink raw reply related

* Re: [RFC 1/2] PCI-Express Non-Transparent Bridge Support
From: Jon Mason @ 2012-07-16 18:38 UTC (permalink / raw)
  To: chetan loke; +Cc: linux-kernel, netdev, linux-pci, Dave Jiang
In-Reply-To: <CAAsGZS5uMArBVa+Rog1H60FVfKq84eqt5G2dvafDWXTP-RdiwQ@mail.gmail.com>

On Mon, Jul 16, 2012 at 12:49:39PM -0400, chetan loke wrote:
> Hi Jon,
> 
> On Fri, Jul 13, 2012 at 5:44 PM, Jon Mason <jon.mason@intel.com> wrote:
> 
> Just a few minor comments/questions:
> 
> ....
> 
> > +struct ntb_transport_qp {
> > +       struct ntb_device *ndev;
> > +
> > +       bool client_ready;
> > +       bool qp_link;
> > +       u8 qp_num;      /* Only 64 QP's are allowed.  0-63 */
> > +
> > +       void (*tx_handler) (struct ntb_transport_qp *qp);
> > +       struct tasklet_struct tx_work;
> 
> Is it ok to rename the following vars for convenience sake?
> 
> > +       struct list_head txq;
> tx_pend_q - (pending_queue) or tx_out_q - (outstanding_queue) - or
> pick any new string you like - other than a mono-syllable definition
> 
> > +       struct list_head txc;
> tx_compl_q - completion queue
> 
> > +       struct list_head txe;
> tx_avail_e - available entry queue
> 
> 
> > +       spinlock_t txq_lock;
> > +       spinlock_t txc_lock;
> > +       spinlock_t txe_lock;
> 
> then match the variants accordingly.
> 
> > +       struct list_head rxq;
> > +       struct list_head rxc;
> > +       struct list_head rxe;
> > +       spinlock_t rxq_lock;
> > +       spinlock_t rxc_lock;
> > +       spinlock_t rxe_lock;
> 
> similarly the rx-counterpart

Are they difficult to understand?  I can change them, but it seems rather moot since you seemed to immediately understand the logic behind the names.

> 
> 
> ..................
> 
> > +static void ntb_tx_copy_task(struct ntb_transport_qp *qp,
> > +                            struct ntb_queue_entry *entry,
> > +                            void *offset)
> > +{
> > +       struct ntb_payload_header *hdr = offset;
> > +       int rc;
> > +
> > +       offset += sizeof(struct ntb_payload_header);
> > +       memcpy_toio(offset, entry->buf, entry->len);
> > +
> > +       hdr->len = entry->len;
> > +       hdr->ver = qp->tx_pkts;
> > +
> > +       /* Ensure that the data is fully copied out before setting the flag */
> > +       wmb();
> > +       hdr->flags = entry->flags | DESC_DONE_FLAG;
> > +
> > +       rc = ntb_ring_sdb(qp->ndev, qp->qp_num);
> > +       if (rc)
> > +               pr_err("%s: error ringing db %d\n", __func__, qp->qp_num);
> > +
> > +       if (entry->len > 0) {
> 
> how do you interpret this len variable and decide if it's a good/bad completion?

The length of 0 is for messages from the remote system, which currently only consists of a "link down" message notifying the local system to no longer send any messages to the remote side.

> 
> > +               qp->tx_bytes += entry->len;
> > +
> > +               /* Add fully transmitted data to completion queue */
> > +               ntb_list_add_tail(&qp->txc_lock, &entry->entry, &qp->txc);
> > +
> > +               if (qp->tx_handler)
> > +                       qp->tx_handler(qp);
> > +       } else
> > +               ntb_list_add_tail(&qp->txe_lock, &entry->entry, &qp->txe);
> 
> I could be wrong but how is the original skb handled if the code path
> goes in the else clause?

There is no skb is the length is zero.  Since the only client is virtual ethernet, it will always be > 60.  However, I should add a sanity check for 0 length in tx_enqueue.

> Also, in the else clause, how about a ntb_list_add_head rather than a _tail.

Why add to the head, it was just used?

> > +
> > +static int ntb_process_tx(struct ntb_transport_qp *qp,
> > +                         struct ntb_queue_entry *entry)
> > +{
> > +       struct ntb_payload_header *hdr;
> > +       void *offset;
> > +
> > +       offset = qp->tx_offset;
> > +       hdr = offset;
> > +
> > +       pr_debug("%lld - offset %p, tx %p, entry len %d flags %x buff %p\n",
> > +                qp->tx_pkts, offset, qp->tx_offset, entry->len, entry->flags,
> > +                entry->buf);
> > +       if (hdr->flags) {
> > +               ntb_list_add_head(&qp->txq_lock, &entry->entry, &qp->txq);
> > +               qp->tx_ring_full++;
> > +               return -EAGAIN;
> > +       }
> > +
> > +       if (entry->len > transport_mtu) {
> > +               pr_err("Trying to send pkt size of %d\n", entry->len);
> > +               entry->flags = HW_ERROR_FLAG;
> > +
> > +               ntb_list_add_tail(&qp->txc_lock, &entry->entry, &qp->txc);
> > +
> > +               if (qp->tx_handler)
> > +                       qp->tx_handler(qp);
> > +
> > +               return 0;
> > +       }
> > +
> > +       ntb_tx_copy_task(qp, entry, offset);
> 
> what happens when ntb_sdb_ring returns an error? would you still want
> to increment tx_pkts below?

It's not fatal if the remote side isn't notified.  It will hurt latency, since the next packet would be the one that triggers the next interrupt.  Also, the only way it could ever fail would be if it was an invalid interrupt bit being set, which should never happen.

> > +
> > +       qp->tx_offset =
> > +           (qp->tx_offset +
> > +            ((transport_mtu + sizeof(struct ntb_payload_header)) * 2) >=
> > +            qp->tx_mw_end) ? qp->tx_mw_begin : qp->tx_offset + transport_mtu +
> > +           sizeof(struct ntb_payload_header);
> > +
> > +       qp->tx_pkts++;
> > +
> > +       return 0;
> > +}
> > +
> 
> ........................
> 
> 
> > +void *ntb_transport_tx_dequeue(struct ntb_transport_qp *qp, unsigned int *len)
> > +{
> > +       struct ntb_queue_entry *entry;
> > +       void *buf;
> > +
> > +       if (!qp)
> > +               return NULL;
> > +
> > +       entry = ntb_list_rm_head(&qp->txc_lock, &qp->txc);
> > +       if (!entry)
> > +               return NULL;
> > +
> > +       buf = entry->callback_data;
> > +       if (entry->flags != HW_ERROR_FLAG)
> > +               *len = entry->len;
> > +       else
> > +               *len = -EIO;
> > +
> > +       ntb_list_add_tail(&qp->txe_lock, &entry->entry, &qp->txe);
> 
> how about a ntb_list_add_head?

Is there a benefit to adding to the head versus tail?  It makes more sense to me to pull from the head and add to the tail.

> 
> 
> Chetan Loke

^ permalink raw reply

* connecting two or more hotspots at the same time
From: Akhil pm @ 2012-07-16 18:37 UTC (permalink / raw)
  To: netdev



^ permalink raw reply

* Re: [RFC 1/2] PCI-Express Non-Transparent Bridge Support
From: Greg KH @ 2012-07-16 18:30 UTC (permalink / raw)
  To: Jon Mason; +Cc: linux-kernel, netdev, linux-pci, Dave Jiang
In-Reply-To: <20120716175505.GF9598@jonmason-lab>

On Mon, Jul 16, 2012 at 10:55:06AM -0700, Jon Mason wrote:
> On Sun, Jul 15, 2012 at 05:19:21PM -0700, Greg KH wrote:
> > On Sun, Jul 15, 2012 at 04:55:48PM -0700, Jon Mason wrote:
> > > On Sat, Jul 14, 2012 at 10:10:15AM -0700, Greg KH wrote:
> > > > On Fri, Jul 13, 2012 at 02:44:59PM -0700, Jon Mason wrote:
> > > > > +static int max_num_cbs = 2;
> > > > > +module_param(max_num_cbs, uint, 0644);
> > > > > +MODULE_PARM_DESC(max_num_cbs, "Maximum number of NTB transport connections");
> > > > > +
> > > > > +static bool no_msix;
> > > > > +module_param(no_msix, bool, 0644);
> > > > > +MODULE_PARM_DESC(no_msix, "Do not allow MSI-X interrupts to be selected");
> > > > 
> > > > How would a user, or a distro, know to set these options?  Why are they
> > > > even options at all?
> > > 
> > > Good question.  There is actually a potential benefit to disabling
> > > MSI-X.  The NTB device on one of our platforms only has 3 MSI-X
> > > vectors.  In the current driver design, that would limit them to 3
> > > client/virtual devices.  However, there are 15bits in the ISR that can
> > > be used for the same purpose.  So, if you disable MSI-X, you can have
> > > 15 instead of 3.  
> > 
> > But again, how would a user, or a distro, know to set these?  Where is
> > the documentation describing it?  Why really have these options at all
> > and not just fix the platform issues (only 3 MSI-X vectors?  Really?)
> 
> I believe we'll want multiple clients (or have multiqueue Ethernet).
> I'm happy to add something to /Documentation to describe it and why it
> would be useful, or I can remove it and re-introduce it when I add
> multiqueue Ethernet.

I'd suggest waiting and adding it later if really needed (see previous
comment about not adding code/features before they are actually needed.)

thanks,

greg k-h

^ permalink raw reply

* Re: [RFC 1/2] PCI-Express Non-Transparent Bridge Support
From: chetan loke @ 2012-07-16 18:26 UTC (permalink / raw)
  To: Jon Mason; +Cc: linux-kernel, netdev, linux-pci, Dave Jiang
In-Reply-To: <1342215900-3358-1-git-send-email-jon.mason@intel.com>

Jon,

On Fri, Jul 13, 2012 at 5:44 PM, Jon Mason <jon.mason@intel.com> wrote:

..............

> +/**
> + * ntb_ring_sdb() - Set the doorbell on the secondary/external side
> + * @ndev: pointer to ntb_device instance
> + * @db: doorbell to ring
> + *
> + * This function allows triggering of a doorbell on the secondary/external
> + * side that will initiate an interrupt on the remote host
> + *
> + * RETURNS: An appropriate -ERRNO error value on error, or zero for success.
> + */
> +int ntb_ring_sdb(struct ntb_device *ndev, unsigned int db)
> +{
> +       dev_dbg(&ndev->pdev->dev, "%s: ringing doorbell %d\n", __func__, db);
> +

> +       if (db >= ndev->max_cbs)
> +               return -EINVAL;

How about moving this max_cbs error check in the upper level
callers(example in ntb_process_tx)?
That way you won't have to defer handling some negative cases all the
way till the end.

So ntb_process_tx could now look like:

.....
error=0;
if (entry->len > transport_mtu) {
...
error=1;
}
else if (qp->qp_num >= qp->ndev->max_cbs) {
...
error=1;
}

if (unlikely(error)) {
      ntb_list_add_tail(&qp->txc_lock, &entry->entry, &qp->txc);
      if (qp->tx_handler)
           qp->tx_handler(qp);

      return 0;
}
.................

No further comments below

> +
> +static int ntb_process_tx(struct ntb_transport_qp *qp,
> +                         struct ntb_queue_entry *entry)
> +{
> +       struct ntb_payload_header *hdr;
> +       void *offset;
> +
> +       offset = qp->tx_offset;
> +       hdr = offset;
> +
> +       pr_debug("%lld - offset %p, tx %p, entry len %d flags %x buff %p\n",
> +                qp->tx_pkts, offset, qp->tx_offset, entry->len, entry->flags,
> +                entry->buf);
> +       if (hdr->flags) {
> +               ntb_list_add_head(&qp->txq_lock, &entry->entry, &qp->txq);
> +               qp->tx_ring_full++;
> +               return -EAGAIN;
> +       }
> +
> +       if (entry->len > transport_mtu) {
> +               pr_err("Trying to send pkt size of %d\n", entry->len);
> +               entry->flags = HW_ERROR_FLAG;
> +
> +               ntb_list_add_tail(&qp->txc_lock, &entry->entry, &qp->txc);
> +
> +               if (qp->tx_handler)
> +                       qp->tx_handler(qp);
> +
> +               return 0;
> +       }
> +
> +       ntb_tx_copy_task(qp, entry, offset);
> +
> +       qp->tx_offset =
> +           (qp->tx_offset +
> +            ((transport_mtu + sizeof(struct ntb_payload_header)) * 2) >=
> +            qp->tx_mw_end) ? qp->tx_mw_begin : qp->tx_offset + transport_mtu +
> +           sizeof(struct ntb_payload_header);
> +
> +       qp->tx_pkts++;
> +
> +       return 0;
> +}
> +


Chetan Loke

^ permalink raw reply

* Re: [RFC 1/2] PCI-Express Non-Transparent Bridge Support
From: Jon Mason @ 2012-07-16 17:55 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, netdev, linux-pci, Dave Jiang
In-Reply-To: <20120716001921.GA19775@kroah.com>

On Sun, Jul 15, 2012 at 05:19:21PM -0700, Greg KH wrote:
> On Sun, Jul 15, 2012 at 04:55:48PM -0700, Jon Mason wrote:
> > On Sat, Jul 14, 2012 at 10:10:15AM -0700, Greg KH wrote:
> > > On Fri, Jul 13, 2012 at 02:44:59PM -0700, Jon Mason wrote:
> > > > +static int max_num_cbs = 2;
> > > > +module_param(max_num_cbs, uint, 0644);
> > > > +MODULE_PARM_DESC(max_num_cbs, "Maximum number of NTB transport connections");
> > > > +
> > > > +static bool no_msix;
> > > > +module_param(no_msix, bool, 0644);
> > > > +MODULE_PARM_DESC(no_msix, "Do not allow MSI-X interrupts to be selected");
> > > 
> > > How would a user, or a distro, know to set these options?  Why are they
> > > even options at all?
> > 
> > Good question.  There is actually a potential benefit to disabling
> > MSI-X.  The NTB device on one of our platforms only has 3 MSI-X
> > vectors.  In the current driver design, that would limit them to 3
> > client/virtual devices.  However, there are 15bits in the ISR that can
> > be used for the same purpose.  So, if you disable MSI-X, you can have
> > 15 instead of 3.  
> 
> But again, how would a user, or a distro, know to set these?  Where is
> the documentation describing it?  Why really have these options at all
> and not just fix the platform issues (only 3 MSI-X vectors?  Really?)

I believe we'll want multiple clients (or have multiqueue Ethernet).  I'm happy to add something to /Documentation to describe it and why it would be useful, or I can remove it and re-introduce it when I add multiqueue Ethernet.

3 MSI-X vectors (plus one for PCI-E link up/down) on Xeon NTB, and 33 for Atom NTB.  Yeah, really.

> 
> thanks,
> 
> greg k-h

^ permalink raw reply

* Re: [PATCH v6 4/7] net, ethernet, davinci_emac: add OF support
From: Sekhar Nori @ 2012-07-16 17:27 UTC (permalink / raw)
  To: David Miller
  Cc: Heiko Schocher, davinci-linux-open-source, linux-arm-kernel,
	devicetree-discuss, netdev, Grant Likely, Wolfgang Denk,
	Anatoly Sivov
In-Reply-To: <1341823456-32297-1-git-send-email-hs@denx.de>

Hi Dave,

On 7/9/2012 2:14 PM, Heiko Schocher wrote:
> add of support for the davinci_emac driver.
> 
> Signed-off-by: Heiko Schocher <hs@denx.de>
> Acked-by: Sekhar Nori <nsekhar@ti.com>
> Cc: davinci-linux-open-source@linux.davincidsp.com
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: devicetree-discuss@lists.ozlabs.org
> Cc: netdev@vger.kernel.org
> Cc: Grant Likely <grant.likely@secretlab.ca>
> Cc: Sekhar Nori <nsekhar@ti.com>
> Cc: Wolfgang Denk <wd@denx.de>
> Cc: Anatoly Sivov <mm05@mail.ru>
> Cc: David Miller <davem@davemloft.net>

Can you please consider this patch for v3.6? I tested it on DaVinci
AM18x EVM with and without CONFIG_OF using NFS root.

This patch can be independently queued and does not have any dependencies.

Thanks,
Sekhar

^ permalink raw reply

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
From: Rick Jones @ 2012-07-16 17:27 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	yevgenyp@mellanox.co.il, ogerlitz@mellanox.com,
	amirv@mellanox.com, brking@linux.vnet.ibm.com,
	leitao@linux.vnet.ibm.com, klebers@linux.vnet.ibm.com
In-Reply-To: <1342458113-10384-1-git-send-email-cascardo@linux.vnet.ibm.com>

On 07/16/2012 10:01 AM, Thadeu Lima de Souza Cascardo wrote:
> In its receive path, mlx4_en driver maps each page chunk that it pushes
> to the hardware and unmaps it when pushing it up the stack. This limits
> throughput to about 3Gbps on a Power7 8-core machine.

That seems rather extraordinarily low - Power7 is supposed to be a 
rather high performance CPU.  The last time I noticed O(3Gbit/s) on 10G 
for bulk transfer was before the advent of LRO/GRO - that was in the x86 
space though.  Is mapping really that expensive with Power7?


> One solution is to map the entire allocated page at once. However, this
> requires that we keep track of every page fragment we give to a
> descriptor. We also need to work with the discipline that all fragments will
> be released (in the sense that it will not be reused by the driver
> anymore) in the order they are allocated to the driver.
>
> This requires that we don't reuse any fragments, every single one of
> them must be reallocated. We do that by releasing all the fragments that
> are processed and only after finished processing the descriptors, we
> start the refill.
>
> We also must somehow guarantee that we either refill all fragments in a
> descriptor or none at all, without resorting to giving up a page
> fragment that we would have already given. Otherwise, we would break the
> discipline of only releasing the fragments in the order they were
> allocated.
>
> This has passed page allocation fault injections (restricted to the
> driver by using required-start and required-end) and device hotplug
> while 16 TCP streams were able to deliver more than 9Gbps.

What is the effect on packet-per-second performance?  (eg aggregate, 
burst-mode netperf TCP_RR with TCP_NODELAY set or perhaps UDP_RR)

rick jones

^ permalink raw reply

* Re: [PATCH] net-next: make sock diag per-namespace
From: Jan Ceuleers @ 2012-07-16 17:26 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Eric Dumazet, Andrew Vagin, David S. Miller, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy, Pavel Emelianov,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20120716123815.GA1178@avaginn.sw.ru>

On 07/16/2012 02:38 PM, Andrew Vagin wrote:
> You are right. Sorry for this stupid fault. I will send a new patch.

Before doing so:

Could you put the "net-next" inside the square brackets (being the tree you are aiming your patch at, which should not end up in the commit log), and mention the subsystem in its place? Perhaps simply "net"?

Just a suggestion.

Thanks, Jan

^ permalink raw reply

* Re: 3.4.4/amd64 full interrupt hangs under big nfs copies
From: Marc MERLIN @ 2012-07-16 17:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Larry.Finger, bhutchings, linux-wireless, netdev
In-Reply-To: <1342455717.2830.14.camel@edumazet-glaptop>

On Mon, Jul 16, 2012 at 06:21:57PM +0200, Eric Dumazet wrote:
> > No, it's atually when I'm 'uploading' from my laptop to my server.
> > One interesting thing is that my server is running lvm2 with snapshots,
> > which makes writes slower than my laptop can push data over the network, so
> > it's definitely causing buffers to fill up.
> > I just did a download test and got 4.5MB/s sustained without problems.
> 
> Hmm, nfs apparently is able to push lot of data, try to reduce
> rsize/wsize to sane values, like 32K instead of 512K ?
> 
> gargamel:/mnt/dshelf2/ /net/gargamel/mnt/dshelf2 nfs4
> rw,nosuid,nodev,relatime,vers=4.0,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.205.7,local_lock=none,addr=192.168.205.3 0 0

Nice catch. That seems like an excessive default from autofs5 5.0.4-3.2+b1

So, it helped. I still got hangs, but this time they were VFS hangs. I
couldn't do anything filesystem related durign the 'hangs', but the
interrupts weren't hung anymore, so I could move my mouse cursor.

Having NFS hang all of VFS and local disk is obviously still a problem, but
at this point it may not be a networking (or wireless) related problem.

I'll attach the relevant logs during that attempt. Does that help?

Thanks,
Marc

[76903.011101] SysRq : Show Blocked State
[76903.011110]   task                        PC stack   pid father
[76903.011306] mc              D ffff88021e2d3680     0  9383   9270 0x00000080
[76903.011314]  ffff880111094100 0000000000000082 000000000000000e ffff880213549140
[76903.011322]  0000000000013680 ffff8800140e3fd8 ffff8800140e3fd8 ffff880111094100
[76903.011328]  ffff88021e5c5258 0000000000000000 ffff880111094100 ffff8800140e3e40
[76903.011335] Call Trace:
[76903.011362]  [<ffffffffa06dcdf2>] ? nfs_find_actor+0x66/0x66 [nfs]
[76903.011376]  [<ffffffffa06dce4d>] ? nfs_wait_bit_killable+0x5b/0x6e [nfs]
[76903.011384]  [<ffffffff81360f55>] ? __wait_on_bit_lock+0x3c/0x85
[76903.011391]  [<ffffffff810bb793>] ? filemap_fdatawait_range+0x11b/0x139
[76903.011397]  [<ffffffff8136100d>] ? out_of_line_wait_on_bit_lock+0x6f/0x78
[76903.011410]  [<ffffffffa06dcdf2>] ? nfs_find_actor+0x66/0x66 [nfs]
[76903.011417]  [<ffffffff81052e69>] ? autoremove_wake_function+0x2a/0x2a
[76903.011435]  [<ffffffffa06e8ca2>] ? nfs_commit_inode+0x66/0x27a [nfs]
[76903.011448]  [<ffffffffa06db56e>] ? nfs_file_fsync+0x95/0xf3 [nfs]
[76903.011455]  [<ffffffff811015a9>] ? filp_close+0x3b/0x6a
[76903.011461]  [<ffffffff8110165e>] ? sys_close+0x86/0xc7
[76903.011467]  [<ffffffff8136723d>] ? system_call_fastpath+0x1a/0x1f
[76903.011482] kworker/0:0     D ffff88021e213680     0 13850      2 0x00000080
[76903.011489]  ffff8801fac7d850 0000000000000046 ffff8802117cb848 ffff880140773750
[76903.011495]  0000000000013680 ffff88004c4e7fd8 ffff88004c4e7fd8 ffff8801fac7d850
[76903.011502]  ffff88021e5df9a0 0000000000000000 ffff8801fac7d850 ffffffffa069be59
[76903.011508] Call Trace:
[76903.011524]  [<ffffffffa069be59>] ? rpc_make_runnable+0x6a/0x6a [sunrpc]
[76903.011535]  [<ffffffffa069beb2>] ? rpc_wait_bit_killable+0x59/0x6c [sunrpc]
[76903.011541]  [<ffffffff81361054>] ? __wait_on_bit+0x3e/0x71
[76903.011547]  [<ffffffff81362b73>] ? _raw_spin_unlock_irqrestore+0x30/0x3e
[76903.011553]  [<ffffffff813610f6>] ? out_of_line_wait_on_bit+0x6f/0x78
[76903.011565]  [<ffffffffa069be59>] ? rpc_make_runnable+0x6a/0x6a [sunrpc]
[76903.011570]  [<ffffffff81052e69>] ? autoremove_wake_function+0x2a/0x2a
[76903.011587]  [<ffffffffa06e7bdf>] ? nfs_initiate_commit+0xf4/0x105 [nfs]
[76903.011604]  [<ffffffffa06e8e30>] ? nfs_commit_inode+0x1f4/0x27a [nfs]
[76903.011617]  [<ffffffffa06db97c>] ? nfs_release_page+0x56/0x73 [nfs]
[76903.011626]  [<ffffffff810ca356>] ? shrink_page_list+0x556/0x739
[76903.011635]  [<ffffffff8105dd51>] ? get_parent_ip+0x9/0x1b
[76903.011640]  [<ffffffff8136583e>] ? sub_preempt_count+0x83/0x94
[76903.011646]  [<ffffffff810c91eb>] ? update_isolated_counts.isra.44+0x148/0x16e
[76903.011653]  [<ffffffff810ca9a3>] ? shrink_inactive_list+0x2b1/0x446
[76903.011661]  [<ffffffff810cb182>] ? shrink_mem_cgroup_zone+0x371/0x480
[76903.011668]  [<ffffffff810cb2f3>] ? shrink_zone+0x62/0x9b
[76903.011675]  [<ffffffff810cb73c>] ? do_try_to_free_pages+0x1e4/0x434
[76903.011682]  [<ffffffff810cbc11>] ? try_to_free_pages+0xb3/0xf9
[76903.011688]  [<ffffffff8105931b>] ? should_resched+0x5/0x23
[76903.011695]  [<ffffffff810c24a2>] ? __alloc_pages_nodemask+0x4ef/0x7df
[76903.011702]  [<ffffffff8105dd51>] ? get_parent_ip+0x9/0x1b
[76903.011711]  [<ffffffff810ecf10>] ? alloc_pages_current+0xc7/0xe4
[76903.011723]  [<ffffffffa04ca247>] ? iwlagn_rx_allocate+0x97/0x24d [iwlwifi]
[76903.011734]  [<ffffffffa04ca81e>] ? iwlagn_rx_replenish+0x3a/0x3a [iwlwifi]
[76903.011744]  [<ffffffffa04ca7fc>] ? iwlagn_rx_replenish+0x18/0x3a [iwlwifi]
[76903.011750]  [<ffffffff8104ea7d>] ? process_one_work+0x16d/0x298
[76903.011757]  [<ffffffff8104f4d9>] ? worker_thread+0xc2/0x145
[76903.011763]  [<ffffffff8104f417>] ? manage_workers.isra.23+0x15b/0x15b
[76903.011768]  [<ffffffff81052788>] ? kthread+0x7d/0x85
[76903.011774]  [<ffffffff813686a4>] ? kernel_thread_helper+0x4/0x10
[76903.011780]  [<ffffffff8105270b>] ? kthread_freezable_should_stop+0x37/0x37
[76903.011786]  [<ffffffff813686a0>] ? gs_change+0x13/0x13
[76903.011797] Sched Debug Version: v0.10, 3.4.4-amd64-preempt-noide-20120410 #1

and

[76843.153742] 
[76873.080978] SysRq : Show Blocked State
[76873.080987]   task                        PC stack   pid father
[76873.081200] mc              D ffff88021e293680     0  9383   9270 0x00000080
[76873.081208]  ffff880111094100 0000000000000082 0000000000000001 ffff8802135107d0
[76873.081216]  0000000000013680 ffff8800140e3fd8 ffff8800140e3fd8 ffff880111094100
[76873.081222]  ffff88010c9033d0 ffff88021e293680 ffff880111094100 ffffffff810bb429
[76873.081229] Call Trace:
[76873.081241]  [<ffffffff810bb429>] ? __lock_page+0x66/0x66
[76873.081249]  [<ffffffff81362059>] ? io_schedule+0x55/0x6b
[76873.081254]  [<ffffffff810bb42f>] ? sleep_on_page+0x6/0xa
[76873.081260]  [<ffffffff81361054>] ? __wait_on_bit+0x3e/0x71
[76873.081265]  [<ffffffff810bb577>] ? wait_on_page_bit+0x6e/0x73
[76873.081272]  [<ffffffff81052e69>] ? autoremove_wake_function+0x2a/0x2a
[76873.081278]  [<ffffffff810bb6ec>] ? filemap_fdatawait_range+0x74/0x139
[76873.081285]  [<ffffffff810bc2e8>] ? filemap_write_and_wait_range+0x3b/0x4d
[76873.081308]  [<ffffffffa06db536>] ? nfs_file_fsync+0x5d/0xf3 [nfs]
[76873.081317]  [<ffffffff811015a9>] ? filp_close+0x3b/0x6a
[76873.081323]  [<ffffffff8110165e>] ? sys_close+0x86/0xc7
[76873.081330]  [<ffffffff8136723d>] ? system_call_fastpath+0x1a/0x1f
[76873.081346] kworker/0:0     D ffff88021e213680     0 13850      2 0x00000080
[76873.081352]  ffff8801fac7d850 0000000000000046 ffff880186753ce8 ffff880126d7f040
[76873.081358]  0000000000013680 ffff88004c4e7fd8 ffff88004c4e7fd8 ffff8801fac7d850
[76873.081365]  ffff8801c5ae1d70 ffff88021e213680 ffff8801fac7d850 ffffffff810bb429
[76873.081371] Call Trace:
[76873.081376]  [<ffffffff810bb429>] ? __lock_page+0x66/0x66
[76873.081381]  [<ffffffff81362059>] ? io_schedule+0x55/0x6b
[76873.081386]  [<ffffffff810bb42f>] ? sleep_on_page+0x6/0xa
[76873.081391]  [<ffffffff81361054>] ? __wait_on_bit+0x3e/0x71
[76873.081396]  [<ffffffff810bb577>] ? wait_on_page_bit+0x6e/0x73
[76873.081402]  [<ffffffff81052e69>] ? autoremove_wake_function+0x2a/0x2a
[76873.081411]  [<ffffffff810c9f66>] ? shrink_page_list+0x166/0x739
[76873.081420]  [<ffffffff8105dd51>] ? get_parent_ip+0x9/0x1b
[76873.081425]  [<ffffffff8136583e>] ? sub_preempt_count+0x83/0x94
[76873.081431]  [<ffffffff810c91eb>] ? update_isolated_counts.isra.44+0x148/0x16e
[76873.081438]  [<ffffffff810ca9a3>] ? shrink_inactive_list+0x2b1/0x446
[76873.081446]  [<ffffffff810cb182>] ? shrink_mem_cgroup_zone+0x371/0x480
[76873.081454]  [<ffffffff810cb2f3>] ? shrink_zone+0x62/0x9b
[76873.081460]  [<ffffffff810cb73c>] ? do_try_to_free_pages+0x1e4/0x434
[76873.081467]  [<ffffffff810cbc11>] ? try_to_free_pages+0xb3/0xf9
[76873.081473]  [<ffffffff8105931b>] ? should_resched+0x5/0x23
[76873.081481]  [<ffffffff810c24a2>] ? __alloc_pages_nodemask+0x4ef/0x7df
[76873.081487]  [<ffffffff8105dd51>] ? get_parent_ip+0x9/0x1b
[76873.081497]  [<ffffffff810ecf10>] ? alloc_pages_current+0xc7/0xe4
[76873.081510]  [<ffffffffa04ca247>] ? iwlagn_rx_allocate+0x97/0x24d [iwlwifi]
[76873.081521]  [<ffffffffa04ca81e>] ? iwlagn_rx_replenish+0x3a/0x3a [iwlwifi]
[76873.081530]  [<ffffffffa04ca7fc>] ? iwlagn_rx_replenish+0x18/0x3a [iwlwifi]
[76873.081538]  [<ffffffff8104ea7d>] ? process_one_work+0x16d/0x298
[76873.081545]  [<ffffffff8104f4d9>] ? worker_thread+0xc2/0x145
[76873.081551]  [<ffffffff8104f417>] ? manage_workers.isra.23+0x15b/0x15b
[76873.081556]  [<ffffffff81052788>] ? kthread+0x7d/0x85
[76873.081562]  [<ffffffff813686a4>] ? kernel_thread_helper+0x4/0x10
[76873.081568]  [<ffffffff8105270b>] ? kthread_freezable_should_stop+0x37/0x37
[76873.081574]  [<ffffffff813686a0>] ? gs_change+0x13/0x13
[76873.081585] 192.168.205.3-m D ffff88021e293680     0 14532      2 0x00000080
[76873.081590]  ffff880206d600c0 0000000000000046 ffff880186733e60 ffff88004b4230c0
[76873.081597]  0000000000013680 ffff880022305fd8 ffff880022305fd8 ffff880206d600c0
[76873.081603]  ffff88021e5bb778 0000000000000000 ffff880206d600c0 ffffffffa069be59
[76873.081609] Call Trace:
[76873.081625]  [<ffffffffa069be59>] ? rpc_make_runnable+0x6a/0x6a [sunrpc]
[76873.081637]  [<ffffffffa069beb2>] ? rpc_wait_bit_killable+0x59/0x6c [sunrpc]
[76873.081642]  [<ffffffff81361054>] ? __wait_on_bit+0x3e/0x71
[76873.081648]  [<ffffffff81362b73>] ? _raw_spin_unlock_irqrestore+0x30/0x3e
[76873.081654]  [<ffffffff813610f6>] ? out_of_line_wait_on_bit+0x6f/0x78
[76873.081665]  [<ffffffffa069be59>] ? rpc_make_runnable+0x6a/0x6a [sunrpc]
[76873.081671]  [<ffffffff81052e69>] ? autoremove_wake_function+0x2a/0x2a
[76873.081690]  [<ffffffffa06efb13>] ? nfs4_run_open_task+0x101/0x12e [nfs]
[76873.081709]  [<ffffffffa06f12fb>] ? nfs4_open_recover_helper+0xbd/0x13f [nfs]
[76873.081724]  [<ffffffffa06f13e1>] ? nfs4_open_recover+0x64/0x113 [nfs]
[76873.081740]  [<ffffffffa06f36a2>] ? nfs4_open_expired+0x69/0xc4 [nfs]
[76873.081761]  [<ffffffffa06ff5b8>] ? nfs4_do_reclaim+0x109/0x4a0 [nfs]
[76873.081779]  [<ffffffffa06fe7cb>] ? nfs4_state_clear_reclaim_reboot.part.7+0xf6/0x10a [nfs]
[76873.081797]  [<ffffffffa06ffcb2>] ? nfs4_run_state_manager+0x363/0x52e [nfs]
[76873.081814]  [<ffffffffa06ff94f>] ? nfs4_do_reclaim+0x4a0/0x4a0 [nfs]
[76873.081819]  [<ffffffff81052788>] ? kthread+0x7d/0x85
[76873.081825]  [<ffffffff813686a4>] ? kernel_thread_helper+0x4/0x10
[76873.081830]  [<ffffffff8105270b>] ? kthread_freezable_should_stop+0x37/0x37
[76873.081836]  [<ffffffff813686a0>] ? gs_change+0x13/0x13
[76873.081842] Sched Debug Version: v0.10, 3.4.4-amd64-preempt-noide-20120410 #1
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply

* [PATCH] mlx4_en: map entire pages to increase throughput
From: Thadeu Lima de Souza Cascardo @ 2012-07-16 17:01 UTC (permalink / raw)
  To: davem
  Cc: netdev, yevgenyp, ogerlitz, amirv, brking, leitao, klebers,
	Thadeu Lima de Souza Cascardo

In its receive path, mlx4_en driver maps each page chunk that it pushes
to the hardware and unmaps it when pushing it up the stack. This limits
throughput to about 3Gbps on a Power7 8-core machine.

One solution is to map the entire allocated page at once. However, this
requires that we keep track of every page fragment we give to a
descriptor. We also need to work with the discipline that all fragments will
be released (in the sense that it will not be reused by the driver
anymore) in the order they are allocated to the driver.

This requires that we don't reuse any fragments, every single one of
them must be reallocated. We do that by releasing all the fragments that
are processed and only after finished processing the descriptors, we
start the refill.

We also must somehow guarantee that we either refill all fragments in a
descriptor or none at all, without resorting to giving up a page
fragment that we would have already given. Otherwise, we would break the
discipline of only releasing the fragments in the order they were
allocated.

This has passed page allocation fault injections (restricted to the
driver by using required-start and required-end) and device hotplug
while 16 TCP streams were able to deliver more than 9Gbps.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   |  237 ++++++++++++++------------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |    3 +-
 2 files changed, 131 insertions(+), 109 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index a04cbf7..37ac073 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -41,41 +41,75 @@
 
 #include "mlx4_en.h"
 
-
-static int mlx4_en_alloc_frag(struct mlx4_en_priv *priv,
-			      struct mlx4_en_rx_desc *rx_desc,
-			      struct page_frag *skb_frags,
-			      struct mlx4_en_rx_alloc *ring_alloc,
-			      int i)
+static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
+			       struct mlx4_en_rx_desc *rx_desc,
+			       struct mlx4_en_rx_alloc *frags,
+			       struct mlx4_en_rx_alloc *ring_alloc)
 {
-	struct mlx4_en_frag_info *frag_info = &priv->frag_info[i];
-	struct mlx4_en_rx_alloc *page_alloc = &ring_alloc[i];
+	struct mlx4_en_rx_alloc page_alloc[MLX4_EN_MAX_RX_FRAGS];
+	struct mlx4_en_frag_info *frag_info;
 	struct page *page;
 	dma_addr_t dma;
+	int i;
 
-	if (page_alloc->offset == frag_info->last_offset) {
-		/* Allocate new page */
-		page = alloc_pages(GFP_ATOMIC | __GFP_COMP, MLX4_EN_ALLOC_ORDER);
-		if (!page)
-			return -ENOMEM;
-
-		skb_frags[i].page = page_alloc->page;
-		skb_frags[i].offset = page_alloc->offset;
-		page_alloc->page = page;
-		page_alloc->offset = frag_info->frag_align;
-	} else {
-		page = page_alloc->page;
-		get_page(page);
+	for (i = 0; i < priv->num_frags; i++) {
+		frag_info = &priv->frag_info[i];
+		if (ring_alloc[i].offset == frag_info->last_offset) {
+			page = alloc_pages(GFP_ATOMIC | __GFP_COMP,
+					MLX4_EN_ALLOC_ORDER);
+			if (!page)
+				goto out;
+			dma = dma_map_page(priv->ddev, page, 0,
+				MLX4_EN_ALLOC_SIZE, PCI_DMA_FROMDEVICE);
+			if (dma_mapping_error(priv->ddev, dma)) {
+				put_page(page);
+				goto out;
+			}
+			page_alloc[i].page = page;
+			page_alloc[i].dma = dma;
+			page_alloc[i].offset = frag_info->frag_align;
+		} else {
+			page_alloc[i].page = ring_alloc[i].page;
+			get_page(ring_alloc[i].page);
+			page_alloc[i].dma = ring_alloc[i].dma;
+			page_alloc[i].offset = ring_alloc[i].offset +
+						frag_info->frag_stride;
+		}
+	}
 
-		skb_frags[i].page = page;
-		skb_frags[i].offset = page_alloc->offset;
-		page_alloc->offset += frag_info->frag_stride;
+	for (i = 0; i < priv->num_frags; i++) {
+		frags[i] = ring_alloc[i];
+		dma = ring_alloc[i].dma + ring_alloc[i].offset;
+		ring_alloc[i] = page_alloc[i];
+		rx_desc->data[i].addr = cpu_to_be64(dma);
 	}
-	dma = dma_map_single(priv->ddev, page_address(skb_frags[i].page) +
-			     skb_frags[i].offset, frag_info->frag_size,
-			     PCI_DMA_FROMDEVICE);
-	rx_desc->data[i].addr = cpu_to_be64(dma);
+
 	return 0;
+
+
+out:
+	while (i--) {
+		frag_info = &priv->frag_info[i];
+		if (ring_alloc[i].offset == frag_info->last_offset)
+			dma_unmap_page(priv->ddev, page_alloc[i].dma,
+				MLX4_EN_ALLOC_SIZE, PCI_DMA_FROMDEVICE);
+		put_page(page_alloc[i].page);
+	}
+	return -ENOMEM;
+}
+
+static void mlx4_en_free_frag(struct mlx4_en_priv *priv,
+			      struct mlx4_en_rx_alloc *frags,
+			      int i)
+{
+	struct mlx4_en_frag_info *frag_info = &priv->frag_info[i];
+
+	if (frags[i].offset == frag_info->last_offset) {
+		dma_unmap_page(priv->ddev, frags[i].dma, MLX4_EN_ALLOC_SIZE,
+					 PCI_DMA_FROMDEVICE);
+	}
+	if (frags[i].page)
+		put_page(frags[i].page);
 }
 
 static int mlx4_en_init_allocator(struct mlx4_en_priv *priv,
@@ -91,6 +125,13 @@ static int mlx4_en_init_allocator(struct mlx4_en_priv *priv,
 		if (!page_alloc->page)
 			goto out;
 
+		page_alloc->dma = dma_map_page(priv->ddev, page_alloc->page, 0,
+					MLX4_EN_ALLOC_SIZE, PCI_DMA_FROMDEVICE);
+		if (dma_mapping_error(priv->ddev, page_alloc->dma)) {
+			put_page(page_alloc->page);
+			page_alloc->page = NULL;
+			goto out;
+		}
 		page_alloc->offset = priv->frag_info[i].frag_align;
 		en_dbg(DRV, priv, "Initialized allocator:%d with page:%p\n",
 		       i, page_alloc->page);
@@ -100,6 +141,8 @@ static int mlx4_en_init_allocator(struct mlx4_en_priv *priv,
 out:
 	while (i--) {
 		page_alloc = &ring->page_alloc[i];
+		dma_unmap_page(priv->ddev, page_alloc->dma,
+				MLX4_EN_ALLOC_SIZE, PCI_DMA_FROMDEVICE);
 		put_page(page_alloc->page);
 		page_alloc->page = NULL;
 	}
@@ -117,24 +160,22 @@ static void mlx4_en_destroy_allocator(struct mlx4_en_priv *priv,
 		en_dbg(DRV, priv, "Freeing allocator:%d count:%d\n",
 		       i, page_count(page_alloc->page));
 
+		dma_unmap_page(priv->ddev, page_alloc->dma,
+				MLX4_EN_ALLOC_SIZE, PCI_DMA_FROMDEVICE);
 		put_page(page_alloc->page);
 		page_alloc->page = NULL;
 	}
 }
 
-
 static void mlx4_en_init_rx_desc(struct mlx4_en_priv *priv,
 				 struct mlx4_en_rx_ring *ring, int index)
 {
 	struct mlx4_en_rx_desc *rx_desc = ring->buf + ring->stride * index;
-	struct skb_frag_struct *skb_frags = ring->rx_info +
-					    (index << priv->log_rx_info);
 	int possible_frags;
 	int i;
 
 	/* Set size and memtype fields */
 	for (i = 0; i < priv->num_frags; i++) {
-		skb_frag_size_set(&skb_frags[i], priv->frag_info[i].frag_size);
 		rx_desc->data[i].byte_count =
 			cpu_to_be32(priv->frag_info[i].frag_size);
 		rx_desc->data[i].lkey = cpu_to_be32(priv->mdev->mr.key);
@@ -151,29 +192,14 @@ static void mlx4_en_init_rx_desc(struct mlx4_en_priv *priv,
 	}
 }
 
-
 static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv,
 				   struct mlx4_en_rx_ring *ring, int index)
 {
 	struct mlx4_en_rx_desc *rx_desc = ring->buf + (index * ring->stride);
-	struct page_frag *skb_frags = ring->rx_info +
-				      (index << priv->log_rx_info);
-	int i;
+	struct mlx4_en_rx_alloc *frags = ring->rx_info +
+					(index << priv->log_rx_info);
 
-	for (i = 0; i < priv->num_frags; i++)
-		if (mlx4_en_alloc_frag(priv, rx_desc, skb_frags, ring->page_alloc, i))
-			goto err;
-
-	return 0;
-
-err:
-	while (i--) {
-		dma_addr_t dma = be64_to_cpu(rx_desc->data[i].addr);
-		pci_unmap_single(priv->mdev->pdev, dma, skb_frags[i].size,
-				 PCI_DMA_FROMDEVICE);
-		put_page(skb_frags[i].page);
-	}
-	return -ENOMEM;
+	return mlx4_en_alloc_frags(priv, rx_desc, frags, ring->page_alloc);
 }
 
 static inline void mlx4_en_update_rx_prod_db(struct mlx4_en_rx_ring *ring)
@@ -185,20 +211,13 @@ static void mlx4_en_free_rx_desc(struct mlx4_en_priv *priv,
 				 struct mlx4_en_rx_ring *ring,
 				 int index)
 {
-	struct page_frag *skb_frags;
-	struct mlx4_en_rx_desc *rx_desc = ring->buf + (index << ring->log_stride);
-	dma_addr_t dma;
+	struct mlx4_en_rx_alloc *frags;
 	int nr;
 
-	skb_frags = ring->rx_info + (index << priv->log_rx_info);
+	frags = ring->rx_info + (index << priv->log_rx_info);
 	for (nr = 0; nr < priv->num_frags; nr++) {
 		en_dbg(DRV, priv, "Freeing fragment:%d\n", nr);
-		dma = be64_to_cpu(rx_desc->data[nr].addr);
-
-		en_dbg(DRV, priv, "Unmapping buffer at dma:0x%llx\n", (u64) dma);
-		dma_unmap_single(priv->ddev, dma, skb_frags[nr].size,
-				 PCI_DMA_FROMDEVICE);
-		put_page(skb_frags[nr].page);
+		mlx4_en_free_frag(priv, frags, nr);
 	}
 }
 
@@ -268,10 +287,9 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
 			   struct mlx4_en_rx_ring *ring, u32 size, u16 stride)
 {
 	struct mlx4_en_dev *mdev = priv->mdev;
-	int err;
+	int err = -ENOMEM;
 	int tmp;
 
-
 	ring->prod = 0;
 	ring->cons = 0;
 	ring->size = size;
@@ -281,7 +299,7 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
 	ring->buf_size = ring->size * ring->stride + TXBB_SIZE;
 
 	tmp = size * roundup_pow_of_two(MLX4_EN_MAX_RX_FRAGS *
-					sizeof(struct skb_frag_struct));
+					sizeof(struct mlx4_en_rx_alloc));
 	ring->rx_info = vmalloc(tmp);
 	if (!ring->rx_info)
 		return -ENOMEM;
@@ -338,7 +356,7 @@ int mlx4_en_activate_rx_rings(struct mlx4_en_priv *priv)
 		memset(ring->buf, 0, ring->buf_size);
 		mlx4_en_update_rx_prod_db(ring);
 
-		/* Initailize all descriptors */
+		/* Initialize all descriptors */
 		for (i = 0; i < ring->size; i++)
 			mlx4_en_init_rx_desc(priv, ring, i);
 
@@ -401,12 +419,10 @@ void mlx4_en_deactivate_rx_ring(struct mlx4_en_priv *priv,
 }
 
 
-/* Unmap a completed descriptor and free unused pages */
 static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
 				    struct mlx4_en_rx_desc *rx_desc,
-				    struct page_frag *skb_frags,
+				    struct mlx4_en_rx_alloc *frags,
 				    struct sk_buff *skb,
-				    struct mlx4_en_rx_alloc *page_alloc,
 				    int length)
 {
 	struct skb_frag_struct *skb_frags_rx = skb_shinfo(skb)->frags;
@@ -414,26 +430,24 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
 	int nr;
 	dma_addr_t dma;
 
-	/* Collect used fragments while replacing them in the HW descirptors */
+	/* Collect used fragments while replacing them in the HW descriptors */
 	for (nr = 0; nr < priv->num_frags; nr++) {
 		frag_info = &priv->frag_info[nr];
 		if (length <= frag_info->frag_prefix_size)
 			break;
+		if (!frags[nr].page)
+			goto fail;
 
-		/* Save page reference in skb */
-		__skb_frag_set_page(&skb_frags_rx[nr], skb_frags[nr].page);
-		skb_frag_size_set(&skb_frags_rx[nr], skb_frags[nr].size);
-		skb_frags_rx[nr].page_offset = skb_frags[nr].offset;
-		skb->truesize += frag_info->frag_stride;
 		dma = be64_to_cpu(rx_desc->data[nr].addr);
+		dma_sync_single_for_cpu(priv->ddev, dma, frag_info->frag_size,
+					DMA_FROM_DEVICE);
 
-		/* Allocate a replacement page */
-		if (mlx4_en_alloc_frag(priv, rx_desc, skb_frags, page_alloc, nr))
-			goto fail;
-
-		/* Unmap buffer */
-		dma_unmap_single(priv->ddev, dma, skb_frag_size(&skb_frags_rx[nr]),
-				 PCI_DMA_FROMDEVICE);
+		/* Save page reference in skb */
+		get_page(frags[nr].page);
+		__skb_frag_set_page(&skb_frags_rx[nr], frags[nr].page);
+		skb_frag_size_set(&skb_frags_rx[nr], frag_info->frag_size);
+		skb_frags_rx[nr].page_offset = frags[nr].offset;
+		skb->truesize += frag_info->frag_stride;
 	}
 	/* Adjust size of last fragment to match actual length */
 	if (nr > 0)
@@ -442,8 +456,6 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
 	return nr;
 
 fail:
-	/* Drop all accumulated fragments (which have already been replaced in
-	 * the descriptor) of this packet; remaining fragments are reused... */
 	while (nr > 0) {
 		nr--;
 		__skb_frag_unref(&skb_frags_rx[nr]);
@@ -454,8 +466,7 @@ fail:
 
 static struct sk_buff *mlx4_en_rx_skb(struct mlx4_en_priv *priv,
 				      struct mlx4_en_rx_desc *rx_desc,
-				      struct page_frag *skb_frags,
-				      struct mlx4_en_rx_alloc *page_alloc,
+				      struct mlx4_en_rx_alloc *frags,
 				      unsigned int length)
 {
 	struct sk_buff *skb;
@@ -473,23 +484,20 @@ static struct sk_buff *mlx4_en_rx_skb(struct mlx4_en_priv *priv,
 
 	/* Get pointer to first fragment so we could copy the headers into the
 	 * (linear part of the) skb */
-	va = page_address(skb_frags[0].page) + skb_frags[0].offset;
+	va = page_address(frags[0].page) + frags[0].offset;
 
 	if (length <= SMALL_PACKET_SIZE) {
 		/* We are copying all relevant data to the skb - temporarily
-		 * synch buffers for the copy */
+		 * sync buffers for the copy */
 		dma = be64_to_cpu(rx_desc->data[0].addr);
 		dma_sync_single_for_cpu(priv->ddev, dma, length,
 					DMA_FROM_DEVICE);
 		skb_copy_to_linear_data(skb, va, length);
-		dma_sync_single_for_device(priv->ddev, dma, length,
-					   DMA_FROM_DEVICE);
 		skb->tail += length;
 	} else {
-
 		/* Move relevant fragments to skb */
-		used_frags = mlx4_en_complete_rx_desc(priv, rx_desc, skb_frags,
-						      skb, page_alloc, length);
+		used_frags = mlx4_en_complete_rx_desc(priv, rx_desc, frags,
+							skb, length);
 		if (unlikely(!used_frags)) {
 			kfree_skb(skb);
 			return NULL;
@@ -526,12 +534,25 @@ out_loopback:
 	dev_kfree_skb_any(skb);
 }
 
+static void mlx4_en_refill_rx_buffers(struct mlx4_en_priv *priv,
+				     struct mlx4_en_rx_ring *ring)
+{
+	int index = ring->prod & ring->size_mask;
+
+	while ((u32) (ring->prod - ring->cons) < ring->actual_size) {
+		if (mlx4_en_prepare_rx_desc(priv, ring, index))
+			break;
+		ring->prod++;
+		index = ring->prod & ring->size_mask;
+	}
+}
+
 int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	struct mlx4_cqe *cqe;
 	struct mlx4_en_rx_ring *ring = &priv->rx_ring[cq->ring];
-	struct page_frag *skb_frags;
+	struct mlx4_en_rx_alloc *frags;
 	struct mlx4_en_rx_desc *rx_desc;
 	struct sk_buff *skb;
 	int index;
@@ -540,6 +561,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	int polled = 0;
 	int ip_summed;
 	struct ethhdr *ethh;
+	dma_addr_t dma;
 	u64 s_mac;
 
 	if (!priv->port_up)
@@ -555,7 +577,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	while (XNOR(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK,
 		    cq->mcq.cons_index & cq->size)) {
 
-		skb_frags = ring->rx_info + (index << priv->log_rx_info);
+		frags = ring->rx_info + (index << priv->log_rx_info);
 		rx_desc = ring->buf + (index << ring->log_stride);
 
 		/*
@@ -579,8 +601,11 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 
 		/* Get pointer to first fragment since we haven't skb yet and
 		 * cast it to ethhdr struct */
-		ethh = (struct ethhdr *)(page_address(skb_frags[0].page) +
-					 skb_frags[0].offset);
+		dma = be64_to_cpu(rx_desc->data[0].addr);
+		dma_sync_single_for_cpu(priv->ddev, dma, sizeof(*ethh),
+					DMA_FROM_DEVICE);
+		ethh = (struct ethhdr *)(page_address(frags[0].page) +
+					 frags[0].offset);
 		s_mac = mlx4_en_mac_to_u64(ethh->h_source);
 
 		/* If source MAC is equal to our own MAC and not performing
@@ -612,10 +637,9 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 					if (!gro_skb)
 						goto next;
 
-					nr = mlx4_en_complete_rx_desc(
-						priv, rx_desc,
-						skb_frags, gro_skb,
-						ring->page_alloc, length);
+					nr = mlx4_en_complete_rx_desc(priv,
+						rx_desc, frags, gro_skb,
+						length);
 					if (!nr)
 						goto next;
 
@@ -651,8 +675,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			ring->csum_none++;
 		}
 
-		skb = mlx4_en_rx_skb(priv, rx_desc, skb_frags,
-				     ring->page_alloc, length);
+		skb = mlx4_en_rx_skb(priv, rx_desc, frags, length);
 		if (!skb) {
 			priv->stats.rx_dropped++;
 			goto next;
@@ -678,6 +701,9 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		netif_receive_skb(skb);
 
 next:
+		for (nr = 0; nr < priv->num_frags; nr++)
+			mlx4_en_free_frag(priv, frags, nr);
+
 		++cq->mcq.cons_index;
 		index = (cq->mcq.cons_index) & ring->size_mask;
 		cqe = &cq->buf[index];
@@ -693,7 +719,7 @@ out:
 	mlx4_cq_set_ci(&cq->mcq);
 	wmb(); /* ensure HW sees CQ consumer before we post new buffers */
 	ring->cons = cq->mcq.cons_index;
-	ring->prod += polled; /* Polled descriptors were realocated in place */
+	mlx4_en_refill_rx_buffers(priv, ring);
 	mlx4_en_update_rx_prod_db(ring);
 	return polled;
 }
@@ -782,7 +808,7 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 
 	priv->num_frags = i;
 	priv->rx_skb_size = eff_mtu;
-	priv->log_rx_info = ROUNDUP_LOG2(i * sizeof(struct skb_frag_struct));
+	priv->log_rx_info = ROUNDUP_LOG2(i * sizeof(struct mlx4_en_rx_alloc));
 
 	en_dbg(DRV, priv, "Rx buffer scatter-list (effective-mtu:%d "
 		  "num_frags:%d):\n", eff_mtu, priv->num_frags);
@@ -984,8 +1010,3 @@ void mlx4_en_release_rss_steer(struct mlx4_en_priv *priv)
 	}
 	mlx4_qp_release_range(mdev->dev, rss_map->base_qpn, priv->rx_ring_num);
 }
-
-
-
-
-
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index a126321..f2fc90d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -107,7 +107,7 @@ enum {
 #define MLX4_EN_MAX_TX_SIZE	8192
 #define MLX4_EN_MAX_RX_SIZE	8192
 
-/* Minimum ring size for our page-allocation sceme to work */
+/* Minimum ring size for our page-allocation scheme to work */
 #define MLX4_EN_MIN_RX_SIZE	(MLX4_EN_ALLOC_SIZE / SMP_CACHE_BYTES)
 #define MLX4_EN_MIN_TX_SIZE	(4096 / TXBB_SIZE)
 
@@ -228,6 +228,7 @@ struct mlx4_en_tx_desc {
 
 struct mlx4_en_rx_alloc {
 	struct page *page;
+	dma_addr_t dma;
 	u16 offset;
 };
 
-- 
1.7.4.4

^ permalink raw reply related

* Re: [RFC 1/2] PCI-Express Non-Transparent Bridge Support
From: chetan loke @ 2012-07-16 16:49 UTC (permalink / raw)
  To: Jon Mason; +Cc: linux-kernel, netdev, linux-pci, Dave Jiang
In-Reply-To: <1342215900-3358-1-git-send-email-jon.mason@intel.com>

Hi Jon,

On Fri, Jul 13, 2012 at 5:44 PM, Jon Mason <jon.mason@intel.com> wrote:

Just a few minor comments/questions:

....

> +struct ntb_transport_qp {
> +       struct ntb_device *ndev;
> +
> +       bool client_ready;
> +       bool qp_link;
> +       u8 qp_num;      /* Only 64 QP's are allowed.  0-63 */
> +
> +       void (*tx_handler) (struct ntb_transport_qp *qp);
> +       struct tasklet_struct tx_work;

Is it ok to rename the following vars for convenience sake?

> +       struct list_head txq;
tx_pend_q - (pending_queue) or tx_out_q - (outstanding_queue) - or
pick any new string you like - other than a mono-syllable definition

> +       struct list_head txc;
tx_compl_q - completion queue

> +       struct list_head txe;
tx_avail_e - available entry queue


> +       spinlock_t txq_lock;
> +       spinlock_t txc_lock;
> +       spinlock_t txe_lock;

then match the variants accordingly.

> +       struct list_head rxq;
> +       struct list_head rxc;
> +       struct list_head rxe;
> +       spinlock_t rxq_lock;
> +       spinlock_t rxc_lock;
> +       spinlock_t rxe_lock;

similarly the rx-counterpart


..................

> +static void ntb_tx_copy_task(struct ntb_transport_qp *qp,
> +                            struct ntb_queue_entry *entry,
> +                            void *offset)
> +{
> +       struct ntb_payload_header *hdr = offset;
> +       int rc;
> +
> +       offset += sizeof(struct ntb_payload_header);
> +       memcpy_toio(offset, entry->buf, entry->len);
> +
> +       hdr->len = entry->len;
> +       hdr->ver = qp->tx_pkts;
> +
> +       /* Ensure that the data is fully copied out before setting the flag */
> +       wmb();
> +       hdr->flags = entry->flags | DESC_DONE_FLAG;
> +
> +       rc = ntb_ring_sdb(qp->ndev, qp->qp_num);
> +       if (rc)
> +               pr_err("%s: error ringing db %d\n", __func__, qp->qp_num);
> +
> +       if (entry->len > 0) {

how do you interpret this len variable and decide if it's a good/bad completion?

> +               qp->tx_bytes += entry->len;
> +
> +               /* Add fully transmitted data to completion queue */
> +               ntb_list_add_tail(&qp->txc_lock, &entry->entry, &qp->txc);
> +
> +               if (qp->tx_handler)
> +                       qp->tx_handler(qp);
> +       } else
> +               ntb_list_add_tail(&qp->txe_lock, &entry->entry, &qp->txe);

I could be wrong but how is the original skb handled if the code path
goes in the else clause?
Also, in the else clause, how about a ntb_list_add_head rather than a _tail.

> +
> +static int ntb_process_tx(struct ntb_transport_qp *qp,
> +                         struct ntb_queue_entry *entry)
> +{
> +       struct ntb_payload_header *hdr;
> +       void *offset;
> +
> +       offset = qp->tx_offset;
> +       hdr = offset;
> +
> +       pr_debug("%lld - offset %p, tx %p, entry len %d flags %x buff %p\n",
> +                qp->tx_pkts, offset, qp->tx_offset, entry->len, entry->flags,
> +                entry->buf);
> +       if (hdr->flags) {
> +               ntb_list_add_head(&qp->txq_lock, &entry->entry, &qp->txq);
> +               qp->tx_ring_full++;
> +               return -EAGAIN;
> +       }
> +
> +       if (entry->len > transport_mtu) {
> +               pr_err("Trying to send pkt size of %d\n", entry->len);
> +               entry->flags = HW_ERROR_FLAG;
> +
> +               ntb_list_add_tail(&qp->txc_lock, &entry->entry, &qp->txc);
> +
> +               if (qp->tx_handler)
> +                       qp->tx_handler(qp);
> +
> +               return 0;
> +       }
> +
> +       ntb_tx_copy_task(qp, entry, offset);

what happens when ntb_sdb_ring returns an error? would you still want
to increment tx_pkts below?

> +
> +       qp->tx_offset =
> +           (qp->tx_offset +
> +            ((transport_mtu + sizeof(struct ntb_payload_header)) * 2) >=
> +            qp->tx_mw_end) ? qp->tx_mw_begin : qp->tx_offset + transport_mtu +
> +           sizeof(struct ntb_payload_header);
> +
> +       qp->tx_pkts++;
> +
> +       return 0;
> +}
> +

........................


> +void *ntb_transport_tx_dequeue(struct ntb_transport_qp *qp, unsigned int *len)
> +{
> +       struct ntb_queue_entry *entry;
> +       void *buf;
> +
> +       if (!qp)
> +               return NULL;
> +
> +       entry = ntb_list_rm_head(&qp->txc_lock, &qp->txc);
> +       if (!entry)
> +               return NULL;
> +
> +       buf = entry->callback_data;
> +       if (entry->flags != HW_ERROR_FLAG)
> +               *len = entry->len;
> +       else
> +               *len = -EIO;
> +
> +       ntb_list_add_tail(&qp->txe_lock, &entry->entry, &qp->txe);

how about a ntb_list_add_head?


Chetan Loke

^ permalink raw reply

* Re: [PATCH] ipv6: fix incorrect route 'expires' value passed to userspace.
From: Stephen Hemminger @ 2012-07-16 16:41 UTC (permalink / raw)
  To: Li Wei; +Cc: David S. Miller, netdev
In-Reply-To: <5003CC41.9080204@cn.fujitsu.com>

On Mon, 16 Jul 2012 16:09:37 +0800
Li Wei <lw@cn.fujitsu.com> wrote:

> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index becb048..a7fec9d 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -2516,7 +2516,7 @@ static int rt6_fill_node(struct net *net,
>  		goto nla_put_failure;
>  	if (!(rt->rt6i_flags & RTF_EXPIRES))
>  		expires = 0;
> -	else if (rt->dst.expires - jiffies < INT_MAX)
> +	else if ((int)(rt->dst.expires - jiffies) < INT_MAX)
>  		expires = rt->dst.expires - jiffies;
>  	else
>  		expires = INT_MAX;

Why not use time_is_after_jiffies() macro?

^ permalink raw reply

* Re: 3.4.4/amd64 full interrupt hangs under big nfs copies
From: Eric Dumazet @ 2012-07-16 16:21 UTC (permalink / raw)
  To: Marc MERLIN
  Cc: David Miller, Larry.Finger, bhutchings, linux-wireless, netdev
In-Reply-To: <20120716151826.GA10586@merlins.org>

On Mon, 2012-07-16 at 08:18 -0700, Marc MERLIN wrote:
> On Mon, Jul 16, 2012 at 08:18:49AM +0200, Eric Dumazet wrote:
> > > My understanding is that user space calling drivers that shut off all
> > > interrupts for extended periods of time (as least I think so since my mouse
> > > cursor would not move), is still a kernel bug.
> > > 
> > > For what it's worth, copying 1GB of data in lots of small files does not
> > > cause problems, it seems that it's big files that cause a problem since they
> > > likely fill a buffer somewhere while interrupts are disabled.
> > > 
> > > Do you have an idea of how I can find out where my mc process is stuck in
> > > the kernel?
> > > Should I reproduce with specific sysrq output?
> > 
> > Just to clarify, you get this freeze when transferring a big file from a
> > remote NFS server to your PC, (aka a download), not the reverse way ?
>  
> No, it's atually when I'm 'uploading' from my laptop to my server.
> One interesting thing is that my server is running lvm2 with snapshots,
> which makes writes slower than my laptop can push data over the network, so
> it's definitely causing buffers to fill up.
> I just did a download test and got 4.5MB/s sustained without problems.

Hmm, nfs apparently is able to push lot of data, try to reduce
rsize/wsize to sane values, like 32K instead of 512K ?

gargamel:/mnt/dshelf2/ /net/gargamel/mnt/dshelf2 nfs4
rw,nosuid,nodev,relatime,vers=4.0,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.205.7,local_lock=none,addr=192.168.205.3 0 0

You could trace svc_sock_setbufsize() and check how large is set
sk_sndbuf

(iwlwifi is unable to use sendpage anyway, since SG is not enabled)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox