Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] sky2: Use deferrable timer for watchdog
From: Parag Warudkar @ 2007-12-20 20:36 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Kok, Auke, Stephen Hemminger, netdev, akpm, linux-kernel
In-Reply-To: <476ACABC.4010503@linux.intel.com>

On Dec 20, 2007 3:04 PM, Arjan van de Ven <arjan@linux.intel.com> wrote:
> > I think it is reasonable for Network driver watchdogs to use a
> > deferrable timer - if the machine is 100% IDLE there is no one needing
> > the network to be up. If there is something running even on the other
> > CPU - that is going to cause an IPI, reschedule, TLB invalidation etc.
> > which will make it very likely in practice that each CPU will be
> > interrupted in reasonable amount of time.
>
> this is not correct; many machines are idle waiting for network data. Think of webservers...

Yes, I forgot the receive case. So if a server was 100% IDLE and a web
server was listening for network data and we reach 0 wakeups per
second on the CPU where the network watchdog timer is scheduled to run
deferred _and_ the network link went down, it would cause the watchdog
to not run and redo the link until some one else wakes up that CPU
later.
So as long as we make sure we don't convert every timer to deferrable
we should be ok - may be this can be resolved easily by having a
non-deferrable "dont-allow-deferring-for-too-long" timer on each CPU
that just causes at least one wake up in some reasonable time delta
from the previous wakeup (whoever caused that one.) It is still
beneficial in that all deferrable timers would run at once without
needing to have separate wakeup for each.

>
> >
> > Of course there are theoretical cases where we could land into a
> > situation where a CPU in a multiprocessor machine is IDLE infinitely
> > and that causes the watchdog that happens to be bound to run on the
> > same CPU to not run. To take care of these unlikely cases I think the
> > timer mechanism should have a reasonable limit on how long a CPU can
> > go IDLE if there are deferrable timers.
>
> how about something else instead: a timer mechanism that takes a range instead..
> that at least has defined semantics; the deferrable semantics really are "indefinite".
> Lets keep at least the semantics clear and clean.
>

Would not the simpler solution of installing a non-deferrable timer
per cpu which will not allow the CPU to go IDLE for more than x units
of time at once  (or something to that effect) work? Range would
complicate the thing and I am not sure how many cases will know
reasonably correct range for their normal operation. In this instance
of the e1000 watchdog what range could it give and be successful at
what it wants to do - bring up the link in reasonable amount of time,
while also realizing the power savings?

Perhaps depending on Server/Laptop/Desktop machine (may be based on
Preemption) we could have normal or deferrable timers but that'll
exclude Servers from power savings and I am not sure Data center folks
will like that :) .

Parag

^ permalink raw reply

* [PATCH v2 2/4] [POWERPC][NET] ucc_geth_mii and users: get rid of device_type
From: Anton Vorontsov @ 2007-12-20 20:33 UTC (permalink / raw)
  To: Kumar Gala; +Cc: linuxppc-dev, netdev
In-Reply-To: <20071220202714.GA18471@localhost.localdomain>

device_type property is bogus, thus use proper compatible.

Also change compatible property to "fsl,ucc-mdio".

Per http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048388.html

Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
---
 arch/powerpc/boot/dts/mpc832x_mds.dts |    3 +--
 arch/powerpc/boot/dts/mpc832x_rdb.dts |    3 +--
 arch/powerpc/boot/dts/mpc836x_mds.dts |    3 +--
 arch/powerpc/boot/dts/mpc8568mds.dts  |    2 +-
 drivers/net/ucc_geth_mii.c            |    3 +++
 5 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/boot/dts/mpc832x_mds.dts b/arch/powerpc/boot/dts/mpc832x_mds.dts
index 588d658..8844d30 100644
--- a/arch/powerpc/boot/dts/mpc832x_mds.dts
+++ b/arch/powerpc/boot/dts/mpc832x_mds.dts
@@ -255,8 +255,7 @@
 			#address-cells = <1>;
 			#size-cells = <0>;
 			reg = <2320 18>;
-			device_type = "mdio";
-			compatible = "ucc_geth_phy";
+			compatible = "fsl,ucc-mdio";
 
 			phy3: ethernet-phy@03 {
 				interrupt-parent = < &ipic >;
diff --git a/arch/powerpc/boot/dts/mpc832x_rdb.dts b/arch/powerpc/boot/dts/mpc832x_rdb.dts
index 719f375..a7a2e45 100644
--- a/arch/powerpc/boot/dts/mpc832x_rdb.dts
+++ b/arch/powerpc/boot/dts/mpc832x_rdb.dts
@@ -236,8 +236,7 @@
 			#address-cells = <1>;
 			#size-cells = <0>;
 			reg = <3120 18>;
-			device_type = "mdio";
-			compatible = "ucc_geth_phy";
+			compatible = "fsl,ucc-mdio";
 
 			phy00:ethernet-phy@00 {
 				interrupt-parent = <&pic>;
diff --git a/arch/powerpc/boot/dts/mpc836x_mds.dts b/arch/powerpc/boot/dts/mpc836x_mds.dts
index 8d7124e..5f0b427 100644
--- a/arch/powerpc/boot/dts/mpc836x_mds.dts
+++ b/arch/powerpc/boot/dts/mpc836x_mds.dts
@@ -288,8 +288,7 @@
 			#address-cells = <1>;
 			#size-cells = <0>;
 			reg = <2120 18>;
-			device_type = "mdio";
-			compatible = "ucc_geth_phy";
+			compatible = "fsl,ucc-mdio";
 
 			phy0: ethernet-phy@00 {
 				interrupt-parent = < &ipic >;
diff --git a/arch/powerpc/boot/dts/mpc8568mds.dts b/arch/powerpc/boot/dts/mpc8568mds.dts
index 89add8d..ea70010 100644
--- a/arch/powerpc/boot/dts/mpc8568mds.dts
+++ b/arch/powerpc/boot/dts/mpc8568mds.dts
@@ -356,7 +356,7 @@
 			#address-cells = <1>;
 			#size-cells = <0>;
 			reg = <2120 18>;
-			compatible = "ucc_geth_phy";
+			compatible = "fsl,ucc-mdio";
 
 			/* These are the same PHYs as on
 			 * gianfar's MDIO bus */
diff --git a/drivers/net/ucc_geth_mii.c b/drivers/net/ucc_geth_mii.c
index df884f0..e3ba14a 100644
--- a/drivers/net/ucc_geth_mii.c
+++ b/drivers/net/ucc_geth_mii.c
@@ -256,6 +256,9 @@ static struct of_device_id uec_mdio_match[] = {
 		.type = "mdio",
 		.compatible = "ucc_geth_phy",
 	},
+	{
+		.compatible = "fsl,ucc-mdio",
+	},
 	{},
 };
 
-- 
1.5.2.2


^ permalink raw reply related

* Re: [PATCH] [IPROUTE]: A workaround to make larger rto_min printed correctly
From: Jarek Poplawski @ 2007-12-20 20:31 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Satoru SATOH, netdev
In-Reply-To: <476ACF8A.7060201@gmail.com>

Jarek Poplawski wrote, On 12/20/2007 09:24 PM:
...

> but since it's your patch, I hope you do some additional checking
> if it's always like this...


...or maybe only changing this all a little bit will make it look safer!

Jarek P.

^ permalink raw reply

* Re: [PATCH] [IPROUTE]: A workaround to make larger rto_min printed correctly
From: Jarek Poplawski @ 2007-12-20 20:24 UTC (permalink / raw)
  To: Satoru SATOH; +Cc: netdev
In-Reply-To: <d72b7ace0712200821k7cde0834rf1187b30c68cd1ce@mail.gmail.com>

Satoru SATOH wrote, On 12/20/2007 05:21 PM:

> i see. HZ can be < 1000.. i should be wrong.
> 
> however, i got the following,
> 
> [root iproute2.org]# ./ip/ip route change 192.168.140.0/24 dev eth1 rto_min 4s
> [root iproute2.org]# gdb -q ./ip/ip 

...

> (gdb) p hz
> $1 = 1000000000

That's why I had some doubts! I didn't study this enough, but my
(older) version definitely showed hz == 100. Maybe I'm wrong, but
looking into lib/util.c it seems this could be set differently
depending on system's configuration (or even kernel version).

So, probably this patch could sometimes work even for HZ < 1000,
but since it's your patch, I hope you do some additional checking
if it's always like this...

Cheers,
Jarek P.

^ permalink raw reply

* Re: [PATCH] sky2: Use deferrable timer for watchdog
From: Krzysztof Oledzki @ 2007-12-20 20:17 UTC (permalink / raw)
  To: Parag Warudkar
  Cc: Kok, Auke, Arjan van de Ven, Stephen Hemminger, netdev, akpm,
	linux-kernel
In-Reply-To: <82e4877d0712201200h7b994175u841d1efa047cefff@mail.gmail.com>



On Thu, 20 Dec 2007, Parag Warudkar wrote:

> On Dec 20, 2007 2:22 PM, Kok, Auke <auke-jan.h.kok@intel.com> wrote:
>> ok, that's just bad and if there's no user-defineable limit to the deferral I
>> definately don't like this change.
>>
>> Can I safely assume that any irq will cause all deferred timers to run?
>
> I think even other causes for wakeup like process related ones will
> cause the CPU to go busy and run the timers.
> This, coupled with the fact that no one is yet able to reach 0 wakeups
> per second makes it pretty unlikely that deferrable timers will be
> deferred indefinitely.
>
>>
>> If this is the case then for e1000 this patch is still OK since the watchdog needs
>> to run (1) after a link up/down interrupt or (2) to update statistics. Those
>> statistics won't increase if there is no traffic of course...
>>
>
> I think it is reasonable for Network driver watchdogs to use a
> deferrable timer - if the machine is 100% IDLE there is no one needing
> the network to be up.

Please note tha being connected to a network does not only mean to send 
but also to receive.

Best regards,

 				Krzysztof Oledzki

^ permalink raw reply

* Re: [PATCH] sky2: Use deferrable timer for watchdog
From: Arjan van de Ven @ 2007-12-20 20:04 UTC (permalink / raw)
  To: Parag Warudkar; +Cc: Kok, Auke, Stephen Hemminger, netdev, akpm, linux-kernel
In-Reply-To: <82e4877d0712201200h7b994175u841d1efa047cefff@mail.gmail.com>

Parag Warudkar wrote:
> On Dec 20, 2007 2:22 PM, Kok, Auke <auke-jan.h.kok@intel.com> wrote:
>> ok, that's just bad and if there's no user-defineable limit to the deferral I
>> definately don't like this change.
>>
>> Can I safely assume that any irq will cause all deferred timers to run?
> 
> I think even other causes for wakeup like process related ones will
> cause the CPU to go busy and run the timers.
> This, coupled with the fact that no one is yet able to reach 0 wakeups
> per second makes it pretty unlikely that deferrable timers will be
> deferred indefinitely.

0.8 is easy on single core today.
multicore just increases how idle you can be for a given core.

> 
>> If this is the case then for e1000 this patch is still OK since the watchdog needs
>> to run (1) after a link up/down interrupt or (2) to update statistics. Those
>> statistics won't increase if there is no traffic of course...
>>
> 
> I think it is reasonable for Network driver watchdogs to use a
> deferrable timer - if the machine is 100% IDLE there is no one needing
> the network to be up. If there is something running even on the other
> CPU - that is going to cause an IPI, reschedule, TLB invalidation etc.
> which will make it very likely in practice that each CPU will be
> interrupted in reasonable amount of time.

this is not correct; many machines are idle waiting for network data. Think of webservers...

> 
> Of course there are theoretical cases where we could land into a
> situation where a CPU in a multiprocessor machine is IDLE infinitely
> and that causes the watchdog that happens to be bound to run on the
> same CPU to not run. To take care of these unlikely cases I think the
> timer mechanism should have a reasonable limit on how long a CPU can
> go IDLE if there are deferrable timers.

how about something else instead: a timer mechanism that takes a range instead..
that at least has defined semantics; the deferrable semantics really are "indefinite".
Lets keep at least the semantics clear and clean.


^ permalink raw reply

* Re: [PATCH] sky2: Use deferrable timer for watchdog
From: Arjan van de Ven @ 2007-12-20 19:42 UTC (permalink / raw)
  To: Kok, Auke; +Cc: Stephen Hemminger, parag.warudkar, netdev, akpm, linux-kernel
In-Reply-To: <476AC105.9090206@intel.com>

Kok, Auke wrote:

> ok, that's just bad and if there's no user-defineable limit to the deferral I
> definately don't like this change.
> 
> Can I safely assume that any irq will cause all deferred timers to run?

*on that cpu*. Timers are per cpu, as are interrupts. Just not per se the same one ...


^ permalink raw reply

* Re: [PATCH] sky2: Use deferrable timer for watchdog
From: Parag Warudkar @ 2007-12-20 20:00 UTC (permalink / raw)
  To: Kok, Auke; +Cc: Arjan van de Ven, Stephen Hemminger, netdev, akpm, linux-kernel
In-Reply-To: <476AC105.9090206@intel.com>

On Dec 20, 2007 2:22 PM, Kok, Auke <auke-jan.h.kok@intel.com> wrote:
> ok, that's just bad and if there's no user-defineable limit to the deferral I
> definately don't like this change.
>
> Can I safely assume that any irq will cause all deferred timers to run?

I think even other causes for wakeup like process related ones will
cause the CPU to go busy and run the timers.
This, coupled with the fact that no one is yet able to reach 0 wakeups
per second makes it pretty unlikely that deferrable timers will be
deferred indefinitely.

>
> If this is the case then for e1000 this patch is still OK since the watchdog needs
> to run (1) after a link up/down interrupt or (2) to update statistics. Those
> statistics won't increase if there is no traffic of course...
>

I think it is reasonable for Network driver watchdogs to use a
deferrable timer - if the machine is 100% IDLE there is no one needing
the network to be up. If there is something running even on the other
CPU - that is going to cause an IPI, reschedule, TLB invalidation etc.
which will make it very likely in practice that each CPU will be
interrupted in reasonable amount of time.

Of course there are theoretical cases where we could land into a
situation where a CPU in a multiprocessor machine is IDLE infinitely
and that causes the watchdog that happens to be bound to run on the
same CPU to not run. To take care of these unlikely cases I think the
timer mechanism should have a reasonable limit on how long a CPU can
go IDLE if there are deferrable timers.

Parag

^ permalink raw reply

* [PATCH 9/9][BNX2]: Update version to 1.7.1.
From: Michael Chan @ 2007-12-20 20:33 UTC (permalink / raw)
  To: davem, netdev

[BNX2]: Update version to 1.7.1.

Signed-off-by: Michael Chan <mchan@broadcom.com>

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 08b0349..69a3ce3 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -56,8 +56,8 @@
 
 #define DRV_MODULE_NAME		"bnx2"
 #define PFX DRV_MODULE_NAME	": "
-#define DRV_MODULE_VERSION	"1.7.0"
-#define DRV_MODULE_RELDATE	"December 11, 2007"
+#define DRV_MODULE_VERSION	"1.7.1"
+#define DRV_MODULE_RELDATE	"December 19, 2007"
 
 #define RUN_AT(x) (jiffies + (x))
 



^ permalink raw reply related

* [PATCH 8/9][BNX2]: Enable new tx ring.
From: Michael Chan @ 2007-12-20 20:33 UTC (permalink / raw)
  To: davem, netdev

[BNX2]: Enable new tx ring.

Enable new tx ring and add new MSIX handler and NAPI poll function
for the new tx ring.  Enable MSIX when the hardware supports it.

Signed-off-by: Michael Chan <mchan@broadcom.com>

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index a4ed6ca..3745fc8 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -598,7 +598,7 @@ bnx2_alloc_mem(struct bnx2 *bp)
 		for (i = 1; i < BNX2_MAX_MSIX_VEC; i++) {
 			struct bnx2_napi *bnapi = &bp->bnx2_napi[i];
 
-			bnapi->status_blk = (void *)
+			bnapi->status_blk_msix = (void *)
 				((unsigned long) bp->status_blk +
 				 BNX2_SBLK_MSIX_ALIGN_SIZE * i);
 			bnapi->int_num = i << 24;
@@ -2388,10 +2388,11 @@ bnx2_get_hw_tx_cons(struct bnx2_napi *bnapi)
 	return cons;
 }
 
-static void
-bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi)
+static int
+bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi, int budget)
 {
 	u16 hw_cons, sw_cons, sw_ring_cons;
+	int tx_pkt = 0;
 
 	hw_cons = bnx2_get_hw_tx_cons(bnapi);
 	sw_cons = bnapi->tx_cons;
@@ -2442,6 +2443,9 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi)
 		sw_cons = NEXT_TX_BD(sw_cons);
 
 		dev_kfree_skb(skb);
+		tx_pkt++;
+		if (tx_pkt == budget)
+			break;
 
 		hw_cons = bnx2_get_hw_tx_cons(bnapi);
 	}
@@ -2463,6 +2467,7 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi)
 			netif_wake_queue(bp->dev);
 		netif_tx_unlock(bp->dev);
 	}
+	return tx_pkt;
 }
 
 static void
@@ -2875,6 +2880,23 @@ bnx2_interrupt(int irq, void *dev_instance)
 	return IRQ_HANDLED;
 }
 
+static irqreturn_t
+bnx2_tx_msix(int irq, void *dev_instance)
+{
+	struct net_device *dev = dev_instance;
+	struct bnx2 *bp = netdev_priv(dev);
+	struct bnx2_napi *bnapi = &bp->bnx2_napi[BNX2_TX_VEC];
+
+	prefetch(bnapi->status_blk_msix);
+
+	/* Return here if interrupt is disabled. */
+	if (unlikely(atomic_read(&bp->intr_sem) != 0))
+		return IRQ_HANDLED;
+
+	netif_rx_schedule(dev, &bnapi->napi);
+	return IRQ_HANDLED;
+}
+
 #define STATUS_ATTN_EVENTS	(STATUS_ATTN_BITS_LINK_STATE | \
 				 STATUS_ATTN_BITS_TIMER_ABORT)
 
@@ -2895,6 +2917,29 @@ bnx2_has_work(struct bnx2_napi *bnapi)
 	return 0;
 }
 
+static int bnx2_tx_poll(struct napi_struct *napi, int budget)
+{
+	struct bnx2_napi *bnapi = container_of(napi, struct bnx2_napi, napi);
+	struct bnx2 *bp = bnapi->bp;
+	int work_done = 0;
+	struct status_block_msix *sblk = bnapi->status_blk_msix;
+
+	do {
+		work_done += bnx2_tx_int(bp, bnapi, budget - work_done);
+		if (unlikely(work_done >= budget))
+			return work_done;
+
+		bnapi->last_status_idx = sblk->status_idx;
+		rmb();
+	} while (bnx2_get_hw_tx_cons(bnapi) != bnapi->hw_tx_cons);
+
+	netif_rx_complete(bp->dev, napi);
+	REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, bnapi->int_num |
+	       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
+	       bnapi->last_status_idx);
+	return work_done;
+}
+
 static int bnx2_poll_work(struct bnx2 *bp, struct bnx2_napi *bnapi,
 			  int work_done, int budget)
 {
@@ -2916,7 +2961,7 @@ static int bnx2_poll_work(struct bnx2 *bp, struct bnx2_napi *bnapi,
 	}
 
 	if (bnx2_get_hw_tx_cons(bnapi) != bnapi->hw_tx_cons)
-		bnx2_tx_int(bp, bnapi);
+		bnx2_tx_int(bp, bnapi, 0);
 
 	if (bnx2_get_hw_rx_cons(bnapi) != bnapi->rx_cons)
 		work_done += bnx2_rx_int(bp, bnapi, budget - work_done);
@@ -5399,10 +5444,35 @@ bnx2_free_irq(struct bnx2 *bp)
 static void
 bnx2_enable_msix(struct bnx2 *bp)
 {
+	int i, rc;
+	struct msix_entry msix_ent[BNX2_MAX_MSIX_VEC];
+
 	bnx2_setup_msix_tbl(bp);
 	REG_WR(bp, BNX2_PCI_MSIX_CONTROL, BNX2_MAX_MSIX_HW_VEC - 1);
 	REG_WR(bp, BNX2_PCI_MSIX_TBL_OFF_BIR, BNX2_PCI_GRC_WINDOW2_BASE);
 	REG_WR(bp, BNX2_PCI_MSIX_PBA_OFF_BIT, BNX2_PCI_GRC_WINDOW3_BASE);
+
+	for (i = 0; i < BNX2_MAX_MSIX_VEC; i++) {
+		msix_ent[i].entry = i;
+		msix_ent[i].vector = 0;
+	}
+
+	rc = pci_enable_msix(bp->pdev, msix_ent, BNX2_MAX_MSIX_VEC);
+	if (rc != 0)
+		return;
+
+	bp->irq_tbl[BNX2_BASE_VEC].handler = bnx2_msi_1shot;
+	bp->irq_tbl[BNX2_TX_VEC].handler = bnx2_tx_msix;
+
+	strcpy(bp->irq_tbl[BNX2_BASE_VEC].name, bp->dev->name);
+	strcat(bp->irq_tbl[BNX2_BASE_VEC].name, "-base");
+	strcpy(bp->irq_tbl[BNX2_TX_VEC].name, bp->dev->name);
+	strcat(bp->irq_tbl[BNX2_TX_VEC].name, "-tx");
+
+	bp->irq_nvecs = BNX2_MAX_MSIX_VEC;
+	bp->flags |= USING_MSIX_FLAG | ONE_SHOT_MSI_FLAG;
+	for (i = 0; i < BNX2_MAX_MSIX_VEC; i++)
+		bp->irq_tbl[i].vector = msix_ent[i].vector;
 }
 
 static void
@@ -5504,9 +5574,10 @@ bnx2_open(struct net_device *dev)
 			bnx2_enable_int(bp);
 		}
 	}
-	if (bp->flags & USING_MSI_FLAG) {
+	if (bp->flags & USING_MSI_FLAG)
 		printk(KERN_INFO PFX "%s: using MSI\n", dev->name);
-	}
+	else if (bp->flags & USING_MSIX_FLAG)
+		printk(KERN_INFO PFX "%s: using MSIX\n", dev->name);
 
 	netif_start_queue(dev);
 
@@ -5570,7 +5641,7 @@ bnx2_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	u32 len, vlan_tag_flags, last_frag, mss;
 	u16 prod, ring_prod;
 	int i;
-	struct bnx2_napi *bnapi = &bp->bnx2_napi[0];
+	struct bnx2_napi *bnapi = &bp->bnx2_napi[bp->tx_vec];
 
 	if (unlikely(bnx2_tx_avail(bp, bnapi) <
 	    (skb_shinfo(skb)->nr_frags + 1))) {
@@ -7253,6 +7324,8 @@ bnx2_init_napi(struct bnx2 *bp)
 		bnapi->bp = bp;
 	}
 	netif_napi_add(bp->dev, &bp->bnx2_napi[0].napi, bnx2_poll, 64);
+	netif_napi_add(bp->dev, &bp->bnx2_napi[BNX2_TX_VEC].napi, bnx2_tx_poll,
+		       64);
 }
 
 static int __devinit
diff --git a/drivers/net/bnx2.h b/drivers/net/bnx2.h
index 68fb590..09bd665 100644
--- a/drivers/net/bnx2.h
+++ b/drivers/net/bnx2.h
@@ -6528,7 +6528,7 @@ struct flash_spec {
 };
 
 #define BNX2_MAX_MSIX_HW_VEC	9
-#define BNX2_MAX_MSIX_VEC	1
+#define BNX2_MAX_MSIX_VEC	2
 #define BNX2_BASE_VEC		0
 #define BNX2_TX_VEC		1
 #define BNX2_TX_INT_NUM	(BNX2_TX_VEC << BNX2_PCICFG_INT_ACK_CMD_INT_NUM_SHIFT)



^ permalink raw reply related

* [PATCH 7/9][BNX2]: Add support for a new tx ring.
From: Michael Chan @ 2007-12-20 20:33 UTC (permalink / raw)
  To: davem, netdev

[BNX2]: Add support for a new tx ring.

To separate TX IRQs into a different MSIX vector, we need to
support a new tx ring.  The original tx ring will still be used
when not using MSIX.

Signed-off-by: Michael Chan <mchan@broadcom.com>

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 196d053..a4ed6ca 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -2378,7 +2378,10 @@ bnx2_get_hw_tx_cons(struct bnx2_napi *bnapi)
 {
 	u16 cons;
 
-	cons = bnapi->status_blk->status_tx_quick_consumer_index0;
+	if (bnapi->int_num == 0)
+		cons = bnapi->status_blk->status_tx_quick_consumer_index0;
+	else
+		cons = bnapi->status_blk_msix->status_tx_quick_consumer_index;
 
 	if (unlikely((cons & MAX_TX_DESC_CNT) == MAX_TX_DESC_CNT))
 		cons++;
@@ -2389,7 +2392,6 @@ static void
 bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi)
 {
 	u16 hw_cons, sw_cons, sw_ring_cons;
-	int tx_free_bd = 0;
 
 	hw_cons = bnx2_get_hw_tx_cons(bnapi);
 	sw_cons = bnapi->tx_cons;
@@ -2439,8 +2441,6 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi)
 
 		sw_cons = NEXT_TX_BD(sw_cons);
 
-		tx_free_bd += last + 1;
-
 		dev_kfree_skb(skb);
 
 		hw_cons = bnx2_get_hw_tx_cons(bnapi);
@@ -4369,6 +4369,24 @@ bnx2_init_chip(struct bnx2 *bp)
 		      BNX2_HC_CONFIG_COLLECT_STATS;
 	}
 
+	if (bp->flags & USING_MSIX_FLAG) {
+		REG_WR(bp, BNX2_HC_MSIX_BIT_VECTOR,
+		       BNX2_HC_MSIX_BIT_VECTOR_VAL);
+
+		REG_WR(bp, BNX2_HC_SB_CONFIG_1,
+			BNX2_HC_SB_CONFIG_1_TX_TMR_MODE |
+			BNX2_HC_SB_CONFIG_1_ONE_SHOT);
+
+		REG_WR(bp, BNX2_HC_TX_QUICK_CONS_TRIP_1,
+			(bp->tx_quick_cons_trip_int << 16) |
+			 bp->tx_quick_cons_trip);
+
+		REG_WR(bp, BNX2_HC_TX_TICKS_1,
+			(bp->tx_ticks_int << 16) | bp->tx_ticks);
+
+		val |= BNX2_HC_CONFIG_SB_ADDR_INC_128B;
+	}
+
 	if (bp->flags & ONE_SHOT_MSI_FLAG)
 		val |= BNX2_HC_CONFIG_ONE_SHOT;
 
@@ -4401,6 +4419,25 @@ bnx2_init_chip(struct bnx2 *bp)
 }
 
 static void
+bnx2_clear_ring_states(struct bnx2 *bp)
+{
+	struct bnx2_napi *bnapi;
+	int i;
+
+	for (i = 0; i < BNX2_MAX_MSIX_VEC; i++) {
+		bnapi = &bp->bnx2_napi[i];
+
+		bnapi->tx_cons = 0;
+		bnapi->hw_tx_cons = 0;
+		bnapi->rx_prod_bseq = 0;
+		bnapi->rx_prod = 0;
+		bnapi->rx_cons = 0;
+		bnapi->rx_pg_prod = 0;
+		bnapi->rx_pg_cons = 0;
+	}
+}
+
+static void
 bnx2_init_tx_context(struct bnx2 *bp, u32 cid)
 {
 	u32 val, offset0, offset1, offset2, offset3;
@@ -4433,8 +4470,17 @@ static void
 bnx2_init_tx_ring(struct bnx2 *bp)
 {
 	struct tx_bd *txbd;
-	u32 cid;
-	struct bnx2_napi *bnapi = &bp->bnx2_napi[0];
+	u32 cid = TX_CID;
+	struct bnx2_napi *bnapi;
+
+	bp->tx_vec = 0;
+	if (bp->flags & USING_MSIX_FLAG) {
+		cid = TX_TSS_CID;
+		bp->tx_vec = BNX2_TX_VEC;
+		REG_WR(bp, BNX2_TSCH_TSS_CFG, BNX2_TX_INT_NUM |
+		       (TX_TSS_CID << 7));
+	}
+	bnapi = &bp->bnx2_napi[bp->tx_vec];
 
 	bp->tx_wake_thresh = bp->tx_ring_size / 2;
 
@@ -4444,11 +4490,8 @@ bnx2_init_tx_ring(struct bnx2 *bp)
 	txbd->tx_bd_haddr_lo = (u64) bp->tx_desc_mapping & 0xffffffff;
 
 	bp->tx_prod = 0;
-	bnapi->tx_cons = 0;
-	bnapi->hw_tx_cons = 0;
 	bp->tx_prod_bseq = 0;
 
-	cid = TX_CID;
 	bp->tx_bidx_addr = MB_GET_CID_ADDR(cid) + BNX2_L2CTX_TX_HOST_BIDX;
 	bp->tx_bseq_addr = MB_GET_CID_ADDR(cid) + BNX2_L2CTX_TX_HOST_BSEQ;
 
@@ -4487,12 +4530,6 @@ bnx2_init_rx_ring(struct bnx2 *bp)
 	u32 val, rx_cid_addr = GET_CID_ADDR(RX_CID);
 	struct bnx2_napi *bnapi = &bp->bnx2_napi[0];
 
-	bnapi->rx_prod = 0;
-	bnapi->rx_cons = 0;
-	bnapi->rx_prod_bseq = 0;
-	bnapi->rx_pg_prod = 0;
-	bnapi->rx_pg_cons = 0;
-
 	bnx2_init_rxbd_rings(bp->rx_desc_ring, bp->rx_desc_mapping,
 			     bp->rx_buf_use_size, bp->rx_max_ring);
 
@@ -4694,6 +4731,7 @@ bnx2_reset_nic(struct bnx2 *bp, u32 reset_code)
 	if ((rc = bnx2_init_chip(bp)) != 0)
 		return rc;
 
+	bnx2_clear_ring_states(bp);
 	bnx2_init_tx_ring(bp);
 	bnx2_init_rx_ring(bp);
 	return 0;
@@ -4965,7 +5003,11 @@ bnx2_run_loopback(struct bnx2 *bp, int loopback_mode)
 	struct sw_bd *rx_buf;
 	struct l2_fhdr *rx_hdr;
 	int ret = -ENODEV;
-	struct bnx2_napi *bnapi = &bp->bnx2_napi[0];
+	struct bnx2_napi *bnapi = &bp->bnx2_napi[0], *tx_napi;
+
+	tx_napi = bnapi;
+	if (bp->flags & USING_MSIX_FLAG)
+		tx_napi = &bp->bnx2_napi[BNX2_TX_VEC];
 
 	if (loopback_mode == BNX2_MAC_LOOPBACK) {
 		bp->loopback = MAC_LOOPBACK;
@@ -5030,7 +5072,7 @@ bnx2_run_loopback(struct bnx2 *bp, int loopback_mode)
 	pci_unmap_single(bp->pdev, map, pkt_size, PCI_DMA_TODEVICE);
 	dev_kfree_skb(skb);
 
-	if (bnx2_get_hw_tx_cons(bnapi) != bp->tx_prod)
+	if (bnx2_get_hw_tx_cons(tx_napi) != bp->tx_prod)
 		goto loopback_test_done;
 
 	rx_idx = bnx2_get_hw_rx_cons(bnapi);
@@ -5324,7 +5366,7 @@ bnx2_request_irq(struct bnx2 *bp)
 	
 	for (i = 0; i < bp->irq_nvecs; i++) {
 		irq = &bp->irq_tbl[i];
-		rc = request_irq(irq->vector, irq->handler, flags, dev->name,
+		rc = request_irq(irq->vector, irq->handler, flags, irq->name,
 				 dev);
 		if (rc)
 			break;
diff --git a/drivers/net/bnx2.h b/drivers/net/bnx2.h
index d71ceb6..68fb590 100644
--- a/drivers/net/bnx2.h
+++ b/drivers/net/bnx2.h
@@ -6529,6 +6529,9 @@ struct flash_spec {
 
 #define BNX2_MAX_MSIX_HW_VEC	9
 #define BNX2_MAX_MSIX_VEC	1
+#define BNX2_BASE_VEC		0
+#define BNX2_TX_VEC		1
+#define BNX2_TX_INT_NUM	(BNX2_TX_VEC << BNX2_PCICFG_INT_ACK_CMD_INT_NUM_SHIFT)
 
 struct bnx2_irq {
 	irq_handler_t	handler;
@@ -6541,6 +6544,7 @@ struct bnx2_napi {
 	struct napi_struct	napi		____cacheline_aligned;
 	struct bnx2		*bp;
 	struct status_block	*status_blk;
+	struct status_block_msix	*status_blk_msix;
 	u32 			last_status_idx;
 	u32			int_num;
 
@@ -6583,6 +6587,7 @@ struct bnx2 {
 
 	u32		tx_prod_bseq __attribute__((aligned(L1_CACHE_BYTES)));
 	u16		tx_prod;
+	u8		tx_vec;
 	u32		tx_bidx_addr;
 	u32		tx_bseq_addr;
 



^ permalink raw reply related

* [PATCH 6/9][BNX2]: Support multiple MSIX IRQs.
From: Michael Chan @ 2007-12-20 20:32 UTC (permalink / raw)
  To: davem, netdev

[BNX2]: Support multiple MSIX IRQs.

Change bnx2_napi struct into an array and add code to manage multiple
IRQs.  MSIX hardware structures and new registers are also added.

Signed-off-by: Michael Chan <mchan@broadcom.com>

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index ecfaad1..196d053 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -399,44 +399,65 @@ bnx2_write_phy(struct bnx2 *bp, u32 reg, u32 val)
 static void
 bnx2_disable_int(struct bnx2 *bp)
 {
-	REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
-	       BNX2_PCICFG_INT_ACK_CMD_MASK_INT);
+	int i;
+	struct bnx2_napi *bnapi;
+
+	for (i = 0; i < bp->irq_nvecs; i++) {
+		bnapi = &bp->bnx2_napi[i];
+		REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, bnapi->int_num |
+		       BNX2_PCICFG_INT_ACK_CMD_MASK_INT);
+	}
 	REG_RD(bp, BNX2_PCICFG_INT_ACK_CMD);
 }
 
 static void
 bnx2_enable_int(struct bnx2 *bp)
 {
-	struct bnx2_napi *bnapi = &bp->bnx2_napi;
+	int i;
+	struct bnx2_napi *bnapi;
 
-	REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
-	       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
-	       BNX2_PCICFG_INT_ACK_CMD_MASK_INT | bnapi->last_status_idx);
+	for (i = 0; i < bp->irq_nvecs; i++) {
+		bnapi = &bp->bnx2_napi[i];
 
-	REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
-	       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | bnapi->last_status_idx);
+		REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, bnapi->int_num |
+		       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
+		       BNX2_PCICFG_INT_ACK_CMD_MASK_INT |
+		       bnapi->last_status_idx);
 
+		REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, bnapi->int_num |
+		       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
+		       bnapi->last_status_idx);
+	}
 	REG_WR(bp, BNX2_HC_COMMAND, bp->hc_cmd | BNX2_HC_COMMAND_COAL_NOW);
 }
 
 static void
 bnx2_disable_int_sync(struct bnx2 *bp)
 {
+	int i;
+
 	atomic_inc(&bp->intr_sem);
 	bnx2_disable_int(bp);
-	synchronize_irq(bp->pdev->irq);
+	for (i = 0; i < bp->irq_nvecs; i++)
+		synchronize_irq(bp->irq_tbl[i].vector);
 }
 
 static void
 bnx2_napi_disable(struct bnx2 *bp)
 {
-	napi_disable(&bp->bnx2_napi.napi);
+	int i;
+
+	for (i = 0; i < bp->irq_nvecs; i++)
+		napi_disable(&bp->bnx2_napi[i].napi);
 }
 
 static void
 bnx2_napi_enable(struct bnx2 *bp)
 {
-	napi_enable(&bp->bnx2_napi.napi);
+	int i;
+
+	for (i = 0; i < bp->irq_nvecs; i++)
+		napi_enable(&bp->bnx2_napi[i].napi);
 }
 
 static void
@@ -559,6 +580,9 @@ bnx2_alloc_mem(struct bnx2 *bp)
 
 	/* Combine status and statistics blocks into one allocation. */
 	status_blk_size = L1_CACHE_ALIGN(sizeof(struct status_block));
+	if (bp->flags & MSIX_CAP_FLAG)
+		status_blk_size = L1_CACHE_ALIGN(BNX2_MAX_MSIX_HW_VEC *
+						 BNX2_SBLK_MSIX_ALIGN_SIZE);
 	bp->status_stats_size = status_blk_size +
 				sizeof(struct statistics_block);
 
@@ -569,7 +593,17 @@ bnx2_alloc_mem(struct bnx2 *bp)
 
 	memset(bp->status_blk, 0, bp->status_stats_size);
 
-	bp->bnx2_napi.status_blk = bp->status_blk;
+	bp->bnx2_napi[0].status_blk = bp->status_blk;
+	if (bp->flags & MSIX_CAP_FLAG) {
+		for (i = 1; i < BNX2_MAX_MSIX_VEC; i++) {
+			struct bnx2_napi *bnapi = &bp->bnx2_napi[i];
+
+			bnapi->status_blk = (void *)
+				((unsigned long) bp->status_blk +
+				 BNX2_SBLK_MSIX_ALIGN_SIZE * i);
+			bnapi->int_num = i << 24;
+		}
+	}
 
 	bp->stats_blk = (void *) ((unsigned long) bp->status_blk +
 				  status_blk_size);
@@ -2767,7 +2801,7 @@ bnx2_msi(int irq, void *dev_instance)
 {
 	struct net_device *dev = dev_instance;
 	struct bnx2 *bp = netdev_priv(dev);
-	struct bnx2_napi *bnapi = &bp->bnx2_napi;
+	struct bnx2_napi *bnapi = &bp->bnx2_napi[0];
 
 	prefetch(bnapi->status_blk);
 	REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
@@ -2788,7 +2822,7 @@ bnx2_msi_1shot(int irq, void *dev_instance)
 {
 	struct net_device *dev = dev_instance;
 	struct bnx2 *bp = netdev_priv(dev);
-	struct bnx2_napi *bnapi = &bp->bnx2_napi;
+	struct bnx2_napi *bnapi = &bp->bnx2_napi[0];
 
 	prefetch(bnapi->status_blk);
 
@@ -2806,7 +2840,7 @@ bnx2_interrupt(int irq, void *dev_instance)
 {
 	struct net_device *dev = dev_instance;
 	struct bnx2 *bp = netdev_priv(dev);
-	struct bnx2_napi *bnapi = &bp->bnx2_napi;
+	struct bnx2_napi *bnapi = &bp->bnx2_napi[0];
 	struct status_block *sblk = bnapi->status_blk;
 
 	/* When using INTx, it is possible for the interrupt to arrive
@@ -2911,7 +2945,7 @@ static int bnx2_poll(struct napi_struct *napi, int budget)
 		rmb();
 		if (likely(!bnx2_has_work(bnapi))) {
 			netif_rx_complete(bp->dev, napi);
-			if (likely(bp->flags & USING_MSI_FLAG)) {
+			if (likely(bp->flags & USING_MSI_OR_MSIX_FLAG)) {
 				REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
 				       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
 				       bnapi->last_status_idx);
@@ -4072,6 +4106,15 @@ bnx2_init_remote_phy(struct bnx2 *bp)
 	}
 }
 
+static void
+bnx2_setup_msix_tbl(struct bnx2 *bp)
+{
+	REG_WR(bp, BNX2_PCI_GRC_WINDOW_ADDR, BNX2_PCI_GRC_WINDOW_ADDR_SEP_WIN);
+
+	REG_WR(bp, BNX2_PCI_GRC_WINDOW2_ADDR, BNX2_MSIX_TABLE_ADDR);
+	REG_WR(bp, BNX2_PCI_GRC_WINDOW3_ADDR, BNX2_MSIX_PBA_ADDR);
+}
+
 static int
 bnx2_reset_chip(struct bnx2 *bp, u32 reset_code)
 {
@@ -4171,6 +4214,9 @@ bnx2_reset_chip(struct bnx2 *bp, u32 reset_code)
 		rc = bnx2_alloc_bad_rbuf(bp);
 	}
 
+	if (bp->flags & USING_MSIX_FLAG)
+		bnx2_setup_msix_tbl(bp);
+
 	return rc;
 }
 
@@ -4178,7 +4224,7 @@ static int
 bnx2_init_chip(struct bnx2 *bp)
 {
 	u32 val;
-	int rc;
+	int rc, i;
 
 	/* Make sure the interrupt is not active. */
 	REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, BNX2_PCICFG_INT_ACK_CMD_MASK_INT);
@@ -4274,7 +4320,9 @@ bnx2_init_chip(struct bnx2 *bp)
 		val |= BNX2_EMAC_RX_MTU_SIZE_JUMBO_ENA;
 	REG_WR(bp, BNX2_EMAC_RX_MTU_SIZE, val);
 
-	bp->bnx2_napi.last_status_idx = 0;
+	for (i = 0; i < BNX2_MAX_MSIX_VEC; i++)
+		bp->bnx2_napi[i].last_status_idx = 0;
+
 	bp->rx_mode = BNX2_EMAC_RX_MODE_SORT_MODE;
 
 	/* Set up how to generate a link change interrupt. */
@@ -4386,7 +4434,7 @@ bnx2_init_tx_ring(struct bnx2 *bp)
 {
 	struct tx_bd *txbd;
 	u32 cid;
-	struct bnx2_napi *bnapi = &bp->bnx2_napi;
+	struct bnx2_napi *bnapi = &bp->bnx2_napi[0];
 
 	bp->tx_wake_thresh = bp->tx_ring_size / 2;
 
@@ -4437,7 +4485,7 @@ bnx2_init_rx_ring(struct bnx2 *bp)
 	int i;
 	u16 prod, ring_prod;
 	u32 val, rx_cid_addr = GET_CID_ADDR(RX_CID);
-	struct bnx2_napi *bnapi = &bp->bnx2_napi;
+	struct bnx2_napi *bnapi = &bp->bnx2_napi[0];
 
 	bnapi->rx_prod = 0;
 	bnapi->rx_cons = 0;
@@ -4917,7 +4965,7 @@ bnx2_run_loopback(struct bnx2 *bp, int loopback_mode)
 	struct sw_bd *rx_buf;
 	struct l2_fhdr *rx_hdr;
 	int ret = -ENODEV;
-	struct bnx2_napi *bnapi = &bp->bnx2_napi;
+	struct bnx2_napi *bnapi = &bp->bnx2_napi[0];
 
 	if (loopback_mode == BNX2_MAC_LOOPBACK) {
 		bp->loopback = MAC_LOOPBACK;
@@ -5266,14 +5314,22 @@ bnx2_request_irq(struct bnx2 *bp)
 {
 	struct net_device *dev = bp->dev;
 	unsigned long flags;
-	struct bnx2_irq *irq = &bp->irq_tbl[0];
-	int rc;
+	struct bnx2_irq *irq;
+	int rc = 0, i;
 
-	if (bp->flags & USING_MSI_FLAG)
+	if (bp->flags & USING_MSI_OR_MSIX_FLAG)
 		flags = 0;
 	else
 		flags = IRQF_SHARED;
-	rc = request_irq(irq->vector, irq->handler, flags, dev->name, dev);
+	
+	for (i = 0; i < bp->irq_nvecs; i++) {
+		irq = &bp->irq_tbl[i];
+		rc = request_irq(irq->vector, irq->handler, flags, dev->name,
+				 dev);
+		if (rc)
+			break;
+		irq->requested = 1;
+	}
 	return rc;
 }
 
@@ -5281,12 +5337,30 @@ static void
 bnx2_free_irq(struct bnx2 *bp)
 {
 	struct net_device *dev = bp->dev;
+	struct bnx2_irq *irq;
+	int i;
 
-	free_irq(bp->irq_tbl[0].vector, dev);
-	if (bp->flags & USING_MSI_FLAG) {
-		pci_disable_msi(bp->pdev);
-		bp->flags &= ~(USING_MSI_FLAG | ONE_SHOT_MSI_FLAG);
+	for (i = 0; i < bp->irq_nvecs; i++) {
+		irq = &bp->irq_tbl[i];
+		if (irq->requested)
+			free_irq(irq->vector, dev);
+		irq->requested = 0;
 	}
+	if (bp->flags & USING_MSI_FLAG)
+		pci_disable_msi(bp->pdev);
+	else if (bp->flags & USING_MSIX_FLAG)
+		pci_disable_msix(bp->pdev);
+
+	bp->flags &= ~(USING_MSI_OR_MSIX_FLAG | ONE_SHOT_MSI_FLAG);
+}
+
+static void
+bnx2_enable_msix(struct bnx2 *bp)
+{
+	bnx2_setup_msix_tbl(bp);
+	REG_WR(bp, BNX2_PCI_MSIX_CONTROL, BNX2_MAX_MSIX_HW_VEC - 1);
+	REG_WR(bp, BNX2_PCI_MSIX_TBL_OFF_BIR, BNX2_PCI_GRC_WINDOW2_BASE);
+	REG_WR(bp, BNX2_PCI_MSIX_PBA_OFF_BIT, BNX2_PCI_GRC_WINDOW3_BASE);
 }
 
 static void
@@ -5294,8 +5368,14 @@ bnx2_setup_int_mode(struct bnx2 *bp, int dis_msi)
 {
 	bp->irq_tbl[0].handler = bnx2_interrupt;
 	strcpy(bp->irq_tbl[0].name, bp->dev->name);
+	bp->irq_nvecs = 1;
+	bp->irq_tbl[0].vector = bp->pdev->irq;
+
+	if ((bp->flags & MSIX_CAP_FLAG) && !dis_msi)
+		bnx2_enable_msix(bp);
 
-	if ((bp->flags & MSI_CAP_FLAG) && !dis_msi) {
+	if ((bp->flags & MSI_CAP_FLAG) && !dis_msi &&
+	    !(bp->flags & USING_MSIX_FLAG)) {
 		if (pci_enable_msi(bp->pdev) == 0) {
 			bp->flags |= USING_MSI_FLAG;
 			if (CHIP_NUM(bp) == CHIP_NUM_5709) {
@@ -5303,10 +5383,10 @@ bnx2_setup_int_mode(struct bnx2 *bp, int dis_msi)
 				bp->irq_tbl[0].handler = bnx2_msi_1shot;
 			} else
 				bp->irq_tbl[0].handler = bnx2_msi;
+
+			bp->irq_tbl[0].vector = bp->pdev->irq;
 		}
 	}
-
-	bp->irq_tbl[0].vector = bp->pdev->irq;
 }
 
 /* Called with rtnl_lock */
@@ -5448,7 +5528,7 @@ bnx2_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	u32 len, vlan_tag_flags, last_frag, mss;
 	u16 prod, ring_prod;
 	int i;
-	struct bnx2_napi *bnapi = &bp->bnx2_napi;
+	struct bnx2_napi *bnapi = &bp->bnx2_napi[0];
 
 	if (unlikely(bnx2_tx_avail(bp, bnapi) <
 	    (skb_shinfo(skb)->nr_frags + 1))) {
@@ -6848,6 +6928,11 @@ bnx2_init_board(struct pci_dev *pdev, struct net_device *dev)
 		}
 	}
 
+	if (CHIP_NUM(bp) == CHIP_NUM_5709 && CHIP_REV(bp) != CHIP_REV_Ax) {
+		if (pci_find_capability(pdev, PCI_CAP_ID_MSIX))
+			bp->flags |= MSIX_CAP_FLAG;
+	}
+
 	if (CHIP_ID(bp) != CHIP_ID_5706_A0 && CHIP_ID(bp) != CHIP_ID_5706_A1) {
 		if (pci_find_capability(pdev, PCI_CAP_ID_MSI))
 			bp->flags |= MSI_CAP_FLAG;
@@ -7118,10 +7203,14 @@ bnx2_bus_string(struct bnx2 *bp, char *str)
 static int __devinit
 bnx2_init_napi(struct bnx2 *bp)
 {
-	struct bnx2_napi *bnapi = &bp->bnx2_napi;
+	int i;
+	struct bnx2_napi *bnapi;
 
-	bnapi->bp = bp;
-	netif_napi_add(bp->dev, &bnapi->napi, bnx2_poll, 64);
+	for (i = 0; i < BNX2_MAX_MSIX_VEC; i++) {
+		bnapi = &bp->bnx2_napi[i];
+		bnapi->bp = bp;
+	}
+	netif_napi_add(bp->dev, &bp->bnx2_napi[0].napi, bnx2_poll, 64);
 }
 
 static int __devinit
diff --git a/drivers/net/bnx2.h b/drivers/net/bnx2.h
index b75795b..d71ceb6 100644
--- a/drivers/net/bnx2.h
+++ b/drivers/net/bnx2.h
@@ -154,6 +154,33 @@ struct status_block {
 #endif
 };
 
+/*
+ *  status_block definition
+ */
+struct status_block_msix {
+#if defined(__BIG_ENDIAN)
+	u16 status_tx_quick_consumer_index;
+	u16 status_rx_quick_consumer_index;
+	u16 status_completion_producer_index;
+	u16 status_cmd_consumer_index;
+	u32 status_unused;
+	u16 status_idx;
+	u8 status_unused2;
+	u8 status_blk_num;
+#elif defined(__LITTLE_ENDIAN)
+	u16 status_rx_quick_consumer_index;
+	u16 status_tx_quick_consumer_index;
+	u16 status_cmd_consumer_index;
+	u16 status_completion_producer_index;
+	u32 status_unused;
+	u8 status_blk_num;
+	u8 status_unused2;
+	u16 status_idx;
+#endif
+};
+
+#define BNX2_SBLK_MSIX_ALIGN_SIZE	128
+
 
 /*
  *  statistics_block definition
@@ -413,6 +440,7 @@ struct l2_fhdr {
 #define BNX2_PCICFG_INT_ACK_CMD_USE_INT_HC_PARAM	 (1L<<17)
 #define BNX2_PCICFG_INT_ACK_CMD_MASK_INT		 (1L<<18)
 #define BNX2_PCICFG_INT_ACK_CMD_INTERRUPT_NUM		 (0xfL<<24)
+#define BNX2_PCICFG_INT_ACK_CMD_INT_NUM_SHIFT		 24
 
 #define BNX2_PCICFG_STATUS_BIT_SET_CMD			0x00000088
 #define BNX2_PCICFG_STATUS_BIT_CLEAR_CMD		0x0000008c
@@ -428,6 +456,9 @@ struct l2_fhdr {
 #define BNX2_PCI_GRC_WINDOW_ADDR_VALUE			 (0x1ffL<<13)
 #define BNX2_PCI_GRC_WINDOW_ADDR_SEP_WIN		 (1L<<31)
 
+#define BNX2_PCI_GRC_WINDOW2_BASE		 	 0xc000
+#define BNX2_PCI_GRC_WINDOW3_BASE		 	 0xe000
+
 #define BNX2_PCI_CONFIG_1				0x00000404
 #define BNX2_PCI_CONFIG_1_RESERVED0			 (0xffL<<0)
 #define BNX2_PCI_CONFIG_1_READ_BOUNDARY			 (0x7L<<8)
@@ -700,6 +731,8 @@ struct l2_fhdr {
 #define BNX2_PCI_GRC_WINDOW3_ADDR			0x00000618
 #define BNX2_PCI_GRC_WINDOW3_ADDR_VALUE			 (0x1ffL<<13)
 
+#define BNX2_MSIX_TABLE_ADDR				 0x318000
+#define BNX2_MSIX_PBA_ADDR				 0x31c000
 
 /*
  *  misc_reg definition
@@ -6500,6 +6533,7 @@ struct flash_spec {
 struct bnx2_irq {
 	irq_handler_t	handler;
 	u16		vector;
+	u8		requested;
 	char		name[16];
 };
 
@@ -6535,13 +6569,15 @@ struct bnx2 {
 	u32			flags;
 #define PCIX_FLAG			0x00000001
 #define PCI_32BIT_FLAG			0x00000002
-#define ONE_TDMA_FLAG			0x00000004	/* no longer used */
+#define MSIX_CAP_FLAG			0x00000004
 #define NO_WOL_FLAG			0x00000008
 #define USING_MSI_FLAG			0x00000020
 #define ASF_ENABLE_FLAG			0x00000040
 #define MSI_CAP_FLAG			0x00000080
 #define ONE_SHOT_MSI_FLAG		0x00000100
 #define PCIE_FLAG			0x00000200
+#define USING_MSIX_FLAG			0x00000400
+#define USING_MSI_OR_MSIX_FLAG		(USING_MSI_FLAG | USING_MSIX_FLAG)
 
 	/* Put tx producer and consumer fields in separate cache lines. */
 
@@ -6550,7 +6586,7 @@ struct bnx2 {
 	u32		tx_bidx_addr;
 	u32		tx_bseq_addr;
 
-	struct bnx2_napi	bnx2_napi;
+	struct bnx2_napi	bnx2_napi[BNX2_MAX_MSIX_VEC];
 
 #ifdef BCM_VLAN
 	struct			vlan_group *vlgrp;



^ permalink raw reply related

* [PATCH 5/9][BNX2]: Move rx indexes into bnx2_napi struct.
From: Michael Chan @ 2007-12-20 20:32 UTC (permalink / raw)
  To: davem, netdev

[BNX2]: Move rx indexes into bnx2_napi struct.

Rx related fields used in NAPI polling are moved from the main
bnx2 struct to the bnx2_napi struct.

Signed-off-by: Michael Chan <mchan@broadcom.com>

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 0300a75..ecfaad1 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -2276,7 +2276,7 @@ bnx2_free_rx_page(struct bnx2 *bp, u16 index)
 }
 
 static inline int
-bnx2_alloc_rx_skb(struct bnx2 *bp, u16 index)
+bnx2_alloc_rx_skb(struct bnx2 *bp, struct bnx2_napi *bnapi, u16 index)
 {
 	struct sk_buff *skb;
 	struct sw_bd *rx_buf = &bp->rx_buf_ring[index];
@@ -2301,7 +2301,7 @@ bnx2_alloc_rx_skb(struct bnx2 *bp, u16 index)
 	rxbd->rx_bd_haddr_hi = (u64) mapping >> 32;
 	rxbd->rx_bd_haddr_lo = (u64) mapping & 0xffffffff;
 
-	bp->rx_prod_bseq += bp->rx_buf_use_size;
+	bnapi->rx_prod_bseq += bp->rx_buf_use_size;
 
 	return 0;
 }
@@ -2432,14 +2432,15 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi)
 }
 
 static void
-bnx2_reuse_rx_skb_pages(struct bnx2 *bp, struct sk_buff *skb, int count)
+bnx2_reuse_rx_skb_pages(struct bnx2 *bp, struct bnx2_napi *bnapi,
+			struct sk_buff *skb, int count)
 {
 	struct sw_pg *cons_rx_pg, *prod_rx_pg;
 	struct rx_bd *cons_bd, *prod_bd;
 	dma_addr_t mapping;
 	int i;
-	u16 hw_prod = bp->rx_pg_prod, prod;
-	u16 cons = bp->rx_pg_cons;
+	u16 hw_prod = bnapi->rx_pg_prod, prod;
+	u16 cons = bnapi->rx_pg_cons;
 
 	for (i = 0; i < count; i++) {
 		prod = RX_PG_RING_IDX(hw_prod);
@@ -2476,12 +2477,12 @@ bnx2_reuse_rx_skb_pages(struct bnx2 *bp, struct sk_buff *skb, int count)
 		cons = RX_PG_RING_IDX(NEXT_RX_BD(cons));
 		hw_prod = NEXT_RX_BD(hw_prod);
 	}
-	bp->rx_pg_prod = hw_prod;
-	bp->rx_pg_cons = cons;
+	bnapi->rx_pg_prod = hw_prod;
+	bnapi->rx_pg_cons = cons;
 }
 
 static inline void
-bnx2_reuse_rx_skb(struct bnx2 *bp, struct sk_buff *skb,
+bnx2_reuse_rx_skb(struct bnx2 *bp, struct bnx2_napi *bnapi, struct sk_buff *skb,
 	u16 cons, u16 prod)
 {
 	struct sw_bd *cons_rx_buf, *prod_rx_buf;
@@ -2494,7 +2495,7 @@ bnx2_reuse_rx_skb(struct bnx2 *bp, struct sk_buff *skb,
 		pci_unmap_addr(cons_rx_buf, mapping),
 		bp->rx_offset + RX_COPY_THRESH, PCI_DMA_FROMDEVICE);
 
-	bp->rx_prod_bseq += bp->rx_buf_use_size;
+	bnapi->rx_prod_bseq += bp->rx_buf_use_size;
 
 	prod_rx_buf->skb = skb;
 
@@ -2511,20 +2512,21 @@ bnx2_reuse_rx_skb(struct bnx2 *bp, struct sk_buff *skb,
 }
 
 static int
-bnx2_rx_skb(struct bnx2 *bp, struct sk_buff *skb, unsigned int len,
-	    unsigned int hdr_len, dma_addr_t dma_addr, u32 ring_idx)
+bnx2_rx_skb(struct bnx2 *bp, struct bnx2_napi *bnapi, struct sk_buff *skb,
+	    unsigned int len, unsigned int hdr_len, dma_addr_t dma_addr,
+	    u32 ring_idx)
 {
 	int err;
 	u16 prod = ring_idx & 0xffff;
 
-	err = bnx2_alloc_rx_skb(bp, prod);
+	err = bnx2_alloc_rx_skb(bp, bnapi, prod);
 	if (unlikely(err)) {
-		bnx2_reuse_rx_skb(bp, skb, (u16) (ring_idx >> 16), prod);
+		bnx2_reuse_rx_skb(bp, bnapi, skb, (u16) (ring_idx >> 16), prod);
 		if (hdr_len) {
 			unsigned int raw_len = len + 4;
 			int pages = PAGE_ALIGN(raw_len - hdr_len) >> PAGE_SHIFT;
 
-			bnx2_reuse_rx_skb_pages(bp, NULL, pages);
+			bnx2_reuse_rx_skb_pages(bp, bnapi, NULL, pages);
 		}
 		return err;
 	}
@@ -2539,8 +2541,8 @@ bnx2_rx_skb(struct bnx2 *bp, struct sk_buff *skb, unsigned int len,
 	} else {
 		unsigned int i, frag_len, frag_size, pages;
 		struct sw_pg *rx_pg;
-		u16 pg_cons = bp->rx_pg_cons;
-		u16 pg_prod = bp->rx_pg_prod;
+		u16 pg_cons = bnapi->rx_pg_cons;
+		u16 pg_prod = bnapi->rx_pg_prod;
 
 		frag_size = len + 4 - hdr_len;
 		pages = PAGE_ALIGN(frag_size) >> PAGE_SHIFT;
@@ -2551,9 +2553,10 @@ bnx2_rx_skb(struct bnx2 *bp, struct sk_buff *skb, unsigned int len,
 			if (unlikely(frag_len <= 4)) {
 				unsigned int tail = 4 - frag_len;
 
-				bp->rx_pg_cons = pg_cons;
-				bp->rx_pg_prod = pg_prod;
-				bnx2_reuse_rx_skb_pages(bp, NULL, pages - i);
+				bnapi->rx_pg_cons = pg_cons;
+				bnapi->rx_pg_prod = pg_prod;
+				bnx2_reuse_rx_skb_pages(bp, bnapi, NULL,
+							pages - i);
 				skb->len -= tail;
 				if (i == 0) {
 					skb->tail -= tail;
@@ -2579,9 +2582,10 @@ bnx2_rx_skb(struct bnx2 *bp, struct sk_buff *skb, unsigned int len,
 
 			err = bnx2_alloc_rx_page(bp, RX_PG_RING_IDX(pg_prod));
 			if (unlikely(err)) {
-				bp->rx_pg_cons = pg_cons;
-				bp->rx_pg_prod = pg_prod;
-				bnx2_reuse_rx_skb_pages(bp, skb, pages - i);
+				bnapi->rx_pg_cons = pg_cons;
+				bnapi->rx_pg_prod = pg_prod;
+				bnx2_reuse_rx_skb_pages(bp, bnapi, skb,
+							pages - i);
 				return err;
 			}
 
@@ -2593,8 +2597,8 @@ bnx2_rx_skb(struct bnx2 *bp, struct sk_buff *skb, unsigned int len,
 			pg_prod = NEXT_RX_BD(pg_prod);
 			pg_cons = RX_PG_RING_IDX(NEXT_RX_BD(pg_cons));
 		}
-		bp->rx_pg_prod = pg_prod;
-		bp->rx_pg_cons = pg_cons;
+		bnapi->rx_pg_prod = pg_prod;
+		bnapi->rx_pg_cons = pg_cons;
 	}
 	return 0;
 }
@@ -2617,8 +2621,8 @@ bnx2_rx_int(struct bnx2 *bp, struct bnx2_napi *bnapi, int budget)
 	int rx_pkt = 0, pg_ring_used = 0;
 
 	hw_cons = bnx2_get_hw_rx_cons(bnapi);
-	sw_cons = bp->rx_cons;
-	sw_prod = bp->rx_prod;
+	sw_cons = bnapi->rx_cons;
+	sw_prod = bnapi->rx_prod;
 
 	/* Memory barrier necessary as speculative reads of the rx
 	 * buffer can be ahead of the index in the status block
@@ -2654,7 +2658,8 @@ bnx2_rx_int(struct bnx2 *bp, struct bnx2_napi *bnapi, int budget)
 			L2_FHDR_ERRORS_TOO_SHORT |
 			L2_FHDR_ERRORS_GIANT_FRAME)) {
 
-			bnx2_reuse_rx_skb(bp, skb, sw_ring_cons, sw_ring_prod);
+			bnx2_reuse_rx_skb(bp, bnapi, skb, sw_ring_cons,
+					  sw_ring_prod);
 			goto next_rx;
 		}
 		hdr_len = 0;
@@ -2673,7 +2678,7 @@ bnx2_rx_int(struct bnx2 *bp, struct bnx2_napi *bnapi, int budget)
 
 			new_skb = netdev_alloc_skb(bp->dev, len + 2);
 			if (new_skb == NULL) {
-				bnx2_reuse_rx_skb(bp, skb, sw_ring_cons,
+				bnx2_reuse_rx_skb(bp, bnapi, skb, sw_ring_cons,
 						  sw_ring_prod);
 				goto next_rx;
 			}
@@ -2684,12 +2689,12 @@ bnx2_rx_int(struct bnx2 *bp, struct bnx2_napi *bnapi, int budget)
 			skb_reserve(new_skb, 2);
 			skb_put(new_skb, len);
 
-			bnx2_reuse_rx_skb(bp, skb,
+			bnx2_reuse_rx_skb(bp, bnapi, skb,
 				sw_ring_cons, sw_ring_prod);
 
 			skb = new_skb;
-		} else if (unlikely(bnx2_rx_skb(bp, skb, len, hdr_len, dma_addr,
-				    (sw_ring_cons << 16) | sw_ring_prod)))
+		} else if (unlikely(bnx2_rx_skb(bp, bnapi, skb, len, hdr_len,
+			   dma_addr, (sw_ring_cons << 16) | sw_ring_prod)))
 			goto next_rx;
 
 		skb->protocol = eth_type_trans(skb, bp->dev);
@@ -2737,16 +2742,16 @@ next_rx:
 			rmb();
 		}
 	}
-	bp->rx_cons = sw_cons;
-	bp->rx_prod = sw_prod;
+	bnapi->rx_cons = sw_cons;
+	bnapi->rx_prod = sw_prod;
 
 	if (pg_ring_used)
 		REG_WR16(bp, MB_RX_CID_ADDR + BNX2_L2CTX_HOST_PG_BDIDX,
-			 bp->rx_pg_prod);
+			 bnapi->rx_pg_prod);
 
 	REG_WR16(bp, MB_RX_CID_ADDR + BNX2_L2CTX_HOST_BDIDX, sw_prod);
 
-	REG_WR(bp, MB_RX_CID_ADDR + BNX2_L2CTX_HOST_BSEQ, bp->rx_prod_bseq);
+	REG_WR(bp, MB_RX_CID_ADDR + BNX2_L2CTX_HOST_BSEQ, bnapi->rx_prod_bseq);
 
 	mmiowb();
 
@@ -2845,7 +2850,7 @@ bnx2_has_work(struct bnx2_napi *bnapi)
 	struct bnx2 *bp = bnapi->bp;
 	struct status_block *sblk = bp->status_blk;
 
-	if ((bnx2_get_hw_rx_cons(bnapi) != bp->rx_cons) ||
+	if ((bnx2_get_hw_rx_cons(bnapi) != bnapi->rx_cons) ||
 	    (bnx2_get_hw_tx_cons(bnapi) != bnapi->hw_tx_cons))
 		return 1;
 
@@ -2879,7 +2884,7 @@ static int bnx2_poll_work(struct bnx2 *bp, struct bnx2_napi *bnapi,
 	if (bnx2_get_hw_tx_cons(bnapi) != bnapi->hw_tx_cons)
 		bnx2_tx_int(bp, bnapi);
 
-	if (bnx2_get_hw_rx_cons(bnapi) != bp->rx_cons)
+	if (bnx2_get_hw_rx_cons(bnapi) != bnapi->rx_cons)
 		work_done += bnx2_rx_int(bp, bnapi, budget - work_done);
 
 	return work_done;
@@ -4432,12 +4437,13 @@ bnx2_init_rx_ring(struct bnx2 *bp)
 	int i;
 	u16 prod, ring_prod;
 	u32 val, rx_cid_addr = GET_CID_ADDR(RX_CID);
+	struct bnx2_napi *bnapi = &bp->bnx2_napi;
 
-	bp->rx_prod = 0;
-	bp->rx_cons = 0;
-	bp->rx_prod_bseq = 0;
-	bp->rx_pg_prod = 0;
-	bp->rx_pg_cons = 0;
+	bnapi->rx_prod = 0;
+	bnapi->rx_cons = 0;
+	bnapi->rx_prod_bseq = 0;
+	bnapi->rx_pg_prod = 0;
+	bnapi->rx_pg_cons = 0;
 
 	bnx2_init_rxbd_rings(bp->rx_desc_ring, bp->rx_desc_mapping,
 			     bp->rx_buf_use_size, bp->rx_max_ring);
@@ -4473,29 +4479,30 @@ bnx2_init_rx_ring(struct bnx2 *bp)
 	val = (u64) bp->rx_desc_mapping[0] & 0xffffffff;
 	CTX_WR(bp, rx_cid_addr, BNX2_L2CTX_NX_BDHADDR_LO, val);
 
-	ring_prod = prod = bp->rx_pg_prod;
+	ring_prod = prod = bnapi->rx_pg_prod;
 	for (i = 0; i < bp->rx_pg_ring_size; i++) {
 		if (bnx2_alloc_rx_page(bp, ring_prod) < 0)
 			break;
 		prod = NEXT_RX_BD(prod);
 		ring_prod = RX_PG_RING_IDX(prod);
 	}
-	bp->rx_pg_prod = prod;
+	bnapi->rx_pg_prod = prod;
 
-	ring_prod = prod = bp->rx_prod;
+	ring_prod = prod = bnapi->rx_prod;
 	for (i = 0; i < bp->rx_ring_size; i++) {
-		if (bnx2_alloc_rx_skb(bp, ring_prod) < 0) {
+		if (bnx2_alloc_rx_skb(bp, bnapi, ring_prod) < 0) {
 			break;
 		}
 		prod = NEXT_RX_BD(prod);
 		ring_prod = RX_RING_IDX(prod);
 	}
-	bp->rx_prod = prod;
+	bnapi->rx_prod = prod;
 
-	REG_WR16(bp, MB_RX_CID_ADDR + BNX2_L2CTX_HOST_PG_BDIDX, bp->rx_pg_prod);
+	REG_WR16(bp, MB_RX_CID_ADDR + BNX2_L2CTX_HOST_PG_BDIDX,
+		 bnapi->rx_pg_prod);
 	REG_WR16(bp, MB_RX_CID_ADDR + BNX2_L2CTX_HOST_BDIDX, prod);
 
-	REG_WR(bp, MB_RX_CID_ADDR + BNX2_L2CTX_HOST_BSEQ, bp->rx_prod_bseq);
+	REG_WR(bp, MB_RX_CID_ADDR + BNX2_L2CTX_HOST_BSEQ, bnapi->rx_prod_bseq);
 }
 
 static u32 bnx2_find_max_ring(u32 ring_size, u32 max_size)
diff --git a/drivers/net/bnx2.h b/drivers/net/bnx2.h
index 958fdda..b75795b 100644
--- a/drivers/net/bnx2.h
+++ b/drivers/net/bnx2.h
@@ -6512,6 +6512,14 @@ struct bnx2_napi {
 
 	u16			tx_cons;
 	u16			hw_tx_cons;
+
+	u32			rx_prod_bseq;
+	u16			rx_prod;
+	u16			rx_cons;
+
+	u16			rx_pg_prod;
+	u16			rx_pg_cons;
+
 };
 
 struct bnx2 {
@@ -6556,13 +6564,6 @@ struct bnx2 {
 	u32			rx_max_ring_idx;
 	u32			rx_max_pg_ring_idx;
 
-	u32			rx_prod_bseq;
-	u16			rx_prod;
-	u16			rx_cons;
-
-	u16			rx_pg_prod;
-	u16			rx_pg_cons;
-
 	u32			rx_csum;
 
 	struct sw_bd		*rx_buf_ring;



^ permalink raw reply related

* [PATCH 4/9][BNX2]: Move tx indexes into bnx2_napi struct.
From: Michael Chan @ 2007-12-20 20:32 UTC (permalink / raw)
  To: davem, netdev

[BNX2]: Move tx indexes into bnx2_napi struct.

Tx related fields used in NAPI polling are moved from the main
bnx2 struct to the bnx2_napi struct.

Signed-off-by: Michael Chan <mchan@broadcom.com>

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 3f754e6..0300a75 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -226,7 +226,7 @@ static struct flash_spec flash_5709 = {
 
 MODULE_DEVICE_TABLE(pci, bnx2_pci_tbl);
 
-static inline u32 bnx2_tx_avail(struct bnx2 *bp)
+static inline u32 bnx2_tx_avail(struct bnx2 *bp, struct bnx2_napi *bnapi)
 {
 	u32 diff;
 
@@ -235,7 +235,7 @@ static inline u32 bnx2_tx_avail(struct bnx2 *bp)
 	/* The ring uses 256 indices for 255 entries, one of them
 	 * needs to be skipped.
 	 */
-	diff = bp->tx_prod - bp->tx_cons;
+	diff = bp->tx_prod - bnapi->tx_cons;
 	if (unlikely(diff >= TX_DESC_CNT)) {
 		diff &= 0xffff;
 		if (diff == TX_DESC_CNT)
@@ -2358,7 +2358,7 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi)
 	int tx_free_bd = 0;
 
 	hw_cons = bnx2_get_hw_tx_cons(bnapi);
-	sw_cons = bp->tx_cons;
+	sw_cons = bnapi->tx_cons;
 
 	while (sw_cons != hw_cons) {
 		struct sw_bd *tx_buf;
@@ -2412,8 +2412,8 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi)
 		hw_cons = bnx2_get_hw_tx_cons(bnapi);
 	}
 
-	bp->hw_tx_cons = hw_cons;
-	bp->tx_cons = sw_cons;
+	bnapi->hw_tx_cons = hw_cons;
+	bnapi->tx_cons = sw_cons;
 	/* Need to make the tx_cons update visible to bnx2_start_xmit()
 	 * before checking for netif_queue_stopped().  Without the
 	 * memory barrier, there is a small possibility that bnx2_start_xmit()
@@ -2422,10 +2422,10 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi)
 	smp_mb();
 
 	if (unlikely(netif_queue_stopped(bp->dev)) &&
-		     (bnx2_tx_avail(bp) > bp->tx_wake_thresh)) {
+		     (bnx2_tx_avail(bp, bnapi) > bp->tx_wake_thresh)) {
 		netif_tx_lock(bp->dev);
 		if ((netif_queue_stopped(bp->dev)) &&
-		    (bnx2_tx_avail(bp) > bp->tx_wake_thresh))
+		    (bnx2_tx_avail(bp, bnapi) > bp->tx_wake_thresh))
 			netif_wake_queue(bp->dev);
 		netif_tx_unlock(bp->dev);
 	}
@@ -2846,7 +2846,7 @@ bnx2_has_work(struct bnx2_napi *bnapi)
 	struct status_block *sblk = bp->status_blk;
 
 	if ((bnx2_get_hw_rx_cons(bnapi) != bp->rx_cons) ||
-	    (bnx2_get_hw_tx_cons(bnapi) != bp->hw_tx_cons))
+	    (bnx2_get_hw_tx_cons(bnapi) != bnapi->hw_tx_cons))
 		return 1;
 
 	if ((sblk->status_attn_bits & STATUS_ATTN_EVENTS) !=
@@ -2876,7 +2876,7 @@ static int bnx2_poll_work(struct bnx2 *bp, struct bnx2_napi *bnapi,
 		REG_RD(bp, BNX2_HC_COMMAND);
 	}
 
-	if (bnx2_get_hw_tx_cons(bnapi) != bp->hw_tx_cons)
+	if (bnx2_get_hw_tx_cons(bnapi) != bnapi->hw_tx_cons)
 		bnx2_tx_int(bp, bnapi);
 
 	if (bnx2_get_hw_rx_cons(bnapi) != bp->rx_cons)
@@ -4381,6 +4381,7 @@ bnx2_init_tx_ring(struct bnx2 *bp)
 {
 	struct tx_bd *txbd;
 	u32 cid;
+	struct bnx2_napi *bnapi = &bp->bnx2_napi;
 
 	bp->tx_wake_thresh = bp->tx_ring_size / 2;
 
@@ -4390,8 +4391,8 @@ bnx2_init_tx_ring(struct bnx2 *bp)
 	txbd->tx_bd_haddr_lo = (u64) bp->tx_desc_mapping & 0xffffffff;
 
 	bp->tx_prod = 0;
-	bp->tx_cons = 0;
-	bp->hw_tx_cons = 0;
+	bnapi->tx_cons = 0;
+	bnapi->hw_tx_cons = 0;
 	bp->tx_prod_bseq = 0;
 
 	cid = TX_CID;
@@ -5440,8 +5441,10 @@ bnx2_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	u32 len, vlan_tag_flags, last_frag, mss;
 	u16 prod, ring_prod;
 	int i;
+	struct bnx2_napi *bnapi = &bp->bnx2_napi;
 
-	if (unlikely(bnx2_tx_avail(bp) < (skb_shinfo(skb)->nr_frags + 1))) {
+	if (unlikely(bnx2_tx_avail(bp, bnapi) <
+	    (skb_shinfo(skb)->nr_frags + 1))) {
 		netif_stop_queue(dev);
 		printk(KERN_ERR PFX "%s: BUG! Tx ring full when queue awake!\n",
 			dev->name);
@@ -5556,9 +5559,9 @@ bnx2_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	bp->tx_prod = prod;
 	dev->trans_start = jiffies;
 
-	if (unlikely(bnx2_tx_avail(bp) <= MAX_SKB_FRAGS)) {
+	if (unlikely(bnx2_tx_avail(bp, bnapi) <= MAX_SKB_FRAGS)) {
 		netif_stop_queue(dev);
-		if (bnx2_tx_avail(bp) > bp->tx_wake_thresh)
+		if (bnx2_tx_avail(bp, bnapi) > bp->tx_wake_thresh)
 			netif_wake_queue(dev);
 	}
 
diff --git a/drivers/net/bnx2.h b/drivers/net/bnx2.h
index 345b6db..958fdda 100644
--- a/drivers/net/bnx2.h
+++ b/drivers/net/bnx2.h
@@ -6509,6 +6509,9 @@ struct bnx2_napi {
 	struct status_block	*status_blk;
 	u32 			last_status_idx;
 	u32			int_num;
+
+	u16			tx_cons;
+	u16			hw_tx_cons;
 };
 
 struct bnx2 {
@@ -6539,9 +6542,6 @@ struct bnx2 {
 	u32		tx_bidx_addr;
 	u32		tx_bseq_addr;
 
-	u16		tx_cons __attribute__((aligned(L1_CACHE_BYTES)));
-	u16		hw_tx_cons;
-
 	struct bnx2_napi	bnx2_napi;
 
 #ifdef BCM_VLAN



^ permalink raw reply related

* [PATCH 3/9][BNX2]: Introduce new bnx2_napi structure.
From: Michael Chan @ 2007-12-20 20:31 UTC (permalink / raw)
  To: davem, netdev

[BNX2]: Introduce new bnx2_napi structure.

Introduce a bnx2_napi structure that will hold a napi_struct and
other fields to handle NAPI polling for the napi_struct.  Various tx
and rx indexes and status block pointers will be moved from the main
bnx2 structure to this bnx2_napi structure.

Most NAPI path functions are modified to be passed this bnx2_napi
struct pointer.

Signed-off-by: Michael Chan <mchan@broadcom.com>

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 83cdbde..3f754e6 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -407,12 +407,14 @@ bnx2_disable_int(struct bnx2 *bp)
 static void
 bnx2_enable_int(struct bnx2 *bp)
 {
+	struct bnx2_napi *bnapi = &bp->bnx2_napi;
+
 	REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
 	       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
-	       BNX2_PCICFG_INT_ACK_CMD_MASK_INT | bp->last_status_idx);
+	       BNX2_PCICFG_INT_ACK_CMD_MASK_INT | bnapi->last_status_idx);
 
 	REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
-	       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | bp->last_status_idx);
+	       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | bnapi->last_status_idx);
 
 	REG_WR(bp, BNX2_HC_COMMAND, bp->hc_cmd | BNX2_HC_COMMAND_COAL_NOW);
 }
@@ -426,11 +428,23 @@ bnx2_disable_int_sync(struct bnx2 *bp)
 }
 
 static void
+bnx2_napi_disable(struct bnx2 *bp)
+{
+	napi_disable(&bp->bnx2_napi.napi);
+}
+
+static void
+bnx2_napi_enable(struct bnx2 *bp)
+{
+	napi_enable(&bp->bnx2_napi.napi);
+}
+
+static void
 bnx2_netif_stop(struct bnx2 *bp)
 {
 	bnx2_disable_int_sync(bp);
 	if (netif_running(bp->dev)) {
-		napi_disable(&bp->napi);
+		bnx2_napi_disable(bp);
 		netif_tx_disable(bp->dev);
 		bp->dev->trans_start = jiffies;	/* prevent tx timeout */
 	}
@@ -442,7 +456,7 @@ bnx2_netif_start(struct bnx2 *bp)
 	if (atomic_dec_and_test(&bp->intr_sem)) {
 		if (netif_running(bp->dev)) {
 			netif_wake_queue(bp->dev);
-			napi_enable(&bp->napi);
+			bnx2_napi_enable(bp);
 			bnx2_enable_int(bp);
 		}
 	}
@@ -555,6 +569,8 @@ bnx2_alloc_mem(struct bnx2 *bp)
 
 	memset(bp->status_blk, 0, bp->status_stats_size);
 
+	bp->bnx2_napi.status_blk = bp->status_blk;
+
 	bp->stats_blk = (void *) ((unsigned long) bp->status_blk +
 				  status_blk_size);
 
@@ -2291,9 +2307,9 @@ bnx2_alloc_rx_skb(struct bnx2 *bp, u16 index)
 }
 
 static int
-bnx2_phy_event_is_set(struct bnx2 *bp, u32 event)
+bnx2_phy_event_is_set(struct bnx2 *bp, struct bnx2_napi *bnapi, u32 event)
 {
-	struct status_block *sblk = bp->status_blk;
+	struct status_block *sblk = bnapi->status_blk;
 	u32 new_link_state, old_link_state;
 	int is_set = 1;
 
@@ -2311,24 +2327,24 @@ bnx2_phy_event_is_set(struct bnx2 *bp, u32 event)
 }
 
 static void
-bnx2_phy_int(struct bnx2 *bp)
+bnx2_phy_int(struct bnx2 *bp, struct bnx2_napi *bnapi)
 {
-	if (bnx2_phy_event_is_set(bp, STATUS_ATTN_BITS_LINK_STATE)) {
+	if (bnx2_phy_event_is_set(bp, bnapi, STATUS_ATTN_BITS_LINK_STATE)) {
 		spin_lock(&bp->phy_lock);
 		bnx2_set_link(bp);
 		spin_unlock(&bp->phy_lock);
 	}
-	if (bnx2_phy_event_is_set(bp, STATUS_ATTN_BITS_TIMER_ABORT))
+	if (bnx2_phy_event_is_set(bp, bnapi, STATUS_ATTN_BITS_TIMER_ABORT))
 		bnx2_set_remote_link(bp);
 
 }
 
 static inline u16
-bnx2_get_hw_tx_cons(struct bnx2 *bp)
+bnx2_get_hw_tx_cons(struct bnx2_napi *bnapi)
 {
 	u16 cons;
 
-	cons = bp->status_blk->status_tx_quick_consumer_index0;
+	cons = bnapi->status_blk->status_tx_quick_consumer_index0;
 
 	if (unlikely((cons & MAX_TX_DESC_CNT) == MAX_TX_DESC_CNT))
 		cons++;
@@ -2336,12 +2352,12 @@ bnx2_get_hw_tx_cons(struct bnx2 *bp)
 }
 
 static void
-bnx2_tx_int(struct bnx2 *bp)
+bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi)
 {
 	u16 hw_cons, sw_cons, sw_ring_cons;
 	int tx_free_bd = 0;
 
-	hw_cons = bnx2_get_hw_tx_cons(bp);
+	hw_cons = bnx2_get_hw_tx_cons(bnapi);
 	sw_cons = bp->tx_cons;
 
 	while (sw_cons != hw_cons) {
@@ -2393,7 +2409,7 @@ bnx2_tx_int(struct bnx2 *bp)
 
 		dev_kfree_skb(skb);
 
-		hw_cons = bnx2_get_hw_tx_cons(bp);
+		hw_cons = bnx2_get_hw_tx_cons(bnapi);
 	}
 
 	bp->hw_tx_cons = hw_cons;
@@ -2584,9 +2600,9 @@ bnx2_rx_skb(struct bnx2 *bp, struct sk_buff *skb, unsigned int len,
 }
 
 static inline u16
-bnx2_get_hw_rx_cons(struct bnx2 *bp)
+bnx2_get_hw_rx_cons(struct bnx2_napi *bnapi)
 {
-	u16 cons = bp->status_blk->status_rx_quick_consumer_index0;
+	u16 cons = bnapi->status_blk->status_rx_quick_consumer_index0;
 
 	if (unlikely((cons & MAX_RX_DESC_CNT) == MAX_RX_DESC_CNT))
 		cons++;
@@ -2594,13 +2610,13 @@ bnx2_get_hw_rx_cons(struct bnx2 *bp)
 }
 
 static int
-bnx2_rx_int(struct bnx2 *bp, int budget)
+bnx2_rx_int(struct bnx2 *bp, struct bnx2_napi *bnapi, int budget)
 {
 	u16 hw_cons, sw_cons, sw_ring_cons, sw_prod, sw_ring_prod;
 	struct l2_fhdr *rx_hdr;
 	int rx_pkt = 0, pg_ring_used = 0;
 
-	hw_cons = bnx2_get_hw_rx_cons(bp);
+	hw_cons = bnx2_get_hw_rx_cons(bnapi);
 	sw_cons = bp->rx_cons;
 	sw_prod = bp->rx_prod;
 
@@ -2717,7 +2733,7 @@ next_rx:
 
 		/* Refresh hw_cons to see if there is new work */
 		if (sw_cons == hw_cons) {
-			hw_cons = bnx2_get_hw_rx_cons(bp);
+			hw_cons = bnx2_get_hw_rx_cons(bnapi);
 			rmb();
 		}
 	}
@@ -2746,8 +2762,9 @@ bnx2_msi(int irq, void *dev_instance)
 {
 	struct net_device *dev = dev_instance;
 	struct bnx2 *bp = netdev_priv(dev);
+	struct bnx2_napi *bnapi = &bp->bnx2_napi;
 
-	prefetch(bp->status_blk);
+	prefetch(bnapi->status_blk);
 	REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
 		BNX2_PCICFG_INT_ACK_CMD_USE_INT_HC_PARAM |
 		BNX2_PCICFG_INT_ACK_CMD_MASK_INT);
@@ -2756,7 +2773,7 @@ bnx2_msi(int irq, void *dev_instance)
 	if (unlikely(atomic_read(&bp->intr_sem) != 0))
 		return IRQ_HANDLED;
 
-	netif_rx_schedule(dev, &bp->napi);
+	netif_rx_schedule(dev, &bnapi->napi);
 
 	return IRQ_HANDLED;
 }
@@ -2766,14 +2783,15 @@ bnx2_msi_1shot(int irq, void *dev_instance)
 {
 	struct net_device *dev = dev_instance;
 	struct bnx2 *bp = netdev_priv(dev);
+	struct bnx2_napi *bnapi = &bp->bnx2_napi;
 
-	prefetch(bp->status_blk);
+	prefetch(bnapi->status_blk);
 
 	/* Return here if interrupt is disabled. */
 	if (unlikely(atomic_read(&bp->intr_sem) != 0))
 		return IRQ_HANDLED;
 
-	netif_rx_schedule(dev, &bp->napi);
+	netif_rx_schedule(dev, &bnapi->napi);
 
 	return IRQ_HANDLED;
 }
@@ -2783,7 +2801,8 @@ bnx2_interrupt(int irq, void *dev_instance)
 {
 	struct net_device *dev = dev_instance;
 	struct bnx2 *bp = netdev_priv(dev);
-	struct status_block *sblk = bp->status_blk;
+	struct bnx2_napi *bnapi = &bp->bnx2_napi;
+	struct status_block *sblk = bnapi->status_blk;
 
 	/* When using INTx, it is possible for the interrupt to arrive
 	 * at the CPU before the status block posted prior to the
@@ -2791,7 +2810,7 @@ bnx2_interrupt(int irq, void *dev_instance)
 	 * When using MSI, the MSI message will always complete after
 	 * the status block write.
 	 */
-	if ((sblk->status_idx == bp->last_status_idx) &&
+	if ((sblk->status_idx == bnapi->last_status_idx) &&
 	    (REG_RD(bp, BNX2_PCICFG_MISC_STATUS) &
 	     BNX2_PCICFG_MISC_STATUS_INTA_VALUE))
 		return IRQ_NONE;
@@ -2809,9 +2828,9 @@ bnx2_interrupt(int irq, void *dev_instance)
 	if (unlikely(atomic_read(&bp->intr_sem) != 0))
 		return IRQ_HANDLED;
 
-	if (netif_rx_schedule_prep(dev, &bp->napi)) {
-		bp->last_status_idx = sblk->status_idx;
-		__netif_rx_schedule(dev, &bp->napi);
+	if (netif_rx_schedule_prep(dev, &bnapi->napi)) {
+		bnapi->last_status_idx = sblk->status_idx;
+		__netif_rx_schedule(dev, &bnapi->napi);
 	}
 
 	return IRQ_HANDLED;
@@ -2821,12 +2840,13 @@ bnx2_interrupt(int irq, void *dev_instance)
 				 STATUS_ATTN_BITS_TIMER_ABORT)
 
 static inline int
-bnx2_has_work(struct bnx2 *bp)
+bnx2_has_work(struct bnx2_napi *bnapi)
 {
+	struct bnx2 *bp = bnapi->bp;
 	struct status_block *sblk = bp->status_blk;
 
-	if ((bnx2_get_hw_rx_cons(bp) != bp->rx_cons) ||
-	    (bnx2_get_hw_tx_cons(bp) != bp->hw_tx_cons))
+	if ((bnx2_get_hw_rx_cons(bnapi) != bp->rx_cons) ||
+	    (bnx2_get_hw_tx_cons(bnapi) != bp->hw_tx_cons))
 		return 1;
 
 	if ((sblk->status_attn_bits & STATUS_ATTN_EVENTS) !=
@@ -2836,16 +2856,17 @@ bnx2_has_work(struct bnx2 *bp)
 	return 0;
 }
 
-static int bnx2_poll_work(struct bnx2 *bp, int work_done, int budget)
+static int bnx2_poll_work(struct bnx2 *bp, struct bnx2_napi *bnapi,
+			  int work_done, int budget)
 {
-	struct status_block *sblk = bp->status_blk;
+	struct status_block *sblk = bnapi->status_blk;
 	u32 status_attn_bits = sblk->status_attn_bits;
 	u32 status_attn_bits_ack = sblk->status_attn_bits_ack;
 
 	if ((status_attn_bits & STATUS_ATTN_EVENTS) !=
 	    (status_attn_bits_ack & STATUS_ATTN_EVENTS)) {
 
-		bnx2_phy_int(bp);
+		bnx2_phy_int(bp, bnapi);
 
 		/* This is needed to take care of transient status
 		 * during link changes.
@@ -2855,49 +2876,50 @@ static int bnx2_poll_work(struct bnx2 *bp, int work_done, int budget)
 		REG_RD(bp, BNX2_HC_COMMAND);
 	}
 
-	if (bnx2_get_hw_tx_cons(bp) != bp->hw_tx_cons)
-		bnx2_tx_int(bp);
+	if (bnx2_get_hw_tx_cons(bnapi) != bp->hw_tx_cons)
+		bnx2_tx_int(bp, bnapi);
 
-	if (bnx2_get_hw_rx_cons(bp) != bp->rx_cons)
-		work_done += bnx2_rx_int(bp, budget - work_done);
+	if (bnx2_get_hw_rx_cons(bnapi) != bp->rx_cons)
+		work_done += bnx2_rx_int(bp, bnapi, budget - work_done);
 
 	return work_done;
 }
 
 static int bnx2_poll(struct napi_struct *napi, int budget)
 {
-	struct bnx2 *bp = container_of(napi, struct bnx2, napi);
+	struct bnx2_napi *bnapi = container_of(napi, struct bnx2_napi, napi);
+	struct bnx2 *bp = bnapi->bp;
 	int work_done = 0;
-	struct status_block *sblk = bp->status_blk;
+	struct status_block *sblk = bnapi->status_blk;
 
 	while (1) {
-		work_done = bnx2_poll_work(bp, work_done, budget);
+		work_done = bnx2_poll_work(bp, bnapi, work_done, budget);
 
 		if (unlikely(work_done >= budget))
 			break;
 
-		/* bp->last_status_idx is used below to tell the hw how
+		/* bnapi->last_status_idx is used below to tell the hw how
 		 * much work has been processed, so we must read it before
 		 * checking for more work.
 		 */
-		bp->last_status_idx = sblk->status_idx;
+		bnapi->last_status_idx = sblk->status_idx;
 		rmb();
-		if (likely(!bnx2_has_work(bp))) {
+		if (likely(!bnx2_has_work(bnapi))) {
 			netif_rx_complete(bp->dev, napi);
 			if (likely(bp->flags & USING_MSI_FLAG)) {
 				REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
 				       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
-				       bp->last_status_idx);
+				       bnapi->last_status_idx);
 				break;
 			}
 			REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
 			       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
 			       BNX2_PCICFG_INT_ACK_CMD_MASK_INT |
-			       bp->last_status_idx);
+			       bnapi->last_status_idx);
 
 			REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
 			       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
-			       bp->last_status_idx);
+			       bnapi->last_status_idx);
 			break;
 		}
 	}
@@ -4247,7 +4269,7 @@ bnx2_init_chip(struct bnx2 *bp)
 		val |= BNX2_EMAC_RX_MTU_SIZE_JUMBO_ENA;
 	REG_WR(bp, BNX2_EMAC_RX_MTU_SIZE, val);
 
-	bp->last_status_idx = 0;
+	bp->bnx2_napi.last_status_idx = 0;
 	bp->rx_mode = BNX2_EMAC_RX_MODE_SORT_MODE;
 
 	/* Set up how to generate a link change interrupt. */
@@ -4887,6 +4909,7 @@ bnx2_run_loopback(struct bnx2 *bp, int loopback_mode)
 	struct sw_bd *rx_buf;
 	struct l2_fhdr *rx_hdr;
 	int ret = -ENODEV;
+	struct bnx2_napi *bnapi = &bp->bnx2_napi;
 
 	if (loopback_mode == BNX2_MAC_LOOPBACK) {
 		bp->loopback = MAC_LOOPBACK;
@@ -4921,7 +4944,7 @@ bnx2_run_loopback(struct bnx2 *bp, int loopback_mode)
 	REG_RD(bp, BNX2_HC_COMMAND);
 
 	udelay(5);
-	rx_start_idx = bnx2_get_hw_rx_cons(bp);
+	rx_start_idx = bnx2_get_hw_rx_cons(bnapi);
 
 	num_pkts = 0;
 
@@ -4951,10 +4974,10 @@ bnx2_run_loopback(struct bnx2 *bp, int loopback_mode)
 	pci_unmap_single(bp->pdev, map, pkt_size, PCI_DMA_TODEVICE);
 	dev_kfree_skb(skb);
 
-	if (bnx2_get_hw_tx_cons(bp) != bp->tx_prod)
+	if (bnx2_get_hw_tx_cons(bnapi) != bp->tx_prod)
 		goto loopback_test_done;
 
-	rx_idx = bnx2_get_hw_rx_cons(bp);
+	rx_idx = bnx2_get_hw_rx_cons(bnapi);
 	if (rx_idx != rx_start_idx + num_pkts) {
 		goto loopback_test_done;
 	}
@@ -5295,11 +5318,11 @@ bnx2_open(struct net_device *dev)
 		return rc;
 
 	bnx2_setup_int_mode(bp, disable_msi);
-	napi_enable(&bp->napi);
+	bnx2_napi_enable(bp);
 	rc = bnx2_request_irq(bp);
 
 	if (rc) {
-		napi_disable(&bp->napi);
+		bnx2_napi_disable(bp);
 		bnx2_free_mem(bp);
 		return rc;
 	}
@@ -5307,7 +5330,7 @@ bnx2_open(struct net_device *dev)
 	rc = bnx2_init_nic(bp);
 
 	if (rc) {
-		napi_disable(&bp->napi);
+		bnx2_napi_disable(bp);
 		bnx2_free_irq(bp);
 		bnx2_free_skbs(bp);
 		bnx2_free_mem(bp);
@@ -5342,7 +5365,7 @@ bnx2_open(struct net_device *dev)
 				rc = bnx2_request_irq(bp);
 
 			if (rc) {
-				napi_disable(&bp->napi);
+				bnx2_napi_disable(bp);
 				bnx2_free_skbs(bp);
 				bnx2_free_mem(bp);
 				del_timer_sync(&bp->timer);
@@ -5557,7 +5580,7 @@ bnx2_close(struct net_device *dev)
 		msleep(1);
 
 	bnx2_disable_int_sync(bp);
-	napi_disable(&bp->napi);
+	bnx2_napi_disable(bp);
 	del_timer_sync(&bp->timer);
 	if (bp->flags & NO_WOL_FLAG)
 		reset_code = BNX2_DRV_MSG_CODE_UNLOAD_LNK_DN;
@@ -7083,6 +7106,15 @@ bnx2_bus_string(struct bnx2 *bp, char *str)
 }
 
 static int __devinit
+bnx2_init_napi(struct bnx2 *bp)
+{
+	struct bnx2_napi *bnapi = &bp->bnx2_napi;
+
+	bnapi->bp = bp;
+	netif_napi_add(bp->dev, &bnapi->napi, bnx2_poll, 64);
+}
+
+static int __devinit
 bnx2_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 {
 	static int version_printed = 0;
@@ -7123,7 +7155,7 @@ bnx2_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	dev->ethtool_ops = &bnx2_ethtool_ops;
 
 	bp = netdev_priv(dev);
-	netif_napi_add(dev, &bp->napi, bnx2_poll, 64);
+	bnx2_init_napi(bp);
 
 #if defined(HAVE_POLL_CONTROLLER) || defined(CONFIG_NET_POLL_CONTROLLER)
 	dev->poll_controller = poll_bnx2;
diff --git a/drivers/net/bnx2.h b/drivers/net/bnx2.h
index 1accf00..345b6db 100644
--- a/drivers/net/bnx2.h
+++ b/drivers/net/bnx2.h
@@ -6503,6 +6503,14 @@ struct bnx2_irq {
 	char		name[16];
 };
 
+struct bnx2_napi {
+	struct napi_struct	napi		____cacheline_aligned;
+	struct bnx2		*bp;
+	struct status_block	*status_blk;
+	u32 			last_status_idx;
+	u32			int_num;
+};
+
 struct bnx2 {
 	/* Fields used in the tx and intr/napi performance paths are grouped */
 	/* together in the beginning of the structure. */
@@ -6511,13 +6519,8 @@ struct bnx2 {
 	struct net_device	*dev;
 	struct pci_dev		*pdev;
 
-	struct napi_struct	napi;
-
 	atomic_t		intr_sem;
 
-	struct status_block	*status_blk;
-	u32 			last_status_idx;
-
 	u32			flags;
 #define PCIX_FLAG			0x00000001
 #define PCI_32BIT_FLAG			0x00000002
@@ -6539,6 +6542,8 @@ struct bnx2 {
 	u16		tx_cons __attribute__((aligned(L1_CACHE_BYTES)));
 	u16		hw_tx_cons;
 
+	struct bnx2_napi	bnx2_napi;
+
 #ifdef BCM_VLAN
 	struct			vlan_group *vlgrp;
 #endif
@@ -6672,6 +6677,7 @@ struct bnx2 {
 
 	u32			stats_ticks;
 
+	struct status_block	*status_blk;
 	dma_addr_t		status_blk_mapping;
 
 	struct statistics_block	*stats_blk;



^ permalink raw reply related

* [PATCH 2/9][BNX2]: Restructure IRQ datastructures.
From: Michael Chan @ 2007-12-20 20:31 UTC (permalink / raw)
  To: davem, netdev

[BNX2]: Restructure IRQ datastructures.

Add a table to keep track of multiple IRQs and restructure the IRQ
request and free functions so that they can be easily expanded to
handle multiple IRQs.

Signed-off-by: Michael Chan <mchan@broadcom.com>

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index f19a1e9..83cdbde 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -5234,18 +5234,15 @@ static int
 bnx2_request_irq(struct bnx2 *bp)
 {
 	struct net_device *dev = bp->dev;
-	int rc = 0;
-
-	if (bp->flags & USING_MSI_FLAG) {
-		irq_handler_t	fn = bnx2_msi;
-
-		if (bp->flags & ONE_SHOT_MSI_FLAG)
-			fn = bnx2_msi_1shot;
+	unsigned long flags;
+	struct bnx2_irq *irq = &bp->irq_tbl[0];
+	int rc;
 
-		rc = request_irq(bp->pdev->irq, fn, 0, dev->name, dev);
-	} else
-		rc = request_irq(bp->pdev->irq, bnx2_interrupt,
-				 IRQF_SHARED, dev->name, dev);
+	if (bp->flags & USING_MSI_FLAG)
+		flags = 0;
+	else
+		flags = IRQF_SHARED;
+	rc = request_irq(irq->vector, irq->handler, flags, dev->name, dev);
 	return rc;
 }
 
@@ -5254,12 +5251,31 @@ bnx2_free_irq(struct bnx2 *bp)
 {
 	struct net_device *dev = bp->dev;
 
+	free_irq(bp->irq_tbl[0].vector, dev);
 	if (bp->flags & USING_MSI_FLAG) {
-		free_irq(bp->pdev->irq, dev);
 		pci_disable_msi(bp->pdev);
 		bp->flags &= ~(USING_MSI_FLAG | ONE_SHOT_MSI_FLAG);
-	} else
-		free_irq(bp->pdev->irq, dev);
+	}
+}
+
+static void
+bnx2_setup_int_mode(struct bnx2 *bp, int dis_msi)
+{
+	bp->irq_tbl[0].handler = bnx2_interrupt;
+	strcpy(bp->irq_tbl[0].name, bp->dev->name);
+
+	if ((bp->flags & MSI_CAP_FLAG) && !dis_msi) {
+		if (pci_enable_msi(bp->pdev) == 0) {
+			bp->flags |= USING_MSI_FLAG;
+			if (CHIP_NUM(bp) == CHIP_NUM_5709) {
+				bp->flags |= ONE_SHOT_MSI_FLAG;
+				bp->irq_tbl[0].handler = bnx2_msi_1shot;
+			} else
+				bp->irq_tbl[0].handler = bnx2_msi;
+		}
+	}
+
+	bp->irq_tbl[0].vector = bp->pdev->irq;
 }
 
 /* Called with rtnl_lock */
@@ -5278,15 +5294,8 @@ bnx2_open(struct net_device *dev)
 	if (rc)
 		return rc;
 
+	bnx2_setup_int_mode(bp, disable_msi);
 	napi_enable(&bp->napi);
-
-	if ((bp->flags & MSI_CAP_FLAG) && !disable_msi) {
-		if (pci_enable_msi(bp->pdev) == 0) {
-			bp->flags |= USING_MSI_FLAG;
-			if (CHIP_NUM(bp) == CHIP_NUM_5709)
-				bp->flags |= ONE_SHOT_MSI_FLAG;
-		}
-	}
 	rc = bnx2_request_irq(bp);
 
 	if (rc) {
@@ -5325,6 +5334,8 @@ bnx2_open(struct net_device *dev)
 			bnx2_disable_int(bp);
 			bnx2_free_irq(bp);
 
+			bnx2_setup_int_mode(bp, 1);
+
 			rc = bnx2_init_nic(bp);
 
 			if (!rc)
diff --git a/drivers/net/bnx2.h b/drivers/net/bnx2.h
index 1f244fa..1accf00 100644
--- a/drivers/net/bnx2.h
+++ b/drivers/net/bnx2.h
@@ -6494,6 +6494,15 @@ struct flash_spec {
 	u8  *name;
 };
 
+#define BNX2_MAX_MSIX_HW_VEC	9
+#define BNX2_MAX_MSIX_VEC	1
+
+struct bnx2_irq {
+	irq_handler_t	handler;
+	u16		vector;
+	char		name[16];
+};
+
 struct bnx2 {
 	/* Fields used in the tx and intr/napi performance paths are grouped */
 	/* together in the beginning of the structure. */
@@ -6721,6 +6730,9 @@ struct bnx2 {
 	u32			flash_size;
 
 	int			status_stats_size;
+
+	struct bnx2_irq		irq_tbl[BNX2_MAX_MSIX_VEC];
+	int			irq_nvecs;
 };
 
 static u32 bnx2_reg_rd_ind(struct bnx2 *bp, u32 offset);



^ permalink raw reply related

* [PATCH 1/9][BNX2]: Add function to fetch hardware tx index.
From: Michael Chan @ 2007-12-20 20:31 UTC (permalink / raw)
  To: davem, netdev

[BNX2]: Add function to fetch hardware tx index.

This makes the code cleaner and easier to support different tx rings.

Signed-off-by: Michael Chan <mchan@broadcom.com>

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 469d259..f19a1e9 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -2323,17 +2323,25 @@ bnx2_phy_int(struct bnx2 *bp)
 
 }
 
+static inline u16
+bnx2_get_hw_tx_cons(struct bnx2 *bp)
+{
+	u16 cons;
+
+	cons = bp->status_blk->status_tx_quick_consumer_index0;
+
+	if (unlikely((cons & MAX_TX_DESC_CNT) == MAX_TX_DESC_CNT))
+		cons++;
+	return cons;
+}
+
 static void
 bnx2_tx_int(struct bnx2 *bp)
 {
-	struct status_block *sblk = bp->status_blk;
 	u16 hw_cons, sw_cons, sw_ring_cons;
 	int tx_free_bd = 0;
 
-	hw_cons = bp->hw_tx_cons = sblk->status_tx_quick_consumer_index0;
-	if ((hw_cons & MAX_TX_DESC_CNT) == MAX_TX_DESC_CNT) {
-		hw_cons++;
-	}
+	hw_cons = bnx2_get_hw_tx_cons(bp);
 	sw_cons = bp->tx_cons;
 
 	while (sw_cons != hw_cons) {
@@ -2385,14 +2393,10 @@ bnx2_tx_int(struct bnx2 *bp)
 
 		dev_kfree_skb(skb);
 
-		hw_cons = bp->hw_tx_cons =
-			sblk->status_tx_quick_consumer_index0;
-
-		if ((hw_cons & MAX_TX_DESC_CNT) == MAX_TX_DESC_CNT) {
-			hw_cons++;
-		}
+		hw_cons = bnx2_get_hw_tx_cons(bp);
 	}
 
+	bp->hw_tx_cons = hw_cons;
 	bp->tx_cons = sw_cons;
 	/* Need to make the tx_cons update visible to bnx2_start_xmit()
 	 * before checking for netif_queue_stopped().  Without the
@@ -2822,7 +2826,7 @@ bnx2_has_work(struct bnx2 *bp)
 	struct status_block *sblk = bp->status_blk;
 
 	if ((bnx2_get_hw_rx_cons(bp) != bp->rx_cons) ||
-	    (sblk->status_tx_quick_consumer_index0 != bp->hw_tx_cons))
+	    (bnx2_get_hw_tx_cons(bp) != bp->hw_tx_cons))
 		return 1;
 
 	if ((sblk->status_attn_bits & STATUS_ATTN_EVENTS) !=
@@ -2851,7 +2855,7 @@ static int bnx2_poll_work(struct bnx2 *bp, int work_done, int budget)
 		REG_RD(bp, BNX2_HC_COMMAND);
 	}
 
-	if (sblk->status_tx_quick_consumer_index0 != bp->hw_tx_cons)
+	if (bnx2_get_hw_tx_cons(bp) != bp->hw_tx_cons)
 		bnx2_tx_int(bp);
 
 	if (bnx2_get_hw_rx_cons(bp) != bp->rx_cons)
@@ -4917,7 +4921,7 @@ bnx2_run_loopback(struct bnx2 *bp, int loopback_mode)
 	REG_RD(bp, BNX2_HC_COMMAND);
 
 	udelay(5);
-	rx_start_idx = bp->status_blk->status_rx_quick_consumer_index0;
+	rx_start_idx = bnx2_get_hw_rx_cons(bp);
 
 	num_pkts = 0;
 
@@ -4947,11 +4951,10 @@ bnx2_run_loopback(struct bnx2 *bp, int loopback_mode)
 	pci_unmap_single(bp->pdev, map, pkt_size, PCI_DMA_TODEVICE);
 	dev_kfree_skb(skb);
 
-	if (bp->status_blk->status_tx_quick_consumer_index0 != bp->tx_prod) {
+	if (bnx2_get_hw_tx_cons(bp) != bp->tx_prod)
 		goto loopback_test_done;
-	}
 
-	rx_idx = bp->status_blk->status_rx_quick_consumer_index0;
+	rx_idx = bnx2_get_hw_rx_cons(bp);
 	if (rx_idx != rx_start_idx + num_pkts) {
 		goto loopback_test_done;
 	}



^ permalink raw reply related

* [PATCH 0/9][BNX2]: Add MSIX support.
From: Michael Chan @ 2007-12-20 20:30 UTC (permalink / raw)
  To: davem, netdev

David, this patchset lays the foundation for supporting multiple MSIX
IRQs.  Only 1 additional MSIX is added to handle TX separately from RX
at the moment.  Multiple TX and RX rings will be added in the future.
Please review for 2.6.25.  Thanks.


^ permalink raw reply

* Re: [PATCH] sky2: Use deferrable timer for watchdog
From: Kok, Auke @ 2007-12-20 19:22 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Stephen Hemminger, parag.warudkar, netdev, akpm, linux-kernel
In-Reply-To: <476ABE7D.60901@linux.intel.com>

Arjan van de Ven wrote:
> 
>>>> My interpretation of the api is:
>>>>    * round_jiffies()  - timer wants to wakeup but isn't precise
>>>> about when so schedule
>>>>                         on next second when system will wake up anyway;
>>>>                         e.g why meetings are usually scheduled on
>>>> the hour
>>>>
>>>>    * deferrable       - timer doesn't have to really wakeup but
>>>> wants to happen near
>>>>                         a particular time. e.g. "I'll meet you at
>>>> the pub around 8pm"
> 
> this is not correct.
> 
> deferrable means "if you're busy wake me up at this time. But if not,
> don't bother waking up for me, get to it
> later".
> 
> The "later" can be a LONG time later, several seconds easily, if not more.
> (timers are on a per cpu bases, and you may end up with a several-core
> system where the common timers are all on another cpu
> than this one)
> 
> 
> 
>>> If this is the case then the whole usage of round_jiffies() is bogus.
>>> All users of round_jiffies()
>>> should just be converted to deferrable??  I am a bit concerned that
>>> if deferrable gets used everywhere
>>> then a strange situation would occur where all timers were waiting
>>> for some other timer to finally
>>> happen, kind of a wierd timelock situation. Like the old chip/dale
>>> cartoon:
>>>  "you first, no you first, after you mister chip, no after you mister
>>> dale,..."
>>
>>
>>
>> that's a dangerous situation indeed and I'd really like to know what
>> the limits
>> are for deferring deferrable timers.... Arjan, do you know? Anyone?
> 
> there is NO limit to deferring a timer. Do NOT use a deferrable timer if
> you can't afford the timer to not happen
> within.. 10 to 100 seconds! (or more)
> They are really meant for things where you CAN afford for it to not
> happen when you're idle....

ok, that's just bad and if there's no user-defineable limit to the deferral I
definately don't like this change.

Can I safely assume that any irq will cause all deferred timers to run?

If this is the case then for e1000 this patch is still OK since the watchdog needs
to run (1) after a link up/down interrupt or (2) to update statistics. Those
statistics won't increase if there is no traffic of course...

Auke

^ permalink raw reply

* Re: [PATCH] sky2: Use deferrable timer for watchdog
From: Arjan van de Ven @ 2007-12-20 19:11 UTC (permalink / raw)
  To: Kok, Auke; +Cc: Stephen Hemminger, parag.warudkar, netdev, akpm, linux-kernel
In-Reply-To: <476ABDDF.8080607@intel.com>


>>> My interpretation of the api is:
>>>    * round_jiffies()  - timer wants to wakeup but isn't precise about when so schedule
>>>                         on next second when system will wake up anyway;
>>>                         e.g why meetings are usually scheduled on the hour
>>>
>>>    * deferrable       - timer doesn't have to really wakeup but wants to happen near
>>>                         a particular time. e.g. "I'll meet you at the pub around 8pm"

this is not correct.

deferrable means "if you're busy wake me up at this time. But if not, don't bother waking up for me, get to it
later".

The "later" can be a LONG time later, several seconds easily, if not more.
(timers are on a per cpu bases, and you may end up with a several-core system where the common timers are all on another cpu
than this one)



>> If this is the case then the whole usage of round_jiffies() is bogus. All users of round_jiffies()
>> should just be converted to deferrable??  I am a bit concerned that if deferrable gets used everywhere
>> then a strange situation would occur where all timers were waiting for some other timer to finally
>> happen, kind of a wierd timelock situation. Like the old chip/dale cartoon:
>>  "you first, no you first, after you mister chip, no after you mister dale,..."
> 
> 
> 
> that's a dangerous situation indeed and I'd really like to know what the limits
> are for deferring deferrable timers.... Arjan, do you know? Anyone?

there is NO limit to deferring a timer. Do NOT use a deferrable timer if you can't afford the timer to not happen
within.. 10 to 100 seconds! (or more)
They are really meant for things where you CAN afford for it to not happen when you're idle....


> 
> I don't see a danger just yet on normal systems - I get something like 10 wakeups
> per second from just the kernel (acpi, ahci, usb) on most my systems which
> guarantees that the watchdog runs often enough, but for embedded systems and
> critical timers in other drivers this may be an issue quickly

on my work desktop test box the average time between cpu wakeups is 1.4 seconds
(and that's single core). It would be higher if it wasn't for some hpet limit issues.



^ permalink raw reply

* [PATCH 1/2] e1000e: Use deferrable timer for watchdog
From: Auke Kok @ 2007-12-20 18:51 UTC (permalink / raw)
  To: jeff; +Cc: netdev, parag.warudkar

From: Parag Warudkar <parag.warudkar@gmail.com>

Reduce wakeups from idle per second.

Signed-off-by: Parag Warudkar <parag.warudkar@gmail.com>
Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
---

 drivers/net/e1000e/netdev.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 2422d16..59960d2 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -3931,7 +3931,7 @@ static int __devinit e1000_probe(struct pci_dev *pdev,
 		goto err_eeprom;
 	}
 
-	init_timer(&adapter->watchdog_timer);
+	init_timer_deferrable(&adapter->watchdog_timer);
 	adapter->watchdog_timer.function = &e1000_watchdog;
 	adapter->watchdog_timer.data = (unsigned long) adapter;
 


^ permalink raw reply related

* Re: [PATCH] e1000e: Use deferrable timer for watchdog
From: Kok, Auke @ 2007-12-20 18:49 UTC (permalink / raw)
  To: Parag Warudkar; +Cc: Kok, Auke, netdev, linux-kernel, akpm
In-Reply-To: <82e4877d0712200956l4126e221qc0cf4720e59ca2a1@mail.gmail.com>

Parag Warudkar wrote:
> On Dec 20, 2007 12:05 PM, Kok, Auke <auke-jan.h.kok@intel.com> wrote:
>> I can't even apply this patch and the e1000 one... not only is it whitespace
>> damaged it is also not properly formatted as patch at all. If you want me to take
>> these patches seriously, then please fix the formatting issues.
> 
> Sigh - I use Pine, follow Documents/email-clients.txt for the
> recommended settings and obviously the pathces are not generated with
> whitespace damage at my end as I test those before sending out.
> 
> So although I hate to see this happen there is nothing at this moment
> that I can do - except for attaching the patch instead of inlining it.
> Since they have already been reviewed inline, please see if the
> attached patches work for you.

here's what the files in my Maildir spool look like in vim (my vim displays a '»'
char for tabs and a "¶" for EOL):

 76 --- linux-2.6/drivers/net/e1000e/netdev.c»      2007-12-07 10:04:39.00000000
 77 +++ linux-2.6-work/drivers/net/e1000e/netdev.c» 2007-12-18 20:45:59.00000000
 78 @@ -3899,7 +3899,7 @@¶
 79   »     »       goto err_eeprom;¶
 80   »     }¶
 81 ¶
 82 -»      init_timer(&adapter->watchdog_timer);¶
 83 +»      init_timer_deferrable(&adapter->watchdog_timer);¶
 84   »     adapter->watchdog_timer.function = &e1000_watchdog;¶
 85   »     adapter->watchdog_timer.data = (unsigned long) adapter;¶
 86 ¶
 87 --¶

notice that there are two spaces instead of 1. Also there's no line heading the
diff with 'diff a/foo b/foo' which is what throws of stg. And the -p option is
missing.


as for content, the patch looks OK with me. I ran the numbers and allthough there
was a slight average delay in the link up detection time it is negligeable (less
than 0.2sec difference over a bunch of measurements), and I confirmed your
powertop numbers are correct. As for the timer interval, the watchdog may already
be delayed up to 3 seconds safely, this doesn't change that.

I'll forward the patch, Care to make one for e100? plenty of laptops with those
still around! The embedded guys would love it I think.

Thanks,

Auke


^ permalink raw reply

* Re: [PATCH] sky2: Use deferrable timer for watchdog
From: Kok, Auke @ 2007-12-20 19:09 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: parag.warudkar, netdev, akpm, linux-kernel, Arjan van de Ven
In-Reply-To: <20071220095121.7859c023@deepthought>

Stephen Hemminger wrote:
> On Thu, 20 Dec 2007 17:29:23 +0000
> 
>> -----Original Message-----
>> From: Stephen Hemminger <shemminger@linux-foundation.org>
>>
>> Date: Thu, 20 Dec 2007 09:16:03 
>> To:parag.warudkar@gmail.com
>> Cc:netdev@vger.kernel.org, akpm@linux-foundation.org,       linux-kernel@vger.kernel.org
>> Subject: Re: [PATCH] sky2: Use deferrable timer for watchdog
>>
>>
>> On Tue, 18 Dec 2007 20:13:28 -0500 (EST)
>> Parag Warudkar <parag.warudkar@gmail.com> wrote:
>>
>>> sky2 can use deferrable timer for watchdog - reduces wakeups from idle per 
>>> second.
>>>
>>> Signed-off-by: Parag Warudkar <parag.warudkar@gmail.com>
>>>
>>> --- linux-2.6/drivers/net/sky2.c	2007-12-07 10:04:39.000000000 -0500
>>> +++ linux-2.6-work/drivers/net/sky2.c	2007-12-18 20:07:58.000000000 -0500
>>> @@ -4230,7 +4230,10 @@
>>>   			sky2_show_addr(dev1);
>>>   	}
>>>
>>> -	setup_timer(&hw->watchdog_timer, sky2_watchdog, (unsigned long) hw);
>>> +	hw->watchdog_timer.function = sky2_watchdog;
>>> +	hw->watchdog_timer.data = (unsigned long) hw;
>>> +	init_timer_deferrable(&hw->watchdog_timer);
>>> +
>>>   	INIT_WORK(&hw->restart_work, sky2_restart);
>>>
>>>   	pci_set_drvdata(pdev, hw);
>> Does it really reduce the wakeup's or only change who gets charged by powertop?
>> The system is going to wakeup once a second anyway. Looks to me that if the
>> timer is using round_jiffies(), that setting deferrable just changes the accounting.
>>
>> My interpretation of the api is:
>>    * round_jiffies()  - timer wants to wakeup but isn't precise about when so schedule
>>                         on next second when system will wake up anyway;
>>                         e.g why meetings are usually scheduled on the hour
>>
>>    * deferrable       - timer doesn't have to really wakeup but wants to happen near
>>                         a particular time. e.g. "I'll meet you at the pub around 8pm"
>>
>> Therefore doing deferrable is unnecessary for timers using round_jiffies unless system
>> is so good at doing timers that it is going to skip doing timer once per second.
>>
> 
> parag.warudkar@gmail.com wrote:
> 
>> NO_HZ kernels don't do timers every second - if you do round_jiffies() the kernel will wakeup and run the timer at that time no matter what. 
>>
>> The reason deferrable was introduced is to avoid waking up the kernel just for this one timer that can be called when the CPU is not idle for some reason other than this timer.
>>
>> In other words let's say there were two timers - one non-deferrable expiring in 3 seconds and other deferrable, expiring in 1.5 seconds. The kernel will not wake up twice - once for 1.5 second and other for 3 second - it will wake up once at expiry of 3 second timer and execute both the 1.5 second and 3 second timers.
>>
>> And this is not just powertop accounting thing - like I said the total num of wakeups per second go down with this patch.
>>
>> Parag
>>
>> Sent via BlackBerry from T-Mobile
> 
> 
> Quit top-posting!
> 
> If this is the case then the whole usage of round_jiffies() is bogus. All users of round_jiffies()
> should just be converted to deferrable??  I am a bit concerned that if deferrable gets used everywhere
> then a strange situation would occur where all timers were waiting for some other timer to finally
> happen, kind of a wierd timelock situation. Like the old chip/dale cartoon:
>  "you first, no you first, after you mister chip, no after you mister dale,..."



that's a dangerous situation indeed and I'd really like to know what the limits
are for deferring deferrable timers.... Arjan, do you know? Anyone?

I don't see a danger just yet on normal systems - I get something like 10 wakeups
per second from just the kernel (acpi, ahci, usb) on most my systems which
guarantees that the watchdog runs often enough, but for embedded systems and
critical timers in other drivers this may be an issue quickly


Auke

^ permalink raw reply

* Re: [PATCH net-2.6.25 3/3] Uninline the inet_twsk_put function
From: Ingo Oeser @ 2007-12-20 18:32 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: David Miller, Linux Netdev List, devel
In-Reply-To: <4768F8CD.2050209@openvz.org>

Pavel Emelyanov schrieb:
> This one is not that big, but is widely used: saves 1200 bytes 
> from net/ipv4/built-in.o
> +void inet_twsk_put(struct inet_timewait_sock *tw)
> +{
> +	if (atomic_dec_and_test(&tw->tw_refcnt)) {
> +		struct module *owner = tw->tw_prot->owner;
> +		twsk_destructor((struct sock *)tw);
> +#ifdef SOCK_REFCNT_DEBUG
> +		printk(KERN_DEBUG "%s timewait_sock %p released\n",
> +		       tw->tw_prot->name, tw);
> +#endif
> +		kmem_cache_free(tw->tw_prot->twsk_prot->twsk_slab, tw);
> +		module_put(owner);
> +	}
> +}
> +EXPORT_SYMBOL_GPL(inet_twsk_put);

More correct fix seems to be conversion to kref.

Just create out of line inet_twsk_release() containing
sth. similiar to the code inside these braces and modify 
inet_twsk_put() to sth. like this:

static inline inet_twsk_put(struct inet_timewait_sock *tw)
{
	kref_put(&tw->kref, inet_twsk_release);
}

David, can you see any reason (e.g. some crazy lock stuff) NOT to do this?


Best Regards

Ingo Oeser

^ permalink raw reply

* [PATCH 2/2] e1000: Use deferrable timer for watchdog
From: Auke Kok @ 2007-12-20 18:51 UTC (permalink / raw)
  To: jeff; +Cc: netdev, parag.warudkar
In-Reply-To: <20071220185125.21903.44752.stgit@localhost.localdomain>

From: Parag Warudkar <parag.warudkar@gmail.com>

Reduces wakeups from idle per second.

Signed-off-by: Parag Warudkar <parag.warudkar@gmail.com>
Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
---

 drivers/net/e1000/e1000_main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 599153d..6af86fa 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -1066,7 +1066,7 @@ e1000_probe(struct pci_dev *pdev,
 	adapter->tx_fifo_stall_timer.function = &e1000_82547_tx_fifo_stall;
 	adapter->tx_fifo_stall_timer.data = (unsigned long) adapter;
 
-	init_timer(&adapter->watchdog_timer);
+	init_timer_deferrable(&adapter->watchdog_timer);
 	adapter->watchdog_timer.function = &e1000_watchdog;
 	adapter->watchdog_timer.data = (unsigned long) adapter;
 


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox