* Re: [PATCH net-next-2.6 1/2] qlcnic: NIC Partitioning - Add basic infrastructure support
From: David Miller @ 2010-06-02 9:23 UTC (permalink / raw)
To: anirban.chakraborty; +Cc: netdev, amit.salecha, ameen.rahman
In-Reply-To: <alpine.OSX.2.00.1006011422230.45836@macintosh-2.qlogic.org>
From: Anirban Chakraborty <anirban.chakraborty@qlogic.com>
Date: Tue, 1 Jun 2010 14:28:51 -0700
> Following changes have been added to enable the adapter to work in
> NIC partitioning mode where multiple PCI functions of an adapter port can
> be configured to work as NIC functions. The first function that is enumerated on
> the PCI bus assumes the role of management function which, besides being able
> to do all the NIC functionality, can configure other NIC partitions. Other NIC
> functions can be configured as privileged or non privileged functions.
> Privileged function can not configure other NIC functions but can do all the
> NIC functionality including any firmware initialization, chip reset etc. Non
> privileged functions can do only basic IO. For chip reset etc, it depends on the
> privilege or management function.
>
> 1. Added code to determine PCI function number independent of kernel API.
> 2. Added Driver - FW version 2.0 support.
> 3. Changed producer and consumer register offset calculation.
> 4. Added management and privileged operation modes for npar functions. A module
> parameter has been added to control it.
> 5. Added support for configuring the eswitch in the adapter.
>
> Signed-off-by: Anirban Chakraborty <anirban.chakraborty@qlogic.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next-2.6 2/2] qlcnic: NIC Partitioning - Add non privileged mode support
From: David Miller @ 2010-06-02 9:23 UTC (permalink / raw)
To: anirban.chakraborty; +Cc: netdev, amit.salecha, ameen.rahman
In-Reply-To: <alpine.OSX.2.00.1006011428590.45836@macintosh-2.qlogic.org>
From: Anirban Chakraborty <anirban.chakraborty@qlogic.com>
Date: Tue, 1 Jun 2010 14:33:09 -0700
> Added support for NIC functions that work in non privileged mode where these
> functions are privileged to do IO only, the control operations are handled via
> privileged functions.
> Bumped up version number to 5.0.3.
>
> Signed-off-by: Anirban Chakraborty <anirban.chakraborty@qlogic.com>
Applied.
^ permalink raw reply
* Re: [PATCH] xfrm: force a dst reference in __xfrm_route_forward()
From: David Miller @ 2010-06-02 9:27 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
In-Reply-To: <1275422689.2638.14.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 01 Jun 2010 22:04:49 +0200
> Packets going through __xfrm_route_forward() have a not refcounted dst
> entry, since we enabled a noref forwarding path.
>
> xfrm_lookup() might incorrectly release this dst entry.
>
> It's a bit late to make invasive changes in xfrm_lookup(), so lets force
> a refcount in this path.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied, thanks for fixing this bug Eric.
^ permalink raw reply
* Re: [net-2.6 PATCH] enic: bug fix: make the set/get netlink VF_PORT support symmetrical
From: David Miller @ 2010-06-02 9:27 UTC (permalink / raw)
To: scofeldm; +Cc: chrisw, netdev
In-Reply-To: <20100601185933.4323.2212.stgit@localhost.localdomain>
From: Scott Feldman <scofeldm@cisco.com>
Date: Tue, 01 Jun 2010 11:59:33 -0700
> From: Scott Feldman <scofeldm@cisco.com>
>
> To make get/set netlink VF_PORT truly symmetrical, we need to keep track
> of what items are set and only return those items on get. Previously, the
> driver wasn't differentiating between a set of attr with a NULL string,
> for example, and not setting the attr at all. We only want to return
> the NULL string if the attr was actually set with a NULL string. Otherwise,
> don't return the attr.
>
> Signed-off-by: Scott Feldman <scofeldm@cisco.com>
Applied, thanks Scott.
^ permalink raw reply
* Re: [PATCH net-2.6] bnx2: Fix hang during rmmod bnx2.
From: David Miller @ 2010-06-02 9:27 UTC (permalink / raw)
To: mchan; +Cc: netdev
In-Reply-To: <1275440736-18058-1-git-send-email-mchan@broadcom.com>
From: "Michael Chan" <mchan@broadcom.com>
Date: Tue, 1 Jun 2010 18:05:36 -0700
> The regression is caused by:
>
> commit 4327ba435a56ada13eedf3eb332e583c7a0586a9
> bnx2: Fix netpoll crash.
>
> If ->open() and ->close() are called multiple times, the same napi structs
> will be added to dev->napi_list multiple times, corrupting the dev->napi_list.
> This causes free_netdev() to hang during rmmod.
>
> We fix this by calling netif_napi_del() during ->close().
>
> Also, bnx2_init_napi() must not be in the __devinit section since it is
> called by ->open().
>
> Signed-off-by: Michael Chan <mchan@broadcom.com>
> Signed-off-by: Benjamin Li <benli@broadcom.com>
Applied, thanks!
^ permalink raw reply
* Re: [RFC PATCH] net: add additional lock to qdisc to increase enqueue/dequeue fairness
From: Eric Dumazet @ 2010-06-02 9:48 UTC (permalink / raw)
To: Stephen Hemminger, David Miller; +Cc: alexander.h.duyck, netdev
In-Reply-To: <1274463643.2439.473.camel@edumazet-laptop>
Le vendredi 21 mai 2010 à 19:40 +0200, Eric Dumazet a écrit :
> Le vendredi 21 mai 2010 à 18:44 +0200, Eric Dumazet a écrit :
>
> > We could use cmpxchg() and manipulate several bits at once in fast path.
> > ( __QDISC_STATE_RUNNING, __QDISC_STATE_LOCKED ... ) but making the crowd
> > of cpus spin on the same bits/cacheline than dequeue worker would
> > definitely slowdown the worker.
> >
> >
>
> Maybe I am missing something, but __QDISC_STATE_RUNNING is always
> manipulated with the lock held...
>
> We might avoid two atomic ops when changing this state (if moved to a
> separate container) in fast path (when a cpu sends only one packet and
> returns)
>
>
Here are two patches to implement this idea.
First patch to abstract QDISC_STATE_RUNNING access.
Second patch to add a __qstate container and remove the atomic ops.
^ permalink raw reply
* [PATCH 1/2 net-next-2.6] net: Define accessors to manipulate QDISC_STATE_RUNNING
From: Eric Dumazet @ 2010-06-02 9:49 UTC (permalink / raw)
To: Stephen Hemminger, David Miller; +Cc: alexander.h.duyck, netdev
In-Reply-To: <1274463643.2439.473.camel@edumazet-laptop>
Define three helpers to manipulate QDISC_STATE_RUNNIG flag, that a
second patch will move on another location.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
include/net/pkt_sched.h | 2 +-
include/net/sch_generic.h | 15 +++++++++++++++
net/core/dev.c | 4 ++--
net/sched/sch_generic.c | 4 ++--
4 files changed, 20 insertions(+), 5 deletions(-)
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 9d4d87c..d9549af 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -95,7 +95,7 @@ extern void __qdisc_run(struct Qdisc *q);
static inline void qdisc_run(struct Qdisc *q)
{
- if (!test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
+ if (qdisc_run_begin(q))
__qdisc_run(q);
}
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 03ca5d8..9707dae 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -76,6 +76,21 @@ struct Qdisc {
struct rcu_head rcu_head;
};
+static inline bool qdisc_is_running(struct Qdisc *qdisc)
+{
+ return test_bit(__QDISC_STATE_RUNNING, &qdisc->state);
+}
+
+static inline bool qdisc_run_begin(struct Qdisc *qdisc)
+{
+ return !test_and_set_bit(__QDISC_STATE_RUNNING, &qdisc->state);
+}
+
+static inline void qdisc_run_end(struct Qdisc *qdisc)
+{
+ clear_bit(__QDISC_STATE_RUNNING, &qdisc->state);
+}
+
struct Qdisc_class_ops {
/* Child qdisc manipulation */
struct netdev_queue * (*select_queue)(struct Qdisc *, struct tcmsg *);
diff --git a/net/core/dev.c b/net/core/dev.c
index 983a3c1..2733226 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2047,7 +2047,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
kfree_skb(skb);
rc = NET_XMIT_DROP;
} else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
- !test_and_set_bit(__QDISC_STATE_RUNNING, &q->state)) {
+ qdisc_run_begin(q)) {
/*
* This is a work-conserving queue; there are no old skbs
* waiting to be sent out; and the qdisc is not running -
@@ -2059,7 +2059,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
if (sch_direct_xmit(skb, q, dev, txq, root_lock))
__qdisc_run(q);
else
- clear_bit(__QDISC_STATE_RUNNING, &q->state);
+ qdisc_run_end(q);
rc = NET_XMIT_SUCCESS;
} else {
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index bd1892f..37b86ea 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -205,7 +205,7 @@ void __qdisc_run(struct Qdisc *q)
}
}
- clear_bit(__QDISC_STATE_RUNNING, &q->state);
+ qdisc_run_end(q);
}
unsigned long dev_trans_start(struct net_device *dev)
@@ -797,7 +797,7 @@ static bool some_qdisc_is_busy(struct net_device *dev)
spin_lock_bh(root_lock);
- val = (test_bit(__QDISC_STATE_RUNNING, &q->state) ||
+ val = (qdisc_is_running(q) ||
test_bit(__QDISC_STATE_SCHED, &q->state));
spin_unlock_bh(root_lock);
^ permalink raw reply related
* [PATCH 2/2 net-next-2.6] net: QDISC_STATE_RUNNING dont need atomic bit ops
From: Eric Dumazet @ 2010-06-02 9:50 UTC (permalink / raw)
To: Stephen Hemminger, David Miller; +Cc: alexander.h.duyck, netdev
In-Reply-To: <1274463643.2439.473.camel@edumazet-laptop>
__QDISC_STATE_RUNNING is always changed while qdisc lock is held.
We can avoid two atomic operations in xmit path, if we move this bit in
a new __state container.
Location of this __state container is carefully chosen so that fast path
only dirties one qdisc cache line.
THROTTLED bit could later be moved into this __state location too, to
avoid dirtying first qdisc cache line.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
include/net/sch_generic.h | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 9707dae..b3591e4 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -23,11 +23,17 @@ struct qdisc_rate_table {
};
enum qdisc_state_t {
- __QDISC_STATE_RUNNING,
__QDISC_STATE_SCHED,
__QDISC_STATE_DEACTIVATED,
};
+/*
+ * following bits are only changed while qdisc lock is held
+ */
+enum qdisc___state_t {
+ __QDISC___STATE_RUNNING,
+};
+
struct qdisc_size_table {
struct list_head list;
struct tc_sizespec szopts;
@@ -72,23 +78,24 @@ struct Qdisc {
unsigned long state;
struct sk_buff_head q;
struct gnet_stats_basic_packed bstats;
+ unsigned long __state;
struct gnet_stats_queue qstats;
struct rcu_head rcu_head;
};
static inline bool qdisc_is_running(struct Qdisc *qdisc)
{
- return test_bit(__QDISC_STATE_RUNNING, &qdisc->state);
+ return test_bit(__QDISC___STATE_RUNNING, &qdisc->__state);
}
static inline bool qdisc_run_begin(struct Qdisc *qdisc)
{
- return !test_and_set_bit(__QDISC_STATE_RUNNING, &qdisc->state);
+ return !__test_and_set_bit(__QDISC___STATE_RUNNING, &qdisc->__state);
}
static inline void qdisc_run_end(struct Qdisc *qdisc)
{
- clear_bit(__QDISC_STATE_RUNNING, &qdisc->state);
+ __clear_bit(__QDISC___STATE_RUNNING, &qdisc->__state);
}
struct Qdisc_class_ops {
^ permalink raw reply related
* Re: [v5 Patch 1/3] netpoll: add generic support for bridge and bonding devices
From: Cong Wang @ 2010-06-02 10:04 UTC (permalink / raw)
To: Jay Vosburgh
Cc: Flavio Leitner, linux-kernel, Matt Mackall, netdev, bridge,
Andy Gospodarek, Neil Horman, Jeff Moyer, Stephen Hemminger,
bonding-devel, David Miller
In-Reply-To: <24059.1275417767@death.nxdomain.ibm.com>
On 06/02/10 02:42, Jay Vosburgh wrote:
> Cong Wang<amwang@redhat.com> wrote:
>
>> On 06/01/10 03:08, Flavio Leitner wrote:
>>> On Mon, May 31, 2010 at 01:56:52PM +0800, Cong Wang wrote:
>>>> Hi, Flavio,
>>>>
>>>> Please use the attached patch instead, try to see if it solves
>>>> all your problems.
>>>
>>> I tried and it hangs. No backtraces this time.
>>> The bond_change_active_slave() prints before NETDEV_BONDING_FAILOVER
>>> notification, so I think it won't work.
>>
>> Ah, I thought the same.
>>
>>>
>>> Please, correct if I'm wrong, but when a failover happens with your
>>> patch applied, the netconsole would be disabled forever even with
>>> another healthy slave, right?
>>>
>>
>> Yes, this is an easy solution, because bonding has several modes,
>> it is complex to make netpoll work in different modes.
>
> If I understand correctly, the root cause of the problem with
> netconsole and bonding is that bonding is, ultimately, performing
> printks with a write lock held, and when netconsole recursively calls
> into bonding to send the printk over the netconsole, there is a deadlock
> (when the bonding xmit function attempts to acquire the same lock for
> read).
Yes.
>
> You're trying to avoid the deadlock by shutting off netconsole
> (permanently, it looks like) for one problem case: a failover, which
> does some printks with a write lock held.
>
> This doesn't look to me like a complete solution, there are
> other cases in bonding that will do printk with write locks held. I
> suspect those will also hang netconsole as things exist today, and won't
> be affected by your patch below.
I can expect that, bonding modes are complex.
>
> For example:
>
> The sysfs functions to set the primary (bonding_store_primary)
> or active (bonding_store_active_slave) options: a pr_info is called to
> provide a log message of the results. These could be tested by setting
> the primary or active options via sysfs, e.g.,
>
> echo eth0> /sys/class/net/bond0/bonding/primary
> echo eth0> /sys/class/net/bond0/bonding/active
>
> If the kernel is defined with DEBUG, there are a few pr_debug
> calls within write_locks (bond_del_vlan, for example).
>
> If the slave's underlying device driver's ndo_vlan_rx_register
> or ndo_vlan_rx_kill_vid functions call printk (and it looks like some do
> for error cases, e.g., igbvf, ehea, enic), those would also presumably
> deadlock (because bonding holds its write_lock when calling the ndo_
> vlan functions).
>
> It also appears that (with the patch below) some nominally
> normal usage patterns will immediately disable netconsole. The one that
> comes to mind is if the primary= option is set (to "eth1" for this
> example), but that slave not enslaved first (the slaves are added, say,
> eth0 then eth1). In that situation, when the primary slave (eth1 here)
> is added, the first thing that will happen is a failover, and that will
> disable netconsole.
>
Thanks for your detailed explanation!
This is why I said bonding is complex. I guess we would have to adjust
netpoll code for different bonding cases, one solution seems not fix all.
I am not sure how much work to do, since I am not familiar with bonding
code. Maybe Andy can help?
For the previous patch, it at least can make Flavio happy. :)
Thanks!
^ permalink raw reply
* Re: [PATCH 1/2 net-next-2.6] net: Define accessors to manipulate QDISC_STATE_RUNNING
From: David Miller @ 2010-06-02 10:24 UTC (permalink / raw)
To: eric.dumazet; +Cc: shemminger, alexander.h.duyck, netdev
In-Reply-To: <1275472154.2725.143.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 02 Jun 2010 11:49:14 +0200
> Define three helpers to manipulate QDISC_STATE_RUNNIG flag, that a
> second patch will move on another location.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Looks good, applied.
^ permalink raw reply
* Re: [PATCH 2/2 net-next-2.6] net: QDISC_STATE_RUNNING dont need atomic bit ops
From: David Miller @ 2010-06-02 10:25 UTC (permalink / raw)
To: eric.dumazet; +Cc: shemminger, alexander.h.duyck, netdev
In-Reply-To: <1275472233.2725.146.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 02 Jun 2010 11:50:33 +0200
> __QDISC_STATE_RUNNING is always changed while qdisc lock is held.
>
> We can avoid two atomic operations in xmit path, if we move this bit in
> a new __state container.
>
> Location of this __state container is carefully chosen so that fast path
> only dirties one qdisc cache line.
>
> THROTTLED bit could later be moved into this __state location too, to
> avoid dirtying first qdisc cache line.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Also looks good, applied.
One thing about naming. Here, even though we name the type and the
state member with two leading underscores, the things that actually
modify these state members are helper functions with names that lack
double underscores.
So really, reading the code, you don't see that special considerations
for these state changes might be necessary.
^ permalink raw reply
* Re: [PATCH 1/3] net: init_vlan should not copy slave or master flags
From: David Miller @ 2010-06-02 10:35 UTC (permalink / raw)
To: john.r.fastabend; +Cc: andy, fubar, nhorman, bonding-devel, netdev
In-Reply-To: <20100513073106.3528.45412.stgit@jf-dev2-dcblab>
From: John Fastabend <john.r.fastabend@intel.com>
Date: Thu, 13 May 2010 00:31:06 -0700
> The vlan device should not copy the slave or master flags from
> the real device. It is not in the bond until added nor is it
> a master.
>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Applied, thanks John.
^ permalink raw reply
* Re: [PATCH 2/3] net: fix conflict between null_or_orig and null_or_bond
From: David Miller @ 2010-06-02 10:35 UTC (permalink / raw)
To: john.r.fastabend; +Cc: andy, fubar, nhorman, bonding-devel, netdev
In-Reply-To: <20100513073111.3528.19949.stgit@jf-dev2-dcblab>
From: John Fastabend <john.r.fastabend@intel.com>
Date: Thu, 13 May 2010 00:31:11 -0700
> If a skb is received on an inactive bond that does not meet
> the special cases checked for by skb_bond_should_drop it should
> only be delivered to exact matches as the comment in
> netif_receive_skb() says.
>
> However because null_or_bond could also be null this is not
> always true. This patch renames null_or_bond to orig_or_bond
> and initializes it to orig_dev. This keeps the intent of
> null_or_bond to pass frames received on VLAN interfaces stacked
> on bonding interfaces without invalidating the statement for
> null_or_orig.
>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Also applied.
^ permalink raw reply
* Re: [PATCH 3/3] net: deliver skbs on inactive slaves to exact matches
From: David Miller @ 2010-06-02 10:36 UTC (permalink / raw)
To: john.r.fastabend; +Cc: andy, fubar, nhorman, bonding-devel, netdev
In-Reply-To: <4BFDFAAA.8070701@intel.com>
From: John Fastabend <john.r.fastabend@intel.com>
Date: Wed, 26 May 2010 21:52:58 -0700
> I know you were looking over this series at one point. Did you have
> any comments? Sorry for the ping just wanted to keep this on my
> radar.
I'll hold this last one until Jay has a chance to comment on it.
^ permalink raw reply
* Re: [PATCH net-next-2.6] bonding: remove unused variable "found"
From: David Miller @ 2010-06-02 10:40 UTC (permalink / raw)
To: jpirko; +Cc: netdev
In-Reply-To: <20100517134953.GD2878@psychotron.lab.eng.brq.redhat.com>
From: Jiri Pirko <jpirko@redhat.com>
Date: Mon, 17 May 2010 15:49:54 +0200
>
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next-2.6] bonding: move slave MTU handling from sysfs V2
From: David Miller @ 2010-06-02 10:40 UTC (permalink / raw)
To: jpirko; +Cc: netdev, fubar, bonding-devel, monis
In-Reply-To: <20100518154016.GD2878@psychotron.lab.eng.brq.redhat.com>
From: Jiri Pirko <jpirko@redhat.com>
Date: Tue, 18 May 2010 17:42:40 +0200
> V1->V2: corrected res/ret use
>
> For some reason, MTU handling (storing, and restoring) is taking place in
> bond_sysfs. The correct place for this code is in bond_enslave, bond_release.
> So move it there.
>
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next-2.6] bonding: remove redundant checks from bonding_store_slaves V2
From: David Miller @ 2010-06-02 10:40 UTC (permalink / raw)
To: jpirko; +Cc: netdev, fubar, bonding-devel
In-Reply-To: <20100518154452.GE2878@psychotron.lab.eng.brq.redhat.com>
From: Jiri Pirko <jpirko@redhat.com>
Date: Tue, 18 May 2010 17:44:53 +0200
> (it's actually the same as v1)
>
> Remove checks that duplicates similar checks in bond_enslave.
>
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next-2.6] bonding: make bonding_store_slaves simpler
From: David Miller @ 2010-06-02 10:40 UTC (permalink / raw)
To: jpirko; +Cc: netdev, fubar, bonding-devel
In-Reply-To: <20100518154638.GF2878@psychotron.lab.eng.brq.redhat.com>
From: Jiri Pirko <jpirko@redhat.com>
Date: Tue, 18 May 2010 17:46:39 +0200
> This patch makes bonding_store_slaves function nicer and easier to understand.
>
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Applied.
^ permalink raw reply
* Re: [PATCH] net/mpc52xx_phy: Various code cleanups
From: David Miller @ 2010-06-02 10:45 UTC (permalink / raw)
To: w.sang; +Cc: linuxppc-dev, netdev, grant.likely
In-Reply-To: <1274264470-2905-1-git-send-email-w.sang@pengutronix.de>
From: Wolfram Sang <w.sang@pengutronix.de>
Date: Wed, 19 May 2010 12:21:10 +0200
> - don't free bus->irq (obsoleted by ca816d98170942371535b3e862813b0aba9b7d90)
> - don't dispose irqs (should be done in of_mdiobus_register())
> - use fec-pointer consistently in transfer()
> - use resource_size()
> - cosmetic fixes
>
> Signed-off-by: Wolfram Sang <w.sang@pengutronix.de>
Applied, thanks.
^ permalink raw reply
* Re: [RFC] tcp: delack_timer expiration changes for every frame
From: David Miller @ 2010-06-02 11:11 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev, ilpo.jarvinen
In-Reply-To: <1274388439.2508.27.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 20 May 2010 22:47:19 +0200
> So we change the delack_timer by a positive delta (~ HZ/10) and a
> negative delta (~HZ/10), on the typical netperf TCP_RR workload.
Yes, this is silly.
> Here, the incoming frame is handled by netperf, doing a recvmsg().
> tcp_send_delayed_ack() sets the delack_timer to jiffies + HZ/25
>
> [ 392.207721] timer->expires=23045, new expires=23019(10) diff=26 timer=e5ecb754
...
> [ 392.209221] [<c1279ce7>] ? tcp_send_delayed_ack+0xb5/0xc1
> [ 392.209282] [<c1276d26>] ? tcp_rcv_established+0x39f/0x4f7
HZ/25 is TCP_DELACK_MIN and TCP_ATO_MIN, but the actual value we use
here involves incorporation of various measurements made on the
connection (RTO, etc.)
...
> tcp_v4_rcv() sets the delack timer to 37 ticks, so mod_timer() optimizations is not
> working at all.
>
> [ 392.217454] timer->expires=23019, new expires=23049(37) diff=-30 timer=e5ecb754
...
> [ 392.218765] [<c127cf56>] ? tcp_v4_rcv+0x41c/0x6b7
> [ 392.218826] [<c1265832>] ? ip_local_deliver_finish+0xe9/0x178
There must be a tail-call here at tcp_v4_rcv() or something missed in
the backtrace stack scanning logic, because tcp_v4_rcv() and it's main
inline tcp_v4_do_rcv() do not modify established state socket timers,
and in particular do not modify the delack timer, that I can see.
It must be in via tcp_rcv_established() or similar.
...
Nevermind, it's the inlined prequeue stuff.
It uses a seperate calculation of the delack timer offset, independant
of the one made by tcp_send_delayed_ack(), it's timer offset formula is:
(3 * tcp_rto_min(sk)) / 4
with a MAX of:
TCP_RTO_MAX
So every time we go in and out of recvmsg() we'll hop between these
two different delayed ACK settings.
The prequeue logic is trying to stretch the delayed ACK to 3/4 of a
window of data. It's set a bit high, intentionally, in the hopes that
we'll get the process into recvmsg() and have it emit it's response
packet from a subsequent sendmsg() (that the ACK can ride on) before
this timer fires.
But when we drop the socket lock to sleep or return to userspace, one
of the next packets is just going to reset this timer differently.
While the intentions of the prequeue code look legit, the use of two
different delayed ACK timeout schemes has bad implications elsewhere.
For example, if the delack timer does actually fire, there is this
ATO fixup code here:
if (!icsk->icsk_ack.pingpong) {
/* Delayed ACK missed: inflate ATO. */
icsk->icsk_ack.ato = min(icsk->icsk_ack.ato << 1, icsk->icsk_rto);
} else {
/* Delayed ACK missed: leave pingpong mode and
* deflate ATO.
*/
icsk->icsk_ack.pingpong = 0;
icsk->icsk_ack.ato = TCP_ATO_MIN;
}
which is totally wrong if the delack timer offset is the one
calculated by the prequeue code. Doubling the ATO in that case
is completely the wrong thing to do.
So yes we have all kinds of inconsistencies here and we should
probably unify things so that the timer gets kicked less often.
^ permalink raw reply
* Re: [PATCH net-next-2.6] bonding: move dev_addr cpy to bond_enslave
From: David Miller @ 2010-06-02 11:17 UTC (permalink / raw)
To: jpirko; +Cc: netdev, fubar, bonding-devel
In-Reply-To: <20100519111428.GA2788@psychotron.lab.eng.brq.redhat.com>
From: Jiri Pirko <jpirko@redhat.com>
Date: Wed, 19 May 2010 13:14:29 +0200
> Move the code that copies slave's mac address in case that's the first slave into
> bond_enslave. Ifenslave app does this also but that's not a problem. This is
> something that should be done in bond_enslave, and it shound not matter from
> where is it called.
>
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
(Jiri, please number your patches in a set, even if they should apply
properly independantly, thanks)
Applied.
^ permalink raw reply
* Re: [PATCH net-next-2.6] bonding: remove unused original_flags struct slave member
From: David Miller @ 2010-06-02 11:17 UTC (permalink / raw)
To: jpirko; +Cc: netdev, fubar, bonding-devel
In-Reply-To: <20100519111740.GB2788@psychotron.lab.eng.brq.redhat.com>
From: Jiri Pirko <jpirko@redhat.com>
Date: Wed, 19 May 2010 13:17:41 +0200
> This is stored but never restored. So remove this as it is useless.
>
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next-2.6] bonding: optimize tlb_get_least_loaded_slave
From: David Miller @ 2010-06-02 11:17 UTC (permalink / raw)
To: jpirko; +Cc: netdev, fubar, bonding-devel
In-Reply-To: <20100519132638.GC2788@psychotron.lab.eng.brq.redhat.com>
From: Jiri Pirko <jpirko@redhat.com>
Date: Wed, 19 May 2010 15:26:39 +0200
> In the worst case, when the first loop breaks an the end of the slave list,
> the slave list is iterated through twice. This patch reduces this
> function only to one loop. Also makes it simpler.
>
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next-2.6] bonding: move dev_addr cpy to bond_enslave
From: Jiri Pirko @ 2010-06-02 11:20 UTC (permalink / raw)
To: David Miller; +Cc: netdev, fubar, bonding-devel
In-Reply-To: <20100602.041714.236244214.davem@davemloft.net>
Wed, Jun 02, 2010 at 01:17:14PM CEST, davem@davemloft.net wrote:
>From: Jiri Pirko <jpirko@redhat.com>
>Date: Wed, 19 May 2010 13:14:29 +0200
>
>> Move the code that copies slave's mac address in case that's the first slave into
>> bond_enslave. Ifenslave app does this also but that's not a problem. This is
>> something that should be done in bond_enslave, and it shound not matter from
>> where is it called.
>>
>> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>
>(Jiri, please number your patches in a set, even if they should apply
> properly independantly, thanks)
Sorry, this was not ment to be a set. I was just posting patches as I went thru
the code. Will try to "buffer" it next time.
Thanks.
>
>Applied.
^ permalink raw reply
* [PATCH 1/2] pktgen: increasing transmission granularity
From: Daniel Turull @ 2010-06-02 11:49 UTC (permalink / raw)
To: netdev; +Cc: robert, jens.laas
This patch increases the granularity of the rate generated by pktgen.
The previous version of pktgen uses micro seconds (udelay) resolution when it
was delayed causing gaps in the rates. It is changed to nanosecond (ndelay).
Now any rate is possible.
Also it allows to set, the desired rate in Mb/s or packets per second.
The documentation has been updated.
Signed-off-by: Daniel Turull <daniel.turull@gmail.com>
---
diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt
index 61bb645..ac0e4ff 100644
--- a/Documentation/networking/pktgen.txt
+++ b/Documentation/networking/pktgen.txt
@@ -78,6 +78,9 @@ Examples:
pgset "delay 5000" adds delay to hard_start_xmit(). nanoseconds
+ pgset "rate 300M" set rate to 300 Mb/s
+ pgset "ratep 1000000" set rate to 1Mpps
+
pgset "dst 10.0.0.1" sets IP destination address
(BEWARE! This generator is very aggressive!)
@@ -200,6 +203,9 @@ debug
frags
delay
+rate
+ratep
+
src_mac_count
dst_mac_count
diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 2ad68da..6428653 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -169,7 +169,7 @@
#include <asm/dma.h>
#include <asm/div64.h> /* do_div */
-#define VERSION "2.73"
+#define VERSION "2.74"
#define IP_NAME_SZ 32
#define MAX_MPLS_LABELS 16 /* This is the max label stack depth */
#define MPLS_STACK_BOTTOM htonl(0x00000100)
@@ -980,6 +980,40 @@ static ssize_t pktgen_if_write(struct file *file,
(unsigned long long) pkt_dev->delay);
return count;
}
+ if (!strcmp(name, "rate")) {
+ len = num_arg(&user_buffer[i], 10, &value);
+ if (len < 0)
+ return len;
+
+ i += len;
+ if (!value)
+ return len;
+ pkt_dev->delay = pkt_dev->min_pkt_size*8*NSEC_PER_USEC/value;
+ if (debug)
+ printk(KERN_INFO
+ "pktgen: Delay set at: %llu ns\n",
+ pkt_dev->delay);
+
+ sprintf(pg_result, "OK: rate=%lu", value);
+ return count;
+ }
+ if (!strcmp(name, "ratep")) {
+ len = num_arg(&user_buffer[i], 10, &value);
+ if (len < 0)
+ return len;
+
+ i += len;
+ if (!value)
+ return len;
+ pkt_dev->delay = NSEC_PER_SEC/value;
+ if (debug)
+ printk(KERN_INFO
+ "pktgen: Delay set at: %llu ns\n",
+ pkt_dev->delay);
+
+ sprintf(pg_result, "OK: rate=%lu", value);
+ return count;
+ }
if (!strcmp(name, "udp_src_min")) {
len = num_arg(&user_buffer[i], 10, &value);
if (len < 0)
@@ -2142,15 +2176,15 @@ static void spin(struct pktgen_dev *pkt_dev, ktime_t spin_until)
hrtimer_init_on_stack(&t.timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
hrtimer_set_expires(&t.timer, spin_until);
- remaining = ktime_to_us(hrtimer_expires_remaining(&t.timer));
+ remaining = ktime_to_ns(hrtimer_expires_remaining(&t.timer));
if (remaining <= 0) {
pkt_dev->next_tx = ktime_add_ns(spin_until, pkt_dev->delay);
return;
}
start_time = ktime_now();
- if (remaining < 100)
- udelay(remaining); /* really small just spin */
+ if (remaining < 100000)
+ ndelay(remaining); /* really small just spin */
else {
/* see do_nanosleep */
hrtimer_init_sleeper(&t, current);
@@ -2170,7 +2204,7 @@ static void spin(struct pktgen_dev *pkt_dev, ktime_t spin_until)
end_time = ktime_now();
pkt_dev->idle_acc += ktime_to_ns(ktime_sub(end_time, start_time));
- pkt_dev->next_tx = ktime_add_ns(end_time, pkt_dev->delay);
+ pkt_dev->next_tx = ktime_add_ns(spin_until, pkt_dev->delay);
}
static inline void set_pkt_overhead(struct pktgen_dev *pkt_dev)
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox