* Re: [PATCHv5 3/4] sparc64: Avoid irqsave/restore on vio.lock if in_softirq()
From: David Miller @ 2014-10-25 18:28 UTC (permalink / raw)
To: sowmini.varadhan; +Cc: netdev, sparclinux
In-Reply-To: <20141022221245.GE17252@oracle.com>
From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Wed, 22 Oct 2014 18:12:45 -0400
>
> For NAPIfied drivers , there is no need to
> synchronize by doing irqsave/restore on vio.lock in the I/O
> path.
>
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Conditional locking is broken locking, and the bug you introduce
here is a good example of why that is.
If the vio->timer has to be triggered when vio_port_up() is
invoked, it will next run fron in_softirq() context regardless
of whether the user is sunvnet or sunvdc. So it will elide
the locking regardless of who is using this vio context.
Never, ever, use conditional locking.
This locking is harmless overhead in a slow path, just leave
it alone for now.
^ permalink raw reply
* Re: localed stuck in recent 3.18 git in copy_net_ns?
From: Jay Vosburgh @ 2014-10-25 4:33 UTC (permalink / raw)
To: paulmck
Cc: Yanko Kaneti, Josh Boyer, Eric W. Biederman, Cong Wang,
Kevin Fenzi, netdev, Linux-Kernel@Vger. Kernel. Org, mroos, tj
In-Reply-To: <20141025020324.GA28247@linux.vnet.ibm.com>
Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
>On Fri, Oct 24, 2014 at 05:20:48PM -0700, Jay Vosburgh wrote:
>> Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
>>
>> >On Fri, Oct 24, 2014 at 03:59:31PM -0700, Paul E. McKenney wrote:
>> [...]
>> >> Hmmm... It sure looks like we have some callbacks stuck here. I clearly
>> >> need to take a hard look at the sleep/wakeup code.
>> >>
>> >> Thank you for running this!!!
>> >
>> >Could you please try the following patch? If no joy, could you please
>> >add rcu:rcu_nocb_wake to the list of ftrace events?
>>
>> I tried the patch, it did not change the behavior.
>>
>> I enabled the rcu:rcu_barrier and rcu:rcu_nocb_wake tracepoints
>> and ran it again (with this patch and the first patch from earlier
>> today); the trace output is a bit on the large side so I put it and the
>> dmesg log at:
>>
>> http://people.canonical.com/~jvosburgh/nocb-wake-dmesg.txt
>>
>> http://people.canonical.com/~jvosburgh/nocb-wake-trace.txt
>
>Thank you again!
>
>Very strange part of the trace. The only sign of CPU 2 and 3 are:
>
> ovs-vswitchd-902 [000] .... 109.896840: rcu_barrier: rcu_sched Begin cpu -1 remaining 0 # 0
> ovs-vswitchd-902 [000] .... 109.896840: rcu_barrier: rcu_sched Check cpu -1 remaining 0 # 0
> ovs-vswitchd-902 [000] .... 109.896841: rcu_barrier: rcu_sched Inc1 cpu -1 remaining 0 # 1
> ovs-vswitchd-902 [000] .... 109.896841: rcu_barrier: rcu_sched OnlineNoCB cpu 0 remaining 1 # 1
> ovs-vswitchd-902 [000] d... 109.896841: rcu_nocb_wake: rcu_sched 0 WakeNot
> ovs-vswitchd-902 [000] .... 109.896841: rcu_barrier: rcu_sched OnlineNoCB cpu 1 remaining 2 # 1
> ovs-vswitchd-902 [000] d... 109.896841: rcu_nocb_wake: rcu_sched 1 WakeNot
> ovs-vswitchd-902 [000] .... 109.896842: rcu_barrier: rcu_sched OnlineNoCB cpu 2 remaining 3 # 1
> ovs-vswitchd-902 [000] d... 109.896842: rcu_nocb_wake: rcu_sched 2 WakeNotPoll
> ovs-vswitchd-902 [000] .... 109.896842: rcu_barrier: rcu_sched OnlineNoCB cpu 3 remaining 4 # 1
> ovs-vswitchd-902 [000] d... 109.896842: rcu_nocb_wake: rcu_sched 3 WakeNotPoll
> ovs-vswitchd-902 [000] .... 109.896843: rcu_barrier: rcu_sched Inc2 cpu -1 remaining 4 # 2
>
>The pair of WakeNotPoll trace entries says that at that point, RCU believed
>that the CPU 2's and CPU 3's rcuo kthreads did not exist. :-/
On the test system I'm using, CPUs 2 and 3 really do not exist;
it is a 2 CPU system (Intel Core 2 Duo E8400). I mentioned this in an
earlier message, but perhaps you missed it in the flurry.
Looking at the dmesg, the early boot messages seem to be
confused as to how many CPUs there are, e.g.,
[ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[ 0.000000] Hierarchical RCU implementation.
[ 0.000000] RCU debugfs-based tracing is enabled.
[ 0.000000] RCU dyntick-idle grace-period acceleration is enabled.
[ 0.000000] RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
[ 0.000000] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
[ 0.000000] NR_IRQS:16640 nr_irqs:456 0
[ 0.000000] Offload RCU callbacks from all CPUs
[ 0.000000] Offload RCU callbacks from CPUs: 0-3.
but later shows 2:
[ 0.233703] x86: Booting SMP configuration:
[ 0.236003] .... node #0, CPUs: #1
[ 0.255528] x86: Booted up 1 node, 2 CPUs
In any event, the E8400 is a 2 core CPU with no hyperthreading.
-J
---
-Jay Vosburgh, jay.vosburgh@canonical.com
^ permalink raw reply
* Re: [PATCH 1/1 net-next] libceph: remove unused variable in handle_reply()
From: Fabian Frederick @ 2014-10-25 9:12 UTC (permalink / raw)
To: Ilya Dryomov
Cc: netdev, Linux Kernel Mailing List, Ceph Development, Sage Weil,
David S. Miller
In-Reply-To: <CALFYKtD222g32njwF8nHz0Fbjj4p2512ka9bkADmB6S2QibgYA@mail.gmail.com>
> On 23 October 2014 at 18:25 Ilya Dryomov <ilya.dryomov@inktank.com> wrote:
>
>
> On Thu, Oct 23, 2014 at 8:15 PM, Fabian Frederick <fabf@skynet.be> wrote:
> > osdmap_epoch is redundant with reassert_epoch and unused.
> >
> > Signed-off-by: Fabian Frederick <fabf@skynet.be>
> > ---
> > net/ceph/osd_client.c | 2 --
> > 1 file changed, 2 deletions(-)
> >
> > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> > index f3fc54e..432bd75 100644
> > --- a/net/ceph/osd_client.c
> > +++ b/net/ceph/osd_client.c
> > @@ -1700,7 +1700,6 @@ static void handle_reply(struct ceph_osd_client *osdc,
> > struct ceph_msg *msg,
> > int err;
> > u32 reassert_epoch;
> > u64 reassert_version;
> > - u32 osdmap_epoch;
> > int already_completed;
> > u32 bytes;
> > unsigned int i;
> > @@ -1725,7 +1724,6 @@ static void handle_reply(struct ceph_osd_client *osdc,
> > struct ceph_msg *msg,
> > result = ceph_decode_32(&p);
> > reassert_epoch = ceph_decode_32(&p);
> > reassert_version = ceph_decode_64(&p);
> > - osdmap_epoch = ceph_decode_32(&p);
> >
> > /* lookup */
> > down_read(&osdc->map_sem);
>
> Hi Fabian,
>
> osdmap_epoch is useful for debugging, but this is wrong anyway -
> ceph_decode_32() has side effects. Removing it and not adjusting *p
> would make the whole thing blow up pretty fast..
>
> Thanks,
>
> Ilya
Hi Ilya,
osdmap_epoch generates a warning with -Wunused-but-set-variable.
Maybe we could just do ceph_decode_32(&p) and remove it
(it doesn't seem to add a warning) and/or add some comment ?
Regards,
Fabian
^ permalink raw reply
* Re: [PATCH 1/1 net-next] libceph: remove unused variable in handle_reply()
From: Ilya Dryomov @ 2014-10-25 15:05 UTC (permalink / raw)
To: Fabian Frederick
Cc: netdev, Linux Kernel Mailing List, Ceph Development, Sage Weil,
David S. Miller
In-Reply-To: <1805780944.182558.1414228336940.open-xchange@webmail.nmp.skynet.be>
On Sat, Oct 25, 2014 at 1:12 PM, Fabian Frederick <fabf@skynet.be> wrote:
>
>
>> On 23 October 2014 at 18:25 Ilya Dryomov <ilya.dryomov@inktank.com> wrote:
>>
>>
>> On Thu, Oct 23, 2014 at 8:15 PM, Fabian Frederick <fabf@skynet.be> wrote:
>> > osdmap_epoch is redundant with reassert_epoch and unused.
>> >
>> > Signed-off-by: Fabian Frederick <fabf@skynet.be>
>> > ---
>> > net/ceph/osd_client.c | 2 --
>> > 1 file changed, 2 deletions(-)
>> >
>> > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
>> > index f3fc54e..432bd75 100644
>> > --- a/net/ceph/osd_client.c
>> > +++ b/net/ceph/osd_client.c
>> > @@ -1700,7 +1700,6 @@ static void handle_reply(struct ceph_osd_client *osdc,
>> > struct ceph_msg *msg,
>> > int err;
>> > u32 reassert_epoch;
>> > u64 reassert_version;
>> > - u32 osdmap_epoch;
>> > int already_completed;
>> > u32 bytes;
>> > unsigned int i;
>> > @@ -1725,7 +1724,6 @@ static void handle_reply(struct ceph_osd_client *osdc,
>> > struct ceph_msg *msg,
>> > result = ceph_decode_32(&p);
>> > reassert_epoch = ceph_decode_32(&p);
>> > reassert_version = ceph_decode_64(&p);
>> > - osdmap_epoch = ceph_decode_32(&p);
>> >
>> > /* lookup */
>> > down_read(&osdc->map_sem);
>>
>> Hi Fabian,
>>
>> osdmap_epoch is useful for debugging, but this is wrong anyway -
>> ceph_decode_32() has side effects. Removing it and not adjusting *p
>> would make the whole thing blow up pretty fast..
>>
>> Thanks,
>>
>> Ilya
>
> Hi Ilya,
>
> osdmap_epoch generates a warning with -Wunused-but-set-variable.
> Maybe we could just do ceph_decode_32(&p) and remove it
> (it doesn't seem to add a warning) and/or add some comment ?
I see, I'll fix it up.
Thanks,
Ilya
^ permalink raw reply
* [PATCHv6 net-next 0/3] sunvnet: NAPIfy sunvnet
From: Sowmini Varadhan @ 2014-10-25 19:11 UTC (permalink / raw)
To: davem, bob.picco, sowmini.varadhan, dwight.engen,
raghuram.kothakota; +Cc: netdev
This patchset converts the sunvnet driver to use the NAPI framework.
Changes since v4 to Patch1:
vnet_event accumulates LDC_EVENT_* bits into rx_event.
vnet_event_napi() unrolls send_events() logic to process all rx_event bits.
Changes since v5:
Patch 1: use net_device.h definition for NAPI_POLL_WEIGHT.
Drop sparclinux changes (patch3) per David Miller feedback
Patch 1 in the series addresses the packet-receive path- all
the vnet_event() processing is moved into NAPI context.
This patch is dependant on the sparc-next commit:
"sparc64: Add vio_set_intr() to enable/disable Rx interrupts"
(sparc commit id ca605b7dd740c8909408d67911d8ddd272c2b320)
Patch 2 uses RCU to fix race conditions between vnet_port_remove and
paths that access/modify port-related state, such as vnet_start_xmit.
Patch 3 leverages from the NAPIfied Rx path,
dropping superfluous usage of the irqsave/irqrestores on the vio.lock
where possible.
Sowmini Varadhan (4):
NAPIfy sunvnet
Use RCU to synchronize port usage with vnet_port_remove()
Avoid irqsave/restore on vio.lock if in_softirq()
Remove irqsave/irqrestore on vio.lock
arch/sparc/kernel/viohs.c | 8 +-
drivers/net/ethernet/sun/sunvnet.c | 263 +++++++++++++++++++++++--------------
drivers/net/ethernet/sun/sunvnet.h | 6 +-
3 files changed, 177 insertions(+), 100 deletions(-)
--
1.8.4.2
^ permalink raw reply
* [PATCHv6 net-next 1/3] sunvnet: NAPIfy sunvnet
From: Sowmini Varadhan @ 2014-10-25 19:12 UTC (permalink / raw)
To: davem, bob.picco, sowmini.varadhan, dwight.engen,
raghuram.kothakota, david.stevens
Cc: netdev
Move Rx packet procssing to the NAPI poll callback.
Disable VIO interrupt and unconditioanlly go into NAPI
context from vnet_event.
Note that we want to minimize the number of LDC
STOP/START messages sent. Specifically, do not send a STOP
message if vnet_walk_rx does not read all the available descriptors
because of the NAPI budget limitation. Instead, note the end index
as part of port state, and resume from this index when the
next poll callback is triggered.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Raghuram Kothakota <raghuram.kothakota@oracle.com>
Acked-by: Dwight Engen <dwight.engen@oracle.com>
---
changes since v2: use NAPI.
changes since v3: David Stevens comments.
Changes since v4: vnet_event() must accumulate LDC_EVENT_* bits into rx_event
and all these bits should be processed in vnet_event_api()
in the same order as send_events()
changes since v5: use netdevice.h definition for NAPI_POLL_WEIGHT
drivers/net/ethernet/sun/sunvnet.c | 173 ++++++++++++++++++++++++++++---------
drivers/net/ethernet/sun/sunvnet.h | 6 +-
2 files changed, 135 insertions(+), 44 deletions(-)
diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c
index 3652afd..9e048f5 100644
--- a/drivers/net/ethernet/sun/sunvnet.c
+++ b/drivers/net/ethernet/sun/sunvnet.c
@@ -311,9 +311,7 @@ static int vnet_rx_one(struct vnet_port *port, unsigned int len,
dev->stats.rx_packets++;
dev->stats.rx_bytes += len;
-
- netif_rx(skb);
-
+ napi_gro_receive(&port->napi, skb);
return 0;
out_free_skb:
@@ -430,6 +428,7 @@ static int vnet_walk_rx_one(struct vnet_port *port,
struct vio_driver_state *vio = &port->vio;
int err;
+ BUG_ON(desc == NULL);
if (IS_ERR(desc))
return PTR_ERR(desc);
@@ -456,10 +455,11 @@ static int vnet_walk_rx_one(struct vnet_port *port,
}
static int vnet_walk_rx(struct vnet_port *port, struct vio_dring_state *dr,
- u32 start, u32 end)
+ u32 start, u32 end, int *npkts, int budget)
{
struct vio_driver_state *vio = &port->vio;
int ack_start = -1, ack_end = -1;
+ bool send_ack = true;
end = (end == (u32) -1) ? prev_idx(start, dr) : next_idx(end, dr);
@@ -471,6 +471,7 @@ static int vnet_walk_rx(struct vnet_port *port, struct vio_dring_state *dr,
return err;
if (err != 0)
break;
+ (*npkts)++;
if (ack_start == -1)
ack_start = start;
ack_end = start;
@@ -482,13 +483,26 @@ static int vnet_walk_rx(struct vnet_port *port, struct vio_dring_state *dr,
return err;
ack_start = -1;
}
+ if ((*npkts) >= budget) {
+ send_ack = false;
+ break;
+ }
}
if (unlikely(ack_start == -1))
ack_start = ack_end = prev_idx(start, dr);
- return vnet_send_ack(port, dr, ack_start, ack_end, VIO_DRING_STOPPED);
+ if (send_ack) {
+ port->napi_resume = false;
+ return vnet_send_ack(port, dr, ack_start, ack_end,
+ VIO_DRING_STOPPED);
+ } else {
+ port->napi_resume = true;
+ port->napi_stop_idx = ack_end;
+ return 1;
+ }
}
-static int vnet_rx(struct vnet_port *port, void *msgbuf)
+static int vnet_rx(struct vnet_port *port, void *msgbuf, int *npkts,
+ int budget)
{
struct vio_dring_data *pkt = msgbuf;
struct vio_dring_state *dr = &port->vio.drings[VIO_DRIVER_RX_RING];
@@ -505,11 +519,13 @@ static int vnet_rx(struct vnet_port *port, void *msgbuf)
return 0;
}
- dr->rcv_nxt++;
+ if (!port->napi_resume)
+ dr->rcv_nxt++;
/* XXX Validate pkt->start_idx and pkt->end_idx XXX */
- return vnet_walk_rx(port, dr, pkt->start_idx, pkt->end_idx);
+ return vnet_walk_rx(port, dr, pkt->start_idx, pkt->end_idx,
+ npkts, budget);
}
static int idx_is_pending(struct vio_dring_state *dr, u32 end)
@@ -542,9 +558,12 @@ static int vnet_ack(struct vnet_port *port, void *msgbuf)
if (unlikely(!idx_is_pending(dr, end)))
return 0;
+ vp = port->vp;
+ dev = vp->dev;
/* sync for race conditions with vnet_start_xmit() and tell xmit it
* is time to send a trigger.
*/
+ netif_tx_lock(dev);
dr->cons = next_idx(end, dr);
desc = vio_dring_entry(dr, dr->cons);
if (desc->hdr.state == VIO_DESC_READY && port->start_cons) {
@@ -559,10 +578,8 @@ static int vnet_ack(struct vnet_port *port, void *msgbuf)
} else {
port->start_cons = true;
}
+ netif_tx_unlock(dev);
-
- vp = port->vp;
- dev = vp->dev;
if (unlikely(netif_queue_stopped(dev) &&
vnet_tx_dring_avail(dr) >= VNET_TX_WAKEUP_THRESH(dr)))
return 1;
@@ -591,9 +608,8 @@ static int handle_mcast(struct vnet_port *port, void *msgbuf)
return 0;
}
-static void maybe_tx_wakeup(unsigned long param)
+static void maybe_tx_wakeup(struct vnet *vp)
{
- struct vnet *vp = (struct vnet *)param;
struct net_device *dev = vp->dev;
netif_tx_lock(dev);
@@ -617,32 +633,43 @@ static void maybe_tx_wakeup(unsigned long param)
netif_tx_unlock(dev);
}
-static void vnet_event(void *arg, int event)
+static inline bool port_is_up(struct vnet_port *vnet)
+{
+ struct vio_driver_state *vio = &vnet->vio;
+
+ return !!(vio->hs_state & VIO_HS_COMPLETE);
+}
+
+static int vnet_event_napi(struct vnet_port *port, int budget)
{
- struct vnet_port *port = arg;
struct vio_driver_state *vio = &port->vio;
- unsigned long flags;
int tx_wakeup, err;
+ int npkts = 0;
+ int event = (port->rx_event & LDC_EVENT_RESET);
- spin_lock_irqsave(&vio->lock, flags);
-
+ldc_ctrl:
if (unlikely(event == LDC_EVENT_RESET ||
event == LDC_EVENT_UP)) {
vio_link_state_change(vio, event);
- spin_unlock_irqrestore(&vio->lock, flags);
if (event == LDC_EVENT_RESET) {
port->rmtu = 0;
vio_port_up(vio);
}
- return;
+ port->rx_event = 0;
+ return 0;
}
+ /* We may have multiple LDC events in rx_event. Unroll send_events() */
+ event = (port->rx_event & LDC_EVENT_UP);
+ port->rx_event &= ~(LDC_EVENT_RESET|LDC_EVENT_UP);
+ if (event == LDC_EVENT_UP)
+ goto ldc_ctrl;
+ event = port->rx_event;
+ if (!(event & LDC_EVENT_DATA_READY))
+ return 0;
- if (unlikely(event != LDC_EVENT_DATA_READY)) {
- pr_warn("Unexpected LDC event %d\n", event);
- spin_unlock_irqrestore(&vio->lock, flags);
- return;
- }
+ /* we dont expect any other bits than RESET, UP, DATA_READY */
+ BUG_ON(event != LDC_EVENT_DATA_READY);
tx_wakeup = err = 0;
while (1) {
@@ -651,6 +678,21 @@ static void vnet_event(void *arg, int event)
u64 raw[8];
} msgbuf;
+ if (port->napi_resume) {
+ struct vio_dring_data *pkt =
+ (struct vio_dring_data *)&msgbuf;
+ struct vio_dring_state *dr =
+ &port->vio.drings[VIO_DRIVER_RX_RING];
+
+ pkt->tag.type = VIO_TYPE_DATA;
+ pkt->tag.stype = VIO_SUBTYPE_INFO;
+ pkt->tag.stype_env = VIO_DRING_DATA;
+ pkt->seq = dr->rcv_nxt;
+ pkt->start_idx = next_idx(port->napi_stop_idx, dr);
+ pkt->end_idx = -1;
+ goto napi_resume;
+ }
+ldc_read:
err = ldc_read(vio->lp, &msgbuf, sizeof(msgbuf));
if (unlikely(err < 0)) {
if (err == -ECONNRESET)
@@ -667,10 +709,22 @@ static void vnet_event(void *arg, int event)
err = vio_validate_sid(vio, &msgbuf.tag);
if (err < 0)
break;
-
+napi_resume:
if (likely(msgbuf.tag.type == VIO_TYPE_DATA)) {
if (msgbuf.tag.stype == VIO_SUBTYPE_INFO) {
- err = vnet_rx(port, &msgbuf);
+ if (!port_is_up(port)) {
+ /* failures like handshake_failure()
+ * may have cleaned up dring, but
+ * NAPI polling may bring us here.
+ */
+ err = -ECONNRESET;
+ break;
+ }
+ err = vnet_rx(port, &msgbuf, &npkts, budget);
+ if (npkts >= budget)
+ break;
+ if (npkts == 0 && err != -ECONNRESET)
+ goto ldc_read;
} else if (msgbuf.tag.stype == VIO_SUBTYPE_ACK) {
err = vnet_ack(port, &msgbuf);
if (err > 0)
@@ -691,15 +745,33 @@ static void vnet_event(void *arg, int event)
if (err == -ECONNRESET)
break;
}
- spin_unlock(&vio->lock);
- /* Kick off a tasklet to wake the queue. We cannot call
- * maybe_tx_wakeup directly here because we could deadlock on
- * netif_tx_lock() with dev_watchdog()
- */
if (unlikely(tx_wakeup && err != -ECONNRESET))
- tasklet_schedule(&port->vp->vnet_tx_wakeup);
+ maybe_tx_wakeup(port->vp);
+ return npkts;
+}
+
+static int vnet_poll(struct napi_struct *napi, int budget)
+{
+ struct vnet_port *port = container_of(napi, struct vnet_port, napi);
+ struct vio_driver_state *vio = &port->vio;
+ int processed = vnet_event_napi(port, budget);
+
+ if (processed < budget) {
+ napi_complete(napi);
+ vio_set_intr(vio->vdev->rx_ino, HV_INTR_ENABLED);
+ }
+ return processed;
+}
+
+static void vnet_event(void *arg, int event)
+{
+ struct vnet_port *port = arg;
+ struct vio_driver_state *vio = &port->vio;
+
+ port->rx_event |= event;
+ vio_set_intr(vio->vdev->rx_ino, HV_INTR_DISABLED);
+ napi_schedule(&port->napi);
- local_irq_restore(flags);
}
static int __vnet_tx_trigger(struct vnet_port *port, u32 start)
@@ -746,13 +818,6 @@ static int __vnet_tx_trigger(struct vnet_port *port, u32 start)
return err;
}
-static inline bool port_is_up(struct vnet_port *vnet)
-{
- struct vio_driver_state *vio = &vnet->vio;
-
- return !!(vio->hs_state & VIO_HS_COMPLETE);
-}
-
struct vnet_port *__tx_port_find(struct vnet *vp, struct sk_buff *skb)
{
unsigned int hash = vnet_hashfn(skb->data);
@@ -1342,6 +1407,21 @@ err_out:
return err;
}
+#ifdef CONFIG_NET_POLL_CONTROLLER
+static void vnet_poll_controller(struct net_device *dev)
+{
+ struct vnet *vp = netdev_priv(dev);
+ struct vnet_port *port;
+ unsigned long flags;
+
+ spin_lock_irqsave(&vp->lock, flags);
+ if (!list_empty(&vp->port_list)) {
+ port = list_entry(vp->port_list.next, struct vnet_port, list);
+ napi_schedule(&port->napi);
+ }
+ spin_unlock_irqrestore(&vp->lock, flags);
+}
+#endif
static LIST_HEAD(vnet_list);
static DEFINE_MUTEX(vnet_list_mutex);
@@ -1354,6 +1434,9 @@ static const struct net_device_ops vnet_ops = {
.ndo_tx_timeout = vnet_tx_timeout,
.ndo_change_mtu = vnet_change_mtu,
.ndo_start_xmit = vnet_start_xmit,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+ .ndo_poll_controller = vnet_poll_controller,
+#endif
};
static struct vnet *vnet_new(const u64 *local_mac)
@@ -1374,7 +1457,6 @@ static struct vnet *vnet_new(const u64 *local_mac)
vp = netdev_priv(dev);
spin_lock_init(&vp->lock);
- tasklet_init(&vp->vnet_tx_wakeup, maybe_tx_wakeup, (unsigned long)vp);
vp->dev = dev;
INIT_LIST_HEAD(&vp->port_list);
@@ -1434,7 +1516,6 @@ static void vnet_cleanup(void)
vp = list_first_entry(&vnet_list, struct vnet, list);
list_del(&vp->list);
dev = vp->dev;
- tasklet_kill(&vp->vnet_tx_wakeup);
/* vio_unregister_driver() should have cleaned up port_list */
BUG_ON(!list_empty(&vp->port_list));
unregister_netdev(dev);
@@ -1536,6 +1617,8 @@ static int vnet_port_probe(struct vio_dev *vdev, const struct vio_device_id *id)
if (err)
goto err_out_free_port;
+ netif_napi_add(port->vp->dev, &port->napi, vnet_poll, NAPI_POLL_WEIGHT);
+
err = vnet_port_alloc_tx_bufs(port);
if (err)
goto err_out_free_ldc;
@@ -1564,6 +1647,7 @@ static int vnet_port_probe(struct vio_dev *vdev, const struct vio_device_id *id)
setup_timer(&port->clean_timer, vnet_clean_timer_expire,
(unsigned long)port);
+ napi_enable(&port->napi);
vio_port_up(&port->vio);
mdesc_release(hp);
@@ -1571,6 +1655,7 @@ static int vnet_port_probe(struct vio_dev *vdev, const struct vio_device_id *id)
return 0;
err_out_free_ldc:
+ netif_napi_del(&port->napi);
vio_ldc_free(&port->vio);
err_out_free_port:
@@ -1592,11 +1677,13 @@ static int vnet_port_remove(struct vio_dev *vdev)
del_timer_sync(&port->vio.timer);
del_timer_sync(&port->clean_timer);
+ napi_disable(&port->napi);
spin_lock_irqsave(&vp->lock, flags);
list_del(&port->list);
hlist_del(&port->hash);
spin_unlock_irqrestore(&vp->lock, flags);
+ netif_napi_del(&port->napi);
vnet_port_free_tx_bufs(port);
vio_ldc_free(&port->vio);
diff --git a/drivers/net/ethernet/sun/sunvnet.h b/drivers/net/ethernet/sun/sunvnet.h
index c911045..c8a862e 100644
--- a/drivers/net/ethernet/sun/sunvnet.h
+++ b/drivers/net/ethernet/sun/sunvnet.h
@@ -56,6 +56,11 @@ struct vnet_port {
struct timer_list clean_timer;
u64 rmtu;
+
+ struct napi_struct napi;
+ u32 napi_stop_idx;
+ bool napi_resume;
+ int rx_event;
};
static inline struct vnet_port *to_vnet_port(struct vio_driver_state *vio)
@@ -97,7 +102,6 @@ struct vnet {
struct list_head list;
u64 local_mac;
- struct tasklet_struct vnet_tx_wakeup;
};
#endif /* _SUNVNET_H */
--
1.8.4.2
^ permalink raw reply related
* [PATCHv6 net-next 3/3] sunvnet: Remove irqsave/irqrestore on vio.lock
From: Sowmini Varadhan @ 2014-10-25 19:12 UTC (permalink / raw)
To: davem, sowmini.varadhan; +Cc: netdev
After the NAPIfication of sunvnet, we no longer need to
synchronize by doing irqsave/restore on vio.lock in the
I/O fastpath.
NAPI ->poll() is non-reentrant, so all RX processing occurs
strictly in a serialized environment. TX reclaim is done in NAPI
context, so the netif_tx_lock can be used to serialize
critical sections between Tx and Rx paths.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
drivers/net/ethernet/sun/sunvnet.c | 30 +++++-------------------------
1 file changed, 5 insertions(+), 25 deletions(-)
diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c
index 966c252..c390a27 100644
--- a/drivers/net/ethernet/sun/sunvnet.c
+++ b/drivers/net/ethernet/sun/sunvnet.c
@@ -842,18 +842,6 @@ struct vnet_port *__tx_port_find(struct vnet *vp, struct sk_buff *skb)
return NULL;
}
-struct vnet_port *tx_port_find(struct vnet *vp, struct sk_buff *skb)
-{
- struct vnet_port *ret;
- unsigned long flags;
-
- spin_lock_irqsave(&vp->lock, flags);
- ret = __tx_port_find(vp, skb);
- spin_unlock_irqrestore(&vp->lock, flags);
-
- return ret;
-}
-
static struct sk_buff *vnet_clean_tx_ring(struct vnet_port *port,
unsigned *pending)
{
@@ -914,11 +902,10 @@ static void vnet_clean_timer_expire(unsigned long port0)
struct vnet_port *port = (struct vnet_port *)port0;
struct sk_buff *freeskbs;
unsigned pending;
- unsigned long flags;
- spin_lock_irqsave(&port->vio.lock, flags);
+ netif_tx_lock(port->vp->dev);
freeskbs = vnet_clean_tx_ring(port, &pending);
- spin_unlock_irqrestore(&port->vio.lock, flags);
+ netif_tx_unlock(port->vp->dev);
vnet_free_skbs(freeskbs);
@@ -971,7 +958,6 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
struct vnet_port *port = NULL;
struct vio_dring_state *dr;
struct vio_net_desc *d;
- unsigned long flags;
unsigned int len;
struct sk_buff *freeskbs = NULL;
int i, err, txi;
@@ -984,7 +970,7 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
goto out_dropped;
rcu_read_lock();
- port = tx_port_find(vp, skb);
+ port = __tx_port_find(vp, skb);
if (unlikely(!port))
goto out_dropped;
@@ -1020,8 +1006,6 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
goto out_dropped;
}
- spin_lock_irqsave(&port->vio.lock, flags);
-
dr = &port->vio.drings[VIO_DRIVER_TX_RING];
if (unlikely(vnet_tx_dring_avail(dr) < 1)) {
if (!netif_queue_stopped(dev)) {
@@ -1055,7 +1039,7 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
(LDC_MAP_SHADOW | LDC_MAP_DIRECT | LDC_MAP_RW));
if (err < 0) {
netdev_info(dev, "tx buffer map error %d\n", err);
- goto out_dropped_unlock;
+ goto out_dropped;
}
port->tx_bufs[txi].ncookies = err;
@@ -1108,7 +1092,7 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
netdev_info(dev, "TX trigger error %d\n", err);
d->hdr.state = VIO_DESC_FREE;
dev->stats.tx_carrier_errors++;
- goto out_dropped_unlock;
+ goto out_dropped;
}
ldc_start_done:
@@ -1124,7 +1108,6 @@ ldc_start_done:
netif_wake_queue(dev);
}
- spin_unlock_irqrestore(&port->vio.lock, flags);
(void)mod_timer(&port->clean_timer, jiffies + VNET_CLEAN_TIMEOUT);
rcu_read_unlock();
@@ -1132,9 +1115,6 @@ ldc_start_done:
return NETDEV_TX_OK;
-out_dropped_unlock:
- spin_unlock_irqrestore(&port->vio.lock, flags);
-
out_dropped:
if (pending)
(void)mod_timer(&port->clean_timer,
--
1.8.4.2
^ permalink raw reply related
* [PATCHv6 net-next 2/3] sunvnet: Use RCU to synchronize port usage with vnet_port_remove()
From: Sowmini Varadhan @ 2014-10-25 19:12 UTC (permalink / raw)
To: davem, bob.picco, sowmini.varadhan, dwight.engen, david.stevens; +Cc: netdev
A vnet_port_remove could be triggered as a result of an ldm-unbind
operation by the peer, module unload, or other changes to the
inter-vnet-link configuration. When this is concurrent with
vnet_start_xmit(), there are several race sequences possible,
such as
thread 1 thread 2
vnet_start_xmit
-> tx_port_find
spin_lock_irqsave(&vp->lock..)
ret = __tx_port_find(..)
spin_lock_irqrestore(&vp->lock..)
vio_remove -> ..
->vnet_port_remove
spin_lock_irqsave(&vp->lock..)
cleanup
spin_lock_irqrestore(&vp->lock..)
kfree(port)
/* attempt to use ret will bomb */
This patch adds RCU locking for port access so that vnet_port_remove
will correctly clean up port-related state.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Dwight Engen <dwight.engen@oracle.com>
Acked-by: Bob Picco <bob.picco@oracle.com>
---
changes since v2: use RCU.
changes since v3: incorporate David Stevens feedback
drivers/net/ethernet/sun/sunvnet.c | 62 ++++++++++++++++++++------------------
1 file changed, 33 insertions(+), 29 deletions(-)
diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c
index 9e048f5..966c252 100644
--- a/drivers/net/ethernet/sun/sunvnet.c
+++ b/drivers/net/ethernet/sun/sunvnet.c
@@ -617,7 +617,8 @@ static void maybe_tx_wakeup(struct vnet *vp)
struct vnet_port *port;
int wake = 1;
- list_for_each_entry(port, &vp->port_list, list) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(port, &vp->port_list, list) {
struct vio_dring_state *dr;
dr = &port->vio.drings[VIO_DRIVER_TX_RING];
@@ -627,6 +628,7 @@ static void maybe_tx_wakeup(struct vnet *vp)
break;
}
}
+ rcu_read_unlock();
if (wake)
netif_wake_queue(dev);
}
@@ -824,13 +826,13 @@ struct vnet_port *__tx_port_find(struct vnet *vp, struct sk_buff *skb)
struct hlist_head *hp = &vp->port_hash[hash];
struct vnet_port *port;
- hlist_for_each_entry(port, hp, hash) {
+ hlist_for_each_entry_rcu(port, hp, hash) {
if (!port_is_up(port))
continue;
if (ether_addr_equal(port->raddr, skb->data))
return port;
}
- list_for_each_entry(port, &vp->port_list, list) {
+ list_for_each_entry_rcu(port, &vp->port_list, list) {
if (!port->switch_port)
continue;
if (!port_is_up(port))
@@ -966,7 +968,7 @@ static inline struct sk_buff *vnet_skb_shape(struct sk_buff *skb, void **pstart,
static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
struct vnet *vp = netdev_priv(dev);
- struct vnet_port *port = tx_port_find(vp, skb);
+ struct vnet_port *port = NULL;
struct vio_dring_state *dr;
struct vio_net_desc *d;
unsigned long flags;
@@ -977,14 +979,15 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
int nlen = 0;
unsigned pending = 0;
- if (unlikely(!port))
- goto out_dropped;
-
skb = vnet_skb_shape(skb, &start, &nlen);
-
if (unlikely(!skb))
goto out_dropped;
+ rcu_read_lock();
+ port = tx_port_find(vp, skb);
+ if (unlikely(!port))
+ goto out_dropped;
+
if (skb->len > port->rmtu) {
unsigned long localmtu = port->rmtu - ETH_HLEN;
@@ -1002,6 +1005,7 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
fl4.saddr = ip_hdr(skb)->saddr;
rt = ip_route_output_key(dev_net(dev), &fl4);
+ rcu_read_unlock();
if (!IS_ERR(rt)) {
skb_dst_set(skb, &rt->dst);
icmp_send(skb, ICMP_DEST_UNREACH,
@@ -1027,7 +1031,7 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
netdev_err(dev, "BUG! Tx Ring full when queue awake!\n");
dev->stats.tx_errors++;
}
- spin_unlock_irqrestore(&port->vio.lock, flags);
+ rcu_read_unlock();
return NETDEV_TX_BUSY;
}
@@ -1121,25 +1125,27 @@ ldc_start_done:
}
spin_unlock_irqrestore(&port->vio.lock, flags);
+ (void)mod_timer(&port->clean_timer, jiffies + VNET_CLEAN_TIMEOUT);
+ rcu_read_unlock();
vnet_free_skbs(freeskbs);
- (void)mod_timer(&port->clean_timer, jiffies + VNET_CLEAN_TIMEOUT);
-
return NETDEV_TX_OK;
out_dropped_unlock:
spin_unlock_irqrestore(&port->vio.lock, flags);
out_dropped:
- if (skb)
- dev_kfree_skb(skb);
- vnet_free_skbs(freeskbs);
if (pending)
(void)mod_timer(&port->clean_timer,
jiffies + VNET_CLEAN_TIMEOUT);
else if (port)
del_timer(&port->clean_timer);
+ if (port)
+ rcu_read_unlock();
+ if (skb)
+ dev_kfree_skb(skb);
+ vnet_free_skbs(freeskbs);
dev->stats.tx_dropped++;
return NETDEV_TX_OK;
}
@@ -1269,18 +1275,17 @@ static void vnet_set_rx_mode(struct net_device *dev)
{
struct vnet *vp = netdev_priv(dev);
struct vnet_port *port;
- unsigned long flags;
- spin_lock_irqsave(&vp->lock, flags);
- if (!list_empty(&vp->port_list)) {
- port = list_entry(vp->port_list.next, struct vnet_port, list);
+ rcu_read_lock();
+ list_for_each_entry_rcu(port, &vp->port_list, list) {
if (port->switch_port) {
__update_mc_list(vp, dev);
__send_mc_list(vp, port);
+ break;
}
}
- spin_unlock_irqrestore(&vp->lock, flags);
+ rcu_read_unlock();
}
static int vnet_change_mtu(struct net_device *dev, int new_mtu)
@@ -1633,10 +1638,11 @@ static int vnet_port_probe(struct vio_dev *vdev, const struct vio_device_id *id)
spin_lock_irqsave(&vp->lock, flags);
if (switch_port)
- list_add(&port->list, &vp->port_list);
+ list_add_rcu(&port->list, &vp->port_list);
else
- list_add_tail(&port->list, &vp->port_list);
- hlist_add_head(&port->hash, &vp->port_hash[vnet_hashfn(port->raddr)]);
+ list_add_tail_rcu(&port->list, &vp->port_list);
+ hlist_add_head_rcu(&port->hash,
+ &vp->port_hash[vnet_hashfn(port->raddr)]);
spin_unlock_irqrestore(&vp->lock, flags);
dev_set_drvdata(&vdev->dev, port);
@@ -1671,18 +1677,16 @@ static int vnet_port_remove(struct vio_dev *vdev)
struct vnet_port *port = dev_get_drvdata(&vdev->dev);
if (port) {
- struct vnet *vp = port->vp;
- unsigned long flags;
del_timer_sync(&port->vio.timer);
- del_timer_sync(&port->clean_timer);
napi_disable(&port->napi);
- spin_lock_irqsave(&vp->lock, flags);
- list_del(&port->list);
- hlist_del(&port->hash);
- spin_unlock_irqrestore(&vp->lock, flags);
+ list_del_rcu(&port->list);
+ hlist_del_rcu(&port->hash);
+
+ synchronize_rcu();
+ del_timer_sync(&port->clean_timer);
netif_napi_del(&port->napi);
vnet_port_free_tx_bufs(port);
vio_ldc_free(&port->vio);
--
1.8.4.2
^ permalink raw reply related
* Re: [PATCH 06/14] net: dsa: Add support for hardware monitoring
From: Andrew Lunn @ 2014-10-25 18:00 UTC (permalink / raw)
To: Florian Fainelli
Cc: Andrew Lunn, Guenter Roeck, netdev, David S. Miller,
linux-kernel@vger.kernel.org
In-Reply-To: <544BDC8F.2030308@gmail.com>
On Sat, Oct 25, 2014 at 10:23:27AM -0700, Florian Fainelli wrote:
> On 10/25/14 07:01, Andrew Lunn wrote:
> >> Here is another naming option:
> >>
> >> em1dsa0-virtual-0
> >
> > I prefer this over isa.
> >
> > However, i think there should be some sort of separator between the
> > network device name and dsa.
>
> Considering that network devices can be renamed, do we want it to be
> included in the sensor name at all?
Well, we need something which identifies the "DSA collection". In
that, you could have two or more collections, connected to different
network devices.
I once came across a PCI board with two Intel ethernet chipsets and
two 10 port Marvell switches. Each switch had one host cpu port, 3
ports interconnected to the other switch, and 6 going to the back
panel. You would want to describe that as two separate DSA entities,
not one DSA with two switches. So the name needs something to indicate
the DSA collection.
Andrew
^ permalink raw reply
* Re: [PATCH 06/14] net: dsa: Add support for hardware monitoring
From: Guenter Roeck @ 2014-10-25 19:44 UTC (permalink / raw)
To: Andrew Lunn
Cc: Florian Fainelli, netdev, David S. Miller,
linux-kernel@vger.kernel.org
In-Reply-To: <20141025140116.GA12051@lunn.ch>
On 10/25/2014 07:01 AM, Andrew Lunn wrote:
>> Here is another naming option:
>>
>> em1dsa0-virtual-0
>
> I prefer this over isa.
>
> However, i think there should be some sort of separator between the
> network device name and dsa.
>
Fine with me. Any preference ? Note that '-' is not permitted.
Guenter
^ permalink raw reply
* Fw: [Bug 86851] New: Reproducible panic on heavy UDP traffic
From: Stephen Hemminger @ 2014-10-25 8:40 UTC (permalink / raw)
To: netdev
Begin forwarded message:
Date: Fri, 24 Oct 2014 11:34:08 -0700
From: "bugzilla-daemon@bugzilla.kernel.org" <bugzilla-daemon@bugzilla.kernel.org>
To: "stephen@networkplumber.org" <stephen@networkplumber.org>
Subject: [Bug 86851] New: Reproducible panic on heavy UDP traffic
https://bugzilla.kernel.org/show_bug.cgi?id=86851
Bug ID: 86851
Summary: Reproducible panic on heavy UDP traffic
Product: Networking
Version: 2.5
Kernel Version: 3.18-rc1
Hardware: x86-64
OS: Linux
Tree: Mainline
Status: NEW
Severity: normal
Priority: P1
Component: IPV4
Assignee: shemminger@linux-foundation.org
Reporter: chutzpah@gentoo.org
Regression: No
Created attachment 154861
--> https://bugzilla.kernel.org/attachment.cgi?id=154861&action=edit
Panic message captured over serial console
This bug is being encountered on a machine with 6 ixgbe interfaces configured
as 3 bonded pairs. Currently using netperf for testing UDP throughput from ~136
client machines connected via 1G interfaces.
This is the command being run on the clients:
netperf -H XX.XX.XX.XX -t OMNI -l 60 -- -d send -T UDP -O throughput,direction
-m 16384
This is the command being run on the server:
netserver -4
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply
* Re: [PATCH v2 net] tcp: md5: do not use alloc_percpu()
From: David Miller @ 2014-10-25 20:11 UTC (permalink / raw)
To: eric.dumazet; +Cc: dsahern, cdleonard, netdev, jtoppins
In-Reply-To: <1414094338.20845.30.camel@edumazet-glaptop2.roam.corp.google.com>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 23 Oct 2014 12:58:58 -0700
> From: Eric Dumazet <edumazet@google.com>
>
> percpu tcp_md5sig_pool contains memory blobs that ultimately
> go through sg_set_buf().
>
> -> sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf));
>
> This requires that whole area is in a physically contiguous portion
> of memory. And that @buf is not backed by vmalloc().
>
> Given that alloc_percpu() can use vmalloc() areas, this does not
> fit the requirements.
>
> Replace alloc_percpu() by a static DEFINE_PER_CPU() as tcp_md5sig_pool
> is small anyway, there is no gain to dynamically allocate it.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Fixes: 765cf9976e93 ("tcp: md5: remove one indirection level in tcp_md5sig_pool")
> Reported-by: Crestez Dan Leonard <cdleonard@gmail.com>
Applied and queued up for -stable, thanks Eric.
Longer term we should do a proper shash/ahash conversion of the tcp md5
code, using the rules provided by Herbert Xu.
Thanks.
^ permalink raw reply
* Re: [PATCHv6 net-next 0/3] sunvnet: NAPIfy sunvnet
From: David Miller @ 2014-10-25 20:21 UTC (permalink / raw)
To: sowmini.varadhan; +Cc: bob.picco, dwight.engen, raghuram.kothakota, netdev
In-Reply-To: <20141025191152.GA31334@oracle.com>
From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Sat, 25 Oct 2014 15:11:52 -0400
> This patchset converts the sunvnet driver to use the NAPI framework.
> Changes since v4 to Patch1:
> vnet_event accumulates LDC_EVENT_* bits into rx_event.
> vnet_event_napi() unrolls send_events() logic to process all rx_event bits.
> Changes since v5:
> Patch 1: use net_device.h definition for NAPI_POLL_WEIGHT.
> Drop sparclinux changes (patch3) per David Miller feedback
>
> Patch 1 in the series addresses the packet-receive path- all
> the vnet_event() processing is moved into NAPI context.
> This patch is dependant on the sparc-next commit:
> "sparc64: Add vio_set_intr() to enable/disable Rx interrupts"
> (sparc commit id ca605b7dd740c8909408d67911d8ddd272c2b320)
>
> Patch 2 uses RCU to fix race conditions between vnet_port_remove and
> paths that access/modify port-related state, such as vnet_start_xmit.
>
> Patch 3 leverages from the NAPIfied Rx path,
> dropping superfluous usage of the irqsave/irqrestores on the vio.lock
> where possible.
Series applied, thanks.
^ permalink raw reply
* Re: [PATCH net] macvlan: fix a race on port dismantle and possible skb leaks
From: David Miller @ 2014-10-25 20:24 UTC (permalink / raw)
To: herbert; +Cc: eric.dumazet, netdev
In-Reply-To: <20141023032858.GA1836@gondor.apana.org.au>
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 23 Oct 2014 11:28:58 +0800
> On Wed, Oct 22, 2014 at 07:43:46PM -0700, Eric Dumazet wrote:
>> From: Eric Dumazet <edumazet@google.com>
>>
>> We need to cancel the work queue after rcu grace period,
>> otherwise it can be rescheduled by incoming packets.
>>
>> We need to purge queue if some skbs are still in it.
>>
>> We can use __skb_queue_head_init() variant in
>> macvlan_process_broadcast()
>>
>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>> Fixes: 412ca1550cbec ("macvlan: Move broadcasts into a work queue")
>> Cc: Herbert Xu <herbert@gondor.apana.org.au>
>
> Good catch! Your fix looks good to me.
Applied and queued up for -stable, thanks everyone.
^ permalink raw reply
* Re: [net] i40e: _MASK vs _SHIFT typo in i40e_handle_mdd_event()
From: David Miller @ 2014-10-25 20:51 UTC (permalink / raw)
To: jeffrey.t.kirsher; +Cc: dan.carpenter, netdev, nhorman, sassmann, jogreene
In-Reply-To: <1414033589-7544-1-git-send-email-jeffrey.t.kirsher@intel.com>
From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Wed, 22 Oct 2014 20:06:29 -0700
> From: Dan Carpenter <dan.carpenter@oracle.com>
>
> We accidentally mask by the _SHIFT variable. It means that "event" is
> always zero.
>
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> Tested-by: Jim Young <jamesx.m.young@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH v3] drivers: net: xgene: Rewrite buggy loop in xgene_enet_ecc_init()
From: David Miller @ 2014-10-25 21:05 UTC (permalink / raw)
To: geert; +Cc: isubramanian, kchudgar, netdev, linux-kernel
In-Reply-To: <1414052753-10183-1-git-send-email-geert@linux-m68k.org>
From: Geert Uytterhoeven <geert@linux-m68k.org>
Date: Thu, 23 Oct 2014 10:25:53 +0200
> drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c: In function ‘xgene_enet_ecc_init’:
> drivers/net/ethernet/apm/xgene/xgene_enet_sgmac.c:126: warning: ‘data’ may be used uninitialized in this function
>
> Depending on the arbitrary value on the stack, the loop may terminate
> too early, and cause a bogus -ENODEV failure.
>
> Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
> ---
> v3:
> - Use "do { ... } while (...);" instead of "for (...) { ... }",
> v2:
> - Rewrite the loop instead of pre-initializing data.
Applied, thank you.
^ permalink raw reply
* Re: [PATCH net-next] tcp: Add TCP_FREEZE socket option
From: Kristian Evensen @ 2014-10-25 21:21 UTC (permalink / raw)
To: Yuchung Cheng; +Cc: Hagen Paul Pfeifer, David Miller, Network Development
In-Reply-To: <CAK6E8=fiHrgicTXy2hBGh0GUcZRWWxpYxw_bNVmRD+ueN9mUdw@mail.gmail.com>
Hi,
Sorry for my late reply.
On Fri, Oct 24, 2014 at 4:58 PM, Yuchung Cheng <ycheng@google.com> wrote:
> Do packets actually get dropped during the handover period? if not
> then the sender can detect spurious RTOs and undo the cwnd reductions
> with timestamps or DSACKs (Eifel). Eifel did not exist when the
> freeze-TCP was published at 2000. Even if the connection does not
> support these options as major TCP stacks do, slow-start on a path
> with new BDP isn't that bad of an idea.
Yes, there is 100% is packet loss when several of these handovers
occur. In addition to looking at the pcap-files, I have made some
tests were I send UDP traffic at the same time as TCP. The UDP traffic
does, as expected, resume as soon as the handover is over. The TCP
traffic, on the other hand, recovers when the RTO expires.
Thanks for reminding me about Eifel btw.
-Kristian
^ permalink raw reply
* Re: [PATCH net-next] tcp: Add TCP_FREEZE socket option
From: Kristian Evensen @ 2014-10-25 21:29 UTC (permalink / raw)
To: Hagen Paul Pfeifer; +Cc: Yuchung Cheng, David Miller, Network Development
In-Reply-To: <CAPh34mdeYib4VSDT7Fj9i-oNc4c8HmUGT9KHpU2gS4scpODdpA@mail.gmail.com>
Hi,
On Fri, Oct 24, 2014 at 6:26 PM, Hagen Paul Pfeifer <hagen@jauu.net> wrote:
> Yes, starting with fresh values for a new links is a good thing to do.
> But what I think what Kristian want to address is to reduce larger
> idle period due to backoff'ing timeouts followed by larger idle
> periods? E.g. like a temporary cork for an exact period of time.
Yes, that is exactly what I want to achieve. To temporarily stop
traffic while the handover occurs, to prevent the penalty of the
potentially large idle period caused by the sender backing off. With
regards to which values to use when the handover is done, I agree that
it would make sense to start with for example a lower cwnd. The
network characteristics (for example throughput and latency) will most
likely have changed. However, at least the way I have thought about
this, such a change would also need changes on the remote machine.
> I thought that freeze-TCP was *also* designed to bridge a larger
> disconnection? The downside with this approach (compared to SplitTCP)
> is that you only send one instance into sleep. The other peer (sender)
> may run into timeouts too.
Yes, that is true. But I guess with the alternate design (netlink
module), the probability of this will be reduced. As long as the
application sets the frozen socket option and assuming zero window
will arrive, sender should stop as well.
-Kristian
^ permalink raw reply
* Re: Fw: [Bug 86851] New: Reproducible panic on heavy UDP traffic
From: Florian Westphal @ 2014-10-25 21:44 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev, nikolay
In-Reply-To: <20141025141038.26fa5ac2@uryu.home.lan>
Stephen Hemminger <stephen@networkplumber.org> wrote:
[ CC Nik ]
> Date: Fri, 24 Oct 2014 11:34:08 -0700
> From: "bugzilla-daemon@bugzilla.kernel.org" <bugzilla-daemon@bugzilla.kernel.org>
> To: "stephen@networkplumber.org" <stephen@networkplumber.org>
> Subject: [Bug 86851] New: Reproducible panic on heavy UDP traffic
>
>
> https://bugzilla.kernel.org/show_bug.cgi?id=86851
>
> Bug ID: 86851
> Summary: Reproducible panic on heavy UDP traffic
> Product: Networking
> Version: 2.5
> Kernel Version: 3.18-rc1
> Hardware: x86-64
> OS: Linux
> Tree: Mainline
> Status: NEW
> Severity: normal
> Priority: P1
> Component: IPV4
> Assignee: shemminger@linux-foundation.org
> Reporter: chutzpah@gentoo.org
> Regression: No
>
> Created attachment 154861
> --> https://bugzilla.kernel.org/attachment.cgi?id=154861&action=edit
> Panic message captured over serial console
general protection fault: 0000 [#1] SMP
Modules linked in: nfs [..]
CPU: 7 PID: 257 Comm: kworker/7:1 Tainted: G W 3.18.0-rc1-base-7+ #2
asked reporter to check if there is a warning before the oops.
Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.03.0003.041920141333 04/19/2014
Workqueue: events inet_frag_worker
task: ffff882fd32e70e0 ti: ffff882fd0adc000 task.ti: ffff882fd0adc000
RIP: 0010:[<ffffffff81592ab4>] [<ffffffff81592ab4>] inet_evict_bucket+0xf4/0x180
RSP: 0018:ffff882fd0adfd58 EFLAGS: 00010286
RAX: ffff8817c7230701 RBX: dead000000100100 RCX: 0000000180300013
Hello LIST_POISON!
RDX: 0000000180300014 RSI: 0000000000000001 RDI: dead0000001000c0
RBP: 0000000000000002 R08: 0000000000000202 R09: ffff88303fc39ab0
R10: ffffffff81592ac0 R11: ffffea005f1c8c00 R12: ffffffff81aa2820
R13: ffff882fd0adfd70 R14: ffff8817c72307e0 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88303fc20000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
device rack0a left promiscuous mode
CR2: 00007f054c7ba034 CR3: 0000002fc4986000 CR4: 00000000001407e0
Stack:
ffffffff81aa3298 ffffffff81aa3290 ffff8817d0820a08 0000000000000000
0000000000000000 00000000000000a8 0000000000000008 ffff88303fc32780
ffffffff81aa6820 0000000000000059[ 2415.026338] device rack1a left promiscuous mode
0000000000000000 ffffffff81592ba2
Call Trace:
[<ffffffff81592ba2>] ? inet_frag_worker+0x62/0x210
[<ffffffff8112c312>] ? process_one_work+0x132/0x360
[..]
crash is in hlist_for_each_entry_safe() at the end of inet_evict_bucket(), looks like
we encounter an already-list_del'd element while iterating.
Will look at this tomorrow.
^ permalink raw reply
* [PATCH -next 0/2] net: allow setting ecn via routing table
From: Florian Westphal @ 2014-10-25 22:38 UTC (permalink / raw)
To: netdev
These two patches allow turing on explicit congestion notification
based on the destination network.
For example, assuming the default tcp_ecn sysctl '2', the following will
enable ecn (tcp_ecn=1 behaviour, i.e. request ecn to be enabled for a
tcp connection) for all connections to hosts inside the 192.168.2/24 network:
ip route change 192.168.2.0/24 dev eth0 features ecn
Having a more fine-grained per-route setting can be beneficial for
various reasons, for example 1) within data centers, or 2) local ISPs
may deploy ECN support for their own video/streaming services [1], etc.
Joint work with Daniel Borkmann, feature suggested by Hannes Frederic Sowa.
The patch to enable this in iproute2 will be posted shortly, it is currently
also available here:
http://git.breakpoint.cc/cgit/fw/iproute2.git/commit/?h=iproute_features&id=8843d2d8973fb81c78a7efe6d42e3a17d739003e
[1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15
^ permalink raw reply
* [PATCH -next 1/2] syncookies: remove ecn_ok validation when decoding option timestamp
From: Florian Westphal @ 2014-10-25 22:38 UTC (permalink / raw)
To: netdev; +Cc: Florian Westphal
In-Reply-To: <1414276729-17871-1-git-send-email-fw@strlen.de>
Won't work anymore when tcp_ecn=0 and RTAX_FEATURES route attribute did
allow ecn.
While its possible to extend the test to also perform route lookup and
check RTAX_FEATURES, it doesn't seem worth it.
Thus, just turn on ecn if the client ts indicates so.
This means that while syn cookies are in use clients can turn on ecn
even if it is off. However, there seems to be no harm in permitting
this.
Signed-off-by: Florian Westphal <fw@strlen.de>
---
include/net/tcp.h | 3 +--
net/ipv4/syncookies.c | 6 ++----
net/ipv6/syncookies.c | 2 +-
3 files changed, 4 insertions(+), 7 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index c73fc14..7c85167 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -495,8 +495,7 @@ __u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
#endif
__u32 cookie_init_timestamp(struct request_sock *req);
-bool cookie_check_timestamp(struct tcp_options_received *opt, struct net *net,
- bool *ecn_ok);
+bool cookie_check_timestamp(struct tcp_options_received *opt, bool *ecn_ok);
/* From net/ipv6/syncookies.c */
int __cookie_v6_check(const struct ipv6hdr *iph, const struct tcphdr *th,
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 32b98d0..b84cc12 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -225,7 +225,7 @@ static inline struct sock *get_cookie_sock(struct sock *sk, struct sk_buff *skb,
* return false if we decode an option that should not be.
*/
bool cookie_check_timestamp(struct tcp_options_received *tcp_opt,
- struct net *net, bool *ecn_ok)
+ bool *ecn_ok)
{
/* echoed timestamp, lowest bits contain options */
u32 options = tcp_opt->rcv_tsecr & TSMASK;
@@ -240,8 +240,6 @@ bool cookie_check_timestamp(struct tcp_options_received *tcp_opt,
tcp_opt->sack_ok = (options & (1 << 4)) ? TCP_SACK_SEEN : 0;
*ecn_ok = (options >> 5) & 1;
- if (*ecn_ok && !net->ipv4.sysctl_tcp_ecn)
- return false;
if (tcp_opt->sack_ok && !sysctl_tcp_sack)
return false;
@@ -287,7 +285,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
memset(&tcp_opt, 0, sizeof(tcp_opt));
tcp_parse_options(skb, &tcp_opt, 0, NULL);
- if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
+ if (!cookie_check_timestamp(&tcp_opt, &ecn_ok))
goto out;
ret = NULL;
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 0e26e79..4df0258 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -183,7 +183,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
memset(&tcp_opt, 0, sizeof(tcp_opt));
tcp_parse_options(skb, &tcp_opt, 0, NULL);
- if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
+ if (!cookie_check_timestamp(&tcp_opt, &ecn_ok))
goto out;
ret = NULL;
--
2.0.4
^ permalink raw reply related
* [PATCH -next 2/2] net: allow setting ecn via routing table
From: Florian Westphal @ 2014-10-25 22:38 UTC (permalink / raw)
To: netdev; +Cc: Florian Westphal, Daniel Borkmann
In-Reply-To: <1414276729-17871-1-git-send-email-fw@strlen.de>
Allows to set ECN on a per-route basis in case the sysctl tcp_ecn is
not 1. IOW, when ECN is set for specific routes, it provides a
tcp_ecn=1 behaviour for that route while the rest of the stack acts
according to the global settings.
One can use 'ip route change dev $dev $net features ecn' to toggle this.
Joint work with Daniel Borkmann.
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
net/ipv4/tcp_input.c | 25 +++++++++++++++----------
net/ipv4/tcp_output.c | 12 ++++++++++--
2 files changed, 25 insertions(+), 12 deletions(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a12b455..72d7510 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5875,20 +5875,22 @@ static inline void pr_drop_req(struct request_sock *req, __u16 port, int family)
*/
static void tcp_ecn_create_request(struct request_sock *req,
const struct sk_buff *skb,
- const struct sock *listen_sk)
+ const struct sock *listen_sk,
+ struct dst_entry *dst)
{
const struct tcphdr *th = tcp_hdr(skb);
const struct net *net = sock_net(listen_sk);
bool th_ecn = th->ece && th->cwr;
- bool ect, need_ecn;
+ bool ect, need_ecn, ecn_ok;
if (!th_ecn)
return;
ect = !INET_ECN_is_not_ect(TCP_SKB_CB(skb)->ip_dsfield);
need_ecn = tcp_ca_needs_ecn(listen_sk);
+ ecn_ok = net->ipv4.sysctl_tcp_ecn || dst_feature(dst, RTAX_FEATURE_ECN);
- if (!ect && !need_ecn && net->ipv4.sysctl_tcp_ecn)
+ if (!ect && !need_ecn && ecn_ok)
inet_rsk(req)->ecn_ok = 1;
else if (ect && need_ecn)
inet_rsk(req)->ecn_ok = 1;
@@ -5953,13 +5955,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
if (security_inet_conn_request(sk, skb, req))
goto drop_and_free;
- if (!want_cookie || tmp_opt.tstamp_ok)
- tcp_ecn_create_request(req, skb, sk);
-
- if (want_cookie) {
- isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
- req->cookie_ts = tmp_opt.tstamp_ok;
- } else if (!isn) {
+ if (!want_cookie && !isn) {
/* VJ's idea. We save last timestamp seen
* from the destination in peer table, when entering
* state TIME-WAIT, and check against it before
@@ -6007,6 +6003,15 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
goto drop_and_free;
}
+ tcp_ecn_create_request(req, skb, sk, dst);
+
+ if (want_cookie) {
+ isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
+ req->cookie_ts = tmp_opt.tstamp_ok;
+ if (!tmp_opt.tstamp_ok)
+ inet_rsk(req)->ecn_ok = 0;
+ }
+
tcp_rsk(req)->snt_isn = isn;
tcp_openreq_init_rwin(req, sk, dst);
fastopen = !want_cookie &&
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 3af2129..b1c6296 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -333,10 +333,18 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
+ bool use_ecn = sock_net(sk)->ipv4.sysctl_tcp_ecn == 1 ||
+ tcp_ca_needs_ecn(sk);
+
+ if (!use_ecn) {
+ const struct dst_entry *dst = __sk_dst_get(sk);
+ if (dst && dst_feature(dst, RTAX_FEATURE_ECN))
+ use_ecn = true;
+ }
tp->ecn_flags = 0;
- if (sock_net(sk)->ipv4.sysctl_tcp_ecn == 1 ||
- tcp_ca_needs_ecn(sk)) {
+
+ if (use_ecn) {
TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
tp->ecn_flags = TCP_ECN_OK;
if (tcp_ca_needs_ecn(sk))
--
2.0.4
^ permalink raw reply related
* RE: Problem with 10Gbit Broadcom NetXtreme II 5771x/578xx 10/20-Gigabit Ethernet Driver
From: Yuval Mintz @ 2014-10-26 9:07 UTC (permalink / raw)
To: Stefan Bottelier | Ocius.nl; +Cc: netdev
In-Reply-To: <544A110C.4030402@ocius.nl>
> We are using Dell Blade Centers, but we get a error on the 10Gbit
> Broadcom adapter bnx2x
> bnx2x 0000:01:00.0: part number 394D4342-31383735-30315430-473030
> WARNING: at drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c:9410
Hi Stefan,
Which distro & linux kernel are you using?
[Obviously not net-next]
Thanks,
Yuval
________________________________
This message and any attached documents contain information from the sending company or its parent company(s), subsidiaries, divisions or branch offices that may be confidential. If you are not the intended recipient, you may not read, copy, distribute, or use this information. If you have received this transmission in error, please notify the sender immediately by reply e-mail and then delete this message.
^ permalink raw reply
* [PATCH iproute2] doc make: Add *.pdf files to the 'clean' target
From: Vadim Kochan @ 2014-10-26 9:52 UTC (permalink / raw)
To: netdev; +Cc: Vadim Kochan
Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
---
doc/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/doc/Makefile b/doc/Makefile
index b92957e..e9c0ff7 100644
--- a/doc/Makefile
+++ b/doc/Makefile
@@ -70,4 +70,4 @@ install:
install -m 0644 $(shell echo *.sgml) $(DESTDIR)$(DOCDIR)
clean:
- rm -f *.aux *.log *.toc $(PSFILES) $(DVIFILES) *.html
+ rm -f *.aux *.log *.toc $(PSFILES) $(DVIFILES) *.html *.pdf
--
2.1.0
^ permalink raw reply related
* [PATCH iproute2] gitignore: Ignore 'doc' files generated at runtime
From: Vadim Kochan @ 2014-10-26 10:18 UTC (permalink / raw)
To: netdev; +Cc: Vadim Kochan
The list is based on doc/Makefile 'clean' target
Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
---
.gitignore | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/.gitignore b/.gitignore
index 5d1e189..98d83c5 100644
--- a/.gitignore
+++ b/.gitignore
@@ -35,3 +35,12 @@ series
# tests
testsuite/results
testsuite/iproute2/iproute2-this
+
+# doc files generated at runtime
+doc/*.aux
+doc/*.log
+doc/*.toc
+doc/*.ps
+doc/*.dvi
+doc/*.html
+doc/*.pdf
--
2.1.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox