* Re: [0/8] netpoll/bridge fixes
From: Eric Dumazet @ 2010-06-16 6:16 UTC (permalink / raw)
To: Herbert Xu
Cc: David Miller, Paul E. McKenney, shemminger, mst, frzhang, netdev,
amwang, mpm
In-Reply-To: <20100616033336.GA17440@gondor.apana.org.au>
Le mercredi 16 juin 2010 à 13:33 +1000, Herbert Xu a écrit :
> On Wed, Jun 16, 2010 at 05:03:20AM +0200, Eric Dumazet wrote:
> >
> > I wonder how these patches were tested, Herbert ?
>
> You know, not everyone enables RCU debugging...
>
Hmm, this was not an attack Herbert, just a suggestion, I now turn on
RCU lockdep debugging when doing RCU changes, it can helps ;)
> Anyway, this patch should fix the problems you've spotted.
>
Thanks
^ permalink raw reply
* Re: [PATCH 05/12] phylib: Allow reading and writing a mii bus from atomic context.
From: Richard Cochran @ 2010-06-16 6:20 UTC (permalink / raw)
To: Grant Likely
Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ, Thomas Gleixner,
linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
Krzysztof Halasa
In-Reply-To: <AANLkTin8RI4UKzA58k_qER_urdwnD037u5GZrVQS8IoS-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
> That's right, and I fully agree with that change. To me, going back
> to allowing spin locks is a regression because it adds a new source of
> scheduling latency.
I think that the change was not about reducing scheduling
latency. Rather, the idea was simply to allow mdio bus drivers that
sleep. Here is the change log message:
commit 35b5f6b1a82b5c586e0b24c711dc6ba944e88ef1
PHYLIB: Locking fixes for PHY I/O potentially sleeping
PHY read/write functions can potentially sleep (e.g., a PHY accessed
via I2C). The following changes were made to account for this:
* Change spin locks to mutex locks
* Add a BUG_ON() to phy_read() phy_write() to warn against
calling them from an interrupt context.
* Use work queue for PHY state machine handling since
it can potentially sleep
* Change phydev lock from spinlock to mutex
The fundamental issue is this: Fro the SO_TIMESTAMPING API, receive
timestamps must appear in a control message along with the packet
data. Only the MAC driver (or the PHY driver) knows how to get the
timestamp. The stack calls the MAC driver via its napi poll
function. During the call, the driver must provide the skb with Rx
timestamp.
The only reasonable way to do this is to have the driver fetch the
timestamp durng the napi poll function. For MAC drivers with fast
register access, the performance penalty is small. For PHY drivers
with must go via the MDIO bus, the performance penalty is obviously
larger, and the user must be willing to live with it.
You might suggest the alternate that the driver would defer the
netif_receive_skb() callback until a work queue completes, providing
the Rx timestamp. The driver would then call netif_receive_skb() at
some later time.
However, there are a number of problems with this idea:
1. It is really icky for the drivers to be creating new skb queues for
this purpose. MAC drivers would have to maintain such queues on
behalf of the PHY drivers, but only when the PHYs support
timestamping. Yuck.
2. There is a (soft) real time constraint on the delivery of the PTP
packets to the user space application. Basicly, delays and jitter
in the time to receive the packet negatively affect the clock servo
loop.
3. It cannot work for many kinds of PTP timestamping hardware. Some of
hardware only timestamps PTP packets. That means that not every
received packet will have a timestamp. Such hardware provides some
key data from the packet (like PTP UUID and sequence number) with
the timestamp. Software must match this information to a particular
packet. In order to defer a skb, the driver must first obtain the
timestamp information. This is a catch-22.
Having said all that, I am still open to suggestions...
Richard
^ permalink raw reply
* Re: [0/8] netpoll/bridge fixes
From: Eric Dumazet @ 2010-06-16 6:21 UTC (permalink / raw)
To: paulmck
Cc: David Miller, herbert, shemminger, mst, frzhang, netdev, amwang,
mpm
In-Reply-To: <20100616050808.GD2911@linux.vnet.ibm.com>
Le mardi 15 juin 2010 à 22:08 -0700, Paul E. McKenney a écrit :
> On Wed, Jun 16, 2010 at 04:58:59AM +0200, Eric Dumazet wrote:
> >
> > Paul, could you please explain if current lockdep rules are correct, or could be relaxed ?
> >
> > I thought :
> >
> > rcu_read_lock_bh();
> >
> > was a shorthand to
> >
> > local_disable_bh();
> > rcu_read_lock();
>
> In CONFIG_TREE_RCU and CONFIG_TINY_RCU, rcu_read_lock_bh() is actually
> shorthand for only local_disable_bh(). Therefore, rcu_dereference()
> will scream if only rcu_read_lock_bh() is held.
>
> However, in CONFIG_PREEMPT_TREE_RCU, rcu_read_lock_bh() is its own
> mechanism that does local_disable_bh() but has its own set of grace
> periods, independent of those of rcu_read_lock().
>
> > Why lockdep is not able to make a correct diagnostic ?
>
> Here is the situation I am concerned about:
>
> o Task 0 does rcu_read_lock(), then p=rcu_dereference_bh().
> If we make the change you are asking for, rcu_dereference_bh()
> is OK with this.
>
> o Task 0 now is preempted before finishing its RCU read-side
> critical section.
>
> o Task 1 removes the data element referenced by pointer p,
> then invokes synchronize_rcu_bh().
>
> o Task 0 does not block synchronize_rcu_bh(), so the grace
> period completes.
>
> o Task 1 frees up the data element referenced by pointer p,
> which might be reallocated as some other type, unmapped,
> or whatever else.
>
> o Task 0 resumes, and is sadly disappointed when the data
> element referenced by pointer p has been swept out from
> under it.
>
> Or am I missing something here?
>
Nice thing with RCU is that I learn new things every day ;)
Thanks Paul, I'll try to remember all the details ! ;)
^ permalink raw reply
* Re: [PATCH 04/12] phylib: add a way to make PHY time stamps possible.
From: Richard Cochran @ 2010-06-16 6:29 UTC (permalink / raw)
To: netdev-u79uwXL29TY76Z2rM5mHXA
Cc: devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ,
linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
Krzysztof Halasa
In-Reply-To: <27c0ad283f025c2bb71e7ceb71be07f969939429.1276615626.git.richard.cochran-3mrvs1K0uXizZXS1Dc/lvw@public.gmane.org>
On Tue, Jun 15, 2010 at 06:08:20PM +0200, Richard Cochran wrote:
> +static inline void skb_tx_timetamp(struct phy_device *phy, struct sk_buff *skb)
> +{
> + union skb_shared_tx *shtx = skb_tx(skb);
> +
> + if (shtx->hardware && phy && phy->drv->txtstamp)
> + phy->drv->txtstamp(phy, skb);
> +
> + if (shtx->software && !shtx->in_progress)
> + skb_tstamp_tx(skb, NULL);
> +}
I forgot to mention this patch also provides a way to fix the broken
software timestamp fallback mode of the SO_TIMESTAMPING API.
We would have to add this inline call to every MAC driver in an
appropriate spot within the hard_xmit function. It is not too pretty,
but providing this as a compile time option will promote
standardization of the SO_TIMESTAMPING API for applications.
Richard
^ permalink raw reply
* Re: linux-next: build warning after merge of the final tree (net tree related)
From: Stephen Rothwell @ 2010-06-16 6:30 UTC (permalink / raw)
To: David Miller; +Cc: herbert, netdev, linux-next, linux-kernel
In-Reply-To: <20100615.214406.45904170.davem@davemloft.net>
[-- Attachment #1: Type: text/plain, Size: 561 bytes --]
On Tue, 15 Jun 2010 21:44:06 -0700 (PDT) David Miller <davem@davemloft.net> wrote:
>
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Wed, 16 Jun 2010 13:56:30 +1000
>
> > bridge: Add const to dummy br_netpoll_send_skb
> >
> > The version of br_netpoll_send_skb used when netpoll is off is
> > missing a const thus causing a warning.
> >
> > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
>
> Applied.
Thanks guys.
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply
* Re: linux-next: build warnings after merge of the net tree
From: Stephen Rothwell @ 2010-06-16 6:30 UTC (permalink / raw)
To: David Miller; +Cc: netdev, linux-next, linux-kernel, bhutchings
In-Reply-To: <20100615.215212.214211994.davem@davemloft.net>
[-- Attachment #1: Type: text/plain, Size: 470 bytes --]
Hi Dave,
On Tue, 15 Jun 2010 21:52:12 -0700 (PDT) David Miller <davem@davemloft.net> wrote:
>
> I've commited the patch below to deal with this, thanks for the report.
Thanks.
> There's some pre-existing warnings someone will need to deal with at
> some point:
>
> drivers/usb/gadget/rndis.c: whole file: warning: coding style is bolixed
:-)
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply
* Re: Proposed linux kernel changes : scaling tcp/ip stack : 3rd part
From: Eric Dumazet @ 2010-06-16 6:37 UTC (permalink / raw)
To: Mitchell Erblich; +Cc: netdev
In-Reply-To: <F5330DD9-0486-4831-B2CD-83B3DF111456@earthlink.net>
Le mardi 15 juin 2010 à 23:09 -0700, Mitchell Erblich a écrit :
> On Jun 15, 2010, at 8:30 PM, Mitchell Erblich wrote:
>
> >
> > On Jun 15, 2010, at 8:11 PM, Mitchell Erblich wrote:
> >
> >>
> >> On Jun 3, 2010, at 2:14 AM, Eric Dumazet wrote:
> >>
> >>> Le jeudi 03 juin 2010 à 01:16 -0700, Mitchell Erblich a écrit :
> >>>> To whom it may concern,
> >>>>
> >>>> First, my assumption is to keep this discussion local to just a few tcp/ip
> >>>> developers to see if there is any consensus that the below is a logical
> >>>> approach. Please also pass this email if there is a "owner(s)" of this stack
> >>>> to identify if a case exists for the below possible changes.
> >>>>
> >>>> I am not currently on the linux kernel mail group.
> >>>>
> >>>> I have experience with modifications of the Linux tcp/ip stack, and have
> >>>> merged the changes into the company's local tree and left the possible
> >>>> global integration to others.
> >>>>
> >>>> I have been approached by a number of companies about scaling the
> >>>> stack with the assumption of a number of cpu cores. At present, I find extra
> >>>> time on my hands and am considering looking into this area on my own.
> >>>>
> >>>> The first assumption is that if extra cores are available, that a single
> >>>> received homogeneous flow of a large number of packets/segments per
> >>>> second (pps) can be split into non-equal flows. This split can in effect
> >>>> allow a larger recv'd pps rate at the same core load while splitting off
> >>>> other workloads, such as xmit'ing pure ACKs.
> >>>>
> >>>> Simply, again assuming Amdahl's law (and not looking to equalize the load
> >>>> between cores), and creating logical separations where in a many core
> >>>> system, different cores could have new kernel threads that operate in
> >>>> parallel within the tcp/ip stack. The initial separation points would be at
> >>>> the ip/tcp layer boundry and where any recv'd sk/pkt would generate some
> >>>> form of output.
> >>>>
> >>>> The ip/tcp layer would be split like the vintage AT&T STREAMs protocol,
> >>>> with some form of queuing & scheduling, would be needed. In addition,
> >>>> the queuing/schedullng of other kernel threads would occur within ip & tcp
> >>>> to separate the I/O.
> >>>>
> >>>> A possible validation test is to identify the max recv'd pps rate within the
> >>>> tcp/ip modules within normal flow TCP established state with normal order
> >>>> of say 64byte non fragmented segments, before and after each
> >>>> incremental change. Or the same rate with fewer core/cpu cycles.
> >>>>
> >>>> I am willing to have a private git Linux.org tree that concentrates proposed
> >>>> changes into this tree and if there is willingness, a seen want/need then identify
> >>>> how to implement the merge.
> >>>
> >>> Hi Mitchell
> >>>
> >>> We work everyday to improve network stack, and standard linux tree is
> >>> pretty scalable, you dont need to setup a separate git tree for that.
> >>>
> >>> Our beloved maintainer David S. Miller handles two trees, net-2.6 and
> >>> net-next-2.6 where we put all our changes.
> >>>
> >>> http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git
> >>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
> >>>
> >>> I suggest you read the last patches (say .. about 10.000 of them), to
> >>> have an idea of things we did during last years.
> >>>
> >>> keywords : RCU, multiqueue, RPS, percpu data, lockless algos, cache line
> >>> placement...
> >>>
> >>> Its nice to see another man joining the team !
> >>>
> >>> Thanks
> >>>
> >>
> >>
> >> Lets start with a two part Linux kernel change and a tcp input/output change:
> >>
> >> 2 Parts: 2nd part TBD
> >>
> >> Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for our
> >> sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, and a change
> >> in the generic kernel. TBD.
> >>
> >> This change should have no effect with normal available kernel mem allocs.
> >>
> >> Assuming memory pressure ( WAITING for clean memory) we should be allocating
> >> our last pages for input skbufs and not for xmit allocs.
> >>
> >> By delaying skbuf allocations when we have low kmem, we secondarily slow down the
> >> tcp flow : if in slow start (SS) we are almost doing a DELACK, else CA should/could
> >> decrease the number of in-flight ACKs and the peer should do burst avoidance
> >> if our later ack increases the window in a larger chunk..
> >>
> >> And use the last pages to decrease the chance of dropping a input pkt or
> >> running out of recv descriptors, because of mem back pressure.
> >>
> >> The change could check for some form of mem pressure before the alloc,
> >> but the alloc in itself should suffice. We could also do a ECN type check before
> >> the alloc.
> >>
> >> Now the kicker. I want a GFP_KERNEL with NO_SLEEP OR a GFP_ATOMIC and
> >> NOT use emergency pools, thus CAN FAIL, to have 0 other secondary effects
> >> and change just the 1 arg.
> >>
> >> code : tcp_output.c : tcp_send_ack()
> >> line : buff = alloc_skb(MAX_TCP_HDR, GFP_KERNEL_NSLEEP); /* with a NO SLEEP */
> >>
> >> Suggestions, feedback??
> >>
> >> Mitchell Erblich
> >>
> >>
> >>
> >>
> >
> > Sorry :),
> >
> > 2nd part:
> >
> > use GFP_NOWAIT as 2nd arg to alloc_skb()
> >
> > Mitchell Erblich
>
> Going in the same direction,
>
>
> If tcp_out_of_resources() and the number of orphaned sockets is above
> a configured number (maybe because of DoS attack), SHOULD we consume
> our last available resources and most likely effect skbufs that we aren't
> reset-ing because NOW the recv sk allocs are failing.
>
> thus,
> file tcp_timer.c : tcp_out_of_resources()
> suggestion change 2nd arg GFP_ATOMIC: tcp_send_active_reset(sk, GFP_NOWAIT);
>
> Please note that even if we believed that the GFP_ATOMIC would have a higher
> probability to send a TCP pkt/seg, that gives us no guarantee that the peer
> will recv it or will process it.
>
> We COULD also do some form of ECN in this function to inform the peer that our
> system is in distress if tcp_send_active_reset() did not return void and informed
> us of a mem alloc failure with the GFP_NOWAIT.
>
> Since the ECN would benefit the our node/system, this ECN sending event COULD
> be argued to have a higher priority and mem argument then sent with a GFP_ATOMIC.
>
>
> Suggestions, opinions...
1) Acks are about the smallest chunks that are ever allocated in network
stack.
2) Their lifetime is close to 0 us. They are not cloned (queued on a
socket queue), only given to device xmit. Unless you play with trafic
shaping and insane queue lengths, acks should not use more than 0.0001 %
of your ram.
3) Under attack, adding complex algos to try to resist only delay a bit
the moment where nothing can be done to stop the attack. Being clever or
not. Dropping packets is very fine.
4) Maybe all the work you think about is the balance between ATOMIC and
non ATOMIC (GFP_KERNEL) memory allocations ? Some tuning maybe ?
input path always use ATOMIC ops, being run from sofirq, and cannot
wait.
^ permalink raw reply
* Re: [PATCH 10/12] ptp: Added a clock that uses the eTSEC found on the MPC85xx.
From: Richard Cochran @ 2010-06-16 6:45 UTC (permalink / raw)
To: Grant Likely
Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ,
linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
Krzysztof Halasa
In-Reply-To: <AANLkTikFc15j-Qhw9M2CnKLKq58Wi1xD7E2dtVNsVgeA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Tue, Jun 15, 2010 at 11:20:41AM -0600, Grant Likely wrote:
>
> Is this header file used by anything other than gianfar_ptp.c? If
> not, then roll the two files together.
I anticipate that it might be necessary to share the header's contents
with gianfar.c one day.
> Use dash ('-') not underscore ('_') in property names.
Okay, can do.
> If you encode this value as a string, then it will be friendly for humans too.
Okay.
> I could use more explication here. Is this a divider value?
> Computers are good at making calculations, and the driver can obtain
> the clock frequency supplied to the device. It may be more useful to
> specify here the desired frequency rather than the divider. Certainly
> more human-friendly too.
It is not that simple. The basic algorithm is described in the text,
and anyone wanting to use the eTSEC for PTP will have to consider the
issue themselves, since it is a design issue with a few tradeoffs
related to the board layout. It is really too thorny to do in the
driver automatically.
I can post a tcltk calculator script for finding appropriate values,
in anyone would like to see it.
> > +static struct etsects the_clock;
>
> Will there ever be multiple instances of this device?
No, never. If you consider how PTP works, there can only be one clock
per system.
> Consider of_iomap(), it will simplify the code a bit.
> Move ptp_gianfar_exit() definition here so it is immediately before
> the module_exit() line.
Okay.
Thanks for the review,
Richard
^ permalink raw reply
* Re: [PATCH 11/12] ptp: Added a clock driver for the IXP46x.
From: Richard Cochran @ 2010-06-16 6:54 UTC (permalink / raw)
To: Grant Likely
Cc: netdev, devicetree-discuss, linuxppc-dev, linux-arm-kernel,
Krzysztof Halasa
In-Reply-To: <AANLkTinPDaWNvUG8q8RdvRUga-qrPlSV2H5TkzVVMaFn@mail.gmail.com>
On Tue, Jun 15, 2010 at 12:41:56PM -0600, Grant Likely wrote:
> Nitpick. We use all lower case names for structures in Linux.
Yes, I know, but in this case an exception makes sense.
I prefer to use the exact same register mnemonics as in the hardware
documentation, whenever possible. That way, anyone later working on
the driver with hardware manual in hand (and they should be doing that
way) will immediately see the connection.
> You want to get stuff as fast as possible, but there is a udelay()
> that just chews up CPU time. Would cpu_relax() be sufficient with a
> time-based exit condition in the loop?
I am not sure. What does cpu_relax() do exactly, and when is it safe
to call?
Thanks,
Richard
^ permalink raw reply
* Re: Proposed linux kernel changes : scaling tcp/ip stack : 3rd part
From: Mitchell Erblich @ 2010-06-16 7:46 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1276670223.19249.77.camel@edumazet-laptop>
On Jun 15, 2010, at 11:37 PM, Eric Dumazet wrote:
> Le mardi 15 juin 2010 à 23:09 -0700, Mitchell Erblich a écrit :
>> On Jun 15, 2010, at 8:30 PM, Mitchell Erblich wrote:
>>
>>>
>>> On Jun 15, 2010, at 8:11 PM, Mitchell Erblich wrote:
>>>
>>>>
>>>> On Jun 3, 2010, at 2:14 AM, Eric Dumazet wrote:
>>>>
>>>>> Le jeudi 03 juin 2010 à 01:16 -0700, Mitchell Erblich a écrit :
>>>>>> To whom it may concern,
>>>>>>
>>>>>> First, my assumption is to keep this discussion local to just a few tcp/ip
>>>>>> developers to see if there is any consensus that the below is a logical
>>>>>> approach. Please also pass this email if there is a "owner(s)" of this stack
>>>>>> to identify if a case exists for the below possible changes.
>>>>>>
>>>>>> I am not currently on the linux kernel mail group.
>>>>>>
>>>>>> I have experience with modifications of the Linux tcp/ip stack, and have
>>>>>> merged the changes into the company's local tree and left the possible
>>>>>> global integration to others.
>>>>>>
>>>>>> I have been approached by a number of companies about scaling the
>>>>>> stack with the assumption of a number of cpu cores. At present, I find extra
>>>>>> time on my hands and am considering looking into this area on my own.
>>>>>>
>>>>>> The first assumption is that if extra cores are available, that a single
>>>>>> received homogeneous flow of a large number of packets/segments per
>>>>>> second (pps) can be split into non-equal flows. This split can in effect
>>>>>> allow a larger recv'd pps rate at the same core load while splitting off
>>>>>> other workloads, such as xmit'ing pure ACKs.
>>>>>>
>>>>>> Simply, again assuming Amdahl's law (and not looking to equalize the load
>>>>>> between cores), and creating logical separations where in a many core
>>>>>> system, different cores could have new kernel threads that operate in
>>>>>> parallel within the tcp/ip stack. The initial separation points would be at
>>>>>> the ip/tcp layer boundry and where any recv'd sk/pkt would generate some
>>>>>> form of output.
>>>>>>
>>>>>> The ip/tcp layer would be split like the vintage AT&T STREAMs protocol,
>>>>>> with some form of queuing & scheduling, would be needed. In addition,
>>>>>> the queuing/schedullng of other kernel threads would occur within ip & tcp
>>>>>> to separate the I/O.
>>>>>>
>>>>>> A possible validation test is to identify the max recv'd pps rate within the
>>>>>> tcp/ip modules within normal flow TCP established state with normal order
>>>>>> of say 64byte non fragmented segments, before and after each
>>>>>> incremental change. Or the same rate with fewer core/cpu cycles.
>>>>>>
>>>>>> I am willing to have a private git Linux.org tree that concentrates proposed
>>>>>> changes into this tree and if there is willingness, a seen want/need then identify
>>>>>> how to implement the merge.
>>>>>
>>>>> Hi Mitchell
>>>>>
>>>>> We work everyday to improve network stack, and standard linux tree is
>>>>> pretty scalable, you dont need to setup a separate git tree for that.
>>>>>
>>>>> Our beloved maintainer David S. Miller handles two trees, net-2.6 and
>>>>> net-next-2.6 where we put all our changes.
>>>>>
>>>>> http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git
>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
>>>>>
>>>>> I suggest you read the last patches (say .. about 10.000 of them), to
>>>>> have an idea of things we did during last years.
>>>>>
>>>>> keywords : RCU, multiqueue, RPS, percpu data, lockless algos, cache line
>>>>> placement...
>>>>>
>>>>> Its nice to see another man joining the team !
>>>>>
>>>>> Thanks
>>>>>
>>>>
>>>>
>>>> Lets start with a two part Linux kernel change and a tcp input/output change:
>>>>
>>>> 2 Parts: 2nd part TBD
>>>>
>>>> Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for our
>>>> sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, and a change
>>>> in the generic kernel. TBD.
>>>>
>>>> This change should have no effect with normal available kernel mem allocs.
>>>>
>>>> Assuming memory pressure ( WAITING for clean memory) we should be allocating
>>>> our last pages for input skbufs and not for xmit allocs.
>>>>
>>>> By delaying skbuf allocations when we have low kmem, we secondarily slow down the
>>>> tcp flow : if in slow start (SS) we are almost doing a DELACK, else CA should/could
>>>> decrease the number of in-flight ACKs and the peer should do burst avoidance
>>>> if our later ack increases the window in a larger chunk..
>>>>
>>>> And use the last pages to decrease the chance of dropping a input pkt or
>>>> running out of recv descriptors, because of mem back pressure.
>>>>
>>>> The change could check for some form of mem pressure before the alloc,
>>>> but the alloc in itself should suffice. We could also do a ECN type check before
>>>> the alloc.
>>>>
>>>> Now the kicker. I want a GFP_KERNEL with NO_SLEEP OR a GFP_ATOMIC and
>>>> NOT use emergency pools, thus CAN FAIL, to have 0 other secondary effects
>>>> and change just the 1 arg.
>>>>
>>>> code : tcp_output.c : tcp_send_ack()
>>>> line : buff = alloc_skb(MAX_TCP_HDR, GFP_KERNEL_NSLEEP); /* with a NO SLEEP */
>>>>
>>>> Suggestions, feedback??
>>>>
>>>> Mitchell Erblich
>>>>
>>>>
>>>>
>>>>
>>>
>>> Sorry :),
>>>
>>> 2nd part:
>>>
>>> use GFP_NOWAIT as 2nd arg to alloc_skb()
>>>
>>> Mitchell Erblich
>>
>> Going in the same direction,
>>
>>
>> If tcp_out_of_resources() and the number of orphaned sockets is above
>> a configured number (maybe because of DoS attack), SHOULD we consume
>> our last available resources and most likely effect skbufs that we aren't
>> reset-ing because NOW the recv sk allocs are failing.
>>
>> thus,
>> file tcp_timer.c : tcp_out_of_resources()
>> suggestion change 2nd arg GFP_ATOMIC: tcp_send_active_reset(sk, GFP_NOWAIT);
>>
>> Please note that even if we believed that the GFP_ATOMIC would have a higher
>> probability to send a TCP pkt/seg, that gives us no guarantee that the peer
>> will recv it or will process it.
>>
>> We COULD also do some form of ECN in this function to inform the peer that our
>> system is in distress if tcp_send_active_reset() did not return void and informed
>> us of a mem alloc failure with the GFP_NOWAIT.
>>
>> Since the ECN would benefit the our node/system, this ECN sending event COULD
>> be argued to have a higher priority and mem argument then sent with a GFP_ATOMIC.
>>
>>
>> Suggestions, opinions...
>
>
> 1) Acks are about the smallest chunks that are ever allocated in network
> stack.
>
> 2) Their lifetime is close to 0 us. They are not cloned (queued on a
> socket queue), only given to device xmit. Unless you play with trafic
> shaping and insane queue lengths, acks should not use more than 0.0001 %
> of your ram.
>
> 3) Under attack, adding complex algos to try to resist only delay a bit
> the moment where nothing can be done to stop the attack. Being clever or
> not. Dropping packets is very fine.
>
> 4) Maybe all the work you think about is the balance between ATOMIC and
> non ATOMIC (GFP_KERNEL) memory allocations ? Some tuning maybe ?
> input path always use ATOMIC ops, being run from sofirq, and cannot
> wait.
>
I am not suggesting a pure GFP_KERNEL /sleep allocs on either side.
I suggested it with a NO-SLEEP and then looked and saw GFP_NOWAIT.
However, when under mem pressure, it might make sense to delay/slow
the opening of the snd and recv TCP windows, to allow mem pages
to be cleaned and re-used to relieve the mem pressure.
If queues are already existing, then they are the first delay-points
to be reviewed / used, however only under abnormal circumstances.
Maybe even short-lived flows have expired and those resources can now
be used.
>
> --
Eric & group,
I am starting simple with a different look at what COULD/SHOULD IT be done?
The question that I am asking is if GFP_NOWAIT WOULD fail AND GFP_ATOMIC
would succeed, then only the last percentage of kernel memory allocations are
being done. Do you want to use this last few percentage of memory with ACKs/xmits?
Maybe if we failed a few non-necessary / delayable items, then enough time
may occur to clean pages, and not execute any OOM like code.
The effect of generating a short DELACK SHOULD reduce memory pressure from
the peer in 1 RTT. Also, failing and executing the non-buff code, MAY ALSO slow-down
the ramp-up based on the TCP ACK clock.
The number of ACTIVE (recv at the NIC to xmit at the NIC) tcp flows may account
for a non-neglible number of later allocs and decrease mem pressure.
If I understand the diff between GFP_NOWAIT and GFP_ATOMIC, both
don't sleep, and only GFP_NOWAIT doesn't grab for the last available pages. So,
they are both atomic/no-sleeps.
To conclude with part 4: tcp_timer.c has 2 additional calls to tcp_send_active_reset()
use GFP_NOWAIT instead of GFP_ATOMIC.
Now, with your statement about "input path always uses atomic". SHOULD IT?
SAY under a DoS attack? Why not reject some if no-memory via GFP_NOWAIT.
If a new flow is to be started and we are under memory pressure, should it not ALSO
use GFP_NOWAIT? If we ALSO do GFP_NOWAIT on ESTABLISHED flows and (shoot me)
drop a seg/pkt, that should drop them to 1/2 bandwidth. This is in effect TCP fair-ness.
Thus, delays in ACKs are much more preferred.
Again, we will only minimally effect flows with the NEW suggested changes if their is
mem pressure and going to do more aggressive things like reset sockets/flows.
Since, my suggested changes are NOT in the form of a patch, then someone ELSE
needs to agree with me and then the maintainer must see that it does no harm and
the changes slowly moves the code in the right direction.
Mitchell Erblich
==================
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH net-next-2.6] syncookies: check decoded options against sysctl settings
From: Florian Westphal @ 2010-06-16 8:03 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20100615.180947.241927235.davem@davemloft.net>
David Miller <davem@davemloft.net> wrote:
> From: Florian Westphal <fw@strlen.de>
> Date: Sun, 13 Jun 2010 23:34:35 +0200
>
> > - if (tcp_opt->sack_ok)
> > - tcp_sack_reset(tcp_opt);
> > + if (tcp_opt->sack_ok && !sysctl_tcp_sack)
> > + return false;
> >
>
> If you remove the tcp_sack_reset() call here, who is going to
> do it?
Right, I should have mentioned that in the changelog, sorry about that.
Bottom line is that I failed to find out why its needed.
Both call sites of this function (cookie_v4_check, cookie_v6_check)
allocate the "struct tcp_options_received" argument on the stack, zero it,
hand it to tcp_parse_options() and then call cookie_check_timestamp().
I did not find any place in tcp_parse_options that would cause
tcp_opt->num_sacks/dsack to become nonzero.
Even if it can turn nonzero, I do not see any ill effects that might
happen then. The structure is on the stack and after tcp_parse_options()
returns only a few selected members are copied to the inet_request_sock.
^ permalink raw reply
* Re: [PATCH] vlan_dev: VLAN 0 should be treated as "no vlan tag" (802.1p packet)
From: Pedro Garcia @ 2010-06-16 8:49 UTC (permalink / raw)
To: netdev; +Cc: Patrick McHardy, Ben Hutchings, Eric Dumazet
In-Reply-To: <1276542772.2444.13.camel@edumazet-laptop>
On Mon, 14 Jun 2010 21:12:52 +0200, Eric Dumazet <eric.dumazet@gmail.com>
wrote:
> Le lundi 14 juin 2010 à 19:11 +0200, Patrick McHardy a écrit :
>> Ben Hutchings wrote:
>> > On Mon, 2010-06-14 at 18:49 +0200, Pedro Garcia wrote:
>> >
>> >> On Sun, 13 Jun 2010 22:56:30 +0100, Ben Hutchings
>> >> <bhutchings@solarflare.com> wrote:
>> >>
>> >>> I have no particular opinion on this change, but you need to read and
>> >>> follow Documentation/SubmittingPatches.
>> >>>
>> >>> Ben.
>> >>>
>> >> Sorry, first kernel patch, and I did not know about it. I resubmit
>> >> with
>> >> the correct style / format:
>> >>
>> > [...]
>> >
>> > Sorry, no you haven't.
>> >
>> > - Networking changes go through David Miller's net-next-2.6 tree so you
>> > need to use that as the baseline, not 2.6.26
>> > - Patches should be applicable with -p1, not -p0 (so if you use diff,
>> > you should run it from one directory level up)
>> > - The patch was word-wrapped
>>
>> Additionally:
>>
>> - please use the proper comment style, meaning each line begins
>> with a '*'
>>
>> - the pr_debug statements may be useful for debugging, but are
>> a bit excessive for the final version
>>
>> - + /* 2010-06-13: Pedro Garcia
>>
>> We have changelogs for this, simply explaining what the code
>> does is enough.
>>
>> - Please CC the maintainer (which is me)
>> --
>
> Pedro, we have two kind of vlan setups :
>
> accelerated and non accelerated ones.
>
> Your patch address non accelated ones only, since you only touch
> vlan_skb_recv()
>
> Accelerated vlan can follow these paths :
>
> 1) NAPI devices
>
> vlan_gro_receive() -> vlan_gro_common()
>
> 2) non NAPI devices
>
> __vlan_hwaccel_rx()
>
> So you might also patch __vlan_hwaccel_rx() and vlan_gro_common()
>
> Please merge following bits to your patch submission :
>
> http://kerneltrap.org/mailarchive/linux-netdev/2010/5/23/6277868
>
>
> Good luck for your first patch !
Here it is again. I added the modifications in http://kerneltrap.org/mailarchive/linux-netdev/2010/5/23/6277868 for HW accelerated incoming packets (it did not apply cleanly on the last version of
the kernel, so I applied manually). Now, if the VLAN 0 is not explicitly created by the user, VLAN 0 packets will be treated as no VLAN (802.1p packets), instead of dropping them.
The patch is now for two files: vlan_core (accel) and vlan_dev (non accel)
I can not test on HW accelerated devices, so if someone can check it I will appreciate (even though in the thread above it looked like yes). For non accel I tessted in 2.6.26. Now the patch is for
net-next-2.6, and it compiles OK, but I a have to setup a test environment to check it is still OK (should, but better to test).
Signed-off-by: Pedro Garcia <pedro.netdev@dondevamos.com>
--
diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
index 50f58f5..daaca31 100644
--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -8,6 +8,9 @@
int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp,
u16 vlan_tci, int polling)
{
+ struct net_device *vlan_dev;
+ u16 vlan_id;
+
if (netpoll_rx(skb))
return NET_RX_DROP;
@@ -16,10 +19,14 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp,
skb->skb_iif = skb->dev->ifindex;
__vlan_hwaccel_put_tag(skb, vlan_tci);
- skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
+ vlan_id = vlan_tci & VLAN_VID_MASK;
+ vlan_dev = vlan_group_get_device(grp, vlan_id);
- if (!skb->dev)
- goto drop;
+ if (vlan_dev)
+ skb->dev = vlan_dev;
+ else
+ if (vlan_id)
+ goto drop;
return (polling ? netif_receive_skb(skb) : netif_rx(skb));
@@ -82,16 +89,22 @@ vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp,
unsigned int vlan_tci, struct sk_buff *skb)
{
struct sk_buff *p;
+ struct net_device *vlan_dev;
+ u16 vlan_id;
if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
skb->deliver_no_wcard = 1;
skb->skb_iif = skb->dev->ifindex;
__vlan_hwaccel_put_tag(skb, vlan_tci);
- skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
-
- if (!skb->dev)
- goto drop;
+ vlan_id = vlan_tci & VLAN_VID_MASK;
+ vlan_dev = vlan_group_get_device(grp, vlan_id);
+
+ if (vlan_dev)
+ skb->dev = vlan_dev;
+ else
+ if (vlan_id)
+ goto drop;
for (p = napi->gro_list; p; p = p->next) {
NAPI_GRO_CB(p)->same_flow =
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index 5298426..65512c3 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -142,6 +142,7 @@ int vlan_skb_recv(struct sk_buff *skb, struct net_device *dev,
{
struct vlan_hdr *vhdr;
struct vlan_rx_stats *rx_stats;
+ struct net_device *vlan_dev;
u16 vlan_id;
u16 vlan_tci;
@@ -157,8 +158,18 @@ int vlan_skb_recv(struct sk_buff *skb, struct net_device *dev,
vlan_id = vlan_tci & VLAN_VID_MASK;
rcu_read_lock();
- skb->dev = __find_vlan_dev(dev, vlan_id);
- if (!skb->dev) {
+ vlan_dev = __find_vlan_dev(dev, vlan_id);
+
+ /* If the VLAN device is defined, we use it.
+ * If not, and the VID is 0, it is a 802.1p packet (not
+ * really a VLAN), so we will just netif_rx it later to the
+ * original interface, but with the skb->proto set to the
+ * wrapped proto: we do nothing here.
+ */
+
+ if (vlan_dev) {
+ skb->dev = vlan_dev;
+ } else if (vlan_id) {
pr_debug("%s: ERROR: No net_device for VID: %u on dev: %s\n",
__func__, vlan_id, dev->name);
goto err_unlock;
^ permalink raw reply related
* [PATCH] Clear IFF_XMIT_DST_RELEASE for teql interfaces
From: Tom Hughes @ 2010-06-16 8:24 UTC (permalink / raw)
To: netdev
Cc: akpm, Tom Hughes, Jamal Hadi Salim, David S. Miller,
Stephen Hemminger, Patrick McHardy, Tejun Heo, linux-kernel
The sch_teql module, which can be used to load balance over a set of
underlying interfaces, stopped working after 2.6.30 and has been
broken in all kernels since then for any underlying interface which
requires the addition of link level headers.
The problem is that the transmit routine relies on being able to
access the destination address in the skb in order to do address
resolution once it has decided which underlying interface it is going
to transmit through.
In 2.6.31 the IFF_XMIT_DST_RELEASE flag was introduced, and set by
default for all interfaces, which causes the destination address to be
released before the transmit routine for the interface is called.
The solution is to clear that flag for teql interfaces.
Signed-off-by: Tom Hughes <tom@compton.nu>
---
net/sched/sch_teql.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c
index 3415b6c..807643b 100644
--- a/net/sched/sch_teql.c
+++ b/net/sched/sch_teql.c
@@ -449,6 +449,7 @@ static __init void teql_master_setup(struct net_device *dev)
dev->tx_queue_len = 100;
dev->flags = IFF_NOARP;
dev->hard_header_len = LL_MAX_HEADER;
+ dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
}
static LIST_HEAD(master_dev_list);
--
1.7.0.1
^ permalink raw reply related
* Re: [PATCH net-next-2.6] inetpeer: do not use zero refcnt for freed entries
From: Eric Dumazet @ 2010-06-16 8:56 UTC (permalink / raw)
To: David Miller; +Cc: netdev, paulmck
In-Reply-To: <20100615.214754.42801686.davem@davemloft.net>
Le mardi 15 juin 2010 à 21:47 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 16 Jun 2010 04:45:24 +0200
>
> > [PATCH net-next-2.6] inetpeer: do not use zero refcnt for freed entries
> >
> > Followup of commit aa1039e73cc2 (inetpeer: RCU conversion)
> >
> > Unused inet_peer entries have a null refcnt.
> >
> > Using atomic_inc_not_zero() in rcu lookups is not going to work for
> > them, and slow path is taken.
> >
> > Fix this using -1 marker instead of 0 for deleted entries.
> >
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>
> Applied, thanks Eric.
Thanks
With 65537 peers and a DDOS frag attack, I now get following profiling
results :
-----------------------------------------------------------------------------------------
PerfTop: 1024 irqs/sec kernel:100.0% exact: 0.0% [1000Hz
cycles], (all, cpu: 0)
-----------------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ _________________________
7722.00 65.6% inet_frag_find
1355.00 11.5% ip4_frag_match
494.00 4.2% __lock_acquire
260.00 2.2% inet_getpeer
243.00 2.1% ip_route_input_common
151.00 1.3% lock_release
142.00 1.2% mark_lock
126.00 1.1% lock_acquire
104.00 0.9% __kmalloc
86.00 0.7% skb_put
Just to show what could be the next steps ;)
^ permalink raw reply
* Re: [PATCH] vlan_dev: VLAN 0 should be treated as "no vlan tag" (802.1p packet)
From: Eric Dumazet @ 2010-06-16 9:08 UTC (permalink / raw)
To: Pedro Garcia; +Cc: netdev, Patrick McHardy, Ben Hutchings
In-Reply-To: <311b59aee7d648c6124a84b5ca06ac60@dondevamos.com>
Le mercredi 16 juin 2010 à 10:49 +0200, Pedro Garcia a écrit :
> On Mon, 14 Jun 2010 21:12:52 +0200, Eric Dumazet <eric.dumazet@gmail.com>
> > Good luck for your first patch !
>
> Here it is again. I added the modifications in http://kerneltrap.org/mailarchive/linux-netdev/2010/5/23/6277868 for HW accelerated incoming packets (it did not apply cleanly on the last version of
> the kernel, so I applied manually). Now, if the VLAN 0 is not explicitly created by the user, VLAN 0 packets will be treated as no VLAN (802.1p packets), instead of dropping them.
>
> The patch is now for two files: vlan_core (accel) and vlan_dev (non accel)
>
> I can not test on HW accelerated devices, so if someone can check it I will appreciate (even though in the thread above it looked like yes). For non accel I tessted in 2.6.26. Now the patch is for
> net-next-2.6, and it compiles OK, but I a have to setup a test environment to check it is still OK (should, but better to test).
>
> Signed-off-by: Pedro Garcia <pedro.netdev@dondevamos.com>
OK, the patch itself is correct.
Now, could you please send it again with a proper changelog ?
In this changelog, please explain why patch is needed, and
keep lines short (< 72 chars), like the one you did in your first mail.
I'll then add my Signed-off-by, since I wrote the accelerated part ;)
Note : I wonder if another patch is needed, in case 8021q module is
_not_ loaded. We probably should accept vlan 0 frames in this case ?
^ permalink raw reply
* Re: Proposed linux kernel changes : scaling tcp/ip stack
From: Andi Kleen @ 2010-06-16 9:10 UTC (permalink / raw)
To: Mitchell Erblich; +Cc: Eric Dumazet, netdev
In-Reply-To: <97746864-ED54-4A12-AFE7-752AA6E41CDD@earthlink.net>
Mitchell Erblich <erblichs@earthlink.net> writes:
>
> Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for our
> sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, and a change
> in the generic kernel. TBD.
>
> This change should have no effect with normal available kernel mem allocs.
>
> Assuming memory pressure ( WAITING for clean memory) we should be allocating
> our last pages for input skbufs and not for xmit allocs.
How about you instrument a kernel and measure if this really happens
frequently under reasonable loads? That is you can probably
use the existing dropped page counters in netstat
Stephen added some time ago.
Since soft irqs cannot really wait exhausted GFP_ATOMIC would normally
lead to dropped packets. FWIW I am not aware of any serious dropped
packets problem on normal loads.
Running a kernel with nearly zero free memory is dangerous anyways
-- pretty much any kernel service can fail arbitarily --
if this happened frequently I suspect we would need generic
VM solution for it.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply
* Re: mpd client timeouts (bisected) 2.6.35-rc3
From: Christian Kujau @ 2010-06-16 10:01 UTC (permalink / raw)
To: markus@trippelsdorf.de
Cc: John Fastabend, David Miller, linux-kernel@vger.kernel.org,
netdev@vger.kernel.org, yanmin_zhang, alex.shi, tim.c.chen
In-Reply-To: <20100613205922.GA1806@arch.tripp.de>
On Sun, 13 Jun 2010 at 22:59, markus@trippelsdorf.de wrote:
> This solves the problem here. Thanks.
Not sure if this is related, but I've noticed connection timeouts and
connections going in FIN_WAIT2 state (most of them SSH tunnels) with
2.6.35. Going back to 2.6.34 or applying John's (and Eric's) patch
does seem to fix this issue.
Thanks,
Christian.
--
BOFH excuse #168:
le0: no carrier: transceiver cable problem?
^ permalink raw reply
* Re: [PATCH 12/12] ptp: Added a clock driver for the National Semiconductor PHYTER.
From: Richard Cochran @ 2010-06-16 10:05 UTC (permalink / raw)
To: Grant Likely
Cc: netdev, devicetree-discuss, linuxppc-dev, linux-arm-kernel,
Krzysztof Halasa
In-Reply-To: <AANLkTilqd-wxqMWDxssC-XMDJqZ8iIiSrvp2t8xARAQV@mail.gmail.com>
On Tue, Jun 15, 2010 at 12:49:13PM -0600, Grant Likely wrote:
> Won't this break things for existing DP83640 users?
Nope, the driver was only added five patches ago, and it only offers
the timestamping stuff. The standard PHY functions just call the
generic functions, so the PHY works fine even without this driver.
> > +static struct ptp_clock *dp83640_clock;
> > +DEFINE_SPINLOCK(clock_lock); /* protects the one and only dp83640_clock */
>
> Why only one? Is it not possible to have 2 of these PHYs in a system?
Yes, you can have multiple PHYs, but only one PTP clock.
If you do use multiple PHYs, then you must wire their clocks together
and adjust the PTP clock on only one of the PHYs.
Thanks for your other comments,
Richard
^ permalink raw reply
* Re: BUG: unable to handle kernel paging request at 000041ed00000001
From: Arturas @ 2010-06-16 10:25 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
In-Reply-To: <96E62960-9EF8-4F05-92DD-2D7477D0D78B@res.lt>
Your new patch doesn't show any warnings about tx queue
and everything is working as expected.
On Jun 14, 2010, at 12:27 PM, Arturas wrote:
>
> On Jun 14, 2010, at 11:31 AM, Eric Dumazet wrote:
>> But your problem is about bridge, not bonding (see trace).
> I want it for performance reason, not because of this bug.
> Bridge isn't a bottleneck for me, but bonding may be and not to me only,
> but for many people. I believe that performance gain would be more
> than 1% on cpu? :-)
>
>>
>> And 2.6.34 wont accept such changes, its already released.
> It can be as a separate patch or I can test 2.3.35 if it would accept
> such change. I just need a stable kernel with good performance :-)
>
>>
>>> I also have another issue with NMI. On older machine with 5500 xeons i
>>> have almost no overhead with nmi_watchdog enabled, but on this it is about twice.
>>> without nmi enabled cpu peak average is 30%, and with nmi enabled i have 53%.
>>> When traffic is not passing all cpus are idling at 100%.
>>> Maybe overhead could be a little bit smaller? :-)
>>>
>>
>> I am a bit lost here, NMI have litle to do with network stack ;)
> May this be related to very recent cpu? As i understand NMI depends on CPU.
>
>>
>>
>> Could you please test another patch ?
> Applied, it's working correctly for now. If i'll get a warning i'll write you or maybe I
> shouldn't get it if a patch is correct?
>
>>
>> Before calling sk_tx_queue_set(sk, queue_index); we should check if dst
>> dev is current device.
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* rt hash table / rt hash locks question
From: Nick Piggin @ 2010-06-16 10:46 UTC (permalink / raw)
To: netdev
I'm just converting this scalable dentry/inode hash table to a more
compact form. I was previously using a dumb spinlock per bucket,
but this doubles the size of the tables so isn't production quality.
What I've done at the moment is to use a bit_spinlock in bit 0 of each
list pointer of the table. Bit spinlocks are now pretty nice because
we can do __bit_spin_unlock() which gives non-atomic store with release
ordering, so it should be almost as fast as spinlock.
But I look at rt hash and it seems you use a small hash on the side
for spinlocks. So I wonder, pros for each:
- bitlocks have effectively zero storage
- bitlocks hit the same cacheline that the hash walk hits.
- in RCU list, locked hash walks usually followed by hash modification,
bitlock should have brought in the line for exclusive.
- bitlock number of locks scales with hash size
- spinlocks may be slightly better at the cacheline level (bitops
sometimes require explicit load which may not acquire exclusive
line on some archs). On x86 ll/sc architectures, this shouldn't
be a problem.
- spinlocks better debugging (could be overcome with a LOCKDEP
option to revert to spinlocks, but a bit ugly).
- in practice, contention due to aliasing in buckets to lock mapping
is probably fairly minor.
Net code is obviously tested and tuned well, but instinctively I would
have tought bitlocks are the better way to go. Any comments on this?
Thanks,
Nick
^ permalink raw reply
* DMFE Driver in current kernel
From: Heiko Gerstung @ 2010-06-16 10:52 UTC (permalink / raw)
To: netdev
Dear netdev members,
I apologize for contacting this list, I was pointed here by Tobias
Ringstroem who is listed as the current maintainer for the dmfe network
driver but he told me that this is not true anymore. I believe that
probably no one is looking at this driver anymore, and he suggested that
I should contact you guys because he cannot help me.
Here is my long worded story: We are a manufacturer of special network
appliances used for time synchronization and in our LANTIME time server
appliances we are currently using GNU Linux on an embedded i386 CPU
board with a Davicom network chip.
I am desperately missing an important feature in the dmfe driver (which
lists you as the current maintainer): I would want to be able to change
network speed and duplex settings without having to unload/reload the
kernel module, preferably by using ethtool.
Would anyone here interested in adding support for this into the driver?
If yes, we would want to pay for this (and of course we would love
to see that these changes are included in the next kernel version, if
possible). If nobody on this list is interested in this job, I would
appreciate recommendations regarding where to look for someone. If I do
not manage to find anyone for this task, I will try it on my own.
Pointers to suitable documentation or hints would be appreciated!
I know that this is a pretty old chip, but it seems that it is still
used in a number of embedded systems and since we want to be able to
upgrade our existing units in the field with our new firmware, I really
would like to get this functionality into the driver.
Another point I am looking for is how to get the asix driver to play
nice when used in combination with active-backup bonding. As it seems,
the current driver does not work unless I put the bonding interface into
promiscuous mode (which I find out by accident when I used tcpdump for
debugging). If anyone has an idea how to improve this, I would be very
grateful, too.
Thanks for your time,
Heiko
--
Heiko Gerstung
*MEINBERG Funkuhren* GmbH & Co. KG
Lange Wand 9
D-31812 Bad Pyrmont, Germany
Phone: +49 (0)5281 9309-25
Fax: +49 (0)5281 9309-30
Amtsgericht Hannover 17HRA 100322
Geschäftsführer/Managing Directors: Günter Meinberg, Werner Meinberg,
Andre Hartmann, Heiko Gerstung
Email: heiko.gerstung@meinberg.de <mailto:heiko.gerstung@meinberg.de>
Web: www.meinberg.de <http://www.meinberg.de>
------------------------------------------------------------------------
*MEINBERG - Accurate Time Worldwide*
^ permalink raw reply
* Re: [PATCH] vlan_dev: VLAN 0 should be treated as "no vlan tag" (802.1p packet)
From: Patrick McHardy @ 2010-06-16 11:42 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Pedro Garcia, netdev, Ben Hutchings
In-Reply-To: <1276679284.2632.22.camel@edumazet-laptop>
Eric Dumazet wrote:
> Le mercredi 16 juin 2010 à 10:49 +0200, Pedro Garcia a écrit :
>> Here it is again. I added the modifications in http://kerneltrap.org/mailarchive/linux-netdev/2010/5/23/6277868 for HW accelerated incoming packets (it did not apply cleanly on the last version of
>> the kernel, so I applied manually). Now, if the VLAN 0 is not explicitly created by the user, VLAN 0 packets will be treated as no VLAN (802.1p packets), instead of dropping them.
>>
>> The patch is now for two files: vlan_core (accel) and vlan_dev (non accel)
>>
>> I can not test on HW accelerated devices, so if someone can check it I will appreciate (even though in the thread above it looked like yes). For non accel I tessted in 2.6.26. Now the patch is for
>> net-next-2.6, and it compiles OK, but I a have to setup a test environment to check it is still OK (should, but better to test).
>>
>> Signed-off-by: Pedro Garcia <pedro.netdev@dondevamos.com>
>>
>
> OK, the patch itself is correct.
>
Yes, looks fine to me as well.
> Now, could you please send it again with a proper changelog ?
>
> In this changelog, please explain why patch is needed, and
> keep lines short (< 72 chars), like the one you did in your first mail.
>
> I'll then add my Signed-off-by, since I wrote the accelerated part ;)
>
> Note : I wonder if another patch is needed, in case 8021q module is
> _not_ loaded. We probably should accept vlan 0 frames in this case ?
>
I agree that this would be best for consistency, but that would mean
adding more special cases to __netif_receive_skb().
^ permalink raw reply
* Re: rt hash table / rt hash locks question
From: Eric Dumazet @ 2010-06-16 12:27 UTC (permalink / raw)
To: Nick Piggin; +Cc: netdev
In-Reply-To: <20100616104633.GW6138@laptop>
Le mercredi 16 juin 2010 à 20:46 +1000, Nick Piggin a écrit :
> I'm just converting this scalable dentry/inode hash table to a more
> compact form. I was previously using a dumb spinlock per bucket,
> but this doubles the size of the tables so isn't production quality.
>
Yes, we had this in the past (one rwlock or spinlock per hash chain),
and it was not very good with LOCKDEP on.
> What I've done at the moment is to use a bit_spinlock in bit 0 of each
> list pointer of the table. Bit spinlocks are now pretty nice because
> we can do __bit_spin_unlock() which gives non-atomic store with release
> ordering, so it should be almost as fast as spinlock.
>
> But I look at rt hash and it seems you use a small hash on the side
> for spinlocks. So I wonder, pros for each:
>
> - bitlocks have effectively zero storage
yes but a mask is needed to get head pointer. Special care also must
be taken when insert/delete a node in chain, keeping this bit set.
> - bitlocks hit the same cacheline that the hash walk hits.
yes
> - in RCU list, locked hash walks usually followed by hash modification,
> bitlock should have brought in the line for exclusive.
But we usually perform a read only lookup, _then_ take the lock, to
perform a new lookup before insert. So at time we would take the
bitlock, cache line is in shared state. With spinlocks, we always use
the exclusive mode, but on a separate cache line...
> - bitlock number of locks scales with hash size
Yes, but concurrency is more a function of online cpus, given we use
jhash.
> - spinlocks may be slightly better at the cacheline level (bitops
> sometimes require explicit load which may not acquire exclusive
> line on some archs). On x86 ll/sc architectures, this shouldn't
> be a problem.
Yes, you can add fairness (if ticket spinlocks variant used), but on
route cache I really doubt it can make a difference.
> - spinlocks better debugging (could be overcome with a LOCKDEP
> option to revert to spinlocks, but a bit ugly).
Definitely a good thing.
> - in practice, contention due to aliasing in buckets to lock mapping
> is probably fairly minor.
Agreed
>
> Net code is obviously tested and tuned well, but instinctively I would
> have tought bitlocks are the better way to go. Any comments on this?
Well, to be honest, this code is rather old, and at time I wrote it,
bitlocks were probably not available.
You can add :
- One downside of the hashed spinlocks is the X86_INTERNODE_CACHE_SHIFT
being 12 on X86_VSMP : All locks are probably in same internode block :(
- Another downside is all locks are currently on a single NUMA node,
since we kmalloc() them in one contiguous chunk.
So I guess it would be worth to try :)
^ permalink raw reply
* Re: rt hash table / rt hash locks question
From: Nick Piggin @ 2010-06-16 12:49 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1276691258.2632.55.camel@edumazet-laptop>
On Wed, Jun 16, 2010 at 02:27:38PM +0200, Eric Dumazet wrote:
> Le mercredi 16 juin 2010 à 20:46 +1000, Nick Piggin a écrit :
> > I'm just converting this scalable dentry/inode hash table to a more
> > compact form. I was previously using a dumb spinlock per bucket,
> > but this doubles the size of the tables so isn't production quality.
> >
>
> Yes, we had this in the past (one rwlock or spinlock per hash chain),
> and it was not very good with LOCKDEP on.
Sure :) And it halves the size of your hash even with lockdep off.
> > What I've done at the moment is to use a bit_spinlock in bit 0 of each
> > list pointer of the table. Bit spinlocks are now pretty nice because
> > we can do __bit_spin_unlock() which gives non-atomic store with release
> > ordering, so it should be almost as fast as spinlock.
> >
> > But I look at rt hash and it seems you use a small hash on the side
> > for spinlocks. So I wonder, pros for each:
> >
> > - bitlocks have effectively zero storage
> yes but a mask is needed to get head pointer. Special care also must
> be taken when insert/delete a node in chain, keeping this bit set.
That is true. Overall, I don't know what would be better for straight
line cycles, all in L1 cache. Probably the spinlocks, although there
is some small overhead from loading the 2nd hash.
> > - bitlocks hit the same cacheline that the hash walk hits.
> yes
> > - in RCU list, locked hash walks usually followed by hash modification,
> > bitlock should have brought in the line for exclusive.
> But we usually perform a read only lookup, _then_ take the lock, to
> perform a new lookup before insert. So at time we would take the
> bitlock, cache line is in shared state. With spinlocks, we always use
> the exclusive mode, but on a separate cache line...
Hmm, OK. This is usually true of the dcache and icache as well
actually. But you still have the same problem with spinlocks (with I
presume the common case of 0 or 1 entry in the hash) when inserting
into the table.
So we're still often avoiding one cacheline transition, and avoiding
hitting one cacheline.
> > - bitlock number of locks scales with hash size
> Yes, but concurrency is more a function of online cpus, given we use
> jhash.
Oh yeah but it has a maximum upper bound on the number of buckets in
the hash, and just having it scale nicely avoids the ifdef heuristics
in the existing code.
> > - spinlocks may be slightly better at the cacheline level (bitops
> > sometimes require explicit load which may not acquire exclusive
> > line on some archs). On x86 ll/sc architectures, this shouldn't
> > be a problem.
> Yes, you can add fairness (if ticket spinlocks variant used), but on
> route cache I really doubt it can make a difference.
Yes if the critical sections are very short and uncontended, I don't
think it's a large factor.
> > - spinlocks better debugging (could be overcome with a LOCKDEP
> > option to revert to spinlocks, but a bit ugly).
> Definitely a good thing.
>
> > - in practice, contention due to aliasing in buckets to lock mapping
> > is probably fairly minor.
> Agreed
> >
> > Net code is obviously tested and tuned well, but instinctively I would
> > have tought bitlocks are the better way to go. Any comments on this?
>
> Well, to be honest, this code is rather old, and at time I wrote it,
> bitlocks were probably not available.
>
> You can add :
>
> - One downside of the hashed spinlocks is the X86_INTERNODE_CACHE_SHIFT
> being 12 on X86_VSMP : All locks are probably in same internode block :(
Oh yeah that's true, very special case though.
> - Another downside is all locks are currently on a single NUMA node,
> since we kmalloc() them in one contiguous chunk.
>
> So I guess it would be worth to try :)
OK, this is what I'm working with for the icache/dcache to hide the
details of masking out the low bit. But it looks like rt hash is a bit
more highly tuned (eg without pprev pointer as you don't delete items
without walking the hash). So it might not be appropriate for you.
You might be able to derive some macros to hide some of the pain though.
Index: linux-2.6/include/linux/list_bl.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/list_bl.h
@@ -0,0 +1,97 @@
+#ifndef _LINUX_LIST_BL_H
+#define _LINUX_LIST_BL_H
+
+#include <linux/list.h>
+
+/*
+ * Special version of lists, where head of the list has a bit spinlock
+ * in the lowest bit. This is useful for scalable hash tables without
+ * increasing memory footprint overhead.
+ */
+
+struct hlist_bl_head {
+ struct hlist_bl_node *first;
+};
+
+struct hlist_bl_node {
+ struct hlist_bl_node *next, **pprev;
+};
+#define INIT_HLIST_BL_HEAD(ptr) \
+ ((ptr)->first = NULL)
+
+static inline void INIT_HLIST_BL_NODE(struct hlist_bl_node *h)
+{
+ h->next = NULL;
+ h->pprev = NULL;
+}
+
+#define hlist_bl_entry(ptr, type, member) container_of(ptr,type,member)
+
+static inline int hlist_bl_unhashed(const struct hlist_bl_node *h)
+{
+ return !h->pprev;
+}
+
+static inline struct hlist_bl_node *hlist_bl_first(struct hlist_bl_head *h)
+{
+ return (struct hlist_bl_node *)((unsigned long)h->first & ~1UL);
+}
+
+static inline void hlist_bl_set_first(struct hlist_bl_head *h, struct hlist_bl_node *n)
+{
+#ifdef CONFIG_DEBUG_LIST
+ BUG_ON(!((unsigned long)&h->first & 1UL));
+#endif
+ h->first = (struct hlist_bl_node *)((unsigned long)n | 1UL);
+}
+
+static inline int hlist_bl_empty(const struct hlist_bl_head *ptr)
+{
+ return !((unsigned long)ptr & ~1UL);
+}
+
+static inline void hlist_bl_add_head(struct hlist_bl_node *n,
+ struct hlist_bl_head *h)
+{
+ struct hlist_bl_node *first = hlist_bl_first(h);
+
+#ifdef CONFIG_DEBUG_LIST
+ BUG_ON(!((unsigned long)&h->first & 1UL));
+#endif
+ n->next = first;
+ if (first)
+ first->pprev = &n->next;
+ n->pprev = &h->first;
+ hlist_bl_set_first(h, n);
+}
+
+static inline void __hlist_bl_del(struct hlist_bl_node *n)
+{
+ struct hlist_bl_node *next = n->next;
+ struct hlist_bl_node **pprev = n->pprev;
+ *pprev = (struct hlist_bl_node *)((unsigned long)next | ((unsigned long)*pprev & 1UL));
+ if (next)
+ next->pprev = pprev;
+}
+
+static inline void hlist_bl_del(struct hlist_bl_node *n)
+{
+ __hlist_bl_del(n);
+ n->pprev = LIST_POISON2;
+}
+
+/**
+ * hlist_bl_for_each_entry - iterate over list of given type
+ * @tpos: the type * to use as a loop cursor.
+ * @pos: the &struct hlist_node to use as a loop cursor.
+ * @head: the head for your list.
+ * @member: the name of the hlist_node within the struct.
+ *
+ */
+#define hlist_bl_for_each_entry(tpos, pos, head, member) \
+ for (pos = hlist_bl_first(head); \
+ pos && \
+ ({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1;}); \
+ pos = pos->next)
+
+#endif
Index: linux-2.6/include/linux/rculist_bl.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/rculist_bl.h
@@ -0,0 +1,123 @@
+#ifndef _LINUX_RCULIST_BL_H
+#define _LINUX_RCULIST_BL_H
+
+#ifdef __KERNEL__
+
+/*
+ * RCU-protected list version
+ */
+#include <linux/list_bl.h>
+#include <linux/rcupdate.h>
+
+static inline void hlist_bl_set_first_rcu(struct hlist_bl_head *h, struct hlist_bl_node *n)
+{
+#ifdef CONFIG_DEBUG_LIST
+ BUG_ON(!((unsigned long)h->first & 1UL));
+#endif
+ rcu_assign_pointer(h->first, (struct hlist_bl_node *)((unsigned long)n | 1UL));
+}
+
+static inline struct hlist_bl_node *hlist_bl_first_rcu(struct hlist_bl_head *h)
+{
+ return (struct hlist_bl_node *)((unsigned long)rcu_dereference(h->first) & ~1UL);
+}
+
+/**
+ * hlist_bl_del_init_rcu - deletes entry from hash list with re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: hlist_bl_unhashed() on the node return true after this. It is
+ * useful for RCU based read lockfree traversal if the writer side
+ * must know if the list entry is still hashed or already unhashed.
+ *
+ * In particular, it means that we can not poison the forward pointers
+ * that may still be used for walking the hash list and we can only
+ * zero the pprev pointer so list_unhashed() will return true after
+ * this.
+ *
+ * The caller must take whatever precautions are necessary (such as
+ * holding appropriate locks) to avoid racing with another
+ * list-mutation primitive, such as hlist_bl_add_head_rcu() or
+ * hlist_bl_del_rcu(), running on this same list. However, it is
+ * perfectly legal to run concurrently with the _rcu list-traversal
+ * primitives, such as hlist_bl_for_each_entry_rcu().
+ */
+static inline void hlist_bl_del_init_rcu(struct hlist_bl_node *n)
+{
+ if (!hlist_bl_unhashed(n)) {
+ __hlist_bl_del(n);
+ n->pprev = NULL;
+ }
+}
+
+/**
+ * hlist_bl_del_rcu - deletes entry from hash list without re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: hlist_bl_unhashed() on entry does not return true after this,
+ * the entry is in an undefined state. It is useful for RCU based
+ * lockfree traversal.
+ *
+ * In particular, it means that we can not poison the forward
+ * pointers that may still be used for walking the hash list.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_bl_add_head_rcu()
+ * or hlist_bl_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_bl_for_each_entry().
+ */
+static inline void hlist_bl_del_rcu(struct hlist_bl_node *n)
+{
+ __hlist_bl_del(n);
+ n->pprev = LIST_POISON2;
+}
+
+/**
+ * hlist_bl_add_head_rcu
+ * @n: the element to add to the hash list.
+ * @h: the list to add to.
+ *
+ * Description:
+ * Adds the specified element to the specified hlist_bl,
+ * while permitting racing traversals.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_bl_add_head_rcu()
+ * or hlist_bl_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_bl_for_each_entry_rcu(), used to prevent memory-consistency
+ * problems on Alpha CPUs. Regardless of the type of CPU, the
+ * list-traversal primitive must be guarded by rcu_read_lock().
+ */
+static inline void hlist_bl_add_head_rcu(struct hlist_bl_node *n,
+ struct hlist_bl_head *h)
+{
+ struct hlist_bl_node *first = hlist_bl_first(h);
+
+ n->next = first;
+ if (first)
+ first->pprev = &n->next;
+ n->pprev = &h->first;
+ hlist_bl_set_first_rcu(h, n);
+}
+/**
+ * hlist_bl_for_each_entry_rcu - iterate over rcu list of given type
+ * @tpos: the type * to use as a loop cursor.
+ * @pos: the &struct hlist_bl_node to use as a loop cursor.
+ * @head: the head for your list.
+ * @member: the name of the hlist_bl_node within the struct.
+ *
+ */
+#define hlist_bl_for_each_entry_rcu(tpos, pos, head, member) \
+ for (pos = hlist_bl_first_rcu(head); \
+ pos && \
+ ({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1; }); \
+ pos = rcu_dereference_raw(pos->next))
+
+#endif
+#endif
^ permalink raw reply
* Re: [PATCH] Clear IFF_XMIT_DST_RELEASE for teql interfaces
From: jamal @ 2010-06-16 13:25 UTC (permalink / raw)
To: Tom Hughes
Cc: netdev, akpm, David S. Miller, Eric Dumazet, Stephen Hemminger,
Patrick McHardy, Tejun Heo, linux-kernel
In-Reply-To: <1276676668-10256-1-git-send-email-tom@compton.nu>
On Wed, 2010-06-16 at 09:24 +0100, Tom Hughes wrote:
> The sch_teql module, which can be used to load balance over a set of
> underlying interfaces, stopped working after 2.6.30 and has been
> broken in all kernels since then for any underlying interface which
> requires the addition of link level headers.
>
> The problem is that the transmit routine relies on being able to
> access the destination address in the skb in order to do address
> resolution once it has decided which underlying interface it is going
> to transmit through.
>
> In 2.6.31 the IFF_XMIT_DST_RELEASE flag was introduced, and set by
> default for all interfaces, which causes the destination address to be
> released before the transmit routine for the interface is called.
>
> The solution is to clear that flag for teql interfaces.
>
> Signed-off-by: Tom Hughes <tom@compton.nu>
Sounds reasonable. Lets CC Eric and get his ACK.
cheers,
jamal
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox