* [PATCH net-next] udp: be less conservative with sock rmem accounting
From: Paolo Abeni @ 2016-12-02 16:35 UTC (permalink / raw)
To: netdev; +Cc: David S. Miller, Jesper Dangaard Brouer
Before commit 850cbaddb52d ("udp: use it's own memory accounting
schema"), the udp protocol allowed sk_rmem_alloc to grow beyond
the rcvbuf by the whole current packet's truesize. After said commit
we allow sk_rmem_alloc to exceed the rcvbuf only if the receive queue
is empty. As reported by Jesper this cause a performance regression
for some (small) values of rcvbuf.
This commit is intended to fix the regression restoring the old
handling of the rcvbuf limit.
Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Fixes: 850cbaddb52d ("udp: use it's own memory accounting schema")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
net/ipv4/udp.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index e1d0bf8..16d88ba 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1205,14 +1205,14 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
* queue is full; always allow at least a packet
*/
rmem = atomic_read(&sk->sk_rmem_alloc);
- if (rmem && (rmem + size > sk->sk_rcvbuf))
+ if (rmem > sk->sk_rcvbuf)
goto drop;
/* we drop only if the receive buf is full and the receive
* queue contains some other skb
*/
rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
- if ((rmem > sk->sk_rcvbuf) && (rmem > size))
+ if (rmem > (size + sk->sk_rcvbuf))
goto uncharge_drop;
spin_lock(&list->lock);
--
1.8.3.1
^ permalink raw reply related
* Re: [PATCH net v2] igb: re-assign hw address pointer on reset after PCI error
From: Keller, Jacob E @ 2016-12-02 16:41 UTC (permalink / raw)
To: gpiccoli@linux.vnet.ibm.com, intel-wired-lan@lists.osuosl.org,
Kirsher, Jeffrey T, alexander.duyck@gmail.com
Cc: netdev@vger.kernel.org, Fujinaka, Todd
In-Reply-To: <58416F5C.20602@linux.vnet.ibm.com>
On Fri, 2016-12-02 at 10:55 -0200, Guilherme G. Piccoli wrote:
> On 11/10/2016 04:46 PM, Guilherme G. Piccoli wrote:
> > Whenever the igb driver detects the result of a read operation
> > returns
> > a value composed only by F's (like 0xFFFFFFFF), it will detach the
> > net_device, clear the hw_addr pointer and warn to the user that
> > adapter's
> > link is lost - those steps happen on igb_rd32().
> >
> > In case a PCI error happens on Power architecture, there's a
> > recovery
> > mechanism called EEH, that will reset the PCI slot and call
> > driver's
> > handlers to reset the adapter and network functionality as well.
> >
> > We observed that once hw_addr is NULL after the error is detected
> > on
> > igb_rd32(), it's never assigned back, so in the process of
> > resetting
> > the network functionality we got a NULL pointer dereference in both
> > igb_configure_tx_ring() and igb_configure_rx_ring(). In order to
> > avoid
> > such bug, this patch re-assigns the hw_addr value in the slot_reset
> > handler.
> >
> > Reported-by: Anthony H. Thai <ahthai@us.ibm.com>
> > Reported-by: Harsha Thyagaraja <hathyaga@in.ibm.com>
> > Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
> > ---
> > drivers/net/ethernet/intel/igb/igb_main.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> > b/drivers/net/ethernet/intel/igb/igb_main.c
> > index edc9a6a..136ee9e 100644
> > --- a/drivers/net/ethernet/intel/igb/igb_main.c
> > +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> > @@ -7878,6 +7878,11 @@ static pci_ers_result_t
> > igb_io_slot_reset(struct pci_dev *pdev)
> > pci_enable_wake(pdev, PCI_D3hot, 0);
> > pci_enable_wake(pdev, PCI_D3cold, 0);
> >
> > + /* In case of PCI error, adapter lose its HW
> > address
> > + * so we should re-assign it here.
> > + */
> > + hw->hw_addr = adapter->io_addr;
> > +
> > igb_reset(adapter);
> > wr32(E1000_WUS, ~0);
> > result = PCI_ERS_RESULT_RECOVERED;
> >
>
> Ping?
>
> Sorry to annoy, any news on this?
> Thanks in advance!
>
> Cheers,
>
>
> Guilherme
>
This seems reasonable. It's similar to what fm10k driver does under
this circumstance.
Thanks,
Jake
^ permalink raw reply
* Re: [PATCH 2/7] net: ethernet: ti: cpdma: fix desc re-queuing
From: Grygorii Strashko @ 2016-12-02 16:45 UTC (permalink / raw)
To: Ivan Khoronzhuk
Cc: David S. Miller, netdev, Mugunthan V N, Sekhar Nori, linux-kernel,
linux-omap
In-Reply-To: <20161202110321.GA1213@khorivan>
On 12/02/2016 05:03 AM, Ivan Khoronzhuk wrote:
> On Thu, Dec 01, 2016 at 05:34:27PM -0600, Grygorii Strashko wrote:
>> The currently processing cpdma descriptor with EOQ flag set may
>> contain two values in Next Descriptor Pointer field:
>> - valid pointer: means CPDMA missed addition of new desc in queue;
> It shouldn't happen in normal circumstances, right?
it might happen, because desc push compete with desc pop.
You can check stats values:
chan->stats.misqueued
chan->stats.requeue
under different types of net-loads.
TRM:
"
If the pNext pointer is initially NULL, and more packets need to be queued for transmit, the software
application may alter this pointer to point to a newly appended descriptor. The EMAC will use the new
pointer value and proceed to the next descriptor unless the pNext value has already been read. In this
latter case, the transmitter will halt on the transmit channel in question, and the software application may
restart it at that time. The software can detect this case by checking for an end of queue (EOQ) condition
flag on the updated packet descriptor when it is returned by the EMAC.
"
> So, why it happens only for egress channels? And Does that mean
> there is some resynchronization between submit and process function,
> or this is h/w issue?
no hw issues. this patch just removes one unnecessary I/O access
>
>> - null: no more descriptors in queue.
>> In the later case, it's not required to write to HDP register, but now
>> CPDMA does it.
>>
>> Hence, add additional check for Next Descriptor Pointer != null in
>> cpdma_chan_process() function before writing in HDP register.
>>
>> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
>> ---
>> drivers/net/ethernet/ti/davinci_cpdma.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c
>> index 0924014..379314f 100644
>> --- a/drivers/net/ethernet/ti/davinci_cpdma.c
>> +++ b/drivers/net/ethernet/ti/davinci_cpdma.c
>> @@ -1152,7 +1152,7 @@ static int __cpdma_chan_process(struct cpdma_chan *chan)
>> chan->count--;
>> chan->stats.good_dequeue++;
>>
>> - if (status & CPDMA_DESC_EOQ) {
>> + if ((status & CPDMA_DESC_EOQ) && chan->head) {
>> chan->stats.requeue++;
>> chan_write(chan, hdp, desc_phys(pool, chan->head));
>> }
>> --
>> 2.10.1
>>
--
regards,
-grygorii
^ permalink raw reply
* Re: arp_filter and IPv6 ND
From: Hannes Frederic Sowa @ 2016-12-02 16:45 UTC (permalink / raw)
To: Saku Ytti; +Cc: netdev
In-Reply-To: <CAAeewD_2tfNHhWmtwXb_28DicmipReimRvGqv1Q6PQ0AjkBPGw@mail.gmail.com>
Hello,
On 02.12.2016 16:42, Saku Ytti wrote:
> On 2 December 2016 at 16:08, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>
> Hey,
>
>> May I ask why you want to turn it off?
>
> Certainly. I don't want device to answer with link address for L3
> address it does not have on the link. In my case it triggers this bug
> https://supportforums.cisco.com/document/12098096/cscse46790-cef-prefers-arp-adjacency-over-rib-next-hop
Okay, that should not happen.
Redirects and neighbor advertisements are the only way how you can
announce prefixes on-link. Unfortunately historically we automatically
add device routes for prefixes, too. We can't change this anymore but
this is wrong.
> In this particular case, for one reason or another my Cisco device
> would have ND entry for Linux loopback pointing to an interface with
> completely different network. Which itself would be just weird, but
> combined with weird behaviour of Cisco it actually causes the loopback
> route advertised by BGP not to be installed. If the ND entry didn't
> exist, the BGP route would be installed.
Hmmmm... Loopback route advertised by BGP? Do you use filter to get rid
of that on your AS-border? So you probably don't use an IGP? Do you use
next-hop-self attribute on your neighbor in that direction? BGP in
general doesn't lead to ND entry installs, protocols like IS-IS afair
can short circuit here.
Hmm, I would keep the Loopback announcements out of the BGP.
> I don't really even know why the ND entry exists, all I can think of
> is that Linux must have sent gratuitous reply, because I don't se why
> Cisco would have tried to discover it.
>
> Expected behaviour is that the loopback/128 BGP route resolves to
> on-link next-hop, and on-link next hop is then ND'd. Observed
> behaviour is that loopback/128 BGP route also appears in ND cache.
Yep, exactly.
>> In IPv6 this depends on the scope. In IPv4 this concept doesn't really
>> exist.
>>
>> Please notice that in IPv4 arp_filter does not necessarily mean that the
>> system is operating in strong end system mode but you end up in an
>> hybrid clone where arp is acting strong but routing not and thus you
>> also have to add fib rules to simulate that.
>
> It's just very peculiar behaviour to have ARP or ND entries on a
> interface where given subnet does not exist, it rudimentarily causes
> difficult to troubleshoot problems and is surprising/unexpected
> behaviour.
For enterprise and cloud stuff it is certainly very surprising, as some
isolations don't work as expected. OTOH it is really easy to build up
home networks and things are more plug and play.
> Of course well behaving device wouldn't accept such replies, because
> it itself could be attack vector (imagine me telling you 8.8.8.8 is on
> the link, or worse, your bank).
>
> I'm curious, why does this behaviour exist? When is this desirable?
> I've never seen any other device than Linux behave like this, and when
> ever I've heard about the problem, I've only seen surprised faces that
> it does behave like this.
I don't feel comfortable to answer that, just some thoughts...
Some RFCs require that for some router implementations (CPE), on the
other hand weak end model in Linux was probably inherited by IPv4. The
addition of duplicate address detection (which of course only makes
sense in strong end systems) to IPv6, basically shows that IPv6 is more
or less designed to be a strong end system model.
Anyway, a patch to suppress ndisc requests on those interfaces will
probably be accepted.
For unicast reverse filtering e.g. there is actually no sysctl available
anymore, instead you are supposed to install a netfilter rule to handle
this, which automatically takes care of this.
Bye,
Hannes
^ permalink raw reply
* Re: [PATCH net-next] net/mlx5e: skip loopback selftest with !CONFIG_INET
From: David Miller @ 2016-12-02 16:56 UTC (permalink / raw)
To: arnd; +Cc: saeedm, matanb, leonro, kamalh, netdev, linux-rdma, linux-kernel
In-Reply-To: <20161130210552.1756085-1-arnd@arndb.de>
From: Arnd Bergmann <arnd@arndb.de>
Date: Wed, 30 Nov 2016 22:05:39 +0100
> When CONFIG_INET is disabled, the new selftest results in a link
> error:
>
> drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.o: In function `mlx5e_test_loopback':
> en_selftest.c:(.text.mlx5e_test_loopback+0x2ec): undefined reference to `ip_send_check'
> en_selftest.c:(.text.mlx5e_test_loopback+0x34c): undefined reference to `udp4_hwcsum'
>
> This hides the specific test in that configuration.
>
> Fixes: 0952da791c97 ("net/mlx5e: Add support for loopback selftest")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Applied, thanks Arnd.
^ permalink raw reply
* Re: [PATCH net-next v4 0/4] bpf: BPF for lightweight tunnel encapsulation
From: David Miller @ 2016-12-02 16:57 UTC (permalink / raw)
To: tgraf; +Cc: netdev, alexei.starovoitov, daniel, tom, roopa, hannes
In-Reply-To: <cover.1480522144.git.tgraf@suug.ch>
From: Thomas Graf <tgraf@suug.ch>
Date: Wed, 30 Nov 2016 17:10:07 +0100
> This series implements BPF program invocation from dst entries via the
> lightweight tunnels infrastructure.
Nice work, applied, thanks Thomas.
^ permalink raw reply
* Re: [flamebait] xdp, well meaning but pointless
From: Tom Herbert @ 2016-12-02 16:59 UTC (permalink / raw)
To: Hannes Frederic Sowa
Cc: Jesper Dangaard Brouer, Thomas Graf, Florian Westphal,
Linux Kernel Network Developers
In-Reply-To: <163220cc-a91b-5627-ea93-cd43dc0079c6@stressinduktion.org>
On Fri, Dec 2, 2016 at 3:54 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On 02.12.2016 11:24, Jesper Dangaard Brouer wrote:
>> On Thu, 1 Dec 2016 13:51:32 -0800
>> Tom Herbert <tom@herbertland.com> wrote:
>>
>>>>> The technical plenary at last IETF on Seoul a couple of weeks ago was
>>>>> exclusively focussed on DDOS in light of the recent attack against
>>>>> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
>>>>> presentation by Nick Sullivan
>>>>> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
>>>>> alluded to some implementation of DDOS mitigation. In particular, on
>>>>> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
>>
>> slide 14
>>
>>>>> numbers he gave we're based in iptables+BPF and that was a whole
>>>>> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
>>>>> and that's also when I introduced XDP to whole IETF :-) ). If that's
>>>>> the best we can do the Internet is in a world hurt. DDOS mitigation
>>>>> alone is probably a sufficient motivation to look at XDP. We need
>>>>> something that drops bad packets as quickly as possible when under
>>>>> attack, we need this to be integrated into the stack, we need it to be
>>>>> programmable to deal with the increasing savvy of attackers, and we
>>>>> don't want to be forced to be dependent on HW solutions. This is why
>>>>> we created XDP!
>>
>> The 1.2Mpps number is a bit low, but we are unfortunately in that
>> ballpark.
>>
>>>> I totally understand that. But in my reply to David in this thread I
>>>> mentioned DNS apex processing as being problematic which is actually
>>>> being referred in your linked slide deck on page 9 ("What do floods look
>>>> like") and the problematic of parsing DNS packets in XDP due to string
>>>> processing and looping inside eBPF.
>>
>> That is a weak argument. You do realize CloudFlare actually use eBPF to
>> do this exact filtering, and (so-far) eBPF for parsing DNS have been
>> sufficient for them.
>
> You are talking about this code on the following slides (I actually
> transcribed it for you here and disassembled):
>
> l0: ld #0x14
> l1: ldxb 4*([0]&0xf)
> l2: add x
> l3: tax
> l4: ld [x+0]
> l5: jeq #0x7657861, l6, l13
> l6: ld [x+4]
> l7: jeq #0x6d706c65, l8, l13
> l8: ld [x+8]
> l9: jeq #0x3636f6d, l10, l13
> l10: ldb [x+12]
> l11: jeq #0, l12, l13
> l12: ret #0x1
> l13: ret #0
>
> You can offload this to u32 in hardware if that is what you want.
>
> The reason this works is because of netfilter, which allows them to
> dynamically generate BPF programs and insert and delete them from
> chains, do intersection or unions of them.
>
> If you have a freestanding program like in XDP the complexity space is a
> different one and not comparable to this at all.
>
I don't understand this comment about complexity especially in regards
to the idea of offloading u32 to hardware. Relying on hardware to do
anything always leads to more complexity than an equivalent SW
implementation for the same functionality. The only reason we ever use
a hardware mechanisms is if it gives *significantly* better
performance. If the performance difference isn't there then doing
things in SW is going to be the better path (as we see in XDP).
Tom
> Bye,
> Hannes
>
^ permalink raw reply
* Re: [Intel-wired-lan] [PATCH v2] e1000e: free IRQ regardless of __E1000_DOWN
From: Baicar, Tyler @ 2016-12-02 17:02 UTC (permalink / raw)
To: Neftin, Sasha, jeffrey.t.kirsher, intel-wired-lan, netdev,
linux-kernel, okaya, timur
In-Reply-To: <20c191fe-e8fe-b2ca-0628-7d0188d1f28e@codeaurora.org>
Hello Sasha,
Were you able to reproduce this issue?
Do you have a patch fixing the close function inconsistencies that you
mentioned which I could try out?
Thanks,
Tyler
On 11/21/2016 1:40 PM, Baicar, Tyler wrote:
> On 11/17/2016 6:31 AM, Neftin, Sasha wrote:
>> On 11/13/2016 10:34 AM, Neftin, Sasha wrote:
>>> On 11/11/2016 12:35 AM, Baicar, Tyler wrote:
>>>> Hello Sasha,
>>>>
>>>> On 11/9/2016 11:19 PM, Neftin, Sasha wrote:
>>>>> On 11/9/2016 11:41 PM, Tyler Baicar wrote:
>>>>>> Move IRQ free code so that it will happen regardless of the
>>>>>> __E1000_DOWN bit. Currently the e1000e driver only releases its IRQ
>>>>>> if the __E1000_DOWN bit is cleared. This is not sufficient because
>>>>>> it is possible for __E1000_DOWN to be set without releasing the IRQ.
>>>>>> In such a situation, we will hit a kernel bug later in e1000_remove
>>>>>> because the IRQ still has action since it was never freed. A
>>>>>> secondary bus reset can cause this case to happen.
>>>>>>
>>>>>> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
>>>>>> ---
>>>>>> drivers/net/ethernet/intel/e1000e/netdev.c | 3 ++-
>>>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
>>>>>> b/drivers/net/ethernet/intel/e1000e/netdev.c
>>>>>> index 7017281..36cfcb0 100644
>>>>>> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
>>>>>> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
>>>>>> @@ -4679,12 +4679,13 @@ int e1000e_close(struct net_device *netdev)
>>>>>> if (!test_bit(__E1000_DOWN, &adapter->state)) {
>>>>>> e1000e_down(adapter, true);
>>>>>> - e1000_free_irq(adapter);
>>>>>> /* Link status message must follow this format */
>>>>>> pr_info("%s NIC Link is Down\n", adapter->netdev->name);
>>>>>> }
>>>>>> + e1000_free_irq(adapter);
>>>>>> +
>>>>>> napi_disable(&adapter->napi);
>>>>>> e1000e_free_tx_resources(adapter->tx_ring);
>>>>>>
>>>>> I would like not recommend insert this change. This change related
>>>>> driver state machine, we afraid from lot of synchronization
>>>>> problem and
>>>>> issues.
>>>>> We need keep e1000_free_irq in loop and check for 'test_bit' ready.
>>>> What do you mean here? There is no loop. If __E1000_DOWN is set
>>>> then we
>>>> will never free the IRQ.
>>>>
>>>>> Another point, does before execute secondary bus reset your SW
>>>>> back up
>>>>> pcie configuration space as properly?
>>>> After a secondary bus reset, the link needs to recover and go back
>>>> to a
>>>> working state after 1 second.
>>>>
>>>> From the callstack, the issue is happening while removing the
>>>> endpoint
>>>> from the system, before applying the secondary bus reset.
>>>>
>>>> The order of events is
>>>> 1. remove the drivers
>>>> 2. cause a secondary bus reset
>>>> 3. wait 1 second
>>> Actually, this is too much, usually link up in less than 100ms.You can
>>> check Data Link Layer indication.
>>>> 4. recover the link
>>>>
>>>> callstack:
>>>> free_msi_irqs+0x6c/0x1a8
>>>> pci_disable_msi+0xb0/0x148
>>>> e1000e_reset_interrupt_capability+0x60/0x78
>>>> e1000_remove+0xc8/0x180
>>>> pci_device_remove+0x48/0x118
>>>> __device_release_driver+0x80/0x108
>>>> device_release_driver+0x2c/0x40
>>>> pci_stop_bus_device+0xa0/0xb0
>>>> pci_stop_bus_device+0x3c/0xb0
>>>> pci_stop_root_bus+0x54/0x80
>>>> acpi_pci_root_remove+0x28/0x64
>>>> acpi_bus_trim+0x6c/0xa4
>>>> acpi_device_hotplug+0x19c/0x3f4
>>>> acpi_hotplug_work_fn+0x28/0x3c
>>>> process_one_work+0x150/0x460
>>>> worker_thread+0x50/0x4b8
>>>> kthread+0xd4/0xe8
>>>> ret_from_fork+0x10/0x50
>>>>
>>>> Thanks,
>>>> Tyler
>>>>
>>> Hello Tyler,
>>> Okay, we need consult more about this suggestion.
>>> May I ask what is setup you run? Is there NIC or on board LAN? I would
>>> like try reproduce this issue in our lab's too.
>>> Also, is same issue observed with same scenario and others NIC's too?
>>> Sasha
>>> _______________________________________________
>>> Intel-wired-lan mailing list
>>> Intel-wired-lan@lists.osuosl.org
>>> http://lists.osuosl.org/mailman/listinfo/intel-wired-lan
>>>
>> Hello Tyler,
>> I see some in consistent implementation of __*_close methods in our
>> drivers. Do you have any igb NIC to check if same problem persist there?
>> Thanks,
>> Sasha
> Hello Sasha,
>
> I couldn't find an igb NIC to test with, but I did find another e1000e
> card that does not cause the same issue. That card is:
>
> 0004:01:00.0 Ethernet controller: Intel Corporation 82574L Gigabit
> Network Connection
> Subsystem: Intel Corporation Gigabit CT Desktop Adapter
> Physical Slot: 5-1
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr- Stepping- SERR+ FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0, Cache Line Size: 64 bytes
> Interrupt: pin A routed to IRQ 299
> Region 0: Memory at c01001c0000 (32-bit, non-prefetchable)
> [size=128K]
> Region 1: Memory at c0100100000 (32-bit, non-prefetchable)
> [size=512K]
> Region 2: I/O ports at 1000 [size=32]
> Region 3: Memory at c01001e0000 (32-bit, non-prefetchable)
> [size=16K]
> Expansion ROM at c0100180000 [disabled] [size=256K]
> Capabilities: [c8] Power Management version 2
> Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
> Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
> Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
> Address: 00000000397f0040 Data: 0000
> Capabilities: [e0] Express (v1) Endpoint, MSI 00
> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
> <512ns, L1 <64us
> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
> DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+
> Unsupported+
> RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
> MaxPayload 256 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq-
> AuxPwr+ TransPend-
> LnkCap: Port #8, Speed 2.5GT/s, Width x1, ASPM L0s L1,
> Exit Latency L0s <128ns, L1 <64us
> ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
> LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train-
> SlotClk+ DLActive- BWMgmt- ABWMgmt-
> Capabilities: [a0] MSI-X: Enable- Count=5 Masked-
> Vector table: BAR=3 offset=00000000
> PBA: BAR=3 offset=00002000
> Capabilities: [100 v1] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
> UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
> UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt-
> UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
> NonFatalErr-
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
> NonFatalErr+
> AERCap: First Error Pointer: 00, GenCap- CGenEn-
> ChkCap- ChkEn-
> Capabilities: [140 v1] Device Serial Number
> 68-05-ca-ff-ff-29-47-34
> Kernel driver in use: e1000e
>
> Here are the kernel logs from first running the test on this card and
> then running the test on the Intel 82571EB card that I originally
> found the issue with (you can see the issue doesn't happen on this
> card but does on the other):
>
> [ 44.146749] ACPI: \_SB_.PCI0: Device has suffered a power fault
> [ 44.155238] pcieport 0000:00:00.0: PCIe Bus Error:
> severity=Uncorrected (Non-Fatal), type=Transaction Layer,
> id=0000(Requester ID)
> [ 44.166111] pcieport 0000:00:00.0: device [17cb:0400] error
> status/mask=00004000/00400000
> [ 44.174420] pcieport 0000:00:00.0: [14] Completion Timeout (First)
> [ 44.401943] e1000e 0000:01:00.0 eth0: Timesync Tx Control register
> not set as expected
> [ 82.445586] pcieport 0002:00:00.0: PCIe Bus Error:
> severity=Corrected, type=Physical Layer, id=0000(Receiver ID)
> [ 82.454851] pcieport 0002:00:00.0: device [17cb:0400] error
> status/mask=00000001/00006000
> [ 82.463209] pcieport 0002:00:00.0: [ 0] Receiver Error
> [ 82.469355] pcieport 0002:00:00.0: PCIe Bus Error:
> severity=Uncorrected (Non-Fatal), type=Transaction Layer,
> id=0000(Requester ID)
> [ 82.481026] pcieport 0002:00:00.0: device [17cb:0400] error
> status/mask=00004000/00400000
> [ 82.489343] pcieport 0002:00:00.0: [14] Completion Timeout (First)
> [ 82.504573] ACPI: \_SB_.PCI2: Device has suffered a power fault
> [ 84.528036] kernel BUG at drivers/pci/msi.c:369!
>
> I'm not sure why it reproduces on the 82571EB card and not the 82574L
> card. The only obvious difference is there is no Reciever Error on the
> 82574L card.
>
> If you have a patch fixing the inconsistencies you mentioned with the
> __*_close methods I would certainly be willing to test it out!
>
> Thanks,
> Tyler
>
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply
* Re: [PATCH] net: atarilance: use %8ph for printing hex string
From: David Miller @ 2016-12-02 17:03 UTC (permalink / raw)
To: linux; +Cc: felipe.balbi, netdev, linux-kernel
In-Reply-To: <1480543375-28904-1-git-send-email-linux@rasmusvillemoes.dk>
From: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Date: Wed, 30 Nov 2016 23:02:54 +0100
> This is already using the %pM printf extension; might as well also use
> %ph to make the code smaller.
>
> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Applied, thank you.
^ permalink raw reply
* Re: net/can: warning in raw_setsockopt/__alloc_pages_slowpath
From: Oliver Hartkopp @ 2016-12-02 17:05 UTC (permalink / raw)
To: Marc Kleine-Budde, Andrey Konovalov, David S. Miller, linux-can,
netdev, LKML
Cc: Dmitry Vyukov, Kostya Serebryany, syzkaller
In-Reply-To: <74e2aed8-ba38-ee91-59a3-49131ea18d60@pengutronix.de>
On 12/02/2016 04:42 PM, Marc Kleine-Budde wrote:
> On 12/02/2016 04:11 PM, Oliver Hartkopp wrote:
>>
>>
>> On 12/02/2016 02:24 PM, Marc Kleine-Budde wrote:
>>> On 12/02/2016 01:43 PM, Andrey Konovalov wrote:
>>
>>
>>>> [<ffffffff8369e0de>] raw_setsockopt+0x1be/0x9f0 net/can/raw.c:506
>>>
>>> We should add a check for a sensible optlen....
>>>
>>>> static int raw_setsockopt(struct socket *sock, int level, int optname,
>>>> char __user *optval, unsigned int optlen)
>>>> {
>>>> struct sock *sk = sock->sk;
>>>> struct raw_sock *ro = raw_sk(sk);
>>>> struct can_filter *filter = NULL; /* dyn. alloc'ed filters */
>>>> struct can_filter sfilter; /* single filter */
>>>> struct net_device *dev = NULL;
>>>> can_err_mask_t err_mask = 0;
>>>> int count = 0;
>>>> int err = 0;
>>>>
>>>> if (level != SOL_CAN_RAW)
>>>> return -EINVAL;
>>>>
>>>> switch (optname) {
>>>>
>>>> case CAN_RAW_FILTER:
>>>> if (optlen % sizeof(struct can_filter) != 0)
>>>> return -EINVAL;
>>>
>>> here...
>>>
>>> if (optlen > 64 * sizeof(struct can_filter))
>>> return -EINVAL;
>>>
>>
>> Agreed.
>>
>> But what is sensible here?
>> 64 filters is way to small IMO.
>>
>> When thinking about picking a bunch of single CAN IDs I would tend to
>> something like 512 filters.
>
> Ok - 64 was just an arbitrary chosen value for demonstration purposes :)
>
:-)
Would you like to send a patch?
Regards,
Oliver
^ permalink raw reply
* Re: [PATCH/RFC iproute2/net-next 2/3] tc: flower: introduce enum flower_endpoint
From: Jiri Pirko @ 2016-12-02 17:08 UTC (permalink / raw)
To: Simon Horman; +Cc: netdev, Stephen Hemminger, Jamal Hadi Salim, Jiri Pirko
In-Reply-To: <1480672785-14570-3-git-send-email-simon.horman@netronome.com>
Fri, Dec 02, 2016 at 10:59:44AM CET, simon.horman@netronome.com wrote:
>Introduce enum flower_endpoint and use it instead of a bool
>as the type for paramatising source and destination.
>
>This is intended to improve read-ability and provide some type
>checking of endpoint parameters.
>
>Signed-off-by: Simon Horman <simon.horman@netronome.com>
>---
> tc/f_flower.c | 22 ++++++++++++++--------
> 1 file changed, 14 insertions(+), 8 deletions(-)
>
>diff --git a/tc/f_flower.c b/tc/f_flower.c
>index 615e8f27bed2..42253067b43d 100644
>--- a/tc/f_flower.c
>+++ b/tc/f_flower.c
>@@ -23,6 +23,11 @@
> #include "tc_util.h"
> #include "rt_names.h"
>
>+enum flower_endpoint {
>+ flower_src,
>+ flower_dst
FLOWER_ENDPOINT_SRC,
FLOWER_ENDPOINT_DST,
>+};
>+
> static void explain(void)
> {
> fprintf(stderr,
>@@ -160,29 +165,30 @@ static int flower_parse_ip_addr(char *str, __be16 eth_type,
> return 0;
> }
>
>-static int flower_port_attr_type(__u8 ip_proto, bool is_src)
>+static int flower_port_attr_type(__u8 ip_proto, enum flower_endpoint endpoint)
> {
> if (ip_proto == IPPROTO_TCP)
>- return is_src ? TCA_FLOWER_KEY_TCP_SRC :
>+ return endpoint == flower_src ? TCA_FLOWER_KEY_TCP_SRC :
> TCA_FLOWER_KEY_TCP_DST;
> else if (ip_proto == IPPROTO_UDP)
>- return is_src ? TCA_FLOWER_KEY_UDP_SRC :
>+ return endpoint == flower_src ? TCA_FLOWER_KEY_UDP_SRC :
> TCA_FLOWER_KEY_UDP_DST;
> else if (ip_proto == IPPROTO_SCTP)
>- return is_src ? TCA_FLOWER_KEY_SCTP_SRC :
>+ return endpoint == flower_src ? TCA_FLOWER_KEY_SCTP_SRC :
> TCA_FLOWER_KEY_SCTP_DST;
> else
> return -1;
> }
>
>-static int flower_parse_port(char *str, __u8 ip_proto, bool is_src,
>+static int flower_parse_port(char *str, __u8 ip_proto,
>+ enum flower_endpoint endpoint,
> struct nlmsghdr *n)
> {
> int ret;
> int type;
> __be16 port;
>
>- type = flower_port_attr_type(ip_proto, is_src);
>+ type = flower_port_attr_type(ip_proto, endpoint);
> if (type < 0)
> return -1;
>
>@@ -340,14 +346,14 @@ static int flower_parse_opt(struct filter_util *qu, char *handle,
> }
> } else if (matches(*argv, "dst_port") == 0) {
> NEXT_ARG();
>- ret = flower_parse_port(*argv, ip_proto, false, n);
>+ ret = flower_parse_port(*argv, ip_proto, flower_dst, n);
> if (ret < 0) {
> fprintf(stderr, "Illegal \"dst_port\"\n");
> return -1;
> }
> } else if (matches(*argv, "src_port") == 0) {
> NEXT_ARG();
>- ret = flower_parse_port(*argv, ip_proto, true, n);
>+ ret = flower_parse_port(*argv, ip_proto, flower_src, n);
> if (ret < 0) {
> fprintf(stderr, "Illegal \"src_port\"\n");
> return -1;
>--
>2.7.0.rc3.207.g0ac5344
>
^ permalink raw reply
* Re: [PATCH/RFC iproute2/net-next 3/3] tc: flower: support matching on ICMP type and code
From: Jiri Pirko @ 2016-12-02 17:09 UTC (permalink / raw)
To: Simon Horman, g; +Cc: netdev, Stephen Hemminger, Jamal Hadi Salim, Jiri Pirko
In-Reply-To: <1480672785-14570-4-git-send-email-simon.horman@netronome.com>
Fri, Dec 02, 2016 at 10:59:45AM CET, simon.horman@netronome.com wrote:
>Support matching on ICMP type and code.
>
>Example usage:
>
>tc qdisc add dev eth0 ingress
>
>tc filter add dev eth0 protocol ip parent ffff: flower \
> indev eth0 ip_proto icmp type 8 code 0 action drop
>
>tc filter add dev eth0 protocol ipv6 parent ffff: flower \
> indev eth0 ip_proto icmpv6 type 128 code 0 action drop
>
>Signed-off-by: Simon Horman <simon.horman@netronome.com>
>---
> man/man8/tc-flower.8 | 20 ++++++++---
> tc/f_flower.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++----
> 2 files changed, 105 insertions(+), 11 deletions(-)
>
>diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
>index a401293fed50..c01ace6249dd 100644
>--- a/man/man8/tc-flower.8
>+++ b/man/man8/tc-flower.8
>@@ -29,7 +29,7 @@ flower \- flow based traffic control filter
> .IR PRIORITY " | "
> .BR vlan_eth_type " { " ipv4 " | " ipv6 " | "
> .IR ETH_TYPE " } | "
>-.BR ip_proto " { " tcp " | " udp " | " sctp " | "
>+.BR ip_proto " { " tcp " | " udp " | " sctp " | " icmp " | " icmpv6 " | "
> .IR IP_PROTO " } | { "
> .BR dst_ip " | " src_ip " } { "
> .IR ipv4_address " | " ipv6_address " } | { "
>@@ -94,7 +94,7 @@ or an unsigned 16bit value in hexadecimal format.
> Match on layer four protocol.
> .I IP_PROTO
> may be
>-.BR tcp ", " udp ", " sctp
>+.BR tcp ", " udp ", " sctp ", " icmp ", " icmpv6
> or an unsigned 8bit value in hexadecimal format.
> .TP
> .BI dst_ip " ADDRESS"
>@@ -112,6 +112,13 @@ option of tc filter.
> Match on layer 4 protocol source or destination port number. Only available for
> .BR ip_proto " values " udp ", " tcp " and " sctp
> which have to be specified in beforehand.
>+.TP
>+.BI type " NUMBER"
>+.TQ
>+.BI code " NUMBER"
>+Match on ICMP type or code. Only available for
>+.BR ip_proto " values " icmp " and " icmpv6
>+which have to be specified in beforehand.
> .SH NOTES
> As stated above where applicable, matches of a certain layer implicitly depend
> on the matches of the next lower layer. Precisely, layer one and two matches
>@@ -120,13 +127,16 @@ have no dependency, layer three matches
> (\fBip_proto\fR, \fBdst_ip\fR and \fBsrc_ip\fR)
> depend on the
> .B protocol
>-option of tc filter
>-and finally layer four matches
>+option of tc filter, layer four port matches
> (\fBdst_port\fR and \fBsrc_port\fR)
> depend on
> .B ip_proto
> being set to
>-.BR tcp ", " udp " or " sctp.
>+.BR tcp ", " udp " or " sctp,
>+and finally ICMP matches (\fBcode\fR and \fBtype\fR) depend on
>+.B ip_proto
>+being set to
>+.BR icmp " or " icmpv6.
> .P
> There can be only used one mask per one prio. If user needs to specify different
> mask, he has to use different prio.
>diff --git a/tc/f_flower.c b/tc/f_flower.c
>index 42253067b43d..59f6f1ea26e6 100644
>--- a/tc/f_flower.c
>+++ b/tc/f_flower.c
>@@ -28,6 +28,11 @@ enum flower_endpoint {
> flower_dst
> };
>
>+enum flower_icmp_field {
>+ flower_icmp_type,
>+ flower_icmp_code
FLOWER_ICMP_FIELD_TYPE,
FLOWER_ICMP_FIELD_CODE,
>+};
>+
> static void explain(void)
> {
> fprintf(stderr,
>@@ -42,11 +47,13 @@ static void explain(void)
> " vlan_ethtype [ ipv4 | ipv6 | ETH-TYPE ] |\n"
> " dst_mac MAC-ADDR |\n"
> " src_mac MAC-ADDR |\n"
>- " ip_proto [tcp | udp | sctp | IP-PROTO ] |\n"
>+ " ip_proto [tcp | udp | sctp | icmp | icmpv6 | IP-PROTO ] |\n"
> " dst_ip [ IPV4-ADDR | IPV6-ADDR ] |\n"
> " src_ip [ IPV4-ADDR | IPV6-ADDR ] |\n"
> " dst_port PORT-NUMBER |\n"
>- " src_port PORT-NUMBER }\n"
>+ " src_port PORT-NUMBER |\n"
>+ " type ICMP-TYPE |\n"
>+ " code ICMP-CODE }\n"
> " FILTERID := X:Y:Z\n"
> " ACTION-SPEC := ... look at individual actions\n"
> "\n"
>@@ -95,16 +102,23 @@ static int flower_parse_ip_proto(char *str, __be16 eth_type, int type,
> int ret;
> __u8 ip_proto;
>
>- if (eth_type != htons(ETH_P_IP) && eth_type != htons(ETH_P_IPV6)) {
>- fprintf(stderr, "Illegal \"eth_type\" for ip proto\n");
>- return -1;
>- }
>+ if (eth_type != htons(ETH_P_IP) && eth_type != htons(ETH_P_IPV6))
>+ goto err;
>+
> if (matches(str, "tcp") == 0) {
> ip_proto = IPPROTO_TCP;
> } else if (matches(str, "udp") == 0) {
> ip_proto = IPPROTO_UDP;
> } else if (matches(str, "sctp") == 0) {
> ip_proto = IPPROTO_SCTP;
>+ } else if (matches(str, "icmp") == 0) {
>+ if (eth_type != htons(ETH_P_IP))
>+ goto err;
>+ ip_proto = IPPROTO_ICMP;
>+ } else if (matches(str, "icmpv6") == 0) {
>+ if (eth_type != htons(ETH_P_IPV6))
>+ goto err;
>+ ip_proto = IPPROTO_ICMPV6;
> } else {
> ret = get_u8(&ip_proto, str, 16);
> if (ret)
>@@ -113,6 +127,10 @@ static int flower_parse_ip_proto(char *str, __be16 eth_type, int type,
> addattr8(n, MAX_MSG, type, ip_proto);
> *p_ip_proto = ip_proto;
> return 0;
>+
>+err:
>+ fprintf(stderr, "Illegal \"eth_type\" for ip proto\n");
>+ return -1;
> }
>
> static int flower_parse_ip_addr(char *str, __be16 eth_type,
>@@ -165,6 +183,39 @@ static int flower_parse_ip_addr(char *str, __be16 eth_type,
> return 0;
> }
>
>+static int flower_icmp_attr_type(__be16 eth_type, __u8 ip_proto,
>+ enum flower_icmp_field field)
>+{
>+ if (eth_type == htons(ETH_P_IP) && ip_proto == IPPROTO_ICMP)
>+ return field == flower_icmp_code ? TCA_FLOWER_KEY_ICMPV4_CODE :
>+ TCA_FLOWER_KEY_ICMPV4_TYPE;
>+ else if (eth_type == htons(ETH_P_IPV6) &&ip_proto == IPPROTO_ICMPV6)
>+ return field == flower_icmp_code ? TCA_FLOWER_KEY_ICMPV6_CODE :
>+ TCA_FLOWER_KEY_ICMPV6_TYPE;
>+
>+ return -1;
>+}
>+
>+static int flower_parse_icmp(char *str, __u16 eth_type, __u8 ip_proto,
>+ bool is_code, struct nlmsghdr *n)
>+{
>+ int ret;
>+ int type;
>+ uint8_t value;
>+
>+ type = flower_icmp_attr_type(eth_type, ip_proto, is_code);
>+ if (type < 0)
>+ return -1;
>+
>+ ret = get_u8(&value, str, 10);
>+ if (ret)
>+ return -1;
>+
>+ addattr8(n, MAX_MSG, type, value);
>+
>+ return 0;
>+}
>+
> static int flower_port_attr_type(__u8 ip_proto, enum flower_endpoint endpoint)
> {
> if (ip_proto == IPPROTO_TCP)
>@@ -358,6 +409,22 @@ static int flower_parse_opt(struct filter_util *qu, char *handle,
> fprintf(stderr, "Illegal \"src_port\"\n");
> return -1;
> }
>+ } else if (matches(*argv, "type") == 0) {
>+ NEXT_ARG();
>+ ret = flower_parse_icmp(*argv, eth_type, ip_proto,
>+ false, n);
>+ if (ret < 0) {
>+ fprintf(stderr, "Illegal \"icmp type\"\n");
>+ return -1;
>+ }
>+ } else if (matches(*argv, "code") == 0) {
>+ NEXT_ARG();
>+ ret = flower_parse_icmp(*argv, eth_type, ip_proto,
>+ true, n);
>+ if (ret < 0) {
>+ fprintf(stderr, "Illegal \"icmp code\"\n");
>+ return -1;
>+ }
> } else if (matches(*argv, "action") == 0) {
> NEXT_ARG();
> ret = parse_action(&argc, &argv, TCA_FLOWER_ACT, n);
>@@ -471,6 +538,10 @@ static void flower_print_ip_proto(FILE *f, __u8 *p_ip_proto,
> fprintf(f, "udp");
> else if (ip_proto == IPPROTO_SCTP)
> fprintf(f, "sctp");
>+ else if (ip_proto == IPPROTO_ICMP)
>+ fprintf(f, "icmp");
>+ else if (ip_proto == IPPROTO_ICMPV6)
>+ fprintf(f, "icmpv6");
> else
> fprintf(f, "%02x", ip_proto);
> *p_ip_proto = ip_proto;
>@@ -519,6 +590,12 @@ static void flower_print_port(FILE *f, char *name, struct rtattr *attr)
> fprintf(f, "\n %s %d", name, ntohs(rta_getattr_u16(attr)));
> }
>
>+static void flower_print_icmp(FILE *f, char *name, struct rtattr *attr)
>+{
>+ if (attr)
>+ fprintf(f, "\n %s %d", name, ntohs(rta_getattr_u8(attr)));
>+}
>+
> static int flower_print_opt(struct filter_util *qu, FILE *f,
> struct rtattr *opt, __u32 handle)
> {
>@@ -587,6 +664,13 @@ static int flower_print_opt(struct filter_util *qu, FILE *f,
> if (nl_type >= 0)
> flower_print_port(f, "src_port", tb[nl_type]);
>
>+ nl_type = flower_icmp_attr_type(eth_type, ip_proto, false);
>+ if (nl_type >= 0)
>+ flower_print_icmp(f, "icmp_type", tb[nl_type]);
>+ nl_type = flower_icmp_attr_type(eth_type, ip_proto, true);
>+ if (nl_type >= 0)
>+ flower_print_icmp(f, "icmp_code", tb[nl_type]);
>+
> if (tb[TCA_FLOWER_FLAGS]) {
> __u32 flags = rta_getattr_u32(tb[TCA_FLOWER_FLAGS]);
>
>--
>2.7.0.rc3.207.g0ac5344
>
^ permalink raw reply
* Re: [PATCH/RFC iproute2/net-next 0/3] tc: flower: Support matching on ICMP
From: Jiri Pirko @ 2016-12-02 17:10 UTC (permalink / raw)
To: Simon Horman; +Cc: netdev, Stephen Hemminger, Jamal Hadi Salim, Jiri Pirko
In-Reply-To: <1480672785-14570-1-git-send-email-simon.horman@netronome.com>
Fri, Dec 02, 2016 at 10:59:42AM CET, simon.horman@netronome.com wrote:
>Add support for matching on ICMP type and code to flower. This is modeled
>on existing support for matching on L4 ports.
>
>The second patch provided a minor cleanup which is in keeping with
>they style used in the last patch.
>
>This is marked as an RFC to match the same designation given to the
>corresponding kernel patches.
Looks nice, I only have those 2 enum nitpicks.
Thanks.
>
>Based on iproute2/net-next with the following applied:
>* [[PATCH iproute2/net-next v2] 0/4] tc: flower: SCTP and other port fixes
>
>Simon Horman (3):
> tc: flower: update headers for TCA_FLOWER_KEY_ICMP*
> tc: flower: introduce enum flower_endpoint
> tc: flower: support matching on ICMP type and code
>
> include/linux/pkt_cls.h | 10 ++++
> man/man8/tc-flower.8 | 20 ++++++--
> tc/f_flower.c | 118 ++++++++++++++++++++++++++++++++++++++++++------
> 3 files changed, 129 insertions(+), 19 deletions(-)
>
>--
>2.7.0.rc3.207.g0ac5344
>
^ permalink raw reply
* Re: [PATCH/RFC net-next 0/2] net/sched: cls_flower: Support matching on ICMP
From: Jiri Pirko @ 2016-12-02 17:10 UTC (permalink / raw)
To: Simon Horman
Cc: netdev, David S. Miller, Jay Vosburgh, Veaceslav Falico,
Andy Gospodarek, Jamal Hadi Salim, Jiri Pirko
In-Reply-To: <1480672352-13291-1-git-send-email-simon.horman@netronome.com>
Fri, Dec 02, 2016 at 10:52:30AM CET, simon.horman@netronome.com wrote:
>Hi,
>
>this series add supports for matching on ICMP type and code to cls_flower.
>This is modeled on existing support for matching on L4 ports. The updates
>to the dissector are intended to allow for code and storage re-use.
Looks fine to me. Thanks!
>
>Simon Horman (2):
> flow dissector: ICMP support
> net/sched: cls_flower: Support matching on ICMP type and code
>
> drivers/net/bonding/bond_main.c | 6 +++--
> include/linux/skbuff.h | 5 +++++
> include/net/flow_dissector.h | 50 ++++++++++++++++++++++++++++++++++++++---
> include/uapi/linux/pkt_cls.h | 10 +++++++++
> net/core/flow_dissector.c | 34 +++++++++++++++++++++++++---
> net/sched/cls_flow.c | 4 ++--
> net/sched/cls_flower.c | 42 ++++++++++++++++++++++++++++++++++
> 7 files changed, 141 insertions(+), 10 deletions(-)
>
>--
>2.7.0.rc3.207.g0ac5344
>
^ permalink raw reply
* Re: [PATCH 1/2] net: ethernet: altera: TSE: Remove unneeded dma sync for tx buffers
From: David Miller @ 2016-12-02 17:11 UTC (permalink / raw)
To: LinoSanfilippo; +Cc: vbridger, nios2-dev, linux-kernel, netdev
In-Reply-To: <1480546112-3099-1-git-send-email-LinoSanfilippo@gmx.de>
From: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Date: Wed, 30 Nov 2016 23:48:31 +0100
> An explicit dma sync for device directly after mapping as well as an
> explicit dma sync for cpu directly before unmapping is unnecessary and
> costly on the hotpath. So remove these calls.
>
> Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Applied.
^ permalink raw reply
* Re: [PATCH 2/2] net: ethernet: altera: TSE: do not use tx queue lock in tx completion handler
From: David Miller @ 2016-12-02 17:11 UTC (permalink / raw)
To: LinoSanfilippo; +Cc: vbridger, nios2-dev, linux-kernel, netdev
In-Reply-To: <1480546112-3099-2-git-send-email-LinoSanfilippo@gmx.de>
From: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Date: Wed, 30 Nov 2016 23:48:32 +0100
> The driver already uses its private lock for synchronization between xmit
> and xmit completion handler making the additional use of the xmit_lock
> unnecessary.
> Furthermore the driver does not set NETIF_F_LLTX resulting in xmit to be
> called with the xmit_lock held and then taking the private lock while xmit
> completion handler does the reverse, first take the private lock, then the
> xmit_lock.
> Fix these issues by not taking the xmit_lock in the tx completion handler.
>
> Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Yeah that could be a nasty deadlock, in fact.
Applied, thanks.
^ permalink raw reply
* Re: Initial thoughts on TXDP
From: Tom Herbert @ 2016-12-02 17:12 UTC (permalink / raw)
To: Edward Cree
Cc: Hannes Frederic Sowa, Florian Westphal,
Linux Kernel Network Developers, Jesper Dangaard Brouer
In-Reply-To: <cb2e6263-d981-eccf-cea7-39392ceb67b5@solarflare.com>
On Fri, Dec 2, 2016 at 6:36 AM, Edward Cree <ecree@solarflare.com> wrote:
> On 01/12/16 23:46, Tom Herbert wrote:
>> The only time we
>> _really_ to allocate an skbuf is when we need to put the packet onto a
>> queue. All the other use cases are really just to pass a structure
>> containing a packet from function to function. For that purpose we
>> should be able to just pass a much smaller structure in a stack
>> argument and only allocate an skbuff when we need to enqueue. In cases
>> where we don't ever queue a packet we might never need to allocate any
>> skbuff
> Now this intrigues me, because one of the objections to bundling (vs GRO)
> was the memory usage of all those SKBs. IIRC we already do a 'GRO-like'
> coalescing when packets reach a TCP socket anyway (or at least in some
> cases, not sure if all the different ways we can enqueue a TCP packet for
> RX do it), but if we could carry the packets from NIC to socket without
> SKBs, doing so in lists rather than one-at-a-time wouldn't cost any extra
> memory (the packet-pages are all already allocated on the NIC RX ring).
> Possibly combine the two, so that rather than having potentially four
> versions of each function (skb, skbundle, void*, void* bundle) you just
> have the two 'ends'.
>
Yep, seems like a good idea to incorporate bundling into TXDP from the get-go.
Tom
> -Ed
^ permalink raw reply
* Re: [PATCH net] packet: fix race condition in packet_set_ring
From: David Miller @ 2016-12-02 17:17 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev, philip.pettersson
In-Reply-To: <1480546536.18162.216.camel@edumazet-glaptop3.roam.corp.google.com>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 30 Nov 2016 14:55:36 -0800
> From: Philip Pettersson <philip.pettersson@gmail.com>
>
> When packet_set_ring creates a ring buffer it will initialize a
> struct timer_list if the packet version is TPACKET_V3. This value
> can then be raced by a different thread calling setsockopt to
> set the version to TPACKET_V1 before packet_set_ring has finished.
>
> This leads to a use-after-free on a function pointer in the
> struct timer_list when the socket is closed as the previously
> initialized timer will not be deleted.
>
> The bug is fixed by taking lock_sock(sk) in packet_setsockopt when
> changing the packet version while also taking the lock at the start
> of packet_set_ring.
>
> Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
> Signed-off-by: Philip Pettersson <philip.pettersson@gmail.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Applied and queued up for -stable, thanks.
^ permalink raw reply
* Re: [PATCH 5/7] Documentation: DT: net: cpsw: allow to specify descriptors pool size
From: Grygorii Strashko @ 2016-12-02 17:21 UTC (permalink / raw)
To: Ivan Khoronzhuk
Cc: David S. Miller, netdev, Mugunthan V N, Sekhar Nori, linux-kernel,
linux-omap
In-Reply-To: <20161202112832.GB1213@khorivan>
On 12/02/2016 05:28 AM, Ivan Khoronzhuk wrote:
> On Thu, Dec 01, 2016 at 05:34:30PM -0600, Grygorii Strashko wrote:
>> Add optional property "descs_pool_size" to specify buffer descriptor's
>> pool size. The "descs_pool_size" should define total number of CPDMA
>> CPPI descriptors to be used for both ingress/egress packets
>> processing. If not specified - the default value 256 will be used
>> which will allow to place descriptor's pool into the internal CPPI
>> RAM on most of TI SoC.
>>
>> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
>> ---
>> Documentation/devicetree/bindings/net/cpsw.txt | 5 +++++
>> 1 file changed, 5 insertions(+)
>>
>> diff --git a/Documentation/devicetree/bindings/net/cpsw.txt b/Documentation/devicetree/bindings/net/cpsw.txt
>> index 5ad439f..b99d196 100644
>> --- a/Documentation/devicetree/bindings/net/cpsw.txt
>> +++ b/Documentation/devicetree/bindings/net/cpsw.txt
>> @@ -35,6 +35,11 @@ Optional properties:
>> For example in dra72x-evm, pcf gpio has to be
>> driven low so that cpsw slave 0 and phy data
>> lines are connected via mux.
>> +- descs_pool_size : total number of CPDMA CPPI descriptors to be used for
>> + both ingress/egress packets processing. if not
>> + specified the default value 256 will be used which
>> + will allow to place descriptors pool into the
>> + internal CPPI RAM.
> Does it describe h/w? Why now module parameter? or even smth like ethtool num
> ring entries?
>
It can be module parameter too. for the use cases i'm aware of -
this is one-time boot setting only.
----- OR
So, do you propose to use
ethtool -g ethX
ethtool -G ethX [rx N] [tx N]
?
Now cpdma has one pool for all RX/TX channels, so changing this settings
by ethtool will require: pause interfaces, reallocate cpdma pool,
re-arrange buffers between channels, resume interface. Correct?
How do you think - we can move forward with one pool or better to have two (Rx and Tx)?
Wouldn't it be reasonable to still have DT (or module) parameter to avoid
cpdma reconfiguration on system startup (pause/resume interfaces) (faster boot)?
How about cpdma re-allocation policy (with expectation that is shouldn't happen too often)?
- increasing of Rx, Tx will grow total number of physically allocated buffers (total_desc_num)
- decreasing of Rx, Tx will just change number of available buffers (no memory re-allocation)
----- OR ----
Can we move forward with current patch (total number of CPDMA CPPI descriptors defined in DT)
and add ethtool -G ethX [rx N] [tx N] which will allow to re-split descs between RX and TX?
--
regards,
-grygorii
^ permalink raw reply
* Re: [PATCH 5/7] Documentation: DT: net: cpsw: allow to specify descriptors pool size
From: Grygorii Strashko @ 2016-12-02 17:22 UTC (permalink / raw)
To: Ivan Khoronzhuk
Cc: David S. Miller, netdev, Mugunthan V N, Sekhar Nori, linux-kernel,
linux-omap
In-Reply-To: <20161202112832.GB1213@khorivan>
On 12/02/2016 05:28 AM, Ivan Khoronzhuk wrote:
> On Thu, Dec 01, 2016 at 05:34:30PM -0600, Grygorii Strashko wrote:
>> Add optional property "descs_pool_size" to specify buffer descriptor's
>> pool size. The "descs_pool_size" should define total number of CPDMA
>> CPPI descriptors to be used for both ingress/egress packets
>> processing. If not specified - the default value 256 will be used
>> which will allow to place descriptor's pool into the internal CPPI
>> RAM on most of TI SoC.
>>
>> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
>> ---
>> Documentation/devicetree/bindings/net/cpsw.txt | 5 +++++
>> 1 file changed, 5 insertions(+)
>>
>> diff --git a/Documentation/devicetree/bindings/net/cpsw.txt b/Documentation/devicetree/bindings/net/cpsw.txt
>> index 5ad439f..b99d196 100644
>> --- a/Documentation/devicetree/bindings/net/cpsw.txt
>> +++ b/Documentation/devicetree/bindings/net/cpsw.txt
>> @@ -35,6 +35,11 @@ Optional properties:
>> For example in dra72x-evm, pcf gpio has to be
>> driven low so that cpsw slave 0 and phy data
>> lines are connected via mux.
>> +- descs_pool_size : total number of CPDMA CPPI descriptors to be used for
>> + both ingress/egress packets processing. if not
>> + specified the default value 256 will be used which
>> + will allow to place descriptors pool into the
>> + internal CPPI RAM.
> Does it describe h/w? Why now module parameter? or even smth like ethtool num
> ring entries?
>
It can be module parameter too. in general this is expected to be
one-time boot setting only.
----- OR
So, do you propose to use
ethtool -g ethX
ethtool -G ethX [rx N] [tx N]
?
Now cpdma has one pool for all RX/TX channels, so changing this settings
by ethtool will require: pause interfaces, reallocate cpdma pool,
re-arrange buffers between channels, resume interface. Correct?
How do you think - we can move forward with one pool or better to have two (Rx and Tx)?
Wouldn't it be reasonable to still have DT (or module) parameter to avoid
cpdma reconfiguration on system startup (pause/resume interfaces) (faster boot)?
How about cpdma re-allocation policy (with expectation that is shouldn't happen too often)?
- increasing of Rx, Tx will grow total number of physically allocated buffers (total_desc_num)
- decreasing of Rx, Tx will just change number of available buffers (no memory re-allocation)
----- OR ----
Can we move forward with current patch (total number of CPDMA CPPI descriptors defined in DT)
and add ethtool -G ethX [rx N] [tx N] which will allow to re-split descs between RX and TX?
--
regards,
-grygorii
^ permalink raw reply
* Re: [flamebait] xdp, well meaning but pointless
From: Jesper Dangaard Brouer @ 2016-12-02 17:22 UTC (permalink / raw)
To: Florian Westphal; +Cc: brouer, netdev
In-Reply-To: <20161201091108.GF26507@breakpoint.cc>
On Thu, 1 Dec 2016 10:11:08 +0100 Florian Westphal <fw@strlen.de> wrote:
> In light of DPDKs existence it make a lot more sense to me to provide
> a). a faster mmap based interface (possibly AF_PACKET based) that allows
> to map nic directly into userspace, detaching tx/rx queue from kernel.
>
> John Fastabend sent something like this last year as a proof of
> concept, iirc it was rejected because register space got exposed directly
> to userspace. I think we should re-consider merging netmap
> (or something conceptually close to its design).
I'm actually working in this direction, of zero-copy RX mapping packets
into userspace. This work is mostly related to page_pool, and I only
plan to use XDP as a filter for selecting packets going to userspace,
as this choice need to be taken very early.
My design is here:
https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
This is mostly about changing the memory model in the drivers, to allow
for safely mapping pages to userspace. (An efficient queue mechanism is
not covered). People often overlook that netmap's efficiency *also* comes
from introducing pre-mapping memory/pages to userspace.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply
* Re: [PATCH net-next 3/6] net: dsa: mv88e6xxx: add a software reset op
From: David Miller @ 2016-12-02 17:23 UTC (permalink / raw)
To: vivien.didelot; +Cc: netdev, linux-kernel, kernel, f.fainelli, andrew
In-Reply-To: <20161130225930.25510-4-vivien.didelot@savoirfairelinux.com>
From: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Date: Wed, 30 Nov 2016 17:59:27 -0500
> @@ -765,6 +765,9 @@ struct mv88e6xxx_ops {
> int (*phy_write)(struct mv88e6xxx_chip *chip, int addr, int reg,
> u16 val);
>
> + /* Switch Software Reset */
> + int (*reset)(struct mv88e6xxx_chip *chip);
> +
I think Andrew's request to name this method "g1_reset" is reasonable, please
respin with that change.
Thanks.
^ permalink raw reply
* Re: [PATCH net-next v3] ipv6 addrconf: Implemented enhanced DAD (RFC7527)
From: David Miller @ 2016-12-02 17:27 UTC (permalink / raw)
To: nordmark; +Cc: netdev, hannes, gilligan
In-Reply-To: <1480549159-8142-1-git-send-email-nordmark@arista.com>
From: Erik Nordmark <nordmark@arista.com>
Date: Wed, 30 Nov 2016 15:39:19 -0800
> @@ -794,6 +808,17 @@ static void ndisc_recv_ns(struct sk_buff *skb)
> have_ifp:
> if (ifp->flags & (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC)) {
> if (dad) {
> + if (nonce != 0 && ifp->dad_nonce == nonce) {
> + u8 *np = (u8 *)&nonce;
> + /* Matching nonce if looped back */
> + ND_PRINTK(2, notice,
> + "%s: IPv6 DAD loopback for address %pI6c nonce %02x:%02x:%02x:%02x:%02x:%02x ignored\n",
> + ifp->idev->dev->name,
> + &ifp->addr,
> + np[0], np[1], np[2], np[3],
> + np[4], np[5]);
I know you said you'd leave this, but I'd actually like to ask that you
use %pM here to save some kernel size.
Thank you.
^ permalink raw reply
* Re: [PATCH net-next 3/6] net: dsa: mv88e6xxx: add a software reset op
From: Vivien Didelot @ 2016-12-02 17:30 UTC (permalink / raw)
To: Andrew Lunn
Cc: netdev, linux-kernel, kernel, David S. Miller, Florian Fainelli
In-Reply-To: <20161202154342.GL21887@lunn.ch>
Hi Andrew,
Andrew Lunn <andrew@lunn.ch> writes:
> + /* Switch Software Reset */
> + int (*g1_reset)(struct mv88e6xxx_chip *chip);
> +
>
> We have a collection of function pointers with port_ prefix, another
> collection with stats_, and a third with ppu_, etc. And then we have
> some which do not fit a specific category. Those i have prefixed with
> g1_ or g2_. I think we should have some prefix, and that is my
> suggestion.
I disagree. There's only one entry point to issue a switch software
reset, so .reset is enough.
I use this opportunity to give a bit of details about mv88e6xxx/ so that
things get written down at least once somewhere:
global1.c implements accessors to "Global 1 Registers" features and are
prefixed with mv88e6xxx_g1_; port.c implements accessors to "Port
Registers" features and are prefixed with mv88e6xxx_port_, and so
on. (where xxx can be a model if there's conflict due to a redefinition
of the same register)
If a feature is not present or if there's more than one way to access
it, these accessors are bundled in the per-chip mv88e6xxx_ops structure
for disambiguation.
chip.c implements support for a single chip by aggregating and nicely
wrapping these operations. It provides a generic API for Marvell
switches, used to implement net/dsa routines.
Here's a couple of example. Setting a switch MAC can be done in Global
1, or Global 2 depending on the model. Thus .set_switch_mac can be
mv88e6xxx_g1_set_switch_mac or mv88e6xxx_g2_set_switch_mac.
Setting the port's speed is always in the same Port register, but its
layout varies with the model. Thus .port_set_speed can be
mv88e6185_port_set_speed or mv88e6352_port_set_speed.
Thanks,
Vivien
^ permalink raw reply
* Re: [PATCH net v2 1/3] ipv4: Set skb->protocol properly for local output
From: David Miller @ 2016-12-02 17:34 UTC (permalink / raw)
To: elicooper; +Cc: netdev, eric.dumazet
In-Reply-To: <20161201020512.21661-1-elicooper@gmx.com>
From: Eli Cooper <elicooper@gmx.com>
Date: Thu, 1 Dec 2016 10:05:10 +0800
> When xfrm is applied to TSO/GSO packets, it follows this path:
>
> xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()
>
> where skb_gso_segment() relies on skb->protocol to function properly.
>
> This patch sets skb->protocol to ETH_P_IP before dst_output() is called,
> fixing a bug where GSO packets sent through a sit tunnel are dropped
> when xfrm is involved.
>
> Cc: stable@vger.kernel.org
> Signed-off-by: Eli Cooper <elicooper@gmx.com>
Applied.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox