Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v3] lib/test_rhashtable: Make test_insert_dup() allocate its hash table dynamically
From: Herbert Xu @ 2019-01-31 12:08 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: David Miller, tgraf, netdev, linux-kernel
In-Reply-To: <20190130184230.225631-1-bvanassche@acm.org>

On Wed, Jan 30, 2019 at 10:42:30AM -0800, Bart Van Assche wrote:
> The test_insert_dup() function from lib/test_rhashtable.c passes a
> pointer to a stack object to rhltable_init(). Allocate the hash table
> dynamically to avoid that the following is reported with object
> debugging enabled:
> 
> ODEBUG: object (ptrval) is on stack (ptrval), but NOT annotated.
> WARNING: CPU: 0 PID: 1 at lib/debugobjects.c:368 __debug_object_init+0x312/0x480
> Modules linked in:
> EIP: __debug_object_init+0x312/0x480
> Call Trace:
>  ? debug_object_init+0x1a/0x20
>  ? __init_work+0x16/0x30
>  ? rhashtable_init+0x1e1/0x460
>  ? sched_clock_cpu+0x57/0xe0
>  ? rhltable_init+0xb/0x20
>  ? test_insert_dup+0x32/0x20f
>  ? trace_hardirqs_on+0x38/0xf0
>  ? ida_dump+0x10/0x10
>  ? jhash+0x130/0x130
>  ? my_hashfn+0x30/0x30
>  ? test_rht_init+0x6aa/0xab4
>  ? ida_dump+0x10/0x10
>  ? test_rhltable+0xc5c/0xc5c
>  ? do_one_initcall+0x67/0x28e
>  ? trace_hardirqs_off+0x22/0xe0
>  ? restore_all_kernel+0xf/0x70
>  ? trace_hardirqs_on_thunk+0xc/0x10
>  ? restore_all_kernel+0xf/0x70
>  ? kernel_init_freeable+0x142/0x213
>  ? rest_init+0x230/0x230
>  ? kernel_init+0x10/0x110
>  ? schedule_tail_wrapper+0x9/0xc
>  ? ret_from_fork+0x19/0x24
> 
> Cc: Thomas Graf <tgraf@suug.ch>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: netdev@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
> 
> Changes compared to v2: fixed build error.
> 
> Changes compared to v1: instead of modifying rhashtable_init(), modify its
>    caller.
> 
> lib/test_rhashtable.c | 23 +++++++++++++++--------
>  1 file changed, 15 insertions(+), 8 deletions(-)

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: r8169 Driver - Poor Network Performance Since Kernel 4.19
From: Peter Ceiley @ 2019-01-31 12:09 UTC (permalink / raw)
  To: David Chang; +Cc: Heiner Kallweit, Realtek linux nic maintainers, netdev
In-Reply-To: <20190131072306.GG25745@linux-kyyb.suse>

Hi Heiner,

A quick update on my testing with different pcie_aspm settings:

pcie_aspm=off | no change
pcie_aspm.policy=default | no change
pcie_aspm.policy=performance | issue resolved
pcie_aspm.policy=powersave | issue resolved
pcie_aspm.policy=powersupersave | issue resolved

It seems the new driver does not play nicely with the default ASPM policy.

As requested, I've included an output of ethtool below when experiencing
the issue - note that no errors are recorded.

# ethtool -S enp3s0
NIC statistics:
     tx_packets: 2749
     rx_packets: 4089
     tx_errors: 0
     rx_errors: 0
     rx_missed: 0
     align_errors: 0
     tx_single_collisions: 0
     tx_multi_collisions: 0
     unicast: 4078
     broadcast: 9
     multicast: 2
     tx_aborted: 0
     tx_underrun: 0

David, I hope this helps for your user as well. I appreciate you sharing
the bug ticket - thanks.

Heiner, thanks very much for your help to date.

Regards,

Peter.

On Thu, 31 Jan 2019 at 18:23, David Chang <dchang@suse.com> wrote:
>
> Hi Heiner,
>
> On Jan 31, 2019 at 07:35:30 +0100, Heiner Kallweit wrote:
> > Hi David, two more things:
> >
> > 1. Could you please test a recent linux-next kernel?
> > 2. Please get a register dump (ethtool -d <if>) from 4.18 and 4.19
> >    and compare them.
>
> I'm sorry that I do not have the issue machine handy. I would ask
> our user to do the test. Thanks!
>
> Regards,
> David
>
> >
> > Heiner
> >
> >
> > On 31.01.2019 07:21, Heiner Kallweit wrote:
> > > David, thanks for the link to the bug ticket.
> > > I think only a proper bisect can help to find the offending commit.
> > >
> > > Heiner
> > >
> > >
> > > On 31.01.2019 03:32, David Chang wrote:
> > >> Hi,
> > >>
> > >> We had a similr case here.
> > >> - Realtek r8169 receive performance regression in kernel 4.19
> > >>   https://bugzilla.suse.com/show_bug.cgi?id=1119649
> > >>
> > >> kernel: r8169 0000:01:00.0 eth0: RTL8168h/8111h, XID 54100880
> > >> The major symptom is there are many rx_missed count.
> > >>
> > >>
> > >> On Jan 30, 2019 at 20:15:45 +0100, Heiner Kallweit wrote:
> > >>> Hi Peter,
> > >>>
> > >>> recently I had somebody where pcie_aspm=off for whatever reason didn't
> > >>> do the trick, can you also check with pcie_aspm.policy=performance.
> > >>
> > >> We will give it a try later.
> > >>
> > >>> And please check with "ethtool -S <if>" whether the chip statistics
> > >>> show a significant number of errors.
> > >>>
> > >>> If this doesn't help you may have to bisect to find the offending commit.
> > >>
> > >> We had tried fallback driver to a few previous commits as following,
> > >> but with no luck.
> > >>
> > >> 9675931e6b65 r8169: re-enable MSI-X on RTL8168g (v4.19)
> > >> 098b01ad9837 r8169: don't include asm headers directly (v4.19-rc1)
> > >> a2965f12fde6 r8169: remove rtl8169_set_speed_xmii (v4.19-rc1)
> > >> 6fcf9b1d4d6c r8169: fix runtime suspend (v4.19-rc1)
> > >> e397286b8e89 r8169: remove TBI 1000BaseX support (v4.19-rc1)
> > >>
> > >> Thanks,
> > >> David Chang
> > >>
> > >>>
> > >>> Heiner
> > >>>
> > >>>
> > >>> On 30.01.2019 10:59, Peter Ceiley wrote:
> > >>>> Hi Heiner,
> > >>>>
> > >>>> I tried disabling the ASPM using the pcie_aspm=off kernel parameter
> > >>>> and this made no difference.
> > >>>>
> > >>>> I tried compiling the 4.18.16 r8169.c with the 4.19.18 source and
> > >>>> subsequently loaded the module in the running 4.19.18 kernel. I can
> > >>>> confirm that this immediately resolved the issue and access to the NFS
> > >>>> shares operated as expected.
> > >>>>
> > >>>> I presume this means it is an issue with the r8169 driver included in
> > >>>> 4.19 onwards?
> > >>>>
> > >>>> To answer your last questions:
> > >>>>
> > >>>> Base Board Information
> > >>>>     Manufacturer: Alienware
> > >>>>     Product Name: 0PGRP5
> > >>>>     Version: A02
> > >>>>
> > >>>> ... and yes, the RTL8168 is the onboard network chip.
> > >>>>
> > >>>> Regards,
> > >>>>
> > >>>> Peter.
> > >>>>
> > >>>> On Tue, 29 Jan 2019 at 17:44, Heiner Kallweit <hkallweit1@gmail.com> wrote:
> > >>>>>
> > >>>>> Hi Peter,
> > >>>>>
> > >>>>> I think the vendor driver doesn't enable ASPM per default.
> > >>>>> So it's worth a try to disable ASPM in the BIOS or via sysfs.
> > >>>>> Few older systems seem to have issues with ASPM, what kind of
> > >>>>> system / mainboard are you using? The RTL8168 is the onboard
> > >>>>> network chip?
> > >>>>>
> > >>>>> Rgds, Heiner
> > >>>>>
> > >>>>>
> > >>>>> On 29.01.2019 07:20, Peter Ceiley wrote:
> > >>>>>> Hi Heiner,
> > >>>>>>
> > >>>>>> Thanks, I'll do some more testing. It might not be the driver - I
> > >>>>>> assumed it was due to the fact that using the r8168 driver 'resolves'
> > >>>>>> the issue. I'll see if I can test the r8169.c on top of 4.19 - this is
> > >>>>>> a good idea.
> > >>>>>>
> > >>>>>> Cheers,
> > >>>>>>
> > >>>>>> Peter.
> > >>>>>>
> > >>>>>> On Tue, 29 Jan 2019 at 17:16, Heiner Kallweit <hkallweit1@gmail.com> wrote:
> > >>>>>>>
> > >>>>>>> Hi Peter,
> > >>>>>>>
> > >>>>>>> at a first glance it doesn't look like a typical driver issue.
> > >>>>>>> What you could do:
> > >>>>>>>
> > >>>>>>> - Test the r8169.c from 4.18 on top of 4.19.
> > >>>>>>>
> > >>>>>>> - Check whether disabling ASPM (/sys/module/pcie_aspm) has an effect.
> > >>>>>>>
> > >>>>>>> - Bisect between 4.18 and 4.19 to find the offending commit.
> > >>>>>>>
> > >>>>>>> Any specific reason why you think root cause is in the driver and not
> > >>>>>>> elsewhere in the network subsystem?
> > >>>>>>>
> > >>>>>>> Heiner
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On 28.01.2019 23:10, Peter Ceiley wrote:
> > >>>>>>>> Hi Heiner,
> > >>>>>>>>
> > >>>>>>>> Thanks for getting back to me.
> > >>>>>>>>
> > >>>>>>>> No, I don't use jumbo packets.
> > >>>>>>>>
> > >>>>>>>> Bandwidth is *generally* good, and iperf results to my NAS provide
> > >>>>>>>> over 900 Mbits/s in both circumstances. The issue seems to appear when
> > >>>>>>>> establishing a connection and is most notable, for example, on my
> > >>>>>>>> mounted NFS shares where it takes seconds (up to 10's of seconds on
> > >>>>>>>> larger directories) to list the contents of each directory. Once a
> > >>>>>>>> transfer begins on a file, I appear to get good bandwidth.
> > >>>>>>>>
> > >>>>>>>> I'm unsure of the best scientific data to provide you in order to
> > >>>>>>>> troubleshoot this issue. Running the following
> > >>>>>>>>
> > >>>>>>>>     netstat -s |grep retransmitted
> > >>>>>>>>
> > >>>>>>>> shows a steady increase in retransmitted segments each time I list the
> > >>>>>>>> contents of a remote directory, for example, running 'ls' on a
> > >>>>>>>> directory containing 345 media files did the following using kernel
> > >>>>>>>> 4.19.18:
> > >>>>>>>>
> > >>>>>>>> increased retransmitted segments by 21 and the 'time' command showed
> > >>>>>>>> the following:
> > >>>>>>>>     real    0m19.867s
> > >>>>>>>>     user    0m0.012s
> > >>>>>>>>     sys    0m0.036s
> > >>>>>>>>
> > >>>>>>>> The same command shows no retransmitted segments running kernel
> > >>>>>>>> 4.18.16 and 'time' showed:
> > >>>>>>>>     real    0m0.300s
> > >>>>>>>>     user    0m0.004s
> > >>>>>>>>     sys    0m0.007s
> > >>>>>>>>
> > >>>>>>>> ifconfig does not show any RX/TX errors nor dropped packets in either case.
> > >>>>>>>>
> > >>>>>>>> dmesg XID:
> > >>>>>>>> [    2.979984] r8169 0000:03:00.0 eth0: RTL8168g/8111g,
> > >>>>>>>> f8:b1:56:fe:67:e0, XID 4c000800, IRQ 32
> > >>>>>>>>
> > >>>>>>>> # lspci -vv
> > >>>>>>>> 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
> > >>>>>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)
> > >>>>>>>>     Subsystem: Dell RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
> > >>>>>>>>     Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> > >>>>>>>> ParErr- Stepping- SERR- FastB2B- DisINTx+
> > >>>>>>>>     Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> > >>>>>>>> <TAbort- <MAbort- >SERR- <PERR- INTx-
> > >>>>>>>>     Latency: 0, Cache Line Size: 64 bytes
> > >>>>>>>>     Interrupt: pin A routed to IRQ 19
> > >>>>>>>>     Region 0: I/O ports at d000 [size=256]
> > >>>>>>>>     Region 2: Memory at f7b00000 (64-bit, non-prefetchable) [size=4K]
> > >>>>>>>>     Region 4: Memory at f2100000 (64-bit, prefetchable) [size=16K]
> > >>>>>>>>     Capabilities: [40] Power Management version 3
> > >>>>>>>>         Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA
> > >>>>>>>> PME(D0+,D1+,D2+,D3hot+,D3cold+)
> > >>>>>>>>         Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > >>>>>>>>     Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
> > >>>>>>>>         Address: 0000000000000000  Data: 0000
> > >>>>>>>>     Capabilities: [70] Express (v2) Endpoint, MSI 01
> > >>>>>>>>         DevCap:    MaxPayload 128 bytes, PhantFunc 0, Latency L0s
> > >>>>>>>> <512ns, L1 <64us
> > >>>>>>>>             ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
> > >>>>>>>> SlotPowerLimit 10.000W
> > >>>>>>>>         DevCtl:    CorrErr- NonFatalErr- FatalErr- UnsupReq-
> > >>>>>>>>             RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> > >>>>>>>>             MaxPayload 128 bytes, MaxReadReq 4096 bytes
> > >>>>>>>>         DevSta:    CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
> > >>>>>>>>         LnkCap:    Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit
> > >>>>>>>> Latency L0s unlimited, L1 <64us
> > >>>>>>>>             ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
> > >>>>>>>>         LnkCtl:    ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
> > >>>>>>>>             ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
> > >>>>>>>>         LnkSta:    Speed 2.5GT/s (ok), Width x1 (ok)
> > >>>>>>>>             TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > >>>>>>>>         DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+,
> > >>>>>>>> OBFF Via message/WAKE#
> > >>>>>>>>              AtomicOpsCap: 32bit- 64bit- 128bitCAS-
> > >>>>>>>>         DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+,
> > >>>>>>>> OBFF Disabled
> > >>>>>>>>              AtomicOpsCtl: ReqEn-
> > >>>>>>>>         LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
> > >>>>>>>>              Transmit Margin: Normal Operating Range,
> > >>>>>>>> EnterModifiedCompliance- ComplianceSOS-
> > >>>>>>>>              Compliance De-emphasis: -6dB
> > >>>>>>>>         LnkSta2: Current De-emphasis Level: -6dB,
> > >>>>>>>> EqualizationComplete-, EqualizationPhase1-
> > >>>>>>>>              EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> > >>>>>>>>     Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
> > >>>>>>>>         Vector table: BAR=4 offset=00000000
> > >>>>>>>>         PBA: BAR=4 offset=00000800
> > >>>>>>>>     Capabilities: [d0] Vital Product Data
> > >>>>>>>> pcilib: sysfs_read_vpd: read failed: Input/output error
> > >>>>>>>>         Not readable
> > >>>>>>>>     Capabilities: [100 v1] Advanced Error Reporting
> > >>>>>>>>         UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> > >>>>>>>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > >>>>>>>>         UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> > >>>>>>>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > >>>>>>>>         UESvrt:    DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
> > >>>>>>>> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> > >>>>>>>>         CESta:    RxErr+ BadTLP+ BadDLLP+ Rollover- Timeout+ AdvNonFatalErr-
> > >>>>>>>>         CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
> > >>>>>>>>         AERCap:    First Error Pointer: 00, ECRCGenCap+ ECRCGenEn-
> > >>>>>>>> ECRCChkCap+ ECRCChkEn-
> > >>>>>>>>             MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
> > >>>>>>>>         HeaderLog: 00000000 00000000 00000000 00000000
> > >>>>>>>>     Capabilities: [140 v1] Virtual Channel
> > >>>>>>>>         Caps:    LPEVC=0 RefClk=100ns PATEntryBits=1
> > >>>>>>>>         Arb:    Fixed- WRR32- WRR64- WRR128-
> > >>>>>>>>         Ctrl:    ArbSelect=Fixed
> > >>>>>>>>         Status:    InProgress-
> > >>>>>>>>         VC0:    Caps:    PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
> > >>>>>>>>             Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
> > >>>>>>>>             Ctrl:    Enable+ ID=0 ArbSelect=Fixed TC/VC=01
> > >>>>>>>>             Status:    NegoPending- InProgress-
> > >>>>>>>>     Capabilities: [160 v1] Device Serial Number 01-00-00-00-68-4c-e0-00
> > >>>>>>>>     Capabilities: [170 v1] Latency Tolerance Reporting
> > >>>>>>>>         Max snoop latency: 71680ns
> > >>>>>>>>         Max no snoop latency: 71680ns
> > >>>>>>>>     Kernel driver in use: r8169
> > >>>>>>>>     Kernel modules: r8169
> > >>>>>>>>
> > >>>>>>>> Please let me know if you have any other ideas in terms of testing.
> > >>>>>>>>
> > >>>>>>>> Thanks!
> > >>>>>>>>
> > >>>>>>>> Peter.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Tue, 29 Jan 2019 at 05:28, Heiner Kallweit <hkallweit1@gmail.com> wrote:
> > >>>>>>>>>
> > >>>>>>>>> On 28.01.2019 12:13, Peter Ceiley wrote:
> > >>>>>>>>>> Hi,
> > >>>>>>>>>>
> > >>>>>>>>>> I have been experiencing very poor network performance since Kernel
> > >>>>>>>>>> 4.19 and I'm confident it's related to the r8169 driver.
> > >>>>>>>>>>
> > >>>>>>>>>> I have no issue with kernel versions 4.18 and prior. I am experiencing
> > >>>>>>>>>> this issue in kernels 4.19 and 4.20 (currently running/testing with
> > >>>>>>>>>> 4.20.4 & 4.19.18).
> > >>>>>>>>>>
> > >>>>>>>>>> If someone could guide me in the right direction, I'm happy to help
> > >>>>>>>>>> troubleshoot this issue. Note that I have been keeping an eye on one
> > >>>>>>>>>> issue related to loading of the PHY driver, however, my symptoms
> > >>>>>>>>>> differ in that I still have a network connection. I have attempted to
> > >>>>>>>>>> reload the driver on a running system, but this does not improve the
> > >>>>>>>>>> situation.
> > >>>>>>>>>>
> > >>>>>>>>>> Using the proprietary r8168 driver returns my device to proper working order.
> > >>>>>>>>>>
> > >>>>>>>>>> lshw shows:
> > >>>>>>>>>>        description: Ethernet interface
> > >>>>>>>>>>        product: RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
> > >>>>>>>>>>        vendor: Realtek Semiconductor Co., Ltd.
> > >>>>>>>>>>        physical id: 0
> > >>>>>>>>>>        bus info: pci@0000:03:00.0
> > >>>>>>>>>>        logical name: enp3s0
> > >>>>>>>>>>        version: 0c
> > >>>>>>>>>>        serial:
> > >>>>>>>>>>        size: 1Gbit/s
> > >>>>>>>>>>        capacity: 1Gbit/s
> > >>>>>>>>>>        width: 64 bits
> > >>>>>>>>>>        clock: 33MHz
> > >>>>>>>>>>        capabilities: pm msi pciexpress msix vpd bus_master cap_list
> > >>>>>>>>>> ethernet physical tp aui bnc mii fibre 10bt 10bt-fd 100bt 100bt-fd
> > >>>>>>>>>> 1000bt-fd autonegotiation
> > >>>>>>>>>>        configuration: autonegotiation=on broadcast=yes driver=r8169
> > >>>>>>>>>> duplex=full firmware=rtl8168g-2_0.0.1 02/06/13 ip=192.168.1.25
> > >>>>>>>>>> latency=0 link=yes multicast=yes port=MII speed=1Gbit/s
> > >>>>>>>>>>        resources: irq:19 ioport:d000(size=256)
> > >>>>>>>>>> memory:f7b00000-f7b00fff memory:f2100000-f2103fff
> > >>>>>>>>>>
> > >>>>>>>>>> Kind Regards,
> > >>>>>>>>>>
> > >>>>>>>>>> Peter.
> > >>>>>>>>>>
> > >>>>>>>>> Hi Peter,
> > >>>>>>>>>
> > >>>>>>>>> the description "poor network performance" is quite vague, therefore:
> > >>>>>>>>>
> > >>>>>>>>> - Can you provide any measurements?
> > >>>>>>>>> - iperf results before and after
> > >>>>>>>>> - statistics about dropped packets (rx and/or tx)
> > >>>>>>>>> - Do you use jumbo packets?
> > >>>>>>>>>
> > >>>>>>>>> Also help would be a "lspci -vv" output for the network card and
> > >>>>>>>>> the dmesg output line with the chip XID.
> > >>>>>>>>>
> > >>>>>>>>> Heiner
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >
> >
> >

^ permalink raw reply

* Re: [PATCH iproute2-next] Introduce ip-brctl shell script
From: Stefano Brivio @ 2019-01-31 12:46 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Nikolay Aleksandrov, David Ahern, Phil Sutter, Eric Garver,
	Tomas Dolezal, Stephen Hemminger, Lennert Buytenhek, netdev
In-Reply-To: <CAJieiUir5WVaQmWF3d46mo1J4+C_RdVmD=iAButhuEFsd7001A@mail.gmail.com>

On Wed, 30 Jan 2019 14:30:59 -0800
Roopa Prabhu <roopa@cumulusnetworks.com> wrote:

> On Sun, Jan 27, 2019 at 11:57 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > They can't replace brctl not because they are badly designed or
> > unusable, but simply because they are different tools with different
> > purposes (see also my comments to Nikolay).  
> 
> I don't think i understand that they are different tools. The new netlink tools
> are supposed to deprecate the old tools that use ioctls. this is the same reason
> we don't have a ip-ifconfig today

It's not just ioctl vs. netlink: brctl operates on bridges, whereas
ip-link and 'bridge' operate on generic links. If you look at
cmd_showstp() and cmd_show() from my script (I wouldn't recommend that
right after breakfast ;)) that should be apparent.

If you want to get basic information about a bridge, with ip-link or
'bridge' you need multiple commands. This is very different from
ifconfig: there's no such ifconfig command that can't be implemented as
a single ip-link command.

> > > So, I think its better to fix the first two instead of introducing
> > > another one.  
> >
> > This is really not the same thing: I'm not introducing a new tool, I'm
> > effectively replacing a 1794-LoC, non-trivial, ioctl-based
> > implementation with a trivial, 572-lines shell script, with half
> > the binary size.  
> 
> you are replacing a ioctl-based tool from another package into
> iproute2..and maybe there-by deprecating the netlink based tool ? :).

Oh, I see your point here. Well, it's not deprecating anything because
I'm using the netlink tool in the wrapper (which allows it to be
rather dumb and simple), but sure, I understand what you mean.

Still, we could very easily get the wrapper to print equivalent ip-link
and bridge commands before executing them -- and this would actually
ease and encourage migration, compared to the current situation.

> We are in opposite camps, I strongly want people to move to bridge and
> ip link which has all the latest support.

I wouldn't say we are in opposite camps, I just think that whatever has
been done so far to get people to switch hasn't worked.

> > > Can we extend 'bridge' tool with extra options to provide a summary
> > > view of all bridges like brctl ?  
> >
> > We could, and I initially thought of that approach instead, but that
> > has a number of fundamental downsides:
> >
> > - we can't provide a brctl-compatible syntax, unless we want to
> >   substantially rewrite the 'bridge' interface, and I think it's a
> >   bad idea to break 'bridge' syntax for users, while we won't be able to
> >   replace brctl if we don't provide a similar syntax, history showed  
> 
> I am certainly not suggesting we break existing bridge users. I am
> talking about new options.

Let's take 'brctl showstp br0'. I wouldn't even know where to start
adding options to 'bridge' to have a similar functionality (any idea?),
and still, that brctl command is very convenient.

> I understand some people are finding it hard to move away from brctl
> output, but in my experience,
> these are also the people who want new things in brctl like json
> output etc. which is already available in the bridge command

Not in my experience: the output of brctl showmacs and showstp commands
is human-readable, that's what a large category of users need. JSON is
well suited for other purposes.

> > - the fdb implementation has a long-dated comment by Stephen in its
> >   header,
> >         * TODO: merge/replace this with ip neighbour
> >   and this is actually the only part of 'bridge' I'm using in ip-brctl.
> >   Code is conceptually duplicated there, and I think we should actually
> >   get rid of that -- but then 'bridge' wouldn't even give information
> >   about the FDB, one would need to use ip neighbour instead.  
> 
> This could be comment from initial days. Today bridge has support for
> fdb, vlans and vlan tunnels which you
> cannot get from brctl and any brctl compat tool.

This is unrelated. I'm specifically referring to the fdb information,
which is similar to what ip neighbour displays, and which you can
indeed get with brctl, see cmd_showmacs() from my script. 

> > - 'bridge' doesn't implement settings for basic bridge features (say,
> >   STP), which are convenient for users, especially if they are used to
> >   brctl. To get that, even at an interface/syntax level, we would need
> >   to duplicate some parts of ip-link, which looks like a bad idea per
> >   se.  
> 
> thats fine IMO. Today ip link set extended bridge attribute support is
> only for convenience.
>
> You can set most attributes both from ip link set and bridge link
> command. We can see if they can share code.

*This* is exactly what looks confusing to me.

> You can set a vlan on a bridge today via the bridge command. I dont
> see why we should hesitate about STP here.
> And you will get the json output for free.

I'm not particularly interested in non-human users here, not in the
context of this discussion at least.

> > > Its supposed to be the netlink based tool for all bridging and hence
> > > could be a good replacement for all brctl users.  
> >
> > I still think the best replacement for users is the one that changes
> > absolutely nothing, and if that's easily achievable, I'd rather go for
> > it.  
> 
> That would also mean we add ip-ifconfig and ip-ethtool (if we
> deprecate ethtool tomorrow. i am not saying its going away....,
> but just giving you an example of ioctl to netlink based tools).

Again, it's not the same as ifconfig, for the reasons I mentioned
above. And by the way I think ifconfig doesn't even vaguely have the
same amount of users of brctl nowadays, and for a reason, but I would
only have anecdotal experience to support this.

-- 
Stefano

^ permalink raw reply

* Re: [PATCH iproute2-next] Introduce ip-brctl shell script
From: Stefano Brivio @ 2019-01-31 12:46 UTC (permalink / raw)
  To: David Ahern
  Cc: Phil Sutter, Eric Garver, Tomas Dolezal, Stephen Hemminger,
	Lennert Buytenhek, netdev
In-Reply-To: <d2cd931e-5348-a427-5935-b259e047e14a@gmail.com>

On Wed, 30 Jan 2019 22:12:45 -0700
David Ahern <dsahern@gmail.com> wrote:

> This is a convenience wrapper around commands packaged in iproute2. If
> iproute2 adds this wrapper, it will have to carry it and maintain it
> forever. Distributions (Fedora, RHEL, Debian, etc) may see it
> differently and decide to add this patch onto iproute2 that they
> distribute as a means for dropping bridge-utils. That's a reasonable
> migration choice. It is just not something upstream iproute2 should carry.

I see what you mean now, I didn't think of that. Sure, that also
sounds reasonable.

I still think this wrapper would require basically zero maintenance,
and carrying it would outweigh the burden at large -- and we could also
use it as a tool to get users familiar with ip-link and 'bridge' by
printing the equivalent syntax before executing the commands.

Anyway, I'll inform downstream maintainers about this option.

-- 
Stefano

^ permalink raw reply

* Re: BUG: KASAN: double-free or invalid-free in ip_defrag after upgrade from 4.19.13
From: Greg Kroah-Hartman @ 2019-01-31 12:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ivan Babrou, Michal Kubecek, Linux Kernel Network Developers,
	David S. Miller, Ignat Korchagin, Shawn Bohrer, Jakub Sitnicki
In-Reply-To: <CANn89iJaTAHvKesjibA91hP-1cqHUzBRVmxBHrw0G=9K0twsfw@mail.gmail.com>

On Wed, Jan 30, 2019 at 03:16:56PM -0800, Eric Dumazet wrote:
> On Wed, Jan 30, 2019 at 3:13 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Jan 30, 2019 at 3:09 PM Ivan Babrou <ivan@cloudflare.com> wrote:
> > >
> > > Eric,
> > >
> > > Are you going to propose the change then?
> > >
> > > I'm happy to test it out.
> > >
> >
> > This is indeed a bug in linux stable tree only.
> >
> > The err=-EINVAL move was part of a patch that was not backported
> > (since it was not a bug fix)
> >
> > commit 0ff89efb524631ac9901b81446b453c29711c376
> > Author: Peter Oskolkov <posk@google.com>
> > Date:   Tue Aug 28 11:36:19 2018 -0700
> >
> >     ip: fail fast on IP defrag errors
> >
> >
> 
> Greg, the fix for 4.19 (and maybe other stable trees ?) would be :
> 
> diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
> index f8bbd693c19c247e41839c2d0b5318ca51b23ee8..d95b32af4a0e3f552405c9e61cc372729834160c
> 100644
> --- a/net/ipv4/ip_fragment.c
> +++ b/net/ipv4/ip_fragment.c
> @@ -425,6 +425,7 @@ static int ip_frag_queue(struct ipq *qp, struct
> sk_buff *skb)
>          * fragment.
>          */
> 
> +       err = -EINVAL;
>         /* Find out where to put this fragment.  */
>         prev_tail = qp->q.fragments_tail;
>         if (!prev_tail)
> @@ -501,7 +502,6 @@ static int ip_frag_queue(struct ipq *qp, struct
> sk_buff *skb)
> 
>  discard_qp:
>         inet_frag_kill(&qp->q);
> -       err = -EINVAL;
>         __IP_INC_STATS(net, IPSTATS_MIB_REASM_OVERLAPS);
>  err:
>         kfree_skb(skb);
> 

Thanks for this, I'll turn this into a real patch and backport it to
where it is needed.

greg k-h

^ permalink raw reply

* Re: [PATCH iproute2-next] Introduce ip-brctl shell script
From: Stefano Brivio @ 2019-01-31 12:49 UTC (permalink / raw)
  To: Phil Sutter, Alexander Wirt, Luca Boccassi,
	Santiago Garcia Mantinan, Steve Langasek, Christian Hesse,
	Ronald van Haren, Terry Chvatal
  Cc: David Ahern, Eric Garver, Tomas Dolezal, Stephen Hemminger,
	Lennert Buytenhek, netdev
In-Reply-To: <ed6b04eab48a70d6416a6b021f04f9901f7e9f01.1547830302.git.sbrivio@redhat.com>

Hi,

For your information: I recently submitted a patch that implements
brctl as a wrapper using ip-link and 'bridge' from iproute2, that
allows to drop bridge-utils while still providing brctl functionality:

	https://patchwork.ozlabs.org/patch/1027627/

This wasn't exactly met with enthusiasm upstream (full discussion at:
https://marc.info/?l=linux-netdev&m=154783087917432), but feel free to
carry it downstream if you're interested.

Thanks,

-- 
Stefano

On Fri, 18 Jan 2019 18:00:45 +0100
Stefano Brivio <sbrivio@redhat.com> wrote:

> This script wraps 'ip' and 'bridge' tools to provide a drop-in replacement
> of the standalone 'brctl' utility.
> 
> It's bug-to-bug compatible with brctl as of bridge-utils version 1.6,
> has no dependencies other than a POSIX shell, and it's less than half
> the binary size of brctl on x86_64.
> 
> As many users (including myself) seem to find brctl usage vastly more
> intuitive than ip-link, possibly due to habit, this might be a lightweight
> approach to provide brctl syntax without the need to maintain bridge-utils
> any longer.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> Acked-by: Phil Sutter <phil@nwl.cc>
> ---
>  man/man8/Makefile   |   5 +-
>  man/man8/ip-brctl.8 | 187 +++++++++++++++
>  misc/Makefile       |   9 +-
>  misc/ip-brctl.in    | 572 ++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 770 insertions(+), 3 deletions(-)
>  create mode 100644 man/man8/ip-brctl.8
>  create mode 100755 misc/ip-brctl.in
> 
> diff --git a/man/man8/Makefile b/man/man8/Makefile
> index 932ba1f3c488..26d2370a9cbe 100644
> --- a/man/man8/Makefile
> +++ b/man/man8/Makefile
> @@ -1,5 +1,5 @@
>  # SPDX-License-Identifier: GPL-2.0
> -TARGETS = ip-address.8 ip-link.8 ip-route.8
> +TARGETS = ip-address.8 ip-link.8 ip-route.8 brctl.8
>  
>  MAN8PAGES = $(TARGETS) $(filter-out $(TARGETS),$(wildcard *.8))
>  
> @@ -14,6 +14,9 @@ ip-link.8: ip-link.8.in
>  ip-route.8: ip-route.8.in
>  	sed "s|@SYSCONFDIR@|$(CONFDIR)|g" $< > $@
>  
> +brctl.8: ip-brctl.8
> +	echo '.so man8/ip-brctl.8' >$@
> +
>  distclean: clean
>  
>  clean:
> diff --git a/man/man8/ip-brctl.8 b/man/man8/ip-brctl.8
> new file mode 100644
> index 000000000000..63c1bcdebe9f
> --- /dev/null
> +++ b/man/man8/ip-brctl.8
> @@ -0,0 +1,187 @@
> +.\"
> +.\"	This program is free software; you can redistribute it and/or modify
> +.\"	it under the terms of the GNU General Public License as published by
> +.\"	the Free Software Foundation; either version 2 of the License, or
> +.\"	(at your option) any later version.
> +.\"
> +.\"	This program is distributed in the hope that it will be useful,
> +.\"	but WITHOUT ANY WARRANTY; without even the implied warranty of
> +.\"	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +.\"	GNU General Public License for more details.
> +.\"
> +.\"	You should have received a copy of the GNU General Public License
> +.\"	along with this program; if not, write to the Free Software
> +.\"	Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
> +.\"
> +.\"
> +.TH IP-BRCTL 8 "January 18, 2019" "" ""
> +.SH NAME
> +ip-brctl \- ethernet bridge administration
> +.SH SYNOPSIS
> +.BR "brctl [command]"
> +.SH DESCRIPTION
> +.B ip-brctl
> +is a reimplementation of the traditional
> +.B brctl
> +utility in shell making use of
> +.BR ip " and " bridge
> +tools internally. It is supposed to behave identically to
> +.BR brctl ,
> +hence the remainder of this document will use that name instead of
> +.BR ip-brctl .
> +
> +.B brctl
> +is used to set up, maintain, and inspect the ethernet bridge
> +configuration in the Linux kernel.
> +
> +An ethernet bridge is a device commonly used to connect different
> +networks of ethernets together, so that these ethernets will appear as
> +one ethernet to the participants.
> +
> +Each of the ethernets being connected corresponds to one physical
> +interface in the bridge. These individual ethernets are bundled into
> +one bigger ('logical') ethernet, this bigger ethernet corresponds to
> +the bridge network interface.
> +
> +
> +.SH INSTANCES
> +The command
> +.B brctl addbr <name>
> +creates a new instance of the ethernet bridge. The network interface
> +corresponding to the bridge will be called <name>.
> +
> +The command
> +.B brctl delbr <name>
> +deletes the instance <name> of the ethernet bridge. The network
> +interface corresponding to the bridge must be down before it can be
> +deleted!
> +
> +The command
> +.B brctl show
> +shows all current instances of the ethernet bridge.
> +
> +
> +.SH PORTS
> +Each bridge has a number of ports attached to it. Network traffic
> +coming in on any of these ports will be forwarded to the other ports
> +transparently, so that the bridge is invisible to the rest of the
> +network (i.e. it will not show up in
> +.IR traceroute(8)
> +).
> +
> +The command
> +.B brctl addif <brname> <ifname>
> +will make the interface <ifname> a port of the bridge <brname>. This
> +means that all frames received on <ifname> will be processed as if
> +destined for the bridge. Also, when sending frames on <brname>,
> +<ifname> will be considered as a potential output interface.
> +
> +The command
> +.B brctl delif <brname> <ifname>
> +will detach the interface <ifname> from the bridge <brname>.
> +
> +The command
> +.B brctl show <brname>
> +will show some information on the bridge and its attached ports.
> +
> +
> +.SH AGEING
> +The bridge keeps track of ethernet addresses seen on each port. When
> +it needs to forward a frame, and it happens to know on which port the
> +destination ethernet address (specified in the frame) is located, it
> +can 'cheat' by forwarding the frame to that port only, thus saving a
> +lot of redundant copies and transmits.
> +
> +However, the ethernet address location data is not static
> +data. Machines can move to other ports, network cards can be replaced
> +(which changes the machine's ethernet address), etc.
> +
> +.B brctl showmacs <brname>
> +shows a list of learned MAC addresses for this bridge.
> +
> +.B brctl setageing <brname> <time>
> +sets the ethernet (MAC) address ageing time, in seconds. After <time>
> +seconds of not having seen a frame coming from a certain address, the
> +bridge will time out (delete) that address from the Forwarding
> +DataBase (fdb).
> +
> +.B brctl setgcint <brname> <time>
> +sets the garbage collection interval for the bridge <brname> to <time>
> +seconds. This means that the bridge will check the forwarding database
> +for timed out entries every <time> seconds.
> +
> +
> +.SH SPANNING TREE PROTOCOL
> +Multiple ethernet bridges can work together to create even larger
> +networks of ethernets using the IEEE 802.1d spanning tree
> +protocol. This protocol is used for finding the shortest path between
> +two ethernets, and for eliminating loops from the topology. As this
> +protocol is a standard, Linux bridges will interwork properly with
> +other third party bridge products. Bridges communicate with each other
> +by sending and receiving BPDUs (Bridge Protocol Data Units). These
> +BPDUs can be recognised by an ethernet destination address of
> +01:80:c2:00:00:00.
> +
> +The spanning tree protocol can also be turned off (for those
> +situations where it just doesn't make sense, for example when this
> +Linux box is the only bridge on the LAN, or when you know that there
> +are no loops in the topology.)
> +
> +.IR brctl(8)
> +can be used for configuring certain spanning tree protocol
> +parameters. For an explanation of these parameters, see the IEEE
> +802.1d specification (or send me an email). The default values should
> +be just fine. If you don't know what these parameters mean, you
> +probably won't feel the desire to tweak them.
> +
> +.B brctl stp <bridge> <state>
> +controls this bridge instance's participation in the spanning tree
> +protocol. If <state> is "on" or "yes" the STP will be turned on,
> +otherwise it will be turned off.  When turned off, the bridge will not
> +send or receive BPDUs, and will thus not participate in the spanning
> +tree protocol. If your bridge isn't the only bridge on the LAN, or if
> +there are loops in the LAN's topology, DO NOT turn this option off. If
> +you turn this option off, please know what you are doing.
> +
> +
> +.B brctl setbridgeprio <bridge> <priority>
> +sets the bridge's priority to <priority>. The priority value is an
> +unsigned 16-bit quantity (a number between 0 and 65535), and has no
> +dimension. Lower priority values are 'better'. The bridge with the
> +lowest priority will be elected 'root bridge'.
> +
> +.B brctl setfd <bridge> <time>
> +sets the bridge's 'bridge forward delay' to <time> seconds.
> +
> +.B brctl sethello <bridge> <time>
> +sets the bridge's 'bridge hello time' to <time> seconds.
> +
> +.B brctl setmaxage <bridge> <time>
> +sets the bridge's 'maximum message age' to <time> seconds.
> +
> +.B brctl setpathcost <bridge> <port> <cost>
> +sets the port cost of the port <port> to <cost>. This is a
> +dimensionless metric.
> +
> +.B brctl setportprio <bridge> <port> <priority>
> +sets the port <port>'s priority to <priority>. The priority value is
> +an unsigned 8-bit quantity (a number between 0 and 255), and has no
> +dimension. This metric is used in the designated port and root port
> +selection algorithms.
> +
> +.SH NOTES
> +.BR brctl(8)
> +is obsolete. Some features such as STP guard, harpin mode, fastleave and root
> +block are intentionally not implemented in this command.
> +Instead use
> +.B bridge
> +command from
> +.B iproute2
> +package for a more full set of features.
> +
> +.SH SEE ALSO
> +.BR iptables(8)
> +
> +.SH AUTHOR
> +Lennert Buytenhek <buytenh@gnu.org>
> +Stephen Hemminger <stephen@networkplumber.org>
> diff --git a/misc/Makefile b/misc/Makefile
> index 6a849af4be22..79c6be31b874 100644
> --- a/misc/Makefile
> +++ b/misc/Makefile
> @@ -3,6 +3,7 @@ SSOBJ=ss.o ssfilter.o
>  LNSTATOBJ=lnstat.o lnstat_util.o
>  
>  TARGETS=ss nstat ifstat rtacct lnstat
> +SCRIPTS=ip-brctl
>  
>  include ../config.mk
>  
> @@ -10,7 +11,7 @@ ifeq ($(HAVE_BERKELEY_DB),y)
>  	TARGETS += arpd
>  endif
>  
> -all: $(TARGETS)
> +all: $(TARGETS) $(SCRIPTS)
>  
>  ss: $(SSOBJ)
>  	$(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@
> @@ -33,10 +34,14 @@ ssfilter.c: ssfilter.y
>  lnstat: $(LNSTATOBJ)
>  	$(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@
>  
> +ip-brctl: ip-brctl.in
> +	@sed -e "s|^\(IP=\"\).*\"|\1$(DESTDIR)$(SBINDIR)\/ip\"|" -e "s|^\(BRIDGE=\"\).*\"|\1$(DESTDIR)$(SBINDIR)\/bridge\"|" $< > $@
> +
>  install: all
> -	install -m 0755 $(TARGETS) $(DESTDIR)$(SBINDIR)
> +	install -m 0755 $(TARGETS) $(SCRIPTS) $(DESTDIR)$(SBINDIR)
>  	ln -sf lnstat $(DESTDIR)$(SBINDIR)/rtstat
>  	ln -sf lnstat $(DESTDIR)$(SBINDIR)/ctstat
> +	ln -sf ip-brctl $(DESTDIR)$(SBINDIR)/brctl
>  
>  clean:
>  	rm -f *.o $(TARGETS) ssfilter.c
> diff --git a/misc/ip-brctl.in b/misc/ip-brctl.in
> new file mode 100755
> index 000000000000..7e7059aa2c06
> --- /dev/null
> +++ b/misc/ip-brctl.in
> @@ -0,0 +1,572 @@
> +#!/bin/sh
> +#
> +# ip-brctl: Implementation of brctl from bridge-utils as iproute2 wrapper
> +#
> +# Copyright (c) 2019 Red Hat, Inc.
> +# Author: Stefano Brivio <sbrivio@redhat.com>
> +#
> +# SPDX-License-Identifier: GPL-2.0
> +
> +# Global variables #############################################################
> +
> +# These will be replaced by Makefile on install with ip and bridge install paths
> +IP="../ip/ip"
> +BRIDGE="../bridge/bridge"
> +
> +usage_text="Usage: brctl [commands]
> +commands:
> +	addbr     	<bridge>		add bridge
> +	delbr     	<bridge>		delete bridge
> +	addif     	<bridge> <device>	add interface to bridge
> +	delif     	<bridge> <device>	delete interface from bridge
> +	hairpin   	<bridge> <port> {on|off}	turn hairpin on/off
> +	setageing 	<bridge> <time>		set ageing time
> +	setbridgeprio	<bridge> <prio>		set bridge priority
> +	setfd     	<bridge> <time>		set bridge forward delay
> +	sethello  	<bridge> <time>		set hello time
> +	setmaxage 	<bridge> <time>		set max message age
> +	setpathcost	<bridge> <port> <cost>	set path cost
> +	setportprio	<bridge> <port> <prio>	set port priority
> +	show      	[ <bridge> ]		show a list of bridges
> +	showmacs  	<bridge>		show a list of mac addrs
> +	showstp   	<bridge>		show bridge stp info
> +	stp       	<bridge> {on|off}	turn stp on/off"
> +
> +# List of commands prefixed by minimum number of arguments
> +commands="1 addbr 1 delbr 2 addif 2 delif 3 hairpin 2 setageing 2 setbridgeprio
> +	  2 setfd 2 sethello 2 setmaxage 3 setpathcost 3 setportprio 0 show
> +	  1 showmacs 1 showstp 2 stp"
> +
> +
> +# Helper functions #############################################################
> +
> +usage() {
> +	echo "${usage_text}"
> +	exit "${1:-1}"
> +}
> +
> +# Print a single line from usage for given command
> +usage_one() {
> +	command="${1}"
> +
> +	ifs="${IFS}"
> +	IFS='
> +'
> +	for line in ${usage_text}; do
> +		case ${line} in
> +		"	${command}"*)
> +			line="${line#	${command}*}"
> +			while [ "${line}" != "${line#[[:blank:]]*}" ]; do
> +				line="${line#[[:blank:]]*}"
> +			done
> +			echo "Usage: brctl ${command} ${line}"
> +			break
> +			;;
> +		esac
> +	done
> +	IFS="${ifs}"
> +
> +	exit 1
> +}
> +
> +# Print to standard error and exit
> +err() {
> +	echo "${@}" > /dev/stderr
> +	exit 1
> +}
> +
> +# Output token following the given one, from a space-separated string
> +parse_next() {
> +	needle=${1}
> +	str=${2}
> +
> +	next=0
> +	ifs="${IFS}"
> +	IFS=' '
> +	for token in ${str}; do
> +		[ ${next} -eq 1 ] && echo "${token}" && break
> +		[ "${token}" = "${needle}" ] && next=1
> +	done
> +	IFS="${ifs}"
> +}
> +
> +# Output value for given ip-link attribute, for the given device
> +parse_iplink() {
> +	attr="${1}"
> +	dev="${2}"
> +
> +	out="$(${IP} -d link show dev "${dev}" 2>/dev/null)"
> +	parse_next "${attr}" "${out}"
> +}
> +
> +# Once starting token is matched, for each token x with index n = 2 * k, assign
> +# value of token with index n + 1 to a variable named by the value of x, using
> +# state variables parse_assign_start and parse_assign_prev. Pass one token at
> +# a time.
> +parse_assign() {
> +	token="${1}"
> +	start_token="${2}"
> +
> +	if [ "${token}" = "${start_token}" ]; then
> +		parse_assign_start=1
> +		parse_assign_prev=
> +	elif [ ${parse_assign_start} -eq 0 ]; then
> +		:
> +	elif [ -z "${parse_assign_prev}" ]; then
> +		parse_assign_prev="${token}"
> +	else
> +		eval "${parse_assign_prev}"="${token}"
> +		parse_assign_prev=
> +	fi
> +}
> +
> +# Execute ip-link command with given arguments. On failure, print returned error
> +# message prefixed by given string and exit.
> +exec_iplink() {
> +	args="${1}"
> +	err_prefix="${2}"
> +
> +	err_out="$(${IP} link "${args}" 2>&1)" || err "${err_prefix}: ${err_out#*:}"
> +}
> +
> +# Print passed error string and exit if the given device does or does not exist,
> +# depending on the condition given. Device type matching is optional.
> +err_dev_exists() {
> +	cond="${1}"
> +	dev="${2}"
> +	err_string="${3}"
> +	opt_type="${4}"
> +
> +	[ -n "${opt_type}" ] && opt_type="type ${opt_type}" || opt_type=
> +
> +	if [ -n "$(${IP} link show dev "${dev}" ${opt_type} 2>/dev/null)" ]; then
> +		[ "${cond}" = "y" ] && err "${err_string}"
> +	else
> +		[ "${cond}" = "n" ] && err "${err_string}"
> +	fi
> +}
> +
> +
> +# Type checks and conversions ##################################################
> +
> +# Don't attempt operations on values exceeding limits for a signed long. From
> +# POSIX.1-2017, XCU, par. 2.6.4 Arithmetic Expansion:
> +#
> +#	Only signed long integer arithmetic is required.
> +#
> +# On failure, print returned error message prefixed by given string and exit.
> +check_long() {
> +	in="${1}"
> +	err_prefix="${2}"
> +
> +	# Comparing x >= 2^31 may fail, allow up to one digit shorter than that
> +	long_max=2147483647
> +	if [ ${#in} -ge ${#long_max} ]; then
> +		err "${err_prefix}: Numerical result out of range"
> +	fi
> +}
> +
> +# On failure, print passed error string and exit
> +check_float() {
> +	in="${1}"
> +	err_string="${2}"
> +
> +	case ${in} in
> +	[0-9.]*) ;;
> +	*) err "${err_string}" ;;
> +	esac
> +}
> +
> +# Convert values allowed as boolean to given strings for false and true values
> +make_bool() {
> +	in="${1}"
> +	str_false="${2}"
> +	str_true="${2}"
> +
> +	case ${in} in
> +	off|no|0) echo "${str_false}" ;;
> +	on|yes|1) echo "${str_true}" ;;
> +	*) err "expect on/off for argument" ;;
> +	esac
> +}
> +
> +# Divide by 100, output number with two fractional digits
> +make_float() {
> +	in="${1}"
> +
> +	int=$((in / 100))
> +	frac=$((in % 100))
> +	[ ${#frac} -eq 1 ] && frac="0${frac}"
> +	echo "${int}.${frac}"
> +}
> +
> +# Multiply given float by 100. Multiple decimal separators are allowed, anything
> +# after the second one is discarded.
> +brctl_timeval() {
> +	in="${1}"
> +
> +	# Drop second decimal separator and following digits, if any
> +	drop=${in#[0-9]*.[0-9]*.}
> +	time=${in%%.${drop}}
> +
> +	int=${time%.*}
> +
> +	# Take up to two digits from fractional part
> +	frac=${time#*.}
> +	drop=${frac#[0-9][0-9]*}
> +	frac=${frac%%${drop}}
> +	frac=${frac:-0}
> +	[ "${frac}" -lt 10 ] && frac=$((frac * 10))
> +
> +	echo "$((int * 100 + frac))"
> +}
> +
> +# Strip colons from MAC address in bridge identifiers, pad bytes with 0
> +fixup_id() {
> +	in="${1}"
> +
> +	ifs="${IFS}"
> +	IFS=':'
> +	out=
> +	for byte in ${in}; do
> +		[ ${#byte} -eq 1 ] && byte="0${byte}"
> +		out="${out}${byte}"
> +	done
> +	IFS="${ifs}"
> +	echo "${out}"
> +}
> +
> +
> +# Display functions ############################################################
> +
> +# Pad string to given width -- if not given, width is 7 if float, 4 otherwise
> +pad_width() {
> +	str="${1}"
> +	[ "${str}" != "${str%.*}" ] && width=${2:-7} || width=${2:-4}
> +
> +	len=${#str}
> +	while [ "${len}" -lt "${width}" ]; do
> +		str=" ${str}"
> +		len=$((len + 1))
> +	done
> +	echo "${str}"
> +}
> +
> +# Dump bridge information from set variables in 'brctl showstp' format
> +#
> +# shellcheck disable=SC2154 # shellcheck won't see variables we set under eval
> +dump_bridge() {
> +	name="${1}"
> +
> +	for i in bridge_id designated_root; do
> +		eval ${i}="\$(fixup_id \$${i})"
> +	done
> +	for i in max_age hello_time forward_delay ageing_time; do
> +		eval ${i}="\$(make_float \$${i})"
> +	done
> +	for i in root_port root_path_cost max_age hello_time forward_delay \
> +		 ageing_time hello_timer tcn_timer topology_change_timer \
> +		 gc_timer; do
> +		eval ${i}=\""\$(pad_width \$${i})"\"
> +	done
> +	flags=
> +	[ "${topology_change}" != "0" ] && flags="TOPOLOGY_CHANGE "
> +	[ "${topology_change_detected}" -ne "0" ] && flags="${flags}TOPOLOGY_CHANGE_DETECTED "
> +
> +	echo "${name}"
> +	echo " bridge id		${bridge_id}"
> +	echo " designated root	${designated_root}"
> +	echo " root port		${root_port}			path cost		${root_path_cost}"
> +	echo " max age		${max_age}			bridge max age		${max_age}"
> +	echo " hello time		${hello_time}			bridge hello time	${hello_time}"
> +	echo " forward delay		${forward_delay}			bridge forward delay	${forward_delay}"
> +	echo " ageing time		${ageing_time}"
> +	echo " hello timer		${hello_timer}			tcn timer		${tcn_timer}"
> +	echo " topology change timer	${topology_change_timer}			gc timer		${gc_timer}"
> +	echo " flags			${flags}"
> +	echo
> +	echo
> +}
> +
> +# Dump port information from set variables in 'brctl showstp' format
> +#
> +# shellcheck disable=SC2154 # shellcheck won't see variables we set under eval
> +dump_port() {
> +	for i in designated_root designated_bridge; do
> +		eval ${i}="\$(fixup_id \$${i})"
> +	done
> +
> +	for i in message_age_timer cost message_age_timer forward_delay_timer \
> +		 designated_cost hold_timer; do
> +		eval ${i}=\""\$(pad_width \$${i})"\"
> +	done
> +
> +	flags=
> +	[ "${config_pending}" != "0" ] && flags="CONFIG_PENDING "
> +	[ "${topology_change_ack}" -ne "0" ] && flags="${flags}TOPOLOGY_CHANGE_ACK "
> +
> +	echo "${port_name%*:} (${port_no#0x*})"
> +	echo " port id		${port_id#0x*}			state		$(pad_width "${state}" 15)"
> +	echo " designated root	${designated_root}	path cost		${cost}"
> +	echo " designated bridge	${designated_bridge}	message age timer	${message_age_timer}"
> +	echo " designated port	$((8000 + designated_port % 32768))			forward delay timer	${forward_delay_timer}"
> +	echo " designated cost	${designated_cost}			hold timer		${hold_timer}"
> +	echo " flags ${flags}"
> +	[ "${hairpin}" != "off" ] && echo " hairpin mode		$(pad_width 1)"
> +	echo
> +}
> +
> +
> +# Commands #####################################################################
> +
> +cmd_addbr() {
> +	err_dev_exists y "${1}" "device ${1} already exists; can't create bridge with the same name"
> +
> +	exec_iplink "add ${1} type bridge" "add bridge failed"
> +}
> +
> +cmd_delbr() {
> +	err_dev_exists n "${1}" "bridge ${1} doesn't exist; can't delete it"
> +	if [ "$(parse_iplink state "${1}")" != "DOWN" ]; then
> +		err "bridge ${1} is still up; can't delete it"
> +	fi
> +
> +	exec_iplink "del ${1}" "can't delete bridge ${1}"
> +}
> +
> +cmd_addif() {
> +	err_dev_exists n "${1}" "bridge ${1} does not exist!"
> +	err_dev_exists n "${2}" "interface ${2} does not exist!"
> +	if [ -n "$(parse_iplink master "${1}")" ]; then
> +		err "device ${2} is already a member of a bridge; can't enslave it to bridge ${1}."
> +	fi
> +	err_dev_exists y "${1}" "device ${1} is a bridge device itself; can't enslave a bridge device to a bridge device." bridge
> +
> +	exec_iplink "set ${2} master ${1}" "can't add ${2} to bridge ${1}"
> +}
> +
> +cmd_delif() {
> +	err_dev_exists n "${1}" "bridge ${1} does not exist!"
> +	err_dev_exists n "${2}" "interface ${2} does not exist!"
> +	if [ "$(parse_iplink master "${2}")" != "${1}" ]; then
> +		err "device ${1} is not a slave of ${2}"
> +	fi
> +
> +	exec_iplink "set ${2} nomaster" "can't delete ${2} from ${1}"
> +}
> +
> +cmd_hairpin() {
> +	hairpin="$(make_bool "${3}" off on)"
> +	[ -z "${hairpin}" ] && exit 1
> +	err_dev_exists n "${2}" "interface ${2} does not exist!"
> +	err_dev_exists n "${1}" "bridge ${1} does not exist!"
> +
> +	exec_iplink "set ${2} type bridge_slave ${hairpin}" "can't set ${2} to hairpin on bridge ${1}"
> +}
> +
> +cmd_setageing() {
> +	check_long "${2}" "set ageing time failed"
> +	check_float "${2}" "bad ageing time value"
> +	err_dev_exists n "${1}" "set ageing time failed: No such device"
> +
> +	exec_iplink "set ${1} type bridge ageing_time $(brctl_timeval "${2}")" "set ageing time failed"
> +}
> +
> +cmd_setbridgeprio() {
> +	check_long "${2}" "set bridge priority failed"
> +	check_float "${2}" "bad priority"
> +	err_dev_exists n "${1}" "set bridge priority failed: No such device"
> +
> +	prio=${2%%.*}
> +	prio=$((prio % 65536))
> +
> +	exec_iplink "set ${1} type bridge priority ${prio}" "set bridge priority failed"
> +}
> +
> +cmd_setfd() {
> +	check_long "${2}" "set forward delay failed"
> +	check_float "${2}" "bad ageing time value"
> +	err_dev_exists n "${1}" "set forward delay failed: No such device"
> +
> +	exec_iplink "set ${1} type bridge forward_delay $(brctl_timeval "${2}")" "set forward delay failed"
> +}
> +
> +cmd_sethello() {
> +	check_long "${2}" "set ageing time failed"
> +	check_float "${2}" "bad hello timer value"
> +	err_dev_exists n "${1}" "set hello timer failed: No such device"
> +
> +	exec_iplink "set ${1} type bridge hello_time $(brctl_timeval "${2}")" "set hello timer failed"
> +}
> +
> +cmd_setmaxage() {
> +	check_long "${2}" "set max age failed"
> +	check_float "${2}" "bad max age value"
> +	err_dev_exists n "${1}" "set max age failed: No such device"
> +
> +	exec_iplink "set ${1} max_age $(brctl_timeval "${2}")" "set max age failed"
> +}
> +
> +cmd_setpathcost() {
> +	check_long "${3}" "set path cost failed"
> +	check_float "${3}" "bad path cost value"
> +	err_dev_exists n "${2}" "set path cost failed: No such device"
> +
> +	cost=${3%%.*}
> +
> +	exec_iplink "set ${2} type bridge_slave cost ${cost}" "set path cost failed"
> +}
> +
> +cmd_setportprio() {
> +	check_long "${3}" "set port priority failed"
> +	check_float "${3}" "bad path priority value"
> +	err_dev_exists n "${2}" "set port priority failed: No such device"
> +
> +	prio=${3%%.*}
> +	[ "${prio}" -ge 64 ] && err "set port priority failed: Numerical result out of range"
> +
> +	exec_iplink "set ${2} type bridge_slave priority ${prio}" "set port priority failed"
> +}
> +
> +cmd_stp() {
> +	stp="$(make_bool "${2}" 0 1)"
> +	[ -z "${stp}" ] && exit 1
> +	err_dev_exists n "${1}" "set stp status failed: No such device"
> +
> +	exec_iplink "set ${1} type bridge stp_state ${stp}" "set stp status failed"
> +}
> +
> +cmd_showstp() {
> +	parse_assign_start=0
> +	for t in $(${IP} -d link show "${1}" 2>/dev/null); do
> +		parse_assign "${t}" bridge
> +	done
> +	[ ${parse_assign_start} -eq 0 ] && err "${1}: can't get info No such device"
> +
> +	dump_bridge "${1}"
> +
> +	parse_assign_start=0
> +	for t in $(${IP} -d link show type bridge_slave master "${1}" 2>/dev/null); do
> +		case ${t} in
> +		[0-9]*:)
> +			[ -n "${port_name}" ] && dump_port
> +			port_name=
> +			parse_assign_start=0
> +			continue
> +			;;
> +		esac
> +		[ -z "${port_name}" ] && port_name="${t}"
> +
> +		parse_assign "${t}" bridge_slave
> +	done
> +
> +	[ -n "${port_name}" ] && dump_port
> +}
> +
> +cmd_show() {
> +	dev="${1}"
> +
> +	if [ -n "${dev}" ]; then
> +		err_dev_exists n "${dev}" "bridge ${dev} does not exist!"
> +		err_dev_exists n "${dev}" "device ${dev} is not a bridge!" bridge
> +	fi
> +
> +	echo "bridge name	bridge id		STP enabled	interfaces"
> +	ifs="${IFS}"
> +	IFS='
> +'
> +	for line in $(${IP} -br link show type bridge "${dev}" 2>/dev/null); do
> +		name="${line%% *}"
> +		id="$(fixup_id "$(parse_iplink bridge_id "${name}")")"
> +		[ "$(parse_iplink stp_state "${name}")" = "0" ] && stp="no" || stp="yes"
> +		first=1
> +		for s in $(${IP} -br link show type bridge_slave master "${name}"); do
> +			if [ ${first} -eq 1 ]; then
> +				echo "${name}		${id}	${stp}		${s%% *}"
> +				first=0
> +			else
> +				echo "							${s%% *}"
> +			fi
> +		done
> +		[ ${first} -eq 1 ] && echo "${name}		${id}	${stp}"
> +	done
> +	IFS="${ifs}"
> +}
> +
> +cmd_showmacs() {
> +	err_dev_exists n "${1}" "read of forward table failed: No such device"
> +
> +	echo "port no	mac addr		is local?	ageing timer"
> +	ifs="${IFS}"
> +	IFS='
> +'
> +	for s in $(${IP} -br link show type bridge_slave master "${1}"); do
> +		for fdb_entry in $(${BRIDGE} -s fdb show dev "${s%% *}"); do
> +			case ${fdb_entry} in
> +			*" ${1} "*) ;;
> +			*) continue ;;
> +			esac
> +
> +			case ${fdb_entry} in
> +			*" permanent"*)
> +				is_local="yes"
> +				timer="$(pad_width "0.00")"
> +				;;
> +			*)
> +				is_local="no"
> +				timer="$(parse_next used "${fdb_entry}")"
> +				timer="$(pad_width "${timer#*/}.00")"
> +				;;
> +			esac
> +
> +			port_no="$(parse_iplink port_no "${s%% *}")"
> +			port_no="${port_no#0x*}"
> +
> +			echo "$(pad_width "${port_no}" 3)	${fdb_entry%% *}	${is_local}		${timer}"
> +		done
> +	done
> +	IFS="${ifs}"
> +}
> +
> +
> +# Start here ###################################################################
> +
> +cmdname="${1}"
> +[ -z "${cmdname}" ] && usage
> +
> +for arg do
> +	case ${arg} in
> +	"-V"|"--v"*)
> +		echo "ip-link wrapper, compatible with bridge-utils, 1.6"
> +		exit 0
> +		;;
> +	"-h"|"--h"*)
> +		usage 0
> +		;;
> +	"--")
> +		usage
> +		;;
> +	"-")
> +		break
> +		;;
> +	"-"*)
> +		echo "${0}: unrecognized option '${arg}'" >/dev/stderr
> +		echo "Unknown option '?'" >/dev/stderr
> +		usage
> +		;;
> +	esac
> +done
> +
> +found=0
> +min_arg=
> +for cmd in ${commands}; do
> +	[ -n "${min_arg}" ] && [ "${cmd}" = "${cmdname}" ] && found=1 && break
> +	min_arg="${cmd}"
> +done
> +[ ${found} -eq 0 ] && echo "never heard of command [${cmdname}]" && usage
> +[ ${#} -le "${min_arg}" ] && echo "Incorrect number of arguments for command" && usage_one "${cmd}"
> +
> +shift
> +eval "cmd_${cmdname}" "${@}"
> +
> +exit 0


^ permalink raw reply

* [PATCH net] net/mlx4_en: Force CHECKSUM_NONE for short ethernet frames
From: Tariq Toukan @ 2019-01-31 13:02 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, Eric Dumazet,
	Tariq Toukan

From: Saeed Mahameed <saeedm@mellanox.com>

When an ethernet frame is padded to meet the minimum ethernet frame
size, the padding octets are not covered by the hardware checksum.
Fortunately the padding octets are usually zero's, which don't affect
checksum. However, it is not guaranteed. For example, switches might
choose to make other use of these octets.
This repeatedly causes kernel hardware checksum fault.

Prior to the cited commit below, skb checksum was forced to be
CHECKSUM_NONE when padding is detected. After it, we need to keep
skb->csum updated. However, fixing up CHECKSUM_COMPLETE requires to
verify and parse IP headers, it does not worth the effort as the packets
are so small that CHECKSUM_COMPLETE has no significant advantage.

Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

Hi Dave,
Please queue for -stable >= v4.18.

Thanks,
Tariq

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 9a0881cb7f51..fc8a11444e1a 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -617,6 +617,8 @@ static int get_fixed_ipv6_csum(__wsum hw_checksum, struct sk_buff *skb,
 }
 #endif
 
+#define short_frame(size) ((size) <= ETH_ZLEN + ETH_FCS_LEN)
+
 /* We reach this function only after checking that any of
  * the (IPv4 | IPv6) bits are set in cqe->status.
  */
@@ -624,9 +626,22 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
 		      netdev_features_t dev_features)
 {
 	__wsum hw_checksum = 0;
+	void *hdr;
+
+	/* CQE csum doesn't cover padding octets in short ethernet
+	 * frames. And the pad field is appended prior to calculating
+	 * and appending the FCS field.
+	 *
+	 * Detecting these padded frames requires to verify and parse
+	 * IP headers, so we simply force all those small frames to be
+	 * CHECKSUM_NONE even if they are not padded.
+	 * TODO: better if we report CHECKSUM_UNNECESSARY but this
+	 * demands some refactroing.
+	 */
+	if (short_frame(skb->len))
+		return -EINVAL;
 
-	void *hdr = (u8 *)va + sizeof(struct ethhdr);
-
+	hdr = (u8 *)va + sizeof(struct ethhdr);
 	hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
 
 	if (cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_CVLAN_PRESENT_MASK) &&
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH] ath10k: do not return invalid pointers as a *dentry
From: Greg Kroah-Hartman @ 2019-01-31 13:15 UTC (permalink / raw)
  To: Kalle Valo; +Cc: ath10k, linux-wireless, netdev

When calling debugfs functions, they can now return error values if
something went wrong.  If that happens, return a NULL as a *dentry to
the relay core instead of passing it an illegal pointer.

The relay core should be able to handle an illegal pointer, but add this
check to be safe.

Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: ath10k@lists.infradead.org
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 drivers/net/wireless/ath/ath10k/spectral.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/wireless/ath/ath10k/spectral.c b/drivers/net/wireless/ath/ath10k/spectral.c
index 653b6d013207..4a7fa3e4e99f 100644
--- a/drivers/net/wireless/ath/ath10k/spectral.c
+++ b/drivers/net/wireless/ath/ath10k/spectral.c
@@ -494,6 +494,9 @@ static struct dentry *create_buf_file_handler(const char *filename,
 
 	buf_file = debugfs_create_file(filename, mode, parent, buf,
 				       &relay_file_operations);
+	if (IS_ERR(buf_file))
+		return NULL;
+
 	*is_global = 1;
 	return buf_file;
 }
-- 
2.20.1


^ permalink raw reply related

* [PATCH] ath9k: do not return invalid pointers as a *dentry
From: Greg Kroah-Hartman @ 2019-01-31 13:16 UTC (permalink / raw)
  To: linux-wireless, netdev; +Cc: Kalle Valo, QCA ath9k Development

When calling debugfs functions, they can now return error values if
something went wrong.  If that happens, return a NULL as a *dentry to
the relay core instead of passing it an illegal pointer.

The relay core should be able to handle an illegal pointer, but add this
check to be safe.

Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: QCA ath9k Development <ath9k-devel@qca.qualcomm.com>
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 drivers/net/wireless/ath/ath9k/common-spectral.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/wireless/ath/ath9k/common-spectral.c b/drivers/net/wireless/ath/ath9k/common-spectral.c
index 6aa3ec024ffa..21191955a7c1 100644
--- a/drivers/net/wireless/ath/ath9k/common-spectral.c
+++ b/drivers/net/wireless/ath/ath9k/common-spectral.c
@@ -1039,6 +1039,9 @@ static struct dentry *create_buf_file_handler(const char *filename,
 
 	buf_file = debugfs_create_file(filename, mode, parent, buf,
 				       &relay_file_operations);
+	if (IS_ERR(buf_file))
+		return NULL;
+
 	*is_global = 1;
 	return buf_file;
 }
-- 
2.20.1


^ permalink raw reply related

* Re: Kernel memory corruption in CIPSO labeled TCP packets processing.
From: Nazarov Sergey @ 2019-01-31 13:20 UTC (permalink / raw)
  To: Paul Moore
  Cc: linux-security-module@vger.kernel.org, selinux@vger.kernel.org,
	netdev@vger.kernel.org, Casey Schaufler
In-Reply-To: <CAHC9VhQcrnXmjcVw_ppwM-fGcwhJq9SafTnZKjgpQpxK0ie+Jw@mail.gmail.com>

31.01.2019, 05:10, "Paul Moore" <paul@paul-moore.com>:
> This isn't how the rest of the stack works, look at
> ip_local_deliver_finish() for one example. Perhaps the behavior you
> are proposing is correct, but please show me where in the various RFC
> specs it is defined so that I can better understand why it should work
> this way.
> --
> paul moore
> www.paul-moore.com

Sorry, I was inattentive. ip_options_compile modifies srr option data, only if
skb is NULL. My last message could be ignored.

^ permalink raw reply

* [PATCH] net: check negative value for signed refcnt
From: alexandre.besnard @ 2019-01-31 13:20 UTC (permalink / raw)
  To: davem, ktkhai, ecree, jiri, petrm, alexander.h.duyck,
	amritha.nambiar, lirongqing
  Cc: netdev, linux-kernel, Alexandre Besnard

From: Alexandre Besnard <alexandre.besnard@softathome.com>

Device remaining references counter is get as a signed integer.

When unregistering network devices, the loop waiting for this counter
to decrement tests the 0 strict equality. Thus if an error occurs and
two references are given back by a protocol, we are stuck in the loop
forever, with a -1 value.

Robustness is added by checking a negative value: the device is then
considered free of references, and a warning is issued (it should not
happen, one should check that behavior)

Signed-off-by: Alexandre Besnard <alexandre.besnard@softathome.com>
---
 net/core/dev.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index ddc551f..e4190ae 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8687,6 +8687,11 @@ static void netdev_wait_allrefs(struct net_device *dev)
 	refcnt = netdev_refcnt_read(dev);

 	while (refcnt != 0) {
+		if (refcnt < 0) {
+			pr_warn("Device %s refcnt negative: device considered free, but it should not happen\n",
+				dev->name);
+			break;
+		}
 		if (time_after(jiffies, rebroadcast_time + 1 * HZ)) {
 			rtnl_lock();

--
2.7.4


^ permalink raw reply related

* Re: [PATCH] net: check negative value for signed refcnt
From: Kirill Tkhai @ 2019-01-31 13:49 UTC (permalink / raw)
  To: alexandre.besnard, davem, ecree, jiri, petrm, alexander.h.duyck,
	amritha.nambiar, lirongqing
  Cc: netdev, linux-kernel
In-Reply-To: <20190131132008.23161-1-alexandre.besnard@softathome.com>

Hi, Alexandre,

On 31.01.2019 16:20, alexandre.besnard@softathome.com wrote:
> From: Alexandre Besnard <alexandre.besnard@softathome.com>
> 
> Device remaining references counter is get as a signed integer.
> 
> When unregistering network devices, the loop waiting for this counter
> to decrement tests the 0 strict equality. Thus if an error occurs and
> two references are given back by a protocol, we are stuck in the loop
> forever, with a -1 value.
> 
> Robustness is added by checking a negative value: the device is then
> considered free of references, and a warning is issued (it should not
> happen, one should check that behavior)
> 
> Signed-off-by: Alexandre Besnard <alexandre.besnard@softathome.com>
> ---
>  net/core/dev.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index ddc551f..e4190ae 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -8687,6 +8687,11 @@ static void netdev_wait_allrefs(struct net_device *dev)
>  	refcnt = netdev_refcnt_read(dev);
> 
>  	while (refcnt != 0) {
> +		if (refcnt < 0) {
> +			pr_warn("Device %s refcnt negative: device considered free, but it should not happen\n",
> +				dev->name);
> +			break;
> +		}

1)I don't think this is a good approach. Negative value does not guarantee
there is just a double put of device reference. Negative value is an indicator
something goes wrong, and we definitely should not free device memory in
this case.

2)Not related to your patch -- it looks like we have problem in existing
code with this netdev_refcnt_read(). It does not imply a memory ordering
or some guarantees about reading percpu values. For example, in generic
code struct percpu_ref switches a counter into atomic mode before it checks
for the last reference. But there is nothing in netdev_refcnt_read().



^ permalink raw reply

* Re: [PATCH bpf-next v3 2/3] libbpf: add support for using AF_XDP sockets
From: Magnus Karlsson @ 2019-01-31 13:55 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Magnus Karlsson, Björn Töpel, ast, Network Development,
	Jakub Kicinski, Björn Töpel, Zhang, Qi Z,
	Jesper Dangaard Brouer
In-Reply-To: <1d42cd5e-4842-1c6f-6fa9-f1cb16fdd74f@iogearbox.net>

On Thu, Jan 31, 2019 at 11:31 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 01/29/2019 04:12 PM, Magnus Karlsson wrote:
> > This commit adds AF_XDP support to libbpf. The main reason for this is
> > to facilitate writing applications that use AF_XDP by offering
> > higher-level APIs that hide many of the details of the AF_XDP
> > uapi. This is in the same vein as libbpf facilitates XDP adoption by
> > offering easy-to-use higher level interfaces of XDP
> > functionality. Hopefully this will facilitate adoption of AF_XDP, make
> > applications using it simpler and smaller, and finally also make it
> > possible for applications to benefit from optimizations in the AF_XDP
> > user space access code. Previously, people just copied and pasted the
> > code from the sample application into their application, which is not
> > desirable.
> >
> > The interface is composed of two parts:
> >
> > * Low-level access interface to the four rings and the packet
> > * High-level control plane interface for creating and setting
> >   up umems and af_xdp sockets as well as a simple XDP program.
> >
> > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> [...]
> > +
> > +static __u32 xsk_prod_nb_free(struct xsk_ring_prod *r, __u32 nb)
> > +{
> > +     __u32 free_entries = r->cached_cons - r->cached_prod;
> > +
> > +     if (free_entries >= nb)
> > +             return free_entries;
> > +
> > +     /* Refresh the local tail pointer.
> > +      * cached_cons is r->size bigger than the real consumer pointer so
> > +      * that this addition can be avoided in the more frequently
> > +      * executed code that computs free_entries in the beginning of
> > +      * this function. Without this optimization it whould have been
> > +      * free_entries = r->cached_prod - r->cached_cons + r->size.
> > +      */
> > +     r->cached_cons = *r->consumer + r->size;
> > +
> > +     return r->cached_cons - r->cached_prod;
> > +}
> > +
> > +static __u32 xsk_cons_nb_avail(struct xsk_ring_cons *r, __u32 nb)
> > +{
> > +     __u32 entries = r->cached_prod - r->cached_cons;
> > +
> > +     if (entries == 0) {
> > +             r->cached_prod = *r->producer;
> > +             entries = r->cached_prod - r->cached_cons;
> > +     }
> > +
> > +     return (entries > nb) ? nb : entries;
> > +}
> > +
> > +size_t xsk_ring_prod__reserve(struct xsk_ring_prod *prod, size_t nb, __u32 *idx)
> > +{
> > +     if (unlikely(xsk_prod_nb_free(prod, nb) < nb))
> > +             return 0;
> > +
> > +     *idx = prod->cached_prod;
> > +     prod->cached_prod += nb;
> > +
> > +     return nb;
> > +}
> > +
> > +void xsk_ring_prod__submit(struct xsk_ring_prod *prod, size_t nb)
> > +{
> > +     /* Make sure everything has been written to the ring before signalling
> > +      * this to the kernel.
> > +      */
> > +     smp_wmb();
> > +
> > +     *prod->producer += nb;
> > +}
> > +
> > +size_t xsk_ring_cons__peek(struct xsk_ring_cons *cons, size_t nb, __u32 *idx)
> > +{
> > +     size_t entries = xsk_cons_nb_avail(cons, nb);
> > +
> > +     if (likely(entries > 0)) {
> > +             /* Make sure we do not speculatively read the data before
> > +              * we have received the packet buffers from the ring.
> > +              */
> > +             smp_rmb();
> > +
> > +             *idx = cons->cached_cons;
> > +             cons->cached_cons += entries;
> > +     }
> > +
> > +     return entries;
> > +}
> > +
> > +void xsk_ring_cons__release(struct xsk_ring_cons *cons, size_t nb)
> > +{
> > +     *cons->consumer += nb;
> > +}
> > +
> > +void *xsk_umem__get_data_raw(void *umem_area, __u64 addr)
> > +{
> > +     return &((char *)umem_area)[addr];
> > +}
> > +
> > +void *xsk_umem__get_data(struct xsk_umem *umem, __u64 addr)
> > +{
> > +     return &((char *)(umem->umem_area))[addr];
> > +}
>
> Shouldn't some of the above helpers for critical path be exposed as static
> inline functions instead?

Could be. I will make some experiments to see how much performance
would improve. Kept them as non-static so that they could be changed
if someone thought of a faster way of doing these operations. But
might be unnecessary, since we can change parts of it even if they are
static. I will measure and come back to you.

> [...]
> > +static int xsk_parse_nl(void *cookie, void *msg, struct nlattr **tb)
> > +{
> > +     struct nlattr *tb_parsed[IFLA_XDP_MAX + 1];
> > +     struct xsk_nl_info *nl_info = cookie;
> > +     unsigned char mode;
> > +     int err;
> > +
> > +     (void)msg;
>
> Unused?

Yes.

> > +     nl_info->xdp_prog_attached = false;
> > +     if (!tb[IFLA_XDP])
> > +             return 0;
> > +
> > +     err = libbpf_nla_parse_nested(tb_parsed, IFLA_XDP_MAX, tb[IFLA_XDP],
> > +                                   NULL);
> > +     if (err)
> > +             return err;
> > +
> > +     if (!tb_parsed[IFLA_XDP_ATTACHED] || !tb_parsed[IFLA_XDP_FD])
> > +             return 0;
> > +
> > +     mode = libbpf_nla_getattr_u8(tb_parsed[IFLA_XDP_ATTACHED]);
> > +     if (mode == XDP_ATTACHED_NONE)
> > +             return 0;
>
> Probably good to memset and/or init the passed struct xsk_nl_info (e.g.
> fd to -1 etc) such that if some user did not we won't end up with garbage
> values on return 0.

Will do.

> > +     nl_info->xdp_prog_attached = true;
> > +     nl_info->fd = libbpf_nla_getattr_u32(tb_parsed[IFLA_XDP_FD]);
> > +     return 0;
> > +}
> > +
> > +static bool xsk_xdp_prog_attached(struct xsk_socket *xsk)
> > +{
> > +     struct xsk_nl_info nl_info;
> > +     unsigned int nl_pid;
> > +     char err_buf[256];
> > +     int sock, err;
> > +
> > +     sock = libbpf_netlink_open(&nl_pid);
> > +     if (sock < 0)
> > +             return false;
> > +
> > +     nl_info.xdp_prog_attached = false;
> > +     nl_info.fd = 0;
> > +
> > +     err = libbpf_nl_get_link(sock, nl_pid, xsk_parse_nl, &nl_info);
> > +     if (err) {
> > +             libbpf_strerror(err, err_buf, sizeof(err_buf));
> > +             pr_warning("Error:\n%s\n", err_buf);
> > +             return false;
> > +     }
> > +
> > +     xsk->prog_fd = nl_info.fd;
> > +     return nl_info.xdp_prog_attached;
>
> Don't we leak sock here and in error above?

Will fix the fd leaks that you have pointed out here and below. And go
through the code to see if there are more fd leaks. Do I ever learn
;-).

Thanks for your review. Appreciated.

/Magnus

> > +}
> > +
> > +static int xsk_load_xdp_prog(struct xsk_socket *xsk)
> > +{
> > +     char bpf_log_buf[BPF_LOG_BUF_SIZE];
> > +     int err, prog_fd;
> > +
> > +     /* This is the C-program:
> > +      * SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
> > +      * {
> > +      *     int *qidconf, index = ctx->rx_queue_index;
> > +      *
> > +      *     // A set entry here means that the correspnding queue_id
> > +      *     // has an active AF_XDP socket bound to it.
> > +      *     qidconf = bpf_map_lookup_elem(&qidconf_map, &index);
> > +      *     if (!qidconf)
> > +      *         return XDP_ABORTED;
> > +      *
> > +      *     if (*qidconf)
> > +      *         return bpf_redirect_map(&xsks_map, index, 0);
> > +      *
> > +      *     return XDP_PASS;
> > +      * }
> > +      */
> > +     struct bpf_insn prog[] = {
> > +             /* r1 = *(u32 *)(r1 + 16) */
> > +             BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 16),
> > +             /* *(u32 *)(r10 - 4) = r1 */
> > +             BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_1, -4),
> > +             BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
> > +             BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
> > +             BPF_LD_MAP_FD(BPF_REG_1, xsk->qidconf_map_fd),
> > +             BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
> > +             BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
> > +             BPF_MOV32_IMM(BPF_REG_0, 0),
> > +             /* if r1 == 0 goto +8 */
> > +             BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 8),
> > +             BPF_MOV32_IMM(BPF_REG_0, 2),
> > +             /* r1 = *(u32 *)(r1 + 0) */
> > +             BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 0),
> > +             /* if r1 == 0 goto +5 */
> > +             BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 5),
> > +             /* r2 = *(u32 *)(r10 - 4) */
> > +             BPF_LD_MAP_FD(BPF_REG_1, xsk->xsks_map_fd),
> > +             BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_10, -4),
> > +             BPF_MOV32_IMM(BPF_REG_3, 0),
> > +             BPF_EMIT_CALL(BPF_FUNC_redirect_map),
> > +             /* The jumps are to this instruction */
> > +             BPF_EXIT_INSN(),
> > +     };
> > +     size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn);
> > +
> > +     prog_fd = bpf_load_program(BPF_PROG_TYPE_XDP, prog, insns_cnt,
> > +                                "LGPL-2.1 or BSD-2-Clause", 0, bpf_log_buf,
> > +                                BPF_LOG_BUF_SIZE);
> > +     if (prog_fd < 0) {
> > +             pr_warning("BPF log buffer:\n%s", bpf_log_buf);
> > +             return prog_fd;
> > +     }
> > +
> > +     err = bpf_set_link_xdp_fd(xsk->ifindex, prog_fd, xsk->config.xdp_flags);
> > +     if (err)
> > +             return err;
>
> Leaks prog_fd on error.
>
> > +     xsk->prog_fd = prog_fd;
> > +     return 0;
> > +}
> > +
> > +static int xsk_create_bpf_maps(struct xsk_socket *xsk)
> > +{
> > +     int fd;
> > +
> > +     fd = bpf_create_map_name(BPF_MAP_TYPE_ARRAY, "qidconf_map",
> > +                              sizeof(int), sizeof(int), MAX_QUEUES, 0);
> > +     if (fd < 0)
> > +             return fd;
> > +     xsk->qidconf_map_fd = fd;
> > +
> > +     fd = bpf_create_map_name(BPF_MAP_TYPE_XSKMAP, "xsks_map",
> > +                              sizeof(int), sizeof(int), MAX_QUEUES, 0);
> > +     if (fd < 0)
> > +             return fd;
>
> Leaks first map fd on error.
>
> > +     xsk->xsks_map_fd = fd;
> > +
> > +     return 0;
> > +}
> > +
> > +static int xsk_update_bpf_maps(struct xsk_socket *xsk, int qidconf_value,
> > +                            int xsks_value)
> > +{
> > +     bool qidconf_map_updated = false, xsks_map_updated = false;
> > +     struct bpf_prog_info prog_info = {};
> > +     __u32 prog_len = sizeof(prog_info);
> > +     struct bpf_map_info map_info;
> > +     __u32 map_len = sizeof(map_info);
> > +     __u32 *map_ids;
> > +     int reset_value = 0;
> > +     __u32 num_maps;
> > +     unsigned int i;
> > +     int err;
> > +
> > +     err = bpf_obj_get_info_by_fd(xsk->prog_fd, &prog_info, &prog_len);
> > +     if (err)
> > +             return err;
> > +
> > +     num_maps = prog_info.nr_map_ids;
> > +
> > +     map_ids = malloc(prog_info.nr_map_ids * sizeof(*map_ids));
>
> calloc()?
>
> > +     if (!map_ids)
> > +             return -ENOMEM;
> > +
> > +     memset(&prog_info, 0, prog_len);
> > +     prog_info.nr_map_ids = num_maps;
> > +     prog_info.map_ids = (__u64)(unsigned long)map_ids;
> > +
> > +     err = bpf_obj_get_info_by_fd(xsk->prog_fd, &prog_info, &prog_len);
> > +     if (err)
> > +             return err;
>
> Leaks map_ids on error.
>
> > +
> > +     for (i = 0; i < prog_info.nr_map_ids; i++) {
> > +             int fd;
> > +
> > +             fd = bpf_map_get_fd_by_id(map_ids[i]);
> > +             if (fd < 0) {
> > +                     err = -errno;
> > +                     goto out;
> > +             }
> > +
> > +             err = bpf_obj_get_info_by_fd(fd, &map_info, &map_len);
> > +             if (err)
> > +                     goto out;
>
> close(fd) on error, also the case for below, please double check everything
> so that no fd leaks by accident into the app.
>
> > +
> > +             if (!strcmp(map_info.name, "qidconf_map")) {
> > +                     err = bpf_map_update_elem(fd, &xsk->queue_id,
> > +                                               &qidconf_value, 0);
> > +                     if (err)
> > +                             goto out;
> > +                     qidconf_map_updated = true;
> > +                     xsk->qidconf_map_fd = fd;
> > +             } else if (!strcmp(map_info.name, "xsks_map")) {
> > +                     err = bpf_map_update_elem(fd, &xsk->queue_id,
> > +                                               &xsks_value, 0);
> > +                     if (err)
> > +                             goto out;> +                    xsks_map_updated = true;
> > +                     xsk->xsks_map_fd = fd;
> > +             }
> > +
> > +             if (qidconf_map_updated && xsks_map_updated)
> > +                     break;
> > +     }
> > +
> > +     if (!(qidconf_map_updated && xsks_map_updated)) {
> > +             err = -ENOENT;
> > +             goto out;
> > +     }
> > +
> > +     return 0;
> > +
> > +out:
> > +     if (qidconf_map_updated)
> > +             (void)bpf_map_update_elem(xsk->qidconf_map_fd, &xsk->queue_id,
> > +                                       &reset_value, 0);
> > +     if (xsks_map_updated)
> > +             (void)bpf_map_update_elem(xsk->xsks_map_fd, &xsk->queue_id,
> > +                                       &reset_value, 0);
> > +     return err;
> > +}
> > +
> > +static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
> > +{
> > +     bool prog_loaded = false;
> > +     int err;
> > +
> > +     if (!xsk_xdp_prog_attached(xsk)) {
> > +             err = xsk_create_bpf_maps(xsk);
> > +             if (err)
> > +                     goto out_load;
> > +
> > +             err = xsk_load_xdp_prog(xsk);
> > +             if (err)
> > +                     return err;
>
> Needs to undo created maps on error.
>
> > +             prog_loaded = true;
> > +     }
> > +
> > +     err = xsk_update_bpf_maps(xsk, true, xsk->fd);
> > +     if (err)
> > +             goto out_load;
> > +
> > +     return 0;
> > +
> > +out_load:
> > +     if (prog_loaded)
> > +             close(xsk->prog_fd);
>
> Ditto
>
> > +     return err;
> > +}
> > +
> > +int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname,
> > +                    __u32 queue_id, struct xsk_umem *umem,
> > +                    struct xsk_ring_cons *rx, struct xsk_ring_prod *tx,
> > +                    const struct xsk_socket_config *usr_config)
> > +{
> > +     struct sockaddr_xdp sxdp = {};
> > +     struct xdp_mmap_offsets off;
> > +     struct xsk_socket *xsk;
> > +     socklen_t optlen;
> > +     void *map;

^ permalink raw reply

* Re: bpf memory model. Was: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock
From: Paul E. McKenney @ 2019-01-31 14:01 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Will Deacon, Peter Zijlstra, Alexei Starovoitov, davem, daniel,
	jakub.kicinski, netdev, kernel-team, mingo, jannh
In-Reply-To: <20190130225741.3rgbw2c6nh5v2kdj@ast-mbp.dhcp.thefacebook.com>

On Wed, Jan 30, 2019 at 02:57:43PM -0800, Alexei Starovoitov wrote:
> On Wed, Jan 30, 2019 at 01:05:36PM -0800, Paul E. McKenney wrote:
> > On Wed, Jan 30, 2019 at 11:51:14AM -0800, Alexei Starovoitov wrote:
> > > On Wed, Jan 30, 2019 at 10:36:18AM -0800, Paul E. McKenney wrote:
> > > > On Wed, Jan 30, 2019 at 06:11:00PM +0000, Will Deacon wrote:
> > > > > Hi Alexei,
> > > > > 
> > > > > On Mon, Jan 28, 2019 at 01:56:24PM -0800, Alexei Starovoitov wrote:
> > > > > > On Mon, Jan 28, 2019 at 10:24:08AM +0100, Peter Zijlstra wrote:
> > > > > > > On Fri, Jan 25, 2019 at 04:17:26PM -0800, Alexei Starovoitov wrote:
> > > > > > > > What I want to avoid is to define the whole execution ordering model upfront.
> > > > > > > > We cannot say that BPF ISA is weakly ordered like alpha.
> > > > > > > > Most of the bpf progs are written and running on x86. We shouldn't
> > > > > > > > twist bpf developer's arm by artificially relaxing memory model.
> > > > > > > > BPF memory model is equal to memory model of underlying architecture.
> > > > > > > > What we can do is to make it bpf progs a bit more portable with
> > > > > > > > smp_rmb instructions, but we must not force weak execution on the developer.
> > > > > > > 
> > > > > > > Well, I agree with only introducing bits you actually need, and my
> > > > > > > smp_rmb() example might have been poorly chosen, smp_load_acquire() /
> > > > > > > smp_store_release() might have been a far more useful example.
> > > > > > > 
> > > > > > > But I disagree with the last part; we have to pick a model now;
> > > > > > > otherwise you'll pain yourself into a corner.
> > > > > > > 
> > > > > > > Also; Alpha isn't very relevant these days; however ARM64 does seem to
> > > > > > > be gaining a lot of attention and that is very much a weak architecture.
> > > > > > > Adding strongly ordered assumptions to BPF now, will penalize them in
> > > > > > > the long run.
> > > > > > 
> > > > > > arm64 is gaining attention just like riscV is gaining it too.
> > > > > > BPF jit for arm64 is very solid, while BPF jit for riscV is being worked on.
> > > > > > BPF is not picking sides in CPU HW and ISA battles.
> > > > > 
> > > > > It's not about picking a side, it's about providing an abstraction of the
> > > > > various CPU architectures out there so that the programmer doesn't need to
> > > > > worry about where their program may run. Hell, even if you just said "eBPF
> > > > > follows x86 semantics" that would be better than saying nothing (and then we
> > > > > could have a discussion about whether x86 semantics are really what you
> > > > > want).
> > > > 
> > > > To reinforce this point, the Linux-kernel memory model (tools/memory-model)
> > > > is that abstraction for the Linux kernel.  Why not just use that for BPF?
> > > 
> > > I already answered this earlier in the thread.
> > > tldr: not going to sacrifice performance.
> > 
> > Understood.
> > 
> > But can we at least say that where there are no performance consequences,
> > BPF should follow LKMM?  You already mentioned smp_load_acquire()
> > and smp_store_release(), but the void atomics (e.g., atomic_inc())
> > should also work because they don't provide any ordering guarantees.
> > The _relaxed(), _release(), and _acquire() variants of the value-returning
> > atomics should be just fine as well.
> > 
> > The other value-returning atomics have strong ordering, which is fine
> > on many systems, but potentially suboptimal for the weakly ordered ones.
> > Though you have to have pretty good locality of reference to be able to
> > see the difference, because otherwise cache-miss overhead dominates.
> > 
> > Things like cmpxchg() don't seem to fit BPF because they are normally
> > used in spin loops, though there are some non-spinning use cases.
> > 
> > You correctly pointed out that READ_ONCE() and WRITE_ONCE() are suboptimal
> > on systems that don't support all sizes of loads, but I bet that there
> > are some sizes for which they are just fine across systems, for example,
> > pointer size and int size.
> > 
> > Does that help?  Or am I missing additional cases where performance
> > could be degraded?
> 
> bpf doesn't have smp_load_acquire, atomic_fetch_add, xchg, fence instructions.
> They can be added step by step. That's easy.
> I believe folks already started working on adding atomic_fetch_add.
> What I have problem with is making a statement today that bpf's end
> goal is LKMM. Even after adding all sorts of instructions it may
> not be practical.
> Only when real use case requires adding new instruction we do it.
> Do you have a bpf program that needs smp_load_acquire ?

We seem to be talking past each other.  Let me try again...

I believe that if BPF adds a given concurrency feature, it should follow
LKMM unless there is some specific problem with its doing so.

My paragraphs in my previous email list the concurrency features BPF
could follow LKMM without penalty, should BPF choose to add them.

Does that help?

							Thanx, Paul


^ permalink raw reply

* Re: [PATCH bpf-next v3 2/3] libbpf: add support for using AF_XDP sockets
From: Maciej Fijalkowski @ 2019-01-31 14:05 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Daniel Borkmann, Magnus Karlsson, Björn Töpel, ast,
	Network Development, Jakub Kicinski, Björn Töpel,
	Zhang, Qi Z, Jesper Dangaard Brouer
In-Reply-To: <CAJ8uoz04xv_yBxX7uzsu7ftkgs8ufC97Jn8MbzXtgfN9fZUdGg@mail.gmail.com>

On Thu, 31 Jan 2019 14:55:06 +0100
Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> On Thu, Jan 31, 2019 at 11:31 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >
> > On 01/29/2019 04:12 PM, Magnus Karlsson wrote:  
> > > This commit adds AF_XDP support to libbpf. The main reason for this is
> > > to facilitate writing applications that use AF_XDP by offering
> > > higher-level APIs that hide many of the details of the AF_XDP
> > > uapi. This is in the same vein as libbpf facilitates XDP adoption by
> > > offering easy-to-use higher level interfaces of XDP
> > > functionality. Hopefully this will facilitate adoption of AF_XDP, make
> > > applications using it simpler and smaller, and finally also make it
> > > possible for applications to benefit from optimizations in the AF_XDP
> > > user space access code. Previously, people just copied and pasted the
> > > code from the sample application into their application, which is not
> > > desirable.
> > >
> > > The interface is composed of two parts:
> > >
> > > * Low-level access interface to the four rings and the packet
> > > * High-level control plane interface for creating and setting
> > >   up umems and af_xdp sockets as well as a simple XDP program.
> > >
> > > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>  
> > [...]  
> > > +
> > > +static __u32 xsk_prod_nb_free(struct xsk_ring_prod *r, __u32 nb)
> > > +{
> > > +     __u32 free_entries = r->cached_cons - r->cached_prod;
> > > +
> > > +     if (free_entries >= nb)
> > > +             return free_entries;
> > > +
> > > +     /* Refresh the local tail pointer.
> > > +      * cached_cons is r->size bigger than the real consumer pointer so
> > > +      * that this addition can be avoided in the more frequently
> > > +      * executed code that computs free_entries in the beginning of
> > > +      * this function. Without this optimization it whould have been
> > > +      * free_entries = r->cached_prod - r->cached_cons + r->size.
> > > +      */
> > > +     r->cached_cons = *r->consumer + r->size;
> > > +
> > > +     return r->cached_cons - r->cached_prod;
> > > +}
> > > +
> > > +static __u32 xsk_cons_nb_avail(struct xsk_ring_cons *r, __u32 nb)
> > > +{
> > > +     __u32 entries = r->cached_prod - r->cached_cons;
> > > +
> > > +     if (entries == 0) {
> > > +             r->cached_prod = *r->producer;
> > > +             entries = r->cached_prod - r->cached_cons;
> > > +     }
> > > +
> > > +     return (entries > nb) ? nb : entries;
> > > +}
> > > +
> > > +size_t xsk_ring_prod__reserve(struct xsk_ring_prod *prod, size_t nb, __u32 *idx)
> > > +{
> > > +     if (unlikely(xsk_prod_nb_free(prod, nb) < nb))
> > > +             return 0;
> > > +
> > > +     *idx = prod->cached_prod;
> > > +     prod->cached_prod += nb;
> > > +
> > > +     return nb;
> > > +}
> > > +
> > > +void xsk_ring_prod__submit(struct xsk_ring_prod *prod, size_t nb)
> > > +{
> > > +     /* Make sure everything has been written to the ring before signalling
> > > +      * this to the kernel.
> > > +      */
> > > +     smp_wmb();
> > > +
> > > +     *prod->producer += nb;
> > > +}
> > > +
> > > +size_t xsk_ring_cons__peek(struct xsk_ring_cons *cons, size_t nb, __u32 *idx)
> > > +{
> > > +     size_t entries = xsk_cons_nb_avail(cons, nb);
> > > +
> > > +     if (likely(entries > 0)) {
> > > +             /* Make sure we do not speculatively read the data before
> > > +              * we have received the packet buffers from the ring.
> > > +              */
> > > +             smp_rmb();
> > > +
> > > +             *idx = cons->cached_cons;
> > > +             cons->cached_cons += entries;
> > > +     }
> > > +
> > > +     return entries;
> > > +}
> > > +
> > > +void xsk_ring_cons__release(struct xsk_ring_cons *cons, size_t nb)
> > > +{
> > > +     *cons->consumer += nb;
> > > +}
> > > +
> > > +void *xsk_umem__get_data_raw(void *umem_area, __u64 addr)
> > > +{
> > > +     return &((char *)umem_area)[addr];
> > > +}
> > > +
> > > +void *xsk_umem__get_data(struct xsk_umem *umem, __u64 addr)
> > > +{
> > > +     return &((char *)(umem->umem_area))[addr];
> > > +}  
> >
> > Shouldn't some of the above helpers for critical path be exposed as static
> > inline functions instead?  
> 
> Could be. I will make some experiments to see how much performance
> would improve. Kept them as non-static so that they could be changed
> if someone thought of a faster way of doing these operations. But
> might be unnecessary, since we can change parts of it even if they are
> static. I will measure and come back to you.
> 
> > [...]  
> > > +static int xsk_parse_nl(void *cookie, void *msg, struct nlattr **tb)
> > > +{
> > > +     struct nlattr *tb_parsed[IFLA_XDP_MAX + 1];
> > > +     struct xsk_nl_info *nl_info = cookie;
> > > +     unsigned char mode;
> > > +     int err;
> > > +
> > > +     (void)msg;  
> >
> > Unused?  
> 
> Yes.
>
IMO we should make sure here that we actually do the parsing for xsk->ifindex
only. So nl_info should have an ifindex field which will be set to xsk->ifindex
and here we will do:

struct ifinfomsg *ifinfo = msg;
if (nl_info->ifindex && nl_info->ifindex != ifinfo->ifi_index)
	return 0;

RTM_GETLINK with NLM_F_DUMP flag will give you information from all interfaces
so let's suppose that you are running the xdpsock on eth0 but on eth1 there's a
xdp prog already running. IIUC here you might have a situation that
xsk_xdp_prog_attached will return true because you did the parsing for every
interface, eth1 has xdp prog attached so the nl_info will be filled. Then, in
xsk_setup_xdp_prog you won't create maps.

Check against the xsk->ifindex at the beginning of xsk_parse_nl will prevent
this kind of behavior.

> > > +     nl_info->xdp_prog_attached = false;
> > > +     if (!tb[IFLA_XDP])
> > > +             return 0;
> > > +
> > > +     err = libbpf_nla_parse_nested(tb_parsed, IFLA_XDP_MAX, tb[IFLA_XDP],
> > > +                                   NULL);
> > > +     if (err)
> > > +             return err;
> > > +
> > > +     if (!tb_parsed[IFLA_XDP_ATTACHED] || !tb_parsed[IFLA_XDP_FD])
> > > +             return 0;
> > > +
> > > +     mode = libbpf_nla_getattr_u8(tb_parsed[IFLA_XDP_ATTACHED]);
> > > +     if (mode == XDP_ATTACHED_NONE)
> > > +             return 0;  
> >
> > Probably good to memset and/or init the passed struct xsk_nl_info (e.g.
> > fd to -1 etc) such that if some user did not we won't end up with garbage
> > values on return 0.  
> 
> Will do.
> 
> > > +     nl_info->xdp_prog_attached = true;
> > > +     nl_info->fd = libbpf_nla_getattr_u32(tb_parsed[IFLA_XDP_FD]);
> > > +     return 0;
> > > +}
> > > +
> > > +static bool xsk_xdp_prog_attached(struct xsk_socket *xsk)
> > > +{
> > > +     struct xsk_nl_info nl_info;
> > > +     unsigned int nl_pid;
> > > +     char err_buf[256];
> > > +     int sock, err;
> > > +
> > > +     sock = libbpf_netlink_open(&nl_pid);
> > > +     if (sock < 0)
> > > +             return false;
> > > +
> > > +     nl_info.xdp_prog_attached = false;
> > > +     nl_info.fd = 0;
> > > +
> > > +     err = libbpf_nl_get_link(sock, nl_pid, xsk_parse_nl, &nl_info);
> > > +     if (err) {
> > > +             libbpf_strerror(err, err_buf, sizeof(err_buf));
> > > +             pr_warning("Error:\n%s\n", err_buf);
> > > +             return false;
> > > +     }
> > > +
> > > +     xsk->prog_fd = nl_info.fd;
> > > +     return nl_info.xdp_prog_attached;  
> >
> > Don't we leak sock here and in error above?  
> 
> Will fix the fd leaks that you have pointed out here and below. And go
> through the code to see if there are more fd leaks. Do I ever learn
> ;-).
> 
> Thanks for your review. Appreciated.
> 
> /Magnus
> 
> > > +}
> > > +
> > > +static int xsk_load_xdp_prog(struct xsk_socket *xsk)
> > > +{
> > > +     char bpf_log_buf[BPF_LOG_BUF_SIZE];
> > > +     int err, prog_fd;
> > > +
> > > +     /* This is the C-program:
> > > +      * SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
> > > +      * {
> > > +      *     int *qidconf, index = ctx->rx_queue_index;
> > > +      *
> > > +      *     // A set entry here means that the correspnding queue_id
> > > +      *     // has an active AF_XDP socket bound to it.
> > > +      *     qidconf = bpf_map_lookup_elem(&qidconf_map, &index);
> > > +      *     if (!qidconf)
> > > +      *         return XDP_ABORTED;
> > > +      *
> > > +      *     if (*qidconf)
> > > +      *         return bpf_redirect_map(&xsks_map, index, 0);
> > > +      *
> > > +      *     return XDP_PASS;
> > > +      * }
> > > +      */
> > > +     struct bpf_insn prog[] = {
> > > +             /* r1 = *(u32 *)(r1 + 16) */
> > > +             BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 16),
> > > +             /* *(u32 *)(r10 - 4) = r1 */
> > > +             BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_1, -4),
> > > +             BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
> > > +             BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
> > > +             BPF_LD_MAP_FD(BPF_REG_1, xsk->qidconf_map_fd),
> > > +             BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
> > > +             BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
> > > +             BPF_MOV32_IMM(BPF_REG_0, 0),
> > > +             /* if r1 == 0 goto +8 */
> > > +             BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 8),
> > > +             BPF_MOV32_IMM(BPF_REG_0, 2),
> > > +             /* r1 = *(u32 *)(r1 + 0) */
> > > +             BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 0),
> > > +             /* if r1 == 0 goto +5 */
> > > +             BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 5),
> > > +             /* r2 = *(u32 *)(r10 - 4) */
> > > +             BPF_LD_MAP_FD(BPF_REG_1, xsk->xsks_map_fd),
> > > +             BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_10, -4),
> > > +             BPF_MOV32_IMM(BPF_REG_3, 0),
> > > +             BPF_EMIT_CALL(BPF_FUNC_redirect_map),
> > > +             /* The jumps are to this instruction */
> > > +             BPF_EXIT_INSN(),
> > > +     };
> > > +     size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn);
> > > +
> > > +     prog_fd = bpf_load_program(BPF_PROG_TYPE_XDP, prog, insns_cnt,
> > > +                                "LGPL-2.1 or BSD-2-Clause", 0, bpf_log_buf,
> > > +                                BPF_LOG_BUF_SIZE);
> > > +     if (prog_fd < 0) {
> > > +             pr_warning("BPF log buffer:\n%s", bpf_log_buf);
> > > +             return prog_fd;
> > > +     }
> > > +
> > > +     err = bpf_set_link_xdp_fd(xsk->ifindex, prog_fd, xsk->config.xdp_flags);
> > > +     if (err)
> > > +             return err;  
> >
> > Leaks prog_fd on error.
> >  
> > > +     xsk->prog_fd = prog_fd;
> > > +     return 0;
> > > +}
> > > +
> > > +static int xsk_create_bpf_maps(struct xsk_socket *xsk)
> > > +{
> > > +     int fd;
> > > +
> > > +     fd = bpf_create_map_name(BPF_MAP_TYPE_ARRAY, "qidconf_map",
> > > +                              sizeof(int), sizeof(int), MAX_QUEUES, 0);
> > > +     if (fd < 0)
> > > +             return fd;
> > > +     xsk->qidconf_map_fd = fd;
> > > +
> > > +     fd = bpf_create_map_name(BPF_MAP_TYPE_XSKMAP, "xsks_map",
> > > +                              sizeof(int), sizeof(int), MAX_QUEUES, 0);
> > > +     if (fd < 0)
> > > +             return fd;  
> >
> > Leaks first map fd on error.
> >  
> > > +     xsk->xsks_map_fd = fd;
> > > +
> > > +     return 0;
> > > +}
> > > +
> > > +static int xsk_update_bpf_maps(struct xsk_socket *xsk, int qidconf_value,
> > > +                            int xsks_value)
> > > +{
> > > +     bool qidconf_map_updated = false, xsks_map_updated = false;
> > > +     struct bpf_prog_info prog_info = {};
> > > +     __u32 prog_len = sizeof(prog_info);
> > > +     struct bpf_map_info map_info;
> > > +     __u32 map_len = sizeof(map_info);
> > > +     __u32 *map_ids;
> > > +     int reset_value = 0;
> > > +     __u32 num_maps;
> > > +     unsigned int i;
> > > +     int err;
> > > +
> > > +     err = bpf_obj_get_info_by_fd(xsk->prog_fd, &prog_info, &prog_len);
> > > +     if (err)
> > > +             return err;
> > > +
> > > +     num_maps = prog_info.nr_map_ids;
> > > +
> > > +     map_ids = malloc(prog_info.nr_map_ids * sizeof(*map_ids));  
> >
> > calloc()?
> >  
> > > +     if (!map_ids)
> > > +             return -ENOMEM;
> > > +
> > > +     memset(&prog_info, 0, prog_len);
> > > +     prog_info.nr_map_ids = num_maps;
> > > +     prog_info.map_ids = (__u64)(unsigned long)map_ids;
> > > +
> > > +     err = bpf_obj_get_info_by_fd(xsk->prog_fd, &prog_info, &prog_len);
> > > +     if (err)
> > > +             return err;  
> >
> > Leaks map_ids on error.
> >  
> > > +
> > > +     for (i = 0; i < prog_info.nr_map_ids; i++) {
> > > +             int fd;
> > > +
> > > +             fd = bpf_map_get_fd_by_id(map_ids[i]);
> > > +             if (fd < 0) {
> > > +                     err = -errno;
> > > +                     goto out;
> > > +             }
> > > +
> > > +             err = bpf_obj_get_info_by_fd(fd, &map_info, &map_len);
> > > +             if (err)
> > > +                     goto out;  
> >
> > close(fd) on error, also the case for below, please double check everything
> > so that no fd leaks by accident into the app.
> >  
> > > +
> > > +             if (!strcmp(map_info.name, "qidconf_map")) {
> > > +                     err = bpf_map_update_elem(fd, &xsk->queue_id,
> > > +                                               &qidconf_value, 0);
> > > +                     if (err)
> > > +                             goto out;
> > > +                     qidconf_map_updated = true;
> > > +                     xsk->qidconf_map_fd = fd;
> > > +             } else if (!strcmp(map_info.name, "xsks_map")) {
> > > +                     err = bpf_map_update_elem(fd, &xsk->queue_id,
> > > +                                               &xsks_value, 0);
> > > +                     if (err)
> > > +                             goto out;> +                    xsks_map_updated = true;
> > > +                     xsk->xsks_map_fd = fd;
> > > +             }
> > > +
> > > +             if (qidconf_map_updated && xsks_map_updated)
> > > +                     break;
> > > +     }
> > > +
> > > +     if (!(qidconf_map_updated && xsks_map_updated)) {
> > > +             err = -ENOENT;
> > > +             goto out;
> > > +     }
> > > +
> > > +     return 0;
> > > +
> > > +out:
> > > +     if (qidconf_map_updated)
> > > +             (void)bpf_map_update_elem(xsk->qidconf_map_fd, &xsk->queue_id,
> > > +                                       &reset_value, 0);
> > > +     if (xsks_map_updated)
> > > +             (void)bpf_map_update_elem(xsk->xsks_map_fd, &xsk->queue_id,
> > > +                                       &reset_value, 0);
> > > +     return err;
> > > +}
> > > +
> > > +static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
> > > +{
> > > +     bool prog_loaded = false;
> > > +     int err;
> > > +
> > > +     if (!xsk_xdp_prog_attached(xsk)) {
> > > +             err = xsk_create_bpf_maps(xsk);
> > > +             if (err)
> > > +                     goto out_load;
> > > +
> > > +             err = xsk_load_xdp_prog(xsk);
> > > +             if (err)
> > > +                     return err;  
> >
> > Needs to undo created maps on error.
> >  
> > > +             prog_loaded = true;
> > > +     }
> > > +
> > > +     err = xsk_update_bpf_maps(xsk, true, xsk->fd);
> > > +     if (err)
> > > +             goto out_load;
> > > +
> > > +     return 0;
> > > +
> > > +out_load:
> > > +     if (prog_loaded)
> > > +             close(xsk->prog_fd);  
> >
> > Ditto
> >  
> > > +     return err;
> > > +}
> > > +
> > > +int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname,
> > > +                    __u32 queue_id, struct xsk_umem *umem,
> > > +                    struct xsk_ring_cons *rx, struct xsk_ring_prod *tx,
> > > +                    const struct xsk_socket_config *usr_config)
> > > +{
> > > +     struct sockaddr_xdp sxdp = {};
> > > +     struct xdp_mmap_offsets off;
> > > +     struct xsk_socket *xsk;
> > > +     socklen_t optlen;
> > > +     void *map;  


^ permalink raw reply

* [PATCH net-next] net/mlx5e: Fix code style issue in mlx driver
From: xiangxia.m.yue @ 2019-01-29 20:31 UTC (permalink / raw)
  To: saeedm, gerlitz.or; +Cc: netdev, Tonghao Zhang

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>

Add the tab before '}' and keep the code style consistent.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index 79f122b..9c5aac1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -618,7 +618,8 @@ static struct mlx5_flow_group *alloc_flow_group(struct mlx5_flow_steering *steer
 	if (ret) {
 		kmem_cache_free(steering->fgs_cache, fg);
 		return ERR_PTR(ret);
-}
+	}
+
 	ida_init(&fg->fte_allocator);
 	fg->mask.match_criteria_enable = match_criteria_enable;
 	memcpy(&fg->mask.match_criteria, match_criteria,
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH net-next] xprtrdma: Use struct_size() in kzalloc()
From: Chuck Lever @ 2019-01-31 14:11 UTC (permalink / raw)
  To: Gustavo A. R. Silva
  Cc: Trond Myklebust, Anna Schumaker, Bruce Fields, Jeff Layton,
	David S. Miller, Linux NFS Mailing List, netdev, linux-kernel
In-Reply-To: <20190131004622.GA30261@embeddedor>



> On Jan 30, 2019, at 7:46 PM, Gustavo A. R. Silva <gustavo@embeddedor.com> wrote:
> 
> One of the more common cases of allocation size calculations is finding
> the size of a structure that has a zero-sized array at the end, along
> with memory for some number of elements for that array. For example:
> 
> struct foo {
>    int stuff;
>    struct boo entry[];
> };
> 
> instance = kzalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL);
> 
> Instead of leaving these open-coded and prone to type mistakes, we can
> now use the new struct_size() helper:
> 
> instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);
> 
> This code was detected with the help of Coccinelle.
> 
> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>


> ---
> net/sunrpc/xprtrdma/verbs.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 4994e75945b8..9e8cf7456840 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -811,8 +811,7 @@ static struct rpcrdma_sendctx *rpcrdma_sendctx_create(struct rpcrdma_ia *ia)
> {
> 	struct rpcrdma_sendctx *sc;
> 
> -	sc = kzalloc(sizeof(*sc) +
> -		     ia->ri_max_send_sges * sizeof(struct ib_sge),
> +	sc = kzalloc(struct_size(sc, sc_sges, ia->ri_max_send_sges),
> 		     GFP_KERNEL);
> 	if (!sc)
> 		return NULL;
> -- 
> 2.20.1
> 

--
Chuck Lever




^ permalink raw reply

* Re: [PATCH bpf-next v3 2/3] libbpf: add support for using AF_XDP sockets
From: Magnus Karlsson @ 2019-01-31 14:23 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Daniel Borkmann, Magnus Karlsson, Björn Töpel, ast,
	Network Development, Jakub Kicinski, Björn Töpel,
	Zhang, Qi Z, Jesper Dangaard Brouer
In-Reply-To: <20190131150555.00002898@gmail.com>

On Thu, Jan 31, 2019 at 3:06 PM Maciej Fijalkowski
<maciejromanfijalkowski@gmail.com> wrote:
>
> On Thu, 31 Jan 2019 14:55:06 +0100
> Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
>
> > On Thu, Jan 31, 2019 at 11:31 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > >
> > > On 01/29/2019 04:12 PM, Magnus Karlsson wrote:
> > > > This commit adds AF_XDP support to libbpf. The main reason for this is
> > > > to facilitate writing applications that use AF_XDP by offering
> > > > higher-level APIs that hide many of the details of the AF_XDP
> > > > uapi. This is in the same vein as libbpf facilitates XDP adoption by
> > > > offering easy-to-use higher level interfaces of XDP
> > > > functionality. Hopefully this will facilitate adoption of AF_XDP, make
> > > > applications using it simpler and smaller, and finally also make it
> > > > possible for applications to benefit from optimizations in the AF_XDP
> > > > user space access code. Previously, people just copied and pasted the
> > > > code from the sample application into their application, which is not
> > > > desirable.
> > > >
> > > > The interface is composed of two parts:
> > > >
> > > > * Low-level access interface to the four rings and the packet
> > > > * High-level control plane interface for creating and setting
> > > >   up umems and af_xdp sockets as well as a simple XDP program.
> > > >
> > > > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> > > [...]
> > > > +
> > > > +static __u32 xsk_prod_nb_free(struct xsk_ring_prod *r, __u32 nb)
> > > > +{
> > > > +     __u32 free_entries = r->cached_cons - r->cached_prod;
> > > > +
> > > > +     if (free_entries >= nb)
> > > > +             return free_entries;
> > > > +
> > > > +     /* Refresh the local tail pointer.
> > > > +      * cached_cons is r->size bigger than the real consumer pointer so
> > > > +      * that this addition can be avoided in the more frequently
> > > > +      * executed code that computs free_entries in the beginning of
> > > > +      * this function. Without this optimization it whould have been
> > > > +      * free_entries = r->cached_prod - r->cached_cons + r->size.
> > > > +      */
> > > > +     r->cached_cons = *r->consumer + r->size;
> > > > +
> > > > +     return r->cached_cons - r->cached_prod;
> > > > +}
> > > > +
> > > > +static __u32 xsk_cons_nb_avail(struct xsk_ring_cons *r, __u32 nb)
> > > > +{
> > > > +     __u32 entries = r->cached_prod - r->cached_cons;
> > > > +
> > > > +     if (entries == 0) {
> > > > +             r->cached_prod = *r->producer;
> > > > +             entries = r->cached_prod - r->cached_cons;
> > > > +     }
> > > > +
> > > > +     return (entries > nb) ? nb : entries;
> > > > +}
> > > > +
> > > > +size_t xsk_ring_prod__reserve(struct xsk_ring_prod *prod, size_t nb, __u32 *idx)
> > > > +{
> > > > +     if (unlikely(xsk_prod_nb_free(prod, nb) < nb))
> > > > +             return 0;
> > > > +
> > > > +     *idx = prod->cached_prod;
> > > > +     prod->cached_prod += nb;
> > > > +
> > > > +     return nb;
> > > > +}
> > > > +
> > > > +void xsk_ring_prod__submit(struct xsk_ring_prod *prod, size_t nb)
> > > > +{
> > > > +     /* Make sure everything has been written to the ring before signalling
> > > > +      * this to the kernel.
> > > > +      */
> > > > +     smp_wmb();
> > > > +
> > > > +     *prod->producer += nb;
> > > > +}
> > > > +
> > > > +size_t xsk_ring_cons__peek(struct xsk_ring_cons *cons, size_t nb, __u32 *idx)
> > > > +{
> > > > +     size_t entries = xsk_cons_nb_avail(cons, nb);
> > > > +
> > > > +     if (likely(entries > 0)) {
> > > > +             /* Make sure we do not speculatively read the data before
> > > > +              * we have received the packet buffers from the ring.
> > > > +              */
> > > > +             smp_rmb();
> > > > +
> > > > +             *idx = cons->cached_cons;
> > > > +             cons->cached_cons += entries;
> > > > +     }
> > > > +
> > > > +     return entries;
> > > > +}
> > > > +
> > > > +void xsk_ring_cons__release(struct xsk_ring_cons *cons, size_t nb)
> > > > +{
> > > > +     *cons->consumer += nb;
> > > > +}
> > > > +
> > > > +void *xsk_umem__get_data_raw(void *umem_area, __u64 addr)
> > > > +{
> > > > +     return &((char *)umem_area)[addr];
> > > > +}
> > > > +
> > > > +void *xsk_umem__get_data(struct xsk_umem *umem, __u64 addr)
> > > > +{
> > > > +     return &((char *)(umem->umem_area))[addr];
> > > > +}
> > >
> > > Shouldn't some of the above helpers for critical path be exposed as static
> > > inline functions instead?
> >
> > Could be. I will make some experiments to see how much performance
> > would improve. Kept them as non-static so that they could be changed
> > if someone thought of a faster way of doing these operations. But
> > might be unnecessary, since we can change parts of it even if they are
> > static. I will measure and come back to you.
> >
> > > [...]
> > > > +static int xsk_parse_nl(void *cookie, void *msg, struct nlattr **tb)
> > > > +{
> > > > +     struct nlattr *tb_parsed[IFLA_XDP_MAX + 1];
> > > > +     struct xsk_nl_info *nl_info = cookie;
> > > > +     unsigned char mode;
> > > > +     int err;
> > > > +
> > > > +     (void)msg;
> > >
> > > Unused?
> >
> > Yes.
> >
> IMO we should make sure here that we actually do the parsing for xsk->ifindex
> only. So nl_info should have an ifindex field which will be set to xsk->ifindex
> and here we will do:
>
> struct ifinfomsg *ifinfo = msg;
> if (nl_info->ifindex && nl_info->ifindex != ifinfo->ifi_index)
>         return 0;
>
> RTM_GETLINK with NLM_F_DUMP flag will give you information from all interfaces
> so let's suppose that you are running the xdpsock on eth0 but on eth1 there's a
> xdp prog already running. IIUC here you might have a situation that
> xsk_xdp_prog_attached will return true because you did the parsing for every
> interface, eth1 has xdp prog attached so the nl_info will be filled. Then, in
> xsk_setup_xdp_prog you won't create maps.
>
> Check against the xsk->ifindex at the beginning of xsk_parse_nl will prevent
> this kind of behavior.

Thanks for catching this Maciej. Will fix.

/Magnus

> > > > +     nl_info->xdp_prog_attached = false;
> > > > +     if (!tb[IFLA_XDP])
> > > > +             return 0;
> > > > +
> > > > +     err = libbpf_nla_parse_nested(tb_parsed, IFLA_XDP_MAX, tb[IFLA_XDP],
> > > > +                                   NULL);
> > > > +     if (err)
> > > > +             return err;
> > > > +
> > > > +     if (!tb_parsed[IFLA_XDP_ATTACHED] || !tb_parsed[IFLA_XDP_FD])
> > > > +             return 0;
> > > > +
> > > > +     mode = libbpf_nla_getattr_u8(tb_parsed[IFLA_XDP_ATTACHED]);
> > > > +     if (mode == XDP_ATTACHED_NONE)
> > > > +             return 0;
> > >
> > > Probably good to memset and/or init the passed struct xsk_nl_info (e.g.
> > > fd to -1 etc) such that if some user did not we won't end up with garbage
> > > values on return 0.
> >
> > Will do.
> >
> > > > +     nl_info->xdp_prog_attached = true;
> > > > +     nl_info->fd = libbpf_nla_getattr_u32(tb_parsed[IFLA_XDP_FD]);
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +static bool xsk_xdp_prog_attached(struct xsk_socket *xsk)
> > > > +{
> > > > +     struct xsk_nl_info nl_info;
> > > > +     unsigned int nl_pid;
> > > > +     char err_buf[256];
> > > > +     int sock, err;
> > > > +
> > > > +     sock = libbpf_netlink_open(&nl_pid);
> > > > +     if (sock < 0)
> > > > +             return false;
> > > > +
> > > > +     nl_info.xdp_prog_attached = false;
> > > > +     nl_info.fd = 0;
> > > > +
> > > > +     err = libbpf_nl_get_link(sock, nl_pid, xsk_parse_nl, &nl_info);
> > > > +     if (err) {
> > > > +             libbpf_strerror(err, err_buf, sizeof(err_buf));
> > > > +             pr_warning("Error:\n%s\n", err_buf);
> > > > +             return false;
> > > > +     }
> > > > +
> > > > +     xsk->prog_fd = nl_info.fd;
> > > > +     return nl_info.xdp_prog_attached;
> > >
> > > Don't we leak sock here and in error above?
> >
> > Will fix the fd leaks that you have pointed out here and below. And go
> > through the code to see if there are more fd leaks. Do I ever learn
> > ;-).
> >
> > Thanks for your review. Appreciated.
> >
> > /Magnus
> >
> > > > +}
> > > > +
> > > > +static int xsk_load_xdp_prog(struct xsk_socket *xsk)
> > > > +{
> > > > +     char bpf_log_buf[BPF_LOG_BUF_SIZE];
> > > > +     int err, prog_fd;
> > > > +
> > > > +     /* This is the C-program:
> > > > +      * SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
> > > > +      * {
> > > > +      *     int *qidconf, index = ctx->rx_queue_index;
> > > > +      *
> > > > +      *     // A set entry here means that the correspnding queue_id
> > > > +      *     // has an active AF_XDP socket bound to it.
> > > > +      *     qidconf = bpf_map_lookup_elem(&qidconf_map, &index);
> > > > +      *     if (!qidconf)
> > > > +      *         return XDP_ABORTED;
> > > > +      *
> > > > +      *     if (*qidconf)
> > > > +      *         return bpf_redirect_map(&xsks_map, index, 0);
> > > > +      *
> > > > +      *     return XDP_PASS;
> > > > +      * }
> > > > +      */
> > > > +     struct bpf_insn prog[] = {
> > > > +             /* r1 = *(u32 *)(r1 + 16) */
> > > > +             BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 16),
> > > > +             /* *(u32 *)(r10 - 4) = r1 */
> > > > +             BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_1, -4),
> > > > +             BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
> > > > +             BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
> > > > +             BPF_LD_MAP_FD(BPF_REG_1, xsk->qidconf_map_fd),
> > > > +             BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
> > > > +             BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
> > > > +             BPF_MOV32_IMM(BPF_REG_0, 0),
> > > > +             /* if r1 == 0 goto +8 */
> > > > +             BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 8),
> > > > +             BPF_MOV32_IMM(BPF_REG_0, 2),
> > > > +             /* r1 = *(u32 *)(r1 + 0) */
> > > > +             BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 0),
> > > > +             /* if r1 == 0 goto +5 */
> > > > +             BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 5),
> > > > +             /* r2 = *(u32 *)(r10 - 4) */
> > > > +             BPF_LD_MAP_FD(BPF_REG_1, xsk->xsks_map_fd),
> > > > +             BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_10, -4),
> > > > +             BPF_MOV32_IMM(BPF_REG_3, 0),
> > > > +             BPF_EMIT_CALL(BPF_FUNC_redirect_map),
> > > > +             /* The jumps are to this instruction */
> > > > +             BPF_EXIT_INSN(),
> > > > +     };
> > > > +     size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn);
> > > > +
> > > > +     prog_fd = bpf_load_program(BPF_PROG_TYPE_XDP, prog, insns_cnt,
> > > > +                                "LGPL-2.1 or BSD-2-Clause", 0, bpf_log_buf,
> > > > +                                BPF_LOG_BUF_SIZE);
> > > > +     if (prog_fd < 0) {
> > > > +             pr_warning("BPF log buffer:\n%s", bpf_log_buf);
> > > > +             return prog_fd;
> > > > +     }
> > > > +
> > > > +     err = bpf_set_link_xdp_fd(xsk->ifindex, prog_fd, xsk->config.xdp_flags);
> > > > +     if (err)
> > > > +             return err;
> > >
> > > Leaks prog_fd on error.
> > >
> > > > +     xsk->prog_fd = prog_fd;
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +static int xsk_create_bpf_maps(struct xsk_socket *xsk)
> > > > +{
> > > > +     int fd;
> > > > +
> > > > +     fd = bpf_create_map_name(BPF_MAP_TYPE_ARRAY, "qidconf_map",
> > > > +                              sizeof(int), sizeof(int), MAX_QUEUES, 0);
> > > > +     if (fd < 0)
> > > > +             return fd;
> > > > +     xsk->qidconf_map_fd = fd;
> > > > +
> > > > +     fd = bpf_create_map_name(BPF_MAP_TYPE_XSKMAP, "xsks_map",
> > > > +                              sizeof(int), sizeof(int), MAX_QUEUES, 0);
> > > > +     if (fd < 0)
> > > > +             return fd;
> > >
> > > Leaks first map fd on error.
> > >
> > > > +     xsk->xsks_map_fd = fd;
> > > > +
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +static int xsk_update_bpf_maps(struct xsk_socket *xsk, int qidconf_value,
> > > > +                            int xsks_value)
> > > > +{
> > > > +     bool qidconf_map_updated = false, xsks_map_updated = false;
> > > > +     struct bpf_prog_info prog_info = {};
> > > > +     __u32 prog_len = sizeof(prog_info);
> > > > +     struct bpf_map_info map_info;
> > > > +     __u32 map_len = sizeof(map_info);
> > > > +     __u32 *map_ids;
> > > > +     int reset_value = 0;
> > > > +     __u32 num_maps;
> > > > +     unsigned int i;
> > > > +     int err;
> > > > +
> > > > +     err = bpf_obj_get_info_by_fd(xsk->prog_fd, &prog_info, &prog_len);
> > > > +     if (err)
> > > > +             return err;
> > > > +
> > > > +     num_maps = prog_info.nr_map_ids;
> > > > +
> > > > +     map_ids = malloc(prog_info.nr_map_ids * sizeof(*map_ids));
> > >
> > > calloc()?
> > >
> > > > +     if (!map_ids)
> > > > +             return -ENOMEM;
> > > > +
> > > > +     memset(&prog_info, 0, prog_len);
> > > > +     prog_info.nr_map_ids = num_maps;
> > > > +     prog_info.map_ids = (__u64)(unsigned long)map_ids;
> > > > +
> > > > +     err = bpf_obj_get_info_by_fd(xsk->prog_fd, &prog_info, &prog_len);
> > > > +     if (err)
> > > > +             return err;
> > >
> > > Leaks map_ids on error.
> > >
> > > > +
> > > > +     for (i = 0; i < prog_info.nr_map_ids; i++) {
> > > > +             int fd;
> > > > +
> > > > +             fd = bpf_map_get_fd_by_id(map_ids[i]);
> > > > +             if (fd < 0) {
> > > > +                     err = -errno;
> > > > +                     goto out;
> > > > +             }
> > > > +
> > > > +             err = bpf_obj_get_info_by_fd(fd, &map_info, &map_len);
> > > > +             if (err)
> > > > +                     goto out;
> > >
> > > close(fd) on error, also the case for below, please double check everything
> > > so that no fd leaks by accident into the app.
> > >
> > > > +
> > > > +             if (!strcmp(map_info.name, "qidconf_map")) {
> > > > +                     err = bpf_map_update_elem(fd, &xsk->queue_id,
> > > > +                                               &qidconf_value, 0);
> > > > +                     if (err)
> > > > +                             goto out;
> > > > +                     qidconf_map_updated = true;
> > > > +                     xsk->qidconf_map_fd = fd;
> > > > +             } else if (!strcmp(map_info.name, "xsks_map")) {
> > > > +                     err = bpf_map_update_elem(fd, &xsk->queue_id,
> > > > +                                               &xsks_value, 0);
> > > > +                     if (err)
> > > > +                             goto out;> +                    xsks_map_updated = true;
> > > > +                     xsk->xsks_map_fd = fd;
> > > > +             }
> > > > +
> > > > +             if (qidconf_map_updated && xsks_map_updated)
> > > > +                     break;
> > > > +     }
> > > > +
> > > > +     if (!(qidconf_map_updated && xsks_map_updated)) {
> > > > +             err = -ENOENT;
> > > > +             goto out;
> > > > +     }
> > > > +
> > > > +     return 0;
> > > > +
> > > > +out:
> > > > +     if (qidconf_map_updated)
> > > > +             (void)bpf_map_update_elem(xsk->qidconf_map_fd, &xsk->queue_id,
> > > > +                                       &reset_value, 0);
> > > > +     if (xsks_map_updated)
> > > > +             (void)bpf_map_update_elem(xsk->xsks_map_fd, &xsk->queue_id,
> > > > +                                       &reset_value, 0);
> > > > +     return err;
> > > > +}
> > > > +
> > > > +static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
> > > > +{
> > > > +     bool prog_loaded = false;
> > > > +     int err;
> > > > +
> > > > +     if (!xsk_xdp_prog_attached(xsk)) {
> > > > +             err = xsk_create_bpf_maps(xsk);
> > > > +             if (err)
> > > > +                     goto out_load;
> > > > +
> > > > +             err = xsk_load_xdp_prog(xsk);
> > > > +             if (err)
> > > > +                     return err;
> > >
> > > Needs to undo created maps on error.
> > >
> > > > +             prog_loaded = true;
> > > > +     }
> > > > +
> > > > +     err = xsk_update_bpf_maps(xsk, true, xsk->fd);
> > > > +     if (err)
> > > > +             goto out_load;
> > > > +
> > > > +     return 0;
> > > > +
> > > > +out_load:
> > > > +     if (prog_loaded)
> > > > +             close(xsk->prog_fd);
> > >
> > > Ditto
> > >
> > > > +     return err;
> > > > +}
> > > > +
> > > > +int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname,
> > > > +                    __u32 queue_id, struct xsk_umem *umem,
> > > > +                    struct xsk_ring_cons *rx, struct xsk_ring_prod *tx,
> > > > +                    const struct xsk_socket_config *usr_config)
> > > > +{
> > > > +     struct sockaddr_xdp sxdp = {};
> > > > +     struct xdp_mmap_offsets off;
> > > > +     struct xsk_socket *xsk;
> > > > +     socklen_t optlen;
> > > > +     void *map;
>

^ permalink raw reply

* Re: Question on ptr_ring linux header
From: Michael S. Tsirkin @ 2019-01-31 14:39 UTC (permalink / raw)
  To: fei phung; +Cc: feiphung, netdev
In-Reply-To: <CAF8QhUhSWauyg2a1Y_MjP0gk-W9oydbpuBvtZyGZEbfmm8+mQA@mail.gmail.com>

On Thu, Jan 31, 2019 at 01:16:31PM +0800, fei phung wrote:
> Hi,
> 
> /*
>  * Filename: circ_ring.c
>  * Version: 1.0
>  * Description: A circular buffer using API from
>  * https://github.com/torvalds/linux/blob/master/include/linux/ptr_ring.h
>  */
> 
> ptr_ring's void** queue is just giving data race problem, running
> consume() together with [assignment of pointers+produce()] will
> definitely give rise to data race
> 
> mutex or lock cannot help in this case. Please correct me if wrong
> 
> Regards,
> Phung

I am not sure what does assignment of pointers mean in this context.
ptr_ring is designed for a single producer and a single consumer.  For
why it works see explanation about data dependencies in
Documentation/memory-barriers.txt.  You will have to be more specific
about the data race that you see if you expect more specific answers.

Thanks,

-- 
MST

^ permalink raw reply

* Re: [PATCH net] net/mlx4_en: Force CHECKSUM_NONE for short ethernet frames
From: Eric Dumazet @ 2019-01-31 15:02 UTC (permalink / raw)
  To: Tariq Toukan; +Cc: David S. Miller, netdev, Eran Ben Elisha, Saeed Mahameed
In-Reply-To: <1548939763-20074-1-git-send-email-tariqt@mellanox.com>

On Thu, Jan 31, 2019 at 5:04 AM Tariq Toukan <tariqt@mellanox.com> wrote:
>
> From: Saeed Mahameed <saeedm@mellanox.com>
>
> When an ethernet frame is padded to meet the minimum ethernet frame
> size, the padding octets are not covered by the hardware checksum.
> Fortunately the padding octets are usually zero's, which don't affect
> checksum. However, it is not guaranteed. For example, switches might
> choose to make other use of these octets.
> This repeatedly causes kernel hardware checksum fault.
>
> Prior to the cited commit below, skb checksum was forced to be
> CHECKSUM_NONE when padding is detected. After it, we need to keep
> skb->csum updated. However, fixing up CHECKSUM_COMPLETE requires to
> verify and parse IP headers, it does not worth the effort as the packets
> are so small that CHECKSUM_COMPLETE has no significant advantage.
>
> Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
>
> Hi Dave,
> Please queue for -stable >= v4.18.
>
> Thanks,
> Tariq
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 9a0881cb7f51..fc8a11444e1a 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -617,6 +617,8 @@ static int get_fixed_ipv6_csum(__wsum hw_checksum, struct sk_buff *skb,
>  }
>  #endif
>
> +#define short_frame(size) ((size) <= ETH_ZLEN + ETH_FCS_LEN)
> +
>  /* We reach this function only after checking that any of
>   * the (IPv4 | IPv6) bits are set in cqe->status.
>   */
> @@ -624,9 +626,22 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
>                       netdev_features_t dev_features)
>  {
>         __wsum hw_checksum = 0;
> +       void *hdr;
> +
> +       /* CQE csum doesn't cover padding octets in short ethernet
> +        * frames. And the pad field is appended prior to calculating
> +        * and appending the FCS field.
> +        *
> +        * Detecting these padded frames requires to verify and parse
> +        * IP headers, so we simply force all those small frames to be
> +        * CHECKSUM_NONE even if they are not padded.
> +        * TODO: better if we report CHECKSUM_UNNECESSARY but this
> +        * demands some refactroing.

This TODO: looks bogus to me.

mlx4 driver first tries to use CHECKSUM_UNNECESSARY.

if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
                                                  MLX4_CQE_STATUS_UDP)) &&
     (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
      cqe->checksum == cpu_to_be16(0xffff)) {
     ...
      ip_summed = CHECKSUM_UNNECESSARY;
      ...
}

Then if this attempt failed, it tries to use CHECKSUM_COMPLETE (and
calls this check_sum() function)

So there is no refactoring to be done for mlx4 : short
IPv[46]+{UDP|TCP} frames get CHECKSUM_UNNECESSARY already

> +        */
> +       if (short_frame(skb->len))
> +               return -EINVAL;
>
> -       void *hdr = (u8 *)va + sizeof(struct ethhdr);
> -
> +       hdr = (u8 *)va + sizeof(struct ethhdr);
>         hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
>
>         if (cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_CVLAN_PRESENT_MASK) &&
> --
> 1.8.3.1
>

^ permalink raw reply

* Re: BUG: KASAN: double-free or invalid-free in ip_defrag after upgrade from 4.19.13
From: Eric Dumazet @ 2019-01-31 15:05 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Ivan Babrou, Michal Kubecek, Linux Kernel Network Developers,
	David S. Miller, Ignat Korchagin, Shawn Bohrer, Jakub Sitnicki
In-Reply-To: <20190131124816.GA8031@kroah.com>

On Thu, Jan 31, 2019 at 4:48 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:

> Thanks for this, I'll turn this into a real patch and backport it to
> where it is needed.
>
> greg k-h

Thanks a lot Greg !

^ permalink raw reply

* Re: [PATCH] net: check negative value for signed refcnt
From: Alexandre BESNARD @ 2019-01-31 15:14 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: davem, amritha nambiar, lirongqing, netdev, linux-kernel,
	alexander h duyck, jiri, petrm, ecree
In-Reply-To: <8b654b17-f8b4-0c89-26b5-311aeb703f6d@virtuozzo.com>

Hi Kirill, and thanks for your time,

On 31 Jan 19 14:49, Kirill Tkhai ktkhai@virtuozzo.com wrote :

> Hi, Alexandre,

> On 31.01.2019 16:20, alexandre.besnard@softathome.com wrote:
> > From: Alexandre Besnard <alexandre.besnard@softathome.com>

> > Device remaining references counter is get as a signed integer.

> > When unregistering network devices, the loop waiting for this counter
> > to decrement tests the 0 strict equality. Thus if an error occurs and
> > two references are given back by a protocol, we are stuck in the loop
> > forever, with a -1 value.

> > Robustness is added by checking a negative value: the device is then
> > considered free of references, and a warning is issued (it should not
> > happen, one should check that behavior)

> > Signed-off-by: Alexandre Besnard <alexandre.besnard@softathome.com>
> > ---
> > net/core/dev.c | 5 +++++
> > 1 file changed, 5 insertions(+)

> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index ddc551f..e4190ae 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -8687,6 +8687,11 @@ static void netdev_wait_allrefs(struct net_device *dev)
> > refcnt = netdev_refcnt_read(dev);

> > while (refcnt != 0) {
> > + if (refcnt < 0) {
> > + pr_warn("Device %s refcnt negative: device considered free, but it should not
> > happen\n",
> > + dev->name);
> > + break;
> > + }

> 1)I don't think this is a good approach. Negative value does not guarantee
> there is just a double put of device reference. Negative value is an indicator
> something goes wrong, and we definitely should not free device memory in
> this case.

> 2)Not related to your patch -- it looks like we have problem in existing
> code with this netdev_refcnt_read(). It does not imply a memory ordering
> or some guarantees about reading percpu values. For example, in generic
> code struct percpu_ref switches a counter into atomic mode before it checks
> for the last reference. But there is nothing in netdev_refcnt_read().

I agree with you, as it is not a full fix for a bad behavior of the refcnt: many
wrong things could happen here, and that's why I added a warning (short of a
more critical flag I could think of).

However, I think this is a good approach as a global workaround for any critical
situation caused by a negative refcnt, acting as a failsafe. What I try to avoid
here is not the bug, but a situation such as a deadlock keeping a system from
powering off, or way worse in the system life.
On the other hand, I can't think of a critical situation caused by freeing
the device memory. Processes or even systems may crash in some cases, but it
should be an expected behavior in such a case IMHO.

Actually, I think that with the current implementation, most of the systems
locked in the problem are powered off.

Do you think of any issue beyond this behavior ? 

^ permalink raw reply

* Re: [PATCH] net: check negative value for signed refcnt
From: Eric Dumazet @ 2019-01-31 15:15 UTC (permalink / raw)
  To: Kirill Tkhai, alexandre.besnard, davem, ecree, jiri, petrm,
	alexander.h.duyck, amritha.nambiar, lirongqing
  Cc: netdev, linux-kernel
In-Reply-To: <8b654b17-f8b4-0c89-26b5-311aeb703f6d@virtuozzo.com>



On 01/31/2019 05:49 AM, Kirill Tkhai wrote:
> 
> 2)Not related to your patch -- it looks like we have problem in existing
> code with this netdev_refcnt_read(). It does not imply a memory ordering
> or some guarantees about reading percpu values. For example, in generic
> code struct percpu_ref switches a counter into atomic mode before it checks
> for the last reference. But there is nothing in netdev_refcnt_read().

Well, if we read an old value here, after a full and expensive synchronize_net(),
then we would have lot more problems than simply having a second round in
netdev_wait_allrefs()
 

^ permalink raw reply

* Re: [PATCH] tty: Fix WARNING in tty_set_termios
From: Marcel Holtmann @ 2019-01-31 15:18 UTC (permalink / raw)
  To: Johan Hovold
  Cc: shuah, Al Viro, open list:NFC SUBSYSTEM, chris, devel,
	Rob Herring, Samuel Ortiz, open list:SERIAL DRIVERS, Jiri Slaby,
	santhameena13, kirk, Johan Hedberg, Arnd Bergmann,
	samuel.thibault, m.maya.nakamura, zhongjiang, Greg KH, speakup,
	Linux Kernel Mailing List, Bluez mailing list, netdev,
	nishka.dasgupta_ug18, David S. Miller
In-Reply-To: <20190130103227.GR3691@localhost>

Hi Johan,

>>> On Fri, Jan 25, 2019 at 04:29:05PM -0700, Shuah Khan wrote:
>>>> tty_set_termios() has the following WARMN_ON which can be triggered with a
>>>> syscall to invoke TIOCGETD __NR_ioctl.
> 
> You meant TIOCSETD here, and in fact its the call which sets the uart
> protocol that triggers the warning.
> 
>>>> WARN_ON(tty->driver->type == TTY_DRIVER_TYPE_PTY &&
>>>>                 tty->driver->subtype == PTY_TYPE_MASTER);
>>>> Reference: https://syzkaller.appspot.com/bug?id=2410d22f1d8e5984217329dd0884b01d99e3e48d
>>>> 
>>>> A simple change would have been to print error message instead of WARN_ON.
>>>> However, the callers assume that tty_set_termios() always returns 0 and
>>>> don't check return value. The complete solution is fixing all the callers
>>>> to check error and bail out to fix the WARN_ON.
>>>> 
>>>> This fix changes tty_set_termios() to return error and all the callers
>>>> to check error and bail out. The reproducer is used to reproduce the
>>>> problem and verify the fix.
>>> 
>>>> --- a/drivers/bluetooth/hci_ldisc.c
>>>> +++ b/drivers/bluetooth/hci_ldisc.c
>>>> @@ -321,6 +321,8 @@ void hci_uart_set_flow_control(struct hci_uart *hu, bool enable)
>>>>  		status = tty_set_termios(tty, &ktermios);
>>>>  		BT_DBG("Disabling hardware flow control: %s",
>>>>  		       status ? "failed" : "success");
>>>> +		if (status)
>>>> +			return;
>>> 
>>> Can that ldisc end up set on pty master?  And does it make any sense there?
>> 
>> The initial objective of the patch is to prevent the WARN_ON by making
>> the change to return error instead of WARN_ON. However, without changes
>> to places that don't check the return and keep making progress, there
>> will be secondary problems.
>> 
>> Without this change to return here, instead of WARN_ON, it will fail
>> with the following NULL pointer dereference at the next thing 
>> hci_uart_set_flow_control() attempts.
>> 
>> status = tty->driver->ops->tiocmget(tty);
>> 
>> kernel: [10140.649783] BUG: unable to handle kernel NULL pointer 
> 
> That's a separate issue, which is being fixed:
> 
> 	https://lkml.kernel.org/r/20190130095938.GP3691@localhost
> 
>>> IOW, I don't believe that this patch makes any sense.  If anything,
>>> we need to prevent unconditional tty_set_termios() on the path
>>> that *does* lead to calling it for pty.
>> 
>> I don't think preventing unconditional tty_set_termios() is enough to
>> prevent secondary problems such as the one above.
>> 
>> For example, the following call chain leads to the WARN_ON that was
>> reported. Even if void hci_uart_set_baudrate() prevents the very first
>> tty_set_termios() call, its caller hci_uart_setup() continues with
>> more tty setup. It goes ahead to call driver setup callback. The
>> driver callback goes on to do more setup calling tty_set_termios().
>> 
>> WARN_ON call path:
>>  hci_uart_set_baudrate+0x1cc/0x250 drivers/bluetooth/hci_ldisc.c:378
>>  hci_uart_setup+0xa2/0x490 drivers/bluetooth/hci_ldisc.c:401
>>  hci_dev_do_open+0x6b1/0x1920 net/bluetooth/hci_core.c:1423
>> 
>> Once this WARN_ON is changed to return error, the following
>> happens, when hci_uart_setup() does driver setup callback.
>> 
>> kernel: [10140.649836]  mrvl_setup+0x17/0x80 [hci_uart]
>> kernel: [10140.649840]  hci_uart_setup+0x56/0x160 [hci_uart]
>> kernel: [10140.649850]  hci_dev_do_open+0xe6/0x630 [bluetooth]
>> kernel: [10140.649860]  hci_power_on+0x52/0x220 [bluetooth]
>> 
>> I think continuing to catch the invalid condition in tty_set_termios()
>> and preventing progress by checking return value is a straight forward
>> change to avoid secondary problems, and it might be difficult to catch
>> all the cases where it could fail.
> 
> I agree with Al that this change doesn't make much sense. The WARN_ON
> is there to catch any bugs leading to the termios being changed for a
> master side pty. Those should bugs should be fixed, and not worked
> around in order to silence a WARN_ON.
> 
> The problem started with 7721383f4199 ("Bluetooth: hci_uart: Support
> operational speed during setup") which introduced a new way for how
> tty_set_termios() could end up being called for a master pty.
> 
> As Al hinted at, setting these ldiscs for a master pty really makes no
> sense and perhaps that is what we should prevent unless simply making
> sure they do not call tty_set_termios() is sufficient for the time
> being.
> 
> Finally, note that serdev never operates on a pty, and that this is only
> an issue for (the three) line disciplines.

I think for PTYs we should just fail setting the HCI line discipline. Fail early and just move on with life.

Regards

Marcel


^ permalink raw reply

* Re: [PATCH] net: check negative value for signed refcnt
From: Eric Dumazet @ 2019-01-31 15:21 UTC (permalink / raw)
  To: Kirill Tkhai, alexandre.besnard, davem, ecree, jiri, petrm,
	alexander.h.duyck, amritha.nambiar, lirongqing
  Cc: netdev, linux-kernel
In-Reply-To: <7bc31b61-8852-88a7-12bf-494c0e54574b@gmail.com>



On 01/31/2019 07:15 AM, Eric Dumazet wrote:
> 
> 
> On 01/31/2019 05:49 AM, Kirill Tkhai wrote:
>>
>> 2)Not related to your patch -- it looks like we have problem in existing
>> code with this netdev_refcnt_read(). It does not imply a memory ordering
>> or some guarantees about reading percpu values. For example, in generic
>> code struct percpu_ref switches a counter into atomic mode before it checks
>> for the last reference. But there is nothing in netdev_refcnt_read().
> 
> Well, if we read an old value here, after a full and expensive synchronize_net(),
> then we would have lot more problems than simply having a second round in
> netdev_wait_allrefs()
>  
> 

percpu_ref was added more recently than the netdev_refcnt stuff, and is
interesting for users wanting a synchronous wait for the refcnt reaching 0.

netdev_wait_allrefs() was designed to be asynchronous, so that we at least release
RTNL (and current cpu) when something is wrong and a device can not be dismantled.


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox