Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: r8169 Driver - Poor Network Performance Since Kernel 4.19
From: David Chang @ 2019-02-01  4:29 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: Peter Ceiley, Realtek linux nic maintainers, netdev,
	Martti Laaksonen
In-Reply-To: <4d832c16-8830-b746-a818-6026c2e6725c@gmail.com>

On Jan 31, 2019 at 07:35:30 +0100, Heiner Kallweit wrote:
> Hi David, two more things:
> 
> 1. Could you please test a recent linux-next kernel?

Not tested yet. Will do if possible.

> 2. Please get a register dump (ethtool -d <if>) from 4.18 and 4.19
>    and compare them.

For your informaiton.

[with pcie_aspm=off]
--- v4.18.15	2019-02-01 12:11:56.019051828 +0800
+++ v4.9.11	2019-02-01 12:12:26.827439645 +0800
@@ -3,18 +3,19 @@
 Offset          Values
 ------          ------
 0x0000:         ec 8e b5 5a 2c f5 00 00 48 00 40 00 80 00 80 00
-0x0010:         00 10 38 0e 04 00 00 00 78 00 06 00 00 00 00 00
-0x0020:         00 f0 9b f6 03 00 00 00 00 00 00 00 00 00 00 00
+0x0010:         00 f0 ba 0d 04 00 00 00 78 00 06 00 00 00 00 00
+0x0020:         00 d0 35 f7 03 00 00 00 00 00 00 00 00 00 00 00
 0x0030:         00 00 00 00 00 00 00 0c 00 00 00 00 3f 80 00 00
-0x0040:         80 0f 10 57 0e cf 02 00 00 cf ba 34 00 00 00 00
-0x0050:         10 00 cf 18 60 11 00 01 11 11 11 00 00 00 00 00
+0x0040:         80 0f 10 57 0e cf 02 00 00 d8 c7 50 00 00 00 00
+0x0050:         10 00 cf 98 60 11 01 01 11 11 11 00 00 00 00 00
 0x0060:         00 00 00 00 3c 10 00 81 2c f0 00 80 93 00 80 f0
-0x0070:         00 6f 00 c4 b0 31 00 00 07 00 00 00 00 00 00 00
+0x0070:         00 6f 00 c4 b0 31 00 00 07 00 00 00 00 00 76 d0
 0x0080:         8b 06 01 00 00 00 00 00 00 00 00 00 00 00 00 00
 0x0090:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 0x00a0:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
-0x00b0:         7f 04 00 00 00 00 00 00 e1 c1 05 d2 00 00 00 00
+0x00b0:         7f 04 00 00 00 00 00 00 ad 79 01 d2 00 00 00 00
 0x00c0:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 0x00d0:         21 00 00 32 0e 00 00 00 00 00 00 40 06 11 fd 00
-0x00e0:         e1 20 51 51 00 30 94 f6 03 00 00 00 27 00 00 00
+0x00e0:         e1 20 51 51 00 e0 35 f7 03 00 00 00 27 00 00 00
 0x00f0:         3f 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00

[pcie_aspm.policy=performance]
--- v4.18.15-p	2019-02-01 12:18:46.919221060 +0800
+++ v4.9.11-p	2019-02-01 12:19:09.207474824 +0800
@@ -3,18 +3,19 @@
 Offset          Values
 ------          ------
 0x0000:         ec 8e b5 5a 2c f5 00 00 48 00 40 00 80 00 80 00
-0x0010:         00 f0 bc 0d 04 00 00 00 78 00 06 00 00 00 00 00
-0x0020:         00 60 2e f7 03 00 00 00 00 00 00 00 00 00 00 00
+0x0010:         00 c0 22 09 04 00 00 00 78 00 06 00 00 00 00 00
+0x0020:         00 f0 e5 f4 03 00 00 00 00 00 00 00 00 00 00 00
 0x0030:         00 00 00 00 00 00 00 0c 00 00 00 00 3f 80 00 00
-0x0040:         80 0f 10 57 0e cf 02 00 00 53 50 1a 00 00 00 00
-0x0050:         10 00 cf 18 60 11 00 01 11 11 11 00 00 00 00 00
+0x0040:         80 0f 10 57 0e cf 02 00 00 d2 35 7b 00 00 00 00
+0x0050:         10 00 cf 98 60 11 01 01 11 11 11 00 00 00 00 00
 0x0060:         00 00 00 00 3c 10 00 81 2c f0 00 80 93 00 80 f0
-0x0070:         00 6f 00 c4 b0 31 00 00 07 00 00 00 00 00 00 00
+0x0070:         00 6f 00 c4 b0 31 00 00 07 00 00 00 00 00 a4 a0
 0x0080:         8b 06 01 00 00 00 00 00 00 00 00 00 00 00 00 00
 0x0090:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 0x00a0:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
-0x00b0:         7f 04 00 00 00 00 00 00 e1 c1 05 d2 00 00 00 00
+0x00b0:         7f 04 00 00 00 00 00 00 ad 79 01 d2 00 00 00 00
 0x00c0:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 0x00d0:         21 00 00 32 0e 00 00 00 00 00 00 40 06 11 fd 00
-0x00e0:         e1 20 51 51 00 70 2e f7 03 00 00 00 27 00 00 00
+0x00e0:         e1 20 51 51 00 00 e6 f4 03 00 00 00 27 00 00 00
 0x00f0:         3f 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00

Thanks,
David

> Heiner
> 
> 
> On 31.01.2019 07:21, Heiner Kallweit wrote:
> > David, thanks for the link to the bug ticket.
> > I think only a proper bisect can help to find the offending commit.
> > 
> > Heiner
> > 
> > 
> > On 31.01.2019 03:32, David Chang wrote:
> >> Hi,
> >>
> >> We had a similr case here.
> >> - Realtek r8169 receive performance regression in kernel 4.19
> >>   https://bugzilla.suse.com/show_bug.cgi?id=1119649
> >>
> >> kernel: r8169 0000:01:00.0 eth0: RTL8168h/8111h, XID 54100880
> >> The major symptom is there are many rx_missed count.
> >>
> >>
> >> On Jan 30, 2019 at 20:15:45 +0100, Heiner Kallweit wrote:
> >>> Hi Peter,
> >>>
> >>> recently I had somebody where pcie_aspm=off for whatever reason didn't
> >>> do the trick, can you also check with pcie_aspm.policy=performance.
> >>
> >> We will give it a try later.
> >>
> >>> And please check with "ethtool -S <if>" whether the chip statistics
> >>> show a significant number of errors.
> >>>
> >>> If this doesn't help you may have to bisect to find the offending commit.
> >>
> >> We had tried fallback driver to a few previous commits as following,
> >> but with no luck.
> >>
> >> 9675931e6b65 r8169: re-enable MSI-X on RTL8168g (v4.19)
> >> 098b01ad9837 r8169: don't include asm headers directly (v4.19-rc1)
> >> a2965f12fde6 r8169: remove rtl8169_set_speed_xmii (v4.19-rc1)
> >> 6fcf9b1d4d6c r8169: fix runtime suspend (v4.19-rc1)
> >> e397286b8e89 r8169: remove TBI 1000BaseX support (v4.19-rc1)
> >>
> >> Thanks,
> >> David Chang
> >>
> >>>
> >>> Heiner
> >>>
> >>>
> >>> On 30.01.2019 10:59, Peter Ceiley wrote:
> >>>> Hi Heiner,
> >>>>
> >>>> I tried disabling the ASPM using the pcie_aspm=off kernel parameter
> >>>> and this made no difference.
> >>>>
> >>>> I tried compiling the 4.18.16 r8169.c with the 4.19.18 source and
> >>>> subsequently loaded the module in the running 4.19.18 kernel. I can
> >>>> confirm that this immediately resolved the issue and access to the NFS
> >>>> shares operated as expected.
> >>>>
> >>>> I presume this means it is an issue with the r8169 driver included in
> >>>> 4.19 onwards?
> >>>>
> >>>> To answer your last questions:
> >>>>
> >>>> Base Board Information
> >>>>     Manufacturer: Alienware
> >>>>     Product Name: 0PGRP5
> >>>>     Version: A02
> >>>>
> >>>> ... and yes, the RTL8168 is the onboard network chip.
> >>>>
> >>>> Regards,
> >>>>
> >>>> Peter.
> >>>>
> >>>> On Tue, 29 Jan 2019 at 17:44, Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>>>>
> >>>>> Hi Peter,
> >>>>>
> >>>>> I think the vendor driver doesn't enable ASPM per default.
> >>>>> So it's worth a try to disable ASPM in the BIOS or via sysfs.
> >>>>> Few older systems seem to have issues with ASPM, what kind of
> >>>>> system / mainboard are you using? The RTL8168 is the onboard
> >>>>> network chip?
> >>>>>
> >>>>> Rgds, Heiner
> >>>>>
> >>>>>
> >>>>> On 29.01.2019 07:20, Peter Ceiley wrote:
> >>>>>> Hi Heiner,
> >>>>>>
> >>>>>> Thanks, I'll do some more testing. It might not be the driver - I
> >>>>>> assumed it was due to the fact that using the r8168 driver 'resolves'
> >>>>>> the issue. I'll see if I can test the r8169.c on top of 4.19 - this is
> >>>>>> a good idea.
> >>>>>>
> >>>>>> Cheers,
> >>>>>>
> >>>>>> Peter.
> >>>>>>
> >>>>>> On Tue, 29 Jan 2019 at 17:16, Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Hi Peter,
> >>>>>>>
> >>>>>>> at a first glance it doesn't look like a typical driver issue.
> >>>>>>> What you could do:
> >>>>>>>
> >>>>>>> - Test the r8169.c from 4.18 on top of 4.19.
> >>>>>>>
> >>>>>>> - Check whether disabling ASPM (/sys/module/pcie_aspm) has an effect.
> >>>>>>>
> >>>>>>> - Bisect between 4.18 and 4.19 to find the offending commit.
> >>>>>>>
> >>>>>>> Any specific reason why you think root cause is in the driver and not
> >>>>>>> elsewhere in the network subsystem?
> >>>>>>>
> >>>>>>> Heiner
> >>>>>>>
> >>>>>>>
> >>>>>>> On 28.01.2019 23:10, Peter Ceiley wrote:
> >>>>>>>> Hi Heiner,
> >>>>>>>>
> >>>>>>>> Thanks for getting back to me.
> >>>>>>>>
> >>>>>>>> No, I don't use jumbo packets.
> >>>>>>>>
> >>>>>>>> Bandwidth is *generally* good, and iperf results to my NAS provide
> >>>>>>>> over 900 Mbits/s in both circumstances. The issue seems to appear when
> >>>>>>>> establishing a connection and is most notable, for example, on my
> >>>>>>>> mounted NFS shares where it takes seconds (up to 10's of seconds on
> >>>>>>>> larger directories) to list the contents of each directory. Once a
> >>>>>>>> transfer begins on a file, I appear to get good bandwidth.
> >>>>>>>>
> >>>>>>>> I'm unsure of the best scientific data to provide you in order to
> >>>>>>>> troubleshoot this issue. Running the following
> >>>>>>>>
> >>>>>>>>     netstat -s |grep retransmitted
> >>>>>>>>
> >>>>>>>> shows a steady increase in retransmitted segments each time I list the
> >>>>>>>> contents of a remote directory, for example, running 'ls' on a
> >>>>>>>> directory containing 345 media files did the following using kernel
> >>>>>>>> 4.19.18:
> >>>>>>>>
> >>>>>>>> increased retransmitted segments by 21 and the 'time' command showed
> >>>>>>>> the following:
> >>>>>>>>     real    0m19.867s
> >>>>>>>>     user    0m0.012s
> >>>>>>>>     sys    0m0.036s
> >>>>>>>>
> >>>>>>>> The same command shows no retransmitted segments running kernel
> >>>>>>>> 4.18.16 and 'time' showed:
> >>>>>>>>     real    0m0.300s
> >>>>>>>>     user    0m0.004s
> >>>>>>>>     sys    0m0.007s
> >>>>>>>>
> >>>>>>>> ifconfig does not show any RX/TX errors nor dropped packets in either case.
> >>>>>>>>
> >>>>>>>> dmesg XID:
> >>>>>>>> [    2.979984] r8169 0000:03:00.0 eth0: RTL8168g/8111g,
> >>>>>>>> f8:b1:56:fe:67:e0, XID 4c000800, IRQ 32
> >>>>>>>>
> >>>>>>>> # lspci -vv
> >>>>>>>> 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
> >>>>>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)
> >>>>>>>>     Subsystem: Dell RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
> >>>>>>>>     Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> >>>>>>>> ParErr- Stepping- SERR- FastB2B- DisINTx+
> >>>>>>>>     Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> >>>>>>>> <TAbort- <MAbort- >SERR- <PERR- INTx-
> >>>>>>>>     Latency: 0, Cache Line Size: 64 bytes
> >>>>>>>>     Interrupt: pin A routed to IRQ 19
> >>>>>>>>     Region 0: I/O ports at d000 [size=256]
> >>>>>>>>     Region 2: Memory at f7b00000 (64-bit, non-prefetchable) [size=4K]
> >>>>>>>>     Region 4: Memory at f2100000 (64-bit, prefetchable) [size=16K]
> >>>>>>>>     Capabilities: [40] Power Management version 3
> >>>>>>>>         Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA
> >>>>>>>> PME(D0+,D1+,D2+,D3hot+,D3cold+)
> >>>>>>>>         Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> >>>>>>>>     Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
> >>>>>>>>         Address: 0000000000000000  Data: 0000
> >>>>>>>>     Capabilities: [70] Express (v2) Endpoint, MSI 01
> >>>>>>>>         DevCap:    MaxPayload 128 bytes, PhantFunc 0, Latency L0s
> >>>>>>>> <512ns, L1 <64us
> >>>>>>>>             ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
> >>>>>>>> SlotPowerLimit 10.000W
> >>>>>>>>         DevCtl:    CorrErr- NonFatalErr- FatalErr- UnsupReq-
> >>>>>>>>             RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> >>>>>>>>             MaxPayload 128 bytes, MaxReadReq 4096 bytes
> >>>>>>>>         DevSta:    CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
> >>>>>>>>         LnkCap:    Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit
> >>>>>>>> Latency L0s unlimited, L1 <64us
> >>>>>>>>             ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
> >>>>>>>>         LnkCtl:    ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
> >>>>>>>>             ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
> >>>>>>>>         LnkSta:    Speed 2.5GT/s (ok), Width x1 (ok)
> >>>>>>>>             TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> >>>>>>>>         DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+,
> >>>>>>>> OBFF Via message/WAKE#
> >>>>>>>>              AtomicOpsCap: 32bit- 64bit- 128bitCAS-
> >>>>>>>>         DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+,
> >>>>>>>> OBFF Disabled
> >>>>>>>>              AtomicOpsCtl: ReqEn-
> >>>>>>>>         LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
> >>>>>>>>              Transmit Margin: Normal Operating Range,
> >>>>>>>> EnterModifiedCompliance- ComplianceSOS-
> >>>>>>>>              Compliance De-emphasis: -6dB
> >>>>>>>>         LnkSta2: Current De-emphasis Level: -6dB,
> >>>>>>>> EqualizationComplete-, EqualizationPhase1-
> >>>>>>>>              EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> >>>>>>>>     Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
> >>>>>>>>         Vector table: BAR=4 offset=00000000
> >>>>>>>>         PBA: BAR=4 offset=00000800
> >>>>>>>>     Capabilities: [d0] Vital Product Data
> >>>>>>>> pcilib: sysfs_read_vpd: read failed: Input/output error
> >>>>>>>>         Not readable
> >>>>>>>>     Capabilities: [100 v1] Advanced Error Reporting
> >>>>>>>>         UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> >>>>>>>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> >>>>>>>>         UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> >>>>>>>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> >>>>>>>>         UESvrt:    DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
> >>>>>>>> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> >>>>>>>>         CESta:    RxErr+ BadTLP+ BadDLLP+ Rollover- Timeout+ AdvNonFatalErr-
> >>>>>>>>         CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
> >>>>>>>>         AERCap:    First Error Pointer: 00, ECRCGenCap+ ECRCGenEn-
> >>>>>>>> ECRCChkCap+ ECRCChkEn-
> >>>>>>>>             MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
> >>>>>>>>         HeaderLog: 00000000 00000000 00000000 00000000
> >>>>>>>>     Capabilities: [140 v1] Virtual Channel
> >>>>>>>>         Caps:    LPEVC=0 RefClk=100ns PATEntryBits=1
> >>>>>>>>         Arb:    Fixed- WRR32- WRR64- WRR128-
> >>>>>>>>         Ctrl:    ArbSelect=Fixed
> >>>>>>>>         Status:    InProgress-
> >>>>>>>>         VC0:    Caps:    PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
> >>>>>>>>             Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
> >>>>>>>>             Ctrl:    Enable+ ID=0 ArbSelect=Fixed TC/VC=01
> >>>>>>>>             Status:    NegoPending- InProgress-
> >>>>>>>>     Capabilities: [160 v1] Device Serial Number 01-00-00-00-68-4c-e0-00
> >>>>>>>>     Capabilities: [170 v1] Latency Tolerance Reporting
> >>>>>>>>         Max snoop latency: 71680ns
> >>>>>>>>         Max no snoop latency: 71680ns
> >>>>>>>>     Kernel driver in use: r8169
> >>>>>>>>     Kernel modules: r8169
> >>>>>>>>
> >>>>>>>> Please let me know if you have any other ideas in terms of testing.
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>>
> >>>>>>>> Peter.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, 29 Jan 2019 at 05:28, Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On 28.01.2019 12:13, Peter Ceiley wrote:
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> I have been experiencing very poor network performance since Kernel
> >>>>>>>>>> 4.19 and I'm confident it's related to the r8169 driver.
> >>>>>>>>>>
> >>>>>>>>>> I have no issue with kernel versions 4.18 and prior. I am experiencing
> >>>>>>>>>> this issue in kernels 4.19 and 4.20 (currently running/testing with
> >>>>>>>>>> 4.20.4 & 4.19.18).
> >>>>>>>>>>
> >>>>>>>>>> If someone could guide me in the right direction, I'm happy to help
> >>>>>>>>>> troubleshoot this issue. Note that I have been keeping an eye on one
> >>>>>>>>>> issue related to loading of the PHY driver, however, my symptoms
> >>>>>>>>>> differ in that I still have a network connection. I have attempted to
> >>>>>>>>>> reload the driver on a running system, but this does not improve the
> >>>>>>>>>> situation.
> >>>>>>>>>>
> >>>>>>>>>> Using the proprietary r8168 driver returns my device to proper working order.
> >>>>>>>>>>
> >>>>>>>>>> lshw shows:
> >>>>>>>>>>        description: Ethernet interface
> >>>>>>>>>>        product: RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
> >>>>>>>>>>        vendor: Realtek Semiconductor Co., Ltd.
> >>>>>>>>>>        physical id: 0
> >>>>>>>>>>        bus info: pci@0000:03:00.0
> >>>>>>>>>>        logical name: enp3s0
> >>>>>>>>>>        version: 0c
> >>>>>>>>>>        serial:
> >>>>>>>>>>        size: 1Gbit/s
> >>>>>>>>>>        capacity: 1Gbit/s
> >>>>>>>>>>        width: 64 bits
> >>>>>>>>>>        clock: 33MHz
> >>>>>>>>>>        capabilities: pm msi pciexpress msix vpd bus_master cap_list
> >>>>>>>>>> ethernet physical tp aui bnc mii fibre 10bt 10bt-fd 100bt 100bt-fd
> >>>>>>>>>> 1000bt-fd autonegotiation
> >>>>>>>>>>        configuration: autonegotiation=on broadcast=yes driver=r8169
> >>>>>>>>>> duplex=full firmware=rtl8168g-2_0.0.1 02/06/13 ip=192.168.1.25
> >>>>>>>>>> latency=0 link=yes multicast=yes port=MII speed=1Gbit/s
> >>>>>>>>>>        resources: irq:19 ioport:d000(size=256)
> >>>>>>>>>> memory:f7b00000-f7b00fff memory:f2100000-f2103fff
> >>>>>>>>>>
> >>>>>>>>>> Kind Regards,
> >>>>>>>>>>
> >>>>>>>>>> Peter.
> >>>>>>>>>>
> >>>>>>>>> Hi Peter,
> >>>>>>>>>
> >>>>>>>>> the description "poor network performance" is quite vague, therefore:
> >>>>>>>>>
> >>>>>>>>> - Can you provide any measurements?
> >>>>>>>>> - iperf results before and after
> >>>>>>>>> - statistics about dropped packets (rx and/or tx)
> >>>>>>>>> - Do you use jumbo packets?
> >>>>>>>>>
> >>>>>>>>> Also help would be a "lspci -vv" output for the network card and
> >>>>>>>>> the dmesg output line with the chip XID.
> >>>>>>>>>
> >>>>>>>>> Heiner
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> > 
> 
> 

^ permalink raw reply

* [PATCH net] add snmp counter document
From: yupeng @ 2019-02-01  5:45 UTC (permalink / raw)
  To: netdev, davem; +Cc: xiyou.wangcong, rdunlap

add document for tcp retransmission, tcp fast open, syn cookies,
challenge ack, prune and several general counters

Signed-off-by: yupeng <yupeng0921@gmail.com>
---
 Documentation/networking/snmp_counter.rst | 143 ++++++++++++++++++++++
 1 file changed, 143 insertions(+)

diff --git a/Documentation/networking/snmp_counter.rst b/Documentation/networking/snmp_counter.rst
index fe8f741193be..8520c500d4e5 100644
--- a/Documentation/networking/snmp_counter.rst
+++ b/Documentation/networking/snmp_counter.rst
@@ -354,6 +354,26 @@ process might be failed due to some errors (e.g. memory alloc failed).
 
 .. _RFC1213 tcpOutRsts: https://tools.ietf.org/html/rfc1213#page-52
 
+* TcpExtTCPSpuriousRtxHostQueues
+When the TCP stack wants to retransmit a packet, and finds that packet
+is not lost in the network, but the packet is not sent yet, the TCP
+stack would give up the retransmission and update this counter. It
+might happen if a packet stays too long time in a qdisc or driver
+queue.
+
+* TcpEstabResets
+The socket receives a RST packet in Establish or CloseWait state.
+
+* TcpExtTCPKeepAlive
+This counter indicates many keepalive packets were sent. The keepalive
+won't be enabled by default. A userspace program could enable it by
+setting the SO_KEEPALIVE socket option.
+
+* TcpExtTCPSpuriousRTOs
+The spurious retransmission timeout detected by the `F-RTO`_
+algorithm.
+
+.. _F-RTO: https://tools.ietf.org/html/rfc5682
 
 TCP Fast Path
 ============
@@ -562,6 +582,24 @@ packet yet, the sender would know packet 4 is out of order. The TCP
 stack of kernel will increase TcpExtTCPSACKReorder for both of the
 above scenarios.
 
+* TcpExtTCPSlowStartRetrans
+The TCP stack wants to retransmit a packet and the congestion control
+state is 'Loss'.
+
+* TcpExtTCPFastRetrans
+The TCP stack wants to retransmit a packet and the congestion control
+state is not 'Loss'.
+
+* TcpExtTCPLostRetransmit
+A SACK points out that a retransmission packet is lost again.
+
+* TcpExtTCPRetransFail
+The TCP stack tries to deliver a retransmission packet to lower layers
+but the lower layers return an error.
+
+* TcpExtTCPSynRetrans
+The TCP stack retransmits a SYN packet.
+
 DSACK
 =====
 The DSACK is defined in `RFC2883`_. The receiver uses DSACK to report
@@ -783,6 +821,111 @@ A TLP probe packet is sent.
 * TcpExtTCPLossProbeRecovery
 A packet loss is detected and recovered by TLP.
 
+TCP Fast Open
+============
+TCP Fast Open is a technology which allows data transfer before the
+3-way handshake complete. Please refer the `TCP Fast Open wiki`_ for a
+general description.
+
+.. _TCP Fast Open wiki: https://en.wikipedia.org/wiki/TCP_Fast_Open
+
+* TcpExtTCPFastOpenActive
+When the TCP stack receives an ACK packet in the SYN-SENT status, and
+the ACK packet acknowledges the data in the SYN packet, the TCP stack
+understand the TFO cookie is accepted by the other side, then it
+updates this counter.
+
+* TcpExtTCPFastOpenActiveFail
+This counter indicates that the TCP stack initiated a TCP Fast Open,
+but it failed. This counter would be updated in three scenarios: (1)
+the other side doesn't acknowledge the data in the SYN packet. (2) The
+SYN packet which has the TFO cookie is timeout at least once. (3)
+after the 3-way handshake, the retransmission timeout happens
+net.ipv4.tcp_retries1 times, because some middle-boxes may black-hole
+fast open after the handshake.
+
+* TcpExtTCPFastOpenPassive
+This counter indicates how many times the TCP stack accepts the fast
+open request.
+
+* TcpExtTCPFastOpenPassiveFail
+This counter indicates how many times the TCP stack rejects the fast
+open request. It is caused by either the TFO cookie is invalid or the
+TCP stack finds an error during the socket creating process.
+
+* TcpExtTCPFastOpenListenOverflow
+When the pending fast open request number is larger than
+fastopenq->max_qlen, the TCP stack will reject the fast open request
+and update this counter. When this counter is updated, the TCP stack
+won't update TcpExtTCPFastOpenPassive or
+TcpExtTCPFastOpenPassiveFail. The fastopenq->max_qlen is set by the
+TCP_FASTOPEN socket operation and it could not be larger than
+net.core.somaxconn. For example:
+
+setsockopt(sfd, SOL_TCP, TCP_FASTOPEN, &qlen, sizeof(qlen));
+
+* TcpExtTCPFastOpenCookieReqd
+This counter indicates how many times a client wants to request a TFO
+cookie.
+
+SYN cookies
+==========
+SYN cookies are used to mitigate SYN flood, for details, please refer
+the `SYN cookies wiki`_.
+
+.. _SYN cookies wiki: https://en.wikipedia.org/wiki/SYN_cookies
+
+* TcpExtSyncookiesSent
+It indicates how many SYN cookies are sent.
+
+* TcpExtSyncookiesRecv
+How many reply packets of the SYN cookies the TCP stack receives.
+
+* TcpExtSyncookiesFailed
+The MSS decoded from the SYN cookie is invalid. When this counter is
+updated, the received packet won't be treated as a SYN cookie and the
+TcpExtSyncookiesRecv counter wont be updated.
+
+Challenge ACK
+============
+For details of challenge ACK, please refer the explaination of
+TcpExtTCPACKSkippedChallenge.
+
+* TcpExtTCPChallengeACK
+The number of challenge acks sent.
+
+* TcpExtTCPSYNChallenge
+The number of challenge acks sent in response to SYN packets. After
+updates this counter, the TCP stack might send a challenge ACK and
+update the TcpExtTCPChallengeACK counter, or it might also skip to
+send the challenge and update the TcpExtTCPACKSkippedChallenge.
+
+prune
+=====
+When a socket is under memory pressure, the TCP stack will try to
+reclaim memory from the receiving queue and out of order queue. One of
+the reclaiming method is 'collapse', which means allocate a big sbk,
+copy the contiguous skbs to the single big skb, and free these
+contiguous skbs.
+
+* TcpExtPruneCalled
+The TCP stack tries to reclaim memory for a socket. After updates this
+counter, the TCP stack will try to collapse the out of order queue and
+the receiving queue. If the memory is still not enough, the TCP stack
+will try to discard packets from the out of order queue (and update the
+TcpExtOfoPruned counter)
+
+* TcpExtOfoPruned
+The TCP stack tries to discard packet on the out of order queue.
+
+* TcpExtRcvPruned
+After 'collapse' and discard packets from the out of order queue, if
+the actually used memory is still larger than the max allowed memory,
+this counter will be updated. It means the 'prune' fails.
+
+* TcpExtTCPRcvCollapsed
+This counter indicates how many skbs are freed during 'collapse'.
+
 examples
 =======
 
-- 
2.17.1


^ permalink raw reply related

* Re: KASAN: use-after-free Read in selinux_netlbl_socket_setsockopt
From: Dmitry Vyukov @ 2019-02-01  6:26 UTC (permalink / raw)
  To: Paul Moore, Ralf Baechle, David Miller, linux-hams, netdev
  Cc: syzbot, Eric Paris, LKML, Stephen Smalley, selinux,
	syzkaller-bugs
In-Reply-To: <CAHC9VhR9wwqVcnmAVcqfwmRuX02F8oMVzNNNQym-TeceHFLqOQ@mail.gmail.com>

On Wed, Jan 30, 2019 at 10:30 PM Paul Moore <paul@paul-moore.com> wrote:
>
> On Wed, Jan 30, 2019 at 4:01 PM syzbot
> <syzbot+1bfc00ca3aabe5bcd4cb@syzkaller.appspotmail.com> wrote:
> >
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit:    62967898789d Merge git://git.kernel.org/pub/scm/linux/kern..
> > git tree:       upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=167fdef8c00000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=4fceea9e2d99ac20
> > dashboard link: https://syzkaller.appspot.com/bug?extid=1bfc00ca3aabe5bcd4cb
> > compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> >
> > Unfortunately, I don't have any reproducer for this crash yet.
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+1bfc00ca3aabe5bcd4cb@syzkaller.appspotmail.com
> >
> > 8021q: adding VLAN 0 to HW filter on device team0
> > ==================================================================
> > BUG: KASAN: use-after-free in selinux_netlbl_socket_setsockopt+0x49b/0x510
> > security/selinux/netlabel.c:525
> > Read of size 8 at addr ffff8880a63cf078 by task syz-executor3/18477
> >
> > CPU: 1 PID: 18477 Comm: syz-executor3 Not tainted 5.0.0-rc4+ #51
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > Call Trace:
> >   __dump_stack lib/dump_stack.c:77 [inline]
> >   dump_stack+0x1db/0x2d0 lib/dump_stack.c:113
> >   print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
> >   kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
> >   __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:135
> >   selinux_netlbl_socket_setsockopt+0x49b/0x510
> > security/selinux/netlabel.c:525
>
> At first glance this seems odd.  The line above is simply
> dereferencing sock->sk_security (getting the "sksec"), which we also
> do higher up selinux_socket_setsockopt() via sock_has_perm().  Unless
> somehow the socket is being released/freed in the middle of a
> setsockopt() syscall this looks like maybe it's something else?

Hi Paul,

Searching for af_netrom across other syzbot bugs:
https://groups.google.com/forum/#!searchin/syzkaller-bugs/af_netrom%7Csort:date

I see at least:
https://syzkaller.appspot.com/bug?extid=b0b1952f5864b4009b09
https://syzkaller.appspot.com/bug?extid=febf3c50d4262e578b1c
https://syzkaller.appspot.com/bug?extid=defa700d16f1bd1b9a05

Which suggests there are some serious lifetime problems in netrom
sockets. That would probably explain this crash as well.

+netrom maintainers, does this explanation look plausible to you?
should we dup this crash onto one of the other netrom bugs? and
perhaps these netrom bugs across themselves too?



> >   selinux_socket_setsockopt+0x67/0x90 security/selinux/hooks.c:4693
> >   security_socket_setsockopt+0x7d/0xc0 security/security.c:1467
> >   __sys_setsockopt+0xe4/0x3a0 net/socket.c:1892
> >   __do_sys_setsockopt net/socket.c:1913 [inline]
> >   __se_sys_setsockopt net/socket.c:1910 [inline]
> >   __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1910
> >   do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290
> >   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > RIP: 0033:0x458089
> > Code: 6d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
> > 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
> > ff 0f 83 3b b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
> > RSP: 002b:00007fb988b9cc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
> > RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000458089
> > RDX: 0000000000000078 RSI: 0000000000000084 RDI: 000000000000000a
> > RBP: 000000000073c040 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb988b9d6d4
> > R13: 00000000004cc9e8 R14: 00000000004da6a8 R15: 00000000ffffffff
> >
> > Allocated by task 18471:
> >   save_stack+0x45/0xd0 mm/kasan/common.c:73
> >   set_track mm/kasan/common.c:85 [inline]
> >   __kasan_kmalloc mm/kasan/common.c:496 [inline]
> >   __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:469
> >   kasan_kmalloc+0x9/0x10 mm/kasan/common.c:504
> >   __do_kmalloc mm/slab.c:3711 [inline]
> >   __kmalloc+0x15c/0x740 mm/slab.c:3720
> >   kmalloc include/linux/slab.h:550 [inline]
> >   sk_prot_alloc+0x19c/0x2e0 net/core/sock.c:1477
> >   sk_alloc+0xd7/0x1690 net/core/sock.c:1531
> >   nr_create+0xb9/0x5e0 net/netrom/af_netrom.c:436
> >   __sock_create+0x532/0x930 net/socket.c:1277
> >   sock_create net/socket.c:1317 [inline]
> >   __sys_socket+0x106/0x260 net/socket.c:1347
> >   __do_sys_socket net/socket.c:1356 [inline]
> >   __se_sys_socket net/socket.c:1354 [inline]
> >   __x64_sys_socket+0x73/0xb0 net/socket.c:1354
> >   do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290
> >   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> > Freed by task 18466:
> >   save_stack+0x45/0xd0 mm/kasan/common.c:73
> >   set_track mm/kasan/common.c:85 [inline]
> >   __kasan_slab_free+0x102/0x150 mm/kasan/common.c:458
> >   kasan_slab_free+0xe/0x10 mm/kasan/common.c:466
> >   __cache_free mm/slab.c:3487 [inline]
> >   kfree+0xcf/0x230 mm/slab.c:3806
> >   sk_prot_free net/core/sock.c:1514 [inline]
> >   __sk_destruct+0x76d/0xa60 net/core/sock.c:1596
> >   sk_destruct+0x7b/0x90 net/core/sock.c:1604
> >   __sk_free+0xce/0x300 net/core/sock.c:1615
> >   sk_free+0x42/0x50 net/core/sock.c:1626
> >   sock_put include/net/sock.h:1707 [inline]
> >   nr_release+0x337/0x3c0 net/netrom/af_netrom.c:557
> >   __sock_release+0xd3/0x250 net/socket.c:579
> >   sock_close+0x1b/0x30 net/socket.c:1141
> >   __fput+0x3c5/0xb10 fs/file_table.c:278
> >   ____fput+0x16/0x20 fs/file_table.c:309
> >   task_work_run+0x1f4/0x2b0 kernel/task_work.c:113
> >   tracehook_notify_resume include/linux/tracehook.h:188 [inline]
> >   exit_to_usermode_loop+0x32a/0x3b0 arch/x86/entry/common.c:166
> >   prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
> >   syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
> >   do_syscall_64+0x696/0x800 arch/x86/entry/common.c:293
> >   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> > The buggy address belongs to the object at ffff8880a63cec80
> >   which belongs to the cache kmalloc-2k of size 2048
> > The buggy address is located 1016 bytes inside of
> >   2048-byte region [ffff8880a63cec80, ffff8880a63cf480)
> > The buggy address belongs to the page:
> > page:ffffea000298f380 count:1 mapcount:0 mapping:ffff88812c3f0c40 index:0x0
> > compound_mapcount: 0
> > flags: 0x1fffc0000010200(slab|head)
> > raw: 01fffc0000010200 ffffea000127f888 ffffea00013a3188 ffff88812c3f0c40
> > raw: 0000000000000000 ffff8880a63ce400 0000000100000003 0000000000000000
> > page dumped because: kasan: bad access detected
> >
> > Memory state around the buggy address:
> >   ffff8880a63cef00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >   ffff8880a63cef80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > ffff8880a63cf000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >                                                                  ^
> >   ffff8880a63cf080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >   ffff8880a63cf100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > ==================================================================
> >
> >
> > ---
> > This bug is generated by a bot. It may contain errors.
> > See https://goo.gl/tpsmEJ for more information about syzbot.
> > syzbot engineers can be reached at syzkaller@googlegroups.com.
> >
> > syzbot will keep track of this bug report. See:
> > https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
> > syzbot.
>
>
>
> --
> paul moore
> www.paul-moore.com

^ permalink raw reply

* Re: r8169 Driver - Poor Network Performance Since Kernel 4.19
From: Heiner Kallweit @ 2019-02-01  6:32 UTC (permalink / raw)
  To: David Chang
  Cc: Peter Ceiley, Realtek linux nic maintainers, netdev,
	Martti Laaksonen
In-Reply-To: <20190201042907.GB7193@linux-kyyb.suse>

Thanks, however the register diff is a little hard to read.
Usually ethtool -d outputs something like this:

RealTek RTL8168g/8111g registers:
--------------------------------------------------------
0x00: MAC Address                      00:01:2e:83:90:11
0x08: Multicast Address Filter     0x100000c0 0x00000084
0x10: Dump Tally Counter Command   0x78c43000 0x00000001
0x20: Tx Normal Priority Ring Addr 0x77cc4000 0x00000001
0x28: Tx High Priority Ring Addr   0x00000000 0x00000000
0x30: Flash memory read/write                 0x00000000
0x34: Early Rx Byte Count                              0
0x36: Early Rx Status                               0x00
0x37: Command                                       0x0c
      Rx on, Tx on
0x3C: Interrupt Mask                              0x003f
      LinkChg RxNoBuf TxErr TxOK RxErr RxOK
0x3E: Interrupt Status                            0x0000

0x40: Tx Configuration                        0x4f000f80
0x44: Rx Configuration                        0x0002cf0e
0x48: Timer count                             0x00000000
0x4C: Missed packet counter                     0x000000
0x50: EEPROM Command                                0x10
0x51: Config 0                                      0x00
0x52: Config 1                                      0xcf
0x53: Config 2                                      0x9c
0x54: Config 3                                      0x60
0x55: Config 4                                      0x50
0x56: Config 5                                      0x01
0x58: Timer interrupt                         0x00000000
0x5C: Multiple Interrupt Select                   0x0000
0x60: PHY access                              0x00000000
0x64: TBI control and status                  0x00000000
0x68: TBI Autonegotiation advertisement (ANAR)    0x0000
0x6A: TBI Link partner ability (LPAR)             0x0000
0x6C: PHY status                                    0xf3
..


On 01.02.2019 05:29, David Chang wrote:
> On Jan 31, 2019 at 07:35:30 +0100, Heiner Kallweit wrote:
>> Hi David, two more things:
>>
>> 1. Could you please test a recent linux-next kernel?
> 
> Not tested yet. Will do if possible.
> 
>> 2. Please get a register dump (ethtool -d <if>) from 4.18 and 4.19
>>    and compare them.
> 
> For your informaiton.
> 
> [with pcie_aspm=off]
> --- v4.18.15	2019-02-01 12:11:56.019051828 +0800
> +++ v4.9.11	2019-02-01 12:12:26.827439645 +0800
> @@ -3,18 +3,19 @@
>  Offset          Values
>  ------          ------
>  0x0000:         ec 8e b5 5a 2c f5 00 00 48 00 40 00 80 00 80 00
> -0x0010:         00 10 38 0e 04 00 00 00 78 00 06 00 00 00 00 00
> -0x0020:         00 f0 9b f6 03 00 00 00 00 00 00 00 00 00 00 00
> +0x0010:         00 f0 ba 0d 04 00 00 00 78 00 06 00 00 00 00 00
> +0x0020:         00 d0 35 f7 03 00 00 00 00 00 00 00 00 00 00 00
>  0x0030:         00 00 00 00 00 00 00 0c 00 00 00 00 3f 80 00 00
> -0x0040:         80 0f 10 57 0e cf 02 00 00 cf ba 34 00 00 00 00
> -0x0050:         10 00 cf 18 60 11 00 01 11 11 11 00 00 00 00 00
> +0x0040:         80 0f 10 57 0e cf 02 00 00 d8 c7 50 00 00 00 00
> +0x0050:         10 00 cf 98 60 11 01 01 11 11 11 00 00 00 00 00
>  0x0060:         00 00 00 00 3c 10 00 81 2c f0 00 80 93 00 80 f0
> -0x0070:         00 6f 00 c4 b0 31 00 00 07 00 00 00 00 00 00 00
> +0x0070:         00 6f 00 c4 b0 31 00 00 07 00 00 00 00 00 76 d0
>  0x0080:         8b 06 01 00 00 00 00 00 00 00 00 00 00 00 00 00
>  0x0090:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  0x00a0:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> -0x00b0:         7f 04 00 00 00 00 00 00 e1 c1 05 d2 00 00 00 00
> +0x00b0:         7f 04 00 00 00 00 00 00 ad 79 01 d2 00 00 00 00
>  0x00c0:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  0x00d0:         21 00 00 32 0e 00 00 00 00 00 00 40 06 11 fd 00
> -0x00e0:         e1 20 51 51 00 30 94 f6 03 00 00 00 27 00 00 00
> +0x00e0:         e1 20 51 51 00 e0 35 f7 03 00 00 00 27 00 00 00
>  0x00f0:         3f 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00
> 
> [pcie_aspm.policy=performance]
> --- v4.18.15-p	2019-02-01 12:18:46.919221060 +0800
> +++ v4.9.11-p	2019-02-01 12:19:09.207474824 +0800
> @@ -3,18 +3,19 @@
>  Offset          Values
>  ------          ------
>  0x0000:         ec 8e b5 5a 2c f5 00 00 48 00 40 00 80 00 80 00
> -0x0010:         00 f0 bc 0d 04 00 00 00 78 00 06 00 00 00 00 00
> -0x0020:         00 60 2e f7 03 00 00 00 00 00 00 00 00 00 00 00
> +0x0010:         00 c0 22 09 04 00 00 00 78 00 06 00 00 00 00 00
> +0x0020:         00 f0 e5 f4 03 00 00 00 00 00 00 00 00 00 00 00
>  0x0030:         00 00 00 00 00 00 00 0c 00 00 00 00 3f 80 00 00
> -0x0040:         80 0f 10 57 0e cf 02 00 00 53 50 1a 00 00 00 00
> -0x0050:         10 00 cf 18 60 11 00 01 11 11 11 00 00 00 00 00
> +0x0040:         80 0f 10 57 0e cf 02 00 00 d2 35 7b 00 00 00 00
> +0x0050:         10 00 cf 98 60 11 01 01 11 11 11 00 00 00 00 00
>  0x0060:         00 00 00 00 3c 10 00 81 2c f0 00 80 93 00 80 f0
> -0x0070:         00 6f 00 c4 b0 31 00 00 07 00 00 00 00 00 00 00
> +0x0070:         00 6f 00 c4 b0 31 00 00 07 00 00 00 00 00 a4 a0
>  0x0080:         8b 06 01 00 00 00 00 00 00 00 00 00 00 00 00 00
>  0x0090:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  0x00a0:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> -0x00b0:         7f 04 00 00 00 00 00 00 e1 c1 05 d2 00 00 00 00
> +0x00b0:         7f 04 00 00 00 00 00 00 ad 79 01 d2 00 00 00 00
>  0x00c0:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  0x00d0:         21 00 00 32 0e 00 00 00 00 00 00 40 06 11 fd 00
> -0x00e0:         e1 20 51 51 00 70 2e f7 03 00 00 00 27 00 00 00
> +0x00e0:         e1 20 51 51 00 00 e6 f4 03 00 00 00 27 00 00 00
>  0x00f0:         3f 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00
> 
> Thanks,
> David
> 
>> Heiner
>>
>>
>> On 31.01.2019 07:21, Heiner Kallweit wrote:
>>> David, thanks for the link to the bug ticket.
>>> I think only a proper bisect can help to find the offending commit.
>>>
>>> Heiner
>>>
>>>
>>> On 31.01.2019 03:32, David Chang wrote:
>>>> Hi,
>>>>
>>>> We had a similr case here.
>>>> - Realtek r8169 receive performance regression in kernel 4.19
>>>>   https://bugzilla.suse.com/show_bug.cgi?id=1119649
>>>>
>>>> kernel: r8169 0000:01:00.0 eth0: RTL8168h/8111h, XID 54100880
>>>> The major symptom is there are many rx_missed count.
>>>>
>>>>
>>>> On Jan 30, 2019 at 20:15:45 +0100, Heiner Kallweit wrote:
>>>>> Hi Peter,
>>>>>
>>>>> recently I had somebody where pcie_aspm=off for whatever reason didn't
>>>>> do the trick, can you also check with pcie_aspm.policy=performance.
>>>>
>>>> We will give it a try later.
>>>>
>>>>> And please check with "ethtool -S <if>" whether the chip statistics
>>>>> show a significant number of errors.
>>>>>
>>>>> If this doesn't help you may have to bisect to find the offending commit.
>>>>
>>>> We had tried fallback driver to a few previous commits as following,
>>>> but with no luck.
>>>>
>>>> 9675931e6b65 r8169: re-enable MSI-X on RTL8168g (v4.19)
>>>> 098b01ad9837 r8169: don't include asm headers directly (v4.19-rc1)
>>>> a2965f12fde6 r8169: remove rtl8169_set_speed_xmii (v4.19-rc1)
>>>> 6fcf9b1d4d6c r8169: fix runtime suspend (v4.19-rc1)
>>>> e397286b8e89 r8169: remove TBI 1000BaseX support (v4.19-rc1)
>>>>
>>>> Thanks,
>>>> David Chang
>>>>
>>>>>
>>>>> Heiner
>>>>>
>>>>>
>>>>> On 30.01.2019 10:59, Peter Ceiley wrote:
>>>>>> Hi Heiner,
>>>>>>
>>>>>> I tried disabling the ASPM using the pcie_aspm=off kernel parameter
>>>>>> and this made no difference.
>>>>>>
>>>>>> I tried compiling the 4.18.16 r8169.c with the 4.19.18 source and
>>>>>> subsequently loaded the module in the running 4.19.18 kernel. I can
>>>>>> confirm that this immediately resolved the issue and access to the NFS
>>>>>> shares operated as expected.
>>>>>>
>>>>>> I presume this means it is an issue with the r8169 driver included in
>>>>>> 4.19 onwards?
>>>>>>
>>>>>> To answer your last questions:
>>>>>>
>>>>>> Base Board Information
>>>>>>     Manufacturer: Alienware
>>>>>>     Product Name: 0PGRP5
>>>>>>     Version: A02
>>>>>>
>>>>>> ... and yes, the RTL8168 is the onboard network chip.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Peter.
>>>>>>
>>>>>> On Tue, 29 Jan 2019 at 17:44, Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>>>>
>>>>>>> Hi Peter,
>>>>>>>
>>>>>>> I think the vendor driver doesn't enable ASPM per default.
>>>>>>> So it's worth a try to disable ASPM in the BIOS or via sysfs.
>>>>>>> Few older systems seem to have issues with ASPM, what kind of
>>>>>>> system / mainboard are you using? The RTL8168 is the onboard
>>>>>>> network chip?
>>>>>>>
>>>>>>> Rgds, Heiner
>>>>>>>
>>>>>>>
>>>>>>> On 29.01.2019 07:20, Peter Ceiley wrote:
>>>>>>>> Hi Heiner,
>>>>>>>>
>>>>>>>> Thanks, I'll do some more testing. It might not be the driver - I
>>>>>>>> assumed it was due to the fact that using the r8168 driver 'resolves'
>>>>>>>> the issue. I'll see if I can test the r8169.c on top of 4.19 - this is
>>>>>>>> a good idea.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Peter.
>>>>>>>>
>>>>>>>> On Tue, 29 Jan 2019 at 17:16, Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Peter,
>>>>>>>>>
>>>>>>>>> at a first glance it doesn't look like a typical driver issue.
>>>>>>>>> What you could do:
>>>>>>>>>
>>>>>>>>> - Test the r8169.c from 4.18 on top of 4.19.
>>>>>>>>>
>>>>>>>>> - Check whether disabling ASPM (/sys/module/pcie_aspm) has an effect.
>>>>>>>>>
>>>>>>>>> - Bisect between 4.18 and 4.19 to find the offending commit.
>>>>>>>>>
>>>>>>>>> Any specific reason why you think root cause is in the driver and not
>>>>>>>>> elsewhere in the network subsystem?
>>>>>>>>>
>>>>>>>>> Heiner
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 28.01.2019 23:10, Peter Ceiley wrote:
>>>>>>>>>> Hi Heiner,
>>>>>>>>>>
>>>>>>>>>> Thanks for getting back to me.
>>>>>>>>>>
>>>>>>>>>> No, I don't use jumbo packets.
>>>>>>>>>>
>>>>>>>>>> Bandwidth is *generally* good, and iperf results to my NAS provide
>>>>>>>>>> over 900 Mbits/s in both circumstances. The issue seems to appear when
>>>>>>>>>> establishing a connection and is most notable, for example, on my
>>>>>>>>>> mounted NFS shares where it takes seconds (up to 10's of seconds on
>>>>>>>>>> larger directories) to list the contents of each directory. Once a
>>>>>>>>>> transfer begins on a file, I appear to get good bandwidth.
>>>>>>>>>>
>>>>>>>>>> I'm unsure of the best scientific data to provide you in order to
>>>>>>>>>> troubleshoot this issue. Running the following
>>>>>>>>>>
>>>>>>>>>>     netstat -s |grep retransmitted
>>>>>>>>>>
>>>>>>>>>> shows a steady increase in retransmitted segments each time I list the
>>>>>>>>>> contents of a remote directory, for example, running 'ls' on a
>>>>>>>>>> directory containing 345 media files did the following using kernel
>>>>>>>>>> 4.19.18:
>>>>>>>>>>
>>>>>>>>>> increased retransmitted segments by 21 and the 'time' command showed
>>>>>>>>>> the following:
>>>>>>>>>>     real    0m19.867s
>>>>>>>>>>     user    0m0.012s
>>>>>>>>>>     sys    0m0.036s
>>>>>>>>>>
>>>>>>>>>> The same command shows no retransmitted segments running kernel
>>>>>>>>>> 4.18.16 and 'time' showed:
>>>>>>>>>>     real    0m0.300s
>>>>>>>>>>     user    0m0.004s
>>>>>>>>>>     sys    0m0.007s
>>>>>>>>>>
>>>>>>>>>> ifconfig does not show any RX/TX errors nor dropped packets in either case.
>>>>>>>>>>
>>>>>>>>>> dmesg XID:
>>>>>>>>>> [    2.979984] r8169 0000:03:00.0 eth0: RTL8168g/8111g,
>>>>>>>>>> f8:b1:56:fe:67:e0, XID 4c000800, IRQ 32
>>>>>>>>>>
>>>>>>>>>> # lspci -vv
>>>>>>>>>> 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
>>>>>>>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)
>>>>>>>>>>     Subsystem: Dell RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
>>>>>>>>>>     Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
>>>>>>>>>> ParErr- Stepping- SERR- FastB2B- DisINTx+
>>>>>>>>>>     Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>>>>>>>>>> <TAbort- <MAbort- >SERR- <PERR- INTx-
>>>>>>>>>>     Latency: 0, Cache Line Size: 64 bytes
>>>>>>>>>>     Interrupt: pin A routed to IRQ 19
>>>>>>>>>>     Region 0: I/O ports at d000 [size=256]
>>>>>>>>>>     Region 2: Memory at f7b00000 (64-bit, non-prefetchable) [size=4K]
>>>>>>>>>>     Region 4: Memory at f2100000 (64-bit, prefetchable) [size=16K]
>>>>>>>>>>     Capabilities: [40] Power Management version 3
>>>>>>>>>>         Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA
>>>>>>>>>> PME(D0+,D1+,D2+,D3hot+,D3cold+)
>>>>>>>>>>         Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>>>>>>>>>>     Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
>>>>>>>>>>         Address: 0000000000000000  Data: 0000
>>>>>>>>>>     Capabilities: [70] Express (v2) Endpoint, MSI 01
>>>>>>>>>>         DevCap:    MaxPayload 128 bytes, PhantFunc 0, Latency L0s
>>>>>>>>>> <512ns, L1 <64us
>>>>>>>>>>             ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>>>>>>>>>> SlotPowerLimit 10.000W
>>>>>>>>>>         DevCtl:    CorrErr- NonFatalErr- FatalErr- UnsupReq-
>>>>>>>>>>             RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>>>>>>>>>>             MaxPayload 128 bytes, MaxReadReq 4096 bytes
>>>>>>>>>>         DevSta:    CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
>>>>>>>>>>         LnkCap:    Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit
>>>>>>>>>> Latency L0s unlimited, L1 <64us
>>>>>>>>>>             ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
>>>>>>>>>>         LnkCtl:    ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
>>>>>>>>>>             ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
>>>>>>>>>>         LnkSta:    Speed 2.5GT/s (ok), Width x1 (ok)
>>>>>>>>>>             TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>>>>>>>>>>         DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+,
>>>>>>>>>> OBFF Via message/WAKE#
>>>>>>>>>>              AtomicOpsCap: 32bit- 64bit- 128bitCAS-
>>>>>>>>>>         DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+,
>>>>>>>>>> OBFF Disabled
>>>>>>>>>>              AtomicOpsCtl: ReqEn-
>>>>>>>>>>         LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
>>>>>>>>>>              Transmit Margin: Normal Operating Range,
>>>>>>>>>> EnterModifiedCompliance- ComplianceSOS-
>>>>>>>>>>              Compliance De-emphasis: -6dB
>>>>>>>>>>         LnkSta2: Current De-emphasis Level: -6dB,
>>>>>>>>>> EqualizationComplete-, EqualizationPhase1-
>>>>>>>>>>              EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>>>>>>>>>>     Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
>>>>>>>>>>         Vector table: BAR=4 offset=00000000
>>>>>>>>>>         PBA: BAR=4 offset=00000800
>>>>>>>>>>     Capabilities: [d0] Vital Product Data
>>>>>>>>>> pcilib: sysfs_read_vpd: read failed: Input/output error
>>>>>>>>>>         Not readable
>>>>>>>>>>     Capabilities: [100 v1] Advanced Error Reporting
>>>>>>>>>>         UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
>>>>>>>>>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>>>>>>>>>         UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
>>>>>>>>>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>>>>>>>>>         UESvrt:    DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
>>>>>>>>>> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>>>>>>>>>>         CESta:    RxErr+ BadTLP+ BadDLLP+ Rollover- Timeout+ AdvNonFatalErr-
>>>>>>>>>>         CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>>>>>>>>>>         AERCap:    First Error Pointer: 00, ECRCGenCap+ ECRCGenEn-
>>>>>>>>>> ECRCChkCap+ ECRCChkEn-
>>>>>>>>>>             MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
>>>>>>>>>>         HeaderLog: 00000000 00000000 00000000 00000000
>>>>>>>>>>     Capabilities: [140 v1] Virtual Channel
>>>>>>>>>>         Caps:    LPEVC=0 RefClk=100ns PATEntryBits=1
>>>>>>>>>>         Arb:    Fixed- WRR32- WRR64- WRR128-
>>>>>>>>>>         Ctrl:    ArbSelect=Fixed
>>>>>>>>>>         Status:    InProgress-
>>>>>>>>>>         VC0:    Caps:    PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
>>>>>>>>>>             Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
>>>>>>>>>>             Ctrl:    Enable+ ID=0 ArbSelect=Fixed TC/VC=01
>>>>>>>>>>             Status:    NegoPending- InProgress-
>>>>>>>>>>     Capabilities: [160 v1] Device Serial Number 01-00-00-00-68-4c-e0-00
>>>>>>>>>>     Capabilities: [170 v1] Latency Tolerance Reporting
>>>>>>>>>>         Max snoop latency: 71680ns
>>>>>>>>>>         Max no snoop latency: 71680ns
>>>>>>>>>>     Kernel driver in use: r8169
>>>>>>>>>>     Kernel modules: r8169
>>>>>>>>>>
>>>>>>>>>> Please let me know if you have any other ideas in terms of testing.
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>> Peter.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, 29 Jan 2019 at 05:28, Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 28.01.2019 12:13, Peter Ceiley wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I have been experiencing very poor network performance since Kernel
>>>>>>>>>>>> 4.19 and I'm confident it's related to the r8169 driver.
>>>>>>>>>>>>
>>>>>>>>>>>> I have no issue with kernel versions 4.18 and prior. I am experiencing
>>>>>>>>>>>> this issue in kernels 4.19 and 4.20 (currently running/testing with
>>>>>>>>>>>> 4.20.4 & 4.19.18).
>>>>>>>>>>>>
>>>>>>>>>>>> If someone could guide me in the right direction, I'm happy to help
>>>>>>>>>>>> troubleshoot this issue. Note that I have been keeping an eye on one
>>>>>>>>>>>> issue related to loading of the PHY driver, however, my symptoms
>>>>>>>>>>>> differ in that I still have a network connection. I have attempted to
>>>>>>>>>>>> reload the driver on a running system, but this does not improve the
>>>>>>>>>>>> situation.
>>>>>>>>>>>>
>>>>>>>>>>>> Using the proprietary r8168 driver returns my device to proper working order.
>>>>>>>>>>>>
>>>>>>>>>>>> lshw shows:
>>>>>>>>>>>>        description: Ethernet interface
>>>>>>>>>>>>        product: RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
>>>>>>>>>>>>        vendor: Realtek Semiconductor Co., Ltd.
>>>>>>>>>>>>        physical id: 0
>>>>>>>>>>>>        bus info: pci@0000:03:00.0
>>>>>>>>>>>>        logical name: enp3s0
>>>>>>>>>>>>        version: 0c
>>>>>>>>>>>>        serial:
>>>>>>>>>>>>        size: 1Gbit/s
>>>>>>>>>>>>        capacity: 1Gbit/s
>>>>>>>>>>>>        width: 64 bits
>>>>>>>>>>>>        clock: 33MHz
>>>>>>>>>>>>        capabilities: pm msi pciexpress msix vpd bus_master cap_list
>>>>>>>>>>>> ethernet physical tp aui bnc mii fibre 10bt 10bt-fd 100bt 100bt-fd
>>>>>>>>>>>> 1000bt-fd autonegotiation
>>>>>>>>>>>>        configuration: autonegotiation=on broadcast=yes driver=r8169
>>>>>>>>>>>> duplex=full firmware=rtl8168g-2_0.0.1 02/06/13 ip=192.168.1.25
>>>>>>>>>>>> latency=0 link=yes multicast=yes port=MII speed=1Gbit/s
>>>>>>>>>>>>        resources: irq:19 ioport:d000(size=256)
>>>>>>>>>>>> memory:f7b00000-f7b00fff memory:f2100000-f2103fff
>>>>>>>>>>>>
>>>>>>>>>>>> Kind Regards,
>>>>>>>>>>>>
>>>>>>>>>>>> Peter.
>>>>>>>>>>>>
>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>
>>>>>>>>>>> the description "poor network performance" is quite vague, therefore:
>>>>>>>>>>>
>>>>>>>>>>> - Can you provide any measurements?
>>>>>>>>>>> - iperf results before and after
>>>>>>>>>>> - statistics about dropped packets (rx and/or tx)
>>>>>>>>>>> - Do you use jumbo packets?
>>>>>>>>>>>
>>>>>>>>>>> Also help would be a "lspci -vv" output for the network card and
>>>>>>>>>>> the dmesg output line with the chip XID.
>>>>>>>>>>>
>>>>>>>>>>> Heiner
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
> 


^ permalink raw reply

* Re: WoL broken in r8169.c since kernel 4.19
From: Heiner Kallweit @ 2019-02-01  6:49 UTC (permalink / raw)
  To: Marc Haber; +Cc: netdev@vger.kernel.org
In-Reply-To: <20190130153735.GN27062@torres.zugschlus.de>

Great, thanks. This patch has been applied and is part of latest linux-next kernel already:
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/?h=next-20190201
Would be much appreciated if you could build and test this kernel version.

Heiner

On 30.01.2019 16:37, Marc Haber wrote:
> On Tue, Jan 29, 2019 at 10:20:48PM +0100, Heiner Kallweit wrote:
>> one more attempt, could you please test the following with 4.19 or 4.20
>> (w/o the other debug patches) ?
> 
> With the following patch, the machine wakes up fine on a WoL magic
> packet:
> 
> nux-4.20.5/drivers/net/ethernet/realtek/r8169.c   2019-01-30 16:03:00.090841076 +0100
> +++ orig/linux-4.20.5/drivers/net/ethernet/realtek/r8169.c      2019-01-26 09:20:52.000000000 +0100
> @@ -1418,7 +1418,6 @@
> 
>  #define WAKE_ANY (WAKE_PHY | WAKE_MAGIC | WAKE_UCAST | WAKE_BCAST | WAKE_MCAST)
> 
> -#if 0
>  static u32 __rtl8169_get_wol(struct rtl8169_private *tp)
>  {
>         u8 options;
> @@ -1453,7 +1452,6 @@
> 
>         return wolopts;
>  }
> -#endif
> 
>  static void rtl8169_get_wol(struct net_device *dev, struct ethtool_wolinfo *wol)
>  {
> @@ -1463,8 +1461,6 @@
>         wol->supported = WAKE_ANY;
>         wol->wolopts = tp->saved_wolopts;
>         rtl_unlock_work(tp);
> -
> -       pr_info("get_wol: 0x%08x\n", wol->wolopts);
>  }
> 
>  static void __rtl8169_set_wol(struct rtl8169_private *tp, u32 wolopts)
> @@ -1540,8 +1536,6 @@
>         struct rtl8169_private *tp = netdev_priv(dev);
>         struct device *d = tp_to_dev(tp);
> 
> -       pr_info("set_wol: 0x%08x\n", wol->wolopts);
> -
>         if (wol->wolopts & ~WAKE_ANY)
>                 return -EINVAL;
> 
> @@ -4174,7 +4168,7 @@
>  {
>         struct phy_device *phydev;
> 
> -       if (!device_may_wakeup(tp_to_dev(tp)))
> +       if (!__rtl8169_get_wol(tp))
>                 return false;
> 
>         /* phydev may not be attached to netdevice */
> @@ -7372,6 +7366,8 @@
>                 return rc;
>         }
> 
> +       tp->saved_wolopts = __rtl8169_get_wol(tp);
> +
>         mutex_init(&tp->wk.mutex);
>         u64_stats_init(&tp->rx_stats.syncp);
>         u64_stats_init(&tp->tx_stats.syncp);
> 1 [14/5006]mh@fan:~/linux/4.20.5 $
> 
> I'll send the dmesg output to you in private e-mail
> 
> Greetings
> Marc
> 


^ permalink raw reply

* Co-existing XDP generic and native mode? (Re: [PATCH bpf-next v5 5/8] xdp: Provide extack messages when prog attachment failed)
From: Jesper Dangaard Brouer @ 2019-02-01  7:02 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: daniel, ast, David Miller, Maciej Fijalkowski, netdev,
	john.fastabend, David Ahern, brouer, Saeed Mahameed
In-Reply-To: <20190131191101.2e9dc9f6@cakuba.hsd1.ca.comcast.net>

On Thu, 31 Jan 2019 19:11:01 -0800
Jakub Kicinski <jakub.kicinski@netronome.com> wrote:

> On Fri,  1 Feb 2019 01:19:51 +0100, Maciej Fijalkowski wrote:
> >  		if (__dev_xdp_query(dev, bpf_chk, XDP_QUERY_PROG) ||
> > -		    __dev_xdp_query(dev, bpf_chk, XDP_QUERY_PROG_HW))
> > +		    __dev_xdp_query(dev, bpf_chk, XDP_QUERY_PROG_HW)) {
> > +			NL_SET_ERR_MSG(extack, "native and generic XDP can't be active at the same time");
> >  			return -EEXIST;
> > +		}  
> 
> This reminds me, since we allowed native/driver and offloaded XDP
> programs to coexist in a25717d2b604 ("xdp: support simultaneous 
> driver and hw XDP attachment") I got an internal feature request 
> to also allow generic and native mode.  Would anyone object to that?

Well, I will object ;-)

I have two refactor ideas [1] and [2], that depend on not allowing
XDP-native and XDP-generic to co-exist.   The general idea is to let
XDP-native use the same fields in net_device->rx[] as XDP-generic given
they (currently) cannot co-exist. 
 The goal is (1) to move stuff out of driver code, and (2) hopefully
make it easier to implement per RXq XDP progs.

These are only refactor ideas, so if you can argue why your internal
feature request for simultaneous generic and native make more sense,
then I'm open for allowing this ?

[1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp_per_rxq01.org#refactor-idea-move-xdp_rxq_info-to-net_devicenetdev_rx_queue

[2] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp_per_rxq01.org#refactor-idea-xdpbpf_prog-into-netdev_rx_queuenet_device

> Apart from a touch up to test_offload.py I don't think anything 
> would care.  netlink can already carry multiple IDs, iproute2
> understands it, too..

And we did notice you added support for HW+native:
 [3] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp_per_rxq01.org
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [PATCH bpf-next 0/6] Add __sk_buff->sk, bpf_tcp_sock, BPF_FUNC_tcp_sock and BPF_FUNC_sk_fullsock
From: Martin KaFai Lau @ 2019-02-01  7:03 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo

This series adds __sk_buff->sk, "struct bpf_tcp_sock",
BPF_FUNC_sk_fullsock and BPF_FUNC_tcp_sock.  Together, they provide
a common way to expose the members of "struct tcp_sock" and
"struct bpf_sock" for the bpf_prog to access.

The patch series first adds a bpf_sock pointer to __sk_buff
and a new helper BPF_FUNC_sk_fullsock.

It then adds BPF_FUNC_tcp_sock to get a bpf_tcp_sock
pointer from a bpf_sock pointer.

The current use case is to allow a cg_skb_bpf_prog to provide
per cgroup traffic policing/shaping.

Please see individual patch for details.

Martin KaFai Lau (6):
  bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper
  bpf: Refactor sock_ops_convert_ctx_access
  bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock
  bpf: Sync bpf.h to tools/
  bpf: Add skb->sk, bpf_sk_fullsock and bpf_tcp_sock tests to
    test_verifer
  bpf: Add test_sock_fields for skb->sk and bpf_tcp_sock

 include/linux/bpf.h                           |  42 ++
 include/uapi/linux/bpf.h                      |  61 ++-
 kernel/bpf/verifier.c                         | 156 +++++--
 net/core/filter.c                             | 407 +++++++++++-------
 tools/include/uapi/linux/bpf.h                |  61 ++-
 tools/testing/selftests/bpf/Makefile          |   5 +-
 tools/testing/selftests/bpf/bpf_helpers.h     |   4 +
 .../testing/selftests/bpf/test_sock_fields.c  | 314 ++++++++++++++
 .../selftests/bpf/test_sock_fields_kern.c     | 145 +++++++
 .../selftests/bpf/verifier/ref_tracking.c     |   4 +-
 tools/testing/selftests/bpf/verifier/sock.c   | 238 ++++++++++
 tools/testing/selftests/bpf/verifier/unpriv.c |   2 +-
 12 files changed, 1233 insertions(+), 206 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sock_fields.c
 create mode 100644 tools/testing/selftests/bpf/test_sock_fields_kern.c
 create mode 100644 tools/testing/selftests/bpf/verifier/sock.c

-- 
2.17.1


^ permalink raw reply

* [PATCH bpf-next 3/6] bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock
From: Martin KaFai Lau @ 2019-02-01  7:03 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo
In-Reply-To: <20190201070356.4148189-1-kafai@fb.com>

This patch adds a helper function BPF_FUNC_tcp_sock and it
is currently available for cg_skb and sched_(cls|act):

struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk);

int cg_skb_foo(struct __sk_buff *skb) {
	struct bpf_tcp_sock *tp;
	struct bpf_sock *sk;
	__u32 snd_cwnd;

	sk = skb->sk;
	if (!sk)
		return 1;

	tp = bpf_tcp_sock(sk);
	if (!tp)
		return 1;

	snd_cwnd = tp->snd_cwnd;
	/* ... */

	return 1;
}

A 'struct bpf_tcp_sock' is also added to the uapi bpf.h to provide
read-only access.  bpf_tcp_sock has all the existing tcp_sock's fields
that has already been exposed by the bpf_sock_ops.
i.e. no new tcp_sock's fields are exposed in bpf.h.

This helper ensures the sk is a tcp_sock.  If it is not
a tcp_sock, it returns NULL.  Hence, the caller needs to
check for NULL before accessing bpf_tcp_sock.

The current use case is to expose members from tcp_sock
to allow a cg_skb_bpf_prog to provide per cgroup traffic
policing/shaping.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf.h      | 30 ++++++++++++++++
 include/uapi/linux/bpf.h | 51 +++++++++++++++++++++++++-
 kernel/bpf/verifier.c    | 31 ++++++++++++++--
 net/core/filter.c        | 77 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 186 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 72022a0c442d..cc3667f8c003 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -172,6 +172,7 @@ enum bpf_return_type {
 	RET_PTR_TO_MAP_VALUE,		/* returns a pointer to map elem value */
 	RET_PTR_TO_MAP_VALUE_OR_NULL,	/* returns a pointer to map elem value or NULL */
 	RET_PTR_TO_SOCKET_OR_NULL,	/* returns a pointer to a socket or NULL */
+	RET_PTR_TO_TCP_SOCK_OR_NULL,	/* returns a pointer to a tcp_sock or NULL */
 };
 
 /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs
@@ -227,6 +228,8 @@ enum bpf_reg_type {
 	PTR_TO_SOCKET_OR_NULL,	 /* reg points to struct bpf_sock or NULL */
 	PTR_TO_SOCK_COMMON,	 /* reg points to sock_common */
 	PTR_TO_SOCK_COMMON_OR_NULL, /* reg points to sock_common or NULL */
+	PTR_TO_TCP_SOCK,	 /* reg points to struct bpf_tcp_sock */
+	PTR_TO_TCP_SOCK_OR_NULL, /* reg points to struct bpf_tcp_sock or NULL */
 };
 
 /* The information passed from prog-specific *_is_valid_access
@@ -923,4 +926,31 @@ static inline u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
 }
 #endif
 
+#ifdef CONFIG_INET
+bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type,
+				  struct bpf_insn_access_aux *info);
+
+u32 bpf_tcp_sock_convert_ctx_access(enum bpf_access_type type,
+				    const struct bpf_insn *si,
+				    struct bpf_insn *insn_buf,
+				    struct bpf_prog *prog,
+				    u32 *target_size);
+#else
+static inline bool bpf_tcp_sock_is_valid_access(int off, int size,
+						enum bpf_access_type type,
+						struct bpf_insn_access_aux *info)
+{
+	return false;
+}
+
+static inline u32 bpf_tcp_sock_convert_ctx_access(enum bpf_access_type type,
+						  const struct bpf_insn *si,
+						  struct bpf_insn *insn_buf,
+						  struct bpf_prog *prog,
+						  u32 *target_size)
+{
+	return 0;
+}
+#endif /* CONFIG_INET */
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b7a5312c4ccf..5f6c947abdd8 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2336,6 +2336,15 @@ union bpf_attr {
  *	Return
  *		The same pointer if *sk* is a fullsock.  Otherwise,
  *		NULL is returned.
+ *
+ * struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk)
+ *	Description
+ *		Obtain a **struct bpf_tcp_sock** pointer from a
+ *              **struct bpf_sock** pointer.
+ *
+ *	Return
+ *		Pointer to **struct bpf_tcp_sock**, or **NULL** in
+ *		case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2431,7 +2440,8 @@ union bpf_attr {
 	FN(msg_push_data),		\
 	FN(msg_pop_data),		\
 	FN(rc_pointer_rel),		\
-	FN(sk_fullsock),
+	FN(sk_fullsock),		\
+	FN(tcp_sock),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -2614,6 +2624,45 @@ struct bpf_sock {
 				 */
 };
 
+struct bpf_tcp_sock {
+	__u32 snd_cwnd;		/* Sending congestion window		*/
+	__u32 srtt_us;		/* smoothed round trip time << 3 in usecs */
+	__u32 rtt_min;
+	__u32 snd_ssthresh;	/* Slow start size threshold		*/
+	__u32 rcv_nxt;		/* What we want to receive next		*/
+	__u32 snd_nxt;		/* Next sequence we send		*/
+	__u32 snd_una;		/* First byte we want an ack for	*/
+	__u32 mss_cache;	/* Cached effective mss, not including SACKS */
+	__u32 ecn_flags;	/* ECN status bits.			*/
+	__u32 rate_delivered;	/* saved rate sample: packets delivered */
+	__u32 rate_interval_us;	/* saved rate sample: time elapsed */
+	__u32 packets_out;	/* Packets which are "in flight"	*/
+	__u32 retrans_out;	/* Retransmitted packets out		*/
+	__u32 total_retrans;	/* Total retransmits for entire connection */
+	__u32 segs_in;		/* RFC4898 tcpEStatsPerfSegsIn
+				 * total number of segments in.
+				 */
+	__u32 data_segs_in;	/* RFC4898 tcpEStatsPerfDataSegsIn
+				 * total number of data segments in.
+				 */
+	__u32 segs_out;		/* RFC4898 tcpEStatsPerfSegsOut
+				 * The total number of segments sent.
+				 */
+	__u32 data_segs_out;	/* RFC4898 tcpEStatsPerfDataSegsOut
+				 * total number of data segments sent.
+				 */
+	__u32 lost_out;		/* Lost packets			*/
+	__u32 sacked_out;	/* SACK'd packets			*/
+	__u64 bytes_received;	/* RFC4898 tcpEStatsAppHCThruOctetsReceived
+				 * sum(delta(rcv_nxt)), or how many bytes
+				 * were acked.
+				 */
+	__u64 bytes_acked;	/* RFC4898 tcpEStatsAppHCThruOctetsAcked
+				 * sum(delta(snd_una)), or how many bytes
+				 * were acked.
+				 */
+};
+
 struct bpf_sock_tuple {
 	union {
 		struct {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b4500f29dff1..f8ba690dad80 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -333,14 +333,16 @@ static bool type_is_pkt_pointer(enum bpf_reg_type type)
 static bool type_is_sk_pointer(enum bpf_reg_type type)
 {
 	return type == PTR_TO_SOCKET ||
-		type == PTR_TO_SOCK_COMMON;
+		type == PTR_TO_SOCK_COMMON ||
+		type == PTR_TO_TCP_SOCK;
 }
 
 static bool reg_type_may_be_null(enum bpf_reg_type type)
 {
 	return type == PTR_TO_MAP_VALUE_OR_NULL ||
 	       type == PTR_TO_SOCKET_OR_NULL ||
-	       type == PTR_TO_SOCK_COMMON_OR_NULL;
+	       type == PTR_TO_SOCK_COMMON_OR_NULL ||
+	       type == PTR_TO_TCP_SOCK_OR_NULL;
 }
 
 static bool type_is_refcounted(enum bpf_reg_type type)
@@ -400,6 +402,8 @@ static const char * const reg_type_str[] = {
 	[PTR_TO_SOCKET_OR_NULL] = "sock_or_null",
 	[PTR_TO_SOCK_COMMON]	= "sock_common",
 	[PTR_TO_SOCK_COMMON_OR_NULL] = "sock_common_or_null",
+	[PTR_TO_TCP_SOCK]	= "tcp_sock",
+	[PTR_TO_TCP_SOCK_OR_NULL] = "tcp_sock_or_null",
 };
 
 static char slot_type_char[] = {
@@ -1201,6 +1205,8 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
 	case PTR_TO_SOCKET_OR_NULL:
 	case PTR_TO_SOCK_COMMON:
 	case PTR_TO_SOCK_COMMON_OR_NULL:
+	case PTR_TO_TCP_SOCK:
+	case PTR_TO_TCP_SOCK_OR_NULL:
 		return true;
 	default:
 		return false;
@@ -1638,6 +1644,9 @@ static int check_sock_access(struct bpf_verifier_env *env, u32 regno, int off,
 	case PTR_TO_SOCKET:
 		valid = bpf_sock_is_valid_access(off, size, t, &info);
 		break;
+	case PTR_TO_TCP_SOCK:
+		valid = bpf_tcp_sock_is_valid_access(off, size, t, &info);
+		break;
 	default:
 		valid = false;
 	}
@@ -1795,6 +1804,9 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
 	case PTR_TO_SOCK_COMMON:
 		pointer_desc = "sock_common ";
 		break;
+	case PTR_TO_TCP_SOCK:
+		pointer_desc = "tcp_sock ";
+		break;
 	default:
 		break;
 	}
@@ -3021,6 +3033,10 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 			/* For mark_ptr_or_null_reg() */
 			regs[BPF_REG_0].id = ++env->id_gen;
 		}
+	} else if (fn->ret_type == RET_PTR_TO_TCP_SOCK_OR_NULL) {
+		mark_reg_known_zero(env, regs, BPF_REG_0);
+		regs[BPF_REG_0].type = PTR_TO_TCP_SOCK_OR_NULL;
+		regs[BPF_REG_0].id = ++env->id_gen;
 	} else {
 		verbose(env, "unknown return type %d of func %s#%d\n",
 			fn->ret_type, func_id_name(func_id), func_id);
@@ -3282,6 +3298,8 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 	case PTR_TO_SOCKET_OR_NULL:
 	case PTR_TO_SOCK_COMMON:
 	case PTR_TO_SOCK_COMMON_OR_NULL:
+	case PTR_TO_TCP_SOCK:
+	case PTR_TO_TCP_SOCK_OR_NULL:
 		verbose(env, "R%d pointer arithmetic on %s prohibited\n",
 			dst, reg_type_str[ptr_reg->type]);
 		return -EACCES;
@@ -4517,6 +4535,8 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state,
 			reg->type = PTR_TO_SOCKET;
 		} else if (reg->type == PTR_TO_SOCK_COMMON_OR_NULL) {
 			reg->type = PTR_TO_SOCK_COMMON;
+		} else if (reg->type == PTR_TO_TCP_SOCK_OR_NULL) {
+			reg->type = PTR_TO_TCP_SOCK;
 		}
 
 		if (is_null || !reg_is_refcounted(reg)) {
@@ -5704,6 +5724,8 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur,
 	case PTR_TO_SOCKET_OR_NULL:
 	case PTR_TO_SOCK_COMMON:
 	case PTR_TO_SOCK_COMMON_OR_NULL:
+	case PTR_TO_TCP_SOCK:
+	case PTR_TO_TCP_SOCK_OR_NULL:
 		/* Only valid matches are exact, which memcmp() above
 		 * would have accepted
 		 */
@@ -6023,6 +6045,8 @@ static bool reg_type_mismatch_ok(enum bpf_reg_type type)
 	case PTR_TO_SOCKET_OR_NULL:
 	case PTR_TO_SOCK_COMMON:
 	case PTR_TO_SOCK_COMMON_OR_NULL:
+	case PTR_TO_TCP_SOCK:
+	case PTR_TO_TCP_SOCK_OR_NULL:
 		return false;
 	default:
 		return true;
@@ -6997,6 +7021,9 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 		case PTR_TO_SOCK_COMMON:
 			convert_ctx_access = bpf_sock_convert_ctx_access;
 			break;
+		case PTR_TO_TCP_SOCK:
+			convert_ctx_access = bpf_tcp_sock_convert_ctx_access;
+			break;
 		default:
 			continue;
 		}
diff --git a/net/core/filter.c b/net/core/filter.c
index b3b4b5faede4..bdc39c3081ff 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5305,6 +5305,77 @@ static const struct bpf_func_proto bpf_sock_addr_sk_lookup_udp_proto = {
 	.arg5_type	= ARG_ANYTHING,
 };
 
+bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type,
+				  struct bpf_insn_access_aux *info)
+{
+	if (off < 0 || off >= offsetofend(struct bpf_tcp_sock, bytes_acked))
+		return false;
+
+	if (off % size != 0)
+		return false;
+
+	switch (off) {
+	case offsetof(struct bpf_tcp_sock, bytes_received):
+	case offsetof(struct bpf_tcp_sock, bytes_acked):
+		return size == sizeof(__u64);
+	default:
+		return size == sizeof(__u32);
+	}
+}
+
+u32 bpf_tcp_sock_convert_ctx_access(enum bpf_access_type type,
+				    const struct bpf_insn *si,
+				    struct bpf_insn *insn_buf,
+				    struct bpf_prog *prog, u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+
+#define BPF_TCP_SOCK_GET_COMMON(FIELD)					\
+	do {								\
+		BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD) >	\
+			     FIELD_SIZEOF(struct bpf_tcp_sock, FIELD));	\
+		*insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock, FIELD), \
+				      si->dst_reg, si->src_reg,		\
+				      offsetof(struct tcp_sock, FIELD)); \
+	} while (0)
+
+	CONVERT_COMMON_TCP_SOCK_FIELDS(struct bpf_tcp_sock,
+				       BPF_TCP_SOCK_GET_COMMON);
+
+	if (insn > insn_buf)
+		return insn - insn_buf;
+
+	switch (si->off) {
+	case offsetof(struct bpf_tcp_sock, rtt_min):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, rtt_min) !=
+			     sizeof(struct minmax));
+		BUILD_BUG_ON(sizeof(struct minmax) <
+			     sizeof(struct minmax_sample));
+
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
+				      offsetof(struct tcp_sock, rtt_min) +
+				      offsetof(struct minmax_sample, v));
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
+BPF_CALL_1(bpf_tcp_sock, const struct sock *, sk)
+{
+	if (sk_fullsock(sk) && sk->sk_protocol == IPPROTO_TCP)
+		return (unsigned long)sk;
+
+	return (unsigned long)NULL;
+}
+
+static const struct bpf_func_proto bpf_tcp_sock_proto = {
+	.func		= bpf_tcp_sock,
+	.gpl_only	= false,
+	.ret_type	= RET_PTR_TO_TCP_SOCK_OR_NULL,
+	.arg1_type	= ARG_PTR_TO_SOCK_COMMON,
+};
+
 #endif /* CONFIG_INET */
 
 bool bpf_helper_changes_pkt_data(void *func)
@@ -5450,6 +5521,10 @@ cg_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_local_storage_proto;
 	case BPF_FUNC_sk_fullsock:
 		return &bpf_sk_fullsock_proto;
+#ifdef CONFIG_INET
+	case BPF_FUNC_tcp_sock:
+		return &bpf_tcp_sock_proto;
+#endif
 	default:
 		return sk_filter_func_proto(func_id, prog);
 	}
@@ -5540,6 +5615,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_sk_lookup_udp_proto;
 	case BPF_FUNC_sk_release:
 		return &bpf_sk_release_proto;
+	case BPF_FUNC_tcp_sock:
+		return &bpf_tcp_sock_proto;
 #endif
 	default:
 		return bpf_base_func_proto(func_id);
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next 1/6] bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper
From: Martin KaFai Lau @ 2019-02-01  7:03 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo
In-Reply-To: <20190201070356.4148189-1-kafai@fb.com>

In kernel, it is common to check "!skb->sk && sk_fullsock(skb->sk)"
before accessing the fields in sock.  For example, in __netdev_pick_tx:

static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
			    struct net_device *sb_dev)
{
	/* ... */

	struct sock *sk = skb->sk;

		if (queue_index != new_index && sk &&
		    sk_fullsock(sk) &&
		    rcu_access_pointer(sk->sk_dst_cache))
			sk_tx_queue_set(sk, new_index);

	/* ... */

	return queue_index;
}

This patch adds a "struct bpf_sock *sk" pointer to the "struct __sk_buff"
where a few of the convert_ctx_access() in filter.c has already been
accessing the skb->sk sock_common's fields,
e.g. sock_ops_convert_ctx_access().

"__sk_buff->sk" is a PTR_TO_SOCK_COMMON_OR_NULL in the verifier.
Some of the fileds in "bpf_sock" will not be directly
accessible through the "__sk_buff->sk" pointer.  It is limited
by the new "bpf_sock_common_is_valid_access()".
e.g. The existing "type", "protocol", "mark" and "priority" in bpf_sock
     are not allowed.

The newly added "struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)"
can be used to get a sk with all accessible fields in "bpf_sock".
This helper is added to both cg_skb and sched_(cls|act).

int cg_skb_foo(struct __sk_buff *skb) {
	struct bpf_sock *sk;
	__u32 family;

	sk = skb->sk;
	if (!sk)
		return 1;

	sk = bpf_sk_fullsock(sk);
	if (!sk)
		return 1;

	if (sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP)
		return 1;

	/* some_traffic_shaping(); */

	return 1;
}

(1) The sk is read only

(2) There is no new "struct bpf_sock_common" introduced.

(3) Future kernel sock's members could be added to bpf_sock only
    instead of repeatedly adding at multiple places like currently
    in bpf_sock_ops_md, bpf_sock_addr_md, sk_reuseport_md...etc.

(4) After "sk = skb->sk", the reg holding sk is in type
    PTR_TO_SOCK_COMMON_OR_NULL.

(5) After bpf_sk_fullsock(), the return type will be in type
    PTR_TO_SOCKET_OR_NULL which is the same as the return type of
    bpf_sk_lookup_xxx().

    However, bpf_sk_fullsock() does not take refcnt.  The
    acquire_reference_state() is only depending on the return type now.
    To avoid it, a new is_acquire_function() is checked before calling
    acquire_reference_state().

(6) The WARN_ON in "release_reference_state()" is no longer an
    internal verifier bug.

    When reg->id is not found in state->refs[], it means the
    bpf_prog does something wrong like
    "bpf_sk_release(bpf_sk_fullsock(skb->sk))" where reference has
    never been acquired by calling "bpf_sk_fullsock(skb->sk)".

    A -EINVAL and a verbose are done instead of WARN_ON.  A test is
    added to the test_verifier in a later patch.

    Since the WARN_ON in "release_reference_state()" is no longer
    needed, "__release_reference_state()" is folded into
    "release_reference_state()" also.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf.h      |  12 ++++
 include/uapi/linux/bpf.h |  12 +++-
 kernel/bpf/verifier.c    | 129 +++++++++++++++++++++++++++------------
 net/core/filter.c        |  43 +++++++++++++
 4 files changed, 156 insertions(+), 40 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0394f1f9213b..72022a0c442d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -162,6 +162,7 @@ enum bpf_arg_type {
 	ARG_PTR_TO_CTX,		/* pointer to context */
 	ARG_ANYTHING,		/* any (initialized) argument is ok */
 	ARG_PTR_TO_SOCKET,	/* pointer to bpf_sock */
+	ARG_PTR_TO_SOCK_COMMON,	/* pointer to __sk_buff->sk */
 };
 
 /* type of values returned from helper functions */
@@ -224,6 +225,8 @@ enum bpf_reg_type {
 	PTR_TO_FLOW_KEYS,	 /* reg points to bpf_flow_keys */
 	PTR_TO_SOCKET,		 /* reg points to struct bpf_sock */
 	PTR_TO_SOCKET_OR_NULL,	 /* reg points to struct bpf_sock or NULL */
+	PTR_TO_SOCK_COMMON,	 /* reg points to sock_common */
+	PTR_TO_SOCK_COMMON_OR_NULL, /* reg points to sock_common or NULL */
 };
 
 /* The information passed from prog-specific *_is_valid_access
@@ -887,6 +890,9 @@ void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
 #if defined(CONFIG_NET)
+bool bpf_sock_common_is_valid_access(int off, int size,
+				     enum bpf_access_type type,
+				     struct bpf_insn_access_aux *info);
 bool bpf_sock_is_valid_access(int off, int size, enum bpf_access_type type,
 			      struct bpf_insn_access_aux *info);
 u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
@@ -895,6 +901,12 @@ u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
 				struct bpf_prog *prog,
 				u32 *target_size);
 #else
+static inline bool bpf_sock_common_is_valid_access(int off, int size,
+						   enum bpf_access_type type,
+						   struct bpf_insn_access_aux *info)
+{
+	return false;
+}
 static inline bool bpf_sock_is_valid_access(int off, int size,
 					    enum bpf_access_type type,
 					    struct bpf_insn_access_aux *info)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 60b99b730a41..b7a5312c4ccf 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2328,6 +2328,14 @@ union bpf_attr {
  *		"**y**".
  *	Return
  *		0
+ *
+ * struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)
+ *	Description
+ *		This helper tests if a bpf_sock is a fullsock such that
+ *		all the fields in bpf_sock can be accessed.
+ *	Return
+ *		The same pointer if *sk* is a fullsock.  Otherwise,
+ *		NULL is returned.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2422,7 +2430,8 @@ union bpf_attr {
 	FN(map_peek_elem),		\
 	FN(msg_push_data),		\
 	FN(msg_pop_data),		\
-	FN(rc_pointer_rel),
+	FN(rc_pointer_rel),		\
+	FN(sk_fullsock),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -2542,6 +2551,7 @@ struct __sk_buff {
 	__u64 tstamp;
 	__u32 wire_len;
 	__u32 gso_segs;
+	__bpf_md_ptr(struct bpf_sock *, sk);
 };
 
 struct bpf_tunnel_key {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8c1c21cd50b4..b4500f29dff1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -330,10 +330,17 @@ static bool type_is_pkt_pointer(enum bpf_reg_type type)
 	       type == PTR_TO_PACKET_META;
 }
 
+static bool type_is_sk_pointer(enum bpf_reg_type type)
+{
+	return type == PTR_TO_SOCKET ||
+		type == PTR_TO_SOCK_COMMON;
+}
+
 static bool reg_type_may_be_null(enum bpf_reg_type type)
 {
 	return type == PTR_TO_MAP_VALUE_OR_NULL ||
-	       type == PTR_TO_SOCKET_OR_NULL;
+	       type == PTR_TO_SOCKET_OR_NULL ||
+	       type == PTR_TO_SOCK_COMMON_OR_NULL;
 }
 
 static bool type_is_refcounted(enum bpf_reg_type type)
@@ -370,6 +377,12 @@ static bool is_release_function(enum bpf_func_id func_id)
 	return func_id == BPF_FUNC_sk_release;
 }
 
+static bool is_acquire_function(enum bpf_func_id func_id)
+{
+	return func_id == BPF_FUNC_sk_lookup_tcp ||
+		func_id == BPF_FUNC_sk_lookup_udp;
+}
+
 /* string representation of 'enum bpf_reg_type' */
 static const char * const reg_type_str[] = {
 	[NOT_INIT]		= "?",
@@ -385,6 +398,8 @@ static const char * const reg_type_str[] = {
 	[PTR_TO_FLOW_KEYS]	= "flow_keys",
 	[PTR_TO_SOCKET]		= "sock",
 	[PTR_TO_SOCKET_OR_NULL] = "sock_or_null",
+	[PTR_TO_SOCK_COMMON]	= "sock_common",
+	[PTR_TO_SOCK_COMMON_OR_NULL] = "sock_common_or_null",
 };
 
 static char slot_type_char[] = {
@@ -611,13 +626,10 @@ static int acquire_reference_state(struct bpf_verifier_env *env, int insn_idx)
 }
 
 /* release function corresponding to acquire_reference_state(). Idempotent. */
-static int __release_reference_state(struct bpf_func_state *state, int ptr_id)
+static int release_reference_state(struct bpf_func_state *state, int ptr_id)
 {
 	int i, last_idx;
 
-	if (!ptr_id)
-		return -EFAULT;
-
 	last_idx = state->acquired_refs - 1;
 	for (i = 0; i < state->acquired_refs; i++) {
 		if (state->refs[i].id == ptr_id) {
@@ -629,21 +641,7 @@ static int __release_reference_state(struct bpf_func_state *state, int ptr_id)
 			return 0;
 		}
 	}
-	return -EFAULT;
-}
-
-/* variation on the above for cases where we expect that there must be an
- * outstanding reference for the specified ptr_id.
- */
-static int release_reference_state(struct bpf_verifier_env *env, int ptr_id)
-{
-	struct bpf_func_state *state = cur_func(env);
-	int err;
-
-	err = __release_reference_state(state, ptr_id);
-	if (WARN_ON_ONCE(err != 0))
-		verbose(env, "verifier internal error: can't release reference\n");
-	return err;
+	return -EINVAL;
 }
 
 static int transfer_reference_state(struct bpf_func_state *dst,
@@ -1201,6 +1199,8 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
 	case CONST_PTR_TO_MAP:
 	case PTR_TO_SOCKET:
 	case PTR_TO_SOCKET_OR_NULL:
+	case PTR_TO_SOCK_COMMON:
+	case PTR_TO_SOCK_COMMON_OR_NULL:
 		return true;
 	default:
 		return false;
@@ -1623,6 +1623,7 @@ static int check_sock_access(struct bpf_verifier_env *env, u32 regno, int off,
 	struct bpf_reg_state *regs = cur_regs(env);
 	struct bpf_reg_state *reg = &regs[regno];
 	struct bpf_insn_access_aux info;
+	bool valid;
 
 	if (reg->smin_value < 0) {
 		verbose(env, "R%d min value is negative, either use unsigned index or do a if (index >=0) check.\n",
@@ -1630,13 +1631,24 @@ static int check_sock_access(struct bpf_verifier_env *env, u32 regno, int off,
 		return -EACCES;
 	}
 
-	if (!bpf_sock_is_valid_access(off, size, t, &info)) {
-		verbose(env, "invalid bpf_sock access off=%d size=%d\n",
-			off, size);
-		return -EACCES;
+	switch (reg->type) {
+	case PTR_TO_SOCK_COMMON:
+		valid = bpf_sock_common_is_valid_access(off, size, t, &info);
+		break;
+	case PTR_TO_SOCKET:
+		valid = bpf_sock_is_valid_access(off, size, t, &info);
+		break;
+	default:
+		valid = false;
 	}
 
-	return 0;
+	if (valid)
+		return 0;
+
+	verbose(env, "R%d invalid %s access off=%d size=%d\n",
+		regno, reg_type_str[reg->type], off, size);
+
+	return -EACCES;
 }
 
 static bool __is_pointer_value(bool allow_ptr_leaks,
@@ -1662,8 +1674,14 @@ static bool is_ctx_reg(struct bpf_verifier_env *env, int regno)
 {
 	const struct bpf_reg_state *reg = reg_state(env, regno);
 
-	return reg->type == PTR_TO_CTX ||
-	       reg->type == PTR_TO_SOCKET;
+	return reg->type == PTR_TO_CTX;
+}
+
+static bool is_sk_reg(struct bpf_verifier_env *env, int regno)
+{
+	const struct bpf_reg_state *reg = reg_state(env, regno);
+
+	return type_is_sk_pointer(reg->type);
 }
 
 static bool is_pkt_reg(struct bpf_verifier_env *env, int regno)
@@ -1774,6 +1792,9 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
 	case PTR_TO_SOCKET:
 		pointer_desc = "sock ";
 		break;
+	case PTR_TO_SOCK_COMMON:
+		pointer_desc = "sock_common ";
+		break;
 	default:
 		break;
 	}
@@ -1977,11 +1998,14 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 			 * PTR_TO_PACKET[_META,_END]. In the latter
 			 * case, we know the offset is zero.
 			 */
-			if (reg_type == SCALAR_VALUE)
+			if (reg_type == SCALAR_VALUE) {
 				mark_reg_unknown(env, regs, value_regno);
-			else
+			} else {
 				mark_reg_known_zero(env, regs,
 						    value_regno);
+				if (reg_type_may_be_null(reg_type))
+					regs[value_regno].id = ++env->id_gen;
+			}
 			regs[value_regno].type = reg_type;
 		}
 
@@ -2027,9 +2051,10 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 		err = check_flow_keys_access(env, off, size);
 		if (!err && t == BPF_READ && value_regno >= 0)
 			mark_reg_unknown(env, regs, value_regno);
-	} else if (reg->type == PTR_TO_SOCKET) {
+	} else if (type_is_sk_pointer(reg->type)) {
 		if (t == BPF_WRITE) {
-			verbose(env, "cannot write into socket\n");
+			verbose(env, "R%d cannot write into %s\n",
+				regno, reg_type_str[reg->type]);
 			return -EACCES;
 		}
 		err = check_sock_access(env, regno, off, size, t);
@@ -2076,7 +2101,8 @@ static int check_xadd(struct bpf_verifier_env *env, int insn_idx, struct bpf_ins
 
 	if (is_ctx_reg(env, insn->dst_reg) ||
 	    is_pkt_reg(env, insn->dst_reg) ||
-	    is_flow_key_reg(env, insn->dst_reg)) {
+	    is_flow_key_reg(env, insn->dst_reg) ||
+	    is_sk_reg(env, insn->dst_reg)) {
 		verbose(env, "BPF_XADD stores into R%d %s is not allowed\n",
 			insn->dst_reg,
 			reg_type_str[reg_state(env, insn->dst_reg)->type]);
@@ -2258,6 +2284,11 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 		err = check_ctx_reg(env, reg, regno);
 		if (err < 0)
 			return err;
+	} else if (arg_type == ARG_PTR_TO_SOCK_COMMON) {
+		expected_type = PTR_TO_SOCK_COMMON;
+		/* Any sk pointer can be ARG_PTR_TO_SOCK_COMMON */
+		if (!type_is_sk_pointer(type))
+			goto err_type;
 	} else if (arg_type == ARG_PTR_TO_SOCKET) {
 		expected_type = PTR_TO_SOCKET;
 		if (type != expected_type)
@@ -2661,7 +2692,7 @@ static int release_reference(struct bpf_verifier_env *env,
 	for (i = 0; i <= vstate->curframe; i++)
 		release_reg_references(env, vstate->frame[i], meta->ptr_id);
 
-	return release_reference_state(env, meta->ptr_id);
+	return release_reference_state(cur_func(env), meta->ptr_id);
 }
 
 static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
@@ -2926,8 +2957,11 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 		}
 	} else if (is_release_function(func_id)) {
 		err = release_reference(env, &meta);
-		if (err)
+		if (err) {
+			verbose(env, "func %s#%d reference has not been acquired before\n",
+				func_id_name(func_id), func_id);
 			return err;
+		}
 	}
 
 	regs = cur_regs(env);
@@ -2974,12 +3008,19 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 			regs[BPF_REG_0].id = ++env->id_gen;
 		}
 	} else if (fn->ret_type == RET_PTR_TO_SOCKET_OR_NULL) {
-		int id = acquire_reference_state(env, insn_idx);
-		if (id < 0)
-			return id;
 		mark_reg_known_zero(env, regs, BPF_REG_0);
 		regs[BPF_REG_0].type = PTR_TO_SOCKET_OR_NULL;
-		regs[BPF_REG_0].id = id;
+		if (is_acquire_function(func_id)) {
+			int id = acquire_reference_state(env, insn_idx);
+
+			if (id < 0)
+				return id;
+			/* For release_reference() */
+			regs[BPF_REG_0].id = id;
+		} else {
+			/* For mark_ptr_or_null_reg() */
+			regs[BPF_REG_0].id = ++env->id_gen;
+		}
 	} else {
 		verbose(env, "unknown return type %d of func %s#%d\n",
 			fn->ret_type, func_id_name(func_id), func_id);
@@ -3239,6 +3280,8 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 	case PTR_TO_PACKET_END:
 	case PTR_TO_SOCKET:
 	case PTR_TO_SOCKET_OR_NULL:
+	case PTR_TO_SOCK_COMMON:
+	case PTR_TO_SOCK_COMMON_OR_NULL:
 		verbose(env, "R%d pointer arithmetic on %s prohibited\n",
 			dst, reg_type_str[ptr_reg->type]);
 		return -EACCES;
@@ -4472,7 +4515,10 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state,
 			}
 		} else if (reg->type == PTR_TO_SOCKET_OR_NULL) {
 			reg->type = PTR_TO_SOCKET;
+		} else if (reg->type == PTR_TO_SOCK_COMMON_OR_NULL) {
+			reg->type = PTR_TO_SOCK_COMMON;
 		}
+
 		if (is_null || !reg_is_refcounted(reg)) {
 			/* We don't need id from this point onwards anymore,
 			 * thus we should better reset it, so that state
@@ -4495,7 +4541,7 @@ static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno,
 	int i, j;
 
 	if (reg_is_refcounted_or_null(&regs[regno]) && is_null)
-		__release_reference_state(state, id);
+		release_reference_state(state, id);
 
 	for (i = 0; i < MAX_BPF_REG; i++)
 		mark_ptr_or_null_reg(state, &regs[i], id, is_null);
@@ -5656,6 +5702,8 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur,
 	case PTR_TO_FLOW_KEYS:
 	case PTR_TO_SOCKET:
 	case PTR_TO_SOCKET_OR_NULL:
+	case PTR_TO_SOCK_COMMON:
+	case PTR_TO_SOCK_COMMON_OR_NULL:
 		/* Only valid matches are exact, which memcmp() above
 		 * would have accepted
 		 */
@@ -5973,6 +6021,8 @@ static bool reg_type_mismatch_ok(enum bpf_reg_type type)
 	case PTR_TO_CTX:
 	case PTR_TO_SOCKET:
 	case PTR_TO_SOCKET_OR_NULL:
+	case PTR_TO_SOCK_COMMON:
+	case PTR_TO_SOCK_COMMON_OR_NULL:
 		return false;
 	default:
 		return true;
@@ -6944,6 +6994,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 			convert_ctx_access = ops->convert_ctx_access;
 			break;
 		case PTR_TO_SOCKET:
+		case PTR_TO_SOCK_COMMON:
 			convert_ctx_access = bpf_sock_convert_ctx_access;
 			break;
 		default:
diff --git a/net/core/filter.c b/net/core/filter.c
index 8ce421796ac6..7f9ff0517b67 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1793,6 +1793,18 @@ static const struct bpf_func_proto bpf_skb_pull_data_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
+{
+	return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL;
+}
+
+static const struct bpf_func_proto bpf_sk_fullsock_proto = {
+	.func		= bpf_sk_fullsock,
+	.gpl_only	= false,
+	.ret_type	= RET_PTR_TO_SOCKET_OR_NULL,
+	.arg1_type	= ARG_PTR_TO_SOCK_COMMON,
+};
+
 static inline int sk_skb_try_make_writable(struct sk_buff *skb,
 					   unsigned int write_len)
 {
@@ -5388,6 +5400,8 @@ cg_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	switch (func_id) {
 	case BPF_FUNC_get_local_storage:
 		return &bpf_get_local_storage_proto;
+	case BPF_FUNC_sk_fullsock:
+		return &bpf_sk_fullsock_proto;
 	default:
 		return sk_filter_func_proto(func_id, prog);
 	}
@@ -5459,6 +5473,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_socket_uid_proto;
 	case BPF_FUNC_fib_lookup:
 		return &bpf_skb_fib_lookup_proto;
+	case BPF_FUNC_sk_fullsock:
+		return &bpf_sk_fullsock_proto;
 #ifdef CONFIG_XFRM
 	case BPF_FUNC_skb_get_xfrm_state:
 		return &bpf_skb_get_xfrm_state_proto;
@@ -5746,6 +5762,11 @@ static bool bpf_skb_is_valid_access(int off, int size, enum bpf_access_type type
 		if (size != sizeof(__u64))
 			return false;
 		break;
+	case offsetof(struct __sk_buff, sk):
+		if (type == BPF_WRITE || size != sizeof(__u64))
+			return false;
+		info->reg_type = PTR_TO_SOCK_COMMON_OR_NULL;
+		break;
 	default:
 		/* Only narrow read access allowed for now. */
 		if (type == BPF_WRITE) {
@@ -5932,6 +5953,21 @@ static bool __sock_filter_check_size(int off, int size,
 	return size == size_default;
 }
 
+bool bpf_sock_common_is_valid_access(int off, int size,
+				     enum bpf_access_type type,
+				     struct bpf_insn_access_aux *info)
+{
+	switch (off) {
+	case offsetof(struct bpf_sock, type):
+	case offsetof(struct bpf_sock, protocol):
+	case offsetof(struct bpf_sock, mark):
+	case offsetof(struct bpf_sock, priority):
+		return false;
+	default:
+		return bpf_sock_is_valid_access(off, size, type, info);
+	}
+}
+
 bool bpf_sock_is_valid_access(int off, int size, enum bpf_access_type type,
 			      struct bpf_insn_access_aux *info)
 {
@@ -6730,6 +6766,13 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 		off += offsetof(struct qdisc_skb_cb, pkt_len);
 		*target_size = 4;
 		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg, off);
+		break;
+
+	case offsetof(struct __sk_buff, sk):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, sk),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct sk_buff, sk));
+		break;
 	}
 
 	return insn - insn_buf;
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next 2/6] bpf: Refactor sock_ops_convert_ctx_access
From: Martin KaFai Lau @ 2019-02-01  7:03 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo
In-Reply-To: <20190201070356.4148189-1-kafai@fb.com>

The next patch will introduce a new "struct bpf_tcp_sock" which
exposes the same tcp_sock's fields already exposed in
"struct bpf_sock_ops".

This patch refactor the existing convert_ctx_access() codes for
"struct bpf_sock_ops" to get them ready to be reused for
"struct bpf_tcp_sock".  The "rtt_min" is not refactored
in this patch because its handling is different from other
fields.

The SOCK_OPS_GET_TCP_SOCK_FIELD is new. All other SOCK_OPS_XXX_FIELD
changes are code move only.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/core/filter.c | 287 ++++++++++++++++++++--------------------------
 1 file changed, 127 insertions(+), 160 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 7f9ff0517b67..b3b4b5faede4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5020,6 +5020,54 @@ static const struct bpf_func_proto bpf_lwt_seg6_adjust_srh_proto = {
 };
 #endif /* CONFIG_IPV6_SEG6_BPF */
 
+#define CONVERT_COMMON_TCP_SOCK_FIELDS(md_type, CONVERT)		\
+do {									\
+	switch (si->off) {						\
+	case offsetof(md_type, snd_cwnd):				\
+		CONVERT(snd_cwnd); break;				\
+	case offsetof(md_type, srtt_us):				\
+		CONVERT(srtt_us); break;				\
+	case offsetof(md_type, snd_ssthresh):				\
+		CONVERT(snd_ssthresh); break;				\
+	case offsetof(md_type, rcv_nxt):				\
+		CONVERT(rcv_nxt); break;				\
+	case offsetof(md_type, snd_nxt):				\
+		CONVERT(snd_nxt); break;				\
+	case offsetof(md_type, snd_una):				\
+		CONVERT(snd_una); break;				\
+	case offsetof(md_type, mss_cache):				\
+		CONVERT(mss_cache); break;				\
+	case offsetof(md_type, ecn_flags):				\
+		CONVERT(ecn_flags); break;				\
+	case offsetof(md_type, rate_delivered):				\
+		CONVERT(rate_delivered); break;				\
+	case offsetof(md_type, rate_interval_us):			\
+		CONVERT(rate_interval_us); break;			\
+	case offsetof(md_type, packets_out):				\
+		CONVERT(packets_out); break;				\
+	case offsetof(md_type, retrans_out):				\
+		CONVERT(retrans_out); break;				\
+	case offsetof(md_type, total_retrans):				\
+		CONVERT(total_retrans); break;				\
+	case offsetof(md_type, segs_in):				\
+		CONVERT(segs_in); break;				\
+	case offsetof(md_type, data_segs_in):				\
+		CONVERT(data_segs_in); break;				\
+	case offsetof(md_type, segs_out):				\
+		CONVERT(segs_out); break;				\
+	case offsetof(md_type, data_segs_out):				\
+		CONVERT(data_segs_out); break;				\
+	case offsetof(md_type, lost_out):				\
+		CONVERT(lost_out); break;				\
+	case offsetof(md_type, sacked_out):				\
+		CONVERT(sacked_out); break;				\
+	case offsetof(md_type, bytes_received):				\
+		CONVERT(bytes_received); break;				\
+	case offsetof(md_type, bytes_acked):				\
+		CONVERT(bytes_acked); break;				\
+	}								\
+} while (0)
+
 #ifdef CONFIG_INET
 static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple,
 			      int dif, int sdif, u8 family, u8 proto)
@@ -7124,6 +7172,85 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 	struct bpf_insn *insn = insn_buf;
 	int off;
 
+/* Helper macro for adding read access to tcp_sock or sock fields. */
+#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
+	do {								      \
+		BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >		      \
+			     FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
+						struct bpf_sock_ops_kern,     \
+						is_fullsock),		      \
+				      si->dst_reg, si->src_reg,		      \
+				      offsetof(struct bpf_sock_ops_kern,      \
+					       is_fullsock));		      \
+		*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 2);	      \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
+						struct bpf_sock_ops_kern, sk),\
+				      si->dst_reg, si->src_reg,		      \
+				      offsetof(struct bpf_sock_ops_kern, sk));\
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ,		      \
+						       OBJ_FIELD),	      \
+				      si->dst_reg, si->dst_reg,		      \
+				      offsetof(OBJ, OBJ_FIELD));	      \
+	} while (0)
+
+#define SOCK_OPS_GET_TCP_SOCK_FIELD(FIELD) \
+		SOCK_OPS_GET_FIELD(FIELD, FIELD, struct tcp_sock)
+
+/* Helper macro for adding write access to tcp_sock or sock fields.
+ * The macro is called with two registers, dst_reg which contains a pointer
+ * to ctx (context) and src_reg which contains the value that should be
+ * stored. However, we need an additional register since we cannot overwrite
+ * dst_reg because it may be used later in the program.
+ * Instead we "borrow" one of the other register. We first save its value
+ * into a new (temp) field in bpf_sock_ops_kern, use it, and then restore
+ * it at the end of the macro.
+ */
+#define SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
+	do {								      \
+		int reg = BPF_REG_9;					      \
+		BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >		      \
+			     FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
+		if (si->dst_reg == reg || si->src_reg == reg)		      \
+			reg--;						      \
+		if (si->dst_reg == reg || si->src_reg == reg)		      \
+			reg--;						      \
+		*insn++ = BPF_STX_MEM(BPF_DW, si->dst_reg, reg,		      \
+				      offsetof(struct bpf_sock_ops_kern,      \
+					       temp));			      \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
+						struct bpf_sock_ops_kern,     \
+						is_fullsock),		      \
+				      reg, si->dst_reg,			      \
+				      offsetof(struct bpf_sock_ops_kern,      \
+					       is_fullsock));		      \
+		*insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);		      \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
+						struct bpf_sock_ops_kern, sk),\
+				      reg, si->dst_reg,			      \
+				      offsetof(struct bpf_sock_ops_kern, sk));\
+		*insn++ = BPF_STX_MEM(BPF_FIELD_SIZEOF(OBJ, OBJ_FIELD),	      \
+				      reg, si->src_reg,			      \
+				      offsetof(OBJ, OBJ_FIELD));	      \
+		*insn++ = BPF_LDX_MEM(BPF_DW, reg, si->dst_reg,		      \
+				      offsetof(struct bpf_sock_ops_kern,      \
+					       temp));			      \
+	} while (0)
+
+#define SOCK_OPS_GET_OR_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ, TYPE)	      \
+	do {								      \
+		if (TYPE == BPF_WRITE)					      \
+			SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ);	      \
+		else							      \
+			SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ);	      \
+	} while (0)
+
+	CONVERT_COMMON_TCP_SOCK_FIELDS(struct bpf_sock_ops,
+				       SOCK_OPS_GET_TCP_SOCK_FIELD);
+
+	if (insn > insn_buf)
+		return insn - insn_buf;
+
 	switch (si->off) {
 	case offsetof(struct bpf_sock_ops, op) ...
 	     offsetof(struct bpf_sock_ops, replylong[3]):
@@ -7281,175 +7408,15 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 				      FIELD_SIZEOF(struct minmax_sample, t));
 		break;
 
-/* Helper macro for adding read access to tcp_sock or sock fields. */
-#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
-	do {								      \
-		BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >		      \
-			     FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
-		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
-						struct bpf_sock_ops_kern,     \
-						is_fullsock),		      \
-				      si->dst_reg, si->src_reg,		      \
-				      offsetof(struct bpf_sock_ops_kern,      \
-					       is_fullsock));		      \
-		*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 2);	      \
-		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
-						struct bpf_sock_ops_kern, sk),\
-				      si->dst_reg, si->src_reg,		      \
-				      offsetof(struct bpf_sock_ops_kern, sk));\
-		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ,		      \
-						       OBJ_FIELD),	      \
-				      si->dst_reg, si->dst_reg,		      \
-				      offsetof(OBJ, OBJ_FIELD));	      \
-	} while (0)
-
-/* Helper macro for adding write access to tcp_sock or sock fields.
- * The macro is called with two registers, dst_reg which contains a pointer
- * to ctx (context) and src_reg which contains the value that should be
- * stored. However, we need an additional register since we cannot overwrite
- * dst_reg because it may be used later in the program.
- * Instead we "borrow" one of the other register. We first save its value
- * into a new (temp) field in bpf_sock_ops_kern, use it, and then restore
- * it at the end of the macro.
- */
-#define SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
-	do {								      \
-		int reg = BPF_REG_9;					      \
-		BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >		      \
-			     FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
-		if (si->dst_reg == reg || si->src_reg == reg)		      \
-			reg--;						      \
-		if (si->dst_reg == reg || si->src_reg == reg)		      \
-			reg--;						      \
-		*insn++ = BPF_STX_MEM(BPF_DW, si->dst_reg, reg,		      \
-				      offsetof(struct bpf_sock_ops_kern,      \
-					       temp));			      \
-		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
-						struct bpf_sock_ops_kern,     \
-						is_fullsock),		      \
-				      reg, si->dst_reg,			      \
-				      offsetof(struct bpf_sock_ops_kern,      \
-					       is_fullsock));		      \
-		*insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);		      \
-		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
-						struct bpf_sock_ops_kern, sk),\
-				      reg, si->dst_reg,			      \
-				      offsetof(struct bpf_sock_ops_kern, sk));\
-		*insn++ = BPF_STX_MEM(BPF_FIELD_SIZEOF(OBJ, OBJ_FIELD),	      \
-				      reg, si->src_reg,			      \
-				      offsetof(OBJ, OBJ_FIELD));	      \
-		*insn++ = BPF_LDX_MEM(BPF_DW, reg, si->dst_reg,		      \
-				      offsetof(struct bpf_sock_ops_kern,      \
-					       temp));			      \
-	} while (0)
-
-#define SOCK_OPS_GET_OR_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ, TYPE)	      \
-	do {								      \
-		if (TYPE == BPF_WRITE)					      \
-			SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ);	      \
-		else							      \
-			SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ);	      \
-	} while (0)
-
-	case offsetof(struct bpf_sock_ops, snd_cwnd):
-		SOCK_OPS_GET_FIELD(snd_cwnd, snd_cwnd, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, srtt_us):
-		SOCK_OPS_GET_FIELD(srtt_us, srtt_us, struct tcp_sock);
-		break;
-
 	case offsetof(struct bpf_sock_ops, bpf_sock_ops_cb_flags):
 		SOCK_OPS_GET_FIELD(bpf_sock_ops_cb_flags, bpf_sock_ops_cb_flags,
 				   struct tcp_sock);
 		break;
 
-	case offsetof(struct bpf_sock_ops, snd_ssthresh):
-		SOCK_OPS_GET_FIELD(snd_ssthresh, snd_ssthresh, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, rcv_nxt):
-		SOCK_OPS_GET_FIELD(rcv_nxt, rcv_nxt, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, snd_nxt):
-		SOCK_OPS_GET_FIELD(snd_nxt, snd_nxt, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, snd_una):
-		SOCK_OPS_GET_FIELD(snd_una, snd_una, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, mss_cache):
-		SOCK_OPS_GET_FIELD(mss_cache, mss_cache, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, ecn_flags):
-		SOCK_OPS_GET_FIELD(ecn_flags, ecn_flags, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, rate_delivered):
-		SOCK_OPS_GET_FIELD(rate_delivered, rate_delivered,
-				   struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, rate_interval_us):
-		SOCK_OPS_GET_FIELD(rate_interval_us, rate_interval_us,
-				   struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, packets_out):
-		SOCK_OPS_GET_FIELD(packets_out, packets_out, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, retrans_out):
-		SOCK_OPS_GET_FIELD(retrans_out, retrans_out, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, total_retrans):
-		SOCK_OPS_GET_FIELD(total_retrans, total_retrans,
-				   struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, segs_in):
-		SOCK_OPS_GET_FIELD(segs_in, segs_in, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, data_segs_in):
-		SOCK_OPS_GET_FIELD(data_segs_in, data_segs_in, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, segs_out):
-		SOCK_OPS_GET_FIELD(segs_out, segs_out, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, data_segs_out):
-		SOCK_OPS_GET_FIELD(data_segs_out, data_segs_out,
-				   struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, lost_out):
-		SOCK_OPS_GET_FIELD(lost_out, lost_out, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, sacked_out):
-		SOCK_OPS_GET_FIELD(sacked_out, sacked_out, struct tcp_sock);
-		break;
-
 	case offsetof(struct bpf_sock_ops, sk_txhash):
 		SOCK_OPS_GET_OR_SET_FIELD(sk_txhash, sk_txhash,
 					  struct sock, type);
 		break;
-
-	case offsetof(struct bpf_sock_ops, bytes_received):
-		SOCK_OPS_GET_FIELD(bytes_received, bytes_received,
-				   struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, bytes_acked):
-		SOCK_OPS_GET_FIELD(bytes_acked, bytes_acked, struct tcp_sock);
-		break;
-
 	}
 	return insn - insn_buf;
 }
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next 6/6] bpf: Add test_sock_fields for skb->sk and bpf_tcp_sock
From: Martin KaFai Lau @ 2019-02-01  7:04 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo
In-Reply-To: <20190201070356.4148189-1-kafai@fb.com>

This patch adds a C program to show the usage on
skb->sk and bpf_tcp_sock.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/testing/selftests/bpf/Makefile          |   5 +-
 tools/testing/selftests/bpf/bpf_helpers.h     |   4 +
 .../testing/selftests/bpf/test_sock_fields.c  | 314 ++++++++++++++++++
 .../selftests/bpf/test_sock_fields_kern.c     | 145 ++++++++
 4 files changed, 466 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sock_fields.c
 create mode 100644 tools/testing/selftests/bpf/test_sock_fields_kern.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index c566090e5657..6c00f5e39b63 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -23,7 +23,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
 	test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
 	test_sock test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user \
 	test_socket_cookie test_cgroup_storage test_select_reuseport test_section_names \
-	test_netcnt test_tcpnotify_user
+	test_netcnt test_tcpnotify_user test_sock_fields
 
 BPF_OBJ_FILES = \
 	test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o \
@@ -35,7 +35,7 @@ BPF_OBJ_FILES = \
 	sendmsg4_prog.o sendmsg6_prog.o test_lirc_mode2_kern.o \
 	get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
 	test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o test_xdp_vlan.o \
-	xdp_dummy.o test_map_in_map.o
+	xdp_dummy.o test_map_in_map.o test_sock_fields_kern.o
 
 # Objects are built with default compilation flags and with sub-register
 # code-gen enabled.
@@ -110,6 +110,7 @@ $(OUTPUT)/test_progs: trace_helpers.c
 $(OUTPUT)/get_cgroup_id_user: cgroup_helpers.c
 $(OUTPUT)/test_cgroup_storage: cgroup_helpers.c
 $(OUTPUT)/test_netcnt: cgroup_helpers.c
+$(OUTPUT)/test_sock_fields: cgroup_helpers.c
 
 .PHONY: force
 
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 6c77cf7bedce..a5bb09f5e6fc 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -172,6 +172,10 @@ static int (*bpf_skb_vlan_pop)(void *ctx) =
 	(void *) BPF_FUNC_skb_vlan_pop;
 static int (*bpf_rc_pointer_rel)(void *ctx, int rel_x, int rel_y) =
 	(void *) BPF_FUNC_rc_pointer_rel;
+static struct bpf_sock *(*bpf_sk_fullsock)(struct bpf_sock *sk) =
+	(void *) BPF_FUNC_sk_fullsock;
+static struct bpf_tcp_sock *(*bpf_tcp_sock)(struct bpf_sock *sk) =
+	(void *) BPF_FUNC_tcp_sock;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/test_sock_fields.c b/tools/testing/selftests/bpf/test_sock_fields.c
new file mode 100644
index 000000000000..1f8ff8e6caf4
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sock_fields.c
@@ -0,0 +1,314 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook */
+
+#include <sys/socket.h>
+#include <sys/epoll.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include "cgroup_helpers.h"
+
+enum bpf_array_idx {
+	SRV_IDX,
+	CLI_IDX,
+	__NR_BPF_ARRAY_IDX,
+};
+
+#define CHECK(condition, tag, format...) ({				\
+	int __ret = !!(condition);					\
+	if (__ret) {							\
+		printf("%s(%d):FAIL:%s ", __func__, __LINE__, tag);	\
+		printf(format);						\
+		printf("\n");						\
+		exit(-1);						\
+	}								\
+})
+
+#define TEST_CGROUP "/test-bpf-sock-fields"
+#define DATA "Hello BPF!"
+#define DATA_LEN sizeof(DATA)
+
+static struct sockaddr_in6 srv_sa6, cli_sa6;
+static int linum_map_fd;
+static int addr_map_fd;
+static int tp_map_fd;
+static int sk_map_fd;
+static __u32 srv_idx = SRV_IDX;
+static __u32 cli_idx = CLI_IDX;
+
+static void init_loopback6(struct sockaddr_in6 *sa6)
+{
+	memset(sa6, 0, sizeof(*sa6));
+	sa6->sin6_family = AF_INET6;
+	sa6->sin6_addr = in6addr_loopback;
+}
+
+static void print_sk(const struct bpf_sock *sk)
+{
+	char ip4[64];
+	char ip6[64];
+
+	inet_ntop(AF_INET, &sk->src_ip4, ip4, sizeof(ip4));
+	inet_ntop(AF_INET6, &sk->src_ip6, ip6, sizeof(ip6));
+
+	printf("bound_dev_if:%u family:%u type:%u protocol:%u mark:%u priority:%u "
+	       "src_ip4:%x(%s) src_ip6:%x:%x:%x:%x(%s) src_port:%u\n",
+	       sk->bound_dev_if, sk->family, sk->type, sk->protocol,
+	       sk->mark, sk->priority, sk->src_ip4, ip4,
+	       sk->src_ip6[0], sk->src_ip6[1], sk->src_ip6[2], sk->src_ip6[3], ip6,
+	       sk->src_port);
+}
+
+static void print_tp(const struct bpf_tcp_sock *tp)
+{
+	printf("snd_cwnd:%u srtt_us:%u rtt_min:%u snd_ssthresh:%u rcv_nxt:%u "
+	       "snd_nxt:%u snd:una:%u mss_cache:%u ecn_flags:%u "
+	       "rate_delivered:%u rate_interval_us:%u packets_out:%u "
+	       "retrans_out:%u total_retrans:%u segs_in:%u data_segs_in:%u "
+	       "segs_out:%u data_segs_out:%u lost_out:%u sacked_out:%u "
+	       "bytes_received:%llu bytes_acked:%llu\n",
+	       tp->snd_cwnd, tp->srtt_us, tp->rtt_min, tp->snd_ssthresh,
+	       tp->rcv_nxt, tp->snd_nxt, tp->snd_una, tp->mss_cache,
+	       tp->ecn_flags, tp->rate_delivered, tp->rate_interval_us,
+	       tp->packets_out, tp->retrans_out, tp->total_retrans,
+	       tp->segs_in, tp->data_segs_in, tp->segs_out,
+	       tp->data_segs_out, tp->lost_out, tp->sacked_out,
+	       tp->bytes_received, tp->bytes_acked);
+}
+
+static void check_result(void)
+{
+	struct bpf_tcp_sock srv_tp, cli_tp;
+	struct bpf_sock srv_sk, cli_sk;
+	__u32 linum, idx0 = 0;
+	int err;
+
+	err = bpf_map_lookup_elem(linum_map_fd, &idx0, &linum);
+	CHECK(err == -1, "bpf_map_lookup_elem(linum_map_fd)",
+	      "err:%d errno:%d", err, errno);
+
+	err = bpf_map_lookup_elem(sk_map_fd, &srv_idx, &srv_sk);
+	CHECK(err == -1, "bpf_map_lookup_elem(sk_map_fd, &srv_idx)",
+	      "err:%d errno:%d", err, errno);
+	err = bpf_map_lookup_elem(tp_map_fd, &srv_idx, &srv_tp);
+	CHECK(err == -1, "bpf_map_lookup_elem(tp_map_fd, &srv_idx)",
+	      "err:%d errno:%d", err, errno);
+
+	err = bpf_map_lookup_elem(sk_map_fd, &cli_idx, &cli_sk);
+	CHECK(err == -1, "bpf_map_lookup_elem(sk_map_fd, &cli_idx)",
+	      "err:%d errno:%d", err, errno);
+	err = bpf_map_lookup_elem(tp_map_fd, &cli_idx, &cli_tp);
+	CHECK(err == -1, "bpf_map_lookup_elem(tp_map_fd, &cli_idx)",
+	      "err:%d errno:%d", err, errno);
+
+	printf("srv_sk: ");
+	print_sk(&srv_sk);
+	printf("\n");
+
+	printf("cli_sk: ");
+	print_sk(&cli_sk);
+	printf("\n");
+
+	printf("srv_tp: ");
+	print_tp(&srv_tp);
+	printf("\n");
+
+	printf("cli_tp: ");
+	print_tp(&cli_tp);
+	printf("\n");
+
+	CHECK(srv_sk.family != AF_INET6 ||
+	      srv_sk.protocol != IPPROTO_TCP ||
+	      memcmp(srv_sk.src_ip6, &in6addr_loopback,
+		     sizeof(srv_sk.src_ip6)) ||
+	      srv_sk.src_port != ntohs(srv_sa6.sin6_port),
+	      "Unexpected srv_sk", "Check srv_sk output. linum:%u",
+	      linum);
+
+	CHECK(cli_sk.family != AF_INET6 ||
+	      cli_sk.protocol != IPPROTO_TCP ||
+	      memcmp(cli_sk.src_ip6, &in6addr_loopback,
+		     sizeof(cli_sk.src_ip6)) ||
+	      cli_sk.src_port != ntohs(cli_sa6.sin6_port),
+	      "Unexpected cli_sk", "Check cli_sk output. linum:%u",
+	      linum);
+
+	CHECK(srv_tp.data_segs_out != 1 ||
+	      srv_tp.data_segs_in ||
+	      srv_tp.snd_cwnd != 10 ||
+	      srv_tp.total_retrans ||
+	      srv_tp.bytes_acked != DATA_LEN,
+	      "Unexpected srv_tp", "Check srv_tp output. linum:%u",
+	      linum);
+
+	CHECK(cli_tp.data_segs_out ||
+	      cli_tp.data_segs_in != 1 ||
+	      cli_tp.snd_cwnd != 10 ||
+	      cli_tp.total_retrans ||
+	      cli_tp.bytes_received != DATA_LEN,
+	      "Unexpected cli_tp", "Check cli_tp output. linum:%u",
+	      linum);
+}
+
+static void test(void)
+{
+	int listen_fd, cli_fd, accept_fd, epfd, err;
+	struct epoll_event ev;
+	socklen_t addrlen;
+
+	addrlen = sizeof(struct sockaddr_in6);
+	ev.events = EPOLLIN;
+
+	epfd = epoll_create(1);
+	CHECK(epfd == -1, "epoll_create()", "epfd:%d errno:%d", epfd, errno);
+
+	/* Prepare listen_fd */
+	listen_fd = socket(AF_INET6, SOCK_STREAM | SOCK_NONBLOCK, 0);
+	CHECK(listen_fd == -1, "socket()", "listen_fd:%d errno:%d",
+	      listen_fd, errno);
+
+	init_loopback6(&srv_sa6);
+	err = bind(listen_fd, (struct sockaddr *)&srv_sa6, sizeof(srv_sa6));
+	CHECK(err, "bind(listen_fd)", "err:%d errno:%d", err, errno);
+
+	err = getsockname(listen_fd, (struct sockaddr *)&srv_sa6, &addrlen);
+	CHECK(err, "getsockname(listen_fd)", "err:%d errno:%d", err, errno);
+
+	err = listen(listen_fd, 0);
+	CHECK(err, "listen(listen_fd)", "err:%d errno:%d", err, errno);
+
+	/* Prepare cli_fd */
+	cli_fd = socket(AF_INET6, SOCK_STREAM | SOCK_NONBLOCK, 0);
+	CHECK(cli_fd == -1, "socket()", "cli_fd:%d errno:%d", cli_fd, errno);
+
+	init_loopback6(&cli_sa6);
+	err = bind(cli_fd, (struct sockaddr *)&cli_sa6, sizeof(cli_sa6));
+	CHECK(err, "bind(cli_fd)", "err:%d errno:%d", err, errno);
+
+	err = getsockname(cli_fd, (struct sockaddr *)&cli_sa6, &addrlen);
+	CHECK(err, "getsockname(cli_fd)", "err:%d errno:%d",
+	      err, errno);
+
+	/* Update addr_map with srv_sa6 and cli_sa6 */
+	err = bpf_map_update_elem(addr_map_fd, &srv_idx, &srv_sa6, 0);
+	CHECK(err, "map_update", "err:%d errno:%d", err, errno);
+
+	err = bpf_map_update_elem(addr_map_fd, &cli_idx, &cli_sa6, 0);
+	CHECK(err, "map_update", "err:%d errno:%d", err, errno);
+
+	/* Connect from cli_sa6 to srv_sa6 */
+	err = connect(cli_fd, (struct sockaddr *)&srv_sa6, addrlen);
+	printf("srv_sa6.sin6_port:%u cli_sa6.sin6_port:%u\n\n",
+	       ntohs(srv_sa6.sin6_port), ntohs(cli_sa6.sin6_port));
+	CHECK(err && errno != EINPROGRESS,
+	      "connect(cli_fd)", "err:%d errno:%d", err, errno);
+
+	ev.data.fd = listen_fd;
+	err = epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev);
+	CHECK(err, "epoll_ctl(EPOLL_CTL_ADD, listen_fd)", "err:%d errno:%d",
+	      err, errno);
+
+	/* Accept the connection */
+	/* Have some timeout in accept(listen_fd). Just in case. */
+	err = epoll_wait(epfd, &ev, 1, 1000);
+	CHECK(err != 1 || ev.data.fd != listen_fd,
+	      "epoll_wait(listen_fd)",
+	      "err:%d errno:%d ev.data.fd:%d listen_fd:%d",
+	      err, errno, ev.data.fd, listen_fd);
+
+	accept_fd = accept(listen_fd, NULL, NULL);
+	CHECK(accept_fd == -1, "accept(listen_fd)", "accept_fd:%d errno:%d",
+	      accept_fd, errno);
+	close(listen_fd);
+
+	/* Send some data from accept_fd to cli_fd */
+	err = send(accept_fd, DATA, DATA_LEN, 0);
+	CHECK(err != DATA_LEN, "send(accept_fd)", "err:%d errno:%d",
+	      err, errno);
+
+	/* Have some timeout in recv(cli_fd). Just in case. */
+	ev.data.fd = cli_fd;
+	err = epoll_ctl(epfd, EPOLL_CTL_ADD, cli_fd, &ev);
+	CHECK(err, "epoll_ctl(EPOLL_CTL_ADD, cli_fd)", "err:%d errno:%d",
+	      err, errno);
+
+	err = epoll_wait(epfd, &ev, 1, 1000);
+	CHECK(err != 1 || ev.data.fd != cli_fd,
+	      "epoll_wait(cli_fd)", "err:%d errno:%d ev.data.fd:%d cli_fd:%d",
+	      err, errno, ev.data.fd, cli_fd);
+
+	err = recv(cli_fd, NULL, 0, MSG_TRUNC);
+	CHECK(err, "recv(cli_fd)", "err:%d errno:%d", err, errno);
+
+	close(epfd);
+	close(accept_fd);
+	close(cli_fd);
+
+	check_result();
+}
+
+int main(int argc, char **argv)
+{
+	struct bpf_prog_load_attr attr = {
+		.file = "test_sock_fields_kern.o",
+		.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+		.expected_attach_type = BPF_CGROUP_INET_EGRESS,
+	};
+	int cgroup_fd, prog_fd, err;
+	struct bpf_object *obj;
+	struct bpf_map *map;
+
+	err = setup_cgroup_environment();
+	CHECK(err, "setup_cgroup_environment()", "err:%d errno:%d",
+	      err, errno);
+
+	atexit(cleanup_cgroup_environment);
+
+	/* Create a cgroup, get fd, and join it */
+	cgroup_fd = create_and_get_cgroup(TEST_CGROUP);
+	CHECK(cgroup_fd == -1, "create_and_get_cgroup()",
+	      "cgroup_fd:%d errno:%d", cgroup_fd, errno);
+
+	err = join_cgroup(TEST_CGROUP);
+	CHECK(err, "join_cgroup", "err:%d errno:%d", err, errno);
+
+	err = bpf_prog_load_xattr(&attr, &obj, &prog_fd);
+	CHECK(err, "bpf_prog_load_xattr()", "err:%d", err);
+
+	err = bpf_prog_attach(prog_fd, cgroup_fd, BPF_CGROUP_INET_EGRESS, 0);
+	CHECK(err == -1, "bpf_prog_attach(CPF_CGROUP_INET_EGRESS)",
+	      "err:%d errno%d", err, errno);
+	close(cgroup_fd);
+
+	map = bpf_object__find_map_by_name(obj, "addr_map");
+	CHECK(!map, "cannot find addr_map", "(null)");
+	addr_map_fd = bpf_map__fd(map);
+
+	map = bpf_object__find_map_by_name(obj, "sock_result_map");
+	CHECK(!map, "cannot find sock_result_map", "(null)");
+	sk_map_fd = bpf_map__fd(map);
+
+	map = bpf_object__find_map_by_name(obj, "tcp_sock_result_map");
+	CHECK(!map, "cannot find tcp_sock_result_map", "(null)");
+	tp_map_fd = bpf_map__fd(map);
+
+	map = bpf_object__find_map_by_name(obj, "linum_map");
+	CHECK(!map, "cannot find linum_map", "(null)");
+	linum_map_fd = bpf_map__fd(map);
+
+	test();
+
+	bpf_object__close(obj);
+	cleanup_cgroup_environment();
+
+	printf("PASS\n");
+
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/test_sock_fields_kern.c b/tools/testing/selftests/bpf/test_sock_fields_kern.c
new file mode 100644
index 000000000000..09d4d8d34588
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sock_fields_kern.c
@@ -0,0 +1,145 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook */
+
+#include <linux/bpf.h>
+#include <netinet/in.h>
+#include <stdbool.h>
+
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+enum bpf_array_idx {
+	SRV_IDX,
+	CLI_IDX,
+	__NR_BPF_ARRAY_IDX,
+};
+
+struct bpf_map_def SEC("maps") addr_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(struct sockaddr_in6),
+	.max_entries = __NR_BPF_ARRAY_IDX,
+};
+
+struct bpf_map_def SEC("maps") sock_result_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(struct bpf_sock),
+	.max_entries = __NR_BPF_ARRAY_IDX,
+};
+
+struct bpf_map_def SEC("maps") tcp_sock_result_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(struct bpf_tcp_sock),
+	.max_entries = __NR_BPF_ARRAY_IDX,
+};
+
+struct bpf_map_def SEC("maps") linum_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 1,
+};
+
+static bool is_loopback6(__u32 *a6)
+{
+	return !a6[0] && !a6[1] && !a6[2] && a6[3] == bpf_htonl(1);
+}
+
+static void skcpy(struct bpf_sock *dst,
+		  const struct bpf_sock *src)
+{
+	dst->bound_dev_if = src->bound_dev_if;
+	dst->family = src->family;
+	dst->type = src->type;
+	dst->protocol = src->protocol;
+	dst->mark = src->mark;
+	dst->priority = src->priority;
+	dst->src_ip4 = src->src_ip4;
+	dst->src_ip6[0] = src->src_ip6[0];
+	dst->src_ip6[1] = src->src_ip6[1];
+	dst->src_ip6[2] = src->src_ip6[2];
+	dst->src_ip6[3] = src->src_ip6[3];
+	dst->src_port = src->src_port;
+}
+
+static void tpcpy(struct bpf_tcp_sock *dst,
+		  const struct bpf_tcp_sock *src)
+{
+	dst->snd_cwnd = src->snd_cwnd;
+	dst->srtt_us = src->srtt_us;
+	dst->rtt_min = src->rtt_min;
+	dst->snd_ssthresh = src->snd_ssthresh;
+	dst->rcv_nxt = src->rcv_nxt;
+	dst->snd_nxt = src->snd_nxt;
+	dst->snd_una = src->snd_una;
+	dst->mss_cache = src->mss_cache;
+	dst->ecn_flags = src->ecn_flags;
+	dst->rate_delivered = src->rate_delivered;
+	dst->rate_interval_us = src->rate_interval_us;
+	dst->packets_out = src->packets_out;
+	dst->retrans_out = src->retrans_out;
+	dst->total_retrans = src->total_retrans;
+	dst->segs_in = src->segs_in;
+	dst->data_segs_in = src->data_segs_in;
+	dst->segs_out = src->segs_out;
+	dst->data_segs_out = src->data_segs_out;
+	dst->lost_out = src->lost_out;
+	dst->sacked_out = src->sacked_out;
+	dst->bytes_received = src->bytes_received;
+	dst->bytes_acked = src->bytes_acked;
+}
+
+#define RETURN {						\
+	linum = __LINE__;					\
+	bpf_map_update_elem(&linum_map, &idx0, &linum, 0);	\
+	return 1;						\
+}
+
+SEC("cgroup_skb/egress")
+int read_sock_fields(struct __sk_buff *skb)
+{
+	__u32 srv_idx = SRV_IDX, cli_idx = CLI_IDX, idx;
+	struct sockaddr_in6 *srv_sa6, *cli_sa6;
+	struct bpf_tcp_sock *tp, *tp_ret;
+	struct bpf_sock *sk, *sk_ret;
+	__u32 linum, idx0 = 0;
+
+	sk = skb->sk;
+	if (!sk)
+		RETURN;
+
+	sk = bpf_sk_fullsock(sk);
+	if (!sk || sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP ||
+	    !is_loopback6(sk->src_ip6))
+		RETURN;
+
+	tp = bpf_tcp_sock(sk);
+	if (!tp)
+		RETURN;
+
+	srv_sa6 = bpf_map_lookup_elem(&addr_map, &srv_idx);
+	cli_sa6 = bpf_map_lookup_elem(&addr_map, &cli_idx);
+	if (!srv_sa6 || !cli_sa6)
+		RETURN;
+
+	if (sk->src_port == bpf_ntohs(srv_sa6->sin6_port))
+		idx = srv_idx;
+	else if (sk->src_port == bpf_ntohs(cli_sa6->sin6_port))
+		idx = cli_idx;
+	else
+		RETURN;
+
+	sk_ret = bpf_map_lookup_elem(&sock_result_map, &idx);
+	tp_ret = bpf_map_lookup_elem(&tcp_sock_result_map, &idx);
+	if (!sk_ret || !tp_ret)
+		RETURN;
+
+	skcpy(sk_ret, sk);
+	tpcpy(tp_ret, tp);
+
+	RETURN;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next 4/6] bpf: Sync bpf.h to tools/
From: Martin KaFai Lau @ 2019-02-01  7:04 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo
In-Reply-To: <20190201070356.4148189-1-kafai@fb.com>

This patch sync the uapi bpf.h to tools/.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/include/uapi/linux/bpf.h | 61 +++++++++++++++++++++++++++++++++-
 1 file changed, 60 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 60b99b730a41..5f6c947abdd8 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2328,6 +2328,23 @@ union bpf_attr {
  *		"**y**".
  *	Return
  *		0
+ *
+ * struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)
+ *	Description
+ *		This helper tests if a bpf_sock is a fullsock such that
+ *		all the fields in bpf_sock can be accessed.
+ *	Return
+ *		The same pointer if *sk* is a fullsock.  Otherwise,
+ *		NULL is returned.
+ *
+ * struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk)
+ *	Description
+ *		Obtain a **struct bpf_tcp_sock** pointer from a
+ *              **struct bpf_sock** pointer.
+ *
+ *	Return
+ *		Pointer to **struct bpf_tcp_sock**, or **NULL** in
+ *		case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2422,7 +2439,9 @@ union bpf_attr {
 	FN(map_peek_elem),		\
 	FN(msg_push_data),		\
 	FN(msg_pop_data),		\
-	FN(rc_pointer_rel),
+	FN(rc_pointer_rel),		\
+	FN(sk_fullsock),		\
+	FN(tcp_sock),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -2542,6 +2561,7 @@ struct __sk_buff {
 	__u64 tstamp;
 	__u32 wire_len;
 	__u32 gso_segs;
+	__bpf_md_ptr(struct bpf_sock *, sk);
 };
 
 struct bpf_tunnel_key {
@@ -2604,6 +2624,45 @@ struct bpf_sock {
 				 */
 };
 
+struct bpf_tcp_sock {
+	__u32 snd_cwnd;		/* Sending congestion window		*/
+	__u32 srtt_us;		/* smoothed round trip time << 3 in usecs */
+	__u32 rtt_min;
+	__u32 snd_ssthresh;	/* Slow start size threshold		*/
+	__u32 rcv_nxt;		/* What we want to receive next		*/
+	__u32 snd_nxt;		/* Next sequence we send		*/
+	__u32 snd_una;		/* First byte we want an ack for	*/
+	__u32 mss_cache;	/* Cached effective mss, not including SACKS */
+	__u32 ecn_flags;	/* ECN status bits.			*/
+	__u32 rate_delivered;	/* saved rate sample: packets delivered */
+	__u32 rate_interval_us;	/* saved rate sample: time elapsed */
+	__u32 packets_out;	/* Packets which are "in flight"	*/
+	__u32 retrans_out;	/* Retransmitted packets out		*/
+	__u32 total_retrans;	/* Total retransmits for entire connection */
+	__u32 segs_in;		/* RFC4898 tcpEStatsPerfSegsIn
+				 * total number of segments in.
+				 */
+	__u32 data_segs_in;	/* RFC4898 tcpEStatsPerfDataSegsIn
+				 * total number of data segments in.
+				 */
+	__u32 segs_out;		/* RFC4898 tcpEStatsPerfSegsOut
+				 * The total number of segments sent.
+				 */
+	__u32 data_segs_out;	/* RFC4898 tcpEStatsPerfDataSegsOut
+				 * total number of data segments sent.
+				 */
+	__u32 lost_out;		/* Lost packets			*/
+	__u32 sacked_out;	/* SACK'd packets			*/
+	__u64 bytes_received;	/* RFC4898 tcpEStatsAppHCThruOctetsReceived
+				 * sum(delta(rcv_nxt)), or how many bytes
+				 * were acked.
+				 */
+	__u64 bytes_acked;	/* RFC4898 tcpEStatsAppHCThruOctetsAcked
+				 * sum(delta(snd_una)), or how many bytes
+				 * were acked.
+				 */
+};
+
 struct bpf_sock_tuple {
 	union {
 		struct {
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next 5/6] bpf: Add skb->sk, bpf_sk_fullsock and bpf_tcp_sock tests to test_verifer
From: Martin KaFai Lau @ 2019-02-01  7:04 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo
In-Reply-To: <20190201070356.4148189-1-kafai@fb.com>

This patch tests accessing the skb->sk and the new helpers,
bpf_sk_fullsock and bpf_tcp_sock.

The errstr of some existing "reference tracking" tests is changed
with s/bpf_sock/sock/ and s/socket/sock/ where "sock" is from the
verifier's reg_type_str[].

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 .../selftests/bpf/verifier/ref_tracking.c     |   4 +-
 tools/testing/selftests/bpf/verifier/sock.c   | 238 ++++++++++++++++++
 tools/testing/selftests/bpf/verifier/unpriv.c |   2 +-
 3 files changed, 241 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/verifier/sock.c

diff --git a/tools/testing/selftests/bpf/verifier/ref_tracking.c b/tools/testing/selftests/bpf/verifier/ref_tracking.c
index dc2cc823df2b..3ed3593bd8b6 100644
--- a/tools/testing/selftests/bpf/verifier/ref_tracking.c
+++ b/tools/testing/selftests/bpf/verifier/ref_tracking.c
@@ -547,7 +547,7 @@
 	BPF_EXIT_INSN(),
 	},
 	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
-	.errstr = "cannot write into socket",
+	.errstr = "cannot write into sock",
 	.result = REJECT,
 },
 {
@@ -562,7 +562,7 @@
 	BPF_EXIT_INSN(),
 	},
 	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
-	.errstr = "invalid bpf_sock access off=0 size=8",
+	.errstr = "invalid sock access off=0 size=8",
 	.result = REJECT,
 },
 {
diff --git a/tools/testing/selftests/bpf/verifier/sock.c b/tools/testing/selftests/bpf/verifier/sock.c
new file mode 100644
index 000000000000..a01ea3bc1c54
--- /dev/null
+++ b/tools/testing/selftests/bpf/verifier/sock.c
@@ -0,0 +1,238 @@
+{
+	"skb->sk: no NULL check",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, 0),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "invalid mem access 'sock_common_or_null'",
+},
+{
+	"skb->sk: sk->family [non fullsock field]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, offsetof(struct bpf_sock, family)),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"skb->sk: sk->type [fullsock field]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, offsetof(struct bpf_sock, type)),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "invalid sock_common access",
+},
+{
+	"bpf_sk_fullsock(skb->sk): no !skb->sk check",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "type=sock_common_or_null expected=sock_common",
+},
+{
+	"sk_fullsock(skb->sk): no NULL check on ret",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, type)),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "invalid mem access 'sock_or_null'",
+},
+{
+	"sk_fullsock(skb->sk): sk->type [fullsock field]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, type)),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"sk_fullsock(skb->sk): sk->family [non fullsock field]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, family)),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"bpf_tcp_sock(skb->sk): no !skb->sk check",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "type=sock_common_or_null expected=sock_common",
+},
+{
+	"bpf_tcp_sock(skb->sk): no NULL check on ret",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_tcp_sock, snd_cwnd)),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "invalid mem access 'tcp_sock_or_null'",
+},
+{
+	"bpf_tcp_sock(skb->sk): sk->snd_cwnd",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_tcp_sock, snd_cwnd)),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"bpf_tcp_sock(skb->sk): sk->bytes_acked",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_tcp_sock, bytes_acked)),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"bpf_tcp_sock(bpf_sk_fullsock(skb->sk)): tp->snd_cwnd",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_tcp_sock, snd_cwnd)),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"bpf_sk_release(skb->sk)",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 1),
+	BPF_EMIT_CALL(BPF_FUNC_sk_release),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.result = REJECT,
+	.errstr = "type=sock_common expected=sock",
+},
+{
+	"bpf_sk_release(bpf_sk_fullsock(skb->sk))",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_EMIT_CALL(BPF_FUNC_sk_release),
+	BPF_MOV64_IMM(BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.result = REJECT,
+	.errstr = "reference has not been acquired before",
+},
+{
+	"bpf_sk_release(bpf_tcp_sock(skb->sk))",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_EMIT_CALL(BPF_FUNC_sk_release),
+	BPF_MOV64_IMM(BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.result = REJECT,
+	.errstr = "type=tcp_sock expected=sock",
+},
diff --git a/tools/testing/selftests/bpf/verifier/unpriv.c b/tools/testing/selftests/bpf/verifier/unpriv.c
index dca58cf1a4ab..0013847993a8 100644
--- a/tools/testing/selftests/bpf/verifier/unpriv.c
+++ b/tools/testing/selftests/bpf/verifier/unpriv.c
@@ -364,7 +364,7 @@
 	},
 	.result = REJECT,
 	//.errstr = "same insn cannot be used with different pointers",
-	.errstr = "cannot write into socket",
+	.errstr = "cannot write into sock",
 	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
 },
 {
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH bpf-next v5 5/8] xdp: Provide extack messages when prog attachment failed
From: Jesper Dangaard Brouer @ 2019-02-01  7:04 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: daniel, ast, netdev, jakub.kicinski, john.fastabend, brouer
In-Reply-To: <20190201001954.4130-6-maciej.fijalkowski@intel.com>

On Fri,  1 Feb 2019 01:19:51 +0100
Maciej Fijalkowski <maciejromanfijalkowski@gmail.com> wrote:

> In order to provide more meaningful messages to user when the process of
> loading xdp program onto network interface failed, let's add extack
> messages within dev_change_xdp_fd.
> 
> Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH net-next v3 8/8] ethtool: add compat for devlink info
From: Jiri Pirko @ 2019-02-01  7:57 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Miller, lkp, kbuild-all, netdev, oss-drivers, andrew,
	f.fainelli, mkubecek, eugenem, jonathan.lemon
In-Reply-To: <20190131092925.2106513e@cakuba.hsd1.ca.comcast.net>

Thu, Jan 31, 2019 at 06:29:25PM CET, jakub.kicinski@netronome.com wrote:
>On Thu, 31 Jan 2019 09:25:08 -0800, Jakub Kicinski wrote:
>> On Thu, 31 Jan 2019 09:23:03 -0800 (PST), David Miller wrote:
>> > From: kbuild test robot <lkp@intel.com>
>> > Date: Fri, 1 Feb 2019 00:19:33 +0800
>> >   
>> > > All errors (new ones prefixed by >>):
>> > > 
>> > >    m68k-linux-gnu-ld: drivers/rtc/proc.o: in function `is_rtc_hctosys.isra.0':
>> > >    proc.c:(.text+0x178): undefined reference to `strcmp'
>> > >    m68k-linux-gnu-ld: net/core/ethtool.o: in function `ethtool_get_drvinfo':    
>> > >>> ethtool.c:(.text+0xc08): undefined reference to `devlink_compat_running_version'    
>> > 
>> > Missing string.h include perhaps?  
>> 
>> Yeah, that one looks like existing m68k bug, but also I think I need to
>> cater to the DEVLINK=m case since ethtool code is always built in we
>> can't use the MAY_USE_DEVLINK trick :S
>
>I think we have to do this:
>
>#if IS_REACHABLE(CONFIG_NET_DEVLINK)
>void devlink_compat_running_version(struct net_device *dev,
>				    char *buf, size_t len);
>#else
>static inline void
>devlink_compat_running_version(struct net_device *dev, char *buf, size_t len)
>{
>}
>#endif
>
>Jiri, any objections?

Np.

^ permalink raw reply

* Re: [PATCH net-next v4 8/8] ethtool: add compat for devlink info
From: Jiri Pirko @ 2019-02-01  7:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, oss-drivers, andrew, f.fainelli, mkubecek, eugenem,
	jonathan.lemon
In-Reply-To: <20190131185047.27685-9-jakub.kicinski@netronome.com>

Thu, Jan 31, 2019 at 07:50:47PM CET, jakub.kicinski@netronome.com wrote:
>If driver did not fill the fw_version field, try to call into
>the new devlink get_info op and collect the versions that way.
>We assume ethtool was always reporting running versions.
>
>v4:
> - use IS_REACHABLE() to avoid problems with DEVLINK=m (kbuildbot).
>v3 (Jiri):
> - do a dump and then parse it instead of special handling;
> - concatenate all versions (well, all that fit :)).
>
>Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* Re: [PATCH v2] Revert "net: phy: marvell: avoid pause mode on SGMII-to-Copper for 88e151x"
From: liuyonglong @ 2019-02-01  8:10 UTC (permalink / raw)
  To: Russell King, Andrew Lunn, Florian Fainelli, Heiner Kallweit
  Cc: David S. Miller, netdev
In-Reply-To: <E1gpFgo-0000gO-A6@rmk-PC.armlinux.org.uk>

On 2019/2/1 0:59, Russell King wrote:
> This reverts commit 6623c0fba10ef45b64ca213ad5dec926f37fa9a0.
> 
> The original diagnosis was incorrect: it appears that the NIC had
> PHY polling mode enabled, which meant that it overwrote the PHYs
> advertisement register during negotiation.
> 
> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>

Tested-by: Yonglong Liu <liuyonglong@huawei.com>

Had done boot testing, and the networking is fine too.

> ---
> v2: respun on top of net-next, but I've not yet been able to boot
>  test this - which is unlikely to be for a while yet.
> 
> The extraneous whitespace change comes from the reversion of the
> original patch - I added a "u32 pause" and blank line there.  The
> patch converting to link modes removed the "u32 pause" but left
> the blank line, causing a coding style issue.  It seems only right
> that a reversion of the original commit fixes that too.
> 
>  drivers/net/phy/marvell.c | 16 ----------------
>  1 file changed, 16 deletions(-)
> 
> diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
> index 90f44ba8aca7..3ccba37bd6dd 100644
> --- a/drivers/net/phy/marvell.c
> +++ b/drivers/net/phy/marvell.c
> @@ -842,7 +842,6 @@ static int m88e1510_config_init(struct phy_device *phydev)
>  
>  	/* SGMII-to-Copper mode initialization */
>  	if (phydev->interface == PHY_INTERFACE_MODE_SGMII) {
> -
>  		/* Select page 18 */
>  		err = marvell_set_page(phydev, 18);
>  		if (err < 0)
> @@ -865,21 +864,6 @@ static int m88e1510_config_init(struct phy_device *phydev)
>  		err = marvell_set_page(phydev, MII_MARVELL_COPPER_PAGE);
>  		if (err < 0)
>  			return err;
> -
> -		/* There appears to be a bug in the 88e1512 when used in
> -		 * SGMII to copper mode, where the AN advertisement register
> -		 * clears the pause bits each time a negotiation occurs.
> -		 * This means we can never be truely sure what was advertised,
> -		 * so disable Pause support.
> -		 */
> -		linkmode_clear_bit(ETHTOOL_LINK_MODE_Asym_Pause_BIT,
> -				   phydev->supported);
> -		linkmode_clear_bit(ETHTOOL_LINK_MODE_Pause_BIT,
> -				   phydev->supported);
> -		linkmode_clear_bit(ETHTOOL_LINK_MODE_Asym_Pause_BIT,
> -				   phydev->advertising);
> -		linkmode_clear_bit(ETHTOOL_LINK_MODE_Pause_BIT,
> -				   phydev->advertising);
>  	}
>  
>  	return m88e1318_config_init(phydev);
> 


^ permalink raw reply

* Re: Question on ptr_ring linux header
From: fei phung @ 2019-02-01  8:12 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: feiphung, netdev
In-Reply-To: <20190131093342-mutt-send-email-mst@kernel.org>

> I am not sure what does assignment of pointers mean in this context.
> ptr_ring is designed for a single producer and a single consumer.  For
> why it works see explanation about data dependencies in
> Documentation/memory-barriers.txt.  You will have to be more specific
> about the data race that you see if you expect more specific answers.

Hi,

ptr_ring_produce_any(sc->recv[chnl]->msgs,
&item_recv_push[item_recv_push_index])  needs to have
non-NULL pointer assigned for item_recv_push[item_recv_push_index] , right ?

Note:
ptr_ring_produce_any() occurs in interrupt handler, while
ptr_ring_consume_any() occurs in thread.

https://gist.github.com/promach/7716ee8addcaa33fda140d74d1ad94d6/cdc6599b8313e265bdfb073a65a124e1ba3303a2#file-riffa_driver_ptr_ring-c-L306-L320

        // TX (PC receive) scatter gather buffer is read.
        if (vect & (1<<((5*i)+1))) {
            recv = 1;

            item_recv_push[item_recv_push_index].val1 = EVENT_SG_BUF_READ;
            item_recv_push[item_recv_push_index].val2 = 0;

            // Keep track so the thread can handle this.
            if (ptr_ring_produce_any(sc->recv[chnl]->msgs,
&item_recv_push[item_recv_push_index])) {
                printk(KERN_ERR "riffa: fpga:%d chnl:%d, recv sg buf
read msg queue full\n", sc->id, chnl);
            }
            DEBUG_MSG(KERN_INFO "riffa: fpga:%d chnl:%d, recv sg buf
read\n", sc->id, chnl);

            item_recv_push_index++;
        }

The kernel log points me to ptr_ring_consume_any(). So, this is
definitely data race issue
with my own ptr_ring interfacing code.

besides, I am also getting a reference to zero-length ring for the
kernel dmesg log.
I am not sure how this is related to the data race though.

The kernel log points to
https://github.com/torvalds/linux/blame/master/include/linux/ptr_ring.h#L175
(click open git blame on this line)

Regards,
Phung

^ permalink raw reply

* Re: [PATCH RFC v2 0/2] vsock/virtio: fix issues on device hot-unplug
From: Stefan Hajnoczi @ 2019-02-01  8:19 UTC (permalink / raw)
  To: Stefano Garzarella; +Cc: davem, netdev
In-Reply-To: <CAGxU2F6-Pb0BjnRBrBatLuC-Xg7f_aJuidW-jiQR59ZTQ-j2=A@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 286 bytes --]

On Tue, Jan 29, 2019 at 04:33:58PM +0100, Stefano Garzarella wrote:
> Kindly ping :)

Hi Stefano,
It probably didn't get picked up due to the "RFC" (Request for
Comments).

I suggest rebasing, retesting, and resending to be sure it will be
noticed and merged without conflicts.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply

* [PATCH net] ethtool: remove unnecessary check in ethtool_get_regs()
From: Dan Carpenter @ 2019-02-01  8:24 UTC (permalink / raw)
  To: David S. Miller, Yunsheng Lin
  Cc: Florian Fainelli, Wenwen Wang, Boris Pismenny, Ilya Lesokhin,
	Edward Cree, Michal Kubecek, Yury Norov, Jakub Kicinski, netdev,
	kernel-janitors

We recently changed this function in commit f9fc54d313fa ("ethtool:
check the return value of get_regs_len") such that if "reglen" is zero
we return directly.  That means we can remove this condition as well.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
---
 net/core/ethtool.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 158264f7cfaf..3fe6e9da3579 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -1348,12 +1348,9 @@ static int ethtool_get_regs(struct net_device *dev, char __user *useraddr)
 	if (regs.len > reglen)
 		regs.len = reglen;
 
-	regbuf = NULL;
-	if (reglen) {
-		regbuf = vzalloc(reglen);
-		if (!regbuf)
-			return -ENOMEM;
-	}
+	regbuf = vzalloc(reglen);
+	if (!regbuf)
+		return -ENOMEM;
 
 	ops->get_regs(dev, &regs, regbuf);
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next 0/2] tools/bpf: changes of libbpf debug interfaces
From: Yonghong Song @ 2019-02-01  8:24 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Magnus Karlsson, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Yonghong Song

These are concrete patches responding to my comments for
Magnus's patch. Specifically, Patch #1 used global functions
to facilitate pr_* macros in the header files so they
are available in different C files.
Patch #2 simplified libbpf_set_print() takes only one print
function and the print function has a "level" argument.

Yonghong Song (2):
  tools/bpf: move libbpf pr_* debug print functions to headers
  tools/bpf: simplify libbpf API function libbpf_set_print()

 tools/lib/bpf/btf.c                           | 97 +++++++++----------
 tools/lib/bpf/btf.h                           |  7 +-
 tools/lib/bpf/libbpf.c                        | 45 +++++----
 tools/lib/bpf/libbpf.h                        | 20 ++--
 tools/lib/bpf/test_libbpf.cpp                 |  4 +-
 tools/lib/bpf/util.h                          | 32 ++++++
 tools/perf/util/bpf-loader.c                  | 32 +++---
 tools/testing/selftests/bpf/test_btf.c        |  7 +-
 .../testing/selftests/bpf/test_libbpf_open.c  | 36 ++++---
 tools/testing/selftests/bpf/test_progs.c      | 20 +++-
 10 files changed, 166 insertions(+), 134 deletions(-)
 create mode 100644 tools/lib/bpf/util.h

-- 
2.17.1


^ permalink raw reply

* [PATCH bpf-next 2/2] tools/bpf: simplify libbpf API function libbpf_set_print()
From: Yonghong Song @ 2019-02-01  8:24 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Magnus Karlsson, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Yonghong Song
In-Reply-To: <20190201082442.2770052-1-yhs@fb.com>

Currently, the libbpf API function libbpf_set_print()
takes three function pointer parameters for warning, info
and debug printout respectively.

This patch changes the API to have just one function pointer
parameter and the function pointer has one additional
parameter "debugging level". So if in the future, if
the debug level is increased, the function signature
won't change.

The libbpf internal function libbpf_dprint_level_available()
previous is able to return whether a particular debug level
(warning, info or debug) is available or not.
With this change, there is only one function for debug
printout, so no easy way to test whether a particular
level is supported or not. This function is only used
for btf.c to allocate for log buffers. Such a compromise
should be acceptable. The function is changed to
libbpf_dprint_available().

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/lib/bpf/btf.c                           |  2 +-
 tools/lib/bpf/libbpf.c                        | 37 +++++--------------
 tools/lib/bpf/libbpf.h                        | 14 ++-----
 tools/lib/bpf/test_libbpf.cpp                 |  2 +-
 tools/lib/bpf/util.h                          |  2 +-
 tools/perf/util/bpf-loader.c                  | 32 ++++++----------
 tools/testing/selftests/bpf/test_btf.c        |  7 ++--
 .../testing/selftests/bpf/test_libbpf_open.c  | 36 +++++++++---------
 tools/testing/selftests/bpf/test_progs.c      | 20 +++++++++-
 9 files changed, 67 insertions(+), 85 deletions(-)

diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
index 159104ec31dc..4bbf7f363c50 100644
--- a/tools/lib/bpf/btf.c
+++ b/tools/lib/bpf/btf.c
@@ -377,7 +377,7 @@ struct btf *btf__new(__u8 *data, __u32 size)
 
 	btf->fd = -1;
 
-	if (libbpf_dprint_level_available(LIBBPF_DEBUG)) {
+	if (libbpf_dprint_available()) {
 		log_buf = malloc(BPF_LOG_BUF_SIZE);
 		if (!log_buf) {
 			err = -ENOMEM;
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 1c7e25ebbe85..38a2e71ea369 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -54,8 +54,8 @@
 
 #define __printf(a, b)	__attribute__((format(printf, a, b)))
 
-__printf(1, 2)
-static int __base_pr(const char *format, ...)
+__printf(2, 3)
+static int __base_pr(enum libbpf_print_level level, const char *format, ...)
 {
 	va_list args;
 	int err;
@@ -66,17 +66,11 @@ static int __base_pr(const char *format, ...)
 	return err;
 }
 
-static __printf(1, 2) libbpf_print_fn_t __pr_warning = __base_pr;
-static __printf(1, 2) libbpf_print_fn_t __pr_info = __base_pr;
-static __printf(1, 2) libbpf_print_fn_t __pr_debug;
+static __printf(2, 3) libbpf_print_fn_t __libbpf_pr = __base_pr;
 
-void libbpf_set_print(libbpf_print_fn_t warn,
-		      libbpf_print_fn_t info,
-		      libbpf_print_fn_t debug)
+void libbpf_set_print(libbpf_print_fn_t fn)
 {
-	__pr_warning = warn;
-	__pr_info = info;
-	__pr_debug = debug;
+	__libbpf_pr = fn;
 }
 
 __printf(2, 3)
@@ -85,27 +79,14 @@ void libbpf_dprint(enum libbpf_print_level level, const char *format, ...)
 	va_list args;
 
 	va_start(args, format);
-	if (level == LIBBPF_WARN) {
-		if (__pr_warning)
-			__pr_warning(format, args);
-	} else if (level == LIBBPF_INFO) {
-		if (__pr_info)
-			__pr_info(format, args);
-	} else {
-		if (__pr_debug)
-			__pr_debug(format, args);
-	}
+	if (__libbpf_pr)
+		__libbpf_pr(level, format, args);
 	va_end(args);
 }
 
-bool libbpf_dprint_level_available(enum libbpf_print_level level)
+bool libbpf_dprint_available(void)
 {
-	if (level == LIBBPF_WARN)
-		return !!__pr_warning;
-	else if (level == LIBBPF_INFO)
-		return !!__pr_info;
-	else
-		return !!__pr_debug;
+	return !!__libbpf_pr;
 }
 
 #define STRERR_BUFSIZE  128
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 4e21971101c9..f8f27f1bb6bf 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -53,17 +53,11 @@ enum libbpf_print_level {
         LIBBPF_DEBUG,
 };
 
-/*
- * __printf is defined in include/linux/compiler-gcc.h. However,
- * it would be better if libbpf.h didn't depend on Linux header files.
- * So instead of __printf, here we use gcc attribute directly.
- */
-typedef int (*libbpf_print_fn_t)(const char *, ...)
-	__attribute__((format(printf, 1, 2)));
+typedef int (*libbpf_print_fn_t)(enum libbpf_print_level level,
+				 const char *, ...)
+	__attribute__((format(printf, 2, 3)));
 
-LIBBPF_API void libbpf_set_print(libbpf_print_fn_t warn,
-				 libbpf_print_fn_t info,
-				 libbpf_print_fn_t debug);
+LIBBPF_API void libbpf_set_print(libbpf_print_fn_t fn);
 
 /* Hide internal to user */
 struct bpf_object;
diff --git a/tools/lib/bpf/test_libbpf.cpp b/tools/lib/bpf/test_libbpf.cpp
index be67f5ea2c19..fc134873bb6d 100644
--- a/tools/lib/bpf/test_libbpf.cpp
+++ b/tools/lib/bpf/test_libbpf.cpp
@@ -8,7 +8,7 @@
 int main(int argc, char *argv[])
 {
     /* libbpf.h */
-    libbpf_set_print(NULL, NULL, NULL);
+    libbpf_set_print(NULL);
 
     /* bpf.h */
     bpf_prog_get_fd_by_id(0);
diff --git a/tools/lib/bpf/util.h b/tools/lib/bpf/util.h
index b07657600f7c..6b93ebf61ea6 100644
--- a/tools/lib/bpf/util.h
+++ b/tools/lib/bpf/util.h
@@ -14,7 +14,7 @@ extern void libbpf_dprint(enum libbpf_print_level level,
 			  const char *format, ...)
 	__attribute__((format(printf, 2, 3)));
 
-extern bool libbpf_dprint_level_available(enum libbpf_print_level level);
+extern bool libbpf_dprint_available(void);
 
 #define __pr(level, fmt, ...)	\
 do {				\
diff --git a/tools/perf/util/bpf-loader.c b/tools/perf/util/bpf-loader.c
index 2f3eb6d293ee..c492f3a2acdc 100644
--- a/tools/perf/util/bpf-loader.c
+++ b/tools/perf/util/bpf-loader.c
@@ -24,21 +24,17 @@
 #include "llvm-utils.h"
 #include "c++/clang-c.h"
 
-#define DEFINE_PRINT_FN(name, level) \
-static int libbpf_##name(const char *fmt, ...)	\
-{						\
-	va_list args;				\
-	int ret;				\
-						\
-	va_start(args, fmt);			\
-	ret = veprintf(level, verbose, pr_fmt(fmt), args);\
-	va_end(args);				\
-	return ret;				\
-}
+static int libbpf_perf_dprint(enum libbpf_print_level level __attribute__((unused)),
+			      const char *fmt, ...)
+{
+	va_list args;
+	int ret;
 
-DEFINE_PRINT_FN(warning, 1)
-DEFINE_PRINT_FN(info, 1)
-DEFINE_PRINT_FN(debug, 1)
+	va_start(args, fmt);
+	ret = veprintf(1, verbose, pr_fmt(fmt), args);
+	va_end(args);
+	return ret;
+}
 
 struct bpf_prog_priv {
 	bool is_tp;
@@ -59,9 +55,7 @@ bpf__prepare_load_buffer(void *obj_buf, size_t obj_buf_sz, const char *name)
 	struct bpf_object *obj;
 
 	if (!libbpf_initialized) {
-		libbpf_set_print(libbpf_warning,
-				 libbpf_info,
-				 libbpf_debug);
+		libbpf_set_print(libbpf_perf_dprint);
 		libbpf_initialized = true;
 	}
 
@@ -79,9 +73,7 @@ struct bpf_object *bpf__prepare_load(const char *filename, bool source)
 	struct bpf_object *obj;
 
 	if (!libbpf_initialized) {
-		libbpf_set_print(libbpf_warning,
-				 libbpf_info,
-				 libbpf_debug);
+		libbpf_set_print(libbpf_perf_dprint);
 		libbpf_initialized = true;
 	}
 
diff --git a/tools/testing/selftests/bpf/test_btf.c b/tools/testing/selftests/bpf/test_btf.c
index 179f1d8ec5bf..aebaeff5a5a0 100644
--- a/tools/testing/selftests/bpf/test_btf.c
+++ b/tools/testing/selftests/bpf/test_btf.c
@@ -54,8 +54,9 @@ static int count_result(int err)
 
 #define __printf(a, b)	__attribute__((format(printf, a, b)))
 
-__printf(1, 2)
-static int __base_pr(const char *format, ...)
+__printf(2, 3)
+static int __base_pr(enum libbpf_print_level level __attribute__((unused)),
+		     const char *format, ...)
 {
 	va_list args;
 	int err;
@@ -5650,7 +5651,7 @@ int main(int argc, char **argv)
 		return err;
 
 	if (args.always_log)
-		libbpf_set_print(__base_pr, __base_pr, __base_pr);
+		libbpf_set_print(__base_pr);
 
 	if (args.raw_test)
 		err |= test_raw();
diff --git a/tools/testing/selftests/bpf/test_libbpf_open.c b/tools/testing/selftests/bpf/test_libbpf_open.c
index 8fcd1c076add..b9ff3bf76544 100644
--- a/tools/testing/selftests/bpf/test_libbpf_open.c
+++ b/tools/testing/selftests/bpf/test_libbpf_open.c
@@ -34,23 +34,22 @@ static void usage(char *argv[])
 	printf("\n");
 }
 
-#define DEFINE_PRINT_FN(name, enabled) \
-static int libbpf_##name(const char *fmt, ...)  	\
-{							\
-        va_list args;					\
-        int ret;					\
-							\
-        va_start(args, fmt);				\
-	if (enabled) {					\
-		fprintf(stderr, "[" #name "] ");	\
-		ret = vfprintf(stderr, fmt, args);	\
-	}						\
-        va_end(args);					\
-        return ret;					\
+static bool debug = 0;
+static int libbpf_debug_print(enum libbpf_print_level level,
+			      const char *fmt, ...)
+{
+	va_list args;
+	int ret;
+
+	if (level == LIBBPF_DEBUG && !debug)
+		return 0;
+
+	va_start(args, fmt);
+	fprintf(stderr, "[%d] ", level);
+	ret = vfprintf(stderr, fmt, args);
+	va_end(args);
+	return ret;
 }
-DEFINE_PRINT_FN(warning, 1)
-DEFINE_PRINT_FN(info, 1)
-DEFINE_PRINT_FN(debug, 1)
 
 #define EXIT_FAIL_LIBBPF EXIT_FAILURE
 #define EXIT_FAIL_OPTION 2
@@ -120,15 +119,14 @@ int main(int argc, char **argv)
 	int longindex = 0;
 	int opt;
 
-	libbpf_set_print(libbpf_warning, libbpf_info, NULL);
+	libbpf_set_print(libbpf_debug_print);
 
 	/* Parse commands line args */
 	while ((opt = getopt_long(argc, argv, "hDq",
 				  long_options, &longindex)) != -1) {
 		switch (opt) {
 		case 'D':
-			libbpf_set_print(libbpf_warning, libbpf_info,
-					 libbpf_debug);
+			debug = 1;
 			break;
 		case 'q': /* Use in scripting mode */
 			verbose = 0;
diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index d8940b8b2f8d..294beff3374e 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -10,6 +10,7 @@
 #include <string.h>
 #include <assert.h>
 #include <stdlib.h>
+#include <stdarg.h>
 #include <time.h>
 
 #include <linux/types.h>
@@ -1783,6 +1784,21 @@ static void test_task_fd_query_tp(void)
 				   "sys_enter_read");
 }
 
+static int libbpf_debug_print(enum libbpf_print_level level,
+			      const char *format, ...)
+{
+	va_list args;
+	int ret;
+
+	if (level == LIBBPF_DEBUG)
+		return 0;
+
+	va_start(args, format);
+	ret = vfprintf(stderr, format, args);
+	va_end(args);
+	return ret;
+}
+
 static void test_reference_tracking()
 {
 	const char *file = "./test_sk_lookup_kern.o";
@@ -1809,9 +1825,9 @@ static void test_reference_tracking()
 
 		/* Expect verifier failure if test name has 'fail' */
 		if (strstr(title, "fail") != NULL) {
-			libbpf_set_print(NULL, NULL, NULL);
+			libbpf_set_print(NULL);
 			err = !bpf_program__load(prog, "GPL", 0);
-			libbpf_set_print(printf, printf, NULL);
+			libbpf_set_print(libbpf_debug_print);
 		} else {
 			err = bpf_program__load(prog, "GPL", 0);
 		}
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next 1/2] tools/bpf: move libbpf pr_* debug print functions to headers
From: Yonghong Song @ 2019-02-01  8:24 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Magnus Karlsson, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Yonghong Song
In-Reply-To: <20190201082442.2770052-1-yhs@fb.com>

A global function libbpf_dprint, which is invisible
outside the shared library, is defined to print based
on levels. The pr_warning, pr_info and pr_debug
macros are moved into the newly created header
common.h. So any .c file including common.h can
use these macros directly.

Currently btf__new and btf_ext__new API has an argument getting
__pr_debug function pointer into btf.c so the debugging information
can be printed there. This patch removed this parameter
from btf__new and btf_ext__new and directly using pr_debug in btf.c.

Another global function libbpf_dprint_level_available, also
invisible outside the shared library, can test
whether a particular level debug printing is
available or not. It is used in btf.c to
test whether DEBUG level debug printing is availabl or not,
based on which the log buffer will be allocated when loading
btf to the kernel.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/lib/bpf/btf.c           | 97 +++++++++++++++++------------------
 tools/lib/bpf/btf.h           |  7 +--
 tools/lib/bpf/libbpf.c        | 46 ++++++++++++-----
 tools/lib/bpf/libbpf.h        |  6 +++
 tools/lib/bpf/test_libbpf.cpp |  2 +-
 tools/lib/bpf/util.h          | 32 ++++++++++++
 6 files changed, 120 insertions(+), 70 deletions(-)
 create mode 100644 tools/lib/bpf/util.h

diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
index d682d3b8f7b9..159104ec31dc 100644
--- a/tools/lib/bpf/btf.c
+++ b/tools/lib/bpf/btf.c
@@ -9,8 +9,9 @@
 #include <linux/btf.h>
 #include "btf.h"
 #include "bpf.h"
+#include "libbpf.h"
+#include "util.h"
 
-#define elog(fmt, ...) { if (err_log) err_log(fmt, ##__VA_ARGS__); }
 #define max(a, b) ((a) > (b) ? (a) : (b))
 #define min(a, b) ((a) < (b) ? (a) : (b))
 
@@ -107,54 +108,54 @@ static int btf_add_type(struct btf *btf, struct btf_type *t)
 	return 0;
 }
 
-static int btf_parse_hdr(struct btf *btf, btf_print_fn_t err_log)
+static int btf_parse_hdr(struct btf *btf)
 {
 	const struct btf_header *hdr = btf->hdr;
 	__u32 meta_left;
 
 	if (btf->data_size < sizeof(struct btf_header)) {
-		elog("BTF header not found\n");
+		pr_debug("BTF header not found\n");
 		return -EINVAL;
 	}
 
 	if (hdr->magic != BTF_MAGIC) {
-		elog("Invalid BTF magic:%x\n", hdr->magic);
+		pr_debug("Invalid BTF magic:%x\n", hdr->magic);
 		return -EINVAL;
 	}
 
 	if (hdr->version != BTF_VERSION) {
-		elog("Unsupported BTF version:%u\n", hdr->version);
+		pr_debug("Unsupported BTF version:%u\n", hdr->version);
 		return -ENOTSUP;
 	}
 
 	if (hdr->flags) {
-		elog("Unsupported BTF flags:%x\n", hdr->flags);
+		pr_debug("Unsupported BTF flags:%x\n", hdr->flags);
 		return -ENOTSUP;
 	}
 
 	meta_left = btf->data_size - sizeof(*hdr);
 	if (!meta_left) {
-		elog("BTF has no data\n");
+		pr_debug("BTF has no data\n");
 		return -EINVAL;
 	}
 
 	if (meta_left < hdr->type_off) {
-		elog("Invalid BTF type section offset:%u\n", hdr->type_off);
+		pr_debug("Invalid BTF type section offset:%u\n", hdr->type_off);
 		return -EINVAL;
 	}
 
 	if (meta_left < hdr->str_off) {
-		elog("Invalid BTF string section offset:%u\n", hdr->str_off);
+		pr_debug("Invalid BTF string section offset:%u\n", hdr->str_off);
 		return -EINVAL;
 	}
 
 	if (hdr->type_off >= hdr->str_off) {
-		elog("BTF type section offset >= string section offset. No type?\n");
+		pr_debug("BTF type section offset >= string section offset. No type?\n");
 		return -EINVAL;
 	}
 
 	if (hdr->type_off & 0x02) {
-		elog("BTF type section is not aligned to 4 bytes\n");
+		pr_debug("BTF type section is not aligned to 4 bytes\n");
 		return -EINVAL;
 	}
 
@@ -163,7 +164,7 @@ static int btf_parse_hdr(struct btf *btf, btf_print_fn_t err_log)
 	return 0;
 }
 
-static int btf_parse_str_sec(struct btf *btf, btf_print_fn_t err_log)
+static int btf_parse_str_sec(struct btf *btf)
 {
 	const struct btf_header *hdr = btf->hdr;
 	const char *start = btf->nohdr_data + hdr->str_off;
@@ -171,7 +172,7 @@ static int btf_parse_str_sec(struct btf *btf, btf_print_fn_t err_log)
 
 	if (!hdr->str_len || hdr->str_len - 1 > BTF_MAX_NAME_OFFSET ||
 	    start[0] || end[-1]) {
-		elog("Invalid BTF string section\n");
+		pr_debug("Invalid BTF string section\n");
 		return -EINVAL;
 	}
 
@@ -180,7 +181,7 @@ static int btf_parse_str_sec(struct btf *btf, btf_print_fn_t err_log)
 	return 0;
 }
 
-static int btf_parse_type_sec(struct btf *btf, btf_print_fn_t err_log)
+static int btf_parse_type_sec(struct btf *btf)
 {
 	struct btf_header *hdr = btf->hdr;
 	void *nohdr_data = btf->nohdr_data;
@@ -219,7 +220,7 @@ static int btf_parse_type_sec(struct btf *btf, btf_print_fn_t err_log)
 		case BTF_KIND_RESTRICT:
 			break;
 		default:
-			elog("Unsupported BTF_KIND:%u\n",
+			pr_debug("Unsupported BTF_KIND:%u\n",
 			     BTF_INFO_KIND(t->info));
 			return -EINVAL;
 		}
@@ -363,7 +364,7 @@ void btf__free(struct btf *btf)
 	free(btf);
 }
 
-struct btf *btf__new(__u8 *data, __u32 size, btf_print_fn_t err_log)
+struct btf *btf__new(__u8 *data, __u32 size)
 {
 	__u32 log_buf_size = 0;
 	char *log_buf = NULL;
@@ -376,7 +377,7 @@ struct btf *btf__new(__u8 *data, __u32 size, btf_print_fn_t err_log)
 
 	btf->fd = -1;
 
-	if (err_log) {
+	if (libbpf_dprint_level_available(LIBBPF_DEBUG)) {
 		log_buf = malloc(BPF_LOG_BUF_SIZE);
 		if (!log_buf) {
 			err = -ENOMEM;
@@ -400,21 +401,21 @@ struct btf *btf__new(__u8 *data, __u32 size, btf_print_fn_t err_log)
 
 	if (btf->fd == -1) {
 		err = -errno;
-		elog("Error loading BTF: %s(%d)\n", strerror(errno), errno);
+		pr_debug("Error loading BTF: %s(%d)\n", strerror(errno), errno);
 		if (log_buf && *log_buf)
-			elog("%s\n", log_buf);
+			pr_debug("%s\n", log_buf);
 		goto done;
 	}
 
-	err = btf_parse_hdr(btf, err_log);
+	err = btf_parse_hdr(btf);
 	if (err)
 		goto done;
 
-	err = btf_parse_str_sec(btf, err_log);
+	err = btf_parse_str_sec(btf);
 	if (err)
 		goto done;
 
-	err = btf_parse_type_sec(btf, err_log);
+	err = btf_parse_type_sec(btf);
 
 done:
 	free(log_buf);
@@ -491,7 +492,7 @@ int btf__get_from_id(__u32 id, struct btf **btf)
 		goto exit_free;
 	}
 
-	*btf = btf__new((__u8 *)(long)btf_info.btf, btf_info.btf_size, NULL);
+	*btf = btf__new((__u8 *)(long)btf_info.btf, btf_info.btf_size);
 	if (IS_ERR(*btf)) {
 		err = PTR_ERR(*btf);
 		*btf = NULL;
@@ -514,8 +515,7 @@ struct btf_ext_sec_copy_param {
 
 static int btf_ext_copy_info(struct btf_ext *btf_ext,
 			     __u8 *data, __u32 data_size,
-			     struct btf_ext_sec_copy_param *ext_sec,
-			     btf_print_fn_t err_log)
+			     struct btf_ext_sec_copy_param *ext_sec)
 {
 	const struct btf_ext_header *hdr = (struct btf_ext_header *)data;
 	const struct btf_ext_info_sec *sinfo;
@@ -529,14 +529,14 @@ static int btf_ext_copy_info(struct btf_ext *btf_ext,
 	data_size -= hdr->hdr_len;
 
 	if (ext_sec->off & 0x03) {
-		elog(".BTF.ext %s section is not aligned to 4 bytes\n",
+		pr_debug(".BTF.ext %s section is not aligned to 4 bytes\n",
 		     ext_sec->desc);
 		return -EINVAL;
 	}
 
 	if (data_size < ext_sec->off ||
 	    ext_sec->len > data_size - ext_sec->off) {
-		elog("%s section (off:%u len:%u) is beyond the end of the ELF section .BTF.ext\n",
+		pr_debug("%s section (off:%u len:%u) is beyond the end of the ELF section .BTF.ext\n",
 		     ext_sec->desc, ext_sec->off, ext_sec->len);
 		return -EINVAL;
 	}
@@ -546,7 +546,7 @@ static int btf_ext_copy_info(struct btf_ext *btf_ext,
 
 	/* At least a record size */
 	if (info_left < sizeof(__u32)) {
-		elog(".BTF.ext %s record size not found\n", ext_sec->desc);
+		pr_debug(".BTF.ext %s record size not found\n", ext_sec->desc);
 		return -EINVAL;
 	}
 
@@ -554,7 +554,7 @@ static int btf_ext_copy_info(struct btf_ext *btf_ext,
 	record_size = *(__u32 *)info;
 	if (record_size < ext_sec->min_rec_size ||
 	    record_size & 0x03) {
-		elog("%s section in .BTF.ext has invalid record size %u\n",
+		pr_debug("%s section in .BTF.ext has invalid record size %u\n",
 		     ext_sec->desc, record_size);
 		return -EINVAL;
 	}
@@ -564,7 +564,7 @@ static int btf_ext_copy_info(struct btf_ext *btf_ext,
 
 	/* If no records, return failure now so .BTF.ext won't be used. */
 	if (!info_left) {
-		elog("%s section in .BTF.ext has no records", ext_sec->desc);
+		pr_debug("%s section in .BTF.ext has no records", ext_sec->desc);
 		return -EINVAL;
 	}
 
@@ -574,14 +574,14 @@ static int btf_ext_copy_info(struct btf_ext *btf_ext,
 		__u32 num_records;
 
 		if (info_left < sec_hdrlen) {
-			elog("%s section header is not found in .BTF.ext\n",
+			pr_debug("%s section header is not found in .BTF.ext\n",
 			     ext_sec->desc);
 			return -EINVAL;
 		}
 
 		num_records = sinfo->num_info;
 		if (num_records == 0) {
-			elog("%s section has incorrect num_records in .BTF.ext\n",
+			pr_debug("%s section has incorrect num_records in .BTF.ext\n",
 			     ext_sec->desc);
 			return -EINVAL;
 		}
@@ -589,7 +589,7 @@ static int btf_ext_copy_info(struct btf_ext *btf_ext,
 		total_record_size = sec_hdrlen +
 				    (__u64)num_records * record_size;
 		if (info_left < total_record_size) {
-			elog("%s section has incorrect num_records in .BTF.ext\n",
+			pr_debug("%s section has incorrect num_records in .BTF.ext\n",
 			     ext_sec->desc);
 			return -EINVAL;
 		}
@@ -610,8 +610,7 @@ static int btf_ext_copy_info(struct btf_ext *btf_ext,
 }
 
 static int btf_ext_copy_func_info(struct btf_ext *btf_ext,
-				  __u8 *data, __u32 data_size,
-				  btf_print_fn_t err_log)
+				  __u8 *data, __u32 data_size)
 {
 	const struct btf_ext_header *hdr = (struct btf_ext_header *)data;
 	struct btf_ext_sec_copy_param param = {
@@ -622,12 +621,11 @@ static int btf_ext_copy_func_info(struct btf_ext *btf_ext,
 		.desc = "func_info"
 	};
 
-	return btf_ext_copy_info(btf_ext, data, data_size, &param, err_log);
+	return btf_ext_copy_info(btf_ext, data, data_size, &param);
 }
 
 static int btf_ext_copy_line_info(struct btf_ext *btf_ext,
-				  __u8 *data, __u32 data_size,
-				  btf_print_fn_t err_log)
+				  __u8 *data, __u32 data_size)
 {
 	const struct btf_ext_header *hdr = (struct btf_ext_header *)data;
 	struct btf_ext_sec_copy_param param = {
@@ -638,37 +636,36 @@ static int btf_ext_copy_line_info(struct btf_ext *btf_ext,
 		.desc = "line_info",
 	};
 
-	return btf_ext_copy_info(btf_ext, data, data_size, &param, err_log);
+	return btf_ext_copy_info(btf_ext, data, data_size, &param);
 }
 
-static int btf_ext_parse_hdr(__u8 *data, __u32 data_size,
-			     btf_print_fn_t err_log)
+static int btf_ext_parse_hdr(__u8 *data, __u32 data_size)
 {
 	const struct btf_ext_header *hdr = (struct btf_ext_header *)data;
 
 	if (data_size < offsetof(struct btf_ext_header, func_info_off) ||
 	    data_size < hdr->hdr_len) {
-		elog("BTF.ext header not found");
+		pr_debug("BTF.ext header not found");
 		return -EINVAL;
 	}
 
 	if (hdr->magic != BTF_MAGIC) {
-		elog("Invalid BTF.ext magic:%x\n", hdr->magic);
+		pr_debug("Invalid BTF.ext magic:%x\n", hdr->magic);
 		return -EINVAL;
 	}
 
 	if (hdr->version != BTF_VERSION) {
-		elog("Unsupported BTF.ext version:%u\n", hdr->version);
+		pr_debug("Unsupported BTF.ext version:%u\n", hdr->version);
 		return -ENOTSUP;
 	}
 
 	if (hdr->flags) {
-		elog("Unsupported BTF.ext flags:%x\n", hdr->flags);
+		pr_debug("Unsupported BTF.ext flags:%x\n", hdr->flags);
 		return -ENOTSUP;
 	}
 
 	if (data_size == hdr->hdr_len) {
-		elog("BTF.ext has no data\n");
+		pr_debug("BTF.ext has no data\n");
 		return -EINVAL;
 	}
 
@@ -685,12 +682,12 @@ void btf_ext__free(struct btf_ext *btf_ext)
 	free(btf_ext);
 }
 
-struct btf_ext *btf_ext__new(__u8 *data, __u32 size, btf_print_fn_t err_log)
+struct btf_ext *btf_ext__new(__u8 *data, __u32 size)
 {
 	struct btf_ext *btf_ext;
 	int err;
 
-	err = btf_ext_parse_hdr(data, size, err_log);
+	err = btf_ext_parse_hdr(data, size);
 	if (err)
 		return ERR_PTR(err);
 
@@ -698,13 +695,13 @@ struct btf_ext *btf_ext__new(__u8 *data, __u32 size, btf_print_fn_t err_log)
 	if (!btf_ext)
 		return ERR_PTR(-ENOMEM);
 
-	err = btf_ext_copy_func_info(btf_ext, data, size, err_log);
+	err = btf_ext_copy_func_info(btf_ext, data, size);
 	if (err) {
 		btf_ext__free(btf_ext);
 		return ERR_PTR(err);
 	}
 
-	err = btf_ext_copy_line_info(btf_ext, data, size, err_log);
+	err = btf_ext_copy_line_info(btf_ext, data, size);
 	if (err) {
 		btf_ext__free(btf_ext);
 		return ERR_PTR(err);
diff --git a/tools/lib/bpf/btf.h b/tools/lib/bpf/btf.h
index b0610dcdae6b..b1e8e54cc21d 100644
--- a/tools/lib/bpf/btf.h
+++ b/tools/lib/bpf/btf.h
@@ -55,11 +55,8 @@ struct btf_ext_header {
 	__u32	line_info_len;
 };
 
-typedef int (*btf_print_fn_t)(const char *, ...)
-	__attribute__((format(printf, 1, 2)));
-
 LIBBPF_API void btf__free(struct btf *btf);
-LIBBPF_API struct btf *btf__new(__u8 *data, __u32 size, btf_print_fn_t err_log);
+LIBBPF_API struct btf *btf__new(__u8 *data, __u32 size);
 LIBBPF_API __s32 btf__find_by_name(const struct btf *btf,
 				   const char *type_name);
 LIBBPF_API const struct btf_type *btf__type_by_id(const struct btf *btf,
@@ -70,7 +67,7 @@ LIBBPF_API int btf__fd(const struct btf *btf);
 LIBBPF_API const char *btf__name_by_offset(const struct btf *btf, __u32 offset);
 LIBBPF_API int btf__get_from_id(__u32 id, struct btf **btf);
 
-struct btf_ext *btf_ext__new(__u8 *data, __u32 size, btf_print_fn_t err_log);
+struct btf_ext *btf_ext__new(__u8 *data, __u32 size);
 void btf_ext__free(struct btf_ext *btf_ext);
 int btf_ext__reloc_func_info(const struct btf *btf,
 			     const struct btf_ext *btf_ext,
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2ccde17957e6..1c7e25ebbe85 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -42,6 +42,7 @@
 #include "bpf.h"
 #include "btf.h"
 #include "str_error.h"
+#include "util.h"
 
 #ifndef EM_BPF
 #define EM_BPF 247
@@ -69,16 +70,6 @@ static __printf(1, 2) libbpf_print_fn_t __pr_warning = __base_pr;
 static __printf(1, 2) libbpf_print_fn_t __pr_info = __base_pr;
 static __printf(1, 2) libbpf_print_fn_t __pr_debug;
 
-#define __pr(func, fmt, ...)	\
-do {				\
-	if ((func))		\
-		(func)("libbpf: " fmt, ##__VA_ARGS__); \
-} while (0)
-
-#define pr_warning(fmt, ...)	__pr(__pr_warning, fmt, ##__VA_ARGS__)
-#define pr_info(fmt, ...)	__pr(__pr_info, fmt, ##__VA_ARGS__)
-#define pr_debug(fmt, ...)	__pr(__pr_debug, fmt, ##__VA_ARGS__)
-
 void libbpf_set_print(libbpf_print_fn_t warn,
 		      libbpf_print_fn_t info,
 		      libbpf_print_fn_t debug)
@@ -88,6 +79,35 @@ void libbpf_set_print(libbpf_print_fn_t warn,
 	__pr_debug = debug;
 }
 
+__printf(2, 3)
+void libbpf_dprint(enum libbpf_print_level level, const char *format, ...)
+{
+	va_list args;
+
+	va_start(args, format);
+	if (level == LIBBPF_WARN) {
+		if (__pr_warning)
+			__pr_warning(format, args);
+	} else if (level == LIBBPF_INFO) {
+		if (__pr_info)
+			__pr_info(format, args);
+	} else {
+		if (__pr_debug)
+			__pr_debug(format, args);
+	}
+	va_end(args);
+}
+
+bool libbpf_dprint_level_available(enum libbpf_print_level level)
+{
+	if (level == LIBBPF_WARN)
+		return !!__pr_warning;
+	else if (level == LIBBPF_INFO)
+		return !!__pr_info;
+	else
+		return !!__pr_debug;
+}
+
 #define STRERR_BUFSIZE  128
 
 #define CHECK_ERR(action, err, out) do {	\
@@ -839,8 +859,7 @@ static int bpf_object__elf_collect(struct bpf_object *obj, int flags)
 		else if (strcmp(name, "maps") == 0)
 			obj->efile.maps_shndx = idx;
 		else if (strcmp(name, BTF_ELF_SEC) == 0) {
-			obj->btf = btf__new(data->d_buf, data->d_size,
-					    __pr_debug);
+			obj->btf = btf__new(data->d_buf, data->d_size);
 			if (IS_ERR(obj->btf)) {
 				pr_warning("Error loading ELF section %s: %ld. Ignored and continue.\n",
 					   BTF_ELF_SEC, PTR_ERR(obj->btf));
@@ -915,8 +934,7 @@ static int bpf_object__elf_collect(struct bpf_object *obj, int flags)
 				 BTF_EXT_ELF_SEC, BTF_ELF_SEC);
 		} else {
 			obj->btf_ext = btf_ext__new(btf_ext_data->d_buf,
-						    btf_ext_data->d_size,
-						    __pr_debug);
+						    btf_ext_data->d_size);
 			if (IS_ERR(obj->btf_ext)) {
 				pr_warning("Error loading ELF section %s: %ld. Ignored and continue.\n",
 					   BTF_EXT_ELF_SEC,
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 62ae6cb93da1..4e21971101c9 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -47,6 +47,12 @@ enum libbpf_errno {
 
 LIBBPF_API int libbpf_strerror(int err, char *buf, size_t size);
 
+enum libbpf_print_level {
+        LIBBPF_WARN,
+        LIBBPF_INFO,
+        LIBBPF_DEBUG,
+};
+
 /*
  * __printf is defined in include/linux/compiler-gcc.h. However,
  * it would be better if libbpf.h didn't depend on Linux header files.
diff --git a/tools/lib/bpf/test_libbpf.cpp b/tools/lib/bpf/test_libbpf.cpp
index abf3fc25c9fa..be67f5ea2c19 100644
--- a/tools/lib/bpf/test_libbpf.cpp
+++ b/tools/lib/bpf/test_libbpf.cpp
@@ -14,5 +14,5 @@ int main(int argc, char *argv[])
     bpf_prog_get_fd_by_id(0);
 
     /* btf.h */
-    btf__new(NULL, 0, NULL);
+    btf__new(NULL, 0);
 }
diff --git a/tools/lib/bpf/util.h b/tools/lib/bpf/util.h
new file mode 100644
index 000000000000..b07657600f7c
--- /dev/null
+++ b/tools/lib/bpf/util.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
+/* Copyright (c) 2019 Facebook */
+
+#ifndef __LIBBPF_COMMON_H
+#define __LIBBPF_COMMON_H
+
+#include <stdbool.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+extern void libbpf_dprint(enum libbpf_print_level level,
+			  const char *format, ...)
+	__attribute__((format(printf, 2, 3)));
+
+extern bool libbpf_dprint_level_available(enum libbpf_print_level level);
+
+#define __pr(level, fmt, ...)	\
+do {				\
+	libbpf_dprint(level, "libbpf: " fmt, ##__VA_ARGS__);	\
+} while (0)
+
+#define pr_warning(fmt, ...)	__pr(LIBBPF_WARN, fmt, ##__VA_ARGS__)
+#define pr_info(fmt, ...)	__pr(LIBBPF_INFO, fmt, ##__VA_ARGS__)
+#define pr_debug(fmt, ...)	__pr(LIBBPF_DEBUG, fmt, ##__VA_ARGS__)
+
+#ifdef __cplusplus
+} /* extern "C" */
+#endif
+
+#endif
-- 
2.17.1


^ permalink raw reply related

* [PATCH net] skge: potential memory corruption in skge_get_regs()
From: Dan Carpenter @ 2019-02-01  8:28 UTC (permalink / raw)
  To: Mirko Lindner, Stephen Hemminger; +Cc: David S. Miller, netdev, kernel-janitors

The "p" buffer is 0x4000 bytes long.  B3_RI_WTO_R1 is 0x190.  The value
of "regs->len" is in the 1-0x4000 range.  The bug here is that
"regs->len - B3_RI_WTO_R1" can be a negative value which would lead to
memory corruption and an abrupt crash.

Fixes: c3f8be961808 ("[PATCH] skge: expand ethtool debug register dump")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
---
 drivers/net/ethernet/marvell/skge.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/marvell/skge.c b/drivers/net/ethernet/marvell/skge.c
index 04fd1f135011..654ac534b10e 100644
--- a/drivers/net/ethernet/marvell/skge.c
+++ b/drivers/net/ethernet/marvell/skge.c
@@ -152,8 +152,10 @@ static void skge_get_regs(struct net_device *dev, struct ethtool_regs *regs,
 	memset(p, 0, regs->len);
 	memcpy_fromio(p, io, B3_RAM_ADDR);

-	memcpy_fromio(p + B3_RI_WTO_R1, io + B3_RI_WTO_R1,
-		      regs->len - B3_RI_WTO_R1);
+	if (regs->len > B3_RI_WTO_R1) {
+		memcpy_fromio(p + B3_RI_WTO_R1, io + B3_RI_WTO_R1,
+			      regs->len - B3_RI_WTO_R1);
+	}
 }

 /* Wake on Lan only supported on Yukon chips with rev 1 or above */
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next] net: hns3: Check for allocation failure
From: Dan Carpenter @ 2019-02-01  8:32 UTC (permalink / raw)
  To: Yisen Zhuang, liuzhongzhu
  Cc: Salil Mehta, David S. Miller, Peng Li, Jian Shen, Huazhong Tan,
	Yunsheng Lin, Fuyun Liang, netdev, kernel-janitors

We should return -ENOMEM if the kcalloc() fails.

Fixes: d174ea75c96a ("net: hns3: add statistics for PFC frames and MAC control frame")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index ae8336c18264..afd9a0176923 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -343,6 +343,8 @@ static int hclge_mac_update_stats_complete(struct hclge_dev *hdev, u32 desc_num)
 	int ret;
 
 	desc = kcalloc(desc_num, sizeof(struct hclge_desc), GFP_KERNEL);
+	if (!desc)
+		return -ENOMEM;
 	hclge_cmd_setup_basic_desc(&desc[0], HCLGE_OPC_STATS_MAC_ALL, true);
 	ret = hclge_cmd_send(&hdev->hw, desc, desc_num);
 	if (ret) {
-- 
2.17.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox