Netdev List
 help / color / mirror / Atom feed
* Re: r8169 Driver - Poor Network Performance Since Kernel 4.19
From: David Chang @ 2019-01-31  2:32 UTC (permalink / raw)
  To: Heiner Kallweit; +Cc: Peter Ceiley, Realtek linux nic maintainers, netdev
In-Reply-To: <aead4da3-e1b0-ab6c-2842-634e175b33ab@gmail.com>

Hi,

We had a similr case here.
- Realtek r8169 receive performance regression in kernel 4.19
  https://bugzilla.suse.com/show_bug.cgi?id=1119649

kernel: r8169 0000:01:00.0 eth0: RTL8168h/8111h, XID 54100880
The major symptom is there are many rx_missed count.


On Jan 30, 2019 at 20:15:45 +0100, Heiner Kallweit wrote:
> Hi Peter,
> 
> recently I had somebody where pcie_aspm=off for whatever reason didn't
> do the trick, can you also check with pcie_aspm.policy=performance.

We will give it a try later.

> And please check with "ethtool -S <if>" whether the chip statistics
> show a significant number of errors.
> 
> If this doesn't help you may have to bisect to find the offending commit.

We had tried fallback driver to a few previous commits as following,
but with no luck.

9675931e6b65 r8169: re-enable MSI-X on RTL8168g (v4.19)
098b01ad9837 r8169: don't include asm headers directly (v4.19-rc1)
a2965f12fde6 r8169: remove rtl8169_set_speed_xmii (v4.19-rc1)
6fcf9b1d4d6c r8169: fix runtime suspend (v4.19-rc1)
e397286b8e89 r8169: remove TBI 1000BaseX support (v4.19-rc1)

Thanks,
David Chang

> 
> Heiner
> 
> 
> On 30.01.2019 10:59, Peter Ceiley wrote:
> > Hi Heiner,
> > 
> > I tried disabling the ASPM using the pcie_aspm=off kernel parameter
> > and this made no difference.
> > 
> > I tried compiling the 4.18.16 r8169.c with the 4.19.18 source and
> > subsequently loaded the module in the running 4.19.18 kernel. I can
> > confirm that this immediately resolved the issue and access to the NFS
> > shares operated as expected.
> > 
> > I presume this means it is an issue with the r8169 driver included in
> > 4.19 onwards?
> > 
> > To answer your last questions:
> > 
> > Base Board Information
> >     Manufacturer: Alienware
> >     Product Name: 0PGRP5
> >     Version: A02
> > 
> > ... and yes, the RTL8168 is the onboard network chip.
> > 
> > Regards,
> > 
> > Peter.
> > 
> > On Tue, 29 Jan 2019 at 17:44, Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>
> >> Hi Peter,
> >>
> >> I think the vendor driver doesn't enable ASPM per default.
> >> So it's worth a try to disable ASPM in the BIOS or via sysfs.
> >> Few older systems seem to have issues with ASPM, what kind of
> >> system / mainboard are you using? The RTL8168 is the onboard
> >> network chip?
> >>
> >> Rgds, Heiner
> >>
> >>
> >> On 29.01.2019 07:20, Peter Ceiley wrote:
> >>> Hi Heiner,
> >>>
> >>> Thanks, I'll do some more testing. It might not be the driver - I
> >>> assumed it was due to the fact that using the r8168 driver 'resolves'
> >>> the issue. I'll see if I can test the r8169.c on top of 4.19 - this is
> >>> a good idea.
> >>>
> >>> Cheers,
> >>>
> >>> Peter.
> >>>
> >>> On Tue, 29 Jan 2019 at 17:16, Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>>>
> >>>> Hi Peter,
> >>>>
> >>>> at a first glance it doesn't look like a typical driver issue.
> >>>> What you could do:
> >>>>
> >>>> - Test the r8169.c from 4.18 on top of 4.19.
> >>>>
> >>>> - Check whether disabling ASPM (/sys/module/pcie_aspm) has an effect.
> >>>>
> >>>> - Bisect between 4.18 and 4.19 to find the offending commit.
> >>>>
> >>>> Any specific reason why you think root cause is in the driver and not
> >>>> elsewhere in the network subsystem?
> >>>>
> >>>> Heiner
> >>>>
> >>>>
> >>>> On 28.01.2019 23:10, Peter Ceiley wrote:
> >>>>> Hi Heiner,
> >>>>>
> >>>>> Thanks for getting back to me.
> >>>>>
> >>>>> No, I don't use jumbo packets.
> >>>>>
> >>>>> Bandwidth is *generally* good, and iperf results to my NAS provide
> >>>>> over 900 Mbits/s in both circumstances. The issue seems to appear when
> >>>>> establishing a connection and is most notable, for example, on my
> >>>>> mounted NFS shares where it takes seconds (up to 10's of seconds on
> >>>>> larger directories) to list the contents of each directory. Once a
> >>>>> transfer begins on a file, I appear to get good bandwidth.
> >>>>>
> >>>>> I'm unsure of the best scientific data to provide you in order to
> >>>>> troubleshoot this issue. Running the following
> >>>>>
> >>>>>     netstat -s |grep retransmitted
> >>>>>
> >>>>> shows a steady increase in retransmitted segments each time I list the
> >>>>> contents of a remote directory, for example, running 'ls' on a
> >>>>> directory containing 345 media files did the following using kernel
> >>>>> 4.19.18:
> >>>>>
> >>>>> increased retransmitted segments by 21 and the 'time' command showed
> >>>>> the following:
> >>>>>     real    0m19.867s
> >>>>>     user    0m0.012s
> >>>>>     sys    0m0.036s
> >>>>>
> >>>>> The same command shows no retransmitted segments running kernel
> >>>>> 4.18.16 and 'time' showed:
> >>>>>     real    0m0.300s
> >>>>>     user    0m0.004s
> >>>>>     sys    0m0.007s
> >>>>>
> >>>>> ifconfig does not show any RX/TX errors nor dropped packets in either case.
> >>>>>
> >>>>> dmesg XID:
> >>>>> [    2.979984] r8169 0000:03:00.0 eth0: RTL8168g/8111g,
> >>>>> f8:b1:56:fe:67:e0, XID 4c000800, IRQ 32
> >>>>>
> >>>>> # lspci -vv
> >>>>> 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
> >>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)
> >>>>>     Subsystem: Dell RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
> >>>>>     Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> >>>>> ParErr- Stepping- SERR- FastB2B- DisINTx+
> >>>>>     Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> >>>>> <TAbort- <MAbort- >SERR- <PERR- INTx-
> >>>>>     Latency: 0, Cache Line Size: 64 bytes
> >>>>>     Interrupt: pin A routed to IRQ 19
> >>>>>     Region 0: I/O ports at d000 [size=256]
> >>>>>     Region 2: Memory at f7b00000 (64-bit, non-prefetchable) [size=4K]
> >>>>>     Region 4: Memory at f2100000 (64-bit, prefetchable) [size=16K]
> >>>>>     Capabilities: [40] Power Management version 3
> >>>>>         Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA
> >>>>> PME(D0+,D1+,D2+,D3hot+,D3cold+)
> >>>>>         Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> >>>>>     Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
> >>>>>         Address: 0000000000000000  Data: 0000
> >>>>>     Capabilities: [70] Express (v2) Endpoint, MSI 01
> >>>>>         DevCap:    MaxPayload 128 bytes, PhantFunc 0, Latency L0s
> >>>>> <512ns, L1 <64us
> >>>>>             ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
> >>>>> SlotPowerLimit 10.000W
> >>>>>         DevCtl:    CorrErr- NonFatalErr- FatalErr- UnsupReq-
> >>>>>             RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> >>>>>             MaxPayload 128 bytes, MaxReadReq 4096 bytes
> >>>>>         DevSta:    CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
> >>>>>         LnkCap:    Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit
> >>>>> Latency L0s unlimited, L1 <64us
> >>>>>             ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
> >>>>>         LnkCtl:    ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
> >>>>>             ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
> >>>>>         LnkSta:    Speed 2.5GT/s (ok), Width x1 (ok)
> >>>>>             TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> >>>>>         DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+,
> >>>>> OBFF Via message/WAKE#
> >>>>>              AtomicOpsCap: 32bit- 64bit- 128bitCAS-
> >>>>>         DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+,
> >>>>> OBFF Disabled
> >>>>>              AtomicOpsCtl: ReqEn-
> >>>>>         LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
> >>>>>              Transmit Margin: Normal Operating Range,
> >>>>> EnterModifiedCompliance- ComplianceSOS-
> >>>>>              Compliance De-emphasis: -6dB
> >>>>>         LnkSta2: Current De-emphasis Level: -6dB,
> >>>>> EqualizationComplete-, EqualizationPhase1-
> >>>>>              EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> >>>>>     Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
> >>>>>         Vector table: BAR=4 offset=00000000
> >>>>>         PBA: BAR=4 offset=00000800
> >>>>>     Capabilities: [d0] Vital Product Data
> >>>>> pcilib: sysfs_read_vpd: read failed: Input/output error
> >>>>>         Not readable
> >>>>>     Capabilities: [100 v1] Advanced Error Reporting
> >>>>>         UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> >>>>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> >>>>>         UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> >>>>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> >>>>>         UESvrt:    DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
> >>>>> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> >>>>>         CESta:    RxErr+ BadTLP+ BadDLLP+ Rollover- Timeout+ AdvNonFatalErr-
> >>>>>         CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
> >>>>>         AERCap:    First Error Pointer: 00, ECRCGenCap+ ECRCGenEn-
> >>>>> ECRCChkCap+ ECRCChkEn-
> >>>>>             MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
> >>>>>         HeaderLog: 00000000 00000000 00000000 00000000
> >>>>>     Capabilities: [140 v1] Virtual Channel
> >>>>>         Caps:    LPEVC=0 RefClk=100ns PATEntryBits=1
> >>>>>         Arb:    Fixed- WRR32- WRR64- WRR128-
> >>>>>         Ctrl:    ArbSelect=Fixed
> >>>>>         Status:    InProgress-
> >>>>>         VC0:    Caps:    PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
> >>>>>             Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
> >>>>>             Ctrl:    Enable+ ID=0 ArbSelect=Fixed TC/VC=01
> >>>>>             Status:    NegoPending- InProgress-
> >>>>>     Capabilities: [160 v1] Device Serial Number 01-00-00-00-68-4c-e0-00
> >>>>>     Capabilities: [170 v1] Latency Tolerance Reporting
> >>>>>         Max snoop latency: 71680ns
> >>>>>         Max no snoop latency: 71680ns
> >>>>>     Kernel driver in use: r8169
> >>>>>     Kernel modules: r8169
> >>>>>
> >>>>> Please let me know if you have any other ideas in terms of testing.
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> Peter.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Tue, 29 Jan 2019 at 05:28, Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>>>>>
> >>>>>> On 28.01.2019 12:13, Peter Ceiley wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I have been experiencing very poor network performance since Kernel
> >>>>>>> 4.19 and I'm confident it's related to the r8169 driver.
> >>>>>>>
> >>>>>>> I have no issue with kernel versions 4.18 and prior. I am experiencing
> >>>>>>> this issue in kernels 4.19 and 4.20 (currently running/testing with
> >>>>>>> 4.20.4 & 4.19.18).
> >>>>>>>
> >>>>>>> If someone could guide me in the right direction, I'm happy to help
> >>>>>>> troubleshoot this issue. Note that I have been keeping an eye on one
> >>>>>>> issue related to loading of the PHY driver, however, my symptoms
> >>>>>>> differ in that I still have a network connection. I have attempted to
> >>>>>>> reload the driver on a running system, but this does not improve the
> >>>>>>> situation.
> >>>>>>>
> >>>>>>> Using the proprietary r8168 driver returns my device to proper working order.
> >>>>>>>
> >>>>>>> lshw shows:
> >>>>>>>        description: Ethernet interface
> >>>>>>>        product: RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
> >>>>>>>        vendor: Realtek Semiconductor Co., Ltd.
> >>>>>>>        physical id: 0
> >>>>>>>        bus info: pci@0000:03:00.0
> >>>>>>>        logical name: enp3s0
> >>>>>>>        version: 0c
> >>>>>>>        serial:
> >>>>>>>        size: 1Gbit/s
> >>>>>>>        capacity: 1Gbit/s
> >>>>>>>        width: 64 bits
> >>>>>>>        clock: 33MHz
> >>>>>>>        capabilities: pm msi pciexpress msix vpd bus_master cap_list
> >>>>>>> ethernet physical tp aui bnc mii fibre 10bt 10bt-fd 100bt 100bt-fd
> >>>>>>> 1000bt-fd autonegotiation
> >>>>>>>        configuration: autonegotiation=on broadcast=yes driver=r8169
> >>>>>>> duplex=full firmware=rtl8168g-2_0.0.1 02/06/13 ip=192.168.1.25
> >>>>>>> latency=0 link=yes multicast=yes port=MII speed=1Gbit/s
> >>>>>>>        resources: irq:19 ioport:d000(size=256)
> >>>>>>> memory:f7b00000-f7b00fff memory:f2100000-f2103fff
> >>>>>>>
> >>>>>>> Kind Regards,
> >>>>>>>
> >>>>>>> Peter.
> >>>>>>>
> >>>>>> Hi Peter,
> >>>>>>
> >>>>>> the description "poor network performance" is quite vague, therefore:
> >>>>>>
> >>>>>> - Can you provide any measurements?
> >>>>>> - iperf results before and after
> >>>>>> - statistics about dropped packets (rx and/or tx)
> >>>>>> - Do you use jumbo packets?
> >>>>>>
> >>>>>> Also help would be a "lspci -vv" output for the network card and
> >>>>>> the dmesg output line with the chip XID.
> >>>>>>
> >>>>>> Heiner
> >>>>>
> >>>>
> >>>
> >>
> > 
> 

^ permalink raw reply

* [PATCH 0/9] Add ENETC PTP clock driver
From: Yangbo Lu @ 2019-01-31  2:33 UTC (permalink / raw)
  To: netdev, devicetree
  Cc: David S . Miller, Richard Cochran, Rob Herring, Claudiu Manoil,
	Yangbo Lu

There is same QorIQ 1588 timer IP block on the new ENETC Ethernet
controller with eTSEC/DPAA Ethernet controllers. However it's
different endianness (little-endian) and using PCI driver.

To support ENETC PTP driver, ptp_qoriq driver needed to be
reworked to make functions global for reusing, to add little-
endian support, to add ENETC memory map support, and to add
ENETC dependency for ptp_qoriq driver.

In addition, although ENETC PTP driver is a PCI driver, the dts
node still could be used. Currently the ls1028a dtsi which is
the only platform by now using ENETC is not complete, so there
is still dependency for ENETC PTP node upstreaming. This will
be done in the near future. The hardware timestamping support
for ENETC is done but needs to be reworked with new method in
internal git tree, and will be sent out soon.

Yangbo Lu (9):
  ptp_qoriq: make structure/function names more consistent
  ptp_qoriq: make ptp operations global
  ptp_qoriq: convert to use ptp_qoriq_init()
  ptp_qoriq: add little enadian support
  dt-binding: ptp_qoriq: add little-endian support
  ptp_qoriq: fix register memory map
  ptp: add QorIQ PTP support for ENETC
  enetc: add PTP clock driver
  MAINTAINERS: add enetc_ptp driver into QorIQ PTP list

 .../devicetree/bindings/ptp/ptp-qoriq.txt          |    3 +
 MAINTAINERS                                        |    1 +
 drivers/net/ethernet/freescale/enetc/Kconfig       |   12 +
 drivers/net/ethernet/freescale/enetc/Makefile      |    3 +
 drivers/net/ethernet/freescale/enetc/enetc_hw.h    |    5 +-
 drivers/net/ethernet/freescale/enetc/enetc_ptp.c   |  151 +++++++
 drivers/ptp/Kconfig                                |    2 +-
 drivers/ptp/ptp_qoriq.c                            |  427 ++++++++++----------
 drivers/ptp/ptp_qoriq_debugfs.c                    |   48 ++--
 include/linux/fsl/ptp_qoriq.h                      |   62 ++-
 10 files changed, 462 insertions(+), 252 deletions(-)
 create mode 100644 drivers/net/ethernet/freescale/enetc/enetc_ptp.c


^ permalink raw reply

* [PATCH 2/9] ptp_qoriq: make ptp operations global
From: Yangbo Lu @ 2019-01-31  2:33 UTC (permalink / raw)
  To: netdev, devicetree
  Cc: David S . Miller, Richard Cochran, Rob Herring, Claudiu Manoil,
	Yangbo Lu
In-Reply-To: <20190131023310.9563-1-yangbo.lu@nxp.com>

This patch is to make functions of ptp operations global,
so that ENETC PTP driver which is a PCI driver for same
1588 timer IP block could reuse them.

Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
---
 drivers/ptp/ptp_qoriq.c       |   27 ++++++++++++++++-----------
 include/linux/fsl/ptp_qoriq.h |    9 +++++++++
 2 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/drivers/ptp/ptp_qoriq.c b/drivers/ptp/ptp_qoriq.c
index 8c10d0f..1f3e73e 100644
--- a/drivers/ptp/ptp_qoriq.c
+++ b/drivers/ptp/ptp_qoriq.c
@@ -22,7 +22,6 @@
 
 #include <linux/device.h>
 #include <linux/hrtimer.h>
-#include <linux/interrupt.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/of.h>
@@ -135,7 +134,7 @@ static int extts_clean_up(struct ptp_qoriq *ptp_qoriq, int index,
  * Interrupt service routine
  */
 
-static irqreturn_t isr(int irq, void *priv)
+irqreturn_t ptp_qoriq_isr(int irq, void *priv)
 {
 	struct ptp_qoriq *ptp_qoriq = priv;
 	struct ptp_qoriq_registers *regs = &ptp_qoriq->regs;
@@ -200,12 +199,13 @@ static irqreturn_t isr(int irq, void *priv)
 	} else
 		return IRQ_NONE;
 }
+EXPORT_SYMBOL_GPL(ptp_qoriq_isr);
 
 /*
  * PTP clock operations
  */
 
-static int ptp_qoriq_adjfine(struct ptp_clock_info *ptp, long scaled_ppm)
+int ptp_qoriq_adjfine(struct ptp_clock_info *ptp, long scaled_ppm)
 {
 	u64 adj, diff;
 	u32 tmr_add;
@@ -233,8 +233,9 @@ static int ptp_qoriq_adjfine(struct ptp_clock_info *ptp, long scaled_ppm)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(ptp_qoriq_adjfine);
 
-static int ptp_qoriq_adjtime(struct ptp_clock_info *ptp, s64 delta)
+int ptp_qoriq_adjtime(struct ptp_clock_info *ptp, s64 delta)
 {
 	s64 now;
 	unsigned long flags;
@@ -251,9 +252,9 @@ static int ptp_qoriq_adjtime(struct ptp_clock_info *ptp, s64 delta)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(ptp_qoriq_adjtime);
 
-static int ptp_qoriq_gettime(struct ptp_clock_info *ptp,
-			       struct timespec64 *ts)
+int ptp_qoriq_gettime(struct ptp_clock_info *ptp, struct timespec64 *ts)
 {
 	u64 ns;
 	unsigned long flags;
@@ -269,9 +270,10 @@ static int ptp_qoriq_gettime(struct ptp_clock_info *ptp,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(ptp_qoriq_gettime);
 
-static int ptp_qoriq_settime(struct ptp_clock_info *ptp,
-			       const struct timespec64 *ts)
+int ptp_qoriq_settime(struct ptp_clock_info *ptp,
+		      const struct timespec64 *ts)
 {
 	u64 ns;
 	unsigned long flags;
@@ -288,9 +290,10 @@ static int ptp_qoriq_settime(struct ptp_clock_info *ptp,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(ptp_qoriq_settime);
 
-static int ptp_qoriq_enable(struct ptp_clock_info *ptp,
-			      struct ptp_clock_request *rq, int on)
+int ptp_qoriq_enable(struct ptp_clock_info *ptp,
+		     struct ptp_clock_request *rq, int on)
 {
 	struct ptp_qoriq *ptp_qoriq = container_of(ptp, struct ptp_qoriq, caps);
 	struct ptp_qoriq_registers *regs = &ptp_qoriq->regs;
@@ -336,6 +339,7 @@ static int ptp_qoriq_enable(struct ptp_clock_info *ptp,
 	spin_unlock_irqrestore(&ptp_qoriq->lock, flags);
 	return 0;
 }
+EXPORT_SYMBOL_GPL(ptp_qoriq_enable);
 
 static const struct ptp_clock_info ptp_qoriq_caps = {
 	.owner		= THIS_MODULE,
@@ -508,7 +512,8 @@ static int ptp_qoriq_probe(struct platform_device *dev)
 		pr_err("irq not in device tree\n");
 		goto no_node;
 	}
-	if (request_irq(ptp_qoriq->irq, isr, IRQF_SHARED, DRIVER, ptp_qoriq)) {
+	if (request_irq(ptp_qoriq->irq, ptp_qoriq_isr, IRQF_SHARED,
+			DRIVER, ptp_qoriq)) {
 		pr_err("request_irq failed\n");
 		goto no_node;
 	}
diff --git a/include/linux/fsl/ptp_qoriq.h b/include/linux/fsl/ptp_qoriq.h
index c2a32d9..75e6f05 100644
--- a/include/linux/fsl/ptp_qoriq.h
+++ b/include/linux/fsl/ptp_qoriq.h
@@ -7,6 +7,7 @@
 #define __PTP_QORIQ_H__
 
 #include <linux/io.h>
+#include <linux/interrupt.h>
 #include <linux/ptp_clock_kernel.h>
 
 /*
@@ -171,6 +172,14 @@ static inline void qoriq_write(unsigned __iomem *addr, u32 val)
 	iowrite32be(val, addr);
 }
 
+irqreturn_t ptp_qoriq_isr(int irq, void *priv);
+int ptp_qoriq_adjfine(struct ptp_clock_info *ptp, long scaled_ppm);
+int ptp_qoriq_adjtime(struct ptp_clock_info *ptp, s64 delta);
+int ptp_qoriq_gettime(struct ptp_clock_info *ptp, struct timespec64 *ts);
+int ptp_qoriq_settime(struct ptp_clock_info *ptp,
+		      const struct timespec64 *ts);
+int ptp_qoriq_enable(struct ptp_clock_info *ptp,
+		     struct ptp_clock_request *rq, int on);
 #ifdef CONFIG_DEBUG_FS
 void ptp_qoriq_create_debugfs(struct ptp_qoriq *ptp_qoriq);
 void ptp_qoriq_remove_debugfs(struct ptp_qoriq *ptp_qoriq);
-- 
1.7.1


^ permalink raw reply related

* [PATCH 1/9] ptp_qoriq: make structure/function names more consistent
From: Yangbo Lu @ 2019-01-31  2:33 UTC (permalink / raw)
  To: netdev, devicetree
  Cc: David S . Miller, Richard Cochran, Rob Herring, Claudiu Manoil,
	Yangbo Lu
In-Reply-To: <20190131023310.9563-1-yangbo.lu@nxp.com>

Strings containing "ptp_qoriq" or "qoriq_ptp" which were used for
structure/function names were complained by users. Let's just use
the unique "ptp_qoriq" to make these names more consistent.
This patch is just to unify the names using "ptp_qoriq". It hasn't
changed any functions.

Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
---
 drivers/ptp/ptp_qoriq.c         |  288 +++++++++++++++++++-------------------
 drivers/ptp/ptp_qoriq_debugfs.c |   36 +++---
 include/linux/fsl/ptp_qoriq.h   |   14 +-
 3 files changed, 169 insertions(+), 169 deletions(-)

diff --git a/drivers/ptp/ptp_qoriq.c b/drivers/ptp/ptp_qoriq.c
index 43416b2..8c10d0f 100644
--- a/drivers/ptp/ptp_qoriq.c
+++ b/drivers/ptp/ptp_qoriq.c
@@ -37,10 +37,10 @@
  * Register access functions
  */
 
-/* Caller must hold qoriq_ptp->lock. */
-static u64 tmr_cnt_read(struct qoriq_ptp *qoriq_ptp)
+/* Caller must hold ptp_qoriq->lock. */
+static u64 tmr_cnt_read(struct ptp_qoriq *ptp_qoriq)
 {
-	struct qoriq_ptp_registers *regs = &qoriq_ptp->regs;
+	struct ptp_qoriq_registers *regs = &ptp_qoriq->regs;
 	u64 ns;
 	u32 lo, hi;
 
@@ -51,10 +51,10 @@ static u64 tmr_cnt_read(struct qoriq_ptp *qoriq_ptp)
 	return ns;
 }
 
-/* Caller must hold qoriq_ptp->lock. */
-static void tmr_cnt_write(struct qoriq_ptp *qoriq_ptp, u64 ns)
+/* Caller must hold ptp_qoriq->lock. */
+static void tmr_cnt_write(struct ptp_qoriq *ptp_qoriq, u64 ns)
 {
-	struct qoriq_ptp_registers *regs = &qoriq_ptp->regs;
+	struct ptp_qoriq_registers *regs = &ptp_qoriq->regs;
 	u32 hi = ns >> 32;
 	u32 lo = ns & 0xffffffff;
 
@@ -62,36 +62,36 @@ static void tmr_cnt_write(struct qoriq_ptp *qoriq_ptp, u64 ns)
 	qoriq_write(&regs->ctrl_regs->tmr_cnt_h, hi);
 }
 
-/* Caller must hold qoriq_ptp->lock. */
-static void set_alarm(struct qoriq_ptp *qoriq_ptp)
+/* Caller must hold ptp_qoriq->lock. */
+static void set_alarm(struct ptp_qoriq *ptp_qoriq)
 {
-	struct qoriq_ptp_registers *regs = &qoriq_ptp->regs;
+	struct ptp_qoriq_registers *regs = &ptp_qoriq->regs;
 	u64 ns;
 	u32 lo, hi;
 
-	ns = tmr_cnt_read(qoriq_ptp) + 1500000000ULL;
+	ns = tmr_cnt_read(ptp_qoriq) + 1500000000ULL;
 	ns = div_u64(ns, 1000000000UL) * 1000000000ULL;
-	ns -= qoriq_ptp->tclk_period;
+	ns -= ptp_qoriq->tclk_period;
 	hi = ns >> 32;
 	lo = ns & 0xffffffff;
 	qoriq_write(&regs->alarm_regs->tmr_alarm1_l, lo);
 	qoriq_write(&regs->alarm_regs->tmr_alarm1_h, hi);
 }
 
-/* Caller must hold qoriq_ptp->lock. */
-static void set_fipers(struct qoriq_ptp *qoriq_ptp)
+/* Caller must hold ptp_qoriq->lock. */
+static void set_fipers(struct ptp_qoriq *ptp_qoriq)
 {
-	struct qoriq_ptp_registers *regs = &qoriq_ptp->regs;
+	struct ptp_qoriq_registers *regs = &ptp_qoriq->regs;
 
-	set_alarm(qoriq_ptp);
-	qoriq_write(&regs->fiper_regs->tmr_fiper1, qoriq_ptp->tmr_fiper1);
-	qoriq_write(&regs->fiper_regs->tmr_fiper2, qoriq_ptp->tmr_fiper2);
+	set_alarm(ptp_qoriq);
+	qoriq_write(&regs->fiper_regs->tmr_fiper1, ptp_qoriq->tmr_fiper1);
+	qoriq_write(&regs->fiper_regs->tmr_fiper2, ptp_qoriq->tmr_fiper2);
 }
 
-static int extts_clean_up(struct qoriq_ptp *qoriq_ptp, int index,
+static int extts_clean_up(struct ptp_qoriq *ptp_qoriq, int index,
 			  bool update_event)
 {
-	struct qoriq_ptp_registers *regs = &qoriq_ptp->regs;
+	struct ptp_qoriq_registers *regs = &ptp_qoriq->regs;
 	struct ptp_clock_event event;
 	void __iomem *reg_etts_l;
 	void __iomem *reg_etts_h;
@@ -122,11 +122,11 @@ static int extts_clean_up(struct qoriq_ptp *qoriq_ptp, int index,
 		if (update_event) {
 			event.timestamp = ((u64) hi) << 32;
 			event.timestamp |= lo;
-			ptp_clock_event(qoriq_ptp->clock, &event);
+			ptp_clock_event(ptp_qoriq->clock, &event);
 		}
 
 		stat = qoriq_read(&regs->ctrl_regs->tmr_stat);
-	} while (qoriq_ptp->extts_fifo_support && (stat & valid));
+	} while (ptp_qoriq->extts_fifo_support && (stat & valid));
 
 	return 0;
 }
@@ -137,61 +137,61 @@ static int extts_clean_up(struct qoriq_ptp *qoriq_ptp, int index,
 
 static irqreturn_t isr(int irq, void *priv)
 {
-	struct qoriq_ptp *qoriq_ptp = priv;
-	struct qoriq_ptp_registers *regs = &qoriq_ptp->regs;
+	struct ptp_qoriq *ptp_qoriq = priv;
+	struct ptp_qoriq_registers *regs = &ptp_qoriq->regs;
 	struct ptp_clock_event event;
 	u64 ns;
 	u32 ack = 0, lo, hi, mask, val, irqs;
 
-	spin_lock(&qoriq_ptp->lock);
+	spin_lock(&ptp_qoriq->lock);
 
 	val = qoriq_read(&regs->ctrl_regs->tmr_tevent);
 	mask = qoriq_read(&regs->ctrl_regs->tmr_temask);
 
-	spin_unlock(&qoriq_ptp->lock);
+	spin_unlock(&ptp_qoriq->lock);
 
 	irqs = val & mask;
 
 	if (irqs & ETS1) {
 		ack |= ETS1;
-		extts_clean_up(qoriq_ptp, 0, true);
+		extts_clean_up(ptp_qoriq, 0, true);
 	}
 
 	if (irqs & ETS2) {
 		ack |= ETS2;
-		extts_clean_up(qoriq_ptp, 1, true);
+		extts_clean_up(ptp_qoriq, 1, true);
 	}
 
 	if (irqs & ALM2) {
 		ack |= ALM2;
-		if (qoriq_ptp->alarm_value) {
+		if (ptp_qoriq->alarm_value) {
 			event.type = PTP_CLOCK_ALARM;
 			event.index = 0;
-			event.timestamp = qoriq_ptp->alarm_value;
-			ptp_clock_event(qoriq_ptp->clock, &event);
+			event.timestamp = ptp_qoriq->alarm_value;
+			ptp_clock_event(ptp_qoriq->clock, &event);
 		}
-		if (qoriq_ptp->alarm_interval) {
-			ns = qoriq_ptp->alarm_value + qoriq_ptp->alarm_interval;
+		if (ptp_qoriq->alarm_interval) {
+			ns = ptp_qoriq->alarm_value + ptp_qoriq->alarm_interval;
 			hi = ns >> 32;
 			lo = ns & 0xffffffff;
 			qoriq_write(&regs->alarm_regs->tmr_alarm2_l, lo);
 			qoriq_write(&regs->alarm_regs->tmr_alarm2_h, hi);
-			qoriq_ptp->alarm_value = ns;
+			ptp_qoriq->alarm_value = ns;
 		} else {
-			spin_lock(&qoriq_ptp->lock);
+			spin_lock(&ptp_qoriq->lock);
 			mask = qoriq_read(&regs->ctrl_regs->tmr_temask);
 			mask &= ~ALM2EN;
 			qoriq_write(&regs->ctrl_regs->tmr_temask, mask);
-			spin_unlock(&qoriq_ptp->lock);
-			qoriq_ptp->alarm_value = 0;
-			qoriq_ptp->alarm_interval = 0;
+			spin_unlock(&ptp_qoriq->lock);
+			ptp_qoriq->alarm_value = 0;
+			ptp_qoriq->alarm_interval = 0;
 		}
 	}
 
 	if (irqs & PP1) {
 		ack |= PP1;
 		event.type = PTP_CLOCK_PPS;
-		ptp_clock_event(qoriq_ptp->clock, &event);
+		ptp_clock_event(ptp_qoriq->clock, &event);
 	}
 
 	if (ack) {
@@ -210,14 +210,14 @@ static int ptp_qoriq_adjfine(struct ptp_clock_info *ptp, long scaled_ppm)
 	u64 adj, diff;
 	u32 tmr_add;
 	int neg_adj = 0;
-	struct qoriq_ptp *qoriq_ptp = container_of(ptp, struct qoriq_ptp, caps);
-	struct qoriq_ptp_registers *regs = &qoriq_ptp->regs;
+	struct ptp_qoriq *ptp_qoriq = container_of(ptp, struct ptp_qoriq, caps);
+	struct ptp_qoriq_registers *regs = &ptp_qoriq->regs;
 
 	if (scaled_ppm < 0) {
 		neg_adj = 1;
 		scaled_ppm = -scaled_ppm;
 	}
-	tmr_add = qoriq_ptp->tmr_add;
+	tmr_add = ptp_qoriq->tmr_add;
 	adj = tmr_add;
 
 	/* calculate diff as adj*(scaled_ppm/65536)/1000000
@@ -238,16 +238,16 @@ static int ptp_qoriq_adjtime(struct ptp_clock_info *ptp, s64 delta)
 {
 	s64 now;
 	unsigned long flags;
-	struct qoriq_ptp *qoriq_ptp = container_of(ptp, struct qoriq_ptp, caps);
+	struct ptp_qoriq *ptp_qoriq = container_of(ptp, struct ptp_qoriq, caps);
 
-	spin_lock_irqsave(&qoriq_ptp->lock, flags);
+	spin_lock_irqsave(&ptp_qoriq->lock, flags);
 
-	now = tmr_cnt_read(qoriq_ptp);
+	now = tmr_cnt_read(ptp_qoriq);
 	now += delta;
-	tmr_cnt_write(qoriq_ptp, now);
-	set_fipers(qoriq_ptp);
+	tmr_cnt_write(ptp_qoriq, now);
+	set_fipers(ptp_qoriq);
 
-	spin_unlock_irqrestore(&qoriq_ptp->lock, flags);
+	spin_unlock_irqrestore(&ptp_qoriq->lock, flags);
 
 	return 0;
 }
@@ -257,13 +257,13 @@ static int ptp_qoriq_gettime(struct ptp_clock_info *ptp,
 {
 	u64 ns;
 	unsigned long flags;
-	struct qoriq_ptp *qoriq_ptp = container_of(ptp, struct qoriq_ptp, caps);
+	struct ptp_qoriq *ptp_qoriq = container_of(ptp, struct ptp_qoriq, caps);
 
-	spin_lock_irqsave(&qoriq_ptp->lock, flags);
+	spin_lock_irqsave(&ptp_qoriq->lock, flags);
 
-	ns = tmr_cnt_read(qoriq_ptp);
+	ns = tmr_cnt_read(ptp_qoriq);
 
-	spin_unlock_irqrestore(&qoriq_ptp->lock, flags);
+	spin_unlock_irqrestore(&ptp_qoriq->lock, flags);
 
 	*ts = ns_to_timespec64(ns);
 
@@ -275,16 +275,16 @@ static int ptp_qoriq_settime(struct ptp_clock_info *ptp,
 {
 	u64 ns;
 	unsigned long flags;
-	struct qoriq_ptp *qoriq_ptp = container_of(ptp, struct qoriq_ptp, caps);
+	struct ptp_qoriq *ptp_qoriq = container_of(ptp, struct ptp_qoriq, caps);
 
 	ns = timespec64_to_ns(ts);
 
-	spin_lock_irqsave(&qoriq_ptp->lock, flags);
+	spin_lock_irqsave(&ptp_qoriq->lock, flags);
 
-	tmr_cnt_write(qoriq_ptp, ns);
-	set_fipers(qoriq_ptp);
+	tmr_cnt_write(ptp_qoriq, ns);
+	set_fipers(ptp_qoriq);
 
-	spin_unlock_irqrestore(&qoriq_ptp->lock, flags);
+	spin_unlock_irqrestore(&ptp_qoriq->lock, flags);
 
 	return 0;
 }
@@ -292,8 +292,8 @@ static int ptp_qoriq_settime(struct ptp_clock_info *ptp,
 static int ptp_qoriq_enable(struct ptp_clock_info *ptp,
 			      struct ptp_clock_request *rq, int on)
 {
-	struct qoriq_ptp *qoriq_ptp = container_of(ptp, struct qoriq_ptp, caps);
-	struct qoriq_ptp_registers *regs = &qoriq_ptp->regs;
+	struct ptp_qoriq *ptp_qoriq = container_of(ptp, struct ptp_qoriq, caps);
+	struct ptp_qoriq_registers *regs = &ptp_qoriq->regs;
 	unsigned long flags;
 	u32 bit, mask = 0;
 
@@ -311,7 +311,7 @@ static int ptp_qoriq_enable(struct ptp_clock_info *ptp,
 		}
 
 		if (on)
-			extts_clean_up(qoriq_ptp, rq->extts.index, false);
+			extts_clean_up(ptp_qoriq, rq->extts.index, false);
 
 		break;
 	case PTP_CLK_REQ_PPS:
@@ -321,7 +321,7 @@ static int ptp_qoriq_enable(struct ptp_clock_info *ptp,
 		return -EOPNOTSUPP;
 	}
 
-	spin_lock_irqsave(&qoriq_ptp->lock, flags);
+	spin_lock_irqsave(&ptp_qoriq->lock, flags);
 
 	mask = qoriq_read(&regs->ctrl_regs->tmr_temask);
 	if (on) {
@@ -333,7 +333,7 @@ static int ptp_qoriq_enable(struct ptp_clock_info *ptp,
 
 	qoriq_write(&regs->ctrl_regs->tmr_temask, mask);
 
-	spin_unlock_irqrestore(&qoriq_ptp->lock, flags);
+	spin_unlock_irqrestore(&ptp_qoriq->lock, flags);
 	return 0;
 }
 
@@ -354,7 +354,7 @@ static int ptp_qoriq_enable(struct ptp_clock_info *ptp,
 };
 
 /**
- * qoriq_ptp_nominal_freq - calculate nominal frequency according to
+ * ptp_qoriq_nominal_freq - calculate nominal frequency according to
  *			    reference clock frequency
  *
  * @clk_src: reference clock frequency
@@ -365,7 +365,7 @@ static int ptp_qoriq_enable(struct ptp_clock_info *ptp,
  *
  * Return the nominal frequency
  */
-static u32 qoriq_ptp_nominal_freq(u32 clk_src)
+static u32 ptp_qoriq_nominal_freq(u32 clk_src)
 {
 	u32 remainder = 0;
 
@@ -385,9 +385,9 @@ static u32 qoriq_ptp_nominal_freq(u32 clk_src)
 }
 
 /**
- * qoriq_ptp_auto_config - calculate a set of default configurations
+ * ptp_qoriq_auto_config - calculate a set of default configurations
  *
- * @qoriq_ptp: pointer to qoriq_ptp
+ * @ptp_qoriq: pointer to ptp_qoriq
  * @node: pointer to device_node
  *
  * If below dts properties are not provided, this function will be
@@ -401,7 +401,7 @@ static u32 qoriq_ptp_nominal_freq(u32 clk_src)
  *
  * Return 0 if success
  */
-static int qoriq_ptp_auto_config(struct qoriq_ptp *qoriq_ptp,
+static int ptp_qoriq_auto_config(struct ptp_qoriq *ptp_qoriq,
 				 struct device_node *node)
 {
 	struct clk *clk;
@@ -411,7 +411,7 @@ static int qoriq_ptp_auto_config(struct qoriq_ptp *qoriq_ptp,
 	u32 remainder = 0;
 	u32 clk_src = 0;
 
-	qoriq_ptp->cksel = DEFAULT_CKSEL;
+	ptp_qoriq->cksel = DEFAULT_CKSEL;
 
 	clk = of_clk_get(node, 0);
 	if (!IS_ERR(clk)) {
@@ -424,12 +424,12 @@ static int qoriq_ptp_auto_config(struct qoriq_ptp *qoriq_ptp,
 		return -EINVAL;
 	}
 
-	nominal_freq = qoriq_ptp_nominal_freq(clk_src);
+	nominal_freq = ptp_qoriq_nominal_freq(clk_src);
 	if (!nominal_freq)
 		return -EINVAL;
 
-	qoriq_ptp->tclk_period = 1000000000UL / nominal_freq;
-	qoriq_ptp->tmr_prsc = DEFAULT_TMR_PRSC;
+	ptp_qoriq->tclk_period = 1000000000UL / nominal_freq;
+	ptp_qoriq->tmr_prsc = DEFAULT_TMR_PRSC;
 
 	/* Calculate initial frequency compensation value for TMR_ADD register.
 	 * freq_comp = ceil(2^32 / freq_ratio)
@@ -440,171 +440,171 @@ static int qoriq_ptp_auto_config(struct qoriq_ptp *qoriq_ptp,
 	if (remainder)
 		freq_comp++;
 
-	qoriq_ptp->tmr_add = freq_comp;
-	qoriq_ptp->tmr_fiper1 = DEFAULT_FIPER1_PERIOD - qoriq_ptp->tclk_period;
-	qoriq_ptp->tmr_fiper2 = DEFAULT_FIPER2_PERIOD - qoriq_ptp->tclk_period;
+	ptp_qoriq->tmr_add = freq_comp;
+	ptp_qoriq->tmr_fiper1 = DEFAULT_FIPER1_PERIOD - ptp_qoriq->tclk_period;
+	ptp_qoriq->tmr_fiper2 = DEFAULT_FIPER2_PERIOD - ptp_qoriq->tclk_period;
 
 	/* max_adj = 1000000000 * (freq_ratio - 1.0) - 1
 	 * freq_ratio = reference_clock_freq / nominal_freq
 	 */
 	max_adj = 1000000000ULL * (clk_src - nominal_freq);
 	max_adj = div_u64(max_adj, nominal_freq) - 1;
-	qoriq_ptp->caps.max_adj = max_adj;
+	ptp_qoriq->caps.max_adj = max_adj;
 
 	return 0;
 }
 
-static int qoriq_ptp_probe(struct platform_device *dev)
+static int ptp_qoriq_probe(struct platform_device *dev)
 {
 	struct device_node *node = dev->dev.of_node;
-	struct qoriq_ptp *qoriq_ptp;
-	struct qoriq_ptp_registers *regs;
+	struct ptp_qoriq *ptp_qoriq;
+	struct ptp_qoriq_registers *regs;
 	struct timespec64 now;
 	int err = -ENOMEM;
 	u32 tmr_ctrl;
 	unsigned long flags;
 	void __iomem *base;
 
-	qoriq_ptp = kzalloc(sizeof(*qoriq_ptp), GFP_KERNEL);
-	if (!qoriq_ptp)
+	ptp_qoriq = kzalloc(sizeof(*ptp_qoriq), GFP_KERNEL);
+	if (!ptp_qoriq)
 		goto no_memory;
 
 	err = -EINVAL;
 
-	qoriq_ptp->dev = &dev->dev;
-	qoriq_ptp->caps = ptp_qoriq_caps;
+	ptp_qoriq->dev = &dev->dev;
+	ptp_qoriq->caps = ptp_qoriq_caps;
 
-	if (of_property_read_u32(node, "fsl,cksel", &qoriq_ptp->cksel))
-		qoriq_ptp->cksel = DEFAULT_CKSEL;
+	if (of_property_read_u32(node, "fsl,cksel", &ptp_qoriq->cksel))
+		ptp_qoriq->cksel = DEFAULT_CKSEL;
 
 	if (of_property_read_bool(node, "fsl,extts-fifo"))
-		qoriq_ptp->extts_fifo_support = true;
+		ptp_qoriq->extts_fifo_support = true;
 	else
-		qoriq_ptp->extts_fifo_support = false;
+		ptp_qoriq->extts_fifo_support = false;
 
 	if (of_property_read_u32(node,
-				 "fsl,tclk-period", &qoriq_ptp->tclk_period) ||
+				 "fsl,tclk-period", &ptp_qoriq->tclk_period) ||
 	    of_property_read_u32(node,
-				 "fsl,tmr-prsc", &qoriq_ptp->tmr_prsc) ||
+				 "fsl,tmr-prsc", &ptp_qoriq->tmr_prsc) ||
 	    of_property_read_u32(node,
-				 "fsl,tmr-add", &qoriq_ptp->tmr_add) ||
+				 "fsl,tmr-add", &ptp_qoriq->tmr_add) ||
 	    of_property_read_u32(node,
-				 "fsl,tmr-fiper1", &qoriq_ptp->tmr_fiper1) ||
+				 "fsl,tmr-fiper1", &ptp_qoriq->tmr_fiper1) ||
 	    of_property_read_u32(node,
-				 "fsl,tmr-fiper2", &qoriq_ptp->tmr_fiper2) ||
+				 "fsl,tmr-fiper2", &ptp_qoriq->tmr_fiper2) ||
 	    of_property_read_u32(node,
-				 "fsl,max-adj", &qoriq_ptp->caps.max_adj)) {
+				 "fsl,max-adj", &ptp_qoriq->caps.max_adj)) {
 		pr_warn("device tree node missing required elements, try automatic configuration\n");
 
-		if (qoriq_ptp_auto_config(qoriq_ptp, node))
+		if (ptp_qoriq_auto_config(ptp_qoriq, node))
 			goto no_config;
 	}
 
 	err = -ENODEV;
 
-	qoriq_ptp->irq = platform_get_irq(dev, 0);
+	ptp_qoriq->irq = platform_get_irq(dev, 0);
 
-	if (qoriq_ptp->irq < 0) {
+	if (ptp_qoriq->irq < 0) {
 		pr_err("irq not in device tree\n");
 		goto no_node;
 	}
-	if (request_irq(qoriq_ptp->irq, isr, IRQF_SHARED, DRIVER, qoriq_ptp)) {
+	if (request_irq(ptp_qoriq->irq, isr, IRQF_SHARED, DRIVER, ptp_qoriq)) {
 		pr_err("request_irq failed\n");
 		goto no_node;
 	}
 
-	qoriq_ptp->rsrc = platform_get_resource(dev, IORESOURCE_MEM, 0);
-	if (!qoriq_ptp->rsrc) {
+	ptp_qoriq->rsrc = platform_get_resource(dev, IORESOURCE_MEM, 0);
+	if (!ptp_qoriq->rsrc) {
 		pr_err("no resource\n");
 		goto no_resource;
 	}
-	if (request_resource(&iomem_resource, qoriq_ptp->rsrc)) {
+	if (request_resource(&iomem_resource, ptp_qoriq->rsrc)) {
 		pr_err("resource busy\n");
 		goto no_resource;
 	}
 
-	spin_lock_init(&qoriq_ptp->lock);
+	spin_lock_init(&ptp_qoriq->lock);
 
-	base = ioremap(qoriq_ptp->rsrc->start,
-		       resource_size(qoriq_ptp->rsrc));
+	base = ioremap(ptp_qoriq->rsrc->start,
+		       resource_size(ptp_qoriq->rsrc));
 	if (!base) {
 		pr_err("ioremap ptp registers failed\n");
 		goto no_ioremap;
 	}
 
-	qoriq_ptp->base = base;
+	ptp_qoriq->base = base;
 
 	if (of_device_is_compatible(node, "fsl,fman-ptp-timer")) {
-		qoriq_ptp->regs.ctrl_regs = base + FMAN_CTRL_REGS_OFFSET;
-		qoriq_ptp->regs.alarm_regs = base + FMAN_ALARM_REGS_OFFSET;
-		qoriq_ptp->regs.fiper_regs = base + FMAN_FIPER_REGS_OFFSET;
-		qoriq_ptp->regs.etts_regs = base + FMAN_ETTS_REGS_OFFSET;
+		ptp_qoriq->regs.ctrl_regs = base + FMAN_CTRL_REGS_OFFSET;
+		ptp_qoriq->regs.alarm_regs = base + FMAN_ALARM_REGS_OFFSET;
+		ptp_qoriq->regs.fiper_regs = base + FMAN_FIPER_REGS_OFFSET;
+		ptp_qoriq->regs.etts_regs = base + FMAN_ETTS_REGS_OFFSET;
 	} else {
-		qoriq_ptp->regs.ctrl_regs = base + CTRL_REGS_OFFSET;
-		qoriq_ptp->regs.alarm_regs = base + ALARM_REGS_OFFSET;
-		qoriq_ptp->regs.fiper_regs = base + FIPER_REGS_OFFSET;
-		qoriq_ptp->regs.etts_regs = base + ETTS_REGS_OFFSET;
+		ptp_qoriq->regs.ctrl_regs = base + CTRL_REGS_OFFSET;
+		ptp_qoriq->regs.alarm_regs = base + ALARM_REGS_OFFSET;
+		ptp_qoriq->regs.fiper_regs = base + FIPER_REGS_OFFSET;
+		ptp_qoriq->regs.etts_regs = base + ETTS_REGS_OFFSET;
 	}
 
 	ktime_get_real_ts64(&now);
-	ptp_qoriq_settime(&qoriq_ptp->caps, &now);
+	ptp_qoriq_settime(&ptp_qoriq->caps, &now);
 
 	tmr_ctrl =
-	  (qoriq_ptp->tclk_period & TCLK_PERIOD_MASK) << TCLK_PERIOD_SHIFT |
-	  (qoriq_ptp->cksel & CKSEL_MASK) << CKSEL_SHIFT;
+	  (ptp_qoriq->tclk_period & TCLK_PERIOD_MASK) << TCLK_PERIOD_SHIFT |
+	  (ptp_qoriq->cksel & CKSEL_MASK) << CKSEL_SHIFT;
 
-	spin_lock_irqsave(&qoriq_ptp->lock, flags);
+	spin_lock_irqsave(&ptp_qoriq->lock, flags);
 
-	regs = &qoriq_ptp->regs;
+	regs = &ptp_qoriq->regs;
 	qoriq_write(&regs->ctrl_regs->tmr_ctrl,   tmr_ctrl);
-	qoriq_write(&regs->ctrl_regs->tmr_add,    qoriq_ptp->tmr_add);
-	qoriq_write(&regs->ctrl_regs->tmr_prsc,   qoriq_ptp->tmr_prsc);
-	qoriq_write(&regs->fiper_regs->tmr_fiper1, qoriq_ptp->tmr_fiper1);
-	qoriq_write(&regs->fiper_regs->tmr_fiper2, qoriq_ptp->tmr_fiper2);
-	set_alarm(qoriq_ptp);
+	qoriq_write(&regs->ctrl_regs->tmr_add,    ptp_qoriq->tmr_add);
+	qoriq_write(&regs->ctrl_regs->tmr_prsc,   ptp_qoriq->tmr_prsc);
+	qoriq_write(&regs->fiper_regs->tmr_fiper1, ptp_qoriq->tmr_fiper1);
+	qoriq_write(&regs->fiper_regs->tmr_fiper2, ptp_qoriq->tmr_fiper2);
+	set_alarm(ptp_qoriq);
 	qoriq_write(&regs->ctrl_regs->tmr_ctrl,   tmr_ctrl|FIPERST|RTPE|TE|FRD);
 
-	spin_unlock_irqrestore(&qoriq_ptp->lock, flags);
+	spin_unlock_irqrestore(&ptp_qoriq->lock, flags);
 
-	qoriq_ptp->clock = ptp_clock_register(&qoriq_ptp->caps, &dev->dev);
-	if (IS_ERR(qoriq_ptp->clock)) {
-		err = PTR_ERR(qoriq_ptp->clock);
+	ptp_qoriq->clock = ptp_clock_register(&ptp_qoriq->caps, &dev->dev);
+	if (IS_ERR(ptp_qoriq->clock)) {
+		err = PTR_ERR(ptp_qoriq->clock);
 		goto no_clock;
 	}
-	qoriq_ptp->phc_index = ptp_clock_index(qoriq_ptp->clock);
+	ptp_qoriq->phc_index = ptp_clock_index(ptp_qoriq->clock);
 
-	ptp_qoriq_create_debugfs(qoriq_ptp);
-	platform_set_drvdata(dev, qoriq_ptp);
+	ptp_qoriq_create_debugfs(ptp_qoriq);
+	platform_set_drvdata(dev, ptp_qoriq);
 
 	return 0;
 
 no_clock:
-	iounmap(qoriq_ptp->base);
+	iounmap(ptp_qoriq->base);
 no_ioremap:
-	release_resource(qoriq_ptp->rsrc);
+	release_resource(ptp_qoriq->rsrc);
 no_resource:
-	free_irq(qoriq_ptp->irq, qoriq_ptp);
+	free_irq(ptp_qoriq->irq, ptp_qoriq);
 no_config:
 no_node:
-	kfree(qoriq_ptp);
+	kfree(ptp_qoriq);
 no_memory:
 	return err;
 }
 
-static int qoriq_ptp_remove(struct platform_device *dev)
+static int ptp_qoriq_remove(struct platform_device *dev)
 {
-	struct qoriq_ptp *qoriq_ptp = platform_get_drvdata(dev);
-	struct qoriq_ptp_registers *regs = &qoriq_ptp->regs;
+	struct ptp_qoriq *ptp_qoriq = platform_get_drvdata(dev);
+	struct ptp_qoriq_registers *regs = &ptp_qoriq->regs;
 
 	qoriq_write(&regs->ctrl_regs->tmr_temask, 0);
 	qoriq_write(&regs->ctrl_regs->tmr_ctrl,   0);
 
-	ptp_qoriq_remove_debugfs(qoriq_ptp);
-	ptp_clock_unregister(qoriq_ptp->clock);
-	iounmap(qoriq_ptp->base);
-	release_resource(qoriq_ptp->rsrc);
-	free_irq(qoriq_ptp->irq, qoriq_ptp);
-	kfree(qoriq_ptp);
+	ptp_qoriq_remove_debugfs(ptp_qoriq);
+	ptp_clock_unregister(ptp_qoriq->clock);
+	iounmap(ptp_qoriq->base);
+	release_resource(ptp_qoriq->rsrc);
+	free_irq(ptp_qoriq->irq, ptp_qoriq);
+	kfree(ptp_qoriq);
 
 	return 0;
 }
@@ -616,16 +616,16 @@ static int qoriq_ptp_remove(struct platform_device *dev)
 };
 MODULE_DEVICE_TABLE(of, match_table);
 
-static struct platform_driver qoriq_ptp_driver = {
+static struct platform_driver ptp_qoriq_driver = {
 	.driver = {
 		.name		= "ptp_qoriq",
 		.of_match_table	= match_table,
 	},
-	.probe       = qoriq_ptp_probe,
-	.remove      = qoriq_ptp_remove,
+	.probe       = ptp_qoriq_probe,
+	.remove      = ptp_qoriq_remove,
 };
 
-module_platform_driver(qoriq_ptp_driver);
+module_platform_driver(ptp_qoriq_driver);
 
 MODULE_AUTHOR("Richard Cochran <richardcochran@gmail.com>");
 MODULE_DESCRIPTION("PTP clock for Freescale QorIQ 1588 timer");
diff --git a/drivers/ptp/ptp_qoriq_debugfs.c b/drivers/ptp/ptp_qoriq_debugfs.c
index 9705950..3a70daf 100644
--- a/drivers/ptp/ptp_qoriq_debugfs.c
+++ b/drivers/ptp/ptp_qoriq_debugfs.c
@@ -7,8 +7,8 @@
 
 static int ptp_qoriq_fiper1_lpbk_get(void *data, u64 *val)
 {
-	struct qoriq_ptp *qoriq_ptp = data;
-	struct qoriq_ptp_registers *regs = &qoriq_ptp->regs;
+	struct ptp_qoriq *ptp_qoriq = data;
+	struct ptp_qoriq_registers *regs = &ptp_qoriq->regs;
 	u32 ctrl;
 
 	ctrl = qoriq_read(&regs->ctrl_regs->tmr_ctrl);
@@ -19,8 +19,8 @@ static int ptp_qoriq_fiper1_lpbk_get(void *data, u64 *val)
 
 static int ptp_qoriq_fiper1_lpbk_set(void *data, u64 val)
 {
-	struct qoriq_ptp *qoriq_ptp = data;
-	struct qoriq_ptp_registers *regs = &qoriq_ptp->regs;
+	struct ptp_qoriq *ptp_qoriq = data;
+	struct ptp_qoriq_registers *regs = &ptp_qoriq->regs;
 	u32 ctrl;
 
 	ctrl = qoriq_read(&regs->ctrl_regs->tmr_ctrl);
@@ -38,8 +38,8 @@ static int ptp_qoriq_fiper1_lpbk_set(void *data, u64 val)
 
 static int ptp_qoriq_fiper2_lpbk_get(void *data, u64 *val)
 {
-	struct qoriq_ptp *qoriq_ptp = data;
-	struct qoriq_ptp_registers *regs = &qoriq_ptp->regs;
+	struct ptp_qoriq *ptp_qoriq = data;
+	struct ptp_qoriq_registers *regs = &ptp_qoriq->regs;
 	u32 ctrl;
 
 	ctrl = qoriq_read(&regs->ctrl_regs->tmr_ctrl);
@@ -50,8 +50,8 @@ static int ptp_qoriq_fiper2_lpbk_get(void *data, u64 *val)
 
 static int ptp_qoriq_fiper2_lpbk_set(void *data, u64 val)
 {
-	struct qoriq_ptp *qoriq_ptp = data;
-	struct qoriq_ptp_registers *regs = &qoriq_ptp->regs;
+	struct ptp_qoriq *ptp_qoriq = data;
+	struct ptp_qoriq_registers *regs = &ptp_qoriq->regs;
 	u32 ctrl;
 
 	ctrl = qoriq_read(&regs->ctrl_regs->tmr_ctrl);
@@ -67,35 +67,35 @@ static int ptp_qoriq_fiper2_lpbk_set(void *data, u64 val)
 DEFINE_DEBUGFS_ATTRIBUTE(ptp_qoriq_fiper2_fops, ptp_qoriq_fiper2_lpbk_get,
 			 ptp_qoriq_fiper2_lpbk_set, "%llu\n");
 
-void ptp_qoriq_create_debugfs(struct qoriq_ptp *qoriq_ptp)
+void ptp_qoriq_create_debugfs(struct ptp_qoriq *ptp_qoriq)
 {
 	struct dentry *root;
 
-	root = debugfs_create_dir(dev_name(qoriq_ptp->dev), NULL);
+	root = debugfs_create_dir(dev_name(ptp_qoriq->dev), NULL);
 	if (IS_ERR(root))
 		return;
 	if (!root)
 		goto err_root;
 
-	qoriq_ptp->debugfs_root = root;
+	ptp_qoriq->debugfs_root = root;
 
 	if (!debugfs_create_file_unsafe("fiper1-loopback", 0600, root,
-					qoriq_ptp, &ptp_qoriq_fiper1_fops))
+					ptp_qoriq, &ptp_qoriq_fiper1_fops))
 		goto err_node;
 	if (!debugfs_create_file_unsafe("fiper2-loopback", 0600, root,
-					qoriq_ptp, &ptp_qoriq_fiper2_fops))
+					ptp_qoriq, &ptp_qoriq_fiper2_fops))
 		goto err_node;
 	return;
 
 err_node:
 	debugfs_remove_recursive(root);
-	qoriq_ptp->debugfs_root = NULL;
+	ptp_qoriq->debugfs_root = NULL;
 err_root:
-	dev_err(qoriq_ptp->dev, "failed to initialize debugfs\n");
+	dev_err(ptp_qoriq->dev, "failed to initialize debugfs\n");
 }
 
-void ptp_qoriq_remove_debugfs(struct qoriq_ptp *qoriq_ptp)
+void ptp_qoriq_remove_debugfs(struct ptp_qoriq *ptp_qoriq)
 {
-	debugfs_remove_recursive(qoriq_ptp->debugfs_root);
-	qoriq_ptp->debugfs_root = NULL;
+	debugfs_remove_recursive(ptp_qoriq->debugfs_root);
+	ptp_qoriq->debugfs_root = NULL;
 }
diff --git a/include/linux/fsl/ptp_qoriq.h b/include/linux/fsl/ptp_qoriq.h
index 94e9797..c2a32d9 100644
--- a/include/linux/fsl/ptp_qoriq.h
+++ b/include/linux/fsl/ptp_qoriq.h
@@ -49,7 +49,7 @@ struct etts_regs {
 	u32 tmr_etts2_l;  /* Timestamp of general purpose external trigger */
 };
 
-struct qoriq_ptp_registers {
+struct ptp_qoriq_registers {
 	struct ctrl_regs __iomem *ctrl_regs;
 	struct alarm_regs __iomem *alarm_regs;
 	struct fiper_regs __iomem *fiper_regs;
@@ -136,9 +136,9 @@ struct qoriq_ptp_registers {
 #define DEFAULT_FIPER1_PERIOD	1000000000
 #define DEFAULT_FIPER2_PERIOD	100000
 
-struct qoriq_ptp {
+struct ptp_qoriq {
 	void __iomem *base;
-	struct qoriq_ptp_registers regs;
+	struct ptp_qoriq_registers regs;
 	spinlock_t lock; /* protects regs */
 	struct ptp_clock *clock;
 	struct ptp_clock_info caps;
@@ -172,12 +172,12 @@ static inline void qoriq_write(unsigned __iomem *addr, u32 val)
 }
 
 #ifdef CONFIG_DEBUG_FS
-void ptp_qoriq_create_debugfs(struct qoriq_ptp *qoriq_ptp);
-void ptp_qoriq_remove_debugfs(struct qoriq_ptp *qoriq_ptp);
+void ptp_qoriq_create_debugfs(struct ptp_qoriq *ptp_qoriq);
+void ptp_qoriq_remove_debugfs(struct ptp_qoriq *ptp_qoriq);
 #else
-static inline void ptp_qoriq_create_debugfs(struct qoriq_ptp *qoriq_ptp)
+static inline void ptp_qoriq_create_debugfs(struct ptp_qoriq *ptp_qoriq)
 { }
-static inline void ptp_qoriq_remove_debugfs(struct qoriq_ptp *qoriq_ptp)
+static inline void ptp_qoriq_remove_debugfs(struct ptp_qoriq *ptp_qoriq)
 { }
 #endif
 
-- 
1.7.1


^ permalink raw reply related

* [PATCH 6/9] ptp_qoriq: fix register memory map
From: Yangbo Lu @ 2019-01-31  2:33 UTC (permalink / raw)
  To: netdev, devicetree
  Cc: David S . Miller, Richard Cochran, Rob Herring, Claudiu Manoil,
	Yangbo Lu
In-Reply-To: <20190131023310.9563-1-yangbo.lu@nxp.com>

The 1588 timer on eTSEC Ethernet controller uses different
register memory map with DPAA Ethernet controller.
Now the new ENETC Ethernet controller uses same reigster
memory map with DPAA. To support ENETC, let's use register
memory map of DPAA/ENETC in default.

Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
---
 drivers/ptp/ptp_qoriq.c       |   11 ++++++-----
 include/linux/fsl/ptp_qoriq.h |   16 ++++++++--------
 2 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/drivers/ptp/ptp_qoriq.c b/drivers/ptp/ptp_qoriq.c
index 4308e25..af7a70e 100644
--- a/drivers/ptp/ptp_qoriq.c
+++ b/drivers/ptp/ptp_qoriq.c
@@ -504,11 +504,12 @@ int ptp_qoriq_init(struct ptp_qoriq *ptp_qoriq, void __iomem *base,
 		ptp_qoriq->write = qoriq_write_be;
 	}
 
-	if (of_device_is_compatible(node, "fsl,fman-ptp-timer")) {
-		ptp_qoriq->regs.ctrl_regs = base + FMAN_CTRL_REGS_OFFSET;
-		ptp_qoriq->regs.alarm_regs = base + FMAN_ALARM_REGS_OFFSET;
-		ptp_qoriq->regs.fiper_regs = base + FMAN_FIPER_REGS_OFFSET;
-		ptp_qoriq->regs.etts_regs = base + FMAN_ETTS_REGS_OFFSET;
+	/* The eTSEC uses differnt memory map with DPAA/ENETC */
+	if (of_device_is_compatible(node, "fsl,etsec-ptp")) {
+		ptp_qoriq->regs.ctrl_regs = base + ETSEC_CTRL_REGS_OFFSET;
+		ptp_qoriq->regs.alarm_regs = base + ETSEC_ALARM_REGS_OFFSET;
+		ptp_qoriq->regs.fiper_regs = base + ETSEC_FIPER_REGS_OFFSET;
+		ptp_qoriq->regs.etts_regs = base + ETSEC_ETTS_REGS_OFFSET;
 	} else {
 		ptp_qoriq->regs.ctrl_regs = base + CTRL_REGS_OFFSET;
 		ptp_qoriq->regs.alarm_regs = base + ALARM_REGS_OFFSET;
diff --git a/include/linux/fsl/ptp_qoriq.h b/include/linux/fsl/ptp_qoriq.h
index b44b466..902d3b1 100644
--- a/include/linux/fsl/ptp_qoriq.h
+++ b/include/linux/fsl/ptp_qoriq.h
@@ -58,15 +58,15 @@ struct ptp_qoriq_registers {
 };
 
 /* Offset definitions for the four register groups */
-#define CTRL_REGS_OFFSET	0x0
-#define ALARM_REGS_OFFSET	0x40
-#define FIPER_REGS_OFFSET	0x80
-#define ETTS_REGS_OFFSET	0xa0
-
-#define FMAN_CTRL_REGS_OFFSET	0x80
-#define FMAN_ALARM_REGS_OFFSET	0xb8
-#define FMAN_FIPER_REGS_OFFSET	0xd0
-#define FMAN_ETTS_REGS_OFFSET	0xe0
+#define ETSEC_CTRL_REGS_OFFSET	0x0
+#define ETSEC_ALARM_REGS_OFFSET	0x40
+#define ETSEC_FIPER_REGS_OFFSET	0x80
+#define ETSEC_ETTS_REGS_OFFSET	0xa0
+
+#define CTRL_REGS_OFFSET	0x80
+#define ALARM_REGS_OFFSET	0xb8
+#define FIPER_REGS_OFFSET	0xd0
+#define ETTS_REGS_OFFSET	0xe0
 
 
 /* Bit definitions for the TMR_CTRL register */
-- 
1.7.1


^ permalink raw reply related

* [PATCH 8/9] enetc: add PTP clock driver
From: Yangbo Lu @ 2019-01-31  2:33 UTC (permalink / raw)
  To: netdev, devicetree
  Cc: David S . Miller, Richard Cochran, Rob Herring, Claudiu Manoil,
	Yangbo Lu
In-Reply-To: <20190131023310.9563-1-yangbo.lu@nxp.com>

This patch is to add PTP clock driver for ENETC.
The driver reused QorIQ PTP clock driver.

Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
---
 drivers/net/ethernet/freescale/enetc/Kconfig     |   12 ++
 drivers/net/ethernet/freescale/enetc/Makefile    |    3 +
 drivers/net/ethernet/freescale/enetc/enetc_hw.h  |    5 +-
 drivers/net/ethernet/freescale/enetc/enetc_ptp.c |  151 ++++++++++++++++++++++
 4 files changed, 169 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/ethernet/freescale/enetc/enetc_ptp.c

diff --git a/drivers/net/ethernet/freescale/enetc/Kconfig b/drivers/net/ethernet/freescale/enetc/Kconfig
index f9dd26f..8429f5c 100644
--- a/drivers/net/ethernet/freescale/enetc/Kconfig
+++ b/drivers/net/ethernet/freescale/enetc/Kconfig
@@ -17,3 +17,15 @@ config FSL_ENETC_VF
 	  virtual function (VF) devices enabled by the ENETC PF driver.
 
 	  If compiled as module (M), the module name is fsl-enetc-vf.
+
+config FSL_ENETC_PTP_CLOCK
+	tristate "ENETC PTP clock driver"
+	depends on PTP_1588_CLOCK_QORIQ && (FSL_ENETC || FSL_ENETC_VF)
+	default y
+	help
+	  This driver adds support for using the ENETC 1588 timer
+	  as a PTP clock. This clock is only useful if your PTP
+	  programs are getting hardware time stamps on the PTP Ethernet
+	  packets using the SO_TIMESTAMPING API.
+
+	  If compiled as module (M), the module name is fsl-enetc-ptp.
diff --git a/drivers/net/ethernet/freescale/enetc/Makefile b/drivers/net/ethernet/freescale/enetc/Makefile
index 9529b01..6976602 100644
--- a/drivers/net/ethernet/freescale/enetc/Makefile
+++ b/drivers/net/ethernet/freescale/enetc/Makefile
@@ -13,3 +13,6 @@ fsl-enetc-vf-$(CONFIG_FSL_ENETC_VF) += enetc.o enetc_cbdr.o \
 				       enetc_ethtool.o
 fsl-enetc-vf-objs := enetc_vf.o $(fsl-enetc-vf-y)
 endif
+
+obj-$(CONFIG_FSL_ENETC_PTP_CLOCK) += fsl-enetc-ptp.o
+fsl-enetc-ptp-$(CONFIG_FSL_ENETC_PTP_CLOCK) += enetc_ptp.o
diff --git a/drivers/net/ethernet/freescale/enetc/enetc_hw.h b/drivers/net/ethernet/freescale/enetc/enetc_hw.h
index efa0b1a..df8eb88 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc_hw.h
+++ b/drivers/net/ethernet/freescale/enetc/enetc_hw.h
@@ -4,8 +4,9 @@
 #include <linux/bitops.h>
 
 /* ENETC device IDs */
-#define ENETC_DEV_ID_PF	0xe100
-#define ENETC_DEV_ID_VF	0xef00
+#define ENETC_DEV_ID_PF		0xe100
+#define ENETC_DEV_ID_VF		0xef00
+#define ENETC_DEV_ID_PTP	0xee02
 
 /* ENETC register block BAR */
 #define ENETC_BAR_REGS	0
diff --git a/drivers/net/ethernet/freescale/enetc/enetc_ptp.c b/drivers/net/ethernet/freescale/enetc/enetc_ptp.c
new file mode 100644
index 0000000..1cebb4c
--- /dev/null
+++ b/drivers/net/ethernet/freescale/enetc/enetc_ptp.c
@@ -0,0 +1,151 @@
+// SPDX-License-Identifier: (GPL-2.0+ OR BSD-3-Clause)
+/* Copyright 2019 NXP */
+
+#include <linux/module.h>
+#include <linux/of.h>
+#include <linux/fsl/ptp_qoriq.h>
+
+#include "enetc.h"
+
+static struct ptp_clock_info enetc_ptp_caps = {
+	.owner		= THIS_MODULE,
+	.name		= "ENETC PTP clock",
+	.max_adj	= 512000,
+	.n_alarm	= 0,
+	.n_ext_ts	= 2,
+	.n_per_out	= 0,
+	.n_pins		= 0,
+	.pps		= 1,
+	.adjfine	= ptp_qoriq_adjfine,
+	.adjtime	= ptp_qoriq_adjtime,
+	.gettime64	= ptp_qoriq_gettime,
+	.settime64	= ptp_qoriq_settime,
+	.enable		= ptp_qoriq_enable,
+};
+
+static int enetc_ptp_probe(struct pci_dev *pdev,
+			   const struct pci_device_id *ent)
+{
+	struct ptp_qoriq *ptp_qoriq;
+	void __iomem *base;
+	int err, len, n;
+
+	if (pdev->dev.of_node && !of_device_is_available(pdev->dev.of_node)) {
+		dev_info(&pdev->dev, "device is disabled, skipping\n");
+		return -ENODEV;
+	}
+
+	err = pci_enable_device_mem(pdev);
+	if (err) {
+		dev_err(&pdev->dev, "device enable failed\n");
+		return err;
+	}
+
+	/* set up for high or low dma */
+	err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
+	if (err) {
+		err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
+		if (err) {
+			dev_err(&pdev->dev,
+				"DMA configuration failed: 0x%x\n", err);
+			goto err_dma;
+		}
+	}
+
+	err = pci_request_mem_regions(pdev, KBUILD_MODNAME);
+	if (err) {
+		dev_err(&pdev->dev, "pci_request_regions failed err=%d\n", err);
+		goto err_pci_mem_reg;
+	}
+
+	pci_set_master(pdev);
+
+	ptp_qoriq = kzalloc(sizeof(*ptp_qoriq), GFP_KERNEL);
+	if (!ptp_qoriq) {
+		err = -ENOMEM;
+		goto err_alloc_ptp;
+	}
+
+	len = pci_resource_len(pdev, ENETC_BAR_REGS);
+
+	base = ioremap(pci_resource_start(pdev, ENETC_BAR_REGS), len);
+	if (!base) {
+		err = -ENXIO;
+		dev_err(&pdev->dev, "ioremap() failed\n");
+		goto err_ioremap;
+	}
+
+	/* Allocate 1 interrupt */
+	n = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_MSIX);
+	if (n != 1) {
+		err = -EPERM;
+		goto err_irq;
+	}
+
+	ptp_qoriq->irq = pci_irq_vector(pdev, 0);
+
+	err = request_irq(ptp_qoriq->irq, ptp_qoriq_isr, 0, DRIVER, ptp_qoriq);
+	if (err) {
+		dev_err(&pdev->dev, "request_irq() failed!\n");
+		goto err_irq;
+	}
+
+	ptp_qoriq->dev = &pdev->dev;
+
+	err = ptp_qoriq_init(ptp_qoriq, base, enetc_ptp_caps);
+	if (err)
+		goto err_no_clock;
+
+	pci_set_drvdata(pdev, ptp_qoriq);
+
+	return 0;
+
+err_no_clock:
+	free_irq(ptp_qoriq->irq, ptp_qoriq);
+err_irq:
+	iounmap(base);
+err_ioremap:
+	kfree(ptp_qoriq);
+err_alloc_ptp:
+	pci_release_mem_regions(pdev);
+err_pci_mem_reg:
+err_dma:
+	pci_disable_device(pdev);
+
+	return err;
+}
+
+static void enetc_ptp_remove(struct pci_dev *pdev)
+{
+	struct ptp_qoriq *ptp_qoriq = pci_get_drvdata(pdev);
+	struct ptp_qoriq_registers *regs = &ptp_qoriq->regs;
+
+	ptp_qoriq->write(&regs->ctrl_regs->tmr_temask, 0);
+	ptp_qoriq->write(&regs->ctrl_regs->tmr_ctrl,   0);
+
+	ptp_qoriq_remove_debugfs(ptp_qoriq);
+	ptp_clock_unregister(ptp_qoriq->clock);
+	free_irq(ptp_qoriq->irq, ptp_qoriq);
+	iounmap(ptp_qoriq->base);
+	kfree(ptp_qoriq);
+
+	pci_release_mem_regions(pdev);
+	pci_disable_device(pdev);
+}
+
+static const struct pci_device_id enetc_ptp_id_table[] = {
+	{ PCI_DEVICE(PCI_VENDOR_ID_FREESCALE, ENETC_DEV_ID_PTP) },
+	{ 0, } /* End of table. */
+};
+MODULE_DEVICE_TABLE(pci, enetc_ptp_id_table);
+
+static struct pci_driver enetc_ptp_driver = {
+	.name = KBUILD_MODNAME,
+	.id_table = enetc_ptp_id_table,
+	.probe = enetc_ptp_probe,
+	.remove = enetc_ptp_remove,
+};
+module_pci_driver(enetc_ptp_driver);
+
+MODULE_DESCRIPTION("ENETC PTP clock driver");
+MODULE_LICENSE("Dual BSD/GPL");
-- 
1.7.1


^ permalink raw reply related

* [PATCH 5/9] dt-binding: ptp_qoriq: add little-endian support
From: Yangbo Lu @ 2019-01-31  2:33 UTC (permalink / raw)
  To: netdev, devicetree
  Cc: David S . Miller, Richard Cochran, Rob Herring, Claudiu Manoil,
	Yangbo Lu
In-Reply-To: <20190131023310.9563-1-yangbo.lu@nxp.com>

Specify "little-endian" property if the 1588 timer IP block
is little-endian mode. The default endian mode is big-endian.

Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
---
 .../devicetree/bindings/ptp/ptp-qoriq.txt          |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt b/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt
index 8e7f855..454c937 100644
--- a/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt
+++ b/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt
@@ -19,6 +19,9 @@ Clock Properties:
   - fsl,max-adj      Maximum frequency adjustment in parts per billion.
   - fsl,extts-fifo   The presence of this property indicates hardware
 		     support for the external trigger stamp FIFO.
+  - little-endian    The presence of this property indicates the 1588 timer
+		     IP block is little-endian mode. The default endian mode
+		     is big-endian.
 
   These properties set the operational parameters for the PTP
   clock. You must choose these carefully for the clock to work right.
-- 
1.7.1


^ permalink raw reply related

* [PATCH 4/9] ptp_qoriq: add little enadian support
From: Yangbo Lu @ 2019-01-31  2:33 UTC (permalink / raw)
  To: netdev, devicetree
  Cc: David S . Miller, Richard Cochran, Rob Herring, Claudiu Manoil,
	Yangbo Lu
In-Reply-To: <20190131023310.9563-1-yangbo.lu@nxp.com>

There is QorIQ 1588 timer IP block on the new ENETC Ethernet
controller. However it uses little endian mode which is different
with before. This patch is to add little endian support for the
driver by using "little-endian" dts node property.

Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
---
 drivers/ptp/ptp_qoriq.c         |   69 ++++++++++++++++++++++-----------------
 drivers/ptp/ptp_qoriq_debugfs.c |   12 +++---
 include/linux/fsl/ptp_qoriq.h   |   21 ++++++++---
 3 files changed, 60 insertions(+), 42 deletions(-)

diff --git a/drivers/ptp/ptp_qoriq.c b/drivers/ptp/ptp_qoriq.c
index 63896b5..4308e25 100644
--- a/drivers/ptp/ptp_qoriq.c
+++ b/drivers/ptp/ptp_qoriq.c
@@ -43,8 +43,8 @@ static u64 tmr_cnt_read(struct ptp_qoriq *ptp_qoriq)
 	u64 ns;
 	u32 lo, hi;
 
-	lo = qoriq_read(&regs->ctrl_regs->tmr_cnt_l);
-	hi = qoriq_read(&regs->ctrl_regs->tmr_cnt_h);
+	lo = ptp_qoriq->read(&regs->ctrl_regs->tmr_cnt_l);
+	hi = ptp_qoriq->read(&regs->ctrl_regs->tmr_cnt_h);
 	ns = ((u64) hi) << 32;
 	ns |= lo;
 	return ns;
@@ -57,8 +57,8 @@ static void tmr_cnt_write(struct ptp_qoriq *ptp_qoriq, u64 ns)
 	u32 hi = ns >> 32;
 	u32 lo = ns & 0xffffffff;
 
-	qoriq_write(&regs->ctrl_regs->tmr_cnt_l, lo);
-	qoriq_write(&regs->ctrl_regs->tmr_cnt_h, hi);
+	ptp_qoriq->write(&regs->ctrl_regs->tmr_cnt_l, lo);
+	ptp_qoriq->write(&regs->ctrl_regs->tmr_cnt_h, hi);
 }
 
 /* Caller must hold ptp_qoriq->lock. */
@@ -73,8 +73,8 @@ static void set_alarm(struct ptp_qoriq *ptp_qoriq)
 	ns -= ptp_qoriq->tclk_period;
 	hi = ns >> 32;
 	lo = ns & 0xffffffff;
-	qoriq_write(&regs->alarm_regs->tmr_alarm1_l, lo);
-	qoriq_write(&regs->alarm_regs->tmr_alarm1_h, hi);
+	ptp_qoriq->write(&regs->alarm_regs->tmr_alarm1_l, lo);
+	ptp_qoriq->write(&regs->alarm_regs->tmr_alarm1_h, hi);
 }
 
 /* Caller must hold ptp_qoriq->lock. */
@@ -83,8 +83,8 @@ static void set_fipers(struct ptp_qoriq *ptp_qoriq)
 	struct ptp_qoriq_registers *regs = &ptp_qoriq->regs;
 
 	set_alarm(ptp_qoriq);
-	qoriq_write(&regs->fiper_regs->tmr_fiper1, ptp_qoriq->tmr_fiper1);
-	qoriq_write(&regs->fiper_regs->tmr_fiper2, ptp_qoriq->tmr_fiper2);
+	ptp_qoriq->write(&regs->fiper_regs->tmr_fiper1, ptp_qoriq->tmr_fiper1);
+	ptp_qoriq->write(&regs->fiper_regs->tmr_fiper2, ptp_qoriq->tmr_fiper2);
 }
 
 static int extts_clean_up(struct ptp_qoriq *ptp_qoriq, int index,
@@ -115,8 +115,8 @@ static int extts_clean_up(struct ptp_qoriq *ptp_qoriq, int index,
 	event.index = index;
 
 	do {
-		lo = qoriq_read(reg_etts_l);
-		hi = qoriq_read(reg_etts_h);
+		lo = ptp_qoriq->read(reg_etts_l);
+		hi = ptp_qoriq->read(reg_etts_h);
 
 		if (update_event) {
 			event.timestamp = ((u64) hi) << 32;
@@ -124,7 +124,7 @@ static int extts_clean_up(struct ptp_qoriq *ptp_qoriq, int index,
 			ptp_clock_event(ptp_qoriq->clock, &event);
 		}
 
-		stat = qoriq_read(&regs->ctrl_regs->tmr_stat);
+		stat = ptp_qoriq->read(&regs->ctrl_regs->tmr_stat);
 	} while (ptp_qoriq->extts_fifo_support && (stat & valid));
 
 	return 0;
@@ -144,8 +144,8 @@ irqreturn_t ptp_qoriq_isr(int irq, void *priv)
 
 	spin_lock(&ptp_qoriq->lock);
 
-	val = qoriq_read(&regs->ctrl_regs->tmr_tevent);
-	mask = qoriq_read(&regs->ctrl_regs->tmr_temask);
+	val = ptp_qoriq->read(&regs->ctrl_regs->tmr_tevent);
+	mask = ptp_qoriq->read(&regs->ctrl_regs->tmr_temask);
 
 	spin_unlock(&ptp_qoriq->lock);
 
@@ -173,14 +173,14 @@ irqreturn_t ptp_qoriq_isr(int irq, void *priv)
 			ns = ptp_qoriq->alarm_value + ptp_qoriq->alarm_interval;
 			hi = ns >> 32;
 			lo = ns & 0xffffffff;
-			qoriq_write(&regs->alarm_regs->tmr_alarm2_l, lo);
-			qoriq_write(&regs->alarm_regs->tmr_alarm2_h, hi);
+			ptp_qoriq->write(&regs->alarm_regs->tmr_alarm2_l, lo);
+			ptp_qoriq->write(&regs->alarm_regs->tmr_alarm2_h, hi);
 			ptp_qoriq->alarm_value = ns;
 		} else {
 			spin_lock(&ptp_qoriq->lock);
-			mask = qoriq_read(&regs->ctrl_regs->tmr_temask);
+			mask = ptp_qoriq->read(&regs->ctrl_regs->tmr_temask);
 			mask &= ~ALM2EN;
-			qoriq_write(&regs->ctrl_regs->tmr_temask, mask);
+			ptp_qoriq->write(&regs->ctrl_regs->tmr_temask, mask);
 			spin_unlock(&ptp_qoriq->lock);
 			ptp_qoriq->alarm_value = 0;
 			ptp_qoriq->alarm_interval = 0;
@@ -194,7 +194,7 @@ irqreturn_t ptp_qoriq_isr(int irq, void *priv)
 	}
 
 	if (ack) {
-		qoriq_write(&regs->ctrl_regs->tmr_tevent, ack);
+		ptp_qoriq->write(&regs->ctrl_regs->tmr_tevent, ack);
 		return IRQ_HANDLED;
 	} else
 		return IRQ_NONE;
@@ -229,7 +229,7 @@ int ptp_qoriq_adjfine(struct ptp_clock_info *ptp, long scaled_ppm)
 
 	tmr_add = neg_adj ? tmr_add - diff : tmr_add + diff;
 
-	qoriq_write(&regs->ctrl_regs->tmr_add, tmr_add);
+	ptp_qoriq->write(&regs->ctrl_regs->tmr_add, tmr_add);
 
 	return 0;
 }
@@ -326,15 +326,15 @@ int ptp_qoriq_enable(struct ptp_clock_info *ptp,
 
 	spin_lock_irqsave(&ptp_qoriq->lock, flags);
 
-	mask = qoriq_read(&regs->ctrl_regs->tmr_temask);
+	mask = ptp_qoriq->read(&regs->ctrl_regs->tmr_temask);
 	if (on) {
 		mask |= bit;
-		qoriq_write(&regs->ctrl_regs->tmr_tevent, bit);
+		ptp_qoriq->write(&regs->ctrl_regs->tmr_tevent, bit);
 	} else {
 		mask &= ~bit;
 	}
 
-	qoriq_write(&regs->ctrl_regs->tmr_temask, mask);
+	ptp_qoriq->write(&regs->ctrl_regs->tmr_temask, mask);
 
 	spin_unlock_irqrestore(&ptp_qoriq->lock, flags);
 	return 0;
@@ -496,6 +496,14 @@ int ptp_qoriq_init(struct ptp_qoriq *ptp_qoriq, void __iomem *base,
 			return -ENODEV;
 	}
 
+	if (of_property_read_bool(node, "little-endian")) {
+		ptp_qoriq->read = qoriq_read_le;
+		ptp_qoriq->write = qoriq_write_le;
+	} else {
+		ptp_qoriq->read = qoriq_read_be;
+		ptp_qoriq->write = qoriq_write_be;
+	}
+
 	if (of_device_is_compatible(node, "fsl,fman-ptp-timer")) {
 		ptp_qoriq->regs.ctrl_regs = base + FMAN_CTRL_REGS_OFFSET;
 		ptp_qoriq->regs.alarm_regs = base + FMAN_ALARM_REGS_OFFSET;
@@ -519,13 +527,14 @@ int ptp_qoriq_init(struct ptp_qoriq *ptp_qoriq, void __iomem *base,
 	spin_lock_irqsave(&ptp_qoriq->lock, flags);
 
 	regs = &ptp_qoriq->regs;
-	qoriq_write(&regs->ctrl_regs->tmr_ctrl,   tmr_ctrl);
-	qoriq_write(&regs->ctrl_regs->tmr_add,    ptp_qoriq->tmr_add);
-	qoriq_write(&regs->ctrl_regs->tmr_prsc,   ptp_qoriq->tmr_prsc);
-	qoriq_write(&regs->fiper_regs->tmr_fiper1, ptp_qoriq->tmr_fiper1);
-	qoriq_write(&regs->fiper_regs->tmr_fiper2, ptp_qoriq->tmr_fiper2);
+	ptp_qoriq->write(&regs->ctrl_regs->tmr_ctrl, tmr_ctrl);
+	ptp_qoriq->write(&regs->ctrl_regs->tmr_add, ptp_qoriq->tmr_add);
+	ptp_qoriq->write(&regs->ctrl_regs->tmr_prsc, ptp_qoriq->tmr_prsc);
+	ptp_qoriq->write(&regs->fiper_regs->tmr_fiper1, ptp_qoriq->tmr_fiper1);
+	ptp_qoriq->write(&regs->fiper_regs->tmr_fiper2, ptp_qoriq->tmr_fiper2);
 	set_alarm(ptp_qoriq);
-	qoriq_write(&regs->ctrl_regs->tmr_ctrl,   tmr_ctrl|FIPERST|RTPE|TE|FRD);
+	ptp_qoriq->write(&regs->ctrl_regs->tmr_ctrl,
+			 tmr_ctrl|FIPERST|RTPE|TE|FRD);
 
 	spin_unlock_irqrestore(&ptp_qoriq->lock, flags);
 
@@ -605,8 +614,8 @@ static int ptp_qoriq_remove(struct platform_device *dev)
 	struct ptp_qoriq *ptp_qoriq = platform_get_drvdata(dev);
 	struct ptp_qoriq_registers *regs = &ptp_qoriq->regs;
 
-	qoriq_write(&regs->ctrl_regs->tmr_temask, 0);
-	qoriq_write(&regs->ctrl_regs->tmr_ctrl,   0);
+	ptp_qoriq->write(&regs->ctrl_regs->tmr_temask, 0);
+	ptp_qoriq->write(&regs->ctrl_regs->tmr_ctrl,   0);
 
 	ptp_qoriq_remove_debugfs(ptp_qoriq);
 	ptp_clock_unregister(ptp_qoriq->clock);
diff --git a/drivers/ptp/ptp_qoriq_debugfs.c b/drivers/ptp/ptp_qoriq_debugfs.c
index 3a70daf..e8dddce 100644
--- a/drivers/ptp/ptp_qoriq_debugfs.c
+++ b/drivers/ptp/ptp_qoriq_debugfs.c
@@ -11,7 +11,7 @@ static int ptp_qoriq_fiper1_lpbk_get(void *data, u64 *val)
 	struct ptp_qoriq_registers *regs = &ptp_qoriq->regs;
 	u32 ctrl;
 
-	ctrl = qoriq_read(&regs->ctrl_regs->tmr_ctrl);
+	ctrl = ptp_qoriq->read(&regs->ctrl_regs->tmr_ctrl);
 	*val = ctrl & PP1L ? 1 : 0;
 
 	return 0;
@@ -23,13 +23,13 @@ static int ptp_qoriq_fiper1_lpbk_set(void *data, u64 val)
 	struct ptp_qoriq_registers *regs = &ptp_qoriq->regs;
 	u32 ctrl;
 
-	ctrl = qoriq_read(&regs->ctrl_regs->tmr_ctrl);
+	ctrl = ptp_qoriq->read(&regs->ctrl_regs->tmr_ctrl);
 	if (val == 0)
 		ctrl &= ~PP1L;
 	else
 		ctrl |= PP1L;
 
-	qoriq_write(&regs->ctrl_regs->tmr_ctrl, ctrl);
+	ptp_qoriq->write(&regs->ctrl_regs->tmr_ctrl, ctrl);
 	return 0;
 }
 
@@ -42,7 +42,7 @@ static int ptp_qoriq_fiper2_lpbk_get(void *data, u64 *val)
 	struct ptp_qoriq_registers *regs = &ptp_qoriq->regs;
 	u32 ctrl;
 
-	ctrl = qoriq_read(&regs->ctrl_regs->tmr_ctrl);
+	ctrl = ptp_qoriq->read(&regs->ctrl_regs->tmr_ctrl);
 	*val = ctrl & PP2L ? 1 : 0;
 
 	return 0;
@@ -54,13 +54,13 @@ static int ptp_qoriq_fiper2_lpbk_set(void *data, u64 val)
 	struct ptp_qoriq_registers *regs = &ptp_qoriq->regs;
 	u32 ctrl;
 
-	ctrl = qoriq_read(&regs->ctrl_regs->tmr_ctrl);
+	ctrl = ptp_qoriq->read(&regs->ctrl_regs->tmr_ctrl);
 	if (val == 0)
 		ctrl &= ~PP2L;
 	else
 		ctrl |= PP2L;
 
-	qoriq_write(&regs->ctrl_regs->tmr_ctrl, ctrl);
+	ptp_qoriq->write(&regs->ctrl_regs->tmr_ctrl, ctrl);
 	return 0;
 }
 
diff --git a/include/linux/fsl/ptp_qoriq.h b/include/linux/fsl/ptp_qoriq.h
index d0c1e42..b44b466 100644
--- a/include/linux/fsl/ptp_qoriq.h
+++ b/include/linux/fsl/ptp_qoriq.h
@@ -157,21 +157,30 @@ struct ptp_qoriq {
 	u32 cksel;
 	u32 tmr_fiper1;
 	u32 tmr_fiper2;
+	u32 (*read)(unsigned __iomem *addr);
+	void (*write)(unsigned __iomem *addr, u32 val);
 };
 
-static inline u32 qoriq_read(unsigned __iomem *addr)
+static inline u32 qoriq_read_be(unsigned __iomem *addr)
 {
-	u32 val;
-
-	val = ioread32be(addr);
-	return val;
+	return ioread32be(addr);
 }
 
-static inline void qoriq_write(unsigned __iomem *addr, u32 val)
+static inline void qoriq_write_be(unsigned __iomem *addr, u32 val)
 {
 	iowrite32be(val, addr);
 }
 
+static inline u32 qoriq_read_le(unsigned __iomem *addr)
+{
+	return ioread32(addr);
+}
+
+static inline void qoriq_write_le(unsigned __iomem *addr, u32 val)
+{
+	iowrite32(val, addr);
+}
+
 irqreturn_t ptp_qoriq_isr(int irq, void *priv);
 int ptp_qoriq_init(struct ptp_qoriq *ptp_qoriq, void __iomem *base,
 		   const struct ptp_clock_info caps);
-- 
1.7.1


^ permalink raw reply related

* [PATCH 9/9] MAINTAINERS: add enetc_ptp driver into QorIQ PTP list
From: Yangbo Lu @ 2019-01-31  2:33 UTC (permalink / raw)
  To: netdev, devicetree
  Cc: David S . Miller, Richard Cochran, Rob Herring, Claudiu Manoil,
	Yangbo Lu
In-Reply-To: <20190131023310.9563-1-yangbo.lu@nxp.com>

This patch to add enetc_ptp driver into QorIQ PTP list
for maintaining.

Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
---
 MAINTAINERS |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 019a2bc..f0b50f8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6093,6 +6093,7 @@ FREESCALE QORIQ PTP CLOCK DRIVER
 M:	Yangbo Lu <yangbo.lu@nxp.com>
 L:	netdev@vger.kernel.org
 S:	Maintained
+F:	drivers/net/ethernet/freescale/enetc/enetc_ptp.c
 F:	drivers/ptp/ptp_qoriq.c
 F:	drivers/ptp/ptp_qoriq_debugfs.c
 F:	include/linux/fsl/ptp_qoriq.h
-- 
1.7.1


^ permalink raw reply related

* [PATCH 3/9] ptp_qoriq: convert to use ptp_qoriq_init()
From: Yangbo Lu @ 2019-01-31  2:33 UTC (permalink / raw)
  To: netdev, devicetree
  Cc: David S . Miller, Richard Cochran, Rob Herring, Claudiu Manoil,
	Yangbo Lu
In-Reply-To: <20190131023310.9563-1-yangbo.lu@nxp.com>

Moved QorIQ PTP clock initialization into new function
qoriq_ptp_init(). This function could also be reused
by ENETC PTP drvier which is a PCI driver for same 1588
timer IP block.

Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
---
 drivers/ptp/ptp_qoriq.c       |  120 +++++++++++++++++++++--------------------
 include/linux/fsl/ptp_qoriq.h |    2 +
 2 files changed, 64 insertions(+), 58 deletions(-)

diff --git a/drivers/ptp/ptp_qoriq.c b/drivers/ptp/ptp_qoriq.c
index 1f3e73e..63896b5 100644
--- a/drivers/ptp/ptp_qoriq.c
+++ b/drivers/ptp/ptp_qoriq.c
@@ -458,25 +458,17 @@ static int ptp_qoriq_auto_config(struct ptp_qoriq *ptp_qoriq,
 	return 0;
 }
 
-static int ptp_qoriq_probe(struct platform_device *dev)
+int ptp_qoriq_init(struct ptp_qoriq *ptp_qoriq, void __iomem *base,
+		   const struct ptp_clock_info caps)
 {
-	struct device_node *node = dev->dev.of_node;
-	struct ptp_qoriq *ptp_qoriq;
+	struct device_node *node = ptp_qoriq->dev->of_node;
 	struct ptp_qoriq_registers *regs;
 	struct timespec64 now;
-	int err = -ENOMEM;
-	u32 tmr_ctrl;
 	unsigned long flags;
-	void __iomem *base;
-
-	ptp_qoriq = kzalloc(sizeof(*ptp_qoriq), GFP_KERNEL);
-	if (!ptp_qoriq)
-		goto no_memory;
-
-	err = -EINVAL;
+	u32 tmr_ctrl;
 
-	ptp_qoriq->dev = &dev->dev;
-	ptp_qoriq->caps = ptp_qoriq_caps;
+	ptp_qoriq->base = base;
+	ptp_qoriq->caps = caps;
 
 	if (of_property_read_u32(node, "fsl,cksel", &ptp_qoriq->cksel))
 		ptp_qoriq->cksel = DEFAULT_CKSEL;
@@ -501,44 +493,9 @@ static int ptp_qoriq_probe(struct platform_device *dev)
 		pr_warn("device tree node missing required elements, try automatic configuration\n");
 
 		if (ptp_qoriq_auto_config(ptp_qoriq, node))
-			goto no_config;
+			return -ENODEV;
 	}
 
-	err = -ENODEV;
-
-	ptp_qoriq->irq = platform_get_irq(dev, 0);
-
-	if (ptp_qoriq->irq < 0) {
-		pr_err("irq not in device tree\n");
-		goto no_node;
-	}
-	if (request_irq(ptp_qoriq->irq, ptp_qoriq_isr, IRQF_SHARED,
-			DRIVER, ptp_qoriq)) {
-		pr_err("request_irq failed\n");
-		goto no_node;
-	}
-
-	ptp_qoriq->rsrc = platform_get_resource(dev, IORESOURCE_MEM, 0);
-	if (!ptp_qoriq->rsrc) {
-		pr_err("no resource\n");
-		goto no_resource;
-	}
-	if (request_resource(&iomem_resource, ptp_qoriq->rsrc)) {
-		pr_err("resource busy\n");
-		goto no_resource;
-	}
-
-	spin_lock_init(&ptp_qoriq->lock);
-
-	base = ioremap(ptp_qoriq->rsrc->start,
-		       resource_size(ptp_qoriq->rsrc));
-	if (!base) {
-		pr_err("ioremap ptp registers failed\n");
-		goto no_ioremap;
-	}
-
-	ptp_qoriq->base = base;
-
 	if (of_device_is_compatible(node, "fsl,fman-ptp-timer")) {
 		ptp_qoriq->regs.ctrl_regs = base + FMAN_CTRL_REGS_OFFSET;
 		ptp_qoriq->regs.alarm_regs = base + FMAN_ALARM_REGS_OFFSET;
@@ -558,6 +515,7 @@ static int ptp_qoriq_probe(struct platform_device *dev)
 	  (ptp_qoriq->tclk_period & TCLK_PERIOD_MASK) << TCLK_PERIOD_SHIFT |
 	  (ptp_qoriq->cksel & CKSEL_MASK) << CKSEL_SHIFT;
 
+	spin_lock_init(&ptp_qoriq->lock);
 	spin_lock_irqsave(&ptp_qoriq->lock, flags);
 
 	regs = &ptp_qoriq->regs;
@@ -571,16 +529,63 @@ static int ptp_qoriq_probe(struct platform_device *dev)
 
 	spin_unlock_irqrestore(&ptp_qoriq->lock, flags);
 
-	ptp_qoriq->clock = ptp_clock_register(&ptp_qoriq->caps, &dev->dev);
-	if (IS_ERR(ptp_qoriq->clock)) {
-		err = PTR_ERR(ptp_qoriq->clock);
-		goto no_clock;
-	}
-	ptp_qoriq->phc_index = ptp_clock_index(ptp_qoriq->clock);
+	ptp_qoriq->clock = ptp_clock_register(&ptp_qoriq->caps, ptp_qoriq->dev);
+	if (IS_ERR(ptp_qoriq->clock))
+		return PTR_ERR(ptp_qoriq->clock);
 
+	ptp_qoriq->phc_index = ptp_clock_index(ptp_qoriq->clock);
 	ptp_qoriq_create_debugfs(ptp_qoriq);
-	platform_set_drvdata(dev, ptp_qoriq);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ptp_qoriq_init);
+
+static int ptp_qoriq_probe(struct platform_device *dev)
+{
+	struct ptp_qoriq *ptp_qoriq;
+	int err = -ENOMEM;
+	void __iomem *base;
+
+	ptp_qoriq = kzalloc(sizeof(*ptp_qoriq), GFP_KERNEL);
+	if (!ptp_qoriq)
+		goto no_memory;
+
+	ptp_qoriq->dev = &dev->dev;
+
+	err = -ENODEV;
+
+	ptp_qoriq->irq = platform_get_irq(dev, 0);
+	if (ptp_qoriq->irq < 0) {
+		pr_err("irq not in device tree\n");
+		goto no_node;
+	}
+	if (request_irq(ptp_qoriq->irq, ptp_qoriq_isr, IRQF_SHARED,
+			DRIVER, ptp_qoriq)) {
+		pr_err("request_irq failed\n");
+		goto no_node;
+	}
 
+	ptp_qoriq->rsrc = platform_get_resource(dev, IORESOURCE_MEM, 0);
+	if (!ptp_qoriq->rsrc) {
+		pr_err("no resource\n");
+		goto no_resource;
+	}
+	if (request_resource(&iomem_resource, ptp_qoriq->rsrc)) {
+		pr_err("resource busy\n");
+		goto no_resource;
+	}
+
+	base = ioremap(ptp_qoriq->rsrc->start,
+		       resource_size(ptp_qoriq->rsrc));
+	if (!base) {
+		pr_err("ioremap ptp registers failed\n");
+		goto no_ioremap;
+	}
+
+	err = ptp_qoriq_init(ptp_qoriq, base, ptp_qoriq_caps);
+	if (err)
+		goto no_clock;
+
+	platform_set_drvdata(dev, ptp_qoriq);
 	return 0;
 
 no_clock:
@@ -589,7 +594,6 @@ static int ptp_qoriq_probe(struct platform_device *dev)
 	release_resource(ptp_qoriq->rsrc);
 no_resource:
 	free_irq(ptp_qoriq->irq, ptp_qoriq);
-no_config:
 no_node:
 	kfree(ptp_qoriq);
 no_memory:
diff --git a/include/linux/fsl/ptp_qoriq.h b/include/linux/fsl/ptp_qoriq.h
index 75e6f05..d0c1e42 100644
--- a/include/linux/fsl/ptp_qoriq.h
+++ b/include/linux/fsl/ptp_qoriq.h
@@ -173,6 +173,8 @@ static inline void qoriq_write(unsigned __iomem *addr, u32 val)
 }
 
 irqreturn_t ptp_qoriq_isr(int irq, void *priv);
+int ptp_qoriq_init(struct ptp_qoriq *ptp_qoriq, void __iomem *base,
+		   const struct ptp_clock_info caps);
 int ptp_qoriq_adjfine(struct ptp_clock_info *ptp, long scaled_ppm);
 int ptp_qoriq_adjtime(struct ptp_clock_info *ptp, s64 delta);
 int ptp_qoriq_gettime(struct ptp_clock_info *ptp, struct timespec64 *ts);
-- 
1.7.1


^ permalink raw reply related

* [PATCH 7/9] ptp: add QorIQ PTP support for ENETC
From: Yangbo Lu @ 2019-01-31  2:33 UTC (permalink / raw)
  To: netdev, devicetree
  Cc: David S . Miller, Richard Cochran, Rob Herring, Claudiu Manoil,
	Yangbo Lu
In-Reply-To: <20190131023310.9563-1-yangbo.lu@nxp.com>

This patch is to add QorIQ PTP support for ENETC.
ENETC PTP driver which is a PCI driver for same
1588 timer IP block will reuse QorIQ PTP driver.

Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
---
 drivers/ptp/Kconfig |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index aeb4a8b..7fe1863 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -43,7 +43,7 @@ config PTP_1588_CLOCK_DTE
 
 config PTP_1588_CLOCK_QORIQ
 	tristate "Freescale QorIQ 1588 timer as PTP clock"
-	depends on GIANFAR || FSL_DPAA_ETH
+	depends on GIANFAR || FSL_DPAA_ETH || FSL_ENETC || FSL_ENETC_VF
 	depends on PTP_1588_CLOCK
 	default y
 	help
-- 
1.7.1


^ permalink raw reply related

* Re: [PATCH net-next] net: udp Allow CHECKSUM_UNNECESSARY packets to do GRO.
From: Tom Herbert @ 2019-01-31  2:43 UTC (permalink / raw)
  To: maowenan; +Cc: Linux Kernel Network Developers, David S. Miller, Eric Dumazet
In-Reply-To: <17c94d54-0a0e-21e7-1a8d-5e84b2f7a07d@huawei.com>

On Wed, Jan 30, 2019 at 5:58 PM maowenan <maowenan@huawei.com> wrote:
>
>
>
> On 2019/1/30 4:24, Tom Herbert wrote:
> > On Tue, Jan 29, 2019 at 12:08 AM maowenan <maowenan@huawei.com> wrote:
> >>
> >>
> >>
> >> On 2019/1/29 14:24, Tom Herbert wrote:
> >>> On Mon, Jan 28, 2019 at 10:04 PM maowenan <maowenan@huawei.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2019/1/29 12:01, Tom Herbert wrote:
> >>>>> On Mon, Jan 28, 2019 at 7:00 PM maowenan <maowenan@huawei.com> wrote:
> >>>>>>
> >>>>>> Hi all,
> >>>>>> Do you have any comments about this change?
> >>>>>>
> >>>>>>
> >>>>>> On 2019/1/23 11:33, Mao Wenan wrote:
> >>>>>>> When udp4_gro_receive() get one packet that uh->check=0,
> >>>>>>> skb_gro_checksum_validate_zero_check() will set the
> >>>>>>> skb->ip_summed = CHECKSUM_UNNECESSARY;
> >>>>>>> skb->csum_level = 0;
> >>>>>>> Then udp_gro_receive() will flush the packet which is not CHECKSUM_PARTIAL,
> >>>>>>> It is not our expect,  because check=0 in udp header indicates this
> >>>>>>> packet is no need to caculate checksum, we should go further to do GRO.
> >>>>>>>
> >>>>>>> This patch changes the value of csum_cnt according to skb->csum_level.
> >>>>>>> ---
> >>>>>>>  include/linux/netdevice.h | 1 +
> >>>>>>>  1 file changed, 1 insertion(+)
> >>>>>>>
> >>>>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> >>>>>>> index 1377d08..9c819f1 100644
> >>>>>>> --- a/include/linux/netdevice.h
> >>>>>>> +++ b/include/linux/netdevice.h
> >>>>>>> @@ -2764,6 +2764,7 @@ static inline void skb_gro_incr_csum_unnecessary(struct sk_buff *skb)
> >>>>>>>                * during GRO. This saves work if we fallback to normal path.
> >>>>>>>                */
> >>>>>>>               __skb_incr_checksum_unnecessary(skb);
> >>>>>>> +             NAPI_GRO_CB(skb)->csum_cnt = skb->csum_level + 1;
> >>>>>
> >>>>> That doesn't look right. This would be reinitializing the GRO
> >>>>> checksums from the beginning.
> >>>>>
> >>>>>>>       }
> >>>>>>>  }
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>> I assume the code is bailing on this conditional:
> >>>>>
> >>>>> if (NAPI_GRO_CB(skb)->encap_mark ||
> >>>>>             (skb->ip_summed != CHECKSUM_PARTIAL &&
> >>>>>              NAPI_GRO_CB(skb)->csum_cnt == 0 &&
> >>>>>              !NAPI_GRO_CB(skb)->csum_valid) ||
> >>>>>             !udp_sk(sk)->gro_receive)
> >>>>>                 goto out_unlock;
> >>>>>
> >>>>> I am trying to remember why this needs to check csum_cnt. If there was
> >>>>> a csum_cnt for the UDP csum being zero from checksum-unnecessary, it
> >>>>> was consumed by skb_gro_checksum_validate_zero_check in UDP4 GRO
> >>>>> received.
> >>>>
> >>>> We have met the scene about two VMs in different host with vxlan packets, when udp4_gro_receive receives
> >>>> one packet with ip_summed=CHECKSUM_NONE,csum_cnt=0,csum_valid=0,and udp->check=0, then skb_gro_checksum_validate_zero_check()->
> >>>> skb_gro_incr_csum_unnecessary() validate it and set ip_summed=CHECKSUM_UNNECESSARY,csum_level=0, but csum_cnt and csum_valid
> >>>> keep zero value. Then it will be flushed in udp_gro_receive(), the codes as you have showed.
> >>>>
> >>>> so I think it forgets to modify csum_cnt since csum_level is changed in skb_gro_incr_csum_unnecessary()->__skb_incr_checksum_unnecessary().
> >>>>
> >>> Yes, but the csum_level is changing since we've gone beyond the
> >>> checksums initially reported inc checksum-unnecessary. GRO csum_cnt is
> >>> initialized to skb->csum_level + 1 at the start of GRO processing.
> >>>
> >>> If I recall, the rule is that UDP GRO requires at least one non-zero
> >>> checksum to be verified. The idea is that if we end up computing
> >>> packet checksums on the host for inner checksums like TCP during GRO,
> >>> then that's negating the performance benefits of GRO. Had UDP check
> >>> not been zero then we would do checksum unnecessary conversion and so
> >>> csum_valid would be set for the remainded of GRO processing. The
> >>> existing code is following the rule I believe, so this may be working
> >>> as intended.
> >>
> >> Do you have any suggestion if I need do GRO as udp->check is zero?
> >> My previous modification which works fine as below:
> >>         if (NAPI_GRO_CB(skb)->encap_mark ||
> >>             (skb->ip_summed != CHECKSUM_PARTIAL &&
> >> +            skb->ip_summed != CHECKSUM_UNNECESSARY &&
> >
> > That's effectively disabling the rule that we need a real checksum
> > calculation to proceed with GRO. Besides that, the device returning
> > one checksum-unnecessary level because UDP csum is zero is pretty
> > pointelss; we can just as easily deduce get to same state just by
> > looking at the field with CHECKSUM_NONE. What we really want to see
> > for GRO is a real checksum computation being done on the packet.
> >
> > A few questions:
> >
> > What type of packets are being GROed? Are these TCP? What performance
> > difference do you see with our patch? Can you try enabling UDP
> > checksums, and even RCO with VXLAN? With UDP encapsulation we
> > generally see better performance with checksum enabled since UDP
> > checksum offload is ubiquitous and we can easily convert
> > checksum-unnecessary (with non-zero csum) to checksum-complete.
>
> We use the physical network card calculate the checksum of the inner packet with checksum offload.
> Set the udp checksum of the vxlan header is 0.
>
I see. It sounds like the device is really verifying two checksums in
the packet, the outer UDP checksum (which is zero for UDP) and an
inner checksum, but only reporting one checksum was verified. The
driver needs to set csum_level to 1 in this case (meaning two
checksums have been verified for checksum-unnecessary). What NIC are
you using?

Tom


> With this patch, the bandwidth of TCP between two VMs increase from 2Gbit/s to 6Gbit/s.
>
> >
> > Tom
> >
> >
> >
> >
> >>              NAPI_GRO_CB(skb)->csum_cnt == 0 &&
> >>              !NAPI_GRO_CB(skb)->csum_valid) ||
> >>             !udp_sk(sk)->gro_receive)
> >>                 goto out_unlock;
> >>
> >>
> >>>
> >>> Tom
> >>>
> >>>>>
> >>>>> .
> >>>>>
> >>>>
> >>>
> >>> .
> >>>
> >>
> >
> > .
> >
>

^ permalink raw reply

* Re: [PATCH bpf-next 2/4] bpf: fix lockdep false positive in stackmap
From: Waiman Long @ 2019-01-31  2:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peter Zijlstra, Alexei Starovoitov, davem, daniel, edumazet,
	jannh, netdev, kernel-team
In-Reply-To: <20190131020150.zbys2r6me7em2jnl@ast-mbp.dhcp.thefacebook.com>

On 01/30/2019 09:01 PM, Alexei Starovoitov wrote:
> On Wed, Jan 30, 2019 at 04:32:12PM -0500, Waiman Long wrote:
>> On 01/30/2019 04:11 PM, Waiman Long wrote:
>>> On 01/30/2019 03:10 PM, Alexei Starovoitov wrote:
>>>> On Wed, Jan 30, 2019 at 02:42:23PM -0500, Waiman Long wrote:
>>>>> On 01/30/2019 02:30 PM, Alexei Starovoitov wrote:
>>>>>> On Wed, Jan 30, 2019 at 11:15:30AM +0100, Peter Zijlstra wrote:
>>>>>>> On Tue, Jan 29, 2019 at 08:04:56PM -0800, Alexei Starovoitov wrote:
>>>>>>>> Lockdep warns about false positive:
>>>>>>> This is not a false positive, and you probably also need to use
>>>>>>> down_read_non_owner() to match this up_read_non_owner().
>>>>>>>
>>>>>>> {up,down}_read() and {up,down}_read_non_owner() are not only different
>>>>>>> in the lockdep annotation; there is also optimistic spin stuff that
>>>>>>> relies on 'owner' tracking.
>>>>>> Can you point out in the code the spin bit?
>>>>>> As far as I can see sem->owner is debug only feature.
>>>>>> All owner checks are done under CONFIG_DEBUG_RWSEMS.
>>>>> No, sem->owner is mainly for performing optimistic spinning which is a
>>>>> performance feature to make rwsem writer-lock performs similar to mutex.
>>>>> The debugging part is just an add-on. It is not the reason for the
>>>>> presence of sem->owner.
>>>> I see. Got it.
>>>>
>>>>>> Also there is no down_read_trylock_non_owner() at the moment.
>>>>>> We can argue about it for -next, but I'd rather silence lockdep
>>>>>> with this patch today.
>>>>>>
>>>>> We can add down_read_trylock_non_owner() if there is a need for it. It
>>>>> should be easy to do.
>>>> Yes, but looking through the code it's not clear to me that it's safe
>>>> to mix non_owner() versions with regular.
>>>> bpf/stackmap.c does down_read_trylock + up_read.
>>>> If we add new down_read_trylock_non_owner that set the owner to
>>>> NULL | RWSEM_* bits is this safe with conccurent read/write
>>>> that do regular versions?
>>>> rwsem_can_spin_on_owner() does:
>>>>         if (owner) {
>>>>                 ret = is_rwsem_owner_spinnable(owner) &&
>>>>                       owner_on_cpu(owner);
>>>> that looks correct.
>>>> For a second I thought there could be fault here due to non_owner.
>>>> But there could be other places where it's assumed that owner
>>>> is never null?
>>> The content of owner is not the cause of the lockdep warning. The
>>> lockdep code assumes that the task that acquires the lock will release
>>> it some time later. That is not the case when you need to acquire the
>>> lock by one task and released by another. In this case, you have to use
>>> the non_owner version of down/up_read which disable the lockdep
>>> acquire/release tracking. That will be the only difference between the
>>> two set of APIs.
>>>
>>> Cheers,
>>> Longman
>> BTW, you may want to do something like that to make sure that the lock
>> ownership is probably tracked.
>>
>> -Longman
>>
>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>> index d43b145..79eef9d 100644
>> --- a/kernel/bpf/stackmap.c
>> +++ b/kernel/bpf/stackmap.c
>> @@ -338,6 +338,13 @@ static void stack_map_get_build_id_offset(struct
>> bpf_stack_
>>         } else {
>>                 work->sem = &current->mm->mmap_sem;
>>                 irq_work_queue(&work->irq_work);
>> +
>> +               /*
>> +                * The irq_work will release the mmap_sem with
>> +                * up_read_non_owner(). The rwsem_release() is called
>> +                * here to release the lock from lockdep's perspective.
>> +                */
>> +               rwsem_release(&current->mm->mmap_sem.dep_map, 1, _RET_IP_);
> are you saying the above is enough coupled with up_read_non_owner?
> Looking at how amdgpu is using this api... I think they just use non_owner
> version when doing things from different task.
> So I don't think pairing non_owner with non_owner is strictly necessary.
> It seems fine to use down_read_trylock() with above rwsem_release() hack
> plus up_read_non_owner().
> Am I missing something?
>
The statement above is to clear the state for the lockdep so that it
knows that the task no longer owns the lock. Without doing that, there
is a possibility of some other kind of incorrect lockdep warning may be
produced because the code will still think the task hold a read-lock on
the mmap_sem. It is also possible no warning will be reported.

The bpf code is kind of special that it acquires the mmap_sem. Later on,
it either releases it itself (non-NMI) or request irq_work to release it
(NMI). So either, you use the _non_owner versions for both acquire and
release or fake the release like the code segment above.

Peter, please chime in if you have other suggestion.

Cheers,
Longman



^ permalink raw reply

* Re: [PATCH net-next] net: udp Allow CHECKSUM_UNNECESSARY packets to do GRO.
From: maowenan @ 2019-01-31  2:58 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Linux Kernel Network Developers, David S. Miller, Eric Dumazet
In-Reply-To: <CALx6S373utRW0csPy-8xewB=F4NfRD_X2jvGMgUBQxNxpM8AAQ@mail.gmail.com>



On 2019/1/31 10:43, Tom Herbert wrote:
> On Wed, Jan 30, 2019 at 5:58 PM maowenan <maowenan@huawei.com> wrote:
>>
>>
>>
>> On 2019/1/30 4:24, Tom Herbert wrote:
>>> On Tue, Jan 29, 2019 at 12:08 AM maowenan <maowenan@huawei.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2019/1/29 14:24, Tom Herbert wrote:
>>>>> On Mon, Jan 28, 2019 at 10:04 PM maowenan <maowenan@huawei.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2019/1/29 12:01, Tom Herbert wrote:
>>>>>>> On Mon, Jan 28, 2019 at 7:00 PM maowenan <maowenan@huawei.com> wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>> Do you have any comments about this change?
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2019/1/23 11:33, Mao Wenan wrote:
>>>>>>>>> When udp4_gro_receive() get one packet that uh->check=0,
>>>>>>>>> skb_gro_checksum_validate_zero_check() will set the
>>>>>>>>> skb->ip_summed = CHECKSUM_UNNECESSARY;
>>>>>>>>> skb->csum_level = 0;
>>>>>>>>> Then udp_gro_receive() will flush the packet which is not CHECKSUM_PARTIAL,
>>>>>>>>> It is not our expect,  because check=0 in udp header indicates this
>>>>>>>>> packet is no need to caculate checksum, we should go further to do GRO.
>>>>>>>>>
>>>>>>>>> This patch changes the value of csum_cnt according to skb->csum_level.
>>>>>>>>> ---
>>>>>>>>>  include/linux/netdevice.h | 1 +
>>>>>>>>>  1 file changed, 1 insertion(+)
>>>>>>>>>
>>>>>>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>>>>>>>> index 1377d08..9c819f1 100644
>>>>>>>>> --- a/include/linux/netdevice.h
>>>>>>>>> +++ b/include/linux/netdevice.h
>>>>>>>>> @@ -2764,6 +2764,7 @@ static inline void skb_gro_incr_csum_unnecessary(struct sk_buff *skb)
>>>>>>>>>                * during GRO. This saves work if we fallback to normal path.
>>>>>>>>>                */
>>>>>>>>>               __skb_incr_checksum_unnecessary(skb);
>>>>>>>>> +             NAPI_GRO_CB(skb)->csum_cnt = skb->csum_level + 1;
>>>>>>>
>>>>>>> That doesn't look right. This would be reinitializing the GRO
>>>>>>> checksums from the beginning.
>>>>>>>
>>>>>>>>>       }
>>>>>>>>>  }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>> I assume the code is bailing on this conditional:
>>>>>>>
>>>>>>> if (NAPI_GRO_CB(skb)->encap_mark ||
>>>>>>>             (skb->ip_summed != CHECKSUM_PARTIAL &&
>>>>>>>              NAPI_GRO_CB(skb)->csum_cnt == 0 &&
>>>>>>>              !NAPI_GRO_CB(skb)->csum_valid) ||
>>>>>>>             !udp_sk(sk)->gro_receive)
>>>>>>>                 goto out_unlock;
>>>>>>>
>>>>>>> I am trying to remember why this needs to check csum_cnt. If there was
>>>>>>> a csum_cnt for the UDP csum being zero from checksum-unnecessary, it
>>>>>>> was consumed by skb_gro_checksum_validate_zero_check in UDP4 GRO
>>>>>>> received.
>>>>>>
>>>>>> We have met the scene about two VMs in different host with vxlan packets, when udp4_gro_receive receives
>>>>>> one packet with ip_summed=CHECKSUM_NONE,csum_cnt=0,csum_valid=0,and udp->check=0, then skb_gro_checksum_validate_zero_check()->
>>>>>> skb_gro_incr_csum_unnecessary() validate it and set ip_summed=CHECKSUM_UNNECESSARY,csum_level=0, but csum_cnt and csum_valid
>>>>>> keep zero value. Then it will be flushed in udp_gro_receive(), the codes as you have showed.
>>>>>>
>>>>>> so I think it forgets to modify csum_cnt since csum_level is changed in skb_gro_incr_csum_unnecessary()->__skb_incr_checksum_unnecessary().
>>>>>>
>>>>> Yes, but the csum_level is changing since we've gone beyond the
>>>>> checksums initially reported inc checksum-unnecessary. GRO csum_cnt is
>>>>> initialized to skb->csum_level + 1 at the start of GRO processing.
>>>>>
>>>>> If I recall, the rule is that UDP GRO requires at least one non-zero
>>>>> checksum to be verified. The idea is that if we end up computing
>>>>> packet checksums on the host for inner checksums like TCP during GRO,
>>>>> then that's negating the performance benefits of GRO. Had UDP check
>>>>> not been zero then we would do checksum unnecessary conversion and so
>>>>> csum_valid would be set for the remainded of GRO processing. The
>>>>> existing code is following the rule I believe, so this may be working
>>>>> as intended.
>>>>
>>>> Do you have any suggestion if I need do GRO as udp->check is zero?
>>>> My previous modification which works fine as below:
>>>>         if (NAPI_GRO_CB(skb)->encap_mark ||
>>>>             (skb->ip_summed != CHECKSUM_PARTIAL &&
>>>> +            skb->ip_summed != CHECKSUM_UNNECESSARY &&
>>>
>>> That's effectively disabling the rule that we need a real checksum
>>> calculation to proceed with GRO. Besides that, the device returning
>>> one checksum-unnecessary level because UDP csum is zero is pretty
>>> pointelss; we can just as easily deduce get to same state just by
>>> looking at the field with CHECKSUM_NONE. What we really want to see
>>> for GRO is a real checksum computation being done on the packet.
>>>
>>> A few questions:
>>>
>>> What type of packets are being GROed? Are these TCP? What performance
>>> difference do you see with our patch? Can you try enabling UDP
>>> checksums, and even RCO with VXLAN? With UDP encapsulation we
>>> generally see better performance with checksum enabled since UDP
>>> checksum offload is ubiquitous and we can easily convert
>>> checksum-unnecessary (with non-zero csum) to checksum-complete.
>>
>> We use the physical network card calculate the checksum of the inner packet with checksum offload.
>> Set the udp checksum of the vxlan header is 0.
>>
> I see. It sounds like the device is really verifying two checksums in
> the packet, the outer UDP checksum (which is zero for UDP) and an
> inner checksum, but only reporting one checksum was verified. The
> driver needs to set csum_level to 1 in this case (meaning two
> checksums have been verified for checksum-unnecessary). What NIC are
> you using?

Currently it is 82599, whose driver can't recognize the vxlan packet.
I guess so many NICs can't do this checking in it's driver, so I think this is a
common case, will we fix it in stack?

> 
> Tom
> 
> 
>> With this patch, the bandwidth of TCP between two VMs increase from 2Gbit/s to 6Gbit/s.
>>
>>>
>>> Tom
>>>
>>>
>>>
>>>
>>>>              NAPI_GRO_CB(skb)->csum_cnt == 0 &&
>>>>              !NAPI_GRO_CB(skb)->csum_valid) ||
>>>>             !udp_sk(sk)->gro_receive)
>>>>                 goto out_unlock;
>>>>
>>>>
>>>>>
>>>>> Tom
>>>>>
>>>>>>>
>>>>>>> .
>>>>>>>
>>>>>>
>>>>>
>>>>> .
>>>>>
>>>>
>>>
>>> .
>>>
>>
> 
> .
> 


^ permalink raw reply

* [PATCH] bpfilter: remove extra header search paths for bpfilter_umh
From: Masahiro Yamada @ 2019-01-31  3:15 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Masahiro Yamada, Matteo Croce, David S. Miller, linux-kernel,
	Jakub Kicinski, Laura Abbott, Alexei Starovoitov, YueHaibing

Currently, the header search paths -Itools/include and
-Itools/include/uapi are not used. Let's drop the unused code.

We can remove -I. too by fixing up one C file.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
---

Perhaps, are these extra header search paths for
more upstreaming in the future?

If this patch is rejected, I will send an alternative one.

To clean up the Kbuild core,
I want to drop as many unused header search paths as possible.


 net/bpfilter/Makefile | 1 -
 net/bpfilter/main.c   | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
index 0947ee7..5d6c776 100644
--- a/net/bpfilter/Makefile
+++ b/net/bpfilter/Makefile
@@ -5,7 +5,6 @@
 
 hostprogs-y := bpfilter_umh
 bpfilter_umh-objs := main.o
-KBUILD_HOSTCFLAGS += -I. -Itools/include/ -Itools/include/uapi
 HOSTCC := $(CC)
 
 ifeq ($(CONFIG_BPFILTER_UMH), y)
diff --git a/net/bpfilter/main.c b/net/bpfilter/main.c
index 1317f10..61ce845 100644
--- a/net/bpfilter/main.c
+++ b/net/bpfilter/main.c
@@ -6,7 +6,7 @@
 #include <sys/socket.h>
 #include <fcntl.h>
 #include <unistd.h>
-#include "include/uapi/linux/bpf.h"
+#include "../../include/uapi/linux/bpf.h"
 #include <asm/unistd.h>
 #include "msgfmt.h"
 
-- 
2.7.4


^ permalink raw reply related

* Re: [RFC] net: dp83640: expire old TX-skb
From: Richard Cochran @ 2019-01-31  4:26 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Andrew Lunn, Florian Fainelli, Heiner Kallweit, netdev
In-Reply-To: <20190128174547.twhwv64y2k5xx5x7@linutronix.de>

On Mon, Jan 28, 2019 at 06:45:47PM +0100, Sebastian Andrzej Siewior wrote:
> During sendmsg() a cloned skb is saved via dp83640_txtstamp() in
> ->tx_queue. After the NIC sends this packet, the PHY will reply with a
> timestamp for that TX packet. If the cable is pulled at the right time I
> don't see that packet. It might gets flushed as part of queue shutdown
> on NIC's side.
> Once the link is up again then after the next sendmsg() we enqueue
> another skb in dp83640_txtstamp() and have two on the list. Then the PHY
> will send a reply and decode_txts() attaches it to the first skb on the
> list.
> No crash occurs since refcounting works but we are one packet behind.
> linuxptp/ptp4l usually closes the socket and opens a new one (in such a
> timeout case) so those "stale" replies never get there. However it does
> not resume normal operation anymore.

Thanks for the detailed explanation.  This sounds like a really rare
bug, but maybe you guys were able to trigger it reliably?
 
> Purge old skbs in decode_txts().

It is too bad that the Tx timestamp from the HW doesn't provide
matching fields.  Using the timeout is probably the best that we can
do.
 
> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Order: signed-off goes before reviewed-by.

> ---
>  drivers/net/phy/dp83640.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
> index 1dc043d0bc875..a50e0680a0322 100644
> --- a/drivers/net/phy/dp83640.c
> +++ b/drivers/net/phy/dp83640.c
> @@ -920,13 +920,13 @@ static void decode_txts(struct dp83640_private *dp83640,
>  {
>  	struct skb_shared_hwtstamps shhwtstamps;
>  	struct sk_buff *skb;
> +	struct dp83640_skb_info *skb_info;

Reverse Christmas tree please,

>  	u64 ns;
>  	u8 overflow;

and fix ^^^ while you are at it.

>  
>  	/* We must already have the skb that triggered this. */
> -
> +again:
>  	skb = skb_dequeue(&dp83640->tx_queue);
> -
>  	if (!skb) {
>  		pr_debug("have timestamp but tx_queue empty\n");
>  		return;
> @@ -941,6 +941,11 @@ static void decode_txts(struct dp83640_private *dp83640,
>  		}
>  		return;
>  	}
> +	skb_info = (struct dp83640_skb_info *)skb->cb;
> +	if (time_after(jiffies, skb_info->tmo)) {
> +		kfree_skb(skb);
> +		goto again;
> +	}
>  
>  	ns = phy2txts(phy_txts);
>  	memset(&shhwtstamps, 0, sizeof(shhwtstamps));
> @@ -1489,6 +1494,7 @@ static void dp83640_txtstamp(struct phy_device *phydev,
>  			     struct sk_buff *skb, int type)
>  {
>  	struct dp83640_private *dp83640 = phydev->priv;
> +	struct dp83640_skb_info *skb_info = (struct dp83640_skb_info *)skb->cb;

Reverse Christmas tree.

Thanks,
Richard

>  
>  	switch (dp83640->hwts_tx_en) {
>  
> @@ -1500,6 +1506,7 @@ static void dp83640_txtstamp(struct phy_device *phydev,
>  		/* fall through */
>  	case HWTSTAMP_TX_ON:
>  		skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
> +		skb_info->tmo = jiffies + SKB_TIMESTAMP_TIMEOUT;
>  		skb_queue_tail(&dp83640->tx_queue, skb);
>  		break;
>  
> -- 
> 2.20.1
> 

^ permalink raw reply

* Re: [PATCH net-next] net: udp Allow CHECKSUM_UNNECESSARY packets to do GRO.
From: Tom Herbert @ 2019-01-31  4:33 UTC (permalink / raw)
  To: maowenan; +Cc: Linux Kernel Network Developers, David S. Miller, Eric Dumazet
In-Reply-To: <57903a5f-5033-4da9-8807-0b6f894827a9@huawei.com>

On Wed, Jan 30, 2019 at 6:59 PM maowenan <maowenan@huawei.com> wrote:
>
>
>
> On 2019/1/31 10:43, Tom Herbert wrote:
> > On Wed, Jan 30, 2019 at 5:58 PM maowenan <maowenan@huawei.com> wrote:
> >>
> >>
> >>
> >> On 2019/1/30 4:24, Tom Herbert wrote:
> >>> On Tue, Jan 29, 2019 at 12:08 AM maowenan <maowenan@huawei.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2019/1/29 14:24, Tom Herbert wrote:
> >>>>> On Mon, Jan 28, 2019 at 10:04 PM maowenan <maowenan@huawei.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 2019/1/29 12:01, Tom Herbert wrote:
> >>>>>>> On Mon, Jan 28, 2019 at 7:00 PM maowenan <maowenan@huawei.com> wrote:
> >>>>>>>>
> >>>>>>>> Hi all,
> >>>>>>>> Do you have any comments about this change?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 2019/1/23 11:33, Mao Wenan wrote:
> >>>>>>>>> When udp4_gro_receive() get one packet that uh->check=0,
> >>>>>>>>> skb_gro_checksum_validate_zero_check() will set the
> >>>>>>>>> skb->ip_summed = CHECKSUM_UNNECESSARY;
> >>>>>>>>> skb->csum_level = 0;
> >>>>>>>>> Then udp_gro_receive() will flush the packet which is not CHECKSUM_PARTIAL,
> >>>>>>>>> It is not our expect,  because check=0 in udp header indicates this
> >>>>>>>>> packet is no need to caculate checksum, we should go further to do GRO.
> >>>>>>>>>
> >>>>>>>>> This patch changes the value of csum_cnt according to skb->csum_level.
> >>>>>>>>> ---
> >>>>>>>>>  include/linux/netdevice.h | 1 +
> >>>>>>>>>  1 file changed, 1 insertion(+)
> >>>>>>>>>
> >>>>>>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> >>>>>>>>> index 1377d08..9c819f1 100644
> >>>>>>>>> --- a/include/linux/netdevice.h
> >>>>>>>>> +++ b/include/linux/netdevice.h
> >>>>>>>>> @@ -2764,6 +2764,7 @@ static inline void skb_gro_incr_csum_unnecessary(struct sk_buff *skb)
> >>>>>>>>>                * during GRO. This saves work if we fallback to normal path.
> >>>>>>>>>                */
> >>>>>>>>>               __skb_incr_checksum_unnecessary(skb);
> >>>>>>>>> +             NAPI_GRO_CB(skb)->csum_cnt = skb->csum_level + 1;
> >>>>>>>
> >>>>>>> That doesn't look right. This would be reinitializing the GRO
> >>>>>>> checksums from the beginning.
> >>>>>>>
> >>>>>>>>>       }
> >>>>>>>>>  }
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>> I assume the code is bailing on this conditional:
> >>>>>>>
> >>>>>>> if (NAPI_GRO_CB(skb)->encap_mark ||
> >>>>>>>             (skb->ip_summed != CHECKSUM_PARTIAL &&
> >>>>>>>              NAPI_GRO_CB(skb)->csum_cnt == 0 &&
> >>>>>>>              !NAPI_GRO_CB(skb)->csum_valid) ||
> >>>>>>>             !udp_sk(sk)->gro_receive)
> >>>>>>>                 goto out_unlock;
> >>>>>>>
> >>>>>>> I am trying to remember why this needs to check csum_cnt. If there was
> >>>>>>> a csum_cnt for the UDP csum being zero from checksum-unnecessary, it
> >>>>>>> was consumed by skb_gro_checksum_validate_zero_check in UDP4 GRO
> >>>>>>> received.
> >>>>>>
> >>>>>> We have met the scene about two VMs in different host with vxlan packets, when udp4_gro_receive receives
> >>>>>> one packet with ip_summed=CHECKSUM_NONE,csum_cnt=0,csum_valid=0,and udp->check=0, then skb_gro_checksum_validate_zero_check()->
> >>>>>> skb_gro_incr_csum_unnecessary() validate it and set ip_summed=CHECKSUM_UNNECESSARY,csum_level=0, but csum_cnt and csum_valid
> >>>>>> keep zero value. Then it will be flushed in udp_gro_receive(), the codes as you have showed.
> >>>>>>
> >>>>>> so I think it forgets to modify csum_cnt since csum_level is changed in skb_gro_incr_csum_unnecessary()->__skb_incr_checksum_unnecessary().
> >>>>>>
> >>>>> Yes, but the csum_level is changing since we've gone beyond the
> >>>>> checksums initially reported inc checksum-unnecessary. GRO csum_cnt is
> >>>>> initialized to skb->csum_level + 1 at the start of GRO processing.
> >>>>>
> >>>>> If I recall, the rule is that UDP GRO requires at least one non-zero
> >>>>> checksum to be verified. The idea is that if we end up computing
> >>>>> packet checksums on the host for inner checksums like TCP during GRO,
> >>>>> then that's negating the performance benefits of GRO. Had UDP check
> >>>>> not been zero then we would do checksum unnecessary conversion and so
> >>>>> csum_valid would be set for the remainded of GRO processing. The
> >>>>> existing code is following the rule I believe, so this may be working
> >>>>> as intended.
> >>>>
> >>>> Do you have any suggestion if I need do GRO as udp->check is zero?
> >>>> My previous modification which works fine as below:
> >>>>         if (NAPI_GRO_CB(skb)->encap_mark ||
> >>>>             (skb->ip_summed != CHECKSUM_PARTIAL &&
> >>>> +            skb->ip_summed != CHECKSUM_UNNECESSARY &&
> >>>
> >>> That's effectively disabling the rule that we need a real checksum
> >>> calculation to proceed with GRO. Besides that, the device returning
> >>> one checksum-unnecessary level because UDP csum is zero is pretty
> >>> pointelss; we can just as easily deduce get to same state just by
> >>> looking at the field with CHECKSUM_NONE. What we really want to see
> >>> for GRO is a real checksum computation being done on the packet.
> >>>
> >>> A few questions:
> >>>
> >>> What type of packets are being GROed? Are these TCP? What performance
> >>> difference do you see with our patch? Can you try enabling UDP
> >>> checksums, and even RCO with VXLAN? With UDP encapsulation we
> >>> generally see better performance with checksum enabled since UDP
> >>> checksum offload is ubiquitous and we can easily convert
> >>> checksum-unnecessary (with non-zero csum) to checksum-complete.
> >>
> >> We use the physical network card calculate the checksum of the inner packet with checksum offload.
> >> Set the udp checksum of the vxlan header is 0.
> >>
> > I see. It sounds like the device is really verifying two checksums in
> > the packet, the outer UDP checksum (which is zero for UDP) and an
> > inner checksum, but only reporting one checksum was verified. The
> > driver needs to set csum_level to 1 in this case (meaning two
> > checksums have been verified for checksum-unnecessary). What NIC are
> > you using?
>
> Currently it is 82599, whose driver can't recognize the vxlan packet.
> I guess so many NICs can't do this checking in it's driver, so I think this is a
> common case, will we fix it in stack?
>
Mao,

The problem isn't in the stack, it seems to be in the driver. If the
device reports a verified checksum for an encpasulated packet then the
driver needs to set csum_level to 1. Otherwise, the stack can't just
assume that the inner checksum was verified. *How* a driver deduces
that the device is reporting about an encapsulated checksum is
specific to the device and its driver. I'm not sure which driver your
running, but if you search the code there should be something like
"skb->csum_level =1" that would be a clue about support. A good
example is ixgbe, if the device reports checksum verified and that
packet was VXLAN, it deduces that the inner checksum was verified and
so the driver sets CHECKSUM_UNNECESSARY and skb->csum_level. Of course
all this complexity goes away when devices just provide
checksum-complete.

Tom

> >
> > Tom
> >
> >
> >> With this patch, the bandwidth of TCP between two VMs increase from 2Gbit/s to 6Gbit/s.
> >>
> >>>
> >>> Tom
> >>>
> >>>
> >>>
> >>>
> >>>>              NAPI_GRO_CB(skb)->csum_cnt == 0 &&
> >>>>              !NAPI_GRO_CB(skb)->csum_valid) ||
> >>>>             !udp_sk(sk)->gro_receive)
> >>>>                 goto out_unlock;
> >>>>
> >>>>
> >>>>>
> >>>>> Tom
> >>>>>
> >>>>>>>
> >>>>>>> .
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>> .
> >>>>>
> >>>>
> >>>
> >>> .
> >>>
> >>
> >
> > .
> >
>

^ permalink raw reply

* Re: [PATCH iproute2-next v2] ss: add AF_XDP support
From: David Ahern @ 2019-01-31  5:01 UTC (permalink / raw)
  To: bjorn.topel, netdev, stephen
  Cc: Björn Töpel, magnus.karlsson, magnus.karlsson
In-Reply-To: <20190130065732.22910-1-bjorn.topel@gmail.com>

On 1/29/19 11:57 PM, bjorn.topel@gmail.com wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> AF_XDP is an address family that is optimized for high performance
> packet processing.
> 
> This patch adds AF_XDP support to ss(8) so that sockets can be queried
> and monitored.
> 
> Example:
> $ sudo ss --xdp -e -p -m
> Recv-Q      Send-Q           Local Address:Port             Peer Address:Port
> 
> 0           0                   enp134s0f0:q20                          *
>  users:(("xdpsock",pid=17787,fd=3)) ino:39424 sk:4
>         rx(entries:2048)
>         tx(entries:2048)
>         umem(id:1,size:8388608,num_pages:2048,chunk_size:2048,headroom:0,ifindex:7,
> qid:20,zc:0,refs:1)
>         fr(entries:2048)
>         cr(entries:2048) skmem:(r0,rb212992,t0,tb212992,f0,w0,o0,bl0,d0)
> 0           0                    enp24s0f0:q0                           *
>  users:(("xdpsock",pid=17780,fd=3)) ino:37384 sk:5
>         rx(entries:2048)
>         tx(entries:2048)
>         umem(id:0,size:8388608,num_pages:2048,chunk_size:2048,headroom:0,ifindex:6,
> qid:0,zc:1,refs:1)
>         fr(entries:2048)
>         cr(entries:2048) skmem:(r0,rb212992,t0,tb212992,f0,w0,o0,bl0,d0)
> 
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
> v1->v2:
>   * Define/redefine AF_XDP/AF_MAX if missing. (David)
>   * Add example to the commit log. (David)
> ---
>  include/utils.h |   8 +++
>  man/man8/ss.8   |   9 ++-
>  misc/ss.c       | 168 +++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 180 insertions(+), 5 deletions(-)
> 

applied to iproute2-next. Thanks



^ permalink raw reply

* [PATCH v6 bpf-next 0/9] introduce bpf_spin_lock
From: Alexei Starovoitov @ 2019-01-31  5:05 UTC (permalink / raw)
  To: davem; +Cc: daniel, peterz, jannh, netdev, kernel-team

Many algorithms need to read and modify several variables atomically.
Until now it was hard to impossible to implement such algorithms in BPF.
Hence introduce support for bpf_spin_lock.

The api consists of 'struct bpf_spin_lock' that should be placed
inside hash/array/cgroup_local_storage element
and bpf_spin_lock/unlock() helper function.

Example:
struct hash_elem {
    int cnt;
    struct bpf_spin_lock lock;
};
struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key);
if (val) {
    bpf_spin_lock(&val->lock);
    val->cnt++;
    bpf_spin_unlock(&val->lock);
}

and BPF_F_LOCK flag for lookup/update bpf syscall commands that
allows user space to read/write map elements under lock.

Together these primitives allow race free access to map elements
from bpf programs and from user space.

Key restriction: root only.
Key requirement: maps must be annotated with BTF.

This concept was discussed at Linux Plumbers Conference 2018.
Thank you everyone who participated and helped to iron out details
of api and implementation.

Patch 1: bpf_spin_lock support in the verifier, BTF, hash, array.
Patch 2: bpf_spin_lock in cgroup local storage.
Patches 3,4,5: tests
Patch 6: BPF_F_LOCK flag to lookup/update
Patches 7,8,9: tests

v5->v6:
- adopted arch_spinlock approach suggested by Peter
- switched to spin_lock_irqsave equivalent as the simplest way
  to avoid deadlocks in rare case of nested networking progs
  (cgroup-bpf prog in preempt_disable vs clsbpf in softirq sharing
  the same map with bpf_spin_lock)
  bpf_spin_lock is only allowed in networking progs that don't
  have arbitrary entry points unlike tracing progs.
- rebase and split test_verifier tests

v4->v5:
- disallow bpf_spin_lock for tracing progs due to insufficient preemption checks
- socket filter progs cannot use bpf_spin_lock due to missing preempt_disable
- fix atomic_set_release. Spotted by Peter.
- fixed hash_of_maps
  
v3->v4:
- fix BPF_EXIST | BPF_NOEXIST check patch 6. Spotted by Jakub. Thanks!
- rebase

v2->v3:
- fixed build on ia64 and archs where qspinlock is not supported
- fixed missing lock init during lookup w/o BPF_F_LOCK. Spotted by Martin

v1->v2:
- addressed several issues spotted by Daniel and Martin in patch 1
- added test11 to patch 4 as suggested by Daniel

Alexei Starovoitov (9):
  bpf: introduce bpf_spin_lock
  bpf: add support for bpf_spin_lock to cgroup local storage
  tools/bpf: sync include/uapi/linux/bpf.h
  selftests/bpf: add bpf_spin_lock verifier tests
  selftests/bpf: add bpf_spin_lock C test
  bpf: introduce BPF_F_LOCK flag
  tools/bpf: sync uapi/bpf.h
  libbpf: introduce bpf_map_lookup_elem_flags()
  selftests/bpf: test for BPF_F_LOCK

 include/linux/bpf.h                           |  39 ++-
 include/linux/bpf_verifier.h                  |   1 +
 include/linux/btf.h                           |   1 +
 include/uapi/linux/bpf.h                      |   8 +-
 kernel/Kconfig.locks                          |   3 +
 kernel/bpf/arraymap.c                         |  23 +-
 kernel/bpf/btf.c                              |  42 +++
 kernel/bpf/core.c                             |   2 +
 kernel/bpf/hashtab.c                          |  63 +++-
 kernel/bpf/helpers.c                          |  96 +++++
 kernel/bpf/local_storage.c                    |  16 +-
 kernel/bpf/map_in_map.c                       |   5 +
 kernel/bpf/syscall.c                          |  45 ++-
 kernel/bpf/verifier.c                         | 171 ++++++++-
 net/core/filter.c                             |  16 +-
 tools/include/uapi/linux/bpf.h                |   8 +-
 tools/lib/bpf/bpf.c                           |  13 +
 tools/lib/bpf/bpf.h                           |   2 +
 tools/lib/bpf/libbpf.map                      |   1 +
 tools/testing/selftests/bpf/Makefile          |   2 +-
 tools/testing/selftests/bpf/bpf_helpers.h     |   4 +
 tools/testing/selftests/bpf/test_map_lock.c   |  66 ++++
 tools/testing/selftests/bpf/test_progs.c      | 117 ++++++-
 tools/testing/selftests/bpf/test_spin_lock.c  | 108 ++++++
 tools/testing/selftests/bpf/test_verifier.c   | 104 +++++-
 .../selftests/bpf/verifier/spin_lock.c        | 331 ++++++++++++++++++
 26 files changed, 1248 insertions(+), 39 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_map_lock.c
 create mode 100644 tools/testing/selftests/bpf/test_spin_lock.c
 create mode 100644 tools/testing/selftests/bpf/verifier/spin_lock.c

-- 
2.20.0


^ permalink raw reply

* [PATCH v6 bpf-next 3/9] tools/bpf: sync include/uapi/linux/bpf.h
From: Alexei Starovoitov @ 2019-01-31  5:05 UTC (permalink / raw)
  To: davem; +Cc: daniel, peterz, jannh, netdev, kernel-team
In-Reply-To: <20190131050546.3634009-1-ast@kernel.org>

sync bpf.h

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/include/uapi/linux/bpf.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 60b99b730a41..86f7c438d40f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2422,7 +2422,9 @@ union bpf_attr {
 	FN(map_peek_elem),		\
 	FN(msg_push_data),		\
 	FN(msg_pop_data),		\
-	FN(rc_pointer_rel),
+	FN(rc_pointer_rel),		\
+	FN(spin_lock),			\
+	FN(spin_unlock),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3056,4 +3058,7 @@ struct bpf_line_info {
 	__u32	line_col;
 };
 
+struct bpf_spin_lock {
+	__u32	val;
+};
 #endif /* _UAPI__LINUX_BPF_H__ */
-- 
2.20.0


^ permalink raw reply related

* [PATCH v6 bpf-next 1/9] bpf: introduce bpf_spin_lock
From: Alexei Starovoitov @ 2019-01-31  5:05 UTC (permalink / raw)
  To: davem; +Cc: daniel, peterz, jannh, netdev, kernel-team
In-Reply-To: <20190131050546.3634009-1-ast@kernel.org>

Introduce 'struct bpf_spin_lock' and bpf_spin_lock/unlock() helpers to let
bpf program serialize access to other variables.

Example:
struct hash_elem {
    int cnt;
    struct bpf_spin_lock lock;
};
struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key);
if (val) {
    bpf_spin_lock(&val->lock);
    val->cnt++;
    bpf_spin_unlock(&val->lock);
}

Restrictions and safety checks:
- bpf_spin_lock is only allowed inside HASH and ARRAY maps.
- BTF description of the map is mandatory for safety analysis.
- bpf program can take one bpf_spin_lock at a time, since two or more can
  cause dead locks.
- only one 'struct bpf_spin_lock' is allowed per map element.
  It drastically simplifies implementation yet allows bpf program to use
  any number of bpf_spin_locks.
- when bpf_spin_lock is taken the calls (either bpf2bpf or helpers) are not allowed.
- bpf program must bpf_spin_unlock() before return.
- bpf program can access 'struct bpf_spin_lock' only via
  bpf_spin_lock()/bpf_spin_unlock() helpers.
- load/store into 'struct bpf_spin_lock lock;' field is not allowed.
- to use bpf_spin_lock() helper the BTF description of map value must be
  a struct and have 'struct bpf_spin_lock anyname;' field at the top level.
  Nested lock inside another struct is not allowed.
- syscall map_lookup doesn't copy bpf_spin_lock field to user space.
- syscall map_update and program map_update do not update bpf_spin_lock field.
- bpf_spin_lock cannot be on the stack or inside networking packet.
  bpf_spin_lock can only be inside HASH or ARRAY map value.
- bpf_spin_lock is available to root only and to all program types.
- bpf_spin_lock is not allowed in inner maps of map-in-map.
- ld_abs is not allowed inside spin_lock-ed region.
- tracing progs and socket filter progs cannot use bpf_spin_lock due to
  insufficient preemption checks

Implementation details:
- cgroup-bpf class of programs can nest with xdp/tc programs.
  Hence bpf_spin_lock is equivalent to spin_lock_irqsave.
  Other solutions to avoid nested bpf_spin_lock are possible.
  Like making sure that all networking progs run with softirq disabled.
  spin_lock_irqsave is the simplest and doesn't add overhead to the
  programs that don't use it.
- arch_spinlock_t is used when its implemented as queued_spin_lock
- archs can force their own arch_spinlock_t
- on architectures where queued_spin_lock is not available and
  sizeof(arch_spinlock_t) != sizeof(__u32) trivial lock is used.
- presence of bpf_spin_lock inside map value could have been indicated via
  extra flag during map_create, but specifying it via BTF is cleaner.
  It provides introspection for map key/value and reduces user mistakes.

Next steps:
- allow bpf_spin_lock in other map types (like cgroup local storage)
- introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper
  to request kernel to grab bpf_spin_lock before rewriting the value.
  That will serialize access to map elements.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h          |  37 +++++++-
 include/linux/bpf_verifier.h |   1 +
 include/linux/btf.h          |   1 +
 include/uapi/linux/bpf.h     |   7 +-
 kernel/Kconfig.locks         |   3 +
 kernel/bpf/arraymap.c        |   7 +-
 kernel/bpf/btf.c             |  42 +++++++++
 kernel/bpf/core.c            |   2 +
 kernel/bpf/hashtab.c         |  21 +++--
 kernel/bpf/helpers.c         |  80 +++++++++++++++++
 kernel/bpf/map_in_map.c      |   5 ++
 kernel/bpf/syscall.c         |  21 ++++-
 kernel/bpf/verifier.c        | 169 ++++++++++++++++++++++++++++++++++-
 net/core/filter.c            |  16 +++-
 14 files changed, 386 insertions(+), 26 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0394f1f9213b..2ae615b48bb8 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -72,14 +72,15 @@ struct bpf_map {
 	u32 value_size;
 	u32 max_entries;
 	u32 map_flags;
-	u32 pages;
+	int spin_lock_off; /* >=0 valid offset, <0 error */
 	u32 id;
 	int numa_node;
 	u32 btf_key_type_id;
 	u32 btf_value_type_id;
 	struct btf *btf;
+	u32 pages;
 	bool unpriv_array;
-	/* 55 bytes hole */
+	/* 51 bytes hole */
 
 	/* The 3rd and 4th cacheline with misc members to avoid false sharing
 	 * particularly with refcounting.
@@ -91,6 +92,34 @@ struct bpf_map {
 	char name[BPF_OBJ_NAME_LEN];
 };
 
+static inline bool map_value_has_spin_lock(const struct bpf_map *map)
+{
+	return map->spin_lock_off >= 0;
+}
+
+static inline void check_and_init_map_lock(struct bpf_map *map, void *dst)
+{
+	if (likely(!map_value_has_spin_lock(map)))
+		return;
+	*(struct bpf_spin_lock *)(dst + map->spin_lock_off) =
+		(struct bpf_spin_lock){};
+}
+
+/* copy everything but bpf_spin_lock */
+static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
+{
+	if (unlikely(map_value_has_spin_lock(map))) {
+		u32 off = map->spin_lock_off;
+
+		memcpy(dst, src, off);
+		memcpy(dst + off + sizeof(struct bpf_spin_lock),
+		       src + off + sizeof(struct bpf_spin_lock),
+		       map->value_size - off - sizeof(struct bpf_spin_lock));
+	} else {
+		memcpy(dst, src, map->value_size);
+	}
+}
+
 struct bpf_offload_dev;
 struct bpf_offloaded_map;
 
@@ -162,6 +191,7 @@ enum bpf_arg_type {
 	ARG_PTR_TO_CTX,		/* pointer to context */
 	ARG_ANYTHING,		/* any (initialized) argument is ok */
 	ARG_PTR_TO_SOCKET,	/* pointer to bpf_sock */
+	ARG_PTR_TO_SPIN_LOCK,	/* pointer to bpf_spin_lock */
 };
 
 /* type of values returned from helper functions */
@@ -879,7 +909,8 @@ extern const struct bpf_func_proto bpf_msg_redirect_hash_proto;
 extern const struct bpf_func_proto bpf_msg_redirect_map_proto;
 extern const struct bpf_func_proto bpf_sk_redirect_hash_proto;
 extern const struct bpf_func_proto bpf_sk_redirect_map_proto;
-
+extern const struct bpf_func_proto bpf_spin_lock_proto;
+extern const struct bpf_func_proto bpf_spin_unlock_proto;
 extern const struct bpf_func_proto bpf_get_local_storage_proto;
 
 /* Shared helpers among cBPF and eBPF. */
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 0620e418dde5..69f7a3449eda 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -148,6 +148,7 @@ struct bpf_verifier_state {
 	/* call stack tracking */
 	struct bpf_func_state *frame[MAX_CALL_FRAMES];
 	u32 curframe;
+	u32 active_spin_lock;
 	bool speculative;
 };
 
diff --git a/include/linux/btf.h b/include/linux/btf.h
index 12502e25e767..455d31b55828 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -50,6 +50,7 @@ u32 btf_id(const struct btf *btf);
 bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
 			   const struct btf_member *m,
 			   u32 expected_offset, u32 expected_size);
+int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
 
 #ifdef CONFIG_BPF_SYSCALL
 const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 60b99b730a41..86f7c438d40f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2422,7 +2422,9 @@ union bpf_attr {
 	FN(map_peek_elem),		\
 	FN(msg_push_data),		\
 	FN(msg_pop_data),		\
-	FN(rc_pointer_rel),
+	FN(rc_pointer_rel),		\
+	FN(spin_lock),			\
+	FN(spin_unlock),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3056,4 +3058,7 @@ struct bpf_line_info {
 	__u32	line_col;
 };
 
+struct bpf_spin_lock {
+	__u32	val;
+};
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 84d882f3e299..fbba478ae522 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -242,6 +242,9 @@ config QUEUED_SPINLOCKS
 	def_bool y if ARCH_USE_QUEUED_SPINLOCKS
 	depends on SMP
 
+config BPF_ARCH_SPINLOCK
+	bool
+
 config ARCH_USE_QUEUED_RWLOCKS
 	bool
 
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 25632a75d630..d6d979910a2a 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -270,9 +270,10 @@ static int array_map_update_elem(struct bpf_map *map, void *key, void *value,
 		memcpy(this_cpu_ptr(array->pptrs[index & array->index_mask]),
 		       value, map->value_size);
 	else
-		memcpy(array->value +
-		       array->elem_size * (index & array->index_mask),
-		       value, map->value_size);
+		copy_map_value(map,
+			       array->value +
+			       array->elem_size * (index & array->index_mask),
+			       value);
 	return 0;
 }
 
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 3d661f0606fe..7019c1f05cab 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -355,6 +355,11 @@ static bool btf_type_is_struct(const struct btf_type *t)
 	return kind == BTF_KIND_STRUCT || kind == BTF_KIND_UNION;
 }
 
+static bool __btf_type_is_struct(const struct btf_type *t)
+{
+	return BTF_INFO_KIND(t->info) == BTF_KIND_STRUCT;
+}
+
 static bool btf_type_is_array(const struct btf_type *t)
 {
 	return BTF_INFO_KIND(t->info) == BTF_KIND_ARRAY;
@@ -2045,6 +2050,43 @@ static void btf_struct_log(struct btf_verifier_env *env,
 	btf_verifier_log(env, "size=%u vlen=%u", t->size, btf_type_vlen(t));
 }
 
+/* find 'struct bpf_spin_lock' in map value.
+ * return >= 0 offset if found
+ * and < 0 in case of error
+ */
+int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
+{
+	const struct btf_member *member;
+	u32 i, off = -ENOENT;
+
+	if (!__btf_type_is_struct(t))
+		return -EINVAL;
+
+	for_each_member(i, t, member) {
+		const struct btf_type *member_type = btf_type_by_id(btf,
+								    member->type);
+		if (!__btf_type_is_struct(member_type))
+			continue;
+		if (member_type->size != sizeof(struct bpf_spin_lock))
+			continue;
+		if (strcmp(__btf_name_by_offset(btf, member_type->name_off),
+			   "bpf_spin_lock"))
+			continue;
+		if (off != -ENOENT)
+			/* only one 'struct bpf_spin_lock' is allowed */
+			return -E2BIG;
+		off = btf_member_bit_offset(t, member);
+		if (off % 8)
+			/* valid C code cannot generate such BTF */
+			return -EINVAL;
+		off /= 8;
+		if (off % __alignof__(struct bpf_spin_lock))
+			/* valid struct bpf_spin_lock will be 4 byte aligned */
+			return -EINVAL;
+	}
+	return off;
+}
+
 static void btf_struct_seq_show(const struct btf *btf, const struct btf_type *t,
 				u32 type_id, void *data, u8 bits_offset,
 				struct seq_file *m)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index a7bcb23bee84..550d9f9298f3 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2001,6 +2001,8 @@ const struct bpf_func_proto bpf_map_delete_elem_proto __weak;
 const struct bpf_func_proto bpf_map_push_elem_proto __weak;
 const struct bpf_func_proto bpf_map_pop_elem_proto __weak;
 const struct bpf_func_proto bpf_map_peek_elem_proto __weak;
+const struct bpf_func_proto bpf_spin_lock_proto __weak;
+const struct bpf_func_proto bpf_spin_unlock_proto __weak;
 
 const struct bpf_func_proto bpf_get_prandom_u32_proto __weak;
 const struct bpf_func_proto bpf_get_smp_processor_id_proto __weak;
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 4b7c76765d9d..6d3b22c5ad68 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -718,21 +718,12 @@ static bool fd_htab_map_needs_adjust(const struct bpf_htab *htab)
 	       BITS_PER_LONG == 64;
 }
 
-static u32 htab_size_value(const struct bpf_htab *htab, bool percpu)
-{
-	u32 size = htab->map.value_size;
-
-	if (percpu || fd_htab_map_needs_adjust(htab))
-		size = round_up(size, 8);
-	return size;
-}
-
 static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 					 void *value, u32 key_size, u32 hash,
 					 bool percpu, bool onallcpus,
 					 struct htab_elem *old_elem)
 {
-	u32 size = htab_size_value(htab, percpu);
+	u32 size = htab->map.value_size;
 	bool prealloc = htab_is_prealloc(htab);
 	struct htab_elem *l_new, **pl_new;
 	void __percpu *pptr;
@@ -770,10 +761,13 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 			l_new = ERR_PTR(-ENOMEM);
 			goto dec_count;
 		}
+		check_and_init_map_lock(&htab->map,
+					l_new->key + round_up(key_size, 8));
 	}
 
 	memcpy(l_new->key, key, key_size);
 	if (percpu) {
+		size = round_up(size, 8);
 		if (prealloc) {
 			pptr = htab_elem_get_ptr(l_new, key_size);
 		} else {
@@ -791,8 +785,13 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 
 		if (!prealloc)
 			htab_elem_set_ptr(l_new, key_size, pptr);
-	} else {
+	} else if (fd_htab_map_needs_adjust(htab)) {
+		size = round_up(size, 8);
 		memcpy(l_new->key + round_up(key_size, 8), value, size);
+	} else {
+		copy_map_value(&htab->map,
+			       l_new->key + round_up(key_size, 8),
+			       value);
 	}
 
 	l_new->hash = hash;
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index a74972b07e74..29a8a4e62b8a 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -221,6 +221,86 @@ const struct bpf_func_proto bpf_get_current_comm_proto = {
 	.arg2_type	= ARG_CONST_SIZE,
 };
 
+#if defined(CONFIG_QUEUED_SPINLOCKS) || defined(CONFIG_BPF_ARCH_SPINLOCK)
+
+static inline void __bpf_spin_lock(struct bpf_spin_lock *lock)
+{
+	arch_spinlock_t *l = (void *)lock;
+	union {
+		__u32 val;
+		arch_spinlock_t lock;
+	} u = { .lock = __ARCH_SPIN_LOCK_UNLOCKED };
+
+	compiletime_assert(u.val == 0, "__ARCH_SPIN_LOCK_UNLOCKED not 0");
+	BUILD_BUG_ON(sizeof(*l) != sizeof(__u32));
+	BUILD_BUG_ON(sizeof(*lock) != sizeof(__u32));
+	arch_spin_lock(l);
+}
+
+static inline void __bpf_spin_unlock(struct bpf_spin_lock *lock)
+{
+	arch_spinlock_t *l = (void *)lock;
+
+	arch_spin_unlock(l);
+}
+
+#else
+
+static inline void __bpf_spin_lock(struct bpf_spin_lock *lock)
+{
+	atomic_t *l = (void *)lock;
+
+	BUILD_BUG_ON(sizeof(*l) != sizeof(*lock));
+	do {
+		atomic_cond_read_relaxed(l, !VAL);
+	} while (atomic_xchg(l, 1));
+}
+
+static inline void __bpf_spin_unlock(struct bpf_spin_lock *lock)
+{
+	atomic_t *l = (void *)lock;
+
+	atomic_set_release(l, 0);
+}
+
+#endif
+
+static DEFINE_PER_CPU(unsigned long, irqsave_flags);
+
+notrace BPF_CALL_1(bpf_spin_lock, struct bpf_spin_lock *, lock)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__bpf_spin_lock(lock);
+	this_cpu_write(irqsave_flags, flags);
+	return 0;
+}
+
+const struct bpf_func_proto bpf_spin_lock_proto = {
+	.func		= bpf_spin_lock,
+	.gpl_only	= false,
+	.ret_type	= RET_VOID,
+	.arg1_type	= ARG_PTR_TO_SPIN_LOCK,
+};
+
+notrace BPF_CALL_1(bpf_spin_unlock, struct bpf_spin_lock *, lock)
+{
+	unsigned long flags;
+
+	flags = this_cpu_read(irqsave_flags);
+	__bpf_spin_unlock(lock);
+	local_irq_restore(flags);
+	return 0;
+}
+
+const struct bpf_func_proto bpf_spin_unlock_proto = {
+	.func		= bpf_spin_unlock,
+	.gpl_only	= false,
+	.ret_type	= RET_VOID,
+	.arg1_type	= ARG_PTR_TO_SPIN_LOCK,
+};
+
 #ifdef CONFIG_CGROUPS
 BPF_CALL_0(bpf_get_current_cgroup_id)
 {
diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index 52378d3e34b3..583346a0ab29 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -37,6 +37,11 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 		return ERR_PTR(-EINVAL);
 	}
 
+	if (map_value_has_spin_lock(inner_map)) {
+		fdput(f);
+		return ERR_PTR(-ENOTSUPP);
+	}
+
 	inner_map_meta_size = sizeof(*inner_map_meta);
 	/* In some cases verifier needs to access beyond just base map. */
 	if (inner_map->ops == &array_map_ops)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b155cd17c1bd..ebf0a673cb83 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -463,7 +463,7 @@ int map_check_no_btf(const struct bpf_map *map,
 	return -ENOTSUPP;
 }
 
-static int map_check_btf(const struct bpf_map *map, const struct btf *btf,
+static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 			 u32 btf_key_id, u32 btf_value_id)
 {
 	const struct btf_type *key_type, *value_type;
@@ -478,6 +478,21 @@ static int map_check_btf(const struct bpf_map *map, const struct btf *btf,
 	if (!value_type || value_size != map->value_size)
 		return -EINVAL;
 
+	map->spin_lock_off = btf_find_spin_lock(btf, value_type);
+
+	if (map_value_has_spin_lock(map)) {
+		if (map->map_type != BPF_MAP_TYPE_HASH &&
+		    map->map_type != BPF_MAP_TYPE_ARRAY)
+			return -ENOTSUPP;
+		if (map->spin_lock_off + sizeof(struct bpf_spin_lock) >
+		    map->value_size) {
+			WARN_ONCE(1,
+				  "verifier bug spin_lock_off %d value_size %d\n",
+				  map->spin_lock_off, map->value_size);
+			return -EFAULT;
+		}
+	}
+
 	if (map->ops->map_check_btf)
 		ret = map->ops->map_check_btf(map, btf, key_type, value_type);
 
@@ -542,6 +557,8 @@ static int map_create(union bpf_attr *attr)
 		map->btf = btf;
 		map->btf_key_type_id = attr->btf_key_type_id;
 		map->btf_value_type_id = attr->btf_value_type_id;
+	} else {
+		map->spin_lock_off = -EINVAL;
 	}
 
 	err = security_bpf_map_alloc(map);
@@ -740,7 +757,7 @@ static int map_lookup_elem(union bpf_attr *attr)
 			err = -ENOENT;
 		} else {
 			err = 0;
-			memcpy(value, ptr, value_size);
+			copy_map_value(map, value, ptr);
 		}
 		rcu_read_unlock();
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8c1c21cd50b4..b6af9d2cb236 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -213,6 +213,7 @@ struct bpf_call_arg_meta {
 	s64 msize_smax_value;
 	u64 msize_umax_value;
 	int ptr_id;
+	int func_id;
 };
 
 static DEFINE_MUTEX(bpf_verifier_lock);
@@ -351,6 +352,12 @@ static bool reg_is_refcounted(const struct bpf_reg_state *reg)
 	return type_is_refcounted(reg->type);
 }
 
+static bool reg_may_point_to_spin_lock(const struct bpf_reg_state *reg)
+{
+	return reg->type == PTR_TO_MAP_VALUE &&
+		map_value_has_spin_lock(reg->map_ptr);
+}
+
 static bool reg_is_refcounted_or_null(const struct bpf_reg_state *reg)
 {
 	return type_is_refcounted_or_null(reg->type);
@@ -712,6 +719,7 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state,
 	}
 	dst_state->speculative = src->speculative;
 	dst_state->curframe = src->curframe;
+	dst_state->active_spin_lock = src->active_spin_lock;
 	for (i = 0; i <= src->curframe; i++) {
 		dst = dst_state->frame[i];
 		if (!dst) {
@@ -1483,6 +1491,21 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
 	if (err)
 		verbose(env, "R%d max value is outside of the array range\n",
 			regno);
+
+	if (map_value_has_spin_lock(reg->map_ptr)) {
+		u32 lock = reg->map_ptr->spin_lock_off;
+
+		/* if any part of struct bpf_spin_lock can be touched by
+		 * load/store reject this program
+		 */
+		if ((reg->smin_value + off <= lock &&
+		     lock < reg->umax_value + off + size) ||
+		    (reg->smin_value + off < lock + sizeof(struct bpf_spin_lock) &&
+		     lock + sizeof(struct bpf_spin_lock) <= reg->umax_value + off + size)) {
+			verbose(env, "bpf_spin_lock cannot be accessed directly by load/store\n");
+			return -EACCES;
+		}
+	}
 	return err;
 }
 
@@ -2192,6 +2215,91 @@ static int check_helper_mem_access(struct bpf_verifier_env *env, int regno,
 	}
 }
 
+/* Implementation details:
+ * bpf_map_lookup returns PTR_TO_MAP_VALUE_OR_NULL
+ * Two bpf_map_lookups (even with the same key) will have different reg->id.
+ * For traditional PTR_TO_MAP_VALUE the verifier clears reg->id after
+ * value_or_null->value transition, since the verifier only cares about
+ * the range of access to valid map value pointer and doesn't care about actual
+ * address of the map element.
+ * For maps with 'struct bpf_spin_lock' inside map value the verifier keeps
+ * reg->id > 0 after value_or_null->value transition. By doing so
+ * two bpf_map_lookups will be considered two different pointers that
+ * point to different bpf_spin_locks.
+ * The verifier allows taking only one bpf_spin_lock at a time to avoid
+ * dead-locks.
+ * Since only one bpf_spin_lock is allowed the checks are simpler than
+ * reg_is_refcounted() logic. The verifier needs to remember only
+ * one spin_lock instead of array of acquired_refs.
+ * cur_state->active_spin_lock remembers which map value element got locked
+ * and clears it after bpf_spin_unlock.
+ */
+static int process_spin_lock(struct bpf_verifier_env *env, int regno,
+			     bool is_lock)
+{
+	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
+	struct bpf_verifier_state *cur = env->cur_state;
+	bool is_const = tnum_is_const(reg->var_off);
+	struct bpf_map *map = reg->map_ptr;
+	u64 val = reg->var_off.value;
+
+	if (reg->type != PTR_TO_MAP_VALUE) {
+		verbose(env, "R%d is not a pointer to map_value\n", regno);
+		return -EINVAL;
+	}
+	if (!is_const) {
+		verbose(env,
+			"R%d doesn't have constant offset. bpf_spin_lock has to be at the constant offset\n",
+			regno);
+		return -EINVAL;
+	}
+	if (!map->btf) {
+		verbose(env,
+			"map '%s' has to have BTF in order to use bpf_spin_lock\n",
+			map->name);
+		return -EINVAL;
+	}
+	if (!map_value_has_spin_lock(map)) {
+		if (map->spin_lock_off == -E2BIG)
+			verbose(env,
+				"map '%s' has more than one 'struct bpf_spin_lock'\n",
+				map->name);
+		else if (map->spin_lock_off == -ENOENT)
+			verbose(env,
+				"map '%s' doesn't have 'struct bpf_spin_lock'\n",
+				map->name);
+		else
+			verbose(env,
+				"map '%s' is not a struct type or bpf_spin_lock is mangled\n",
+				map->name);
+		return -EINVAL;
+	}
+	if (map->spin_lock_off != val + reg->off) {
+		verbose(env, "off %lld doesn't point to 'struct bpf_spin_lock'\n",
+			val + reg->off);
+		return -EINVAL;
+	}
+	if (is_lock) {
+		if (cur->active_spin_lock) {
+			verbose(env,
+				"Locking two bpf_spin_locks are not allowed\n");
+			return -EINVAL;
+		}
+		cur->active_spin_lock = reg->id;
+	} else {
+		if (!cur->active_spin_lock) {
+			verbose(env, "bpf_spin_unlock without taking a lock\n");
+			return -EINVAL;
+		}
+		if (cur->active_spin_lock != reg->id) {
+			verbose(env, "bpf_spin_unlock of different lock\n");
+			return -EINVAL;
+		}
+		cur->active_spin_lock = 0;
+	}
+	return 0;
+}
+
 static bool arg_type_is_mem_ptr(enum bpf_arg_type type)
 {
 	return type == ARG_PTR_TO_MEM ||
@@ -2268,6 +2376,17 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 			return -EFAULT;
 		}
 		meta->ptr_id = reg->id;
+	} else if (arg_type == ARG_PTR_TO_SPIN_LOCK) {
+		if (meta->func_id == BPF_FUNC_spin_lock) {
+			if (process_spin_lock(env, regno, true))
+				return -EACCES;
+		} else if (meta->func_id == BPF_FUNC_spin_unlock) {
+			if (process_spin_lock(env, regno, false))
+				return -EACCES;
+		} else {
+			verbose(env, "verifier internal error\n");
+			return -EFAULT;
+		}
 	} else if (arg_type_is_mem_ptr(arg_type)) {
 		expected_type = PTR_TO_STACK;
 		/* One exception here. In case function allows for NULL to be
@@ -2887,6 +3006,7 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 		return err;
 	}
 
+	meta.func_id = func_id;
 	/* check args */
 	err = check_func_arg(env, BPF_REG_1, fn->arg1_type, &meta);
 	if (err)
@@ -4473,7 +4593,8 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state,
 		} else if (reg->type == PTR_TO_SOCKET_OR_NULL) {
 			reg->type = PTR_TO_SOCKET;
 		}
-		if (is_null || !reg_is_refcounted(reg)) {
+		if (is_null || !(reg_is_refcounted(reg) ||
+				 reg_may_point_to_spin_lock(reg))) {
 			/* We don't need id from this point onwards anymore,
 			 * thus we should better reset it, so that state
 			 * pruning has chances to take effect.
@@ -4871,6 +4992,11 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn)
 		return err;
 	}
 
+	if (env->cur_state->active_spin_lock) {
+		verbose(env, "BPF_LD_[ABS|IND] cannot be used inside bpf_spin_lock-ed region\n");
+		return -EINVAL;
+	}
+
 	if (regs[BPF_REG_6].type != PTR_TO_CTX) {
 		verbose(env,
 			"at the time of BPF_LD_ABS|IND R6 != pointer to skb\n");
@@ -5607,8 +5733,11 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur,
 	case PTR_TO_MAP_VALUE:
 		/* If the new min/max/var_off satisfy the old ones and
 		 * everything else matches, we are OK.
-		 * We don't care about the 'id' value, because nothing
-		 * uses it for PTR_TO_MAP_VALUE (only for ..._OR_NULL)
+		 * 'id' is not compared, since it's only used for maps with
+		 * bpf_spin_lock inside map element and in such cases if
+		 * the rest of the prog is valid for one map element then
+		 * it's valid for all map elements regardless of the key
+		 * used in bpf_map_lookup()
 		 */
 		return memcmp(rold, rcur, offsetof(struct bpf_reg_state, id)) == 0 &&
 		       range_within(rold, rcur) &&
@@ -5811,6 +5940,9 @@ static bool states_equal(struct bpf_verifier_env *env,
 	if (old->speculative && !cur->speculative)
 		return false;
 
+	if (old->active_spin_lock != cur->active_spin_lock)
+		return false;
+
 	/* for states to be equal callsites have to be the same
 	 * and all frame states need to be equivalent
 	 */
@@ -6229,6 +6361,12 @@ static int do_check(struct bpf_verifier_env *env)
 					return -EINVAL;
 				}
 
+				if (env->cur_state->active_spin_lock &&
+				    (insn->src_reg == BPF_PSEUDO_CALL ||
+				     insn->imm != BPF_FUNC_spin_unlock)) {
+					verbose(env, "function calls are not allowed while holding a lock\n");
+					return -EINVAL;
+				}
 				if (insn->src_reg == BPF_PSEUDO_CALL)
 					err = check_func_call(env, insn, &env->insn_idx);
 				else
@@ -6259,6 +6397,11 @@ static int do_check(struct bpf_verifier_env *env)
 					return -EINVAL;
 				}
 
+				if (env->cur_state->active_spin_lock) {
+					verbose(env, "bpf_spin_unlock is missing\n");
+					return -EINVAL;
+				}
+
 				if (state->curframe) {
 					/* exit from nested function */
 					env->prev_insn_idx = env->insn_idx;
@@ -6356,6 +6499,19 @@ static int check_map_prealloc(struct bpf_map *map)
 		!(map->map_flags & BPF_F_NO_PREALLOC);
 }
 
+static bool is_tracing_prog_type(enum bpf_prog_type type)
+{
+	switch (type) {
+	case BPF_PROG_TYPE_KPROBE:
+	case BPF_PROG_TYPE_TRACEPOINT:
+	case BPF_PROG_TYPE_PERF_EVENT:
+	case BPF_PROG_TYPE_RAW_TRACEPOINT:
+		return true;
+	default:
+		return false;
+	}
+}
+
 static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 					struct bpf_map *map,
 					struct bpf_prog *prog)
@@ -6378,6 +6534,13 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 		}
 	}
 
+	if ((is_tracing_prog_type(prog->type) ||
+	     prog->type == BPF_PROG_TYPE_SOCKET_FILTER) &&
+	    map_value_has_spin_lock(map)) {
+		verbose(env, "tracing progs cannot use bpf_spin_lock yet\n");
+		return -EINVAL;
+	}
+
 	if ((bpf_prog_is_dev_bound(prog->aux) || bpf_map_is_dev_bound(map)) &&
 	    !bpf_offload_prog_map_match(prog, map)) {
 		verbose(env, "offload device mismatch between prog and map\n");
diff --git a/net/core/filter.c b/net/core/filter.c
index 41984ad4b9b4..3a49f68eda10 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5314,10 +5314,20 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_tail_call_proto;
 	case BPF_FUNC_ktime_get_ns:
 		return &bpf_ktime_get_ns_proto;
+	default:
+		break;
+	}
+
+	if (!capable(CAP_SYS_ADMIN))
+		return NULL;
+
+	switch (func_id) {
+	case BPF_FUNC_spin_lock:
+		return &bpf_spin_lock_proto;
+	case BPF_FUNC_spin_unlock:
+		return &bpf_spin_unlock_proto;
 	case BPF_FUNC_trace_printk:
-		if (capable(CAP_SYS_ADMIN))
-			return bpf_get_trace_printk_proto();
-		/* else, fall through */
+		return bpf_get_trace_printk_proto();
 	default:
 		return NULL;
 	}
-- 
2.20.0


^ permalink raw reply related

* [PATCH v6 bpf-next 4/9] selftests/bpf: add bpf_spin_lock verifier tests
From: Alexei Starovoitov @ 2019-01-31  5:05 UTC (permalink / raw)
  To: davem; +Cc: daniel, peterz, jannh, netdev, kernel-team
In-Reply-To: <20190131050546.3634009-1-ast@kernel.org>

add bpf_spin_lock tests to test_verifier.c that don't require
latest llvm with BTF support

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/test_verifier.c   | 104 +++++-
 .../selftests/bpf/verifier/spin_lock.c        | 331 ++++++++++++++++++
 2 files changed, 434 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/verifier/spin_lock.c

diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index c5e22422a852..ae01aedb65a0 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -32,6 +32,7 @@
 #include <linux/bpf_perf_event.h>
 #include <linux/bpf.h>
 #include <linux/if_ether.h>
+#include <linux/btf.h>
 
 #include <bpf/bpf.h>
 
@@ -49,7 +50,7 @@
 
 #define MAX_INSNS	BPF_MAXINSNS
 #define MAX_FIXUPS	8
-#define MAX_NR_MAPS	13
+#define MAX_NR_MAPS	14
 #define MAX_TEST_RUNS	8
 #define POINTER_VALUE	0xcafe4all
 #define TEST_DATA_LEN	64
@@ -76,6 +77,7 @@ struct bpf_test {
 	int fixup_map_in_map[MAX_FIXUPS];
 	int fixup_cgroup_storage[MAX_FIXUPS];
 	int fixup_percpu_cgroup_storage[MAX_FIXUPS];
+	int fixup_map_spin_lock[MAX_FIXUPS];
 	const char *errstr;
 	const char *errstr_unpriv;
 	uint32_t retval, retval_unpriv, insn_processed;
@@ -381,6 +383,98 @@ static int create_cgroup_storage(bool percpu)
 	return fd;
 }
 
+#define BTF_INFO_ENC(kind, kind_flag, vlen) \
+	((!!(kind_flag) << 31) | ((kind) << 24) | ((vlen) & BTF_MAX_VLEN))
+#define BTF_TYPE_ENC(name, info, size_or_type) \
+	(name), (info), (size_or_type)
+#define BTF_INT_ENC(encoding, bits_offset, nr_bits) \
+	((encoding) << 24 | (bits_offset) << 16 | (nr_bits))
+#define BTF_TYPE_INT_ENC(name, encoding, bits_offset, bits, sz) \
+	BTF_TYPE_ENC(name, BTF_INFO_ENC(BTF_KIND_INT, 0, 0), sz), \
+	BTF_INT_ENC(encoding, bits_offset, bits)
+#define BTF_MEMBER_ENC(name, type, bits_offset) \
+	(name), (type), (bits_offset)
+
+struct btf_raw_data {
+	__u32 raw_types[64];
+	const char *str_sec;
+	__u32 str_sec_size;
+};
+
+/* struct bpf_spin_lock {
+ *   int val;
+ * };
+ * struct val {
+ *   int cnt;
+ *   struct bpf_spin_lock l;
+ * };
+ */
+static const char btf_str_sec[] = "\0bpf_spin_lock\0val\0cnt\0l";
+static __u32 btf_raw_types[] = {
+	/* int */
+	BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4),  /* [1] */
+	/* struct bpf_spin_lock */                      /* [2] */
+	BTF_TYPE_ENC(1, BTF_INFO_ENC(BTF_KIND_STRUCT, 0, 1), 4),
+	BTF_MEMBER_ENC(15, 1, 0), /* int val; */
+	/* struct val */                                /* [3] */
+	BTF_TYPE_ENC(15, BTF_INFO_ENC(BTF_KIND_STRUCT, 0, 2), 8),
+	BTF_MEMBER_ENC(19, 1, 0), /* int cnt; */
+	BTF_MEMBER_ENC(23, 2, 32),/* struct bpf_spin_lock l; */
+};
+
+static int load_btf(void)
+{
+	struct btf_header hdr = {
+		.magic = BTF_MAGIC,
+		.version = BTF_VERSION,
+		.hdr_len = sizeof(struct btf_header),
+		.type_len = sizeof(btf_raw_types),
+		.str_off = sizeof(btf_raw_types),
+		.str_len = sizeof(btf_str_sec),
+	};
+	void *ptr, *raw_btf;
+	int btf_fd;
+
+	ptr = raw_btf = malloc(sizeof(hdr) + sizeof(btf_raw_types) +
+			       sizeof(btf_str_sec));
+
+	memcpy(ptr, &hdr, sizeof(hdr));
+	ptr += sizeof(hdr);
+	memcpy(ptr, btf_raw_types, hdr.type_len);
+	ptr += hdr.type_len;
+	memcpy(ptr, btf_str_sec, hdr.str_len);
+	ptr += hdr.str_len;
+
+	btf_fd = bpf_load_btf(raw_btf, ptr - raw_btf, 0, 0, 0);
+	free(raw_btf);
+	if (btf_fd < 0)
+		return -1;
+	return btf_fd;
+}
+
+static int create_map_spin_lock(void)
+{
+	struct bpf_create_map_attr attr = {
+		.name = "test_map",
+		.map_type = BPF_MAP_TYPE_ARRAY,
+		.key_size = 4,
+		.value_size = 8,
+		.max_entries = 1,
+		.btf_key_type_id = 1,
+		.btf_value_type_id = 3,
+	};
+	int fd, btf_fd;
+
+	btf_fd = load_btf();
+	if (btf_fd < 0)
+		return -1;
+	attr.btf_fd = btf_fd;
+	fd = bpf_create_map_xattr(&attr);
+	if (fd < 0)
+		printf("Failed to create map with spin_lock\n");
+	return fd;
+}
+
 static char bpf_vlog[UINT_MAX >> 8];
 
 static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
@@ -399,6 +493,7 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 	int *fixup_map_in_map = test->fixup_map_in_map;
 	int *fixup_cgroup_storage = test->fixup_cgroup_storage;
 	int *fixup_percpu_cgroup_storage = test->fixup_percpu_cgroup_storage;
+	int *fixup_map_spin_lock = test->fixup_map_spin_lock;
 
 	if (test->fill_helper)
 		test->fill_helper(test);
@@ -515,6 +610,13 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 			fixup_map_stacktrace++;
 		} while (*fixup_map_stacktrace);
 	}
+	if (*fixup_map_spin_lock) {
+		map_fds[13] = create_map_spin_lock();
+		do {
+			prog[*fixup_map_spin_lock].imm = map_fds[13];
+			fixup_map_spin_lock++;
+		} while (*fixup_map_spin_lock);
+	}
 }
 
 static int set_admin(bool admin)
diff --git a/tools/testing/selftests/bpf/verifier/spin_lock.c b/tools/testing/selftests/bpf/verifier/spin_lock.c
new file mode 100644
index 000000000000..d829eef372a4
--- /dev/null
+++ b/tools/testing/selftests/bpf/verifier/spin_lock.c
@@ -0,0 +1,331 @@
+{
+	"spin_lock: test1 success",
+	.insns = {
+	BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_6, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_unlock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.fixup_map_spin_lock = { 3 },
+	.result = ACCEPT,
+	.result_unpriv = REJECT,
+	.errstr_unpriv = "",
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+},
+{
+	"spin_lock: test2 direct ld/st",
+	.insns = {
+	BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_unlock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.fixup_map_spin_lock = { 3 },
+	.result = REJECT,
+	.errstr = "cannot be accessed directly",
+	.result_unpriv = REJECT,
+	.errstr_unpriv = "",
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+},
+{
+	"spin_lock: test3 direct ld/st",
+	.insns = {
+	BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_6, 1),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_unlock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.fixup_map_spin_lock = { 3 },
+	.result = REJECT,
+	.errstr = "cannot be accessed directly",
+	.result_unpriv = REJECT,
+	.errstr_unpriv = "",
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+},
+{
+	"spin_lock: test4 direct ld/st",
+	.insns = {
+	BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_6, 3),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_unlock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.fixup_map_spin_lock = { 3 },
+	.result = REJECT,
+	.errstr = "cannot be accessed directly",
+	.result_unpriv = REJECT,
+	.errstr_unpriv = "",
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+},
+{
+	"spin_lock: test5 call within a locked region",
+	.insns = {
+	BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_get_prandom_u32),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_unlock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.fixup_map_spin_lock = { 3 },
+	.result = REJECT,
+	.errstr = "calls are not allowed",
+	.result_unpriv = REJECT,
+	.errstr_unpriv = "",
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+},
+{
+	"spin_lock: test6 missing unlock",
+	.insns = {
+	BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_6, 0),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_unlock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.fixup_map_spin_lock = { 3 },
+	.result = REJECT,
+	.errstr = "unlock is missing",
+	.result_unpriv = REJECT,
+	.errstr_unpriv = "",
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+},
+{
+	"spin_lock: test7 unlock without lock",
+	.insns = {
+	BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 1),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_6, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_unlock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.fixup_map_spin_lock = { 3 },
+	.result = REJECT,
+	.errstr = "without taking a lock",
+	.result_unpriv = REJECT,
+	.errstr_unpriv = "",
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+},
+{
+	"spin_lock: test8 double lock",
+	.insns = {
+	BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_6, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_unlock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.fixup_map_spin_lock = { 3 },
+	.result = REJECT,
+	.errstr = "calls are not allowed",
+	.result_unpriv = REJECT,
+	.errstr_unpriv = "",
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+},
+{
+	"spin_lock: test9 different lock",
+	.insns = {
+	BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_7),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_unlock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.fixup_map_spin_lock = { 3, 11 },
+	.result = REJECT,
+	.errstr = "unlock of different lock",
+	.result_unpriv = REJECT,
+	.errstr_unpriv = "",
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+},
+{
+	"spin_lock: test10 lock in subprog without unlock",
+	.insns = {
+	BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 1, 0, 5),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_unlock),
+	BPF_MOV64_IMM(BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.fixup_map_spin_lock = { 3 },
+	.result = REJECT,
+	.errstr = "unlock is missing",
+	.result_unpriv = REJECT,
+	.errstr_unpriv = "",
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+},
+{
+	"spin_lock: test11 ld_abs under lock",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+	BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_LD_ABS(BPF_B, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_7),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_unlock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.fixup_map_spin_lock = { 4 },
+	.result = REJECT,
+	.errstr = "inside bpf_spin_lock",
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+},
-- 
2.20.0


^ permalink raw reply related

* [PATCH v6 bpf-next 2/9] bpf: add support for bpf_spin_lock to cgroup local storage
From: Alexei Starovoitov @ 2019-01-31  5:05 UTC (permalink / raw)
  To: davem; +Cc: daniel, peterz, jannh, netdev, kernel-team
In-Reply-To: <20190131050546.3634009-1-ast@kernel.org>

Allow 'struct bpf_spin_lock' to reside inside cgroup local storage.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/local_storage.c | 2 ++
 kernel/bpf/syscall.c       | 3 ++-
 kernel/bpf/verifier.c      | 2 ++
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index 07a34ef562a0..0295427f06e2 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -147,6 +147,7 @@ static int cgroup_storage_update_elem(struct bpf_map *map, void *_key,
 		return -ENOMEM;
 
 	memcpy(&new->data[0], value, map->value_size);
+	check_and_init_map_lock(map, new->data);
 
 	new = xchg(&storage->buf, new);
 	kfree_rcu(new, rcu);
@@ -483,6 +484,7 @@ struct bpf_cgroup_storage *bpf_cgroup_storage_alloc(struct bpf_prog *prog,
 		storage->buf = kmalloc_node(size, flags, map->numa_node);
 		if (!storage->buf)
 			goto enomem;
+		check_and_init_map_lock(map, storage->buf->data);
 	} else {
 		storage->percpu_buf = __alloc_percpu_gfp(size, 8, flags);
 		if (!storage->percpu_buf)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ebf0a673cb83..b29e6dc44650 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -482,7 +482,8 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 
 	if (map_value_has_spin_lock(map)) {
 		if (map->map_type != BPF_MAP_TYPE_HASH &&
-		    map->map_type != BPF_MAP_TYPE_ARRAY)
+		    map->map_type != BPF_MAP_TYPE_ARRAY &&
+		    map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE)
 			return -ENOTSUPP;
 		if (map->spin_lock_off + sizeof(struct bpf_spin_lock) >
 		    map->value_size) {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b6af9d2cb236..dd1bd7485a57 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3089,6 +3089,8 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 		regs[BPF_REG_0].map_ptr = meta.map_ptr;
 		if (fn->ret_type == RET_PTR_TO_MAP_VALUE) {
 			regs[BPF_REG_0].type = PTR_TO_MAP_VALUE;
+			if (map_value_has_spin_lock(meta.map_ptr))
+				regs[BPF_REG_0].id = ++env->id_gen;
 		} else {
 			regs[BPF_REG_0].type = PTR_TO_MAP_VALUE_OR_NULL;
 			regs[BPF_REG_0].id = ++env->id_gen;
-- 
2.20.0


^ permalink raw reply related

* [PATCH v6 bpf-next 7/9] tools/bpf: sync uapi/bpf.h
From: Alexei Starovoitov @ 2019-01-31  5:05 UTC (permalink / raw)
  To: davem; +Cc: daniel, peterz, jannh, netdev, kernel-team
In-Reply-To: <20190131050546.3634009-1-ast@kernel.org>

add BPF_F_LOCK definition to tools/include/uapi/linux/bpf.h

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/include/uapi/linux/bpf.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 86f7c438d40f..1777fa0c61e4 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -267,6 +267,7 @@ enum bpf_attach_type {
 #define BPF_ANY		0 /* create new element or update existing */
 #define BPF_NOEXIST	1 /* create new element if it didn't exist */
 #define BPF_EXIST	2 /* update existing element */
+#define BPF_F_LOCK	4 /* spin_lock-ed map_lookup/map_update */
 
 /* flags for BPF_MAP_CREATE command */
 #define BPF_F_NO_PREALLOC	(1U << 0)
-- 
2.20.0


^ permalink raw reply related

* [PATCH v6 bpf-next 6/9] bpf: introduce BPF_F_LOCK flag
From: Alexei Starovoitov @ 2019-01-31  5:05 UTC (permalink / raw)
  To: davem; +Cc: daniel, peterz, jannh, netdev, kernel-team
In-Reply-To: <20190131050546.3634009-1-ast@kernel.org>

Introduce BPF_F_LOCK flag for map_lookup and map_update syscall commands
and for map_update() helper function.
In all these cases take a lock of existing element (which was provided
in BTF description) before copying (in or out) the rest of map value.

Implementation details that are part of uapi:

Array:
The array map takes the element lock for lookup/update.

Hash:
hash map also takes the lock for lookup/update and tries to avoid the bucket lock.
If old element exists it takes the element lock and updates the element in place.
If element doesn't exist it allocates new one and inserts into hash table
while holding the bucket lock.
In rare case the hashmap has to take both the bucket lock and the element lock
to update old value in place.

Cgroup local storage:
It is similar to array. update in place and lookup are done with lock taken.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h        |  2 ++
 include/uapi/linux/bpf.h   |  1 +
 kernel/bpf/arraymap.c      | 24 ++++++++++++++--------
 kernel/bpf/hashtab.c       | 42 +++++++++++++++++++++++++++++++++++---
 kernel/bpf/helpers.c       | 16 +++++++++++++++
 kernel/bpf/local_storage.c | 14 ++++++++++++-
 kernel/bpf/syscall.c       | 25 +++++++++++++++++++++--
 7 files changed, 110 insertions(+), 14 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 2ae615b48bb8..bd169a7bcc93 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -119,6 +119,8 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
 		memcpy(dst, src, map->value_size);
 	}
 }
+void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
+			   bool lock_src);
 
 struct bpf_offload_dev;
 struct bpf_offloaded_map;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 86f7c438d40f..1777fa0c61e4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -267,6 +267,7 @@ enum bpf_attach_type {
 #define BPF_ANY		0 /* create new element or update existing */
 #define BPF_NOEXIST	1 /* create new element if it didn't exist */
 #define BPF_EXIST	2 /* update existing element */
+#define BPF_F_LOCK	4 /* spin_lock-ed map_lookup/map_update */
 
 /* flags for BPF_MAP_CREATE command */
 #define BPF_F_NO_PREALLOC	(1U << 0)
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index d6d979910a2a..c72e0d8e1e65 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -253,8 +253,9 @@ static int array_map_update_elem(struct bpf_map *map, void *key, void *value,
 {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
 	u32 index = *(u32 *)key;
+	char *val;
 
-	if (unlikely(map_flags > BPF_EXIST))
+	if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST))
 		/* unknown flags */
 		return -EINVAL;
 
@@ -262,18 +263,25 @@ static int array_map_update_elem(struct bpf_map *map, void *key, void *value,
 		/* all elements were pre-allocated, cannot insert a new one */
 		return -E2BIG;
 
-	if (unlikely(map_flags == BPF_NOEXIST))
+	if (unlikely(map_flags & BPF_NOEXIST))
 		/* all elements already exist */
 		return -EEXIST;
 
-	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY)
+	if (unlikely((map_flags & BPF_F_LOCK) &&
+		     !map_value_has_spin_lock(map)))
+		return -EINVAL;
+
+	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
 		memcpy(this_cpu_ptr(array->pptrs[index & array->index_mask]),
 		       value, map->value_size);
-	else
-		copy_map_value(map,
-			       array->value +
-			       array->elem_size * (index & array->index_mask),
-			       value);
+	} else {
+		val = array->value +
+			array->elem_size * (index & array->index_mask);
+		if (map_flags & BPF_F_LOCK)
+			copy_map_value_locked(map, val, value, false);
+		else
+			copy_map_value(map, val, value);
+	}
 	return 0;
 }
 
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 6d3b22c5ad68..937776531998 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -804,11 +804,11 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 static int check_flags(struct bpf_htab *htab, struct htab_elem *l_old,
 		       u64 map_flags)
 {
-	if (l_old && map_flags == BPF_NOEXIST)
+	if (l_old && (map_flags & ~BPF_F_LOCK) == BPF_NOEXIST)
 		/* elem already exists */
 		return -EEXIST;
 
-	if (!l_old && map_flags == BPF_EXIST)
+	if (!l_old && (map_flags & ~BPF_F_LOCK) == BPF_EXIST)
 		/* elem doesn't exist, cannot update it */
 		return -ENOENT;
 
@@ -827,7 +827,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	u32 key_size, hash;
 	int ret;
 
-	if (unlikely(map_flags > BPF_EXIST))
+	if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST))
 		/* unknown flags */
 		return -EINVAL;
 
@@ -840,6 +840,28 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	b = __select_bucket(htab, hash);
 	head = &b->head;
 
+	if (unlikely(map_flags & BPF_F_LOCK)) {
+		if (unlikely(!map_value_has_spin_lock(map)))
+			return -EINVAL;
+		/* find an element without taking the bucket lock */
+		l_old = lookup_nulls_elem_raw(head, hash, key, key_size,
+					      htab->n_buckets);
+		ret = check_flags(htab, l_old, map_flags);
+		if (ret)
+			return ret;
+		if (l_old) {
+			/* grab the element lock and update value in place */
+			copy_map_value_locked(map,
+					      l_old->key + round_up(key_size, 8),
+					      value, false);
+			return 0;
+		}
+		/* fall through, grab the bucket lock and lookup again.
+		 * 99.9% chance that the element won't be found,
+		 * but second lookup under lock has to be done.
+		 */
+	}
+
 	/* bpf_map_update_elem() can be called in_irq() */
 	raw_spin_lock_irqsave(&b->lock, flags);
 
@@ -849,6 +871,20 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	if (ret)
 		goto err;
 
+	if (unlikely(l_old && (map_flags & BPF_F_LOCK))) {
+		/* first lookup without the bucket lock didn't find the element,
+		 * but second lookup with the bucket lock found it.
+		 * This case is highly unlikely, but has to be dealt with:
+		 * grab the element lock in addition to the bucket lock
+		 * and update element in place
+		 */
+		copy_map_value_locked(map,
+				      l_old->key + round_up(key_size, 8),
+				      value, false);
+		ret = 0;
+		goto err;
+	}
+
 	l_new = alloc_htab_elem(htab, key, value, key_size, hash, false, false,
 				l_old);
 	if (IS_ERR(l_new)) {
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 29a8a4e62b8a..8218be1ee6ef 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -301,6 +301,22 @@ const struct bpf_func_proto bpf_spin_unlock_proto = {
 	.arg1_type	= ARG_PTR_TO_SPIN_LOCK,
 };
 
+void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
+			   bool lock_src)
+{
+	struct bpf_spin_lock *lock;
+
+	if (lock_src)
+		lock = src + map->spin_lock_off;
+	else
+		lock = dst + map->spin_lock_off;
+	preempt_disable();
+	____bpf_spin_lock(lock);
+	copy_map_value(map, dst, src);
+	____bpf_spin_unlock(lock);
+	preempt_enable();
+}
+
 #ifdef CONFIG_CGROUPS
 BPF_CALL_0(bpf_get_current_cgroup_id)
 {
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index 0295427f06e2..6b572e2de7fb 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -131,7 +131,14 @@ static int cgroup_storage_update_elem(struct bpf_map *map, void *_key,
 	struct bpf_cgroup_storage *storage;
 	struct bpf_storage_buffer *new;
 
-	if (flags != BPF_ANY && flags != BPF_EXIST)
+	if (unlikely(flags & ~(BPF_F_LOCK | BPF_EXIST | BPF_NOEXIST)))
+		return -EINVAL;
+
+	if (unlikely(flags & BPF_NOEXIST))
+		return -EINVAL;
+
+	if (unlikely((flags & BPF_F_LOCK) &&
+		     !map_value_has_spin_lock(map)))
 		return -EINVAL;
 
 	storage = cgroup_storage_lookup((struct bpf_cgroup_storage_map *)map,
@@ -139,6 +146,11 @@ static int cgroup_storage_update_elem(struct bpf_map *map, void *_key,
 	if (!storage)
 		return -ENOENT;
 
+	if (flags & BPF_F_LOCK) {
+		copy_map_value_locked(map, storage->buf->data, value, false);
+		return 0;
+	}
+
 	new = kmalloc_node(sizeof(struct bpf_storage_buffer) +
 			   map->value_size,
 			   __GFP_ZERO | GFP_ATOMIC | __GFP_NOWARN,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b29e6dc44650..0834958f1dc4 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -682,7 +682,7 @@ static void *__bpf_copy_key(void __user *ukey, u64 key_size)
 }
 
 /* last field in 'union bpf_attr' used by this command */
-#define BPF_MAP_LOOKUP_ELEM_LAST_FIELD value
+#define BPF_MAP_LOOKUP_ELEM_LAST_FIELD flags
 
 static int map_lookup_elem(union bpf_attr *attr)
 {
@@ -698,6 +698,9 @@ static int map_lookup_elem(union bpf_attr *attr)
 	if (CHECK_ATTR(BPF_MAP_LOOKUP_ELEM))
 		return -EINVAL;
 
+	if (attr->flags & ~BPF_F_LOCK)
+		return -EINVAL;
+
 	f = fdget(ufd);
 	map = __bpf_map_get(f);
 	if (IS_ERR(map))
@@ -708,6 +711,12 @@ static int map_lookup_elem(union bpf_attr *attr)
 		goto err_put;
 	}
 
+	if ((attr->flags & BPF_F_LOCK) &&
+	    !map_value_has_spin_lock(map)) {
+		err = -EINVAL;
+		goto err_put;
+	}
+
 	key = __bpf_copy_key(ukey, map->key_size);
 	if (IS_ERR(key)) {
 		err = PTR_ERR(key);
@@ -758,7 +767,13 @@ static int map_lookup_elem(union bpf_attr *attr)
 			err = -ENOENT;
 		} else {
 			err = 0;
-			copy_map_value(map, value, ptr);
+			if (attr->flags & BPF_F_LOCK)
+				/* lock 'ptr' and copy everything but lock */
+				copy_map_value_locked(map, value, ptr, true);
+			else
+				copy_map_value(map, value, ptr);
+			/* mask lock, since value wasn't zero inited */
+			check_and_init_map_lock(map, value);
 		}
 		rcu_read_unlock();
 	}
@@ -818,6 +833,12 @@ static int map_update_elem(union bpf_attr *attr)
 		goto err_put;
 	}
 
+	if ((attr->flags & BPF_F_LOCK) &&
+	    !map_value_has_spin_lock(map)) {
+		err = -EINVAL;
+		goto err_put;
+	}
+
 	key = __bpf_copy_key(ukey, map->key_size);
 	if (IS_ERR(key)) {
 		err = PTR_ERR(key);
-- 
2.20.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox