* Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel
From: Eric W. Biederman @ 2018-04-19 14:53 UTC (permalink / raw)
To: Rahul Lakkireddy
Cc: Dave Young, netdev@vger.kernel.org, kexec@lists.infradead.org,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
Indranil Choudhury, Nirranjan Kirubaharan,
stephen@networkplumber.org, Ganesh GR, akpm@linux-foundation.org,
torvalds@linux-foundation.org, davem@davemloft.net,
viro@zeniv.linux.org.uk
In-Reply-To: <20180419142747.GA30274@chelsio.com>
Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> writes:
> On Thursday, April 04/19/18, 2018 at 07:10:30 +0530, Dave Young wrote:
>> On 04/18/18 at 06:01pm, Rahul Lakkireddy wrote:
>> > On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote:
>> > > Hi Rahul,
>> > > On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
>> > > > On production servers running variety of workloads over time, kernel
>> > > > panic can happen sporadically after days or even months. It is
>> > > > important to collect as much debug logs as possible to root cause
>> > > > and fix the problem, that may not be easy to reproduce. Snapshot of
>> > > > underlying hardware/firmware state (like register dump, firmware
>> > > > logs, adapter memory, etc.), at the time of kernel panic will be very
>> > > > helpful while debugging the culprit device driver.
>> > > >
>> > > > This series of patches add new generic framework that enable device
>> > > > drivers to collect device specific snapshot of the hardware/firmware
>> > > > state of the underlying device in the crash recovery kernel. In crash
>> > > > recovery kernel, the collected logs are added as elf notes to
>> > > > /proc/vmcore, which is copied by user space scripts for post-analysis.
>> > > >
>> > > > The sequence of actions done by device drivers to append their device
>> > > > specific hardware/firmware logs to /proc/vmcore are as follows:
>> > > >
>> > > > 1. During probe (before hardware is initialized), device drivers
>> > > > register to the vmcore module (via vmcore_add_device_dump()), with
>> > > > callback function, along with buffer size and log name needed for
>> > > > firmware/hardware log collection.
>> > >
>> > > I assumed the elf notes info should be prepared while kexec_[file_]load
>> > > phase. But I did not read the old comment, not sure if it has been discussed
>> > > or not.
>> > >
>> >
>> > We must not collect dumps in crashing kernel. Adding more things in
>> > crash dump path risks not collecting vmcore at all. Eric had
>> > discussed this in more detail at:
>> >
>> > https://lkml.org/lkml/2018/3/24/319
>> >
>> > We are safe to collect dumps in the second kernel. Each device dump
>> > will be exported as an elf note in /proc/vmcore.
>>
>> I understand that we should avoid adding anything in crash path. And I also
>> agree to collect device dump in second kernel. I just assumed device
>> dump use some memory area to store the debug info and the memory
>> is persistent so that this can be done in 2 steps, first register the
>> address in elf header in kexec_load, then collect the dump in 2nd
>> kernel. But it seems the driver is doing some other logic to collect
>> the info instead of just that simple like I thought.
>>
>
> It seems simpler, but I'm concerned with waste of memory area, if
> there are no device dumps being collected in second kernel. In
> approach proposed in these series, we dynamically allocate memory
> for the device dumps from second kernel's available memory.
Don't count that kernel having more than about 128MiB.
For that reason if for no other it would be nice if it was possible to
have the driver to not initialize the device and just stand there
handing out the data a piece at a time as it is read from /proc/vmcore.
The 2GiB number I read earlier concerns me for working in a limited
environment.
It might even make sense to separate this into a completely separate
module (depended upon the main driver if it makes sense to share
the functionality) so that people performing crash dumps would not
hesitate to include the code in their initramfs images.
I can see splitting a device up into a portion only to be used in case
of a crash dump and a normal portion like we do for main memory but I
doubt that makes sense in practice.
>> > > If do this in 2nd kernel a question is driver can be loaded later than vmcore init.
>> >
>> > Yes, drivers will add their device dumps after vmcore init.
>> >
>> > > How to guarantee the function works if vmcore reading happens before
>> > > the driver is loaded?
>> > >
>> > > Also it is possible that kdump initramfs does not contains the driver
>> > > module.
>> > >
>> > > Am I missing something?
>> > >
>> >
>> > Yes, driver must be in initramfs if it wants to collect and add device
>> > dump to /proc/vmcore in second kernel.
>>
>> In RH/Fedora kdump scripts we only add the things are required to
>> bring up the dump target, so that we can use as less memory as we can.
>>
>> For example, if a net driver panicked, and the dump target is rootfs
>> which is a scsi disk, then no network related stuff will be added in
>> initramfs.
>>
>> In this case the device dump info will be not collected..
>
> Correct. If the driver is not present in initramfs, it can't collect
> its underlying device's dump. Administrator is expected to add the
> driver to initramfs, if device dump needs to be collected.
That makes sense, as most people won't have that need. Still if we can
find something that can work automatically and safely without the need
for manual configuration people are more likely to use it.
Eric
^ permalink raw reply
* Re: [PATCH v2 net 2/3] virtio_net: fix adding vids on big-endian
From: Michael S. Tsirkin @ 2018-04-19 14:56 UTC (permalink / raw)
To: Cornelia Huck
Cc: linux-kernel, Mikulas Patocka, Eric Dumazet, David Miller,
Thomas Huth, Jason Wang, virtualization, netdev
In-Reply-To: <20180419152641.092865f3.cohuck@redhat.com>
On Thu, Apr 19, 2018 at 03:26:41PM +0200, Cornelia Huck wrote:
> On Thu, 19 Apr 2018 08:30:49 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
> > Programming vids (adding or removing them) still passes
> > guest-endian values in the DMA buffer. That's wrong
> > if guest is big-endian and when virtio 1 is enabled.
> >
> > Note: this is on top of a previous patch:
> > virtio_net: split out ctrl buffer
> >
> > Fixes: 9465a7a6f ("virtio_net: enable v1.0 support")
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> > drivers/net/virtio_net.c | 6 +++---
> > 1 file changed, 3 insertions(+), 3 deletions(-)
>
> Ouch. Have you seen any bug reports for that?
No, but then vlans within VMs aren't used too often
(as opposed to attaching vlans by the HV).
--
MST
^ permalink raw reply
* Re: [PATCH net-next v4 1/3] vmcore: add API to collect hardware dump in second kernel
From: Rahul Lakkireddy @ 2018-04-19 14:56 UTC (permalink / raw)
To: Greg KH
Cc: netdev@vger.kernel.org, kexec@lists.infradead.org,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
davem@davemloft.net, viro@zeniv.linux.org.uk,
ebiederm@xmission.com, stephen@networkplumber.org,
akpm@linux-foundation.org, torvalds@linux-foundation.org,
Ganesh GR, Nirranjan Kirubaharan, Indranil Choudhury
In-Reply-To: <20180419082456.GA8617@kroah.com>
On Thursday, April 04/19/18, 2018 at 13:54:56 +0530, Greg KH wrote:
> On Tue, Apr 17, 2018 at 01:14:17PM +0530, Rahul Lakkireddy wrote:
> > +config PROC_VMCORE_DEVICE_DUMP
> > + bool "Device Hardware/Firmware Log Collection"
> > + depends on PROC_VMCORE
> > + default y
>
> Only things that require the machine to keep working should be 'default
> y', please remove this, it's an option.
>
Ok. Will fix this.
> > + help
> > + Device drivers can collect the device specific snapshot of
> > + their hardware or firmware before they are initialized in
> > + crash recovery kernel. If you say Y here, the device dumps
> > + will be added as ELF notes to /proc/vmcore
>
> Which exact "device drivers" are you referring to here?
>
The API is generic enough to collect any type of device's dump. Any
driver that wants to collect its underlying hardware/firmware dump
can use the API. In our case, cxgb4 driver collects dumps of the
underlying Chelsio network devices.
Thanks,
Rahul
^ permalink raw reply
* Re: [PATCH 35/61] net: ethernet: ti: simplify getting .drvdata
From: Grygorii Strashko @ 2018-04-19 15:14 UTC (permalink / raw)
To: Wolfram Sang, linux-kernel
Cc: linux-renesas-soc, kernel-janitors, linux-omap, netdev
In-Reply-To: <20180419140641.27926-36-wsa+renesas@sang-engineering.com>
On 04/19/2018 09:06 AM, Wolfram Sang wrote:
> We should get drvdata from struct device directly. Going via
> platform_device is an unneeded step back and forth.
>
> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
> ---
>
> Build tested only. buildbot is happy. Please apply individually.
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
--
regards,
-grygorii
^ permalink raw reply
* [PATCH resend v3 0/3] lan78xx: Read configuration from Device Tree
From: Phil Elwell @ 2018-04-19 15:16 UTC (permalink / raw)
To: Woojung Huh, Microchip Linux Driver Support, Rob Herring,
Mark Rutland, Andrew Lunn, Florian Fainelli, David S. Miller,
Mauro Carvalho Chehab, Greg Kroah-Hartman, Linus Walleij,
Andrew Morton, Randy Dunlap, Phil Elwell, netdev, devicetree,
linux-kernel, linux-usb
[ Resending due to incorrect distribution list ]
The Microchip LAN78XX family of devices are Ethernet controllers with
a USB interface. Despite being discoverable devices it can be useful to
be able to configure them from Device Tree, particularly in low-cost
applications without an EEPROM or programmed OTP.
This patch set adds support for reading the MAC address and LED modes from
Device Tree.
v3:
- Move LED setting into PHY driver.
v2:
- Use eth_platform_get_mac_address.
- Support up to 4 LEDs, and move LED mode constants into dt-bindings header.
- Improve bindings document.
- Remove EEE support.
Phil Elwell (3):
lan78xx: Read MAC address from DT if present
lan78xx: Read LED states from Device Tree
dt-bindings: Document the DT bindings for lan78xx
.../devicetree/bindings/net/microchip,lan78xx.txt | 54 ++++++++++++++++
MAINTAINERS | 2 +
drivers/net/phy/microchip.c | 25 ++++++++
drivers/net/usb/lan78xx.c | 74 +++++++++++++++-------
include/dt-bindings/net/microchip-lan78xx.h | 21 ++++++
include/linux/microchipphy.h | 3 +
6 files changed, 156 insertions(+), 23 deletions(-)
create mode 100644 Documentation/devicetree/bindings/net/microchip,lan78xx.txt
create mode 100644 include/dt-bindings/net/microchip-lan78xx.h
--
2.7.4
^ permalink raw reply
* [PATCH resend v3 1/3] lan78xx: Read MAC address from DT if present
From: Phil Elwell @ 2018-04-19 15:16 UTC (permalink / raw)
To: Woojung Huh, Microchip Linux Driver Support, Rob Herring,
Mark Rutland, Andrew Lunn, Florian Fainelli, David S. Miller,
Mauro Carvalho Chehab, Greg Kroah-Hartman, Linus Walleij,
Andrew Morton, Randy Dunlap, Phil Elwell, netdev, devicetree,
linux-kernel, linux-usb
In-Reply-To: <1524151019-82823-1-git-send-email-phil@raspberrypi.org>
There is a standard mechanism for locating and using a MAC address from
the Device Tree. Use this facility in the lan78xx driver to support
applications without programmed EEPROM or OTP. At the same time,
regularise the handling of the different address sources.
Signed-off-by: Phil Elwell <phil@raspberrypi.org>
---
drivers/net/usb/lan78xx.c | 42 ++++++++++++++++++++----------------------
1 file changed, 20 insertions(+), 22 deletions(-)
diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index 0867f72..a823f01 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -37,6 +37,7 @@
#include <linux/irqchip/chained_irq.h>
#include <linux/microchipphy.h>
#include <linux/phy.h>
+#include <linux/of_net.h>
#include "lan78xx.h"
#define DRIVER_AUTHOR "WOOJUNG HUH <woojung.huh@microchip.com>"
@@ -1652,34 +1653,31 @@ static void lan78xx_init_mac_address(struct lan78xx_net *dev)
addr[5] = (addr_hi >> 8) & 0xFF;
if (!is_valid_ether_addr(addr)) {
- /* reading mac address from EEPROM or OTP */
- if ((lan78xx_read_eeprom(dev, EEPROM_MAC_OFFSET, ETH_ALEN,
- addr) == 0) ||
- (lan78xx_read_otp(dev, EEPROM_MAC_OFFSET, ETH_ALEN,
- addr) == 0)) {
- if (is_valid_ether_addr(addr)) {
- /* eeprom values are valid so use them */
- netif_dbg(dev, ifup, dev->net,
- "MAC address read from EEPROM");
- } else {
- /* generate random MAC */
- random_ether_addr(addr);
- netif_dbg(dev, ifup, dev->net,
- "MAC address set to random addr");
- }
-
- addr_lo = addr[0] | (addr[1] << 8) |
- (addr[2] << 16) | (addr[3] << 24);
- addr_hi = addr[4] | (addr[5] << 8);
-
- ret = lan78xx_write_reg(dev, RX_ADDRL, addr_lo);
- ret = lan78xx_write_reg(dev, RX_ADDRH, addr_hi);
+ if (!eth_platform_get_mac_address(&dev->udev->dev, addr)) {
+ /* valid address present in Device Tree */
+ netif_dbg(dev, ifup, dev->net,
+ "MAC address read from Device Tree");
+ } else if (((lan78xx_read_eeprom(dev, EEPROM_MAC_OFFSET,
+ ETH_ALEN, addr) == 0) ||
+ (lan78xx_read_otp(dev, EEPROM_MAC_OFFSET,
+ ETH_ALEN, addr) == 0)) &&
+ is_valid_ether_addr(addr)) {
+ /* eeprom values are valid so use them */
+ netif_dbg(dev, ifup, dev->net,
+ "MAC address read from EEPROM");
} else {
/* generate random MAC */
random_ether_addr(addr);
netif_dbg(dev, ifup, dev->net,
"MAC address set to random addr");
}
+
+ addr_lo = addr[0] | (addr[1] << 8) |
+ (addr[2] << 16) | (addr[3] << 24);
+ addr_hi = addr[4] | (addr[5] << 8);
+
+ ret = lan78xx_write_reg(dev, RX_ADDRL, addr_lo);
+ ret = lan78xx_write_reg(dev, RX_ADDRH, addr_hi);
}
ret = lan78xx_write_reg(dev, MAF_LO(0), addr_lo);
--
2.7.4
^ permalink raw reply related
* [PATCH resend v3 2/3] lan78xx: Read LED states from Device Tree
From: Phil Elwell @ 2018-04-19 15:16 UTC (permalink / raw)
To: Woojung Huh, Microchip Linux Driver Support, Rob Herring,
Mark Rutland, Andrew Lunn, Florian Fainelli, David S. Miller,
Mauro Carvalho Chehab, Greg Kroah-Hartman, Linus Walleij,
Andrew Morton, Randy Dunlap, Phil Elwell, netdev, devicetree,
linux-kernel, linux-usb
In-Reply-To: <1524151019-82823-1-git-send-email-phil@raspberrypi.org>
Add support for DT property "microchip,led-modes", a vector of zero
to four cells (u32s) in the range 0-15, each of which sets the mode
for one of the LEDs. Some possible values are:
0=link/activity 1=link1000/activity
2=link100/activity 3=link10/activity
4=link100/1000/activity 5=link10/1000/activity
6=link10/100/activity 14=off 15=on
These values are given symbolic constants in a dt-bindings header.
Also use the presence of the DT property to indicate that the
LEDs should be enabled - necessary in the event that no valid OTP
or EEPROM is available.
Signed-off-by: Phil Elwell <phil@raspberrypi.org>
---
MAINTAINERS | 1 +
drivers/net/phy/microchip.c | 25 ++++++++++++++++++++++
drivers/net/usb/lan78xx.c | 32 ++++++++++++++++++++++++++++-
include/dt-bindings/net/microchip-lan78xx.h | 21 +++++++++++++++++++
include/linux/microchipphy.h | 3 +++
5 files changed, 81 insertions(+), 1 deletion(-)
create mode 100644 include/dt-bindings/net/microchip-lan78xx.h
diff --git a/MAINTAINERS b/MAINTAINERS
index b60179d..23735d9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14573,6 +14573,7 @@ M: Microchip Linux Driver Support <UNGLinuxDriver@microchip.com>
L: netdev@vger.kernel.org
S: Maintained
F: drivers/net/usb/lan78xx.*
+F: include/dt-bindings/net/microchip-lan78xx.h
USB MASS STORAGE DRIVER
M: Alan Stern <stern@rowland.harvard.edu>
diff --git a/drivers/net/phy/microchip.c b/drivers/net/phy/microchip.c
index 0f293ef..ef5e160 100644
--- a/drivers/net/phy/microchip.c
+++ b/drivers/net/phy/microchip.c
@@ -20,6 +20,8 @@
#include <linux/ethtool.h>
#include <linux/phy.h>
#include <linux/microchipphy.h>
+#include <linux/of.h>
+#include <dt-bindings/net/microchip-lan78xx.h>
#define DRIVER_AUTHOR "WOOJUNG HUH <woojung.huh@microchip.com>"
#define DRIVER_DESC "Microchip LAN88XX PHY driver"
@@ -70,6 +72,8 @@ static int lan88xx_probe(struct phy_device *phydev)
{
struct device *dev = &phydev->mdio.dev;
struct lan88xx_priv *priv;
+ u32 led_modes[4];
+ int len;
priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
if (!priv)
@@ -77,6 +81,27 @@ static int lan88xx_probe(struct phy_device *phydev)
priv->wolopts = 0;
+ len = of_property_read_variable_u32_array(dev->of_node,
+ "microchip,led-modes",
+ led_modes,
+ 0,
+ ARRAY_SIZE(led_modes));
+ if (len >= 0) {
+ u32 reg = 0;
+ int i;
+
+ for (i = 0; i < len; i++) {
+ if (led_modes[i] > 15)
+ return -EINVAL;
+ reg |= led_modes[i] << (i * 4);
+ }
+ for (; i < ARRAY_SIZE(led_modes); i++)
+ reg |= LAN78XX_FORCE_LED_OFF << (i * 4);
+ (void)phy_write(phydev, LAN78XX_PHY_LED_MODE_SELECT, reg);
+ } else if (len == -EOVERFLOW) {
+ return -EINVAL;
+ }
+
/* these values can be used to identify internal PHY */
priv->chip_id = phy_read_mmd(phydev, 3, LAN88XX_MMD3_CHIP_ID);
priv->chip_rev = phy_read_mmd(phydev, 3, LAN88XX_MMD3_CHIP_REV);
diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index a823f01..6b03b97 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -37,6 +37,7 @@
#include <linux/irqchip/chained_irq.h>
#include <linux/microchipphy.h>
#include <linux/phy.h>
+#include <linux/of_mdio.h>
#include <linux/of_net.h>
#include "lan78xx.h"
@@ -1760,6 +1761,7 @@ static int lan78xx_mdiobus_write(struct mii_bus *bus, int phy_id, int idx,
static int lan78xx_mdio_init(struct lan78xx_net *dev)
{
+ struct device_node *node;
int ret;
dev->mdiobus = mdiobus_alloc();
@@ -1788,7 +1790,13 @@ static int lan78xx_mdio_init(struct lan78xx_net *dev)
break;
}
- ret = mdiobus_register(dev->mdiobus);
+ node = of_get_child_by_name(dev->udev->dev.of_node, "mdio");
+ if (node) {
+ ret = of_mdiobus_register(dev->mdiobus, node);
+ of_node_put(node);
+ } else {
+ ret = mdiobus_register(dev->mdiobus);
+ }
if (ret) {
netdev_err(dev->net, "can't register MDIO bus\n");
goto exit1;
@@ -2077,6 +2085,28 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
mii_adv = (u32)mii_advertise_flowctrl(dev->fc_request_control);
phydev->advertising |= mii_adv_to_ethtool_adv_t(mii_adv);
+ if (phydev->mdio.dev.of_node) {
+ u32 reg;
+ int len;
+
+ len = of_property_count_elems_of_size(phydev->mdio.dev.of_node,
+ "microchip,led-modes",
+ sizeof(u32));
+ if (len >= 0) {
+ /* Ensure the appropriate LEDs are enabled */
+ lan78xx_read_reg(dev, HW_CFG, ®);
+ reg &= ~(HW_CFG_LED0_EN_ |
+ HW_CFG_LED1_EN_ |
+ HW_CFG_LED2_EN_ |
+ HW_CFG_LED3_EN_);
+ reg |= (len > 0) * HW_CFG_LED0_EN_ |
+ (len > 1) * HW_CFG_LED1_EN_ |
+ (len > 2) * HW_CFG_LED2_EN_ |
+ (len > 3) * HW_CFG_LED3_EN_;
+ lan78xx_write_reg(dev, HW_CFG, reg);
+ }
+ }
+
genphy_config_aneg(phydev);
dev->fc_autoneg = phydev->autoneg;
diff --git a/include/dt-bindings/net/microchip-lan78xx.h b/include/dt-bindings/net/microchip-lan78xx.h
new file mode 100644
index 0000000..0742ff0
--- /dev/null
+++ b/include/dt-bindings/net/microchip-lan78xx.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _DT_BINDINGS_MICROCHIP_LAN78XX_H
+#define _DT_BINDINGS_MICROCHIP_LAN78XX_H
+
+/* LED modes for LAN7800/LAN7850 embedded PHY */
+
+#define LAN78XX_LINK_ACTIVITY 0
+#define LAN78XX_LINK_1000_ACTIVITY 1
+#define LAN78XX_LINK_100_ACTIVITY 2
+#define LAN78XX_LINK_10_ACTIVITY 3
+#define LAN78XX_LINK_100_1000_ACTIVITY 4
+#define LAN78XX_LINK_10_1000_ACTIVITY 5
+#define LAN78XX_LINK_10_100_ACTIVITY 6
+#define LAN78XX_DUPLEX_COLLISION 8
+#define LAN78XX_COLLISION 9
+#define LAN78XX_ACTIVITY 10
+#define LAN78XX_AUTONEG_FAULT 12
+#define LAN78XX_FORCE_LED_OFF 14
+#define LAN78XX_FORCE_LED_ON 15
+
+#endif
diff --git a/include/linux/microchipphy.h b/include/linux/microchipphy.h
index eb492d4..8e4015e 100644
--- a/include/linux/microchipphy.h
+++ b/include/linux/microchipphy.h
@@ -70,4 +70,7 @@
#define LAN88XX_MMD3_CHIP_ID (32877)
#define LAN88XX_MMD3_CHIP_REV (32878)
+/* Registers specific to the LAN7800/LAN7850 embedded phy */
+#define LAN78XX_PHY_LED_MODE_SELECT (0x1D)
+
#endif /* _MICROCHIPPHY_H */
--
2.7.4
^ permalink raw reply related
* [PATCH resend v3 3/3] dt-bindings: Document the DT bindings for lan78xx
From: Phil Elwell @ 2018-04-19 15:16 UTC (permalink / raw)
To: Woojung Huh, Microchip Linux Driver Support, Rob Herring,
Mark Rutland, Andrew Lunn, Florian Fainelli, David S. Miller,
Mauro Carvalho Chehab, Greg Kroah-Hartman, Linus Walleij,
Andrew Morton, Randy Dunlap, Phil Elwell, netdev, devicetree,
linux-kernel, linux-usb
In-Reply-To: <1524151019-82823-1-git-send-email-phil@raspberrypi.org>
The Microchip LAN78XX family of devices are Ethernet controllers with
a USB interface. Despite being discoverable devices it can be useful to
be able to configure them from Device Tree, particularly in low-cost
applications without an EEPROM or programmed OTP.
Document the supported properties in a bindings file.
Signed-off-by: Phil Elwell <phil@raspberrypi.org>
---
.../devicetree/bindings/net/microchip,lan78xx.txt | 54 ++++++++++++++++++++++
MAINTAINERS | 1 +
2 files changed, 55 insertions(+)
create mode 100644 Documentation/devicetree/bindings/net/microchip,lan78xx.txt
diff --git a/Documentation/devicetree/bindings/net/microchip,lan78xx.txt b/Documentation/devicetree/bindings/net/microchip,lan78xx.txt
new file mode 100644
index 0000000..a5d701b
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/microchip,lan78xx.txt
@@ -0,0 +1,54 @@
+Microchip LAN78xx Gigabit Ethernet controller
+
+The LAN78XX devices are usually configured by programming their OTP or with
+an external EEPROM, but some platforms (e.g. Raspberry Pi 3 B+) have neither.
+The Device Tree properties, if present, override the OTP and EEPROM.
+
+Required properties:
+- compatible: Should be one of "usb424,7800", "usb424,7801" or "usb424,7850".
+
+Optional properties:
+- local-mac-address: see ethernet.txt
+- mac-address: see ethernet.txt
+
+Optional properties of the embedded PHY:
+- microchip,led-modes: a 0..4 element vector, with each element configuring
+ the operating mode of an LED. Omitted LEDs are turned off. Allowed values
+ are defined in "include/dt-bindings/net/microchip-lan78xx.h".
+
+Example:
+
+/* Based on the configuration for a Raspberry Pi 3 B+ */
+&usb {
+ usb1@1 {
+ compatible = "usb424,2514";
+ reg = <1>;
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ usb1_1@1 {
+ compatible = "usb424,2514";
+ reg = <1>;
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ ethernet: usbether@1 {
+ compatible = "usb424,7800";
+ reg = <1>;
+ local-mac-address = [ 00 11 22 33 44 55 ];
+
+ mdio {
+ #address-cells = <0x1>;
+ #size-cells = <0x0>;
+ eth_phy: ethernet-phy@1 {
+ reg = <1>;
+ microchip,led-modes = <
+ LAN78XX_LINK_1000_ACTIVITY
+ LAN78XX_LINK_10_100_ACTIVITY
+ >;
+ };
+ };
+ };
+ };
+ };
+};
diff --git a/MAINTAINERS b/MAINTAINERS
index 23735d9..91cb961 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14572,6 +14572,7 @@ M: Woojung Huh <woojung.huh@microchip.com>
M: Microchip Linux Driver Support <UNGLinuxDriver@microchip.com>
L: netdev@vger.kernel.org
S: Maintained
+F: Documentation/devicetree/bindings/net/microchip,lan78xx.txt
F: drivers/net/usb/lan78xx.*
F: include/dt-bindings/net/microchip-lan78xx.h
--
2.7.4
^ permalink raw reply related
* Re: [PATCH v3 2/3] lan78xx: Read LED states from Device Tree
From: Andrew Lunn @ 2018-04-19 15:19 UTC (permalink / raw)
To: Phil Elwell
Cc: Woojung Huh, Microchip Linux Driver Support, Rob Herring,
Mark Rutland, David S. Miller, Mauro Carvalho Chehab,
Greg Kroah-Hartman, Linus Walleij, Andrew Morton, Randy Dunlap,
netdev, devicetree, linux-kernel, linux-usb
In-Reply-To: <1524148325-78945-4-git-send-email-phil@raspberrypi.org>
> @@ -2077,6 +2085,28 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
> mii_adv = (u32)mii_advertise_flowctrl(dev->fc_request_control);
> phydev->advertising |= mii_adv_to_ethtool_adv_t(mii_adv);
>
> + if (phydev->mdio.dev.of_node) {
> + u32 reg;
> + int len;
> +
> + len = of_property_count_elems_of_size(phydev->mdio.dev.of_node,
> + "microchip,led-modes",
> + sizeof(u32));
> + if (len >= 0) {
> + /* Ensure the appropriate LEDs are enabled */
> + lan78xx_read_reg(dev, HW_CFG, ®);
> + reg &= ~(HW_CFG_LED0_EN_ |
> + HW_CFG_LED1_EN_ |
> + HW_CFG_LED2_EN_ |
> + HW_CFG_LED3_EN_);
> + reg |= (len > 0) * HW_CFG_LED0_EN_ |
> + (len > 1) * HW_CFG_LED1_EN_ |
> + (len > 2) * HW_CFG_LED2_EN_ |
> + (len > 3) * HW_CFG_LED3_EN_;
> + lan78xx_write_reg(dev, HW_CFG, reg);
> + }
> + }
> +
Humm. Not nice. But i cannot think of a cleaner way of doing this.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Andrew
^ permalink raw reply
* Re: [PATCH v3 3/3] dt-bindings: Document the DT bindings for lan78xx
From: Andrew Lunn @ 2018-04-19 15:20 UTC (permalink / raw)
To: Phil Elwell
Cc: Woojung Huh, Microchip Linux Driver Support, Rob Herring,
Mark Rutland, David S. Miller, Mauro Carvalho Chehab,
Greg Kroah-Hartman, Linus Walleij, Andrew Morton, Randy Dunlap,
netdev, devicetree, linux-kernel, linux-usb
In-Reply-To: <1524148325-78945-5-git-send-email-phil@raspberrypi.org>
On Thu, Apr 19, 2018 at 03:32:05PM +0100, Phil Elwell wrote:
> The Microchip LAN78XX family of devices are Ethernet controllers with
> a USB interface. Despite being discoverable devices it can be useful to
> be able to configure them from Device Tree, particularly in low-cost
> applications without an EEPROM or programmed OTP.
>
> Document the supported properties in a bindings file.
>
> Signed-off-by: Phil Elwell <phil@raspberrypi.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Andrew
^ permalink raw reply
* [PATCH net-next] net: stmmac: Implement logic to automatically select HW Interface
From: Jose Abreu @ 2018-04-19 15:24 UTC (permalink / raw)
To: netdev
Cc: Jose Abreu, David S. Miller, Joao Pinto, Vitor Soares,
Giuseppe Cavallaro, Alexandre Torgue
Move all the core version detection to a common place ("hwif.c") and
implement a table which can be used to lookup the correct callbacks for
each IP version.
This simplifies the initialization flow of each IP version and eases
future implementation of new IP versions.
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Vitor Soares <soares@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
---
Hi,
I tested this in DWMAC5. It would be great if anyone with older versions of
the core could also test this (I only have the setup for DWMAC5).
Thanks.
---
drivers/net/ethernet/stmicro/stmmac/Makefile | 3 +-
drivers/net/ethernet/stmicro/stmmac/common.h | 30 +---
drivers/net/ethernet/stmicro/stmmac/dwmac1000.h | 1 -
.../net/ethernet/stmicro/stmmac/dwmac1000_core.c | 29 +--
.../net/ethernet/stmicro/stmmac/dwmac100_core.c | 23 +--
drivers/net/ethernet/stmicro/stmmac/dwmac4.h | 1 -
drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c | 41 +---
drivers/net/ethernet/stmicro/stmmac/hwif.c | 216 ++++++++++++++++++++
drivers/net/ethernet/stmicro/stmmac/hwif.h | 17 ++
drivers/net/ethernet/stmicro/stmmac/stmmac.h | 1 +
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 77 +-------
11 files changed, 275 insertions(+), 164 deletions(-)
create mode 100644 drivers/net/ethernet/stmicro/stmmac/hwif.c
diff --git a/drivers/net/ethernet/stmicro/stmmac/Makefile b/drivers/net/ethernet/stmicro/stmmac/Makefile
index 972e4ef..e3b578b 100644
--- a/drivers/net/ethernet/stmicro/stmmac/Makefile
+++ b/drivers/net/ethernet/stmicro/stmmac/Makefile
@@ -4,7 +4,8 @@ stmmac-objs:= stmmac_main.o stmmac_ethtool.o stmmac_mdio.o ring_mode.o \
chain_mode.o dwmac_lib.o dwmac1000_core.o dwmac1000_dma.o \
dwmac100_core.o dwmac100_dma.o enh_desc.o norm_desc.o \
mmc_core.o stmmac_hwtstamp.o stmmac_ptp.o dwmac4_descs.o \
- dwmac4_dma.o dwmac4_lib.o dwmac4_core.o dwmac5.o $(stmmac-y)
+ dwmac4_dma.o dwmac4_lib.o dwmac4_core.o dwmac5.o hwif.o \
+ $(stmmac-y)
# Ordering matters. Generic driver must be last.
obj-$(CONFIG_STMMAC_PLATFORM) += stmmac-platform.o
diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
index 59673c6..627e905 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -39,6 +39,7 @@
#define DWMAC_CORE_3_40 0x34
#define DWMAC_CORE_3_50 0x35
#define DWMAC_CORE_4_00 0x40
+#define DWMAC_CORE_4_10 0x41
#define DWMAC_CORE_5_00 0x50
#define DWMAC_CORE_5_10 0x51
#define STMMAC_CHAN0 0 /* Always supported and default for all chips */
@@ -428,12 +429,9 @@ struct stmmac_rx_routing {
u32 reg_shift;
};
-struct mac_device_info *dwmac1000_setup(void __iomem *ioaddr, int mcbins,
- int perfect_uc_entries,
- int *synopsys_id);
-struct mac_device_info *dwmac100_setup(void __iomem *ioaddr, int *synopsys_id);
-struct mac_device_info *dwmac4_setup(void __iomem *ioaddr, int mcbins,
- int perfect_uc_entries, int *synopsys_id);
+int dwmac100_setup(struct stmmac_priv *priv);
+int dwmac1000_setup(struct stmmac_priv *priv);
+int dwmac4_setup(struct stmmac_priv *priv);
void stmmac_set_mac_addr(void __iomem *ioaddr, u8 addr[6],
unsigned int high, unsigned int low);
@@ -453,24 +451,4 @@ void stmmac_dwmac4_get_mac_addr(void __iomem *ioaddr, unsigned char *addr,
extern const struct stmmac_mode_ops chain_mode_ops;
extern const struct stmmac_desc_ops dwmac4_desc_ops;
-/**
- * stmmac_get_synopsys_id - return the SYINID.
- * @priv: driver private structure
- * Description: this simple function is to decode and return the SYINID
- * starting from the HW core register.
- */
-static inline u32 stmmac_get_synopsys_id(u32 hwid)
-{
- /* Check Synopsys Id (not available on old chips) */
- if (likely(hwid)) {
- u32 uid = ((hwid & 0x0000ff00) >> 8);
- u32 synid = (hwid & 0x000000ff);
-
- pr_info("stmmac - user ID: 0x%x, Synopsys ID: 0x%x\n",
- uid, synid);
-
- return synid;
- }
- return 0;
-}
#endif /* __COMMON_H__ */
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h b/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h
index c02d366..184ca13 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h
@@ -29,7 +29,6 @@
#define GMAC_MII_DATA 0x00000014 /* MII Data */
#define GMAC_FLOW_CTRL 0x00000018 /* Flow Control */
#define GMAC_VLAN_TAG 0x0000001c /* VLAN Tag */
-#define GMAC_VERSION 0x00000020 /* GMAC CORE Version */
#define GMAC_DEBUG 0x00000024 /* GMAC debug register */
#define GMAC_WAKEUP_FILTER 0x00000028 /* Wake-up Frame Filter */
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
index ef10baf..0877bde 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
@@ -27,6 +27,7 @@
#include <linux/ethtool.h>
#include <net/dsa.h>
#include <asm/io.h>
+#include "stmmac.h"
#include "stmmac_pcs.h"
#include "dwmac1000.h"
@@ -498,7 +499,7 @@ static void dwmac1000_debug(void __iomem *ioaddr, struct stmmac_extra_stats *x,
x->mac_gmii_rx_proto_engine++;
}
-static const struct stmmac_ops dwmac1000_ops = {
+const struct stmmac_ops dwmac1000_ops = {
.core_init = dwmac1000_core_init,
.set_mac = stmmac_set_mac,
.rx_ipc = dwmac1000_rx_ipc_enable,
@@ -519,28 +520,21 @@ static void dwmac1000_debug(void __iomem *ioaddr, struct stmmac_extra_stats *x,
.pcs_get_adv_lp = dwmac1000_get_adv_lp,
};
-struct mac_device_info *dwmac1000_setup(void __iomem *ioaddr, int mcbins,
- int perfect_uc_entries,
- int *synopsys_id)
+int dwmac1000_setup(struct stmmac_priv *priv)
{
- struct mac_device_info *mac;
- u32 hwid = readl(ioaddr + GMAC_VERSION);
+ struct mac_device_info *mac = priv->hw;
- mac = kzalloc(sizeof(const struct mac_device_info), GFP_KERNEL);
- if (!mac)
- return NULL;
+ dev_info(priv->device, "\tDWMAC1000\n");
- mac->pcsr = ioaddr;
- mac->multicast_filter_bins = mcbins;
- mac->unicast_filter_entries = perfect_uc_entries;
+ priv->dev->priv_flags |= IFF_UNICAST_FLT;
+ mac->pcsr = priv->ioaddr;
+ mac->multicast_filter_bins = priv->plat->multicast_filter_bins;
+ mac->unicast_filter_entries = priv->plat->unicast_filter_entries;
mac->mcast_bits_log2 = 0;
if (mac->multicast_filter_bins)
mac->mcast_bits_log2 = ilog2(mac->multicast_filter_bins);
- mac->mac = &dwmac1000_ops;
- mac->dma = &dwmac1000_dma_ops;
-
mac->link.duplex = GMAC_CONTROL_DM;
mac->link.speed10 = GMAC_CONTROL_PS;
mac->link.speed100 = GMAC_CONTROL_PS | GMAC_CONTROL_FES;
@@ -555,8 +549,5 @@ struct mac_device_info *dwmac1000_setup(void __iomem *ioaddr, int mcbins,
mac->mii.clk_csr_shift = 2;
mac->mii.clk_csr_mask = GENMASK(5, 2);
- /* Get and dump the chip ID */
- *synopsys_id = stmmac_get_synopsys_id(hwid);
-
- return mac;
+ return 0;
}
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c b/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c
index 91b23f9..b735143 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c
@@ -27,6 +27,7 @@
#include <linux/crc32.h>
#include <net/dsa.h>
#include <asm/io.h>
+#include "stmmac.h"
#include "dwmac100.h"
static void dwmac100_core_init(struct mac_device_info *hw,
@@ -159,7 +160,7 @@ static void dwmac100_pmt(struct mac_device_info *hw, unsigned long mode)
return;
}
-static const struct stmmac_ops dwmac100_ops = {
+const struct stmmac_ops dwmac100_ops = {
.core_init = dwmac100_core_init,
.set_mac = stmmac_set_mac,
.rx_ipc = dwmac100_rx_ipc_enable,
@@ -172,20 +173,13 @@ static void dwmac100_pmt(struct mac_device_info *hw, unsigned long mode)
.get_umac_addr = dwmac100_get_umac_addr,
};
-struct mac_device_info *dwmac100_setup(void __iomem *ioaddr, int *synopsys_id)
+int dwmac100_setup(struct stmmac_priv *priv)
{
- struct mac_device_info *mac;
+ struct mac_device_info *mac = priv->hw;
- mac = kzalloc(sizeof(const struct mac_device_info), GFP_KERNEL);
- if (!mac)
- return NULL;
-
- pr_info("\tDWMAC100\n");
-
- mac->pcsr = ioaddr;
- mac->mac = &dwmac100_ops;
- mac->dma = &dwmac100_dma_ops;
+ dev_info(priv->device, "\tDWMAC100\n");
+ mac->pcsr = priv->ioaddr;
mac->link.duplex = MAC_CONTROL_F;
mac->link.speed10 = 0;
mac->link.speed100 = 0;
@@ -200,8 +194,5 @@ struct mac_device_info *dwmac100_setup(void __iomem *ioaddr, int *synopsys_id)
mac->mii.clk_csr_shift = 2;
mac->mii.clk_csr_mask = GENMASK(5, 2);
- /* Synopsys Id is not available on old chips */
- *synopsys_id = 0;
-
- return mac;
+ return 0;
}
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
index c7bff59..ec32fd7 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
@@ -34,7 +34,6 @@
#define GMAC_PCS_BASE 0x000000e0
#define GMAC_PHYIF_CONTROL_STATUS 0x000000f8
#define GMAC_PMT 0x000000c0
-#define GMAC_VERSION 0x00000110
#define GMAC_DEBUG 0x00000114
#define GMAC_HW_FEATURE0 0x0000011c
#define GMAC_HW_FEATURE1 0x00000120
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
index a3af92e..cdda0bf 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
@@ -18,6 +18,7 @@
#include <linux/ethtool.h>
#include <linux/io.h>
#include <net/dsa.h>
+#include "stmmac.h"
#include "stmmac_pcs.h"
#include "dwmac4.h"
#include "dwmac5.h"
@@ -707,7 +708,7 @@ static void dwmac4_debug(void __iomem *ioaddr, struct stmmac_extra_stats *x,
x->mac_gmii_rx_proto_engine++;
}
-static const struct stmmac_ops dwmac4_ops = {
+const struct stmmac_ops dwmac4_ops = {
.core_init = dwmac4_core_init,
.set_mac = stmmac_set_mac,
.rx_ipc = dwmac4_rx_ipc_enable,
@@ -738,7 +739,7 @@ static void dwmac4_debug(void __iomem *ioaddr, struct stmmac_extra_stats *x,
.set_filter = dwmac4_set_filter,
};
-static const struct stmmac_ops dwmac410_ops = {
+const struct stmmac_ops dwmac410_ops = {
.core_init = dwmac4_core_init,
.set_mac = stmmac_dwmac4_set_mac,
.rx_ipc = dwmac4_rx_ipc_enable,
@@ -769,7 +770,7 @@ static void dwmac4_debug(void __iomem *ioaddr, struct stmmac_extra_stats *x,
.set_filter = dwmac4_set_filter,
};
-static const struct stmmac_ops dwmac510_ops = {
+const struct stmmac_ops dwmac510_ops = {
.core_init = dwmac4_core_init,
.set_mac = stmmac_dwmac4_set_mac,
.rx_ipc = dwmac4_rx_ipc_enable,
@@ -803,19 +804,16 @@ static void dwmac4_debug(void __iomem *ioaddr, struct stmmac_extra_stats *x,
.safety_feat_dump = dwmac5_safety_feat_dump,
};
-struct mac_device_info *dwmac4_setup(void __iomem *ioaddr, int mcbins,
- int perfect_uc_entries, int *synopsys_id)
+int dwmac4_setup(struct stmmac_priv *priv)
{
- struct mac_device_info *mac;
- u32 hwid = readl(ioaddr + GMAC_VERSION);
+ struct mac_device_info *mac = priv->hw;
- mac = kzalloc(sizeof(const struct mac_device_info), GFP_KERNEL);
- if (!mac)
- return NULL;
+ dev_info(priv->device, "\tDWMAC4/5\n");
- mac->pcsr = ioaddr;
- mac->multicast_filter_bins = mcbins;
- mac->unicast_filter_entries = perfect_uc_entries;
+ priv->dev->priv_flags |= IFF_UNICAST_FLT;
+ mac->pcsr = priv->ioaddr;
+ mac->multicast_filter_bins = priv->plat->multicast_filter_bins;
+ mac->unicast_filter_entries = priv->plat->unicast_filter_entries;
mac->mcast_bits_log2 = 0;
if (mac->multicast_filter_bins)
@@ -835,20 +833,5 @@ struct mac_device_info *dwmac4_setup(void __iomem *ioaddr, int mcbins,
mac->mii.clk_csr_shift = 8;
mac->mii.clk_csr_mask = GENMASK(11, 8);
- /* Get and dump the chip ID */
- *synopsys_id = stmmac_get_synopsys_id(hwid);
-
- if (*synopsys_id > DWMAC_CORE_4_00)
- mac->dma = &dwmac410_dma_ops;
- else
- mac->dma = &dwmac4_dma_ops;
-
- if (*synopsys_id >= DWMAC_CORE_5_10)
- mac->mac = &dwmac510_ops;
- else if (*synopsys_id >= DWMAC_CORE_4_00)
- mac->mac = &dwmac410_ops;
- else
- mac->mac = &dwmac4_ops;
-
- return mac;
+ return 0;
}
diff --git a/drivers/net/ethernet/stmicro/stmmac/hwif.c b/drivers/net/ethernet/stmicro/stmmac/hwif.c
new file mode 100644
index 0000000..45f6300
--- /dev/null
+++ b/drivers/net/ethernet/stmicro/stmmac/hwif.c
@@ -0,0 +1,216 @@
+// SPDX-License-Identifier: (GPL-2.0 OR MIT)
+// Copyright (c) 2018 Synopsys, Inc. and/or its affiliates.
+// stmmac HW Interface Handling
+
+#include "common.h"
+#include "stmmac.h"
+
+static u32 stmmac_get_id(struct stmmac_priv *priv, u32 id_reg)
+{
+ u32 reg = readl(priv->ioaddr + id_reg);
+
+ if (!reg) {
+ dev_info(priv->device, "Version ID not available\n");
+ return 0x0;
+ }
+
+ dev_info(priv->device, "User ID: 0x%x, Synopsys ID: 0x%x\n",
+ (unsigned int)(reg & GENMASK(15, 8)) >> 8,
+ (unsigned int)(reg & GENMASK(7, 0)));
+ return reg & GENMASK(7, 0);
+}
+
+static void stmmac_dwmac_mode_quirk(struct stmmac_priv *priv)
+{
+ struct mac_device_info *mac = priv->hw;
+
+ if (priv->chain_mode) {
+ dev_info(priv->device, "Chain mode enabled\n");
+ priv->mode = STMMAC_CHAIN_MODE;
+ mac->mode = &chain_mode_ops;
+ } else {
+ dev_info(priv->device, "Ring mode enabled\n");
+ priv->mode = STMMAC_RING_MODE;
+ mac->mode = &ring_mode_ops;
+ }
+}
+
+static int stmmac_dwmac1_quirks(struct stmmac_priv *priv)
+{
+ struct mac_device_info *mac = priv->hw;
+
+ if (priv->plat->enh_desc) {
+ dev_info(priv->device, "Enhanced/Alternate descriptors\n");
+
+ /* GMAC older than 3.50 has no extended descriptors */
+ if (priv->synopsys_id >= DWMAC_CORE_3_50) {
+ dev_info(priv->device, "Enabled extended descriptors\n");
+ priv->extend_desc = 1;
+ } else {
+ dev_warn(priv->device, "Extended descriptors not supported\n");
+ }
+
+ mac->desc = &enh_desc_ops;
+ } else {
+ dev_info(priv->device, "Normal descriptors\n");
+ mac->desc = &ndesc_ops;
+ }
+
+ stmmac_dwmac_mode_quirk(priv);
+ return 0;
+}
+
+static int stmmac_dwmac4_quirks(struct stmmac_priv *priv)
+{
+ stmmac_dwmac_mode_quirk(priv);
+ return 0;
+}
+
+static const struct stmmac_hwif_entry {
+ bool gmac;
+ bool gmac4;
+ u32 min_id;
+ const void *desc;
+ const void *dma;
+ const void *mac;
+ const void *hwtimestamp;
+ const void *mode;
+ int (*setup)(struct stmmac_priv *priv);
+ int (*quirks)(struct stmmac_priv *priv);
+} stmmac_hw[] = {
+ /* NOTE: New HW versions shall go to the end of this table */
+ {
+ .gmac = false,
+ .gmac4 = false,
+ .min_id = 0,
+ .desc = NULL,
+ .dma = &dwmac100_dma_ops,
+ .mac = &dwmac100_ops,
+ .hwtimestamp = &stmmac_ptp,
+ .mode = NULL,
+ .setup = dwmac100_setup,
+ .quirks = stmmac_dwmac1_quirks,
+ }, {
+ .gmac = true,
+ .gmac4 = false,
+ .min_id = 0,
+ .desc = NULL,
+ .dma = &dwmac1000_dma_ops,
+ .mac = &dwmac1000_ops,
+ .hwtimestamp = &stmmac_ptp,
+ .mode = NULL,
+ .setup = dwmac1000_setup,
+ .quirks = stmmac_dwmac1_quirks,
+ }, {
+ .gmac = false,
+ .gmac4 = true,
+ .min_id = 0,
+ .desc = &dwmac4_desc_ops,
+ .dma = &dwmac4_dma_ops,
+ .mac = &dwmac4_ops,
+ .hwtimestamp = &stmmac_ptp,
+ .mode = NULL,
+ .setup = dwmac4_setup,
+ .quirks = stmmac_dwmac4_quirks,
+ }, {
+ .gmac = false,
+ .gmac4 = true,
+ .min_id = DWMAC_CORE_4_00,
+ .desc = &dwmac4_desc_ops,
+ .dma = &dwmac4_dma_ops,
+ .mac = &dwmac410_ops,
+ .hwtimestamp = &stmmac_ptp,
+ .mode = &dwmac4_ring_mode_ops,
+ .setup = dwmac4_setup,
+ .quirks = NULL,
+ }, {
+ .gmac = false,
+ .gmac4 = true,
+ .min_id = DWMAC_CORE_4_10,
+ .desc = &dwmac4_desc_ops,
+ .dma = &dwmac410_dma_ops,
+ .mac = &dwmac410_ops,
+ .hwtimestamp = &stmmac_ptp,
+ .mode = &dwmac4_ring_mode_ops,
+ .setup = dwmac4_setup,
+ .quirks = NULL,
+ }, {
+ .gmac = false,
+ .gmac4 = true,
+ .min_id = DWMAC_CORE_5_10,
+ .desc = &dwmac4_desc_ops,
+ .dma = &dwmac410_dma_ops,
+ .mac = &dwmac510_ops,
+ .hwtimestamp = &stmmac_ptp,
+ .mode = &dwmac4_ring_mode_ops,
+ .setup = dwmac4_setup,
+ .quirks = NULL,
+ }
+};
+
+int stmmac_hwif_init(struct stmmac_priv *priv)
+{
+ bool needs_gmac4 = priv->plat->has_gmac4;
+ bool needs_gmac = priv->plat->has_gmac;
+ const struct stmmac_hwif_entry *entry;
+ struct mac_device_info *mac;
+ int i, ret;
+ u32 id;
+
+ if (needs_gmac) {
+ id = stmmac_get_id(priv, GMAC_VERSION);
+ } else {
+ id = stmmac_get_id(priv, GMAC4_VERSION);
+ }
+
+ /* Save ID for later use */
+ priv->synopsys_id = id;
+
+ /* Check for HW specific setup first */
+ if (priv->plat->setup) {
+ priv->hw = priv->plat->setup(priv);
+ if (!priv->hw)
+ return -ENOMEM;
+ return 0;
+ }
+
+ mac = devm_kzalloc(priv->device, sizeof(*mac), GFP_KERNEL);
+ if (!mac)
+ return -ENOMEM;
+
+ /* Fallback to generic HW */
+ for (i = ARRAY_SIZE(stmmac_hw) - 1; i >= 0; i--) {
+ entry = &stmmac_hw[i];
+
+ if (needs_gmac ^ entry->gmac)
+ continue;
+ if (needs_gmac4 ^ entry->gmac4)
+ continue;
+ if (id < entry->min_id)
+ continue;
+
+ mac->desc = entry->desc;
+ mac->dma = entry->dma;
+ mac->mac = entry->mac;
+ mac->ptp = entry->hwtimestamp;
+ mac->mode = entry->mode;
+
+ priv->hw = mac;
+
+ /* Entry found */
+ ret = entry->setup(priv);
+ if (ret)
+ return ret;
+
+ /* Run quirks, if needed */
+ if (entry->quirks) {
+ ret = entry->quirks(priv);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+ }
+
+ return -EINVAL;
+}
diff --git a/drivers/net/ethernet/stmicro/stmmac/hwif.h b/drivers/net/ethernet/stmicro/stmmac/hwif.h
index f81ded4..bfad616 100644
--- a/drivers/net/ethernet/stmicro/stmmac/hwif.h
+++ b/drivers/net/ethernet/stmicro/stmmac/hwif.h
@@ -418,4 +418,21 @@ struct stmmac_mode_ops {
#define stmmac_clean_desc3(__priv, __args...) \
stmmac_do_void_callback(__priv, mode, clean_desc3, __args)
+struct stmmac_priv;
+
+extern const struct stmmac_ops dwmac100_ops;
+extern const struct stmmac_dma_ops dwmac100_dma_ops;
+extern const struct stmmac_ops dwmac1000_ops;
+extern const struct stmmac_dma_ops dwmac1000_dma_ops;
+extern const struct stmmac_ops dwmac4_ops;
+extern const struct stmmac_dma_ops dwmac4_dma_ops;
+extern const struct stmmac_ops dwmac410_ops;
+extern const struct stmmac_dma_ops dwmac410_dma_ops;
+extern const struct stmmac_ops dwmac510_ops;
+
+#define GMAC_VERSION 0x00000020 /* GMAC CORE Version */
+#define GMAC4_VERSION 0x00000110 /* GMAC4+ CORE Version */
+
+int stmmac_hwif_init(struct stmmac_priv *priv);
+
#endif /* __STMMAC_HWIF_H__ */
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index da50451..2443f20 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -130,6 +130,7 @@ struct stmmac_priv {
int eee_active;
int tx_lpi_timer;
unsigned int mode;
+ unsigned int chain_mode;
int extend_desc;
struct ptp_clock *ptp_clock;
struct ptp_clock_info ptp_clock_ops;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 90363a8..574924a 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -769,7 +769,6 @@ static int stmmac_init_ptp(struct stmmac_priv *priv)
netdev_info(priv->dev,
"IEEE 1588-2008 Advanced Timestamp supported\n");
- priv->hw->ptp = &stmmac_ptp;
priv->hwts_tx_en = 0;
priv->hwts_rx_en = 0;
@@ -2122,32 +2121,6 @@ static void stmmac_mmc_setup(struct stmmac_priv *priv)
}
/**
- * stmmac_selec_desc_mode - to select among: normal/alternate/extend descriptors
- * @priv: driver private structure
- * Description: select the Enhanced/Alternate or Normal descriptors.
- * In case of Enhanced/Alternate, it checks if the extended descriptors are
- * supported by the HW capability register.
- */
-static void stmmac_selec_desc_mode(struct stmmac_priv *priv)
-{
- if (priv->plat->enh_desc) {
- dev_info(priv->device, "Enhanced/Alternate descriptors\n");
-
- /* GMAC older than 3.50 has no extended descriptors */
- if (priv->synopsys_id >= DWMAC_CORE_3_50) {
- dev_info(priv->device, "Enabled extended descriptors\n");
- priv->extend_desc = 1;
- } else
- dev_warn(priv->device, "Extended descriptors not supported\n");
-
- priv->hw->desc = &enh_desc_ops;
- } else {
- dev_info(priv->device, "Normal descriptors\n");
- priv->hw->desc = &ndesc_ops;
- }
-}
-
-/**
* stmmac_get_hw_features - get MAC capabilities from the HW cap. register.
* @priv: driver private structure
* Description:
@@ -4093,49 +4066,17 @@ static void stmmac_service_task(struct work_struct *work)
*/
static int stmmac_hw_init(struct stmmac_priv *priv)
{
- struct mac_device_info *mac;
-
- /* Identify the MAC HW device */
- if (priv->plat->setup) {
- mac = priv->plat->setup(priv);
- } else if (priv->plat->has_gmac) {
- priv->dev->priv_flags |= IFF_UNICAST_FLT;
- mac = dwmac1000_setup(priv->ioaddr,
- priv->plat->multicast_filter_bins,
- priv->plat->unicast_filter_entries,
- &priv->synopsys_id);
- } else if (priv->plat->has_gmac4) {
- priv->dev->priv_flags |= IFF_UNICAST_FLT;
- mac = dwmac4_setup(priv->ioaddr,
- priv->plat->multicast_filter_bins,
- priv->plat->unicast_filter_entries,
- &priv->synopsys_id);
- } else {
- mac = dwmac100_setup(priv->ioaddr, &priv->synopsys_id);
- }
- if (!mac)
- return -ENOMEM;
-
- priv->hw = mac;
+ int ret;
/* dwmac-sun8i only work in chain mode */
if (priv->plat->has_sun8i)
chain_mode = 1;
+ priv->chain_mode = chain_mode;
- /* To use the chained or ring mode */
- if (priv->synopsys_id >= DWMAC_CORE_4_00) {
- priv->hw->mode = &dwmac4_ring_mode_ops;
- } else {
- if (chain_mode) {
- priv->hw->mode = &chain_mode_ops;
- dev_info(priv->device, "Chain mode enabled\n");
- priv->mode = STMMAC_CHAIN_MODE;
- } else {
- priv->hw->mode = &ring_mode_ops;
- dev_info(priv->device, "Ring mode enabled\n");
- priv->mode = STMMAC_RING_MODE;
- }
- }
+ /* Initialize HW Interface */
+ ret = stmmac_hwif_init(priv);
+ if (ret)
+ return ret;
/* Get the HW capability (new GMAC newer than 3.50a) */
priv->hw_cap_support = stmmac_get_hw_features(priv);
@@ -4169,12 +4110,6 @@ static int stmmac_hw_init(struct stmmac_priv *priv)
dev_info(priv->device, "No HW DMA feature register supported\n");
}
- /* To use alternate (extended), normal or GMAC4 descriptor structures */
- if (priv->synopsys_id >= DWMAC_CORE_4_00)
- priv->hw->desc = &dwmac4_desc_ops;
- else
- stmmac_selec_desc_mode(priv);
-
if (priv->plat->rx_coe) {
priv->hw->rx_csum = priv->plat->rx_coe;
dev_info(priv->device, "RX Checksum Offload Engine supported\n");
--
1.7.1
^ permalink raw reply related
* Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.
From: Ben Greear @ 2018-04-19 15:25 UTC (permalink / raw)
To: Johannes Berg, netdev; +Cc: linux-wireless, ath10k
In-Reply-To: <1524119910.3024.12.camel@sipsolutions.net>
On 04/18/2018 11:38 PM, Johannes Berg wrote:
> On Wed, 2018-04-18 at 14:51 -0700, Ben Greear wrote:
>
>>> It'd be pretty hard to know which flags are firmware stats?
>>
>> Yes, it is, but ethtool stats are difficult to understand in a generic
>> manner anyway, so someone using them is already likely aware of low-level
>> details of the driver(s) they are using.
>
> Right. Come to think of it though,
>
>> + * @get_ethtool_stats2: Return extended statistics about the device.
>> + * This is only useful if the device maintains statistics not
>> + * included in &struct rtnl_link_stats64.
>> + * Takes a flags argument: 0 means all (same as get_ethtool_stats),
>> + * 0x1 (ETHTOOL_GS2_SKIP_FW) means skip firmware stats.
>> + * Other flags are reserved for now.
>> + * Same number of stats will be returned, but some of them might
>> + * not be as accurate/refreshed. This is to allow not querying
>> + * firmware or other expensive-to-read stats, for instance.
>
> "skip" vs. "don't refresh" is a bit ambiguous - I'd argue better to
> either really skip and not return the non-refreshed ones (also helps
> with the identifying), or rename the flag.
In order to efficiently parse lots of stats over and over again, I probe
the stat names once on startup, map them to the variable I am trying to use
(since different drivers may have different names for the same basic stat),
and then I store the stat index.
On subsequent stat reads, I just grab stats and go right to the index to
store the stat.
If the stats indexes change, that will complicate my logic quite a bit.
Maybe the flag could be called: ETHTOOL_GS2_NO_REFRESH_FW ?
>
> Also, wrt. the rest of the patch, I'd argue that it'd be worthwhile to
> write the spatch and just add the flags argument to "get_ethtool_stats"
> instead of adding a separate method - internally to the kernel it's not
> that hard to change.
Maybe this could be in followup patches? It's going to touch a lot of files,
and might be hell to get merged all at once, and I've never used spatch, so
just maybe someone else will volunteer that part :)
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply
* Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.
From: Johannes Berg @ 2018-04-19 15:26 UTC (permalink / raw)
To: Ben Greear, netdev-u79uwXL29TY76Z2rM5mHXA
Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
ath10k-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
In-Reply-To: <173c5f98-36bc-2e52-1e64-3a5f89008d46-my8/4N5VtI7c+919tysfdA@public.gmane.org>
On Thu, 2018-04-19 at 08:25 -0700, Ben Greear wrote:
>
> In order to efficiently parse lots of stats over and over again, I probe
> the stat names once on startup, map them to the variable I am trying to use
> (since different drivers may have different names for the same basic stat),
> and then I store the stat index.
>
> On subsequent stat reads, I just grab stats and go right to the index to
> store the stat.
>
> If the stats indexes change, that will complicate my logic quite a bit.
That's a good point.
> Maybe the flag could be called: ETHTOOL_GS2_NO_REFRESH_FW ?
Sounds more to the point to me, yeah.
> >
> > Also, wrt. the rest of the patch, I'd argue that it'd be worthwhile to
> > write the spatch and just add the flags argument to "get_ethtool_stats"
> > instead of adding a separate method - internally to the kernel it's not
> > that hard to change.
>
> Maybe this could be in followup patches? It's going to touch a lot of files,
> and might be hell to get merged all at once, and I've never used spatch, so
> just maybe someone else will volunteer that part :)
I guess you'll have to ask davem. :)
johannes
^ permalink raw reply
* Re: 答复: [PATCH][net-next] net: ip tos cgroup
From: David Miller @ 2018-04-19 15:27 UTC (permalink / raw)
To: lirongqing; +Cc: daniel, netdev, tj, ast, brakmo
In-Reply-To: <2AD939572F25A448A3AE3CAEA61328C2375E35F6@BC-MAIL-M28.internal.baidu.com>
From: "Li,Rongqing" <lirongqing@baidu.com>
Date: Thu, 19 Apr 2018 04:56:09 +0000
> I think this method is easier to use than BPF, and more efficient
Ease of use is arguable, but for anything bpf this is going to
steadily improve over time therefore it is not much of an argument
against using bpf.
And I disagree on the efficient aspect as well.
^ permalink raw reply
* Re: [PATCH 16/39] ipmi: simplify procfs code
From: Corey Minyard @ 2018-04-19 15:29 UTC (permalink / raw)
To: Christoph Hellwig, Andrew Morton, Alexander Viro
Cc: linux-rtc, Alessandro Zummo, Alexandre Belloni, devel,
linux-kernel, linux-scsi, linux-ide, Greg Kroah-Hartman,
jfs-discussion, linux-afs, linux-acpi, netdev, netfilter-devel,
Jiri Slaby, linux-ext4, Alexey Dobriyan, megaraidlinux.pdl,
drbd-dev
In-Reply-To: <20180419124140.9309-17-hch@lst.de>
On 04/19/2018 07:41 AM, Christoph Hellwig wrote:
> Use remove_proc_subtree to remove the whole subtree on cleanup instead
> of a hand rolled list of proc entries, unwind the registration loop into
> individual calls. Switch to use proc_create_single to further simplify
> the code.
I'm yanking all the proc code out of the IPMI driver in 3.18. So this
is probably
not necessary.
Thanks,
-corey
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> drivers/char/ipmi/ipmi_msghandler.c | 150 +++++-----------------------
> drivers/char/ipmi/ipmi_si_intf.c | 47 +--------
> drivers/char/ipmi/ipmi_ssif.c | 34 +------
> include/linux/ipmi_smi.h | 8 +-
> 4 files changed, 33 insertions(+), 206 deletions(-)
>
> diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c
> index 361148938801..c18db313e4c4 100644
> --- a/drivers/char/ipmi/ipmi_msghandler.c
> +++ b/drivers/char/ipmi/ipmi_msghandler.c
> @@ -247,13 +247,6 @@ struct ipmi_my_addrinfo {
> unsigned char lun;
> };
>
> -#ifdef CONFIG_IPMI_PROC_INTERFACE
> -struct ipmi_proc_entry {
> - char *name;
> - struct ipmi_proc_entry *next;
> -};
> -#endif
> -
> /*
> * Note that the product id, manufacturer id, guid, and device id are
> * immutable in this structure, so dyn_mutex is not required for
> @@ -430,10 +423,6 @@ struct ipmi_smi {
> void *send_info;
>
> #ifdef CONFIG_IPMI_PROC_INTERFACE
> - /* A list of proc entries for this interface. */
> - struct mutex proc_entry_lock;
> - struct ipmi_proc_entry *proc_entries;
> -
> struct proc_dir_entry *proc_dir;
> char proc_dir_name[10];
> #endif
> @@ -2358,18 +2347,6 @@ static int smi_ipmb_proc_show(struct seq_file *m, void *v)
> return 0;
> }
>
> -static int smi_ipmb_proc_open(struct inode *inode, struct file *file)
> -{
> - return single_open(file, smi_ipmb_proc_show, PDE_DATA(inode));
> -}
> -
> -static const struct file_operations smi_ipmb_proc_ops = {
> - .open = smi_ipmb_proc_open,
> - .read = seq_read,
> - .llseek = seq_lseek,
> - .release = single_release,
> -};
> -
> static int smi_version_proc_show(struct seq_file *m, void *v)
> {
> ipmi_smi_t intf = m->private;
> @@ -2387,18 +2364,6 @@ static int smi_version_proc_show(struct seq_file *m, void *v)
> return 0;
> }
>
> -static int smi_version_proc_open(struct inode *inode, struct file *file)
> -{
> - return single_open(file, smi_version_proc_show, PDE_DATA(inode));
> -}
> -
> -static const struct file_operations smi_version_proc_ops = {
> - .open = smi_version_proc_open,
> - .read = seq_read,
> - .llseek = seq_lseek,
> - .release = single_release,
> -};
> -
> static int smi_stats_proc_show(struct seq_file *m, void *v)
> {
> ipmi_smi_t intf = m->private;
> @@ -2462,95 +2427,45 @@ static int smi_stats_proc_show(struct seq_file *m, void *v)
> return 0;
> }
>
> -static int smi_stats_proc_open(struct inode *inode, struct file *file)
> -{
> - return single_open(file, smi_stats_proc_show, PDE_DATA(inode));
> -}
> -
> -static const struct file_operations smi_stats_proc_ops = {
> - .open = smi_stats_proc_open,
> - .read = seq_read,
> - .llseek = seq_lseek,
> - .release = single_release,
> -};
> -
> int ipmi_smi_add_proc_entry(ipmi_smi_t smi, char *name,
> - const struct file_operations *proc_ops,
> - void *data)
> + int (*show)(struct seq_file *, void *), void *data)
> {
> - int rv = 0;
> - struct proc_dir_entry *file;
> - struct ipmi_proc_entry *entry;
> -
> - /* Create a list element. */
> - entry = kmalloc(sizeof(*entry), GFP_KERNEL);
> - if (!entry)
> + if (!proc_create_single_data(name, 0, smi->proc_dir, show, data))
> return -ENOMEM;
> - entry->name = kstrdup(name, GFP_KERNEL);
> - if (!entry->name) {
> - kfree(entry);
> - return -ENOMEM;
> - }
> -
> - file = proc_create_data(name, 0, smi->proc_dir, proc_ops, data);
> - if (!file) {
> - kfree(entry->name);
> - kfree(entry);
> - rv = -ENOMEM;
> - } else {
> - mutex_lock(&smi->proc_entry_lock);
> - /* Stick it on the list. */
> - entry->next = smi->proc_entries;
> - smi->proc_entries = entry;
> - mutex_unlock(&smi->proc_entry_lock);
> - }
> -
> - return rv;
> + return 0;
> }
> EXPORT_SYMBOL(ipmi_smi_add_proc_entry);
>
> static int add_proc_entries(ipmi_smi_t smi, int num)
> {
> - int rv = 0;
> -
> sprintf(smi->proc_dir_name, "%d", num);
> smi->proc_dir = proc_mkdir(smi->proc_dir_name, proc_ipmi_root);
> if (!smi->proc_dir)
> - rv = -ENOMEM;
> -
> - if (rv == 0)
> - rv = ipmi_smi_add_proc_entry(smi, "stats",
> - &smi_stats_proc_ops,
> - smi);
> -
> - if (rv == 0)
> - rv = ipmi_smi_add_proc_entry(smi, "ipmb",
> - &smi_ipmb_proc_ops,
> - smi);
> -
> - if (rv == 0)
> - rv = ipmi_smi_add_proc_entry(smi, "version",
> - &smi_version_proc_ops,
> - smi);
> -
> - return rv;
> + return -ENOMEM;
> + if (!proc_create_single_data("stats", 0, smi->proc_dir,
> + smi_stats_proc_show, smi))
> + return -ENOMEM;
> + if (!proc_create_single_data("ipmb", 0, smi->proc_dir,
> + smi_ipmb_proc_show, smi))
> + return -ENOMEM;
> + if (!proc_create_single_data("version", 0, smi->proc_dir,
> + smi_version_proc_show, smi))
> + return -ENOMEM;
> + return 0;
> }
>
> static void remove_proc_entries(ipmi_smi_t smi)
> {
> - struct ipmi_proc_entry *entry;
> -
> - mutex_lock(&smi->proc_entry_lock);
> - while (smi->proc_entries) {
> - entry = smi->proc_entries;
> - smi->proc_entries = entry->next;
> -
> - remove_proc_entry(entry->name, smi->proc_dir);
> - kfree(entry->name);
> - kfree(entry);
> - }
> - mutex_unlock(&smi->proc_entry_lock);
> - remove_proc_entry(smi->proc_dir_name, proc_ipmi_root);
> + if (smi->proc_dir)
> + remove_proc_subtree(smi->proc_dir_name, proc_ipmi_root);
> +}
> +#else
> +static int add_proc_entries(ipmi_smi_t smi, int num)
> +{
> + return 0;
> +}
> +static void remove_proc_entries(ipmi_smi_t smi)
> +{
> }
> #endif /* CONFIG_IPMI_PROC_INTERFACE */
>
> @@ -3386,9 +3301,6 @@ int ipmi_register_smi(const struct ipmi_smi_handlers *handlers,
> intf->seq_table[j].seqid = 0;
> }
> intf->curr_seq = 0;
> -#ifdef CONFIG_IPMI_PROC_INTERFACE
> - mutex_init(&intf->proc_entry_lock);
> -#endif
> spin_lock_init(&intf->waiting_rcv_msgs_lock);
> INIT_LIST_HEAD(&intf->waiting_rcv_msgs);
> tasklet_init(&intf->recv_tasklet,
> @@ -3410,10 +3322,6 @@ int ipmi_register_smi(const struct ipmi_smi_handlers *handlers,
> for (i = 0; i < IPMI_NUM_STATS; i++)
> atomic_set(&intf->stats[i], 0);
>
> -#ifdef CONFIG_IPMI_PROC_INTERFACE
> - intf->proc_dir = NULL;
> -#endif
> -
> mutex_lock(&smi_watchers_mutex);
> mutex_lock(&ipmi_interfaces_mutex);
> /* Look for a hole in the numbers. */
> @@ -3448,17 +3356,11 @@ int ipmi_register_smi(const struct ipmi_smi_handlers *handlers,
> if (rv)
> goto out;
>
> -#ifdef CONFIG_IPMI_PROC_INTERFACE
> rv = add_proc_entries(intf, i);
> -#endif
> -
> out:
> if (rv) {
> ipmi_bmc_unregister(intf);
> -#ifdef CONFIG_IPMI_PROC_INTERFACE
> - if (intf->proc_dir)
> - remove_proc_entries(intf);
> -#endif
> + remove_proc_entries(intf);
> intf->handlers = NULL;
> list_del_rcu(&intf->link);
> mutex_unlock(&ipmi_interfaces_mutex);
> @@ -3563,9 +3465,7 @@ int ipmi_unregister_smi(ipmi_smi_t intf)
> intf->handlers = NULL;
> mutex_unlock(&ipmi_interfaces_mutex);
>
> -#ifdef CONFIG_IPMI_PROC_INTERFACE
> remove_proc_entries(intf);
> -#endif
> ipmi_bmc_unregister(intf);
>
> /*
> diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
> index ff870aa91cfe..7acc8cbf8d6f 100644
> --- a/drivers/char/ipmi/ipmi_si_intf.c
> +++ b/drivers/char/ipmi/ipmi_si_intf.c
> @@ -1602,18 +1602,6 @@ static int smi_type_proc_show(struct seq_file *m, void *v)
> return 0;
> }
>
> -static int smi_type_proc_open(struct inode *inode, struct file *file)
> -{
> - return single_open(file, smi_type_proc_show, PDE_DATA(inode));
> -}
> -
> -static const struct file_operations smi_type_proc_ops = {
> - .open = smi_type_proc_open,
> - .read = seq_read,
> - .llseek = seq_lseek,
> - .release = single_release,
> -};
> -
> static int smi_si_stats_proc_show(struct seq_file *m, void *v)
> {
> struct smi_info *smi = m->private;
> @@ -1645,18 +1633,6 @@ static int smi_si_stats_proc_show(struct seq_file *m, void *v)
> return 0;
> }
>
> -static int smi_si_stats_proc_open(struct inode *inode, struct file *file)
> -{
> - return single_open(file, smi_si_stats_proc_show, PDE_DATA(inode));
> -}
> -
> -static const struct file_operations smi_si_stats_proc_ops = {
> - .open = smi_si_stats_proc_open,
> - .read = seq_read,
> - .llseek = seq_lseek,
> - .release = single_release,
> -};
> -
> static int smi_params_proc_show(struct seq_file *m, void *v)
> {
> struct smi_info *smi = m->private;
> @@ -1674,18 +1650,6 @@ static int smi_params_proc_show(struct seq_file *m, void *v)
>
> return 0;
> }
> -
> -static int smi_params_proc_open(struct inode *inode, struct file *file)
> -{
> - return single_open(file, smi_params_proc_show, PDE_DATA(inode));
> -}
> -
> -static const struct file_operations smi_params_proc_ops = {
> - .open = smi_params_proc_open,
> - .read = seq_read,
> - .llseek = seq_lseek,
> - .release = single_release,
> -};
> #endif
>
> #define IPMI_SI_ATTR(name) \
> @@ -2182,10 +2146,8 @@ static int try_smi_init(struct smi_info *new_smi)
> goto out_err;
> }
>
> -#ifdef CONFIG_IPMI_PROC_INTERFACE
> rv = ipmi_smi_add_proc_entry(new_smi->intf, "type",
> - &smi_type_proc_ops,
> - new_smi);
> + smi_type_proc_show, new_smi);
> if (rv) {
> dev_err(new_smi->io.dev,
> "Unable to create proc entry: %d\n", rv);
> @@ -2193,8 +2155,7 @@ static int try_smi_init(struct smi_info *new_smi)
> }
>
> rv = ipmi_smi_add_proc_entry(new_smi->intf, "si_stats",
> - &smi_si_stats_proc_ops,
> - new_smi);
> + smi_si_stats_proc_show, new_smi);
> if (rv) {
> dev_err(new_smi->io.dev,
> "Unable to create proc entry: %d\n", rv);
> @@ -2202,14 +2163,12 @@ static int try_smi_init(struct smi_info *new_smi)
> }
>
> rv = ipmi_smi_add_proc_entry(new_smi->intf, "params",
> - &smi_params_proc_ops,
> - new_smi);
> + smi_params_proc_show, new_smi);
> if (rv) {
> dev_err(new_smi->io.dev,
> "Unable to create proc entry: %d\n", rv);
> goto out_err;
> }
> -#endif
>
> /* Don't increment till we know we have succeeded. */
> smi_num++;
> diff --git a/drivers/char/ipmi/ipmi_ssif.c b/drivers/char/ipmi/ipmi_ssif.c
> index 35a82f4bfd78..57f6116a17b2 100644
> --- a/drivers/char/ipmi/ipmi_ssif.c
> +++ b/drivers/char/ipmi/ipmi_ssif.c
> @@ -1349,18 +1349,6 @@ static int smi_type_proc_show(struct seq_file *m, void *v)
> return 0;
> }
>
> -static int smi_type_proc_open(struct inode *inode, struct file *file)
> -{
> - return single_open(file, smi_type_proc_show, inode->i_private);
> -}
> -
> -static const struct file_operations smi_type_proc_ops = {
> - .open = smi_type_proc_open,
> - .read = seq_read,
> - .llseek = seq_lseek,
> - .release = single_release,
> -};
> -
> static int smi_stats_proc_show(struct seq_file *m, void *v)
> {
> struct ssif_info *ssif_info = m->private;
> @@ -1393,18 +1381,6 @@ static int smi_stats_proc_show(struct seq_file *m, void *v)
> ssif_get_stat(ssif_info, alerts));
> return 0;
> }
> -
> -static int smi_stats_proc_open(struct inode *inode, struct file *file)
> -{
> - return single_open(file, smi_stats_proc_show, PDE_DATA(inode));
> -}
> -
> -static const struct file_operations smi_stats_proc_ops = {
> - .open = smi_stats_proc_open,
> - .read = seq_read,
> - .llseek = seq_lseek,
> - .release = single_release,
> -};
> #endif
>
> static int strcmp_nospace(char *s1, char *s2)
> @@ -1740,23 +1716,19 @@ static int ssif_probe(struct i2c_client *client, const struct i2c_device_id *id)
> goto out_remove_attr;
> }
>
> -#ifdef CONFIG_IPMI_PROC_INTERFACE
> rv = ipmi_smi_add_proc_entry(ssif_info->intf, "type",
> - &smi_type_proc_ops,
> - ssif_info);
> + smi_type_proc_show, ssif_info);
> if (rv) {
> pr_err(PFX "Unable to create proc entry: %d\n", rv);
> goto out_err_unreg;
> }
>
> rv = ipmi_smi_add_proc_entry(ssif_info->intf, "ssif_stats",
> - &smi_stats_proc_ops,
> - ssif_info);
> + smi_stats_proc_show, ssif_info);
> if (rv) {
> pr_err(PFX "Unable to create proc entry: %d\n", rv);
> goto out_err_unreg;
> }
> -#endif
>
> out:
> if (rv) {
> @@ -1775,10 +1747,8 @@ static int ssif_probe(struct i2c_client *client, const struct i2c_device_id *id)
> kfree(resp);
> return rv;
>
> -#ifdef CONFIG_IPMI_PROC_INTERFACE
> out_err_unreg:
> ipmi_unregister_smi(ssif_info->intf);
> -#endif
>
> out_remove_attr:
> device_remove_group(&ssif_info->client->dev, &ipmi_ssif_dev_attr_group);
> diff --git a/include/linux/ipmi_smi.h b/include/linux/ipmi_smi.h
> index af457b5a689e..78d9fd480fe8 100644
> --- a/include/linux/ipmi_smi.h
> +++ b/include/linux/ipmi_smi.h
> @@ -223,12 +223,10 @@ static inline void ipmi_free_smi_msg(struct ipmi_smi_msg *msg)
> }
>
> #ifdef CONFIG_IPMI_PROC_INTERFACE
> -/* Allow the lower layer to add things to the proc filesystem
> - directory for this interface. Note that the entry will
> - automatically be dstroyed when the interface is destroyed. */
> int ipmi_smi_add_proc_entry(ipmi_smi_t smi, char *name,
> - const struct file_operations *proc_ops,
> - void *data);
> + int (*show)(struct seq_file *, void *), void *data);
> +#else
> +#define ipmi_smi_add_proc_entry(smi, name, show, data) 0
> #endif
>
> #endif /* __LINUX_IPMI_SMI_H */
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
^ permalink raw reply
* [PATCH] Bluetooth: use wait_event API instead of open-coding it
From: John Keeping @ 2018-04-19 15:29 UTC (permalink / raw)
To: linux-bluetooth
Cc: netdev, linux-kernel, Marcel Holtmann, Johan Hedberg,
John Keeping
I've seen timeout errors from HCI commands where it looks like
schedule_timeout() has returned immediately; additional logging for the
error case gives:
req_status=1 req_result=0 remaining=10000 jiffies
so the device is still in state HCI_REQ_PEND and the value returned by
schedule_timeout() is the same as the original timeout (HCI_INIT_TIMEOUT
on a system with HZ=1000).
Use wait_event_interruptible_timeout() instead of open-coding similar
behaviour which is subject to the spurious failure described above.
Signed-off-by: John Keeping <john@metanate.com>
---
I saw problems with the -rt patchset on 4.9 and I'm not convinced that
it's Bluetooth at fault for the problem described above, but I think
this is a nice cleanup even if it's not a bug fix.
net/bluetooth/hci_request.c | 30 +++++++-----------------------
1 file changed, 7 insertions(+), 23 deletions(-)
diff --git a/net/bluetooth/hci_request.c b/net/bluetooth/hci_request.c
index 66c0781773df..e44d34734834 100644
--- a/net/bluetooth/hci_request.c
+++ b/net/bluetooth/hci_request.c
@@ -122,7 +122,6 @@ void hci_req_sync_cancel(struct hci_dev *hdev, int err)
struct sk_buff *__hci_cmd_sync_ev(struct hci_dev *hdev, u16 opcode, u32 plen,
const void *param, u8 event, u32 timeout)
{
- DECLARE_WAITQUEUE(wait, current);
struct hci_request req;
struct sk_buff *skb;
int err = 0;
@@ -135,21 +134,14 @@ struct sk_buff *__hci_cmd_sync_ev(struct hci_dev *hdev, u16 opcode, u32 plen,
hdev->req_status = HCI_REQ_PEND;
- add_wait_queue(&hdev->req_wait_q, &wait);
- set_current_state(TASK_INTERRUPTIBLE);
-
err = hci_req_run_skb(&req, hci_req_sync_complete);
- if (err < 0) {
- remove_wait_queue(&hdev->req_wait_q, &wait);
- set_current_state(TASK_RUNNING);
+ if (err < 0)
return ERR_PTR(err);
- }
- schedule_timeout(timeout);
+ err = wait_event_interruptible_timeout(hdev->req_wait_q,
+ hdev->req_status != HCI_REQ_PEND, timeout);
- remove_wait_queue(&hdev->req_wait_q, &wait);
-
- if (signal_pending(current))
+ if (err == -ERESTARTSYS)
return ERR_PTR(-EINTR);
switch (hdev->req_status) {
@@ -197,7 +189,6 @@ int __hci_req_sync(struct hci_dev *hdev, int (*func)(struct hci_request *req,
unsigned long opt, u32 timeout, u8 *hci_status)
{
struct hci_request req;
- DECLARE_WAITQUEUE(wait, current);
int err = 0;
BT_DBG("%s start", hdev->name);
@@ -213,16 +204,10 @@ int __hci_req_sync(struct hci_dev *hdev, int (*func)(struct hci_request *req,
return err;
}
- add_wait_queue(&hdev->req_wait_q, &wait);
- set_current_state(TASK_INTERRUPTIBLE);
-
err = hci_req_run_skb(&req, hci_req_sync_complete);
if (err < 0) {
hdev->req_status = 0;
- remove_wait_queue(&hdev->req_wait_q, &wait);
- set_current_state(TASK_RUNNING);
-
/* ENODATA means the HCI request command queue is empty.
* This can happen when a request with conditionals doesn't
* trigger any commands to be sent. This is normal behavior
@@ -240,11 +225,10 @@ int __hci_req_sync(struct hci_dev *hdev, int (*func)(struct hci_request *req,
return err;
}
- schedule_timeout(timeout);
-
- remove_wait_queue(&hdev->req_wait_q, &wait);
+ err = wait_event_interruptible_timeout(hdev->req_wait_q,
+ hdev->req_status != HCI_REQ_PEND, timeout);
- if (signal_pending(current))
+ if (err == -ERESTARTSYS)
return -EINTR;
switch (hdev->req_status) {
--
2.17.0
^ permalink raw reply related
* Re: [bisected] Stack overflow after fs: "switch the IO-triggering parts of umount to fs_pin" (was net namespaces kernel stack overflow)
From: Al Viro @ 2018-04-19 15:34 UTC (permalink / raw)
To: Kirill Tkhai; +Cc: Alexander Aring, linux-kernel, netdev, Jamal Hadi Salim
In-Reply-To: <d6e6f694-1bd9-7f3e-eaa8-1947c47f523f@virtuozzo.com>
On Thu, Apr 19, 2018 at 03:50:25PM +0300, Kirill Tkhai wrote:
> Hi, Al,
>
> commit 87b95ce0964c016ede92763be9c164e49f1019e9 is the first after which the below test crashes the kernel:
>
> Author: Al Viro <viro@zeniv.linux.org.uk>
> Date: Sat Jan 10 19:01:08 2015 -0500
>
> switch the IO-triggering parts of umount to fs_pin
>
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
>
> $modprobe dummy
>
> $while true
> do
> mkdir /var/run/netns
> touch /var/run/netns/init_net
> mount --bind /proc/1/ns/net /var/run/netns/init_net
>
> ip netns add foo
> ip netns exec foo ip link add dummy0 type dummy
> ip netns delete foo
> done
I can reproduce that, all right, and with a stack chain that
looks like this:
[77132.414912] pin_kill+0x81/0x150
[77132.424362] ? finish_wait+0x80/0x80
[77132.433917] mnt_pin_kill+0x1e/0x30
[77132.443829] cleanup_mnt+0x6b/0x70
[77132.453477] pin_kill+0x81/0x150
[77132.463064] ? finish_wait+0x80/0x80
[77132.472553] group_pin_kill+0x1a/0x30
[77132.481973] namespace_unlock+0x6f/0x80
[77132.491801] put_mnt_ns+0x1d/0x30
[77132.501258] free_nsproxy+0x17/0x90
[77132.510604] do_exit+0x2dc/0xb40
[77132.520146] ? handle_mm_fault+0xaa/0x1e0
[77132.529725] do_group_exit+0x3a/0xa0
[77132.539506] SyS_exit_group+0x10/0x10
with the top 4 entries repeated a lot. Those cleanup_mnt()
could be called from __cleanup_mnt(), delayed_mntput() or
mntput_no_expire().
__cleanup_mnt() is only fed to task_work_add(); no way in hell
would you get the call stack similar to that; it would be
called by task_work_run() from exit_task_work() from
do_exit(). Not in the evidence.
delayed_mntput() is only fed to schedule_delayed_work();
again, not a chance of having the call chain like that.
The one in mntput_no_expire() is a tail-call, with
mntput_no_expire() called from umount(2) and mntput()
(tail-calls both of them). The former is never called
from exit(2), so that call chain reads
pin_kill -> mntput or something tail-calling mntput -> mntput_no_expire ->
cleanup_mnt -> mnt_pin_kill -> pin_kill
Now, the thing called by pin_kill must be something passed to
init_fs_pin(), i.e. acct_pin_kill() or drop_mountpoint().
acct_pin_kill() ends with
pin_remove(pin);
acct_put(acct);
}
with
static void acct_put(struct bsd_acct_struct *p)
{
if (atomic_long_dec_and_test(&p->count))
kfree_rcu(p, rcu);
}
IOW, no tail-call of mntput() in there. OTOH,
static void drop_mountpoint(struct fs_pin *p)
{
struct mount *m = container_of(p, struct mount, mnt_umount);
dput(m->mnt_ex_mountpoint);
pin_remove(p);
mntput(&m->mnt);
}
*does* have the tail-call, so this call chain must be
pin_kill -> drop_mountpoint -> mntput -> mntput_no_expire ->
cleanup_mnt -> mnt_pin_kill -> pin_kill
So far, so good, but if you look into mntput_no_expire() you see
if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
struct task_struct *task = current;
if (likely(!(task->flags & PF_KTHREAD))) {
init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
if (!task_work_add(task, &mnt->mnt_rcu, true))
return;
}
if (llist_add(&mnt->mnt_llist, &delayed_mntput_list))
schedule_delayed_work(&delayed_mntput_work, 1);
return;
}
cleanup_mnt(mnt);
IOW, we only get there if our vfsmount was an MNT_INTERNAL one.
So we have mnt->mnt_umount of some MNT_INTERNAL mount found in
->mnt_pins of some other mount. Which, AFAICS, means that
it used to be mounted on that other mount. How the hell can
that happen?
It looks like you somehow get a long chain of MNT_INTERNAL mounts
stacked on top of each other, which ought to be prevented by
mnt_flags &= ~MNT_INTERNAL_FLAGS;
in do_add_mount(). Nuts...
^ permalink raw reply
* [PATCH net-next] net/mlx4_en: optimizes get_fixed_ipv6_csum()
From: Eric Dumazet @ 2018-04-19 15:49 UTC (permalink / raw)
To: David S . Miller
Cc: netdev, Eric Dumazet, Eric Dumazet, Saeed Mahameed, Tariq Toukan
While trying to support CHECKSUM_COMPLETE for IPV6 fragments,
I had to experiments various hacks in get_fixed_ipv6_csum().
I must admit I could not find how to implement this :/
However, get_fixed_ipv6_csum() does a lot of redundant operations,
calling csum_partial() twice.
First csum_partial() computes the checksum of saddr and daddr,
put in @csum_pseudo_hdr. Undone later in the second csum_partial()
computed on whole ipv6 header.
Then nexthdr is added once, added a second time, then substracted.
payload_len is added once, then substracted.
Really all this can be reduced to two add_csum(), to add back 6 bytes
that were removed by mlx4 when providing hw_checksum in RX descriptor.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
---
Note: This patch, like other mlx4 patches can definitely wait
Tariq approval, thanks !
drivers/net/ethernet/mellanox/mlx4/en_rx.c | 21 ++++++++-------------
1 file changed, 8 insertions(+), 13 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 5c613c6663da51a4ae792eeb4d8956b54655786b..38c56fb6e5f5970f245dd56c38e1fc63a9349a07 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -593,30 +593,25 @@ static int get_fixed_ipv4_csum(__wsum hw_checksum, struct sk_buff *skb,
}
#if IS_ENABLED(CONFIG_IPV6)
-/* In IPv6 packets, besides subtracting the pseudo header checksum,
- * we also compute/add the IP header checksum which
- * is not added by the HW.
+/* In IPv6 packets, hw_checksum lacks 6 bytes from IPv6 header:
+ * 4 first bytes : priority, version, flow_lbl
+ * and 2 additional bytes : nexthdr, hop_limit.
*/
static int get_fixed_ipv6_csum(__wsum hw_checksum, struct sk_buff *skb,
struct ipv6hdr *ipv6h)
{
__u8 nexthdr = ipv6h->nexthdr;
- __wsum csum_pseudo_hdr = 0;
+ __wsum temp;
if (unlikely(nexthdr == IPPROTO_FRAGMENT ||
nexthdr == IPPROTO_HOPOPTS ||
nexthdr == IPPROTO_SCTP))
return -1;
- hw_checksum = csum_add(hw_checksum, (__force __wsum)htons(nexthdr));
- csum_pseudo_hdr = csum_partial(&ipv6h->saddr,
- sizeof(ipv6h->saddr) + sizeof(ipv6h->daddr), 0);
- csum_pseudo_hdr = csum_add(csum_pseudo_hdr, (__force __wsum)ipv6h->payload_len);
- csum_pseudo_hdr = csum_add(csum_pseudo_hdr,
- (__force __wsum)htons(nexthdr));
-
- skb->csum = csum_sub(hw_checksum, csum_pseudo_hdr);
- skb->csum = csum_add(skb->csum, csum_partial(ipv6h, sizeof(struct ipv6hdr), 0));
+ /* priority, version, flow_lbl */
+ temp = csum_add(hw_checksum, *(__wsum *)ipv6h);
+ /* nexthdr and hop_limit */
+ skb->csum = csum_add(temp, (__force __wsum)*(__be16 *)&ipv6h->nexthdr);
return 0;
}
#endif
--
2.17.0.484.g0c8726318c-goog
^ permalink raw reply related
* [PATCH v2] bpf, x86_32: add eBPF JIT compiler for ia32
From: Wang YanQing @ 2018-04-19 15:54 UTC (permalink / raw)
To: daniel
Cc: ast, illusionist.neo, tglx, mingo, hpa, davem, x86, netdev,
linux-kernel
The JIT compiler emits ia32 bit instructions. Currently, It supports
eBPF only. Classic BPF is supported because of the conversion by BPF core.
Almost all instructions from eBPF ISA supported except the following:
BPF_ALU64 | BPF_DIV | BPF_K
BPF_ALU64 | BPF_DIV | BPF_X
BPF_ALU64 | BPF_MOD | BPF_K
BPF_ALU64 | BPF_MOD | BPF_X
BPF_STX | BPF_XADD | BPF_W
BPF_STX | BPF_XADD | BPF_DW
It doesn't support BPF_JMP|BPF_CALL with BPF_PSEUDO_CALL too.
ia32 has few general purpose registers, EAX|EDX|ECX|EBX|ESI|EDI,
and in these six registers, we can't treat all of them as real
general purpose registers in jit:
MUL instructions need EAX:EDX, shift instructions need ECX.
So I decide to use stack to emulate all eBPF 64 registers, this will
simplify the implementation a lot, because we don't need to face the
flexible memory address modes on ia32, for example, we don't need to
write below code pattern for one BPF_ADD instruction:
if (src is a register && dst is a register)
{
//one instruction encoding for ADD instruction
} else if (only src is a register)
{
//another different instruction encoding for ADD instruction
} else if (only dst is a register)
{
//another different instruction encoding for ADD instruction
} else
{
//src and dst are all on stack.
//move src or dst to temporary registers
}
If the above example if-else-else-else isn't so painful, try to think
it for BPF_ALU64|BPF_*SHIFT* instruction which we need to use many
native instructions to emulate.
Tested on my PC (Intel(R) Core(TM) i5-5200U CPU) and virtualbox.
Testing results on i5-5200U:
1) test_bpf: Summary: 349 PASSED, 0 FAILED, [319/341 JIT'ed]
2) test_progs: Summary: 81 PASSED, 2 FAILED.
test_progs report "libbpf: incorrect bpf_call opcode" for
test_l4lb_noinline and test_xdp_noinline, because there is
no llvm-6.0 on my machine, and current implementation doesn't
support BPF_PSEUDO_CALL, so I think we can ignore the two failed
testcases.
3) test_lpm: OK
4) test_lru_map: OK
5) test_verifier: Summary: 823 PASSED, 5 FAILED
test_verifier report "invalid bpf_context access off=68 size=1/2/4/8"
for all the 5 FAILED testcases with/without jit, we need to fix the
failed testcases themself instead of this jit.
Above tests are all done with following flags settings discretely:
1:bpf_jit_enable=1 and bpf_jit_harden=0
2:bpf_jit_enable=1 and bpf_jit_harden=2
Below are some numbers for this jit implementation:
Note:
I run test_progs in kselftest 100 times continuously for every testcase,
the numbers are in format: total/times=avg.
The numbers that test_bpf reports show almost the same relation.
a:jit_enable=0 and jit_harden=0 b:jit_enable=1 and jit_harden=0
test_pkt_access:PASS:ipv4:15622/100=156 test_pkt_access:PASS:ipv4:10057/100=100
test_pkt_access:PASS:ipv6:9130/100=91 test_pkt_access:PASS:ipv6:5055/100=50
test_xdp:PASS:ipv4:240198/100=2401 test_xdp:PASS:ipv4:145945/100=1459
test_xdp:PASS:ipv6:137326/100=1373 test_xdp:PASS:ipv6:67337/100=673
test_l4lb:PASS:ipv4:61100/100=611 test_l4lb:PASS:ipv4:38137/100=381
test_l4lb:PASS:ipv6:101000/100=1010 test_l4lb:PASS:ipv6:57779/100=577
c:jit_enable=1 and jit_harden=2
test_pkt_access:PASS:ipv4:12650/100=126
test_pkt_access:PASS:ipv6:7074/100=70
test_xdp:PASS:ipv4:147211/100=1472
test_xdp:PASS:ipv6:85783/100=857
test_l4lb:PASS:ipv4:53222/100=532
test_l4lb:PASS:ipv6:76322/100=763
Yes, the numbers are pretty when turn off jit_harden, if we want to speedup
jit_harden, then we need to move BPF_REG_AX to *real* register instead of stack
emulation, but when we do it, we need to face all the pain I describe above. We
can do it in next step.
See Documentation/networking/filter.txt for more information.
Signed-off-by: Wang YanQing <udknight@gmail.com>
---
Changes v1-v2:
1:Fix bug in emit_ia32_neg64.
2:Fix bug in emit_ia32_arsh_r64.
3:Delete filename in top level comment, suggested by Thomas Gleixner.
4:Delete unnecessary boiler plate text, suggested by Thomas Gleixner.
5:Rewrite some words in changelog.
6:CodingSytle improvement and a little more comments.
Thanks.
arch/x86/Kconfig | 2 +-
arch/x86/include/asm/nospec-branch.h | 26 +-
arch/x86/net/Makefile | 10 +-
arch/x86/net/bpf_jit32.S | 143 +++
arch/x86/net/bpf_jit_comp32.c | 2258 ++++++++++++++++++++++++++++++++++
5 files changed, 2434 insertions(+), 5 deletions(-)
create mode 100644 arch/x86/net/bpf_jit32.S
create mode 100644 arch/x86/net/bpf_jit_comp32.c
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 00fcf81..1f5fa2f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -137,7 +137,7 @@ config X86
select HAVE_DMA_CONTIGUOUS
select HAVE_DYNAMIC_FTRACE
select HAVE_DYNAMIC_FTRACE_WITH_REGS
- select HAVE_EBPF_JIT if X86_64
+ select HAVE_EBPF_JIT
select HAVE_EFFICIENT_UNALIGNED_ACCESS
select HAVE_EXIT_THREAD
select HAVE_FENTRY if X86_64 || DYNAMIC_FTRACE
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index f928ad9..a4c7ca4 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -291,14 +291,17 @@ static inline void indirect_branch_prediction_barrier(void)
* lfence
* jmp spec_trap
* do_rop:
- * mov %rax,(%rsp)
+ * mov %rax,(%rsp) for x86_64
+ * mov %edx,(%esp) for x86_32
* retq
*
* Without retpolines configured:
*
- * jmp *%rax
+ * jmp *%rax for x86_64
+ * jmp *%edx for x86_32
*/
#ifdef CONFIG_RETPOLINE
+#ifdef CONFIG_X86_64
# define RETPOLINE_RAX_BPF_JIT_SIZE 17
# define RETPOLINE_RAX_BPF_JIT() \
EMIT1_off32(0xE8, 7); /* callq do_rop */ \
@@ -310,9 +313,28 @@ static inline void indirect_branch_prediction_barrier(void)
EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */ \
EMIT1(0xC3); /* retq */
#else
+# define RETPOLINE_EDX_BPF_JIT() \
+do { \
+ EMIT1_off32(0xE8, 7); /* call do_rop */ \
+ /* spec_trap: */ \
+ EMIT2(0xF3, 0x90); /* pause */ \
+ EMIT3(0x0F, 0xAE, 0xE8); /* lfence */ \
+ EMIT2(0xEB, 0xF9); /* jmp spec_trap */ \
+ /* do_rop: */ \
+ EMIT3(0x89, 0x14, 0x24); /* mov %edx,(%esp) */ \
+ EMIT1(0xC3); /* ret */ \
+} while (0)
+#endif
+#else /* !CONFIG_RETPOLINE */
+
+#ifdef CONFIG_X86_64
# define RETPOLINE_RAX_BPF_JIT_SIZE 2
# define RETPOLINE_RAX_BPF_JIT() \
EMIT2(0xFF, 0xE0); /* jmp *%rax */
+#else
+# define RETPOLINE_EDX_BPF_JIT() \
+ EMIT2(0xFF, 0xE2) /* jmp *%edx */
+#endif
#endif
#endif /* _ASM_X86_NOSPEC_BRANCH_H_ */
diff --git a/arch/x86/net/Makefile b/arch/x86/net/Makefile
index fefb4b6..adcadc6 100644
--- a/arch/x86/net/Makefile
+++ b/arch/x86/net/Makefile
@@ -1,6 +1,12 @@
#
# Arch-specific network modules
#
-OBJECT_FILES_NON_STANDARD_bpf_jit.o += y
-obj-$(CONFIG_BPF_JIT) += bpf_jit.o bpf_jit_comp.o
+
+ifeq ($(CONFIG_X86_32),y)
+ OBJECT_FILES_NON_STANDARD_bpf_jit32.o += y
+ obj-$(CONFIG_BPF_JIT) += bpf_jit32.o bpf_jit_comp32.o
+else
+ OBJECT_FILES_NON_STANDARD_bpf_jit.o += y
+ obj-$(CONFIG_BPF_JIT) += bpf_jit.o bpf_jit_comp.o
+endif
diff --git a/arch/x86/net/bpf_jit32.S b/arch/x86/net/bpf_jit32.S
new file mode 100644
index 0000000..6525262
--- /dev/null
+++ b/arch/x86/net/bpf_jit32.S
@@ -0,0 +1,143 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* BPF JIT helper functions
+ *
+ * Author: Wang YanQing (udknight@gmail.com)
+ * The code based on code and ideas from:
+ * Eric Dumazet (eric.dumazet@gmail.com)
+ */
+#include <linux/linkage.h>
+#include <asm/frame.h>
+
+/*
+ * Calling convention :
+ * eax : skb pointer (caller-saved)
+ * edx : offset of byte(s) to fetch in skb (caller-saved)
+ * esi : copy of skb->data (callee-saved)
+ * edi : hlen = skb->len - skb->data_len (callee-saved)
+ *
+ * We don't need to push/pop eax,edx,ecx before calling kernel function,
+ * because jit always prepare eax,edx before calling helper functions,
+ * and jit uses ecx as a temporary register.
+ */
+#define SKBDATA %esi
+#define SKF_MAX_NEG_OFF $(-0x200000) /* SKF_LL_OFF from filter.h */
+
+#define FUNC(name) \
+ .globl name; \
+ .type name, @function; \
+ name:
+
+FUNC(sk_load_word)
+ test %edx,%edx
+ js bpf_slow_path_word_neg
+
+FUNC(sk_load_word_positive_offset)
+ mov %edi,%ecx # hlen
+ sub %edx,%ecx # hlen - offset
+ cmp $3,%ecx
+ jle bpf_slow_path_word
+ mov (SKBDATA,%edx),%eax
+ bswap %eax /* ntohl() */
+ ret
+
+FUNC(sk_load_half)
+ test %edx,%edx
+ js bpf_slow_path_half_neg
+
+FUNC(sk_load_half_positive_offset)
+ mov %edi,%ecx
+ sub %edx,%ecx # hlen - offset
+ cmp $1,%ecx
+ jle bpf_slow_path_half
+ movzwl (SKBDATA,%edx),%eax
+ rol $8,%ax # ntohs()
+ ret
+
+FUNC(sk_load_byte)
+ test %edx,%edx
+ js bpf_slow_path_byte_neg
+
+FUNC(sk_load_byte_positive_offset)
+ cmp %edx,%edi /* if (offset >= hlen) goto bpf_slow_path_byte */
+ jle bpf_slow_path_byte
+ movzbl (SKBDATA,%edx),%eax
+ ret
+
+#define bpf_slow_path_common(LEN) \
+ lea 104(%ebp), %ecx; \
+ FRAME_BEGIN; \
+ push $LEN; \
+ call skb_copy_bits; \
+ add $4,%esp; \
+ test %eax,%eax; \
+ FRAME_END
+
+
+bpf_slow_path_word:
+ bpf_slow_path_common(4)
+ js bpf_error
+ mov 104(%ebp),%eax
+ bswap %eax
+ ret
+
+bpf_slow_path_half:
+ bpf_slow_path_common(2)
+ js bpf_error
+ mov 104(%ebp),%ax
+ rol $8,%ax
+ movzwl %ax,%eax
+ ret
+
+bpf_slow_path_byte:
+ bpf_slow_path_common(1)
+ js bpf_error
+ movzbl 104(%ebp),%eax
+ ret
+
+#define sk_negative_common(SIZE) \
+ FRAME_BEGIN; \
+ mov $SIZE,%ecx; /* size */ \
+ call bpf_internal_load_pointer_neg_helper; \
+ test %eax,%eax; \
+ FRAME_END; \
+ jz bpf_error
+
+bpf_slow_path_word_neg:
+ cmp SKF_MAX_NEG_OFF, %edx /* test range */
+ jl bpf_error /* offset lower -> error */
+
+FUNC(sk_load_word_negative_offset)
+ sk_negative_common(4)
+ mov (%eax), %eax
+ bswap %eax
+ ret
+
+bpf_slow_path_half_neg:
+ cmp SKF_MAX_NEG_OFF, %edx
+ jl bpf_error
+
+FUNC(sk_load_half_negative_offset)
+ sk_negative_common(2)
+ mov (%eax),%ax
+ rol $8,%ax
+ movzwl %ax,%eax
+ ret
+
+bpf_slow_path_byte_neg:
+ cmp SKF_MAX_NEG_OFF, %edx
+ jl bpf_error
+
+FUNC(sk_load_byte_negative_offset)
+ sk_negative_common(1)
+ movzbl (%eax), %eax
+ ret
+
+bpf_error:
+# force a return 0 from jit handler
+ xor %eax,%eax
+ mov 108(%ebp),%ebx
+ mov 112(%ebp),%esi
+ mov 116(%ebp),%edi
+ add $120, %ebp
+ leave
+ ret
diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
new file mode 100644
index 0000000..a02e9a8
--- /dev/null
+++ b/arch/x86/net/bpf_jit_comp32.c
@@ -0,0 +1,2258 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Just-In-Time compiler for eBPF filters on IA32 (32bit x86)
+ *
+ * Author: Wang YanQing (udknight@gmail.com)
+ * The code based on code and ideas from:
+ * Eric Dumazet (eric.dumazet@gmail.com)
+ * and from:
+ * Shubham Bansal <illusionist.neo@gmail.com>
+ */
+
+#include <linux/netdevice.h>
+#include <linux/filter.h>
+#include <linux/if_vlan.h>
+#include <asm/cacheflush.h>
+#include <asm/set_memory.h>
+#include <asm/nospec-branch.h>
+#include <linux/bpf.h>
+
+/*
+ * eBPF prog stack layout:
+ *
+ * high
+ * original ESP => +-----+
+ * | | callee saved registers
+ * +-----+
+ * | ... | eBPF JIT scratch space
+ * BPF_FP,IA32_EBP => +-----+
+ * | ... | eBPF prog stack
+ * +-----+
+ * |RSVD | JIT scratchpad
+ * current ESP => +-----+
+ * | |
+ * | ... | Function call stack
+ * | |
+ * +-----+
+ * low
+ *
+ * The callee saved registers:
+ *
+ * high
+ * original ESP => +------------------+ \
+ * | ebp | |
+ * current EBP => +------------------+ } callee saved registers
+ * | ebx,esi,edi | |
+ * +------------------+ /
+ * low
+ */
+
+/*
+ * assembly code in arch/x86/net/bpf_jit32.S
+ */
+extern u8 sk_load_word[], sk_load_half[], sk_load_byte[];
+extern u8 sk_load_word_positive_offset[], sk_load_half_positive_offset[];
+extern u8 sk_load_byte_positive_offset[];
+extern u8 sk_load_word_negative_offset[], sk_load_half_negative_offset[];
+extern u8 sk_load_byte_negative_offset[];
+
+static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
+{
+ if (len == 1)
+ *ptr = bytes;
+ else if (len == 2)
+ *(u16 *)ptr = bytes;
+ else {
+ *(u32 *)ptr = bytes;
+ barrier();
+ }
+ return ptr + len;
+}
+
+#define EMIT(bytes, len) \
+ do { prog = emit_code(prog, bytes, len); cnt += len; } while (0)
+
+#define EMIT1(b1) EMIT(b1, 1)
+#define EMIT2(b1, b2) EMIT((b1) + ((b2) << 8), 2)
+#define EMIT3(b1, b2, b3) EMIT((b1) + ((b2) << 8) + ((b3) << 16), 3)
+#define EMIT4(b1, b2, b3, b4) \
+ EMIT((b1) + ((b2) << 8) + ((b3) << 16) + ((b4) << 24), 4)
+
+#define EMIT1_off32(b1, off) \
+ do {EMIT1(b1); EMIT(off, 4); } while (0)
+#define EMIT2_off32(b1, b2, off) \
+ do {EMIT2(b1, b2); EMIT(off, 4); } while (0)
+#define EMIT3_off32(b1, b2, b3, off) \
+ do {EMIT3(b1, b2, b3); EMIT(off, 4); } while (0)
+#define EMIT4_off32(b1, b2, b3, b4, off) \
+ do {EMIT4(b1, b2, b3, b4); EMIT(off, 4); } while (0)
+
+#define jmp_label(label, jmp_insn_len) (label - cnt - jmp_insn_len)
+
+static bool is_imm8(int value)
+{
+ return value <= 127 && value >= -128;
+}
+
+static bool is_simm32(s64 value)
+{
+ return value == (s64) (s32) value;
+}
+
+#define STACK_OFFSET(k) (k)
+#define TCALL_CNT (MAX_BPF_JIT_REG + 0) /* Tail Call Count */
+#define TMP_REG_1 (MAX_BPF_JIT_REG + 1) /* TEMP Register 1 */
+#define TMP_REG_2 (MAX_BPF_JIT_REG + 2) /* TEMP Register 2 */
+
+#define IA32_EAX (0x0)
+#define IA32_EBX (0x3)
+#define IA32_ECX (0x1)
+#define IA32_EDX (0x2)
+#define IA32_ESI (0x6)
+#define IA32_EDI (0x7)
+#define IA32_EBP (0x5)
+#define IA32_ESP (0x4)
+
+/* list of x86 cond jumps opcodes (. + s8)
+ * Add 0x10 (and an extra 0x0f) to generate far jumps (. + s32)
+ */
+#define IA32_JB 0x72
+#define IA32_JAE 0x73
+#define IA32_JE 0x74
+#define IA32_JNE 0x75
+#define IA32_JBE 0x76
+#define IA32_JA 0x77
+#define IA32_JL 0x7C
+#define IA32_JGE 0x7D
+#define IA32_JLE 0x7E
+#define IA32_JG 0x7F
+
+/*
+ * Map eBPF registers to x86_32 32bit registers or stack scratch space.
+ *
+ * 1. All the registers, R0-R10, are mapped to scratch space on stack.
+ * 2. We need two 64 bit temp registers to do complex operations on eBPF
+ * registers.
+ *
+ * As the eBPF registers are all 64 bit registers and x86_32 has only 32 bit
+ * registers, we have to map each eBPF registers with two x86_32 32 bit regs
+ * or scratch memory space and we have to build eBPF 64 bit register from those.
+ *
+ */
+static const u8 bpf2ia32[][2] = {
+ /* return value from in-kernel function, and exit value from eBPF */
+ [BPF_REG_0] = {STACK_OFFSET(0), STACK_OFFSET(4)},
+
+ /* arguments from eBPF program to in-kernel function */
+ /* Stored on stack scratch space */
+ [BPF_REG_1] = {STACK_OFFSET(8), STACK_OFFSET(12)},
+ [BPF_REG_2] = {STACK_OFFSET(16), STACK_OFFSET(20)},
+ [BPF_REG_3] = {STACK_OFFSET(24), STACK_OFFSET(28)},
+ [BPF_REG_4] = {STACK_OFFSET(32), STACK_OFFSET(36)},
+ [BPF_REG_5] = {STACK_OFFSET(40), STACK_OFFSET(44)},
+
+ /* callee saved registers that in-kernel function will preserve */
+ /* Stored on stack scratch space */
+ [BPF_REG_6] = {STACK_OFFSET(48), STACK_OFFSET(52)},
+ [BPF_REG_7] = {STACK_OFFSET(56), STACK_OFFSET(60)},
+ [BPF_REG_8] = {STACK_OFFSET(64), STACK_OFFSET(68)},
+ [BPF_REG_9] = {STACK_OFFSET(72), STACK_OFFSET(76)},
+
+ /* Read only Frame Pointer to access Stack */
+ [BPF_REG_FP] = {STACK_OFFSET(80), STACK_OFFSET(84)},
+
+ /* temporary register for blinding constants.
+ * Stored on stack scratch space.
+ */
+ [BPF_REG_AX] = {STACK_OFFSET(88), STACK_OFFSET(92)},
+
+ /* Tail call count. Stored on stack scratch space. */
+ [TCALL_CNT] = {STACK_OFFSET(96), STACK_OFFSET(100)},
+
+ /* Temporary Register for internal BPF JIT, can be used
+ * as temporary storage in operations.
+ */
+ [TMP_REG_1] = {IA32_ESI, IA32_EDI},
+ [TMP_REG_2] = {IA32_EAX, IA32_EDX},
+};
+
+#define dst_lo dst[0]
+#define dst_hi dst[1]
+#define src_lo src[0]
+#define src_hi src[1]
+
+#define STACK_ALIGNMENT 8
+/* Stack space for BPF_REG_1, BPF_REG_2, BPF_REG_3, BPF_REG_4,
+ * BPF_REG_5, BPF_REG_6, BPF_REG_7, BPF_REG_8, BPF_REG_9,
+ * BPF_REG_FP, BPF_REG_AX and Tail call counts.
+ */
+#define SCRATCH_SIZE 104
+
+/* total stack size used in JITed code */
+#define _STACK_SIZE \
+ (stack_depth + \
+ + SCRATCH_SIZE + \
+ + 4 /* extra for skb_copy_bits buffer */)
+
+#define STACK_SIZE ALIGN(_STACK_SIZE, STACK_ALIGNMENT)
+
+/* Get the offset of eBPF REGISTERs stored on scratch space. */
+#define STACK_VAR(off) (off)
+
+/* Offset of skb_copy_bits buffer */
+#define SKB_BUFFER STACK_VAR(SCRATCH_SIZE)
+
+/* encode 'dst_reg' register into x86_32 opcode 'byte' */
+static u8 add_1reg(u8 byte, u32 dst_reg)
+{
+ return byte + dst_reg;
+}
+
+/* encode 'dst_reg' and 'src_reg' registers into x86_32 opcode 'byte' */
+static u8 add_2reg(u8 byte, u32 dst_reg, u32 src_reg)
+{
+ return byte + dst_reg + (src_reg << 3);
+}
+
+static void jit_fill_hole(void *area, unsigned int size)
+{
+ /* fill whole space with int3 instructions */
+ memset(area, 0xcc, size);
+}
+
+static inline void emit_ia32_mov_i(const u8 dst, const u32 val,
+ u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+
+ EMIT3_off32(0xC7, add_1reg(0x40, IA32_EBP), STACK_VAR(dst), val);
+
+ *pprog = prog;
+}
+
+/* dst = imm (4 bytes)*/
+static inline void emit_ia32_mov_r(const u8 dst, const u8 src, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ const u8 *tmp = bpf2ia32[TMP_REG_1];
+
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(src));
+ /* mov dword ptr [ebp+off],esi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(dst));
+
+ *pprog = prog;
+}
+
+/* dst = src */
+static inline void emit_ia32_mov_r64(const bool is64, const u8 dst[],
+ const u8 src[], u8 **pprog)
+{
+ emit_ia32_mov_r(dst_lo, src_lo, pprog);
+ if (is64)
+ /* complete 8 byte move */
+ emit_ia32_mov_r(dst_hi, src_hi, pprog);
+ else
+ /* Zero out high 4 bytes */
+ emit_ia32_mov_i(dst_hi, 0, pprog);
+}
+
+/* Sign extended move */
+static inline void emit_ia32_mov_i64(const bool is64, const u8 dst[],
+ const u32 val, u8 **pprog)
+{
+ u32 hi = 0;
+
+ if (is64 && (val & (1<<31)))
+ hi = (u32)~0;
+
+ emit_ia32_mov_i(dst_lo, val, pprog);
+ emit_ia32_mov_i(dst_hi, hi, pprog);
+}
+
+/* ALU operation (32 bit)
+ * dst = dst (op) src
+ */
+static inline void emit_ia32_alu_r(const bool is64, const bool hi, const u8 op,
+ const u8 dst, const u8 src, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ const u8 *tmp = bpf2ia32[TMP_REG_1];
+
+ switch (BPF_OP(op)) {
+ /* dst = dst + src */
+ case BPF_ADD: {
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(src));
+
+ if (hi && is64)
+ EMIT3(0x11, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst));
+ else
+ EMIT3(0x01, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst));
+ break;
+ }
+ /* dst = dst - src */
+ case BPF_SUB: {
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(src));
+
+ if (hi && is64)
+ EMIT3(0x19, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst));
+ else
+ EMIT3(0x29, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst));
+ break;
+ }
+ /* dst = dst | src */
+ case BPF_OR: {
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(src));
+ EMIT3(0x09, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(dst));
+ break;
+ }
+ /* dst = dst & src */
+ case BPF_AND: {
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(src));
+ EMIT3(0x21, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(dst));
+ break;
+ }
+ /* dst = dst ^ src */
+ case BPF_XOR: {
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(src));
+ EMIT3(0x31, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(dst));
+ break;
+ }
+ /* dst = dst * src */
+ case BPF_MUL: {
+ const u8 *tmp2 = bpf2ia32[TMP_REG_2];
+
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp2[0]), STACK_VAR(dst));
+ EMIT3(0xF7, add_1reg(0x60, IA32_EBP), STACK_VAR(src));
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp2[0]), STACK_VAR(dst));
+ break;
+ }
+ /* dst = dst << src */
+ case BPF_LSH: {
+ /* mov ecx,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), STACK_VAR(src));
+ EMIT3(0xD3, add_1reg(0x60, IA32_EBP), STACK_VAR(dst));
+ break;
+ }
+ /* dst = dst >> src */
+ case BPF_RSH: {
+ /* mov ecx,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), STACK_VAR(src));
+ EMIT3(0xD3, add_1reg(0x68, IA32_EBP), STACK_VAR(dst));
+ break;
+ }
+ /* dst = dst >> src (signed)*/
+ case BPF_ARSH:
+ /* mov ecx,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), STACK_VAR(src));
+ EMIT3(0xD3, add_1reg(0x78, IA32_EBP), STACK_VAR(dst));
+ break;
+ }
+ *pprog = prog;
+}
+
+/* ALU operation (64 bit) */
+static inline void emit_ia32_alu_r64(const bool is64, const u8 op,
+ const u8 dst[], const u8 src[],
+ u8 **pprog)
+{
+ u8 *prog = *pprog;
+
+ emit_ia32_alu_r(is64, false, op, dst_lo, src_lo, &prog);
+ if (is64)
+ emit_ia32_alu_r(is64, true, op, dst_hi, src_hi, &prog);
+ else
+ emit_ia32_mov_i(dst_hi, 0, &prog);
+ *pprog = prog;
+}
+
+/* ALU operation (32 bit)
+ * dst = dst (op) val
+ */
+static inline void emit_ia32_alu_i(const bool is64, const bool hi, const u8 op,
+ const u8 dst, const s32 val, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ const u8 *tmp = bpf2ia32[TMP_REG_1];
+
+ switch (op) {
+ /* dst = dst + val */
+ case BPF_ADD: {
+ if (hi && is64) {
+ if (is_imm8(val)) {
+ EMIT3(0x83, add_1reg(0x50, IA32_EBP),
+ STACK_VAR(dst));
+ EMIT(val, 1);
+ } else {
+ /* mov esi,imm32*/
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]), val);
+ EMIT3(0x11, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst));
+ }
+ } else {
+ if (is_imm8(val)) {
+ EMIT4(0x83, add_1reg(0x40, IA32_EBP),
+ STACK_VAR(dst), val);
+ } else {
+ /* mov esi,imm32*/
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]), val);
+ EMIT3(0x01, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst));
+ }
+ }
+ break;
+ }
+ /* dst = dst - val */
+ case BPF_SUB: {
+ if (hi && is64) {
+ if (is_imm8(val)) {
+ EMIT4(0x83, add_1reg(0x58, IA32_EBP),
+ STACK_VAR(dst), val);
+ } else {
+ /* mov esi,imm32*/
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]), val);
+ EMIT3(0x19, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst));
+ }
+ } else {
+ if (is_imm8(val)) {
+ EMIT4(0x83, add_1reg(0x68, IA32_EBP),
+ STACK_VAR(dst), val);
+ } else {
+ /* mov esi,imm32*/
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]), val);
+ EMIT3(0x29, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst));
+ }
+ }
+ break;
+ }
+ /* dst = dst | val */
+ case BPF_OR: {
+ if (is_imm8(val)) {
+ EMIT4(0x83, add_1reg(0x48, IA32_EBP), STACK_VAR(dst),
+ val);
+ } else {
+ /* mov esi,imm32*/
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]), val);
+ EMIT3(0x09, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst));
+ }
+ break;
+ }
+ /* dst = dst & val */
+ case BPF_AND: {
+ if (is_imm8(val)) {
+ EMIT4(0x83, add_1reg(0x60, IA32_EBP), STACK_VAR(dst),
+ val);
+ } else {
+ /* mov esi,imm32*/
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]), val);
+ EMIT3(0x21, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst));
+ }
+ break;
+ }
+ /* dst = dst ^ val */
+ case BPF_XOR: {
+ if (is_imm8(val)) {
+ EMIT4(0x83, add_1reg(0x70, IA32_EBP),
+ STACK_VAR(dst), val);
+ } else {
+ /* mov esi,imm32*/
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]), val);
+ EMIT3(0x31, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst));
+ }
+ break;
+ }
+ /* dst = dst * val */
+ case BPF_MUL: {
+ const u8 *tmp2 = bpf2ia32[TMP_REG_2];
+
+ /* mov eax,val */
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp2[0]), val);
+ EMIT3(0xF7, add_1reg(0x60, IA32_EBP), STACK_VAR(dst));
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp2[0]), STACK_VAR(dst));
+ break;
+ }
+ /* dst = dst << val */
+ case BPF_LSH: {
+ if (is_imm8(val)) {
+ EMIT4(0xC1, add_1reg(0x60, IA32_EBP), STACK_VAR(dst),
+ val);
+ } else {
+ /* mov esi,imm32*/
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]), val);
+ EMIT3(0xD3, add_1reg(0x60, IA32_EBP), STACK_VAR(dst));
+ }
+ break;
+ }
+ /* dst = dst >> val */
+ case BPF_RSH: {
+ if (is_imm8(val)) {
+ EMIT4(0xC1, add_1reg(0x68, IA32_EBP), STACK_VAR(dst),
+ val);
+ } else {
+ /* mov esi,imm32*/
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]), val);
+ EMIT3(0xD3, add_1reg(0x68, IA32_EBP), STACK_VAR(dst));
+ }
+ break;
+ }
+ /* dst = dst >> val (signed)*/
+ case BPF_ARSH:
+ if (is_imm8(val)) {
+ EMIT4(0xC1, add_1reg(0x78, IA32_EBP), STACK_VAR(dst),
+ val);
+ } else {
+ /* mov esi,imm32*/
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]), val);
+ EMIT3(0xD3, add_1reg(0x78, IA32_EBP), STACK_VAR(dst));
+ }
+ break;
+ case BPF_NEG:
+ /* xor esi,esi */
+ EMIT2(0x31, add_2reg(0xC0, tmp[0], tmp[0]));
+ EMIT3(0x2B, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(dst));
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(dst));
+ break;
+ }
+
+ *pprog = prog;
+}
+
+/* ALU operation (64 bit) */
+static inline void emit_ia32_alu_i64(const bool is64, const u8 op,
+ const u8 dst[], const u32 val,
+ u8 **pprog)
+{
+ u8 *prog = *pprog;
+ u32 hi = 0;
+
+ if (is64 && (val & (1<<31)))
+ hi = (u32)~0;
+
+ emit_ia32_alu_i(is64, false, op, dst_lo, val, &prog);
+ if (is64)
+ emit_ia32_alu_i(is64, true, op, dst_hi, hi, &prog);
+ else
+ emit_ia32_mov_i(dst_hi, 0, &prog);
+
+ *pprog = prog;
+}
+
+/* dst = ~dst (64 bit) */
+static inline void emit_ia32_neg64(const u8 dst[], u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ const u8 *tmp = bpf2ia32[TMP_REG_1];
+
+ /* xor esi,esi */
+ EMIT2(0x31, add_2reg(0xC0, tmp[0], tmp[0]));
+ /* sub esi,dword ptr [ebp+off] */
+ EMIT3(0x2B, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(dst_lo));
+ /* mov dword ptr [ebp+off],esi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(dst_lo));
+
+ /* xor esi,esi */
+ EMIT2(0x31, add_2reg(0xC0, tmp[0], tmp[0]));
+ /* sbb esi,dword ptr [ebp+off] */
+ EMIT3(0x19, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(dst_hi));
+ /* mov dword ptr [ebp+off],esi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(dst_hi));
+
+ *pprog = prog;
+}
+
+/* dst = dst << src */
+static inline void emit_ia32_lsh_r64(const u8 dst[], const u8 src[], u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ const u8 *tmp = bpf2ia32[TMP_REG_1];
+ static int jmp_label1 = -1;
+ static int jmp_label2 = -1;
+ static int jmp_label3 = -1;
+
+ /* mov ecx,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), STACK_VAR(src_lo));
+
+ /* cmp ecx,32 */
+ EMIT3(0x83, add_1reg(0xF8, IA32_ECX), 32);
+ /* jumps when >= 32 */
+ if (is_imm8(jmp_label(jmp_label1, 2)))
+ EMIT2(IA32_JAE, jmp_label(jmp_label1, 2));
+ else
+ EMIT2_off32(0x0F, IA32_JAE + 0x10, jmp_label(jmp_label1, 6));
+
+ /* < 32 */
+ /* mov esi,32 */
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]), 32);
+ /* sub esi,ecx */
+ EMIT2(0x29, add_2reg(0xC0, tmp[0], IA32_ECX));
+
+ /* shl dword ptr [ebp+off],cl */
+ EMIT3(0xD3, add_1reg(0x60, IA32_EBP), STACK_VAR(dst_hi));
+ /* mov edi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[1]), STACK_VAR(dst_lo));
+ /* shl dword ptr [ebp+off],cl */
+ EMIT3(0xD3, add_1reg(0x60, IA32_EBP), STACK_VAR(dst_lo));
+
+ /* mov ecx,esi */
+ EMIT2(0x89, add_2reg(0xC0, IA32_ECX, tmp[0]));
+ /* shr edi,cl */
+ EMIT2(0xD3, add_1reg(0xE8, tmp[1]));
+ /* or dword ptr [ebp+off],edi */
+ EMIT3(0x09, add_2reg(0x40, IA32_EBP, tmp[1]), STACK_VAR(dst_hi));
+
+ /* goto out; */
+ if (is_imm8(jmp_label(jmp_label3, 2)))
+ EMIT2(0xEB, jmp_label(jmp_label3, 2));
+ else
+ EMIT1_off32(0xE9, jmp_label(jmp_label3, 5));
+
+ /* >= 32 */
+ if (jmp_label1 == -1)
+ jmp_label1 = cnt;
+
+ /* cmp ecx,64 */
+ EMIT3(0x83, add_1reg(0xF8, IA32_ECX), 64);
+ /* jumps when >= 64 */
+ if (is_imm8(jmp_label(jmp_label2, 2)))
+ EMIT2(IA32_JAE, jmp_label(jmp_label2, 2));
+ else
+ EMIT2_off32(0x0F, IA32_JAE + 0x10, jmp_label(jmp_label2, 6));
+
+ /* >= 32 && < 64 */
+ /* sub ecx,32 */
+ EMIT3(0x83, add_1reg(0xE8, IA32_ECX), 32);
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(dst_lo));
+ /* shl esi,cl */
+ EMIT2(0xD3, add_1reg(0xE0, tmp[0]));
+ /* mov dword ptr [ebp+off],esi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(dst_hi));
+
+ EMIT3(0xC7, add_1reg(0x40, IA32_EBP), STACK_VAR(dst_lo));
+ EMIT(0x0, 4);
+
+ /* goto out; */
+ if (is_imm8(jmp_label(jmp_label3, 2)))
+ EMIT2(0xEB, jmp_label(jmp_label3, 2));
+ else
+ EMIT1_off32(0xE9, jmp_label(jmp_label3, 5));
+
+ /* >= 64 */
+ if (jmp_label2 == -1)
+ jmp_label2 = cnt;
+ EMIT3(0xC7, add_1reg(0x40, IA32_EBP), STACK_VAR(dst_hi));
+ EMIT(0x0, 4);
+ EMIT3(0xC7, add_1reg(0x40, IA32_EBP), STACK_VAR(dst_lo));
+ EMIT(0x0, 4);
+
+ if (jmp_label3 == -1)
+ jmp_label3 = cnt;
+
+ /* out: */
+ *pprog = prog;
+}
+
+/* dst = dst >> src (signed)*/
+static inline void emit_ia32_arsh_r64(const u8 dst[], const u8 src[],
+ u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ const u8 *tmp = bpf2ia32[TMP_REG_1];
+ static int jmp_label1 = -1;
+ static int jmp_label2 = -1;
+ static int jmp_label3 = -1;
+
+ /* mov ecx,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), STACK_VAR(src_lo));
+
+ /* cmp ecx,32 */
+ EMIT3(0x83, add_1reg(0xF8, IA32_ECX), 32);
+ /* jumps when >= 32 */
+ if (is_imm8(jmp_label(jmp_label1, 2)))
+ EMIT2(IA32_JAE, jmp_label(jmp_label1, 2));
+ else
+ EMIT2_off32(0x0F, IA32_JAE + 0x10, jmp_label(jmp_label1, 6));
+
+ /* < 32 */
+ /* mov esi,32 */
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]), 32);
+ /* sub esi,ecx */
+ EMIT2(0x29, add_2reg(0xC0, tmp[0], IA32_ECX));
+
+ /* lshr dword ptr [ebp+off],cl */
+ EMIT3(0xD3, add_1reg(0x68, IA32_EBP), STACK_VAR(dst_lo));
+ /* mov edi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[1]), STACK_VAR(dst_hi));
+ /* ashr dword ptr [ebp+off],cl */
+ EMIT3(0xD3, add_1reg(0x78, IA32_EBP), STACK_VAR(dst_hi));
+
+ /* mov ecx,esi */
+ EMIT2(0x89, add_2reg(0xC0, IA32_ECX, tmp[0]));
+ /* shl edi,cl */
+ EMIT2(0xD3, add_1reg(0xE0, tmp[1]));
+ /* or dword ptr [ebp+off],edi */
+ EMIT3(0x09, add_2reg(0x40, IA32_EBP, tmp[1]),
+ STACK_VAR(dst_lo));
+
+ /* goto out; */
+ if (is_imm8(jmp_label(jmp_label3, 2)))
+ EMIT2(0xEB, jmp_label(jmp_label3, 2));
+ else
+ EMIT1_off32(0xE9, jmp_label(jmp_label3, 5));
+
+ /* >= 32 */
+ if (jmp_label1 == -1)
+ jmp_label1 = cnt;
+
+ /* cmp ecx,64 */
+ EMIT3(0x83, add_1reg(0xF8, IA32_ECX), 64);
+ /* jumps when >= 64 */
+ if (is_imm8(jmp_label(jmp_label2, 2)))
+ EMIT2(IA32_JAE, jmp_label(jmp_label2, 2));
+ else
+ EMIT2_off32(0x0F, IA32_JAE + 0x10, jmp_label(jmp_label2, 6));
+
+ /* >= 32 && < 64 */
+ /* sub ecx,32 */
+ EMIT3(0x83, add_1reg(0xE8, IA32_ECX), 32);
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(dst_hi));
+ /* ashr esi,cl */
+ EMIT2(0xD3, add_1reg(0xF8, tmp[0]));
+ /* mov dword ptr [ebp+off],esi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(dst_lo));
+
+ /* ashr dword ptr [ebp+off],imm8 */
+ EMIT3(0xC1, add_1reg(0x78, IA32_EBP), STACK_VAR(dst_hi));
+ EMIT(31, 1);
+
+ /* goto out; */
+ if (is_imm8(jmp_label(jmp_label3, 2)))
+ EMIT2(0xEB, jmp_label(jmp_label3, 2));
+ else
+ EMIT1_off32(0xE9, jmp_label(jmp_label3, 5));
+
+ /* >= 64 */
+ if (jmp_label2 == -1)
+ jmp_label2 = cnt;
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(dst_hi));
+ /* ashr esi,imm8 */
+ EMIT3(0xC1, add_1reg(0xF8, tmp[0]), 31);
+ /* mov dword ptr [ebp+off],esi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(dst_hi));
+ /* mov dword ptr [ebp+off],esi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(dst_lo));
+
+ if (jmp_label3 == -1)
+ jmp_label3 = cnt;
+
+ /* out: */
+ *pprog = prog;
+}
+
+/* dst = dst >> src */
+static inline void emit_ia32_rsh_r64(const u8 dst[], const u8 src[], u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ const u8 *tmp = bpf2ia32[TMP_REG_1];
+ static int jmp_label1 = -1;
+ static int jmp_label2 = -1;
+ static int jmp_label3 = -1;
+
+ /* mov ecx,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), STACK_VAR(src_lo));
+
+ /* cmp ecx,32 */
+ EMIT3(0x83, add_1reg(0xF8, IA32_ECX), 32);
+ /* jumps when >= 32 */
+ if (is_imm8(jmp_label(jmp_label1, 2)))
+ EMIT2(IA32_JAE, jmp_label(jmp_label1, 2));
+ else
+ EMIT2_off32(0x0F, IA32_JAE + 0x10, jmp_label(jmp_label1, 6));
+
+ /* < 32 */
+ /* mov esi,32 */
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]), 32);
+ /* sub esi,ecx */
+ EMIT2(0x29, add_2reg(0xC0, tmp[0], IA32_ECX));
+
+ /* lshr dword ptr [ebp+off],cl */
+ EMIT3(0xD3, add_1reg(0x68, IA32_EBP), STACK_VAR(dst_lo));
+ /* mov edi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[1]), STACK_VAR(dst_hi));
+ /* shr dword ptr [ebp+off],cl */
+ EMIT3(0xD3, add_1reg(0x68, IA32_EBP), STACK_VAR(dst_hi));
+
+ /* mov ecx, esi */
+ EMIT2(0x89, add_2reg(0xC0, IA32_ECX, tmp[0]));
+ /* shl edi,cl */
+ EMIT2(0xD3, add_1reg(0xE0, tmp[1]));
+ /* or dword ptr [ebp+off],edi */
+ EMIT3(0x09, add_2reg(0x40, IA32_EBP, tmp[1]), STACK_VAR(dst_lo));
+
+ /* goto out; */
+ if (is_imm8(jmp_label(jmp_label3, 2)))
+ EMIT2(0xEB, jmp_label(jmp_label3, 2));
+ else
+ EMIT1_off32(0xE9, jmp_label(jmp_label3, 5));
+
+ /* >= 32 */
+ if (jmp_label1 == -1)
+ jmp_label1 = cnt;
+ /* cmp ecx,64 */
+ EMIT3(0x83, add_1reg(0xF8, IA32_ECX), 64);
+ /* jumps when >= 64 */
+ if (is_imm8(jmp_label(jmp_label2, 2)))
+ EMIT2(IA32_JAE, jmp_label(jmp_label2, 2));
+ else
+ EMIT2_off32(0x0F, IA32_JAE + 0x10, jmp_label(jmp_label2, 6));
+
+ /* >= 32 && < 64 */
+ /* sub ecx,32 */
+ EMIT3(0x83, add_1reg(0xE8, IA32_ECX), 32);
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(dst_hi));
+ /* shr esi,cl */
+ EMIT2(0xD3, add_1reg(0xE8, tmp[0]));
+ /* mov dword ptr [ebp+off],esi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(dst_lo));
+ /* mov dword ptr[ebp+off],imm32 */
+ EMIT3(0xC7, add_1reg(0x40, IA32_EBP), STACK_VAR(dst_hi));
+ EMIT(0x0, 4);
+
+ /* goto out; */
+ if (is_imm8(jmp_label(jmp_label3, 2)))
+ EMIT2(0xEB, jmp_label(jmp_label3, 2));
+ else
+ EMIT1_off32(0xE9, jmp_label(jmp_label3, 5));
+
+ /* >= 64 */
+ if (jmp_label2 == -1)
+ jmp_label2 = cnt;
+ EMIT3(0xC7, add_1reg(0x40, IA32_EBP), STACK_VAR(dst_hi));
+ EMIT(0x0, 4);
+ EMIT3(0xC7, add_1reg(0x40, IA32_EBP), STACK_VAR(dst_lo));
+ EMIT(0x0, 4);
+
+ if (jmp_label3 == -1)
+ jmp_label3 = cnt;
+
+ /* out: */
+ *pprog = prog;
+}
+
+/* dst = dst << val */
+static inline void emit_ia32_lsh_i64(const u8 dst[], const u32 val, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ const u8 *tmp = bpf2ia32[TMP_REG_1];
+
+ /* Do LSH operation */
+ if (val < 32) {
+ /* mov esi,32 */
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]), 32);
+ /* sub esi,imm8 */
+ EMIT3(0x83, add_1reg(0xE8, tmp[0]), val);
+
+ /* shl dword ptr [ebp+off],imm8 */
+ EMIT3(0xC1, add_1reg(0x60, IA32_EBP), STACK_VAR(dst_hi));
+ EMIT(val, 1);
+ /* mov edi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[1]),
+ STACK_VAR(dst_lo));
+ /* shl dword ptr [ebp+off],imm8 */
+ EMIT3(0xC1, add_1reg(0x60, IA32_EBP), STACK_VAR(dst_lo));
+ EMIT(val, 1);
+
+ /* mov ecx,esi */
+ EMIT2(0x89, add_2reg(0xC0, IA32_ECX, tmp[0]));
+ /* shr edi,cl */
+ EMIT2(0xD3, add_1reg(0xE8, tmp[1]));
+ /* or dword ptr [ebp+off],edi */
+ EMIT3(0x09, add_2reg(0x40, IA32_EBP, tmp[1]),
+ STACK_VAR(dst_hi));
+ } else if (val >= 32 && val < 64) {
+ u32 value = val - 32;
+
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_lo));
+ /* shl esi,imm8 */
+ EMIT3(0xC1, add_1reg(0xE0, tmp[0]), value);
+ /* mov dword ptr [ebp+off],esi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_hi));
+ EMIT3(0xC7, add_1reg(0x40, IA32_EBP), STACK_VAR(dst_lo));
+ EMIT(0x0, 4);
+ } else {
+ EMIT3(0xC7, add_1reg(0x40, IA32_EBP), STACK_VAR(dst_hi));
+ EMIT(0x0, 4);
+ EMIT3(0xC7, add_1reg(0x40, IA32_EBP), STACK_VAR(dst_lo));
+ EMIT(0x0, 4);
+ }
+
+ *pprog = prog;
+}
+
+/* dst = dst >> val */
+static inline void emit_ia32_rsh_i64(const u8 dst[], const u32 val, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ const u8 *tmp = bpf2ia32[TMP_REG_1];
+
+ /* Do RSH operation */
+ if (val < 32) {
+ /* mov esi,32 */
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]), 32);
+ /* sub esi,imm8 */
+ EMIT3(0x83, add_1reg(0xE8, tmp[0]), val);
+
+ /* shr dword ptr [ebp+off],imm8 */
+ EMIT3(0xC1, add_1reg(0x68, IA32_EBP), STACK_VAR(dst_lo));
+ EMIT(val, 1);
+ /* mov edi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[1]),
+ STACK_VAR(dst_hi));
+ /* shr dword ptr [ebp+off],imm8 */
+ EMIT3(0xC1, add_1reg(0x68, IA32_EBP), STACK_VAR(dst_hi));
+ EMIT(val, 1);
+
+ /* mov ecx,esi */
+ EMIT2(0x89, add_2reg(0xC0, IA32_ECX, tmp[0]));
+ /* shl edi,cl */
+ EMIT2(0xD3, add_1reg(0xE0, tmp[1]));
+ /* or dword ptr [ebp+off],edi */
+ EMIT3(0x09, add_2reg(0x40, IA32_EBP, tmp[1]),
+ STACK_VAR(dst_lo));
+ } else if (val >= 32 && val < 64) {
+ u32 value = val - 32;
+
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_hi));
+ /* shr esi,imm8 */
+ EMIT3(0xC1, add_1reg(0xE8, tmp[0]), value);
+ /* mov dword ptr [ebp+off],esi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_lo));
+ EMIT3(0xC7, add_1reg(0x40, IA32_EBP), STACK_VAR(dst_hi));
+ EMIT(0x0, 4);
+ } else {
+ EMIT3(0xC7, add_1reg(0x40, IA32_EBP), STACK_VAR(dst_hi));
+ EMIT(0x0, 4);
+ EMIT3(0xC7, add_1reg(0x40, IA32_EBP), STACK_VAR(dst_lo));
+ EMIT(0x0, 4);
+ }
+
+ *pprog = prog;
+}
+
+/* dst = dst >> val (signed) */
+static inline void emit_ia32_arsh_i64(const u8 dst[], const u32 val, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ const u8 *tmp = bpf2ia32[TMP_REG_1];
+
+ /* Do RSH operation */
+ if (val < 32) {
+ /* mov esi,32 */
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]), 32);
+ /* sub esi,imm8 */
+ EMIT3(0x83, add_1reg(0xE8, tmp[0]), val);
+
+ /* shr dword ptr [ebp+off],imm8 */
+ EMIT3(0xC1, add_1reg(0x68, IA32_EBP), STACK_VAR(dst_lo));
+ EMIT(val, 1);
+ /* mov edi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[1]),
+ STACK_VAR(dst_hi));
+ /* ashr dword ptr [ebp+off],imm8 */
+ EMIT3(0xC1, add_1reg(0x78, IA32_EBP), STACK_VAR(dst_hi));
+ EMIT(val, 1);
+
+ /* mov ecx,esi */
+ EMIT2(0x89, add_2reg(0xC0, IA32_ECX, tmp[0]));
+ /* shl edi,cl */
+ EMIT2(0xD3, add_1reg(0xE0, tmp[1]));
+ /* or dword ptr [ebp+off],edi */
+ EMIT3(0x09, add_2reg(0x40, IA32_EBP, tmp[1]),
+ STACK_VAR(dst_lo));
+ } else if (val >= 32 && val < 64) {
+ u32 value = val - 32;
+
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_hi));
+ /* ashr esi,imm8 */
+ EMIT3(0xC1, add_1reg(0xF8, tmp[0]), value);
+ /* mov dword ptr [ebp+off],esi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_lo));
+
+ /* ashr dword ptr [ebp+off],imm8 */
+ EMIT3(0xC1, add_1reg(0x78, IA32_EBP), STACK_VAR(dst_hi));
+ EMIT(31, 1);
+ } else {
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_hi));
+ /* ashr esi,imm8 */
+ EMIT3(0xC1, add_1reg(0xF8, tmp[0]), 31);
+ /* mov dword ptr [ebp+off],esi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_hi));
+ /* mov dword ptr [ebp+off],esi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_lo));
+ }
+
+ *pprog = prog;
+}
+
+static inline void emit_ia32_mul_r64(const u8 dst[], const u8 src[], u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ const u8 *tmp = bpf2ia32[TMP_REG_1];
+ const u8 *tmp2 = bpf2ia32[TMP_REG_2];
+
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp2[0]), STACK_VAR(dst_hi));
+ /* mul dword ptr [ebp+off] */
+ EMIT3(0xF7, add_1reg(0x60, IA32_EBP), STACK_VAR(src_lo));
+ /* mov esi,eax */
+ EMIT2(0x89, add_2reg(0xC0, tmp[0], tmp2[0]));
+
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp2[0]), STACK_VAR(dst_lo));
+ /* mul dword ptr [ebp+off] */
+ EMIT3(0xF7, add_1reg(0x60, IA32_EBP), STACK_VAR(src_hi));
+ /* mov edi,eax */
+ EMIT2(0x89, add_2reg(0xC0, tmp[1], tmp2[0]));
+
+ /* add esi,edi */
+ EMIT2(0x01, add_2reg(0xC0, tmp[0], tmp[1]));
+
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp2[0]), STACK_VAR(dst_lo));
+ /* mul dword ptr [ebp+off] */
+ EMIT3(0xF7, add_1reg(0x60, IA32_EBP), STACK_VAR(src_lo));
+
+ /* add esi,edx */
+ EMIT2(0x01, add_2reg(0xC0, tmp[0], tmp2[1]));
+
+ /* mov dword ptr [ebp+off],eax */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp2[0]), STACK_VAR(dst_lo));
+
+ /* mov dword ptr [ebp+off],esi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(dst_hi));
+
+ *pprog = prog;
+}
+
+static inline void emit_ia32_mul_i64(const u8 dst[], const u32 val, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ const u8 *tmp = bpf2ia32[TMP_REG_1];
+ const u8 *tmp2 = bpf2ia32[TMP_REG_2];
+ u32 hi;
+
+ hi = val & (1<<31) ? (u32)~0 : 0;
+ /* movl eax,imm32 */
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp2[0]), val);
+ /* mul dword ptr [ebp+off] */
+ EMIT3(0xF7, add_1reg(0x60, IA32_EBP), STACK_VAR(dst_hi));
+ /* mov esi,eax */
+ EMIT2(0x89, add_2reg(0xC0, tmp[0], tmp2[0]));
+
+ /* movl eax,imm32 */
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp2[0]), hi);
+ /* mul dword ptr [ebp+off] */
+ EMIT3(0xF7, add_1reg(0x60, IA32_EBP), STACK_VAR(dst_lo));
+ /* mov edi,eax */
+ EMIT2(0x89, add_2reg(0xC0, tmp[1], tmp2[0]));
+
+ /* add esi,edi */
+ EMIT2(0x01, add_2reg(0xC0, tmp[0], tmp[1]));
+
+ /* movl eax,imm32 */
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp2[0]), val);
+ /* mul dword ptr [ebp+off] */
+ EMIT3(0xF7, add_1reg(0x60, IA32_EBP), STACK_VAR(dst_lo));
+
+ /* add esi,edx */
+ EMIT2(0x01, add_2reg(0xC0, tmp[0], tmp2[1]));
+
+ /* mov dword ptr [ebp+off],eax */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp2[0]), STACK_VAR(dst_lo));
+
+ /* mov dword ptr [ebp+off],esi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(dst_hi));
+
+ *pprog = prog;
+}
+
+static int bpf_size_to_x86_bytes(int bpf_size)
+{
+ if (bpf_size == BPF_W)
+ return 4;
+ else if (bpf_size == BPF_H)
+ return 2;
+ else if (bpf_size == BPF_B)
+ return 1;
+ else if (bpf_size == BPF_DW)
+ return 4; /* imm32 */
+ else
+ return 0;
+}
+
+#define CHOOSE_LOAD_FUNC(K, func) \
+ ((int)K < 0 ? ((int)K >= SKF_LL_OFF ? func##_negative_offset : func) : \
+ func##_positive_offset)
+
+struct jit_context {
+ int cleanup_addr; /* epilogue code offset */
+};
+
+/* maximum number of bytes emitted while JITing one eBPF insn */
+#define BPF_MAX_INSN_SIZE 128
+#define BPF_INSN_SAFETY 64
+
+#define PROLOGUE_SIZE 35
+
+/* emit prologue code for BPF program and check it's size.
+ * bpf_tail_call helper will skip it while jumping into another program
+ */
+static void emit_prologue(u8 **pprog, u32 stack_depth)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ const u8 *r1 = bpf2ia32[BPF_REG_1];
+ const u8 *tmp = bpf2ia32[TMP_REG_1];
+ const u8 *tmp2 = bpf2ia32[TMP_REG_2];
+ const u8 fplo = bpf2ia32[BPF_REG_FP][0];
+ const u8 fphi = bpf2ia32[BPF_REG_FP][1];
+ const u8 *tcc = bpf2ia32[TCALL_CNT];
+
+ /* push ebp */
+ EMIT1(0x55);
+ /* mov ebp,esp */
+ EMIT2(0x89, 0xE5);
+ /* push edi */
+ EMIT1(0x57);
+ /* push esi */
+ EMIT1(0x56);
+ /* push ebx */
+ EMIT1(0x53);
+
+ /* sub esp,STACK_SIZE */
+ EMIT2_off32(0x81, 0xEC, STACK_SIZE);
+ /* sub ebp,SCRATCH_SIZE+4+12*/
+ EMIT3(0x83, add_1reg(0xE8, IA32_EBP), SCRATCH_SIZE + 16);
+ /* xor esi,esi */
+ EMIT2(0x31, add_2reg(0xC0, tmp[0], tmp[0]));
+
+ /* Set up BPF prog stack base register */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EBP), STACK_VAR(fplo));
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(fphi));
+
+ /* Move BPF_CTX (EAX) to BPF_REG_R1 */
+ /* mov dword ptr [ebp+off],eax */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp2[0]), STACK_VAR(r1[0]));
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(r1[1]));
+
+ /* Initialize Tail Count */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(tcc[0]));
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(tcc[1]));
+
+ BUILD_BUG_ON(cnt != PROLOGUE_SIZE);
+ *pprog = prog;
+}
+
+/* Emit epilogue code for BPF program */
+static void emit_epilogue(u8 **pprog, u32 stack_depth)
+{
+ u8 *prog = *pprog;
+ const u8 *r0 = bpf2ia32[BPF_REG_0];
+ int cnt = 0;
+
+ /* mov eax,dword ptr [ebp+off]*/
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), STACK_VAR(r0[0]));
+ /* mov edx,dword ptr [ebp+off]*/
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), STACK_VAR(r0[1]));
+
+ /* add ebp,SCRATCH_SIZE+4+12*/
+ EMIT3(0x83, add_1reg(0xC0, IA32_EBP), SCRATCH_SIZE + 16);
+
+ /* mov ebx,dword ptr [ebp-12]*/
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EBX), -12);
+ /* mov esi,dword ptr [ebp-8]*/
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ESI), -8);
+ /* mov edi,dword ptr [ebp-4]*/
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDI), -4);
+
+ EMIT1(0xC9); /* leave */
+ EMIT1(0xC3); /* ret */
+ *pprog = prog;
+}
+
+/* generate the following code:
+ * ... bpf_tail_call(void *ctx, struct bpf_array *array, u64 index) ...
+ * if (index >= array->map.max_entries)
+ * goto out;
+ * if (++tail_call_cnt > MAX_TAIL_CALL_CNT)
+ * goto out;
+ * prog = array->ptrs[index];
+ * if (prog == NULL)
+ * goto out;
+ * goto *(prog->bpf_func + prologue_size);
+ * out:
+ */
+static void emit_bpf_tail_call(u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ const u8 *r1 = bpf2ia32[BPF_REG_1];
+ const u8 *r2 = bpf2ia32[BPF_REG_2];
+ const u8 *r3 = bpf2ia32[BPF_REG_3];
+ const u8 *tmp = bpf2ia32[TMP_REG_1];
+ const u8 *tmp2 = bpf2ia32[TMP_REG_2];
+ const u8 *tcc = bpf2ia32[TCALL_CNT];
+ u32 lo, hi;
+ static int jmp_label1 = -1;
+
+ /* if (index >= array->map.max_entries)
+ * goto out;
+ */
+
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(r2[0]));
+ /* mov edi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[1]), STACK_VAR(r3[0]));
+
+ /* cmp dword ptr [esi + 16], edi */
+ EMIT3(0x39, add_2reg(0x40, tmp[0], tmp[1]),
+ offsetof(struct bpf_array, map.max_entries));
+ /* jbe out */
+ EMIT2(IA32_JBE, jmp_label(jmp_label1, 2));
+
+ /* if (tail_call_cnt > MAX_TAIL_CALL_CNT)
+ * goto out;
+ */
+ lo = (u32)MAX_TAIL_CALL_CNT;
+ hi = (u32)((u64)MAX_TAIL_CALL_CNT >> 32);
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp2[0]), STACK_VAR(tcc[0]));
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp2[1]), STACK_VAR(tcc[1]));
+
+ EMIT3(0x83, add_1reg(0xF8, tmp2[1]), hi); /* cmp edx, hi */
+ EMIT2(IA32_JNE, 3);
+ EMIT3(0x83, add_1reg(0xF8, tmp2[0]), lo); /* cmp ecx, lo */
+
+ EMIT2(IA32_JAE, jmp_label(jmp_label1, 2)); /* ja out */
+
+ EMIT3(0x83, add_1reg(0xC0, tmp2[0]), 0x01); /* add eax, 0x1 */
+ EMIT3(0x83, add_1reg(0xD0, tmp2[1]), 0x00); /* adc edx, 0x0 */
+
+ /* mov dword ptr [ebp + off], eax */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp2[0]), STACK_VAR(tcc[0]));
+ /* mov dword ptr [ebp + off], edx */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp2[1]), STACK_VAR(tcc[1]));
+
+ /* prog = array->ptrs[index]; */
+ /* mov edx, [esi + edi * 4 + offsetof(...)] */
+ EMIT3_off32(0x8B, 0x94, 0xBE, offsetof(struct bpf_array, ptrs));
+
+ /* if (prog == NULL)
+ * goto out;
+ */
+ EMIT2(0x85, add_2reg(0xC0, tmp2[1], tmp2[1])); /* test edx,edx */
+ EMIT2(IA32_JE, jmp_label(jmp_label1, 2)); /* je out */
+
+ /* goto *(prog->bpf_func + prologue_size); */
+ /* mov edx, dword ptr [edx + 32] */
+ EMIT3(0x8B, add_2reg(0x40, tmp2[1], tmp2[1]),
+ offsetof(struct bpf_prog, bpf_func));
+ /* add edx, prologue_size */
+ EMIT3(0x83, add_1reg(0xC0, tmp2[1]), PROLOGUE_SIZE);
+
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp2[0]), STACK_VAR(r1[0]));
+
+ /* now we're ready to jump into next BPF program
+ * eax == ctx (1st arg)
+ * edx == prog->bpf_func + prologue_size
+ */
+ RETPOLINE_EDX_BPF_JIT();
+
+ if (jmp_label1 == -1)
+ jmp_label1 = cnt;
+
+ /* out: */
+ *pprog = prog;
+}
+
+static void emit_load_skb_data_hlen(u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ const u8 *reg6 = bpf2ia32[BPF_REG_6];
+ const u8 *tmp = bpf2ia32[TMP_REG_1];
+ const u8 *tmp2 = bpf2ia32[TMP_REG_2];
+
+ /*
+ * eax : skb pointer
+ * esi : copy of skb->data
+ * edi : hlen = skb->len - skb->data_len
+ */
+
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp2[0]), STACK_VAR(reg6[0]));
+
+ /* mov %edi, dword ptr [eax+off] */
+ EMIT2_off32(0x8B, add_2reg(0x80, tmp2[0], tmp[1]),
+ offsetof(struct sk_buff, len));
+
+ /* sub %edi, dword ptr [eax+off] */
+ EMIT2_off32(0x2B, add_2reg(0x80, tmp2[0], tmp[1]),
+ offsetof(struct sk_buff, data_len));
+
+ /* mov %esi, dword ptr [eax+off] */
+ EMIT2_off32(0x8B, add_2reg(0x80, tmp2[0], tmp[0]),
+ offsetof(struct sk_buff, data));
+
+ *pprog = prog;
+}
+
+// push the scratch stack register on top of the stack
+static inline void emit_push_r64(const u8 src[], u8 **pprog)
+{
+ const u8 *tmp = bpf2ia32[TMP_REG_1];
+ u8 *prog = *pprog;
+ int cnt = 0;
+
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(src_hi));
+ /* push esi */
+ EMIT1(0x56);
+
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]), STACK_VAR(src_lo));
+ /* push esi */
+ EMIT1(0x56);
+
+ *pprog = prog;
+}
+
+static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
+ int oldproglen, struct jit_context *ctx)
+{
+ struct bpf_insn *insn = bpf_prog->insnsi;
+ int insn_cnt = bpf_prog->len;
+ bool seen_exit = false;
+ u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
+ int i, cnt = 0;
+ int proglen = 0;
+ u8 *prog = temp;
+
+ emit_prologue(&prog, bpf_prog->aux->stack_depth);
+
+ for (i = 0; i < insn_cnt; i++, insn++) {
+ const s32 imm32 = insn->imm;
+ const bool is64 = BPF_CLASS(insn->code) == BPF_ALU64;
+ const u8 code = insn->code;
+ const u8 *dst = bpf2ia32[insn->dst_reg];
+ const u8 *src = bpf2ia32[insn->src_reg];
+ const u8 *tmp = bpf2ia32[TMP_REG_1];
+ const u8 *tmp2 = bpf2ia32[TMP_REG_2];
+ const u8 *r0 = bpf2ia32[BPF_REG_0];
+ s64 jmp_offset;
+ u8 jmp_cond;
+ int ilen;
+ u8 *func;
+
+ switch (code) {
+ /* ALU operations */
+ /* dst = src */
+ case BPF_ALU | BPF_MOV | BPF_K:
+ case BPF_ALU | BPF_MOV | BPF_X:
+ case BPF_ALU64 | BPF_MOV | BPF_K:
+ case BPF_ALU64 | BPF_MOV | BPF_X:
+ switch (BPF_SRC(code)) {
+ case BPF_X:
+ emit_ia32_mov_r64(is64, dst, src, &prog);
+ break;
+ case BPF_K:
+ /* Sign-extend immediate value to dst reg */
+ emit_ia32_mov_i64(is64, dst, imm32, &prog);
+ break;
+ }
+ break;
+ /* dst = dst + src/imm */
+ /* dst = dst - src/imm */
+ /* dst = dst | src/imm */
+ /* dst = dst & src/imm */
+ /* dst = dst ^ src/imm */
+ /* dst = dst * src/imm */
+ /* dst = dst << src */
+ /* dst = dst >> src */
+ case BPF_ALU | BPF_ADD | BPF_K:
+ case BPF_ALU | BPF_ADD | BPF_X:
+ case BPF_ALU | BPF_SUB | BPF_K:
+ case BPF_ALU | BPF_SUB | BPF_X:
+ case BPF_ALU | BPF_OR | BPF_K:
+ case BPF_ALU | BPF_OR | BPF_X:
+ case BPF_ALU | BPF_AND | BPF_K:
+ case BPF_ALU | BPF_AND | BPF_X:
+ case BPF_ALU | BPF_XOR | BPF_K:
+ case BPF_ALU | BPF_XOR | BPF_X:
+ case BPF_ALU | BPF_MUL | BPF_K:
+ case BPF_ALU | BPF_MUL | BPF_X:
+ case BPF_ALU | BPF_LSH | BPF_X:
+ case BPF_ALU | BPF_RSH | BPF_X:
+ case BPF_ALU | BPF_ARSH | BPF_K:
+ case BPF_ALU | BPF_ARSH | BPF_X:
+ case BPF_ALU64 | BPF_ADD | BPF_K:
+ case BPF_ALU64 | BPF_ADD | BPF_X:
+ case BPF_ALU64 | BPF_SUB | BPF_K:
+ case BPF_ALU64 | BPF_SUB | BPF_X:
+ case BPF_ALU64 | BPF_OR | BPF_K:
+ case BPF_ALU64 | BPF_OR | BPF_X:
+ case BPF_ALU64 | BPF_AND | BPF_K:
+ case BPF_ALU64 | BPF_AND | BPF_X:
+ case BPF_ALU64 | BPF_XOR | BPF_K:
+ case BPF_ALU64 | BPF_XOR | BPF_X:
+ switch (BPF_SRC(code)) {
+ case BPF_X:
+ emit_ia32_alu_r64(is64, BPF_OP(code), dst, src,
+ &prog);
+ break;
+ case BPF_K:
+ emit_ia32_alu_i64(is64, BPF_OP(code), dst, imm32,
+ &prog);
+ break;
+ }
+ break;
+ /* dst = dst / src(imm) */
+ /* dst = dst % src(imm) */
+ case BPF_ALU | BPF_DIV | BPF_K:
+ case BPF_ALU | BPF_DIV | BPF_X:
+ case BPF_ALU | BPF_MOD | BPF_K:
+ case BPF_ALU | BPF_MOD | BPF_X:
+ if (BPF_SRC(code) == BPF_X)
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(src_lo));
+ else
+ /* mov esi,imm32*/
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]),
+ imm32);
+
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp2[0]),
+ STACK_VAR(dst_lo));
+ /* xor edx, edx
+ * equivalent to 'xor rdx, rdx', but one byte less
+ */
+ EMIT2(0x31, add_2reg(0xC0, tmp2[1], tmp2[1]));
+
+ /* div esi */
+ EMIT2(0xF7, 0xF6);
+
+ if (BPF_OP(code) == BPF_MOD)
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp2[1]),
+ STACK_VAR(dst_lo));
+ else
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp2[0]),
+ STACK_VAR(dst_lo));
+
+ EMIT3(0xC7, add_1reg(0x40, IA32_EBP),
+ STACK_VAR(dst_hi));
+ EMIT(0x0, 4);
+ break;
+ case BPF_ALU64 | BPF_DIV | BPF_K:
+ case BPF_ALU64 | BPF_DIV | BPF_X:
+ case BPF_ALU64 | BPF_MOD | BPF_K:
+ case BPF_ALU64 | BPF_MOD | BPF_X:
+ goto notyet;
+ /* dst = dst >> imm */
+ /* dst = dst << imm */
+ case BPF_ALU | BPF_RSH | BPF_K:
+ case BPF_ALU | BPF_LSH | BPF_K:
+ if (unlikely(imm32 > 31))
+ return -EINVAL;
+ if (imm32)
+ emit_ia32_alu_i(false, false, BPF_OP(code),
+ dst_lo, imm32, &prog);
+ emit_ia32_mov_i(dst_hi, 0, &prog);
+ break;
+ /* dst = dst << imm */
+ case BPF_ALU64 | BPF_LSH | BPF_K:
+ if (unlikely(imm32 > 63))
+ return -EINVAL;
+ emit_ia32_lsh_i64(dst, imm32, &prog);
+ break;
+ /* dst = dst >> imm */
+ case BPF_ALU64 | BPF_RSH | BPF_K:
+ if (unlikely(imm32 > 63))
+ return -EINVAL;
+ emit_ia32_rsh_i64(dst, imm32, &prog);
+ break;
+ /* dst = dst << src */
+ case BPF_ALU64 | BPF_LSH | BPF_X:
+ emit_ia32_lsh_r64(dst, src, &prog);
+ break;
+ /* dst = dst >> src */
+ case BPF_ALU64 | BPF_RSH | BPF_X:
+ emit_ia32_rsh_r64(dst, src, &prog);
+ break;
+ /* dst = dst >> src (signed) */
+ case BPF_ALU64 | BPF_ARSH | BPF_X:
+ emit_ia32_arsh_r64(dst, src, &prog);
+ break;
+ /* dst = dst >> imm (signed) */
+ case BPF_ALU64 | BPF_ARSH | BPF_K:
+ if (unlikely(imm32 > 63))
+ return -EINVAL;
+ emit_ia32_arsh_i64(dst, imm32, &prog);
+ break;
+ /* dst = ~dst */
+ case BPF_ALU | BPF_NEG:
+ emit_ia32_alu_i(is64, false, BPF_OP(code),
+ dst_lo, 0, &prog);
+ emit_ia32_mov_i(dst_hi, 0, &prog);
+ break;
+ /* dst = ~dst (64 bit) */
+ case BPF_ALU64 | BPF_NEG:
+ emit_ia32_neg64(dst, &prog);
+ break;
+ /* dst = dst * src/imm */
+ case BPF_ALU64 | BPF_MUL | BPF_X:
+ case BPF_ALU64 | BPF_MUL | BPF_K:
+ switch (BPF_SRC(code)) {
+ case BPF_X:
+ emit_ia32_mul_r64(dst, src, &prog);
+ break;
+ case BPF_K:
+ emit_ia32_mul_i64(dst, imm32, &prog);
+ break;
+ }
+ break;
+ /* dst = htole(dst) */
+ case BPF_ALU | BPF_END | BPF_FROM_LE:
+ switch (imm32) {
+ case 16:
+ /* emit 'movzwl eax, ax' to zero extend 16-bit
+ * into 64 bit
+ */
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_lo));
+ EMIT2(0x0F, 0xB7);
+ EMIT1(add_2reg(0xC0, tmp[0], tmp[0]));
+
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_lo));
+ EMIT3(0xC7, add_1reg(0x40, IA32_EBP),
+ STACK_VAR(dst_hi));
+ EMIT(0x0, 4);
+ break;
+ case 32:
+ EMIT3(0xC7, add_1reg(0x40, IA32_EBP),
+ STACK_VAR(dst_hi));
+ EMIT(0x0, 4);
+ break;
+ case 64:
+ /* nop */
+ break;
+ }
+ break;
+ /* dst = htobe(dst) */
+ case BPF_ALU | BPF_END | BPF_FROM_BE:
+ switch (imm32) {
+ case 16:
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_lo));
+
+ /* emit 'ror %si, 8' to swap lower 2 bytes */
+ EMIT1(0x66);
+ EMIT3(0xC1, add_1reg(0xC8, tmp[0]), 8);
+
+ EMIT2(0x0F, 0xB7);
+ EMIT1(add_2reg(0xC0, tmp[0], tmp[0]));
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_lo));
+
+ EMIT3(0xC7, add_1reg(0x40, IA32_EBP),
+ STACK_VAR(dst_hi));
+ EMIT(0x0, 4);
+ break;
+ case 32:
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_lo));
+
+ /* emit 'bswap esi' to swap lower 4 bytes */
+ EMIT1(0x0F);
+ EMIT1(add_1reg(0xC8, tmp[0]));
+
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_lo));
+
+ EMIT3(0xC7, add_1reg(0x40, IA32_EBP),
+ STACK_VAR(dst_hi));
+ EMIT(0x0, 4);
+ break;
+ case 64:
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_lo));
+ /* emit 'bswap esi' to swap lower 4 bytes */
+ EMIT1(0x0F);
+ EMIT1(add_1reg(0xC8, tmp[0]));
+
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[1]),
+ STACK_VAR(dst_hi));
+ /* emit 'bswap esi' to swap lower 4 bytes */
+ EMIT1(0x0F);
+ EMIT1(add_1reg(0xC8, tmp[1]));
+
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[1]),
+ STACK_VAR(dst_lo));
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_hi));
+ break;
+ }
+ break;
+ /* dst = imm64 */
+ case BPF_LD | BPF_IMM | BPF_DW:
+ EMIT3(0xC7, add_1reg(0x40, IA32_EBP),
+ STACK_VAR(dst_lo));
+ EMIT(insn[0].imm, 4);
+
+ EMIT3(0xC7, add_1reg(0x40, IA32_EBP),
+ STACK_VAR(dst_hi));
+ EMIT(insn[1].imm, 4);
+
+ insn++;
+ i++;
+ break;
+ /* ST: *(u8*)(dst_reg + off) = imm */
+ case BPF_ST | BPF_MEM | BPF_B:
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_lo));
+ EMIT1(0xC6);
+ goto st;
+ case BPF_ST | BPF_MEM | BPF_H:
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_lo));
+ EMIT2(0x66, 0xC7);
+ goto st;
+ case BPF_ST | BPF_MEM | BPF_W:
+ case BPF_ST | BPF_MEM | BPF_DW:
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_lo));
+ EMIT1(0xC7);
+
+st:
+ if (is_imm8(insn->off))
+ EMIT2(add_1reg(0x40, tmp[0]), insn->off);
+ else
+ EMIT1_off32(add_1reg(0x80, tmp[0]), insn->off);
+ EMIT(imm32, bpf_size_to_x86_bytes(BPF_SIZE(code)));
+
+ if (BPF_SIZE(code) == BPF_DW) {
+ u32 hi;
+
+ hi = imm32 & (1<<31) ? (u32)~0 : 0;
+ EMIT2_off32(0xC7, add_1reg(0x80, tmp[0]),
+ insn->off + 4);
+ EMIT(hi, 4);
+ }
+ break;
+
+ /* STX: *(u8*)(dst_reg + off) = src_reg */
+ case BPF_STX | BPF_MEM | BPF_B:
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp2[0]),
+ STACK_VAR(dst_lo));
+ /* mov edi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp2[1]),
+ STACK_VAR(src_lo));
+ /* emit 'mov byte ptr [dst + off], al' */
+ EMIT1(0x88);
+ goto stx;
+ case BPF_STX | BPF_MEM | BPF_H:
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp2[0]),
+ STACK_VAR(dst_lo));
+ /* mov edi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp2[1]),
+ STACK_VAR(src_lo));
+ EMIT2(0x66, 0x89);
+ goto stx;
+ case BPF_STX | BPF_MEM | BPF_W:
+ case BPF_STX | BPF_MEM | BPF_DW:
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp2[0]),
+ STACK_VAR(dst_lo));
+ /* mov edi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp2[1]),
+ STACK_VAR(src_lo));
+ EMIT1(0x89);
+
+stx:
+ if (is_imm8(insn->off))
+ EMIT2(add_2reg(0x40, tmp2[0], tmp2[1]),
+ insn->off);
+ else
+ EMIT1_off32(add_2reg(0x80, tmp2[0], tmp2[1]),
+ insn->off);
+
+ if (BPF_SIZE(code) == BPF_DW) {
+ /* mov edi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp2[1]),
+ STACK_VAR(src_hi));
+ EMIT1(0x89);
+
+ if (is_imm8(insn->off + 4)) {
+ EMIT2(add_2reg(0x40, tmp2[0], tmp2[1]),
+ insn->off + 4);
+ } else {
+ EMIT1(add_2reg(0x80, tmp2[0], tmp2[1]));
+ EMIT(insn->off + 4, 4);
+ }
+ }
+ break;
+
+ /* LDX: dst_reg = *(u8*)(src_reg + off) */
+ case BPF_LDX | BPF_MEM | BPF_B:
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(src_lo));
+ /* emit 'movzx esi, byte ptr [ebp+off]' */
+ EMIT2(0x0F, 0xB6);
+ goto ldx;
+ case BPF_LDX | BPF_MEM | BPF_H:
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(src_lo));
+ /* emit 'movzx esi, word ptr [ebp+off]' */
+ EMIT2(0x0F, 0xB7);
+ goto ldx;
+ case BPF_LDX | BPF_MEM | BPF_W:
+ case BPF_LDX | BPF_MEM | BPF_DW:
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(src_lo));
+ /* emit 'mov rax, qword ptr [ebp+off]' */
+ EMIT1(0x8B);
+ldx:
+
+ if (is_imm8(insn->off))
+ EMIT2(add_2reg(0x40, tmp[0], tmp[1]),
+ insn->off);
+ else
+ EMIT1_off32(add_2reg(0x80, tmp[0], tmp[1]),
+ insn->off);
+
+ /* mov dword ptr [ebp+off],edi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[1]),
+ STACK_VAR(dst_lo));
+ switch (BPF_SIZE(code)) {
+ case BPF_B:
+ case BPF_H:
+ case BPF_W:
+ EMIT3(0xC7, add_1reg(0x40, IA32_EBP),
+ STACK_VAR(dst_hi));
+ EMIT(0x0, 4);
+ break;
+ case BPF_DW:
+ EMIT2_off32(0x8B,
+ add_2reg(0x80, tmp[0], tmp[1]),
+ insn->off + 4);
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp[1]),
+ STACK_VAR(dst_hi));
+ break;
+ default:
+ break;
+ }
+ break;
+ /* call */
+ case BPF_JMP | BPF_CALL:
+ {
+ const u8 *r1 = bpf2ia32[BPF_REG_1];
+ const u8 *r2 = bpf2ia32[BPF_REG_2];
+ const u8 *r3 = bpf2ia32[BPF_REG_3];
+ const u8 *r4 = bpf2ia32[BPF_REG_4];
+ const u8 *r5 = bpf2ia32[BPF_REG_5];
+
+ if (insn->src_reg == BPF_PSEUDO_CALL)
+ goto notyet;
+
+ func = (u8 *) __bpf_call_base + imm32;
+ jmp_offset = func - (image + addrs[i]);
+
+ if (!imm32 || !is_simm32(jmp_offset)) {
+ pr_err("unsupported bpf func %d addr %p image %p\n",
+ imm32, func, image);
+ return -EINVAL;
+ }
+
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp2[0]),
+ STACK_VAR(r1[0]));
+ /* mov edx,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp2[1]),
+ STACK_VAR(r1[1]));
+
+ emit_push_r64(r5, &prog);
+ emit_push_r64(r4, &prog);
+ emit_push_r64(r3, &prog);
+ emit_push_r64(r2, &prog);
+
+ EMIT1_off32(0xE8, jmp_offset + 9);
+
+ /* mov dword ptr [ebp+off],eax */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp2[0]),
+ STACK_VAR(r0[0]));
+ /* mov dword ptr [ebp+off],edx */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp2[1]),
+ STACK_VAR(r0[1]));
+
+ /* add esp,32 */
+ EMIT3(0x83, add_1reg(0xC0, IA32_ESP), 32);
+ break;
+ }
+ case BPF_JMP | BPF_TAIL_CALL:
+ emit_bpf_tail_call(&prog);
+ break;
+
+ /* cond jump */
+ case BPF_JMP | BPF_JEQ | BPF_X:
+ case BPF_JMP | BPF_JNE | BPF_X:
+ case BPF_JMP | BPF_JGT | BPF_X:
+ case BPF_JMP | BPF_JLT | BPF_X:
+ case BPF_JMP | BPF_JGE | BPF_X:
+ case BPF_JMP | BPF_JLE | BPF_X:
+ case BPF_JMP | BPF_JSGT | BPF_X:
+ case BPF_JMP | BPF_JSLE | BPF_X:
+ case BPF_JMP | BPF_JSLT | BPF_X:
+ case BPF_JMP | BPF_JSGE | BPF_X:
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(src_hi));
+ /* cmp dword ptr [ebp+off], esi */
+ EMIT3(0x39, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_hi));
+
+ EMIT2(IA32_JNE, 6);
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(src_lo));
+ /* cmp dword ptr [ebp+off], esi */
+ EMIT3(0x39, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_lo));
+ goto emit_cond_jmp;
+
+ case BPF_JMP | BPF_JSET | BPF_X:
+ /* mov esi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_lo));
+ /* and esi,dword ptr [ebp+off]*/
+ EMIT3(0x23, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(src_lo));
+
+ /* mov edi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[1]),
+ STACK_VAR(dst_hi));
+ /* and edi,dword ptr [ebp+off] */
+ EMIT3(0x23, add_2reg(0x40, IA32_EBP, tmp[1]),
+ STACK_VAR(src_hi));
+ /* or esi,edi */
+ EMIT2(0x09, add_2reg(0xC0, tmp[0], tmp[1]));
+ goto emit_cond_jmp;
+
+ case BPF_JMP | BPF_JSET | BPF_K: {
+ u32 hi;
+
+ hi = imm32 & (1<<31) ? (u32)~0 : 0;
+ /* mov esi,imm32 */
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]), imm32);
+ /* and esi,dword ptr [ebp+off]*/
+ EMIT3(0x23, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_lo));
+
+ /* mov esi,imm32 */
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp[1]), hi);
+ /* and esi,dword ptr [ebp+off] */
+ EMIT3(0x23, add_2reg(0x40, IA32_EBP, tmp[1]),
+ STACK_VAR(dst_hi));
+ /* or esi,edi */
+ EMIT2(0x09, add_2reg(0xC0, tmp[0], tmp[1]));
+ goto emit_cond_jmp;
+ }
+ case BPF_JMP | BPF_JEQ | BPF_K:
+ case BPF_JMP | BPF_JNE | BPF_K:
+ case BPF_JMP | BPF_JGT | BPF_K:
+ case BPF_JMP | BPF_JLT | BPF_K:
+ case BPF_JMP | BPF_JGE | BPF_K:
+ case BPF_JMP | BPF_JLE | BPF_K:
+ case BPF_JMP | BPF_JSGT | BPF_K:
+ case BPF_JMP | BPF_JSLE | BPF_K:
+ case BPF_JMP | BPF_JSLT | BPF_K:
+ case BPF_JMP | BPF_JSGE | BPF_K: {
+ u32 hi;
+
+ hi = imm32 & (1<<31) ? (u32)~0 : 0;
+ /* mov esi,imm32 */
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]), hi);
+ /* cmp dword ptr [ebp+off],esi */
+ EMIT3(0x39, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_hi));
+
+ EMIT2(IA32_JNE, 6);
+ /* mov esi,imm32 */
+ EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]), imm32);
+ /* cmp dword ptr [ebp+off],esi */
+ EMIT3(0x39, add_2reg(0x40, IA32_EBP, tmp[0]),
+ STACK_VAR(dst_lo));
+
+emit_cond_jmp: /* convert BPF opcode to x86 */
+ switch (BPF_OP(code)) {
+ case BPF_JEQ:
+ jmp_cond = IA32_JE;
+ break;
+ case BPF_JSET:
+ case BPF_JNE:
+ jmp_cond = IA32_JNE;
+ break;
+ case BPF_JGT:
+ /* GT is unsigned '>', JA in x86 */
+ jmp_cond = IA32_JA;
+ break;
+ case BPF_JLT:
+ /* LT is unsigned '<', JB in x86 */
+ jmp_cond = IA32_JB;
+ break;
+ case BPF_JGE:
+ /* GE is unsigned '>=', JAE in x86 */
+ jmp_cond = IA32_JAE;
+ break;
+ case BPF_JLE:
+ /* LE is unsigned '<=', JBE in x86 */
+ jmp_cond = IA32_JBE;
+ break;
+ case BPF_JSGT:
+ /* signed '>', GT in x86 */
+ jmp_cond = IA32_JG;
+ break;
+ case BPF_JSLT:
+ /* signed '<', LT in x86 */
+ jmp_cond = IA32_JL;
+ break;
+ case BPF_JSGE:
+ /* signed '>=', GE in x86 */
+ jmp_cond = IA32_JGE;
+ break;
+ case BPF_JSLE:
+ /* signed '<=', LE in x86 */
+ jmp_cond = IA32_JLE;
+ break;
+ default: /* to silence gcc warning */
+ return -EFAULT;
+ }
+ jmp_offset = addrs[i + insn->off] - addrs[i];
+ if (is_imm8(jmp_offset)) {
+ EMIT2(jmp_cond, jmp_offset);
+ } else if (is_simm32(jmp_offset)) {
+ EMIT2_off32(0x0F, jmp_cond + 0x10, jmp_offset);
+ } else {
+ pr_err("cond_jmp gen bug %llx\n", jmp_offset);
+ return -EFAULT;
+ }
+
+ break;
+ }
+ case BPF_JMP | BPF_JA:
+ jmp_offset = addrs[i + insn->off] - addrs[i];
+ if (!jmp_offset)
+ /* optimize out nop jumps */
+ break;
+emit_jmp:
+ if (is_imm8(jmp_offset)) {
+ EMIT2(0xEB, jmp_offset);
+ } else if (is_simm32(jmp_offset)) {
+ EMIT1_off32(0xE9, jmp_offset);
+ } else {
+ pr_err("jmp gen bug %llx\n", jmp_offset);
+ return -EFAULT;
+ }
+ break;
+
+ case BPF_LD | BPF_IND | BPF_W:
+ func = sk_load_word;
+ goto common_load;
+ case BPF_LD | BPF_ABS | BPF_W:
+ func = CHOOSE_LOAD_FUNC(imm32, sk_load_word);
+common_load:
+ jmp_offset = func - (image + addrs[i]);
+ if (!func || !is_simm32(jmp_offset)) {
+ pr_err("unsupported bpf func %d addr %p image %p\n",
+ imm32, func, image);
+ return -EINVAL;
+ }
+ if (BPF_MODE(code) == BPF_ABS) {
+ /* mov %edx, imm32 */
+ EMIT1_off32(0xBA, imm32);
+ } else {
+ /* mov edx,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX),
+ STACK_VAR(src_lo));
+ if (imm32) {
+ if (is_imm8(imm32))
+ /* add %edx, imm8 */
+ EMIT3(0x83, 0xC2, imm32);
+ else
+ /* add %edx, imm32 */
+ EMIT2_off32(0x81, 0xC2, imm32);
+ }
+ }
+ emit_load_skb_data_hlen(&prog);
+ EMIT1_off32(0xE8, jmp_offset + 10); /* call */
+
+ /* mov dword ptr [ebp+off],eax */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(r0[0]));
+ EMIT3(0xC7, add_1reg(0x40, IA32_EBP), STACK_VAR(r0[1]));
+ EMIT(0x0, 4);
+ break;
+
+ case BPF_LD | BPF_IND | BPF_H:
+ func = sk_load_half;
+ goto common_load;
+ case BPF_LD | BPF_ABS | BPF_H:
+ func = CHOOSE_LOAD_FUNC(imm32, sk_load_half);
+ goto common_load;
+ case BPF_LD | BPF_IND | BPF_B:
+ func = sk_load_byte;
+ goto common_load;
+ case BPF_LD | BPF_ABS | BPF_B:
+ func = CHOOSE_LOAD_FUNC(imm32, sk_load_byte);
+ goto common_load;
+ /* STX XADD: lock *(u32 *)(dst + off) += src */
+ case BPF_STX | BPF_XADD | BPF_W:
+ /* STX XADD: lock *(u64 *)(dst + off) += src */
+ case BPF_STX | BPF_XADD | BPF_DW:
+ goto notyet;
+ case BPF_JMP | BPF_EXIT:
+ if (seen_exit) {
+ jmp_offset = ctx->cleanup_addr - addrs[i];
+ goto emit_jmp;
+ }
+ seen_exit = true;
+ /* update cleanup_addr */
+ ctx->cleanup_addr = proglen;
+ emit_epilogue(&prog, bpf_prog->aux->stack_depth);
+ break;
+notyet:
+ pr_info_once("*** NOT YET: opcode %02x ***\n", code);
+ return -EFAULT;
+ default:
+ /* This error will be seen if new instruction was added
+ * to interpreter, but not to JIT
+ * or if there is junk in bpf_prog
+ */
+ pr_err("bpf_jit: unknown opcode %02x\n", code);
+ return -EINVAL;
+ }
+
+ ilen = prog - temp;
+ if (ilen > BPF_MAX_INSN_SIZE) {
+ pr_err("bpf_jit: fatal insn size error\n");
+ return -EFAULT;
+ }
+
+ if (image) {
+ if (unlikely(proglen + ilen > oldproglen)) {
+ pr_err("bpf_jit: fatal error\n");
+ return -EFAULT;
+ }
+ memcpy(image + proglen, temp, ilen);
+ }
+ proglen += ilen;
+ addrs[i] = proglen;
+ prog = temp;
+ }
+ return proglen;
+}
+
+struct ia32_jit_data {
+ struct bpf_binary_header *header;
+ int *addrs;
+ u8 *image;
+ int proglen;
+ struct jit_context ctx;
+};
+
+struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
+{
+ struct bpf_binary_header *header = NULL;
+ struct bpf_prog *tmp, *orig_prog = prog;
+ struct ia32_jit_data *jit_data;
+ int proglen, oldproglen = 0;
+ struct jit_context ctx = {};
+ bool tmp_blinded = false;
+ bool extra_pass = false;
+ u8 *image = NULL;
+ int *addrs;
+ int pass;
+ int i;
+
+ if (!prog->jit_requested)
+ return orig_prog;
+
+ tmp = bpf_jit_blind_constants(prog);
+ /* If blinding was requested and we failed during blinding,
+ * we must fall back to the interpreter.
+ */
+ if (IS_ERR(tmp))
+ return orig_prog;
+ if (tmp != prog) {
+ tmp_blinded = true;
+ prog = tmp;
+ }
+
+ jit_data = prog->aux->jit_data;
+ if (!jit_data) {
+ jit_data = kzalloc(sizeof(*jit_data), GFP_KERNEL);
+ if (!jit_data) {
+ prog = orig_prog;
+ goto out;
+ }
+ prog->aux->jit_data = jit_data;
+ }
+ addrs = jit_data->addrs;
+ if (addrs) {
+ ctx = jit_data->ctx;
+ oldproglen = jit_data->proglen;
+ image = jit_data->image;
+ header = jit_data->header;
+ extra_pass = true;
+ goto skip_init_addrs;
+ }
+ addrs = kmalloc(prog->len * sizeof(*addrs), GFP_KERNEL);
+ if (!addrs) {
+ prog = orig_prog;
+ goto out_addrs;
+ }
+
+ /* Before first pass, make a rough estimation of addrs[]
+ * each bpf instruction is translated to less than 64 bytes
+ */
+ for (proglen = 0, i = 0; i < prog->len; i++) {
+ proglen += 64;
+ addrs[i] = proglen;
+ }
+ ctx.cleanup_addr = proglen;
+skip_init_addrs:
+
+ /* JITed image shrinks with every pass and the loop iterates
+ * until the image stops shrinking. Very large bpf programs
+ * may converge on the last pass. In such case do one more
+ * pass to emit the final image
+ */
+ for (pass = 0; pass < 20 || image; pass++) {
+ proglen = do_jit(prog, addrs, image, oldproglen, &ctx);
+ if (proglen <= 0) {
+ image = NULL;
+ if (header)
+ bpf_jit_binary_free(header);
+ prog = orig_prog;
+ goto out_addrs;
+ }
+ if (image) {
+ if (proglen != oldproglen) {
+ pr_err("bpf_jit: proglen=%d != oldproglen=%d\n",
+ proglen, oldproglen);
+ prog = orig_prog;
+ goto out_addrs;
+ }
+ break;
+ }
+ if (proglen == oldproglen) {
+ header = bpf_jit_binary_alloc(proglen, &image,
+ 1, jit_fill_hole);
+ if (!header) {
+ prog = orig_prog;
+ goto out_addrs;
+ }
+ }
+ oldproglen = proglen;
+ cond_resched();
+ }
+
+ if (bpf_jit_enable > 1)
+ bpf_jit_dump(prog->len, proglen, pass + 1, image);
+
+ if (image) {
+ if (!prog->is_func || extra_pass) {
+ bpf_jit_binary_lock_ro(header);
+ } else {
+ jit_data->addrs = addrs;
+ jit_data->ctx = ctx;
+ jit_data->proglen = proglen;
+ jit_data->image = image;
+ jit_data->header = header;
+ }
+ prog->bpf_func = (void *)image;
+ prog->jited = 1;
+ prog->jited_len = proglen;
+ } else {
+ prog = orig_prog;
+ }
+
+ if (!prog->is_func || extra_pass) {
+out_addrs:
+ kfree(addrs);
+ kfree(jit_data);
+ prog->aux->jit_data = NULL;
+ }
+out:
+ if (tmp_blinded)
+ bpf_jit_prog_release_other(prog, prog == orig_prog ?
+ tmp : orig_prog);
+ return prog;
+}
--
1.8.5.6.2.g3d8a54e.dirty
^ permalink raw reply related
* Re: [net-next PATCH] bpf: reserve xdp_frame size in xdp headroom
From: Daniel Borkmann @ 2018-04-19 15:55 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Daniel Borkmann, Alexei Starovoitov; +Cc: netdev
In-Reply-To: <152414743253.1777.13128952001748907524.stgit@firesoul>
On 04/19/2018 04:17 PM, Jesper Dangaard Brouer wrote:
> Commit 6dfb970d3dbd ("xdp: avoid leaking info stored in frame data on
> page reuse") tried to allow user/bpf_prog to (re)use area used by
> xdp_frame (stored in frame headroom), by memset clearing area when
> bpf_xdp_adjust_head give bpf_prog access to headroom area.
>
> The mentioned commit had two bugs. (1) Didn't take bpf_xdp_adjust_meta
> into account. (2) a combination of bpf_xdp_adjust_head calls, where
> xdp->data is moved into xdp_frame section, can cause clearing
> xdp_frame area again for area previously granted to bpf_prog.
>
> After discussions with Daniel, we choose to implement a simpler
> solution to the problem, which is to reserve the headroom used by
> xdp_frame info.
>
> This also avoids the situation where bpf_prog is allowed to adjust/add
> headers, and then XDP_REDIRECT later drops the packet due to lack of
> headroom for the xdp_frame. This would likely confuse the end-user.
>
> Fixes: 6dfb970d3dbd ("xdp: avoid leaking info stored in frame data on page reuse")
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Applied to bpf-next, thanks Jesper!
^ permalink raw reply
* Re: ath6kl: fix spelling mistake: "chache" -> "cache"
From: Kalle Valo @ 2018-04-19 15:57 UTC (permalink / raw)
To: Colin Ian King
Cc: Kalle Valo, linux-wireless, netdev, kernel-janitors, linux-kernel
In-Reply-To: <20180329165304.26504-1-colin.king@canonical.com>
Colin Ian King <colin.king@canonical.com> wrote:
> Trivial fix to spelling mistake in message text
>
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Patch applied to ath-next branch of ath.git, thanks.
5072d87426bb ath6kl: fix spelling mistake: "chache" -> "cache"
--
https://patchwork.kernel.org/patch/10315713/
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
^ permalink raw reply
* KASAN: use-after-free Read in llc_conn_tmr_common_cb
From: syzbot @ 2018-04-19 16:06 UTC (permalink / raw)
To: davem, keescook, linux-kernel, netdev, syzkaller-bugs,
xiyou.wangcong
Hello,
syzbot hit the following crash on upstream commit
a27fc14219f2e3c4a46ba9177b04d9b52c875532 (Mon Apr 16 21:07:39 2018 +0000)
Merge branch 'parisc-4.17-3' of
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
syzbot dashboard link:
https://syzkaller.appspot.com/bug?extid=f922284c18ea23a8e457
Unfortunately, I don't have any reproducer for this crash yet.
Raw console output:
https://syzkaller.appspot.com/x/log.txt?id=6056927826018304
Kernel config:
https://syzkaller.appspot.com/x/.config?id=-5914490758943236750
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+f922284c18ea23a8e457@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for
details.
If you forward the report, please keep this part and the footer.
binder: 10195:10196 transaction failed 29189/-3, size 0-0 line 2963
binder: undelivered TRANSACTION_ERROR: 29189
binder: undelivered TRANSACTION_ERROR: 29189
binder: undelivered TRANSACTION_ERROR: 29189
==================================================================
BUG: KASAN: use-after-free in __lock_acquire+0x3888/0x5140
kernel/locking/lockdep.c:3310
Read of size 8 at addr ffff8801a8c862e0 by task swapper/0/0
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.17.0-rc1+ #6
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
print_address_description+0x6c/0x20b mm/kasan/report.c:256
kasan_report_error mm/kasan/report.c:354 [inline]
kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
__asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
__lock_acquire+0x3888/0x5140 kernel/locking/lockdep.c:3310
lock_acquire+0x1dc/0x520 kernel/locking/lockdep.c:3920
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:144
spin_lock include/linux/spinlock.h:310 [inline]
llc_conn_tmr_common_cb+0x8d/0x9e0 net/llc/llc_c_ac.c:1328
llc_conn_ack_tmr_cb+0x1e/0x30 net/llc/llc_c_ac.c:1357
call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
expire_timers kernel/time/timer.c:1363 [inline]
__run_timers+0x79e/0xc50 kernel/time/timer.c:1666
run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
__do_softirq+0x2e0/0xaf5 kernel/softirq.c:285
invoke_softirq kernel/softirq.c:365 [inline]
irq_exit+0x1d1/0x200 kernel/softirq.c:405
exiting_irq arch/x86/include/asm/apic.h:525 [inline]
smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:863
</IRQ>
RIP: 0010:native_safe_halt+0x6/0x10 arch/x86/include/asm/irqflags.h:54
RSP: 0018:ffffffff88a07bc0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
RAX: dffffc0000000000 RBX: 1ffffffff1140f7b RCX: 0000000000000000
RDX: 1ffffffff1163130 RSI: 0000000000000001 RDI: ffffffff88b18980
RBP: ffffffff88a07bc0 R08: ffffed003b6046c3 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffffffff88a07c78 R14: ffffffff89591560 R15: 0000000000000000
arch_safe_halt arch/x86/include/asm/paravirt.h:94 [inline]
default_idle+0xc2/0x440 arch/x86/kernel/process.c:354
arch_cpu_idle+0x10/0x20 arch/x86/kernel/process.c:345
default_idle_call+0x6d/0x90 kernel/sched/idle.c:93
cpuidle_idle_call kernel/sched/idle.c:153 [inline]
do_idle+0x395/0x560 kernel/sched/idle.c:262
cpu_startup_entry+0x104/0x120 kernel/sched/idle.c:368
rest_init+0xe1/0xe4 init/main.c:441
start_kernel+0x906/0x92d init/main.c:737
x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:445
x86_64_start_kernel+0x76/0x79 arch/x86/kernel/head64.c:426
secondary_startup_64+0xa5/0xb0 arch/x86/kernel/head_64.S:242
Allocated by task 10136:
save_stack+0x43/0xd0 mm/kasan/kasan.c:448
set_track mm/kasan/kasan.c:460 [inline]
kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
__do_kmalloc mm/slab.c:3718 [inline]
__kmalloc+0x14e/0x760 mm/slab.c:3727
kmalloc include/linux/slab.h:517 [inline]
sk_prot_alloc+0x1ae/0x2e0 net/core/sock.c:1474
sk_alloc+0x104/0x17b0 net/core/sock.c:1528
llc_sk_alloc+0x35/0x4b0 net/llc/llc_conn.c:949
llc_ui_create+0xf3/0x3e0 net/llc/af_llc.c:173
__sock_create+0x526/0x920 net/socket.c:1285
sock_create net/socket.c:1325 [inline]
__sys_socket+0x100/0x250 net/socket.c:1355
__do_sys_socket net/socket.c:1364 [inline]
__se_sys_socket net/socket.c:1362 [inline]
__x64_sys_socket+0x73/0xb0 net/socket.c:1362
do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Freed by task 10215:
save_stack+0x43/0xd0 mm/kasan/kasan.c:448
set_track mm/kasan/kasan.c:460 [inline]
__kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
__cache_free mm/slab.c:3498 [inline]
kfree+0xd9/0x260 mm/slab.c:3813
sk_prot_free net/core/sock.c:1511 [inline]
__sk_destruct+0x772/0xa40 net/core/sock.c:1593
sk_destruct+0x78/0x90 net/core/sock.c:1601
__sk_free+0x22e/0x340 net/core/sock.c:1612
sk_free+0x42/0x50 net/core/sock.c:1623
sock_put include/net/sock.h:1664 [inline]
llc_sk_free+0x9a/0xb0 net/llc/llc_conn.c:997
llc_ui_release+0x154/0x220 net/llc/af_llc.c:208
sock_release+0x96/0x1b0 net/socket.c:594
sock_close+0x16/0x20 net/socket.c:1149
__fput+0x34d/0x890 fs/file_table.c:209
____fput+0x15/0x20 fs/file_table.c:243
task_work_run+0x1e4/0x290 kernel/task_work.c:113
exit_task_work include/linux/task_work.h:22 [inline]
do_exit+0x1aee/0x2730 kernel/exit.c:865
do_group_exit+0x16f/0x430 kernel/exit.c:968
get_signal+0x886/0x1960 kernel/signal.c:2469
do_signal+0x98/0x2040 arch/x86/kernel/signal.c:810
exit_to_usermode_loop+0x28a/0x310 arch/x86/entry/common.c:162
prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
do_syscall_64+0x6ac/0x800 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
The buggy address belongs to the object at ffff8801a8c86240
which belongs to the cache kmalloc-2048 of size 2048
The buggy address is located 160 bytes inside of
2048-byte region [ffff8801a8c86240, ffff8801a8c86a40)
The buggy address belongs to the page:
page:ffffea0006a32180 count:1 mapcount:0 mapping:ffff8801a8c86240 index:0x0
compound_mapcount: 0
flags: 0x2fffc0000008100(slab|head)
raw: 02fffc0000008100 ffff8801a8c86240 0000000000000000 0000000100000003
raw: ffffea0006c17920 ffffea0006bf50a0 ffff8801dac00c40 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8801a8c86180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff8801a8c86200: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
> ffff8801a8c86280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8801a8c86300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8801a8c86380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkaller@googlegroups.com.
syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug
report.
Note: all commands must start from beginning of the line in the email body.
^ permalink raw reply
* Re: [PATCH resend v3 3/3] dt-bindings: Document the DT bindings for lan78xx
From: Rob Herring @ 2018-04-19 16:07 UTC (permalink / raw)
To: Phil Elwell
Cc: Woojung Huh, Microchip Linux Driver Support, Mark Rutland,
Andrew Lunn, Florian Fainelli, David S. Miller,
Mauro Carvalho Chehab, Greg Kroah-Hartman, Linus Walleij,
Andrew Morton, Randy Dunlap, netdev, devicetree,
linux-kernel@vger.kernel.org, Linux USB List
In-Reply-To: <1524151019-82823-4-git-send-email-phil@raspberrypi.org>
On Thu, Apr 19, 2018 at 10:16 AM, Phil Elwell <phil@raspberrypi.org> wrote:
> The Microchip LAN78XX family of devices are Ethernet controllers with
> a USB interface. Despite being discoverable devices it can be useful to
> be able to configure them from Device Tree, particularly in low-cost
> applications without an EEPROM or programmed OTP.
>
> Document the supported properties in a bindings file.
>
> Signed-off-by: Phil Elwell <phil@raspberrypi.org>
> ---
> .../devicetree/bindings/net/microchip,lan78xx.txt | 54 ++++++++++++++++++++++
> MAINTAINERS | 1 +
> 2 files changed, 55 insertions(+)
> create mode 100644 Documentation/devicetree/bindings/net/microchip,lan78xx.txt
>
> diff --git a/Documentation/devicetree/bindings/net/microchip,lan78xx.txt b/Documentation/devicetree/bindings/net/microchip,lan78xx.txt
> new file mode 100644
> index 0000000..a5d701b
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/microchip,lan78xx.txt
> @@ -0,0 +1,54 @@
> +Microchip LAN78xx Gigabit Ethernet controller
> +
> +The LAN78XX devices are usually configured by programming their OTP or with
> +an external EEPROM, but some platforms (e.g. Raspberry Pi 3 B+) have neither.
> +The Device Tree properties, if present, override the OTP and EEPROM.
> +
> +Required properties:
> +- compatible: Should be one of "usb424,7800", "usb424,7801" or "usb424,7850".
> +
> +Optional properties:
> +- local-mac-address: see ethernet.txt
> +- mac-address: see ethernet.txt
> +
> +Optional properties of the embedded PHY:
> +- microchip,led-modes: a 0..4 element vector, with each element configuring
> + the operating mode of an LED. Omitted LEDs are turned off. Allowed values
> + are defined in "include/dt-bindings/net/microchip-lan78xx.h".
> +
> +Example:
> +
> +/* Based on the configuration for a Raspberry Pi 3 B+ */
> +&usb {
> + usb1@1 {
Same comments as in the dts file:
usb-port@1
> + compatible = "usb424,2514";
> + reg = <1>;
> + #address-cells = <1>;
> + #size-cells = <0>;
> +
> + usb1_1@1 {
usb-port@1
> + compatible = "usb424,2514";
> + reg = <1>;
> + #address-cells = <1>;
> + #size-cells = <0>;
> +
> + ethernet: usbether@1 {
ethernet@1
> + compatible = "usb424,7800";
> + reg = <1>;
> + local-mac-address = [ 00 11 22 33 44 55 ];
> +
> + mdio {
> + #address-cells = <0x1>;
> + #size-cells = <0x0>;
> + eth_phy: ethernet-phy@1 {
> + reg = <1>;
> + microchip,led-modes = <
> + LAN78XX_LINK_1000_ACTIVITY
> + LAN78XX_LINK_10_100_ACTIVITY
> + >;
> + };
> + };
> + };
> + };
> + };
> +};
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 23735d9..91cb961 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -14572,6 +14572,7 @@ M: Woojung Huh <woojung.huh@microchip.com>
> M: Microchip Linux Driver Support <UNGLinuxDriver@microchip.com>
> L: netdev@vger.kernel.org
> S: Maintained
> +F: Documentation/devicetree/bindings/net/microchip,lan78xx.txt
> F: drivers/net/usb/lan78xx.*
> F: include/dt-bindings/net/microchip-lan78xx.h
>
> --
> 2.7.4
>
^ permalink raw reply
* [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Mikulas Patocka @ 2018-04-19 16:12 UTC (permalink / raw)
To: David Miller, Andrew Morton, linux-mm
Cc: eric.dumazet, edumazet, bhutchings, netdev, linux-kernel, mst,
jasowang, virtualization, dm-devel, Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804181350050.17942@file01.intranet.prod.int.rdu2.redhat.com>
On Wed, 18 Apr 2018, Mikulas Patocka wrote:
>
>
> On Wed, 18 Apr 2018, David Miller wrote:
>
> > From: Mikulas Patocka <mpatocka@redhat.com>
> > Date: Wed, 18 Apr 2018 12:44:25 -0400 (EDT)
> >
> > > The structure net_device is followed by arbitrary driver-specific data
> > > (accessible with the function netdev_priv). And for virtio-net, these
> > > driver-specific data must be in DMA memory.
> >
> > And we are saying that this assumption is wrong and needs to be
> > corrected.
>
> So, try to find all the networking drivers that to DMA to the private
> area.
>
> The problem here is that kvzalloc usually returns DMA-able area, but it
> may return non-DMA area rarely, if the memory is too fragmented. So, we
> are in a situation, where some networking drivers will randomly fail. Go
> and find them.
>
> Mikulas
Her I submit a patch that makes kvmalloc always use vmalloc if
CONFIG_DEBUG_VM is defined.
From: Mikulas Patocka <mpatocka@redhat.com>
Subject: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
The kvmalloc function tries to use kmalloc and falls back to vmalloc if
kmalloc fails.
Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
uses DMA-API on the returned memory or frees it with kfree. Such bugs were
found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
code.
These bugs are hard to reproduce because vmalloc falls back to kmalloc
only if memory is fragmented.
In order to detect these bugs reliably I submit this patch that changes
kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
---
mm/util.c | 2 ++
1 file changed, 2 insertions(+)
Index: linux-2.6/mm/util.c
===================================================================
--- linux-2.6.orig/mm/util.c 2018-04-18 15:46:23.000000000 +0200
+++ linux-2.6/mm/util.c 2018-04-18 16:00:43.000000000 +0200
@@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
*/
void *kvmalloc_node(size_t size, gfp_t flags, int node)
{
+#ifndef CONFIG_DEBUG_VM
gfp_t kmalloc_flags = flags;
void *ret;
@@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
*/
if (ret || size <= PAGE_SIZE)
return ret;
+#endif
return __vmalloc_node_flags_caller(size, node, flags,
__builtin_return_address(0));
^ permalink raw reply
* [PATCH net-next] net/ipv6: Fix ip6_convert_metrics() bug
From: Eric Dumazet @ 2018-04-19 16:14 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet, David Ahern
If ip6_convert_metrics() fails to allocate memory, it should not
overwrite rt->fib6_metrics or we risk a crash later as syzbot found.
BUG: KASAN: null-ptr-deref in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
BUG: KASAN: null-ptr-deref in refcount_sub_and_test+0x92/0x330 lib/refcount.c:179
Read of size 4 at addr 0000000000000044 by task syzkaller832429/4487
CPU: 1 PID: 4487 Comm: syzkaller832429 Not tainted 4.16.0+ #6
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
kasan_report_error mm/kasan/report.c:352 [inline]
kasan_report.cold.7+0x6d/0x2fe mm/kasan/report.c:412
check_memory_region_inline mm/kasan/kasan.c:260 [inline]
check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
refcount_sub_and_test+0x92/0x330 lib/refcount.c:179
refcount_dec_and_test+0x1a/0x20 lib/refcount.c:212
fib6_info_destroy+0x2d0/0x3c0 net/ipv6/ip6_fib.c:206
fib6_info_release include/net/ip6_fib.h:304 [inline]
ip6_route_info_create+0x677/0x3240 net/ipv6/route.c:3020
ip6_route_add+0x23/0xb0 net/ipv6/route.c:3030
inet6_rtm_newroute+0x142/0x160 net/ipv6/route.c:4406
rtnetlink_rcv_msg+0x466/0xc10 net/core/rtnetlink.c:4648
netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4666
netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
sock_sendmsg_nosec net/socket.c:629 [inline]
sock_sendmsg+0xd5/0x120 net/socket.c:639
___sys_sendmsg+0x805/0x940 net/socket.c:2117
__sys_sendmsg+0x115/0x270 net/socket.c:2155
SYSC_sendmsg net/socket.c:2164 [inline]
SyS_sendmsg+0x29/0x30 net/socket.c:2162
do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x42/0xb7
Fixes: d4ead6b34b67 ("net/ipv6: move metrics from dst to rt6_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsa@cumulusnetworks.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
---
net/ipv6/route.c | 20 +++++++++-----------
1 file changed, 9 insertions(+), 11 deletions(-)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f9c363327d620322dda2269d8398a5dc5de7aa4e..9279f4ec84b6b885357390d3de0826a8a4c54daf 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2600,21 +2600,19 @@ static int ip6_dst_gc(struct dst_ops *ops)
static int ip6_convert_metrics(struct net *net, struct fib6_info *rt,
struct fib6_config *cfg)
{
- int err = 0;
+ struct dst_metrics *p;
- if (cfg->fc_mx) {
- rt->fib6_metrics = kzalloc(sizeof(*rt->fib6_metrics),
- GFP_KERNEL);
- if (unlikely(!rt->fib6_metrics))
- return -ENOMEM;
+ if (!cfg->fc_mx)
+ return 0;
- refcount_set(&rt->fib6_metrics->refcnt, 1);
+ p = kzalloc(sizeof(*rt->fib6_metrics), GFP_KERNEL);
+ if (unlikely(!p))
+ return -ENOMEM;
- err = ip_metrics_convert(net, cfg->fc_mx, cfg->fc_mx_len,
- rt->fib6_metrics->metrics);
- }
+ refcount_set(&p->refcnt, 1);
+ rt->fib6_metrics = p;
- return err;
+ return ip_metrics_convert(net, cfg->fc_mx, cfg->fc_mx_len, p->metrics);
}
static struct rt6_info *ip6_nh_lookup_table(struct net *net,
--
2.17.0.484.g0c8726318c-goog
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox