* Re: WARNING: refcount bug in should_fail
From: Eric W. Biederman @ 2018-04-02 20:30 UTC (permalink / raw)
To: Tetsuo Handa
Cc: syzbot+, syzkaller-bugs, dvyukov, linux-fsdevel, linux-mm, netdev,
viro
In-Reply-To: <201804011941.IAE69652.OHMVJLFtSOFFQO@I-love.SAKURA.ne.jp>
Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> writes:
> syzbot wrote:
>> > On Sun, Mar 4, 2018 at 6:57 AM, Tetsuo Handa
>> > <penguin-kernel@i-love.sakura.ne.jp> wrote:
>> >> Switching from mm to fsdevel, for this report says that put_net(net) in
>> >> rpc_kill_sb() made net->count < 0 when mount_ns() failed due to
>> >> register_shrinker() failure.
>>
>> >> Relevant commits will be
>> >> commit 9ee332d99e4d5a97 ("sget(): handle failures of
>> >> register_shrinker()") and
>> >> commit d91ee87d8d85a080 ("vfs: Pass data, ns, and ns->userns to
>> >> mount_ns.").
>>
>> >> When sget_userns() in mount_ns() failed, mount_ns() returns an error
>> >> code to
>> >> the caller without calling fill_super(). That is, get_net(sb->s_fs_info)
>> >> was
>> >> not called by rpc_fill_super() (via fill_super callback passed to
>> >> mount_ns())
>> >> but put_net(sb->s_fs_info) is called by rpc_kill_sb() (via fs->kill_sb()
>> >> from
>> >> deactivate_locked_super()).
>>
>> >> ----------
>> >> static struct dentry *
>> >> rpc_mount(struct file_system_type *fs_type,
>> >> int flags, const char *dev_name, void *data)
>> >> {
>> >> struct net *net = current->nsproxy->net_ns;
>> >> return mount_ns(fs_type, flags, data, net, net->user_ns,
>> >> rpc_fill_super);
>> >> }
>> >> ----------
>>
>> > Messed kernel output, this is definitely not in should_fail.
>>
>> > #syz dup: WARNING: refcount bug in sk_alloc
>>
>> Can't find the corresponding bug.
>>
> I don't think this is a dup of existing bug.
> We need to fix either 9ee332d99e4d5a97 or d91ee87d8d85a080.
Even if expanding mount_ns to more filesystems was magically fixed,
proc would still have this issue with the pid namespace rather than
the net namespace.
This is a mess. I will take a look and see if I can see a a fix.
Eric
^ permalink raw reply
* Re: [PATCH v5 12/14] fm10k: Report PCIe link properties with pcie_print_link_status()
From: Bjorn Helgaas @ 2018-04-02 20:31 UTC (permalink / raw)
To: Keller, Jacob E
Cc: Tal Gilboa, Tariq Toukan, Ariel Elior, Ganesh Goudar,
Kirsher, Jeffrey T, everest-linux-l2@cavium.com,
intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org
In-Reply-To: <02874ECE860811409154E81DA85FBB5882D497EE@ORSMSX115.amr.corp.intel.com>
On Mon, Apr 02, 2018 at 03:56:06PM +0000, Keller, Jacob E wrote:
> > -----Original Message-----
> > From: Bjorn Helgaas [mailto:helgaas@kernel.org]
> > Sent: Friday, March 30, 2018 2:06 PM
> > To: Tal Gilboa <talgi@mellanox.com>
> > Cc: Tariq Toukan <tariqt@mellanox.com>; Keller, Jacob E
> > <jacob.e.keller@intel.com>; Ariel Elior <ariel.elior@cavium.com>; Ganesh
> > Goudar <ganeshgr@chelsio.com>; Kirsher, Jeffrey T
> > <jeffrey.t.kirsher@intel.com>; everest-linux-l2@cavium.com; intel-wired-
> > lan@lists.osuosl.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> > linux-pci@vger.kernel.org
> > Subject: [PATCH v5 12/14] fm10k: Report PCIe link properties with
> > pcie_print_link_status()
> >
> > From: Bjorn Helgaas <bhelgaas@google.com>
> >
> > Use pcie_print_link_status() to report PCIe link speed and possible
> > limitations instead of implementing this in the driver itself.
> >
> > Note that pcie_get_minimum_link() can return misleading information because
> > it finds the slowest link and the narrowest link without considering the
> > total bandwidth of the link. If the path contains a 16 GT/s x1 link and a
> > 2.5 GT/s x16 link, pcie_get_minimum_link() returns 2.5 GT/s x1, which
> > corresponds to 250 MB/s of bandwidth, not the actual available bandwidth of
> > about 2000 MB/s for a 16 GT/s x1 link.
>
> This comment is about what's being fixed, so it would have been easier to
> parse if it were written to more clearly indicate that we're removing
> (and not adding) this behavior.
Good point. Is this any better?
fm10k: Report PCIe link properties with pcie_print_link_status()
Previously the driver used pcie_get_minimum_link() to warn when the NIC
is in a slot that can't supply as much bandwidth as the NIC could use.
pcie_get_minimum_link() can be misleading because it finds the slowest link
and the narrowest link (which may be different links) without considering
the total bandwidth of each link. For a path with a 16 GT/s x1 link and a
2.5 GT/s x16 link, it returns 2.5 GT/s x1, which corresponds to 250 MB/s of
bandwidth, not the true available bandwidth of about 1969 MB/s for a
16 GT/s x1 link.
Use pcie_print_link_status() to report PCIe link speed and possible
limitations instead of implementing this in the driver itself. This finds
the slowest link in the path to the device by computing the total bandwidth
of each link and compares that with the capabilities of the device.
Note that the driver previously used dev_warn() to suggest using a
different slot, but pcie_print_link_status() uses dev_info() because if the
platform has no faster slot available, the user can't do anything about the
warning and may not want to be bothered with it.
^ permalink raw reply
* [PATCH net-next] nfp: add a separate counter for packets with CHECKSUM_COMPLETE
From: Jakub Kicinski @ 2018-04-02 20:31 UTC (permalink / raw)
To: davem; +Cc: netdev, oss-drivers, Jakub Kicinski
We are currently counting packets with CHECKSUM_COMPLETE as
"hw_rx_csum_ok". This is confusing. Add a new counter.
To make sure it fits in the same cacheline move the less used
error counter to a different location.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
---
drivers/net/ethernet/netronome/nfp/nfp_net.h | 4 +++-
drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 2 +-
drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c | 16 +++++++++-------
3 files changed, 13 insertions(+), 9 deletions(-)
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index 787df47ec430..bd7d8ae31e17 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -391,6 +391,7 @@ struct nfp_net_rx_ring {
* @rx_drops: Number of packets dropped on RX due to lack of resources
* @hw_csum_rx_ok: Counter of packets where the HW checksum was OK
* @hw_csum_rx_inner_ok: Counter of packets where the inner HW checksum was OK
+ * @hw_csum_rx_complete: Counter of packets with CHECKSUM_COMPLETE reported
* @hw_csum_rx_error: Counter of packets with bad checksums
* @tx_sync: Seqlock for atomic updates of TX stats
* @tx_pkts: Number of Transmitted packets
@@ -434,7 +435,7 @@ struct nfp_net_r_vector {
u64 rx_drops;
u64 hw_csum_rx_ok;
u64 hw_csum_rx_inner_ok;
- u64 hw_csum_rx_error;
+ u64 hw_csum_rx_complete;
struct nfp_net_tx_ring *xdp_ring;
@@ -446,6 +447,7 @@ struct nfp_net_r_vector {
u64 tx_gather;
u64 tx_lso;
+ u64 hw_csum_rx_error;
u64 rx_replace_buf_alloc_fail;
u64 tx_errors;
u64 tx_busy;
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 43a9c207a049..1eb6549f2a54 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1406,7 +1406,7 @@ static void nfp_net_rx_csum(struct nfp_net_dp *dp,
skb->ip_summed = meta->csum_type;
skb->csum = meta->csum;
u64_stats_update_begin(&r_vec->rx_sync);
- r_vec->hw_csum_rx_ok++;
+ r_vec->hw_csum_rx_complete++;
u64_stats_update_end(&r_vec->rx_sync);
return;
}
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c
index e1dae0616f52..c9016419bfa0 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c
@@ -179,7 +179,7 @@ static const struct nfp_et_stat nfp_mac_et_stats[] = {
#define NN_ET_GLOBAL_STATS_LEN ARRAY_SIZE(nfp_net_et_stats)
#define NN_ET_SWITCH_STATS_LEN 9
-#define NN_RVEC_GATHER_STATS 8
+#define NN_RVEC_GATHER_STATS 9
#define NN_RVEC_PER_Q_STATS 3
static void nfp_net_get_nspinfo(struct nfp_app *app, char *version)
@@ -468,6 +468,7 @@ static u8 *nfp_vnic_get_sw_stats_strings(struct net_device *netdev, u8 *data)
data = nfp_pr_et(data, "hw_rx_csum_ok");
data = nfp_pr_et(data, "hw_rx_csum_inner_ok");
+ data = nfp_pr_et(data, "hw_rx_csum_complete");
data = nfp_pr_et(data, "hw_rx_csum_err");
data = nfp_pr_et(data, "rx_replace_buf_alloc_fail");
data = nfp_pr_et(data, "hw_tx_csum");
@@ -493,18 +494,19 @@ static u64 *nfp_vnic_get_sw_stats(struct net_device *netdev, u64 *data)
data[0] = nn->r_vecs[i].rx_pkts;
tmp[0] = nn->r_vecs[i].hw_csum_rx_ok;
tmp[1] = nn->r_vecs[i].hw_csum_rx_inner_ok;
- tmp[2] = nn->r_vecs[i].hw_csum_rx_error;
- tmp[3] = nn->r_vecs[i].rx_replace_buf_alloc_fail;
+ tmp[2] = nn->r_vecs[i].hw_csum_rx_complete;
+ tmp[3] = nn->r_vecs[i].hw_csum_rx_error;
+ tmp[4] = nn->r_vecs[i].rx_replace_buf_alloc_fail;
} while (u64_stats_fetch_retry(&nn->r_vecs[i].rx_sync, start));
do {
start = u64_stats_fetch_begin(&nn->r_vecs[i].tx_sync);
data[1] = nn->r_vecs[i].tx_pkts;
data[2] = nn->r_vecs[i].tx_busy;
- tmp[4] = nn->r_vecs[i].hw_csum_tx;
- tmp[5] = nn->r_vecs[i].hw_csum_tx_inner;
- tmp[6] = nn->r_vecs[i].tx_gather;
- tmp[7] = nn->r_vecs[i].tx_lso;
+ tmp[5] = nn->r_vecs[i].hw_csum_tx;
+ tmp[6] = nn->r_vecs[i].hw_csum_tx_inner;
+ tmp[7] = nn->r_vecs[i].tx_gather;
+ tmp[8] = nn->r_vecs[i].tx_lso;
} while (u64_stats_fetch_retry(&nn->r_vecs[i].tx_sync, start));
data += NN_RVEC_PER_Q_STATS;
--
2.16.2
^ permalink raw reply related
* Re: [PATCH v3 2/2] net: usb: asix88179_178a: de-duplicate code
From: Marc Zyngier @ 2018-04-02 20:25 UTC (permalink / raw)
To: Alexander Kurz; +Cc: David S . Miller, Andrew F . Davis, linux-usb, netdev
In-Reply-To: <20180402074349.12010-2-akurz@blala.de>
On Mon, 2 Apr 2018 07:43:49 +0000
Alexander Kurz <akurz@blala.de> wrote:
> Remove the duplicated code for asix88179_178a bind and reset methods.
>
> Signed-off-by: Alexander Kurz <akurz@blala.de>
> ---
> drivers/net/usb/ax88179_178a.c | 137 ++++++++++-------------------------------
> 1 file changed, 31 insertions(+), 106 deletions(-)
The good news is that this patch doesn't seem to break anything this
time. A few remarks below:
>
> diff --git a/drivers/net/usb/ax88179_178a.c b/drivers/net/usb/ax88179_178a.c
> index a6ef75907ae9..fea4c7b877cc 100644
> --- a/drivers/net/usb/ax88179_178a.c
> +++ b/drivers/net/usb/ax88179_178a.c
> @@ -1223,7 +1223,7 @@ static int ax88179_led_setting(struct usbnet *dev)
> return 0;
> }
>
> -static int ax88179_bind(struct usbnet *dev, struct usb_interface *intf)
> +static int ax88179_bind_or_reset(struct usbnet *dev, bool do_reset)
> {
> u8 buf[5];
> u16 *tmp16;
> @@ -1231,12 +1231,11 @@ static int ax88179_bind(struct usbnet *dev, struct usb_interface *intf)
> struct ax88179_data *ax179_data = (struct ax88179_data *)dev->data;
> struct ethtool_eee eee_data;
>
> - usbnet_get_endpoints(dev, intf);
> -
So you move this to "bind"...
> tmp16 = (u16 *)buf;
> tmp = (u8 *)buf;
>
> - memset(ax179_data, 0, sizeof(*ax179_data));
> + if (!do_reset)
> + memset(ax179_data, 0, sizeof(*ax179_data));
... but not that. Why? They both are exclusive to the bind operation.
>
> /* Power up ethernet PHY */
> *tmp16 = 0;
> @@ -1249,9 +1248,13 @@ static int ax88179_bind(struct usbnet *dev, struct usb_interface *intf)
> ax88179_write_cmd(dev, AX_ACCESS_MAC, AX_CLK_SELECT, 1, 1, tmp);
> msleep(100);
>
> + if (do_reset)
> + ax88179_auto_detach(dev, 0);
> +
> ax88179_read_cmd(dev, AX_ACCESS_MAC, AX_NODE_ID, ETH_ALEN,
> ETH_ALEN, dev->net->dev_addr);
> - memcpy(dev->net->perm_addr, dev->net->dev_addr, ETH_ALEN);
> + if (!do_reset)
> + memcpy(dev->net->perm_addr, dev->net->dev_addr, ETH_ALEN);
>
> /* RX bulk configuration */
> memcpy(tmp, &AX88179_BULKIN_SIZE[0], 5);
> @@ -1266,19 +1269,21 @@ static int ax88179_bind(struct usbnet *dev, struct usb_interface *intf)
> ax88179_write_cmd(dev, AX_ACCESS_MAC, AX_PAUSE_WATERLVL_HIGH,
> 1, 1, tmp);
>
> - dev->net->netdev_ops = &ax88179_netdev_ops;
> - dev->net->ethtool_ops = &ax88179_ethtool_ops;
> - dev->net->needed_headroom = 8;
> - dev->net->max_mtu = 4088;
> -
> - /* Initialize MII structure */
> - dev->mii.dev = dev->net;
> - dev->mii.mdio_read = ax88179_mdio_read;
> - dev->mii.mdio_write = ax88179_mdio_write;
> - dev->mii.phy_id_mask = 0xff;
> - dev->mii.reg_num_mask = 0xff;
> - dev->mii.phy_id = 0x03;
> - dev->mii.supports_gmii = 1;
> + if (!do_reset) {
> + dev->net->netdev_ops = &ax88179_netdev_ops;
> + dev->net->ethtool_ops = &ax88179_ethtool_ops;
> + dev->net->needed_headroom = 8;
> + dev->net->max_mtu = 4088;
> +
> + /* Initialize MII structure */
> + dev->mii.dev = dev->net;
> + dev->mii.mdio_read = ax88179_mdio_read;
> + dev->mii.mdio_write = ax88179_mdio_write;
> + dev->mii.phy_id_mask = 0xff;
> + dev->mii.reg_num_mask = 0xff;
> + dev->mii.phy_id = 0x03;
> + dev->mii.supports_gmii = 1;
> + }
>
> dev->net->features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
> NETIF_F_RXCSUM;
> @@ -1330,6 +1335,13 @@ static int ax88179_bind(struct usbnet *dev, struct usb_interface *intf)
> return 0;
> }
>
> +static int ax88179_bind(struct usbnet *dev, struct usb_interface *intf)
> +{
> + usbnet_get_endpoints(dev, intf);
> +
> + return ax88179_bind_or_reset(dev, false);
> +}
I find the whole construct extremely messy.
You shouldn't need to "bind or reset" the adapter. You first reset it
(that's the HW part), and you then bind it (that's the SW part). I
understand that the code is quite messy already, and that you're trying
to improve it. I'm just not convinced that having a single function
containing everything and the kitchen sink is the most sensible
approach.
Instead, you probably want to extract stuff that needs to be done at
reset time from what can be done at bind time, and keep the two quite
separate. The fact that bind is mostly a superset of reset is a bit of
an odd thing, and it'd be good to get to the bottom of that (I fully
admit that my understanding of USB networking is close to zero).
I came up with the below hack on top of your patch, and things seem to
still behave.
Thanks,
M.
diff --git a/drivers/net/usb/ax88179_178a.c b/drivers/net/usb/ax88179_178a.c
index fea4c7b877cc..aed98d400d7a 100644
--- a/drivers/net/usb/ax88179_178a.c
+++ b/drivers/net/usb/ax88179_178a.c
@@ -1223,7 +1223,7 @@ static int ax88179_led_setting(struct usbnet *dev)
return 0;
}
-static int ax88179_bind_or_reset(struct usbnet *dev, bool do_reset)
+static int ax88179_reset(struct usbnet *dev)
{
u8 buf[5];
u16 *tmp16;
@@ -1234,9 +1234,6 @@ static int ax88179_bind_or_reset(struct usbnet *dev, bool do_reset)
tmp16 = (u16 *)buf;
tmp = (u8 *)buf;
- if (!do_reset)
- memset(ax179_data, 0, sizeof(*ax179_data));
-
/* Power up ethernet PHY */
*tmp16 = 0;
ax88179_write_cmd(dev, AX_ACCESS_MAC, AX_PHYPWR_RSTCTL, 2, 2, tmp16);
@@ -1248,13 +1245,10 @@ static int ax88179_bind_or_reset(struct usbnet *dev, bool do_reset)
ax88179_write_cmd(dev, AX_ACCESS_MAC, AX_CLK_SELECT, 1, 1, tmp);
msleep(100);
- if (do_reset)
- ax88179_auto_detach(dev, 0);
+ ax88179_auto_detach(dev, 0);
ax88179_read_cmd(dev, AX_ACCESS_MAC, AX_NODE_ID, ETH_ALEN,
ETH_ALEN, dev->net->dev_addr);
- if (!do_reset)
- memcpy(dev->net->perm_addr, dev->net->dev_addr, ETH_ALEN);
/* RX bulk configuration */
memcpy(tmp, &AX88179_BULKIN_SIZE[0], 5);
@@ -1269,28 +1263,6 @@ static int ax88179_bind_or_reset(struct usbnet *dev, bool do_reset)
ax88179_write_cmd(dev, AX_ACCESS_MAC, AX_PAUSE_WATERLVL_HIGH,
1, 1, tmp);
- if (!do_reset) {
- dev->net->netdev_ops = &ax88179_netdev_ops;
- dev->net->ethtool_ops = &ax88179_ethtool_ops;
- dev->net->needed_headroom = 8;
- dev->net->max_mtu = 4088;
-
- /* Initialize MII structure */
- dev->mii.dev = dev->net;
- dev->mii.mdio_read = ax88179_mdio_read;
- dev->mii.mdio_write = ax88179_mdio_write;
- dev->mii.phy_id_mask = 0xff;
- dev->mii.reg_num_mask = 0xff;
- dev->mii.phy_id = 0x03;
- dev->mii.supports_gmii = 1;
- }
-
- dev->net->features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
- NETIF_F_RXCSUM;
-
- dev->net->hw_features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
- NETIF_F_RXCSUM;
-
/* Enable checksum offload */
*tmp = AX_RXCOE_IP | AX_RXCOE_TCP | AX_RXCOE_UDP |
AX_RXCOE_TCPV6 | AX_RXCOE_UDPV6;
@@ -1337,9 +1309,36 @@ static int ax88179_bind_or_reset(struct usbnet *dev, bool do_reset)
static int ax88179_bind(struct usbnet *dev, struct usb_interface *intf)
{
+ struct ax88179_data *ax179_data = (struct ax88179_data *)dev->data;
+
usbnet_get_endpoints(dev, intf);
- return ax88179_bind_or_reset(dev, false);
+ memset(ax179_data, 0, sizeof(*ax179_data));
+
+ dev->net->netdev_ops = &ax88179_netdev_ops;
+ dev->net->ethtool_ops = &ax88179_ethtool_ops;
+ dev->net->needed_headroom = 8;
+ dev->net->max_mtu = 4088;
+
+ dev->net->features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
+ NETIF_F_RXCSUM;
+
+ dev->net->hw_features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
+ NETIF_F_RXCSUM;
+
+ /* Initialize MII structure */
+ dev->mii.dev = dev->net;
+ dev->mii.mdio_read = ax88179_mdio_read;
+ dev->mii.mdio_write = ax88179_mdio_write;
+ dev->mii.phy_id_mask = 0xff;
+ dev->mii.reg_num_mask = 0xff;
+ dev->mii.phy_id = 0x03;
+ dev->mii.supports_gmii = 1;
+
+ ax88179_reset(dev);
+ memcpy(dev->net->perm_addr, dev->net->dev_addr, ETH_ALEN);
+
+ return 0;
}
static void ax88179_unbind(struct usbnet *dev, struct usb_interface *intf)
@@ -1540,11 +1539,6 @@ static int ax88179_link_reset(struct usbnet *dev)
return 0;
}
-static int ax88179_reset(struct usbnet *dev)
-{
- return ax88179_bind_or_reset(dev, true);
-}
-
static int ax88179_stop(struct usbnet *dev)
{
u16 tmp16;
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply related
* RE: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited
From: Keller, Jacob E @ 2018-04-02 20:25 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Tal Gilboa, Tariq Toukan, Ariel Elior, Ganesh Goudar,
Kirsher, Jeffrey T, everest-linux-l2@cavium.com,
intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org
In-Reply-To: <20180402195820.GL9322@bhelgaas-glaptop.roam.corp.google.com>
> -----Original Message-----
> From: Bjorn Helgaas [mailto:helgaas@kernel.org]
> Sent: Monday, April 02, 2018 12:58 PM
> To: Keller, Jacob E <jacob.e.keller@intel.com>
> Cc: Tal Gilboa <talgi@mellanox.com>; Tariq Toukan <tariqt@mellanox.com>; Ariel
> Elior <ariel.elior@cavium.com>; Ganesh Goudar <ganeshgr@chelsio.com>;
> Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>; everest-linux-l2@cavium.com;
> intel-wired-lan@lists.osuosl.org; netdev@vger.kernel.org; linux-
> kernel@vger.kernel.org; linux-pci@vger.kernel.org
> Subject: Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed
> and whether it's limited
>
> On Mon, Apr 02, 2018 at 04:25:17PM +0000, Keller, Jacob E wrote:
> > > -----Original Message-----
> > > From: Bjorn Helgaas [mailto:helgaas@kernel.org]
> > > Sent: Friday, March 30, 2018 2:05 PM
> > > To: Tal Gilboa <talgi@mellanox.com>
> > > Cc: Tariq Toukan <tariqt@mellanox.com>; Keller, Jacob E
> > > <jacob.e.keller@intel.com>; Ariel Elior <ariel.elior@cavium.com>; Ganesh
> > > Goudar <ganeshgr@chelsio.com>; Kirsher, Jeffrey T
> > > <jeffrey.t.kirsher@intel.com>; everest-linux-l2@cavium.com; intel-wired-
> > > lan@lists.osuosl.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > linux-pci@vger.kernel.org
> > > Subject: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed
> and
> > > whether it's limited
> > >
> > > From: Tal Gilboa <talgi@mellanox.com>
> > >
> > > Add pcie_print_link_status(). This logs the current settings of the link
> > > (speed, width, and total available bandwidth).
> > >
> > > If the device is capable of more bandwidth but is limited by a slower
> > > upstream link, we include information about the link that limits the
> > > device's performance.
> > >
> > > The user may be able to move the device to a different slot for better
> > > performance.
> > >
> > > This provides a unified method for all PCI devices to report status and
> > > issues, instead of each device reporting in a different way, using
> > > different code.
> > >
> > > Signed-off-by: Tal Gilboa <talgi@mellanox.com>
> > > [bhelgaas: changelog, reword log messages, print device capabilities when
> > > not limited]
> > > Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> > > ---
> > > drivers/pci/pci.c | 29 +++++++++++++++++++++++++++++
> > > include/linux/pci.h | 1 +
> > > 2 files changed, 30 insertions(+)
> > >
> > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > index e00d56b12747..cec7aed09f6b 100644
> > > --- a/drivers/pci/pci.c
> > > +++ b/drivers/pci/pci.c
> > > @@ -5283,6 +5283,35 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev,
> > > enum pci_bus_speed *speed,
> > > return *width * PCIE_SPEED2MBS_ENC(*speed);
> > > }
> > >
> > > +/**
> > > + * pcie_print_link_status - Report the PCI device's link speed and width
> > > + * @dev: PCI device to query
> > > + *
> > > + * Report the available bandwidth at the device. If this is less than the
> > > + * device is capable of, report the device's maximum possible bandwidth and
> > > + * the upstream link that limits its performance to less than that.
> > > + */
> > > +void pcie_print_link_status(struct pci_dev *dev)
> > > +{
> > > + enum pcie_link_width width, width_cap;
> > > + enum pci_bus_speed speed, speed_cap;
> > > + struct pci_dev *limiting_dev = NULL;
> > > + u32 bw_avail, bw_cap;
> > > +
> > > + bw_cap = pcie_bandwidth_capable(dev, &speed_cap, &width_cap);
> > > + bw_avail = pcie_bandwidth_available(dev, &limiting_dev, &speed,
> > > &width);
> > > +
> > > + if (bw_avail >= bw_cap)
> > > + pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n",
> > > + bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> > > + else
> > > + pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d
> > > link at %s (capable of %d Mb/s with %s x%d link)\n",
> > > + bw_avail, PCIE_SPEED2STR(speed), width,
> > > + limiting_dev ? pci_name(limiting_dev) : "<unknown>",
> > > + bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> > > +}
> >
> > Personally, I would make thic last one a pci_warn() to indicate it at a
> > higher log level, but I'm ok with the wording, and if consensus is that
> > this should be at info, I'm ok with that.
>
> Tal's original patch did have a pci_warn() here, and we went back and
> forth a bit. They get bug reports when a device doesn't perform as
> expected, which argues for pci_warn(). But they also got feedback
> saying warnings are a bit too much, which argues for pci_info() [1]
>
> I don't have a really strong opinion either way. I have a slight
> preference for info because the user may not be able to do anything
> about it (there may not be a faster slot available), and I think
> distros are usually configured so a warning interrupts the smooth
> graphical boot.
>
> It looks like mlx4, fm10k, and ixgbe currently use warnings, while
> bnx2x, bnxt_en, and cxgb4 use info. It's a tie so far :)
>
> [1] https://lkml.kernel.org/r/e47f3628-b56c-4d0a-f18b-
> 5ffaf261ccc0@mellanox.com
>
With that information, I'm fine with the proposal to display this as only an info. The message is still printed and can be used for debugging purposes, and I think that's really enough.
> Here's a proposal for printing the bandwidth as "x.xxx Gb/s":
Nice, I like that also.
Regards,
Jake
^ permalink raw reply
* Re: [bpf-next PATCH 1/2] bpf: sockmap, free memory on sock close with cork data
From: John Fastabend @ 2018-04-02 19:58 UTC (permalink / raw)
To: ast, daniel; +Cc: netdev, davem
In-Reply-To: <20180402195046.10055.11686.stgit@john-Precision-Tower-5810>
On 04/02/2018 12:50 PM, John Fastabend wrote:
> If a socket with pending cork data is closed we do not return the
> memory to the socket until the garbage collector free's the psock
> structure. The garbage collector though can run after the sock has
> completed its close operation. If this ordering happens the sock code
> will through a WARN_ON because there is still outstanding memory
> accounted to the sock.
>
> To resolve this ensure we return memory to the sock when a socket
> is closed.
>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> Fixes: 91843d540a13 ("bpf: sockmap, add msg_cork_bytes() helper")
> ---
Hi Alexei, Daniel,
These two fixes apply against current bpf-next or bpf after
bpf-next is merged. I could resend later I suppose but I think
it makes sense to get these in sooner rather than later.
Thanks,
John
^ permalink raw reply
* Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited
From: Bjorn Helgaas @ 2018-04-02 19:58 UTC (permalink / raw)
To: Keller, Jacob E
Cc: Tal Gilboa, Tariq Toukan, Ariel Elior, Ganesh Goudar,
Kirsher, Jeffrey T, everest-linux-l2@cavium.com,
intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org
In-Reply-To: <02874ECE860811409154E81DA85FBB5882D49870@ORSMSX115.amr.corp.intel.com>
On Mon, Apr 02, 2018 at 04:25:17PM +0000, Keller, Jacob E wrote:
> > -----Original Message-----
> > From: Bjorn Helgaas [mailto:helgaas@kernel.org]
> > Sent: Friday, March 30, 2018 2:05 PM
> > To: Tal Gilboa <talgi@mellanox.com>
> > Cc: Tariq Toukan <tariqt@mellanox.com>; Keller, Jacob E
> > <jacob.e.keller@intel.com>; Ariel Elior <ariel.elior@cavium.com>; Ganesh
> > Goudar <ganeshgr@chelsio.com>; Kirsher, Jeffrey T
> > <jeffrey.t.kirsher@intel.com>; everest-linux-l2@cavium.com; intel-wired-
> > lan@lists.osuosl.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> > linux-pci@vger.kernel.org
> > Subject: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and
> > whether it's limited
> >
> > From: Tal Gilboa <talgi@mellanox.com>
> >
> > Add pcie_print_link_status(). This logs the current settings of the link
> > (speed, width, and total available bandwidth).
> >
> > If the device is capable of more bandwidth but is limited by a slower
> > upstream link, we include information about the link that limits the
> > device's performance.
> >
> > The user may be able to move the device to a different slot for better
> > performance.
> >
> > This provides a unified method for all PCI devices to report status and
> > issues, instead of each device reporting in a different way, using
> > different code.
> >
> > Signed-off-by: Tal Gilboa <talgi@mellanox.com>
> > [bhelgaas: changelog, reword log messages, print device capabilities when
> > not limited]
> > Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> > ---
> > drivers/pci/pci.c | 29 +++++++++++++++++++++++++++++
> > include/linux/pci.h | 1 +
> > 2 files changed, 30 insertions(+)
> >
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index e00d56b12747..cec7aed09f6b 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -5283,6 +5283,35 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev,
> > enum pci_bus_speed *speed,
> > return *width * PCIE_SPEED2MBS_ENC(*speed);
> > }
> >
> > +/**
> > + * pcie_print_link_status - Report the PCI device's link speed and width
> > + * @dev: PCI device to query
> > + *
> > + * Report the available bandwidth at the device. If this is less than the
> > + * device is capable of, report the device's maximum possible bandwidth and
> > + * the upstream link that limits its performance to less than that.
> > + */
> > +void pcie_print_link_status(struct pci_dev *dev)
> > +{
> > + enum pcie_link_width width, width_cap;
> > + enum pci_bus_speed speed, speed_cap;
> > + struct pci_dev *limiting_dev = NULL;
> > + u32 bw_avail, bw_cap;
> > +
> > + bw_cap = pcie_bandwidth_capable(dev, &speed_cap, &width_cap);
> > + bw_avail = pcie_bandwidth_available(dev, &limiting_dev, &speed,
> > &width);
> > +
> > + if (bw_avail >= bw_cap)
> > + pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n",
> > + bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> > + else
> > + pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d
> > link at %s (capable of %d Mb/s with %s x%d link)\n",
> > + bw_avail, PCIE_SPEED2STR(speed), width,
> > + limiting_dev ? pci_name(limiting_dev) : "<unknown>",
> > + bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> > +}
>
> Personally, I would make thic last one a pci_warn() to indicate it at a
> higher log level, but I'm ok with the wording, and if consensus is that
> this should be at info, I'm ok with that.
Tal's original patch did have a pci_warn() here, and we went back and
forth a bit. They get bug reports when a device doesn't perform as
expected, which argues for pci_warn(). But they also got feedback
saying warnings are a bit too much, which argues for pci_info() [1]
I don't have a really strong opinion either way. I have a slight
preference for info because the user may not be able to do anything
about it (there may not be a faster slot available), and I think
distros are usually configured so a warning interrupts the smooth
graphical boot.
It looks like mlx4, fm10k, and ixgbe currently use warnings, while
bnx2x, bnxt_en, and cxgb4 use info. It's a tie so far :)
[1] https://lkml.kernel.org/r/e47f3628-b56c-4d0a-f18b-5ffaf261ccc0@mellanox.com
Here's a proposal for printing the bandwidth as "x.xxx Gb/s":
commit ad370f38c1b5e9b8bb941eaed84ebb676c4bdaa4
Author: Tal Gilboa <talgi@mellanox.com>
Date: Fri Mar 30 08:56:47 2018 -0500
PCI: Add pcie_print_link_status() to log link speed and whether it's limited
Add pcie_print_link_status(). This logs the current settings of the link
(speed, width, and total available bandwidth).
If the device is capable of more bandwidth but is limited by a slower
upstream link, we include information about the link that limits the
device's performance.
The user may be able to move the device to a different slot for better
performance.
This provides a unified method for all PCI devices to report status and
issues, instead of each device reporting in a different way, using
different code.
Signed-off-by: Tal Gilboa <talgi@mellanox.com>
[bhelgaas: changelog, reword log messages, print device capabilities when
not limited, print bandwidth in Gb/s]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index c6e3c0524699..ab2346041fa4 100644
--- a/drivers/pci/pci.c
\x06++ b/drivers/pci/pci.c
@@ -5287,6 +5287,38 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed,
return *width * PCIE_SPEED2MBS_ENC(*speed);
}
+/**
+ * pcie_print_link_status - Report the PCI device's link speed and width
+ * @dev: PCI device to query
+ *
+ * Report the available bandwidth at the device. If this is less than the
+ * device is capable of, report the device's maximum possible bandwidth and
+ * the upstream link that limits its performance to less than that.
+ */
+void pcie_print_link_status(struct pci_dev *dev)
+{
+ enum pcie_link_width width, width_cap;
+ enum pci_bus_speed speed, speed_cap;
+ struct pci_dev *limiting_dev = NULL;
+ u32 bw_avail, bw_cap;
+
+ bw_cap = pcie_bandwidth_capable(dev, &speed_cap, &width_cap);
+ bw_avail = pcie_bandwidth_available(dev, &limiting_dev, &speed, &width);
+
+ if (bw_avail >= bw_cap)
+ pci_info(dev, "%u.%03u Gb/s available bandwidth (%s x%d link)\n",
+ bw_cap / 1000, bw_cap % 1000,
+ PCIE_SPEED2STR(speed_cap), width_cap);
+ else
+ pci_info(dev, "%u.%03u Gb/s available bandwidth, limited by %s x%d link at %s (capable of %u.%03u Gb/s with %s x%d link)\n",
+ bw_avail / 1000, bw_avail % 1000,
+ PCIE_SPEED2STR(speed), width,
+ limiting_dev ? pci_name(limiting_dev) : "<unknown>",
+ bw_cap / 1000, bw_cap % 1000,
+ PCIE_SPEED2STR(speed_cap), width_cap);
+}
+EXPORT_SYMBOL(pcie_print_link_status);
+
/**
* pci_select_bars - Make BAR mask from the type of resource
* @dev: the PCI device for which BAR mask is made
diff --git a/include/linux/pci.h b/include/linux/pci.h
index f2bf2b7a66c7..38f7957121ef 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1086,6 +1086,7 @@ int pcie_get_minimum_link(struct pci_dev *dev, enum pci_bus_speed *speed,
u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev **limiting_dev,
enum pci_bus_speed *speed,
enum pcie_link_width *width);
+void pcie_print_link_status(struct pci_dev *dev);
void pcie_flr(struct pci_dev *dev);
int __pci_reset_function_locked(struct pci_dev *dev);
int pci_reset_function(struct pci_dev *dev);
^ permalink raw reply related
* [bpf-next PATCH 2/2] bpf: sockmap, duplicates release calls may NULL sk_prot
From: John Fastabend @ 2018-04-02 19:50 UTC (permalink / raw)
To: ast, daniel; +Cc: netdev, davem
In-Reply-To: <20180402195046.10055.11686.stgit@john-Precision-Tower-5810>
It is possible to have multiple ULP tcp_release call paths in flight
if a sock is closed and simultaneously being removed from the sockmap
control path. The result would be setting the sk_prot to the saved
values on the first iteration and then on the second iteration setting
the value to NULL.
This patch resolves this by ensuring we only reset the sk_prot pointer
if we have a valid saved state to set.
Fixes: 4f738adba30a7 ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
kernel/bpf/sockmap.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 8ddf326..8dd9210 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -182,8 +182,10 @@ static void bpf_tcp_release(struct sock *sk)
psock->cork = NULL;
}
- sk->sk_prot = psock->sk_proto;
- psock->sk_proto = NULL;
+ if (psock->sk_proto) {
+ sk->sk_prot = psock->sk_proto;
+ psock->sk_proto = NULL;
+ }
out:
rcu_read_unlock();
}
^ permalink raw reply related
* [bpf-next PATCH 1/2] bpf: sockmap, free memory on sock close with cork data
From: John Fastabend @ 2018-04-02 19:50 UTC (permalink / raw)
To: ast, daniel; +Cc: netdev, davem
If a socket with pending cork data is closed we do not return the
memory to the socket until the garbage collector free's the psock
structure. The garbage collector though can run after the sock has
completed its close operation. If this ordering happens the sock code
will through a WARN_ON because there is still outstanding memory
accounted to the sock.
To resolve this ensure we return memory to the sock when a socket
is closed.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Fixes: 91843d540a13 ("bpf: sockmap, add msg_cork_bytes() helper")
---
kernel/bpf/sockmap.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index d2bda5a..8ddf326 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -211,6 +211,12 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
close_fun = psock->save_close;
write_lock_bh(&sk->sk_callback_lock);
+ if (psock->cork) {
+ free_start_sg(psock->sock, psock->cork);
+ kfree(psock->cork);
+ psock->cork = NULL;
+ }
+
list_for_each_entry_safe(md, mtmp, &psock->ingress, list) {
list_del(&md->list);
free_start_sg(psock->sock, md);
^ permalink raw reply related
* Re: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute max supported link bandwidth
From: Bjorn Helgaas @ 2018-04-02 19:37 UTC (permalink / raw)
To: Keller, Jacob E
Cc: Tal Gilboa, Tariq Toukan, Ariel Elior, Ganesh Goudar,
Kirsher, Jeffrey T, everest-linux-l2@cavium.com,
intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org
In-Reply-To: <02874ECE860811409154E81DA85FBB5882D4980A@ORSMSX115.amr.corp.intel.com>
On Mon, Apr 02, 2018 at 04:00:16PM +0000, Keller, Jacob E wrote:
> > -----Original Message-----
> > From: Tal Gilboa [mailto:talgi@mellanox.com]
> > Sent: Monday, April 02, 2018 7:34 AM
> > To: Bjorn Helgaas <helgaas@kernel.org>
> > Cc: Tariq Toukan <tariqt@mellanox.com>; Keller, Jacob E
> > <jacob.e.keller@intel.com>; Ariel Elior <ariel.elior@cavium.com>; Ganesh
> > Goudar <ganeshgr@chelsio.com>; Kirsher, Jeffrey T
> > <jeffrey.t.kirsher@intel.com>; everest-linux-l2@cavium.com; intel-wired-
> > lan@lists.osuosl.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> > linux-pci@vger.kernel.org
> > Subject: Re: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute
> > max supported link bandwidth
> >
> > On 4/2/2018 5:05 PM, Bjorn Helgaas wrote:
> > > On Mon, Apr 02, 2018 at 10:34:58AM +0300, Tal Gilboa wrote:
> > >> On 4/2/2018 3:40 AM, Bjorn Helgaas wrote:
> > >>> On Sun, Apr 01, 2018 at 11:38:53PM +0300, Tal Gilboa wrote:
> > >>>> On 3/31/2018 12:05 AM, Bjorn Helgaas wrote:
> > >>>>> From: Tal Gilboa <talgi@mellanox.com>
> > >>>>>
> > >>>>> Add pcie_bandwidth_capable() to compute the max link bandwidth
> > supported by
> > >>>>> a device, based on the max link speed and width, adjusted by the
> > encoding
> > >>>>> overhead.
> > >>>>>
> > >>>>> The maximum bandwidth of the link is computed as:
> > >>>>>
> > >>>>> max_link_speed * max_link_width * (1 - encoding_overhead)
> > >>>>>
> > >>>>> The encoding overhead is about 20% for 2.5 and 5.0 GT/s links using
> > 8b/10b
> > >>>>> encoding, and about 1.5% for 8 GT/s or higher speed links using 128b/130b
> > >>>>> encoding.
> > >>>>>
> > >>>>> Signed-off-by: Tal Gilboa <talgi@mellanox.com>
> > >>>>> [bhelgaas: adjust for pcie_get_speed_cap() and pcie_get_width_cap()
> > >>>>> signatures, don't export outside drivers/pci]
> > >>>>> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> > >>>>> Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
> > >>>>> ---
> > >>>>> drivers/pci/pci.c | 21 +++++++++++++++++++++
> > >>>>> drivers/pci/pci.h | 9 +++++++++
> > >>>>> 2 files changed, 30 insertions(+)
> > >>>>>
> > >>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > >>>>> index 43075be79388..9ce89e254197 100644
> > >>>>> --- a/drivers/pci/pci.c
> > >>>>> +++ b/drivers/pci/pci.c
> > >>>>> @@ -5208,6 +5208,27 @@ enum pcie_link_width
> > pcie_get_width_cap(struct pci_dev *dev)
> > >>>>> return PCIE_LNK_WIDTH_UNKNOWN;
> > >>>>> }
> > >>>>> +/**
> > >>>>> + * pcie_bandwidth_capable - calculates a PCI device's link bandwidth
> > capability
> > >>>>> + * @dev: PCI device
> > >>>>> + * @speed: storage for link speed
> > >>>>> + * @width: storage for link width
> > >>>>> + *
> > >>>>> + * Calculate a PCI device's link bandwidth by querying for its link speed
> > >>>>> + * and width, multiplying them, and applying encoding overhead.
> > >>>>> + */
> > >>>>> +u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed
> > *speed,
> > >>>>> + enum pcie_link_width *width)
> > >>>>> +{
> > >>>>> + *speed = pcie_get_speed_cap(dev);
> > >>>>> + *width = pcie_get_width_cap(dev);
> > >>>>> +
> > >>>>> + if (*speed == PCI_SPEED_UNKNOWN || *width ==
> > PCIE_LNK_WIDTH_UNKNOWN)
> > >>>>> + return 0;
> > >>>>> +
> > >>>>> + return *width * PCIE_SPEED2MBS_ENC(*speed);
> > >>>>> +}
> > >>>>> +
> > >>>>> /**
> > >>>>> * pci_select_bars - Make BAR mask from the type of resource
> > >>>>> * @dev: the PCI device for which BAR mask is made
> > >>>>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > >>>>> index 66738f1050c0..2a50172b9803 100644
> > >>>>> --- a/drivers/pci/pci.h
> > >>>>> +++ b/drivers/pci/pci.h
> > >>>>> @@ -261,8 +261,17 @@ void pci_disable_bridge_window(struct pci_dev
> > *dev);
> > >>>>> (speed) == PCIE_SPEED_2_5GT ? "2.5 GT/s" : \
> > >>>>> "Unknown speed")
> > >>>>> +/* PCIe speed to Mb/s with encoding overhead: 20% for gen2, ~1.5% for
> > gen3 */
> > >>>>> +#define PCIE_SPEED2MBS_ENC(speed) \
> > >>>>
> > >>>> Missing gen4.
> > >>>
> > >>> I made it "gen3+". I think that's accurate, isn't it? The spec
> > >>> doesn't seem to actually use "gen3" as a specific term, but sec 4.2.2
> > >>> says rates of 8 GT/s or higher (which I think includes gen3 and gen4)
> > >>> use 128b/130b encoding.
> > >>>
> > >>
> > >> I meant that PCIE_SPEED_16_0GT will return 0 from this macro since it wasn't
> > >> added. Need to return 15754.
> > >
> > > Oh, duh, of course! Sorry for being dense. What about the following?
> > > I included the calculation as opposed to just the magic numbers to try
> > > to make it clear how they're derived. This has the disadvantage of
> > > truncating the result instead of rounding, but I doubt that's
> > > significant in this context. If it is, we could use the magic numbers
> > > and put the computation in a comment.
> >
> > We can always use DIV_ROUND_UP((speed * enc_nominator),
> > enc_denominator). I think this is confusing and since this introduces a
> > bandwidth limit I would prefer to give a wider limit than a wrong one,
> > even it is by less than 1Mb/s. My vote is for leaving it as you wrote below.
> >
> > > Another question: we currently deal in Mb/s, not MB/s. Mb/s has the
> > > advantage of sort of corresponding to the GT/s numbers, but using MB/s
> > > would have the advantage of smaller numbers that match the table here:
> > > https://en.wikipedia.org/wiki/PCI_Express#History_and_revisions,
> > > but I don't know what's most typical in user-facing situations.
> > > What's better?
> >
> > I don't know what's better but for network devices we measure bandwidth
> > in Gb/s, so presenting bandwidth in MB/s would mean additional
> > calculations. The truth is I would have prefer to use Gb/s instead of
> > Mb/s, but again, don't want to loss up to 1Gb/s.
>
> I prefer this version with the calculation in line since it makes
> the derivation clear. Keeping them in Mb/s makes it easier to
> convert to Gb/s, which is what most people would expect.
OK, let's keep this patch as-is since returning Mb/s means we
don't have to worry about floating point, and it sounds like we
agree the truncation isn't a big deal.
I'll post a proposal to convert to Gb/s when printing.
> > > commit 946435491b35b7782157e9a4d1bd73071fba7709
> > > Author: Tal Gilboa <talgi@mellanox.com>
> > > Date: Fri Mar 30 08:32:03 2018 -0500
> > >
> > > PCI: Add pcie_bandwidth_capable() to compute max supported link
> > bandwidth
> > >
> > > Add pcie_bandwidth_capable() to compute the max link bandwidth
> > supported by
> > > a device, based on the max link speed and width, adjusted by the encoding
> > > overhead.
> > >
> > > The maximum bandwidth of the link is computed as:
> > >
> > > max_link_width * max_link_speed * (1 - encoding_overhead)
> > >
> > > 2.5 and 5.0 GT/s links use 8b/10b encoding, which reduces the raw
> > bandwidth
> > > available by 20%; 8.0 GT/s and faster links use 128b/130b encoding, which
> > > reduces it by about 1.5%.
> > >
> > > The result is in Mb/s, i.e., megabits/second, of raw bandwidth.
> > >
> > > Signed-off-by: Tal Gilboa <talgi@mellanox.com>
> > > [bhelgaas: add 16 GT/s, adjust for pcie_get_speed_cap() and
> > > pcie_get_width_cap() signatures, don't export outside drivers/pci]
> > > Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> > > Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
> > >
> > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > index 43075be79388..ff1e72060952 100644
> > > --- a/drivers/pci/pci.c
> > > +++ b/drivers/pci/pci.c
> > > @@ -5208,6 +5208,28 @@ enum pcie_link_width pcie_get_width_cap(struct
> > pci_dev *dev)
> > > return PCIE_LNK_WIDTH_UNKNOWN;
> > > }
> > >
> > > +/**
> > > + * pcie_bandwidth_capable - calculate a PCI device's link bandwidth capability
> > > + * @dev: PCI device
> > > + * @speed: storage for link speed
> > > + * @width: storage for link width
> > > + *
> > > + * Calculate a PCI device's link bandwidth by querying for its link speed
> > > + * and width, multiplying them, and applying encoding overhead. The result
> > > + * is in Mb/s, i.e., megabits/second of raw bandwidth.
> > > + */
> > > +u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed
> > *speed,
> > > + enum pcie_link_width *width)
> > > +{
> > > + *speed = pcie_get_speed_cap(dev);
> > > + *width = pcie_get_width_cap(dev);
> > > +
> > > + if (*speed == PCI_SPEED_UNKNOWN || *width ==
> > PCIE_LNK_WIDTH_UNKNOWN)
> > > + return 0;
> > > +
> > > + return *width * PCIE_SPEED2MBS_ENC(*speed);
> > > +}
> > > +
> > > /**
> > > * pci_select_bars - Make BAR mask from the type of resource
> > > * @dev: the PCI device for which BAR mask is made
> > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > > index 66738f1050c0..37f9299ed623 100644
> > > --- a/drivers/pci/pci.h
> > > +++ b/drivers/pci/pci.h
> > > @@ -261,8 +261,18 @@ void pci_disable_bridge_window(struct pci_dev *dev);
> > > (speed) == PCIE_SPEED_2_5GT ? "2.5 GT/s" : \
> > > "Unknown speed")
> > >
> > > +/* PCIe speed to Mb/s reduced by encoding overhead */
> > > +#define PCIE_SPEED2MBS_ENC(speed) \
> > > + ((speed) == PCIE_SPEED_16_0GT ? (16000*(128/130)) : \
> > > + (speed) == PCIE_SPEED_8_0GT ? (8000*(128/130)) : \
> > > + (speed) == PCIE_SPEED_5_0GT ? (5000*(8/10)) : \
> > > + (speed) == PCIE_SPEED_2_5GT ? (2500*(8/10)) : \
> > > + 0)
> > > +
> > > enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev);
> > > enum pcie_link_width pcie_get_width_cap(struct pci_dev *dev);
> > > +u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed
> > *speed,
> > > + enum pcie_link_width *width);
> > >
> > > /* Single Root I/O Virtualization */
> > > struct pci_sriov {
> > >
^ permalink raw reply
* Re: [PATCH v2] KEYS: DNS: limit the length of option strings
From: Eric Biggers @ 2018-04-02 19:20 UTC (permalink / raw)
To: David Howells; +Cc: keyrings, netdev, Mark Rutland, Eric Biggers
In-Reply-To: <20180323202122.GB16350@gmail.com>
On Fri, Mar 23, 2018 at 01:21:22PM -0700, Eric Biggers wrote:
> On Mon, Mar 12, 2018 at 10:57:07AM -0700, Eric Biggers wrote:
> > On Wed, Mar 07, 2018 at 03:54:37PM +0000, David Howells wrote:
> > > Eric Biggers <ebiggers3@gmail.com> wrote:
> > >
> > > > Fix it by limiting option strings (combined name + value) to a much more
> > > > reasonable 128 bytes. The exact limit is arbitrary, but currently the
> > > > only recognized option is formatted as "dnserror=%lu" which fits well
> > > > within this limit.
> > >
> > > There will be more options coming ("ipv4", "ipv6") but they shouldn't overrun
> > > this limit and we can always extend the limit if need be.
> > >
> > > David
> >
> > David (Howells) do you want to take this patch through the keyrings tree or
> > should I ask David Miller to take it through net-next?
> >
> > Eric
>
> Ping.
Ping again.
^ permalink raw reply
* Re: [PATCH net-next v1] ipvs: add consistent source hashing scheduling
From: Vincent Bernat @ 2018-04-02 19:12 UTC (permalink / raw)
To: Julian Anastasov
Cc: Wensong Zhang, Simon Horman, David S. Miller, netdev, lvs-devel,
Inju Song
In-Reply-To: <alpine.LFD.2.20.1804022145370.1734@ja.home.ssi.bg>
❦ 2 avril 2018 22:05 +0300, Julian Anastasov <ja@ssi.bg> :
> Sorry to say it but may be you missed the discussion
> on lvs-devel about the new MH scheduler implemented by Inju Song:
>
> https://www.spinics.net/lists/lvs-devel/msg04928.html
> http://archive.linuxvirtualserver.org/html/lvs-devel/2018-03/msg00023.html
>
> In the last 6 months we fixed all issues and I acked
> v4 just yesterday:
>
> http://archive.linuxvirtualserver.org/html/lvs-devel/2018-04/msg00003.html
For some reason, "ipvs maglev" in Google doesn't yield those
results. But having code already reviewed is great news for me: I can
use it with confidence. :)
--
Make sure your code "does nothing" gracefully.
- The Elements of Programming Style (Kernighan & Plauger)
^ permalink raw reply
* Re: [PATCH v3 2/2] net: usb: asix88179_178a: de-duplicate code
From: Marc Zyngier @ 2018-04-02 19:06 UTC (permalink / raw)
To: Alexander Kurz; +Cc: David Miller, afd, linux-usb, netdev, freddy
In-Reply-To: <alpine.DEB.2.20.1804021513340.14675@blala.de>
[dropping Freddy as I'm getting bounces from asix.com.tw]
On Mon, 2 Apr 2018 15:21:08 +0000 Alexander Kurz <akurz@blala.de> wrote:
Alexander,
> Hi Marc, David,
> with the v2 patch ("net: usb: asix88179_178a: de-duplicate code")
> I made an embarrasly stupid mistake of removing the wrong function.
> The v2 patch accidentially changed ax88179_link_reset() instead of
> ax88179_reset(). Hunk 6 of v2 ("net: usb: asix88179_178a: de-duplicate
> code") is just utterly wrong.
OK, that explains what I saw here.
> ax88179_bind() and ax88179_reset() were the correct targets to be
> de-duplicated, as done in the v3 patch.
>
> Sorry for this, Alexander
We all make mistakes, so let's try to learn from them.
You can improve your process by testing what you're about to send (a
very basic requirement), and documenting the changes you make on each
version you send (a cover letter is a good place to put it).
Answering reviewer questions helps building trust between the
contributor and the maintainer on the receiving end of the patch, and
you probably want to answer them before sending a new version so that a
proper discussion can take place, specially if the reviewer doesn't
quite see what you're aiming for.
I'll comment on the patch separately.
Hope this helps,
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply
* [PATCH v8 0/7] netdev: intel: Eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-04-02 19:06 UTC (permalink / raw)
To: jeffrey.t.kirsher
Cc: netdev, timur, sulrich, linux-arm-msm, linux-arm-kernel,
Sinan Kaya
Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.
This ends up CPU observing two barriers back to back before executing the
register write.
Since code already has an explicit barrier call, changing writel() to
writel_relaxed().
I did a regex search for wmb() followed by writel() in each drivers
directory. I scrubbed the ones I care about in this series.
I considered "ease of change", "popular usage" and "performance critical
path" as the determining criteria for my filtering.
We used relaxed API heavily on ARM for a long time but
it did not exist on other architectures. For this reason, relaxed
architectures have been paying double penalty in order to use the common
drivers.
Now that relaxed API is present on all architectures, we can go and scrub
all drivers to see what needs to change and what can remain.
We start with mostly used ones and hope to increase the coverage over time.
It will take a while to cover all drivers.
Feel free to apply patches individually.
Change since 7:
* API clarification with Linus and several architecture folks regarding
writel()
https://www.mail-archive.com/netdev@vger.kernel.org/msg225806.html
* removed wmb() in front of writel() at several places.
* remove old IA64 comments regarding ordering.
Sinan Kaya (7):
i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs
ixgbe: eliminate duplicate barriers on weakly-ordered archs
igbvf: eliminate duplicate barriers on weakly-ordered archs
igb: eliminate duplicate barriers on weakly-ordered archs
fm10k: Eliminate duplicate barriers on weakly-ordered archs
ixgbevf: keep writel() closer to wmb()
ixgbevf: eliminate duplicate barriers on weakly-ordered archs
drivers/net/ethernet/intel/fm10k/fm10k_main.c | 15 ++-----------
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 22 ++++++++----------
drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 8 +------
drivers/net/ethernet/intel/igb/igb_main.c | 14 ++----------
drivers/net/ethernet/intel/igbvf/netdev.c | 16 +++++---------
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 23 ++-----------------
drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 5 -----
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 27 ++++++++---------------
8 files changed, 30 insertions(+), 100 deletions(-)
--
2.7.4
^ permalink raw reply
* [PATCH v8 7/7] ixgbevf: eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-04-02 19:06 UTC (permalink / raw)
To: jeffrey.t.kirsher
Cc: netdev, timur, sulrich, linux-arm-msm, linux-arm-kernel,
Sinan Kaya, intel-wired-lan, linux-kernel
In-Reply-To: <1522695990-31082-1-git-send-email-okaya@codeaurora.org>
memory-barriers.txt has been updated as follows:
"When using writel(), a prior wmb() is not needed to guarantee that the
cache coherent memory writes have completed before writing to the MMIO
region."
Remove old IA-64 comments in the code along with unneeded wmb() in front
of writel().
There are places in the code where wmb() has been used as a double barrier
for CPU and IO in place of smp_wmb() and wmb() as an optimization. For
such places, keep the wmb() but replace the following writel() with
writel_relaxed() to have a sequence as
wmb()
writel_relaxed()
mmio_wb()
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 23 +++++++----------------
1 file changed, 7 insertions(+), 16 deletions(-)
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 757dac6..29b71a7 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -719,12 +719,6 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring *rx_ring,
/* update next to alloc since we have filled the ring */
rx_ring->next_to_alloc = i;
- /* Force memory writes to complete before letting h/w
- * know there are new descriptors to fetch. (Only
- * applicable for weak-ordered memory model archs,
- * such as IA-64).
- */
- wmb();
writel(i, rx_ring->tail);
}
}
@@ -1228,10 +1222,6 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector,
struct ixgbevf_ring *xdp_ring =
adapter->xdp_ring[rx_ring->queue_index];
- /* Force memory writes to complete before letting h/w
- * know there are new descriptors to fetch.
- */
- wmb();
writel(xdp_ring->next_to_use, xdp_ring->tail);
}
@@ -3985,11 +3975,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
/* set the timestamp */
first->time_stamp = jiffies;
- /* Force memory writes to complete before letting h/w know there
- * are new descriptors to fetch. (Only applicable for weak-ordered
- * memory model archs, such as IA-64).
- *
- * We also need this memory barrier (wmb) to make certain all of the
+ /* We also need this memory barrier (wmb) to make certain all of the
* status bits have been updated before next_to_watch is written.
*/
wmb();
@@ -4004,7 +3990,12 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
tx_ring->next_to_use = i;
/* notify HW of packet */
- writel(i, tx_ring->tail);
+ writel_relaxed(i, tx_ring->tail);
+
+ /* We need this if more than one processor can write to our tail
+ * at a time, it synchronizes IO on IA64/Altix systems
+ */
+ mmiowb();
return;
dma_error:
--
2.7.4
^ permalink raw reply related
* [PATCH v8 6/7] ixgbevf: keep writel() closer to wmb()
From: Sinan Kaya @ 2018-04-02 19:06 UTC (permalink / raw)
To: jeffrey.t.kirsher
Cc: netdev, timur, sulrich, linux-arm-msm, linux-arm-kernel,
Sinan Kaya, intel-wired-lan, linux-kernel
In-Reply-To: <1522695990-31082-1-git-send-email-okaya@codeaurora.org>
Remove ixgbevf_write_tail() in favor of moving writel() close to
wmb().
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 5 -----
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 6 +++---
2 files changed, 3 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 447ce1d..c75ea1f 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -312,11 +312,6 @@ static inline u16 ixgbevf_desc_unused(struct ixgbevf_ring *ring)
return ((ntc > ntu) ? 0 : ring->count) + ntc - ntu - 1;
}
-static inline void ixgbevf_write_tail(struct ixgbevf_ring *ring, u32 value)
-{
- writel(value, ring->tail);
-}
-
#define IXGBEVF_RX_DESC(R, i) \
(&(((union ixgbe_adv_rx_desc *)((R)->desc))[i]))
#define IXGBEVF_TX_DESC(R, i) \
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index e3d04f2..757dac6 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -725,7 +725,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring *rx_ring,
* such as IA-64).
*/
wmb();
- ixgbevf_write_tail(rx_ring, i);
+ writel(i, rx_ring->tail);
}
}
@@ -1232,7 +1232,7 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector,
* know there are new descriptors to fetch.
*/
wmb();
- ixgbevf_write_tail(xdp_ring, xdp_ring->next_to_use);
+ writel(xdp_ring->next_to_use, xdp_ring->tail);
}
u64_stats_update_begin(&rx_ring->syncp);
@@ -4004,7 +4004,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
tx_ring->next_to_use = i;
/* notify HW of packet */
- ixgbevf_write_tail(tx_ring, i);
+ writel(i, tx_ring->tail);
return;
dma_error:
--
2.7.4
^ permalink raw reply related
* [PATCH v8 5/7] fm10k: Eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-04-02 19:06 UTC (permalink / raw)
To: jeffrey.t.kirsher
Cc: netdev, timur, sulrich, linux-arm-msm, linux-arm-kernel,
Sinan Kaya, intel-wired-lan, linux-kernel
In-Reply-To: <1522695990-31082-1-git-send-email-okaya@codeaurora.org>
memory-barriers.txt has been updated as follows:
"When using writel(), a prior wmb() is not needed to guarantee that the
cache coherent memory writes have completed before writing to the MMIO
region."
Remove old IA-64 comments in the code along with unneeded wmb() in front
of writel().
There are places in the code where wmb() has been used as a double barrier
for CPU and IO in place of smp_wmb() and wmb() as an optimization. For
such places, keep the wmb() but replace the following writel() with
writel_relaxed() to have a sequence as
wmb()
writel_relaxed()
mmio_wb()
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
drivers/net/ethernet/intel/fm10k/fm10k_main.c | 15 ++-------------
1 file changed, 2 insertions(+), 13 deletions(-)
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index df86070..41e3aaa 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -172,13 +172,6 @@ void fm10k_alloc_rx_buffers(struct fm10k_ring *rx_ring, u16 cleaned_count)
/* update next to alloc since we have filled the ring */
rx_ring->next_to_alloc = i;
- /* Force memory writes to complete before letting h/w
- * know there are new descriptors to fetch. (Only
- * applicable for weak-ordered memory model archs,
- * such as IA-64).
- */
- wmb();
-
/* notify hardware of new descriptors */
writel(i, rx_ring->tail);
}
@@ -1036,11 +1029,7 @@ static void fm10k_tx_map(struct fm10k_ring *tx_ring,
/* record SW timestamp if HW timestamp is not available */
skb_tx_timestamp(first->skb);
- /* Force memory writes to complete before letting h/w know there
- * are new descriptors to fetch. (Only applicable for weak-ordered
- * memory model archs, such as IA-64).
- *
- * We also need this memory barrier to make certain all of the
+ /* We need this memory barrier to make certain all of the
* status bits have been updated before next_to_watch is written.
*/
wmb();
@@ -1055,7 +1044,7 @@ static void fm10k_tx_map(struct fm10k_ring *tx_ring,
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
- writel(i, tx_ring->tail);
+ writel_relaxed(i, tx_ring->tail);
/* we need this if more than one processor can write to our tail
* at a time, it synchronizes IO on IA64/Altix systems
--
2.7.4
^ permalink raw reply related
* [PATCH v8 4/7] igb: eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-04-02 19:06 UTC (permalink / raw)
To: jeffrey.t.kirsher
Cc: sulrich, netdev, timur, linux-kernel, Sinan Kaya, intel-wired-lan,
linux-arm-msm, linux-arm-kernel
In-Reply-To: <1522695990-31082-1-git-send-email-okaya@codeaurora.org>
memory-barriers.txt has been updated as follows:
"When using writel(), a prior wmb() is not needed to guarantee that the
cache coherent memory writes have completed before writing to the MMIO
region."
Remove old IA-64 comments in the code along with unneeded wmb() in front
of writel().
There are places in the code where wmb() has been used as a double barrier
for CPU and IO in place of smp_wmb() and wmb() as an optimization. For
such places, keep the wmb() but replace the following writel() with
writel_relaxed() to have a sequence as
wmb()
writel_relaxed()
mmio_wb()
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
drivers/net/ethernet/intel/igb/igb_main.c | 14 ++------------
1 file changed, 2 insertions(+), 12 deletions(-)
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index c1c0bc3..c3f7130 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -5652,11 +5652,7 @@ static int igb_tx_map(struct igb_ring *tx_ring,
/* set the timestamp */
first->time_stamp = jiffies;
- /* Force memory writes to complete before letting h/w know there
- * are new descriptors to fetch. (Only applicable for weak-ordered
- * memory model archs, such as IA-64).
- *
- * We also need this memory barrier to make certain all of the
+ /* We need this memory barrier to make certain all of the
* status bits have been updated before next_to_watch is written.
*/
wmb();
@@ -5674,7 +5670,7 @@ static int igb_tx_map(struct igb_ring *tx_ring,
igb_maybe_stop_tx(tx_ring, DESC_NEEDED);
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
- writel(i, tx_ring->tail);
+ writel_relaxed(i, tx_ring->tail);
/* we need this if more than one processor can write to our tail
* at a time, it synchronizes IO on IA64/Altix systems
@@ -8073,12 +8069,6 @@ void igb_alloc_rx_buffers(struct igb_ring *rx_ring, u16 cleaned_count)
/* update next to alloc since we have filled the ring */
rx_ring->next_to_alloc = i;
- /* Force memory writes to complete before letting h/w
- * know there are new descriptors to fetch. (Only
- * applicable for weak-ordered memory model archs,
- * such as IA-64).
- */
- wmb();
writel(i, rx_ring->tail);
}
}
--
2.7.4
^ permalink raw reply related
* [PATCH v8 3/7] igbvf: eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-04-02 19:06 UTC (permalink / raw)
To: jeffrey.t.kirsher
Cc: netdev, timur, sulrich, linux-arm-msm, linux-arm-kernel,
Sinan Kaya, intel-wired-lan, linux-kernel
In-Reply-To: <1522695990-31082-1-git-send-email-okaya@codeaurora.org>
memory-barriers.txt has been updated as follows:
"When using writel(), a prior wmb() is not needed to guarantee that the
cache coherent memory writes have completed before writing to the MMIO
region."
Remove old IA-64 comments in the code along with unneeded wmb() in front
of writel().
There are places in the code where wmb() has been used as a double barrier
for CPU and IO in place of smp_wmb() and wmb() as an optimization. For
such places, keep the wmb() but replace the following writel() with
writel_relaxed() to have a sequence as
wmb()
writel_relaxed()
mmio_wb()
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
drivers/net/ethernet/intel/igbvf/netdev.c | 16 +++++-----------
1 file changed, 5 insertions(+), 11 deletions(-)
diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c b/drivers/net/ethernet/intel/igbvf/netdev.c
index e2b7502..d9f186a 100644
--- a/drivers/net/ethernet/intel/igbvf/netdev.c
+++ b/drivers/net/ethernet/intel/igbvf/netdev.c
@@ -246,12 +246,6 @@ static void igbvf_alloc_rx_buffers(struct igbvf_ring *rx_ring,
else
i--;
- /* Force memory writes to complete before letting h/w
- * know there are new descriptors to fetch. (Only
- * applicable for weak-ordered memory model archs,
- * such as IA-64).
- */
- wmb();
writel(i, adapter->hw.hw_addr + rx_ring->tail);
}
}
@@ -2289,16 +2283,16 @@ static inline void igbvf_tx_queue_adv(struct igbvf_adapter *adapter,
}
tx_desc->read.cmd_type_len |= cpu_to_le32(adapter->txd_cmd);
- /* Force memory writes to complete before letting h/w
- * know there are new descriptors to fetch. (Only
- * applicable for weak-ordered memory model archs,
- * such as IA-64).
+
+ /* We use this memory barrier to make certain all of the
+ * status bits have been updated before next_to_watch is
+ * written.
*/
wmb();
tx_ring->buffer_info[first].next_to_watch = tx_desc;
tx_ring->next_to_use = i;
- writel(i, adapter->hw.hw_addr + tx_ring->tail);
+ writel_relaxed(i, adapter->hw.hw_addr + tx_ring->tail);
/* we need this if more than one processor can write to our tail
* at a time, it synchronizes IO on IA64/Altix systems
*/
--
2.7.4
^ permalink raw reply related
* [PATCH v8 2/7] ixgbe: eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-04-02 19:06 UTC (permalink / raw)
To: jeffrey.t.kirsher
Cc: sulrich, netdev, timur, linux-kernel, Sinan Kaya, intel-wired-lan,
linux-arm-msm, linux-arm-kernel
In-Reply-To: <1522695990-31082-1-git-send-email-okaya@codeaurora.org>
memory-barriers.txt has been updated as follows:
"When using writel(), a prior wmb() is not needed to guarantee that the
cache coherent memory writes have completed before writing to the MMIO
region."
Remove old IA-64 comments in the code along with unneeded wmb() in front
of writel().
There are places in the code where wmb() has been used as a double barrier
for CPU and IO in place of smp_wmb() and wmb() as an optimization. For
such places, keep the wmb() but replace the following writel() with
writel_relaxed() to have a sequence as
wmb()
writel_relaxed()
mmio_wb()
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 23 ++---------------------
1 file changed, 2 insertions(+), 21 deletions(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index afadba9..c17924b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1696,12 +1696,6 @@ void ixgbe_alloc_rx_buffers(struct ixgbe_ring *rx_ring, u16 cleaned_count)
/* update next to alloc since we have filled the ring */
rx_ring->next_to_alloc = i;
- /* Force memory writes to complete before letting h/w
- * know there are new descriptors to fetch. (Only
- * applicable for weak-ordered memory model archs,
- * such as IA-64).
- */
- wmb();
writel(i, rx_ring->tail);
}
}
@@ -2467,10 +2461,6 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
if (xdp_xmit) {
struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
- /* Force memory writes to complete before letting h/w
- * know there are new descriptors to fetch.
- */
- wmb();
writel(ring->next_to_use, ring->tail);
xdp_do_flush_map();
@@ -8080,12 +8070,7 @@ static int ixgbe_tx_map(struct ixgbe_ring *tx_ring,
/* set the timestamp */
first->time_stamp = jiffies;
- /*
- * Force memory writes to complete before letting h/w know there
- * are new descriptors to fetch. (Only applicable for weak-ordered
- * memory model archs, such as IA-64).
- *
- * We also need this memory barrier to make certain all of the
+ /* We need this memory barrier to make certain all of the
* status bits have been updated before next_to_watch is written.
*/
wmb();
@@ -8102,7 +8087,7 @@ static int ixgbe_tx_map(struct ixgbe_ring *tx_ring,
ixgbe_maybe_stop_tx(tx_ring, DESC_NEEDED);
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
- writel(i, tx_ring->tail);
+ writel_relaxed(i, tx_ring->tail);
/* we need this if more than one processor can write to our tail
* at a time, it synchronizes IO on IA64/Altix systems
@@ -10034,10 +10019,6 @@ static void ixgbe_xdp_flush(struct net_device *dev)
if (unlikely(!ring))
return;
- /* Force memory writes to complete before letting h/w know there
- * are new descriptors to fetch.
- */
- wmb();
writel(ring->next_to_use, ring->tail);
return;
--
2.7.4
^ permalink raw reply related
* [PATCH v8 1/7] i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-04-02 19:06 UTC (permalink / raw)
To: jeffrey.t.kirsher
Cc: netdev, timur, sulrich, linux-arm-msm, linux-arm-kernel,
Sinan Kaya, intel-wired-lan, linux-kernel
In-Reply-To: <1522695990-31082-1-git-send-email-okaya@codeaurora.org>
memory-barriers.txt has been updated as follows:
"When using writel(), a prior wmb() is not needed to guarantee that the
cache coherent memory writes have completed before writing to the MMIO
region."
Remove old IA-64 comments in the code along with unneeded wmb() in front
of writel().
There are places in the code where wmb() has been used as a double barrier
for CPU and IO in place of smp_wmb() and wmb() as an optimization. For
such places, keep the wmb() but replace the following writel() with
writel_relaxed() to have a sequence as
wmb()
writel_relaxed()
mmio_wb()
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 22 +++++++++-------------
drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 8 +-------
2 files changed, 10 insertions(+), 20 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index f174c72..1b9fa7a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -186,7 +186,13 @@ static int i40e_program_fdir_filter(struct i40e_fdir_filter *fdir_data,
/* Mark the data descriptor to be watched */
first->next_to_watch = tx_desc;
- writel(tx_ring->next_to_use, tx_ring->tail);
+ writel_relaxed(tx_ring->next_to_use, tx_ring->tail);
+
+ /* We need this if more than one processor can write to our tail
+ * at a time, it synchronizes IO on IA64/Altix systems
+ */
+ mmiowb();
+
return 0;
dma_fail:
@@ -1523,12 +1529,6 @@ static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
/* update next to alloc since we have filled the ring */
rx_ring->next_to_alloc = val;
- /* Force memory writes to complete before letting h/w
- * know there are new descriptors to fetch. (Only
- * applicable for weak-ordered memory model archs,
- * such as IA-64).
- */
- wmb();
writel(val, rx_ring->tail);
}
@@ -2274,11 +2274,7 @@ static void i40e_rx_buffer_flip(struct i40e_ring *rx_ring,
static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
{
- /* Force memory writes to complete before letting h/w
- * know there are new descriptors to fetch.
- */
- wmb();
- writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
+ writel(xdp_ring->next_to_use, xdp_ring->tail);
}
/**
@@ -3444,7 +3440,7 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
- writel(i, tx_ring->tail);
+ writel_relaxed(i, tx_ring->tail);
/* we need this if more than one processor can write to our tail
* at a time, it synchronizes IO on IA64/Altix systems
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 12bd937..eb5556e 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -804,12 +804,6 @@ static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
/* update next to alloc since we have filled the ring */
rx_ring->next_to_alloc = val;
- /* Force memory writes to complete before letting h/w
- * know there are new descriptors to fetch. (Only
- * applicable for weak-ordered memory model archs,
- * such as IA-64).
- */
- wmb();
writel(val, rx_ring->tail);
}
@@ -2379,7 +2373,7 @@ static inline void i40evf_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
- writel(i, tx_ring->tail);
+ writel_relaxed(i, tx_ring->tail);
/* we need this if more than one processor can write to our tail
* at a time, it synchronizes IO on IA64/Altix systems
--
2.7.4
^ permalink raw reply related
* Re: [PATCH net-next v1] ipvs: add consistent source hashing scheduling
From: Julian Anastasov @ 2018-04-02 19:05 UTC (permalink / raw)
To: Vincent Bernat
Cc: Wensong Zhang, Simon Horman, David S. Miller, netdev, lvs-devel,
Inju Song
In-Reply-To: <20180402172025.7380-1-vincent@bernat.im>
Hello,
On Mon, 2 Apr 2018, Vincent Bernat wrote:
> Based on Google's Maglev algorithm [1][2], this scheduler builds a
> lookup table in a way disruption is minimized when a change
> occurs. This helps in case of active/active setup without
> synchronization. Like for classic source hashing, this lookup table is
> used to assign connections to a real server.
>
> Both source address and port are used to compute the hash (unlike sh
> where this is optional).
>
> Weights are correctly handled. Unlike sh, servers with a weight of 0
> are considered as absent. Also, unlike sh, when a server becomes
> unavailable due to a threshold, no fallback is possible: doing so
> would seriously impair the the usefulness of using a consistent hash.
>
> There is a small hack to detect when all real servers have a weight of
> 0. It relies on the fact it is not possible for the weight of a real
> server to change during the execution of the assignment. I believe
> this is the case as modifications through netlink are subject to a
> mutex, but the use of atomic_read() is unsettling.
>
> The value of 65537 for the hash table size is currently not modifiable
> at compile-time. This is the value suggested in the Maglev
> paper. Another possible value is 257 (for small tests) and 655373 (for
> very large setups).
>
> [1]: https://research.google.com/pubs/pub44824.html
> [2]: https://blog.acolyer.org/2016/03/21/maglev-a-fast-and-reliable-software-network-load-balancer/
Sorry to say it but may be you missed the discussion
on lvs-devel about the new MH scheduler implemented by Inju Song:
https://www.spinics.net/lists/lvs-devel/msg04928.html
http://archive.linuxvirtualserver.org/html/lvs-devel/2018-03/msg00023.html
In the last 6 months we fixed all issues and I acked
v4 just yesterday:
http://archive.linuxvirtualserver.org/html/lvs-devel/2018-04/msg00003.html
This scheduler supports:
- tables with different size (prime): IP_VS_MH_TAB_INDEX
- gcd of weights: ip_vs_mh_gcd_weight
- shifted weights: ip_vs_mh_shift_weight
- weight can be changed any time
> Signed-off-by: Vincent Bernat <vincent@bernat.im>
> ---
> include/net/ip_vs.h | 27 ++++
> net/netfilter/ipvs/Kconfig | 13 ++
> net/netfilter/ipvs/Makefile | 1 +
> net/netfilter/ipvs/ip_vs_csh.c | 339 +++++++++++++++++++++++++++++++++++++++++
> net/netfilter/ipvs/ip_vs_sh.c | 32 +---
> 5 files changed, 381 insertions(+), 31 deletions(-)
> create mode 100644 net/netfilter/ipvs/ip_vs_csh.c
Regards
^ permalink raw reply
* [GIT PULL] remove in-kernel calls to syscalls
From: Dominik Brodowski @ 2018-04-02 19:04 UTC (permalink / raw)
To: torvalds
Cc: linux-kernel, viro, arnd, linux-arch, hmclauchlan, tautschn,
Amir Goldstein, Andi Kleen, Andrew Morton, Christoph Hellwig,
Darren Hart, David S. Miller, Eric W. Biederman, H. Peter Anvin,
Ingo Molnar, Jaswinder Singh, Jeff Dike, Jiri Slaby, kexec,
linux-fsdevel, linux-mm, linux-s390, Luis R. Rodriguez, netdev,
Peter Zijlstra <pete
Linus,
please pull the following changes since commit 0c8efd610b58cb23cefdfa12015799079aef94ae:
Linux 4.16-rc5 (2018-03-11 17:25:09 -0700)
which are available in the Git repository at:
https://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux.git syscalls-next
up to commit c9a211951c7c79cfb5de888d7d9550872868b086:
bpf: whitelist all syscalls for error injection (2018-04-02 20:16:21 +0200)
to remove all in-kernel calls to syscalls except from arch/ .
Since the last time I sent the patches out for review,[*] I have solely
added a few more ACKs. Jon Corbet raised the question whether the
documentation really should go to Documentation/process/adding-syscalls.rst
and not to Documentation/process/coding-style.rst (even though, as he said,
that isn't quite right either). As most of the existing instances where
syscalls were called in the kernel were (1) common codepaths for old
and new syscalls, (2) common codepaths for native and compat syscalls, and
(3) syscall multiplexers like sys_ipc(), I have kept it at the former
location for the time being, but will be happy to submit a follow-up patch
to move the documentation bits to a different file.
[*] lkml.kernel.org/r/20180329112426.23043-1-linux@dominikbrodowski.net
All these patches have been in -next, but got rebased a few minutes ago to
include another ACK in patch 2/109 (no code changes). There were/are a few
trivial conflicts against the net, sparc and vfs trees, but not (yet) against
what is in your tree up to commit 86bbbebac1933e6e95e8234c4f7d220c5ddd38bc.
Thanks,
Dominik
----------------------------------------------------------------
System calls are interaction points between userspace and the kernel.
Therefore, system call functions such as sys_xyzzy() or compat_sys_xyzzy()
should only be called from userspace via the syscall table, but not from
elsewhere in the kernel.
At least on 64-bit x86, it will likely be a hard requirement from v4.17
onwards to not call system call functions in the kernel: It is better to
use use a different calling convention for system calls there, where
struct pt_regs is decoded on-the-fly in a syscall wrapper which then hands
processing over to the actual syscall function. This means that only those
parameters which are actually needed for a specific syscall are passed on
during syscall entry, instead of filling in six CPU registers with random
user space content all the time (which may cause serious trouble down the
call chain). Those x86-specific patches will be pushed through the x86
tree in the near future.
Moreover, rules on how data may be accessed may differ between kernel data
and user data. This is another reason why calling sys_xyzzy() is
generally a bad idea, and -- at most -- acceptable in arch-specific code.
This patchset removes all in-kernel calls to syscall functions in the
kernel with the exception of arch/. On top of this, it cleans up the
three places where many syscalls are referenced or prototyped, namely
kernel/sys_ni.c, include/linux/syscalls.h and include/linux/compat.h.
First goes a patch which defines the goal and explains the rationale:
syscalls: define and explain goal to not call syscalls in the kernel
A few codepaths can trivially be converted to existing in-kernel interfaces:
kernel: use kernel_wait4() instead of sys_wait4()
kernel: open-code sys_rt_sigpending() in sys_sigpending()
kexec: call do_kexec_load() in compat syscall directly
mm: use do_futex() instead of sys_futex() in mm_release()
x86: use _do_fork() in compat_sys_x86_clone()
x86: remove compat_sys_x86_waitpid()
Then follow many patches which only affect specfic subsystems each, and
replace sys_*() with internal helpers named __sys_*() or do_sys_*(). Let's
start with net/:
net: socket: add __sys_recvfrom() helper; remove in-kernel call to syscall
net: socket: add __sys_sendto() helper; remove in-kernel call to syscall
net: socket: add __sys_accept4() helper; remove in-kernel call to syscall
net: socket: add __sys_socket() helper; remove in-kernel call to syscall
net: socket: add __sys_bind() helper; remove in-kernel call to syscall
net: socket: add __sys_connect() helper; remove in-kernel call to syscall
net: socket: add __sys_listen() helper; remove in-kernel call to syscall
net: socket: add __sys_getsockname() helper; remove in-kernel call to syscall
net: socket: add __sys_getpeername() helper; remove in-kernel call to syscall
net: socket: add __sys_socketpair() helper; remove in-kernel call to syscall
net: socket: add __sys_shutdown() helper; remove in-kernel call to syscall
net: socket: add __sys_setsockopt() helper; remove in-kernel call to syscall
net: socket: add __sys_getsockopt() helper; remove in-kernel call to syscall
net: socket: add do_sys_recvmmsg() helper; remove in-kernel call to syscall
net: socket: move check for forbid_cmsg_compat to __sys_...msg()
net: socket: replace calls to sys_send() with __sys_sendto()
net: socket: replace call to sys_recv() with __sys_recvfrom()
net: socket: add __compat_sys_recvfrom() helper; remove in-kernel call to compat syscall
net: socket: add __compat_sys_setsockopt() helper; remove in-kernel call to compat syscall
net: socket: add __compat_sys_getsockopt() helper; remove in-kernel call to compat syscall
net: socket: add __compat_sys_recvmmsg() helper; remove in-kernel call to compat syscall
net: socket: add __compat_sys_...msg() helpers; remove in-kernel calls to compat syscalls
The changes in ipc/ are limited to this specific subsystem. The wrappers are
named ksys_*() to denote that these functions are meant as a drop-in replacement
for the syscalls.
ipc: add semtimedop syscall/compat_syscall wrappers
ipc: add semget syscall wrapper
ipc: add semctl syscall/compat_syscall wrappers
ipc: add msgget syscall wrapper
ipc: add shmget syscall wrapper
ipc: add shmdt syscall wrapper
ipc: add shmctl syscall/compat_syscall wrappers
ipc: add msgctl syscall/compat_syscall wrappers
ipc: add msgrcv syscall/compat_syscall wrappers
ipc: add msgsnd syscall/compat_syscall wrappers
A few mindless conversions in kernel/ and mm/:
kernel: add do_getpgid() helper; remove internal call to sys_getpgid()
kernel: add do_compat_sigaltstack() helper; remove in-kernel call to compat syscall
kernel: provide ksys_*() wrappers for syscalls called by kernel/uid16.c
sched: add do_sched_yield() helper; remove in-kernel call to sched_yield()
mm: add kernel_migrate_pages() helper, move compat syscall to mm/mempolicy.c
mm: add kernel_move_pages() helper, move compat syscall to mm/migrate.c
mm: add kernel_mbind() helper; remove in-kernel call to syscall
mm: add kernel_[sg]et_mempolicy() helpers; remove in-kernel calls to syscalls
Then, let's handle those instances internal to fs/ which call syscalls:
fs: add do_readlinkat() helper; remove internal call to sys_readlinkat()
fs: add do_pipe2() helper; remove internal call to sys_pipe2()
fs: add do_renameat2() helper; remove internal call to sys_renameat2()
fs: add do_futimesat() helper; remove internal call to sys_futimesat()
fs: add do_epoll_*() helpers; remove internal calls to sys_epoll_*()
fs: add do_signalfd4() helper; remove internal calls to sys_signalfd4()
fs: add do_eventfd() helper; remove internal call to sys_eventfd()
fs: add do_lookup_dcookie() helper; remove in-kernel call to syscall
fs: add do_vmsplice() helper; remove in-kernel call to syscall
fs: add kern_select() helper; remove in-kernel call to sys_select()
fs: add do_compat_fcntl64() helper; remove in-kernel call to compat syscall
fs: add do_compat_select() helper; remove in-kernel call to compat syscall
fs: add do_compat_signalfd4() helper; remove in-kernel call to compat syscall
fs: add do_compat_futimesat() helper; remove in-kernel call to compat syscall
inotify: add do_inotify_init() helper; remove in-kernel call to syscall
fanotify: add do_fanotify_mark() helper; remove in-kernel call to syscall
fs/quota: add kernel_quotactl() helper; remove in-kernel call to syscall
fs/quota: use COMPAT_SYSCALL_DEFINE for sys32_quotactl()
Several fs- and some mm-related syscalls are called in initramfs, initrd and
init, devtmpfs, and pm code. While at least many of these instances should be
converted to use proper in-kernel VFS interfaces in future, convert them
mindlessly to ksys_*() helpers or wrappers for now.
fs: add ksys_mount() helper; remove in-kernel calls to sys_mount()
fs: add ksys_umount() helper; remove in-kernel call to sys_umount()
fs: add ksys_dup{,3}() helper; remove in-kernel calls to sys_dup{,3}()
fs: add ksys_chroot() helper; remove-in kernel calls to sys_chroot()
fs: add ksys_write() helper; remove in-kernel calls to sys_write()
fs: add ksys_chdir() helper; remove in-kernel calls to sys_chdir()
fs: add ksys_unlink() wrapper; remove in-kernel calls to sys_unlink()
hostfs: rename do_rmdir() to hostfs_do_rmdir()
fs: add ksys_rmdir() wrapper; remove in-kernel calls to sys_rmdir()
fs: add do_mkdirat() helper and ksys_mkdir() wrapper; remove in-kernel calls to syscall
fs: add do_symlinkat() helper and ksys_symlink() wrapper; remove in-kernel calls to syscall
fs: add do_mknodat() helper and ksys_mknod() wrapper; remove in-kernel calls to syscall
fs: add do_linkat() helper and ksys_link() wrapper; remove in-kernel calls to syscall
fs: add ksys_fchmod() and do_fchmodat() helpers and ksys_chmod() wrapper; remove in-kernel calls to syscall
fs: add do_faccessat() helper and ksys_access() wrapper; remove in-kernel calls to syscall
fs: add do_fchownat(), ksys_fchown() helpers and ksys_{,l}chown() wrappers
fs: add ksys_ftruncate() wrapper; remove in-kernel calls to sys_ftruncate()
fs: add ksys_close() wrapper; remove in-kernel calls to sys_close()
fs: add ksys_open() wrapper; remove in-kernel calls to sys_open()
fs: add ksys_getdents64() helper; remove in-kernel calls to sys_getdents64()
fs: add ksys_ioctl() helper; remove in-kernel calls to sys_ioctl()
fs: add ksys_lseek() helper; remove in-kernel calls to sys_lseek()
fs: add ksys_read() helper; remove in-kernel calls to sys_read()
fs: add ksys_sync() helper; remove in-kernel calls to sys_sync()
kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare()
kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid()
To reach the goal to get rid of all in-kernel calls to syscalls for x86, we
need to handle a few further syscalls called from compat syscalls in x86 and
(mostly) from other architectures. Those could be made generic making use of
Al Viro's macro trickery. For v4.17, I'd suggest to keep it simple:
fs: add ksys_sync_file_range helper(); remove in-kernel calls to syscall
fs: add ksys_truncate() wrapper; remove in-kernel calls to sys_truncate()
fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls
fs: add ksys_fallocate() wrapper; remove in-kernel calls to sys_fallocate()
mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64()
mm: add ksys_mmap_pgoff() helper; remove in-kernel calls to sys_mmap_pgoff()
mm: add ksys_readahead() helper; remove in-kernel calls to sys_readahead()
x86/ioport: add ksys_ioperm() helper; remove in-kernel calls to sys_ioperm()
Then, throw in two fixes for x86:
x86: fix sys_sigreturn() return type to be long, not unsigned long
x86/sigreturn: use SYSCALL_DEFINE0 (by Michael Tautschnig)
... and clean up the three places where many syscalls are referenced or
prototyped (kernel/sys_ni.c, include/linux/syscalls.h and
include/linux/compat.h):
kexec: move sys_kexec_load() prototype to syscalls.h
syscalls: sort syscall prototypes in include/linux/syscalls.h
net: remove compat_sys_*() prototypes from net/compat.h
syscalls: sort syscall prototypes in include/linux/compat.h
syscalls/x86: auto-create compat_sys_*() prototypes
kernel/sys_ni: sort cond_syscall() entries
kernel/sys_ni: remove {sys_,sys_compat} from cond_syscall definitions
Last but not least, add a patch by Howard McLauchlan to whitelist all syscalls
for error injection:
bpf: whitelist all syscalls for error injection (by Howard McLauchlan)
----------------------------------------------------------------
Dominik Brodowski (107):
syscalls: define and explain goal to not call syscalls in the kernel
kernel: use kernel_wait4() instead of sys_wait4()
kernel: open-code sys_rt_sigpending() in sys_sigpending()
kexec: call do_kexec_load() in compat syscall directly
mm: use do_futex() instead of sys_futex() in mm_release()
x86: use _do_fork() in compat_sys_x86_clone()
x86: remove compat_sys_x86_waitpid()
net: socket: add __sys_recvfrom() helper; remove in-kernel call to syscall
net: socket: add __sys_sendto() helper; remove in-kernel call to syscall
net: socket: add __sys_accept4() helper; remove in-kernel call to syscall
net: socket: add __sys_socket() helper; remove in-kernel call to syscall
net: socket: add __sys_bind() helper; remove in-kernel call to syscall
net: socket: add __sys_connect() helper; remove in-kernel call to syscall
net: socket: add __sys_listen() helper; remove in-kernel call to syscall
net: socket: add __sys_getsockname() helper; remove in-kernel call to syscall
net: socket: add __sys_getpeername() helper; remove in-kernel call to syscall
net: socket: add __sys_socketpair() helper; remove in-kernel call to syscall
net: socket: add __sys_shutdown() helper; remove in-kernel call to syscall
net: socket: add __sys_setsockopt() helper; remove in-kernel call to syscall
net: socket: add __sys_getsockopt() helper; remove in-kernel call to syscall
net: socket: add do_sys_recvmmsg() helper; remove in-kernel call to syscall
net: socket: move check for forbid_cmsg_compat to __sys_...msg()
net: socket: replace calls to sys_send() with __sys_sendto()
net: socket: replace call to sys_recv() with __sys_recvfrom()
net: socket: add __compat_sys_recvfrom() helper; remove in-kernel call to compat syscall
net: socket: add __compat_sys_setsockopt() helper; remove in-kernel call to compat syscall
net: socket: add __compat_sys_getsockopt() helper; remove in-kernel call to compat syscall
net: socket: add __compat_sys_recvmmsg() helper; remove in-kernel call to compat syscall
net: socket: add __compat_sys_...msg() helpers; remove in-kernel calls to compat syscalls
ipc: add semtimedop syscall/compat_syscall wrappers
ipc: add semget syscall wrapper
ipc: add semctl syscall/compat_syscall wrappers
ipc: add msgget syscall wrapper
ipc: add shmget syscall wrapper
ipc: add shmdt syscall wrapper
ipc: add shmctl syscall/compat_syscall wrappers
ipc: add msgctl syscall/compat_syscall wrappers
ipc: add msgrcv syscall/compat_syscall wrappers
ipc: add msgsnd syscall/compat_syscall wrappers
kernel: add do_getpgid() helper; remove internal call to sys_getpgid()
kernel: add do_compat_sigaltstack() helper; remove in-kernel call to compat syscall
kernel: provide ksys_*() wrappers for syscalls called by kernel/uid16.c
sched: add do_sched_yield() helper; remove in-kernel call to sched_yield()
mm: add kernel_migrate_pages() helper, move compat syscall to mm/mempolicy.c
mm: add kernel_move_pages() helper, move compat syscall to mm/migrate.c
mm: add kernel_mbind() helper; remove in-kernel call to syscall
mm: add kernel_[sg]et_mempolicy() helpers; remove in-kernel calls to syscalls
fs: add do_readlinkat() helper; remove internal call to sys_readlinkat()
fs: add do_pipe2() helper; remove internal call to sys_pipe2()
fs: add do_renameat2() helper; remove internal call to sys_renameat2()
fs: add do_futimesat() helper; remove internal call to sys_futimesat()
fs: add do_epoll_*() helpers; remove internal calls to sys_epoll_*()
fs: add do_signalfd4() helper; remove internal calls to sys_signalfd4()
fs: add do_eventfd() helper; remove internal call to sys_eventfd()
fs: add do_lookup_dcookie() helper; remove in-kernel call to syscall
fs: add do_vmsplice() helper; remove in-kernel call to syscall
fs: add kern_select() helper; remove in-kernel call to sys_select()
fs: add do_compat_fcntl64() helper; remove in-kernel call to compat syscall
fs: add do_compat_select() helper; remove in-kernel call to compat syscall
fs: add do_compat_signalfd4() helper; remove in-kernel call to compat syscall
fs: add do_compat_futimesat() helper; remove in-kernel call to compat syscall
inotify: add do_inotify_init() helper; remove in-kernel call to syscall
fanotify: add do_fanotify_mark() helper; remove in-kernel call to syscall
fs/quota: add kernel_quotactl() helper; remove in-kernel call to syscall
fs/quota: use COMPAT_SYSCALL_DEFINE for sys32_quotactl()
fs: add ksys_mount() helper; remove in-kernel calls to sys_mount()
fs: add ksys_umount() helper; remove in-kernel call to sys_umount()
fs: add ksys_dup{,3}() helper; remove in-kernel calls to sys_dup{,3}()
fs: add ksys_chroot() helper; remove-in kernel calls to sys_chroot()
fs: add ksys_write() helper; remove in-kernel calls to sys_write()
fs: add ksys_chdir() helper; remove in-kernel calls to sys_chdir()
fs: add ksys_unlink() wrapper; remove in-kernel calls to sys_unlink()
hostfs: rename do_rmdir() to hostfs_do_rmdir()
fs: add ksys_rmdir() wrapper; remove in-kernel calls to sys_rmdir()
fs: add do_mkdirat() helper and ksys_mkdir() wrapper; remove in-kernel calls to syscall
fs: add do_symlinkat() helper and ksys_symlink() wrapper; remove in-kernel calls to syscall
fs: add do_mknodat() helper and ksys_mknod() wrapper; remove in-kernel calls to syscall
fs: add do_linkat() helper and ksys_link() wrapper; remove in-kernel calls to syscall
fs: add ksys_fchmod() and do_fchmodat() helpers and ksys_chmod() wrapper; remove in-kernel calls to syscall
fs: add do_faccessat() helper and ksys_access() wrapper; remove in-kernel calls to syscall
fs: add do_fchownat(), ksys_fchown() helpers and ksys_{,l}chown() wrappers
fs: add ksys_ftruncate() wrapper; remove in-kernel calls to sys_ftruncate()
fs: add ksys_close() wrapper; remove in-kernel calls to sys_close()
fs: add ksys_open() wrapper; remove in-kernel calls to sys_open()
fs: add ksys_getdents64() helper; remove in-kernel calls to sys_getdents64()
fs: add ksys_ioctl() helper; remove in-kernel calls to sys_ioctl()
fs: add ksys_lseek() helper; remove in-kernel calls to sys_lseek()
fs: add ksys_read() helper; remove in-kernel calls to sys_read()
fs: add ksys_sync() helper; remove in-kernel calls to sys_sync()
kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare()
kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid()
fs: add ksys_sync_file_range helper(); remove in-kernel calls to syscall
fs: add ksys_truncate() wrapper; remove in-kernel calls to sys_truncate()
fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls
fs: add ksys_fallocate() wrapper; remove in-kernel calls to sys_fallocate()
mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64()
mm: add ksys_mmap_pgoff() helper; remove in-kernel calls to sys_mmap_pgoff()
mm: add ksys_readahead() helper; remove in-kernel calls to sys_readahead()
x86/ioport: add ksys_ioperm() helper; remove in-kernel calls to sys_ioperm()
x86: fix sys_sigreturn() return type to be long, not unsigned long
kexec: move sys_kexec_load() prototype to syscalls.h
syscalls: sort syscall prototypes in include/linux/syscalls.h
net: remove compat_sys_*() prototypes from net/compat.h
syscalls: sort syscall prototypes in include/linux/compat.h
syscalls/x86: auto-create compat_sys_*() prototypes
kernel/sys_ni: sort cond_syscall() entries
kernel/sys_ni: remove {sys_,sys_compat} from cond_syscall definitions
Howard McLauchlan (1):
bpf: whitelist all syscalls for error injection
Tautschnig, Michael (1):
x86/sigreturn: use SYSCALL_DEFINE0
Documentation/process/adding-syscalls.rst | 34 +-
arch/alpha/kernel/osf_sys.c | 2 +-
arch/arm/kernel/sys_arm.c | 2 +-
arch/arm64/kernel/sys.c | 2 +-
arch/ia64/kernel/sys_ia64.c | 4 +-
arch/m68k/kernel/sys_m68k.c | 2 +-
arch/microblaze/kernel/sys_microblaze.c | 6 +-
arch/mips/kernel/linux32.c | 22 +-
arch/mips/kernel/syscall.c | 6 +-
arch/parisc/kernel/sys_parisc.c | 30 +-
arch/powerpc/kernel/sys_ppc32.c | 18 +-
arch/powerpc/kernel/syscalls.c | 6 +-
arch/riscv/kernel/sys_riscv.c | 4 +-
arch/s390/kernel/compat_linux.c | 37 +-
arch/s390/kernel/sys_s390.c | 2 +-
arch/sh/kernel/sys_sh.c | 4 +-
arch/sh/kernel/sys_sh32.c | 12 +-
arch/sparc/kernel/setup_32.c | 2 +-
arch/sparc/kernel/sys_sparc32.c | 26 +-
arch/sparc/kernel/sys_sparc_32.c | 6 +-
arch/sparc/kernel/sys_sparc_64.c | 2 +-
arch/um/kernel/syscall.c | 2 +-
arch/x86/entry/syscalls/syscall_32.tbl | 4 +-
arch/x86/ia32/ia32_signal.c | 1 -
arch/x86/ia32/sys_ia32.c | 50 +-
arch/x86/include/asm/sys_ia32.h | 67 --
arch/x86/include/asm/syscalls.h | 3 +-
arch/x86/kernel/ioport.c | 7 +-
arch/x86/kernel/signal.c | 5 +-
arch/x86/kernel/sys_x86_64.c | 2 +-
arch/xtensa/kernel/syscall.c | 2 +-
drivers/base/devtmpfs.c | 11 +-
drivers/tty/sysrq.c | 2 +-
drivers/tty/vt/vt_ioctl.c | 6 +-
fs/autofs4/dev-ioctl.c | 2 +-
fs/binfmt_misc.c | 2 +-
fs/dcookies.c | 11 +-
fs/eventfd.c | 9 +-
fs/eventpoll.c | 23 +-
fs/fcntl.c | 12 +-
fs/file.c | 17 +-
fs/hostfs/hostfs.h | 2 +-
fs/hostfs/hostfs_kern.c | 2 +-
fs/hostfs/hostfs_user.c | 2 +-
fs/internal.h | 14 +
fs/ioctl.c | 7 +-
fs/namei.c | 61 +-
fs/namespace.c | 19 +-
fs/notify/fanotify/fanotify_user.c | 14 +-
fs/notify/inotify/inotify_user.c | 9 +-
fs/open.c | 77 +-
fs/pipe.c | 9 +-
fs/quota/compat.c | 13 +-
fs/quota/quota.c | 10 +-
fs/read_write.c | 45 +-
fs/readdir.c | 11 +-
fs/select.c | 29 +-
fs/signalfd.c | 31 +-
fs/splice.c | 12 +-
fs/stat.c | 12 +-
fs/sync.c | 19 +-
fs/utimes.c | 25 +-
include/linux/compat.h | 644 ++++++------
include/linux/futex.h | 13 +-
include/linux/kexec.h | 4 -
include/linux/quotaops.h | 3 +
include/linux/socket.h | 37 +-
include/linux/syscalls.h | 1511 +++++++++++++++++------------
include/net/compat.h | 11 -
init/do_mounts.c | 26 +-
init/do_mounts.h | 4 +-
init/do_mounts_initrd.c | 42 +-
init/do_mounts_md.c | 29 +-
init/do_mounts_rd.c | 40 +-
init/initramfs.c | 52 +-
init/main.c | 9 +-
init/noinitramfs.c | 6 +-
ipc/msg.c | 60 +-
ipc/sem.c | 44 +-
ipc/shm.c | 28 +-
ipc/syscall.c | 58 +-
ipc/util.h | 31 +
kernel/compat.c | 55 --
kernel/exit.c | 2 +-
kernel/fork.c | 11 +-
kernel/kexec.c | 52 +-
kernel/pid_namespace.c | 6 +-
kernel/power/hibernate.c | 2 +-
kernel/power/suspend.c | 2 +-
kernel/power/user.c | 2 +-
kernel/sched/core.c | 8 +-
kernel/signal.c | 29 +-
kernel/sys.c | 74 +-
kernel/sys_ni.c | 617 +++++++-----
kernel/uid16.c | 25 +-
kernel/uid16.h | 14 +
kernel/umh.c | 4 +-
mm/fadvise.c | 10 +-
mm/mempolicy.c | 92 +-
mm/migrate.c | 39 +-
mm/mmap.c | 17 +-
mm/nommu.c | 17 +-
mm/readahead.c | 7 +-
net/compat.c | 136 ++-
net/socket.c | 234 +++--
105 files changed, 3129 insertions(+), 1868 deletions(-)
delete mode 100644 arch/x86/include/asm/sys_ia32.h
create mode 100644 kernel/uid16.h
^ permalink raw reply
* Re: [PATCH 15/15] ARM: pxa: change SSP DMA channels allocation
From: kbuild test robot @ 2018-04-02 18:46 UTC (permalink / raw)
To: Robert Jarzmik
Cc: Ulf Hansson, alsa-devel, Takashi Iwai, linux-ide, netdev,
linux-mtd, Robert Jarzmik, devel, Boris Brezillon, Vinod Koul,
Richard Weinberger, Russell King, Marek Vasut, Ezequiel Garcia,
linux-media, Samuel Ortiz, Arnd Bergmann,
Bartlomiej Zolnierkiewicz, Haojian Zhuang, dmaengine, Mark Brown,
Jaroslav Kysela, Mauro Carvalho Chehab, linux-arm-kernel,
Nicolas
In-Reply-To: <20180402142656.26815-16-robert.jarzmik@free.fr>
[-- Attachment #1: Type: text/plain, Size: 1706 bytes --]
Hi Robert,
I love your patch! Yet something to improve:
[auto build test ERROR on linus/master]
[also build test ERROR on v4.16]
[cannot apply to arm-soc/for-next next-20180329]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Robert-Jarzmik/ARM-pxa-switch-to-DMA-slave-maps/20180402-233029
config: arm-allmodconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm
All errors (new ones prefixed by >>):
>> arch/arm/plat-pxa/ssp.c:19:10: fatal error: mach/audio.h: No such file or directory
#include <mach/audio.h>
^~~~~~~~~~~~~~
compilation terminated.
vim +19 arch/arm/plat-pxa/ssp.c
> 19 #include <mach/audio.h>
20 #include <linux/module.h>
21 #include <linux/kernel.h>
22 #include <linux/sched.h>
23 #include <linux/slab.h>
24 #include <linux/errno.h>
25 #include <linux/interrupt.h>
26 #include <linux/ioport.h>
27 #include <linux/init.h>
28 #include <linux/mutex.h>
29 #include <linux/clk.h>
30 #include <linux/err.h>
31 #include <linux/platform_device.h>
32 #include <linux/spi/pxa2xx_spi.h>
33 #include <linux/io.h>
34 #include <linux/of.h>
35 #include <linux/of_device.h>
36
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 65120 bytes --]
[-- Attachment #3: Type: text/plain, Size: 169 bytes --]
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
^ permalink raw reply
* Re: [net-next 0/2] Add promiscous mode support in k2g network driver
From: Murali Karicheri @ 2018-04-02 18:42 UTC (permalink / raw)
To: David Miller; +Cc: w-kwok2, linux-kernel, netdev, nsekhar, grygorii.strashko
In-Reply-To: <20180402.122831.2175483097067229440.davem@davemloft.net>
On 04/02/2018 12:28 PM, David Miller wrote:
> From: Murali Karicheri <m-karicheri2@ti.com>
> Date: Mon, 2 Apr 2018 12:17:17 -0400
>
>> This patch adds support for promiscuous mode in network driver for K2G
>> SoC. This depends on v3 of my series at
>> https://www.spinics.net/lists/kernel/msg2765942.html
>
> The net-next tree is closed, please resubmit this series after the merge
> window when the net-next tree is openned back up.
>
> Thank you.
>
Sure! As I have indicated in my cover letter, I will fold this
to the other series and re-submit it when the merge window opens.
Thanks
--
Murali Karicheri
Linux Kernel, Keystone
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox