* Re: [PATCH v6 4/4] net: phy: realtek: Add LED configuration support for RTL8211E
From: Matthias Kaehlcke @ 2019-08-26 18:40 UTC (permalink / raw)
To: Florian Fainelli
Cc: Doug Anderson, Pavel Machek, David S . Miller, Rob Herring,
Mark Rutland, Andrew Lunn, Heiner Kallweit, netdev, devicetree,
LKML
In-Reply-To: <f1fd7aba-b36f-8cc4-9ed7-9977c0912b9d@gmail.com>
On Fri, Aug 23, 2019 at 12:58:09PM -0700, Florian Fainelli wrote:
> On 8/16/19 3:39 PM, Doug Anderson wrote:
> > Hi,
> >
> > On Fri, Aug 16, 2019 at 3:12 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
> >>
> >> On 8/16/19 2:27 PM, Matthias Kaehlcke wrote:
> >>> On Fri, Aug 16, 2019 at 10:13:42PM +0200, Pavel Machek wrote:
> >>>> On Tue 2019-08-13 12:11:47, Matthias Kaehlcke wrote:
> >>>>> Add a .config_led hook which is called by the PHY core when
> >>>>> configuration data for a PHY LED is available. Each LED can be
> >>>>> configured to be solid 'off, solid 'on' for certain (or all)
> >>>>> link speeds or to blink on RX/TX activity.
> >>>>>
> >>>>> Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
> >>>>
> >>>> THis really needs to go through the LED subsystem,
> >>>
> >>> Sorry, I used what get_maintainers.pl threw at me, I should have
> >>> manually cc-ed the LED list.
> >>>
> >>>> and use the same userland interfaces as the rest of the system.
> >>>
> >>> With the PHY maintainers we discussed to define a binding that is
> >>> compatible with that of the LED one, to have the option to integrate
> >>> it with the LED subsystem later. The integration itself is beyond the
> >>> scope of this patchset.
> >>>
> >>> The PHY LED configuration is a low priority for the project I'm
> >>> working on. I wanted to make an attempt to upstream it and spent
> >>> already significantly more time on it than planned, if integration
> >>> with the LED framework now is a requirement please consider this
> >>> series abandonded.
> >>
> >> While I have an appreciation for how hard it can be to work in a
> >> corporate environment while doing upstream first and working with
> >> virtually unbounded goals (in time or scope) due to maintainers and
> >> reviewers, that kind of statement can hinder your ability to establish
> >> trust with peers in the community as it can be read as take it or leave it.
> >
> > You think so? I feel like Matthias is simply expressing the reality
> > of the situation here and I'd rather see a statement like this posted
> > than the series just silently dropped. Communication is good.
> >
> > In general on Chrome OS we don't spent lots of time tweaking with
> > Ethernet and even less time tweaking with Ethernet on ARM boards where
> > you might need a binding like this, so it's pretty hard to justify up
> > the management chain spending massive amounts of resources on it. In
> > this case we have two existing ARM boards which we're trying to uprev
> > from 3.14 to 4.19 which were tweaking the Ethernet driver in some
> > downstream code. We thought it would be nice to try to come up with a
> > solution that could land upstream, which is usually what we try to do
> > in these cases.
> >
> > Normally if there is some major architecture needed that can't fit in
> > the scope of a project, we would do a downstream solution for the
> > project and then fork off the task (maybe by a different Engineer or a
> > contractor) to get a solution that can land upstream. ...but in this
> > case it seems hard to justify because it's unlikely we would need it
> > again anytime remotely soon.
> >
> > So I guess the alternatives to what Matthias did would have been:
> >
> > A) Don't even try to upstream. Seems worse. At least this way
> > there's something a future person can start from and the discussion is
> > rolling.
> >
> > B) Keep spending tons of time on something even though management
> > doesn't want him to. Seems worse.
> >
> > C) Spend his nights and weekends working on this. Seems worse.
> >
> > D) Silently stop working on it without saying "I'm going to stop". Seems worse.
> >
> > ...unless you have a brilliant "E)" I think what Matthias did here is
> > exactly right.
>
> I must apologize for making that statement since it was not fair to
> Matthias, and he has been clear about how much time he can spend on that
> specific, please accept my apologies for that.
>
> Having had many recent encounters with various people not driving
> projects to completion lately (not specifically within Linux), it looks
> like I am overly sensitive about flagging words and patch status that
> may fall within that lexicon. The choice of word is what triggered me.
No worries, I understand that it can be frustrating if you repeatedly
experience that projects remain unfinished.
Hopefully this series can be revived eventually when somebody finds
the time to work on the integration with the LED framework.
^ permalink raw reply
* Re: [PATCH net 1/2] openvswitch: Properly set L4 keys on "later" IP fragments.
From: Justin Pettit @ 2019-08-26 18:42 UTC (permalink / raw)
To: Pravin Shelar, David Miller
Cc: Linux Kernel Network Developers, Joe Stringer, Greg Rose
In-Reply-To: <CAOrHB_BKd=QKR_ScO+r7qAyZaniEbFur+iup1iXbtiycaFawjw@mail.gmail.com>
> On Aug 25, 2019, at 1:40 PM, Pravin Shelar <pshelar@ovn.org> wrote:
>
> Actually I am not sure about this change. caller of this function
> (ovs_ct_execute()) does skb-pull and push of L2 header, calling
> ovs_flow_key_update() is not safe here, it expect skb data to point to
> L2 header.
Thanks for the feedback, Pravin and David. Greg Rose has a revised version that will address the issues you raised and also make it so that we don't bother re-extracting the L2 headers. He'll hopefully get that out today once we've done some internal testing on it.
--Justin
^ permalink raw reply
* Re: [PATCH v1 net-next] net: stmmac: Add support for MDIO interrupts
From: Andrew Lunn @ 2019-08-26 18:47 UTC (permalink / raw)
To: Voon Weifeng
Cc: David S. Miller, Maxime Coquelin, netdev, linux-kernel,
Jose Abreu, Giuseppe Cavallaro, Alexandre Torgue, Ong Boon Leong
In-Reply-To: <1566870320-9825-1-git-send-email-weifeng.voon@intel.com>
On Tue, Aug 27, 2019 at 09:45:20AM +0800, Voon Weifeng wrote:
> From: "Chuah, Kim Tatt" <kim.tatt.chuah@intel.com>
>
> DW EQoS v5.xx controllers added capability for interrupt generation
> when MDIO interface is done (GMII Busy bit is cleared).
> This patch adds support for this interrupt on supported HW to avoid
> polling on GMII Busy bit.
>
> stmmac_mdio_read() & stmmac_mdio_write() will sleep until wake_up() is
> called by the interrupt handler.
Hi Voon
I _think_ there are some order of operation issues here. The mdiobus
is registered in the probe function. As soon as of_mdiobus_register()
is called, the MDIO bus must work. At that point MDIO read/writes can
start to happen.
As far as i can see, the interrupt handler is only requested in
stmmac_open(). So it seems like any MDIO operations after probe, but
before open are going to fail?
Thanks
Andrew
^ permalink raw reply
* Re: [PATCH RFC] net: dsa: mv88e6xxx: fully support SERDES on Topaz family
From: Vivien Didelot @ 2019-08-26 18:53 UTC (permalink / raw)
To: Marek Behun; +Cc: netdev, Andrew Lunn, Florian Fainelli, Vladimir Oltean
In-Reply-To: <20190826203614.6f9f6a8d@nic.cz>
On Mon, 26 Aug 2019 20:36:14 +0200, Marek Behun <marek.behun@nic.cz> wrote:
> > Ask yourself what is the single task achieved by this function, and name this
> > operation accordingly. It seems to change the CMODE to be writable, only
> > supported by certain switch models right? So in addition to port_get_cmode
> > and port_set_cmode, you can add port_set_cmode_writable, and call it right
> > before or after port_set_cmode in mv88e6xxx_port_setup_mac.
>
> Andrew's complaint was also about this function being called every time
> cmode is to be changed. The cmode does need to be made writable only
> once. In this sense it does make sense to put into into
> mv88e6xxx_setup_port.
mv88e6xxx_port_setup_mac is called by mv88e6xxx_setup_port as expected and also
.phylink_mac_config. I don't think they are called that often and both deal
with configuration, so I'd prefer to keep this consistent and group the two
operations together in mv88e6xxx_port_setup_mac, if that's good for Andrew too.
Thanks,
Vivien
^ permalink raw reply
* Re: [PATCH v1 net-next] net: phy: mdio_bus: make mdiobus_scan also cover PHY that only talks C45
From: Andrew Lunn @ 2019-08-26 18:54 UTC (permalink / raw)
To: Florian Fainelli
Cc: Voon Weifeng, David S. Miller, Maxime Coquelin, netdev,
linux-kernel, Jose Abreu, Heiner Kallweit, Ong Boon Leong
In-Reply-To: <e9ece5ad-a669-6d6b-d050-c633cad15476@gmail.com>
On Mon, Aug 26, 2019 at 11:27:53AM -0700, Florian Fainelli wrote:
> On 8/26/19 6:52 PM, Voon Weifeng wrote:
> > From: Ong Boon Leong <boon.leong.ong@intel.com>
> >
> > Make mdiobus_scan() to try harder to look for any PHY that only talks C45.
> If you are not using Device Tree or ACPI, and you are letting the MDIO
> bus be scanned, it sounds like there should be a way for you to provide
> a hint as to which addresses should be scanned (that's
> mii_bus::phy_mask) and possibly enhance that with a mask of possible C45
> devices?
Yes, i don't like this unconditional c45 scanning. A lot of MDIO bus
drivers don't look for the MII_ADDR_C45. They are going to do a C22
transfer, and maybe not mask out the MII_ADDR_C45 from reg, causing an
invalid register write. Bad things can then happen.
With DT and ACPI, we have an explicit indication that C45 should be
used, so we know on this platform C45 is safe to use. We need
something similar when not using DT or ACPI.
Andrew
^ permalink raw reply
* [PATCH] net/hamradio/6pack: Fix the size of a sk_buff used in 'sp_bump()'
From: Christophe JAILLET @ 2019-08-26 19:02 UTC (permalink / raw)
To: ajk, davem
Cc: linux-hams, netdev, linux-kernel, kernel-janitors,
Christophe JAILLET
We 'allocate' 'count' bytes here. In fact, 'dev_alloc_skb' already add some
extra space for padding, so a bit more is allocated.
However, we use 1 byte for the KISS command, then copy 'count' bytes, so
count+1 bytes.
Explicitly allocate and use 1 more byte to be safe.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
---
This patch should be safe, be however may no be the correct way to fix the
"buffer overflow". Maybe, the allocated size is correct and we should have:
memcpy(ptr, sp->cooked_buf + 1, count - 1);
or
memcpy(ptr, sp->cooked_buf + 1, count - 1sp->rcount);
I've not dig deep enough to understand the link betwwen 'rcount' and
how 'cooked_buf' is used.
---
drivers/net/hamradio/6pack.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/hamradio/6pack.c b/drivers/net/hamradio/6pack.c
index 331c16d30d5d..23281aeeb222 100644
--- a/drivers/net/hamradio/6pack.c
+++ b/drivers/net/hamradio/6pack.c
@@ -344,10 +344,10 @@ static void sp_bump(struct sixpack *sp, char cmd)
sp->dev->stats.rx_bytes += count;
- if ((skb = dev_alloc_skb(count)) == NULL)
+ if ((skb = dev_alloc_skb(count + 1)) == NULL)
goto out_mem;
- ptr = skb_put(skb, count);
+ ptr = skb_put(skb, count + 1);
*ptr++ = cmd; /* KISS command */
memcpy(ptr, sp->cooked_buf + 1, count);
--
2.20.1
^ permalink raw reply related
* [PATCH 04/10] mm/oom_debug: Add ARP and ND Table Summary usage
From: Edward Chron @ 2019-08-26 19:36 UTC (permalink / raw)
To: Andrew Morton
Cc: Michal Hocko, Roman Gushchin, Johannes Weiner, David Rientjes,
Tetsuo Handa, Shakeel Butt, linux-mm, linux-kernel, colona,
Edward Chron, David S. Miller, netdev
In-Reply-To: <20190826193638.6638-1-echron@arista.com>
Adds config options and code to support printing ARP Table usage and or
Neighbour Discovery Table usage when an OOM event occurs. This summarized
information provides the memory usage for each table when configured.
Configuring these two OOM Debug Options
---------------------------------------
Two OOM debug options: CONFIG_DEBUG_OOM_ARP_TBL, CONFIG_DEBUG_OOM_ND_TBL
To get the output for both tables they both must be configured.
The ARP Table uses the CONFIG_DEBUG_OOM_ARP_TBL kernel config option
and the ND Table uses the CONFIG_DEBUG_OOM_ND_TBL kernel config option
both of which are found in the kernel config under the entries:
Kernel hacking, Memory Debugging, OOM Debugging entry. The ARP Table and
ND Table are configured there with the options: DEBUG_OOM_ARP_TBL and
DEBUG_OOM_ND_TBL respectively.
Dynamic disable or re-enable this OOM Debug option
--------------------------------------------------
The oom debugfs base directory is found at: /sys/kernel/debug/oom.
The oom debugfs for this option are: arp_table_summary_ and
nd_table_summary_ and there is just one enable file for each.
Either option may be disabled or re-enabled using the debugfs entry for
the OOM debug option. The debugfs file to enable the ARP Table option
is found at: /sys/kernel/debug/oom/arp_table_summary_enabled
Similarly, the debugfs file to enable the ND Table option is found at:
/sys/kernel/debug/oom/nd_table_summary_enabled
For either option their enabled file's value determines whether the
facility is enabled or disabled for that option. A value of 1 is enabled
(default) and a value of 0 is disabled. When configured the default
setting is set to enabled. Each option will produce 1 line of output.
Content and format of ARP and Neighbour Discovery Tables Summary Output
-----------------------------------------------------------------------
One line of output each for ARP and ND that includes:
- Table name
- Table size (max # entries)
- Key Length
- Entry Size
- Number of Entries
- Last Flush (in seconds)
- hash grows
- entry allocations
- entry destroys
- Number lookups
- Number of lookup hits
- Resolution failures
- Garbage Collection Forced Runs
- Table Full
- Proxy Queue Length
Sample Output:
-------------
Here is sample output for both the ARP table and ND table:
Jul 23 23:26:34 yuorsystem kernel: neighbour: Table: arp_tbl size: 256
keyLen: 4 entrySize: 360 entries: 9 lastFlush: 1721s
hGrows: 1 allocs: 9 destroys: 0 lookups: 204 hits: 199
resFailed: 38 gcRuns/Forced: 111 / 0 tblFull: 0 proxyQlen: 0
Jul 23 23:26:34 yuorsystem kernel: neighbour: Table: nd_tbl size: 128
keyLen: 16 entrySize: 368 entries: 6 lastFlush: 1720s
hGrows: 0 allocs: 7 destroys: 1 lookups: 0 hits: 0
resFailed: 0 gcRuns/Forced: 110 / 0 tblFull: 0 proxyQlen: 0
Signed-off-by: Edward Chron <echron@arista.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
---
include/net/neighbour.h | 12 +++++++
mm/Kconfig.debug | 26 ++++++++++++++
mm/oom_kill_debug.c | 38 ++++++++++++++++++++
net/core/neighbour.c | 78 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 154 insertions(+)
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 50a67bd6a434..35fdecff2724 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -569,4 +569,16 @@ static inline void neigh_update_is_router(struct neighbour *neigh, u32 flags,
*notify = 1;
}
}
+
+#if defined(CONFIG_DEBUG_OOM_ARP_TBL) || defined(CONFIG_DEBUG_OOM_ND_TBL)
+/**
+ * Routine used to print arp table and neighbour table statistics.
+ * Output goes to dmesg along with all the other OOM related messages
+ * when the config options DEBUG_OOM_ARP_TBL and DEBUG_ND_TBL are set to
+ * yes, for the ARP table and Neighbour discovery table respectively.
+ */
+extern void neightbl_print_stats(const char * const tblname,
+ struct neigh_table * const neightable);
+#endif /* CONFIG_DEBUG_OOM_ARP_TBL || CONFIG_DEBUG_OOM_ND_TBL */
+
#endif
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index fcbc5f9aa146..fe4bb5ce0a6d 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -163,3 +163,29 @@ config DEBUG_OOM_TASKS_SUMMARY
A value of 1 is enabled (default) and a value of 0 is disabled.
If unsure, say N.
+
+config DEBUG_OOM_ARP_TBL
+ bool "Debug OOM ARP Table"
+ depends on DEBUG_OOM
+ help
+ When enabled, documents kernel memory usage by the ARP Table
+ entries at the time of an OOM event. Output is one line of
+ summarzied ARP Table usage. If configured it is enabled/disabled
+ by setting the enabled file entry in the debugfs OOM interface
+ at: /sys/kernel/debug/oom/arp_table_summary_enabled
+ A value of 1 is enabled (default) and a value of 0 is disabled.
+
+ If unsure, say N.
+
+config DEBUG_OOM_ND_TBL
+ bool "Debug OOM ND Table"
+ depends on DEBUG_OOM
+ help
+ When enabled, documents kernel memory usage by the ND Table
+ entries at the time of an OOM event. Output is one line of
+ summarzied ND Table usage. If configured it is enabled/disabled
+ by setting the enabled file entry in the debugfs OOM interface
+ at: /sys/kernel/debug/oom/nd_table_summary_enabled
+ A value of 1 is enabled (default) and a value of 0 is disabled.
+
+ If unsure, say N.
diff --git a/mm/oom_kill_debug.c b/mm/oom_kill_debug.c
index 395b3307f822..c4a9117633fd 100644
--- a/mm/oom_kill_debug.c
+++ b/mm/oom_kill_debug.c
@@ -156,6 +156,16 @@
#include <linux/sched/stat.h>
#endif
+#if defined(CONFIG_INET) && defined(CONFIG_DEBUG_OOM_ARP_TBL)
+#include <net/arp.h>
+#endif
+#if defined(CONFIG_IPV6) && defined(CONFIG_DEBUG_OOM_ND_TBL)
+#include <net/ndisc.h>
+#endif
+#if defined(CONFIG_DEBUG_OOM_ARP_TBL) || defined(CONFIG_DEBUG_OOM_ND_TBL)
+#include <net/neighbour.h>
+#endif
+
#define OOMD_MAX_FNAME 48
#define OOMD_MAX_OPTNAME 32
@@ -192,6 +202,18 @@ static struct oom_debug_option oom_debug_options_table[] = {
.option_name = "tasks_summary_",
.support_tpercent = false,
},
+#endif
+#ifdef CONFIG_DEBUG_OOM_ARP_TBL
+ {
+ .option_name = "arp_table_summary_",
+ .support_tpercent = false,
+ },
+#endif
+#ifdef CONFIG_DEBUG_OOM_ND_TBL
+ {
+ .option_name = "nd_table_summary_",
+ .support_tpercent = false,
+ },
#endif
{}
};
@@ -203,6 +225,12 @@ enum oom_debug_options_index {
#endif
#ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
TASKS_STATE,
+#endif
+#ifdef CONFIG_DEBUG_OOM_ARP_TBL
+ ARP_STATE,
+#endif
+#ifdef CONFIG_DEBUG_OOM_ND_TBL
+ ND_STATE,
#endif
OUT_OF_BOUNDS
};
@@ -351,6 +379,16 @@ u32 oom_kill_debug_oom_event_is(void)
oom_kill_debug_system_summary_prt();
#endif
+#if defined(CONFIG_INET) && defined(CONFIG_DEBUG_OOM_ARP_TBL)
+ if (oom_kill_debug_enabled(ARP_STATE))
+ neightbl_print_stats("arp_tbl", &arp_tbl);
+#endif
+
+#if defined(CONFIG_IPV6) && defined(CONFIG_DEBUG_OOM_ND_TBL)
+ if (oom_kill_debug_enabled(ND_STATE))
+ neightbl_print_stats("nd_tbl", &nd_tbl);
+#endif
+
#ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
if (oom_kill_debug_enabled(TASKS_STATE))
oom_kill_debug_tasks_summary_print();
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index f79e61c570ea..9f5a579542a9 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -3735,3 +3735,81 @@ static int __init neigh_init(void)
}
subsys_initcall(neigh_init);
+
+#if defined(CONFIG_DEBUG_OOM_ARP_TBL) || defined(CONFIG_DEBUG_OOM_ND_TBL)
+void neightbl_print_stats(const char * const tblname,
+ struct neigh_table * const tbl)
+{
+ struct neigh_hash_table *nht;
+ struct ndt_stats ndst;
+ u32 now;
+ u32 flush_delta;
+ u32 tblsize;
+ u16 key_len;
+ u16 entry_size;
+ u32 entries;
+ u32 last_flush; /* delta to now in msecs */
+ u32 hash_shift;
+ u32 proxy_qlen;
+ int cpu;
+
+ read_lock_bh(&tbl->lock);
+ now = jiffies;
+ flush_delta = now - tbl->last_flush;
+
+ key_len = tbl->key_len;
+ if (tbl->entry_size)
+ entry_size = tbl->entry_size;
+ else
+ entry_size = ALIGN(offsetof(struct neighbour, primary_key) +
+ key_len, NEIGH_PRIV_ALIGN);
+
+ entries = atomic_read(&tbl->entries);
+ if (entries == 0)
+ goto out_tbl_unlock;
+
+ /* last flush was last_flush seconds ago */
+ last_flush = jiffies_to_msecs(flush_delta) / 1000;
+ proxy_qlen = tbl->proxy_queue.qlen;
+
+ rcu_read_lock_bh();
+ nht = rcu_dereference_bh(tbl->nht);
+ if (nht)
+ hash_shift = nht->hash_shift + 1;
+ rcu_read_unlock_bh();
+ if (!nht)
+ goto out_tbl_unlock;
+
+ memset(&ndst, 0, sizeof(ndst));
+ for_each_possible_cpu(cpu) {
+ struct neigh_statistics *st;
+
+ st = per_cpu_ptr(tbl->stats, cpu);
+ ndst.ndts_allocs += st->allocs;
+ ndst.ndts_destroys += st->destroys;
+ ndst.ndts_hash_grows += st->hash_grows;
+ ndst.ndts_res_failed += st->res_failed;
+ ndst.ndts_lookups += st->lookups;
+ ndst.ndts_hits += st->hits;
+ ndst.ndts_periodic_gc_runs += st->periodic_gc_runs;
+ ndst.ndts_forced_gc_runs += st->forced_gc_runs;
+ ndst.ndts_table_fulls += st->table_fulls;
+ }
+
+ read_unlock_bh(&tbl->lock);
+ tblsize = (1 << hash_shift) * sizeof(struct neighbour *);
+ if (tblsize > PAGE_SIZE)
+ tblsize = get_order(tblsize);
+
+ pr_info("Table:%7s size:%5u keyLen:%2hu entrySize:%3hu entries:%5u lastFlush:%5us hGrows:%5llu allocs:%5llu destroys:%5llu lookups:%5llu hits:%5llu resFailed:%5llu gcRuns/Forced:%3llu / %2llu tblFull:%2llu proxyQlen:%2u\n",
+ tblname, tblsize, key_len, entry_size, entries, last_flush,
+ ndst.ndts_hash_grows, ndst.ndts_allocs, ndst.ndts_destroys,
+ ndst.ndts_lookups, ndst.ndts_hits, ndst.ndts_res_failed,
+ ndst.ndts_periodic_gc_runs, ndst.ndts_forced_gc_runs,
+ ndst.ndts_table_fulls, proxy_qlen);
+ return;
+
+out_tbl_unlock:
+ read_unlock_bh(&tbl->lock);
+}
+#endif /* CONFIG_DEBUG_OOM_ARP_TBL || CONFIG_DEBUG_OOM_ND_TBL */
--
2.20.1
^ permalink raw reply related
* [PATCH 10/10] mm/oom_debug: Add Enhanced Process Print Information
From: Edward Chron @ 2019-08-26 19:36 UTC (permalink / raw)
To: Andrew Morton
Cc: Michal Hocko, Roman Gushchin, Johannes Weiner, David Rientjes,
Tetsuo Handa, Shakeel Butt, linux-mm, linux-kernel, colona,
Edward Chron, David S. Miller, netdev
In-Reply-To: <20190826193638.6638-1-echron@arista.com>
Add OOM Debug code that prints additional detailed information about
users processes that were considered for OOM killing for any print
selected processes. The information is displayed for each user process
that OOM prints in the output.
This supplemental per user process information is very helpful for
determing how process memory is used to allow OOM event root cause
identifcation that might not otherwise be possible.
Configuring Enhanced Process Print Information
----------------------------------------------
The DEBUG_OOM_ENHANCED_PROCESS_PRINT is the config entry defined for
this OOM Debug option. This option is dependent on the OOM Debug
option DEBUG_OOM_SELECT_PROCESS which adds code to allow processes
that are considered for OOM kill to be selectively printed, only
printing processes that use a specified minimum amount of memory.
The kernel configuration entry for this option can be found in the
config file at: Kernel hacking, Memory Debugging, Debug OOM,
Debug OOM Process Selection, Debug OOM Enhanced Process Print.
Both Debug OOM Process Selection and Debug OOM Enhanced Process Print
entries must be selected.
Dynamic disable or re-enable this OOM Debug option
--------------------------------------------------
This option may be disabled or re-enabled using the debugfs entry for
this OOM debug option. The debugfs file to enable this entry is found at:
/sys/kernel/debug/oom/process_enhanced_print_enabled where the enabled
file's value determines whether the facility is enabled or disabled.
A value of 1 is enabled (default) and a value of 0 is disabled.
Content and format of process record and Task state headers
-----------------------------------------------------------
Each OOM process entry printed include memory information about the
process. Memory usage is specified in KiB for memory values instead of
pages. Each entry includes the following fields:
pid, ppid, ruid, euid, tgid, State (S), the oom_score_adjust (Adjust),
task comm value (name), and also memory values (all in KB): VmemKiB,
MaxRssKiB, CurRssKiB, PteKiB, SwapKiB, socket pages (SockKiB), LibKiB,
TextPgKiB, HeapPgKiB, StackKiB, FileKiB and shared memory (ShmemKiB).
Counts of page reads (ReadPgs) and page faults (FaultPgs) are included.
Sample Output
-------------
OOM Process select print headers and line of process enhanced output:
Aug 6 09:37:21 egc103 kernel: Tasks state (memory values in KiB):
Aug 6 09:37:21 egc103 kernel: [ pid ] ppid ruid euid
tgid S utimeSec stimeSec VmemKiB MaxRssKiB CurRssKiB
PteKiB SwapKiB SockKiB LibKiB TextKiB HeapKiB
StackKiB FileKiB ShmemKiB ReadPgs FaultPgs LockKiB
PinnedKiB Adjust name
Aug 6 09:37:21 egc103 kernel: [ 7707] 7553 10383 10383
7707 S 0.132 0.350 1056804 1054040 1052796
2092 0 0 1944 684 1052860
136 4 0 0 0 0
0 1000 oomprocs
Signed-off-by: Edward Chron <echron@arista.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
---
mm/Kconfig.debug | 23 +++++
mm/oom_kill.c | 23 ++++-
mm/oom_kill_debug.c | 236 ++++++++++++++++++++++++++++++++++++++++++++
mm/oom_kill_debug.h | 5 +
4 files changed, 285 insertions(+), 2 deletions(-)
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 4414e46f72c6..2bc843727968 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -320,3 +320,26 @@ config DEBUG_OOM_PROCESS_SELECT_PRINT
print limit value of 10 or 1% of memory.
If unsure, say N.
+
+config DEBUG_OOM_ENHANCED_PROCESS_PRINT
+ bool "Debug OOM Enhanced Process Print"
+ depends on DEBUG_OOM_PROCESS_SELECT_PRINT
+ help
+ Each OOM process entry printed include memory information about
+ the process. Memory usage is specified in KiB (KB) for memory
+ values, not pages. Each entry includes the following fields:
+ pid, ppid, ruid, euid, tgid, State (S), utime in seconds,
+ stime in seconds, the number of read pages (ReadPgs), number of
+ page faults (FaultPgs), the number of lock pages (LockPgs), the
+ oom_score_adjust value (Adjust), memory percentage used (MemPct),
+ oom_score (Score), task comm value (name), and also memory values
+ (all in KB): VmemKiB, MaxRssKiB, CurRssKiB, PteKiB, SwapKiB,
+ socket pages (SockKiB), LibKiB, TextPgKiB, HeapPgKiB, StackKiB,
+ FileKiB and shared memory pages (ShmemKiB).
+
+ If the option is configured it is enabled/disabled by setting
+ the value of the file entry in the debugfs OOM interface at:
+ /sys/kernel/debug/oom/process_enhanced_print_enabled
+ A value of 1 is enabled (default) and a value of 0 is disabled.
+
+ If unsure, say N.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index cbea289c6345..cf37caea9c5c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -417,6 +417,13 @@ static int dump_task(struct task_struct *p, void *arg)
}
#endif
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+ if (oom_kill_debug_enhanced_process_print_enabled()) {
+ dump_task_prt(task, rsspgs, swappgs, pgtbl);
+ task_unlock(task);
+ return 1;
+ }
+#endif
pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %s\n",
task->pid, from_kuid(&init_user_ns, task_uid(task)),
task->tgid, task->mm->total_vm, rsspgs, pgtbl, swappgs,
@@ -426,6 +433,19 @@ static int dump_task(struct task_struct *p, void *arg)
return 1;
}
+static void dump_tasks_headers(void)
+{
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+ if (oom_kill_debug_enhanced_process_print_enabled()) {
+ pr_info("Tasks state (memory values in KiB):\n");
+ pr_info("[ pid ] ppid ruid euid tgid S utimeSec stimeSec VmemKiB MaxRssKiB CurRssKiB PteKiB SwapKiB SockKiB LibKiB TextKiB HeapKiB StackKiB FileKiB ShmemKiB ReadPgs FaultPgs LockKiB PinnedKiB Adjust name\n");
+ return;
+ }
+#endif
+ pr_info("Tasks state (memory values in pages):\n");
+ pr_info("[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name\n");
+}
+
#define K(x) ((x) << (PAGE_SHIFT-10))
/**
@@ -443,8 +463,7 @@ static void dump_tasks(struct oom_control *oc)
u32 total = 0;
u32 prted = 0;
- pr_info("Tasks state (memory values in pages):\n");
- pr_info("[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name\n");
+ dump_tasks_headers();
#ifdef CONFIG_DEBUG_OOM_PROCESS_SELECT_PRINT
oc->minpgs = oom_kill_debug_min_task_pages(oc->totalpages);
diff --git a/mm/oom_kill_debug.c b/mm/oom_kill_debug.c
index ad937b3d59f3..467f7add4397 100644
--- a/mm/oom_kill_debug.c
+++ b/mm/oom_kill_debug.c
@@ -171,6 +171,12 @@
#ifdef CONFIG_DEBUG_OOM_VMALLOC_SELECT_PRINT
#include <linux/vmalloc.h>
#endif
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+#include <linux/fdtable.h>
+#include <linux/net.h>
+#include <net/sock.h>
+#include <linux/sched/cputime.h>
+#endif
#define OOMD_MAX_FNAME 48
#define OOMD_MAX_OPTNAME 32
@@ -250,6 +256,12 @@ static struct oom_debug_option oom_debug_options_table[] = {
.option_name = "slab_enhanced_print_",
.support_tpercent = false,
},
+#endif
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+ {
+ .option_name = "process_enhanced_print_",
+ .support_tpercent = false,
+ },
#endif
{}
};
@@ -282,6 +294,9 @@ enum oom_debug_options_index {
#endif
#ifdef CONFIG_DEBUG_OOM_ENHANCED_SLAB_PRINT
ENHANCED_SLAB_STATE,
+#endif
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+ ENHANCED_PROCESS_STATE,
#endif
OUT_OF_BOUNDS
};
@@ -365,6 +380,12 @@ bool oom_kill_debug_enhanced_slab_print_information_enabled(void)
return oom_kill_debug_enabled(ENHANCED_SLAB_STATE);
}
#endif
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+bool oom_kill_debug_enhanced_process_print_enabled(void)
+{
+ return oom_kill_debug_enabled(ENHANCED_PROCESS_STATE);
+}
+#endif
#ifdef CONFIG_DEBUG_OOM_SYSTEM_STATE
/*
@@ -513,6 +534,221 @@ u32 oom_kill_debug_oom_event_is(void)
return oom_kill_debug_oom_events;
}
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+/*
+ * Account for socket(s) buffer memory in use by a task.
+ * A task may have one or more sockets consuming socket buffer space.
+ * Account for how much socket space each task has in use.
+ */
+static unsigned long account_for_socket_buffers(struct task_struct *task,
+ char *incomplete)
+{
+ unsigned long sockpgs = 0;
+ struct files_struct *files = task->files;
+ struct fdtable *fdt;
+ struct file **fds;
+ int openfilecount;
+ struct inode *inode;
+ struct socket *sock;
+ struct sock *sk;
+ unsigned long bytes;
+ int fdtsize;
+ int i;
+
+ /* Just to make sure the fds don't get closed */
+ atomic_inc(&files->count);
+ /* Make a best effort, but no reason to get hung up here */
+ if (!spin_trylock(&files->file_lock)) {
+ *incomplete = '*';
+ atomic_dec(&files->count);
+ return 0;
+ }
+
+ rcu_read_lock();
+ fdt = files_fdtable(files);
+ fdtsize = fdt->max_fds;
+ /* Determine how many words we need to check for open files */
+ for (i = fdtsize / BITS_PER_LONG; i > 0; ) {
+ if (fdt->open_fds[--i])
+ break;
+ }
+ openfilecount = (i + 1) * BITS_PER_LONG; // Check each fd in the word
+ fds = fdt->fd;
+ for (i = openfilecount; i != 0; i--) {
+ struct file *fp = *fds++;
+
+ if (fp) {
+ /* Any continue case doesn't need to be counted */
+ if (fp->f_path.dentry == NULL)
+ continue;
+ inode = fp->f_path.dentry->d_inode;
+ if (inode == NULL || !S_ISSOCK(inode->i_mode))
+ continue;
+ sock = fp->private_data;
+ if (sock == NULL)
+ continue;
+ sk = sock->sk;
+ if (sk == NULL)
+ continue;
+ bytes = roundup(sk->sk_rcvbuf, PAGE_SIZE);
+ sockpgs = bytes / PAGE_SIZE;
+ bytes = roundup(sk->sk_sndbuf, PAGE_SIZE);
+ sockpgs += bytes / PAGE_SIZE;
+ }
+ }
+ rcu_read_unlock();
+
+ spin_unlock(&files->file_lock);
+ /* We're done looking at the fds */
+ atomic_dec(&files->count);
+
+ return sockpgs;
+}
+
+static u64 power10(u32 index)
+{
+ static u64 pwr10[11] = {1, 10, 100, 1000, 10000, 100000, 1000000,
+ 10000000, 100000000, 1000000000,
+ 10000000000};
+
+ return pwr10[index];
+}
+
+static u32 num_digits(u64 num)
+{
+ u32 i;
+
+ for (i = 1; i < 11; ++i) {
+ if (power10(i) > num)
+ return i;
+ }
+ return i;
+}
+
+static void digits_and_fraction(u64 num, u32 *p_digits, u32 *p_frac, u32 chars)
+{
+ *p_digits = num_digits(num);
+ // Allow for decimal place for fractional output
+ if (chars - 1 > *p_digits)
+ *p_frac = chars - 1 - *p_digits;
+ else
+ *p_frac = 0;
+}
+
+#define MAX_NUM_FIELD_SIZE 10
+/*
+ * Format timespec into seconds and possibly fraction, must fit in 9 bytes.
+ * Linux kernel doesn't support floating point so format as best we can.
+ * With 9 digits in seconds convers 31.7 years and where we can we provide
+ * fractions of a second up to miliseconds.
+ */
+static void timespec_format(u64 nsecs_time, char *p_time, size_t time_size)
+{
+ struct timespec64 tspec = ns_to_timespec64(nsecs_time);
+ u32 digits, fracs, bytes, min;
+ u64 fraction;
+
+ digits_and_fraction(tspec.tv_sec, &digits, &fracs, time_size);
+
+ bytes = sprintf(p_time, "%llu", tspec.tv_sec);
+
+ if (fracs > 0) {
+ u32 frsize = num_digits(tspec.tv_nsec);
+
+ p_time += bytes;
+ if (frsize >= 3) {
+ if (fracs >= 3)
+ min = frsize - 3;
+ else if (fracs >= 2)
+ min = frsize - 2;
+ else
+ min = frsize - 1;
+ } else if (frsize >= 2) {
+ if (fracs >= 2)
+ min = frsize - 2;
+ else
+ min = frsize - 1;
+ } else {
+ min = frsize - 1;
+ }
+ fraction = tspec.tv_nsec / power10(min);
+ sprintf(p_time, ".%llu", fraction);
+ }
+}
+
+/*
+ * Format utime, stime in seconds and possibly fractions, must fit in 9 bytes.
+ */
+static void time_format(struct task_struct *task, char *p_utime, char *p_stime)
+{
+ size_t num_size = MAX_NUM_FIELD_SIZE;
+ u64 utime, stime;
+
+ task_cputime_adjusted(task, &utime, &stime);
+ memset(p_utime, 0, num_size);
+ timespec_format(utime, p_utime, num_size - 1);
+ memset(p_stime, 0, num_size);
+ timespec_format(stime, p_stime, num_size - 1);
+}
+
+/* task_index_to_char kernel function is missing options so use this */
+#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
+static const char task_to_char[] = TASK_STATE_TO_CHAR_STR;
+static const char get_task_state(struct task_struct *p_task, ulong state)
+{
+ int bit = state ? __ffs(state) + 1 : 0;
+
+ if (p_task->tgid == 0)
+ return 'I';
+ return bit < sizeof(task_to_char) - 1 ? task_to_char[bit] : '?';
+}
+
+/*
+ * Code that prints the information about the specified task.
+ * Assumes task lock is held at entry.
+ */
+void dump_task_prt(struct task_struct *task,
+ unsigned long rsspg, unsigned long swappg,
+ unsigned long pgtbl)
+{
+ char c_utime[MAX_NUM_FIELD_SIZE], c_stime[MAX_NUM_FIELD_SIZE];
+ unsigned long vmkb, sockkb, text, maxrsspg, pgtblpg;
+ unsigned long libkb, textkb, pgtblkb;
+ struct mm_struct *mm;
+ char incomp = ' ';
+ kuid_t ruid, euid;
+ char tstate;
+
+ mm = task->mm;
+ maxrsspg = rsspg;
+ pgtblpg = pgtbl >> PAGE_SHIFT;
+ ruid = __task_cred(task)->uid;
+ euid = __task_cred(task)->euid;
+ vmkb = K(mm->total_vm);
+ if (maxrsspg < mm->hiwater_rss)
+ maxrsspg = mm->hiwater_rss;
+ sockkb = K(account_for_socket_buffers(task, &incomp));
+ text = (PAGE_ALIGN(mm->end_code) -
+ (mm->start_code & PAGE_MASK));
+ text = min(text, mm->exec_vm << PAGE_SHIFT);
+ textkb = text >> 10;
+ libkb = ((mm->exec_vm << PAGE_SHIFT) - text) >> 10;
+ pgtblkb = pgtbl >> 10;
+ tstate = get_task_state(task, task->state);
+ time_format(task, c_utime, c_stime);
+
+ pr_info("[%7d] %7d %7d %7d %7d %c %9s %9s %9lu %9lu %9lu %9lu %9ld %9lu%c %9lu %9lu %9lu %9lu %9lu %9lu %11lu %11lu %9lu %9llu %5hd %s\n",
+ task->pid, task_ppid_nr(task), ruid.val, euid.val, task->tgid,
+ tstate, c_utime, c_stime, vmkb, K(maxrsspg), K(rsspg), pgtblkb,
+ K(swappg), sockkb, incomp, libkb, textkb, K(mm->data_vm),
+ K(mm->stack_vm), K(get_mm_counter(mm, MM_FILEPAGES)),
+ K(get_mm_counter(mm, MM_SHMEMPAGES)), task->signal->cmaj_flt,
+ task->signal->cmin_flt,
+ K(mm->locked_vm), K((u64)atomic64_read(&mm->pinned_vm)),
+ task->signal->oom_score_adj, task->comm);
+}
+#endif /* CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT */
+
static void __init oom_debug_init(void)
{
/* Ensure we have a debugfs oom root directory */
diff --git a/mm/oom_kill_debug.h b/mm/oom_kill_debug.h
index a39bc275980e..faebb4c6097c 100644
--- a/mm/oom_kill_debug.h
+++ b/mm/oom_kill_debug.h
@@ -9,6 +9,11 @@
#ifndef __MM_OOM_KILL_DEBUG_H__
#define __MM_OOM_KILL_DEBUG_H__
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+extern bool oom_kill_debug_enhanced_process_print_enabled(void);
+extern void dump_task_prt(struct task_struct *task, unsigned long rsspg,
+ unsigned long swappg, unsigned long pgtbl);
+#endif
#ifdef CONFIG_DEBUG_OOM_PROCESS_SELECT_PRINT
extern unsigned long oom_kill_debug_min_task_pages(unsigned long totalpages);
#endif
--
2.20.1
^ permalink raw reply related
* Re: [PATCH v1 net-next 4/4] net: stmmac: setup higher frequency clk support for EHL & TGL
From: Florian Fainelli @ 2019-08-26 19:55 UTC (permalink / raw)
To: Voon Weifeng, David S. Miller, Maxime Coquelin
Cc: netdev, linux-kernel, Jose Abreu, Giuseppe Cavallaro, Andrew Lunn,
Alexandre Torgue, Ong Boon Leong
In-Reply-To: <1566869891-29239-5-git-send-email-weifeng.voon@intel.com>
On 8/26/19 6:38 PM, Voon Weifeng wrote:
> EHL DW EQOS is running on a 200MHz clock. Setting up stmmac-clk,
> ptp clock and ptp_max_adj to 200MHz.
>
> Signed-off-by: Voon Weifeng <weifeng.voon@intel.com>
> Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
> ---
> drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c | 21 +++++++++++++++++++++
> drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c | 3 +++
> include/linux/stmmac.h | 1 +
> 3 files changed, 25 insertions(+)
>
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
> index e969dc9bb9f0..20906287b6d4 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
> @@ -9,6 +9,7 @@
> Author: Giuseppe Cavallaro <peppe.cavallaro@st.com>
> *******************************************************************************/
>
> +#include <linux/clk-provider.h>
> #include <linux/pci.h>
> #include <linux/dmi.h>
>
> @@ -174,6 +175,19 @@ static int intel_mgbe_common_data(struct pci_dev *pdev,
> plat->axi->axi_blen[1] = 8;
> plat->axi->axi_blen[2] = 16;
>
> + plat->ptp_max_adj = plat->clk_ptp_rate;
> +
> + /* Set system clock */
> + plat->stmmac_clk = clk_register_fixed_rate(&pdev->dev,
> + "stmmac-clk", NULL, 0,
> + plat->clk_ptp_rate);
> +
> + if (IS_ERR(plat->stmmac_clk)) {
> + dev_warn(&pdev->dev, "Fail to register stmmac-clk\n");
> + plat->stmmac_clk = NULL;
Don't you need to propagate at least EPROBE_DEFER here?
--
Florian
^ permalink raw reply
* Re: [PATCH net] tcp: remove empty skb from write queue in error cases
From: Jason Baron @ 2019-08-26 19:56 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller
Cc: netdev, Soheil Hassas Yeganeh, Neal Cardwell, Eric Dumazet,
Vladimir Rutsky
In-Reply-To: <20190826161915.81676-1-edumazet@google.com>
On 8/26/19 12:19 PM, Eric Dumazet wrote:
> Vladimir Rutsky reported stuck TCP sessions after memory pressure
> events. Edge Trigger epoll() user would never receive an EPOLLOUT
> notification allowing them to retry a sendmsg().
>
> Jason tested the case of sk_stream_alloc_skb() returning NULL,
> but there are other paths that could lead both sendmsg() and sendpage()
> to return -1 (EAGAIN), with an empty skb queued on the write queue.
>
> This patch makes sure we remove this empty skb so that
> Jason code can detect that the queue is empty, and
> call sk->sk_write_space(sk) accordingly.
>
Makes sense, thanks. I think this check for empty queue could also
benefit from and update to use tcp_write_queue_empty(). I will send a
follow-up for that.
Thanks,
-Jason
> Fixes: ce5ec440994b ("tcp: ensure epoll edge trigger wakeup when write queue is empty")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Jason Baron <jbaron@akamai.com>
> Reported-by: Vladimir Rutsky <rutsky@google.com>
> Cc: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> ---
> net/ipv4/tcp.c | 30 ++++++++++++++++++++----------
> 1 file changed, 20 insertions(+), 10 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 77b485d60b9d0e00edc4e2f0d6c5bb3a9460b23b..61082065b26a068975c411b74eb46739ab0632ca 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -935,6 +935,22 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
> return mss_now;
> }
>
> +/* In some cases, both sendpage() and sendmsg() could have added
> + * an skb to the write queue, but failed adding payload on it.
> + * We need to remove it to consume less memory, but more
> + * importantly be able to generate EPOLLOUT for Edge Trigger epoll()
> + * users.
> + */
> +static void tcp_remove_empty_skb(struct sock *sk, struct sk_buff *skb)
> +{
> + if (skb && !skb->len) {
> + tcp_unlink_write_queue(skb, sk);
> + if (tcp_write_queue_empty(sk))
> + tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
> + sk_wmem_free_skb(sk, skb);
> + }
> +}
> +
> ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
> size_t size, int flags)
> {
> @@ -1064,6 +1080,7 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
> return copied;
>
> do_error:
> + tcp_remove_empty_skb(sk, tcp_write_queue_tail(sk));
> if (copied)
> goto out;
> out_err:
> @@ -1388,18 +1405,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> sock_zerocopy_put(uarg);
> return copied + copied_syn;
>
> +do_error:
> + skb = tcp_write_queue_tail(sk);
> do_fault:
> - if (!skb->len) {
> - tcp_unlink_write_queue(skb, sk);
> - /* It is the one place in all of TCP, except connection
> - * reset, where we can be unlinking the send_head.
> - */
> - if (tcp_write_queue_empty(sk))
> - tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
> - sk_wmem_free_skb(sk, skb);
> - }
> + tcp_remove_empty_skb(sk, skb);
>
> -do_error:
> if (copied + copied_syn)
> goto out;
> out_err:
>
^ permalink raw reply
* Re: [PATCH net] tcp: remove empty skb from write queue in error cases
From: Eric Dumazet @ 2019-08-26 20:04 UTC (permalink / raw)
To: Jason Baron, Eric Dumazet, David S . Miller
Cc: netdev, Soheil Hassas Yeganeh, Neal Cardwell, Vladimir Rutsky
In-Reply-To: <58ae43f8-21e7-f08f-2632-ce567661d301@akamai.com>
On 8/26/19 9:56 PM, Jason Baron wrote:
>
>
> On 8/26/19 12:19 PM, Eric Dumazet wrote:
>> Vladimir Rutsky reported stuck TCP sessions after memory pressure
>> events. Edge Trigger epoll() user would never receive an EPOLLOUT
>> notification allowing them to retry a sendmsg().
>>
>> Jason tested the case of sk_stream_alloc_skb() returning NULL,
>> but there are other paths that could lead both sendmsg() and sendpage()
>> to return -1 (EAGAIN), with an empty skb queued on the write queue.
>>
>> This patch makes sure we remove this empty skb so that
>> Jason code can detect that the queue is empty, and
>> call sk->sk_write_space(sk) accordingly.
>>
>
> Makes sense, thanks. I think this check for empty queue could also
> benefit from and update to use tcp_write_queue_empty(). I will send a
> follow-up for that.
>
My plan (for net-next) is to submit something more like this one instead.
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 051ef10374f6..908dbe91e04b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1068,7 +1068,7 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
goto out;
out_err:
/* make sure we wake any epoll edge trigger waiter */
- if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 &&
+ if (unlikely(tcp_rtx_and_write_queues_empty(sk) &&
err == -EAGAIN)) {
sk->sk_write_space(sk);
tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED);
@@ -1407,7 +1407,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
sock_zerocopy_put_abort(uarg, true);
err = sk_stream_error(sk, flags, err);
/* make sure we wake any epoll edge trigger waiter */
- if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 &&
+ if (unlikely(tcp_rtx_and_write_queues_empty(sk) &&
err == -EAGAIN)) {
sk->sk_write_space(sk);
tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED);
^ permalink raw reply related
* Re: [PATCH v1 net-next 4/4] net: stmmac: setup higher frequency clk support for EHL & TGL
From: Andrew Lunn @ 2019-08-26 20:13 UTC (permalink / raw)
To: Florian Fainelli
Cc: Voon Weifeng, David S. Miller, Maxime Coquelin, netdev,
linux-kernel, Jose Abreu, Giuseppe Cavallaro, Alexandre Torgue,
Ong Boon Leong
In-Reply-To: <7d43e0c6-6f51-0d71-0af8-89f22b0234f9@gmail.com>
On Mon, Aug 26, 2019 at 12:55:31PM -0700, Florian Fainelli wrote:
> On 8/26/19 6:38 PM, Voon Weifeng wrote:
> > EHL DW EQOS is running on a 200MHz clock. Setting up stmmac-clk,
> > ptp clock and ptp_max_adj to 200MHz.
> >
> > Signed-off-by: Voon Weifeng <weifeng.voon@intel.com>
> > Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
> > ---
> > drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c | 21 +++++++++++++++++++++
> > drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c | 3 +++
> > include/linux/stmmac.h | 1 +
> > 3 files changed, 25 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
> > index e969dc9bb9f0..20906287b6d4 100644
> > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
> > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
> > @@ -9,6 +9,7 @@
> > Author: Giuseppe Cavallaro <peppe.cavallaro@st.com>
> > *******************************************************************************/
> >
> > +#include <linux/clk-provider.h>
> > #include <linux/pci.h>
> > #include <linux/dmi.h>
> >
> > @@ -174,6 +175,19 @@ static int intel_mgbe_common_data(struct pci_dev *pdev,
> > plat->axi->axi_blen[1] = 8;
> > plat->axi->axi_blen[2] = 16;
> >
> > + plat->ptp_max_adj = plat->clk_ptp_rate;
> > +
> > + /* Set system clock */
> > + plat->stmmac_clk = clk_register_fixed_rate(&pdev->dev,
> > + "stmmac-clk", NULL, 0,
> > + plat->clk_ptp_rate);
> > +
> > + if (IS_ERR(plat->stmmac_clk)) {
> > + dev_warn(&pdev->dev, "Fail to register stmmac-clk\n");
> > + plat->stmmac_clk = NULL;
>
> Don't you need to propagate at least EPROBE_DEFER here?
Hi Florian
Isn't a fixed rate clock a complete fake. There is no hardware behind
it. So can it return EPROBE_DEFER?
Andrew
^ permalink raw reply
* Re: [net-next 4/8] net/mlx5e: Add device out of buffer counter
From: Saeed Mahameed @ 2019-08-26 20:14 UTC (permalink / raw)
To: jakub.kicinski@netronome.com
Cc: davem@davemloft.net, netdev@vger.kernel.org, Moshe Shemesh
In-Reply-To: <20190823111601.012fabf4@cakuba.netronome.com>
On Fri, 2019-08-23 at 11:16 -0700, Jakub Kicinski wrote:
> On Fri, 23 Aug 2019 06:00:45 +0000, Saeed Mahameed wrote:
> > On Thu, 2019-08-22 at 18:33 -0700, Jakub Kicinski wrote:
> > > On Thu, 22 Aug 2019 23:35:52 +0000, Saeed Mahameed wrote:
> > > > From: Moshe Shemesh <moshe@mellanox.com>
> > > >
> > > > Added the following packets drop counter:
> > > > Device out of buffer - counts packets which were dropped due to
> > > > full
> > > > device internal receive queue.
> > > > This counter will be shown on ethtool as a new counter called
> > > > dev_out_of_buffer.
> > > > The counter is read from FW by command QUERY_VNIC_ENV.
> > > >
> > > > Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
> > > > Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> > >
> > > Sounds like rx_fifo_errors, no? Doesn't rx_fifo_errors count RX
> > > overruns?
> >
> > No, that is port buffer you are looking for and we got that fully
> > covered in mlx5. this is different.
> >
> > This new counter is deep into the HW data path pipeline and it
> > covers
> > very rare and complex scenarios that got only recently introduced
> > with
> > swichdev mode and "some" lately added tunnels offloads that are
> > routed
> > between VFs/PFs.
> >
> > Normally the HW is lossless once the packet passes port buffers
> > into
> > the data plane pipeline, let's call that "fast lane", BUT for sriov
> > configurations with switchdev mode enabled and some special hand
> > crafted tc tunnel offloads that requires hairpin between VFs/PFs,
> > the
> > hw might decide to send some traffic to a "service lane" which is
> > still
> > fast path but unlike the "fast lane" it handles traffic through "HW
> > internal" receive and send queues (just like we do with hairpin)
> > that
> > might drop packets. the whole thing is transparent to driver and it
> > is
> > HW implementation specific.
>
> I see thanks for the explanation and sorry for the delayed response.
> Would it perhaps make sense to indicate the hairpin in the name?
We had some internal discussion and we couldn't come up with the
perfect name :)
hairpin is just an implementation detail, we don't want to exclusively
bind this counter to hairpin only flows, the problem is not with
hairpin, the actual problem is due to the use of internal RQs, for now
it only happens with "hairpin like" flows, but tomorrow it can happen
with a different scenario but same root cause (the use of internal
RQs), we want to have one counter to count internal drops due to
internal use of internal RQs.
so how about:
dev_internal_rq_oob: Device Internal RQ out of buffer
dev_internal_out_of_res: Device Internal out of resources (more generic
? too generic ?)
Any suggestion that you provide will be more than welcome.
> dev_out_of_buffer is quite a generic name, and there seems to be no
> doc, nor does the commit message explains it as well as you have..
Regarding documentation:
All mlx5 ethool counters are documented here
https://community.mellanox.com/s/article/understanding-mlx5-linux-counters-and-status-parameters
once we decide on the name, will add the new counter to the doc.
^ permalink raw reply
* Re: [PATCH] net/mlx5: fix a -Wstringop-truncation warning
From: Saeed Mahameed @ 2019-08-26 20:18 UTC (permalink / raw)
To: cai@lca.pw, davem@davemloft.net
Cc: Eran Ben Elisha, linux-kernel@vger.kernel.org,
linux-rdma@vger.kernel.org, netdev@vger.kernel.org, Feras Daoud,
leon@kernel.org, Moshe Shemesh
In-Reply-To: <20190823.151809.442664848668672070.davem@davemloft.net>
On Fri, 2019-08-23 at 15:18 -0700, David Miller wrote:
> Saeed, I assume I'll get this from you.
Yes, i will handle it.
^ permalink raw reply
* Re: [net-next 4/8] net/mlx5e: Add device out of buffer counter
From: Jakub Kicinski @ 2019-08-26 20:39 UTC (permalink / raw)
To: Saeed Mahameed; +Cc: davem@davemloft.net, netdev@vger.kernel.org, Moshe Shemesh
In-Reply-To: <18abb6456fb4a2fba52f6f77373ac351651a62c6.camel@mellanox.com>
On Mon, 26 Aug 2019 20:14:47 +0000, Saeed Mahameed wrote:
> > I see thanks for the explanation and sorry for the delayed response.
> > Would it perhaps make sense to indicate the hairpin in the name?
>
> We had some internal discussion and we couldn't come up with the
> perfect name :)
>
> hairpin is just an implementation detail, we don't want to exclusively
> bind this counter to hairpin only flows, the problem is not with
> hairpin, the actual problem is due to the use of internal RQs, for now
> it only happens with "hairpin like" flows, but tomorrow it can happen
> with a different scenario but same root cause (the use of internal
> RQs), we want to have one counter to count internal drops due to
> internal use of internal RQs.
>
> so how about:
> dev_internal_rq_oob: Device Internal RQ out of buffer
> dev_internal_out_of_res: Device Internal out of resources (more generic
> ? too generic ?)
Maybe dev_internal_queue_oob? The use of 'internal' is a little
unfortunate, because it may be read as RQ run out of internal buffers.
Rather than special type of queue run out of buffers.
But not knowing the HW I don't really have any great suggestions :(
Either of the above would work as well.
> Any suggestion that you provide will be more than welcome.
>
> > dev_out_of_buffer is quite a generic name, and there seems to be no
> > doc, nor does the commit message explains it as well as you have..
>
> Regarding documentation:
> All mlx5 ethool counters are documented here
> https://community.mellanox.com/s/article/understanding-mlx5-linux-counters-and-status-parameters
>
> once we decide on the name, will add the new counter to the doc.
I see, thanks!
^ permalink raw reply
* [PATCH 1/4] mdev: Introduce sha1 based mdev alias
From: Parav Pandit @ 2019-08-26 20:41 UTC (permalink / raw)
To: alex.williamson, jiri, kwankhede, cohuck, davem
Cc: kvm, linux-kernel, netdev, Parav Pandit
In-Reply-To: <20190826204119.54386-1-parav@mellanox.com>
Whenever a parent requests to generate mdev alias, generate a mdev
alias.
It is an optional attribute that parent can request to generate
for each of its child mdev.
mdev alias is generated using sha1 from the mdev name.
Signed-off-by: Parav Pandit <parav@mellanox.com>
---
drivers/vfio/mdev/mdev_core.c | 98 +++++++++++++++++++++++++++++++-
drivers/vfio/mdev/mdev_private.h | 5 +-
drivers/vfio/mdev/mdev_sysfs.c | 13 +++--
include/linux/mdev.h | 4 ++
4 files changed, 111 insertions(+), 9 deletions(-)
diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index b558d4cfd082..e825ff38b037 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -10,9 +10,11 @@
#include <linux/module.h>
#include <linux/device.h>
#include <linux/slab.h>
+#include <linux/mm.h>
#include <linux/uuid.h>
#include <linux/sysfs.h>
#include <linux/mdev.h>
+#include <crypto/hash.h>
#include "mdev_private.h"
@@ -27,6 +29,8 @@ static struct class_compat *mdev_bus_compat_class;
static LIST_HEAD(mdev_list);
static DEFINE_MUTEX(mdev_list_lock);
+static struct crypto_shash *alias_hash;
+
struct device *mdev_parent_dev(struct mdev_device *mdev)
{
return mdev->parent->dev;
@@ -164,6 +168,18 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
goto add_dev_err;
}
+ if (ops->get_alias_length) {
+ unsigned int digest_size;
+ unsigned int aligned_len;
+
+ aligned_len = roundup(ops->get_alias_length(), 2);
+ digest_size = crypto_shash_digestsize(alias_hash);
+ if (aligned_len / 2 > digest_size) {
+ ret = -EINVAL;
+ goto add_dev_err;
+ }
+ }
+
parent = kzalloc(sizeof(*parent), GFP_KERNEL);
if (!parent) {
ret = -ENOMEM;
@@ -259,6 +275,7 @@ static void mdev_device_free(struct mdev_device *mdev)
mutex_unlock(&mdev_list_lock);
dev_dbg(&mdev->dev, "MDEV: destroying\n");
+ kvfree(mdev->alias);
kfree(mdev);
}
@@ -269,18 +286,86 @@ static void mdev_device_release(struct device *dev)
mdev_device_free(mdev);
}
-int mdev_device_create(struct kobject *kobj,
- struct device *dev, const guid_t *uuid)
+static const char *
+generate_alias(const char *uuid, unsigned int max_alias_len)
+{
+ struct shash_desc *hash_desc;
+ unsigned int digest_size;
+ unsigned char *digest;
+ unsigned int alias_len;
+ char *alias;
+ int ret = 0;
+
+ /* Align to multiple of 2 as bin2hex will generate
+ * even number of bytes.
+ */
+ alias_len = roundup(max_alias_len, 2);
+ alias = kvzalloc(alias_len + 1, GFP_KERNEL);
+ if (!alias)
+ return NULL;
+
+ /* Allocate and init descriptor */
+ hash_desc = kvzalloc(sizeof(*hash_desc) +
+ crypto_shash_descsize(alias_hash),
+ GFP_KERNEL);
+ if (!hash_desc)
+ goto desc_err;
+
+ hash_desc->tfm = alias_hash;
+
+ digest_size = crypto_shash_digestsize(alias_hash);
+
+ digest = kvzalloc(digest_size, GFP_KERNEL);
+ if (!digest) {
+ ret = -ENOMEM;
+ goto digest_err;
+ }
+ crypto_shash_init(hash_desc);
+ crypto_shash_update(hash_desc, uuid, UUID_STRING_LEN);
+ crypto_shash_final(hash_desc, digest);
+ bin2hex(&alias[0], digest,
+ min_t(unsigned int, digest_size, alias_len / 2));
+ /* When alias length is odd, zero out and additional last byte
+ * that bin2hex has copied.
+ */
+ if (max_alias_len % 2)
+ alias[max_alias_len] = 0;
+
+ kvfree(digest);
+ kvfree(hash_desc);
+ return alias;
+
+digest_err:
+ kvfree(hash_desc);
+desc_err:
+ kvfree(alias);
+ return NULL;
+}
+
+int mdev_device_create(struct kobject *kobj, struct device *dev,
+ const char *uuid_str, const guid_t *uuid)
{
int ret;
struct mdev_device *mdev, *tmp;
struct mdev_parent *parent;
struct mdev_type *type = to_mdev_type(kobj);
+ unsigned int alias_len = 0;
+ const char *alias = NULL;
parent = mdev_get_parent(type->parent);
if (!parent)
return -EINVAL;
+ if (parent->ops->get_alias_length)
+ alias_len = parent->ops->get_alias_length();
+ if (alias_len) {
+ alias = generate_alias(uuid_str, alias_len);
+ if (!alias) {
+ ret = -ENOMEM;
+ goto alias_fail;
+ }
+ }
+
mutex_lock(&mdev_list_lock);
/* Check for duplicate */
@@ -300,6 +385,8 @@ int mdev_device_create(struct kobject *kobj,
}
guid_copy(&mdev->uuid, uuid);
+ mdev->alias = alias;
+ alias = NULL;
list_add(&mdev->next, &mdev_list);
mutex_unlock(&mdev_list_lock);
@@ -346,6 +433,8 @@ int mdev_device_create(struct kobject *kobj,
up_read(&parent->unreg_sem);
put_device(&mdev->dev);
mdev_fail:
+ kvfree(alias);
+alias_fail:
mdev_put_parent(parent);
return ret;
}
@@ -406,6 +495,10 @@ EXPORT_SYMBOL(mdev_get_iommu_device);
static int __init mdev_init(void)
{
+ alias_hash = crypto_alloc_shash("sha1", 0, 0);
+ if (!alias_hash)
+ return -ENOMEM;
+
return mdev_bus_register();
}
@@ -415,6 +508,7 @@ static void __exit mdev_exit(void)
class_compat_unregister(mdev_bus_compat_class);
mdev_bus_unregister();
+ crypto_free_shash(alias_hash);
}
module_init(mdev_init)
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
index 7d922950caaf..cf1c0d9842c6 100644
--- a/drivers/vfio/mdev/mdev_private.h
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -33,6 +33,7 @@ struct mdev_device {
struct kobject *type_kobj;
struct device *iommu_device;
bool active;
+ const char *alias;
};
#define to_mdev_device(dev) container_of(dev, struct mdev_device, dev)
@@ -57,8 +58,8 @@ void parent_remove_sysfs_files(struct mdev_parent *parent);
int mdev_create_sysfs_files(struct device *dev, struct mdev_type *type);
void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type);
-int mdev_device_create(struct kobject *kobj,
- struct device *dev, const guid_t *uuid);
+int mdev_device_create(struct kobject *kobj, struct device *dev,
+ const char *uuid_str, const guid_t *uuid);
int mdev_device_remove(struct device *dev);
#endif /* MDEV_PRIVATE_H */
diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
index 7570c7602ab4..43afe0e80b76 100644
--- a/drivers/vfio/mdev/mdev_sysfs.c
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -63,15 +63,18 @@ static ssize_t create_store(struct kobject *kobj, struct device *dev,
return -ENOMEM;
ret = guid_parse(str, &uuid);
- kfree(str);
if (ret)
- return ret;
+ goto err;
- ret = mdev_device_create(kobj, dev, &uuid);
+ ret = mdev_device_create(kobj, dev, str, &uuid);
if (ret)
- return ret;
+ goto err;
- return count;
+ ret = count;
+
+err:
+ kfree(str);
+ return ret;
}
MDEV_TYPE_ATTR_WO(create);
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
index 0ce30ca78db0..f036fe9854ee 100644
--- a/include/linux/mdev.h
+++ b/include/linux/mdev.h
@@ -72,6 +72,9 @@ struct device *mdev_get_iommu_device(struct device *dev);
* @mmap: mmap callback
* @mdev: mediated device structure
* @vma: vma structure
+ * @get_alias_length: Generate alias for the mdevs of this parent based on the
+ * mdev device name when it returns non zero alias length.
+ * It is optional.
* Parent device that support mediated device should be registered with mdev
* module with mdev_parent_ops structure.
**/
@@ -92,6 +95,7 @@ struct mdev_parent_ops {
long (*ioctl)(struct mdev_device *mdev, unsigned int cmd,
unsigned long arg);
int (*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
+ unsigned int (*get_alias_length)(void);
};
/* interface for exporting mdev supported type attributes */
--
2.19.2
^ permalink raw reply related
* [PATCH 4/4] mtty: Optionally support mtty alias
From: Parav Pandit @ 2019-08-26 20:41 UTC (permalink / raw)
To: alex.williamson, jiri, kwankhede, cohuck, davem
Cc: kvm, linux-kernel, netdev, Parav Pandit
In-Reply-To: <20190826204119.54386-1-parav@mellanox.com>
Provide a module parameter to set alias length to optionally generate
mdev alias.
Example to request mdev alias.
$ modprobe mtty alias_length=12
Signed-off-by: Parav Pandit <parav@mellanox.com>
---
samples/vfio-mdev/mtty.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/samples/vfio-mdev/mtty.c b/samples/vfio-mdev/mtty.c
index 92e770a06ea2..92208245b057 100644
--- a/samples/vfio-mdev/mtty.c
+++ b/samples/vfio-mdev/mtty.c
@@ -1410,6 +1410,15 @@ static struct attribute_group *mdev_type_groups[] = {
NULL,
};
+static unsigned int mtty_alias_length;
+module_param_named(alias_length, mtty_alias_length, uint, 0444);
+MODULE_PARM_DESC(alias_length, "mdev alias length; default=0");
+
+static unsigned int mtty_get_alias_length(void)
+{
+ return mtty_alias_length;
+}
+
static const struct mdev_parent_ops mdev_fops = {
.owner = THIS_MODULE,
.dev_attr_groups = mtty_dev_groups,
@@ -1422,6 +1431,7 @@ static const struct mdev_parent_ops mdev_fops = {
.read = mtty_read,
.write = mtty_write,
.ioctl = mtty_ioctl,
+ .get_alias_length = mtty_get_alias_length
};
static void mtty_device_release(struct device *dev)
--
2.19.2
^ permalink raw reply related
* [PATCH 3/4] mdev: Expose mdev alias in sysfs tree
From: Parav Pandit @ 2019-08-26 20:41 UTC (permalink / raw)
To: alex.williamson, jiri, kwankhede, cohuck, davem
Cc: kvm, linux-kernel, netdev, Parav Pandit
In-Reply-To: <20190826204119.54386-1-parav@mellanox.com>
Expose mdev alias as string in a sysfs tree so that such attribute can
be used to generate netdevice name by systemd/udev or can be used to
match other kernel objects based on the alias of the mdev.
Signed-off-by: Parav Pandit <parav@mellanox.com>
---
drivers/vfio/mdev/mdev_sysfs.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
index 43afe0e80b76..59f4e3cc5233 100644
--- a/drivers/vfio/mdev/mdev_sysfs.c
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -246,7 +246,20 @@ static ssize_t remove_store(struct device *dev, struct device_attribute *attr,
static DEVICE_ATTR_WO(remove);
+static ssize_t alias_show(struct device *device,
+ struct device_attribute *attr, char *buf)
+{
+ struct mdev_device *dev = mdev_from_dev(device);
+
+ if (!dev->alias)
+ return -EOPNOTSUPP;
+
+ return sprintf(buf, "%s\n", dev->alias);
+}
+static DEVICE_ATTR_RO(alias);
+
static const struct attribute *mdev_device_attrs[] = {
+ &dev_attr_alias.attr,
&dev_attr_remove.attr,
NULL,
};
--
2.19.2
^ permalink raw reply related
* [PATCH 0/4] Introduce variable length mdev alias
From: Parav Pandit @ 2019-08-26 20:41 UTC (permalink / raw)
To: alex.williamson, jiri, kwankhede, cohuck, davem
Cc: kvm, linux-kernel, netdev, Parav Pandit
To have consistent naming for the netdevice of a mdev and to have
consistent naming of the devlink port [1] of a mdev, which is formed using
phys_port_name of the devlink port, current UUID is not usable because
UUID is too long.
UUID in string format is 36-characters long and in binary 128-bit.
Both formats are not able to fit within 15 characters limit of netdev
name.
It is desired to have mdev device naming consistent using UUID.
So that widely used user space framework such as ovs [2] can make use
of mdev representor in similar way as PCIe SR-IOV VF and PF representors.
Hence,
(a) mdev alias is created which is derived using sha1 from the mdev name.
(b) Vendor driver describes how long an alias should be for the child mdev
created for a given parent.
(c) Mdev aliases are unique at system level.
(d) alias is created optionally whenever parent requested.
This ensures that non networking mdev parents can function without alias
creation overhead.
This design is discussed at [3].
An example systemd/udev extension will have,
1. netdev name created using mdev alias available in sysfs.
mdev UUID=83b8f4f2-509f-382f-3c1e-e6bfe0fa1001
mdev 12 character alias=cd5b146a80a5
netdev name of this mdev = enmcd5b146a80a5
Here en = Ethernet link
m = mediated device
2. devlink port phys_port_name created using mdev alias.
devlink phys_port_name=pcd5b146a80a5
This patchset enables mdev core to maintain unique alias for a mdev.
Patch-1 Introduces mdev alias using sha1.
Patch-2 Ensures that mdev alias is unique in a system.
Patch-3 Exposes mdev alias in a sysfs hirerchy.
Patch-4 Extends mtty driver to optionally provide alias generation.
This also enables to test UUID based sha1 collision and trigger
error handling for duplicate sha1 results.
In future when networking driver wants to use mdev alias, mdev_alias()
API will be added to derive devlink port name.
[1] http://man7.org/linux/man-pages/man8/devlink-port.8.html
[2] https://docs.openstack.org/os-vif/latest/user/plugins/ovs.html
[3] https://patchwork.kernel.org/cover/11084231/
Parav Pandit (4):
mdev: Introduce sha1 based mdev alias
mdev: Make mdev alias unique among all mdevs
mdev: Expose mdev alias in sysfs tree
mtty: Optionally support mtty alias
drivers/vfio/mdev/mdev_core.c | 103 ++++++++++++++++++++++++++++++-
drivers/vfio/mdev/mdev_private.h | 5 +-
drivers/vfio/mdev/mdev_sysfs.c | 26 ++++++--
include/linux/mdev.h | 4 ++
samples/vfio-mdev/mtty.c | 10 +++
5 files changed, 139 insertions(+), 9 deletions(-)
--
2.19.2
^ permalink raw reply
* [PATCH 2/4] mdev: Make mdev alias unique among all mdevs
From: Parav Pandit @ 2019-08-26 20:41 UTC (permalink / raw)
To: alex.williamson, jiri, kwankhede, cohuck, davem
Cc: kvm, linux-kernel, netdev, Parav Pandit
In-Reply-To: <20190826204119.54386-1-parav@mellanox.com>
Mdev alias should be unique among all the mdevs, so that when such alias
is used by the mdev users to derive other objects, there is no
collision in a given system.
Signed-off-by: Parav Pandit <parav@mellanox.com>
---
drivers/vfio/mdev/mdev_core.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index e825ff38b037..6eb37f0c6369 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -375,6 +375,11 @@ int mdev_device_create(struct kobject *kobj, struct device *dev,
ret = -EEXIST;
goto mdev_fail;
}
+ if (tmp->alias && strcmp(tmp->alias, alias) == 0) {
+ mutex_unlock(&mdev_list_lock);
+ ret = -EEXIST;
+ goto mdev_fail;
+ }
}
mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
--
2.19.2
^ permalink raw reply related
* [PATCH V2 net 1/2] openvswitch: Properly set L4 keys on "later" IP fragments
From: Greg Rose @ 2019-08-26 20:45 UTC (permalink / raw)
To: netdev, pshelar; +Cc: joe, Greg Rose
When IP fragments are reassembled before being sent to conntrack, the
key from the last fragment is used. Unless there are reordering
issues, the last fragment received will not contain the L4 ports, so the
key for the reassembled datagram won't contain them. This patch updates
the key once we have a reassembled datagram.
The handle_fragments() function works on L3 headers so we pull the L3/L4
flow key update code from key_extract into a new function
'key_extract_l3l4'. Then we add a another new function
ovs_flow_key_update_l3l4() and export it so that it is accessible by
handle_fragments() for conntrack packet reassembly.
Co-authored by: Justin Pettit <jpettit@ovn.org>
Signed-off-by: Greg Rose <gvrose8192@gmail.com>
---
net/openvswitch/conntrack.c | 5 ++
net/openvswitch/flow.c | 161 ++++++++++++++++++++++++++------------------
net/openvswitch/flow.h | 1 +
3 files changed, 101 insertions(+), 66 deletions(-)
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index d8da647..05249eb 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -525,6 +525,11 @@ static int handle_fragments(struct net *net, struct sw_flow_key *key,
return -EPFNOSUPPORT;
}
+ /* The key extracted from the fragment that completed this datagram
+ * likely didn't have an L4 header, so regenerate it.
+ */
+ ovs_flow_key_update_l3l4(skb, key);
+
key->ip.frag = OVS_FRAG_TYPE_NONE;
skb_clear_hash(skb);
skb->ignore_df = 1;
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index bc89e16..ea12ee6 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -523,78 +523,15 @@ static int parse_nsh(struct sk_buff *skb, struct sw_flow_key *key)
}
/**
- * key_extract - extracts a flow key from an Ethernet frame.
+ * key_extract_l3l4 - extracts L3/L4 header information.
* @skb: sk_buff that contains the frame, with skb->data pointing to the
- * Ethernet header
+ * L3 header
* @key: output flow key
*
- * The caller must ensure that skb->len >= ETH_HLEN.
- *
- * Returns 0 if successful, otherwise a negative errno value.
- *
- * Initializes @skb header fields as follows:
- *
- * - skb->mac_header: the L2 header.
- *
- * - skb->network_header: just past the L2 header, or just past the
- * VLAN header, to the first byte of the L2 payload.
- *
- * - skb->transport_header: If key->eth.type is ETH_P_IP or ETH_P_IPV6
- * on output, then just past the IP header, if one is present and
- * of a correct length, otherwise the same as skb->network_header.
- * For other key->eth.type values it is left untouched.
- *
- * - skb->protocol: the type of the data starting at skb->network_header.
- * Equals to key->eth.type.
*/
-static int key_extract(struct sk_buff *skb, struct sw_flow_key *key)
+static int key_extract_l3l4(struct sk_buff *skb, struct sw_flow_key *key)
{
int error;
- struct ethhdr *eth;
-
- /* Flags are always used as part of stats */
- key->tp.flags = 0;
-
- skb_reset_mac_header(skb);
-
- /* Link layer. */
- clear_vlan(key);
- if (ovs_key_mac_proto(key) == MAC_PROTO_NONE) {
- if (unlikely(eth_type_vlan(skb->protocol)))
- return -EINVAL;
-
- skb_reset_network_header(skb);
- key->eth.type = skb->protocol;
- } else {
- eth = eth_hdr(skb);
- ether_addr_copy(key->eth.src, eth->h_source);
- ether_addr_copy(key->eth.dst, eth->h_dest);
-
- __skb_pull(skb, 2 * ETH_ALEN);
- /* We are going to push all headers that we pull, so no need to
- * update skb->csum here.
- */
-
- if (unlikely(parse_vlan(skb, key)))
- return -ENOMEM;
-
- key->eth.type = parse_ethertype(skb);
- if (unlikely(key->eth.type == htons(0)))
- return -ENOMEM;
-
- /* Multiple tagged packets need to retain TPID to satisfy
- * skb_vlan_pop(), which will later shift the ethertype into
- * skb->protocol.
- */
- if (key->eth.cvlan.tci & htons(VLAN_CFI_MASK))
- skb->protocol = key->eth.cvlan.tpid;
- else
- skb->protocol = key->eth.type;
-
- skb_reset_network_header(skb);
- __skb_push(skb, skb->data - skb_mac_header(skb));
- }
- skb_reset_mac_len(skb);
/* Network layer. */
if (key->eth.type == htons(ETH_P_IP)) {
@@ -788,6 +725,98 @@ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key)
return 0;
}
+/**
+ * key_extract - extracts a flow key from an Ethernet frame.
+ * @skb: sk_buff that contains the frame, with skb->data pointing to the
+ * Ethernet header
+ * @key: output flow key
+ *
+ * The caller must ensure that skb->len >= ETH_HLEN.
+ *
+ * Returns 0 if successful, otherwise a negative errno value.
+ *
+ * Initializes @skb header fields as follows:
+ *
+ * - skb->mac_header: the L2 header.
+ *
+ * - skb->network_header: just past the L2 header, or just past the
+ * VLAN header, to the first byte of the L2 payload.
+ *
+ * - skb->transport_header: If key->eth.type is ETH_P_IP or ETH_P_IPV6
+ * on output, then just past the IP header, if one is present and
+ * of a correct length, otherwise the same as skb->network_header.
+ * For other key->eth.type values it is left untouched.
+ *
+ * - skb->protocol: the type of the data starting at skb->network_header.
+ * Equals to key->eth.type.
+ */
+static int key_extract(struct sk_buff *skb, struct sw_flow_key *key)
+{
+ struct ethhdr *eth;
+
+ /* Flags are always used as part of stats */
+ key->tp.flags = 0;
+
+ skb_reset_mac_header(skb);
+
+ /* Link layer. */
+ clear_vlan(key);
+ if (ovs_key_mac_proto(key) == MAC_PROTO_NONE) {
+ if (unlikely(eth_type_vlan(skb->protocol)))
+ return -EINVAL;
+
+ skb_reset_network_header(skb);
+ key->eth.type = skb->protocol;
+ } else {
+ eth = eth_hdr(skb);
+ ether_addr_copy(key->eth.src, eth->h_source);
+ ether_addr_copy(key->eth.dst, eth->h_dest);
+
+ __skb_pull(skb, 2 * ETH_ALEN);
+ /* We are going to push all headers that we pull, so no need to
+ * update skb->csum here.
+ */
+
+ if (unlikely(parse_vlan(skb, key)))
+ return -ENOMEM;
+
+ key->eth.type = parse_ethertype(skb);
+ if (unlikely(key->eth.type == htons(0)))
+ return -ENOMEM;
+
+ /* Multiple tagged packets need to retain TPID to satisfy
+ * skb_vlan_pop(), which will later shift the ethertype into
+ * skb->protocol.
+ */
+ if (key->eth.cvlan.tci & htons(VLAN_CFI_MASK))
+ skb->protocol = key->eth.cvlan.tpid;
+ else
+ skb->protocol = key->eth.type;
+
+ skb_reset_network_header(skb);
+ __skb_push(skb, skb->data - skb_mac_header(skb));
+ }
+
+ skb_reset_mac_len(skb);
+
+ /* Fill out L3/L4 key info, if any */
+ return key_extract_l3l4(skb, key);
+}
+
+/* In the case of conntrack fragment handling it expects L3 headers,
+ * add a helper.
+ */
+int ovs_flow_key_update_l3l4(struct sk_buff *skb, struct sw_flow_key *key)
+{
+ int res;
+
+ res = key_extract_l3l4(skb, key);
+ if (!res)
+ key->mac_proto &= ~SW_FLOW_KEY_INVALID;
+
+ return res;
+}
+
int ovs_flow_key_update(struct sk_buff *skb, struct sw_flow_key *key)
{
int res;
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index a5506e2..b830d5f 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -270,6 +270,7 @@ void ovs_flow_stats_get(const struct sw_flow *, struct ovs_flow_stats *,
u64 ovs_flow_used_time(unsigned long flow_jiffies);
int ovs_flow_key_update(struct sk_buff *skb, struct sw_flow_key *key);
+int ovs_flow_key_update_l3l4(struct sk_buff *skb, struct sw_flow_key *key);
int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
struct sk_buff *skb,
struct sw_flow_key *key);
--
1.8.3.1
^ permalink raw reply related
* [PATCH V2 net 2/2] openvswitch: Clear the L4 portion of the key for "later" fragments.
From: Greg Rose @ 2019-08-26 20:45 UTC (permalink / raw)
To: netdev, pshelar; +Cc: joe, Justin Pettit
In-Reply-To: <1566852359-8028-1-git-send-email-gvrose8192@gmail.com>
From: Justin Pettit <jpettit@ovn.org>
Only the first fragment in a datagram contains the L4 headers. When the
Open vSwitch module parses a packet, it always sets the IP protocol
field in the key, but can only set the L4 fields on the first fragment.
The original behavior would not clear the L4 portion of the key, so
garbage values would be sent in the key for "later" fragments. This
patch clears the L4 fields in that circumstance to prevent sending those
garbage values as part of the upcall.
Signed-off-by: Justin Pettit <jpettit@ovn.org>
---
net/openvswitch/flow.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index ea12ee6..22fd55f 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -560,6 +560,7 @@ static int key_extract_l3l4(struct sk_buff *skb, struct sw_flow_key *key)
offset = nh->frag_off & htons(IP_OFFSET);
if (offset) {
key->ip.frag = OVS_FRAG_TYPE_LATER;
+ memset(&key->tp, 0, sizeof(key->tp));
return 0;
}
if (nh->frag_off & htons(IP_MF) ||
@@ -677,8 +678,10 @@ static int key_extract_l3l4(struct sk_buff *skb, struct sw_flow_key *key)
return error;
}
- if (key->ip.frag == OVS_FRAG_TYPE_LATER)
+ if (key->ip.frag == OVS_FRAG_TYPE_LATER) {
+ memset(&key->tp, 0, sizeof(key->tp));
return 0;
+ }
if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP)
key->ip.frag = OVS_FRAG_TYPE_FIRST;
--
1.8.3.1
^ permalink raw reply related
* [PATCH net-next] r8169: improve DMA handling in rtl_rx
From: Heiner Kallweit @ 2019-08-26 20:52 UTC (permalink / raw)
To: Realtek linux nic maintainers, David Miller; +Cc: netdev@vger.kernel.org
Move the call to dma_sync_single_for_cpu after calling napi_alloc_skb.
This avoids calling dma_sync_single_for_cpu w/o handing control back
to device if the memory allocation should fail.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
drivers/net/ethernet/realtek/r8169_main.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 6182e7d33..faa4041cf 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -5807,16 +5807,15 @@ static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp, u32 budget
goto release_descriptor;
}
- dma_sync_single_for_cpu(tp_to_dev(tp),
- le64_to_cpu(desc->addr),
- pkt_size, DMA_FROM_DEVICE);
-
skb = napi_alloc_skb(&tp->napi, pkt_size);
if (unlikely(!skb)) {
dev->stats.rx_dropped++;
goto release_descriptor;
}
+ dma_sync_single_for_cpu(tp_to_dev(tp),
+ le64_to_cpu(desc->addr),
+ pkt_size, DMA_FROM_DEVICE);
prefetch(rx_buf);
skb_copy_to_linear_data(skb, rx_buf, pkt_size);
skb->tail += pkt_size;
--
2.23.0
^ permalink raw reply related
* Re: [PATCH net v2 1/2] Revert "r8152: napi hangup fix after disconnect"
From: David Miller @ 2019-08-26 20:55 UTC (permalink / raw)
To: hayeswang; +Cc: jslaby, netdev, nic_swsd, linux-kernel
In-Reply-To: <0835B3720019904CB8F7AA43166CEEB2F18D6733@RTITMBSVM03.realtek.com.tw>
From: Hayes Wang <hayeswang@realtek.com>
Date: Mon, 26 Aug 2019 09:43:32 +0000
> Jiri Slaby [mailto:jslaby@suse.cz]
>> Sent: Monday, August 26, 2019 4:55 PM
> [...]
>> Could you clarify *why* it conflicts? And how is the problem fixed by
>> 0ee1f473496 avoided now?
>
> In rtl8152_disconnect(), the flow would be as following.
>
> static void rtl8152_disconnect(struct usb_interface *intf)
> {
> ...
> - netif_napi_del(&tp->napi);
> - unregister_netdev(tp->netdev);
> - rtl8152_close
> - napi_disable
>
> Therefore you add a checking of RTL8152_UNPLUG to avoid
> calling napi_disable() after netif_napi_del(). However,
> after commit ffa9fec30ca0 ("r8152: set RTL8152_UNPLUG
> only for real disconnection"), RTL8152_UNPLUG is not
> always set when calling rtl8152_disconnect(). That is,
> napi_disable() would be called after netif_napi_del(),
> if RTL8152_UNPLUG is not set.
>
> The best way is to avoid calling netif_napi_del() before
> calling unregister_netdev(). And I has submitted such
> patch following this one.
These details belong in the commit message, always.
^ permalink raw reply
* Re: [PATCH net-next v4 6/6] net: dsa: mv88e6xxx: fully support SERDES on Topaz family
From: David Miller @ 2019-08-26 21:04 UTC (permalink / raw)
To: andrew; +Cc: marek.behun, netdev, vivien.didelot, f.fainelli, olteanv
In-Reply-To: <20190826153830.GE2168@lunn.ch>
From: Andrew Lunn <andrew@lunn.ch>
Date: Mon, 26 Aug 2019 17:38:30 +0200
>> +static int mv88e6xxx_port_set_cmode(struct mv88e6xxx_chip *chip, int port,
>> + phy_interface_t mode, bool allow_over_2500,
>> + bool make_cmode_writable)
>
> I don't like these two parameters. The caller of this function can do
> the check for allow_over_2500 and error out before calling this.
>
> Is make_cmode_writable something that could be done once at probe and
> then forgotten about? Or is it needed before every write? At least
> move it into the specific port_set_cmode() that requires it.
Please respin this series once these issues are sorted out.
Thanks.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox