* [PATCH AUTOSEL 5.2 005/249] wil6210: fix potential out-of-bounds read
From: Sasha Levin @ 2019-07-15 13:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Gustavo A. R. Silva, Maya Erez, Kalle Valo, Sasha Levin,
linux-wireless, wil6210, netdev
In-Reply-To: <20190715133550.1772-1-sashal@kernel.org>
From: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
[ Upstream commit bfabdd6997323adbedccb13a3fed1967fb8cf8f5 ]
Notice that *rc* can evaluate to up to 5, include/linux/netdevice.h:
enum gro_result {
GRO_MERGED,
GRO_MERGED_FREE,
GRO_HELD,
GRO_NORMAL,
GRO_DROP,
GRO_CONSUMED,
};
typedef enum gro_result gro_result_t;
In case *rc* evaluates to 5, we end up having an out-of-bounds read
at drivers/net/wireless/ath/wil6210/txrx.c:821:
wil_dbg_txrx(wil, "Rx complete %d bytes => %s\n",
len, gro_res_str[rc]);
Fix this by adding element "GRO_CONSUMED" to array gro_res_str.
Addresses-Coverity-ID: 1444666 ("Out-of-bounds read")
Fixes: 194b482b5055 ("wil6210: Debug print GRO Rx result")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Maya Erez <merez@codeaurora.org>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/net/wireless/ath/wil6210/txrx.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/wireless/ath/wil6210/txrx.c b/drivers/net/wireless/ath/wil6210/txrx.c
index 4ccfd1404458..d74837cce67f 100644
--- a/drivers/net/wireless/ath/wil6210/txrx.c
+++ b/drivers/net/wireless/ath/wil6210/txrx.c
@@ -750,6 +750,7 @@ void wil_netif_rx_any(struct sk_buff *skb, struct net_device *ndev)
[GRO_HELD] = "GRO_HELD",
[GRO_NORMAL] = "GRO_NORMAL",
[GRO_DROP] = "GRO_DROP",
+ [GRO_CONSUMED] = "GRO_CONSUMED",
};
wil->txrx_ops.get_netif_rx_params(skb, &cid, &security);
--
2.20.1
^ permalink raw reply related
* [PATCH AUTOSEL 5.2 002/249] ath10k: htt: don't use txdone_fifo with SDIO
From: Sasha Levin @ 2019-07-15 13:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Alagu Sankar, Wen Gong, Kalle Valo, Sasha Levin, ath10k,
linux-wireless, netdev
In-Reply-To: <20190715133550.1772-1-sashal@kernel.org>
From: Alagu Sankar <alagusankar@silex-india.com>
[ Upstream commit e2a6b711282a371c5153239e0468a48254f17ca6 ]
HTT High Latency (ATH10K_DEV_TYPE_HL) does not use txdone_fifo at all, we don't
even initialise it by skipping ath10k_htt_tx_alloc_buf() in
ath10k_htt_tx_start(). Because of this using QCA6174 SDIO
ath10k_htt_rx_tx_compl_ind() will crash when it accesses unitialised
txdone_fifo. So skip txdone_fifo when using High Latency mode.
Tested with QCA6174 SDIO with firmware WLAN.RMH.4.4.1-00007-QCARMSWP-1.
Co-developed-by: Wen Gong <wgong@codeaurora.org>
Signed-off-by: Alagu Sankar <alagusankar@silex-india.com>
Signed-off-by: Wen Gong <wgong@codeaurora.org>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/net/wireless/ath/ath10k/htt_rx.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c b/drivers/net/wireless/ath/ath10k/htt_rx.c
index 1acc622d2183..f22840bbc389 100644
--- a/drivers/net/wireless/ath/ath10k/htt_rx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
@@ -2277,7 +2277,9 @@ static void ath10k_htt_rx_tx_compl_ind(struct ath10k *ar,
* Note that with only one concurrent reader and one concurrent
* writer, you don't need extra locking to use these macro.
*/
- if (!kfifo_put(&htt->txdone_fifo, tx_done)) {
+ if (ar->bus_param.dev_type == ATH10K_DEV_TYPE_HL) {
+ ath10k_txrx_tx_unref(htt, &tx_done);
+ } else if (!kfifo_put(&htt->txdone_fifo, tx_done)) {
ath10k_warn(ar, "txdone fifo overrun, msdu_id %d status %d\n",
tx_done.msdu_id, tx_done.status);
ath10k_txrx_tx_unref(htt, &tx_done);
--
2.20.1
^ permalink raw reply related
* [PATCH AUTOSEL 5.2 001/249] ath10k: Check tx_stats before use it
From: Sasha Levin @ 2019-07-15 13:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Yingying Tang, Kalle Valo, Sasha Levin, ath10k, linux-wireless,
netdev
From: Yingying Tang <yintang@codeaurora.org>
[ Upstream commit 9e7251fa38978b85108c44743e1436d48e8d0d76 ]
tx_stats will be freed and set to NULL before debugfs_sta node is
removed in station disconnetion process. So if read the debugfs_sta
node there may be NULL pointer error. Add check for tx_stats before
use it to resove this issue.
Signed-off-by: Yingying Tang <yintang@codeaurora.org>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/net/wireless/ath/ath10k/debugfs_sta.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/net/wireless/ath/ath10k/debugfs_sta.c b/drivers/net/wireless/ath/ath10k/debugfs_sta.c
index c704ae371c4d..42931a669b02 100644
--- a/drivers/net/wireless/ath/ath10k/debugfs_sta.c
+++ b/drivers/net/wireless/ath/ath10k/debugfs_sta.c
@@ -663,6 +663,13 @@ static ssize_t ath10k_dbg_sta_dump_tx_stats(struct file *file,
mutex_lock(&ar->conf_mutex);
+ if (!arsta->tx_stats) {
+ ath10k_warn(ar, "failed to get tx stats");
+ mutex_unlock(&ar->conf_mutex);
+ kfree(buf);
+ return 0;
+ }
+
spin_lock_bh(&ar->data_lock);
for (k = 0; k < ATH10K_STATS_TYPE_MAX; k++) {
for (j = 0; j < ATH10K_COUNTER_TYPE_MAX; j++) {
--
2.20.1
^ permalink raw reply related
* Re: [PATCH v2 1/2] rt2x00usb: fix rx queue hang
From: Soeren Moch @ 2019-07-15 13:33 UTC (permalink / raw)
To: Kalle Valo
Cc: Stanislaw Gruszka, stable, Helmut Schaa, David S. Miller,
linux-wireless, netdev, linux-kernel
In-Reply-To: <874l3nadjf.fsf@kamboji.qca.qualcomm.com>
On 15.07.19 10:48, Kalle Valo wrote:
> Soeren Moch <smoch@web.de> writes:
>
>> Since commit ed194d136769 ("usb: core: remove local_irq_save() around
>> ->complete() handler") the handler rt2x00usb_interrupt_rxdone() is
>> not running with interrupts disabled anymore. So this completion handler
>> is not guaranteed to run completely before workqueue processing starts
>> for the same queue entry.
>> Be sure to set all other flags in the entry correctly before marking
>> this entry ready for workqueue processing. This way we cannot miss error
>> conditions that need to be signalled from the completion handler to the
>> worker thread.
>> Note that rt2x00usb_work_rxdone() processes all available entries, not
>> only such for which queue_work() was called.
>>
>> This patch is similar to what commit df71c9cfceea ("rt2x00: fix order
>> of entry flags modification") did for TX processing.
>>
>> This fixes a regression on a RT5370 based wifi stick in AP mode, which
>> suddenly stopped data transmission after some period of heavy load. Also
>> stopping the hanging hostapd resulted in the error message "ieee80211
>> phy0: rt2x00queue_flush_queue: Warning - Queue 14 failed to flush".
>> Other operation modes are probably affected as well, this just was
>> the used testcase.
>>
>> Fixes: ed194d136769 ("usb: core: remove local_irq_save() around ->complete() handler")
>> Cc: stable@vger.kernel.org # 4.20+
>> Signed-off-by: Soeren Moch <smoch@web.de>
> I'll queue this for v5.3.
>
OK, thanks,
Soeren
^ permalink raw reply
* Re: INFO: rcu detected stall in ext4_write_checks
From: Dmitry Vyukov @ 2019-07-15 13:33 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Paul E. McKenney, Theodore Ts'o, syzbot, Andreas Dilger,
David Miller, eladr, Ido Schimmel, Jiri Pirko, John Stultz,
linux-ext4, LKML, netdev, syzkaller-bugs, Thomas Gleixner,
Ingo Molnar
In-Reply-To: <20190715132911.GG3419@hirez.programming.kicks-ass.net>
On Mon, Jul 15, 2019 at 3:29 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Jul 14, 2019 at 11:49:15AM -0700, Paul E. McKenney wrote:
> > On Sun, Jul 14, 2019 at 05:48:00PM +0300, Dmitry Vyukov wrote:
> > > But short term I don't see any other solution than stop testing
> > > sched_setattr because it does not check arguments enough to prevent
> > > system misbehavior. Which is a pity because syzkaller has found some
> > > bad misconfigurations that were oversight on checking side.
> > > Any other suggestions?
> >
> > Keep the times down to a few seconds? Of course, that might also
> > fail to find interesting bugs.
>
> Right, if syzcaller can put a limit on the period/deadline parameters
> (and make sure to not write "-1" to
> /proc/sys/kernel/sched_rt_runtime_us) then per the in-kernel
> access-control should not allow these things to happen.
Since we are racing with emails, could you suggest a 100% safe
parameters? Because I only hear people saying "safe", "sane",
"well-behaving" :)
If we move the check to user-space, it does not mean that we can get
away without actually defining what that means.
Now thinking of this, if we come up with some simple criteria, could
we have something like a sysctl that would allow only really "safe"
parameters?
^ permalink raw reply
* Re: INFO: rcu detected stall in ext4_write_checks
From: Dmitry Vyukov @ 2019-07-15 13:29 UTC (permalink / raw)
To: Paul E. McKenney
Cc: Theodore Ts'o, syzbot, Andreas Dilger, David Miller, eladr,
Ido Schimmel, Jiri Pirko, John Stultz, linux-ext4, LKML, netdev,
syzkaller-bugs, Thomas Gleixner, Peter Zijlstra, Ingo Molnar
In-Reply-To: <20190715130101.GA5527@linux.ibm.com>
On Mon, Jul 15, 2019 at 3:01 PM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
>
> On Sun, Jul 14, 2019 at 08:10:27PM -0700, Paul E. McKenney wrote:
> > On Sun, Jul 14, 2019 at 12:29:51PM -0700, Paul E. McKenney wrote:
> > > On Sun, Jul 14, 2019 at 03:05:22PM -0400, Theodore Ts'o wrote:
> > > > On Sun, Jul 14, 2019 at 05:48:00PM +0300, Dmitry Vyukov wrote:
> > > > > But short term I don't see any other solution than stop testing
> > > > > sched_setattr because it does not check arguments enough to prevent
> > > > > system misbehavior. Which is a pity because syzkaller has found some
> > > > > bad misconfigurations that were oversight on checking side.
> > > > > Any other suggestions?
> > > >
> > > > Or maybe syzkaller can put its own limitations on what parameters are
> > > > sent to sched_setattr? In practice, there are any number of ways a
> > > > root user can shoot themselves in the foot when using sched_setattr or
> > > > sched_setaffinity, for that matter. I imagine there must be some such
> > > > constraints already --- or else syzkaller might have set a kernel
> > > > thread to run with priority SCHED_BATCH, with similar catastrophic
> > > > effects --- or do similar configurations to make system threads
> > > > completely unschedulable.
> > > >
> > > > Real time administrators who know what they are doing --- and who know
> > > > that their real-time threads are well behaved --- will always want to
> > > > be able to do things that will be catastrophic if the real-time thread
> > > > is *not* well behaved. I don't it is possible to add safety checks
> > > > which would allow the kernel to automatically detect and reject unsafe
> > > > configurations.
> > > >
> > > > An apt analogy might be civilian versus military aircraft. Most
> > > > airplanes are designed to be "inherently stable"; that way, modulo
> > > > buggy/insane control systems like on the 737 Max, the airplane will
> > > > automatically return to straight and level flight. On the other hand,
> > > > some military planes (for example, the F-16, F-22, F-36, the
> > > > Eurofighter, etc.) are sometimes designed to be unstable, since that
> > > > way they can be more maneuverable.
> > > >
> > > > There are use cases for real-time Linux where this flexibility/power
> > > > vs. stability tradeoff is going to argue for giving root the
> > > > flexibility to crash the system. Some of these systems might
> > > > literally involve using real-time Linux in military applications,
> > > > something for which Paul and I have had some experience. :-)
> > > >
> > > > Speaking of sched_setaffinity, one thing which we can do is have
> > > > syzkaller move all of the system threads to they run on the "system
> > > > CPU's", and then move the syzkaller processes which are testing the
> > > > kernel to be on the "system under test CPU's". Then regardless of
> > > > what priority the syzkaller test programs try to run themselves at,
> > > > they can't crash the system.
> > > >
> > > > Some real-time systems do actually run this way, and it's a
> > > > recommended configuration which is much safer than letting the
> > > > real-time threads take over the whole system:
> > > >
> > > > http://linuxrealtime.org/index.php/Improving_the_Real-Time_Properties#Isolating_the_Application
> > >
> > > Good point! We might still have issues with some per-CPU kthreads,
> > > but perhaps use of nohz_full would help at least reduce these sorts
> > > of problems. (There could still be issues on CPUs with more than
> > > one runnable threads.)
> >
> > I looked at testing limitations in a bit more detail from an RCU
> > viewpoint, and came up with the following rough rule of thumb (which of
> > course might or might not survive actual testing experience, but should at
> > least be a good place to start). I believe that the sched_setaffinity()
> > testing rule should be that the SCHED_DEADLINE cycle be no more than
> > two-thirds of the RCU CPU stall warning timeout, which defaults to 21
> > seconds in mainline and 60 seconds in many distro kernels.
> >
> > That is, the SCHED_DEADLINE cycle should never exceed 14 seconds when
> > testing mainline on the one hand or 40 seconds when testing enterprise
> > distros on the other.
> >
> > This assumes quite a bit, though:
> >
> > o The system has ample memory to spare, and isn't running a
> > callback-hungry workload. For example, if you "only" have 100MB
> > of spare memory and you are also repeatedly and concurrently
> > expanding (say) large source trees from tarballs and then deleting
> > those source trees, the system might OOM. The reason OOM might
> > happen is that each close() of a file generates an RCU callback,
> > and 40 seconds worth of waiting-for-a-grace-period structures
> > takes up a surprisingly large amount of memory.
> >
> > So please be careful when combining tests. ;-)
> >
> > o There are no aggressive real-time workloads on the system.
> > The reason for this is that RCU is going to start sending IPIs
> > halfway to the RCU CPU stall timeout, and, in certain situations
> > on CONFIG_NO_HZ_FULL kernels, much earlier. (These situations
> > constitute abuse of CONFIG_NO_HZ_FULL, but then again carefully
> > calibrated abuse is what stress testing is all about.)
> >
> > o The various RCU kthreads will get a chance to run at least once
> > during the SCHED_DEADLINE cycle. If in real life, they only
> > get a chance to run once per two SCHED_DEADLINE cycles, then of
> > course the 14 seconds becomes 7 and the 40 seconds becomes 20.
>
> And there are configurations and workloads that might require division
> by three, so that (assuming one chance to run per cycle), the 14 seconds
> becomes about 5 and the 40 seconds becomes about 15.
>
> > o The current RCU CPU stall warning defaults remain in
> > place. These are set by the CONFIG_RCU_CPU_STALL_TIMEOUT
> > Kconfig parameter, which may in turn be overridden by the
> > rcupdate.rcu_cpu_stall_timeout kernel boot parameter.
> >
> > o The current SCHED_DEADLINE default for providing spare cycles
> > for other uses remains in place.
> >
> > o Other kthreads might have other constraints, but given that you
> > were seeing RCU CPU stall warnings instead of other failures,
> > the needs of RCU's kthreads seem to be a good place to start.
> >
> > Again, the candidate rough rule of thumb is that the the SCHED_DEADLINE
> > cycle be no more than 14 seconds when testing mainline kernels on the one
> > hand and 40 seconds when testing enterprise distro kernels on the other.
> >
> > Dmitry, does that help?
>
> I checked with the people running the Linux Plumbers Conference Scheduler
> Microconference, and they said that they would welcome a proposal on
> this topic, which I have submitted (please see below). Would anyone
> like to join as co-conspirator?
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> Title: Making SCHED_DEADLINE safe for kernel kthreads
>
> Abstract:
>
> Dmitry Vyukov's testing work identified some (ab)uses of sched_setattr()
> that can result in SCHED_DEADLINE tasks starving RCU's kthreads for
> extended time periods, not millisecond, not seconds, not minutes, not even
> hours, but days. Given that RCU CPU stall warnings are issued whenever
> an RCU grace period fails to complete within a few tens of seconds,
> the system did not suffer silently. Although one could argue that people
> should avoid abusing sched_setattr(), people are human and humans make
> mistakes. Responding to simple mistakes with RCU CPU stall warnings is
> all well and good, but a more severe case could OOM the system, which
> is a particularly unhelpful error message.
>
> It would be better if the system were capable of operating reasonably
> despite such abuse. Several approaches have been suggested.
>
> First, sched_setattr() could recognize parameter settings that put
> kthreads at risk and refuse to honor those settings. This approach
> of course requires that we identify precisely what combinations of
> sched_setattr() parameters settings are risky, especially given that there
> are likely to be parameter settings that are both risky and highly useful.
>
> Second, in theory, RCU could detect this situation and take the "dueling
> banjos" approach of increasing its priority as needed to get the CPU time
> that its kthreads need to operate correctly. However, the required amount
> of CPU time can vary greatly depending on the workload. Furthermore,
> non-RCU kthreads also need some amount of CPU time, and replicating
> "dueling banjos" across all such Linux-kernel subsystems seems both
> wasteful and error-prone. Finally, experience has shown that setting
> RCU's kthreads to real-time priorities significantly harms performance
> by increasing context-switch rates.
>
> Third, stress testing could be limited to non-risky regimes, such that
> kthreads get CPU time every 5-40 seconds, depending on configuration
> and experience. People needing risky parameter settings could then test
> the settings that they actually need, and also take responsibility for
> ensuring that kthreads get the CPU time that they need. (This of course
> includes per-CPU kthreads!)
>
> Fourth, bandwidth throttling could treat tasks in other scheduling classes
> as an aggregate group having a reasonable aggregate deadline and CPU
> budget. This has the advantage of allowing "abusive" testing to proceed,
> which allows people requiring risky parameter settings to rely on this
> testing. Additionally, it avoids complex progress checking and priority
> setting on the part of many kthreads throughout the system. However,
> if this was an easy choice, the SCHED_DEADLINE developers would likely
> have selected it. For example, it is necessary to determine what might
> be a "reasonable" aggregate deadline and CPU budget. Reserving 5%
> seems quite generous, and RCU's grace-period kthread would optimally
> like a deadline in the milliseconds, but would do reasonably well with
> many tens of milliseconds, and absolutely needs a few seconds. However,
> for CONFIG_RCU_NOCB_CPU=y, the RCU's callback-offload kthreads might
> well need a full CPU each! (This happens when the CPU being offloaded
> generates a high rate of callbacks.)
>
> The goal of this proposal is therefore to generate face-to-face
> discussion, hopefully resulting in a good and sufficient solution to
> this problem.
I would be happy to attend if this won't conflict with important
things on the testing and fuzzing MC.
If we restrict arguments for sched_attr, what would be the criteria
for 100% safe arguments? Moving the check from kernel to user-space
does not relief us from explicitly stating the condition in
black-and-white way. All of sched_runtime/sched_deadline/sched_period
be not larger than 1 second?
The problem is that syzkaller does not allow 100% reliable enforcement
for indirect arguments in memory. E.g. inputs arguments can overlap,
input/output can overlap, weird races affect what's actually being
passed to kernel, the memory being mapped from a weird device, etc.
And that's also useful as it can discover TOCTOU bugs, deadlocks, etc.
We could try to wrap sched_setattr and do some additional restrictions
by giving up on TOCTOU, device-mapped memory, etc.
I am also thinking about dropping CAP_SYS_NICE, it should still allow
some configurations, but no inherently unsafe ones.
^ permalink raw reply
* Re: INFO: rcu detected stall in ext4_write_checks
From: Peter Zijlstra @ 2019-07-15 13:29 UTC (permalink / raw)
To: Paul E. McKenney
Cc: Dmitry Vyukov, Theodore Ts'o, syzbot, Andreas Dilger,
David Miller, eladr, Ido Schimmel, Jiri Pirko, John Stultz,
linux-ext4, LKML, netdev, syzkaller-bugs, Thomas Gleixner,
Ingo Molnar
In-Reply-To: <20190714184915.GK26519@linux.ibm.com>
On Sun, Jul 14, 2019 at 11:49:15AM -0700, Paul E. McKenney wrote:
> On Sun, Jul 14, 2019 at 05:48:00PM +0300, Dmitry Vyukov wrote:
> > But short term I don't see any other solution than stop testing
> > sched_setattr because it does not check arguments enough to prevent
> > system misbehavior. Which is a pity because syzkaller has found some
> > bad misconfigurations that were oversight on checking side.
> > Any other suggestions?
>
> Keep the times down to a few seconds? Of course, that might also
> fail to find interesting bugs.
Right, if syzcaller can put a limit on the period/deadline parameters
(and make sure to not write "-1" to
/proc/sys/kernel/sched_rt_runtime_us) then per the in-kernel
access-control should not allow these things to happen.
^ permalink raw reply
* Re: INFO: rcu detected stall in ext4_write_checks
From: Peter Zijlstra @ 2019-07-15 13:22 UTC (permalink / raw)
To: Paul E. McKenney
Cc: Theodore Ts'o, Dmitry Vyukov, syzbot, Andreas Dilger,
David Miller, eladr, Ido Schimmel, Jiri Pirko, John Stultz,
linux-ext4, LKML, netdev, syzkaller-bugs, Thomas Gleixner,
Ingo Molnar
In-Reply-To: <20190707011655.GA22081@linux.ibm.com>
On Sat, Jul 06, 2019 at 06:16:55PM -0700, Paul E. McKenney wrote:
> 4. SCHED_DEADLINE treats the other three scheduling classes as each
> having a period, deadline, and a modest CPU consumption budget
> for the members of the class in aggregate. But this has to have
> been discussed before. How did that go?
Yeah; this has been proposed a number of times; and I think everybody
agrees that it is a good idea, but nobody has so far sat down and wrote
the patches.
Or rather; we would've gotten this for 'free' with the rt-cgroup
rewrite, but that's been stuck forever due to affinity being difficult.
^ permalink raw reply
* Re: INFO: rcu detected stall in ext4_write_checks
From: Paul E. McKenney @ 2019-07-15 13:01 UTC (permalink / raw)
To: Theodore Ts'o, Dmitry Vyukov, syzbot, Andreas Dilger,
David Miller, eladr, Ido Schimmel, Jiri Pirko, John Stultz,
linux-ext4, LKML, netdev, syzkaller-bugs, Thomas Gleixner,
Peter Zijlstra, Ingo Molnar
In-Reply-To: <20190715031027.GA3336@linux.ibm.com>
On Sun, Jul 14, 2019 at 08:10:27PM -0700, Paul E. McKenney wrote:
> On Sun, Jul 14, 2019 at 12:29:51PM -0700, Paul E. McKenney wrote:
> > On Sun, Jul 14, 2019 at 03:05:22PM -0400, Theodore Ts'o wrote:
> > > On Sun, Jul 14, 2019 at 05:48:00PM +0300, Dmitry Vyukov wrote:
> > > > But short term I don't see any other solution than stop testing
> > > > sched_setattr because it does not check arguments enough to prevent
> > > > system misbehavior. Which is a pity because syzkaller has found some
> > > > bad misconfigurations that were oversight on checking side.
> > > > Any other suggestions?
> > >
> > > Or maybe syzkaller can put its own limitations on what parameters are
> > > sent to sched_setattr? In practice, there are any number of ways a
> > > root user can shoot themselves in the foot when using sched_setattr or
> > > sched_setaffinity, for that matter. I imagine there must be some such
> > > constraints already --- or else syzkaller might have set a kernel
> > > thread to run with priority SCHED_BATCH, with similar catastrophic
> > > effects --- or do similar configurations to make system threads
> > > completely unschedulable.
> > >
> > > Real time administrators who know what they are doing --- and who know
> > > that their real-time threads are well behaved --- will always want to
> > > be able to do things that will be catastrophic if the real-time thread
> > > is *not* well behaved. I don't it is possible to add safety checks
> > > which would allow the kernel to automatically detect and reject unsafe
> > > configurations.
> > >
> > > An apt analogy might be civilian versus military aircraft. Most
> > > airplanes are designed to be "inherently stable"; that way, modulo
> > > buggy/insane control systems like on the 737 Max, the airplane will
> > > automatically return to straight and level flight. On the other hand,
> > > some military planes (for example, the F-16, F-22, F-36, the
> > > Eurofighter, etc.) are sometimes designed to be unstable, since that
> > > way they can be more maneuverable.
> > >
> > > There are use cases for real-time Linux where this flexibility/power
> > > vs. stability tradeoff is going to argue for giving root the
> > > flexibility to crash the system. Some of these systems might
> > > literally involve using real-time Linux in military applications,
> > > something for which Paul and I have had some experience. :-)
> > >
> > > Speaking of sched_setaffinity, one thing which we can do is have
> > > syzkaller move all of the system threads to they run on the "system
> > > CPU's", and then move the syzkaller processes which are testing the
> > > kernel to be on the "system under test CPU's". Then regardless of
> > > what priority the syzkaller test programs try to run themselves at,
> > > they can't crash the system.
> > >
> > > Some real-time systems do actually run this way, and it's a
> > > recommended configuration which is much safer than letting the
> > > real-time threads take over the whole system:
> > >
> > > http://linuxrealtime.org/index.php/Improving_the_Real-Time_Properties#Isolating_the_Application
> >
> > Good point! We might still have issues with some per-CPU kthreads,
> > but perhaps use of nohz_full would help at least reduce these sorts
> > of problems. (There could still be issues on CPUs with more than
> > one runnable threads.)
>
> I looked at testing limitations in a bit more detail from an RCU
> viewpoint, and came up with the following rough rule of thumb (which of
> course might or might not survive actual testing experience, but should at
> least be a good place to start). I believe that the sched_setaffinity()
> testing rule should be that the SCHED_DEADLINE cycle be no more than
> two-thirds of the RCU CPU stall warning timeout, which defaults to 21
> seconds in mainline and 60 seconds in many distro kernels.
>
> That is, the SCHED_DEADLINE cycle should never exceed 14 seconds when
> testing mainline on the one hand or 40 seconds when testing enterprise
> distros on the other.
>
> This assumes quite a bit, though:
>
> o The system has ample memory to spare, and isn't running a
> callback-hungry workload. For example, if you "only" have 100MB
> of spare memory and you are also repeatedly and concurrently
> expanding (say) large source trees from tarballs and then deleting
> those source trees, the system might OOM. The reason OOM might
> happen is that each close() of a file generates an RCU callback,
> and 40 seconds worth of waiting-for-a-grace-period structures
> takes up a surprisingly large amount of memory.
>
> So please be careful when combining tests. ;-)
>
> o There are no aggressive real-time workloads on the system.
> The reason for this is that RCU is going to start sending IPIs
> halfway to the RCU CPU stall timeout, and, in certain situations
> on CONFIG_NO_HZ_FULL kernels, much earlier. (These situations
> constitute abuse of CONFIG_NO_HZ_FULL, but then again carefully
> calibrated abuse is what stress testing is all about.)
>
> o The various RCU kthreads will get a chance to run at least once
> during the SCHED_DEADLINE cycle. If in real life, they only
> get a chance to run once per two SCHED_DEADLINE cycles, then of
> course the 14 seconds becomes 7 and the 40 seconds becomes 20.
And there are configurations and workloads that might require division
by three, so that (assuming one chance to run per cycle), the 14 seconds
becomes about 5 and the 40 seconds becomes about 15.
> o The current RCU CPU stall warning defaults remain in
> place. These are set by the CONFIG_RCU_CPU_STALL_TIMEOUT
> Kconfig parameter, which may in turn be overridden by the
> rcupdate.rcu_cpu_stall_timeout kernel boot parameter.
>
> o The current SCHED_DEADLINE default for providing spare cycles
> for other uses remains in place.
>
> o Other kthreads might have other constraints, but given that you
> were seeing RCU CPU stall warnings instead of other failures,
> the needs of RCU's kthreads seem to be a good place to start.
>
> Again, the candidate rough rule of thumb is that the the SCHED_DEADLINE
> cycle be no more than 14 seconds when testing mainline kernels on the one
> hand and 40 seconds when testing enterprise distro kernels on the other.
>
> Dmitry, does that help?
I checked with the people running the Linux Plumbers Conference Scheduler
Microconference, and they said that they would welcome a proposal on
this topic, which I have submitted (please see below). Would anyone
like to join as co-conspirator?
Thanx, Paul
------------------------------------------------------------------------
Title: Making SCHED_DEADLINE safe for kernel kthreads
Abstract:
Dmitry Vyukov's testing work identified some (ab)uses of sched_setattr()
that can result in SCHED_DEADLINE tasks starving RCU's kthreads for
extended time periods, not millisecond, not seconds, not minutes, not even
hours, but days. Given that RCU CPU stall warnings are issued whenever
an RCU grace period fails to complete within a few tens of seconds,
the system did not suffer silently. Although one could argue that people
should avoid abusing sched_setattr(), people are human and humans make
mistakes. Responding to simple mistakes with RCU CPU stall warnings is
all well and good, but a more severe case could OOM the system, which
is a particularly unhelpful error message.
It would be better if the system were capable of operating reasonably
despite such abuse. Several approaches have been suggested.
First, sched_setattr() could recognize parameter settings that put
kthreads at risk and refuse to honor those settings. This approach
of course requires that we identify precisely what combinations of
sched_setattr() parameters settings are risky, especially given that there
are likely to be parameter settings that are both risky and highly useful.
Second, in theory, RCU could detect this situation and take the "dueling
banjos" approach of increasing its priority as needed to get the CPU time
that its kthreads need to operate correctly. However, the required amount
of CPU time can vary greatly depending on the workload. Furthermore,
non-RCU kthreads also need some amount of CPU time, and replicating
"dueling banjos" across all such Linux-kernel subsystems seems both
wasteful and error-prone. Finally, experience has shown that setting
RCU's kthreads to real-time priorities significantly harms performance
by increasing context-switch rates.
Third, stress testing could be limited to non-risky regimes, such that
kthreads get CPU time every 5-40 seconds, depending on configuration
and experience. People needing risky parameter settings could then test
the settings that they actually need, and also take responsibility for
ensuring that kthreads get the CPU time that they need. (This of course
includes per-CPU kthreads!)
Fourth, bandwidth throttling could treat tasks in other scheduling classes
as an aggregate group having a reasonable aggregate deadline and CPU
budget. This has the advantage of allowing "abusive" testing to proceed,
which allows people requiring risky parameter settings to rely on this
testing. Additionally, it avoids complex progress checking and priority
setting on the part of many kthreads throughout the system. However,
if this was an easy choice, the SCHED_DEADLINE developers would likely
have selected it. For example, it is necessary to determine what might
be a "reasonable" aggregate deadline and CPU budget. Reserving 5%
seems quite generous, and RCU's grace-period kthread would optimally
like a deadline in the milliseconds, but would do reasonably well with
many tens of milliseconds, and absolutely needs a few seconds. However,
for CONFIG_RCU_NOCB_CPU=y, the RCU's callback-offload kthreads might
well need a full CPU each! (This happens when the CPU being offloaded
generates a high rate of callbacks.)
The goal of this proposal is therefore to generate face-to-face
discussion, hopefully resulting in a good and sufficient solution to
this problem.
^ permalink raw reply
* [PATCH] i40e: reduce stack usage in i40e_set_fc
From: Arnd Bergmann @ 2019-07-15 12:35 UTC (permalink / raw)
To: Jeff Kirsher, David S. Miller
Cc: Arnd Bergmann, Catherine Sullivan, Aleksandr Loktionov,
Doug Dziggel, Patryk Małek, Piotr Azarewicz, Piotr Marczak,
intel-wired-lan, netdev, linux-kernel, clang-built-linux
The functions i40e_aq_get_phy_abilities_resp() and i40e_set_fc() both
have giant structure on the stack, which makes each one use stack frames
larger than 500 bytes.
As clang decides one function into the other, we get a warning for
exceeding the frame size limit on 32-bit architectures:
drivers/net/ethernet/intel/i40e/i40e_common.c:1654:23: error: stack frame size of 1116 bytes in function 'i40e_set_fc' [-Werror,-Wframe-larger-than=]
When building with gcc, the inlining does not happen, but i40e_set_fc()
calls i40e_aq_get_phy_abilities_resp() anyway, so they add up on the
kernel stack just as much.
The parts that actually use large stacks don't overlap, so make sure
each one is a separate function, and mark them as noinline_for_stack to
prevent the compilers from combining them again.
Fixes: 0a862b43acc6 ("i40e/i40evf: Add module_types and update_link_info")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
drivers/net/ethernet/intel/i40e/i40e_common.c | 91 +++++++++++--------
1 file changed, 51 insertions(+), 40 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c b/drivers/net/ethernet/intel/i40e/i40e_common.c
index 906cf68d3453..7af1b7477140 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
@@ -1643,25 +1643,15 @@ enum i40e_status_code i40e_aq_set_phy_config(struct i40e_hw *hw,
return status;
}
-/**
- * i40e_set_fc
- * @hw: pointer to the hw struct
- * @aq_failures: buffer to return AdminQ failure information
- * @atomic_restart: whether to enable atomic link restart
- *
- * Set the requested flow control mode using set_phy_config.
- **/
-enum i40e_status_code i40e_set_fc(struct i40e_hw *hw, u8 *aq_failures,
- bool atomic_restart)
+static noinline_for_stack enum i40e_status_code
+i40e_set_fc_status(struct i40e_hw *hw,
+ struct i40e_aq_get_phy_abilities_resp *abilities,
+ bool atomic_restart)
{
- enum i40e_fc_mode fc_mode = hw->fc.requested_mode;
- struct i40e_aq_get_phy_abilities_resp abilities;
struct i40e_aq_set_phy_config config;
- enum i40e_status_code status;
+ enum i40e_fc_mode fc_mode = hw->fc.requested_mode;
u8 pause_mask = 0x0;
- *aq_failures = 0x0;
-
switch (fc_mode) {
case I40E_FC_FULL:
pause_mask |= I40E_AQ_PHY_FLAG_PAUSE_TX;
@@ -1677,6 +1667,48 @@ enum i40e_status_code i40e_set_fc(struct i40e_hw *hw, u8 *aq_failures,
break;
}
+ memset(&config, 0, sizeof(struct i40e_aq_set_phy_config));
+ /* clear the old pause settings */
+ config.abilities = abilities->abilities & ~(I40E_AQ_PHY_FLAG_PAUSE_TX) &
+ ~(I40E_AQ_PHY_FLAG_PAUSE_RX);
+ /* set the new abilities */
+ config.abilities |= pause_mask;
+ /* If the abilities have changed, then set the new config */
+ if (config.abilities == abilities->abilities)
+ return 0;
+
+ /* Auto restart link so settings take effect */
+ if (atomic_restart)
+ config.abilities |= I40E_AQ_PHY_ENABLE_ATOMIC_LINK;
+ /* Copy over all the old settings */
+ config.phy_type = abilities->phy_type;
+ config.phy_type_ext = abilities->phy_type_ext;
+ config.link_speed = abilities->link_speed;
+ config.eee_capability = abilities->eee_capability;
+ config.eeer = abilities->eeer_val;
+ config.low_power_ctrl = abilities->d3_lpan;
+ config.fec_config = abilities->fec_cfg_curr_mod_ext_info &
+ I40E_AQ_PHY_FEC_CONFIG_MASK;
+
+ return i40e_aq_set_phy_config(hw, &config, NULL);
+}
+
+/**
+ * i40e_set_fc
+ * @hw: pointer to the hw struct
+ * @aq_failures: buffer to return AdminQ failure information
+ * @atomic_restart: whether to enable atomic link restart
+ *
+ * Set the requested flow control mode using set_phy_config.
+ **/
+enum i40e_status_code i40e_set_fc(struct i40e_hw *hw, u8 *aq_failures,
+ bool atomic_restart)
+{
+ struct i40e_aq_get_phy_abilities_resp abilities;
+ enum i40e_status_code status;
+
+ *aq_failures = 0x0;
+
/* Get the current phy config */
status = i40e_aq_get_phy_capabilities(hw, false, false, &abilities,
NULL);
@@ -1685,31 +1717,10 @@ enum i40e_status_code i40e_set_fc(struct i40e_hw *hw, u8 *aq_failures,
return status;
}
- memset(&config, 0, sizeof(struct i40e_aq_set_phy_config));
- /* clear the old pause settings */
- config.abilities = abilities.abilities & ~(I40E_AQ_PHY_FLAG_PAUSE_TX) &
- ~(I40E_AQ_PHY_FLAG_PAUSE_RX);
- /* set the new abilities */
- config.abilities |= pause_mask;
- /* If the abilities have changed, then set the new config */
- if (config.abilities != abilities.abilities) {
- /* Auto restart link so settings take effect */
- if (atomic_restart)
- config.abilities |= I40E_AQ_PHY_ENABLE_ATOMIC_LINK;
- /* Copy over all the old settings */
- config.phy_type = abilities.phy_type;
- config.phy_type_ext = abilities.phy_type_ext;
- config.link_speed = abilities.link_speed;
- config.eee_capability = abilities.eee_capability;
- config.eeer = abilities.eeer_val;
- config.low_power_ctrl = abilities.d3_lpan;
- config.fec_config = abilities.fec_cfg_curr_mod_ext_info &
- I40E_AQ_PHY_FEC_CONFIG_MASK;
- status = i40e_aq_set_phy_config(hw, &config, NULL);
+ status = i40e_set_fc_status(hw, &abilities, atomic_restart);
+ if (status)
+ *aq_failures |= I40E_SET_FC_AQ_FAIL_SET;
- if (status)
- *aq_failures |= I40E_SET_FC_AQ_FAIL_SET;
- }
/* Update the link info */
status = i40e_update_link_info(hw);
if (status) {
@@ -2537,7 +2548,7 @@ i40e_status i40e_get_link_status(struct i40e_hw *hw, bool *link_up)
* i40e_updatelink_status - update status of the HW network link
* @hw: pointer to the hw struct
**/
-i40e_status i40e_update_link_info(struct i40e_hw *hw)
+noinline_for_stack i40e_status i40e_update_link_info(struct i40e_hw *hw)
{
struct i40e_aq_get_phy_abilities_resp abilities;
i40e_status status = 0;
--
2.20.0
^ permalink raw reply related
* [PATCH v2] e1000e: Make speed detection on hotplugging cable more reliable
From: Kai-Heng Feng @ 2019-07-15 12:25 UTC (permalink / raw)
To: jeffrey.t.kirsher; +Cc: intel-wired-lan, netdev, linux-kernel, Kai-Heng Feng
In-Reply-To: <20190715084355.9962-1-kai.heng.feng@canonical.com>
After hotplugging an 1Gbps ethernet cable with 1Gbps link partner, the
MII_BMSR may report 10Mbps, renders the network rather slow.
The issue has much lower fail rate after commit 59653e6497d1 ("e1000e:
Make watchdog use delayed work"), which essentially introduces some
delay before running the watchdog task.
But there's still a chance that the hotplugging event and the queued
watchdog task gets run at the same time, then the original issue can be
observed once again.
So let's use mod_delayed_work() to add a deterministic 1 second delay
before running watchdog task, after an interrupt.
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
drivers/net/ethernet/intel/e1000e/netdev.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index e4baa13b3cda..c83bf5349d53 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1780,8 +1780,8 @@ static irqreturn_t e1000_intr_msi(int __always_unused irq, void *data)
}
/* guard against interrupt when we're going down */
if (!test_bit(__E1000_DOWN, &adapter->state))
- queue_delayed_work(adapter->e1000_workqueue,
- &adapter->watchdog_task, 1);
+ mod_delayed_work(adapter->e1000_workqueue,
+ &adapter->watchdog_task, HZ);
}
/* Reset on uncorrectable ECC error */
@@ -1861,8 +1861,8 @@ static irqreturn_t e1000_intr(int __always_unused irq, void *data)
}
/* guard against interrupt when we're going down */
if (!test_bit(__E1000_DOWN, &adapter->state))
- queue_delayed_work(adapter->e1000_workqueue,
- &adapter->watchdog_task, 1);
+ mod_delayed_work(adapter->e1000_workqueue,
+ &adapter->watchdog_task, HZ);
}
/* Reset on uncorrectable ECC error */
@@ -1907,8 +1907,8 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
hw->mac.get_link_status = true;
/* guard against interrupt when we're going down */
if (!test_bit(__E1000_DOWN, &adapter->state))
- queue_delayed_work(adapter->e1000_workqueue,
- &adapter->watchdog_task, 1);
+ mod_delayed_work(adapter->e1000_workqueue,
+ &adapter->watchdog_task, HZ);
}
if (!test_bit(__E1000_DOWN, &adapter->state))
--
2.17.1
^ permalink raw reply related
* Re: [PATCH] ethtool: igb: dump RR2DCDELAY register
From: Artem Bityutskiy @ 2019-07-15 11:01 UTC (permalink / raw)
To: Michal Kubecek; +Cc: John W . Linville, netdev
In-Reply-To: <20190715091334.GB24551@unicorn.suse.cz>
On Mon, 2019-07-15 at 11:13 +0200, Michal Kubecek wrote:
> > + /*
> > + * Starting from kernel version 5.3 the registers dump buffer grew from
> > + * 739 4-byte words to 740 words, and word 740 contains the RR2DCDLAY
>
> nit: "E" missing here: ^
Fixed and sent out v2.
> > + fprintf(stdout,
> > + "0x05BF4: RR2DCDELAY (Max. DMA read delay) 0x%08X\n",
> > + regs_buff[739]);
>
> Showing a delay as hex number doesn't seem very user friendly but that
> also applies to many existing registers so it's probably better to be
> consistent and perhaps do an overall cleanup later.
Agree.
Thanks!
^ permalink raw reply
* [PATCH v2] ethtool: igb: dump RR2DCDELAY register
From: Artem Bityutskiy @ 2019-07-15 10:59 UTC (permalink / raw)
To: John W . Linville; +Cc: netdev
In-Reply-To: <20190715065228.88377-1-artem.bityutskiy@linux.intel.com>
From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Decode 'RR2DCDELAY' register which Linux kernel provides starting from version
5.3. The corresponding commit in the Linux kernel is:
cd502a7f7c9c igb: add RR2DCDELAY to ethtool registers dump
The RR2DCDELAY register is present in I210 and I211 Intel Gigabit Ethernet
chips and it stands for "Read Request To Data Completion Delay". Here is how
this register is described in the I210 datasheet:
"This field captures the maximum PCIe split time in 16 ns units, which is the
maximum delay between the read request to the first data completion. This is
giving an estimation of the PCIe round trip time."
In practice, this register can be used to measure the time it takes the NIC to
read data from the host memory.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
---
igb.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
Changelog:
v2: Fixed a typo in the commentary.
v1: Initial pach version.
diff --git a/igb.c b/igb.c
index e0ccef9..cb24877 100644
--- a/igb.c
+++ b/igb.c
@@ -859,6 +859,18 @@ igb_dump_regs(struct ethtool_drvinfo *info, struct ethtool_regs *regs)
"0x03430: TDFPC (Tx data FIFO packet count) 0x%08X\n",
regs_buff[550]);
+ /*
+ * Starting from kernel version 5.3 the registers dump buffer grew from
+ * 739 4-byte words to 740 words, and word 740 contains the RR2DCDELAY
+ * register.
+ */
+ if (regs->len < 740)
+ return 0;
+
+ fprintf(stdout,
+ "0x05BF4: RR2DCDELAY (Max. DMA read delay) 0x%08X\n",
+ regs_buff[739]);
+
return 0;
}
--
2.20.1
^ permalink raw reply related
* Re: [RFC] virtio-net: share receive_*() and add_recvbuf_*() with virtio-vsock
From: Stefano Garzarella @ 2019-07-15 10:42 UTC (permalink / raw)
To: Jason Wang; +Cc: Michael S. Tsirkin, Stefan Hajnoczi, virtualization, netdev
In-Reply-To: <880c1ad2-7e02-3d5d-82d7-49076cc8d02b@redhat.com>
On Mon, Jul 15, 2019 at 05:16:09PM +0800, Jason Wang wrote:
>
> > > > > > > > struct sk_buff *virtskb_receive_small(struct virtskb *vs, ...);
> > > > > > > > struct sk_buff *virtskb_receive_big(struct virtskb *vs, ...);
> > > > > > > > struct sk_buff *virtskb_receive_mergeable(struct virtskb *vs, ...);
> > > > > > > >
> > > > > > > > int virtskb_add_recvbuf_small(struct virtskb*vs, ...);
> > > > > > > > int virtskb_add_recvbuf_big(struct virtskb *vs, ...);
> > > > > > > > int virtskb_add_recvbuf_mergeable(struct virtskb *vs, ...);
> > > > > > > >
> > > > > > > > For the Guest->Host path it should be easier, so maybe I can add a
> > > > > > > > "virtskb_send(struct virtskb *vs, struct sk_buff *skb)" with a part of the code
> > > > > > > > of xmit_skb().
> > > > > > > I may miss something, but I don't see any thing that prevents us from using
> > > > > > > xmit_skb() directly.
> > > > > > >
> > > > > > Yes, but my initial idea was to make it more parametric and not related to the
> > > > > > virtio_net_hdr, so the 'hdr_len' could be a parameter and the
> > > > > > 'num_buffers' should be handled by the caller.
> > > > > >
> > > > > > > > Let me know if you have in mind better names or if I should put these function
> > > > > > > > in another place.
> > > > > > > >
> > > > > > > > I would like to leave the control part completely separate, so, for example,
> > > > > > > > the two drivers will negotiate the features independently and they will call
> > > > > > > > the right virtskb_receive_*() function based on the negotiation.
> > > > > > > If it's one the issue of negotiation, we can simply change the
> > > > > > > virtnet_probe() to deal with different devices.
> > > > > > >
> > > > > > >
> > > > > > > > I already started to work on it, but before to do more steps and send an RFC
> > > > > > > > patch, I would like to hear your opinion.
> > > > > > > > Do you think that makes sense?
> > > > > > > > Do you see any issue or a better solution?
> > > > > > > I still think we need to seek a way of adding some codes on virtio-net.c
> > > > > > > directly if there's no huge different in the processing of TX/RX. That would
> > > > > > > save us a lot time.
> > > > > > After the reading of the buffers from the virtqueue I think the process
> > > > > > is slightly different, because virtio-net will interface with the network
> > > > > > stack, while virtio-vsock will interface with the vsock-core (socket).
> > > > > > So the virtio-vsock implements the following:
> > > > > > - control flow mechanism to avoid to loose packets, informing the peer
> > > > > > about the amount of memory available in the receive queue using some
> > > > > > fields in the virtio_vsock_hdr
> > > > > > - de-multiplexing parsing the virtio_vsock_hdr and choosing the right
> > > > > > socket depending on the port
> > > > > > - socket state handling
> > >
> > > I think it's just a branch, for ethernet, go for networking stack. otherwise
> > > go for vsock core?
> > >
> > Yes, that should work.
> >
> > So, I should refactor the functions that can be called also from the vsock
> > core, in order to remove "struct net_device *dev" parameter.
> > Maybe creating some wrappers for the network stack.
> >
> > Otherwise I should create a fake net_device for vsock_core.
> >
> > What do you suggest?
>
>
> I'm not quite sure I get the question. Can you just use the one that created
> by virtio_net?
Sure, sorry but I missed that it is allocated in the virtnet_probe()!
Thanks,
Stefano
^ permalink raw reply
* Re: [oss-drivers] Re: [RFC bpf-next 2/8] bpf: extend list based insn patching infra to verification layer
From: Jiong Wang @ 2019-07-15 10:02 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Jiong Wang, Andrii Nakryiko, Alexei Starovoitov, Daniel Borkmann,
Edward Cree, Naveen N. Rao, Jakub Kicinski, bpf, Networking,
oss-drivers
In-Reply-To: <CAEf4BzZX45+QJLnHdwW-Cmo1uAFd+4zzds0jJoQVWFLrUVABgA@mail.gmail.com>
Andrii Nakryiko writes:
> On Thu, Jul 11, 2019 at 5:20 AM Jiong Wang <jiong.wang@netronome.com> wrote:
>>
>>
>> Jiong Wang writes:
>>
>> > Andrii Nakryiko writes:
>> >
>> >> On Thu, Jul 4, 2019 at 2:32 PM Jiong Wang <jiong.wang@netronome.com> wrote:
>> >>>
>> >>> Verification layer also needs to handle auxiliar info as well as adjusting
>> >>> subprog start.
>> >>>
>> >>> At this layer, insns inside patch buffer could be jump, but they should
>> >>> have been resolved, meaning they shouldn't jump to insn outside of the
>> >>> patch buffer. Lineration function for this layer won't touch insns inside
>> >>> patch buffer.
>> >>>
>> >>> Adjusting subprog is finished along with adjusting jump target when the
>> >>> input will cover bpf to bpf call insn, re-register subprog start is cheap.
>> >>> But adjustment when there is insn deleteion is not considered yet.
>> >>>
>> >>> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
>> >>> ---
>> >>> kernel/bpf/verifier.c | 150 ++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>> 1 file changed, 150 insertions(+)
>> >>>
>> >>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> >>> index a2e7637..2026d64 100644
>> >>> --- a/kernel/bpf/verifier.c
>> >>> +++ b/kernel/bpf/verifier.c
>> >>> @@ -8350,6 +8350,156 @@ static void opt_hard_wire_dead_code_branches(struct bpf_verifier_env *env)
>> >>> }
>> >>> }
>> >>>
>> >>> +/* Linearize bpf list insn to array (verifier layer). */
>> >>> +static struct bpf_verifier_env *
>> >>> +verifier_linearize_list_insn(struct bpf_verifier_env *env,
>> >>> + struct bpf_list_insn *list)
>> >>
>> >> It's unclear why this returns env back? It's not allocating a new env,
>> >> so it's weird and unnecessary. Just return error code.
>> >
>> > The reason is I was thinking we have two layers in BPF, the core and the
>> > verifier.
>> >
>> > For core layer (the relevant file is core.c), when doing patching, the
>> > input is insn list and bpf_prog, the linearization should linearize the
>> > insn list into insn array, and also whatever others affect inside bpf_prog
>> > due to changing on insns, for example line info inside prog->aux. So the
>> > return value is bpf_prog for core layer linearization hook.
>> >
>> > For verifier layer, it is similar, but the context if bpf_verifier_env, the
>> > linearization hook should linearize the insn list, and also those affected
>> > inside env, for example bpf_insn_aux_data, so the return value is
>> > bpf_verifier_env, meaning returning an updated verifier context
>> > (bpf_verifier_env) after insn list linearization.
>>
>> Realized your point is no new env is allocated, so just return error
>> code. Yes, the env pointer is not changed, just internal data is
>> updated. Return bpf_verifier_env mostly is trying to make the hook more
>> clear that it returns an updated "context" where the linearization happens,
>> for verifier layer, it is bpf_verifier_env, and for core layer, it is
>> bpf_prog, so return value was designed to return these two types.
>
> Oh, I missed that core layer returns bpf_prog*. I think this is
> confusing as hell and is very contrary to what one would expect. If
> the function doesn't allocate those objects, it shouldn't return them,
> except for rare cases of some accessor functions. Me reading this,
> I'll always be suprised and will have to go skim code just to check
> whether those functions really return new bpf_prog or
> bpf_verifier_env, respectively.
bpf_prog_realloc do return new bpf_prog, so we will need to return bpf_prog
* for core layer.
>
> Please change them both to just return error code.
>
>>
>> >
>> > Make sense?
>> >
>> > Regards,
>> > Jiong
>> >
>> >>
>> >>> +{
>> >>> + u32 *idx_map, idx, orig_cnt, fini_cnt = 0;
>> >>> + struct bpf_subprog_info *new_subinfo;
>> >>> + struct bpf_insn_aux_data *new_data;
>> >>> + struct bpf_prog *prog = env->prog;
>> >>> + struct bpf_verifier_env *ret_env;
>> >>> + struct bpf_insn *insns, *insn;
>> >>> + struct bpf_list_insn *elem;
>> >>> + int ret;
>> >>> +
>> >>> + /* Calculate final size. */
>> >>> + for (elem = list; elem; elem = elem->next)
>> >>> + if (!(elem->flag & LIST_INSN_FLAG_REMOVED))
>> >>> + fini_cnt++;
>> >>> +
>> >>> + orig_cnt = prog->len;
>> >>> + insns = prog->insnsi;
>> >>> + /* If prog length remains same, nothing else to do. */
>> >>> + if (fini_cnt == orig_cnt) {
>> >>> + for (insn = insns, elem = list; elem; elem = elem->next, insn++)
>> >>> + *insn = elem->insn;
>> >>> + return env;
>> >>> + }
>> >>> + /* Realloc insn buffer when necessary. */
>> >>> + if (fini_cnt > orig_cnt)
>> >>> + prog = bpf_prog_realloc(prog, bpf_prog_size(fini_cnt),
>> >>> + GFP_USER);
>> >>> + if (!prog)
>> >>> + return ERR_PTR(-ENOMEM);
>> >>> + insns = prog->insnsi;
>> >>> + prog->len = fini_cnt;
>> >>> + ret_env = env;
>> >>> +
>> >>> + /* idx_map[OLD_IDX] = NEW_IDX */
>> >>> + idx_map = kvmalloc(orig_cnt * sizeof(u32), GFP_KERNEL);
>> >>> + if (!idx_map)
>> >>> + return ERR_PTR(-ENOMEM);
>> >>> + memset(idx_map, 0xff, orig_cnt * sizeof(u32));
>> >>> +
>> >>> + /* Use the same alloc method used when allocating env->insn_aux_data. */
>> >>> + new_data = vzalloc(array_size(sizeof(*new_data), fini_cnt));
>> >>> + if (!new_data) {
>> >>> + kvfree(idx_map);
>> >>> + return ERR_PTR(-ENOMEM);
>> >>> + }
>> >>> +
>> >>> + /* Copy over insn + calculate idx_map. */
>> >>> + for (idx = 0, elem = list; elem; elem = elem->next) {
>> >>> + int orig_idx = elem->orig_idx - 1;
>> >>> +
>> >>> + if (orig_idx >= 0) {
>> >>> + idx_map[orig_idx] = idx;
>> >>> +
>> >>> + if (elem->flag & LIST_INSN_FLAG_REMOVED)
>> >>> + continue;
>> >>> +
>> >>> + new_data[idx] = env->insn_aux_data[orig_idx];
>> >>> +
>> >>> + if (elem->flag & LIST_INSN_FLAG_PATCHED)
>> >>> + new_data[idx].zext_dst =
>> >>> + insn_has_def32(env, &elem->insn);
>> >>> + } else {
>> >>> + new_data[idx].seen = true;
>> >>> + new_data[idx].zext_dst = insn_has_def32(env,
>> >>> + &elem->insn);
>> >>> + }
>> >>> + insns[idx++] = elem->insn;
>> >>> + }
>> >>> +
>> >>> + new_subinfo = kvzalloc(sizeof(env->subprog_info), GFP_KERNEL);
>> >>> + if (!new_subinfo) {
>> >>> + kvfree(idx_map);
>> >>> + vfree(new_data);
>> >>> + return ERR_PTR(-ENOMEM);
>> >>> + }
>> >>> + memcpy(new_subinfo, env->subprog_info, sizeof(env->subprog_info));
>> >>> + memset(env->subprog_info, 0, sizeof(env->subprog_info));
>> >>> + env->subprog_cnt = 0;
>> >>> + env->prog = prog;
>> >>> + ret = add_subprog(env, 0);
>> >>> + if (ret < 0) {
>> >>> + ret_env = ERR_PTR(ret);
>> >>> + goto free_all_ret;
>> >>> + }
>> >>> + /* Relocate jumps using idx_map.
>> >>> + * old_dst = jmp_insn.old_target + old_pc + 1;
>> >>> + * new_dst = idx_map[old_dst] = jmp_insn.new_target + new_pc + 1;
>> >>> + * jmp_insn.new_target = new_dst - new_pc - 1;
>> >>> + */
>> >>> + for (idx = 0, elem = list; elem; elem = elem->next) {
>> >>> + int orig_idx = elem->orig_idx;
>> >>> +
>> >>> + if (elem->flag & LIST_INSN_FLAG_REMOVED)
>> >>> + continue;
>> >>> + if ((elem->flag & LIST_INSN_FLAG_PATCHED) || !orig_idx) {
>> >>> + idx++;
>> >>> + continue;
>> >>> + }
>> >>> +
>> >>> + ret = bpf_jit_adj_imm_off(&insns[idx], orig_idx - 1, idx,
>> >>> + idx_map);
>> >>> + if (ret < 0) {
>> >>> + ret_env = ERR_PTR(ret);
>> >>> + goto free_all_ret;
>> >>> + }
>> >>> + /* Recalculate subprog start as we are at bpf2bpf call insn. */
>> >>> + if (ret > 0) {
>> >>> + ret = add_subprog(env, idx + insns[idx].imm + 1);
>> >>> + if (ret < 0) {
>> >>> + ret_env = ERR_PTR(ret);
>> >>> + goto free_all_ret;
>> >>> + }
>> >>> + }
>> >>> + idx++;
>> >>> + }
>> >>> + if (ret < 0) {
>> >>> + ret_env = ERR_PTR(ret);
>> >>> + goto free_all_ret;
>> >>> + }
>> >>> +
>> >>> + env->subprog_info[env->subprog_cnt].start = fini_cnt;
>> >>> + for (idx = 0; idx <= env->subprog_cnt; idx++)
>> >>> + new_subinfo[idx].start = env->subprog_info[idx].start;
>> >>> + memcpy(env->subprog_info, new_subinfo, sizeof(env->subprog_info));
>> >>> +
>> >>> + /* Adjust linfo.
>> >>> + * FIXME: no support for insn removal at the moment.
>> >>> + */
>> >>> + if (prog->aux->nr_linfo) {
>> >>> + struct bpf_line_info *linfo = prog->aux->linfo;
>> >>> + u32 nr_linfo = prog->aux->nr_linfo;
>> >>> +
>> >>> + for (idx = 0; idx < nr_linfo; idx++)
>> >>> + linfo[idx].insn_off = idx_map[linfo[idx].insn_off];
>> >>> + }
>> >>> + vfree(env->insn_aux_data);
>> >>> + env->insn_aux_data = new_data;
>> >>> + goto free_mem_list_ret;
>> >>> +free_all_ret:
>> >>> + vfree(new_data);
>> >>> +free_mem_list_ret:
>> >>> + kvfree(new_subinfo);
>> >>> + kvfree(idx_map);
>> >>> + return ret_env;
>> >>> +}
>> >>> +
>> >>> static int opt_remove_dead_code(struct bpf_verifier_env *env)
>> >>> {
>> >>> struct bpf_insn_aux_data *aux_data = env->insn_aux_data;
>> >>> --
>> >>> 2.7.4
>> >>>
>>
^ permalink raw reply
* [PATCH ipsec v2 1/4] xfrm interface: avoid corruption on changelink
From: Nicolas Dichtel @ 2019-07-15 10:00 UTC (permalink / raw)
To: steffen.klassert, davem; +Cc: netdev, Nicolas Dichtel, Julien Floret
In-Reply-To: <20190715100023.8475-1-nicolas.dichtel@6wind.com>
The new parameters must not be stored in the netdev_priv() before
validation, it may corrupt the interface. Note also that if data is NULL,
only a memset() is done.
$ ip link add xfrm1 type xfrm dev lo if_id 1
$ ip link add xfrm2 type xfrm dev lo if_id 2
$ ip link set xfrm1 type xfrm dev lo if_id 2
RTNETLINK answers: File exists
$ ip -d link list dev xfrm1
5: xfrm1@lo: <NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/none 00:00:00:00:00:00 brd 00:00:00:00:00:00 promiscuity 0 minmtu 68 maxmtu 1500
xfrm if_id 0x2 addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
=> "if_id 0x2"
Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: Julien Floret <julien.floret@6wind.com>
---
net/xfrm/xfrm_interface.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index 74868f9d81fb..2310dc9e35c2 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -671,12 +671,12 @@ static int xfrmi_changelink(struct net_device *dev, struct nlattr *tb[],
struct nlattr *data[],
struct netlink_ext_ack *extack)
{
- struct xfrm_if *xi = netdev_priv(dev);
struct net *net = dev_net(dev);
+ struct xfrm_if_parms p;
+ struct xfrm_if *xi;
- xfrmi_netlink_parms(data, &xi->p);
-
- xi = xfrmi_locate(net, &xi->p);
+ xfrmi_netlink_parms(data, &p);
+ xi = xfrmi_locate(net, &p);
if (!xi) {
xi = netdev_priv(dev);
} else {
@@ -684,7 +684,7 @@ static int xfrmi_changelink(struct net_device *dev, struct nlattr *tb[],
return -EEXIST;
}
- return xfrmi_update(xi, &xi->p);
+ return xfrmi_update(xi, &p);
}
static size_t xfrmi_get_size(const struct net_device *dev)
--
2.21.0
^ permalink raw reply related
* [PATCH ipsec v2 4/4] xfrm interface: fix management of phydev
From: Nicolas Dichtel @ 2019-07-15 10:00 UTC (permalink / raw)
To: steffen.klassert, davem; +Cc: netdev, Nicolas Dichtel, Julien Floret
In-Reply-To: <20190715100023.8475-1-nicolas.dichtel@6wind.com>
With the current implementation, phydev cannot be removed:
$ ip link add dummy type dummy
$ ip link add xfrm1 type xfrm dev dummy if_id 1
$ ip l d dummy
kernel:[77938.465445] unregister_netdevice: waiting for dummy to become free. Usage count = 1
Manage it like in ip tunnels, ie just keep the ifindex. Not that the side
effect, is that the phydev is now optional.
Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: Julien Floret <julien.floret@6wind.com>
---
include/net/xfrm.h | 1 -
net/xfrm/xfrm_interface.c | 32 +++++++++++++++++---------------
2 files changed, 17 insertions(+), 16 deletions(-)
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index ad761ef84797..aa08a7a5f6ac 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -990,7 +990,6 @@ struct xfrm_if_parms {
struct xfrm_if {
struct xfrm_if __rcu *next; /* next interface in list */
struct net_device *dev; /* virtual device associated with interface */
- struct net_device *phydev; /* physical device */
struct net *net; /* netns for packet i/o */
struct xfrm_if_parms p; /* interface parms */
diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index 53e5e47b2c55..2ab4859df55a 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -175,7 +175,6 @@ static void xfrmi_dev_uninit(struct net_device *dev)
struct xfrmi_net *xfrmn = net_generic(xi->net, xfrmi_net_id);
xfrmi_unlink(xfrmn, xi);
- dev_put(xi->phydev);
dev_put(dev);
}
@@ -362,7 +361,7 @@ static netdev_tx_t xfrmi_xmit(struct sk_buff *skb, struct net_device *dev)
goto tx_err;
}
- fl.flowi_oif = xi->phydev->ifindex;
+ fl.flowi_oif = xi->p.link;
ret = xfrmi_xmit2(skb, dev, &fl);
if (ret < 0)
@@ -548,7 +547,7 @@ static int xfrmi_get_iflink(const struct net_device *dev)
{
struct xfrm_if *xi = netdev_priv(dev);
- return xi->phydev->ifindex;
+ return xi->p.link;
}
@@ -574,12 +573,14 @@ static void xfrmi_dev_setup(struct net_device *dev)
dev->needs_free_netdev = true;
dev->priv_destructor = xfrmi_dev_free;
netif_keep_dst(dev);
+
+ eth_broadcast_addr(dev->broadcast);
}
static int xfrmi_dev_init(struct net_device *dev)
{
struct xfrm_if *xi = netdev_priv(dev);
- struct net_device *phydev = xi->phydev;
+ struct net_device *phydev = __dev_get_by_index(xi->net, xi->p.link);
int err;
dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
@@ -594,13 +595,19 @@ static int xfrmi_dev_init(struct net_device *dev)
dev->features |= NETIF_F_LLTX;
- dev->needed_headroom = phydev->needed_headroom;
- dev->needed_tailroom = phydev->needed_tailroom;
+ if (phydev) {
+ dev->needed_headroom = phydev->needed_headroom;
+ dev->needed_tailroom = phydev->needed_tailroom;
- if (is_zero_ether_addr(dev->dev_addr))
- eth_hw_addr_inherit(dev, phydev);
- if (is_zero_ether_addr(dev->broadcast))
- memcpy(dev->broadcast, phydev->broadcast, dev->addr_len);
+ if (is_zero_ether_addr(dev->dev_addr))
+ eth_hw_addr_inherit(dev, phydev);
+ if (is_zero_ether_addr(dev->broadcast))
+ memcpy(dev->broadcast, phydev->broadcast,
+ dev->addr_len);
+ } else {
+ eth_hw_addr_random(dev);
+ eth_broadcast_addr(dev->broadcast);
+ }
return 0;
}
@@ -644,13 +651,8 @@ static int xfrmi_newlink(struct net *src_net, struct net_device *dev,
xi->p = p;
xi->net = net;
xi->dev = dev;
- xi->phydev = dev_get_by_index(net, p.link);
- if (!xi->phydev)
- return -ENODEV;
err = xfrmi_create(dev);
- if (err < 0)
- dev_put(xi->phydev);
return err;
}
--
2.21.0
^ permalink raw reply related
* [PATCH ipsec v2 2/4] xfrm interface: ifname may be wrong in logs
From: Nicolas Dichtel @ 2019-07-15 10:00 UTC (permalink / raw)
To: steffen.klassert, davem; +Cc: netdev, Nicolas Dichtel
In-Reply-To: <20190715100023.8475-1-nicolas.dichtel@6wind.com>
The ifname is copied when the interface is created, but is never updated
later. In fact, this property is used only in one error message, where the
netdevice pointer is available, thus let's use it.
Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
include/net/xfrm.h | 1 -
net/xfrm/xfrm_interface.c | 10 +---------
2 files changed, 1 insertion(+), 10 deletions(-)
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index b22db30c3d88..ad761ef84797 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -983,7 +983,6 @@ static inline void xfrm_dst_destroy(struct xfrm_dst *xdst)
void xfrm_dst_ifdown(struct dst_entry *dst, struct net_device *dev);
struct xfrm_if_parms {
- char name[IFNAMSIZ]; /* name of XFRM device */
int link; /* ifindex of underlying L2 interface */
u32 if_id; /* interface identifyer */
};
diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index 2310dc9e35c2..68336ee00d72 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -145,8 +145,6 @@ static int xfrmi_create(struct net_device *dev)
if (err < 0)
goto out;
- strcpy(xi->p.name, dev->name);
-
dev_hold(dev);
xfrmi_link(xfrmn, xi);
@@ -294,7 +292,7 @@ xfrmi_xmit2(struct sk_buff *skb, struct net_device *dev, struct flowi *fl)
if (tdev == dev) {
stats->collisions++;
net_warn_ratelimited("%s: Local routing loop detected!\n",
- xi->p.name);
+ dev->name);
goto tx_err_dst_release;
}
@@ -638,12 +636,6 @@ static int xfrmi_newlink(struct net *src_net, struct net_device *dev,
int err;
xfrmi_netlink_parms(data, &p);
-
- if (!tb[IFLA_IFNAME])
- return -EINVAL;
-
- nla_strlcpy(p.name, tb[IFLA_IFNAME], IFNAMSIZ);
-
xi = xfrmi_locate(net, &p);
if (xi)
return -EEXIST;
--
2.21.0
^ permalink raw reply related
* [PATCH ipsec v2 3/4] xfrm interface: fix list corruption for x-netns
From: Nicolas Dichtel @ 2019-07-15 10:00 UTC (permalink / raw)
To: steffen.klassert, davem; +Cc: netdev, Nicolas Dichtel, Julien Floret
In-Reply-To: <20190715100023.8475-1-nicolas.dichtel@6wind.com>
dev_net(dev) is the netns of the device and xi->net is the link netns,
where the device has been linked.
changelink() must operate in the link netns to avoid a corruption of
the xfrm lists.
Note that xi->net and dev_net(xi->physdev) are always the same.
Before the patch, the xfrmi lists may be corrupted and can later trigger a
kernel panic.
Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces")
Reported-by: Julien Floret <julien.floret@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: Julien Floret <julien.floret@6wind.com>
---
net/xfrm/xfrm_interface.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index 68336ee00d72..53e5e47b2c55 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -503,7 +503,7 @@ static int xfrmi_change(struct xfrm_if *xi, const struct xfrm_if_parms *p)
static int xfrmi_update(struct xfrm_if *xi, struct xfrm_if_parms *p)
{
- struct net *net = dev_net(xi->dev);
+ struct net *net = xi->net;
struct xfrmi_net *xfrmn = net_generic(net, xfrmi_net_id);
int err;
@@ -663,9 +663,9 @@ static int xfrmi_changelink(struct net_device *dev, struct nlattr *tb[],
struct nlattr *data[],
struct netlink_ext_ack *extack)
{
- struct net *net = dev_net(dev);
+ struct xfrm_if *xi = netdev_priv(dev);
+ struct net *net = xi->net;
struct xfrm_if_parms p;
- struct xfrm_if *xi;
xfrmi_netlink_parms(data, &p);
xi = xfrmi_locate(net, &p);
@@ -707,7 +707,7 @@ static struct net *xfrmi_get_link_net(const struct net_device *dev)
{
struct xfrm_if *xi = netdev_priv(dev);
- return dev_net(xi->phydev);
+ return xi->net;
}
static const struct nla_policy xfrmi_policy[IFLA_XFRM_MAX + 1] = {
--
2.21.0
^ permalink raw reply related
* [PATCH ipsec v2 0/4] xfrm interface: bugs fixes
From: Nicolas Dichtel @ 2019-07-15 10:00 UTC (permalink / raw)
To: steffen.klassert, davem; +Cc: netdev
In-Reply-To: <df990564-819a-314f-dda6-aab58a2e7b6e@6wind.com>
Here is a bunch of bugs fixes. Some have been seen by code review and some when
playing with x-netns.
The details are in each patch.
v1 -> v2:
- add patch #3 and #4
include/net/xfrm.h | 2 --
net/xfrm/xfrm_interface.c | 56 +++++++++++++++++++++--------------------------
2 files changed, 25 insertions(+), 33 deletions(-)
Regards,
Nicolas
^ permalink raw reply
* Re: [RFC bpf-next 1/8] bpf: introducing list based insn patching infra to core layer
From: Jiong Wang @ 2019-07-15 9:58 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Jiong Wang, Alexei Starovoitov, Daniel Borkmann, Edward Cree,
Naveen N. Rao, Andrii Nakryiko, Jakub Kicinski, bpf, Networking,
oss-drivers
In-Reply-To: <CAEf4BzbR_ieGmaOTjCrN6jQRo=QoEJNz1zVeFizZbzGBGaF=Cg@mail.gmail.com>
Andrii Nakryiko writes:
> On Thu, Jul 11, 2019 at 4:53 AM Jiong Wang <jiong.wang@netronome.com> wrote:
>>
>>
>> Andrii Nakryiko writes:
>>
>> > On Thu, Jul 4, 2019 at 2:32 PM Jiong Wang <jiong.wang@netronome.com> wrote:
>> >>
>> >> This patch introduces list based bpf insn patching infra to bpf core layer
>> >> which is lower than verification layer.
>> >>
>> >> This layer has bpf insn sequence as the solo input, therefore the tasks
>> >> to be finished during list linerization is:
>> >> - copy insn
>> >> - relocate jumps
>> >> - relocation line info.
>> >>
>> >> Suggested-by: Alexei Starovoitov <ast@kernel.org>
>> >> Suggested-by: Edward Cree <ecree@solarflare.com>
>> >> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
>> >> ---
>> >> include/linux/filter.h | 25 +++++
>> >> kernel/bpf/core.c | 268 +++++++++++++++++++++++++++++++++++++++++++++++++
>> >> 2 files changed, 293 insertions(+)
>> >>
>> >> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> >> index 1fe53e7..1fea68c 100644
>> >> --- a/include/linux/filter.h
>> >> +++ b/include/linux/filter.h
>> >> @@ -842,6 +842,31 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
>> >> const struct bpf_insn *patch, u32 len);
>> >> int bpf_remove_insns(struct bpf_prog *prog, u32 off, u32 cnt);
>> >>
>> >> +int bpf_jit_adj_imm_off(struct bpf_insn *insn, int old_idx, int new_idx,
>> >> + int idx_map[]);
>> >> +
>> >> +#define LIST_INSN_FLAG_PATCHED 0x1
>> >> +#define LIST_INSN_FLAG_REMOVED 0x2
>> >> +struct bpf_list_insn {
>> >> + struct bpf_insn insn;
>> >> + struct bpf_list_insn *next;
>> >> + s32 orig_idx;
>> >> + u32 flag;
>> >> +};
>> >> +
>> >> +struct bpf_list_insn *bpf_create_list_insn(struct bpf_prog *prog);
>> >> +void bpf_destroy_list_insn(struct bpf_list_insn *list);
>> >> +/* Replace LIST_INSN with new list insns generated from PATCH. */
>> >> +struct bpf_list_insn *bpf_patch_list_insn(struct bpf_list_insn *list_insn,
>> >> + const struct bpf_insn *patch,
>> >> + u32 len);
>> >> +/* Pre-patch list_insn with insns inside PATCH, meaning LIST_INSN is not
>> >> + * touched. New list insns are inserted before it.
>> >> + */
>> >> +struct bpf_list_insn *bpf_prepatch_list_insn(struct bpf_list_insn *list_insn,
>> >> + const struct bpf_insn *patch,
>> >> + u32 len);
>> >> +
>> >> void bpf_clear_redirect_map(struct bpf_map *map);
>> >>
>> >> static inline bool xdp_return_frame_no_direct(void)
>> >> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> >> index e2c1b43..e60703e 100644
>> >> --- a/kernel/bpf/core.c
>> >> +++ b/kernel/bpf/core.c
>> >> @@ -502,6 +502,274 @@ int bpf_remove_insns(struct bpf_prog *prog, u32 off, u32 cnt)
>> >> return WARN_ON_ONCE(bpf_adj_branches(prog, off, off + cnt, off, false));
>> >> }
>> >>
>> >> +int bpf_jit_adj_imm_off(struct bpf_insn *insn, int old_idx, int new_idx,
>> >> + s32 idx_map[])
>> >> +{
>> >> + u8 code = insn->code;
>> >> + s64 imm;
>> >> + s32 off;
>> >> +
>> >> + if (BPF_CLASS(code) != BPF_JMP && BPF_CLASS(code) != BPF_JMP32)
>> >> + return 0;
>> >> +
>> >> + if (BPF_CLASS(code) == BPF_JMP &&
>> >> + (BPF_OP(code) == BPF_EXIT ||
>> >> + (BPF_OP(code) == BPF_CALL && insn->src_reg != BPF_PSEUDO_CALL)))
>> >> + return 0;
>> >> +
>> >> + /* BPF to BPF call. */
>> >> + if (BPF_OP(code) == BPF_CALL) {
>> >> + imm = idx_map[old_idx + insn->imm + 1] - new_idx - 1;
>> >> + if (imm < S32_MIN || imm > S32_MAX)
>> >> + return -ERANGE;
>> >> + insn->imm = imm;
>> >> + return 1;
>> >> + }
>> >> +
>> >> + /* Jump. */
>> >> + off = idx_map[old_idx + insn->off + 1] - new_idx - 1;
>> >> + if (off < S16_MIN || off > S16_MAX)
>> >> + return -ERANGE;
>> >> + insn->off = off;
>> >> + return 0;
>> >> +}
>> >> +
>> >> +void bpf_destroy_list_insn(struct bpf_list_insn *list)
>> >> +{
>> >> + struct bpf_list_insn *elem, *next;
>> >> +
>> >> + for (elem = list; elem; elem = next) {
>> >> + next = elem->next;
>> >> + kvfree(elem);
>> >> + }
>> >> +}
>> >> +
>> >> +struct bpf_list_insn *bpf_create_list_insn(struct bpf_prog *prog)
>> >> +{
>> >> + unsigned int idx, len = prog->len;
>> >> + struct bpf_list_insn *hdr, *prev;
>> >> + struct bpf_insn *insns;
>> >> +
>> >> + hdr = kvzalloc(sizeof(*hdr), GFP_KERNEL);
>> >> + if (!hdr)
>> >> + return ERR_PTR(-ENOMEM);
>> >> +
>> >> + insns = prog->insnsi;
>> >> + hdr->insn = insns[0];
>> >> + hdr->orig_idx = 1;
>> >> + prev = hdr;
>> >
>> > I'm not sure why you need this "prologue" instead of handling first
>> > instruction uniformly in for loop below?
>>
>> It is because the head of the list doesn't have precessor, so no need of
>> the prev->next assignment, not could do a check inside the loop to rule the
>> head out when doing it.
>
> yeah, prev = NULL initially. Then
>
> if (prev) prev->next = node;
>
> Or see my suggestiong about having patchabel_insns_list wrapper struct
> (in cover letter thread).
>
>>
>> >> +
>> >> + for (idx = 1; idx < len; idx++) {
>> >> + struct bpf_list_insn *node = kvzalloc(sizeof(*node),
>> >> + GFP_KERNEL);
>> >> +
>> >> + if (!node) {
>> >> + /* Destroy what has been allocated. */
>> >> + bpf_destroy_list_insn(hdr);
>> >> + return ERR_PTR(-ENOMEM);
>> >> + }
>> >> + node->insn = insns[idx];
>> >> + node->orig_idx = idx + 1;
>> >
>> > Why orig_idx is 1-based? It's really confusing.
>>
>> orig_idx == 0 means one insn is without original insn, means it is an new
>> insn generated for patching purpose.
>>
>> While the LIST_INSN_FLAG_PATCHED in the RFC means one insn in original prog
>> is patched.
>>
>> I had been trying to differenciate above two cases, but yes, they are
>> confusing and differenciating them might be useless, if an insn in original
>> prog is patched, all its info could be treated as clobbered and needing
>> re-calculating or should do conservative assumption.
>
> Instruction will be new and not patched only in patch_buffer. Once you
> add them to the list, they are patched, no? Not sure what's the
> distinction you are trying to maintain here.
Never mind, the reason I was trying to differenciating them is because I
had some strange preference on the insn patched.
insn 1 insn 1
insn 2 >> insn 2.1
insn 3 insn 2.2
insn 2.3
insn 3
I kind of thinking the it is better to maintain the original info of one
patched insn, that is to say insn 2 above is patched and expanded into insn
2.1/2.2/2.3, then I slightly felt better to copy the aux info of insn to
insn 2.1 and only rebuilt those we are sure needs to be updated, for
example zext because the insn is changed.
>> >
>> >> + prev->next = node;
>> >> + prev = node;
>> >> + }
>> >> +
>> >> + return hdr;
>> >> +}
>> >> +
>
> [...]
>
>> >> +
>> >> + len--;
>> >> + patch++;
>> >> +
>> >> + prev = list_insn;
>> >> + next = list_insn->next;
>> >> + for (idx = 0; idx < len; idx++) {
>> >> + struct bpf_list_insn *node = kvzalloc(sizeof(*node),
>> >> + GFP_KERNEL);
>> >> +
>> >> + if (!node) {
>> >> + /* Link what's allocated, so list destroyer could
>> >> + * free them.
>> >> + */
>> >> + prev->next = next;
>> >
>> > Why this special handling, if you can just insert element so that list
>> > is well-formed after each instruction?
>>
>> Good idea, just always do "node->next = next", the "prev->next = node" in
>> next round will fix it.
>>
>> >
>> >> + return ERR_PTR(-ENOMEM);
>> >> + }
>> >> +
>> >> + node->insn = patch[idx];
>> >> + prev->next = node;
>> >> + prev = node;
>> >
>> > E.g.,
>> >
>> > node->next = next;
>> > prev->next = node;
>> > prev = node;
>> >
>> >> + }
>> >> +
>> >> + prev->next = next;
>> >
>> > And no need for this either.
>> >
>> >> + return prev;
>> >> +}
>> >> +
>> >> +struct bpf_list_insn *bpf_prepatch_list_insn(struct bpf_list_insn *list_insn,
>> >> + const struct bpf_insn *patch,
>> >> + u32 len)
>> >
>> > prepatch and patch functions should share the same logic.
>> >
>> > Prepend is just that - insert all instructions from buffer before current insns.
>> > Patch -> replace current one with first instriction in a buffer, then
>> > prepend remaining ones before the next instruction (so patch should
>> > call info prepend, with adjusted count and array pointer).
>>
>> Ack, there indeed has quite a few things to simplify.
>>
>> >
>> >> +{
>> >> + struct bpf_list_insn *prev, *node, *begin_node;
>> >> + u32 idx;
>> >> +
>> >> + if (!len)
>> >> + return list_insn;
>> >> +
>> >> + node = kvzalloc(sizeof(*node), GFP_KERNEL);
>> >> + if (!node)
>> >> + return ERR_PTR(-ENOMEM);
>> >> + node->insn = patch[0];
>> >> + begin_node = node;
>> >> + prev = node;
>> >> +
>> >> + for (idx = 1; idx < len; idx++) {
>> >> + node = kvzalloc(sizeof(*node), GFP_KERNEL);
>> >> + if (!node) {
>> >> + node = begin_node;
>> >> + /* Release what's has been allocated. */
>> >> + while (node) {
>> >> + struct bpf_list_insn *next = node->next;
>> >> +
>> >> + kvfree(node);
>> >> + node = next;
>> >> + }
>> >> + return ERR_PTR(-ENOMEM);
>> >> + }
>> >> + node->insn = patch[idx];
>> >> + prev->next = node;
>> >> + prev = node;
>> >> + }
>> >> +
>> >> + prev->next = list_insn;
>> >> + return begin_node;
>> >> +}
>> >> +
>> >> void bpf_prog_kallsyms_del_subprogs(struct bpf_prog *fp)
>> >> {
>> >> int i;
>> >> --
>> >> 2.7.4
>> >>
>>
^ permalink raw reply
* Re: [PATCH ipsec] xfrm interface: fix list corruption for x-netns
From: Nicolas Dichtel @ 2019-07-15 9:53 UTC (permalink / raw)
To: steffen.klassert, davem; +Cc: netdev, Julien Floret
In-Reply-To: <20190710131137.4787-1-nicolas.dichtel@6wind.com>
Le 10/07/2019 à 15:11, Nicolas Dichtel a écrit :
> dev_net(dev) is the netns of the device and xi->net is the link netns,
> where the device has been linked.
> changelink() must operate in the link netns to avoid a corruption of
> the xfrm lists.
>
> Note that xi->net and dev_net(xi->physdev) are always the same.
>
> Before the patch, the xfrmi lists may be corrupted and can later trigger a
> kernel panic.
>
> Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces")
> Reported-by: Julien Floret <julien.floret@6wind.com>
> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> Tested-by: Julien Floret <julien.floret@6wind.com>
Please, drop this one too.
Regards,
Nicolas
^ permalink raw reply
* Re: [PATCH ipsec 0/2] xfrm interface: bug fix on changelink
From: Nicolas Dichtel @ 2019-07-15 9:52 UTC (permalink / raw)
To: steffen.klassert, davem; +Cc: netdev
In-Reply-To: <20190710074536.7505-1-nicolas.dichtel@6wind.com>
Le 10/07/2019 à 09:45, Nicolas Dichtel a écrit :
>
> Here are two bug fix seen by code review. The first one avoids a corruption of
> existing xfrm interfaces and the second is a minor fix of an error message.
>
> include/net/xfrm.h | 1 -
> net/xfrm/xfrm_interface.c | 20 ++++++--------------
> 2 files changed, 6 insertions(+), 15 deletions(-)
Please, drop this series, I will resend it.
Regards,
Nicolas
^ permalink raw reply
* Re: [PATCH net-next 01/16] qlge: Remove irq_cnt
From: Greg Kroah-Hartman @ 2019-07-15 9:48 UTC (permalink / raw)
To: Benjamin Poirier; +Cc: David Miller, Manish Chopra, GR-Linux-NIC-Dev, netdev
In-Reply-To: <20190715014016.GA27883@f1>
On Mon, Jul 15, 2019 at 10:40:16AM +0900, Benjamin Poirier wrote:
> On 2019/06/17 16:48, Benjamin Poirier wrote:
> > qlge uses an irq enable/disable refcounting scheme that is:
> > * poorly implemented
> > Uses a spin_lock to protect accesses to the irq_cnt atomic variable
> > * buggy
> > Breaks when there is not a 1:1 sequence of irq - napi_poll, such as
> > when using SO_BUSY_POLL.
> > * unnecessary
> > The purpose or irq_cnt is to reduce irq control writes when
> > multiple work items result from one irq: the irq is re-enabled
> > after all work is done.
> > Analysis of the irq handler shows that there is only one case where
> > there might be two workers scheduled at once, and those have
> > separate irq masking bits.
> >
> > Therefore, remove irq_cnt.
> >
> > Additionally, we get a performance improvement:
> > perf stat -e cycles -a -r5 super_netperf 100 -H 192.168.33.1 -t TCP_RR
> >
> > Before:
> > 628560
> > 628056
> > 622103
> > 622744
> > 627202
> > [...]
> > 268,803,947,669 cycles ( +- 0.09% )
> >
> > After:
> > 636300
> > 634106
> > 634984
> > 638555
> > 634188
> > [...]
> > 259,237,291,449 cycles ( +- 0.19% )
> >
> > Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
>
> David, Greg,
>
> Before I send v2 of this patchset, how about moving this driver out to
> staging? The hardware has been declared EOL by the vendor but what's
> more relevant to the Linux kernel is that the quality of this driver is
> not on par with many other mainline drivers, IMO. Over in staging, the
> code might benefit from the attention of interested parties (drive-by
> fixers) or fade away into obscurity.
>
> Now that I check, the code has >500 basic checkpatch issues. While
> working on the rx processing code for the current patchset, I noticed
> the following more structural issues:
> * commit 7c734359d350 ("qlge: Size RX buffers based on MTU.",
> v2.6.33-rc1) introduced dead code in the receive routines, which
> should be rewritten anyways by the admission of the author himself
> (see the comment above ql_build_rx_skb())
> * truesize accounting is incorrect (ex: a 9000B frame has truesize 10280
> while containing two frags of order-1 allocations, ie. >16K)
> * in some cases the driver allocates skbs with head room but only puts
> data in the frags
> * there is an inordinate amount of disparate debugging code, most of
> which is of questionable value. In particular, qlge_dbg.c has hundreds
> of lines of code bitrotting away in ifdef land (doesn't compile since
> commit 18c49b91777c ("qlge: do vlan cleanup", v3.1-rc1), 8 years ago).
> * triggering an ethtool regdump will instead hexdump a 176k struct to
> dmesg depending on some module parameters
> * many structures are defined full of holes, as we found in this
> thread already; are used inefficiently (struct rx_ring is used for rx
> and tx completions, with some members relevant to one case only); are
> initialized redundantly (ex. memset 0 after alloc_etherdev). The
> driver also has a habit of using runtime checks where compile time
> checks are possible.
> * in terms of namespace, the driver uses either qlge_, ql_ (used by
> other qlogic drivers, with clashes, ex: ql_sem_spinlock) or nothing (with
> clashes, ex: struct ob_mac_iocb_req)
>
> I'm guessing other parts of the driver have more issues of the same
> nature. The driver also has many smaller issues like deprecated apis,
> duplicate or useless comments, weird code structure (some "while" loops
> that could be rewritten with simple "for", ex. ql_wait_reg_rdy()) and
> weird line wrapping (all over, ex. the ql_set_routing_reg() calls in
> qlge_set_multicast_list()).
>
> I'm aware of some precedents for code moving from mainline to staging:
> 1bb8155080c6 ncpfs: move net/ncpfs to drivers/staging/ncpfs (v4.16-rc1)
> 6c391ff758eb irda: move drivers/net/irda to drivers/staging/irda/drivers
> (v4.14-rc1)
>
> What do you think is the best course of action in this case?
Feel free to move it to staging if you want it removed from the tree in
a few releases if no one is willing to do the work to keep it alive. If
someone comes along later, it's a trivial revert to add it back.
So send a patch moving it, with all of the information you listed above
in a TODO file for the directory, and I'll be glad to take it.
thanks,
greg k-h
^ permalink raw reply
* Re: [PATCH] ath10k: work around uninitialized vht_pfr variable
From: Kalle Valo @ 2019-07-15 9:44 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Miaoqing Pan, David S. Miller, Rakesh Pillai, Brian Norris,
Balaji Pothunoori, Wen Gong, Pradeep kumar Chitrapu, Sriram R,
ath10k, linux-wireless, netdev, linux-kernel, clang-built-linux
In-Reply-To: <20190708125050.3689133-1-arnd@arndb.de>
Arnd Bergmann <arnd@arndb.de> writes:
> As clang points out, the vht_pfr is assigned to a struct member
> without being initialized in one case:
>
> drivers/net/wireless/ath/ath10k/mac.c:7528:7: error: variable 'vht_pfr' is used uninitialized whenever 'if' condition
> is false [-Werror,-Wsometimes-uninitialized]
> if (!ath10k_mac_can_set_bitrate_mask(ar, band, mask,
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> drivers/net/wireless/ath/ath10k/mac.c:7551:20: note: uninitialized use occurs here
> arvif->vht_pfr = vht_pfr;
> ^~~~~~~
> drivers/net/wireless/ath/ath10k/mac.c:7528:3: note: remove the 'if' if its condition is always true
> if (!ath10k_mac_can_set_bitrate_mask(ar, band, mask,
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> drivers/net/wireless/ath/ath10k/mac.c:7483:12: note: initialize the variable 'vht_pfr' to silence this warning
> u8 vht_pfr;
>
> Add an explicit but probably incorrect initialization here.
> I suspect we want a better fix here, but chose this approach to
> illustrate the issue.
>
> Fixes: 8b97b055dc9d ("ath10k: fix failure to set multiple fixed rate")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
I'll queue this for v5.3.
--
Kalle Valo
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox