Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames
From: Jakub Kicinski @ 2019-08-26 22:15 UTC (permalink / raw)
  To: David Ahern
  Cc: Jiri Pirko, Roopa Prabhu, netdev, David Miller, Stephen Hemminger,
	dcbw, Michal Kubecek, Andrew Lunn, parav, Saeed Mahameed, mlxsw
In-Reply-To: <5d79fba4-f82e-97a7-7846-fd1de089a95b@gmail.com>

On Mon, 26 Aug 2019 15:46:43 -0600, David Ahern wrote:
> On 8/26/19 10:55 AM, Jakub Kicinski wrote:
> > On Mon, 26 Aug 2019 18:09:16 +0200, Jiri Pirko wrote:  
> >> DaveA, Roopa. Do you insist on doing add/remove of altnames in the
> >> existing setlist command using embedded message op attrs? I'm asking
> >> because after some time thinking about it, it still feels wrong to me :/
> >>
> >> If this would be a generic netlink api, we would just add another couple
> >> of commands. What is so different we can't add commands here?
> >> It is also much simpler code. Easy error handling, no need for
> >> rollback, no possibly inconsistent state, etc.  
> > 
> > +1 the separate op feels like a better uapi to me as well.
> > 
> > Perhaps we could redo the iproute2 command line interface to make the
> > name the primary object? Would that address your concern Dave and Roopa?
> 
> No, my point is exactly that a name is not a primary object. A name is
> an attribute of a link - something that exists for the convenience of
> userspace only. (Like the 'protocol' for routes, rules and neighbors.)
> 
> Currently, names are changed by RTM_NEWLINK/RTM_SETLINK. Aliases are
> added and deleted by RTM_NEWLINK/RTM_SETLINK. Why is an alternative name
> so special that it should have its own API?

My feeling is that it's better to introduce operations for this
sub-object than mux everything via setlink :( The use of netlink 
in more recent APIs like devlink is much more liberal when it comes 
to ops and the result is much more convenient and clean IMHO.

Weren't there multiple problems with the size of the RTM_NEWLINK
notification already? Adding multiple sizeable strings to it won't
help there either :S

Do you foresee a need for the alias to be updated atomically with other
RTM_SETLINK changes?

> If only 1 alt name was allowed, then RTM_NEWLINK/RTM_SETLINK would
> suffice. Management of it would have the same semantics as an alias -
> empty string means delete, non-empty string sets the value.
> 
> So really the push for new RTM commands is to handle an unlimited number
> of alt names with the ability to change / delete any one of them. Has
> the need for multiple alternate ifnames been fully established? (I don't
> recall other than a discussion about parallels to block devices.)

I feel like I already posted this link, but here it is again:

https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/

IMHO the fact that there are multiple naming schemes in systemd today
is proof enough.


To be clear my (probably over-engineered) position is that "user-
-understandable" interface names had became a dead end already long
time ago. We should move away from strings and try to build APIs or at
least user space where device can be selected based on attributes
directly. The names are nothing else than a random grab bag of
attributes sprintf()-ed together, anyway. 

The naming done by udev is inherently racy and over and over again we
run into races and issues because OvS or some other piece of userspace
gets confused and e.g. enslaves ports to wrong bridges.

Longer names I just a band aid. But I'm under no illusion I can convince
people to spend time working on an attribute-based scheme.. ;)

^ permalink raw reply

* Re: [PATCH net-next v5 3/6] net: dsa: mv88e6xxx: create serdes_get_lane chip operation
From: Vivien Didelot @ 2019-08-26 22:17 UTC (permalink / raw)
  To: Marek Behún
  Cc: netdev, Andrew Lunn, Florian Fainelli, Vladimir Oltean,
	Marek Behún
In-Reply-To: <20190826213155.14685-4-marek.behun@nic.cz>

Hi Marek,

On Mon, 26 Aug 2019 23:31:52 +0200, Marek Behún <marek.behun@nic.cz> wrote:
> @@ -635,10 +660,10 @@ static irqreturn_t mv88e6390_serdes_thread_fn(int irq, void *dev_id)
>  	irqreturn_t ret = IRQ_NONE;
>  	u8 cmode = port->cmode;
>  	u16 status;
> -	int lane;
>  	int err;
> +	u8 lane;
>  
> -	lane = mv88e6390x_serdes_get_lane(chip, port->port);
> +	mv88e6xxx_serdes_get_lane(chip, port->port, &lane);

I don't like when errors aren't always checked, but the code was already
like this, so this can be addressed in a follow-up patch:

Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>


Thanks,

	Vivien

^ permalink raw reply

* Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames
From: David Miller @ 2019-08-26 22:18 UTC (permalink / raw)
  To: jakub.kicinski
  Cc: dsahern, jiri, roopa, netdev, sthemmin, dcbw, mkubecek, andrew,
	parav, saeedm, mlxsw
In-Reply-To: <20190826151552.4f1a2ad9@cakuba.netronome.com>

From: Jakub Kicinski <jakub.kicinski@netronome.com>
Date: Mon, 26 Aug 2019 15:15:52 -0700

> Weren't there multiple problems with the size of the RTM_NEWLINK
> notification already? Adding multiple sizeable strings to it won't
> help there either :S

Indeed.

We even had situations where we had to make the information provided
in a newlink dump opt-in when we added VF info because the buffers
glibc was using at the time were too small and this broke so much
stuff.

I honestly think that the size of link dumps are out of hand as-is.

^ permalink raw reply

* [PATCH] rtl_nic: add firmware rtl8125a-3
From: Heiner Kallweit @ 2019-08-26 22:23 UTC (permalink / raw)
  To: linux-firmware, Chun-Hao Lin, Realtek linux nic maintainers
  Cc: netdev@vger.kernel.org

This adds firmware rtl8125a-3 for Realtek's 2.5Gbps chip RTL8125.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
Firmware file was provided by Realtek and they asked me to submit it.
The related extension to r8169 driver will be submitted in the next days.
---
 WHENCE                |   3 +++
 rtl_nic/rtl8125a-3.fw | Bin 0 -> 3456 bytes
 2 files changed, 3 insertions(+)
 create mode 100644 rtl_nic/rtl8125a-3.fw

diff --git a/WHENCE b/WHENCE
index fb12924..dbec18a 100644
--- a/WHENCE
+++ b/WHENCE
@@ -2906,6 +2906,9 @@ Version: 0.0.2
 File: rtl_nic/rtl8107e-2.fw
 Version: 0.0.2
 
+File: rtl_nic/rtl8125a-3.fw
+Version: 0.0.1
+
 Licence:
  * Copyright © 2011-2013, Realtek Semiconductor Corporation
  *
diff --git a/rtl_nic/rtl8125a-3.fw b/rtl_nic/rtl8125a-3.fw
new file mode 100644
index 0000000000000000000000000000000000000000..fac635263f92e8d9734456b75932b2088edd5ef9
GIT binary patch
literal 3456
zcmb7G4@{Kj8Gr9M&hw<l39l3>p*MPE#webM6qsqQiuC(hXPi?+wDL#V(Z*4#K%FD{
zd#PH+tTJbqZG?a`)*5G*m3F2zmN`7hx;2*IKiVu;;~Hwa_E@^5+Z^ooav$qyWa|!|
z{J!V^^FHtMe%~vE5S!~a<<HMqSUGn=c_2HGJ>M6|pO=$6Z+-!F`d2|JjuT=?tX!6t
zo1eG1ysol-V@-KgU0(T?#@f6Efr9d!!2E*zz=G_mCu@CyK;goi!iD+b1yk6B2%b&6
z7edS;xkzqO0?9-2l9EW0ltM}+rIFG}86+PmljJ95fhB~6Z~~0y3JYX}?Z^&0uqy1t
zny?FN!)}}nC*Ym12kkF<@t&E4_=v=ynF9AnDz2DmaE+uRv6r!*@!^m!6NmhMOzq8r
zcySgS;n{HZ&ViVjO+Em7C<o$9E{w7~#?6OA6vF-CA|!?$MBkPmUNaZNIZ}kPTZ-Wd
z8F2PIf~lcpz}RxchgOhZNnAy~1V!<ssEIFw^W*gx{=)|N&sV@7s=~F-YS=QKK)AC8
z;l`&BHaB5(q!u4F*5UicX4Dw<xZc@_xQwl|*+!ct+H9u{FeB7V|DE*TO<fCht<>$I
zZZG}Y@U*d?z6a>rPW?gZU!wjH^_^&crVIA-hauiRf}*mc@O*L%b>cXFD^8&Io|9br
zFS+(#A<NUl=QsF#3U7Megl)!Y%#NIan9+;JM$V!#)QA3t5H6bia7TWJOXlz4ioA=<
z<^?z-1MK-A9Fa>HXt;u_<`7nle1Pws`y;xZ4b$%$RvXt*Vtj-(#xP2a8yGS^#rwu*
z7&RlXNB)8`;|q+Lf8+C)SZDkL{T(+MYZPk@p$0naYDvhUdK;W-&~&K<5x2TvCa4ES
zJSq_Os=`o`>Ti(hqM4#RkyLfbOj8MwbamOxQ0|CNT`@D2E8<rJ4O!}{IZMSyW~=Pb
z9LCG0O+ej0lA~sw%T-;^=BaOn@)@g8tu_{^65~O&#t5oXW3d`Ciq!i?u^KfEWsf|f
z%8X@d%v{dr6>6QaQuTMNV*C=d)+lAYWhy1Kp7A%(ze4qPRH@`pHTfshkXfVRB2TgY
zO=`+Wt(q39Q=W61)m5XOc8&DkO5CPgp(bTNzg>y9p~8jDs$zJj5<R=s;Oi~Q^>M4(
zG`vTNq`hitUz@`B_Nx)&fWp3Z<>))8?4g&GICDrH_jW38Z<lJiahS3rlpR$<98=@x
z6ix)=-SB%nIOXwTePI%gc_p@upI;Gdo~F;Tmw)5`y%^_39nX1Ki-J6~L1FKz7NSjv
z_`nkPz3?L$w%rohr-(vgBF1<i;qBEnDP75AC6X+Z8dD{_;JbnNV+;3L_)`miW?_eg
z-4-4vre2o#VTHs{P{PCaz|dle45F5=t7KfRt3*ceh=r$#TAy7mkxtwHW#j)EHgmp)
zO)hO_GH>cb5{`=!(@#j)h_-jxgCneiD9Hcd;ix`Q>rX~yy2c{beS3_|m>9m87;BeK
z9^<~->L`kd5sZmZuw?QWNw>vliHU)j7&EQ4-f1m1#+c~6-j5U9k9~bn**#pV?$N|-
zq$Nr8T#$I{J?c3t2VJ-F9=Aj_^{tkE`!wz?XF~Vaag5Xw_4`&lQTP1kQQmM$_|B0(
zOkxdl@0Tb}k#O<ZNq?P7^BP@o5?P$tDMUYUDdm~Ohjk2MA!9p<P0Z~eCa@+uv7NOF
zV)t~$Ac`@GiSytUlb?qfh~`cEKhXXQ`gkQS+oQgJX8gniiK%+szlqBBQMQ+LjIoYA
z7Pea0V&QHJcUss?e1U!-enM;`#@W7FhmW$!&b34|Z>oiU3%_IGY6~}8_<=_875$5K
z-p<>elW^0nO-Xb$v)?-<5?*G%4-kJO5#)Zu+T&X8&rjZA4DRX1phN*@#Ln3Z5`&x>
zM^Cgz;x5{-ci}0VyQ8FT#^zi&IL{9H?xHU!(>!!e${wT4$GqBa3H>xiG*Va3d7hBl
z$-lAV)_T3WsTa}QwwT;{)^yFKtnb^bPxFJExzl=W&oejIewS6dj^AhH>tVh&*5~e`
zja9dg7?{KsZ_$^!dgjpQL9gf03g&nvnzMv6vp=S9DYVsnn`y<1qkR|cAF7hLwo&4_
z$0b%#ug_6NWfb*$Il}q#a(!Ob6=ePMXrpB-v<#HJEP3C!v!5@<?;Dn1MU0<*f8Qxz
zAjU7*@~eyS)8C3a`2}PA^u1Eoi5NfK?_u^^&&MtM&ozI+{=2wFEidM}?R=ite~tg7
zpYGYyoP*`0xugV=o@1RyFwcMDQ>Obe;jicCb=vBh2MhU4<KB$Vk2NcQ<(`-qXixrA
z8^49!$+$sGAily`auZvb-$fkYEIElgD0dJaC)$bQXUsw`@urBL?<RiJ1G^-)6L-BT
z@#-RprR2w-mqwqPKV$B{bA8MiN1HEyEpap~@gZZ_*el!TTw^JF*(K5a0Q2#@rtsUg
zt6SovowDhaJ<ob6YnhO-UOUgSow9UNkB-M#+sN~3dy@Qhi9cEVf73Z`N^D^5uW{WK
zME%~Yvas61t;E=S%Z@SO6V|;&h-h!3cbdD!=(z6g@jH#a_vpS&+;={={DQpi2<!K6
DeEY6F

literal 0
HcmV?d00001

-- 
2.23.0


^ permalink raw reply related

* Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames
From: David Ahern @ 2019-08-26 22:24 UTC (permalink / raw)
  To: David Miller, jakub.kicinski
  Cc: jiri, roopa, netdev, sthemmin, dcbw, mkubecek, andrew, parav,
	saeedm, mlxsw
In-Reply-To: <20190826.151819.804077961408964282.davem@davemloft.net>

On 8/26/19 4:18 PM, David Miller wrote:
> I honestly think that the size of link dumps are out of hand as-is.

so you are suggesting new alternate names should not appear in kernel
generated RTM_NEWLINK messages - be it a link dump or a notification on
a change?

^ permalink raw reply

* Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames
From: David Miller @ 2019-08-26 22:25 UTC (permalink / raw)
  To: dsahern
  Cc: jakub.kicinski, jiri, roopa, netdev, sthemmin, dcbw, mkubecek,
	andrew, parav, saeedm, mlxsw
In-Reply-To: <ddd05712-e8c7-3c08-11c7-9840f5b64226@gmail.com>

From: David Ahern <dsahern@gmail.com>
Date: Mon, 26 Aug 2019 16:24:38 -0600

> On 8/26/19 4:18 PM, David Miller wrote:
>> I honestly think that the size of link dumps are out of hand as-is.
> 
> so you are suggesting new alternate names should not appear in kernel
> generated RTM_NEWLINK messages - be it a link dump or a notification on
> a change?

I counter with the question of how much crap can we keep sticking in there
before we have to do something else to provide that information?

^ permalink raw reply

* [PATCH bpf-next] selftests/bpf: remove wrong nhoff in flow dissector test
From: Stanislav Fomichev @ 2019-08-26 22:27 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, Stanislav Fomichev

.nhoff = 0 is (correctly) reset to ETH_HLEN on the next line so let's
drop it.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/prog_tests/flow_dissector.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
index 6892b88ae065..d2f4a8262200 100644
--- a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
+++ b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
@@ -344,7 +344,6 @@ struct test tests[] = {
 			.tcp.dest = 8080,
 		},
 		.keys = {
-			.nhoff = 0,
 			.nhoff = ETH_HLEN,
 			.thoff = ETH_HLEN + sizeof(struct iphdr) +
 				sizeof(struct iphdr),
-- 
2.23.0.187.g17f5b7556c-goog


^ permalink raw reply related

* [PATCH net-next] nfp: add AMDA0058 boards to firmware list
From: Jakub Kicinski @ 2019-08-26 22:30 UTC (permalink / raw)
  To: davem; +Cc: netdev, oss-drivers, Jakub Kicinski, Dirk van der Merwe

Add MODULE_FIRMWARE entries for AMDA0058 boards.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/nfp_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c b/drivers/net/ethernet/netronome/nfp/nfp_main.c
index 60e57f08de80..81679647e842 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c
@@ -815,6 +815,8 @@ static void __exit nfp_main_exit(void)
 module_init(nfp_main_init);
 module_exit(nfp_main_exit);
 
+MODULE_FIRMWARE("netronome/nic_AMDA0058-0011_2x40.nffw");
+MODULE_FIRMWARE("netronome/nic_AMDA0058-0012_2x40.nffw");
 MODULE_FIRMWARE("netronome/nic_AMDA0081-0001_1x40.nffw");
 MODULE_FIRMWARE("netronome/nic_AMDA0081-0001_4x10.nffw");
 MODULE_FIRMWARE("netronome/nic_AMDA0096-0001_2x10.nffw");
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH v1 net-next] net: stmmac: Add support for MDIO interrupts
From: Florian Fainelli @ 2019-08-26 22:31 UTC (permalink / raw)
  To: Andrew Lunn, Voon Weifeng
  Cc: David S. Miller, Maxime Coquelin, netdev, linux-kernel,
	Jose Abreu, Giuseppe Cavallaro, Alexandre Torgue, Ong Boon Leong
In-Reply-To: <20190826184719.GF2168@lunn.ch>

On 8/26/19 11:47 AM, Andrew Lunn wrote:
> On Tue, Aug 27, 2019 at 09:45:20AM +0800, Voon Weifeng wrote:
>> From: "Chuah, Kim Tatt" <kim.tatt.chuah@intel.com>
>>
>> DW EQoS v5.xx controllers added capability for interrupt generation
>> when MDIO interface is done (GMII Busy bit is cleared).
>> This patch adds support for this interrupt on supported HW to avoid
>> polling on GMII Busy bit.
>>
>> stmmac_mdio_read() & stmmac_mdio_write() will sleep until wake_up() is
>> called by the interrupt handler.
> 
> Hi Voon
> 
> I _think_ there are some order of operation issues here. The mdiobus
> is registered in the probe function. As soon as of_mdiobus_register()
> is called, the MDIO bus must work. At that point MDIO read/writes can
> start to happen.
> 
> As far as i can see, the interrupt handler is only requested in
> stmmac_open(). So it seems like any MDIO operations after probe, but
> before open are going to fail?

AFAIR, wait_event_timeout() will continue to busy loop and wait until
the timeout, but not return an error because the polled condition was
true, at least that is my recollection from having the same issue with
the bcmgenet driver before it was moved to connecting to the PHY in the
ndo_open() function.
-- 
Florian

^ permalink raw reply

* Re: RFC: very rough draft of a bpf permission model
From: Alexei Starovoitov @ 2019-08-26 22:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Borkmann, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List, Chenbo Feng
In-Reply-To: <CALCETrUhXrZaJy8omX_DsH0rAY98YEqR64VuisQSz2Rru8Dqpg@mail.gmail.com>

On Fri, Aug 23, 2019 at 04:09:11PM -0700, Andy Lutomirski wrote:
> On Thu, Aug 22, 2019 at 4:26 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > You're proposing all of the above in addition to CAP_BPF, right?
> > Otherwise I don't see how it addresses the use cases I kept
> > explaining for the last few weeks.
> 
> None of my proposal is intended to exclude changes like CAP_BPF to
> make privileged bpf() operations need less privilege.  But I think
> it's very hard to evaluate CAP_BPF without both a full description of
> exactly what CAP_BPF would do and what at least one full example of a
> user would look like.

the example is previous email and systemd example was not "full" ?

> I also think that users who want CAP_BPF should look at manipulating
> their effective capability set instead.  A daemon that wants to use
> bpf() but otherwise minimize the chance of accidentally causing a
> problem can use capset() to clear its effective and inheritable masks.
> Then, each time it wants to call bpf(), it could re-add CAP_SYS_ADMIN
> or CAP_NET_ADMIN to its effective set, call bpf(), and then clear its
> effective set again.  This works in current kernels and is generally
> good practice.

Such logic means that CAP_NET_ADMIN is not necessary either.
The process could re-add CAP_SYS_ADMIN when it needs to reconfigure
network and then drop it.

> Aside from this, and depending on exactly what CAP_BPF would be, I
> have some further concerns.  Looking at your example in this email:
> 
> > Here is another example of use case that CAP_BPF is solving:
> > The daemon X is started by pid=1 and currently runs as root.
> > It loads a bunch of tracing progs and attaches them to kprobes
> > and tracepoints. It also loads cgroup-bpf progs and attaches them
> > to cgroups. All progs are collecting data about the system and
> > logging it for further analysis.
> 
> This needs more than just bpf().  Creating a perf kprobe event
> requires CAP_SYS_ADMIN, and without a perf kprobe event, you can't
> attach a bpf program.  

that is already solved sysctl_perf_event_paranoid.
CAP_BPF is about BPF part only.

> And the privilege to attach bpf programs to
> cgroups without any DAC or MAC checks (which is what the current API
> does) is an extremely broad privilege that is not that much weaker
> than CAP_SYS_ADMIN or CAP_NET_ADMIN.  Also:

I don't think there is a hierarchy of CAP_SYS_ADMIN vs CAP_NET_ADMIN
vs CAP_BPF.
CAP_BPF and CAP_NET_ADMIN carve different areas of CAP_SYS_ADMIN.
Just like all other caps.

> > This tracing bpf is looking into kernel memory
> > and using bpf_probe_read. Clearly it's not _secure_. But it's _safe_.
> > The system is not going to crash because of BPF,
> > but it can easily crash because of simple coding bugs in the user
> > space bits of that daemon.
> 
> The BPF verifier and interpreter, taken in isolation, may be extremely
> safe, but attaching BPF programs to various hooks can easily take down
> the system, deliberately or by accident.  A handler, especially if it
> can access user memory or otherwise fault, will explode if attached to
> an inappropriate kprobe, hw_breakpoint, or function entry trace event.

absolutely not true.

> (I and the other maintainers consider this to be a bug if it happens,
> and we'll fix it, but these bugs definitely exist.)  A cgroup-bpf hook
> that blocks all network traffic will effectively kill a machine,
> especially if it's a server. 

this permission is granted by CAP_NET_ADMIN. Nothing changes here.

> A bpf program that runs excessively
> slowly attached to a high-frequency hook will kill the system, too.

not true either.

> (I bet a buggy bpf program that calls bpf_probe_read() on an unmapped
> address repeatedly could be make extremely slow.  Page faults take
> thousands to tens of thousands of cycles.) 

kprobe probing and faulting on non-existent address will do
the same 'damage'. So it's not bpf related.
Also it won't make the system "extremely slow".
Nothing to do with CAP_BPF.

> A bpf firewall rule that's
> wrong can cut a machine off from the network -- I've killed machines
> using iptables more than once, and bpf isn't magically safer.

this is CAP_NET_ADMIN permission. It's a different capability.

> 
> I'm wondering if something like CAP_TRACING would make sense.
> CAP_TRACING would allow operations that can reveal kernel memory and
> other secret kernel state but that do not, by design, allow modifying
> system behavior.  So, for example, CAP_TRACING would allow privileged
> perf_event_open() operations and privileged bpf verifier usage.  But
> it would not allow cgroup-bpf unless further restrictions were added,
> and it would not allow the *_BY_ID operations, as those can modify
> other users' bpf programs' behavior.

Makes little sense to me.
I can imagine CAP_TRACING controlling kprobe/uprobe creation
and probe_read() both from bpf side and from vanilla kprobe.
That would be much nicer interface to use than existing
sysctl_perf_event_paranoid, but that is orthogonal to CAP_BPF
which is strictly about BPF.

> Something finer-grained can mitigate some of this.  CAP_BPF as I think
> you're imagining it will not.

I'm afraid this discussion goes nowhere.
We'll post CAP_BPF patches soon so we can discuss code.


^ permalink raw reply

* Re: [PATCH bpf-next] selftests/bpf: remove wrong nhoff in flow dissector test
From: Song Liu @ 2019-08-26 22:41 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Networking, bpf, David S . Miller, Alexei Starovoitov,
	Daniel Borkmann
In-Reply-To: <20190826222712.171177-1-sdf@google.com>

On Mon, Aug 26, 2019 at 3:28 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> .nhoff = 0 is (correctly) reset to ETH_HLEN on the next line so let's
> drop it.
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  tools/testing/selftests/bpf/prog_tests/flow_dissector.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
> index 6892b88ae065..d2f4a8262200 100644
> --- a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
> +++ b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
> @@ -344,7 +344,6 @@ struct test tests[] = {
>                         .tcp.dest = 8080,
>                 },
>                 .keys = {
> -                       .nhoff = 0,
>                         .nhoff = ETH_HLEN,
>                         .thoff = ETH_HLEN + sizeof(struct iphdr) +
>                                 sizeof(struct iphdr),
> --
> 2.23.0.187.g17f5b7556c-goog
>

^ permalink raw reply

* Re: [PATCH bpf-next 3/4] selftests/bpf: verifier precise tests
From: Alexei Starovoitov @ 2019-08-26 22:47 UTC (permalink / raw)
  To: Song Liu
  Cc: Alexei Starovoitov, David S . Miller, Daniel Borkmann, Networking,
	bpf, Kernel Team
In-Reply-To: <CAPhsuW54=MiBfLp+AL2ASqaoGOf+p9D_VXxBYcR5fFpBrdEGSg@mail.gmail.com>

On Sun, Aug 25, 2019 at 10:22:13PM -0700, Song Liu wrote:
> On Fri, Aug 23, 2019 at 2:59 AM Alexei Starovoitov <ast@kernel.org> wrote:
> >
> > Use BPF_F_TEST_STATE_FREQ flag to check that precision
> > tracking works as expected by comparing every step it takes.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> >
> > +static bool cmp_str_seq(const char *log, const char *exp)
> 
> Maybe call it str_str_seq()?

imo cmp*() returns the result of comparison.
Which is either boolean or -1,0,1.
Whereas str*() should return the address, index, or offset.
Hence I used cmp_ prefix here.

> >  static void do_test_single(struct bpf_test *test, bool unpriv,
> >                            int *passes, int *errors)
> >  {
> > @@ -897,14 +929,20 @@ static void do_test_single(struct bpf_test *test, bool unpriv,
> >                 pflags |= BPF_F_STRICT_ALIGNMENT;
> >         if (test->flags & F_NEEDS_EFFICIENT_UNALIGNED_ACCESS)
> >                 pflags |= BPF_F_ANY_ALIGNMENT;
> > +       if (test->flags & ~3)
> > +               pflags |= test->flags;
> ^^^^^^ why do we need these two lines?

To pass flags from test into attr.prog_flags.
Older F_NEEDS_* and F_LOAD_* may use some cleanup and can be removed,
but it would be a different patch.


^ permalink raw reply

* Re: [PATCH bpf-next 3/4] selftests/bpf: verifier precise tests
From: Song Liu @ 2019-08-26 22:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Song Liu, Alexei Starovoitov, David S . Miller, Daniel Borkmann,
	Networking, bpf, Kernel Team
In-Reply-To: <20190826224724.edxfxbkv6r5wkg6o@ast-mbp>



> On Aug 26, 2019, at 3:47 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> On Sun, Aug 25, 2019 at 10:22:13PM -0700, Song Liu wrote:
>> On Fri, Aug 23, 2019 at 2:59 AM Alexei Starovoitov <ast@kernel.org> wrote:
>>> 
>>> Use BPF_F_TEST_STATE_FREQ flag to check that precision
>>> tracking works as expected by comparing every step it takes.
>>> 
>>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>>> 
>>> +static bool cmp_str_seq(const char *log, const char *exp)
>> 
>> Maybe call it str_str_seq()?
> 
> imo cmp*() returns the result of comparison.
> Which is either boolean or -1,0,1.
> Whereas str*() should return the address, index, or offset.
> Hence I used cmp_ prefix here.

Good point. I didn't think about this. 

> 
>>> static void do_test_single(struct bpf_test *test, bool unpriv,
>>>                           int *passes, int *errors)
>>> {
>>> @@ -897,14 +929,20 @@ static void do_test_single(struct bpf_test *test, bool unpriv,
>>>                pflags |= BPF_F_STRICT_ALIGNMENT;
>>>        if (test->flags & F_NEEDS_EFFICIENT_UNALIGNED_ACCESS)
>>>                pflags |= BPF_F_ANY_ALIGNMENT;
>>> +       if (test->flags & ~3)
>>> +               pflags |= test->flags;
>> ^^^^^^ why do we need these two lines?
> 
> To pass flags from test into attr.prog_flags.
> Older F_NEEDS_* and F_LOAD_* may use some cleanup and can be removed,
> but it would be a different patch.

Sounds good. 

Acked-by: Song Liu <songliubraving@fb.com>

Thanks!


^ permalink raw reply

* Re: [PATCH 2/4] mdev: Make mdev alias unique among all mdevs
From: Mark Bloch @ 2019-08-26 23:02 UTC (permalink / raw)
  To: Parav Pandit, alex.williamson@redhat.com, Jiri Pirko,
	kwankhede@nvidia.com, cohuck@redhat.com, davem@davemloft.net
  Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <20190826204119.54386-3-parav@mellanox.com>



On 8/26/19 1:41 PM, Parav Pandit wrote:
> Mdev alias should be unique among all the mdevs, so that when such alias
> is used by the mdev users to derive other objects, there is no
> collision in a given system.
> 
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index e825ff38b037..6eb37f0c6369 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -375,6 +375,11 @@ int mdev_device_create(struct kobject *kobj, struct device *dev,
>  			ret = -EEXIST;
>  			goto mdev_fail;
>  		}
> +		if (tmp->alias && strcmp(tmp->alias, alias) == 0) {

alias can be NULL here no?

> +			mutex_unlock(&mdev_list_lock);
> +			ret = -EEXIST;
> +			goto mdev_fail;
> +		}
>  	}
>  
>  	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
> 

Mark

^ permalink raw reply

* Re: [net-next 4/8] net/mlx5e: Add device out of buffer counter
From: Saeed Mahameed @ 2019-08-26 23:06 UTC (permalink / raw)
  To: jakub.kicinski@netronome.com
  Cc: davem@davemloft.net, netdev@vger.kernel.org, Moshe Shemesh
In-Reply-To: <20190826133949.0691660c@cakuba.netronome.com>

On Mon, 2019-08-26 at 13:39 -0700, Jakub Kicinski wrote:
> On Mon, 26 Aug 2019 20:14:47 +0000, Saeed Mahameed wrote:
> > > I see thanks for the explanation and sorry for the delayed
> > > response.
> > > Would it perhaps make sense to indicate the hairpin in the
> > > name?  
> > 
> > We had some internal discussion and we couldn't come up with the
> > perfect name :)
> > 
> > hairpin is just an implementation detail, we don't want to
> > exclusively
> > bind this counter to hairpin only flows, the problem is not with
> > hairpin, the actual problem is due to the use of internal RQs, for
> > now
> > it only happens with "hairpin like" flows, but tomorrow it can
> > happen
> > with a different scenario but same root cause (the use of internal
> > RQs), we want to have one counter to count internal drops due to
> > internal use of internal RQs.
> > 
> > so how about:
> > dev_internal_rq_oob: Device Internal RQ out of buffer
> > dev_internal_out_of_res: Device Internal out of resources (more
> > generic
> > ? too generic ?)
> 
> Maybe dev_internal_queue_oob? The use of 'internal' is a little
> unfortunate, because it may be read as RQ run out of internal
> buffers.
> Rather than special type of queue run out of buffers.
> But not knowing the HW I don't really have any great suggestions :(
> Either of the above would work as well.
> 

True, even our HW architects didn't know how to call it, since sticking
to a name now that might be deprecated in a future HW is what we are
trying to avoid. a generic name is preferable.

I like dev_internal_queue_oob, will take it with the team and send v2tomorrow. 

thanks Jakub for the support.


> > Any suggestion that you provide will be more than welcome.
> > 
> > > dev_out_of_buffer is quite a generic name, and there seems to be
> > > no
> > > doc, nor does the commit message explains it as well as you
> > > have..  
> > 
> > Regarding documentation:
> > All mlx5 ethool counters are documented here
> > https://community.mellanox.com/s/article/understanding-mlx5-linux-counters-and-status-parameters
> > 
> > once we decide on the name, will add the new counter to the doc.
> 
> I see, thanks!

I will add a link to this file in 
https://www.kernel.org/doc/html/latest/networking/device_drivers/mellanox/mlx5.html?highlight=mlx5


^ permalink raw reply

* [PATCH net] net/sched: pfifo_fast: fix wrong dereference when qdisc is reset
From: Davide Caratti @ 2019-08-26 23:15 UTC (permalink / raw)
  To: Cong Wang, Jamal Hadi Salim, Jiri Pirko, David S. Miller, netdev
  Cc: Paolo Abeni, Li Shuang

Now that 'TCQ_F_CPUSTATS' bit can be cleared, depending on the value of
'TCQ_F_NOLOCK' bit in the parent qdisc, we need to be sure that per-cpu
counters are present when 'reset()' is called for pfifo_fast qdiscs.
Otherwise, the following script:

 # tc q a dev lo handle 1: root htb default 100
 # tc c a dev lo parent 1: classid 1:100 htb \
 > rate 95Mbit ceil 100Mbit burst 64k
 [...]
 # tc f a dev lo parent 1: protocol arp basic classid 1:100
 [...]
 # tc q a dev lo parent 1:100 handle 100: pfifo_fast
 [...]
 # tc q d dev lo root

can generate the following splat:

 Unable to handle kernel paging request at virtual address dfff2c01bd148000
 Mem abort info:
   ESR = 0x96000004
   Exception class = DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
 Data abort info:
   ISV = 0, ISS = 0x00000004
   CM = 0, WnR = 0
 [dfff2c01bd148000] address between user and kernel address ranges
 Internal error: Oops: 96000004 [#1] SMP
 [...]
 pstate: 80000005 (Nzcv daif -PAN -UAO)
 pc : pfifo_fast_reset+0x280/0x4d8
 lr : pfifo_fast_reset+0x21c/0x4d8
 sp : ffff800d09676fa0
 x29: ffff800d09676fa0 x28: ffff200012ee22e4
 x27: dfff200000000000 x26: 0000000000000000
 x25: ffff800ca0799958 x24: ffff1001940f332b
 x23: 0000000000000007 x22: ffff200012ee1ab8
 x21: 0000600de8a40000 x20: 0000000000000000
 x19: ffff800ca0799900 x18: 0000000000000000
 x17: 0000000000000002 x16: 0000000000000000
 x15: 0000000000000000 x14: 0000000000000000
 x13: 0000000000000000 x12: ffff1001b922e6e2
 x11: 1ffff001b922e6e1 x10: 0000000000000000
 x9 : 1ffff001b922e6e1 x8 : dfff200000000000
 x7 : 0000000000000000 x6 : 0000000000000000
 x5 : 1fffe400025dc45c x4 : 1fffe400025dc357
 x3 : 00000c01bd148000 x2 : 0000600de8a40000
 x1 : 0000000000000007 x0 : 0000600de8a40004
 Call trace:
  pfifo_fast_reset+0x280/0x4d8
  qdisc_reset+0x6c/0x370
  htb_reset+0x150/0x3b8 [sch_htb]
  qdisc_reset+0x6c/0x370
  dev_deactivate_queue.constprop.5+0xe0/0x1a8
  dev_deactivate_many+0xd8/0x908
  dev_deactivate+0xe4/0x190
  qdisc_graft+0x88c/0xbd0
  tc_get_qdisc+0x418/0x8a8
  rtnetlink_rcv_msg+0x3a8/0xa78
  netlink_rcv_skb+0x18c/0x328
  rtnetlink_rcv+0x28/0x38
  netlink_unicast+0x3c4/0x538
  netlink_sendmsg+0x538/0x9a0
  sock_sendmsg+0xac/0xf8
  ___sys_sendmsg+0x53c/0x658
  __sys_sendmsg+0xc8/0x140
  __arm64_sys_sendmsg+0x74/0xa8
  el0_svc_handler+0x164/0x468
  el0_svc+0x10/0x14
 Code: 910012a0 92400801 d343fc03 11000c21 (38fb6863)

Fix this by testing the value of 'TCQ_F_CPUSTATS' bit in 'qdisc->flags',
before dereferencing 'qdisc->cpu_stats'.

Fixes: 8a53e616de29 ("net: sched: when clearing NOLOCK, clear TCQ_F_CPUSTATS, too")
CC: Paolo Abeni <pabeni@redhat.com>
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 net/sched/sch_generic.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 11c03cf4aa74..c89b787785a1 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -688,12 +688,14 @@ static void pfifo_fast_reset(struct Qdisc *qdisc)
 			kfree_skb(skb);
 	}
 
-	for_each_possible_cpu(i) {
-		struct gnet_stats_queue *q = per_cpu_ptr(qdisc->cpu_qstats, i);
+	if (qdisc_is_percpu_stats(qdisc))
+		for_each_possible_cpu(i) {
+			struct gnet_stats_queue *q =
+				per_cpu_ptr(qdisc->cpu_qstats, i);
 
-		q->backlog = 0;
-		q->qlen = 0;
-	}
+			q->backlog = 0;
+			q->qlen = 0;
+		}
 }
 
 static int pfifo_fast_dump(struct Qdisc *qdisc, struct sk_buff *skb)
-- 
2.20.1


^ permalink raw reply related

* Re: RFC: very rough draft of a bpf permission model
From: Andy Lutomirski @ 2019-08-27  0:05 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Borkmann, Song Liu, Kees Cook, Networking,
	bpf, Alexei Starovoitov, Kernel Team, Lorenz Bauer, Jann Horn,
	Greg KH, Linux API, LSM List, Chenbo Feng
In-Reply-To: <20190826223558.6torq6keplniif6w@ast-mbp>

> On Aug 26, 2019, at 3:36 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
>> On Fri, Aug 23, 2019 at 04:09:11PM -0700, Andy Lutomirski wrote:
>> On Thu, Aug 22, 2019 at 4:26 PM Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>>> You're proposing all of the above in addition to CAP_BPF, right?
>>> Otherwise I don't see how it addresses the use cases I kept
>>> explaining for the last few weeks.
>>
>> None of my proposal is intended to exclude changes like CAP_BPF to
>> make privileged bpf() operations need less privilege.  But I think
>> it's very hard to evaluate CAP_BPF without both a full description of
>> exactly what CAP_BPF would do and what at least one full example of a
>> user would look like.
>
> the example is previous email and systemd example was not "full" ?

Can you give an example of how a real user would want to configure
their system such that a non-root systemd instance has capabilities,
sets up a BPF firewall, and does something useful with it?  You
mentioned systemd, multiple people pointed out that, on a normal
system, systemd —user has no capabilities. That was the end of the
discussion.

A full example is one where peoples’ confusion as to what the example
is gets answered.

>
>> I also think that users who want CAP_BPF should look at manipulating
>> their effective capability set instead.  A daemon that wants to use
>> bpf() but otherwise minimize the chance of accidentally causing a
>> problem can use capset() to clear its effective and inheritable masks.
>> Then, each time it wants to call bpf(), it could re-add CAP_SYS_ADMIN
>> or CAP_NET_ADMIN to its effective set, call bpf(), and then clear its
>> effective set again.  This works in current kernels and is generally
>> good practice.
>
> Such logic means that CAP_NET_ADMIN is not necessary either.
> The process could re-add CAP_SYS_ADMIN when it needs to reconfigure
> network and then drop it.

This isn't really true. By giving a process CAP_NET_ADMIN and not
CAP_SYS_ADMIN, that process can configure the network but can’t load
kernel modules or reconfigure the machine deliberately or by accident.

But that's besides the point.  Can you give an example where this
approach doesn't help and CAP_BPF does?

>
>> Aside from this, and depending on exactly what CAP_BPF would be, I
>> have some further concerns.  Looking at your example in this email:
>>
>>> Here is another example of use case that CAP_BPF is solving:
>>> The daemon X is started by pid=1 and currently runs as root.
>>> It loads a bunch of tracing progs and attaches them to kprobes
>>> and tracepoints. It also loads cgroup-bpf progs and attaches them
>>> to cgroups. All progs are collecting data about the system and
>>> logging it for further analysis.
>>
>> This needs more than just bpf().  Creating a perf kprobe event
>> requires CAP_SYS_ADMIN, and without a perf kprobe event, you can't
>> attach a bpf program.
>
> that is already solved sysctl_perf_event_paranoid.
> CAP_BPF is about BPF part only.

Hence my point: I'd like to see a real example where CAP_BPF helps.
perf_event_paranoid does not appear to grant the ability to add
kprobes.  With perf_event_paranoid set to -1:

$ perf probe --add vfs_mknod
Failed to open kprobe_events: Permission denied
  Error: Failed to add events.
$ sudo perf probe --add vfs_mknod
Added new event:
  probe:vfs_mknod      (on vfs_mknod)

I suppose I could modify permissions on debugfs and set
perf_event_paranoid=-1, but at that point the overall security of the
system is so weak that talking about refining the bpf part seems
pointless.

>
>> And the privilege to attach bpf programs to
>> cgroups without any DAC or MAC checks (which is what the current API
>> does) is an extremely broad privilege that is not that much weaker
>> than CAP_SYS_ADMIN or CAP_NET_ADMIN.  Also:
>
> I don't think there is a hierarchy of CAP_SYS_ADMIN vs CAP_NET_ADMIN
> vs CAP_BPF.
> CAP_BPF and CAP_NET_ADMIN carve different areas of CAP_SYS_ADMIN.
> Just like all other caps.

The whole set of capabilities on Linux us a bit of a mess.  Their
features are mostly disjoint but, on a normal Linux machine, many of
the capabilities can be used to become root with full capabilities.

>
>>> This tracing bpf is looking into kernel memory
>>> and using bpf_probe_read. Clearly it's not _secure_. But it's _safe_.
>>> The system is not going to crash because of BPF,
>>> but it can easily crash because of simple coding bugs in the user
>>> space bits of that daemon.
>>
>> The BPF verifier and interpreter, taken in isolation, may be extremely
>> safe, but attaching BPF programs to various hooks can easily take down
>> the system, deliberately or by accident.  A handler, especially if it
>> can access user memory or otherwise fault, will explode if attached to
>> an inappropriate kprobe, hw_breakpoint, or function entry trace event.
>
> absolutely not true.

This is not a constructive way to have a conversation.  When you get
an email that contains a statement you disagree with, perhaps you
could try to give some argument as to why you disagree rather than
just saying "absolutely not true".  Especially when you are talking to
one of the maintainers of the affected system who has a
not-yet-finished branch that addresses some of the bugs that you claim
absolutely don't exist.  If it's really truly necessary, I can go and
write an example that will crash an x86 kernel, but I feel like it
would be a waste of everyone's time.

Right now, on all kernels, an hw_breakpoint on memory used in a
non-recursion-safe part of any of the x86 IST entry handlers will
corrupt the kernel no later than when the hw_breakpoint handler
returns.  It does not matter in the slightest what the BPF payload is.
The payload doesn't even have to be BPF for this to blow up.

Similarly, until very, very recently, a handler that pagefaulted (due
to generating a stack trace or failing a bpf_probe_read() in the
trace_hardirqs_on path would crash the system due to corrupting cr2 in
the x86 entry code.  PeterZ just fixed this bug recently.  I believe
that there are similar bugs relating to DR6, but they probably don't
kill the system as easily.  I wouldn't rule out a full system crash,
though.  Again, a not-really-done fix for this is part-way done in my
tree.

How confident are you that a BPF program that calls bpf_probe_read()
the maximum allowable number of times on the address
0xffffffffffffffff attached to, say, an network interrupt probe will
actually leave the system in a usable state?  Maybe it will, but I'd
be a bit surprised.

How confident are you that the BPF program that calls bpf_probe_read()
on an MMIO address has well-defined semantics?  How confident are you
that the system will still work if such a program runs?

>
>> (I and the other maintainers consider this to be a bug if it happens,
>> and we'll fix it, but these bugs definitely exist.)  A cgroup-bpf hook
>> that blocks all network traffic will effectively kill a machine,
>> especially if it's a server.
>
> this permission is granted by CAP_NET_ADMIN. Nothing changes here.
>
>> A bpf program that runs excessively
>> slowly attached to a high-frequency hook will kill the system, too.
>
> not true either.

What prevents this from happening?  Is there a specific mitigation in place?

My point here is that the bpf is 'safe' in isolation, but that bpf
tracing is only somewhat 'safe'.

>> A bpf firewall rule that's
>> wrong can cut a machine off from the network -- I've killed machines
>> using iptables more than once, and bpf isn't magically safer.
>
> this is CAP_NET_ADMIN permission. It's a different capability.

Since you haven't fully defined what CAP_BPF would do, I can only
assume that you intend for CAP_BPF to enable installation of a BPF
inet_ingress hook on the root cgroup.  A BPF program that rejects
everything will block all traffic.

>
>>
>> I'm wondering if something like CAP_TRACING would make sense.
>> CAP_TRACING would allow operations that can reveal kernel memory and
>> other secret kernel state but that do not, by design, allow modifying
>> system behavior.  So, for example, CAP_TRACING would allow privileged
>> perf_event_open() operations and privileged bpf verifier usage.  But
>> it would not allow cgroup-bpf unless further restrictions were added,
>> and it would not allow the *_BY_ID operations, as those can modify
>> other users' bpf programs' behavior.
>
> Makes little sense to me.
> I can imagine CAP_TRACING controlling kprobe/uprobe creation
> and probe_read() both from bpf side and from vanilla kprobe.
> That would be much nicer interface to use than existing
> sysctl_perf_event_paranoid, but that is orthogonal to CAP_BPF
> which is strictly about BPF.

I'm suggesting that CAP_TRACING would also enable most of all of the
things in the verifier that are currently CAP_SYS_ADMIN and would
enable loading and attaching BPF programs to perf events.  So it's not
orthogonal.

You're welcome to post CAP_BPF patches, but perhaps you could also
comment on CAP_TRACING and capset?

--Andy

^ permalink raw reply

* Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames
From: David Ahern @ 2019-08-27  0:17 UTC (permalink / raw)
  To: David Miller
  Cc: jakub.kicinski, jiri, roopa, netdev, sthemmin, dcbw, mkubecek,
	andrew, parav, saeedm, mlxsw
In-Reply-To: <20190826.152525.144590581669280532.davem@davemloft.net>

On 8/26/19 4:25 PM, David Miller wrote:
> From: David Ahern <dsahern@gmail.com>
> Date: Mon, 26 Aug 2019 16:24:38 -0600
> 
>> On 8/26/19 4:18 PM, David Miller wrote:
>>> I honestly think that the size of link dumps are out of hand as-is.
>>
>> so you are suggesting new alternate names should not appear in kernel
>> generated RTM_NEWLINK messages - be it a link dump or a notification on
>> a change?
> 
> I counter with the question of how much crap can we keep sticking in there
> before we have to do something else to provide that information?
> 

Something a bit stand alone would be a better choice - like all of the
VF stuff, stats, per-device type configuration. Yes, that ship has
sailed, but as I recall that is where the overhead is.

An attribute as basic as a name is the wrong place for that split.

^ permalink raw reply

* Re: [PATCH 08/16] mips: prefer __section from compiler_attributes.h
From: Nick Desaulniers @ 2019-08-27  0:19 UTC (permalink / raw)
  To: Paul Burton
  Cc: akpm@linux-foundation.org, sedat.dilek@gmail.com,
	jpoimboe@redhat.com, yhs@fb.com, miguel.ojeda.sandonis@gmail.com,
	clang-built-linux@googlegroups.com, Ralf Baechle, James Hogan,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	linux-mips@vger.kernel.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, bpf@vger.kernel.org
In-Reply-To: <20190815093848.tremcmaftzspuzzj@pburton-laptop>

On Thu, Aug 15, 2019 at 2:38 AM Paul Burton <paul.burton@mips.com> wrote:
>
> Hi Nick,
>
> On Mon, Aug 12, 2019 at 02:50:41PM -0700, Nick Desaulniers wrote:
> > Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
> > Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
> > Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
>
> It would be good to add a commit message, even if it's just a line
> repeating the subject & preferably describing the motivation.
>
> > ---
> >  arch/mips/include/asm/cache.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/mips/include/asm/cache.h b/arch/mips/include/asm/cache.h
> > index 8b14c2706aa5..af2d943580ee 100644
> > --- a/arch/mips/include/asm/cache.h
> > +++ b/arch/mips/include/asm/cache.h
> > @@ -14,6 +14,6 @@
> >  #define L1_CACHE_SHIFT               CONFIG_MIPS_L1_CACHE_SHIFT
> >  #define L1_CACHE_BYTES               (1 << L1_CACHE_SHIFT)
> >
> > -#define __read_mostly __attribute__((__section__(".data..read_mostly")))
> > +#define __read_mostly __section(.data..read_mostly)
> >
> >  #endif /* _ASM_CACHE_H */
> > --
> > 2.23.0.rc1.153.gdeed80330f-goog
>
> I'm not copied on the rest of the series so I'm not sure what your
> expectations are about where this should be applied. Let me know if
> you'd prefer this to go through mips-next, otherwise:
>
>     Acked-by: Paul Burton <paul.burton@mips.com>

Thanks Paul, going to send this up via Miguel's tree, if you don't
mind.  Updating my series now.  (Will probably avoid running
get_maintainer.pl on every patch...too hard to cc everyone on the
whole series).
-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply

* Re: RFC: very rough draft of a bpf permission model
From: Alexei Starovoitov @ 2019-08-27  0:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Borkmann, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List, Chenbo Feng
In-Reply-To: <CALCETrUARqcn8EmjcgMc8KP=4O5nZDMh=tcruEYvUgSzMKJUBw@mail.gmail.com>

On Mon, Aug 26, 2019 at 05:05:58PM -0700, Andy Lutomirski wrote:
> >>
> >> The BPF verifier and interpreter, taken in isolation, may be extremely
> >> safe, but attaching BPF programs to various hooks can easily take down
> >> the system, deliberately or by accident.  A handler, especially if it
> >> can access user memory or otherwise fault, will explode if attached to
> >> an inappropriate kprobe, hw_breakpoint, or function entry trace event.
> >
> > absolutely not true.
> 
> This is not a constructive way to have a conversation.  When you get
> an email that contains a statement you disagree with, perhaps you
> could try to give some argument as to why you disagree rather than
> just saying "absolutely not true".  Especially when you are talking to
> one of the maintainers of the affected system who has a
> not-yet-finished branch that addresses some of the bugs that you claim
> absolutely don't exist.  If it's really truly necessary, I can go and
> write an example that will crash an x86 kernel, but I feel like it
> would be a waste of everyone's time.

Please do write an example and prove your earlier sensational statement
that "can _easily_ take down the system" by attaching bpf to kprobe.

Most of the functions where kprobes are not allowed are already
marked by 'nokprobe'. All of them marked? Probably not.
There could be places where kprobe will blow the system, but
1. it's not easy to do. unlike your claim
2. that issue has nothing to do with bpf safety guarantees.

> How confident are you that the BPF program that calls bpf_probe_read()
> on an MMIO address has well-defined semantics?  How confident are you
> that the system will still work if such a program runs?

bpf_probe_read is a wrapper of probe_read. Nothing more.
I'm confident that probe_read maintainers are doing good job.

All of the bpf tracing is relying on existing kernel mechanisms
like kprobe, uprobe, perf, probe_read, etc.
bpf verifier cannot make them safer.
If reading mmio via bpf_probe_read will trigger undesired
hw behavior there is nothing bpf verifier can do about it.

^ permalink raw reply

* [PATCH,RFC] ath10k: Fix skb->len (properly) in ath10k_sdio_mbox_rx_packet
From: Nicolas Boichat @ 2019-08-27  0:33 UTC (permalink / raw)
  To: Kalle Valo
  Cc: David S . Miller, ath10k, linux-wireless, netdev, linux-kernel,
	wgong, Niklas Cassel, Alagu Sankar, briannorris, tientzu

(not a formal patch, take this as a bug report for now, I can clean
up depending on the feedback I get here)

There's at least 3 issues here, and the patch fixes 2/3 only, I'm not sure
how/if 1 should be handled.
 1. ath10k_sdio_mbox_rx_alloc allocating skb of a incorrect size (too
    small)
 2. ath10k_sdio_mbox_rx_packet calling skb_put with that incorrect size.
 3. ath10k_sdio_mbox_rx_process_packet attempts to fixup the size, but
    does not use proper skb_put commands to do so, so we end up with
    a mismatch between skb->head + skb->tail and skb->data + skb->len.

Let's start with 3, this is quite serious as this and causes corruptions
in the TCP stack, as the stack tries to coalesce packets, and relies on
skb->tail being correct (that is, skb_tail_pointer must point to the
first byte _after_ the data): one must never manipulate skb->len
directly.

Instead, we need to use skb_put to allocate more space (which updates
skb->len and skb->tail). But it seems odd to do that in
ath10k_sdio_mbox_rx_process_packet, so I move the code to
ath10k_sdio_mbox_rx_packet (point 2 above).

However, there is still something strange (point 1 above), why is
ath10k_sdio_mbox_rx_alloc allocating packets of the incorrect
(too small?) size? What happens if the packet is bigger than alloc_len?
Does this lead to corruption/lost data?

Fixes: 8530b4e7b22bc3b ("ath10k: sdio: set skb len for all rx packets")
Signed-off-by: Nicolas Boichat <drinkcat@chromium.org>

---

One simple way to test this is this scriplet, that sends a lot of
small packets over SSH:
(for i in `seq 1 300`; do echo $i; sleep 0.1; done) | ssh $IP cat

In my testing it rarely ever reach 300 without failure.

 drivers/net/wireless/ath/ath10k/sdio.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
index 8ed4fbd8d6c3888..a9f5002863ee7bb 100644
--- a/drivers/net/wireless/ath/ath10k/sdio.c
+++ b/drivers/net/wireless/ath/ath10k/sdio.c
@@ -381,16 +381,14 @@ static int ath10k_sdio_mbox_rx_process_packet(struct ath10k *ar,
 	struct ath10k_htc_hdr *htc_hdr = (struct ath10k_htc_hdr *)skb->data;
 	bool trailer_present = htc_hdr->flags & ATH10K_HTC_FLAG_TRAILER_PRESENT;
 	enum ath10k_htc_ep_id eid;
-	u16 payload_len;
 	u8 *trailer;
 	int ret;
 
-	payload_len = le16_to_cpu(htc_hdr->len);
-	skb->len = payload_len + sizeof(struct ath10k_htc_hdr);
+	/* TODO: Remove this? */
+	WARN_ON(skb->len != le16_to_cpu(htc_hdr->len) + sizeof(*htc_hdr));
 
 	if (trailer_present) {
-		trailer = skb->data + sizeof(*htc_hdr) +
-			  payload_len - htc_hdr->trailer_len;
+		trailer = skb->data + skb->len - htc_hdr->trailer_len;
 
 		eid = pipe_id_to_eid(htc_hdr->eid);
 
@@ -637,8 +635,16 @@ static int ath10k_sdio_mbox_rx_packet(struct ath10k *ar,
 	ret = ath10k_sdio_readsb(ar, ar_sdio->mbox_info.htc_addr,
 				 skb->data, pkt->alloc_len);
 	pkt->status = ret;
-	if (!ret)
+	if (!ret) {
+		/* Update actual length. */
+		/* FIXME: This looks quite wrong, why is pkt->act_len not
+		 * correct in the first place?
+		 */
+		struct ath10k_htc_hdr *htc_hdr =
+			(struct ath10k_htc_hdr *)skb->data;
+		pkt->act_len = le16_to_cpu(htc_hdr->len) + sizeof(*htc_hdr);
 		skb_put(skb, pkt->act_len);
+	}
 
 	return ret;
 }
-- 
2.23.0.187.g17f5b7556c-goog


^ permalink raw reply related

* Re: [PATCH net-next] nfp: add AMDA0058 boards to firmware list
From: David Miller @ 2019-08-27  0:21 UTC (permalink / raw)
  To: jakub.kicinski; +Cc: netdev, oss-drivers, dirk.vandermerwe
In-Reply-To: <20190826223041.21100-1-jakub.kicinski@netronome.com>

From: Jakub Kicinski <jakub.kicinski@netronome.com>
Date: Mon, 26 Aug 2019 15:30:41 -0700

> Add MODULE_FIRMWARE entries for AMDA0058 boards.
> 
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>

Applied.

^ permalink raw reply

* Re: [PATCH 1/4] mdev: Introduce sha1 based mdev alias
From: Alex Williamson @ 2019-08-27  1:44 UTC (permalink / raw)
  To: Parav Pandit; +Cc: jiri, kwankhede, cohuck, davem, kvm, linux-kernel, netdev
In-Reply-To: <20190826204119.54386-2-parav@mellanox.com>

On Mon, 26 Aug 2019 15:41:16 -0500
Parav Pandit <parav@mellanox.com> wrote:

> Whenever a parent requests to generate mdev alias, generate a mdev
> alias.
> It is an optional attribute that parent can request to generate
> for each of its child mdev.
> mdev alias is generated using sha1 from the mdev name.
> 
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c    | 98 +++++++++++++++++++++++++++++++-
>  drivers/vfio/mdev/mdev_private.h |  5 +-
>  drivers/vfio/mdev/mdev_sysfs.c   | 13 +++--
>  include/linux/mdev.h             |  4 ++
>  4 files changed, 111 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index b558d4cfd082..e825ff38b037 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -10,9 +10,11 @@
>  #include <linux/module.h>
>  #include <linux/device.h>
>  #include <linux/slab.h>
> +#include <linux/mm.h>
>  #include <linux/uuid.h>
>  #include <linux/sysfs.h>
>  #include <linux/mdev.h>
> +#include <crypto/hash.h>
>  
>  #include "mdev_private.h"
>  
> @@ -27,6 +29,8 @@ static struct class_compat *mdev_bus_compat_class;
>  static LIST_HEAD(mdev_list);
>  static DEFINE_MUTEX(mdev_list_lock);
>  
> +static struct crypto_shash *alias_hash;
> +
>  struct device *mdev_parent_dev(struct mdev_device *mdev)
>  {
>  	return mdev->parent->dev;
> @@ -164,6 +168,18 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
>  		goto add_dev_err;
>  	}
>  
> +	if (ops->get_alias_length) {
> +		unsigned int digest_size;
> +		unsigned int aligned_len;
> +
> +		aligned_len = roundup(ops->get_alias_length(), 2);
> +		digest_size = crypto_shash_digestsize(alias_hash);
> +		if (aligned_len / 2 > digest_size) {
> +			ret = -EINVAL;
> +			goto add_dev_err;
> +		}
> +	}

This looks like a sanity check, it could be done outside of the
parent_list_lock, even before we get a parent device reference.

I think we're using a callback for get_alias_length() rather than a
fixed field to support the mtty module option added in patch 4, right?
Its utility is rather limited with no args.  I could imagine that if a
parent wanted to generate an alias that could be incorporated into a
string with the parent device name that it would be useful to call this
with the parent device as an arg.  I guess we can save that until a
user comes along though.

There doesn't seem to be anything serializing use of alias_hash.

> +
>  	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
>  	if (!parent) {
>  		ret = -ENOMEM;
> @@ -259,6 +275,7 @@ static void mdev_device_free(struct mdev_device *mdev)
>  	mutex_unlock(&mdev_list_lock);
>  
>  	dev_dbg(&mdev->dev, "MDEV: destroying\n");
> +	kvfree(mdev->alias);
>  	kfree(mdev);
>  }
>  
> @@ -269,18 +286,86 @@ static void mdev_device_release(struct device *dev)
>  	mdev_device_free(mdev);
>  }
>  
> -int mdev_device_create(struct kobject *kobj,
> -		       struct device *dev, const guid_t *uuid)
> +static const char *
> +generate_alias(const char *uuid, unsigned int max_alias_len)
> +{
> +	struct shash_desc *hash_desc;
> +	unsigned int digest_size;
> +	unsigned char *digest;
> +	unsigned int alias_len;
> +	char *alias;
> +	int ret = 0;
> +
> +	/* Align to multiple of 2 as bin2hex will generate
> +	 * even number of bytes.
> +	 */

Comment style for non-networking code please.

> +	alias_len = roundup(max_alias_len, 2);
> +	alias = kvzalloc(alias_len + 1, GFP_KERNEL);

The size we're generating here should be small enough to just use
kzalloc(), probably below too.

> +	if (!alias)
> +		return NULL;
> +
> +	/* Allocate and init descriptor */
> +	hash_desc = kvzalloc(sizeof(*hash_desc) +
> +			     crypto_shash_descsize(alias_hash),
> +			     GFP_KERNEL);
> +	if (!hash_desc)
> +		goto desc_err;
> +
> +	hash_desc->tfm = alias_hash;
> +
> +	digest_size = crypto_shash_digestsize(alias_hash);
> +
> +	digest = kvzalloc(digest_size, GFP_KERNEL);
> +	if (!digest) {
> +		ret = -ENOMEM;
> +		goto digest_err;
> +	}
> +	crypto_shash_init(hash_desc);
> +	crypto_shash_update(hash_desc, uuid, UUID_STRING_LEN);
> +	crypto_shash_final(hash_desc, digest);
> +	bin2hex(&alias[0], digest,

&alias[0], ie. alias

> +		min_t(unsigned int, digest_size, alias_len / 2));
> +	/* When alias length is odd, zero out and additional last byte
> +	 * that bin2hex has copied.
> +	 */
> +	if (max_alias_len % 2)
> +		alias[max_alias_len] = 0;

Doesn't this give us a null terminated string for odd numbers but not
even numbers?  Probably best to define that we always provide a null
terminated string then we could do this unconditionally.

> +
> +	kvfree(digest);
> +	kvfree(hash_desc);
> +	return alias;
> +
> +digest_err:
> +	kvfree(hash_desc);
> +desc_err:
> +	kvfree(alias);
> +	return NULL;
> +}
> +
> +int mdev_device_create(struct kobject *kobj, struct device *dev,
> +		       const char *uuid_str, const guid_t *uuid)
>  {
>  	int ret;
>  	struct mdev_device *mdev, *tmp;
>  	struct mdev_parent *parent;
>  	struct mdev_type *type = to_mdev_type(kobj);
> +	unsigned int alias_len = 0;
> +	const char *alias = NULL;
>  
>  	parent = mdev_get_parent(type->parent);
>  	if (!parent)
>  		return -EINVAL;
>  
> +	if (parent->ops->get_alias_length)
> +		alias_len = parent->ops->get_alias_length();
> +	if (alias_len) {

Why isn't this nested into the branch above?

> +		alias = generate_alias(uuid_str, alias_len);
> +		if (!alias) {
> +			ret = -ENOMEM;

Could use an ERR_PTR and propagate an errno.

> +			goto alias_fail;
> +		}
> +	}
> +
>  	mutex_lock(&mdev_list_lock);
>  
>  	/* Check for duplicate */
> @@ -300,6 +385,8 @@ int mdev_device_create(struct kobject *kobj,
>  	}
>  
>  	guid_copy(&mdev->uuid, uuid);
> +	mdev->alias = alias;
> +	alias = NULL;

A comment justifying this null'ing might help prevent it getting culled
as some point.  It appears arbitrary at first look.  Thanks,

Alex

>  	list_add(&mdev->next, &mdev_list);
>  	mutex_unlock(&mdev_list_lock);
>  
> @@ -346,6 +433,8 @@ int mdev_device_create(struct kobject *kobj,
>  	up_read(&parent->unreg_sem);
>  	put_device(&mdev->dev);
>  mdev_fail:
> +	kvfree(alias);
> +alias_fail:
>  	mdev_put_parent(parent);
>  	return ret;
>  }
> @@ -406,6 +495,10 @@ EXPORT_SYMBOL(mdev_get_iommu_device);
>  
>  static int __init mdev_init(void)
>  {
> +	alias_hash = crypto_alloc_shash("sha1", 0, 0);
> +	if (!alias_hash)
> +		return -ENOMEM;
> +
>  	return mdev_bus_register();
>  }
>  
> @@ -415,6 +508,7 @@ static void __exit mdev_exit(void)
>  		class_compat_unregister(mdev_bus_compat_class);
>  
>  	mdev_bus_unregister();
> +	crypto_free_shash(alias_hash);
>  }
>  
>  module_init(mdev_init)
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> index 7d922950caaf..cf1c0d9842c6 100644
> --- a/drivers/vfio/mdev/mdev_private.h
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -33,6 +33,7 @@ struct mdev_device {
>  	struct kobject *type_kobj;
>  	struct device *iommu_device;
>  	bool active;
> +	const char *alias;
>  };
>  
>  #define to_mdev_device(dev)	container_of(dev, struct mdev_device, dev)
> @@ -57,8 +58,8 @@ void parent_remove_sysfs_files(struct mdev_parent *parent);
>  int  mdev_create_sysfs_files(struct device *dev, struct mdev_type *type);
>  void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type);
>  
> -int  mdev_device_create(struct kobject *kobj,
> -			struct device *dev, const guid_t *uuid);
> +int mdev_device_create(struct kobject *kobj, struct device *dev,
> +		       const char *uuid_str, const guid_t *uuid);
>  int  mdev_device_remove(struct device *dev);
>  
>  #endif /* MDEV_PRIVATE_H */
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> index 7570c7602ab4..43afe0e80b76 100644
> --- a/drivers/vfio/mdev/mdev_sysfs.c
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -63,15 +63,18 @@ static ssize_t create_store(struct kobject *kobj, struct device *dev,
>  		return -ENOMEM;
>  
>  	ret = guid_parse(str, &uuid);
> -	kfree(str);
>  	if (ret)
> -		return ret;
> +		goto err;
>  
> -	ret = mdev_device_create(kobj, dev, &uuid);
> +	ret = mdev_device_create(kobj, dev, str, &uuid);
>  	if (ret)
> -		return ret;
> +		goto err;
>  
> -	return count;
> +	ret = count;
> +
> +err:
> +	kfree(str);
> +	return ret;
>  }
>  
>  MDEV_TYPE_ATTR_WO(create);
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> index 0ce30ca78db0..f036fe9854ee 100644
> --- a/include/linux/mdev.h
> +++ b/include/linux/mdev.h
> @@ -72,6 +72,9 @@ struct device *mdev_get_iommu_device(struct device *dev);
>   * @mmap:		mmap callback
>   *			@mdev: mediated device structure
>   *			@vma: vma structure
> + * @get_alias_length:	Generate alias for the mdevs of this parent based on the
> + *			mdev device name when it returns non zero alias length.
> + *			It is optional.
>   * Parent device that support mediated device should be registered with mdev
>   * module with mdev_parent_ops structure.
>   **/
> @@ -92,6 +95,7 @@ struct mdev_parent_ops {
>  	long	(*ioctl)(struct mdev_device *mdev, unsigned int cmd,
>  			 unsigned long arg);
>  	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
> +	unsigned int (*get_alias_length)(void);
>  };
>  
>  /* interface for exporting mdev supported type attributes */


^ permalink raw reply

* Re: [PATCH 1/4] mdev: Introduce sha1 based mdev alias
From: Alex Williamson @ 2019-08-27  1:51 UTC (permalink / raw)
  To: Parav Pandit; +Cc: jiri, kwankhede, cohuck, davem, kvm, linux-kernel, netdev
In-Reply-To: <20190826194456.6edef7d1@x1.home>

On Mon, 26 Aug 2019 19:44:56 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Mon, 26 Aug 2019 15:41:16 -0500
> Parav Pandit <parav@mellanox.com> wrote:
> 
> > Whenever a parent requests to generate mdev alias, generate a mdev
> > alias.
> > It is an optional attribute that parent can request to generate
> > for each of its child mdev.
> > mdev alias is generated using sha1 from the mdev name.
> > 
> > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > ---
> >  drivers/vfio/mdev/mdev_core.c    | 98 +++++++++++++++++++++++++++++++-
> >  drivers/vfio/mdev/mdev_private.h |  5 +-
> >  drivers/vfio/mdev/mdev_sysfs.c   | 13 +++--
> >  include/linux/mdev.h             |  4 ++
> >  4 files changed, 111 insertions(+), 9 deletions(-)
> > 
> > diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> > index b558d4cfd082..e825ff38b037 100644
> > --- a/drivers/vfio/mdev/mdev_core.c
> > +++ b/drivers/vfio/mdev/mdev_core.c
> > @@ -10,9 +10,11 @@
> >  #include <linux/module.h>
> >  #include <linux/device.h>
> >  #include <linux/slab.h>
> > +#include <linux/mm.h>
> >  #include <linux/uuid.h>
> >  #include <linux/sysfs.h>
> >  #include <linux/mdev.h>
> > +#include <crypto/hash.h>
> >  
> >  #include "mdev_private.h"
> >  
> > @@ -27,6 +29,8 @@ static struct class_compat *mdev_bus_compat_class;
> >  static LIST_HEAD(mdev_list);
> >  static DEFINE_MUTEX(mdev_list_lock);
> >  
> > +static struct crypto_shash *alias_hash;
> > +
> >  struct device *mdev_parent_dev(struct mdev_device *mdev)
> >  {
> >  	return mdev->parent->dev;
> > @@ -164,6 +168,18 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
> >  		goto add_dev_err;
> >  	}
> >  
> > +	if (ops->get_alias_length) {
> > +		unsigned int digest_size;
> > +		unsigned int aligned_len;
> > +
> > +		aligned_len = roundup(ops->get_alias_length(), 2);
> > +		digest_size = crypto_shash_digestsize(alias_hash);
> > +		if (aligned_len / 2 > digest_size) {
> > +			ret = -EINVAL;
> > +			goto add_dev_err;
> > +		}
> > +	}  
> 
> This looks like a sanity check, it could be done outside of the
> parent_list_lock, even before we get a parent device reference.
> 
> I think we're using a callback for get_alias_length() rather than a
> fixed field to support the mtty module option added in patch 4, right?
> Its utility is rather limited with no args.  I could imagine that if a
> parent wanted to generate an alias that could be incorporated into a
> string with the parent device name that it would be useful to call this
> with the parent device as an arg.  I guess we can save that until a
> user comes along though.
> 
> There doesn't seem to be anything serializing use of alias_hash.
> 
> > +
> >  	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
> >  	if (!parent) {
> >  		ret = -ENOMEM;
> > @@ -259,6 +275,7 @@ static void mdev_device_free(struct mdev_device *mdev)
> >  	mutex_unlock(&mdev_list_lock);
> >  
> >  	dev_dbg(&mdev->dev, "MDEV: destroying\n");
> > +	kvfree(mdev->alias);
> >  	kfree(mdev);
> >  }
> >  
> > @@ -269,18 +286,86 @@ static void mdev_device_release(struct device *dev)
> >  	mdev_device_free(mdev);
> >  }
> >  
> > -int mdev_device_create(struct kobject *kobj,
> > -		       struct device *dev, const guid_t *uuid)
> > +static const char *
> > +generate_alias(const char *uuid, unsigned int max_alias_len)
> > +{
> > +	struct shash_desc *hash_desc;
> > +	unsigned int digest_size;
> > +	unsigned char *digest;
> > +	unsigned int alias_len;
> > +	char *alias;
> > +	int ret = 0;
> > +
> > +	/* Align to multiple of 2 as bin2hex will generate
> > +	 * even number of bytes.
> > +	 */  
> 
> Comment style for non-networking code please.
> 
> > +	alias_len = roundup(max_alias_len, 2);
> > +	alias = kvzalloc(alias_len + 1, GFP_KERNEL);  

Oops, here's the null termination of alias for the even case (+ 1),
ignore the comment below about odd/even.  Thanks,

Alex

> 
> The size we're generating here should be small enough to just use
> kzalloc(), probably below too.
> 
> > +	if (!alias)
> > +		return NULL;
> > +
> > +	/* Allocate and init descriptor */
> > +	hash_desc = kvzalloc(sizeof(*hash_desc) +
> > +			     crypto_shash_descsize(alias_hash),
> > +			     GFP_KERNEL);
> > +	if (!hash_desc)
> > +		goto desc_err;
> > +
> > +	hash_desc->tfm = alias_hash;
> > +
> > +	digest_size = crypto_shash_digestsize(alias_hash);
> > +
> > +	digest = kvzalloc(digest_size, GFP_KERNEL);
> > +	if (!digest) {
> > +		ret = -ENOMEM;
> > +		goto digest_err;
> > +	}
> > +	crypto_shash_init(hash_desc);
> > +	crypto_shash_update(hash_desc, uuid, UUID_STRING_LEN);
> > +	crypto_shash_final(hash_desc, digest);
> > +	bin2hex(&alias[0], digest,  
> 
> &alias[0], ie. alias
> 
> > +		min_t(unsigned int, digest_size, alias_len / 2));
> > +	/* When alias length is odd, zero out and additional last byte
> > +	 * that bin2hex has copied.
> > +	 */
> > +	if (max_alias_len % 2)
> > +		alias[max_alias_len] = 0;  
> 
> Doesn't this give us a null terminated string for odd numbers but not
> even numbers?  Probably best to define that we always provide a null
> terminated string then we could do this unconditionally.
> 
> > +
> > +	kvfree(digest);
> > +	kvfree(hash_desc);
> > +	return alias;
> > +
> > +digest_err:
> > +	kvfree(hash_desc);
> > +desc_err:
> > +	kvfree(alias);
> > +	return NULL;
> > +}
> > +
> > +int mdev_device_create(struct kobject *kobj, struct device *dev,
> > +		       const char *uuid_str, const guid_t *uuid)
> >  {
> >  	int ret;
> >  	struct mdev_device *mdev, *tmp;
> >  	struct mdev_parent *parent;
> >  	struct mdev_type *type = to_mdev_type(kobj);
> > +	unsigned int alias_len = 0;
> > +	const char *alias = NULL;
> >  
> >  	parent = mdev_get_parent(type->parent);
> >  	if (!parent)
> >  		return -EINVAL;
> >  
> > +	if (parent->ops->get_alias_length)
> > +		alias_len = parent->ops->get_alias_length();
> > +	if (alias_len) {  
> 
> Why isn't this nested into the branch above?
> 
> > +		alias = generate_alias(uuid_str, alias_len);
> > +		if (!alias) {
> > +			ret = -ENOMEM;  
> 
> Could use an ERR_PTR and propagate an errno.
> 
> > +			goto alias_fail;
> > +		}
> > +	}
> > +
> >  	mutex_lock(&mdev_list_lock);
> >  
> >  	/* Check for duplicate */
> > @@ -300,6 +385,8 @@ int mdev_device_create(struct kobject *kobj,
> >  	}
> >  
> >  	guid_copy(&mdev->uuid, uuid);
> > +	mdev->alias = alias;
> > +	alias = NULL;  
> 
> A comment justifying this null'ing might help prevent it getting culled
> as some point.  It appears arbitrary at first look.  Thanks,
> 
> Alex
> 
> >  	list_add(&mdev->next, &mdev_list);
> >  	mutex_unlock(&mdev_list_lock);
> >  
> > @@ -346,6 +433,8 @@ int mdev_device_create(struct kobject *kobj,
> >  	up_read(&parent->unreg_sem);
> >  	put_device(&mdev->dev);
> >  mdev_fail:
> > +	kvfree(alias);
> > +alias_fail:
> >  	mdev_put_parent(parent);
> >  	return ret;
> >  }
> > @@ -406,6 +495,10 @@ EXPORT_SYMBOL(mdev_get_iommu_device);
> >  
> >  static int __init mdev_init(void)
> >  {
> > +	alias_hash = crypto_alloc_shash("sha1", 0, 0);
> > +	if (!alias_hash)
> > +		return -ENOMEM;
> > +
> >  	return mdev_bus_register();
> >  }
> >  
> > @@ -415,6 +508,7 @@ static void __exit mdev_exit(void)
> >  		class_compat_unregister(mdev_bus_compat_class);
> >  
> >  	mdev_bus_unregister();
> > +	crypto_free_shash(alias_hash);
> >  }
> >  
> >  module_init(mdev_init)
> > diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> > index 7d922950caaf..cf1c0d9842c6 100644
> > --- a/drivers/vfio/mdev/mdev_private.h
> > +++ b/drivers/vfio/mdev/mdev_private.h
> > @@ -33,6 +33,7 @@ struct mdev_device {
> >  	struct kobject *type_kobj;
> >  	struct device *iommu_device;
> >  	bool active;
> > +	const char *alias;
> >  };
> >  
> >  #define to_mdev_device(dev)	container_of(dev, struct mdev_device, dev)
> > @@ -57,8 +58,8 @@ void parent_remove_sysfs_files(struct mdev_parent *parent);
> >  int  mdev_create_sysfs_files(struct device *dev, struct mdev_type *type);
> >  void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type);
> >  
> > -int  mdev_device_create(struct kobject *kobj,
> > -			struct device *dev, const guid_t *uuid);
> > +int mdev_device_create(struct kobject *kobj, struct device *dev,
> > +		       const char *uuid_str, const guid_t *uuid);
> >  int  mdev_device_remove(struct device *dev);
> >  
> >  #endif /* MDEV_PRIVATE_H */
> > diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> > index 7570c7602ab4..43afe0e80b76 100644
> > --- a/drivers/vfio/mdev/mdev_sysfs.c
> > +++ b/drivers/vfio/mdev/mdev_sysfs.c
> > @@ -63,15 +63,18 @@ static ssize_t create_store(struct kobject *kobj, struct device *dev,
> >  		return -ENOMEM;
> >  
> >  	ret = guid_parse(str, &uuid);
> > -	kfree(str);
> >  	if (ret)
> > -		return ret;
> > +		goto err;
> >  
> > -	ret = mdev_device_create(kobj, dev, &uuid);
> > +	ret = mdev_device_create(kobj, dev, str, &uuid);
> >  	if (ret)
> > -		return ret;
> > +		goto err;
> >  
> > -	return count;
> > +	ret = count;
> > +
> > +err:
> > +	kfree(str);
> > +	return ret;
> >  }
> >  
> >  MDEV_TYPE_ATTR_WO(create);
> > diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> > index 0ce30ca78db0..f036fe9854ee 100644
> > --- a/include/linux/mdev.h
> > +++ b/include/linux/mdev.h
> > @@ -72,6 +72,9 @@ struct device *mdev_get_iommu_device(struct device *dev);
> >   * @mmap:		mmap callback
> >   *			@mdev: mediated device structure
> >   *			@vma: vma structure
> > + * @get_alias_length:	Generate alias for the mdevs of this parent based on the
> > + *			mdev device name when it returns non zero alias length.
> > + *			It is optional.
> >   * Parent device that support mediated device should be registered with mdev
> >   * module with mdev_parent_ops structure.
> >   **/
> > @@ -92,6 +95,7 @@ struct mdev_parent_ops {
> >  	long	(*ioctl)(struct mdev_device *mdev, unsigned int cmd,
> >  			 unsigned long arg);
> >  	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
> > +	unsigned int (*get_alias_length)(void);
> >  };
> >  
> >  /* interface for exporting mdev supported type attributes */  
> 


^ permalink raw reply

* Re: [PATCH 3/4] mdev: Expose mdev alias in sysfs tree
From: Alex Williamson @ 2019-08-27  1:53 UTC (permalink / raw)
  To: Parav Pandit; +Cc: jiri, kwankhede, cohuck, davem, kvm, linux-kernel, netdev
In-Reply-To: <20190826204119.54386-4-parav@mellanox.com>

On Mon, 26 Aug 2019 15:41:18 -0500
Parav Pandit <parav@mellanox.com> wrote:

> Expose mdev alias as string in a sysfs tree so that such attribute can
> be used to generate netdevice name by systemd/udev or can be used to
> match other kernel objects based on the alias of the mdev.
> 
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_sysfs.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> index 43afe0e80b76..59f4e3cc5233 100644
> --- a/drivers/vfio/mdev/mdev_sysfs.c
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -246,7 +246,20 @@ static ssize_t remove_store(struct device *dev, struct device_attribute *attr,
>  
>  static DEVICE_ATTR_WO(remove);
>  
> +static ssize_t alias_show(struct device *device,
> +			  struct device_attribute *attr, char *buf)
> +{
> +	struct mdev_device *dev = mdev_from_dev(device);
> +
> +	if (!dev->alias)
> +		return -EOPNOTSUPP;

Wouldn't it be better to not create the alias at all?  Thanks,

Alex

> +
> +	return sprintf(buf, "%s\n", dev->alias);
> +}
> +static DEVICE_ATTR_RO(alias);
> +
>  static const struct attribute *mdev_device_attrs[] = {
> +	&dev_attr_alias.attr,
>  	&dev_attr_remove.attr,
>  	NULL,
>  };


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox