* [PATCH v4 1/4] selftests/powerpc: add test for 32 bits memcmp
From: Christophe Leroy @ 2018-06-08 10:20 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
wei.guo.simon
Cc: linux-kernel, linuxppc-dev
This patch renames memcmp test to memcmp_64 and adds
a memcmp_32 test for testing the 32 bits version of memcmp()
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
v4: new
tools/testing/selftests/powerpc/stringloops/Makefile | 14 +++++++++++---
tools/testing/selftests/powerpc/stringloops/memcmp_32 | Bin 0 -> 9092 bytes
tools/testing/selftests/powerpc/stringloops/memcmp_32.S | 1 +
3 files changed, 12 insertions(+), 3 deletions(-)
create mode 100755 tools/testing/selftests/powerpc/stringloops/memcmp_32
create mode 120000 tools/testing/selftests/powerpc/stringloops/memcmp_32.S
diff --git a/tools/testing/selftests/powerpc/stringloops/Makefile b/tools/testing/selftests/powerpc/stringloops/Makefile
index 1125e489055e..1e7301d4bac9 100644
--- a/tools/testing/selftests/powerpc/stringloops/Makefile
+++ b/tools/testing/selftests/powerpc/stringloops/Makefile
@@ -1,10 +1,18 @@
# SPDX-License-Identifier: GPL-2.0
# The loops are all 64-bit code
-CFLAGS += -m64
CFLAGS += -I$(CURDIR)
-TEST_GEN_PROGS := memcmp
-EXTRA_SOURCES := memcmp_64.S ../harness.c
+EXTRA_SOURCES := ../harness.c
+
+$(OUTPUT)/memcmp_64: memcmp.c
+$(OUTPUT)/memcmp_64: CFLAGS += -m64
+
+$(OUTPUT)/memcmp_32: memcmp.c
+$(OUTPUT)/memcmp_32: CFLAGS += -m32
+
+ASFLAGS = $(CFLAGS)
+
+TEST_GEN_PROGS := memcmp_32 memcmp_64
include ../../lib.mk
diff --git a/tools/testing/selftests/powerpc/stringloops/memcmp_32 b/tools/testing/selftests/powerpc/stringloops/memcmp_32
new file mode 100755
index 0000000000000000000000000000000000000000..413a02181daf59b0bdd7f0e2e05b74c4c702e7ad
GIT binary patch
literal 9092
zcmcIqeQZ?MmA`MsU_(su07D&M+h^I?n%0aBm@r@x#(bK@5E_gW+H{|XnXx^~d}Zc2
z8F$CLfgx?3ZMRfTrINZpw^24(r6`s1N7aTjRg((TM9QY!6;jg3QCcP4@CQxliguIC
z{(kq}YsRl6L6!ExnS0N<=bm%!x#xa9yW9JB7y(1fq=0Bb*)YZ2Hsswxe2O_smZ%r)
zVuiRF*j!Rf1t?D|n1j3y5QK;(QvvV=3Rp*FQX$$4f)I;Vcp&C@HT+$)!ZbVp`|}V<
zyXsYa0QGUyO91_x0@_6#0HY21B>=Y+$hx@myUzi?R$_azGnG8pnTm%B+0bgm#C9~*
z*tuuFA_>?a{a_(q2xwM0GO?IHNb>=o08s8t0Lp0qECsMXOg93S0O+IZP4$Wqeirip
zHvk$HO=t8m$72zIw$t8grEhB@8IksRnb#2IWj=h&had9c>wWlFefZz|@Y6ngl@I^1
z#x1YVqH$&K9v^;#Da=PftHYqsr;q#UBR>4|K0I$4OWHtVdyd^2Uwzq!|BDar^Wi6E
z;fSpmBkNXS@4Cw_IPql09xWu|VrVEeS{M<DV$u=0L_VL*i`=MF5JTDgA(2j`qp9R@
zhNRd?G%unDvw24xN~TgGmrrJ#Az|BSl*!tuY%J;|vl+IDrE>x*vZIc~a$};9aB{=R
zxQM2r`Lrk`hodpcT*yZ=@oZW=5KTHc)Y~=<g?-VyW2d9Z3@F3tY(`ORv9o{ImaX=h
zQ1?u!OZGE>xgYTTW9$VmbNzHiB&qyWOTmI<lJkObeV#PnI(B~rT$sk<7ZL7q?s((0
zNX|9WXljQT&Li+N!Ipq7hapavBN$d)#yHK<<-M4qx_l3o8C@R4bk*e%Of_B3VQTAg
z5%%bE8Piagzl!kF<>Q$Cx{Uc|>hftU9lHEDmJMBg3d@5o{|L*2E<cB*LzmAXJZ1Ue
zmg=QaL*TLFhG?4nnP}K@O3ZQ31s>b{x-fcPU+`Y}lhrGou&no4(a>kyIMF8pEq!9%
zAIjh_H3T24piQ_%n0v~?+Eg~hra@DLk1Z78*!*~(sDGxk@`mC-i?}r`gz$6)sY~33
zW#z%g&4(egUl{*<LEAz7w1YO#Cg{K9KKE|Zk=x6ne7Ib<VWkL;xAyJ*?1Y#b`L1XO
zy`ggg_MH?fXw&%0;A6Axs<TL^ouq49p)>w*I;pRGvMfp!Yj5e;(p>4-jR50Yy+@(<
zAIgoL57p@1b-j8Yx(L0~^ttM5v+8TxWK=Xx9uX${Jo!D*#C}%}m(9sH#EOX?A=uBz
zNpUZ9uc$m*ZmxV#ZgT(q?Y&1f3*nw!n(H}Rt-2F$N5=2nSBy-FcKFKJ{~2@R_`Q2W
z-vrM%Y}`|>Ymxlm>z%4j-&y@&VuK}&6Oa!-wN+q)^dIb~zy|oOxdQ$0o5eAv-SnFr
z8~Am$PaiTr&Zii!w(-_Gi$_{#`qS(mAJ{YkK1+`4v(?{nOlkL>)#`<h(qH<wrT3MF
zmKQNTZKdC1tfz%24TVFcx^O7`mZ&QQMWi=Sy%aerHs!uAg1e6jtGyxuy~o6g($Gfm
zZ6sfyG}!#P-Cq_~KkCu`4BDR|jr|V~3S;*X=GAJAKVF{f70vKTH^$Gx*tH>s+c9n(
zlZG%STQIkx&@r~WSo(Hrv1iZ_tzR&Nd*YQQ_eZZZ^}x<POzfVjFx*oYB8cT@+~56g
z(S7utqFZ^h*k{atqjY$Av6cp3EOE~-8;g{yFS_TKkM$m{K3|G2FLM65Z=vmuUlrXK
z{#Xph%*Wm8D@~B`p!>*`2;yR@^;J>rsaC6-PlNZwyG=c3&5Q1azchVq3i`kQ4)q;y
zr_MzpC##px-}%6Q=w5iY$-M%elQTT)+>?-ZmFMf^>-ZGb4Yyj29IbNRbyv`T8Bg#*
z1wN|44;A<bvDJpyYDa8!Ahx=cZ>ej$(s>)oQ8nH#han8~GW{t1OrJuY-`{=4ytAc$
zqCbpw-3GKLN+n5a;`jl-uLln3IsIv~2flDO{}O$k_x_ZNzIlDR?8aX$hEHRRr2pp~
zK>cq@$sM`zmBz#SS4Kka@0<66cD_{l^BMQpn?*PFYVpqM`+w^AJM<s(5dD8r>6+6M
z`LwBYoT$|y<wK9Bqv20Ir-cQXqu0<ArCjP#vR|vwd8A_*bbemyoPwLWEAWl<`G@&?
zPzddFxn9T2%Y8<}w>*0re!#K7JbszJzs)iY#&OAAUMwe<w9Ujg#;!EgTI6^z-qC&)
z-~{57zMtwmS*;?DFT2nG5x#Ahpqz4Y@g~L`@Rsmt*l+~v{Lf{5OJ@b+fqBq#cHuVn
zuy`5k+&$pyRqI3@@Ygy|VV#4oxdwW1TK~gvS=>b5_Zju?!~QwQUzjbs{skr5=sj6|
ziu@(W2j5HOsA$N=g^Tsz>4DfUZpR$7Jbx_gK>dF3&QrYKqi>{dIgh#TeMk(e_Sh@y
zw|^u*<jo#~g(ljxTyL8VZNeB|S?^$tG_VggE4>4K#@z4HR;&*Lik1WIyrM<WC-{Hv
zpIxkvTvJg#%dy$8a&h>0Qi$UdqE3y!+@rV;mBzhwB^+CL9OKWmt(kjL=~OHHK6h2k
zzCQ(?XF$6dG$VX$aafMUF`N+w#f_j(Fh1b#0r$69qp;Rs&YbA1h!}N{hL|Y==RP;(
z*}n+BoO2$2X=AKN-@_*VzNGs4Yw*6rGG)P*Y2aUH+n(nze=u&8ajkt#wZ91dFJ7mv
zE8tx%d7php+Vx?wg;ANAPioI^S9^Z9+#4`Y<(`k&;+{XVrg7eJ&v>M@So+Cl;j91I
zkA1POf-{LbF}>HFT6!P$x~`rls_*q+?mT##sB;roPqtz0Z~bMF`zH4kj<b7enrrnm
z?EU!lL9NG(17RFZeb}Ck7--bzGn_@lgPW`Gbxz26->mm+P2)ZR-81Jftox0Ke|c{F
zs6AclM%>LhpTZYsDNn`&&URWBR;y{eKbWStsRJ;L8-n3oW~_X|8O>)BaqE_NhgBHK
zj;7+)Nc8?hZH3jI9U8(t$GQcTh2)nKEUcU_#4TNGik1*pFH>Hbmn83&1$;DfD3g64
zW62AcHI&b$Eqx^tEiD!<W3hzAYgaNeY~50TR6+VI3h>xS63XLASbA^_Mhs`7DcC%p
zytC`H!dMiyKp~t&!P%$;-`^cty?)iY@cLE5x329vxOPKg?ZI2uu2~gN<{e(bo#=33
zqcxm#?E4e>0<L@;J%g^^>V%N}Py!2wk~uAVD4I--<`Wtfh0$0nQGhtrCx{m~UQ}yy
zmU;j3?qI}-FKGvK;4KFEBHo)pTSQ#H-&i2=en)6D#hmAnFTfo=j=T<UG~1BRF+~ug
zfI%0=CFJz$ZvcGPFah$OgFFb`)5v*ed=J2TAm2hhVTwB3heSQ{r;s;59qZ-+8ne~v
z2cVacvp)v%OTcH|4M6)_Q_&PRT=-iX|9`Z@d~bp8>06n0Zr!@kYR9Y9N^4!HJJhA=
z{+w@hq~YStae~v0HigS~BQ;K}$A4MRaUxRZ;d1<l^UVqEJ^upd+ZXXNGS<ttE8rX_
z;utGg&+#M9F$2Gh6XJ4?aXnD=j0x1s_+h<#_X7T~s+aF*rdZUhaK5iGCRZq&?{LI*
z|D;^t9jcyhd93eJxO`>;E`17p>b+sHu9tq(a{~41e!h5GwdY$O<jFTB+RHbknFJhm
z8aHb~ov8i+WjO|<DZCN-rTwU&{mXz$`+?Jb&OK>AaN55P{_020i9&y^2mTy#jo-m}
z3)BEV?K!|XjJ!_aDO1#kfouL@;FoLgub84i_FKt63f!rwhZrv@`zK+aMG7it|7krw
zz&So=eD?no>;Bv1SMq<1@$IX@|C41VCI59(%tIU+gNpwxQ_RPB8IHm)dE7Jne^K+c
z&Z`eTit%HA8(<ImWi$Yf0RNi8_4=drX>kfSEZ8r{2Nmd_v6_1rnb00>jZLVR@dF(3
zV{G^F-vfQWuHnzfeGld$_%MFPVb(EO@;9&_{4)N4)80oU89&~esPO|I{!<_RA3pp|
z$eSh&!9;ui+gH!ERwgAM>z+6NfHz{47=tzSffgUW)`#yv{~3Q`VSp+F`*eSRpg(~@
zAAe5cmMjEZjeGV4zTv}9L0&g$kWYC}!rn)LsyE@lPkj6r5D%}R-xnbd{tmpsvj=#d
zRygOROrAai``eN(5@0XpPY|)^C1uY6%(pgB+PwBb#7FI$a&S$Jzp*|9Z&&MElj7ex
zTV8OV#w}S04r|=g7yObb7SR5F)&47x{~~$3{s$*fe-5aY|D><|&opkyLhx0Md-9F%
z%*KNkG2f4q=IIN(qt|QT^F=K06r9nap->F3##;yOy4&vGwQs<-fo+?G-LZA1DxSzE
zhLZ&+k++?+9ZO|1iGqk_)45c_NyJ0l8`i89aVML%lkp<4bs`o`rEJ@dXYJus_Fyz+
zv$|kMM~hz7P(GSY*zwVHdQ6+Kea|*&#g4mscW<{fMk~DXUx|#(i1ZGi=n+q7Hms9{
zSQLjgyvM`X0_x<ex;=uY^i%?G^n7Et)dPDXHMA<3NjlOeEZEU}K00P6GI(s?x2@OS
z(z|avJhbh=p5EQNw!-2XCix66?4A90ZRzc|@7l3r-}V7}pm$3@TX-+^vBIc5geLf`
zFtFRc>dk@OTj|Sz=)qJ%s8{|-M8Dp5sjmlDDc$Nr#8pa{`gTxDu@mvA6IGuWYAf`I
zidvfVLoLDkyirS0XY5*%iWNOb)aQMCke@|nbI$rQ!7nJa)jXTm;s}{SRt=K>Bg|iY
z_NZ-`AI;bt{aTlEkn~5HT3-LxnA$q;i%c!WwhtBxs(12Z3qShIVz}de`))Nqwx*(m
z0w#z0lvB$lV^f7Ati;%>z^<*Z(a?61=>$Iu&D<4g#mErS({WY+$<I!;tWvQ0(j-EK
zv9uFCh}_A0`G_vz6ICLg6QNAjNrZa0>{^A(iwKQG@x?0?AIqSjmpge88qSP{)Qx5q
zVxumfNJZH|m2xRZgya~6oJ0|2IR!%btPIakV#JoKfC$Nn3>BQuBJ__js>wQQq|#_Q
z8AE{qO`-^4Fwi2KPA4*sUihvrUk)|{fDMu7X?doW=TX(1bxf@Qo`cDUeS#n?fKGZp
zSq%cu)8yL*K7iobt?QW+)&qFXBi|+P5z4^XmO9CUtx9YL;K)h7U=Brqyk{YkN!zzm
z-Ic({bG$sucY#Ce#j%M;Gcmxs70cfFzOIZi`D*n76)^zYlJ6-L0rGpfmdCsikOr{4
z1$@}A<eiRZ57Ky!)_f13Oj~g&;5~sL`T;tTk7*pB_;`OH%pZ_7-jC+1T;=1rMxJfD
zuEB?|H&Pz@!TScmICc%bZ(il&Jw)C`v@WviGWS*5hkNbJ-#|2wK_$)iufR258{`4J
zdy|jvi5k8R@BwCYkyXn(Q^SXT%DYZC8C0S!rl$eCztO%p_yF>6EjYeRlt+8d0+f9`
zm&&v06lgl>KK&d7^3{9i6M05y(oD!HkB|qj4f&=a4<KV7B$>#=bPhm1*vIn~`DXnc
zgywq*bm~=OD)|-y*NJrm(l|d<`XpoOzchxNJWR|P|0*rM3^xq|;r!nSpk5QeyKIfT
NWdlI3fToj__kRhC;J5$)
literal 0
HcmV?d00001
diff --git a/tools/testing/selftests/powerpc/stringloops/memcmp_32.S b/tools/testing/selftests/powerpc/stringloops/memcmp_32.S
new file mode 120000
index 000000000000..056f2b3af789
--- /dev/null
+++ b/tools/testing/selftests/powerpc/stringloops/memcmp_32.S
@@ -0,0 +1 @@
+../../../../../arch/powerpc/lib/memcmp_32.S
\ No newline at end of file
--
2.13.3
^ permalink raw reply related
* [PATCH v4 2/4] selftests/powerpc: Add test for strlen()
From: Christophe Leroy @ 2018-06-08 10:20 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
wei.guo.simon
Cc: linux-kernel, linuxppc-dev
In-Reply-To: <a348f4ffda6349467e4270e754d12150fa773462.1528452373.git.christophe.leroy@c-s.fr>
This patch adds a test for strlen()
string.c contains a copy of strlen() from lib/string.c
The test first tests the correctness of strlen() by comparing
the result with libc strlen(). It tests all cases of alignment.
It them tests the duration of an aligned strlen() on a 4 bytes string,
on a 16 bytes string and on a 256 bytes string.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
v4: new
.../testing/selftests/powerpc/stringloops/Makefile | 5 +-
.../testing/selftests/powerpc/stringloops/string.c | 36 ++++++
.../testing/selftests/powerpc/stringloops/strlen.c | 123 +++++++++++++++++++++
3 files changed, 163 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/powerpc/stringloops/string.c
create mode 100644 tools/testing/selftests/powerpc/stringloops/strlen.c
diff --git a/tools/testing/selftests/powerpc/stringloops/Makefile b/tools/testing/selftests/powerpc/stringloops/Makefile
index 1e7301d4bac9..df663ee9ddb3 100644
--- a/tools/testing/selftests/powerpc/stringloops/Makefile
+++ b/tools/testing/selftests/powerpc/stringloops/Makefile
@@ -10,9 +10,12 @@ $(OUTPUT)/memcmp_64: CFLAGS += -m64
$(OUTPUT)/memcmp_32: memcmp.c
$(OUTPUT)/memcmp_32: CFLAGS += -m32
+$(OUTPUT)/strlen: strlen.c string.o
+$(OUTPUT)/string.o: string.c
+
ASFLAGS = $(CFLAGS)
-TEST_GEN_PROGS := memcmp_32 memcmp_64
+TEST_GEN_PROGS := memcmp_32 memcmp_64 strlen
include ../../lib.mk
diff --git a/tools/testing/selftests/powerpc/stringloops/string.c b/tools/testing/selftests/powerpc/stringloops/string.c
new file mode 100644
index 000000000000..d05200481017
--- /dev/null
+++ b/tools/testing/selftests/powerpc/stringloops/string.c
@@ -0,0 +1,36 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * linux/lib/string.c
+ *
+ * Copyright (C) 1991, 1992 Linus Torvalds
+ */
+
+/*
+ * stupid library routines.. The optimized versions should generally be found
+ * as inline code in <asm-xx/string.h>
+ *
+ * These are buggy as well..
+ *
+ * * Fri Jun 25 1999, Ingo Oeser <ioe@informatik.tu-chemnitz.de>
+ * - Added strsep() which will replace strtok() soon (because strsep() is
+ * reentrant and should be faster). Use only strsep() in new code, please.
+ *
+ * * Sat Feb 09 2002, Jason Thomas <jason@topic.com.au>,
+ * Matthew Hawkins <matt@mh.dropbear.id.au>
+ * - Kissed strtok() goodbye
+ */
+
+#include <stddef.h>
+
+/**
+ * strlen - Find the length of a string
+ * @s: The string to be sized
+ */
+size_t test_strlen(const char *s)
+{
+ const char *sc;
+
+ for (sc = s; *sc != '\0'; ++sc)
+ /* nothing */;
+ return sc - s;
+}
diff --git a/tools/testing/selftests/powerpc/stringloops/strlen.c b/tools/testing/selftests/powerpc/stringloops/strlen.c
new file mode 100644
index 000000000000..e87ca65ea156
--- /dev/null
+++ b/tools/testing/selftests/powerpc/stringloops/strlen.c
@@ -0,0 +1,123 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <malloc.h>
+#include <stdlib.h>
+#include <string.h>
+#include <time.h>
+#include "utils.h"
+
+#define SIZE 256
+#define ITERATIONS 1000
+#define ITERATIONS_BENCH 100000
+
+int test_strlen(const void *s);
+
+/* test all offsets and lengths */
+static void test_one(char *s)
+{
+ unsigned long offset;
+
+ for (offset = 0; offset < SIZE; offset++) {
+ int x, y;
+ unsigned long i;
+
+ y = strlen(s + offset);
+ x = test_strlen(s + offset);
+
+ if (x != y) {
+ printf("strlen() returned %d, should have returned %d (%p offset %ld)\n", x, y, s, offset);
+
+ for (i = offset; i < SIZE; i++)
+ printf("%02x ", s[i]);
+ printf("\n");
+ }
+ }
+}
+
+static int testcase(void)
+{
+ char *s;
+ unsigned long i;
+ struct timespec ts_start, ts_end;
+
+ s = memalign(128, SIZE);
+ if (!s) {
+ perror("memalign");
+ exit(1);
+ }
+
+ srandom(1);
+
+ memset(s, 0, SIZE);
+ for (i = 0; i < SIZE; i++) {
+ char c;
+
+ do {
+ c = random() & 0x7f;
+ } while (!c);
+ s[i] = c;
+ test_one(s);
+ }
+
+ for (i = 0; i < ITERATIONS; i++) {
+ unsigned long j;
+
+ for (j = 0; j < SIZE; j++) {
+ char c;
+
+ do {
+ c = random() & 0x7f;
+ } while (!c);
+ s[j] = c;
+ }
+ for (j = 0; j < sizeof(long); j++) {
+ s[SIZE - 1 - j] = 0;
+ test_one(s);
+ }
+ }
+
+ for (i = 0; i < SIZE; i++) {
+ char c;
+
+ do {
+ c = random() & 0x7f;
+ } while (!c);
+ s[i] = c;
+ }
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_start);
+
+ s[SIZE - 1] = 0;
+ for (i = 0; i < ITERATIONS_BENCH; i++)
+ test_strlen(s);
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_end);
+
+ printf("len %3.3d : time = %.6f\n", SIZE, ts_end.tv_sec - ts_start.tv_sec + (ts_end.tv_nsec - ts_start.tv_nsec) / 1e9);
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_start);
+
+ s[16] = 0;
+ for (i = 0; i < ITERATIONS_BENCH; i++)
+ test_strlen(s);
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_end);
+
+ printf("len 16 : time = %.6f\n", ts_end.tv_sec - ts_start.tv_sec + (ts_end.tv_nsec - ts_start.tv_nsec) / 1e9);
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_start);
+
+ s[4] = 0;
+ for (i = 0; i < ITERATIONS_BENCH; i++)
+ test_strlen(s);
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_end);
+
+ printf("len 4 : time = %.6f\n", ts_end.tv_sec - ts_start.tv_sec + (ts_end.tv_nsec - ts_start.tv_nsec) / 1e9);
+
+ return 0;
+}
+
+int main(void)
+{
+ return test_harness(testcase, "strlen");
+}
--
2.13.3
^ permalink raw reply related
* Re: pkeys on POWER: Access rights not reset on execve
From: Michal Suchánek @ 2018-06-08 10:15 UTC (permalink / raw)
To: Florian Weimer
Cc: Ram Pai, Linux-MM, linuxppc-dev, Andy Lutomirski, Dave Hansen
In-Reply-To: <2858a8eb-c9b5-42ce-5cfc-74a4b3ad6aa9@redhat.com>
On Fri, 8 Jun 2018 07:53:51 +0200
Florian Weimer <fweimer@redhat.com> wrote:
> On 06/08/2018 04:34 AM, Ram Pai wrote:
> >>
> >> So the remaining question at this point is whether the Intel
> >> behavior (default-deny instead of default-allow) is preferable.
> >
> > Florian, remind me what behavior needs to fixed?
>
> See the other thread. The Intel register equivalent to the AMR by
> default disallows access to yet-unallocated keys, so that threads
> which are created before key allocation do not magically gain access
> to a key allocated by another thread.
>
That does not make any sense. The threads share the address space so
they should also share the keys.
Or in other words the keys are supposed to be acceleration of
mprotect() so if mprotect() magically gives access to threads that did
not call it so should pkey functions. If they cannot do that then they
fail the primary purpose.
And in any case what semantic that makes sense will you ever get by
threads not magically getting new keys?
Suppose you will spawn some threads, then allocate a new key, associate
it with a piece of protected data, etc.
Now the old threads do not have access to the protected data at all. So
if they want to access it they have to call into a management thread
that created the access key to give them the data. Which means they
need to call into the kernel to switch to the management thread. Which
completely defeats the purpose of the acceleration of mprotect() which
was to avoid calling into the kernel to access the data. In other words
you can as well spawn a management process and use shared memory.
What's worse, if you wanted to opt out of this feature and give the old
threads the new key that's going to be quite a bit of fiddling.
That said there might be an enhancement that kind of makes sense. For
example if you allocate a key to associate with an area that you want
to read most of the time and update occasionally it might make sense to
tell the kernel: make the read bit on this key process-global, make the
write bit on this key thread-local, and do not allow setting the
execute bit at all. Then a thread calling a function to update the data
will get write access but other threads will continue to have read
access unless they also call a function to update the data.
Thanks
Michal
^ permalink raw reply
* [PATCH] media: fsl-viu: Use proper check for presence of {out, in}_be32()
From: Geert Uytterhoeven @ 2018-06-08 9:48 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
Mauro Carvalho Chehab
Cc: Arnd Bergmann, linuxppc-dev, linux-media, linux-kernel,
Geert Uytterhoeven
When compile-testing on m68k or microblaze:
drivers/media/platform/fsl-viu.c:41:1: warning: "out_be32" redefined
drivers/media/platform/fsl-viu.c:42:1: warning: "in_be32" redefined
Fix this by replacing the check for PowerPC by checks for the presence
of {out,in}_be32().
As PowerPC implements the be32 accessors using inline functions instead
of macros, identity definions are added for all accessors to make the
above checks work.
Fixes: 29d750686331a1a9 ("media: fsl-viu: allow building it with COMPILE_TEST")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
---
Compile-tested on m68k, microblaze, and powerpc.
Assembler output before/after compared for powerpc.
---
arch/powerpc/include/asm/io.h | 14 ++++++++++++++
drivers/media/platform/fsl-viu.c | 4 +++-
2 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
index e0331e7545685c5f..3741183ae09349f1 100644
--- a/arch/powerpc/include/asm/io.h
+++ b/arch/powerpc/include/asm/io.h
@@ -222,6 +222,20 @@ static inline void out_be64(volatile u64 __iomem *addr, u64 val)
#endif
#endif /* __powerpc64__ */
+#define in_be16 in_be16
+#define in_be32 in_be32
+#define in_be64 in_be64
+#define in_le16 in_le16
+#define in_le32 in_le32
+#define in_le64 in_le64
+
+#define out_be16 out_be16
+#define out_be32 out_be32
+#define out_be64 out_be64
+#define out_le16 out_le16
+#define out_le32 out_le32
+#define out_le64 out_le64
+
/*
* Low level IO stream instructions are defined out of line for now
*/
diff --git a/drivers/media/platform/fsl-viu.c b/drivers/media/platform/fsl-viu.c
index e41510ce69a40815..5d5e030c9c980647 100644
--- a/drivers/media/platform/fsl-viu.c
+++ b/drivers/media/platform/fsl-viu.c
@@ -37,8 +37,10 @@
#define VIU_VERSION "0.5.1"
/* Allow building this driver with COMPILE_TEST */
-#ifndef CONFIG_PPC
+#ifndef out_be32
#define out_be32(v, a) iowrite32be(a, (void __iomem *)v)
+#endif
+#ifndef in_be32
#define in_be32(a) ioread32be((void __iomem *)a)
#endif
--
2.7.4
^ permalink raw reply related
* Re: [v3, 07/10] fsl/fman_port: support getting timestamp
From: kbuild test robot @ 2018-06-08 8:39 UTC (permalink / raw)
To: Yangbo Lu
Cc: kbuild-all, netdev, madalin.bucur, Richard Cochran, Rob Herring,
Shawn Guo, David S . Miller, devicetree, linuxppc-dev,
linux-arm-kernel, linux-kernel, Yangbo Lu
In-Reply-To: <20180607092050.46128-8-yangbo.lu@nxp.com>
Hi Yangbo,
I love your patch! Perhaps something to improve:
[auto build test WARNING on net-next/master]
[also build test WARNING on next-20180608]
[cannot apply to v4.17]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Yangbo-Lu/Support-DPAA-PTP-clock-and-timestamping/20180608-131649
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__
sparse warnings: (new ones prefixed by >>)
drivers/net/ethernet/freescale/fman/fman_port.c:879:26: sparse: expression using sizeof(void)
drivers/net/ethernet/freescale/fman/fman_port.c:879:26: sparse: expression using sizeof(void)
drivers/net/ethernet/freescale/fman/fman_port.c:1035:36: sparse: expression using sizeof(void)
drivers/net/ethernet/freescale/fman/fman_port.c:1247:17: sparse: expression using sizeof(void)
drivers/net/ethernet/freescale/fman/fman_port.c:1249:17: sparse: expression using sizeof(void)
drivers/net/ethernet/freescale/fman/fman_port.c:1249:17: sparse: expression using sizeof(void)
>> drivers/net/ethernet/freescale/fman/fman_port.c:1739:19: sparse: cast to restricted __be64
>> drivers/net/ethernet/freescale/fman/fman_port.c:1739:19: sparse: cast to restricted __be64
>> drivers/net/ethernet/freescale/fman/fman_port.c:1739:19: sparse: cast to restricted __be64
>> drivers/net/ethernet/freescale/fman/fman_port.c:1739:19: sparse: cast to restricted __be64
>> drivers/net/ethernet/freescale/fman/fman_port.c:1739:19: sparse: cast to restricted __be64
>> drivers/net/ethernet/freescale/fman/fman_port.c:1739:19: sparse: cast to restricted __be64
>> drivers/net/ethernet/freescale/fman/fman_port.c:1739:19: sparse: cast to restricted __be64
>> drivers/net/ethernet/freescale/fman/fman_port.c:1739:19: sparse: cast to restricted __be64
>> drivers/net/ethernet/freescale/fman/fman_port.c:1739:19: sparse: cast to restricted __be64
>> drivers/net/ethernet/freescale/fman/fman_port.c:1739:19: sparse: cast to restricted __be64
vim +1739 drivers/net/ethernet/freescale/fman/fman_port.c
1733
1734 int fman_port_get_tstamp(struct fman_port *port, const void *data, u64 *tstamp)
1735 {
1736 if (port->buffer_offsets.time_stamp_offset == ILLEGAL_BASE)
1737 return -EINVAL;
1738
> 1739 *tstamp = be64_to_cpu(*(u64 *)(data +
1740 port->buffer_offsets.time_stamp_offset));
1741
1742 return 0;
1743 }
1744 EXPORT_SYMBOL(fman_port_get_tstamp);
1745
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
^ permalink raw reply
* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Christoph Hellwig @ 2018-06-08 6:36 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Christoph Hellwig, Anshuman Khandual, Ram Pai, robh, aik,
jasowang, linux-kernel, virtualization, joe, linuxppc-dev,
elfring, david, cohuck, pawel.moll, Tom Lendacky, Rustad, Mark D
In-Reply-To: <20180607185234-mutt-send-email-mst@kernel.org>
On Thu, Jun 07, 2018 at 07:28:35PM +0300, Michael S. Tsirkin wrote:
> Let me restate it: DMA API has support for a wide range of hardware, and
> hardware based virtio implementations likely won't benefit from all of
> it.
That is completely wrong. All aspects of the DMA API are about the
system they are used in. The CPU, the PCIe root complex, interconnects.
All the issues I mentioned in my previous mail exist in realy life
systems that you can plug virtio PCI or PCIe cards into.
> I'm not really sympathetic to people complaining that they can't even
> set a flag in qemu though. If that's the case the stack in question is
> way too inflexible.
The flag as defined in the spec is the wrong thing to set, because they
are not using an iommu. They probably don't even do any address
translation.
> > Both in the flag naming and the implementation there is an implication
> > of DMA API == IOMMU, which is fundamentally wrong.
>
> Maybe we need to extend the meaning of PLATFORM_IOMMU or rename it.
And the explanation.
>
> It's possible that some setups will benefit from a more
> fine-grained approach where some aspects of the DMA
> API are bypassed, others aren't.
Hell no. DMA API = abstraction for any possble platform wart.
We are not going to make this any more fine grained. It is bad
enough that virtio already has a mode bypassing any of this,
we are not going to make even more of a mess of it.
> This seems to be what was being asked for in this thread,
> with comments claiming IOMMU flag adds too much overhead.
Right now it means implementing a virtual iommu, which I agree is
way too much overhead.
>
> > The DMA API does a few different things:
> >
> > a) address translation
> >
> > This does include IOMMUs. But it also includes random offsets
> > between PCI bars and system memory that we see on various
> > platforms.
>
> I don't think you mean bars. That's unrelated to DMA.
Of course it matters. If the device always needs an offset in
the DMA addresses it is completely related to DMA.
For some examples take a look at:
arch/x86/pci/sta2x11-fixup.c
arch/mips/include/asm/mach-ath25/dma-coherence.h
or anything setting dma_pfn_offset.
> > Worse so some of these offsets might be based on
> > banks, e.g. on the broadcom bmips platform. It also deals
> > with bitmask in physical addresses related to memory encryption
> > like AMD SEV. I'd be really curious how for example the
> > Intel virtio based NIC is going to work on any of those
> > plaforms.
>
> SEV guys report that they just set the iommu flag and then it all works.
> I guess if there's translation we can think of this as a kind of iommu.
> Maybe we should rename PLATFORM_IOMMU to PLARTFORM_TRANSLATION?
VIRTIO_F_BEHAVES_LIKE_A_REAL_PCI_DEVICE_DONT_TRY_TO_OUTSMART_ME
as said it's not just translations, it is cache coherence as well.
> And apparently some people complain that just setting that flag makes
> qemu check translation on each access with an unacceptable performance
> overhead. Forcing same behaviour for everyone on general principles
> even without the flag is unlikely to make them happy.
That sounds like a qemu implementation bug. If qemu knowns that
guest physiscall == guest dma space there is no need to check.
> > b) coherency
> >
> > On many architectures DMA is not cache coherent, and we need
> > to invalidate and/or write back cache lines before doing
> > DMA. Again, I wonder how this is every going to work with
> > hardware based virtio implementations.
>
>
> You mean dma_Xmb and friends?
> There's a new feature VIRTIO_F_IO_BARRIER that's being proposed
> for that.
No. I mean the fact that PCI(e) devices often are not coherent with
the cache. So you need to writeback the cpu cache before transferring
data to the device, and invalidate the cpu cache before transferring
data from the device. Plus additional workarounds for speculation.
Looks at the implementations and comments around the dma_sync_*
calls.
>
> > Even worse I think this
> > is actually broken at least for VIVT event for virtualized
> > implementations. E.g. a KVM guest is going to access memory
> > using different virtual addresses than qemu, vhost might throw
> > in another different address space.
>
> I don't really know what VIVT is. Could you help me please?
Virtually indexed, virtually tagged. In short you must do cache
maintainance based on the virtual address used to fill the cache.
>
> > c) bounce buffering
> >
> > Many DMA implementations can not address all physical memory
> > due to addressing limitations. In such cases we copy the
> > DMA memory into a known addressable bounc buffer and DMA
> > from there.
>
> Don't do it then?
Because for example your PCIe root complex only supports 32-bit
addressing, but the memory buffer is outside the addressing range.
> > d) flushing write combining buffers or similar
> >
> > On some hardware platforms we need workarounds to e.g. read
> > from a certain mmio address to make sure DMA can actually
> > see memory written by the host.
>
> I guess it isn't an issue as long as WC isn't actually used.
> It will become an issue when virtio spec adds some WC capability -
> I suspect we can ignore this for now.
This is write combibining in the SOC with the root complex. Nothing
your can work around in the device or device driver.
^ permalink raw reply
* Re: [PATCH v3] powerpc: Add support for function error injection
From: Samuel Mendoza-Jonas @ 2018-06-08 6:52 UTC (permalink / raw)
To: Naveen N. Rao, Michael Ellerman; +Cc: linuxppc-dev
In-Reply-To: <20180607095202.29189-1-naveen.n.rao@linux.vnet.ibm.com>
On Thu, 2018-06-07 at 15:22 +0530, Naveen N. Rao wrote:
> We implement regs_set_return_value() and override_function_with_return()
> for this purpose.
>
> On powerpc, a return from a function (blr) just branches to the location
> contained in the link register. So, we can just update pt_regs rather
> than redirecting execution to a dummy function that returns.
>
> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
> ---
> The only change is to add a comment in override_function_with_return()
> to clarify that we don't need to worry about 32-bit userspace while
> emulating 'blr'.
Reviewed-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
>
> - Naveen
>
> arch/powerpc/Kconfig | 1 +
> arch/powerpc/include/asm/error-injection.h | 13 +++++++++++++
> arch/powerpc/include/asm/ptrace.h | 5 +++++
> arch/powerpc/lib/Makefile | 2 ++
> arch/powerpc/lib/error-inject.c | 16 ++++++++++++++++
> 5 files changed, 37 insertions(+)
> create mode 100644 arch/powerpc/include/asm/error-injection.h
> create mode 100644 arch/powerpc/lib/error-inject.c
>
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 076fe3094856..00dad3c759a0 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -187,6 +187,7 @@ config PPC
> select HAVE_EBPF_JIT if PPC64
> select HAVE_EFFICIENT_UNALIGNED_ACCESS if !(CPU_LITTLE_ENDIAN && POWER7_CPU)
> select HAVE_FTRACE_MCOUNT_RECORD
> + select HAVE_FUNCTION_ERROR_INJECTION
> select HAVE_FUNCTION_GRAPH_TRACER
> select HAVE_FUNCTION_TRACER
> select HAVE_GCC_PLUGINS
> diff --git a/arch/powerpc/include/asm/error-injection.h b/arch/powerpc/include/asm/error-injection.h
> new file mode 100644
> index 000000000000..740c3075bdf4
> --- /dev/null
> +++ b/arch/powerpc/include/asm/error-injection.h
> @@ -0,0 +1,13 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +
> +#ifndef _ASM_ERROR_INJECTION_H
> +#define _ASM_ERROR_INJECTION_H
> +
> +#include <linux/compiler.h>
> +#include <linux/linkage.h>
> +#include <asm/ptrace.h>
> +#include <asm-generic/error-injection.h>
> +
> +void override_function_with_return(struct pt_regs *regs);
> +
> +#endif /* _ASM_ERROR_INJECTION_H */
> diff --git a/arch/powerpc/include/asm/ptrace.h b/arch/powerpc/include/asm/ptrace.h
> index e4923686e43a..c0705296c2f0 100644
> --- a/arch/powerpc/include/asm/ptrace.h
> +++ b/arch/powerpc/include/asm/ptrace.h
> @@ -101,6 +101,11 @@ static inline long regs_return_value(struct pt_regs *regs)
> return -regs->gpr[3];
> }
>
> +static inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc)
> +{
> + regs->gpr[3] = rc;
> +}
> +
> #ifdef __powerpc64__
> #define user_mode(regs) ((((regs)->msr) >> MSR_PR_LG) & 0x1)
> #else
> diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
> index d0ca13ad8231..dd43c5a53396 100644
> --- a/arch/powerpc/lib/Makefile
> +++ b/arch/powerpc/lib/Makefile
> @@ -14,6 +14,8 @@ obj-y += string.o alloc.o code-patching.o feature-fixups.o
>
> obj-$(CONFIG_PPC32) += div64.o copy_32.o crtsavres.o
>
> +obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
> +
> # See corresponding test in arch/powerpc/Makefile
> # 64-bit linker creates .sfpr on demand for final link (vmlinux),
> # so it is only needed for modules, and only for older linkers which
> diff --git a/arch/powerpc/lib/error-inject.c b/arch/powerpc/lib/error-inject.c
> new file mode 100644
> index 000000000000..407b992fb02f
> --- /dev/null
> +++ b/arch/powerpc/lib/error-inject.c
> @@ -0,0 +1,16 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +
> +#include <linux/error-injection.h>
> +#include <linux/kprobes.h>
> +#include <linux/uaccess.h>
> +
> +void override_function_with_return(struct pt_regs *regs)
> +{
> + /*
> + * Emulate 'blr'. 'regs' represents the state on entry of a predefined
> + * function in the kernel/module, captured on a kprobe. We don't need
> + * to worry about 32-bit userspace on a 64-bit kernel.
> + */
> + regs->nip = regs->link;
> +}
> +NOKPROBE_SYMBOL(override_function_with_return);
^ permalink raw reply
* Re: [v3 PATCH 2/5] powerpc/pseries: Fix endainness while restoring of r3 in MCE handler.
From: Michael Ellerman @ 2018-06-08 6:50 UTC (permalink / raw)
To: Mahesh J Salgaonkar, linuxppc-dev
Cc: stable, Aneesh Kumar K.V, Laurent Dufour, Nicholas Piggin
In-Reply-To: <152839249913.25118.1191250274945665204.stgit@jupiter.in.ibm.com>
Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:
> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>
> During Machine Check interrupt on pseries platform, register r3 points
> RTAS extended event log passed by hypervisor. Since hypervisor uses r3
> to pass pointer to rtas log, it stores the original r3 value at the
> start of the memory (first 8 bytes) pointed by r3. Since hypervisor
> stores this info and rtas log is in BE format, linux should make
> sure to restore r3 value in correct endian format.
Can we hit this under KVM? And if so what if the KVM/qemu is running
little endian, does it still write the value BE?
cheers
^ permalink raw reply
* Re: [v3 PATCH 5/5] powerpc/pseries: Display machine check error details.
From: Mahesh Jagannath Salgaonkar @ 2018-06-08 6:28 UTC (permalink / raw)
To: Nicholas Piggin
Cc: linuxppc-dev, Aneesh Kumar K.V, Michael Ellerman, Laurent Dufour
In-Reply-To: <20180608115136.7a6db415@roar.ozlabs.ibm.com>
On 06/08/2018 07:21 AM, Nicholas Piggin wrote:
> On Thu, 07 Jun 2018 22:59:04 +0530
> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> wrote:
>
>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>
>> Extract the MCE error details from RTAS extended log and display it to
>> console.
>>
>> With this patch you should now see mce logs like below:
>>
>> [ 142.371818] Severe Machine check interrupt [Recovered]
>> [ 142.371822] NIP [d00000000ca301b8]: init_module+0x1b8/0x338 [bork_kernel]
>> [ 142.371822] Initiator: CPU
>> [ 142.371823] Error type: SLB [Multihit]
>> [ 142.371824] Effective address: d00000000ca70000
>>
>> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>> ---
>> arch/powerpc/include/asm/rtas.h | 5 +
>> arch/powerpc/platforms/pseries/ras.c | 128 +++++++++++++++++++++++++++++++++-
>> 2 files changed, 131 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
>> index 3f2fba7ef23b..8100a95c133a 100644
>> --- a/arch/powerpc/include/asm/rtas.h
>> +++ b/arch/powerpc/include/asm/rtas.h
>> @@ -190,6 +190,11 @@ static inline uint8_t rtas_error_extended(const struct rtas_error_log *elog)
>> return (elog->byte1 & 0x04) >> 2;
>> }
>>
>> +static inline uint8_t rtas_error_initiator(const struct rtas_error_log *elog)
>> +{
>> + return (elog->byte2 & 0xf0) >> 4;
>> +}
>> +
>> #define rtas_error_type(x) ((x)->byte3)
>>
>> static inline
>> diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
>> index e56759d92356..cd9446980092 100644
>> --- a/arch/powerpc/platforms/pseries/ras.c
>> +++ b/arch/powerpc/platforms/pseries/ras.c
>> @@ -422,7 +422,130 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
>> return 0; /* need to perform reset */
>> }
>>
>> -static int mce_handle_error(struct rtas_error_log *errp)
>> +#define VAL_TO_STRING(ar, val) ((val < ARRAY_SIZE(ar)) ? ar[val] : "Unknown")
>> +
>> +static void pseries_print_mce_info(struct pt_regs *regs,
>> + struct rtas_error_log *errp, int disposition)
>> +{
>> + const char *level, *sevstr;
>> + struct pseries_errorlog *pseries_log;
>> + struct pseries_mc_errorlog *mce_log;
>> + uint8_t error_type, err_sub_type;
>> + uint8_t initiator = rtas_error_initiator(errp);
>> + uint64_t addr;
>> +
>> + static const char * const initiators[] = {
>> + "Unknown",
>> + "CPU",
>> + "PCI",
>> + "ISA",
>> + "Memory",
>> + "Power Mgmt",
>> + };
>> + static const char * const mc_err_types[] = {
>> + "UE",
>> + "SLB",
>> + "ERAT",
>> + "TLB",
>> + "D-Cache",
>> + "Unknown",
>> + "I-Cache",
>> + };
>> + static const char * const mc_ue_types[] = {
>> + "Indeterminate",
>> + "Instruction fetch",
>> + "Page table walk ifetch",
>> + "Load/Store",
>> + "Page table walk Load/Store",
>> + };
>> +
>> + /* SLB sub errors valid values are 0x0, 0x1, 0x2 */
>> + static const char * const mc_slb_types[] = {
>> + "Parity",
>> + "Multihit",
>> + "Indeterminate",
>> + };
>> +
>> + /* TLB and ERAT sub errors valid values are 0x1, 0x2, 0x3 */
>> + static const char * const mc_soft_types[] = {
>> + "Unknown",
>> + "Parity",
>> + "Multihit",
>> + "Indeterminate",
>> + };
>> +
>> + pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
>> + if (pseries_log == NULL)
>> + return;
>> +
>> + mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
>> +
>> + error_type = rtas_mc_error_type(mce_log);
>> + err_sub_type = rtas_mc_error_sub_type(mce_log);
>> +
>> + switch (rtas_error_severity(errp)) {
>> + case RTAS_SEVERITY_NO_ERROR:
>> + level = KERN_INFO;
>> + sevstr = "Harmless";
>> + break;
>> + case RTAS_SEVERITY_WARNING:
>> + level = KERN_WARNING;
>> + sevstr = "";
>> + break;
>> + case RTAS_SEVERITY_ERROR:
>> + case RTAS_SEVERITY_ERROR_SYNC:
>> + level = KERN_ERR;
>> + sevstr = "Severe";
>> + break;
>> + case RTAS_SEVERITY_FATAL:
>> + default:
>> + level = KERN_ERR;
>> + sevstr = "Fatal";
>> + break;
>> + }
>> +
>> + printk("%s%s Machine check interrupt [%s]\n", level, sevstr,
>> + disposition == RTAS_DISP_FULLY_RECOVERED ?
>> + "Recovered" : "Not recovered");
>> + if (user_mode(regs)) {
>> + printk("%s NIP: [%016lx] PID: %d Comm: %s\n", level,
>> + regs->nip, current->pid, current->comm);
>> + } else {
>> + printk("%s NIP [%016lx]: %pS\n", level, regs->nip,
>> + (void *)regs->nip);
>> + }
>
> I think it's probably still useful to print pid/comm for kernel mode
> faults if !in_interrupt()... I see you're basically taking kernel/mce.c
> and doing the same thing.
>
> Is there any reasonable way to share code here?
I did think of doing that, but I wanted make this patch series simple
enough to be able to make backport easy for very old kernels. I will
work on consolidating the code as enhancement later.
Thanks,
-Mahesh.
^ permalink raw reply
* Re: [v3 PATCH 4/5] powerpc/pseries: Dump and flush SLB contents on SLB MCE errors.
From: Mahesh Jagannath Salgaonkar @ 2018-06-08 6:19 UTC (permalink / raw)
To: Nicholas Piggin
Cc: linuxppc-dev, Aneesh Kumar K.V, Michael Ellerman, Laurent Dufour
In-Reply-To: <20180608114844.2b38d590@roar.ozlabs.ibm.com>
On 06/08/2018 07:18 AM, Nicholas Piggin wrote:
> On Thu, 07 Jun 2018 22:58:55 +0530
> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> wrote:
>
>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>
>> If we get a machine check exceptions due to SLB errors then dump the
>> current SLB contents which will be very much helpful in debugging the
>> root cause of SLB errors. On pseries, as of today system crashes on SLB
>> errors. These are soft errors and can be fixed by flushing the SLBs so
>> the kernel can continue to function instead of system crash. This patch
>> fixes that also.
>
> So pseries never flushed SLB and reloaded in response to multi hit
> errors? This seems like quite a good improvement then. I like
> dumping SLB too.
>
> It's a bit annoying we can't share the same code with xmon really,
> that's okay but I just suggest commenting them both if you take a
> copy like this with a note to keep them in synch if you re-post
> the series.
>
>>
>> With this patch the console will log SLB contents like below on SLB MCE
>> errors:
>>
>> [ 822.711728] slb contents:
>
> Suggest keeping the same format as the xmon dump (in particular
> CPU number, even though it's probably printed elsewhere in the MCE
> message it doesn't hurt.
Sure will do that and repost.
Thanks,
-Mahesh.
>
> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
>
> Thanks,
> Nick
>
>> [ 822.711730] 00 c000000008000000 400ea1b217000500
>> [ 822.711731] 1T ESID= c00000 VSID= ea1b217 LLP:100
>> [ 822.711732] 01 d000000008000000 400d43642f000510
>> [ 822.711733] 1T ESID= d00000 VSID= d43642f LLP:110
>> [ 822.711734] 09 f000000008000000 400a86c85f000500
>> [ 822.711736] 1T ESID= f00000 VSID= a86c85f LLP:100
>> [ 822.711737] 10 00007f0008000000 400d1f26e3000d90
>> [ 822.711738] 1T ESID= 7f VSID= d1f26e3 LLP:110
>> [ 822.711739] 11 0000000018000000 000e3615f520fd90
>> [ 822.711740] 256M ESID= 1 VSID= e3615f520f LLP:110
>> [ 822.711740] 12 d000000008000000 400d43642f000510
>> [ 822.711741] 1T ESID= d00000 VSID= d43642f LLP:110
>> [ 822.711742] 13 d000000008000000 400d43642f000510
>> [ 822.711743] 1T ESID= d00000 VSID= d43642f LLP:110
>>
>>
>> Suggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
>> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>> ---
>> arch/powerpc/include/asm/book3s/64/mmu-hash.h | 1 +
>> arch/powerpc/mm/slb.c | 35 +++++++++++++++++++++++++
>> arch/powerpc/platforms/pseries/ras.c | 29 ++++++++++++++++++++-
>> 3 files changed, 64 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> index 50ed64fba4ae..c0da68927235 100644
>> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> @@ -487,6 +487,7 @@ extern void hpte_init_native(void);
>>
>> extern void slb_initialize(void);
>> extern void slb_flush_and_rebolt(void);
>> +extern void slb_dump_contents(void);
>>
>> extern void slb_vmalloc_update(void);
>> extern void slb_set_size(u16 size);
>> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
>> index 66577cc66dc9..799aa117cec3 100644
>> --- a/arch/powerpc/mm/slb.c
>> +++ b/arch/powerpc/mm/slb.c
>> @@ -145,6 +145,41 @@ void slb_flush_and_rebolt(void)
>> get_paca()->slb_cache_ptr = 0;
>> }
>>
>> +void slb_dump_contents(void)
>> +{
>> + int i;
>> + unsigned long e, v;
>> + unsigned long llp;
>> +
>> + pr_err("slb contents:\n");
>> + for (i = 0; i < mmu_slb_size; i++) {
>> + asm volatile("slbmfee %0,%1" : "=r" (e) : "r" (i));
>> + asm volatile("slbmfev %0,%1" : "=r" (v) : "r" (i));
>> +
>> + if (!e && !v)
>> + continue;
>> +
>> + pr_err("%02d %016lx %016lx", i, e, v);
>> +
>> + if (!(e & SLB_ESID_V)) {
>> + pr_err("\n");
>> + continue;
>> + }
>> + llp = v & SLB_VSID_LLP;
>> + if (v & SLB_VSID_B_1T) {
>> + pr_err(" 1T ESID=%9lx VSID=%13lx LLP:%3lx\n",
>> + GET_ESID_1T(e),
>> + (v & ~SLB_VSID_B) >> SLB_VSID_SHIFT_1T,
>> + llp);
>> + } else {
>> + pr_err(" 256M ESID=%9lx VSID=%13lx LLP:%3lx\n",
>> + GET_ESID(e),
>> + (v & ~SLB_VSID_B) >> SLB_VSID_SHIFT,
>> + llp);
>> + }
>> + }
>> +}
>> +
>> void slb_vmalloc_update(void)
>> {
>> unsigned long vflags;
>> diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
>> index 2edc673be137..e56759d92356 100644
>> --- a/arch/powerpc/platforms/pseries/ras.c
>> +++ b/arch/powerpc/platforms/pseries/ras.c
>> @@ -422,6 +422,31 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
>> return 0; /* need to perform reset */
>> }
>>
>> +static int mce_handle_error(struct rtas_error_log *errp)
>> +{
>> + struct pseries_errorlog *pseries_log;
>> + struct pseries_mc_errorlog *mce_log;
>> + int disposition = rtas_error_disposition(errp);
>> + uint8_t error_type;
>> +
>> + pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
>> + if (pseries_log == NULL)
>> + goto out;
>> +
>> + mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
>> + error_type = rtas_mc_error_type(mce_log);
>> +
>> + if ((disposition == RTAS_DISP_NOT_RECOVERED) &&
>> + (error_type == PSERIES_MC_ERROR_TYPE_SLB)) {
>> + slb_dump_contents();
>> + slb_flush_and_rebolt();
>> + disposition = RTAS_DISP_FULLY_RECOVERED;
>> + }
>> +
>> +out:
>> + return disposition;
>> +}
>> +
>> /*
>> * See if we can recover from a machine check exception.
>> * This is only called on power4 (or above) and only via
>> @@ -434,7 +459,9 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
>> static int recover_mce(struct pt_regs *regs, struct rtas_error_log *err)
>> {
>> int recovered = 0;
>> - int disposition = rtas_error_disposition(err);
>> + int disposition;
>> +
>> + disposition = mce_handle_error(err);
>>
>> if (!(regs->msr & MSR_RI)) {
>> /* If MSR_RI isn't set, we cannot recover */
>>
>
^ permalink raw reply
* Re: [v3 PATCH 1/5] powerpc/pseries: convert rtas_log_buf to linear allocation.
From: Mahesh Jagannath Salgaonkar @ 2018-06-08 6:16 UTC (permalink / raw)
To: Nicholas Piggin
Cc: linuxppc-dev, stable, Aneesh Kumar K.V, Michael Ellerman,
Laurent Dufour
In-Reply-To: <20180608113137.7d409c6c@roar.ozlabs.ibm.com>
On 06/08/2018 07:01 AM, Nicholas Piggin wrote:
> On Thu, 07 Jun 2018 22:58:11 +0530
> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> wrote:
>
>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>
>> rtas_log_buf is a buffer to hold RTAS event data that are communicated
>> to kernel by hypervisor. This buffer is then used to pass RTAS event
>> data to user through proc fs. This buffer is allocated from vmalloc
>> (non-linear mapping) area.
>>
>> On Machine check interrupt, register r3 points to RTAS extended event
>> log passed by hypervisor that contains the MCE event. The pseries
>> machine check handler then logs this error into rtas_log_buf. The
>> rtas_log_buf is a vmalloc-ed (non-linear) buffer we end up taking up a
>> page fault (vector 0x300) while accessing it. Since machine check
>> interrupt handler runs in NMI context we can not afford to take any
>> page fault. Page faults are not honored in NMI context and causes
>> kernel panic. This patch fixes this issue by allocating rtas_log_buf
>> using kmalloc.
>>
>> Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable interrupt")
>> Cc: stable@vger.kernel.org
>> Suggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>> ---
>> arch/powerpc/kernel/rtasd.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c
>> index f915db93cd42..3957d4ae2ba2 100644
>> --- a/arch/powerpc/kernel/rtasd.c
>> +++ b/arch/powerpc/kernel/rtasd.c
>> @@ -559,7 +559,7 @@ static int __init rtas_event_scan_init(void)
>> rtas_error_log_max = rtas_get_error_log_max();
>> rtas_error_log_buffer_max = rtas_error_log_max + sizeof(int);
>>
>> - rtas_log_buf = vmalloc(rtas_error_log_buffer_max*LOG_NUMBER);
>> + rtas_log_buf = kmalloc(rtas_error_log_buffer_max*LOG_NUMBER, GFP_KERNEL);
>
> Does this have to be in the RMA region if it's to be accessed with
> relocation off in the guest?
Nope not required. It never gets accessed with relocation off.
>
> A comment about it being accessed with relocation off might be helpful
> too.
Sure.
Thanks,
-Mahesh.
^ permalink raw reply
* Re: pkeys on POWER: Access rights not reset on execve
From: Florian Weimer @ 2018-06-08 5:53 UTC (permalink / raw)
To: Ram Pai; +Cc: Linux-MM, linuxppc-dev, Andy Lutomirski, Dave Hansen
In-Reply-To: <20180608023441.GA5573@ram.oc3035372033.ibm.com>
On 06/08/2018 04:34 AM, Ram Pai wrote:
>>
>> So the remaining question at this point is whether the Intel
>> behavior (default-deny instead of default-allow) is preferable.
>
> Florian, remind me what behavior needs to fixed?
See the other thread. The Intel register equivalent to the AMR by
default disallows access to yet-unallocated keys, so that threads which
are created before key allocation do not magically gain access to a key
allocated by another thread.
Thanks,
Florian
^ permalink raw reply
* [PATCH kernel 6/6] powerpc/powernv/ioda: Allocate indirect TCE levels on demand
From: Alexey Kardashevskiy @ 2018-06-08 5:46 UTC (permalink / raw)
To: linuxppc-dev
Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Alex Williamson,
Benjamin Herrenschmidt
In-Reply-To: <20180608054633.18659-1-aik@ozlabs.ru>
At the moment we allocate the entire TCE table, twice (hardware part and
userspace translation cache). This normally works as we normally have
contigous memory and the guest will map entire RAM for 64bit DMA.
However if we have sparse RAM (one example is a memory device), then
we will allocate TCEs which will never be used as the guest only maps
actual memory for DMA. If it is a single level TCE table, there is nothing
we can really do but if it a multilevel table, we can skip allocating
TCEs we know we won't need.
This adds ability to allocate only first level, saving memory.
This changes iommu_table::free() to avoid allocating of an extra level;
iommu_table::set() will do this when needed.
This adds @alloc parameter to iommu_table::exchange() to tell the callback
if it can allocate an extra level; the flag is set to "false" for
the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns
H_TOO_HARD.
This still requires the entire table to be counted in mm::locked_vm.
To be conservative, this only does on-demand allocation when
the usespace cache table is requested which is the case of VFIO.
The example math for a system replicating a powernv setup with NVLink2
in a guest:
16GB RAM mapped at 0x0
128GB GPU RAM window (16GB of actual RAM) mapped at 0x244000000000
the table to cover that all with 64K pages takes:
(((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
If we allocate only necessary TCE levels, we will only need:
(((0x400000000 + 0x400000000) >> 16)*8)>>20 = 4MB (plus some for indirect
levels).
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
arch/powerpc/include/asm/iommu.h | 7 ++-
arch/powerpc/platforms/powernv/pci.h | 6 ++-
arch/powerpc/kvm/book3s_64_vio_hv.c | 4 +-
arch/powerpc/platforms/powernv/pci-ioda-tce.c | 69 ++++++++++++++++++++-------
arch/powerpc/platforms/powernv/pci-ioda.c | 8 ++--
drivers/vfio/vfio_iommu_spapr_tce.c | 2 +-
6 files changed, 69 insertions(+), 27 deletions(-)
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 4bdcf22..daa3ee5 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -70,7 +70,7 @@ struct iommu_table_ops {
unsigned long *hpa,
enum dma_data_direction *direction);
- __be64 *(*useraddrptr)(struct iommu_table *tbl, long index);
+ __be64 *(*useraddrptr)(struct iommu_table *tbl, long index, bool alloc);
#endif
void (*clear)(struct iommu_table *tbl,
long index, long npages);
@@ -122,10 +122,13 @@ struct iommu_table {
__be64 *it_userspace; /* userspace view of the table */
struct iommu_table_ops *it_ops;
struct kref it_kref;
+ int it_nid;
};
+#define IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry) \
+ ((tbl)->it_ops->useraddrptr((tbl), (entry), false))
#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
- ((tbl)->it_ops->useraddrptr((tbl), (entry)))
+ ((tbl)->it_ops->useraddrptr((tbl), (entry), true))
/* Pure 2^n version of get_order */
static inline __attribute_const__
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 5e02408..1fa5590 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -267,8 +267,10 @@ extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
unsigned long attrs);
extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
- unsigned long *hpa, enum dma_data_direction *direction);
-extern __be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index);
+ unsigned long *hpa, enum dma_data_direction *direction,
+ bool alloc);
+extern __be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index,
+ bool alloc);
extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
extern long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index db0490c..05b4865 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -200,7 +200,7 @@ static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
{
struct mm_iommu_table_group_mem_t *mem = NULL;
const unsigned long pgsize = 1ULL << tbl->it_page_shift;
- __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+ __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry);
if (!pua)
/* it_userspace allocation might be delayed */
@@ -264,7 +264,7 @@ static long kvmppc_rm_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
{
long ret;
unsigned long hpa = 0;
- __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+ __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry);
struct mm_iommu_table_group_mem_t *mem;
if (!pua)
diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
index 36c2eb0..a7debfb 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
@@ -48,7 +48,7 @@ static __be64 *pnv_alloc_tce_level(int nid, unsigned int shift)
return addr;
}
-static __be64 *pnv_tce(struct iommu_table *tbl, bool user, long idx)
+static __be64 *pnv_tce(struct iommu_table *tbl, bool user, long idx, bool alloc)
{
__be64 *tmp = user ? tbl->it_userspace : (__be64 *) tbl->it_base;
int level = tbl->it_indirect_levels;
@@ -57,7 +57,20 @@ static __be64 *pnv_tce(struct iommu_table *tbl, bool user, long idx)
while (level) {
int n = (idx & mask) >> (level * shift);
- unsigned long tce = be64_to_cpu(tmp[n]);
+ unsigned long tce;
+
+ if (tmp[n] == 0) {
+ __be64 *tmp2;
+
+ if (!alloc)
+ return NULL;
+
+ tmp2 = pnv_alloc_tce_level(tbl->it_nid,
+ ilog2(tbl->it_level_size) + 3);
+ tmp[n] = cpu_to_be64(__pa(tmp2) |
+ TCE_PCI_READ | TCE_PCI_WRITE);
+ }
+ tce = be64_to_cpu(tmp[n]);
tmp = __va(tce & ~(TCE_PCI_READ | TCE_PCI_WRITE));
idx &= ~mask;
@@ -84,7 +97,7 @@ int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
((rpn + i) << tbl->it_page_shift);
unsigned long idx = index - tbl->it_offset + i;
- *(pnv_tce(tbl, false, idx)) = cpu_to_be64(newtce);
+ *(pnv_tce(tbl, false, idx, true)) = cpu_to_be64(newtce);
}
return 0;
@@ -92,31 +105,45 @@ int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
#ifdef CONFIG_IOMMU_API
int pnv_tce_xchg(struct iommu_table *tbl, long index,
- unsigned long *hpa, enum dma_data_direction *direction)
+ unsigned long *hpa, enum dma_data_direction *direction,
+ bool alloc)
{
u64 proto_tce = iommu_direction_to_tce_perm(*direction);
unsigned long newtce = *hpa | proto_tce, oldtce;
unsigned long idx = index - tbl->it_offset;
+ __be64 *ptce;
BUG_ON(*hpa & ~IOMMU_PAGE_MASK(tbl));
if (newtce & TCE_PCI_WRITE)
newtce |= TCE_PCI_READ;
- oldtce = be64_to_cpu(xchg(pnv_tce(tbl, false, idx),
- cpu_to_be64(newtce)));
+ ptce = pnv_tce(tbl, false, idx, alloc);
+ if (!ptce) {
+ if (*direction == DMA_NONE) {
+ *hpa = 0;
+ return 0;
+ }
+ /* It is likely to be realmode */
+ if (!alloc)
+ return H_TOO_HARD;
+
+ return H_HARDWARE;
+ }
+
+ oldtce = be64_to_cpu(xchg(ptce, cpu_to_be64(newtce)));
*hpa = oldtce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
*direction = iommu_tce_direction(oldtce);
return 0;
}
-__be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index)
+__be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index, bool alloc)
{
if (WARN_ON_ONCE(!tbl->it_userspace))
return NULL;
- return pnv_tce(tbl, true, index - tbl->it_offset);
+ return pnv_tce(tbl, true, index - tbl->it_offset, alloc);
}
#endif
@@ -126,14 +153,19 @@ void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
for (i = 0; i < npages; i++) {
unsigned long idx = index - tbl->it_offset + i;
+ __be64 *ptce = pnv_tce(tbl, false, idx, false);
- *(pnv_tce(tbl, false, idx)) = cpu_to_be64(0);
+ if (ptce)
+ *ptce = cpu_to_be64(0);
}
}
unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
{
- __be64 *ptce = pnv_tce(tbl, false, index - tbl->it_offset);
+ __be64 *ptce = pnv_tce(tbl, false, index - tbl->it_offset, false);
+
+ if (!ptce)
+ return 0;
return be64_to_cpu(*ptce);
}
@@ -224,6 +256,7 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
unsigned int table_shift = max_t(unsigned int, entries_shift + 3,
PAGE_SHIFT);
const unsigned long tce_table_size = 1UL << table_shift;
+ unsigned int tmplevels = levels;
if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS))
return -EINVAL;
@@ -231,6 +264,9 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
if (!is_power_of_2(window_size))
return -EINVAL;
+ if (alloc_userspace_copy && (window_size > (1ULL << 32)))
+ tmplevels = 1;
+
/* Adjust direct table size from window_size and levels */
entries_shift = (entries_shift + levels - 1) / levels;
level_shift = entries_shift + 3;
@@ -241,7 +277,7 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
/* Allocate TCE table */
addr = pnv_pci_ioda2_table_do_alloc_pages(nid, level_shift,
- levels, tce_table_size, &offset, &total_allocated);
+ tmplevels, tce_table_size, &offset, &total_allocated);
/* addr==NULL means that the first level allocation failed */
if (!addr)
@@ -252,7 +288,7 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
* we did not allocate as much as we wanted,
* release partially allocated table.
*/
- if (offset < tce_table_size)
+ if (tmplevels == levels && offset < tce_table_size)
goto free_tces_exit;
/* Allocate userspace view of the TCE table */
@@ -263,8 +299,8 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
&total_allocated_uas);
if (!uas)
goto free_tces_exit;
- if (offset < tce_table_size ||
- total_allocated_uas != total_allocated)
+ if (tmplevels == levels && (offset < tce_table_size ||
+ total_allocated_uas != total_allocated))
goto free_uas_exit;
}
@@ -275,10 +311,11 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
tbl->it_indirect_levels = levels - 1;
tbl->it_allocated_size = total_allocated;
tbl->it_userspace = uas;
+ tbl->it_nid = nid;
- pr_debug("Created TCE table: ws=%08llx ts=%lx @%08llx base=%lx uas=%p levels=%d\n",
+ pr_debug("Created TCE table: ws=%08llx ts=%lx @%08llx base=%lx uas=%p levels=%d/%d\n",
window_size, tce_table_size, bus_offset, tbl->it_base,
- tbl->it_userspace, levels);
+ tbl->it_userspace, tmplevels, levels);
return 0;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index c61c04d..d9df620 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2010,7 +2010,7 @@ static int pnv_ioda1_tce_build(struct iommu_table *tbl, long index,
static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
unsigned long *hpa, enum dma_data_direction *direction)
{
- long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+ long ret = pnv_tce_xchg(tbl, index, hpa, direction, true);
if (!ret)
pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, false);
@@ -2021,7 +2021,7 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
unsigned long *hpa, enum dma_data_direction *direction)
{
- long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+ long ret = pnv_tce_xchg(tbl, index, hpa, direction, false);
if (!ret)
pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, true);
@@ -2175,7 +2175,7 @@ static int pnv_ioda2_tce_build(struct iommu_table *tbl, long index,
static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
unsigned long *hpa, enum dma_data_direction *direction)
{
- long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+ long ret = pnv_tce_xchg(tbl, index, hpa, direction, true);
if (!ret)
pnv_pci_ioda2_tce_invalidate(tbl, index, 1, false);
@@ -2186,7 +2186,7 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
unsigned long *hpa, enum dma_data_direction *direction)
{
- long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+ long ret = pnv_tce_xchg(tbl, index, hpa, direction, false);
if (!ret)
pnv_pci_ioda2_tce_invalidate(tbl, index, 1, true);
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 628a948..f040ab1 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -639,7 +639,7 @@ static long tce_iommu_create_table(struct tce_container *container,
page_shift, window_size, levels, ptbl);
WARN_ON(!ret && !(*ptbl)->it_ops->free);
- WARN_ON(!ret && ((*ptbl)->it_allocated_size != table_size));
+ WARN_ON(!ret && ((*ptbl)->it_allocated_size > table_size));
return ret;
}
--
2.11.0
^ permalink raw reply related
* [PATCH kernel 0/6] powerpc/powernv/iommu: Optimize memory use
From: Alexey Kardashevskiy @ 2018-06-08 5:46 UTC (permalink / raw)
To: linuxppc-dev
Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Alex Williamson,
Benjamin Herrenschmidt
This patchset aims to reduce actual memory use for guests with
sparse memory. The pseries guest uses dynamic DMA windows to map
the entire guest RAM but it only actually maps onlined memory
which may be not be contiguous. I hit this when tried passing
through NVLink2-connected GPU RAM of NVIDIA V100 and trying to
map this RAM at the same offset as in the real hardware
forced me to rework I handle these windows.
This moves userspace-to-host-physical translation table
(iommu_table::it_userspace) from VFIO TCE IOMMU subdriver to
the platform code and reuses the already existing multilevel
TCE table code which we have for the hardware tables.
At last in 6/6 I switch to on-demand allocation so we do not
allocate huge chunks of the table if we do not have to;
there is some math in 6/6.
Please comment. Thanks.
Alexey Kardashevskiy (6):
powerpc/powernv: Remove useless wrapper
powerpc/powernv: Move TCE manupulation code to its own file
KVM: PPC: Make iommu_table::it_userspace big endian
powerpc/powernv: Add indirect levels to it_userspace
powerpc/powernv: Rework TCE level allocation
powerpc/powernv/ioda: Allocate indirect TCE levels on demand
arch/powerpc/platforms/powernv/Makefile | 2 +-
arch/powerpc/include/asm/iommu.h | 11 +-
arch/powerpc/platforms/powernv/pci.h | 44 ++-
arch/powerpc/kvm/book3s_64_vio.c | 11 +-
arch/powerpc/kvm/book3s_64_vio_hv.c | 18 +-
arch/powerpc/platforms/powernv/pci-ioda-tce.c | 395 ++++++++++++++++++++++++++
arch/powerpc/platforms/powernv/pci-ioda.c | 192 ++-----------
arch/powerpc/platforms/powernv/pci.c | 158 -----------
drivers/vfio/vfio_iommu_spapr_tce.c | 65 +----
9 files changed, 482 insertions(+), 414 deletions(-)
create mode 100644 arch/powerpc/platforms/powernv/pci-ioda-tce.c
--
2.11.0
^ permalink raw reply
* [PATCH kernel 5/6] powerpc/powernv: Rework TCE level allocation
From: Alexey Kardashevskiy @ 2018-06-08 5:46 UTC (permalink / raw)
To: linuxppc-dev
Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Alex Williamson,
Benjamin Herrenschmidt
In-Reply-To: <20180608054633.18659-1-aik@ozlabs.ru>
This moves actual pages allocation to a separate function which is going
to be reused later in on-demand TCE allocation.
While we are at it, remove unnecessary level size round up as the caller
does this already.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
arch/powerpc/platforms/powernv/pci-ioda-tce.c | 30 +++++++++++++++++----------
1 file changed, 19 insertions(+), 11 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
index f14b282..36c2eb0 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
@@ -31,6 +31,23 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
tbl->it_type = TCE_PCI;
}
+static __be64 *pnv_alloc_tce_level(int nid, unsigned int shift)
+{
+ struct page *tce_mem = NULL;
+ __be64 *addr;
+
+ tce_mem = alloc_pages_node(nid, GFP_KERNEL, shift - PAGE_SHIFT);
+ if (!tce_mem) {
+ pr_err("Failed to allocate a TCE memory, level shift=%d\n",
+ shift);
+ return NULL;
+ }
+ addr = page_address(tce_mem);
+ memset(addr, 0, 1UL << shift);
+
+ return addr;
+}
+
static __be64 *pnv_tce(struct iommu_table *tbl, bool user, long idx)
{
__be64 *tmp = user ? tbl->it_userspace : (__be64 *) tbl->it_base;
@@ -165,21 +182,12 @@ static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned int shift,
unsigned int levels, unsigned long limit,
unsigned long *current_offset, unsigned long *total_allocated)
{
- struct page *tce_mem = NULL;
__be64 *addr, *tmp;
- unsigned int order = max_t(unsigned int, shift, PAGE_SHIFT) -
- PAGE_SHIFT;
- unsigned long allocated = 1UL << (order + PAGE_SHIFT);
+ unsigned long allocated = 1UL << shift;
unsigned int entries = 1UL << (shift - 3);
long i;
- tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
- if (!tce_mem) {
- pr_err("Failed to allocate a TCE memory, order=%d\n", order);
- return NULL;
- }
- addr = page_address(tce_mem);
- memset(addr, 0, allocated);
+ addr = pnv_alloc_tce_level(nid, shift);
*total_allocated += allocated;
--levels;
--
2.11.0
^ permalink raw reply related
* [PATCH kernel 4/6] powerpc/powernv: Add indirect levels to it_userspace
From: Alexey Kardashevskiy @ 2018-06-08 5:46 UTC (permalink / raw)
To: linuxppc-dev
Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Alex Williamson,
Benjamin Herrenschmidt
In-Reply-To: <20180608054633.18659-1-aik@ozlabs.ru>
We want to support sparse memory and therefore huge chunks of DMA windows
do not need to be mapped. If a DMA window big enough to require 2 or more
indirect levels, and a DMA window is used to map all RAM (which is
a default case for 64bit window), we can actually save some memory by
not allocation TCE for regions which we are not going to map anyway.
The hardware tables alreary support indirect levels but we also keep
host-physical-to-userspace translation array which is allocated by
vmalloc() and is a flat array which might use quite some memory.
This converts it_userspace from vmalloc'ed array to a multi level table.
As the format becomes platform dependend, this replaces the direct access
to it_usespace with a iommu_table_ops::useraddrptr hook which returns
a pointer to the userspace copy of a TCE; future extension will return
NULL if the level was not allocated.
This should not change non-KVM handling of TCE tables and it_userspace
will not be allocated for non-KVM tables.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
arch/powerpc/include/asm/iommu.h | 6 +--
arch/powerpc/platforms/powernv/pci.h | 3 +-
arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ----
arch/powerpc/platforms/powernv/pci-ioda-tce.c | 65 +++++++++++++++++++++------
arch/powerpc/platforms/powernv/pci-ioda.c | 31 ++++++++++---
drivers/vfio/vfio_iommu_spapr_tce.c | 46 -------------------
6 files changed, 81 insertions(+), 78 deletions(-)
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 803ac70..4bdcf22 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -69,6 +69,8 @@ struct iommu_table_ops {
long index,
unsigned long *hpa,
enum dma_data_direction *direction);
+
+ __be64 *(*useraddrptr)(struct iommu_table *tbl, long index);
#endif
void (*clear)(struct iommu_table *tbl,
long index, long npages);
@@ -123,9 +125,7 @@ struct iommu_table {
};
#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
- ((tbl)->it_userspace ? \
- &((tbl)->it_userspace[(entry) - (tbl)->it_offset]) : \
- NULL)
+ ((tbl)->it_ops->useraddrptr((tbl), (entry)))
/* Pure 2^n version of get_order */
static inline __attribute_const__
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index f507baf..5e02408 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -268,11 +268,12 @@ extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
unsigned long *hpa, enum dma_data_direction *direction);
+extern __be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index);
extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
extern long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
__u32 page_shift, __u64 window_size, __u32 levels,
- struct iommu_table *tbl);
+ bool alloc_userspace_copy, struct iommu_table *tbl);
extern void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl);
extern long pnv_pci_link_table_and_group(int node, int num,
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 18109f3..db0490c 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -206,10 +206,6 @@ static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
/* it_userspace allocation might be delayed */
return H_TOO_HARD;
- pua = (void *) vmalloc_to_phys(pua);
- if (WARN_ON_ONCE_RM(!pua))
- return H_HARDWARE;
-
mem = mm_iommu_lookup_rm(kvm->mm, be64_to_cpu(*pua), pgsize);
if (!mem)
return H_TOO_HARD;
@@ -282,10 +278,6 @@ static long kvmppc_rm_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
return H_HARDWARE;
- pua = (void *) vmalloc_to_phys(pua);
- if (WARN_ON_ONCE_RM(!pua))
- return H_HARDWARE;
-
if (WARN_ON_ONCE_RM(mm_iommu_mapped_inc(mem)))
return H_CLOSED;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
index 700ceb1..f14b282 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
@@ -31,9 +31,9 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
tbl->it_type = TCE_PCI;
}
-static __be64 *pnv_tce(struct iommu_table *tbl, long idx)
+static __be64 *pnv_tce(struct iommu_table *tbl, bool user, long idx)
{
- __be64 *tmp = ((__be64 *)tbl->it_base);
+ __be64 *tmp = user ? tbl->it_userspace : (__be64 *) tbl->it_base;
int level = tbl->it_indirect_levels;
const long shift = ilog2(tbl->it_level_size);
unsigned long mask = (tbl->it_level_size - 1) << (level * shift);
@@ -67,7 +67,7 @@ int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
((rpn + i) << tbl->it_page_shift);
unsigned long idx = index - tbl->it_offset + i;
- *(pnv_tce(tbl, idx)) = cpu_to_be64(newtce);
+ *(pnv_tce(tbl, false, idx)) = cpu_to_be64(newtce);
}
return 0;
@@ -86,12 +86,21 @@ int pnv_tce_xchg(struct iommu_table *tbl, long index,
if (newtce & TCE_PCI_WRITE)
newtce |= TCE_PCI_READ;
- oldtce = be64_to_cpu(xchg(pnv_tce(tbl, idx), cpu_to_be64(newtce)));
+ oldtce = be64_to_cpu(xchg(pnv_tce(tbl, false, idx),
+ cpu_to_be64(newtce)));
*hpa = oldtce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
*direction = iommu_tce_direction(oldtce);
return 0;
}
+
+__be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index)
+{
+ if (WARN_ON_ONCE(!tbl->it_userspace))
+ return NULL;
+
+ return pnv_tce(tbl, true, index - tbl->it_offset);
+}
#endif
void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
@@ -101,13 +110,15 @@ void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
for (i = 0; i < npages; i++) {
unsigned long idx = index - tbl->it_offset + i;
- *(pnv_tce(tbl, idx)) = cpu_to_be64(0);
+ *(pnv_tce(tbl, false, idx)) = cpu_to_be64(0);
}
}
unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
{
- return be64_to_cpu(*(pnv_tce(tbl, index - tbl->it_offset)));
+ __be64 *ptce = pnv_tce(tbl, false, index - tbl->it_offset);
+
+ return be64_to_cpu(*ptce);
}
static void pnv_pci_ioda2_table_do_free_pages(__be64 *addr,
@@ -144,6 +155,10 @@ void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl)
pnv_pci_ioda2_table_do_free_pages((__be64 *)tbl->it_base, size,
tbl->it_indirect_levels);
+ if (tbl->it_userspace) {
+ pnv_pci_ioda2_table_do_free_pages(tbl->it_userspace, size,
+ tbl->it_indirect_levels);
+ }
}
static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned int shift,
@@ -191,10 +206,11 @@ static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned int shift,
long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
__u32 page_shift, __u64 window_size, __u32 levels,
- struct iommu_table *tbl)
+ bool alloc_userspace_copy, struct iommu_table *tbl)
{
- void *addr;
+ void *addr, *uas = NULL;
unsigned long offset = 0, level_shift, total_allocated = 0;
+ unsigned long total_allocated_uas = 0;
const unsigned int window_shift = ilog2(window_size);
unsigned int entries_shift = window_shift - page_shift;
unsigned int table_shift = max_t(unsigned int, entries_shift + 3,
@@ -228,10 +244,20 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
* we did not allocate as much as we wanted,
* release partially allocated table.
*/
- if (offset < tce_table_size) {
- pnv_pci_ioda2_table_do_free_pages(addr,
- 1ULL << (level_shift - 3), levels - 1);
- return -ENOMEM;
+ if (offset < tce_table_size)
+ goto free_tces_exit;
+
+ /* Allocate userspace view of the TCE table */
+ if (alloc_userspace_copy) {
+ offset = 0;
+ uas = pnv_pci_ioda2_table_do_alloc_pages(nid, level_shift,
+ levels, tce_table_size, &offset,
+ &total_allocated_uas);
+ if (!uas)
+ goto free_tces_exit;
+ if (offset < tce_table_size ||
+ total_allocated_uas != total_allocated)
+ goto free_uas_exit;
}
/* Setup linux iommu table */
@@ -240,11 +266,22 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
tbl->it_level_size = 1ULL << (level_shift - 3);
tbl->it_indirect_levels = levels - 1;
tbl->it_allocated_size = total_allocated;
+ tbl->it_userspace = uas;
- pr_devel("Created TCE table: ws=%08llx ts=%lx @%08llx\n",
- window_size, tce_table_size, bus_offset);
+ pr_debug("Created TCE table: ws=%08llx ts=%lx @%08llx base=%lx uas=%p levels=%d\n",
+ window_size, tce_table_size, bus_offset, tbl->it_base,
+ tbl->it_userspace, levels);
return 0;
+
+free_uas_exit:
+ pnv_pci_ioda2_table_do_free_pages(uas,
+ 1ULL << (level_shift - 3), levels - 1);
+free_tces_exit:
+ pnv_pci_ioda2_table_do_free_pages(addr,
+ 1ULL << (level_shift - 3), levels - 1);
+
+ return -ENOMEM;
}
static void pnv_iommu_table_group_link_free(struct rcu_head *head)
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 9577059..c61c04d 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2043,6 +2043,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
#ifdef CONFIG_IOMMU_API
.exchange = pnv_ioda1_tce_xchg,
.exchange_rm = pnv_ioda1_tce_xchg_rm,
+ .useraddrptr = pnv_tce_useraddrptr,
#endif
.clear = pnv_ioda1_tce_free,
.get = pnv_tce_get,
@@ -2207,6 +2208,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
#ifdef CONFIG_IOMMU_API
.exchange = pnv_ioda2_tce_xchg,
.exchange_rm = pnv_ioda2_tce_xchg_rm,
+ .useraddrptr = pnv_tce_useraddrptr,
#endif
.clear = pnv_ioda2_tce_free,
.get = pnv_tce_get,
@@ -2460,9 +2462,9 @@ void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
pe->tce_bypass_enabled = enable;
}
-static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
+static long pnv_pci_ioda2_do_create_table(struct iommu_table_group *table_group,
int num, __u32 page_shift, __u64 window_size, __u32 levels,
- struct iommu_table **ptbl)
+ bool alloc_userspace_copy, struct iommu_table **ptbl)
{
struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
table_group);
@@ -2479,7 +2481,7 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
ret = pnv_pci_ioda2_table_alloc_pages(nid,
bus_offset, page_shift, window_size,
- levels, tbl);
+ levels, alloc_userspace_copy, tbl);
if (ret) {
iommu_tce_table_put(tbl);
return ret;
@@ -2599,7 +2601,24 @@ static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
tce_table_size, direct_table_size);
}
- return bytes;
+ return bytes + bytes; /* one for HW table, one for userspace copy */
+}
+
+static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
+ int num, __u32 page_shift, __u64 window_size, __u32 levels,
+ struct iommu_table **ptbl)
+{
+ return pnv_pci_ioda2_do_create_table(table_group,
+ num, page_shift, window_size, levels, false, ptbl);
+}
+
+static long pnv_pci_ioda2_create_table_userspace(
+ struct iommu_table_group *table_group,
+ int num, __u32 page_shift, __u64 window_size, __u32 levels,
+ struct iommu_table **ptbl)
+{
+ return pnv_pci_ioda2_do_create_table(table_group,
+ num, page_shift, window_size, levels, true, ptbl);
}
static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
@@ -2628,7 +2647,7 @@ static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
.get_table_size = pnv_pci_ioda2_get_table_size,
- .create_table = pnv_pci_ioda2_create_table,
+ .create_table = pnv_pci_ioda2_create_table_userspace,
.set_window = pnv_pci_ioda2_set_window,
.unset_window = pnv_pci_ioda2_unset_window,
.take_ownership = pnv_ioda2_take_ownership,
@@ -2733,7 +2752,7 @@ static void pnv_ioda2_npu_take_ownership(struct iommu_table_group *table_group)
static struct iommu_table_group_ops pnv_pci_ioda2_npu_ops = {
.get_table_size = pnv_pci_ioda2_get_table_size,
- .create_table = pnv_pci_ioda2_create_table,
+ .create_table = pnv_pci_ioda2_create_table_userspace,
.set_window = pnv_pci_ioda2_npu_set_window,
.unset_window = pnv_pci_ioda2_npu_unset_window,
.take_ownership = pnv_ioda2_npu_take_ownership,
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 81f48114..628a948 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -212,44 +212,6 @@ static long tce_iommu_register_pages(struct tce_container *container,
return 0;
}
-static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl,
- struct mm_struct *mm)
-{
- unsigned long cb = _ALIGN_UP(sizeof(tbl->it_userspace[0]) *
- tbl->it_size, PAGE_SIZE);
- unsigned long *uas;
- long ret;
-
- BUG_ON(tbl->it_userspace);
-
- ret = try_increment_locked_vm(mm, cb >> PAGE_SHIFT);
- if (ret)
- return ret;
-
- uas = vzalloc(cb);
- if (!uas) {
- decrement_locked_vm(mm, cb >> PAGE_SHIFT);
- return -ENOMEM;
- }
- tbl->it_userspace = (__be64 *) uas;
-
- return 0;
-}
-
-static void tce_iommu_userspace_view_free(struct iommu_table *tbl,
- struct mm_struct *mm)
-{
- unsigned long cb = _ALIGN_UP(sizeof(tbl->it_userspace[0]) *
- tbl->it_size, PAGE_SIZE);
-
- if (!tbl->it_userspace)
- return;
-
- vfree(tbl->it_userspace);
- tbl->it_userspace = NULL;
- decrement_locked_vm(mm, cb >> PAGE_SHIFT);
-}
-
static bool tce_page_is_contained(unsigned long hpa, unsigned page_shift)
{
struct page *page = __va(realmode_pfn_to_page(hpa >> PAGE_SHIFT));
@@ -608,12 +570,6 @@ static long tce_iommu_build_v2(struct tce_container *container,
unsigned long hpa;
enum dma_data_direction dirtmp;
- if (!tbl->it_userspace) {
- ret = tce_iommu_userspace_view_alloc(tbl, container->mm);
- if (ret)
- return ret;
- }
-
for (i = 0; i < pages; ++i) {
struct mm_iommu_table_group_mem_t *mem = NULL;
__be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry + i);
@@ -693,7 +649,6 @@ static void tce_iommu_free_table(struct tce_container *container,
{
unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
- tce_iommu_userspace_view_free(tbl, container->mm);
iommu_tce_table_put(tbl);
decrement_locked_vm(container->mm, pages);
}
@@ -1208,7 +1163,6 @@ static void tce_iommu_release_ownership(struct tce_container *container,
continue;
tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
- tce_iommu_userspace_view_free(tbl, container->mm);
if (tbl->it_map)
iommu_release_ownership(tbl);
--
2.11.0
^ permalink raw reply related
* [PATCH kernel 3/6] KVM: PPC: Make iommu_table::it_userspace big endian
From: Alexey Kardashevskiy @ 2018-06-08 5:46 UTC (permalink / raw)
To: linuxppc-dev
Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Alex Williamson,
Benjamin Herrenschmidt
In-Reply-To: <20180608054633.18659-1-aik@ozlabs.ru>
We are going to reuse multilevel TCE code for the userspace copy of
the TCE table and since it is big endian, let's make the copy big endian
too.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
arch/powerpc/include/asm/iommu.h | 2 +-
arch/powerpc/kvm/book3s_64_vio.c | 11 ++++++-----
arch/powerpc/kvm/book3s_64_vio_hv.c | 10 +++++-----
drivers/vfio/vfio_iommu_spapr_tce.c | 19 +++++++++----------
4 files changed, 21 insertions(+), 21 deletions(-)
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 20febe0..803ac70 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -117,7 +117,7 @@ struct iommu_table {
unsigned long *it_map; /* A simple allocation bitmap for now */
unsigned long it_page_shift;/* table iommu page size */
struct list_head it_group_list;/* List of iommu_table_group_link */
- unsigned long *it_userspace; /* userspace view of the table */
+ __be64 *it_userspace; /* userspace view of the table */
struct iommu_table_ops *it_ops;
struct kref it_kref;
};
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 80ead38..1dbca4b 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -378,19 +378,19 @@ static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
{
struct mm_iommu_table_group_mem_t *mem = NULL;
const unsigned long pgsize = 1ULL << tbl->it_page_shift;
- unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+ __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
if (!pua)
/* it_userspace allocation might be delayed */
return H_TOO_HARD;
- mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
+ mem = mm_iommu_lookup(kvm->mm, be64_to_cpu(*pua), pgsize);
if (!mem)
return H_TOO_HARD;
mm_iommu_mapped_dec(mem);
- *pua = 0;
+ *pua = cpu_to_be64(0);
return H_SUCCESS;
}
@@ -437,7 +437,8 @@ long kvmppc_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
enum dma_data_direction dir)
{
long ret;
- unsigned long hpa, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+ unsigned long hpa;
+ __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
struct mm_iommu_table_group_mem_t *mem;
if (!pua)
@@ -464,7 +465,7 @@ long kvmppc_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
if (dir != DMA_NONE)
kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
- *pua = ua;
+ *pua = cpu_to_be64(ua);
return 0;
}
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 635f3ca..18109f3 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -200,7 +200,7 @@ static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
{
struct mm_iommu_table_group_mem_t *mem = NULL;
const unsigned long pgsize = 1ULL << tbl->it_page_shift;
- unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+ __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
if (!pua)
/* it_userspace allocation might be delayed */
@@ -210,13 +210,13 @@ static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
if (WARN_ON_ONCE_RM(!pua))
return H_HARDWARE;
- mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
+ mem = mm_iommu_lookup_rm(kvm->mm, be64_to_cpu(*pua), pgsize);
if (!mem)
return H_TOO_HARD;
mm_iommu_mapped_dec(mem);
- *pua = 0;
+ *pua = cpu_to_be64(0);
return H_SUCCESS;
}
@@ -268,7 +268,7 @@ static long kvmppc_rm_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
{
long ret;
unsigned long hpa = 0;
- unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+ __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
struct mm_iommu_table_group_mem_t *mem;
if (!pua)
@@ -302,7 +302,7 @@ static long kvmppc_rm_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
if (dir != DMA_NONE)
kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
- *pua = ua;
+ *pua = cpu_to_be64(ua);
return 0;
}
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 47071f3..81f48114 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -231,7 +231,7 @@ static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl,
decrement_locked_vm(mm, cb >> PAGE_SHIFT);
return -ENOMEM;
}
- tbl->it_userspace = uas;
+ tbl->it_userspace = (__be64 *) uas;
return 0;
}
@@ -494,20 +494,20 @@ static void tce_iommu_unuse_page_v2(struct tce_container *container,
struct mm_iommu_table_group_mem_t *mem = NULL;
int ret;
unsigned long hpa = 0;
- unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+ __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
if (!pua)
return;
- ret = tce_iommu_prereg_ua_to_hpa(container, *pua, IOMMU_PAGE_SIZE(tbl),
- &hpa, &mem);
+ ret = tce_iommu_prereg_ua_to_hpa(container, be64_to_cpu(*pua),
+ IOMMU_PAGE_SIZE(tbl), &hpa, &mem);
if (ret)
- pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n",
- __func__, *pua, entry, ret);
+ pr_debug("%s: tce %llx at #%lx was not cached, ret=%d\n",
+ __func__, be64_to_cpu(*pua), entry, ret);
if (mem)
mm_iommu_mapped_dec(mem);
- *pua = 0;
+ *pua = cpu_to_be64(0);
}
static int tce_iommu_clear(struct tce_container *container,
@@ -616,8 +616,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
for (i = 0; i < pages; ++i) {
struct mm_iommu_table_group_mem_t *mem = NULL;
- unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl,
- entry + i);
+ __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry + i);
ret = tce_iommu_prereg_ua_to_hpa(container,
tce, IOMMU_PAGE_SIZE(tbl), &hpa, &mem);
@@ -650,7 +649,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
if (dirtmp != DMA_NONE)
tce_iommu_unuse_page_v2(container, tbl, entry + i);
- *pua = tce;
+ *pua = cpu_to_be64(tce);
tce += IOMMU_PAGE_SIZE(tbl);
}
--
2.11.0
^ permalink raw reply related
* [PATCH kernel 2/6] powerpc/powernv: Move TCE manupulation code to its own file
From: Alexey Kardashevskiy @ 2018-06-08 5:46 UTC (permalink / raw)
To: linuxppc-dev
Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Alex Williamson,
Benjamin Herrenschmidt
In-Reply-To: <20180608054633.18659-1-aik@ozlabs.ru>
Right now we have allocation code in pci-ioda.c and traversing code in
pci.c, let's keep them toghether. However both files are big enough
already so let's move this business to a new file.
While we at it, move the code which links IOMMU table groups to
IOMMU tables as it is not specific to any PNV PHB model.
These puts exported symbols from the new file together.
This fixes several warnings from checkpatch.pl like this:
"WARNING: Prefer 'unsigned int' to bare use of 'unsigned'".
As this is almost cut-n-paste, there should be no behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
arch/powerpc/platforms/powernv/Makefile | 2 +-
arch/powerpc/platforms/powernv/pci.h | 41 ++--
arch/powerpc/platforms/powernv/pci-ioda-tce.c | 313 ++++++++++++++++++++++++++
arch/powerpc/platforms/powernv/pci-ioda.c | 146 ------------
arch/powerpc/platforms/powernv/pci.c | 158 -------------
5 files changed, 340 insertions(+), 320 deletions(-)
create mode 100644 arch/powerpc/platforms/powernv/pci-ioda-tce.c
diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile
index 703a350..b540ce8e 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -6,7 +6,7 @@ obj-y += opal-msglog.o opal-hmi.o opal-power.o opal-irqchip.o
obj-y += opal-kmsg.o opal-powercap.o opal-psr.o opal-sensor-groups.o
obj-$(CONFIG_SMP) += smp.o subcore.o subcore-asm.o
-obj-$(CONFIG_PCI) += pci.o pci-ioda.o npu-dma.o
+obj-$(CONFIG_PCI) += pci.o pci-ioda.o npu-dma.o pci-ioda-tce.o
obj-$(CONFIG_CXL_BASE) += pci-cxl.o
obj-$(CONFIG_EEH) += eeh-powernv.o
obj-$(CONFIG_PPC_SCOM) += opal-xscom.o
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 1408247..f507baf 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -202,13 +202,6 @@ struct pnv_phb {
};
extern struct pci_ops pnv_pci_ops;
-extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
- unsigned long uaddr, enum dma_data_direction direction,
- unsigned long attrs);
-extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
-extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
- unsigned long *hpa, enum dma_data_direction *direction);
-extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
unsigned char *log_buff);
@@ -218,14 +211,6 @@ int pnv_pci_cfg_write(struct pci_dn *pdn,
int where, int size, u32 val);
extern struct iommu_table *pnv_pci_table_alloc(int nid);
-extern long pnv_pci_link_table_and_group(int node, int num,
- struct iommu_table *tbl,
- struct iommu_table_group *table_group);
-extern void pnv_pci_unlink_table_and_group(struct iommu_table *tbl,
- struct iommu_table_group *table_group);
-extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
- void *tce_mem, u64 tce_size,
- u64 dma_offset, unsigned page_shift);
extern void pnv_pci_init_ioda_hub(struct device_node *np);
extern void pnv_pci_init_ioda2_phb(struct device_node *np);
extern void pnv_pci_init_npu_phb(struct device_node *np);
@@ -273,4 +258,30 @@ extern void pnv_cxl_cx4_teardown_msi_irqs(struct pci_dev *pdev);
/* phb ops (cxl switches these when enabling the kernel api on the phb) */
extern const struct pci_controller_ops pnv_cxl_cx4_ioda_controller_ops;
+/* pci-ioda-tce.c */
+#define POWERNV_IOMMU_DEFAULT_LEVELS 1
+#define POWERNV_IOMMU_MAX_LEVELS 5
+
+extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
+ unsigned long uaddr, enum dma_data_direction direction,
+ unsigned long attrs);
+extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
+extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
+ unsigned long *hpa, enum dma_data_direction *direction);
+extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
+
+extern long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
+ __u32 page_shift, __u64 window_size, __u32 levels,
+ struct iommu_table *tbl);
+extern void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl);
+
+extern long pnv_pci_link_table_and_group(int node, int num,
+ struct iommu_table *tbl,
+ struct iommu_table_group *table_group);
+extern void pnv_pci_unlink_table_and_group(struct iommu_table *tbl,
+ struct iommu_table_group *table_group);
+extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
+ void *tce_mem, u64 tce_size,
+ u64 dma_offset, unsigned int page_shift);
+
#endif /* __POWERNV_PCI_H */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
new file mode 100644
index 0000000..700ceb1
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
@@ -0,0 +1,313 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * TCE helpers for IODA PCI/PCIe on PowerNV platforms
+ *
+ * Copyright 2018 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/kernel.h>
+#include <linux/iommu.h>
+
+#include <asm/iommu.h>
+#include <asm/tce.h>
+#include "pci.h"
+
+void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
+ void *tce_mem, u64 tce_size,
+ u64 dma_offset, unsigned int page_shift)
+{
+ tbl->it_blocksize = 16;
+ tbl->it_base = (unsigned long)tce_mem;
+ tbl->it_page_shift = page_shift;
+ tbl->it_offset = dma_offset >> tbl->it_page_shift;
+ tbl->it_index = 0;
+ tbl->it_size = tce_size >> 3;
+ tbl->it_busno = 0;
+ tbl->it_type = TCE_PCI;
+}
+
+static __be64 *pnv_tce(struct iommu_table *tbl, long idx)
+{
+ __be64 *tmp = ((__be64 *)tbl->it_base);
+ int level = tbl->it_indirect_levels;
+ const long shift = ilog2(tbl->it_level_size);
+ unsigned long mask = (tbl->it_level_size - 1) << (level * shift);
+
+ while (level) {
+ int n = (idx & mask) >> (level * shift);
+ unsigned long tce = be64_to_cpu(tmp[n]);
+
+ tmp = __va(tce & ~(TCE_PCI_READ | TCE_PCI_WRITE));
+ idx &= ~mask;
+ mask >>= shift;
+ --level;
+ }
+
+ return tmp + idx;
+}
+
+int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
+ unsigned long uaddr, enum dma_data_direction direction,
+ unsigned long attrs)
+{
+ u64 proto_tce = iommu_direction_to_tce_perm(direction);
+ u64 rpn = __pa(uaddr) >> tbl->it_page_shift;
+ long i;
+
+ if (proto_tce & TCE_PCI_WRITE)
+ proto_tce |= TCE_PCI_READ;
+
+ for (i = 0; i < npages; i++) {
+ unsigned long newtce = proto_tce |
+ ((rpn + i) << tbl->it_page_shift);
+ unsigned long idx = index - tbl->it_offset + i;
+
+ *(pnv_tce(tbl, idx)) = cpu_to_be64(newtce);
+ }
+
+ return 0;
+}
+
+#ifdef CONFIG_IOMMU_API
+int pnv_tce_xchg(struct iommu_table *tbl, long index,
+ unsigned long *hpa, enum dma_data_direction *direction)
+{
+ u64 proto_tce = iommu_direction_to_tce_perm(*direction);
+ unsigned long newtce = *hpa | proto_tce, oldtce;
+ unsigned long idx = index - tbl->it_offset;
+
+ BUG_ON(*hpa & ~IOMMU_PAGE_MASK(tbl));
+
+ if (newtce & TCE_PCI_WRITE)
+ newtce |= TCE_PCI_READ;
+
+ oldtce = be64_to_cpu(xchg(pnv_tce(tbl, idx), cpu_to_be64(newtce)));
+ *hpa = oldtce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+ *direction = iommu_tce_direction(oldtce);
+
+ return 0;
+}
+#endif
+
+void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
+{
+ long i;
+
+ for (i = 0; i < npages; i++) {
+ unsigned long idx = index - tbl->it_offset + i;
+
+ *(pnv_tce(tbl, idx)) = cpu_to_be64(0);
+ }
+}
+
+unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
+{
+ return be64_to_cpu(*(pnv_tce(tbl, index - tbl->it_offset)));
+}
+
+static void pnv_pci_ioda2_table_do_free_pages(__be64 *addr,
+ unsigned long size, unsigned int levels)
+{
+ const unsigned long addr_ul = (unsigned long) addr &
+ ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+ if (levels) {
+ long i;
+ u64 *tmp = (u64 *) addr_ul;
+
+ for (i = 0; i < size; ++i) {
+ unsigned long hpa = be64_to_cpu(tmp[i]);
+
+ if (!(hpa & (TCE_PCI_READ | TCE_PCI_WRITE)))
+ continue;
+
+ pnv_pci_ioda2_table_do_free_pages(__va(hpa), size,
+ levels - 1);
+ }
+ }
+
+ free_pages(addr_ul, get_order(size << 3));
+}
+
+void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl)
+{
+ const unsigned long size = tbl->it_indirect_levels ?
+ tbl->it_level_size : tbl->it_size;
+
+ if (!tbl->it_size)
+ return;
+
+ pnv_pci_ioda2_table_do_free_pages((__be64 *)tbl->it_base, size,
+ tbl->it_indirect_levels);
+}
+
+static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned int shift,
+ unsigned int levels, unsigned long limit,
+ unsigned long *current_offset, unsigned long *total_allocated)
+{
+ struct page *tce_mem = NULL;
+ __be64 *addr, *tmp;
+ unsigned int order = max_t(unsigned int, shift, PAGE_SHIFT) -
+ PAGE_SHIFT;
+ unsigned long allocated = 1UL << (order + PAGE_SHIFT);
+ unsigned int entries = 1UL << (shift - 3);
+ long i;
+
+ tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
+ if (!tce_mem) {
+ pr_err("Failed to allocate a TCE memory, order=%d\n", order);
+ return NULL;
+ }
+ addr = page_address(tce_mem);
+ memset(addr, 0, allocated);
+ *total_allocated += allocated;
+
+ --levels;
+ if (!levels) {
+ *current_offset += allocated;
+ return addr;
+ }
+
+ for (i = 0; i < entries; ++i) {
+ tmp = pnv_pci_ioda2_table_do_alloc_pages(nid, shift,
+ levels, limit, current_offset, total_allocated);
+ if (!tmp)
+ break;
+
+ addr[i] = cpu_to_be64(__pa(tmp) |
+ TCE_PCI_READ | TCE_PCI_WRITE);
+
+ if (*current_offset >= limit)
+ break;
+ }
+
+ return addr;
+}
+
+long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
+ __u32 page_shift, __u64 window_size, __u32 levels,
+ struct iommu_table *tbl)
+{
+ void *addr;
+ unsigned long offset = 0, level_shift, total_allocated = 0;
+ const unsigned int window_shift = ilog2(window_size);
+ unsigned int entries_shift = window_shift - page_shift;
+ unsigned int table_shift = max_t(unsigned int, entries_shift + 3,
+ PAGE_SHIFT);
+ const unsigned long tce_table_size = 1UL << table_shift;
+
+ if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS))
+ return -EINVAL;
+
+ if (!is_power_of_2(window_size))
+ return -EINVAL;
+
+ /* Adjust direct table size from window_size and levels */
+ entries_shift = (entries_shift + levels - 1) / levels;
+ level_shift = entries_shift + 3;
+ level_shift = max_t(unsigned int, level_shift, PAGE_SHIFT);
+
+ if ((level_shift - 3) * levels + page_shift >= 55)
+ return -EINVAL;
+
+ /* Allocate TCE table */
+ addr = pnv_pci_ioda2_table_do_alloc_pages(nid, level_shift,
+ levels, tce_table_size, &offset, &total_allocated);
+
+ /* addr==NULL means that the first level allocation failed */
+ if (!addr)
+ return -ENOMEM;
+
+ /*
+ * First level was allocated but some lower level failed as
+ * we did not allocate as much as we wanted,
+ * release partially allocated table.
+ */
+ if (offset < tce_table_size) {
+ pnv_pci_ioda2_table_do_free_pages(addr,
+ 1ULL << (level_shift - 3), levels - 1);
+ return -ENOMEM;
+ }
+
+ /* Setup linux iommu table */
+ pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, bus_offset,
+ page_shift);
+ tbl->it_level_size = 1ULL << (level_shift - 3);
+ tbl->it_indirect_levels = levels - 1;
+ tbl->it_allocated_size = total_allocated;
+
+ pr_devel("Created TCE table: ws=%08llx ts=%lx @%08llx\n",
+ window_size, tce_table_size, bus_offset);
+
+ return 0;
+}
+
+static void pnv_iommu_table_group_link_free(struct rcu_head *head)
+{
+ struct iommu_table_group_link *tgl = container_of(head,
+ struct iommu_table_group_link, rcu);
+
+ kfree(tgl);
+}
+
+void pnv_pci_unlink_table_and_group(struct iommu_table *tbl,
+ struct iommu_table_group *table_group)
+{
+ long i;
+ bool found;
+ struct iommu_table_group_link *tgl;
+
+ if (!tbl || !table_group)
+ return;
+
+ /* Remove link to a group from table's list of attached groups */
+ found = false;
+ list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
+ if (tgl->table_group == table_group) {
+ list_del_rcu(&tgl->next);
+ call_rcu(&tgl->rcu, pnv_iommu_table_group_link_free);
+ found = true;
+ break;
+ }
+ }
+ if (WARN_ON(!found))
+ return;
+
+ /* Clean a pointer to iommu_table in iommu_table_group::tables[] */
+ found = false;
+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+ if (table_group->tables[i] == tbl) {
+ table_group->tables[i] = NULL;
+ found = true;
+ break;
+ }
+ }
+ WARN_ON(!found);
+}
+
+long pnv_pci_link_table_and_group(int node, int num,
+ struct iommu_table *tbl,
+ struct iommu_table_group *table_group)
+{
+ struct iommu_table_group_link *tgl = NULL;
+
+ if (WARN_ON(!tbl || !table_group))
+ return -EINVAL;
+
+ tgl = kzalloc_node(sizeof(struct iommu_table_group_link), GFP_KERNEL,
+ node);
+ if (!tgl)
+ return -ENOMEM;
+
+ tgl->table_group = table_group;
+ list_add_rcu(&tgl->next, &tbl->it_group_list);
+
+ table_group->tables[num] = tbl;
+
+ return 0;
+}
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index d4c60b6..9577059 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -51,12 +51,8 @@
#define PNV_IODA1_M64_SEGS 8 /* Segments per M64 BAR */
#define PNV_IODA1_DMA32_SEGSIZE 0x10000000
-#define POWERNV_IOMMU_DEFAULT_LEVELS 1
-#define POWERNV_IOMMU_MAX_LEVELS 5
-
static const char * const pnv_phb_names[] = { "IODA1", "IODA2", "NPU_NVLINK",
"NPU_OCAPI" };
-static void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl);
void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
const char *fmt, ...)
@@ -2464,10 +2460,6 @@ void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
pe->tce_bypass_enabled = enable;
}
-static long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
- __u32 page_shift, __u64 window_size, __u32 levels,
- struct iommu_table *tbl);
-
static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
int num, __u32 page_shift, __u64 window_size, __u32 levels,
struct iommu_table **ptbl)
@@ -2775,144 +2767,6 @@ static void pnv_pci_ioda_setup_iommu_api(void)
static void pnv_pci_ioda_setup_iommu_api(void) { };
#endif
-static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned shift,
- unsigned levels, unsigned long limit,
- unsigned long *current_offset, unsigned long *total_allocated)
-{
- struct page *tce_mem = NULL;
- __be64 *addr, *tmp;
- unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT;
- unsigned long allocated = 1UL << (order + PAGE_SHIFT);
- unsigned entries = 1UL << (shift - 3);
- long i;
-
- tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
- if (!tce_mem) {
- pr_err("Failed to allocate a TCE memory, order=%d\n", order);
- return NULL;
- }
- addr = page_address(tce_mem);
- memset(addr, 0, allocated);
- *total_allocated += allocated;
-
- --levels;
- if (!levels) {
- *current_offset += allocated;
- return addr;
- }
-
- for (i = 0; i < entries; ++i) {
- tmp = pnv_pci_ioda2_table_do_alloc_pages(nid, shift,
- levels, limit, current_offset, total_allocated);
- if (!tmp)
- break;
-
- addr[i] = cpu_to_be64(__pa(tmp) |
- TCE_PCI_READ | TCE_PCI_WRITE);
-
- if (*current_offset >= limit)
- break;
- }
-
- return addr;
-}
-
-static void pnv_pci_ioda2_table_do_free_pages(__be64 *addr,
- unsigned long size, unsigned level);
-
-static long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
- __u32 page_shift, __u64 window_size, __u32 levels,
- struct iommu_table *tbl)
-{
- void *addr;
- unsigned long offset = 0, level_shift, total_allocated = 0;
- const unsigned window_shift = ilog2(window_size);
- unsigned entries_shift = window_shift - page_shift;
- unsigned table_shift = max_t(unsigned, entries_shift + 3, PAGE_SHIFT);
- const unsigned long tce_table_size = 1UL << table_shift;
-
- if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS))
- return -EINVAL;
-
- if (!is_power_of_2(window_size))
- return -EINVAL;
-
- /* Adjust direct table size from window_size and levels */
- entries_shift = (entries_shift + levels - 1) / levels;
- level_shift = entries_shift + 3;
- level_shift = max_t(unsigned, level_shift, PAGE_SHIFT);
-
- if ((level_shift - 3) * levels + page_shift >= 55)
- return -EINVAL;
-
- /* Allocate TCE table */
- addr = pnv_pci_ioda2_table_do_alloc_pages(nid, level_shift,
- levels, tce_table_size, &offset, &total_allocated);
-
- /* addr==NULL means that the first level allocation failed */
- if (!addr)
- return -ENOMEM;
-
- /*
- * First level was allocated but some lower level failed as
- * we did not allocate as much as we wanted,
- * release partially allocated table.
- */
- if (offset < tce_table_size) {
- pnv_pci_ioda2_table_do_free_pages(addr,
- 1ULL << (level_shift - 3), levels - 1);
- return -ENOMEM;
- }
-
- /* Setup linux iommu table */
- pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, bus_offset,
- page_shift);
- tbl->it_level_size = 1ULL << (level_shift - 3);
- tbl->it_indirect_levels = levels - 1;
- tbl->it_allocated_size = total_allocated;
-
- pr_devel("Created TCE table: ws=%08llx ts=%lx @%08llx\n",
- window_size, tce_table_size, bus_offset);
-
- return 0;
-}
-
-static void pnv_pci_ioda2_table_do_free_pages(__be64 *addr,
- unsigned long size, unsigned level)
-{
- const unsigned long addr_ul = (unsigned long) addr &
- ~(TCE_PCI_READ | TCE_PCI_WRITE);
-
- if (level) {
- long i;
- u64 *tmp = (u64 *) addr_ul;
-
- for (i = 0; i < size; ++i) {
- unsigned long hpa = be64_to_cpu(tmp[i]);
-
- if (!(hpa & (TCE_PCI_READ | TCE_PCI_WRITE)))
- continue;
-
- pnv_pci_ioda2_table_do_free_pages(__va(hpa), size,
- level - 1);
- }
- }
-
- free_pages(addr_ul, get_order(size << 3));
-}
-
-static void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl)
-{
- const unsigned long size = tbl->it_indirect_levels ?
- tbl->it_level_size : tbl->it_size;
-
- if (!tbl->it_size)
- return;
-
- pnv_pci_ioda2_table_do_free_pages((__be64 *)tbl->it_base, size,
- tbl->it_indirect_levels);
-}
-
static unsigned long pnv_ioda_parse_tce_sizes(struct pnv_phb *phb)
{
struct pci_controller *hose = phb->hose;
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index b265ecc..13aef23 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -802,85 +802,6 @@ struct pci_ops pnv_pci_ops = {
.write = pnv_pci_write_config,
};
-static __be64 *pnv_tce(struct iommu_table *tbl, long idx)
-{
- __be64 *tmp = ((__be64 *)tbl->it_base);
- int level = tbl->it_indirect_levels;
- const long shift = ilog2(tbl->it_level_size);
- unsigned long mask = (tbl->it_level_size - 1) << (level * shift);
-
- while (level) {
- int n = (idx & mask) >> (level * shift);
- unsigned long tce = be64_to_cpu(tmp[n]);
-
- tmp = __va(tce & ~(TCE_PCI_READ | TCE_PCI_WRITE));
- idx &= ~mask;
- mask >>= shift;
- --level;
- }
-
- return tmp + idx;
-}
-
-int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
- unsigned long uaddr, enum dma_data_direction direction,
- unsigned long attrs)
-{
- u64 proto_tce = iommu_direction_to_tce_perm(direction);
- u64 rpn = __pa(uaddr) >> tbl->it_page_shift;
- long i;
-
- if (proto_tce & TCE_PCI_WRITE)
- proto_tce |= TCE_PCI_READ;
-
- for (i = 0; i < npages; i++) {
- unsigned long newtce = proto_tce |
- ((rpn + i) << tbl->it_page_shift);
- unsigned long idx = index - tbl->it_offset + i;
-
- *(pnv_tce(tbl, idx)) = cpu_to_be64(newtce);
- }
-
- return 0;
-}
-
-#ifdef CONFIG_IOMMU_API
-int pnv_tce_xchg(struct iommu_table *tbl, long index,
- unsigned long *hpa, enum dma_data_direction *direction)
-{
- u64 proto_tce = iommu_direction_to_tce_perm(*direction);
- unsigned long newtce = *hpa | proto_tce, oldtce;
- unsigned long idx = index - tbl->it_offset;
-
- BUG_ON(*hpa & ~IOMMU_PAGE_MASK(tbl));
-
- if (newtce & TCE_PCI_WRITE)
- newtce |= TCE_PCI_READ;
-
- oldtce = be64_to_cpu(xchg(pnv_tce(tbl, idx), cpu_to_be64(newtce)));
- *hpa = oldtce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
- *direction = iommu_tce_direction(oldtce);
-
- return 0;
-}
-#endif
-
-void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
-{
- long i;
-
- for (i = 0; i < npages; i++) {
- unsigned long idx = index - tbl->it_offset + i;
-
- *(pnv_tce(tbl, idx)) = cpu_to_be64(0);
- }
-}
-
-unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
-{
- return be64_to_cpu(*(pnv_tce(tbl, index - tbl->it_offset)));
-}
-
struct iommu_table *pnv_pci_table_alloc(int nid)
{
struct iommu_table *tbl;
@@ -895,85 +816,6 @@ struct iommu_table *pnv_pci_table_alloc(int nid)
return tbl;
}
-long pnv_pci_link_table_and_group(int node, int num,
- struct iommu_table *tbl,
- struct iommu_table_group *table_group)
-{
- struct iommu_table_group_link *tgl = NULL;
-
- if (WARN_ON(!tbl || !table_group))
- return -EINVAL;
-
- tgl = kzalloc_node(sizeof(struct iommu_table_group_link), GFP_KERNEL,
- node);
- if (!tgl)
- return -ENOMEM;
-
- tgl->table_group = table_group;
- list_add_rcu(&tgl->next, &tbl->it_group_list);
-
- table_group->tables[num] = tbl;
-
- return 0;
-}
-
-static void pnv_iommu_table_group_link_free(struct rcu_head *head)
-{
- struct iommu_table_group_link *tgl = container_of(head,
- struct iommu_table_group_link, rcu);
-
- kfree(tgl);
-}
-
-void pnv_pci_unlink_table_and_group(struct iommu_table *tbl,
- struct iommu_table_group *table_group)
-{
- long i;
- bool found;
- struct iommu_table_group_link *tgl;
-
- if (!tbl || !table_group)
- return;
-
- /* Remove link to a group from table's list of attached groups */
- found = false;
- list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
- if (tgl->table_group == table_group) {
- list_del_rcu(&tgl->next);
- call_rcu(&tgl->rcu, pnv_iommu_table_group_link_free);
- found = true;
- break;
- }
- }
- if (WARN_ON(!found))
- return;
-
- /* Clean a pointer to iommu_table in iommu_table_group::tables[] */
- found = false;
- for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
- if (table_group->tables[i] == tbl) {
- table_group->tables[i] = NULL;
- found = true;
- break;
- }
- }
- WARN_ON(!found);
-}
-
-void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
- void *tce_mem, u64 tce_size,
- u64 dma_offset, unsigned page_shift)
-{
- tbl->it_blocksize = 16;
- tbl->it_base = (unsigned long)tce_mem;
- tbl->it_page_shift = page_shift;
- tbl->it_offset = dma_offset >> tbl->it_page_shift;
- tbl->it_index = 0;
- tbl->it_size = tce_size >> 3;
- tbl->it_busno = 0;
- tbl->it_type = TCE_PCI;
-}
-
void pnv_pci_dma_dev_setup(struct pci_dev *pdev)
{
struct pci_controller *hose = pci_bus_to_host(pdev->bus);
--
2.11.0
^ permalink raw reply related
* [PATCH kernel 1/6] powerpc/powernv: Remove useless wrapper
From: Alexey Kardashevskiy @ 2018-06-08 5:46 UTC (permalink / raw)
To: linuxppc-dev
Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, kvm, Alex Williamson,
Benjamin Herrenschmidt
In-Reply-To: <20180608054633.18659-1-aik@ozlabs.ru>
This gets rid of a useless wrapper around
pnv_pci_ioda2_table_free_pages().
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
arch/powerpc/platforms/powernv/pci-ioda.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 29f798c..d4c60b6 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2206,11 +2206,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
}
-static void pnv_ioda2_table_free(struct iommu_table *tbl)
-{
- pnv_pci_ioda2_table_free_pages(tbl);
-}
-
static struct iommu_table_ops pnv_ioda2_iommu_ops = {
.set = pnv_ioda2_tce_build,
#ifdef CONFIG_IOMMU_API
@@ -2219,7 +2214,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
#endif
.clear = pnv_ioda2_tce_free,
.get = pnv_tce_get,
- .free = pnv_ioda2_table_free,
+ .free = pnv_pci_ioda2_table_free_pages,
};
static int pnv_pci_ioda_dev_dma_weight(struct pci_dev *dev, void *data)
--
2.11.0
^ permalink raw reply related
* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
From: Alex Williamson @ 2018-06-08 5:03 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, kvm-ppc,
Ram Pai, kvm, Alistair Popple
In-Reply-To: <ff6de203-9287-8158-f0fb-ee9146245bea@ozlabs.ru>
On Fri, 8 Jun 2018 14:14:23 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> On 8/6/18 1:44 pm, Alex Williamson wrote:
> > On Fri, 8 Jun 2018 13:08:54 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >
> >> On 8/6/18 8:15 am, Alex Williamson wrote:
> >>> On Fri, 08 Jun 2018 07:54:02 +1000
> >>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> >>>
> >>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> >>>>>
> >>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
> >>>>> connected devices makes sense? AIUI we have a PCI view of these
> >>>>> devices and from that perspective they're isolated. That's the view of
> >>>>> the device used to generate the grouping. However, not visible to us,
> >>>>> these devices are interconnected via NVLink. What isolation properties
> >>>>> does NVLink provide given that its entire purpose for existing seems to
> >>>>> be to provide a high performance link for p2p between devices?
> >>>>
> >>>> Not entire. On POWER chips, we also have an nvlink between the device
> >>>> and the CPU which is running significantly faster than PCIe.
> >>>>
> >>>> But yes, there are cross-links and those should probably be accounted
> >>>> for in the grouping.
> >>>
> >>> Then after we fix the grouping, can we just let the host driver manage
> >>> this coherent memory range and expose vGPUs to guests? The use case of
> >>> assigning all 6 GPUs to one VM seems pretty limited. (Might need to
> >>> convince NVIDIA to support more than a single vGPU per VM though)
> >>
> >> These are physical GPUs, not virtual sriov-alike things they are
> >> implementing as well elsewhere.
> >
> > vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> > either. That's why we have mdev devices now to implement software
> > defined devices. I don't have first hand experience with V-series, but
> > I would absolutely expect a PCIe-based Tesla V100 to support vGPU.
>
> So assuming V100 can do vGPU, you are suggesting ditching this patchset and
> using mediated vGPUs instead, correct?
If it turns out that our PCIe-only-based IOMMU grouping doesn't
account for lack of isolation on the NVLink side and we correct that,
limiting assignment to sets of 3 interconnected GPUs, is that still a
useful feature? OTOH, it's entirely an NVIDIA proprietary decision
whether they choose to support vGPU on these GPUs or whether they can
be convinced to support multiple vGPUs per VM.
> >> My current understanding is that every P9 chip in that box has some NVLink2
> >> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> >> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> >> as well.
> >>
> >> From small bits of information I have it seems that a GPU can perfectly
> >> work alone and if the NVIDIA driver does not see these interconnects
> >> (because we do not pass the rest of the big 3xGPU group to this guest), it
> >> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> >> which simply refuses to work until all 3 GPUs are passed so there is some
> >> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> >> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> >>
> >> So we will either have 6 groups (one per GPU) or 2 groups (one per
> >> interconnected group).
> >
> > I'm not gaining much confidence that we can rely on isolation between
> > NVLink connected GPUs, it sounds like you're simply expecting that
> > proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> > is going to play nice and nobody will figure out how to do bad things
> > because... obfuscation? Thanks,
>
> Well, we already believe that a proprietary firmware of a sriov-capable
> adapter like Mellanox ConnextX is not doing bad things, how is this
> different in principle?
It seems like the scope and hierarchy are different. Here we're
talking about exposing big discrete devices, which are peers of one
another (and have history of being reverse engineered), to userspace
drivers. Once handed to userspace, each of those devices needs to be
considered untrusted. In the case of SR-IOV, we typically have a
trusted host driver for the PF managing untrusted VFs. We do rely on
some sanity in the hardware/firmware in isolating the VFs from each
other and from the PF, but we also often have source code for Linux
drivers for these devices and sometimes even datasheets. Here we have
neither of those and perhaps we won't know the extent of the lack of
isolation between these devices until nouveau (best case) or some
exploit (worst case) exposes it. IOMMU grouping always assumes a lack
of isolation between devices unless the hardware provides some
indication that isolation exists, for example ACS on PCIe. If NVIDIA
wants to expose isolation on NVLink, perhaps they need to document
enough of it that the host kernel can manipulate and test for isolation,
perhaps even enabling virtualization of the NVLink interconnect
interface such that the host can prevent GPUs from interfering with
each other. Thanks,
Alex
^ permalink raw reply
* RE: [v3, 00/10] Support DPAA PTP clock and timestamping
From: Y.b. Lu @ 2018-06-08 4:45 UTC (permalink / raw)
To: Richard Cochran
Cc: netdev@vger.kernel.org, Madalin-cristian Bucur, Rob Herring,
Shawn Guo, David S . Miller, devicetree@vger.kernel.org,
linuxppc-dev@lists.ozlabs.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org
In-Reply-To: <20180608042706.7gfg5p6574ntc2lq@localhost>
PiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiBGcm9tOiBSaWNoYXJkIENvY2hyYW4gW21h
aWx0bzpyaWNoYXJkY29jaHJhbkBnbWFpbC5jb21dDQo+IFNlbnQ6IEZyaWRheSwgSnVuZSA4LCAy
MDE4IDEyOjI3IFBNDQo+IFRvOiBZLmIuIEx1IDx5YW5nYm8ubHVAbnhwLmNvbT4NCj4gQ2M6IG5l
dGRldkB2Z2VyLmtlcm5lbC5vcmc7IE1hZGFsaW4tY3Jpc3RpYW4gQnVjdXINCj4gPG1hZGFsaW4u
YnVjdXJAbnhwLmNvbT47IFJvYiBIZXJyaW5nIDxyb2JoK2R0QGtlcm5lbC5vcmc+OyBTaGF3biBH
dW8NCj4gPHNoYXduZ3VvQGtlcm5lbC5vcmc+OyBEYXZpZCBTIC4gTWlsbGVyIDxkYXZlbUBkYXZl
bWxvZnQubmV0PjsNCj4gZGV2aWNldHJlZUB2Z2VyLmtlcm5lbC5vcmc7IGxpbnV4cHBjLWRldkBs
aXN0cy5vemxhYnMub3JnOw0KPiBsaW51eC1hcm0ta2VybmVsQGxpc3RzLmluZnJhZGVhZC5vcmc7
IGxpbnV4LWtlcm5lbEB2Z2VyLmtlcm5lbC5vcmcNCj4gU3ViamVjdDogUmU6IFt2MywgMDAvMTBd
IFN1cHBvcnQgRFBBQSBQVFAgY2xvY2sgYW5kIHRpbWVzdGFtcGluZw0KPiANCj4gT24gVGh1LCBK
dW4gMDcsIDIwMTggYXQgMDU6MjA6NDBQTSArMDgwMCwgWWFuZ2JvIEx1IHdyb3RlOg0KPiA+IFRo
aXMgcGF0Y2hzZXQgaXMgdG8gc3VwcG9ydCBEUEFBIEZNQU4gUFRQIGNsb2NrIGFuZCBIVyB0aW1l
c3RhbXBpbmcuDQo+ID4gSXQgaGFkIGJlZW4gdmVyaWZpZWQgb24gYm90aCBBUk0gcGxhdGZvcm0g
YW5kIFBQQyBwbGF0Zm9ybS4NCj4gPiAtIFRoZSBwYXRjaCAjMSB0byBwYXRjaCAjNSBhcmUgdG8g
c3VwcG9ydCBEUEFBIEZNQU4gMTU4OCB0aW1lciBpbg0KPiA+ICAgcHRwX3FvcmlxIGRyaXZlci4N
Cj4gPiAtIFRoZSBwYXRjaCAjNiB0byBwYXRjaCAjMTAgYXJlIHRvIGFkZCBIVyB0aW1lc3RhbXBp
bmcgc3VwcG9ydCBpbg0KPiA+ICAgRFBBQSBldGhlcm5ldCBkcml2ZXIuDQo+IA0KPiBSaWdodCBu
b3csIG5ldC1uZXh0IGlzIGNsb3NlZCBmb3IgbmV3IHN0dWZmLiAgWW91IHdpbGwgaGF2ZSB0byBw
b3N0IHRoZSBzZXJpZXMNCj4gYWdhaW4gYWZ0ZXIgdGhlIG1lcmdlIHdpbmRvdyBjbG9zZXMuICBZ
b3UgY2FuIGNoZWNrIHRoZSBzdGF0dXMgaGVyZToNCj4gDQo+IA0KPiBodHRwczovL2VtZWEwMS5z
YWZlbGlua3MucHJvdGVjdGlvbi5vdXRsb29rLmNvbS8/dXJsPWh0dHA6JTJGJTJGdmdlci5rZXJu
DQo+IGVsLm9yZyUyRn5kYXZlbSUyRm5ldC1uZXh0Lmh0bWwmZGF0YT0wMiU3QzAxJTdDeWFuZ2Jv
Lmx1JTQwbnhwLmNvDQo+IG0lN0NiYWFiMGIyMmU3NDQ0Mzg2YzM3MDA4ZDVjY2Y4MWIzNyU3QzY4
NmVhMWQzYmMyYjRjNmZhOTJjZDk5DQo+IGM1YzMwMTYzNSU3QzAlN0MwJTdDNjM2NjQwMjg4MzQ3
NTYzNzQyJnNkYXRhPWpDbU5sd29lV0E1MFBWNA0KPiB3M2xLWiUyRnM0YWtQancwVlYyT3JKM3Q0
Rml6SjAlM0QmcmVzZXJ2ZWQ9MA0KPiANCj4gV2hlbiB5b3UgZG8gcmUtcG9zdCwgeW91IGNhbiBh
ZGQgbXk6DQo+IA0KPiBBY2tlZC1ieTogUmljaGFyZCBDb2NocmFuIDxyaWNoYXJkY29jaHJhbkBn
bWFpbC5jb20+DQoNCltZLmIuIEx1XSBHZXQgaXQuIEFuZCB0aGFua3MgYSBsb3Qg8J+Yig0K
^ permalink raw reply
* Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
From: Alex Williamson @ 2018-06-08 4:34 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: linuxppc-dev, David Gibson, kvm-ppc, Benjamin Herrenschmidt,
Ram Pai, kvm, Alistair Popple
In-Reply-To: <b1ec37e5-e7a0-5930-edcb-08272ca841b0@ozlabs.ru>
On Fri, 8 Jun 2018 13:52:05 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> On 8/6/18 1:35 pm, Alex Williamson wrote:
> > On Fri, 8 Jun 2018 13:09:13 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >> On 8/6/18 3:04 am, Alex Williamson wrote:
> >>> On Thu, 7 Jun 2018 18:44:20 +1000
> >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >>>> index 7bddf1e..38c9475 100644
> >>>> --- a/drivers/vfio/pci/vfio_pci.c
> >>>> +++ b/drivers/vfio/pci/vfio_pci.c
> >>>> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
> >>>> }
> >>>> }
> >>>>
> >>>> + if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
> >>>> + pdev->device == 0x1db1 &&
> >>>> + IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
> >>>
> >>> Can't we do better than check this based on device ID? Perhaps PCIe
> >>> capability hints at this?
> >>
> >> A normal PCI pluggable device looks like this:
> >>
> >> root@fstn3:~# sudo lspci -vs 0000:03:00.0
> >> 0000:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
> >> Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
> >> Flags: fast devsel, IRQ 497
> >> Memory at 3fe000000000 (32-bit, non-prefetchable) [disabled] [size=16M]
> >> Memory at 200000000000 (64-bit, prefetchable) [disabled] [size=16G]
> >> Memory at 200400000000 (64-bit, prefetchable) [disabled] [size=32M]
> >> Capabilities: [60] Power Management version 3
> >> Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> >> Capabilities: [78] Express Endpoint, MSI 00
> >> Capabilities: [100] Virtual Channel
> >> Capabilities: [128] Power Budgeting <?>
> >> Capabilities: [420] Advanced Error Reporting
> >> Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
> >> Capabilities: [900] #19
> >>
> >>
> >> This is a NVLink v1 machine:
> >>
> >> aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
> >> 000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
> >> Subsystem: NVIDIA Corporation Device 116b
> >> Flags: bus master, fast devsel, latency 0, IRQ 457
> >> Memory at 3fe300000000 (32-bit, non-prefetchable) [size=16M]
> >> Memory at 260000000000 (64-bit, prefetchable) [size=16G]
> >> Memory at 260400000000 (64-bit, prefetchable) [size=32M]
> >> Capabilities: [60] Power Management version 3
> >> Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> >> Capabilities: [78] Express Endpoint, MSI 00
> >> Capabilities: [100] Virtual Channel
> >> Capabilities: [250] Latency Tolerance Reporting
> >> Capabilities: [258] L1 PM Substates
> >> Capabilities: [128] Power Budgeting <?>
> >> Capabilities: [420] Advanced Error Reporting
> >> Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
> >> Capabilities: [900] #19
> >> Kernel driver in use: nvidia
> >> Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
> >>
> >>
> >> This is the one the patch is for:
> >>
> >> [aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
> >> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> >> (rev a1)
> >> Subsystem: NVIDIA Corporation Device 1212
> >> Flags: fast devsel, IRQ 82, NUMA node 8
> >> Memory at 620c280000000 (32-bit, non-prefetchable) [disabled] [size=16M]
> >> Memory at 6228000000000 (64-bit, prefetchable) [disabled] [size=16G]
> >> Memory at 6228400000000 (64-bit, prefetchable) [disabled] [size=32M]
> >> Capabilities: [60] Power Management version 3
> >> Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> >> Capabilities: [78] Express Endpoint, MSI 00
> >> Capabilities: [100] Virtual Channel
> >> Capabilities: [250] Latency Tolerance Reporting
> >> Capabilities: [258] L1 PM Substates
> >> Capabilities: [128] Power Budgeting <?>
> >> Capabilities: [420] Advanced Error Reporting
> >> Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
> >> Capabilities: [900] #19
> >> Capabilities: [ac0] #23
> >> Kernel driver in use: vfio-pci
> >>
> >>
> >> I can only see a new capability #23 which I have no idea about what it
> >> actually does - my latest PCIe spec is
> >> PCI_Express_Base_r3.1a_December7-2015.pdf and that only knows capabilities
> >> till #21, do you have any better spec? Does not seem promising anyway...
> >
> > You could just look in include/uapi/linux/pci_regs.h and see that 23
> > (0x17) is a TPH Requester capability and google for that... It's a TLP
> > processing hint related to cache processing for requests from system
> > specific interconnects. Sounds rather promising. Of course there's
> > also the vendor specific capability that might be probed if NVIDIA will
> > tell you what to look for and the init function you've implemented
> > looks for specific devicetree nodes, that I imagine you could test for
> > in a probe as well.
>
>
> This 23 is in hex:
>
> [aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> (rev a1)
> Subsystem: NVIDIA Corporation Device 1212
> Flags: fast devsel, IRQ 82, NUMA node 8
> Memory at 620c280000000 (32-bit, non-prefetchable) [disabled] [size=16M]
> Memory at 6228000000000 (64-bit, prefetchable) [disabled] [size=16G]
> Memory at 6228400000000 (64-bit, prefetchable) [disabled] [size=32M]
> Capabilities: [60] Power Management version 3
> Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> Capabilities: [78] Express Endpoint, MSI 00
> Capabilities: [100] Virtual Channel
> Capabilities: [250] Latency Tolerance Reporting
> Capabilities: [258] L1 PM Substates
> Capabilities: [128] Power Budgeting <?>
> Capabilities: [420] Advanced Error Reporting
> Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
> Capabilities: [900] #19
> Capabilities: [ac0] #23
> Kernel driver in use: vfio-pci
>
> [aik@yc02goos ~]$ sudo lspci -vvvxxxxs 0035:03:00.0 | grep ac0
> Capabilities: [ac0 v1] #23
> ac0: 23 00 01 00 de 10 c1 00 01 00 10 00 00 00 00 00
Oops, I was thinking lspci printed unknown in decimal. Strange, it's a
shared, vendor specific capability:
https://pcisig.com/sites/default/files/specification_documents/ECN_DVSEC-2015-08-04-clean_0.pdf
We see in your dump that the vendor of this capability is 0x10de
(NVIDIA) and the ID of the capability is 0x0001. Note that NVIDIA
sponsored this ECN.
> Talking to NVIDIA is always an option :)
Really no other choice to figure out how to decode these vendor
specific capabilities, this 0x23 capability at least seems to be meant
for sharing.
> >>> Is it worthwhile to continue with assigning the device in the !ENABLED
> >>> case? For instance, maybe it would be better to provide a weak
> >>> definition of vfio_pci_nvlink2_init() that would cause us to fail here
> >>> if we don't have this device specific support enabled. I realize
> >>> you're following the example set forth for IGD, but those regions are
> >>> optional, for better or worse.
> >>
> >>
> >> The device is supposed to work even without GPU RAM passed through, this
> >> should look like NVLink v1 in this case (there used to be bugs in the
> >> driver, may be still are, have not checked for a while but there is a bug
> >> opened at NVIDIA about this and they were going to fix that), this is why I
> >> chose not to fail here.
> >
> > Ok.
> >
> >>>> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> >>>> index 24ee260..2725bc8 100644
> >>>> --- a/drivers/vfio/pci/Kconfig
> >>>> +++ b/drivers/vfio/pci/Kconfig
> >>>> @@ -30,3 +30,7 @@ config VFIO_PCI_INTX
> >>>> config VFIO_PCI_IGD
> >>>> depends on VFIO_PCI
> >>>> def_bool y if X86
> >>>> +
> >>>> +config VFIO_PCI_NVLINK2
> >>>> + depends on VFIO_PCI
> >>>> + def_bool y if PPC_POWERNV
> >>>
> >>> As written, this also depends on PPC_POWERNV (or at least TCE), it's not
> >>> a portable implementation that we could re-use on X86 or ARM or any
> >>> other platform if hardware appeared for it. Can we improve that as
> >>> well to make this less POWER specific? Thanks,
> >>
> >>
> >> As I said in another mail, every P9 chip in that box has some NVLink2 logic
> >> on it so it is not even common among P9's in general and I am having hard
> >> time seeing these V100s used elsewhere in such way.
> >
> > https://www.redhat.com/archives/vfio-users/2018-May/msg00000.html
> >
> > Not much platform info, but based on the rpm mentioned, looks like an
> > x86_64 box. Thanks,
>
> Wow. Interesting. Thanks for the pointer. No advertising material actually
> says that it is P9 only or even mention P9, wiki does not say it is P9 only
> either. Hmmm...
NVIDIA's own DGX systems are Xeon-based and seem to include NVLink.
The DGX-1 definitely makes use of the SXM2 modules, up to 8 of them.
The DGX Station might be the 4x V100 SXM2 box mentioned in the link.
Thanks,
Alex
^ permalink raw reply
* Re: [v3, 00/10] Support DPAA PTP clock and timestamping
From: Richard Cochran @ 2018-06-08 4:27 UTC (permalink / raw)
To: Yangbo Lu
Cc: netdev, madalin.bucur, Rob Herring, Shawn Guo, David S . Miller,
devicetree, linuxppc-dev, linux-arm-kernel, linux-kernel
In-Reply-To: <20180607092050.46128-1-yangbo.lu@nxp.com>
On Thu, Jun 07, 2018 at 05:20:40PM +0800, Yangbo Lu wrote:
> This patchset is to support DPAA FMAN PTP clock and HW timestamping.
> It had been verified on both ARM platform and PPC platform.
> - The patch #1 to patch #5 are to support DPAA FMAN 1588 timer in
> ptp_qoriq driver.
> - The patch #6 to patch #10 are to add HW timestamping support in
> DPAA ethernet driver.
Right now, net-next is closed for new stuff. You will have to post
the series again after the merge window closes. You can check the
status here:
http://vger.kernel.org/~davem/net-next.html
When you do re-post, you can add my:
Acked-by: Richard Cochran <richardcochran@gmail.com>
^ permalink raw reply
* Re: [RFC PATCH kernel 1/5] vfio/spapr_tce: Simplify page contained test
From: David Gibson @ 2018-06-08 3:32 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: linuxppc-dev, kvm-ppc, Alex Williamson, Benjamin Herrenschmidt,
Ram Pai, kvm, Alistair Popple
In-Reply-To: <20180607084420.29513-2-aik@ozlabs.ru>
[-- Attachment #1: Type: text/plain, Size: 2728 bytes --]
On Thu, Jun 07, 2018 at 06:44:16PM +1000, Alexey Kardashevskiy wrote:
> The test function takes a page struct pointer which is not used by
> either of two callers in any other way, make it simple and just pass
> a physical address there.
>
> This should cause no behavioral change now but later we may start
> supporting host addresses for memory devices which are not backed
> with page structs.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> drivers/vfio/vfio_iommu_spapr_tce.c | 11 ++++-------
> 1 file changed, 4 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 759a5bd..2c4a048 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -249,8 +249,9 @@ static void tce_iommu_userspace_view_free(struct iommu_table *tbl,
> decrement_locked_vm(mm, cb >> PAGE_SHIFT);
> }
>
> -static bool tce_page_is_contained(struct page *page, unsigned page_shift)
> +static bool tce_page_is_contained(unsigned long hpa, unsigned page_shift)
> {
> + struct page *page = pfn_to_page(hpa >> PAGE_SHIFT);
> /*
> * Check that the TCE table granularity is not bigger than the size of
> * a page we just found. Otherwise the hardware can get access to
> @@ -549,7 +550,6 @@ static long tce_iommu_build(struct tce_container *container,
> enum dma_data_direction direction)
> {
> long i, ret = 0;
> - struct page *page;
> unsigned long hpa;
> enum dma_data_direction dirtmp;
>
> @@ -560,8 +560,7 @@ static long tce_iommu_build(struct tce_container *container,
> if (ret)
> break;
>
> - page = pfn_to_page(hpa >> PAGE_SHIFT);
> - if (!tce_page_is_contained(page, tbl->it_page_shift)) {
> + if (!tce_page_is_contained(hpa, tbl->it_page_shift)) {
> ret = -EPERM;
> break;
> }
> @@ -595,7 +594,6 @@ static long tce_iommu_build_v2(struct tce_container *container,
> enum dma_data_direction direction)
> {
> long i, ret = 0;
> - struct page *page;
> unsigned long hpa;
> enum dma_data_direction dirtmp;
>
> @@ -615,8 +613,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
> if (ret)
> break;
>
> - page = pfn_to_page(hpa >> PAGE_SHIFT);
> - if (!tce_page_is_contained(page, tbl->it_page_shift)) {
> + if (!tce_page_is_contained(hpa, tbl->it_page_shift)) {
> ret = -EPERM;
> break;
> }
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
From: Alexey Kardashevskiy @ 2018-06-08 4:14 UTC (permalink / raw)
To: Alex Williamson
Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, kvm-ppc,
Ram Pai, kvm, Alistair Popple
In-Reply-To: <20180607214455.51ecfa1a@w520.home>
On 8/6/18 1:44 pm, Alex Williamson wrote:
> On Fri, 8 Jun 2018 13:08:54 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
>> On 8/6/18 8:15 am, Alex Williamson wrote:
>>> On Fri, 08 Jun 2018 07:54:02 +1000
>>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>>>
>>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
>>>>>
>>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
>>>>> connected devices makes sense? AIUI we have a PCI view of these
>>>>> devices and from that perspective they're isolated. That's the view of
>>>>> the device used to generate the grouping. However, not visible to us,
>>>>> these devices are interconnected via NVLink. What isolation properties
>>>>> does NVLink provide given that its entire purpose for existing seems to
>>>>> be to provide a high performance link for p2p between devices?
>>>>
>>>> Not entire. On POWER chips, we also have an nvlink between the device
>>>> and the CPU which is running significantly faster than PCIe.
>>>>
>>>> But yes, there are cross-links and those should probably be accounted
>>>> for in the grouping.
>>>
>>> Then after we fix the grouping, can we just let the host driver manage
>>> this coherent memory range and expose vGPUs to guests? The use case of
>>> assigning all 6 GPUs to one VM seems pretty limited. (Might need to
>>> convince NVIDIA to support more than a single vGPU per VM though)
>>
>> These are physical GPUs, not virtual sriov-alike things they are
>> implementing as well elsewhere.
>
> vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> either. That's why we have mdev devices now to implement software
> defined devices. I don't have first hand experience with V-series, but
> I would absolutely expect a PCIe-based Tesla V100 to support vGPU.
So assuming V100 can do vGPU, you are suggesting ditching this patchset and
using mediated vGPUs instead, correct?
>> My current understanding is that every P9 chip in that box has some NVLink2
>> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
>> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
>> as well.
>>
>> From small bits of information I have it seems that a GPU can perfectly
>> work alone and if the NVIDIA driver does not see these interconnects
>> (because we do not pass the rest of the big 3xGPU group to this guest), it
>> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
>> which simply refuses to work until all 3 GPUs are passed so there is some
>> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
>> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
>>
>> So we will either have 6 groups (one per GPU) or 2 groups (one per
>> interconnected group).
>
> I'm not gaining much confidence that we can rely on isolation between
> NVLink connected GPUs, it sounds like you're simply expecting that
> proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> is going to play nice and nobody will figure out how to do bad things
> because... obfuscation? Thanks,
Well, we already believe that a proprietary firmware of a sriov-capable
adapter like Mellanox ConnextX is not doing bad things, how is this
different in principle?
ps. their obfuscation is funny indeed :)
--
Alexey
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox