Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH bpf-next v7 4/8] bpf: Introduce cgroup iter
From: Hao Luo @ 2022-08-16  6:52 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Yosry Ahmed, Alexei Starovoitov, Linux Kernel Mailing List, bpf,
	Cgroups, Networking, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Tejun Heo, Zefan Li, KP Singh, Johannes Weiner, Michal Hocko,
	Benjamin Tissoires, John Fastabend, Michal Koutny, Roman Gushchin,
	David Rientjes, Stanislav Fomichev, Shakeel Butt
In-Reply-To: <CAEf4BzZTrsBOPpCTFouoWZJG9yXkz8LZgLQrqDREAY-XdGb7ew@mail.gmail.com>

On Mon, Aug 15, 2022 at 9:13 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Aug 11, 2022 at 7:10 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Wed, Aug 10, 2022 at 8:10 PM Hao Luo <haoluo@google.com> wrote:
> > >
> > > On Tue, Aug 9, 2022 at 11:38 AM Hao Luo <haoluo@google.com> wrote:
> > > >
> > > > On Tue, Aug 9, 2022 at 9:23 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Mon, Aug 08, 2022 at 05:56:57PM -0700, Hao Luo wrote:
> > > > > > On Mon, Aug 8, 2022 at 5:19 PM Andrii Nakryiko
> > > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > >
> > > > > > > On Fri, Aug 5, 2022 at 2:49 PM Hao Luo <haoluo@google.com> wrote:
> > > > > > > >
> > > > > > > > Cgroup_iter is a type of bpf_iter. It walks over cgroups in four modes:
> > > > > > > >
> > > > > > > >  - walking a cgroup's descendants in pre-order.
> > > > > > > >  - walking a cgroup's descendants in post-order.
> > > > > > > >  - walking a cgroup's ancestors.
> > > > > > > >  - process only the given cgroup.
> > > > > > > >
> > > > [...]
> > > > > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > > > > > index 59a217ca2dfd..4d758b2e70d6 100644
> > > > > > > > --- a/include/uapi/linux/bpf.h
> > > > > > > > +++ b/include/uapi/linux/bpf.h
> > > > > > > > @@ -87,10 +87,37 @@ struct bpf_cgroup_storage_key {
> > > > > > > >         __u32   attach_type;            /* program attach type (enum bpf_attach_type) */
> > > > > > > >  };
> > > > > > > >
> > > > > > > > +enum bpf_iter_order {
> > > > > > > > +       BPF_ITER_ORDER_DEFAULT = 0,     /* default order. */
> > > > > > >
> > > > > > > why is this default order necessary? It just adds confusion (I had to
> > > > > > > look up source code to know what is default order). I might have
> > > > > > > missed some discussion, so if there is some very good reason, then
> > > > > > > please document this in commit message. But I'd rather not do some
> > > > > > > magical default order instead. We can set 0 to mean invalid and error
> > > > > > > out, or just do SELF as the very first value (and if user forgot to
> > > > > > > specify more fancy mode, they hopefully will quickly discover this in
> > > > > > > their testing).
> > > > > > >
> > > > > >
> > > > > > PRE/POST/UP are tree-specific orders. SELF applies on all iters and
> > > > > > yields only a single object. How does task_iter express a non-self
> > > > > > order? By non-self, I mean something like "I don't care about the
> > > > > > order, just scan _all_ the objects". And this "don't care" order, IMO,
> > > > > > may be the common case. I don't think everyone cares about walking
> > > > > > order for tasks. The DEFAULT is intentionally put at the first value,
> > > > > > so that if users don't care about order, they don't have to specify
> > > > > > this field.
> > > > > >
> > > > > > If that sounds valid, maybe using "UNSPEC" instead of "DEFAULT" is better?
> > > > >
> > > > > I agree with Andrii.
> > > > > This:
> > > > > +       if (order == BPF_ITER_ORDER_DEFAULT)
> > > > > +               order = BPF_ITER_DESCENDANTS_PRE;
> > > > >
> > > > > looks like an arbitrary choice.
> > > > > imo
> > > > > BPF_ITER_DESCENDANTS_PRE = 0,
> > > > > would have been more obvious. No need to dig into definition of "default".
> > > > >
> > > > > UNSPEC = 0
> > > > > is fine too if we want user to always be conscious about the order
> > > > > and the kernel will error if that field is not initialized.
> > > > > That would be my preference, since it will match the rest of uapi/bpf.h
> > > > >
> > > >
> > > > Sounds good. In the next version, will use
> > > >
> > > > enum bpf_iter_order {
> > > >         BPF_ITER_ORDER_UNSPEC = 0,
> > > >         BPF_ITER_SELF_ONLY,             /* process only a single object. */
> > > >         BPF_ITER_DESCENDANTS_PRE,       /* walk descendants in pre-order. */
> > > >         BPF_ITER_DESCENDANTS_POST,      /* walk descendants in post-order. */
> > > >         BPF_ITER_ANCESTORS_UP,          /* walk ancestors upward. */
> > > > };
> > > >
> > >
> > > Sigh, I find that having UNSPEC=0 and erroring out when seeing UNSPEC
> > > doesn't work. Basically, if we have a non-iter prog and a cgroup_iter
> > > prog written in the same source file, I can't use
> > > bpf_object__attach_skeleton to attach them. Because the default
> > > prog_attach_fn for iter initializes `order` to 0 (that is, UNSPEC),
> > > which is going to be rejected by the kernel. In order to make
> > > bpf_object__attach_skeleton work on cgroup_iter, I think I need to use
> > > the following
> > >
> > > enum bpf_iter_order {
>
> so first of all, this can't be called "bpf_iter_order" as it doesn't
> apply to BPF iterators in general. I think this should be called
> bpf_iter_cgroup_order (or maybe bpf_cgroup_iter_order) and if/when we
> add ability to iterate tasks within cgroups then we'll just reuse enum
> bpf_iter_cgroup_order as an extra parameter for task iterator.
>
> And with that future case in mind I do think that we should have 0
> being "UNSPEC" case.
>

Ok.

> > >         BPF_ITER_DESCENDANTS_PRE,       /* walk descendants in pre-order. */
> > >         BPF_ITER_DESCENDANTS_POST,      /* walk descendants in post-order. */
> > >         BPF_ITER_ANCESTORS_UP,          /* walk ancestors upward. */
> > >         BPF_ITER_SELF_ONLY,             /* process only a single object. */
> > > };
> > >
> > > So that when calling bpf_object__attach_skeleton() on cgroup_iter, a
> > > link can be generated and the generated link defaults to pre-order
> > > walk on the whole hierarchy. Is there a better solution?
> > >
>
> I was actually surprised that we specify these additional parameters
> at attach (LINK_CREATE) time, and not at bpf_iter_create() call time.
> It seems more appropriate to allow to specify such runtime parameters
> very late, when we create a specific instance of seq_file. But I guess
> this was done because one of the initial motivations for iterators was
> to be pinned in BPFFS and read as a file, so it was more convenient to
> store such parameters upfront at link creation time to keep
> BPF_OBJ_PIN simpler. I guess it makes sense, worst case you'll need to
> create multiple bpf_link files, one for each cgroup hierarchy you'd
> like to query with the same single BPF program.
>

Right. That was the design from the beginning.

> But I digress.
>
> As for not being able to auto-attach cgroup iterator. I think that's
> sort of expected and is in line with not being able to auto-attach
> cgroup programs, as you need cgroup FD at runtime. So even if you had
> some reasonable default order, you still would need to specify target
> cgroup (either through FD or ID).
>
> So... either don't do skeleton auto-attach,

This is not okay IMHO. It would be very inconvenient to use.

> or let's teach libbpf code
> to not auto-attach some iter types?
>

I'm thinking of two options:

1. Maybe we could add libbpf APIs for disabling auto-attach just like
prog autoload. Like:

bpf_program__set_auto_attach()
bpf_program__get_auto_attach(...)

2. In auto-attach, if the program's link is already set, attach will
be skipped. So, we could just manually attach, which specifies the
order, and set the link in skeleton. This way, no change in libbpf is
needed. Does this sound good to you?

> Alternatively, we could teach libbpf to parse some sort of cgroup
> iterator spec, like:
>
> SEC("iter/cgroup:/path/to/cgroup:descendants_pre")
>
> But this approach won't work for a bunch of other parameterized
> iterators (e.g., task iter, or map elem iter), so I'm hesitant about
> adding this to libbpf as a generic functionality.
>

Agree. Let's explore other options first.

> >
> > I think this can be handled by userspace? We can attach the
> > cgroup_iter separately first (and maybe we will need to set prog->link
> > as well) so that bpf_object__attach_skeleton() doesn't try to attach
> > it? I am following this pattern in the selftest in the final patch,
> > although I think I might be missing setting prog->link, so I am
> > wondering why there are no issues in that selftest which has the same
> > scenario that you are talking about.
> >
> > I think such a pattern will need to be used anyway if the users need
> > to set any non-default arguments for the cgroup_iter prog (like the
> > selftest), right? The only case we are discussing here is the case
> > where the user wants to attach the cgroup_iter with all default
> > options (in which case the default order will fail).
> > I agree that it might be inconvenient if the default/uninitialized
> > options don't work for cgroup_iter, but Alexei pointed out that this
> > matches other bpf uapis.
> >
> > My concern is that in the future we try to reuse enum bpf_iter_order
> > to set ordering for other iterators, and then the
> > default/uninitialized value (BPF_ITER_DESCENDANTS_PRE) doesn't make
> > sense for that iterator (e.g. not a tree). In this case, the same
> > problem that we are avoiding for cgroup_iter here will show up for
> > that iterator, and we can't easily change it at this point because
> > it's uapi.
>
> Yep, valid concern, I agree.
>

Andrii, other than auto-attach, do you have any concern for the rest
of this patchset?

> >
> >
> > > > and explicitly list the values acceptable by cgroup_iter, error out if
> > > > UNSPEC is detected.
> > > >
> > > > Also, following Andrii's comments, will change BPF_ITER_SELF to
> > > > BPF_ITER_SELF_ONLY, which does seem a little bit explicit in
> > > > comparison.
> > > >
> > > > > I applied the first 3 patches to ease respin.
> > > >
> > > > Thanks! This helps!
> > > >
> > > > > Thanks!

^ permalink raw reply

* python-eventlet test broken in 5.19 [was: Revert "tcp: change pingpong threshold to 3"]
From: Jiri Slaby @ 2022-08-16  5:48 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Wei Wang, David Miller, Eric Dumazet, Jakub Kicinski, netdev,
	Soheil Hassas Yeganeh, Yuchung Cheng, LemmyHuang, stable,
	torvalds@linux-foundation.org, temotor, jakub
In-Reply-To: <CADVnQykRMcumBjxND9E4nSxqA-s3exR3AzJ6+Nf0g+s5H6dqeQ@mail.gmail.com>

Cc eventlet guys + Linus.

On 15. 08. 22, 15:30, Neal Cardwell wrote:
> On Mon, Aug 15, 2022 at 3:48 AM Jiri Slaby <jirislaby@kernel.org> wrote:
>>
>> On 06. 08. 22, 16:41, Jiri Slaby wrote:
>>> On 06. 08. 22, 13:24, Neal Cardwell wrote:
>>>> On Sat, Aug 6, 2022 at 6:02 AM Jiri Slaby <jirislaby@kernel.org> wrote:
>>>>>
>>>>> On 21. 07. 22, 22:44, Wei Wang wrote:
>>>>>> This reverts commit 4a41f453bedfd5e9cd040bad509d9da49feb3e2c.
>>>>>>
>>>>>> This to-be-reverted commit was meant to apply a stricter rule for the
>>>>>> stack to enter pingpong mode. However, the condition used to check for
>>>>>> interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
>>>>>> jiffy based and might be too coarse, which delays the stack entering
>>>>>> pingpong mode.
>>>>>> We revert this patch so that we no longer use the above condition to
>>>>>> determine interactive session, and also reduce pingpong threshold to 1.
>>>>>>
>>>>>> Fixes: 4a41f453bedf ("tcp: change pingpong threshold to 3")
>>>>>> Reported-by: LemmyHuang <hlm3280@163.com>
>>>>>> Suggested-by: Neal Cardwell <ncardwell@google.com>
>>>>>> Signed-off-by: Wei Wang <weiwan@google.com>
>>>>>
>>>>>
>>>>> This breaks python-eventlet [1] (and was backported to stable trees):
>>>>> ________________ TestHttpd.test_018b_http_10_keepalive_framing
>>>>> _________________
>>>>>
>>>>> self = <tests.wsgi_test.TestHttpd
>>>>> testMethod=test_018b_http_10_keepalive_framing>
>>>>>
>>>>>        def test_018b_http_10_keepalive_framing(self):
>>>>>            # verify that if an http/1.0 client sends connection:
>>>>> keep-alive
>>>>>            # that we don't mangle the request framing if the app doesn't
>>>>> read the request
>>>>>            def app(environ, start_response):
>>>>>                resp_body = {
>>>>>                    '/1': b'first response',
>>>>>                    '/2': b'second response',
>>>>>                    '/3': b'third response',
>>>>>                }.get(environ['PATH_INFO'])
>>>>>                if resp_body is None:
>>>>>                    resp_body = 'Unexpected path: ' + environ['PATH_INFO']
>>>>>                    if six.PY3:
>>>>>                        resp_body = resp_body.encode('latin1')
>>>>>                # Never look at wsgi.input!
>>>>>                start_response('200 OK', [('Content-type', 'text/plain')])
>>>>>                return [resp_body]
>>>>>
>>>>>            self.site.application = app
>>>>>            sock = eventlet.connect(self.server_addr)
>>>>>            req_body = b'GET /tricksy HTTP/1.1\r\n'
>>>>>            body_len = str(len(req_body)).encode('ascii')
>>>>>
>>>>>            sock.sendall(b'PUT /1 HTTP/1.0\r\nHost:
>>>>> localhost\r\nConnection: keep-alive\r\n'
>>>>>                         b'Content-Length: ' + body_len + b'\r\n\r\n' +
>>>>> req_body)
>>>>>            result1 = read_http(sock)
>>>>>            self.assertEqual(b'first response', result1.body)
>>>>>            self.assertEqual(result1.headers_original.get('Connection'),
>>>>> 'keep-alive')
>>>>>
>>>>>            sock.sendall(b'PUT /2 HTTP/1.0\r\nHost:
>>>>> localhost\r\nConnection: keep-alive\r\n'
>>>>>                         b'Content-Length: ' + body_len + b'\r\nExpect:
>>>>> 100-continue\r\n\r\n')
>>>>>            # Client may have a short timeout waiting on that 100 Continue
>>>>>            # and basically immediately send its body
>>>>>            sock.sendall(req_body)
>>>>>            result2 = read_http(sock)
>>>>>            self.assertEqual(b'second response', result2.body)
>>>>>            self.assertEqual(result2.headers_original.get('Connection'),
>>>>> 'close')
>>>>>
>>>>>    >       sock.sendall(b'PUT /3 HTTP/1.0\r\nHost:
>>>>> localhost\r\nConnection: close\r\n\r\n')
>>>>>
>>>>> tests/wsgi_test.py:648:
>>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>>>> _ _ _ _
>>>>> eventlet/greenio/base.py:407: in sendall
>>>>>        tail = self.send(data, flags)
>>>>> eventlet/greenio/base.py:401: in send
>>>>>        return self._send_loop(self.fd.send, data, flags)
>>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>>>> _ _ _ _
>>>>>
>>>>> self = <eventlet.greenio.base.GreenSocket object at 0x7f5f2f73c9a0>
>>>>> send_method = <built-in method send of socket object at 0x7f5f2f73d520>
>>>>> data = b'PUT /3 HTTP/1.0\r\nHost: localhost\r\nConnection:
>>>>> close\r\n\r\n'
>>>>> args = (0,), _timeout_exc = timeout('timed out'), eno = 32
>>>>>
>>>>>        def _send_loop(self, send_method, data, *args):
>>>>>            if self.act_non_blocking:
>>>>>                return send_method(data, *args)
>>>>>
>>>>>            _timeout_exc = socket_timeout('timed out')
>>>>>            while True:
>>>>>                try:
>>>>>    >               return send_method(data, *args)
>>>>> E               BrokenPipeError: [Errno 32] Broken pipe
>>>>>
>>>>> eventlet/greenio/base.py:388: BrokenPipeError
>>>>> ====================
>>>>>
>>>>> Reverting this revert on the top of 5.19 solves the issue.
>>>>>
>>>>> Any ideas?
>>>>
>>>> Interesting. This revert should return the kernel back to the delayed
>>>> ACK behavior it had for many years before May 2019 and Linux 5.1,
>>>> which contains the commit it is reverting:
>>>>
>>>>     4a41f453bedfd tcp: change pingpong threshold to 3
>>>>
>>>> It sounds like perhaps this test you mention has an implicit
>>>> dependence on the timing of delayed ACKs.
>>>>
>>>> A few questions:
>>>
>>> Dunno. I am only an openSUSE kernel maintainer and this popped out at
>>> me. Feel free to dig to eventlet's sources on your own :P.
>>
>> Any updates on this or should I send a revert directly?
>>
>> The "before() &&" part of the patch makes the difference. That is this diff:
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -172,9 +172,17 @@ static void tcp_event_data_sent(struct tcp_sock *tp,
>>            * and it is a reply for ato after last received packet,
>>            * increase pingpong count.
>>            */
>> -       if (before(tp->lsndtime, icsk->icsk_ack.lrcvtime) &&
>> -           (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
>> +       pr_info("%s: sk=%p (%llx:%x) now=%u lsndtime=%u lrcvtime=%u
>> ping=%u\n",
>> +                       __func__, sk, sk->sk_addrpair, sk->sk_portpair, now,
>> +                       tp->lsndtime, icsk->icsk_ack.lrcvtime,
>> +                       inet_csk(sk)->icsk_ack.pingpong);
>> +       if (//before(tp->lsndtime, icsk->icsk_ack.lrcvtime) &&
>> +           (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato) {
>>                   inet_csk_inc_pingpong_cnt(sk);
>> +               pr_info("\tINC ping=%u before=%u\n",
>> +                               inet_csk(sk)->icsk_ack.pingpong,
>> +                               before(tp->lsndtime,
>> icsk->icsk_ack.lrcvtime));
>> +       }
>>
>>           tp->lsndtime = now;
>>    }
>>
>> makes it work again, and outputs this:
>>
>>   > TCP: tcp_event_data_sent: sk=00000000fd67cf8d
>> (100007f0100007f:e858b18b) now=4294902140 lsndtime=4294902140
>> lrcvtime=4294902140 ping=0
>>   > TCP: tcp_event_data_sent: sk=00000000a4becf82
>> (100007f0100007f:8bb158e8) now=4294902143 lsndtime=4294902140
>> lrcvtime=4294902142 ping=0
>>   > TCP:     INC ping=1 before=1
>>   > TCP: tcp_event_data_sent: sk=00000000fd67cf8d
>> (100007f0100007f:e858b18b) now=4294902145 lsndtime=4294902140
>> lrcvtime=4294902144 ping=0
>>   > TCP:     INC ping=1 before=1
>>   > TCP: tcp_event_data_sent: sk=00000000fd67cf8d
>> (100007f0100007f:e858b18b) now=4294902147 lsndtime=4294902145
>> lrcvtime=4294902144 ping=1
>>   > TCP:     INC ping=2 before=0
>>
>> IMO, this "before=0" is the "source" of the problem. But I have no idea
>> what this means at all...
>>
>>   > TCP: tcp_event_data_sent: sk=00000000a4becf82
>> (100007f0100007f:8bb158e8) now=4294902149 lsndtime=4294902143
>> lrcvtime=4294902148 ping=1
>>   > TCP:     INC ping=2 before=1
>>   > TCP: tcp_event_data_sent: sk=00000000fd67cf8d
>> (100007f0100007f:e858b18b) now=4294902151 lsndtime=4294902147
>> lrcvtime=4294902150 ping=3
>>   > TCP:     INC ping=4 before=1
>>   > TCP: tcp_event_data_sent: sk=00000000c7a417e9
>> (100007f0100007f:e85ab18b) now=4294902153 lsndtime=4294902153
>> lrcvtime=4294902153 ping=0
>>   > TCP: tcp_event_data_sent: sk=000000008681183e
>> (100007f0100007f:8bb15ae8) now=4294902155 lsndtime=4294902153
>> lrcvtime=4294902154 ping=0
>>   > TCP:     INC ping=1 before=1
> 
> It sounds like this test has a very specific dependence on the buggy
> delayed ACK timing behavior from the buggy commit
> 4a41f453bedfd5e9cd040bad509d9da49feb3e2c.
> 
> IMHO I don't think we can revert a kernel bug fix based on a test that
> decided to depend on the exact timing of delayed ACKs during a time
> when that delayed ACK behavior was buggy. :-)

Unfortunately despite the test is likely bogus (I am unable to say it is 
or not), it does happen and the patch (the revert -- 4d8f24eeedc) breaks 
userspace. I'd say this is exactly the case where we apply "we do not 
break userspace". But I might be wrong as we might not care about silly 
tests.

In anyway, openSUSE has to have the patch (the revert) reverted, so that 
the distro actually builds/works. (Until this is fixed on the eventlet 
side at least. And more importantly _until_ it propagates to distros or 
is fixed otherwise (like disabling the test).) And I suppose other 
distros would have to do the same. That is quite unfortunate :/.

thanks,
-- 
js
suse labs


^ permalink raw reply

* [PATCH bpf-next v2] selftests/bpf: fix attach point for non-x86 arches in test_progs/lsm
From: Artem Savkov @ 2022-08-16  5:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf, netdev
  Cc: linux-kernel, Artem Savkov
In-Reply-To: <376f20c5-4b1c-efec-4dde-43d91b3d4308@iogearbox.net>

Use SYS_PREFIX macro from bpf_misc.h instead of hard-coded '__x64_'
prefix for sys_setdomainname attach point in lsm test.

Signed-off-by: Artem Savkov <asavkov@redhat.com>
---
 tools/testing/selftests/bpf/DENYLIST.s390x | 2 +-
 tools/testing/selftests/bpf/progs/lsm.c    | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/DENYLIST.s390x b/tools/testing/selftests/bpf/DENYLIST.s390x
index e33cab34d22f..9d8de15e725e 100644
--- a/tools/testing/selftests/bpf/DENYLIST.s390x
+++ b/tools/testing/selftests/bpf/DENYLIST.s390x
@@ -43,7 +43,7 @@ test_bpffs                               # bpffs test  failed 255
 test_bprm_opts                           # failed to auto-attach program 'secure_exec': -524                           (trampoline)
 test_ima                                 # failed to auto-attach program 'ima': -524                                   (trampoline)
 test_local_storage                       # failed to auto-attach program 'unlink_hook': -524                           (trampoline)
-test_lsm                                 # failed to find kernel BTF type ID of '__x64_sys_setdomainname': -3          (?)
+test_lsm                                 # attach unexpected error: -524                                               (trampoline)
 test_overhead                            # attach_fentry unexpected error: -524                                        (trampoline)
 test_profiler                            # unknown func bpf_probe_read_str#45                                          (overlapping)
 timer                                    # failed to auto-attach program 'test1': -524                                 (trampoline)
diff --git a/tools/testing/selftests/bpf/progs/lsm.c b/tools/testing/selftests/bpf/progs/lsm.c
index 33694ef8acfa..d8d8af623bc2 100644
--- a/tools/testing/selftests/bpf/progs/lsm.c
+++ b/tools/testing/selftests/bpf/progs/lsm.c
@@ -4,6 +4,7 @@
  * Copyright 2020 Google LLC.
  */
 
+#include "bpf_misc.h"
 #include "vmlinux.h"
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_tracing.h>
@@ -160,7 +161,7 @@ int BPF_PROG(test_task_free, struct task_struct *task)
 
 int copy_test = 0;
 
-SEC("fentry.s/__x64_sys_setdomainname")
+SEC("fentry.s/" SYS_PREFIX "sys_setdomainname")
 int BPF_PROG(test_sys_setdomainname, struct pt_regs *regs)
 {
 	void *ptr = (void *)PT_REGS_PARM1(regs);
-- 
2.37.1


^ permalink raw reply related

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
From: Oliver Sang @ 2022-08-16  5:52 UTC (permalink / raw)
  To: Feng Tang
  Cc: Roman Gushchin, Shakeel Butt, Eric Dumazet, Linux MM,
	Andrew Morton, Michal Hocko, Johannes Weiner, Muchun Song,
	Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu
In-Reply-To: <20220705050326.GF62281@shbuild999.sh.intel.com>

Hi all,

now we noticed this commit has already merged into mainline, and in our tests
there is still similar regression. [1]

not sure if there is a plan to merge some of the solutions that have been
discussed in this thread? we'll very glad to test patches if there is a request

Thanks a lot!

[1]
=========================================================================================
tbox_group/testcase/rootfs/kconfig/compiler/ip/runtime/nr_threads/cluster/send_size/test/cpufreq_governor/ucode:
  lkp-icl-2sp4/netperf/debian-11.1-x86_64-20220510.cgz/x86_64-rhel-8.3/gcc-11/ipv4/300s/50%/cs-localhost/10K/SCTP_STREAM_MANY/performance/0xd000363

7c80b038d23e1f4c 4890b686f4088c90432149bd6de
---------------- ---------------------------
         %stddev     %change         %stddev
             \          |                \
      9078           -55.9%       4006        netperf.Throughput_Mbps
    581006           -55.9%     256385        netperf.Throughput_total_Mbps
     36715           -54.6%      16674 ±  4%  netperf.time.involuntary_context_switches
      1885           -50.2%     938.33 ±  3%  netperf.time.percent_of_cpu_this_job_got
      5533           -49.9%       2771 ±  2%  netperf.time.system_time
    152.13           -59.5%      61.61 ±  2%  netperf.time.user_time
    418171 ±  5%     +89.4%     791954 ± 17%  netperf.time.voluntary_context_switches
 2.128e+09           -55.9%  9.389e+08        netperf.workload
     30217           +17.8%      35608        uptime.idle
 2.689e+10           +20.3%  3.234e+10        cpuidle..time
 6.366e+08           -48.1%  3.305e+08        cpuidle..usage
     70.26           +13.5       83.78        mpstat.cpu.all.idle%
      4.46            -1.5        2.92 ±  3%  mpstat.cpu.all.soft%
     23.71           -11.6       12.16 ±  3%  mpstat.cpu.all.sys%
      0.89            -0.5        0.38        mpstat.cpu.all.usr%
 1.392e+09           -57.5%   5.91e+08 ± 12%  numa-numastat.node0.local_node
 1.389e+09           -57.5%  5.906e+08 ± 12%  numa-numastat.node0.numa_hit
 1.369e+09           -54.5%  6.226e+08 ± 12%  numa-numastat.node1.local_node
 1.366e+09           -54.4%  6.222e+08 ± 12%  numa-numastat.node1.numa_hit
     36715           -54.6%      16674 ±  4%  time.involuntary_context_switches
      1885           -50.2%     938.33 ±  3%  time.percent_of_cpu_this_job_got
      5533           -49.9%       2771 ±  2%  time.system_time
    152.13           -59.5%      61.61 ±  2%  time.user_time
    418171 ±  5%     +89.4%     791954 ± 17%  time.voluntary_context_switches


On Tue, Jul 05, 2022 at 01:03:26PM +0800, Feng Tang wrote:
> On Sun, Jul 03, 2022 at 03:55:31PM -0700, Roman Gushchin wrote:
> > On Sun, Jul 03, 2022 at 06:43:53PM +0800, Feng Tang wrote:
> > > Hi Shakeel,
> > > 
> > > On Fri, Jul 01, 2022 at 08:47:29AM -0700, Shakeel Butt wrote:
> > > > On Mon, Jun 27, 2022 at 8:49 PM Feng Tang <feng.tang@intel.com> wrote:
> > > > > I just tested it, it does perform better (the 4th is with your patch),
> > > > > some perf-profile data is also listed.
> > > > >
> > > > >  7c80b038d23e1f4c 4890b686f4088c90432149bd6de 332b589c49656a45881bca4ecc0 e719635902654380b23ffce908d
> > > > > ---------------- --------------------------- --------------------------- ---------------------------
> > > > >      15722           -69.5%       4792           -40.8%       9300           -27.9%      11341        netperf.Throughput_Mbps
> > > > >
> > > > >       0.00            +0.3        0.26 ±  5%      +0.5        0.51            +1.3        1.27 ±  2%pp.self.__sk_mem_raise_allocated
> > > > >       0.00            +0.3        0.32 ± 15%      +1.7        1.74 ±  2%      +0.4        0.40 ±  2%  pp.self.propagate_protected_usage
> > > > >       0.00            +0.8        0.82 ±  7%      +0.9        0.90            +0.8        0.84        pp.self.__mod_memcg_state
> > > > >       0.00            +1.2        1.24 ±  4%      +1.0        1.01            +1.4        1.44        pp.self.try_charge_memcg
> > > > >       0.00            +2.1        2.06            +2.1        2.13            +2.1        2.11        pp.self.page_counter_uncharge
> > > > >       0.00            +2.1        2.14 ±  4%      +2.7        2.71            +2.6        2.60 ±  2%  pp.self.page_counter_try_charge
> > > > >       1.12 ±  4%      +3.1        4.24            +1.1        2.22            +1.4        2.51        pp.self.native_queued_spin_lock_slowpath
> > > > >       0.28 ±  9%      +3.8        4.06 ±  4%      +0.2        0.48            +0.4        0.68        pp.self.sctp_eat_data
> > > > >       0.00            +8.2        8.23            +0.8        0.83            +1.3        1.26        pp.self.__sk_mem_reduce_allocated
> > > > >
> > > > > And the size of 'mem_cgroup' is increased from 4224 Bytes to 4608.
> > > > 
> > > > Hi Feng, can you please try two more configurations? Take Eric's patch
> > > > of adding ____cacheline_aligned_in_smp in page_counter and for first
> > > > increase MEMCG_CHARGE_BATCH to 64 and for second increase it to 128.
> > > > Basically batch increases combined with Eric's patch.
> > > 
> > > With increasing batch to 128, the regression could be reduced to -12.4%.
> > 
> > If we're going to bump it, I wonder if we should scale it dynamically depending
> > on the size of the memory cgroup?
>  
> I think it makes sense, or also make it a configurable parameter? From 
> the test reports of 0Day, these charging/counting play critical role
> in performance (easy to see up to 60% performance effect). If user only
> wants memcg for isolating things or doesn't care charging/stats, these
> seem to be extra taxes.
> 
> For bumping to 64 or 128, universal improvement is expected with the
> only concern of accuracy.
> 
> Thanks,
> Feng
> 
> > Thanks!

^ permalink raw reply

* Re: BUG: corrupted list in insert_work
From: syzbot @ 2022-08-16  5:58 UTC (permalink / raw)
  To: andrii, ast, bpf, brauner, cgroups, daniel, dvyukov, hannes,
	hdanton, john.fastabend, kafai, kpsingh, linux-kernel, linux-mm,
	lizefan.x, lkp, lkp, mkoutny, netdev, oliver.sang, songliubraving,
	stable, syzkaller-android-bugs, syzkaller-bugs, tadeusz.struk, tj,
	yhs
In-Reply-To: <000000000000aaac8505dc135b07@google.com>

syzbot suspects this issue was fixed by commit:

commit d007f49ab789bee8ed76021830b49745d5feaf61
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Wed May 18 06:13:40 2022 +0000

    percpu_ref_init(): clean ->percpu_count_ref on failure

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=10c66b6b080000
start commit:   ebc9fb07d294 ANDROID: random: fix CRC issues with the merge
git tree:       android12-5.10-lts
kernel config:  https://syzkaller.appspot.com/x/.config?x=32c952ff4a8ff8c1
dashboard link: https://syzkaller.appspot.com/bug?extid=e42ae441c3b10acf9e9d
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=172a9074080000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10456caa080000

If the result looks correct, please mark the issue as fixed by replying with:

#syz fix: percpu_ref_init(): clean ->percpu_count_ref on failure

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

^ permalink raw reply

* [PATCH v4 0/3] J7200: CPSW5G: Add support for QSGMII mode to am65-cpsw driver
From: Siddharth Vadapalli @ 2022-08-16  6:01 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, robh+dt, krzysztof.kozlowski+dt,
	linux, vladimir.oltean, grygorii.strashko, vigneshr, nsekhar
  Cc: netdev, devicetree, linux-kernel, kishon, s-vadapalli

Add support for QSGMII mode to am65-cpsw driver.

Change log:

v3 -> v4:
1. Update bindings to disallow ports based on compatible, instead of
   adding a new if/then statement for the new compatible.
2. Add Else-If condition for RMII mode in the set of supported interfaces.
   Support for RMII mode is already present in the driver and I had
   missed out adding a condition for RMII mode in the previous patches.

v2 -> v3:
1. In ti,k3-am654-cpsw-nuss.yaml, restrict if/then statement to port
   nodes.

v1 -> v2:
1. Add new compatible for CPSW5G in ti,k3-am654-cpsw-nuss.yaml and extend
   properties for new compatible.
2. Add extra_modes member to struct am65_cpsw_pdata to be used for QSGMII
   mode by new compatible.
3. Add check for phylink supported modes to ensure that only one phy mode
   is advertised as supported.
4. Check if extra_modes supports QSGMII mode in am65_cpsw_nuss_mac_config()
   for register write.
5. Add check for assigning port->sgmii_base only when extra_modes is valid.

v3: https://lore.kernel.org/r/20220606110443.30362-1-s-vadapalli@ti.com/
v2: https://lore.kernel.org/r/20220602114558.6204-1-s-vadapalli@ti.com/
v1: https://lore.kernel.org/r/20220531113058.23708-1-s-vadapalli@ti.com/

Siddharth Vadapalli (3):
  dt-bindings: net: ti: k3-am654-cpsw-nuss: Update bindings for J7200
    CPSW5G
  net: ethernet: ti: am65-cpsw: Add support for J7200 CPSW5G
  net: ethernet: ti: am65-cpsw: Move phy_set_mode_ext() to correct
    location

 .../bindings/net/ti,k3-am654-cpsw-nuss.yaml   | 17 ++++++-
 drivers/net/ethernet/ti/am65-cpsw-nuss.c      | 44 ++++++++++++++++---
 drivers/net/ethernet/ti/am65-cpsw-nuss.h      |  2 +
 3 files changed, 54 insertions(+), 9 deletions(-)

--
2.25.1


^ permalink raw reply

* [PATCH v4 2/3] net: ethernet: ti: am65-cpsw: Add support for J7200 CPSW5G
From: Siddharth Vadapalli @ 2022-08-16  6:01 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, robh+dt, krzysztof.kozlowski+dt,
	linux, vladimir.oltean, grygorii.strashko, vigneshr, nsekhar
  Cc: netdev, devicetree, linux-kernel, kishon, s-vadapalli
In-Reply-To: <20220816060139.111934-1-s-vadapalli@ti.com>

CPSW5G in J7200 supports additional modes like QSGMII and SGMII.
Add new compatible for J7200 and enable QSGMII mode in am65-cpsw driver.

Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
---
 drivers/net/ethernet/ti/am65-cpsw-nuss.c | 35 ++++++++++++++++++++++--
 drivers/net/ethernet/ti/am65-cpsw-nuss.h |  2 ++
 2 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ti/am65-cpsw-nuss.c b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
index f4a6b590a1e3..033b40649308 100644
--- a/drivers/net/ethernet/ti/am65-cpsw-nuss.c
+++ b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
@@ -74,6 +74,9 @@
 #define AM65_CPSW_PORTN_REG_TS_VLAN_LTYPE_REG	0x318
 #define AM65_CPSW_PORTN_REG_TS_CTL_LTYPE2       0x31C
 
+#define AM65_CPSW_SGMII_CONTROL_REG		0x010
+#define AM65_CPSW_SGMII_CONTROL_MR_AN_ENABLE	BIT(0)
+
 #define AM65_CPSW_CTL_VLAN_AWARE		BIT(1)
 #define AM65_CPSW_CTL_P0_ENABLE			BIT(2)
 #define AM65_CPSW_CTL_P0_TX_CRC_REMOVE		BIT(13)
@@ -1409,7 +1412,14 @@ static const struct net_device_ops am65_cpsw_nuss_netdev_ops = {
 static void am65_cpsw_nuss_mac_config(struct phylink_config *config, unsigned int mode,
 				      const struct phylink_link_state *state)
 {
-	/* Currently not used */
+	struct am65_cpsw_slave_data *slave = container_of(config, struct am65_cpsw_slave_data,
+							  phylink_config);
+	struct am65_cpsw_port *port = container_of(slave, struct am65_cpsw_port, slave);
+	struct am65_cpsw_common *common = port->common;
+
+	if (common->pdata.extra_modes & BIT(state->interface))
+		writel(AM65_CPSW_SGMII_CONTROL_MR_AN_ENABLE,
+		       port->sgmii_base + AM65_CPSW_SGMII_CONTROL_REG);
 }
 
 static void am65_cpsw_nuss_mac_link_down(struct phylink_config *config, unsigned int mode,
@@ -1847,6 +1857,8 @@ static int am65_cpsw_nuss_init_slave_ports(struct am65_cpsw_common *common)
 		port->common = common;
 		port->port_base = common->cpsw_base + AM65_CPSW_NU_PORTS_BASE +
 				  AM65_CPSW_NU_PORTS_OFFSET * (port_id);
+		if (common->pdata.extra_modes)
+			port->sgmii_base = common->ss_base + AM65_CPSW_SGMII_BASE * (port_id);
 		port->stat_base = common->cpsw_base + AM65_CPSW_NU_STATS_BASE +
 				  (AM65_CPSW_NU_STATS_PORT_OFFSET * port_id);
 		port->name = of_get_property(port_np, "label", NULL);
@@ -1981,7 +1993,18 @@ am65_cpsw_nuss_init_port_ndev(struct am65_cpsw_common *common, u32 port_idx)
 	port->slave.phylink_config.type = PHYLINK_NETDEV;
 	port->slave.phylink_config.mac_capabilities = MAC_SYM_PAUSE | MAC_10 | MAC_100 | MAC_1000FD;
 
-	phy_interface_set_rgmii(port->slave.phylink_config.supported_interfaces);
+	if (phy_interface_mode_is_rgmii(port->slave.phy_if)) {
+		phy_interface_set_rgmii(port->slave.phylink_config.supported_interfaces);
+	} else if (port->slave.phy_if == PHY_INTERFACE_MODE_RMII) {
+		__set_bit(PHY_INTERFACE_MODE_RMII,
+			  port->slave.phylink_config.supported_interfaces);
+	} else if (common->pdata.extra_modes & BIT(port->slave.phy_if)) {
+		__set_bit(PHY_INTERFACE_MODE_QSGMII,
+			  port->slave.phylink_config.supported_interfaces);
+	} else {
+		dev_err(dev, "selected phy-mode is not supported\n");
+		return -EOPNOTSUPP;
+	}
 
 	phylink = phylink_create(&port->slave.phylink_config,
 				 of_node_to_fwnode(port->slave.phy_node),
@@ -2611,10 +2634,18 @@ static const struct am65_cpsw_pdata am64x_cpswxg_pdata = {
 	.fdqring_mode = K3_RINGACC_RING_MODE_RING,
 };
 
+static const struct am65_cpsw_pdata j7200_cpswxg_pdata = {
+	.quirks = 0,
+	.ale_dev_id = "am64-cpswxg",
+	.fdqring_mode = K3_RINGACC_RING_MODE_RING,
+	.extra_modes = BIT(PHY_INTERFACE_MODE_QSGMII),
+};
+
 static const struct of_device_id am65_cpsw_nuss_of_mtable[] = {
 	{ .compatible = "ti,am654-cpsw-nuss", .data = &am65x_sr1_0},
 	{ .compatible = "ti,j721e-cpsw-nuss", .data = &j721e_pdata},
 	{ .compatible = "ti,am642-cpsw-nuss", .data = &am64x_cpswxg_pdata},
+	{ .compatible = "ti,j7200-cpswxg-nuss", .data = &j7200_cpswxg_pdata},
 	{ /* sentinel */ },
 };
 MODULE_DEVICE_TABLE(of, am65_cpsw_nuss_of_mtable);
diff --git a/drivers/net/ethernet/ti/am65-cpsw-nuss.h b/drivers/net/ethernet/ti/am65-cpsw-nuss.h
index ac945631bf2f..2c9850fdfcb6 100644
--- a/drivers/net/ethernet/ti/am65-cpsw-nuss.h
+++ b/drivers/net/ethernet/ti/am65-cpsw-nuss.h
@@ -46,6 +46,7 @@ struct am65_cpsw_port {
 	const char			*name;
 	u32				port_id;
 	void __iomem			*port_base;
+	void __iomem			*sgmii_base;
 	void __iomem			*stat_base;
 	void __iomem			*fetch_ram_base;
 	bool				disabled;
@@ -88,6 +89,7 @@ struct am65_cpsw_rx_chn {
 
 struct am65_cpsw_pdata {
 	u32	quirks;
+	u64	extra_modes;
 	enum k3_ring_mode fdqring_mode;
 	const char	*ale_dev_id;
 };
-- 
2.25.1


^ permalink raw reply related

* [PATCH v4 3/3] net: ethernet: ti: am65-cpsw: Move phy_set_mode_ext() to correct location
From: Siddharth Vadapalli @ 2022-08-16  6:01 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, robh+dt, krzysztof.kozlowski+dt,
	linux, vladimir.oltean, grygorii.strashko, vigneshr, nsekhar
  Cc: netdev, devicetree, linux-kernel, kishon, s-vadapalli
In-Reply-To: <20220816060139.111934-1-s-vadapalli@ti.com>

In TI's J7200 SoC CPSW5G ports, each of the 4 ports can be configured
as a QSGMII main or QSGMII-SUB port. This configuration is performed
by phy-gmii-sel driver on invoking the phy_set_mode_ext() function.

It is necessary for the QSGMII main port to be configured before any of
the QSGMII-SUB interfaces are brought up. Currently, the QSGMII-SUB
interfaces come up before the QSGMII main port is configured.

Fix this by moving the call to phy_set_mode_ext() from
am65_cpsw_nuss_ndo_slave_open() to am65_cpsw_nuss_init_slave_ports(),
thereby ensuring that the QSGMII main port is configured before any of
the QSGMII-SUB ports are brought up.

Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
---
 drivers/net/ethernet/ti/am65-cpsw-nuss.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/ti/am65-cpsw-nuss.c b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
index 033b40649308..7ef5d8208a4e 100644
--- a/drivers/net/ethernet/ti/am65-cpsw-nuss.c
+++ b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
@@ -593,11 +593,6 @@ static int am65_cpsw_nuss_ndo_slave_open(struct net_device *ndev)
 	/* mac_sl should be configured via phy-link interface */
 	am65_cpsw_sl_ctl_reset(port);
 
-	ret = phy_set_mode_ext(port->slave.ifphy, PHY_MODE_ETHERNET,
-			       port->slave.phy_if);
-	if (ret)
-		goto error_cleanup;
-
 	ret = phylink_of_phy_connect(port->slave.phylink, port->slave.phy_node, 0);
 	if (ret)
 		goto error_cleanup;
@@ -1898,6 +1893,10 @@ static int am65_cpsw_nuss_init_slave_ports(struct am65_cpsw_common *common)
 			goto of_node_put;
 		}
 
+		ret = phy_set_mode_ext(port->slave.ifphy, PHY_MODE_ETHERNET, port->slave.phy_if);
+		if (ret)
+			goto of_node_put;
+
 		ret = of_get_mac_address(port_np, port->slave.mac_addr);
 		if (ret) {
 			am65_cpsw_am654_get_efuse_macid(port_np,
-- 
2.25.1


^ permalink raw reply related

* [PATCH v4 1/3] dt-bindings: net: ti: k3-am654-cpsw-nuss: Update bindings for J7200 CPSW5G
From: Siddharth Vadapalli @ 2022-08-16  6:01 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, robh+dt, krzysztof.kozlowski+dt,
	linux, vladimir.oltean, grygorii.strashko, vigneshr, nsekhar
  Cc: netdev, devicetree, linux-kernel, kishon, s-vadapalli
In-Reply-To: <20220816060139.111934-1-s-vadapalli@ti.com>

Update bindings for TI K3 J7200 SoC which contains 5 ports (4 external
ports) CPSW5G module and add compatible for it.

Changes made:
    - Add new compatible ti,j7200-cpswxg-nuss for CPSW5G.
    - Extend pattern properties for new compatible.
    - Change maximum number of CPSW ports to 4 for new compatible.

Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
---
 .../bindings/net/ti,k3-am654-cpsw-nuss.yaml     | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/ti,k3-am654-cpsw-nuss.yaml b/Documentation/devicetree/bindings/net/ti,k3-am654-cpsw-nuss.yaml
index b8281d8be940..5366a367c387 100644
--- a/Documentation/devicetree/bindings/net/ti,k3-am654-cpsw-nuss.yaml
+++ b/Documentation/devicetree/bindings/net/ti,k3-am654-cpsw-nuss.yaml
@@ -57,6 +57,7 @@ properties:
       - ti,am654-cpsw-nuss
       - ti,j721e-cpsw-nuss
       - ti,am642-cpsw-nuss
+      - ti,j7200-cpswxg-nuss
 
   reg:
     maxItems: 1
@@ -110,7 +111,7 @@ properties:
         const: 0
 
     patternProperties:
-      port@[1-2]:
+      "^port@[1-4]$":
         type: object
         description: CPSWxG NUSS external ports
 
@@ -119,7 +120,7 @@ properties:
         properties:
           reg:
             minimum: 1
-            maximum: 2
+            maximum: 4
             description: CPSW port number
 
           phys:
@@ -151,6 +152,18 @@ properties:
 
     additionalProperties: false
 
+if:
+  not:
+    properties:
+      compatible:
+        contains:
+          const: ti,j7200-cpswxg-nuss
+then:
+  properties:
+    ethernet-ports:
+      patternProperties:
+        "^port@[3-4]$": false
+
 patternProperties:
   "^mdio@[0-9a-f]+$":
     type: object
-- 
2.25.1


^ permalink raw reply related

* KASAN: slab-out-of-bounds in ipvlan_queue_xmit
From: Jiacheng Xu @ 2022-08-16  5:37 UTC (permalink / raw)
  To: linux-kernel, davem, edumazet; +Cc: security, netdev

Hello,

When using modified Syzkaller to fuzz the latest Linux kernel, the
following crash was triggered.

HEAD commit: 3d7cb6b04c3f Linux-5.19
git tree: upstream

kernel config: https://drive.google.com/file/d/1wgIUDwP5ho29AM-K7HhysSTfWFpfXYkG/view?usp=sharing
syz repro: https://drive.google.com/file/d/1GcF5GvQ2LkFCm_ij_gCwb2qOUSAbdzRT/view?usp=sharing
C reproducer: https://drive.google.com/file/d/1XcEYzdr7cGzdjP8epEUe2HMJ0DgbZRWl/view?usp=sharing

Environment:
Ubuntu 20.04 on Linux 5.4.0
QEMU 4.2.1:
qemu-system-x86_64 \
  -m 2G \
  -smp 2 \
  -kernel /home/workdir/bzImage \
  -append "console=ttyS0 root=/dev/sda earlyprintk=serial net.ifnames=0" \
  -drive file=/home/workdir/stretch.img,format=raw \
  -net user,host=10.0.2.10,hostfwd=tcp:127.0.0.1:10021-:22 \
  -net nic,model=e1000 \
  -enable-kvm \
  -nographic \
  -pidfile vm.pid \
  2>&1 | tee vm.log

If you fix this issue, please add the following tag to the commit:
Credits to Jiacheng Xu<stitch@zju.edu.cn>

==================================================================
BUG: KASAN: slab-out-of-bounds in ipvlan_queue_xmit+0x16c0/0x1950
Read of size 4 at addr ffff888042046fff by task repro/6447

CPU: 1 PID: 6447 Comm: repro Not tainted 5.19.0 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 4
Call Trace:
<TASK>
dump_stack_lvl+0xcd/0x134
print_report.cold+0xe5/0x659
? ipvlan_queue_xmit+0x16c0/0x1950
kasan_report+0x8a/0x1b0
? audit_kill_trees+0x2f0/0x300
? ipvlan_queue_xmit+0x16c0/0x1950
ipvlan_queue_xmit+0x16c0/0x1950
? ipvlan_handle_mode_l3+0x120/0x120
? skb_csum_hwoffload_help+0x1a0/0x1a0
? __sanitizer_cov_trace_pc+0x1a/0x40
? validate_xmit_xfrm+0x49d/0x10a0
? __sanitizer_cov_trace_pc+0x1a/0x40
? netif_skb_features+0x45b/0xbb0
? __sanitizer_cov_trace_pc+0x1a/0x40
? validate_xmit_skb+0x878/0xec0
? __sanitizer_cov_trace_pc+0x1a/0x40
ipvlan_start_xmit+0x45/0x190
__dev_direct_xmit+0x42d/0x630
? validate_xmit_skb_list+0x140/0x140
? packet_poll+0x4d0/0x4d0
? __sanitizer_cov_trace_pc+0x1a/0x40
? netdev_pick_tx+0x14f/0xbe0
packet_direct_xmit+0x1b8/0x2b0
packet_sendmsg+0x223e/0x4d50
? __sanitizer_cov_trace_pc+0x1a/0x40
? aa_label_sk_perm+0x89/0xe0
? __sanitizer_cov_trace_pc+0x1a/0x40
? aa_sk_perm+0x30f/0xa90
? tpacket_rcv+0x32c0/0x32c0
? aa_af_perm+0x230/0x230
? __sanitizer_cov_trace_pc+0x1a/0x40
? __sanitizer_cov_trace_pc+0x1a/0x40
? tpacket_rcv+0x32c0/0x32c0
sock_sendmsg+0xc3/0x120
__sys_sendto+0x21a/0x330
? __ia32_sys_getpeername+0xb0/0xb0
? x86_pmu_start+0x30/0x270
? syscall_enter_from_user_mode+0x1c/0x70
? rcu_read_lock_sched_held+0x9c/0xd0
? rcu_read_lock_bh_held+0xb0/0xb0
__x64_sys_sendto+0xdd/0x1b0
? lockdep_hardirqs_on+0x79/0x100
? syscall_enter_from_user_mode+0x21/0x70
do_syscall_64+0x35/0xb0
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f2dc0ae4469
Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 8
RSP: 002b:00007fff157b1d38 EFLAGS: 00000216 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f2dc0ae4469
RDX: 000000000000000e RSI: 0000000020000100 RDI: 0000000000000005
RBP: 00007fff157b1d60 R08: 0000000020000140 R09: 0000000000000014
R10: 0000000000000000 R11: 0000000000000216 R12: 000055dc98e00f10
R13: 00007fff157b1e70 R14: 0000000000000000 R15: 0000000000000000
</TASK>
Allocated by task 0:
(stack is not available)
The buggy address belongs to the object at ffff888042046000
which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 2047 bytes to the right of
2048-byte region [ffff888042046000, ffff888042046800)

The buggy address belongs to the physical page:
page:ffffea0001081000 refcount:1 mapcount:0 mapping:0000000000000000 index:0x00
head:ffffea0001081000 order:3 compound_mapcount:0 compound_pincount:0
flags: 0x4fff00000010200(slab|head|node=1|zone=1|lastcpupid=0x7ff)
raw: 04fff00000010200 0000000000000000 dead000000000122 ffff888011842000
raw: 0000000000000000 0000000080080008 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd2a20(GFP_A4
prep_new_page+0x297/0x330
get_page_from_freelist+0x215f/0x3c50
__alloc_pages+0x321/0x710
alloc_pages+0x119/0x250
new_slab+0x2a9/0x3f0
___slab_alloc+0xd5a/0x1140
__slab_alloc.isra.0+0x4d/0xa0
__kmalloc_node_track_caller+0x321/0x440
kmalloc_reserve+0x32/0xd0
pskb_expand_head+0x148/0x1060
netlink_trim+0x1ea/0x240
netlink_broadcast+0x5b/0xd00
nlmsg_notify+0x8f/0x280
rtmsg_ifinfo_event.part.0+0xb6/0xe0
rtmsg_ifinfo+0x7f/0xa0
__dev_notify_flags+0x235/0x2c0
page last free stack trace:
free_pcp_prepare+0x51f/0xd00
free_unref_page+0x19/0x5b0
__unfreeze_partials+0x3d2/0x3f0
___cache_free+0x12c/0x140
qlist_free_all+0x6a/0x170
kasan_quarantine_reduce+0x13d/0x180
__kasan_slab_alloc+0xa2/0xc0
slab_post_alloc_hook+0x4d/0x4f0
kmem_cache_alloc+0x1be/0x460
getname_flags+0xd2/0x5b0
vfs_fstatat+0x73/0xb0
__do_sys_newlstat+0x8b/0x110
do_syscall_64+0x35/0xb0
entry_SYSCALL_64_after_hwframe+0x63/0xcd

Memory state around the buggy address:
ffff888042046e80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888042046f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff888042046f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
                                                                ^
ffff888042047000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888042047080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ==================================================================
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 6447 Comm: repro Not tainted 5.19.0 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 4
Call Trace:
<TASK>
dump_stack_lvl+0xcd/0x134
panic+0x2d7/0x636
? panic_print_sys_info.part.0+0x10b/0x10b
? asm_sysvec_apic_timer_interrupt+0x16/0x20
? ipvlan_queue_xmit+0x16c0/0x1950
end_report.part.0+0x3f/0x7c
kasan_report.cold+0x8/0x12
? audit_kill_trees+0x2f0/0x300
? ipvlan_queue_xmit+0x16c0/0x1950
ipvlan_queue_xmit+0x16c0/0x1950
? ipvlan_handle_mode_l3+0x120/0x120
? skb_csum_hwoffload_help+0x1a0/0x1a0
? __sanitizer_cov_trace_pc+0x1a/0x40
? validate_xmit_xfrm+0x49d/0x10a0
? __sanitizer_cov_trace_pc+0x1a/0x40
? netif_skb_features+0x45b/0xbb0
? __sanitizer_cov_trace_pc+0x1a/0x40
? validate_xmit_skb+0x878/0xec0
? __sanitizer_cov_trace_pc+0x1a/0x40
ipvlan_start_xmit+0x45/0x190
__dev_direct_xmit+0x42d/0x630
? validate_xmit_skb_list+0x140/0x140
? packet_poll+0x4d0/0x4d0
? __sanitizer_cov_trace_pc+0x1a/0x40
? netdev_pick_tx+0x14f/0xbe0
packet_direct_xmit+0x1b8/0x2b0
packet_sendmsg+0x223e/0x4d50
? __sanitizer_cov_trace_pc+0x1a/0x40
? aa_label_sk_perm+0x89/0xe0
? __sanitizer_cov_trace_pc+0x1a/0x40
? aa_sk_perm+0x30f/0xa90
? tpacket_rcv+0x32c0/0x32c0
? aa_af_perm+0x230/0x230
? __sanitizer_cov_trace_pc+0x1a/0x40
? __sanitizer_cov_trace_pc+0x1a/0x40
? tpacket_rcv+0x32c0/0x32c0
sock_sendmsg+0xc3/0x120
__sys_sendto+0x21a/0x330
? __ia32_sys_getpeername+0xb0/0xb0
? x86_pmu_start+0x30/0x270
? syscall_enter_from_user_mode+0x1c/0x70
? rcu_read_lock_sched_held+0x9c/0xd0
? rcu_read_lock_bh_held+0xb0/0xb0
__x64_sys_sendto+0xdd/0x1b0
? lockdep_hardirqs_on+0x79/0x100
? syscall_enter_from_user_mode+0x21/0x70
do_syscall_64+0x35/0xb0
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f2dc0ae4469
Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 8
RSP: 002b:00007fff157b1d38 EFLAGS: 00000216 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f2dc0ae4469
RDX: 000000000000000e RSI: 0000000020000100 RDI: 0000000000000005
RBP: 00007fff157b1d60 R08: 0000000020000140 R09: 0000000000000014
R10: 0000000000000000 R11: 0000000000000216 R12: 000055dc98e00f10
R13: 00007fff157b1e70 R14: 0000000000000000 R15: 0000000000000000
</TASK>

Best Regards,
Jiacheng

^ permalink raw reply

* [PATCH v4 6/6] virtio: Revert "virtio: find_vqs() add arg sizes"
From: Michael S. Tsirkin @ 2022-08-16  5:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Xuan Zhuo, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, virtualization, netdev,
	Linus Torvalds, Jens Axboe, James Bottomley, Martin K. Petersen,
	Guenter Roeck, Greg KH, Richard Weinberger, Anton Ivanov,
	Johannes Berg, Hans de Goede, Mark Gross, Vadim Pasternak,
	Bjorn Andersson, Mathieu Poirier, Cornelia Huck, Halil Pasic,
	Eric Farman, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, linux-um,
	platform-driver-x86, linux-remoteproc, linux-s390, kvm
In-Reply-To: <20220816053602.173815-1-mst@redhat.com>

This reverts commit a10fba0377145fccefea4dc4dd5915b7ed87e546: the
proposed API isn't supported on all transports but no
effort was made to address this.

It might not be hard to fix if we want to: maybe just
rename size to size_hint and make sure legacy
transports ignore the hint.

But it's not sure what the benefit is in any case, so
let's drop it.

Fixes: a10fba037714 ("virtio: find_vqs() add arg sizes")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 arch/um/drivers/virtio_uml.c             |  2 +-
 drivers/platform/mellanox/mlxbf-tmfifo.c |  1 -
 drivers/remoteproc/remoteproc_virtio.c   |  1 -
 drivers/s390/virtio/virtio_ccw.c         |  1 -
 drivers/virtio/virtio_mmio.c             |  1 -
 drivers/virtio/virtio_pci_common.c       |  2 +-
 drivers/virtio/virtio_pci_common.h       |  2 +-
 drivers/virtio/virtio_pci_modern.c       |  7 ++-----
 drivers/virtio/virtio_vdpa.c             |  1 -
 include/linux/virtio_config.h            | 14 +++++---------
 10 files changed, 10 insertions(+), 22 deletions(-)

diff --git a/arch/um/drivers/virtio_uml.c b/arch/um/drivers/virtio_uml.c
index 79e38afd4b91..e719af8bdf56 100644
--- a/arch/um/drivers/virtio_uml.c
+++ b/arch/um/drivers/virtio_uml.c
@@ -1011,7 +1011,7 @@ static struct virtqueue *vu_setup_vq(struct virtio_device *vdev,
 
 static int vu_find_vqs(struct virtio_device *vdev, unsigned nvqs,
 		       struct virtqueue *vqs[], vq_callback_t *callbacks[],
-		       const char * const names[], u32 sizes[], const bool *ctx,
+		       const char * const names[], const bool *ctx,
 		       struct irq_affinity *desc)
 {
 	struct virtio_uml_device *vu_dev = to_virtio_uml_device(vdev);
diff --git a/drivers/platform/mellanox/mlxbf-tmfifo.c b/drivers/platform/mellanox/mlxbf-tmfifo.c
index 8be13d416f48..1ae3c56b66b0 100644
--- a/drivers/platform/mellanox/mlxbf-tmfifo.c
+++ b/drivers/platform/mellanox/mlxbf-tmfifo.c
@@ -928,7 +928,6 @@ static int mlxbf_tmfifo_virtio_find_vqs(struct virtio_device *vdev,
 					struct virtqueue *vqs[],
 					vq_callback_t *callbacks[],
 					const char * const names[],
-					u32 sizes[],
 					const bool *ctx,
 					struct irq_affinity *desc)
 {
diff --git a/drivers/remoteproc/remoteproc_virtio.c b/drivers/remoteproc/remoteproc_virtio.c
index 81c4f5776109..0f7706e23eb9 100644
--- a/drivers/remoteproc/remoteproc_virtio.c
+++ b/drivers/remoteproc/remoteproc_virtio.c
@@ -158,7 +158,6 @@ static int rproc_virtio_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 				 struct virtqueue *vqs[],
 				 vq_callback_t *callbacks[],
 				 const char * const names[],
-				 u32 sizes[],
 				 const bool * ctx,
 				 struct irq_affinity *desc)
 {
diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c
index 896896e32664..a10dbe632ef9 100644
--- a/drivers/s390/virtio/virtio_ccw.c
+++ b/drivers/s390/virtio/virtio_ccw.c
@@ -637,7 +637,6 @@ static int virtio_ccw_find_vqs(struct virtio_device *vdev, unsigned nvqs,
 			       struct virtqueue *vqs[],
 			       vq_callback_t *callbacks[],
 			       const char * const names[],
-			       u32 sizes[],
 			       const bool *ctx,
 			       struct irq_affinity *desc)
 {
diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index dfcecfd7aba1..3ff746e3f24a 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -474,7 +474,6 @@ static int vm_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 		       struct virtqueue *vqs[],
 		       vq_callback_t *callbacks[],
 		       const char * const names[],
-		       u32 sizes[],
 		       const bool *ctx,
 		       struct irq_affinity *desc)
 {
diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c
index 7ad734584823..ad258a9d3b9f 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -396,7 +396,7 @@ static int vp_find_vqs_intx(struct virtio_device *vdev, unsigned int nvqs,
 /* the config->find_vqs() implementation */
 int vp_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 		struct virtqueue *vqs[], vq_callback_t *callbacks[],
-		const char * const names[], u32 sizes[], const bool *ctx,
+		const char * const names[], const bool *ctx,
 		struct irq_affinity *desc)
 {
 	int err;
diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h
index a5ff838b85a5..23112d84218f 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -110,7 +110,7 @@ void vp_del_vqs(struct virtio_device *vdev);
 /* the config->find_vqs() implementation */
 int vp_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 		struct virtqueue *vqs[], vq_callback_t *callbacks[],
-		const char * const names[], u32 sizes[], const bool *ctx,
+		const char * const names[], const bool *ctx,
 		struct irq_affinity *desc);
 const char *vp_bus_name(struct virtio_device *vdev);
 
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index be51ec849252..c3b9f2761849 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -347,15 +347,12 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
 static int vp_modern_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 			      struct virtqueue *vqs[],
 			      vq_callback_t *callbacks[],
-			      const char * const names[],
-			      u32 sizes[],
-			      const bool *ctx,
+			      const char * const names[], const bool *ctx,
 			      struct irq_affinity *desc)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
 	struct virtqueue *vq;
-	int rc = vp_find_vqs(vdev, nvqs, vqs, callbacks, names, sizes, ctx,
-			     desc);
+	int rc = vp_find_vqs(vdev, nvqs, vqs, callbacks, names, ctx, desc);
 
 	if (rc)
 		return rc;
diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c
index 832d2c5b1b19..9670cc79371d 100644
--- a/drivers/virtio/virtio_vdpa.c
+++ b/drivers/virtio/virtio_vdpa.c
@@ -269,7 +269,6 @@ static int virtio_vdpa_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 				struct virtqueue *vqs[],
 				vq_callback_t *callbacks[],
 				const char * const names[],
-				u32 sizes[],
 				const bool *ctx,
 				struct irq_affinity *desc)
 {
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 888f7e96f0c7..36ec7be1f480 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -55,7 +55,6 @@ struct virtio_shm_region {
  *		include a NULL entry for vqs that do not need a callback
  *	names: array of virtqueue names (mainly for debugging)
  *		include a NULL entry for vqs unused by driver
- *	sizes: array of virtqueue sizes
  *	Returns 0 on success or error status
  * @del_vqs: free virtqueues found by find_vqs().
  * @synchronize_cbs: synchronize with the virtqueue callbacks (optional)
@@ -104,9 +103,7 @@ struct virtio_config_ops {
 	void (*reset)(struct virtio_device *vdev);
 	int (*find_vqs)(struct virtio_device *, unsigned nvqs,
 			struct virtqueue *vqs[], vq_callback_t *callbacks[],
-			const char * const names[],
-			u32 sizes[],
-			const bool *ctx,
+			const char * const names[], const bool *ctx,
 			struct irq_affinity *desc);
 	void (*del_vqs)(struct virtio_device *);
 	void (*synchronize_cbs)(struct virtio_device *);
@@ -215,7 +212,7 @@ struct virtqueue *virtio_find_single_vq(struct virtio_device *vdev,
 	const char *names[] = { n };
 	struct virtqueue *vq;
 	int err = vdev->config->find_vqs(vdev, 1, &vq, callbacks, names, NULL,
-					 NULL, NULL);
+					 NULL);
 	if (err < 0)
 		return ERR_PTR(err);
 	return vq;
@@ -227,8 +224,7 @@ int virtio_find_vqs(struct virtio_device *vdev, unsigned nvqs,
 			const char * const names[],
 			struct irq_affinity *desc)
 {
-	return vdev->config->find_vqs(vdev, nvqs, vqs, callbacks, names, NULL,
-				      NULL, desc);
+	return vdev->config->find_vqs(vdev, nvqs, vqs, callbacks, names, NULL, desc);
 }
 
 static inline
@@ -237,8 +233,8 @@ int virtio_find_vqs_ctx(struct virtio_device *vdev, unsigned nvqs,
 			const char * const names[], const bool *ctx,
 			struct irq_affinity *desc)
 {
-	return vdev->config->find_vqs(vdev, nvqs, vqs, callbacks, names, NULL,
-				      ctx, desc);
+	return vdev->config->find_vqs(vdev, nvqs, vqs, callbacks, names, ctx,
+				      desc);
 }
 
 /**
-- 
MST


^ permalink raw reply related

* [PATCH v4 5/6] virtio_vdpa: Revert "virtio_vdpa: support the arg sizes of find_vqs()"
From: Michael S. Tsirkin @ 2022-08-16  5:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Xuan Zhuo, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, virtualization, netdev,
	Linus Torvalds, Jens Axboe, James Bottomley, Martin K. Petersen,
	Guenter Roeck, Greg KH, Bo Liu
In-Reply-To: <20220816053602.173815-1-mst@redhat.com>

This reverts commit 99e8927d8a4da8eb8a8a5904dc13a3156be8e7c0:
proposed API isn't supported on all transports but no
effort was made to address this.

It might not be hard to fix if we want to: maybe just rename size to
size_hint and make sure legacy transports ignore the hint.

But it's not sure what the benefit is in any case, so let's drop it.

Fixes: 99e8927d8a4d ("virtio_vdpa: support the arg sizes of find_vqs()")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_vdpa.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c
index 9bc4d110b800..832d2c5b1b19 100644
--- a/drivers/virtio/virtio_vdpa.c
+++ b/drivers/virtio/virtio_vdpa.c
@@ -131,7 +131,7 @@ static irqreturn_t virtio_vdpa_virtqueue_cb(void *private)
 static struct virtqueue *
 virtio_vdpa_setup_vq(struct virtio_device *vdev, unsigned int index,
 		     void (*callback)(struct virtqueue *vq),
-		     const char *name, u32 size, bool ctx)
+		     const char *name, bool ctx)
 {
 	struct virtio_vdpa_device *vd_dev = to_virtio_vdpa_device(vdev);
 	struct vdpa_device *vdpa = vd_get_vdpa(vdev);
@@ -168,17 +168,14 @@ virtio_vdpa_setup_vq(struct virtio_device *vdev, unsigned int index,
 		goto error_new_virtqueue;
 	}
 
-	if (!size || size > max_num)
-		size = max_num;
-
 	if (ops->get_vq_num_min)
 		min_num = ops->get_vq_num_min(vdpa);
 
-	may_reduce_num = (size == min_num) ? false : true;
+	may_reduce_num = (max_num == min_num) ? false : true;
 
 	/* Create the vring */
 	align = ops->get_vq_align(vdpa);
-	vq = vring_create_virtqueue(index, size, align, vdev,
+	vq = vring_create_virtqueue(index, max_num, align, vdev,
 				    true, may_reduce_num, ctx,
 				    virtio_vdpa_notify, callback, name);
 	if (!vq) {
@@ -288,9 +285,9 @@ static int virtio_vdpa_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 			continue;
 		}
 
-		vqs[i] = virtio_vdpa_setup_vq(vdev, queue_idx++, callbacks[i],
-						  names[i], sizes ? sizes[i] : 0,
-						  ctx ? ctx[i] : false);
+		vqs[i] = virtio_vdpa_setup_vq(vdev, queue_idx++,
+					      callbacks[i], names[i], ctx ?
+					      ctx[i] : false);
 		if (IS_ERR(vqs[i])) {
 			err = PTR_ERR(vqs[i]);
 			goto err_setup_vq;
-- 
MST


^ permalink raw reply related

* [PATCH v4 5/6] virtio: Revert "virtio_vdpa: support the arg sizes of find_vqs()"
From: Michael S. Tsirkin @ 2022-08-16  5:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Xuan Zhuo, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, virtualization, netdev,
	Linus Torvalds, Jens Axboe, James Bottomley, Martin K. Petersen,
	Guenter Roeck, Greg KH, Bo Liu
In-Reply-To: <20220816053602.173815-1-mst@redhat.com>

This reverts commit 99e8927d8a4da8eb8a8a5904dc13a3156be8e7c0:
proposed API isn't supported on all transports but no
effort was made to address this.

It might not be hard to fix if we want to: maybe just rename size to
size_hint and make sure legacy transports ignore the hint.

But it's not sure what the benefit is in any case, so let's drop it.

Fixes: 99e8927d8a4d ("virtio_vdpa: support the arg sizes of find_vqs()")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_vdpa.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c
index 9bc4d110b800..832d2c5b1b19 100644
--- a/drivers/virtio/virtio_vdpa.c
+++ b/drivers/virtio/virtio_vdpa.c
@@ -131,7 +131,7 @@ static irqreturn_t virtio_vdpa_virtqueue_cb(void *private)
 static struct virtqueue *
 virtio_vdpa_setup_vq(struct virtio_device *vdev, unsigned int index,
 		     void (*callback)(struct virtqueue *vq),
-		     const char *name, u32 size, bool ctx)
+		     const char *name, bool ctx)
 {
 	struct virtio_vdpa_device *vd_dev = to_virtio_vdpa_device(vdev);
 	struct vdpa_device *vdpa = vd_get_vdpa(vdev);
@@ -168,17 +168,14 @@ virtio_vdpa_setup_vq(struct virtio_device *vdev, unsigned int index,
 		goto error_new_virtqueue;
 	}
 
-	if (!size || size > max_num)
-		size = max_num;
-
 	if (ops->get_vq_num_min)
 		min_num = ops->get_vq_num_min(vdpa);
 
-	may_reduce_num = (size == min_num) ? false : true;
+	may_reduce_num = (max_num == min_num) ? false : true;
 
 	/* Create the vring */
 	align = ops->get_vq_align(vdpa);
-	vq = vring_create_virtqueue(index, size, align, vdev,
+	vq = vring_create_virtqueue(index, max_num, align, vdev,
 				    true, may_reduce_num, ctx,
 				    virtio_vdpa_notify, callback, name);
 	if (!vq) {
@@ -288,9 +285,9 @@ static int virtio_vdpa_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 			continue;
 		}
 
-		vqs[i] = virtio_vdpa_setup_vq(vdev, queue_idx++, callbacks[i],
-						  names[i], sizes ? sizes[i] : 0,
-						  ctx ? ctx[i] : false);
+		vqs[i] = virtio_vdpa_setup_vq(vdev, queue_idx++,
+					      callbacks[i], names[i], ctx ?
+					      ctx[i] : false);
 		if (IS_ERR(vqs[i])) {
 			err = PTR_ERR(vqs[i]);
 			goto err_setup_vq;
-- 
MST


^ permalink raw reply related

* [PATCH v4 0/6] virtio: drop sizing vqs during init
From: Michael S. Tsirkin @ 2022-08-16  5:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Xuan Zhuo, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, virtualization, netdev,
	Linus Torvalds, Jens Axboe, James Bottomley, Martin K. Petersen,
	Guenter Roeck, Greg KH



Supplying size during init does not work for all transports.
In fact for legacy pci doing that causes a memory
corruption which was reported on Google Cloud.

We might get away with changing size to size_hint so it's
safe to ignore and then fixing legacy to ignore the hint.

But the benefit is unclear in any case, so let's revert for now.
Any new version will have to come with
- documentation of performance gains
- performance testing showing existing workflows
  are not harmed materially. especially ones with
  bursty traffic
- report of testing on legacy devices


Huge shout out to Andres Freund for the effort spent reproducing and
debugging!  Thanks to Guenter Roeck for help with testing!


changes from v3
	added a vdpa revert
changes from v2
	drop unrelated patches
changes from v1
	revert the ring size api, it's unused now

Michael S. Tsirkin (5):
  virtio_net: Revert "virtio_net: set the default max ring size by
    find_vqs()"


Michael S. Tsirkin (6):
  virtio_net: Revert "virtio_net: set the default max ring size by
    find_vqs()"


Michael S. Tsirkin (6):
  virtio_net: Revert "virtio_net: set the default max ring size by
    find_vqs()"
  virtio: Revert "virtio: add helper virtio_find_vqs_ctx_size()"
  virtio-mmio: Revert "virtio_mmio: support the arg sizes of find_vqs()"
  virtio_pci: Revert "virtio_pci: support the arg sizes of find_vqs()"
  virtio_vdpa: Revert "virtio_vdpa: support the arg sizes of find_vqs()"
  virtio: Revert "virtio: find_vqs() add arg sizes"

 arch/um/drivers/virtio_uml.c             |  2 +-
 drivers/net/virtio_net.c                 | 42 +++---------------------
 drivers/platform/mellanox/mlxbf-tmfifo.c |  1 -
 drivers/remoteproc/remoteproc_virtio.c   |  1 -
 drivers/s390/virtio/virtio_ccw.c         |  1 -
 drivers/virtio/virtio_mmio.c             |  9 ++---
 drivers/virtio/virtio_pci_common.c       | 20 +++++------
 drivers/virtio/virtio_pci_common.h       |  3 +-
 drivers/virtio/virtio_pci_legacy.c       |  6 +---
 drivers/virtio/virtio_pci_modern.c       | 17 +++-------
 drivers/virtio/virtio_vdpa.c             | 16 ++++-----
 include/linux/virtio_config.h            | 26 +++------------
 12 files changed, 34 insertions(+), 110 deletions(-)

-- 
MST


^ permalink raw reply

* [PATCH v4 1/6] virtio_net: Revert "virtio_net: set the default max ring size by find_vqs()"
From: Michael S. Tsirkin @ 2022-08-16  5:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Xuan Zhuo, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, virtualization, netdev,
	Linus Torvalds, Jens Axboe, James Bottomley, Martin K. Petersen,
	Guenter Roeck, Greg KH, Andres Freund
In-Reply-To: <20220816053602.173815-1-mst@redhat.com>

This reverts commit 762faee5a2678559d3dc09d95f8f2c54cd0466a7.

This has been reported to trip up guests on GCP (Google Cloud).
The reason is that virtio_find_vqs_ctx_size is broken on legacy
devices. We can in theory fix virtio_find_vqs_ctx_size but
in fact the patch itself has several other issues:

- It treats unknown speed as < 10G
- It leaves userspace no way to find out the ring size set by hypervisor
- It tests speed when link is down
- It ignores the virtio spec advice:
        Both \field{speed} and \field{duplex} can change, thus the driver
        is expected to re-read these values after receiving a
        configuration change notification.
- It is not clear the performance impact has been tested properly

Revert the patch for now.

Reported-by: Andres Freund <andres@anarazel.de>
Link: https://lore.kernel.org/r/20220814212610.GA3690074%40roeck-us.net
Link: https://lore.kernel.org/r/20220815070203.plwjx7b3cyugpdt7%40awork3.anarazel.de
Link: https://lore.kernel.org/r/3df6bb82-1951-455d-a768-e9e1513eb667%40www.fastmail.com
Link: https://lore.kernel.org/r/FCDC5DDE-3CDD-4B8A-916F-CA7D87B547CE%40anarazel.de
Fixes: 762faee5a267 ("virtio_net: set the default max ring size by find_vqs()")
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Andres Freund <andres@anarazel.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
---
 drivers/net/virtio_net.c | 42 ++++------------------------------------
 1 file changed, 4 insertions(+), 38 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d934774e9733..ece00b84e3a7 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3432,29 +3432,6 @@ static unsigned int mergeable_min_buf_len(struct virtnet_info *vi, struct virtqu
 		   (unsigned int)GOOD_PACKET_LEN);
 }
 
-static void virtnet_config_sizes(struct virtnet_info *vi, u32 *sizes)
-{
-	u32 i, rx_size, tx_size;
-
-	if (vi->speed == SPEED_UNKNOWN || vi->speed < SPEED_10000) {
-		rx_size = 1024;
-		tx_size = 1024;
-
-	} else if (vi->speed < SPEED_40000) {
-		rx_size = 1024 * 4;
-		tx_size = 1024 * 4;
-
-	} else {
-		rx_size = 1024 * 8;
-		tx_size = 1024 * 8;
-	}
-
-	for (i = 0; i < vi->max_queue_pairs; i++) {
-		sizes[rxq2vq(i)] = rx_size;
-		sizes[txq2vq(i)] = tx_size;
-	}
-}
-
 static int virtnet_find_vqs(struct virtnet_info *vi)
 {
 	vq_callback_t **callbacks;
@@ -3462,7 +3439,6 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
 	int ret = -ENOMEM;
 	int i, total_vqs;
 	const char **names;
-	u32 *sizes;
 	bool *ctx;
 
 	/* We expect 1 RX virtqueue followed by 1 TX virtqueue, followed by
@@ -3490,15 +3466,10 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
 		ctx = NULL;
 	}
 
-	sizes = kmalloc_array(total_vqs, sizeof(*sizes), GFP_KERNEL);
-	if (!sizes)
-		goto err_sizes;
-
 	/* Parameters for control virtqueue, if any */
 	if (vi->has_cvq) {
 		callbacks[total_vqs - 1] = NULL;
 		names[total_vqs - 1] = "control";
-		sizes[total_vqs - 1] = 64;
 	}
 
 	/* Allocate/initialize parameters for send/receive virtqueues */
@@ -3513,10 +3484,8 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
 			ctx[rxq2vq(i)] = true;
 	}
 
-	virtnet_config_sizes(vi, sizes);
-
-	ret = virtio_find_vqs_ctx_size(vi->vdev, total_vqs, vqs, callbacks,
-				       names, sizes, ctx, NULL);
+	ret = virtio_find_vqs_ctx(vi->vdev, total_vqs, vqs, callbacks,
+				  names, ctx, NULL);
 	if (ret)
 		goto err_find;
 
@@ -3536,8 +3505,6 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
 
 
 err_find:
-	kfree(sizes);
-err_sizes:
 	kfree(ctx);
 err_ctx:
 	kfree(names);
@@ -3897,9 +3864,6 @@ static int virtnet_probe(struct virtio_device *vdev)
 		vi->curr_queue_pairs = num_online_cpus();
 	vi->max_queue_pairs = max_queue_pairs;
 
-	virtnet_init_settings(dev);
-	virtnet_update_settings(vi);
-
 	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
 	err = init_vqs(vi);
 	if (err)
@@ -3912,6 +3876,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 	netif_set_real_num_tx_queues(dev, vi->curr_queue_pairs);
 	netif_set_real_num_rx_queues(dev, vi->curr_queue_pairs);
 
+	virtnet_init_settings(dev);
+
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY)) {
 		vi->failover = net_failover_create(vi->dev);
 		if (IS_ERR(vi->failover)) {
-- 
MST


^ permalink raw reply related

* [PATCH v4 3/6] virtio-mmio: Revert "virtio_mmio: support the arg sizes of find_vqs()"
From: Michael S. Tsirkin @ 2022-08-16  5:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Xuan Zhuo, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, virtualization, netdev,
	Linus Torvalds, Jens Axboe, James Bottomley, Martin K. Petersen,
	Guenter Roeck, Greg KH
In-Reply-To: <20220816053602.173815-1-mst@redhat.com>

This reverts commit fbed86abba6e0472d98079790e58060e4332608a.
The API is now unused, let's not carry dead code around.

Fixes: fbed86abba6e ("virtio_mmio: support the arg sizes of find_vqs()")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_mmio.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index c492a57531c6..dfcecfd7aba1 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -360,7 +360,7 @@ static void vm_synchronize_cbs(struct virtio_device *vdev)
 
 static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned int index,
 				  void (*callback)(struct virtqueue *vq),
-				  const char *name, u32 size, bool ctx)
+				  const char *name, bool ctx)
 {
 	struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
 	struct virtio_mmio_vq_info *info;
@@ -395,11 +395,8 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned int in
 		goto error_new_virtqueue;
 	}
 
-	if (!size || size > num)
-		size = num;
-
 	/* Create the vring */
-	vq = vring_create_virtqueue(index, size, VIRTIO_MMIO_VRING_ALIGN, vdev,
+	vq = vring_create_virtqueue(index, num, VIRTIO_MMIO_VRING_ALIGN, vdev,
 				 true, true, ctx, vm_notify, callback, name);
 	if (!vq) {
 		err = -ENOMEM;
@@ -503,7 +500,6 @@ static int vm_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 		}
 
 		vqs[i] = vm_setup_vq(vdev, queue_idx++, callbacks[i], names[i],
-				     sizes ? sizes[i] : 0,
 				     ctx ? ctx[i] : false);
 		if (IS_ERR(vqs[i])) {
 			vm_del_vqs(vdev);
-- 
MST


^ permalink raw reply related

* [PATCH v4 2/6] virtio: Revert "virtio: add helper virtio_find_vqs_ctx_size()"
From: Michael S. Tsirkin @ 2022-08-16  5:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Xuan Zhuo, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, virtualization, netdev,
	Linus Torvalds, Jens Axboe, James Bottomley, Martin K. Petersen,
	Guenter Roeck, Greg KH
In-Reply-To: <20220816053602.173815-1-mst@redhat.com>

This reverts commit fe3dc04e31aa51f91dc7f741a5f76cc4817eb5b4: the
API is now unused and in fact can't be implemented on top of a legacy
device.

Fixes: fe3dc04e31aa ("virtio: add helper virtio_find_vqs_ctx_size()")
Cc: "Xuan Zhuo" <xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/virtio_config.h | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 6adff09f7170..888f7e96f0c7 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -241,18 +241,6 @@ int virtio_find_vqs_ctx(struct virtio_device *vdev, unsigned nvqs,
 				      ctx, desc);
 }
 
-static inline
-int virtio_find_vqs_ctx_size(struct virtio_device *vdev, u32 nvqs,
-			     struct virtqueue *vqs[],
-			     vq_callback_t *callbacks[],
-			     const char * const names[],
-			     u32 sizes[],
-			     const bool *ctx, struct irq_affinity *desc)
-{
-	return vdev->config->find_vqs(vdev, nvqs, vqs, callbacks, names, sizes,
-				      ctx, desc);
-}
-
 /**
  * virtio_synchronize_cbs - synchronize with virtqueue callbacks
  * @vdev: the device
-- 
MST


^ permalink raw reply related

* [PATCH v4 4/6] virtio_pci: Revert "virtio_pci: support the arg sizes of find_vqs()"
From: Michael S. Tsirkin @ 2022-08-16  5:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Xuan Zhuo, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, virtualization, netdev,
	Linus Torvalds, Jens Axboe, James Bottomley, Martin K. Petersen,
	Guenter Roeck, Greg KH, Andres Freund
In-Reply-To: <20220816053602.173815-1-mst@redhat.com>

This reverts commit cdb44806fca2d0ad29ca644cbf1505433902ee0c: the legacy
path is wrong and in fact can not support the proposed API since for a
legacy device we never communicate the vq size to the hypervisor.

Reported-by: Andres Freund <andres@anarazel.de>
Fixes: cdb44806fca2 ("virtio_pci: support the arg sizes of find_vqs()")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_pci_common.c | 18 ++++++++----------
 drivers/virtio/virtio_pci_common.h |  1 -
 drivers/virtio/virtio_pci_legacy.c |  6 +-----
 drivers/virtio/virtio_pci_modern.c | 10 +++-------
 4 files changed, 12 insertions(+), 23 deletions(-)

diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c
index 00ad476a815d..7ad734584823 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -174,7 +174,6 @@ static int vp_request_msix_vectors(struct virtio_device *vdev, int nvectors,
 static struct virtqueue *vp_setup_vq(struct virtio_device *vdev, unsigned int index,
 				     void (*callback)(struct virtqueue *vq),
 				     const char *name,
-				     u32 size,
 				     bool ctx,
 				     u16 msix_vec)
 {
@@ -187,7 +186,7 @@ static struct virtqueue *vp_setup_vq(struct virtio_device *vdev, unsigned int in
 	if (!info)
 		return ERR_PTR(-ENOMEM);
 
-	vq = vp_dev->setup_vq(vp_dev, info, index, callback, name, size, ctx,
+	vq = vp_dev->setup_vq(vp_dev, info, index, callback, name, ctx,
 			      msix_vec);
 	if (IS_ERR(vq))
 		goto out_info;
@@ -284,7 +283,7 @@ void vp_del_vqs(struct virtio_device *vdev)
 
 static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned int nvqs,
 		struct virtqueue *vqs[], vq_callback_t *callbacks[],
-		const char * const names[], u32 sizes[], bool per_vq_vectors,
+		const char * const names[], bool per_vq_vectors,
 		const bool *ctx,
 		struct irq_affinity *desc)
 {
@@ -327,8 +326,8 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned int nvqs,
 		else
 			msix_vec = VP_MSIX_VQ_VECTOR;
 		vqs[i] = vp_setup_vq(vdev, queue_idx++, callbacks[i], names[i],
-				     sizes ? sizes[i] : 0,
-				     ctx ? ctx[i] : false, msix_vec);
+				     ctx ? ctx[i] : false,
+				     msix_vec);
 		if (IS_ERR(vqs[i])) {
 			err = PTR_ERR(vqs[i]);
 			goto error_find;
@@ -358,7 +357,7 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned int nvqs,
 
 static int vp_find_vqs_intx(struct virtio_device *vdev, unsigned int nvqs,
 		struct virtqueue *vqs[], vq_callback_t *callbacks[],
-		const char * const names[], u32 sizes[], const bool *ctx)
+		const char * const names[], const bool *ctx)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
 	int i, err, queue_idx = 0;
@@ -380,7 +379,6 @@ static int vp_find_vqs_intx(struct virtio_device *vdev, unsigned int nvqs,
 			continue;
 		}
 		vqs[i] = vp_setup_vq(vdev, queue_idx++, callbacks[i], names[i],
-				     sizes ? sizes[i] : 0,
 				     ctx ? ctx[i] : false,
 				     VIRTIO_MSI_NO_VECTOR);
 		if (IS_ERR(vqs[i])) {
@@ -404,15 +402,15 @@ int vp_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 	int err;
 
 	/* Try MSI-X with one vector per queue. */
-	err = vp_find_vqs_msix(vdev, nvqs, vqs, callbacks, names, sizes, true, ctx, desc);
+	err = vp_find_vqs_msix(vdev, nvqs, vqs, callbacks, names, true, ctx, desc);
 	if (!err)
 		return 0;
 	/* Fallback: MSI-X with one vector for config, one shared for queues. */
-	err = vp_find_vqs_msix(vdev, nvqs, vqs, callbacks, names, sizes, false, ctx, desc);
+	err = vp_find_vqs_msix(vdev, nvqs, vqs, callbacks, names, false, ctx, desc);
 	if (!err)
 		return 0;
 	/* Finally fall back to regular interrupts. */
-	return vp_find_vqs_intx(vdev, nvqs, vqs, callbacks, names, sizes, ctx);
+	return vp_find_vqs_intx(vdev, nvqs, vqs, callbacks, names, ctx);
 }
 
 const char *vp_bus_name(struct virtio_device *vdev)
diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h
index c0448378b698..a5ff838b85a5 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -80,7 +80,6 @@ struct virtio_pci_device {
 				      unsigned int idx,
 				      void (*callback)(struct virtqueue *vq),
 				      const char *name,
-				      u32 size,
 				      bool ctx,
 				      u16 msix_vec);
 	void (*del_vq)(struct virtio_pci_vq_info *info);
diff --git a/drivers/virtio/virtio_pci_legacy.c b/drivers/virtio/virtio_pci_legacy.c
index d75e5c4e637f..2257f1b3d8ae 100644
--- a/drivers/virtio/virtio_pci_legacy.c
+++ b/drivers/virtio/virtio_pci_legacy.c
@@ -112,7 +112,6 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
 				  unsigned int index,
 				  void (*callback)(struct virtqueue *vq),
 				  const char *name,
-				  u32 size,
 				  bool ctx,
 				  u16 msix_vec)
 {
@@ -126,13 +125,10 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
 	if (!num || vp_legacy_get_queue_enable(&vp_dev->ldev, index))
 		return ERR_PTR(-ENOENT);
 
-	if (!size || size > num)
-		size = num;
-
 	info->msix_vector = msix_vec;
 
 	/* create the vring */
-	vq = vring_create_virtqueue(index, size,
+	vq = vring_create_virtqueue(index, num,
 				    VIRTIO_PCI_VRING_ALIGN, &vp_dev->vdev,
 				    true, false, ctx,
 				    vp_notify, callback, name);
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index f7965c5dd36b..be51ec849252 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -293,7 +293,6 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
 				  unsigned int index,
 				  void (*callback)(struct virtqueue *vq),
 				  const char *name,
-				  u32 size,
 				  bool ctx,
 				  u16 msix_vec)
 {
@@ -311,18 +310,15 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
 	if (!num || vp_modern_get_queue_enable(mdev, index))
 		return ERR_PTR(-ENOENT);
 
-	if (!size || size > num)
-		size = num;
-
-	if (size & (size - 1)) {
-		dev_warn(&vp_dev->pci_dev->dev, "bad queue size %u", size);
+	if (num & (num - 1)) {
+		dev_warn(&vp_dev->pci_dev->dev, "bad queue size %u", num);
 		return ERR_PTR(-EINVAL);
 	}
 
 	info->msix_vector = msix_vec;
 
 	/* create the vring */
-	vq = vring_create_virtqueue(index, size,
+	vq = vring_create_virtqueue(index, num,
 				    SMP_CACHE_BYTES, &vp_dev->vdev,
 				    true, true, ctx,
 				    vp_notify, callback, name);
-- 
MST


^ permalink raw reply related

* [PATCH v1 net 15/15] net: Fix data-races around sysctl_max_skb_frags.
From: Kuniyuki Iwashima @ 2022-08-16  5:23 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel,
	Hans Westgaard Ry
In-Reply-To: <20220816052347.70042-1-kuniyu@amazon.com>

While reading sysctl_max_skb_frags, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.

Fixes: 5f74f82ea34c ("net:Add sysctl_max_skb_frags")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
CC: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
---
 net/ipv4/tcp.c       | 4 ++--
 net/mptcp/protocol.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 970e9a2cca4a..9a6fe3d6ab26 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1000,7 +1000,7 @@ static struct sk_buff *tcp_build_frag(struct sock *sk, int size_goal, int flags,
 
 	i = skb_shinfo(skb)->nr_frags;
 	can_coalesce = skb_can_coalesce(skb, i, page, offset);
-	if (!can_coalesce && i >= sysctl_max_skb_frags) {
+	if (!can_coalesce && i >= READ_ONCE(sysctl_max_skb_frags)) {
 		tcp_mark_push(tp, skb);
 		goto new_segment;
 	}
@@ -1354,7 +1354,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 
 			if (!skb_can_coalesce(skb, i, pfrag->page,
 					      pfrag->offset)) {
-				if (i >= sysctl_max_skb_frags) {
+				if (i >= READ_ONCE(sysctl_max_skb_frags)) {
 					tcp_mark_push(tp, skb);
 					goto new_segment;
 				}
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index da4257504fad..d398f3810662 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1263,7 +1263,7 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 
 		i = skb_shinfo(skb)->nr_frags;
 		can_coalesce = skb_can_coalesce(skb, i, dfrag->page, offset);
-		if (!can_coalesce && i >= sysctl_max_skb_frags) {
+		if (!can_coalesce && i >= READ_ONCE(sysctl_max_skb_frags)) {
 			tcp_mark_push(tcp_sk(ssk), skb);
 			goto alloc_skb;
 		}
-- 
2.30.2


^ permalink raw reply related

* [PATCH v1 net 14/15] net: Fix a data-race around netdev_budget.
From: Kuniyuki Iwashima @ 2022-08-16  5:23 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel,
	Stephen Hemminger
In-Reply-To: <20220816052347.70042-1-kuniyu@amazon.com>

While reading netdev_budget, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.

Fixes: 51b0bdedb8e7 ("[NET]: Separate two usages of netdev_max_backlog.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
CC: Stephen Hemminger <shemminger@osdl.org>
---
 net/core/dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 4705e6630efa..c83e23cfc57d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6666,7 +6666,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
 	struct softnet_data *sd = this_cpu_ptr(&softnet_data);
 	unsigned long time_limit = jiffies +
 		usecs_to_jiffies(netdev_budget_usecs);
-	int budget = netdev_budget;
+	int budget = READ_ONCE(netdev_budget);
 	LIST_HEAD(list);
 	LIST_HEAD(repoll);
 
-- 
2.30.2


^ permalink raw reply related

* [PATCH v1 net 13/15] net: Fix a data-race around sysctl_net_busy_read.
From: Kuniyuki Iwashima @ 2022-08-16  5:23 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel,
	Eliezer Tamir
In-Reply-To: <20220816052347.70042-1-kuniyu@amazon.com>

While reading sysctl_net_busy_read, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.

Fixes: 2d48d67fa8cd ("net: poll/select low latency socket support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
CC: Eliezer Tamir <eliezer.tamir@linux.intel.com>
---
 net/core/sock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 95abf4604d88..788c1372663c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3367,7 +3367,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	sk->sk_napi_id		=	0;
-	sk->sk_ll_usec		=	sysctl_net_busy_read;
+	sk->sk_ll_usec		=	READ_ONCE(sysctl_net_busy_read);
 #endif
 
 	sk->sk_max_pacing_rate = ~0UL;
-- 
2.30.2


^ permalink raw reply related

* [PATCH v1 net 12/15] net: Fix a data-race around sysctl_net_busy_poll.
From: Kuniyuki Iwashima @ 2022-08-16  5:23 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel,
	Eliezer Tamir
In-Reply-To: <20220816052347.70042-1-kuniyu@amazon.com>

While reading sysctl_net_busy_poll, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.

Fixes: 060212928670 ("net: add low latency socket poll")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
CC: Eliezer Tamir <eliezer.tamir@linux.intel.com>
---
 include/net/busy_poll.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index c4898fcbf923..f90f0021f5f2 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -33,7 +33,7 @@ extern unsigned int sysctl_net_busy_poll __read_mostly;
 
 static inline bool net_busy_loop_on(void)
 {
-	return sysctl_net_busy_poll;
+	return READ_ONCE(sysctl_net_busy_poll);
 }
 
 static inline bool sk_can_busy_loop(const struct sock *sk)
-- 
2.30.2


^ permalink raw reply related

* [PATCH v1 net 10/15] net: Fix data-races around sysctl_optmem_max.
From: Kuniyuki Iwashima @ 2022-08-16  5:23 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel
In-Reply-To: <20220816052347.70042-1-kuniyu@amazon.com>

While reading sysctl_optmem_max, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/core/bpf_sk_storage.c | 5 +++--
 net/core/filter.c         | 9 +++++----
 net/core/sock.c           | 8 +++++---
 net/ipv4/ip_sockglue.c    | 6 +++---
 net/ipv6/ipv6_sockglue.c  | 4 ++--
 5 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
index 1b7f385643b4..94374d529ea4 100644
--- a/net/core/bpf_sk_storage.c
+++ b/net/core/bpf_sk_storage.c
@@ -310,11 +310,12 @@ BPF_CALL_2(bpf_sk_storage_delete, struct bpf_map *, map, struct sock *, sk)
 static int bpf_sk_storage_charge(struct bpf_local_storage_map *smap,
 				 void *owner, u32 size)
 {
+	int optmem_max = READ_ONCE(sysctl_optmem_max);
 	struct sock *sk = (struct sock *)owner;
 
 	/* same check as in sock_kmalloc() */
-	if (size <= sysctl_optmem_max &&
-	    atomic_read(&sk->sk_omem_alloc) + size < sysctl_optmem_max) {
+	if (size <= optmem_max &&
+	    atomic_read(&sk->sk_omem_alloc) + size < optmem_max) {
 		atomic_add(size, &sk->sk_omem_alloc);
 		return 0;
 	}
diff --git a/net/core/filter.c b/net/core/filter.c
index c4f14ad82029..c191db80ce93 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1214,10 +1214,11 @@ void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp)
 static bool __sk_filter_charge(struct sock *sk, struct sk_filter *fp)
 {
 	u32 filter_size = bpf_prog_size(fp->prog->len);
+	int optmem_max = READ_ONCE(sysctl_optmem_max);
 
 	/* same check as in sock_kmalloc() */
-	if (filter_size <= sysctl_optmem_max &&
-	    atomic_read(&sk->sk_omem_alloc) + filter_size < sysctl_optmem_max) {
+	if (filter_size <= optmem_max &&
+	    atomic_read(&sk->sk_omem_alloc) + filter_size < optmem_max) {
 		atomic_add(filter_size, &sk->sk_omem_alloc);
 		return true;
 	}
@@ -1548,7 +1549,7 @@ int sk_reuseport_attach_filter(struct sock_fprog *fprog, struct sock *sk)
 	if (IS_ERR(prog))
 		return PTR_ERR(prog);
 
-	if (bpf_prog_size(prog->len) > sysctl_optmem_max)
+	if (bpf_prog_size(prog->len) > READ_ONCE(sysctl_optmem_max))
 		err = -ENOMEM;
 	else
 		err = reuseport_attach_prog(sk, prog);
@@ -1615,7 +1616,7 @@ int sk_reuseport_attach_bpf(u32 ufd, struct sock *sk)
 		}
 	} else {
 		/* BPF_PROG_TYPE_SOCKET_FILTER */
-		if (bpf_prog_size(prog->len) > sysctl_optmem_max) {
+		if (bpf_prog_size(prog->len) > READ_ONCE(sysctl_optmem_max)) {
 			err = -ENOMEM;
 			goto err_prog_put;
 		}
diff --git a/net/core/sock.c b/net/core/sock.c
index 303af52f3b79..95abf4604d88 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2536,7 +2536,7 @@ struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
 
 	/* small safe race: SKB_TRUESIZE may differ from final skb->truesize */
 	if (atomic_read(&sk->sk_omem_alloc) + SKB_TRUESIZE(size) >
-	    sysctl_optmem_max)
+	    READ_ONCE(sysctl_optmem_max))
 		return NULL;
 
 	skb = alloc_skb(size, priority);
@@ -2554,8 +2554,10 @@ struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
  */
 void *sock_kmalloc(struct sock *sk, int size, gfp_t priority)
 {
-	if ((unsigned int)size <= sysctl_optmem_max &&
-	    atomic_read(&sk->sk_omem_alloc) + size < sysctl_optmem_max) {
+	int optmem_max = READ_ONCE(sysctl_optmem_max);
+
+	if ((unsigned int)size <= optmem_max &&
+	    atomic_read(&sk->sk_omem_alloc) + size < optmem_max) {
 		void *mem;
 		/* First do the add, to avoid the race if kmalloc
 		 * might sleep.
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index a8a323ecbb54..e49a61a053a6 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -772,7 +772,7 @@ static int ip_set_mcast_msfilter(struct sock *sk, sockptr_t optval, int optlen)
 
 	if (optlen < GROUP_FILTER_SIZE(0))
 		return -EINVAL;
-	if (optlen > sysctl_optmem_max)
+	if (optlen > READ_ONCE(sysctl_optmem_max))
 		return -ENOBUFS;
 
 	gsf = memdup_sockptr(optval, optlen);
@@ -808,7 +808,7 @@ static int compat_ip_set_mcast_msfilter(struct sock *sk, sockptr_t optval,
 
 	if (optlen < size0)
 		return -EINVAL;
-	if (optlen > sysctl_optmem_max - 4)
+	if (optlen > READ_ONCE(sysctl_optmem_max) - 4)
 		return -ENOBUFS;
 
 	p = kmalloc(optlen + 4, GFP_KERNEL);
@@ -1233,7 +1233,7 @@ static int do_ip_setsockopt(struct sock *sk, int level, int optname,
 
 		if (optlen < IP_MSFILTER_SIZE(0))
 			goto e_inval;
-		if (optlen > sysctl_optmem_max) {
+		if (optlen > READ_ONCE(sysctl_optmem_max)) {
 			err = -ENOBUFS;
 			break;
 		}
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 222f6bf220ba..e0dcc7a193df 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -210,7 +210,7 @@ static int ipv6_set_mcast_msfilter(struct sock *sk, sockptr_t optval,
 
 	if (optlen < GROUP_FILTER_SIZE(0))
 		return -EINVAL;
-	if (optlen > sysctl_optmem_max)
+	if (optlen > READ_ONCE(sysctl_optmem_max))
 		return -ENOBUFS;
 
 	gsf = memdup_sockptr(optval, optlen);
@@ -244,7 +244,7 @@ static int compat_ipv6_set_mcast_msfilter(struct sock *sk, sockptr_t optval,
 
 	if (optlen < size0)
 		return -EINVAL;
-	if (optlen > sysctl_optmem_max - 4)
+	if (optlen > READ_ONCE(sysctl_optmem_max) - 4)
 		return -ENOBUFS;
 
 	p = kmalloc(optlen + 4, GFP_KERNEL);
-- 
2.30.2


^ permalink raw reply related

* [PATCH v1 net 11/15] net: Fix a data-race around sysctl_tstamp_allow_data.
From: Kuniyuki Iwashima @ 2022-08-16  5:23 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel,
	Willem de Bruijn
In-Reply-To: <20220816052347.70042-1-kuniyu@amazon.com>

While reading sysctl_tstamp_allow_data, it can be changed
concurrently.  Thus, we need to add READ_ONCE() to its reader.

Fixes: b245be1f4db1 ("net-timestamp: no-payload only sysctl")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
CC: Willem de Bruijn <willemb@google.com>
---
 net/core/skbuff.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 974bbbbe7138..174f34124c06 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4798,7 +4798,7 @@ static bool skb_may_tx_timestamp(struct sock *sk, bool tsonly)
 {
 	bool ret;
 
-	if (likely(sysctl_tstamp_allow_data || tsonly))
+	if (likely(READ_ONCE(sysctl_tstamp_allow_data) || tsonly))
 		return true;
 
 	read_lock_bh(&sk->sk_callback_lock);
-- 
2.30.2


^ permalink raw reply related

* [PATCH v1 net 08/15] net: Fix data-races around netdev_tstamp_prequeue.
From: Kuniyuki Iwashima @ 2022-08-16  5:23 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel
In-Reply-To: <20220816052347.70042-1-kuniyu@amazon.com>

While reading netdev_tstamp_prequeue, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.

Fixes: 3b098e2d7c69 ("net: Consistent skb timestamping")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/core/dev.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 07da69c1ac0a..4705e6630efa 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4928,7 +4928,7 @@ static int netif_rx_internal(struct sk_buff *skb)
 {
 	int ret;
 
-	net_timestamp_check(netdev_tstamp_prequeue, skb);
+	net_timestamp_check(READ_ONCE(netdev_tstamp_prequeue), skb);
 
 	trace_netif_rx(skb);
 
@@ -5281,7 +5281,7 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
 	int ret = NET_RX_DROP;
 	__be16 type;
 
-	net_timestamp_check(!netdev_tstamp_prequeue, skb);
+	net_timestamp_check(!READ_ONCE(netdev_tstamp_prequeue), skb);
 
 	trace_netif_receive_skb(skb);
 
@@ -5664,7 +5664,7 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
 {
 	int ret;
 
-	net_timestamp_check(netdev_tstamp_prequeue, skb);
+	net_timestamp_check(READ_ONCE(netdev_tstamp_prequeue), skb);
 
 	if (skb_defer_rx_timestamp(skb))
 		return NET_RX_SUCCESS;
@@ -5694,7 +5694,7 @@ void netif_receive_skb_list_internal(struct list_head *head)
 
 	INIT_LIST_HEAD(&sublist);
 	list_for_each_entry_safe(skb, next, head, list) {
-		net_timestamp_check(netdev_tstamp_prequeue, skb);
+		net_timestamp_check(READ_ONCE(netdev_tstamp_prequeue), skb);
 		skb_list_del_init(skb);
 		if (!skb_defer_rx_timestamp(skb))
 			list_add_tail(&skb->list, &sublist);
-- 
2.30.2


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox