* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
From: Jason Wang @ 2018-12-25 10:05 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181224125237-mutt-send-email-mst@kernel.org>
On 2018/12/25 上午2:10, Michael S. Tsirkin wrote:
> On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
>> On 2018/12/14 下午8:36, Michael S. Tsirkin wrote:
>>> On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
>>>> On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
>>>>> On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
>>>>>> It was noticed that the copy_user() friends that was used to access
>>>>>> virtqueue metdata tends to be very expensive for dataplane
>>>>>> implementation like vhost since it involves lots of software check,
>>>>>> speculation barrier, hardware feature toggling (e.g SMAP). The
>>>>>> extra cost will be more obvious when transferring small packets.
>>>>>>
>>>>>> This patch tries to eliminate those overhead by pin vq metadata pages
>>>>>> and access them through vmap(). During SET_VRING_ADDR, we will setup
>>>>>> those mappings and memory accessors are modified to use pointers to
>>>>>> access the metadata directly.
>>>>>>
>>>>>> Note, this was only done when device IOTLB is not enabled. We could
>>>>>> use similar method to optimize it in the future.
>>>>>>
>>>>>> Tests shows about ~24% improvement on TX PPS when using virtio-user +
>>>>>> vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
>>>>>>
>>>>>> Before: ~5.0Mpps
>>>>>> After: ~6.1Mpps
>>>>>>
>>>>>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>>>>>> ---
>>>>>> drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
>>>>>> drivers/vhost/vhost.h | 11 +++
>>>>>> 2 files changed, 189 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>>>> index bafe39d2e637..1bd24203afb6 100644
>>>>>> --- a/drivers/vhost/vhost.c
>>>>>> +++ b/drivers/vhost/vhost.c
>>>>>> @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
>>>>>> vq->indirect = NULL;
>>>>>> vq->heads = NULL;
>>>>>> vq->dev = dev;
>>>>>> + memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
>>>>>> + memset(&vq->used_ring, 0, sizeof(vq->used_ring));
>>>>>> + memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
>>>>>> mutex_init(&vq->mutex);
>>>>>> vhost_vq_reset(dev, vq);
>>>>>> if (vq->handle_kick)
>>>>>> @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>>>>>> spin_unlock(&dev->iotlb_lock);
>>>>>> }
>>>>>> +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
>>>>>> + size_t size, int write)
>>>>>> +{
>>>>>> + struct page **pages;
>>>>>> + int npages = DIV_ROUND_UP(size, PAGE_SIZE);
>>>>>> + int npinned;
>>>>>> + void *vaddr;
>>>>>> +
>>>>>> + pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
>>>>>> + if (!pages)
>>>>>> + return -ENOMEM;
>>>>>> +
>>>>>> + npinned = get_user_pages_fast(uaddr, npages, write, pages);
>>>>>> + if (npinned != npages)
>>>>>> + goto err;
>>>>>> +
>>>>> As I said I have doubts about the whole approach, but this
>>>>> implementation in particular isn't a good idea
>>>>> as it keeps the page around forever.
>>
>> The pages wil be released during set features.
>>
>>
>>>>> So no THP, no NUMA rebalancing,
>>
>> For THP, we will probably miss 2 or 4 pages, but does this really matter
>> consider the gain we have?
> We as in vhost? networking isn't the only thing guest does.
> We don't even know if this guest does a lot of networking.
> You don't
> know what else is in this huge page. Can be something very important
> that guest touches all the time.
Well, the probability should be very small consider we usually give
several gigabytes to guest. The rest of the pages that doesn't sit in
the same hugepage with metadata can still be merged by THP. Anyway, I
can test the differences.
>
>> For NUMA rebalancing, I'm even not quite sure if
>> it can helps for the case of IPC (vhost). It looks to me the worst case it
>> may cause page to be thrash between nodes if vhost and userspace are running
>> in two nodes.
>
> So again it's a gain for vhost but has a completely unpredictable effect on
> other functionality of the guest.
>
> That's what bothers me with this approach.
So:
- The rest of the pages could still be balanced to other nodes, no?
- try to balance metadata pages (belongs to co-operate processes) itself
is still questionable
>
>
>
>
>>>> This is the price of all GUP users not only vhost itself.
>>> Yes. GUP is just not a great interface for vhost to use.
>>
>> Zerocopy codes (enabled by defualt) use them for years.
> But only for TX and temporarily. We pin, read, unpin.
Probably not. For several reasons that the page will be not be released
soon or held for a very long period of time or even forever.
>
> Your patch is different
>
> - it writes into memory and GUP has known issues with file
> backed memory
The ordinary user for vhost is anonymous pages I think?
> - it keeps pages pinned forever
>
>
>
>>>> What's more
>>>> important, the goal is not to be left too much behind for other backends
>>>> like DPDK or AF_XDP (all of which are using GUP).
>>> So these guys assume userspace knows what it's doing.
>>> We can't assume that.
>>
>> What kind of assumption do you they have?
>>
>>
>>>>> userspace-controlled
>>>>> amount of memory locked up and not accounted for.
>>>> It's pretty easy to add this since the slow path was still kept. If we
>>>> exceeds the limitation, we can switch back to slow path.
>>>>
>>>>> Don't get me wrong it's a great patch in an ideal world.
>>>>> But then in an ideal world no barriers smap etc are necessary at all.
>>>> Again, this is only for metadata accessing not the data which has been used
>>>> for years for real use cases.
>>>>
>>>> For SMAP, it makes senses for the address that kernel can not forcast. But
>>>> it's not the case for the vhost metadata since we know the address will be
>>>> accessed very frequently. For speculation barrier, it helps nothing for the
>>>> data path of vhost which is a kthread.
>>> I don't see how a kthread makes any difference. We do have a validation
>>> step which makes some difference.
>>
>> The problem is not kthread but the address of userspace address. The
>> addresses of vq metadata tends to be consistent for a while, and vhost knows
>> they will be frequently. SMAP doesn't help too much in this case.
>>
>> Thanks.
> It's true for a real life applications but a malicious one
> can call the setup ioctls any number of times. And SMAP is
> all about malcious applications.
We don't do this in the path of ioctl, there's no context switch between
userspace and kernel in the worker thread. SMAP is used to prevent
kernel from accessing userspace pages unexpectedly which is not the case
for metadata access.
Thanks
>
>>>> Packet or AF_XDP benefit from
>>>> accessing metadata directly, we should do it as well.
>>>>
>>>> Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net V2 4/4] vhost: log dirty page correctly
From: Jason Wang @ 2018-12-25 9:43 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Jintack Lim, netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181224123654-mutt-send-email-mst@kernel.org>
On 2018/12/25 上午1:41, Michael S. Tsirkin wrote:
> On Mon, Dec 24, 2018 at 11:43:31AM +0800, Jason Wang wrote:
>> On 2018/12/14 下午9:20, Michael S. Tsirkin wrote:
>>> On Fri, Dec 14, 2018 at 10:43:03AM +0800, Jason Wang wrote:
>>>> On 2018/12/13 下午10:31, Michael S. Tsirkin wrote:
>>>>>> Just to make sure I understand this. It looks to me we should:
>>>>>>
>>>>>> - allow passing GIOVA->GPA through UAPI
>>>>>>
>>>>>> - cache GIOVA->GPA somewhere but still use GIOVA->HVA in device IOTLB for
>>>>>> performance
>>>>>>
>>>>>> Is this what you suggest?
>>>>>>
>>>>>> Thanks
>>>>> Not really. We already have GPA->HVA, so I suggested a flag to pass
>>>>> GIOVA->GPA in the IOTLB.
>>>>>
>>>>> This has advantages for security since a single table needs
>>>>> then to be validated to ensure guest does not corrupt
>>>>> QEMU memory.
>>>>>
>>>> I wonder how much we can gain through this. Currently, qemu IOMMU gives
>>>> GIOVA->GPA mapping, and qemu vhost code will translate GPA to HVA then pass
>>>> GIOVA->HVA to vhost. It looks no difference to me.
>>>>
>>>> Thanks
>>> The difference is in security not in performance. Getting a bad HVA
>>> corrupts QEMU memory and it might be guest controlled. Very risky.
>> How can this be controlled by guest? HVA was generated from qemu ram blocks
>> which is totally under the control of qemu memory core instead of guest.
>>
>>
>> Thanks
> It is ultimately under guest influence as guest supplies IOVA->GPA
> translations. qemu translates GPA->HVA and gives the translated result
> to the kernel. If it's not buggy and kernel isn't buggy it's all
> fine.
If qemu provides buggy GPA->HVA, we can't workaround this. And I don't
get the point why we even want to try this. Buggy qemu code can crash
itself in many ways.
>
> But that's the approach that was proven not to work in the 20th century.
> In the 21st century we are trying defence in depth approach.
>
> My point is that a single code path that is responsible for
> the HVA translations is better than two.
>
So the difference whether or not use memory table information:
Current:
1) SET_MEM_TABLE: GPA->HVA
2) Qemu GIOVA->GPA
3) Qemu GPA->HVA
4) IOTLB_UPDATE: GIOVA->HVA
If I understand correctly you want to drop step 3 consider it might be
buggy which is just 19 lines of code in qemu
(vhost_memory_region_lookup()). This will ends up:
1) Do GPA->HVA translation in IOTLB_UPDATE path (I believe we won't want
to do it during device IOTLB lookup).
2) Extra bits to enable this capability.
So this looks need more codes in kernel than what qemu did in
userspace. Is this really worthwhile?
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
From: Michael S. Tsirkin @ 2018-12-24 19:09 UTC (permalink / raw)
To: Jason Wang; +Cc: netdev, linux-kernel, David Miller, kvm, virtualization
In-Reply-To: <f6ce1fbb-b634-b17d-e9cf-36c662f49d75@redhat.com>
On Mon, Dec 24, 2018 at 04:44:14PM +0800, Jason Wang wrote:
>
> On 2018/12/17 上午3:57, Michael S. Tsirkin wrote:
> > On Sat, Dec 15, 2018 at 11:43:08AM -0800, David Miller wrote:
> > > From: Jason Wang <jasowang@redhat.com>
> > > Date: Fri, 14 Dec 2018 12:29:54 +0800
> > >
> > > > On 2018/12/14 上午4:12, Michael S. Tsirkin wrote:
> > > > > On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> > > > > > Hi:
> > > > > >
> > > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > > address instead of copy_user() friends since they had too much
> > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > toggling.
> > > > > >
> > > > > > Test shows about 24% improvement on TX PPS. It should benefit other
> > > > > > cases as well.
> > > > > >
> > > > > > Please review
> > > > > I think the idea of speeding up userspace access is a good one.
> > > > > However I think that moving all checks to start is way too aggressive.
> > > >
> > > > So did packet and AF_XDP. Anyway, sharing address space and access
> > > > them directly is the fastest way. Performance is the major
> > > > consideration for people to choose backend. Compare to userspace
> > > > implementation, vhost does not have security advantages at any
> > > > level. If vhost is still slow, people will start to develop backends
> > > > based on e.g AF_XDP.
> > > Exactly, this is precisely how this kind of problem should be solved.
> > >
> > > Michael, I strongly support the approach Jason is taking here, and I
> > > would like to ask you to seriously reconsider your objections.
> > >
> > > Thank you.
> > Okay. Won't be the first time I'm wrong.
> >
> > Let's say we ignore security aspects, but we need to make sure the
> > following all keep working (broken with this revision):
> > - file backed memory (I didn't see where we mark memory dirty -
> > if we don't we get guest memory corruption on close, if we do
> > then host crash as https://lwn.net/Articles/774411/ seems to apply here?)
>
>
> We only pin metadata pages, so I don't think they can be used for DMA. So it
> was probably not an issue. The real issue is zerocopy codes, maybe it's time
> to disable it by default?
>
>
> > - THP
>
>
> We will miss 2 or 4 pages for THP, I wonder whether or not it's measurable.
>
>
> > - auto-NUMA
>
>
> I'm not sure auto-NUMA will help for the case of IPC. It can damage the
> performance in the worst case if vhost and userspace are running in two
> different nodes. Anyway I can measure.
>
>
> >
> > Because vhost isn't like AF_XDP where you can just tell people "use
> > hugetlbfs" and "data is removed on close" - people are using it in lots
> > of configurations with guest memory shared between rings and unrelated
> > data.
>
>
> This series doesn't share data, only metadata is shared.
Let me clarify - I mean that metadata is in same huge page with
unrelated guest data.
>
> >
> > Jason, thoughts on these?
> >
>
> Based on the above, I can measure the impact of THP to see how it impacts.
>
> For unsafe variants, it can only work for when we can batch the access and
> it needs non trivial rework on the vhost codes with unexpected amount of
> work for archs other than x86. I'm not sure it's worth to try.
>
> Thanks
Yes I think we need better APIs in vhost. Right now
we have an API to get and translate a single buffer.
We should have one that gets a batch of descriptors
and stores it, then one that translates this batch.
IMHO this will benefit everyone even if we do vmap due to
better code locality.
--
MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
From: Michael S. Tsirkin @ 2018-12-24 18:12 UTC (permalink / raw)
To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <b10c99a2-a9c3-595f-983e-2547325e64ad@redhat.com>
On Mon, Dec 24, 2018 at 04:32:39PM +0800, Jason Wang wrote:
>
> On 2018/12/14 下午8:33, Michael S. Tsirkin wrote:
> > On Fri, Dec 14, 2018 at 11:42:18AM +0800, Jason Wang wrote:
> > > On 2018/12/13 下午11:27, Michael S. Tsirkin wrote:
> > > > On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> > > > > Hi:
> > > > >
> > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > address instead of copy_user() friends since they had too much
> > > > > overheads like checks, spec barriers or even hardware feature
> > > > > toggling.
> > > > Userspace accesses through remapping tricks and next time there's a need
> > > > for a new barrier we are left to figure it out by ourselves.
> > >
> > > I don't get here, do you mean spec barriers?
> > I mean the next barrier people decide to put into userspace
> > memory accesses.
> >
> > > It's completely unnecessary for
> > > vhost which is kernel thread.
> > It's defence in depth. Take a look at the commit that added them.
> > And yes quite possibly in most cases we actually have a spec
> > barrier in the validation phase. If we do let's use the
> > unsafe variants so they can be found.
>
>
> unsafe variants can only work if you can batch userspace access. This is not
> necessarily the case for light load.
Do we care a lot about the light load? How would you benchmark it?
>
> >
> > > And even if you're right, vhost is not the
> > > only place, there's lots of vmap() based accessing in kernel.
> > For sure. But if one can get by without get user pages, one
> > really should. Witness recently uncovered mess with file
> > backed storage.
>
>
> We only pin metadata pages, I don't believe they will be used by any DMA.
It doesn't matter really, if you dirty pages behind the MM back
the problem is there.
>
> >
> > > Think in
> > > another direction, this means we won't suffer form unnecessary barriers for
> > > kthread like vhost in the future, we will manually pick the one we really
> > > need
> > I personally think we should err on the side of caution not on the side of
> > performance.
>
>
> So what you suggest may lead unnecessary performance regression (10%-20%)
> which is part of the goal of this series. We should audit and only use the
> one we really need instead of depending on copy_user() friends().
>
> If we do it our own, it could be slow for for security fix but it's no less
> safe than before with performance kept.
>
>
> >
> > > (but it should have little possibility).
> > History seems to teach otherwise.
>
>
> What case did you mean here?
>
>
> >
> > > Please notice we only access metdata through remapping not the data itself.
> > > This idea has been used for high speed userspace backend for years, e.g
> > > packet socket or recent AF_XDP.
> > I think their justification for the higher risk is that they are mostly
> > designed for priveledged userspace.
>
>
> I think it's the same with TUN/TAP, privileged process can pass them to
> unprivileged ones.
>
>
> >
> > > The only difference is the page was remap to
> > > from kernel to userspace.
> > At least that avoids the g.u.p mess.
>
>
> I'm still not very clear at the point. We only pin 2 or 4 pages, they're
> several other cases that will pin much more.
>
>
> >
> > > > I don't
> > > > like the idea I have to say. As a first step, why don't we switch to
> > > > unsafe_put_user/unsafe_get_user etc?
> > >
> > > Several reasons:
> > >
> > > - They only have x86 variant, it won't have any difference for the rest of
> > > architecture.
> > Is there an issue on other architectures? If yes they can be extended
> > there.
>
>
> Consider the unexpected amount of work and in the best case it can give the
> same performance to vmap(). I'm not sure it's worth.
>
>
> >
> > > - unsafe_put_user/unsafe_get_user is not sufficient for accessing structures
> > > (e.g accessing descriptor) or arrays (batching).
> > So you want unsafe_copy_xxx_user? I can do this. Hang on will post.
> >
> > > - Unless we can batch at least the accessing of two places in three of
> > > avail, used and descriptor in one run. There will be no difference. E.g we
> > > can batch updating used ring, but it won't make any difference in this case.
> > >
> > So let's batch them all?
>
>
> Batching might not help for the case of light load. And we need to measure
> the gain/cost of batching itself.
>
>
> >
> >
> > > > That would be more of an apples to apples comparison, would it not?
> > >
> > > Apples to apples comparison only help if we are the No.1. But the fact is we
> > > are not. If we want to compete with e.g dpdk or AF_XDP, vmap() is the
> > > fastest method AFAIK.
> > >
> > >
> > > Thanks
> > We need to speed up the packet access itself too though.
> > You can't vmap all of guest memory.
>
>
> This series only pin and vmap very few pages (metadata).
>
> Thanks
>
>
> >
> >
> > > >
> > > > > Test shows about 24% improvement on TX PPS. It should benefit other
> > > > > cases as well.
> > > > >
> > > > > Please review
> > > > >
> > > > > Jason Wang (3):
> > > > > vhost: generalize adding used elem
> > > > > vhost: fine grain userspace memory accessors
> > > > > vhost: access vq metadata through kernel virtual address
> > > > >
> > > > > drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
> > > > > drivers/vhost/vhost.h | 11 ++
> > > > > 2 files changed, 266 insertions(+), 26 deletions(-)
> > > > >
> > > > > --
> > > > > 2.17.1
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
From: Michael S. Tsirkin @ 2018-12-24 18:10 UTC (permalink / raw)
To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <2ea274df-a79a-250f-648f-12927529d78a@redhat.com>
On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
>
> On 2018/12/14 下午8:36, Michael S. Tsirkin wrote:
> > On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
> > > On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
> > > > On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
> > > > > It was noticed that the copy_user() friends that was used to access
> > > > > virtqueue metdata tends to be very expensive for dataplane
> > > > > implementation like vhost since it involves lots of software check,
> > > > > speculation barrier, hardware feature toggling (e.g SMAP). The
> > > > > extra cost will be more obvious when transferring small packets.
> > > > >
> > > > > This patch tries to eliminate those overhead by pin vq metadata pages
> > > > > and access them through vmap(). During SET_VRING_ADDR, we will setup
> > > > > those mappings and memory accessors are modified to use pointers to
> > > > > access the metadata directly.
> > > > >
> > > > > Note, this was only done when device IOTLB is not enabled. We could
> > > > > use similar method to optimize it in the future.
> > > > >
> > > > > Tests shows about ~24% improvement on TX PPS when using virtio-user +
> > > > > vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
> > > > >
> > > > > Before: ~5.0Mpps
> > > > > After: ~6.1Mpps
> > > > >
> > > > > Signed-off-by: Jason Wang<jasowang@redhat.com>
> > > > > ---
> > > > > drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
> > > > > drivers/vhost/vhost.h | 11 +++
> > > > > 2 files changed, 189 insertions(+)
> > > > >
> > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > index bafe39d2e637..1bd24203afb6 100644
> > > > > --- a/drivers/vhost/vhost.c
> > > > > +++ b/drivers/vhost/vhost.c
> > > > > @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
> > > > > vq->indirect = NULL;
> > > > > vq->heads = NULL;
> > > > > vq->dev = dev;
> > > > > + memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
> > > > > + memset(&vq->used_ring, 0, sizeof(vq->used_ring));
> > > > > + memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
> > > > > mutex_init(&vq->mutex);
> > > > > vhost_vq_reset(dev, vq);
> > > > > if (vq->handle_kick)
> > > > > @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
> > > > > spin_unlock(&dev->iotlb_lock);
> > > > > }
> > > > > +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
> > > > > + size_t size, int write)
> > > > > +{
> > > > > + struct page **pages;
> > > > > + int npages = DIV_ROUND_UP(size, PAGE_SIZE);
> > > > > + int npinned;
> > > > > + void *vaddr;
> > > > > +
> > > > > + pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
> > > > > + if (!pages)
> > > > > + return -ENOMEM;
> > > > > +
> > > > > + npinned = get_user_pages_fast(uaddr, npages, write, pages);
> > > > > + if (npinned != npages)
> > > > > + goto err;
> > > > > +
> > > > As I said I have doubts about the whole approach, but this
> > > > implementation in particular isn't a good idea
> > > > as it keeps the page around forever.
>
>
> The pages wil be released during set features.
>
>
> > > > So no THP, no NUMA rebalancing,
>
>
> For THP, we will probably miss 2 or 4 pages, but does this really matter
> consider the gain we have?
We as in vhost? networking isn't the only thing guest does.
We don't even know if this guest does a lot of networking.
You don't
know what else is in this huge page. Can be something very important
that guest touches all the time.
> For NUMA rebalancing, I'm even not quite sure if
> it can helps for the case of IPC (vhost). It looks to me the worst case it
> may cause page to be thrash between nodes if vhost and userspace are running
> in two nodes.
So again it's a gain for vhost but has a completely unpredictable effect on
other functionality of the guest.
That's what bothers me with this approach.
>
> > >
> > > This is the price of all GUP users not only vhost itself.
> > Yes. GUP is just not a great interface for vhost to use.
>
>
> Zerocopy codes (enabled by defualt) use them for years.
But only for TX and temporarily. We pin, read, unpin.
Your patch is different
- it writes into memory and GUP has known issues with file
backed memory
- it keeps pages pinned forever
>
> >
> > > What's more
> > > important, the goal is not to be left too much behind for other backends
> > > like DPDK or AF_XDP (all of which are using GUP).
> >
> > So these guys assume userspace knows what it's doing.
> > We can't assume that.
>
>
> What kind of assumption do you they have?
>
>
> >
> > > > userspace-controlled
> > > > amount of memory locked up and not accounted for.
> > >
> > > It's pretty easy to add this since the slow path was still kept. If we
> > > exceeds the limitation, we can switch back to slow path.
> > >
> > > > Don't get me wrong it's a great patch in an ideal world.
> > > > But then in an ideal world no barriers smap etc are necessary at all.
> > >
> > > Again, this is only for metadata accessing not the data which has been used
> > > for years for real use cases.
> > >
> > > For SMAP, it makes senses for the address that kernel can not forcast. But
> > > it's not the case for the vhost metadata since we know the address will be
> > > accessed very frequently. For speculation barrier, it helps nothing for the
> > > data path of vhost which is a kthread.
> > I don't see how a kthread makes any difference. We do have a validation
> > step which makes some difference.
>
>
> The problem is not kthread but the address of userspace address. The
> addresses of vq metadata tends to be consistent for a while, and vhost knows
> they will be frequently. SMAP doesn't help too much in this case.
>
> Thanks.
It's true for a real life applications but a malicious one
can call the setup ioctls any number of times. And SMAP is
all about malcious applications.
>
> >
> > > Packet or AF_XDP benefit from
> > > accessing metadata directly, we should do it as well.
> > >
> > > Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net V2 4/4] vhost: log dirty page correctly
From: Michael S. Tsirkin @ 2018-12-24 17:41 UTC (permalink / raw)
To: Jason Wang; +Cc: Jintack Lim, netdev, linux-kernel, kvm, virtualization
In-Reply-To: <55b3d55a-950f-eeaf-1908-bed78a1a9200@redhat.com>
On Mon, Dec 24, 2018 at 11:43:31AM +0800, Jason Wang wrote:
>
> On 2018/12/14 下午9:20, Michael S. Tsirkin wrote:
> > On Fri, Dec 14, 2018 at 10:43:03AM +0800, Jason Wang wrote:
> > > On 2018/12/13 下午10:31, Michael S. Tsirkin wrote:
> > > > > Just to make sure I understand this. It looks to me we should:
> > > > >
> > > > > - allow passing GIOVA->GPA through UAPI
> > > > >
> > > > > - cache GIOVA->GPA somewhere but still use GIOVA->HVA in device IOTLB for
> > > > > performance
> > > > >
> > > > > Is this what you suggest?
> > > > >
> > > > > Thanks
> > > > Not really. We already have GPA->HVA, so I suggested a flag to pass
> > > > GIOVA->GPA in the IOTLB.
> > > >
> > > > This has advantages for security since a single table needs
> > > > then to be validated to ensure guest does not corrupt
> > > > QEMU memory.
> > > >
> > > I wonder how much we can gain through this. Currently, qemu IOMMU gives
> > > GIOVA->GPA mapping, and qemu vhost code will translate GPA to HVA then pass
> > > GIOVA->HVA to vhost. It looks no difference to me.
> > >
> > > Thanks
> > The difference is in security not in performance. Getting a bad HVA
> > corrupts QEMU memory and it might be guest controlled. Very risky.
>
>
> How can this be controlled by guest? HVA was generated from qemu ram blocks
> which is totally under the control of qemu memory core instead of guest.
>
>
> Thanks
It is ultimately under guest influence as guest supplies IOVA->GPA
translations. qemu translates GPA->HVA and gives the translated result
to the kernel. If it's not buggy and kernel isn't buggy it's all
fine.
But that's the approach that was proven not to work in the 20th century.
In the 21st century we are trying defence in depth approach.
My point is that a single code path that is responsible for
the HVA translations is better than two.
>
> > If
> > translations to HVA are done in a single place through a single table
> > it's safer as there's a single risky place.
> >
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
From: Jason Wang @ 2018-12-24 8:44 UTC (permalink / raw)
To: Michael S. Tsirkin, David Miller
Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181216144200-mutt-send-email-mst@kernel.org>
On 2018/12/17 上午3:57, Michael S. Tsirkin wrote:
> On Sat, Dec 15, 2018 at 11:43:08AM -0800, David Miller wrote:
>> From: Jason Wang <jasowang@redhat.com>
>> Date: Fri, 14 Dec 2018 12:29:54 +0800
>>
>>> On 2018/12/14 上午4:12, Michael S. Tsirkin wrote:
>>>> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
>>>>> Hi:
>>>>>
>>>>> This series tries to access virtqueue metadata through kernel virtual
>>>>> address instead of copy_user() friends since they had too much
>>>>> overheads like checks, spec barriers or even hardware feature
>>>>> toggling.
>>>>>
>>>>> Test shows about 24% improvement on TX PPS. It should benefit other
>>>>> cases as well.
>>>>>
>>>>> Please review
>>>> I think the idea of speeding up userspace access is a good one.
>>>> However I think that moving all checks to start is way too aggressive.
>>>
>>> So did packet and AF_XDP. Anyway, sharing address space and access
>>> them directly is the fastest way. Performance is the major
>>> consideration for people to choose backend. Compare to userspace
>>> implementation, vhost does not have security advantages at any
>>> level. If vhost is still slow, people will start to develop backends
>>> based on e.g AF_XDP.
>> Exactly, this is precisely how this kind of problem should be solved.
>>
>> Michael, I strongly support the approach Jason is taking here, and I
>> would like to ask you to seriously reconsider your objections.
>>
>> Thank you.
> Okay. Won't be the first time I'm wrong.
>
> Let's say we ignore security aspects, but we need to make sure the
> following all keep working (broken with this revision):
> - file backed memory (I didn't see where we mark memory dirty -
> if we don't we get guest memory corruption on close, if we do
> then host crash as https://lwn.net/Articles/774411/ seems to apply here?)
We only pin metadata pages, so I don't think they can be used for DMA.
So it was probably not an issue. The real issue is zerocopy codes, maybe
it's time to disable it by default?
> - THP
We will miss 2 or 4 pages for THP, I wonder whether or not it's measurable.
> - auto-NUMA
I'm not sure auto-NUMA will help for the case of IPC. It can damage the
performance in the worst case if vhost and userspace are running in two
different nodes. Anyway I can measure.
>
> Because vhost isn't like AF_XDP where you can just tell people "use
> hugetlbfs" and "data is removed on close" - people are using it in lots
> of configurations with guest memory shared between rings and unrelated
> data.
This series doesn't share data, only metadata is shared.
>
> Jason, thoughts on these?
>
Based on the above, I can measure the impact of THP to see how it impacts.
For unsafe variants, it can only work for when we can batch the access
and it needs non trivial rework on the vhost codes with unexpected
amount of work for archs other than x86. I'm not sure it's worth to try.
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
From: Jason Wang @ 2018-12-24 8:32 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181214072334-mutt-send-email-mst@kernel.org>
On 2018/12/14 下午8:33, Michael S. Tsirkin wrote:
> On Fri, Dec 14, 2018 at 11:42:18AM +0800, Jason Wang wrote:
>> On 2018/12/13 下午11:27, Michael S. Tsirkin wrote:
>>> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
>>>> Hi:
>>>>
>>>> This series tries to access virtqueue metadata through kernel virtual
>>>> address instead of copy_user() friends since they had too much
>>>> overheads like checks, spec barriers or even hardware feature
>>>> toggling.
>>> Userspace accesses through remapping tricks and next time there's a need
>>> for a new barrier we are left to figure it out by ourselves.
>>
>> I don't get here, do you mean spec barriers?
> I mean the next barrier people decide to put into userspace
> memory accesses.
>
>> It's completely unnecessary for
>> vhost which is kernel thread.
> It's defence in depth. Take a look at the commit that added them.
> And yes quite possibly in most cases we actually have a spec
> barrier in the validation phase. If we do let's use the
> unsafe variants so they can be found.
unsafe variants can only work if you can batch userspace access. This is
not necessarily the case for light load.
>
>> And even if you're right, vhost is not the
>> only place, there's lots of vmap() based accessing in kernel.
> For sure. But if one can get by without get user pages, one
> really should. Witness recently uncovered mess with file
> backed storage.
We only pin metadata pages, I don't believe they will be used by any DMA.
>
>> Think in
>> another direction, this means we won't suffer form unnecessary barriers for
>> kthread like vhost in the future, we will manually pick the one we really
>> need
> I personally think we should err on the side of caution not on the side of
> performance.
So what you suggest may lead unnecessary performance regression
(10%-20%) which is part of the goal of this series. We should audit and
only use the one we really need instead of depending on copy_user()
friends().
If we do it our own, it could be slow for for security fix but it's no
less safe than before with performance kept.
>
>> (but it should have little possibility).
> History seems to teach otherwise.
What case did you mean here?
>
>> Please notice we only access metdata through remapping not the data itself.
>> This idea has been used for high speed userspace backend for years, e.g
>> packet socket or recent AF_XDP.
> I think their justification for the higher risk is that they are mostly
> designed for priveledged userspace.
I think it's the same with TUN/TAP, privileged process can pass them to
unprivileged ones.
>
>> The only difference is the page was remap to
>> from kernel to userspace.
> At least that avoids the g.u.p mess.
I'm still not very clear at the point. We only pin 2 or 4 pages, they're
several other cases that will pin much more.
>
>>> I don't
>>> like the idea I have to say. As a first step, why don't we switch to
>>> unsafe_put_user/unsafe_get_user etc?
>>
>> Several reasons:
>>
>> - They only have x86 variant, it won't have any difference for the rest of
>> architecture.
> Is there an issue on other architectures? If yes they can be extended
> there.
Consider the unexpected amount of work and in the best case it can give
the same performance to vmap(). I'm not sure it's worth.
>
>> - unsafe_put_user/unsafe_get_user is not sufficient for accessing structures
>> (e.g accessing descriptor) or arrays (batching).
> So you want unsafe_copy_xxx_user? I can do this. Hang on will post.
>
>> - Unless we can batch at least the accessing of two places in three of
>> avail, used and descriptor in one run. There will be no difference. E.g we
>> can batch updating used ring, but it won't make any difference in this case.
>>
> So let's batch them all?
Batching might not help for the case of light load. And we need to
measure the gain/cost of batching itself.
>
>
>>> That would be more of an apples to apples comparison, would it not?
>>
>> Apples to apples comparison only help if we are the No.1. But the fact is we
>> are not. If we want to compete with e.g dpdk or AF_XDP, vmap() is the
>> fastest method AFAIK.
>>
>>
>> Thanks
> We need to speed up the packet access itself too though.
> You can't vmap all of guest memory.
This series only pin and vmap very few pages (metadata).
Thanks
>
>
>>>
>>>> Test shows about 24% improvement on TX PPS. It should benefit other
>>>> cases as well.
>>>>
>>>> Please review
>>>>
>>>> Jason Wang (3):
>>>> vhost: generalize adding used elem
>>>> vhost: fine grain userspace memory accessors
>>>> vhost: access vq metadata through kernel virtual address
>>>>
>>>> drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
>>>> drivers/vhost/vhost.h | 11 ++
>>>> 2 files changed, 266 insertions(+), 26 deletions(-)
>>>>
>>>> --
>>>> 2.17.1
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
From: Jason Wang @ 2018-12-24 7:53 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181214073332-mutt-send-email-mst@kernel.org>
On 2018/12/14 下午8:36, Michael S. Tsirkin wrote:
> On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
>> On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
>>> On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
>>>> It was noticed that the copy_user() friends that was used to access
>>>> virtqueue metdata tends to be very expensive for dataplane
>>>> implementation like vhost since it involves lots of software check,
>>>> speculation barrier, hardware feature toggling (e.g SMAP). The
>>>> extra cost will be more obvious when transferring small packets.
>>>>
>>>> This patch tries to eliminate those overhead by pin vq metadata pages
>>>> and access them through vmap(). During SET_VRING_ADDR, we will setup
>>>> those mappings and memory accessors are modified to use pointers to
>>>> access the metadata directly.
>>>>
>>>> Note, this was only done when device IOTLB is not enabled. We could
>>>> use similar method to optimize it in the future.
>>>>
>>>> Tests shows about ~24% improvement on TX PPS when using virtio-user +
>>>> vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
>>>>
>>>> Before: ~5.0Mpps
>>>> After: ~6.1Mpps
>>>>
>>>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>>>> ---
>>>> drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
>>>> drivers/vhost/vhost.h | 11 +++
>>>> 2 files changed, 189 insertions(+)
>>>>
>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>> index bafe39d2e637..1bd24203afb6 100644
>>>> --- a/drivers/vhost/vhost.c
>>>> +++ b/drivers/vhost/vhost.c
>>>> @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
>>>> vq->indirect = NULL;
>>>> vq->heads = NULL;
>>>> vq->dev = dev;
>>>> + memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
>>>> + memset(&vq->used_ring, 0, sizeof(vq->used_ring));
>>>> + memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
>>>> mutex_init(&vq->mutex);
>>>> vhost_vq_reset(dev, vq);
>>>> if (vq->handle_kick)
>>>> @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>>>> spin_unlock(&dev->iotlb_lock);
>>>> }
>>>> +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
>>>> + size_t size, int write)
>>>> +{
>>>> + struct page **pages;
>>>> + int npages = DIV_ROUND_UP(size, PAGE_SIZE);
>>>> + int npinned;
>>>> + void *vaddr;
>>>> +
>>>> + pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
>>>> + if (!pages)
>>>> + return -ENOMEM;
>>>> +
>>>> + npinned = get_user_pages_fast(uaddr, npages, write, pages);
>>>> + if (npinned != npages)
>>>> + goto err;
>>>> +
>>> As I said I have doubts about the whole approach, but this
>>> implementation in particular isn't a good idea
>>> as it keeps the page around forever.
The pages wil be released during set features.
>>> So no THP, no NUMA rebalancing,
For THP, we will probably miss 2 or 4 pages, but does this really matter
consider the gain we have? For NUMA rebalancing, I'm even not quite sure
if it can helps for the case of IPC (vhost). It looks to me the worst
case it may cause page to be thrash between nodes if vhost and userspace
are running in two nodes.
>>
>> This is the price of all GUP users not only vhost itself.
> Yes. GUP is just not a great interface for vhost to use.
Zerocopy codes (enabled by defualt) use them for years.
>
>> What's more
>> important, the goal is not to be left too much behind for other backends
>> like DPDK or AF_XDP (all of which are using GUP).
>
> So these guys assume userspace knows what it's doing.
> We can't assume that.
What kind of assumption do you they have?
>
>>> userspace-controlled
>>> amount of memory locked up and not accounted for.
>>
>> It's pretty easy to add this since the slow path was still kept. If we
>> exceeds the limitation, we can switch back to slow path.
>>
>>> Don't get me wrong it's a great patch in an ideal world.
>>> But then in an ideal world no barriers smap etc are necessary at all.
>>
>> Again, this is only for metadata accessing not the data which has been used
>> for years for real use cases.
>>
>> For SMAP, it makes senses for the address that kernel can not forcast. But
>> it's not the case for the vhost metadata since we know the address will be
>> accessed very frequently. For speculation barrier, it helps nothing for the
>> data path of vhost which is a kthread.
> I don't see how a kthread makes any difference. We do have a validation
> step which makes some difference.
The problem is not kthread but the address of userspace address. The
addresses of vq metadata tends to be consistent for a while, and vhost
knows they will be frequently. SMAP doesn't help too much in this case.
Thanks.
>
>> Packet or AF_XDP benefit from
>> accessing metadata directly, we should do it as well.
>>
>> Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net V2 4/4] vhost: log dirty page correctly
From: Jason Wang @ 2018-12-24 3:43 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Jintack Lim, netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181214081821-mutt-send-email-mst@kernel.org>
On 2018/12/14 下午9:20, Michael S. Tsirkin wrote:
> On Fri, Dec 14, 2018 at 10:43:03AM +0800, Jason Wang wrote:
>> On 2018/12/13 下午10:31, Michael S. Tsirkin wrote:
>>>> Just to make sure I understand this. It looks to me we should:
>>>>
>>>> - allow passing GIOVA->GPA through UAPI
>>>>
>>>> - cache GIOVA->GPA somewhere but still use GIOVA->HVA in device IOTLB for
>>>> performance
>>>>
>>>> Is this what you suggest?
>>>>
>>>> Thanks
>>> Not really. We already have GPA->HVA, so I suggested a flag to pass
>>> GIOVA->GPA in the IOTLB.
>>>
>>> This has advantages for security since a single table needs
>>> then to be validated to ensure guest does not corrupt
>>> QEMU memory.
>>>
>> I wonder how much we can gain through this. Currently, qemu IOMMU gives
>> GIOVA->GPA mapping, and qemu vhost code will translate GPA to HVA then pass
>> GIOVA->HVA to vhost. It looks no difference to me.
>>
>> Thanks
> The difference is in security not in performance. Getting a bad HVA
> corrupts QEMU memory and it might be guest controlled. Very risky.
How can this be controlled by guest? HVA was generated from qemu ram
blocks which is totally under the control of qemu memory core instead of
guest.
Thanks
> If
> translations to HVA are done in a single place through a single table
> it's safer as there's a single risky place.
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH v2] x86, kbuild: revert macrofying inline assembly code
From: Andi Kleen @ 2018-12-21 18:44 UTC (permalink / raw)
To: Masahiro Yamada
Cc: linux-arch, Juergen Gross, Michal Marek, Richard Biener,
Arnd Bergmann, Segher Boessenkool, linux-kbuild, x86,
linux-kernel, virtualization, linux-sparse, Ingo Molnar,
Borislav Petkov, H . Peter Anvin, Nadav Amit, Thomas Gleixner,
Alok Kataria, Luc Van Oostenryck
In-Reply-To: <1544928632-9717-1-git-send-email-yamada.masahiro@socionext.com>
Masahiro Yamada <yamada.masahiro@socionext.com> writes:
> Revert the following 9 commits:
FWIW the -Wa additional also broke LTO builds because it doesn't really
support -Wa for individual files.
So I'm glad they got reverted.
-Andi
^ permalink raw reply
* Call for Papers - MICRADS 2019, Rio de Janeiro, Brazil | Deadline: December 28
From: ML @ 2018-12-21 18:10 UTC (permalink / raw)
To: virtualization
[-- Attachment #1.1: Type: text/plain, Size: 5881 bytes --]
***** Proceedings by Springer
--------------------------------------
MICRADS´19 - The 2019 Multidisciplinary International Conference of Research Applied to Defense and Security
Rio de Janeiro, Brazil, 8 - 10 May 2019
http://www.micrads.org/ <http://www.micrads.org/>
------------------------------------------------------------------------------------------------------------------------------
Scope
MICRADS´19 - The 2019 Multidisciplinary International Conference of Research Applied to Defense and Security, to be held at Rio de Janeiro, Brazil, 8 - 10 May 2019, is an international forum for researchers and practitioners to present and discuss the most recent innovations, trends, results, experiences and concerns in the several perspectives of Defense and Security.
We are pleased to invite you to submit your papers to MICRADS´19. They can be written in English, Spanish or Portuguese. All submissions will be reviewed on the basis of relevance, originality, importance and clarity.
Topics
Submitted papers should be related with one or more of the main themes proposed for the Conference:
Area A: Systems, Communication and Defense
A1) Information and Communication Technology in Education
A2) Simulation and computer vision in military applications
A3) Analysis and Signal Processing
A4) Cybersecurity and Cyberdefense
A5) Computer Networks, Mobility and Pervasive Systems
Area B: Strategy and political-administrative vision in Defense
B1) Safety and Maritime Protection
B2) Strategy, Geopolitics and Oceanopolitics
B3) Planning, economy and logistics applied to Defense
B4) Leadership and e-leadership
B5) Military Marketing
B6) Health informatics in military applications
Area C: Engineering and technologies applied to Defense
C1) Wearable Technology and Assistance Devices
C2) Military Naval Engineering
C3) Weapons and Combat Systems
C4) Chemical, Biological and Nuclear Defense
C5) Defense Engineering (General)
Submission and Decision
Submitted papers written in English (until 10-page limit) must comply with the format of Smart Innovation, Systems and Technologies series (see Instructions for Authors at Springer Website <https://www.springer.com/us/authors-editors/conference-proceedings/conference-proceedings-guidelines>), must not have been published before, not be under review for any other conference or publication and not include any information leading to the authors’ identification. Therefore, the authors’ names, affiliations and e-mails should not be included in the version for evaluation by the Scientific Committee. This information should only be included in the camera-ready version, saved in Word or Latex format and also in PDF format. These files must be accompanied by the Consent to Publish form <http://www.micrads.org/consent.doc> filled out, in a ZIP file, and uploaded at the conference management system.
Submitted papers written in Spanish or Portuguese (until 15-page limit) must comply with the format of RISTI <http://www.risti.xyz/> - Revista Ibérica de Sistemas e Tecnologias de Informação (download instructions/template for authors in Spanish <http://www.risti.xyz/formato-es.doc> or Portuguese <http://www.risti.xyz/formato-pt.doc>), must not have been published before, not be under review for any other conference or publication and not include any information leading to the authors’ identification. Therefore, the authors’ names, affiliations and e-mails should not be included in the version for evaluation by the Scientific Committee. This information should only be included in the camera-ready version, saved in Word. These file must be uploaded at the conference management system in a ZIP file.
All papers will be subjected to a “blind review” by at least two members of the Scientific Committee.
Based on Scientific Committee evaluation, a paper can be rejected or accepted by the Conference Chairs. In the later case, it can be accepted as paper or poster.
The authors of papers accepted as posters must build and print a poster to be exhibited during the Conference. This poster must follow an A1 or A2 vertical format. The Conference can includes Work Sessions where these posters are presented and orally discussed, with a 7 minute limit per poster.
The authors of accepted papers will have 15 minutes to present their work in a Conference Work Session; approximately 5 minutes of discussion will follow each presentation.
Publication and Indexing
To ensure that an accepted paper is published, at least one of the authors must be fully registered by the 11 of February 2019, and the paper must comply with the suggested layout and page-limit (until 10 pages). Additionally, all recommended changes must be addressed by the authors before they submit the camera-ready version.
No more than one paper per registration will be published. An extra fee must be paid for publication of additional papers, with a maximum of one additional paper per registration. One registration permits only the participation of one author in the conference.
Papers can be written in English, Spanish or Portuguese. Accepted and registered papers written in English will be published in Proceedings by Springer, in a book of its SIST series, and will be submitted for indexing by ISI, SCOPUS, EI-Compendex, SpringerLink, and Google Scholar.
Important Dates
Paper Submission: December 28, 2018
Notification of Acceptance: January 28, 2019
Payment of Registration, to ensure the inclusion of an accepted paper in the conference proceedings: February 11, 2019.
Camera-ready Submission: February 11, 2019
Website of MICRADS'19: http://www.micrads.org/ <http://www.micrads.org/>
---
This email has been checked for viruses by AVG.
https://www.avg.com
[-- Attachment #1.2: Type: text/html, Size: 8900 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: 4.20-rc6: WARNING: CPU: 30 PID: 197360 at net/core/flow_dissector.c:764 __skb_flow_dissect
From: Willem de Bruijn @ 2018-12-21 14:42 UTC (permalink / raw)
To: Christian Borntraeger
Cc: Willem de Bruijn, Michael S Tsirkin, Network Development,
linux-kernel@vger.kernel.org,
virtualization@lists.linux-foundation.org, Ido Schimmel
In-Reply-To: <e25104e5-4385-965c-993c-85952db254c9@de.ibm.com>
On Fri, Dec 21, 2018 at 1:45 AM Christian Borntraeger
<borntraeger@de.ibm.com> wrote:
>
>
>
> On 20.12.2018 18:23, Willem de Bruijn wrote:
> > On Thu, Dec 20, 2018 at 11:17 AM Ido Schimmel <idosch@idosch.org> wrote:
> >>
> >> On Thu, Dec 20, 2018 at 03:09:22PM +0100, Christian Borntraeger wrote:
> >>> On 20.12.2018 10:12, Ido Schimmel wrote:
> >>>> +Willem
> >>>>
> >>>> On Thu, Dec 20, 2018 at 08:45:40AM +0100, Christian Borntraeger wrote:
> >>>>> Folks,
> >>>>>
> >>>>> I got this warning today. I cant tell when and why this happened, so I do not know yet how to reproduce.
> >>>>> Maybe someone has a quick idea.
> >>>>>
> >>>>> [85109.572032] WARNING: CPU: 30 PID: 197360 at net/core/flow_dissector.c:764 __skb_flow_dissect+0x1f0/0x1318
> >>>>
> >>>> I managed to trigger this warning as well the other day, but from a
> >>>> different call path:
> >>>
> >>> FWIW, it also seems to happen on 4.20-rc1. 4.19.0 seems fine. bisect seem to have failed so
> >>> my reproducer is not reliable.
> >>
> >> Yes, it is caused by commit d0e13a1488ad ("flow_dissector: lookup netns
> >> by skb->sk if skb->dev is NULL")
> >>
> >> $ git tag --contains d0e13a1488ad
> >> v4.20-rc1
> >> v4.20-rc2
> >> v4.20-rc3
> >> v4.20-rc4
> >> v4.20-rc5
> >> v4.20-rc6
> >
> > That tap_get_user_xdp path is also new for 4.20-rc1:
> >
> > commit 0efac27791ee068075d80f07c55a229b1335ce12
> > tap: accept an array of XDP buffs through sendmsg()
> >
> > $ git describe --contains 0efac27791ee
> > v4.20-rc1~14^2~382^2~1
> >
> > In v4.19 and before all packets went through tap_get_user.
>
> Hmmm, so maybe my bisect wasnt broken at all? It pointed to
>
> commit 105bc1306e9b29c2aa2783b9524f7aec9b5a5b1f
> Merge: 3475372ff60e4 d0e13a1488ad3
> Author: David S. Miller <davem@davemloft.net>
> AuthorDate: Tue Sep 25 20:29:38 2018 -0700
> Commit: David S. Miller <davem@davemloft.net>
> CommitDate: Tue Sep 25 20:29:38 2018 -0700
>
> Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Yes, that's the right commit. The flow dissector change went in
through bpf-next.
^ permalink raw reply
* Re: 4.20-rc6: WARNING: CPU: 30 PID: 197360 at net/core/flow_dissector.c:764 __skb_flow_dissect
From: Christian Borntraeger @ 2018-12-21 6:45 UTC (permalink / raw)
To: Willem de Bruijn, Ido Schimmel
Cc: Network Development, Willem de Bruijn,
virtualization@lists.linux-foundation.org,
linux-kernel@vger.kernel.org, Michael S Tsirkin
In-Reply-To: <CAF=yD-KBkX3NxtDt6mv-PoujFfdtJJ66XsnGzx-KQop6EU6LRw@mail.gmail.com>
On 20.12.2018 18:23, Willem de Bruijn wrote:
> On Thu, Dec 20, 2018 at 11:17 AM Ido Schimmel <idosch@idosch.org> wrote:
>>
>> On Thu, Dec 20, 2018 at 03:09:22PM +0100, Christian Borntraeger wrote:
>>> On 20.12.2018 10:12, Ido Schimmel wrote:
>>>> +Willem
>>>>
>>>> On Thu, Dec 20, 2018 at 08:45:40AM +0100, Christian Borntraeger wrote:
>>>>> Folks,
>>>>>
>>>>> I got this warning today. I cant tell when and why this happened, so I do not know yet how to reproduce.
>>>>> Maybe someone has a quick idea.
>>>>>
>>>>> [85109.572032] WARNING: CPU: 30 PID: 197360 at net/core/flow_dissector.c:764 __skb_flow_dissect+0x1f0/0x1318
>>>>
>>>> I managed to trigger this warning as well the other day, but from a
>>>> different call path:
>>>
>>> FWIW, it also seems to happen on 4.20-rc1. 4.19.0 seems fine. bisect seem to have failed so
>>> my reproducer is not reliable.
>>
>> Yes, it is caused by commit d0e13a1488ad ("flow_dissector: lookup netns
>> by skb->sk if skb->dev is NULL")
>>
>> $ git tag --contains d0e13a1488ad
>> v4.20-rc1
>> v4.20-rc2
>> v4.20-rc3
>> v4.20-rc4
>> v4.20-rc5
>> v4.20-rc6
>
> That tap_get_user_xdp path is also new for 4.20-rc1:
>
> commit 0efac27791ee068075d80f07c55a229b1335ce12
> tap: accept an array of XDP buffs through sendmsg()
>
> $ git describe --contains 0efac27791ee
> v4.20-rc1~14^2~382^2~1
>
> In v4.19 and before all packets went through tap_get_user.
Hmmm, so maybe my bisect wasnt broken at all? It pointed to
commit 105bc1306e9b29c2aa2783b9524f7aec9b5a5b1f
Merge: 3475372ff60e4 d0e13a1488ad3
Author: David S. Miller <davem@davemloft.net>
AuthorDate: Tue Sep 25 20:29:38 2018 -0700
Commit: David S. Miller <davem@davemloft.net>
CommitDate: Tue Sep 25 20:29:38 2018 -0700
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
^ permalink raw reply
* CfP for VHPC ‘19: HPC Virtualization Paper Registration due January 25, 2019 - 14th Virtualization in High-Performance Cloud Computing Workshop@ISC
From: VHPC 19 @ 2018-12-20 20:48 UTC (permalink / raw)
To: virtualization
[-- Attachment #1.1: Type: text/plain, Size: 9242 bytes --]
Please accept our apologies if you receive multiple copies of this Call for
Papers
====================================================================
CALL FOR PAPERS
14th Workshop on Virtualization in High-Performance Cloud Computing (VHPC
'19)
held in conjunction with the International Supercomputing Conference - High
Performance,
June 16-20, 2019, Frankfurt, Germany.
(Springer LNCS Proceedings)
====================================================================
Date: June 20, 2019
Workshop URL: http://vhpc.org
Abstract Submission Deadline: January 25, 2019
Springer LNCS, rolling abstract submission
Abstract/Paper Submission Link: https://edas.info/newPaper.php?c=25685
Call for Papers
Containers and virtualization technologies constitute key enabling factors
for flexible
resource management in modern data centers, and particularly in cloud
environments.
Cloud providers need to manage complex infrastructures in a seamless
fashion to
support the highly dynamic and heterogeneous workloads and hosted
applications
customers deploy. Similarly, HPC environments have been increasingly
adopting
techniques that enable flexible management of vast computing and networking
resources, close to marginal provisioning cost, which is unprecedented in
the
history of scientific and commercial computing.
Various virtualization-containerization technologies contribute to the
overall picture
in different ways: machine virtualization, with its capability to enable
consolidation of
multiple underutilized servers with heterogeneous software and operating
systems
(OSes), and its capability to live-migrate a fully operating virtual
machine (VM) with
a very short downtime, enables novel and dynamic ways to manage physical
servers; OS-level virtualization (i.e., containerization), with its
capability to isolate
multiple user-space environments and to allow for their coexistence
within the same
OS kernel, promises to provide many of the advantages of machine
virtualization with
high levels of responsiveness and performance; lastly, unikernels provide
for many
virtualization benefits with a minimized OS/library surface. I/O
Virtualization in turn
allows physical network interfaces to take traffic from multiple VMs or
containers;
network virtualization, with its capability to create logical network
overlays that are
independent of the underlying physical topology is furthermore enabling
virtualization of
HPC infrastructures.
Publication
Accepted papers will be published in a Springer LNCS proceedings volume.
Topics of Interest
The VHPC program committee solicits original, high-quality submissions
related to
virtualization across the entire software stack with a special focus on the
intersection
of HPC, containers-virtualization and the cloud.
Major Topics:
- HPC on Containers and VMs
- Containerized applications with OS-level virtualization
- Lightweight applications with Unikernels
- HP-as-a-Service
each major topic encompassing design/architecture, management, performance
management, modeling and configuration/tooling:
Design / Architecture:
- Containers and OS-level virtualization (LXC, Docker, rkt, Singularity,
Shifter, i.a.)
- Hypervisor support for heterogeneous resources (GPUs, co-processors,
FPGAs, etc.)
- Hypervisor extensions to mitigate side-channel attacks
([micro-]architectural timing attacks, privilege escalation)
- VM & Container trust and security models
- Multi-environment coupling, system software supporting in-situ analysis
with HPC simulation
- Cloud reliability, fault-tolerance and high-availability
- Energy-efficient and power-aware virtualization
- Containers inside VMs with hypervisor isolation
- Virtualization support for emerging memory technologies
- Lightweight/specialized operating systems in conjunction with virtual
machines
- Hypervisor support for heterogeneous resources (GPUs, co-processors,
FPGAs, etc.)
- Novel unikernels and use cases for virtualized HPC environments
- ARM-based hypervisors, ARM virtualization extensions
Management:
- Container and VM management for HPC and cloud environments
- HPC services integration, services to support HPC
- Service and on-demand scheduling & resource management
- Dedicated workload management with VMs or containers
- Workflow coupling with VMs and containers
- Unikernel, lightweight VM application management
- Environments and tools for operating containerized environments (batch,
orchestration)
- Novel models for non-HPC workload provisioning on HPC resources
Performance Measurements and Modeling:
- Performance improvements for or driven by unikernels
- Optimizations of virtual machine monitor platforms and hypervisors
- Scalability analysis of VMs and/or containers at large scale
- Performance measurement, modeling and monitoring of virtualized/cloud
workloads
- Virtualization in supercomputing environments, HPC clusters, HPC in the
cloud
Configuration / Tooling:
- Tool support for unikernels: configuration/build environments, debuggers,
profilers
- Job scheduling/control/policy and container placement in virtualized
environments
- Operating MPI in containers/VMs and Unikernels
- Software defined networks and network virtualization
- GPU virtualization operationalization
The Workshop on Virtualization in High-Performance Cloud Computing (VHPC)
aims to
bring together researchers and industrial practitioners facing the
challenges
posed by virtualization in order to foster discussion, collaboration,
mutual exchange
of knowledge and experience, enabling research to ultimately provide novel
solutions for virtualized computing systems of tomorrow.
The workshop will be one day in length, composed of 20 min paper
presentations, each
followed by 10 min discussion sections, plus lightning talks that are
limited to 5 minutes.
Presentations may be accompanied by interactive demonstrations.
Important Dates
January 25, 2019 - Abstract Registration deadline
Apr 19th, 2019 - Paper submission deadline (Springer LNCS)
May 3, 2019 - Acceptance notification
June 20th, 2019 - Workshop Day
July 10th, 2019 - Camera-ready version due
Chair
Michael Alexander (chair), University of Vienna, Austria
Anastassios Nanos (co-chair), SunLight.io, UK
Andrew Younge (co-chair), Sandia National Laboratories
Program committee
Stergios Anastasiadis, University of Ioannina, Greece
Jakob Blomer, CERN, Europe
Eduardo César, Universidad Autonoma de Barcelona, Spain
Stephen Crago, USC ISI, USA
Tommaso Cucinotta, St. Anna School of Advanced Studies, Italy
Christoffer Dall, Columbia University, USA
Patrick Dreher, MIT, USA
Kyle Hale, Northwestern University, USA
Brian Kocoloski, Washington University, USA
John Lange, University of Pittsburgh, USA
Giuseppe Lettieri, University of Pisa, Italy
Qing Liu, Oak Ridge National Laboratory, USA
Nikos Parlavantzas, IRISA, France
Kevin Pedretti, Sandia National Laboratories, USA
Amer Qouneh, Western New England University, USA
Carlos Reaño, Queen’s University Belfast, UK
Borja Sotomayor, University of Chicago, USA
Joe Stubbs, Texas Advanced Computing Center, USA
Anata Tiwari, San Diego Supercomputer Center, USA
Kurt Tutschku, Blekinge Institute of Technology, Sweden
Yasuhiro Watashiba, Osaka University, Japan
Chao-Tung Yang, Tunghai University, Taiwan
Na Zhang, VMware, USA
Paper Submission-Publication
Papers submitted to the workshop will be reviewed by at least two
members of the program committee and external reviewers. Submissions
should include abstract, keywords, the e-mail address of the
corresponding author, and must not exceed 10 pages, including tables
and figures at a main font size no smaller than 11 point. Submission
of a paper should be regarded as a commitment that, should the paper
be accepted, at least one of the authors will register and attend the
conference to present the work. Accepted papers will be published in a
Springer LNCS volume.
The format must be according to the Springer LNCS Style. Initial
submissions are in PDF; authors of accepted papers will be requested
to provide source files.
Format Guidelines:
ftp://ftp.springernature.com/cs-proceeding/llncs/llncs2e.zip
Abstract, Paper Submission Link:
https://edas.info/newPaper.php?c=25685
Lightning Talks
Lightning Talks are non-paper track, synoptical in nature and are strictly
limited to 5 minutes.
They can be used to gain early feedback on ongoing research, for
demonstrations, to
present research results, early research ideas, perspectives and positions
of interest to the
community. Submit abstract via the main submission link.
General Information
The workshop is one day in length and will be held in conjunction with the
International
Supercomputing Conference - High Performance (ISC) 2019, June 16-20,
Frankfurt,
Germany.
[-- Attachment #1.2: Type: text/html, Size: 47638 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH v6 0/7] Add virtio-iommu driver
From: Michael S. Tsirkin @ 2018-12-20 18:17 UTC (permalink / raw)
To: Jean-Philippe Brucker
Cc: Mark Rutland, virtio-dev@lists.oasis-open.org, Lorenzo Pieralisi,
tnowicki@caviumnetworks.com, devicetree@vger.kernel.org,
Marc Zyngier, linux-pci@vger.kernel.org, Joerg Roedel,
Will Deacon, virtualization@lists.linux-foundation.org,
eric.auger@redhat.com, iommu@lists.linux-foundation.org,
robh+dt@kernel.org, bhelgaas@google.com, Robin Murphy,
kvmarm@lists.cs.columbia.edu
In-Reply-To: <e1c7aee9-083d-103a-87a9-b59d5e63d7aa@arm.com>
On Thu, Dec 20, 2018 at 05:59:46PM +0000, Jean-Philippe Brucker wrote:
> On 19/12/2018 23:09, Michael S. Tsirkin wrote:
> > On Thu, Dec 13, 2018 at 12:50:29PM +0000, Jean-Philippe Brucker wrote:
> >>>> [3] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.9.1
> >>>> git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.9
> >>>
> >>> Unfortunatly gitweb seems to be broken on linux-arm.org. What is missing
> >>> in this patch-set to make this work on x86?
> >>
> >> You should be able to access it here:
> >> http://www.linux-arm.org/git?p=linux-jpb.git;a=shortlog;h=refs/heads/virtio-iommu/devel
> >>
> >> That branch contains missing bits for x86 support:
> >>
> >> * ACPI support. We have the code but it's waiting for an IORT spec
> >> update, to reserve the IORT node ID. I expect it to take a while, given
> >> that I'm alone requesting a change for something that's not upstream or
> >> in hardware.
> >
> > Frankly I think you should take a hard look at just getting the data
> > needed from the PCI device itself. You don't need to depend on virtio,
> > it can be a small driver that gets you that data from the device config
> > space and then just goes away.
> >
> > If you want help with writing such a small driver let me know.
> >
> > If there's an advantage to virtio-iommu then that would be its
> > portability, and it all goes out of the window because
> > of dependencies on ACPI and DT and OF and the rest of the zoo.
>
> But the portable solutions are ACPI and DT.
>
> Describing the DMA dependency through a device would require the guest
> to probe the device before all others. How do we communicate this?
> * pass a kernel parameter saying something like "probe_first=00:01.0"
> * make sure that the PCI root complex is probed before any other
> platform device (since the IOMMU can manage DMA of platform devices).
My idea was to just find and probe the specific device.
> * change DT, ACPI and PCI core code to handle this probe_first kernel
> parameter.
>
> Better go with something standard, that any OS and hypervisor knows how
> to use, and that other IOMMU devices already use.
>
> >> * DMA ops for x86 (see "HACK" commit). I'd like to use dma-iommu but I'm
> >> not sure how to implement the glue that sets dma_ops properly.
> >>
> >> Thanks,
> >> Jean
> >
> > OK so IIUC you are looking into Christoph's suggestions to fix that up?
>
> Eventually yes. I'll give it a try next year, once the dma-iommu changes
> are on the list. It's not a priority for me, given that x86 already has
> a pvIOMMU with VT-d, and that Arm still needs one.
Well that's a kind of a weak usecase, isn't it?
Can we just build VTD on ARM?
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index d9a25715650e..009fa98e9363 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -174,7 +174,7 @@ config DMAR_TABLE
config INTEL_IOMMU
bool "Support for Intel IOMMU using DMA Remapping Devices"
- depends on PCI_MSI && ACPI && (X86 || IA64_GENERIC)
+ depends on PCI_MSI && ACPI && (X86 || IA64_GENERIC || ARM)
select IOMMU_API
select IOMMU_IOVA
select NEED_DMA_MAP_STATE
didn't try this one ...
> It shouldn't block
> this series.
>
> > There's still a bit of time left before the merge window,
> > maybe you can make above changes.
>
> I'll wait to see if Joerg has other concerns about the design or the
> code, and resend in January. I think that IOMMU driver changes should go
> through his tree.
>
> Thanks,
> Jean
Sorry which changes do you mean?
--
MST
^ permalink raw reply related
* Re: [PATCH v6 0/7] Add virtio-iommu driver
From: Jean-Philippe Brucker @ 2018-12-20 17:59 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Mark Rutland, virtio-dev@lists.oasis-open.org, Lorenzo Pieralisi,
tnowicki@caviumnetworks.com, devicetree@vger.kernel.org,
Marc Zyngier, linux-pci@vger.kernel.org, Joerg Roedel,
Will Deacon, virtualization@lists.linux-foundation.org,
eric.auger@redhat.com, iommu@lists.linux-foundation.org,
robh+dt@kernel.org, bhelgaas@google.com, Robin Murphy,
kvmarm@lists.cs.columbia.edu
In-Reply-To: <20181219180417-mutt-send-email-mst@kernel.org>
On 19/12/2018 23:09, Michael S. Tsirkin wrote:
> On Thu, Dec 13, 2018 at 12:50:29PM +0000, Jean-Philippe Brucker wrote:
>>>> [3] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.9.1
>>>> git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.9
>>>
>>> Unfortunatly gitweb seems to be broken on linux-arm.org. What is missing
>>> in this patch-set to make this work on x86?
>>
>> You should be able to access it here:
>> http://www.linux-arm.org/git?p=linux-jpb.git;a=shortlog;h=refs/heads/virtio-iommu/devel
>>
>> That branch contains missing bits for x86 support:
>>
>> * ACPI support. We have the code but it's waiting for an IORT spec
>> update, to reserve the IORT node ID. I expect it to take a while, given
>> that I'm alone requesting a change for something that's not upstream or
>> in hardware.
>
> Frankly I think you should take a hard look at just getting the data
> needed from the PCI device itself. You don't need to depend on virtio,
> it can be a small driver that gets you that data from the device config
> space and then just goes away.
>
> If you want help with writing such a small driver let me know.
>
> If there's an advantage to virtio-iommu then that would be its
> portability, and it all goes out of the window because
> of dependencies on ACPI and DT and OF and the rest of the zoo.
But the portable solutions are ACPI and DT.
Describing the DMA dependency through a device would require the guest
to probe the device before all others. How do we communicate this?
* pass a kernel parameter saying something like "probe_first=00:01.0"
* make sure that the PCI root complex is probed before any other
platform device (since the IOMMU can manage DMA of platform devices).
* change DT, ACPI and PCI core code to handle this probe_first kernel
parameter.
Better go with something standard, that any OS and hypervisor knows how
to use, and that other IOMMU devices already use.
>> * DMA ops for x86 (see "HACK" commit). I'd like to use dma-iommu but I'm
>> not sure how to implement the glue that sets dma_ops properly.
>>
>> Thanks,
>> Jean
>
> OK so IIUC you are looking into Christoph's suggestions to fix that up?
Eventually yes. I'll give it a try next year, once the dma-iommu changes
are on the list. It's not a priority for me, given that x86 already has
a pvIOMMU with VT-d, and that Arm still needs one. It shouldn't block
this series.
> There's still a bit of time left before the merge window,
> maybe you can make above changes.
I'll wait to see if Joerg has other concerns about the design or the
code, and resend in January. I think that IOMMU driver changes should go
through his tree.
Thanks,
Jean
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: 4.20-rc6: WARNING: CPU: 30 PID: 197360 at net/core/flow_dissector.c:764 __skb_flow_dissect
From: Willem de Bruijn @ 2018-12-20 17:23 UTC (permalink / raw)
To: Ido Schimmel
Cc: Willem de Bruijn, Michael S Tsirkin, Network Development,
linux-kernel@vger.kernel.org,
virtualization@lists.linux-foundation.org
In-Reply-To: <20181220141752.GB861@splinter>
On Thu, Dec 20, 2018 at 11:17 AM Ido Schimmel <idosch@idosch.org> wrote:
>
> On Thu, Dec 20, 2018 at 03:09:22PM +0100, Christian Borntraeger wrote:
> > On 20.12.2018 10:12, Ido Schimmel wrote:
> > > +Willem
> > >
> > > On Thu, Dec 20, 2018 at 08:45:40AM +0100, Christian Borntraeger wrote:
> > >> Folks,
> > >>
> > >> I got this warning today. I cant tell when and why this happened, so I do not know yet how to reproduce.
> > >> Maybe someone has a quick idea.
> > >>
> > >> [85109.572032] WARNING: CPU: 30 PID: 197360 at net/core/flow_dissector.c:764 __skb_flow_dissect+0x1f0/0x1318
> > >
> > > I managed to trigger this warning as well the other day, but from a
> > > different call path:
> >
> > FWIW, it also seems to happen on 4.20-rc1. 4.19.0 seems fine. bisect seem to have failed so
> > my reproducer is not reliable.
>
> Yes, it is caused by commit d0e13a1488ad ("flow_dissector: lookup netns
> by skb->sk if skb->dev is NULL")
>
> $ git tag --contains d0e13a1488ad
> v4.20-rc1
> v4.20-rc2
> v4.20-rc3
> v4.20-rc4
> v4.20-rc5
> v4.20-rc6
That tap_get_user_xdp path is also new for 4.20-rc1:
commit 0efac27791ee068075d80f07c55a229b1335ce12
tap: accept an array of XDP buffs through sendmsg()
$ git describe --contains 0efac27791ee
v4.20-rc1~14^2~382^2~1
In v4.19 and before all packets went through tap_get_user.
^ permalink raw reply
* Re: 4.20-rc6: WARNING: CPU: 30 PID: 197360 at net/core/flow_dissector.c:764 __skb_flow_dissect
From: Willem de Bruijn @ 2018-12-20 14:41 UTC (permalink / raw)
To: Ido Schimmel
Cc: Willem de Bruijn, Michael S Tsirkin, Network Development,
linux-kernel@vger.kernel.org,
virtualization@lists.linux-foundation.org
In-Reply-To: <CAF=yD-K6Y=Jt=KNLOtg-_c32bnp__c_3RM4XvO6Q-Zye-nd4=A@mail.gmail.com>
On Thu, Dec 20, 2018 at 9:34 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Thu, Dec 20, 2018 at 9:16 AM Ido Schimmel <idosch@idosch.org> wrote:
> >
> > On Thu, Dec 20, 2018 at 09:04:25AM -0500, Willem de Bruijn wrote:
> > > On Thu, Dec 20, 2018 at 6:15 AM Ido Schimmel <idosch@idosch.org> wrote:
> > > >
> > > > +Willem
> > > >
> > > > On Thu, Dec 20, 2018 at 08:45:40AM +0100, Christian Borntraeger wrote:
> > > > > Folks,
> > > > >
> > > > > I got this warning today. I cant tell when and why this happened, so I do not know yet how to reproduce.
> > > > > Maybe someone has a quick idea.
> > > > >
> > > > > [85109.572032] WARNING: CPU: 30 PID: 197360 at net/core/flow_dissector.c:764 __skb_flow_dissect+0x1f0/0x1318
> > > >
> > > > I managed to trigger this warning as well the other day, but from a
> > > > different call path:
> > > >
> > > > [280155.348610] fib_multipath_hash+0x28c/0x2d0
> > > > [280155.348613] ? fib_multipath_hash+0x28c/0x2d0
> > > > [280155.348619] fib_select_path+0x241/0x32f
> > > > [280155.348622] ? __fib_lookup+0x6a/0xb0
> > > > [280155.348626] ip_route_output_key_hash_rcu+0x650/0xa30
> > > > [280155.348631] ? __alloc_skb+0x9b/0x1d0
> > > > [280155.348634] inet_rtm_getroute+0x3f7/0xb80
> > >
> > > inet_rtm_getroute builds a new packet with inet_rtm_getroute_build_skb
> > > here without dev or sk.
> >
> > Ack
> >
> > >
> > > > Problem is the synthesized skb for output route resolution does not have
> > > > skb->dev or skb->sk set. When a multipath route is hit and
> > > > net.ipv4.fib_multipath_hash_policy is set the flow dissector is called
> > > > with this skb and the warning is triggered.
> > > >
> > > > I plan to fix it by setting skb->dev to net->loopback_dev.
> > >
> > > The device can be chosen based on iif in inet_rtm_getroute? A first
> > > thought, I don't know this code very well.
> >
> > Yes, but iif is for input routes. I'm talking about output routes.
> >
> > > Let me know if you want me to take a stab at that patch. IPv6 probably
> > > will need the same.
> >
> > Yes, I'll try it now and post later today if everything is OK. IPv6 is
> > using flow info and not an skb, so no problem there. I also checked
> > other getroute implementations and none of them call into the flow
> > dissector with an skb, so I think we're fine.
> >
> > >
> > > > I assume we
> > > > want to keep this warning to prevent call paths which will otherwise
> > > > silently fallback to standard flow dissector instead of the BPF one.
> > >
> > > Indeed, the warning is there to sniff out paths that do not follow
> > > what I thought was an invariant. If there are too many exceptions, I
> > > may have to revisit that assumption. But for now, let's see if we can
> > > address these edge cases.
> >
> > Ack
> >
> > >
> > > > I'm not familiar with tap code, so someone else will need to patch this
> > > > case, but it looks like:
> > > >
> > > > tap_sendmsg()
> > > > tap_get_user()
> > > > skb_probe_transport_header()
> > > > skb_flow_dissect_flow_keys_basic()
> > > > __skb_flow_dissect()
> > > >
> > > > skb->dev is only set later in the code.
> > >
> > > tap_get_user uses sock_alloc_send_pskb (through tap_alloc_skb) to
> > > allocate the skb. So skb->sk should be set at the time of
> > > skb_probe_transport_header. I'm not sure how this path triggers the
> > > warning.
> >
> > Maybe it's:
> >
> > tap_sendmsg()
> > tap_get_user_xdp()
> > build_skb()
> > skb_probe_transport_header()
> > skb_flow_dissect_flow_keys_basic()
> > __skb_flow_dissect()
>
> Oh, indeed. I completely overlooked that path.
>
> I will call skb_set_owner_w there and will audit the other users of build_skb.
Uhm, no, that may not be the right solution if these packets may
be injected into the receive path. This also affects the tun device
through tun_xdp_one, which calls netif_receive_skb.
I'll need to take a closer look. Other approach is to move the
assignment skb->dev = tap->dev earlier.
^ permalink raw reply
* Re: 4.20-rc6: WARNING: CPU: 30 PID: 197360 at net/core/flow_dissector.c:764 __skb_flow_dissect
From: Willem de Bruijn @ 2018-12-20 14:34 UTC (permalink / raw)
To: Ido Schimmel
Cc: Willem de Bruijn, Michael S Tsirkin, Network Development,
linux-kernel@vger.kernel.org,
virtualization@lists.linux-foundation.org
In-Reply-To: <20181220141609.GA861@splinter>
On Thu, Dec 20, 2018 at 9:16 AM Ido Schimmel <idosch@idosch.org> wrote:
>
> On Thu, Dec 20, 2018 at 09:04:25AM -0500, Willem de Bruijn wrote:
> > On Thu, Dec 20, 2018 at 6:15 AM Ido Schimmel <idosch@idosch.org> wrote:
> > >
> > > +Willem
> > >
> > > On Thu, Dec 20, 2018 at 08:45:40AM +0100, Christian Borntraeger wrote:
> > > > Folks,
> > > >
> > > > I got this warning today. I cant tell when and why this happened, so I do not know yet how to reproduce.
> > > > Maybe someone has a quick idea.
> > > >
> > > > [85109.572032] WARNING: CPU: 30 PID: 197360 at net/core/flow_dissector.c:764 __skb_flow_dissect+0x1f0/0x1318
> > >
> > > I managed to trigger this warning as well the other day, but from a
> > > different call path:
> > >
> > > [280155.348610] fib_multipath_hash+0x28c/0x2d0
> > > [280155.348613] ? fib_multipath_hash+0x28c/0x2d0
> > > [280155.348619] fib_select_path+0x241/0x32f
> > > [280155.348622] ? __fib_lookup+0x6a/0xb0
> > > [280155.348626] ip_route_output_key_hash_rcu+0x650/0xa30
> > > [280155.348631] ? __alloc_skb+0x9b/0x1d0
> > > [280155.348634] inet_rtm_getroute+0x3f7/0xb80
> >
> > inet_rtm_getroute builds a new packet with inet_rtm_getroute_build_skb
> > here without dev or sk.
>
> Ack
>
> >
> > > Problem is the synthesized skb for output route resolution does not have
> > > skb->dev or skb->sk set. When a multipath route is hit and
> > > net.ipv4.fib_multipath_hash_policy is set the flow dissector is called
> > > with this skb and the warning is triggered.
> > >
> > > I plan to fix it by setting skb->dev to net->loopback_dev.
> >
> > The device can be chosen based on iif in inet_rtm_getroute? A first
> > thought, I don't know this code very well.
>
> Yes, but iif is for input routes. I'm talking about output routes.
>
> > Let me know if you want me to take a stab at that patch. IPv6 probably
> > will need the same.
>
> Yes, I'll try it now and post later today if everything is OK. IPv6 is
> using flow info and not an skb, so no problem there. I also checked
> other getroute implementations and none of them call into the flow
> dissector with an skb, so I think we're fine.
>
> >
> > > I assume we
> > > want to keep this warning to prevent call paths which will otherwise
> > > silently fallback to standard flow dissector instead of the BPF one.
> >
> > Indeed, the warning is there to sniff out paths that do not follow
> > what I thought was an invariant. If there are too many exceptions, I
> > may have to revisit that assumption. But for now, let's see if we can
> > address these edge cases.
>
> Ack
>
> >
> > > I'm not familiar with tap code, so someone else will need to patch this
> > > case, but it looks like:
> > >
> > > tap_sendmsg()
> > > tap_get_user()
> > > skb_probe_transport_header()
> > > skb_flow_dissect_flow_keys_basic()
> > > __skb_flow_dissect()
> > >
> > > skb->dev is only set later in the code.
> >
> > tap_get_user uses sock_alloc_send_pskb (through tap_alloc_skb) to
> > allocate the skb. So skb->sk should be set at the time of
> > skb_probe_transport_header. I'm not sure how this path triggers the
> > warning.
>
> Maybe it's:
>
> tap_sendmsg()
> tap_get_user_xdp()
> build_skb()
> skb_probe_transport_header()
> skb_flow_dissect_flow_keys_basic()
> __skb_flow_dissect()
Oh, indeed. I completely overlooked that path.
I will call skb_set_owner_w there and will audit the other users of build_skb.
^ permalink raw reply
* Re: 4.20-rc6: WARNING: CPU: 30 PID: 197360 at net/core/flow_dissector.c:764 __skb_flow_dissect
From: Christian Borntraeger @ 2018-12-20 14:09 UTC (permalink / raw)
To: Ido Schimmel, willemb
Cc: netdev, virtualization@lists.linux-foundation.org,
linux-kernel@vger.kernel.org, Michael S Tsirkin
In-Reply-To: <20181220091207.GA25942@splinter>
On 20.12.2018 10:12, Ido Schimmel wrote:
> +Willem
>
> On Thu, Dec 20, 2018 at 08:45:40AM +0100, Christian Borntraeger wrote:
>> Folks,
>>
>> I got this warning today. I cant tell when and why this happened, so I do not know yet how to reproduce.
>> Maybe someone has a quick idea.
>>
>> [85109.572032] WARNING: CPU: 30 PID: 197360 at net/core/flow_dissector.c:764 __skb_flow_dissect+0x1f0/0x1318
>
> I managed to trigger this warning as well the other day, but from a
> different call path:
FWIW, it also seems to happen on 4.20-rc1. 4.19.0 seems fine. bisect seem to have failed so
my reproducer is not reliable.
>
> [280155.348526] WARNING: CPU: 1 PID: 24819 at net/core/flow_dissector.c:764 __skb_flow_dissect+0x314/0x16b0
> [280155.348529] Modules linked in: dummy vrf intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul leds_mlxreg i2c_mux_reg i2c_mlxcpld crc32_pclmul mlxreg_hotplug mlxreg_io i2c_mux ghash_clmulni_intel iTCO_wdt gpio_ich iTCO_vendor_support mlx_platform aesni_intel aes_x86_64 crypto_simd cryptd glue_helper intel_cstate mac_hid lpc_ich ip_tables x_tables autofs4 mlxsw_spectrum mlxfw vxlan ip6_udp_tunnel udp_tunnel ip6_tunnel tunnel6 objagg psample parman bridge stp llc mlxsw_pci igb ahci mlxsw_core dca i2c_algo_bit libahci devlink i2c_ismt
> [280155.348570] CPU: 1 PID: 24819 Comm: ip Not tainted 4.20.0-rc6-nn181213 #1
> [280155.348572] Hardware name: Mellanox Technologies Ltd. MSN2100/SA001390, BIOS 5.6.5 06/07/2016
> [280155.348576] RIP: 0010:__skb_flow_dissect+0x314/0x16b0
> [280155.348579] Code: 85 19 0e 00 00 45 0f b7 6c 24 04 41 0f b7 44 24 06 4d 01 fd 48 85 db 4d 8d 14 07 74 0f 48 8b 43 18 48 85 c0 0f 85 e5 02 00 00 <0f> 0b 41 f6 04 24 40 0f 85 a4 02 00 00 c7 85 30 ff ff ff 00 00 00
> [280155.348581] RSP: 0018:ffffa0df41fdf650 EFLAGS: 00010246
> [280155.348584] RAX: 0000000000000000 RBX: ffff8bcded232000 RCX: 0000000000000000
> [280155.348586] RDX: ffffa0df41fdf7e0 RSI: ffffffff98e415a0 RDI: ffff8bcded232000
> [280155.348588] RBP: ffffa0df41fdf760 R08: 0000000000000000 R09: 0000000000000000
> [280155.348590] R10: ffffa0df41fdf7e8 R11: ffff8bcdf27a3000 R12: ffffffff98e415a0
> [280155.348591] R13: ffffa0df41fdf7e0 R14: ffffffff98dd2980 R15: ffffa0df41fdf7e0
> [280155.348594] FS: 00007f46f6897680(0000) GS:ffff8bcdf7a80000(0000) knlGS:0000000000000000
> [280155.348596] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [280155.348598] CR2: 000055933e95f9a0 CR3: 000000021e636000 CR4: 00000000001006e0
> [280155.348600] Call Trace:
> [280155.348610] fib_multipath_hash+0x28c/0x2d0
> [280155.348613] ? fib_multipath_hash+0x28c/0x2d0
> [280155.348619] fib_select_path+0x241/0x32f
> [280155.348622] ? __fib_lookup+0x6a/0xb0
> [280155.348626] ip_route_output_key_hash_rcu+0x650/0xa30
> [280155.348631] ? __alloc_skb+0x9b/0x1d0
> [280155.348634] inet_rtm_getroute+0x3f7/0xb80
> [280155.348640] ? __alloc_pages_nodemask+0x11c/0x2c0
> [280155.348646] rtnetlink_rcv_msg+0x1d9/0x2f0
> [280155.348650] ? rtnl_calcit.isra.24+0x120/0x120
> [280155.348654] netlink_rcv_skb+0x54/0x130
> [280155.348657] rtnetlink_rcv+0x15/0x20
> [280155.348661] netlink_unicast+0x20a/0x2c0
> [280155.348664] netlink_sendmsg+0x2d1/0x3d0
> [280155.348669] sock_sendmsg+0x39/0x50
> [280155.348672] ___sys_sendmsg+0x2a0/0x2f0
> [280155.348677] ? filemap_map_pages+0x16b/0x360
> [280155.348682] ? __handle_mm_fault+0x108e/0x13d0
> [280155.348686] __sys_sendmsg+0x63/0xa0
> [280155.348688] ? __sys_sendmsg+0x63/0xa0
> [280155.348692] __x64_sys_sendmsg+0x1f/0x30
> [280155.348697] do_syscall_64+0x5a/0x120
> [280155.348701] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [280155.348704] RIP: 0033:0x7f46f5b80d04
> [280155.348707] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 80 00 00 00 00 48 8d 05 01 dc 2c 00 8b 00 85 c0 75 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 41 89 d4 53 48 89 f5
> [280155.348709] RSP: 002b:00007fff82d62778 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
> [280155.348712] RAX: ffffffffffffffda RBX: 000000005c1900ae RCX: 00007f46f5b80d04
> [280155.348713] RDX: 0000000000000000 RSI: 00007fff82d627e0 RDI: 0000000000000003
> [280155.348715] RBP: 00007fff82d628d8 R08: 0000000000000001 R09: 0000000000000000
> [280155.348717] R10: 00007f46f5bfccc0 R11: 0000000000000246 R12: 0000000000000001
> [280155.348718] R13: 000055933eb90020 R14: 0000000000000000 R15: 00007fff82d63030
> [280155.348722] ---[ end trace e14023d76a175374 ]---
>
> Problem is the synthesized skb for output route resolution does not have
> skb->dev or skb->sk set. When a multipath route is hit and
> net.ipv4.fib_multipath_hash_policy is set the flow dissector is called
> with this skb and the warning is triggered.
>
> I plan to fix it by setting skb->dev to net->loopback_dev. I assume we
> want to keep this warning to prevent call paths which will otherwise
> silently fallback to standard flow dissector instead of the BPF one.
>
>> [85109.572036] Modules linked in: vhost_net vhost macvtap macvlan tap vfio_ap vfio_mdev mdev vfio_iommu_type1 vfio kvm xt_CHECKSUM ipt_MASQUERADE tun bridge stp llc xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ip6table_filter ip6_tables iptable_filter dm_service_time rpcrdma sunrpc rdma_ucm rdma_cm configfs iw_cm ib_cm mlx4_ib ib_uverbs ib_core pkey ghash_s390 prng aes_s390 des_s390 des_generic sha512_s390 sha1_s390 zcrypt_cex4 zcrypt rng_core eadm_sch sch_fq_codel ip_tables x_tables mlx4_en mlx4_core sha256_s390 sha_common dm_multipath sc
si_dh_rdac scsi_dh_emc scsi_dh_alua dm_mirror dm_region_hash dm_log dm_mod autofs4
>> [85109.572072] CPU: 30 PID: 197360 Comm: vhost-197330 Not tainted 4.20.0-20181213.rc6.git0.407d079170c1.300.fc29.s390x #1
>> [85109.572074] Hardware name: IBM 2964 NC9 712 (LPAR)
>> [85109.572075] Krnl PSW : 0704c00180000000 000000000092e320 (__skb_flow_dissect+0x1f0/0x1318)
>> [85109.572078] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>> [85109.572080] Krnl GPRS: 000000000000002a 0000000000000000 000003e0385bfc84 0000000000d91e30
>> [85109.572081] 000003e0385bfc84 0000000000000000 0000000000000000 0000000000d91e30
>> [85109.572083] 000003e0385bfc84 000000000000000e 0000007e3eb88100 0000007ff3561e00
>> [85109.572084] 0000000000000806 0000000000b4f288 000003e0385bfbb8 000003e0385bfab0
>> [85109.572115] Krnl Code: 000000000092e312: e310b0180002 ltg %r1,24(%r11)
>> 000000000092e318: a7740271 brc 7,92e7fa
>> #000000000092e31c: a7f40001 brc 15,92e31e
>> >000000000092e320: 91407003 tm 3(%r7),64
>> 000000000092e324: a7740257 brc 7,92e7d2
>> 000000000092e328: 5810f0b4 l %r1,180(%r15)
>> 000000000092e32c: e54cf0c80000 mvhi 200(%r15),0
>> 000000000092e332: c01b00000008 nilf %r1,8
>> [85109.572129] Call Trace:
>> [85109.572130] ([<0000000000000000>] (null))
>> [85109.572134] [<000003ff800c81e4>] tap_sendmsg+0x384/0x430 [tap]
>
> I'm not familiar with tap code, so someone else will need to patch this
> case, but it looks like:
>
> tap_sendmsg()
> tap_get_user()
> skb_probe_transport_header()
> skb_flow_dissect_flow_keys_basic()
> __skb_flow_dissect()
>
> skb->dev is only set later in the code.
>
>> [85109.572137] [<000003ff801acdee>] vhost_tx_batch.isra.10+0x66/0xe0 [vhost_net]
>> [85109.572138] [<000003ff801ad61c>] handle_tx_copy+0x18c/0x568 [vhost_net]
>> [85109.572140] [<000003ff801adab4>] handle_tx+0xbc/0x100 [vhost_net]
>> [85109.572145] [<000003ff8019bbe8>] vhost_worker+0xc8/0x128 [vhost]
>> [85109.572148] [<00000000001690b8>] kthread+0x140/0x160
>> [85109.572152] [<0000000000a84266>] kernel_thread_starter+0x6/0x10
>> [85109.572154] [<0000000000a84260>] kernel_thread_starter+0x0/0x10
>> [85109.572155] Last Breaking-Event-Address:
>> [85109.572156] [<000000000092e31c>] __skb_flow_dissect+0x1ec/0x1318
>> [85109.572158] ---[ end trace 97c040a6691bc000 ]---
>
^ permalink raw reply
* Re: 4.20-rc6: WARNING: CPU: 30 PID: 197360 at net/core/flow_dissector.c:764 __skb_flow_dissect
From: Willem de Bruijn @ 2018-12-20 14:04 UTC (permalink / raw)
To: Ido Schimmel
Cc: Willem de Bruijn, Michael S Tsirkin, Network Development,
linux-kernel@vger.kernel.org,
virtualization@lists.linux-foundation.org
In-Reply-To: <20181220091207.GA25942@splinter>
On Thu, Dec 20, 2018 at 6:15 AM Ido Schimmel <idosch@idosch.org> wrote:
>
> +Willem
>
> On Thu, Dec 20, 2018 at 08:45:40AM +0100, Christian Borntraeger wrote:
> > Folks,
> >
> > I got this warning today. I cant tell when and why this happened, so I do not know yet how to reproduce.
> > Maybe someone has a quick idea.
> >
> > [85109.572032] WARNING: CPU: 30 PID: 197360 at net/core/flow_dissector.c:764 __skb_flow_dissect+0x1f0/0x1318
>
> I managed to trigger this warning as well the other day, but from a
> different call path:
>
> [280155.348610] fib_multipath_hash+0x28c/0x2d0
> [280155.348613] ? fib_multipath_hash+0x28c/0x2d0
> [280155.348619] fib_select_path+0x241/0x32f
> [280155.348622] ? __fib_lookup+0x6a/0xb0
> [280155.348626] ip_route_output_key_hash_rcu+0x650/0xa30
> [280155.348631] ? __alloc_skb+0x9b/0x1d0
> [280155.348634] inet_rtm_getroute+0x3f7/0xb80
inet_rtm_getroute builds a new packet with inet_rtm_getroute_build_skb
here without dev or sk.
> Problem is the synthesized skb for output route resolution does not have
> skb->dev or skb->sk set. When a multipath route is hit and
> net.ipv4.fib_multipath_hash_policy is set the flow dissector is called
> with this skb and the warning is triggered.
>
> I plan to fix it by setting skb->dev to net->loopback_dev.
The device can be chosen based on iif in inet_rtm_getroute? A first
thought, I don't know this code very well.
Let me know if you want me to take a stab at that patch. IPv6 probably
will need the same.
> I assume we
> want to keep this warning to prevent call paths which will otherwise
> silently fallback to standard flow dissector instead of the BPF one.
Indeed, the warning is there to sniff out paths that do not follow
what I thought was an invariant. If there are too many exceptions, I
may have to revisit that assumption. But for now, let's see if we can
address these edge cases.
> I'm not familiar with tap code, so someone else will need to patch this
> case, but it looks like:
>
> tap_sendmsg()
> tap_get_user()
> skb_probe_transport_header()
> skb_flow_dissect_flow_keys_basic()
> __skb_flow_dissect()
>
> skb->dev is only set later in the code.
tap_get_user uses sock_alloc_send_pskb (through tap_alloc_skb) to
allocate the skb. So skb->sk should be set at the time of
skb_probe_transport_header. I'm not sure how this path triggers the
warning.
^ permalink raw reply
* Re: [PATCH v3] drm/bochs: add edid present check
From: Daniel Vetter @ 2018-12-20 10:51 UTC (permalink / raw)
To: Gerd Hoffmann
Cc: andr2000, daniel.vetter, open list, dri-devel,
open list:DRM DRIVER FOR BOCHS VIRTUAL GPU, David Airlie
In-Reply-To: <20181220101122.16153-1-kraxel@redhat.com>
On Thu, Dec 20, 2018 at 11:11:21AM +0100, Gerd Hoffmann wrote:
> Check header before trying to read the complete edid blob, to avoid the
> log being spammed in case qemu has no edid support (old qemu or edid
> support turned off).
>
> Fixes: 01f23459cf drm/bochs: add edid support.
> Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> ---
> drivers/gpu/drm/bochs/bochs_hw.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/drivers/gpu/drm/bochs/bochs_hw.c b/drivers/gpu/drm/bochs/bochs_hw.c
> index c90a0d492f..d0b4e1cee8 100644
> --- a/drivers/gpu/drm/bochs/bochs_hw.c
> +++ b/drivers/gpu/drm/bochs/bochs_hw.c
> @@ -86,9 +86,16 @@ static int bochs_get_edid_block(void *data, u8 *buf,
>
> int bochs_hw_load_edid(struct bochs_device *bochs)
> {
> + u8 header[8];
> +
> if (!bochs->mmio)
> return -1;
>
> + /* check header to detect whenever edid support is enabled in qemu */
> + bochs_get_edid_block(bochs, header, 0, ARRAY_SIZE(header));
> + if (drm_edid_header_is_valid(header) != 8)
> + return -1;
> +
> kfree(bochs->edid);
> bochs->edid = drm_do_get_edid(&bochs->connector,
> bochs_get_edid_block, bochs);
> --
> 2.9.3
>
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
^ permalink raw reply
* [PATCH v3] drm/bochs: add edid present check
From: Gerd Hoffmann @ 2018-12-20 10:11 UTC (permalink / raw)
To: dri-devel
Cc: andr2000, daniel.vetter, open list,
open list:DRM DRIVER FOR BOCHS VIRTUAL GPU, David Airlie
Check header before trying to read the complete edid blob, to avoid the
log being spammed in case qemu has no edid support (old qemu or edid
support turned off).
Fixes: 01f23459cf drm/bochs: add edid support.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
---
drivers/gpu/drm/bochs/bochs_hw.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/gpu/drm/bochs/bochs_hw.c b/drivers/gpu/drm/bochs/bochs_hw.c
index c90a0d492f..d0b4e1cee8 100644
--- a/drivers/gpu/drm/bochs/bochs_hw.c
+++ b/drivers/gpu/drm/bochs/bochs_hw.c
@@ -86,9 +86,16 @@ static int bochs_get_edid_block(void *data, u8 *buf,
int bochs_hw_load_edid(struct bochs_device *bochs)
{
+ u8 header[8];
+
if (!bochs->mmio)
return -1;
+ /* check header to detect whenever edid support is enabled in qemu */
+ bochs_get_edid_block(bochs, header, 0, ARRAY_SIZE(header));
+ if (drm_edid_header_is_valid(header) != 8)
+ return -1;
+
kfree(bochs->edid);
bochs->edid = drm_do_get_edid(&bochs->connector,
bochs_get_edid_block, bochs);
--
2.9.3
^ permalink raw reply related
* Re: [PATCH v2] drm/bochs: add edid present check
From: Daniel Vetter @ 2018-12-20 8:30 UTC (permalink / raw)
To: Gerd Hoffmann
Cc: andr2000, David Airlie, open list, dri-devel,
open list:DRM DRIVER FOR BOCHS VIRTUAL GPU
In-Reply-To: <20181220082826.GE21184@phenom.ffwll.local>
On Thu, Dec 20, 2018 at 09:28:26AM +0100, Daniel Vetter wrote:
> On Thu, Dec 20, 2018 at 07:50:01AM +0100, Gerd Hoffmann wrote:
> > Check first two header bytes before trying to read the edid blob,
> > to avoid the log being spammed in case qemu has no edid support (old
> > qemu or edid turned off).
> >
> > Fixes: 01f23459cf drm/bochs: add edid support.
> > Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
>
> It's a bit a hack, but makes sense.
On 2nd thought, maybe make it less of a hack by reading all 8 bytes of the
header and checking it with drm_edit_is_valid().
-Daniel
>
> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> > ---
> > drivers/gpu/drm/bochs/bochs_hw.c | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/bochs/bochs_hw.c b/drivers/gpu/drm/bochs/bochs_hw.c
> > index c90a0d492f..e1f8ffce00 100644
> > --- a/drivers/gpu/drm/bochs/bochs_hw.c
> > +++ b/drivers/gpu/drm/bochs/bochs_hw.c
> > @@ -89,6 +89,14 @@ int bochs_hw_load_edid(struct bochs_device *bochs)
> > if (!bochs->mmio)
> > return -1;
> >
> > + /*
> > + * Check first two EDID blob header bytes to figure whenever
> > + * edid support is enabled in qemu.
> > + */
> > + if (readb(bochs->mmio + 0) != 0x00 ||
> > + readb(bochs->mmio + 1) != 0xff)
> > + return -1;
> > +
> > kfree(bochs->edid);
> > bochs->edid = drm_do_get_edid(&bochs->connector,
> > bochs_get_edid_block, bochs);
> > --
> > 2.9.3
> >
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox