From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.5 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 577EAC49361 for ; Tue, 15 Jun 2021 11:31:21 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3CA7861455 for ; Tue, 15 Jun 2021 11:31:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230039AbhFOLdY (ORCPT ); Tue, 15 Jun 2021 07:33:24 -0400 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:35497 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230045AbhFOLdV (ORCPT ); Tue, 15 Jun 2021 07:33:21 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1623756676; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=aQ2nOAspl2AGDi9hsIXoVirS49xlNLfSm3ZVd3FKUiQ=; b=V82DZbveubcSsIiwIZDz6ctHABaVoWqBPZAL0FfNG1bV3tVMXdn+nK4bOIWM4cnDLQv9K6 CDEdRKojsAN/h0R5S1GddXdv10jrXlVD4DvTCEtnXfvh5BDpuIQuAY0rw7Ce1vRz4NyVHN kqk0I1LCNdK1iNyzhV2Rnb7fvwNUq/g= Received: from mail-ej1-f70.google.com (mail-ej1-f70.google.com [209.85.218.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-157-SgOPF56iOw2PBpPZdIDjgg-1; Tue, 15 Jun 2021 07:31:14 -0400 X-MC-Unique: SgOPF56iOw2PBpPZdIDjgg-1 Received: by mail-ej1-f70.google.com with SMTP id k1-20020a17090666c1b029041c273a883dso4415489ejp.3 for ; Tue, 15 Jun 2021 04:31:14 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:references:from:subject:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=aQ2nOAspl2AGDi9hsIXoVirS49xlNLfSm3ZVd3FKUiQ=; b=bDcl0dt/EFVJBeZRIM+EF5fZBnVIluuRU5tws1FRWsyVMepLorH49VKVeYY/vYFQN2 RzIYTnfgEz/6ll39hHRf1TsZ/DeuX34VdiNpW3MJas39/8QOsnCLzGog4zvjSQllQ+zN 7TLMz8Pk3gzjWJq4tbqW6kKMjx0UZoQljwQEe8PlQdnzRx8J/3od9VrXxuJF2GTPgliC CLEXlXqOsqGa6i/o2m3x/DIWWTDIYHNkdoxef3KoC3FL3E7a1lVvmeCwQGSL+9syQpV4 5iAq6Gj7i2+tZRF1iYMqsoKcA/lzPDMePhzc9L5zVB0viviP32mgdtCjhvgg1hDc7S4f Kp5w== X-Gm-Message-State: AOAM532yrCFYKx0HhRHFatQvsKOQLuPbtrBB686LTk4p7m+LCMb4kHuV tUQxo6WaFcd0Sc4B7++tw5fjCqshGGLOkQG9MOgzxVKP0iXdD70JHMRIoKN7yXq2N9OcOjFcWOO hm1TQeaq9vNQwhZcgDUqmHg== X-Received: by 2002:a17:906:6ad0:: with SMTP id q16mr20563659ejs.286.1623756673650; Tue, 15 Jun 2021 04:31:13 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzKlp02kMxK2rO+S69nIrAnTgSLt5X2p2XEkmw4RDKa1lrb0dqopq+Tz3VP4q2ZjtECG1DKNg== X-Received: by 2002:a17:906:6ad0:: with SMTP id q16mr20563625ejs.286.1623756673461; Tue, 15 Jun 2021 04:31:13 -0700 (PDT) Received: from ?IPv6:2001:b07:6468:f312:c8dd:75d4:99ab:290a? ([2001:b07:6468:f312:c8dd:75d4:99ab:290a]) by smtp.gmail.com with ESMTPSA id u21sm7110928eja.59.2021.06.15.04.31.10 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 15 Jun 2021 04:31:12 -0700 (PDT) To: "Enrico Weigelt, metux IT consult" , Jing Zhang , KVM , KVMARM , LinuxMIPS , KVMPPC , LinuxS390 , Linuxkselftest , Marc Zyngier , James Morse , Julien Thierry , Suzuki K Poulose , Will Deacon , Huacai Chen , Aleksandar Markovic , Thomas Bogendoerfer , Paul Mackerras , Christian Borntraeger , Janosch Frank , David Hildenbrand , Cornelia Huck , Claudio Imbrenda , Sean Christopherson , Vitaly Kuznetsov , Jim Mattson , Peter Shier , Oliver Upton , David Rientjes , Emanuele Giuseppe Esposito , David Matlack , Ricardo Koller , Krish Sadhukhan , Fuad Tabba References: <20210614212155.1670777-1-jingzhangos@google.com> From: Paolo Bonzini Subject: Re: [PATCH v9 0/5] KVM statistics data fd-based binary interface Message-ID: Date: Tue, 15 Jun 2021 13:31:10 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-mips@vger.kernel.org On 15/06/21 10:37, Enrico Weigelt, metux IT consult wrote: > * why is it binary instead of text ? is it so very high volume that >   it really matters ? The main reason to have a binary format is not the high volume actually (though it also has its part). Rather, we would really like to include the schema to make the statistics self-describing. This includes stuff like whether the unit of measure of a statistic is clock cycles, nanoseconds, pages or whatnot; having this kind of information in text leads to awkwardness in the parsers. trace-cmd is another example where the data consists of a schema followed by binary data. Text format could certainly be added if there's a usecase, but for developer use debugfs is usually a suitable replacement. Last year we tried the opposite direction: we built a one-value-per-file filesystem with a common API that any subsystem could use (e.g. providing ethtool stats, /proc/interrupts, etc. in addition to KVM stats). We started with text, similar to sysfs, with the plan of extending it to a binary format later. However, other subsystems expressed very little interest in this, so instead we decided to go with something that is designed around KVM needs. Still, the binary format that KVM uses is designed not to be KVM-specific. If other subsystems want to publish high-volume, self-describing statistic information, they are welcome to share the binary format and the code. Perhaps it may make sense in some cases to have them in sysfs, even (e.g. /sys/kernel/slab/*/.stats). As Greg said sysfs is currently one value per file, but perhaps that could be changed if the binary format is an additional way to access the information and not the only one (not that I'm planning to do it). > * how will possible future extensions of the telemetry packets work ? The format includes a schema, so it's possible to add more statistics in the future. The exact list of statistics varies per architecture and is not part of the userspace API (obvious caveat: https://xkcd.com/1172/). > * aren't there other means to get this fd instead of an ioctl() on the >   VM fd ? something more from the outside (eg. sysfs/procfs) Not yet, but if there's a need it can be added. It'd be plausible to publish system-wide statistics via a ioctl on /dev/kvm, for example. We'd have to check how this compares with stuff that is world-readable in procfs and sysfs, but I don't think there are security concerns in exposing that. There's also pidfd_getfd(2) which can be used to pull a VM file descriptor from another running process. That can be used to avoid the issue of KVM file descriptors being unnamed. > * how will that relate to other hypervisors ? Other hypervisors do not run as part of the Linux kernel (at least they are not upstream). These statistics only apply to Linux *hosts*, not guests. As far as I know, there is no standard that Xen or the proprietary hypervisors use to communicate their telemetry info to monitoring tools, and also no standard binary format used by exporters to talk to monitoring tools. If this format will be adopted by other hypervisors or any random software, I will be happy. > Some notes from the operating perspective: > > In typical datacenters we've got various monitoring tools that are able > to catch up lots of data from different sources (especially files). If > an operator e.g. is interested in something in happening in some file > (e.g. in /proc of /sys), it's quite trivial - just configure yet another > probe (maybe some regex for parsing) and done. Automatically fed in his > $monitoring_solution (e.g. nagios, ELK, Splunk, whatsnot) ... but in practice what you do is you have prebuilt exporters that talks to $monitoring_solution. Monitoring individual files is the exception, not the rule. But indeed Libvirt already has I/O and network statistics and there is an exporter for Prometheus, so we should add support for this new method as well to both QEMU (exporting the file descriptor) and Libvirt. I hope this helps clarifying your doubts! Paolo > With your approach, it's not that simple: now the operator needs to > create (and deploy and manage) a separate agent that somehow receives > that fd from the VMM, reads and parses that specific binary stream > and finally pushes it into the monitoring infrastructure. Or the VMM > writes it into some file, where some monitoring agent can pick it up. > In any case, not actually trivial from ops perspective.