* Re: kdbus: add documentation
From: Florian Weimer @ 2014-11-30 8:56 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: arnd, ebiederm, gnomes, teg, jkosina, luto, linux-api,
linux-kernel, daniel, dh.herrmann, tixxdz
In-Reply-To: <1416546149-24799-2-git-send-email-gregkh@linuxfoundation.org>
* Greg Kroah-Hartman:
> +The focus of this document is an overview of the low-level, native kernel D-Bus
> +transport called kdbus. Kdbus exposes its functionality via files in a
> +filesystem called 'kdbusfs'. All communication between processes takes place
> +via ioctls on files exposed through the mount point of a kdbusfs. The default
> +mount point of kdbusfs is /sys/fs/kdbus.
Does this mean the bus does not enforce the correctness of the D-Bus
introspection metadata? That's really unfortunate. Classic D-Bus
does not do this, either, and combined with the variety of approaches
used to implement D-Bus endpoints, it makes it really difficult to
figure out what D-Bus services, exactly, a process provides.
^ permalink raw reply
* Re: kdbus: add documentation
From: Florian Weimer @ 2014-11-30 9:02 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: arnd-r2nGTMty4D4, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io, teg-B22kvLQNl6c,
jkosina-AlSwsSmVLrQ, luto-kltTT9wpgjJwATOyAt5JVQ,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
daniel-cYrQPVfZoowdnm+yROfE0A, dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
tixxdz-Umm1ozX2/EEdnm+yROfE0A
In-Reply-To: <1416546149-24799-2-git-send-email-gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>
* Greg Kroah-Hartman:
> +7.4 Receiving messages
> +Also, if the connection allowed for file descriptor to be passed
> +(KDBUS_HELLO_ACCEPT_FD), and if the message contained any, they will be
> +installed into the receiving process after the KDBUS_CMD_MSG_RECV ioctl
> +returns. The receiving task is obliged to close all of them appropriately.
What happens if this is not possible because the file descriptor limit
of the processes would be exceeded? EMFILE, and the message will not
be received?
^ permalink raw reply
* Re: kdbus: add documentation
From: Florian Weimer @ 2014-11-30 9:08 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Greg Kroah-Hartman, Arnd Bergmann, Eric W. Biederman,
One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Linux API,
linux-kernel@vger.kernel.org, Daniel Mack, David Herrmann,
Djalal Harouni
In-Reply-To: <CALCETrWUekPq6pquuiVperBihQHgcvCF=WzVJTZziHhUJYsh4Q@mail.gmail.com>
* Andy Lutomirski:
> At the risk of opening a can of worms, wouldn't this be much more
> useful if you could share a pool between multiple connections?
They would also be useful to reduce context switches when receiving
data from all kinds of descriptors. At present, when polling, you
receive notification, and then you have to call into the kernel,
again, to actually fetch the data and associated information. The
kernel could also queue the data for one specific recipient,
addressing the same issue that SO_REUSEPORT tries to solve (on poll
notification, the kernel does not know which recipient will eventually
retrieve the data, so it has to notify and wake up all of them).
^ permalink raw reply
* Re: [PATCH 5/7] KVM: arm64: guest debug, add support for single-step
From: Christoffer Dall @ 2014-11-30 10:10 UTC (permalink / raw)
To: Peter Maydell
Cc: Alex Bennée, kvm-devel, arm-mail-list,
kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg@public.gmane.org,
Marc Zyngier, Alexander Graf, J. Kiszka, David Hildenbrand,
Bharat Bhushan, bp-l3A5Bk7waGM, Paolo Bonzini, Gleb Natapov,
Russell King, Catalin Marinas, Will Deacon, Lorenzo Pieralisi,
open list, open list:ABI/API
In-Reply-To: <CAFEAcA-cjM9yUXi6tc79UP0fBKAagtFKQgV3iAjz5DWr9yxZUQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Wed, Nov 26, 2014 at 07:27:06PM +0000, Peter Maydell wrote:
> On 25 November 2014 at 16:10, Alex Bennée <alex.bennee-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
> > This adds support for single-stepping the guest. As userspace can and
> > will manipulate guest registers before restarting any tweaking of the
> > registers has to occur just before control is passed back to the guest.
> > Furthermore while guest debugging is in effect we need to squash the
> > ability of the guest to single-step itself as we have no easy way of
> > re-entering the guest after the exception has been delivered to the
> > hypervisor.
>
> A corner case I don't think this patch handles: if the debugger
> tries to single step an insn which is emulated by the
> hypervisor (because it's a load/store which is trapped and
> handled as emulated mmio in userspace) then we won't
> correctly update the single-step state machine (and so we'll end
> up incorrectly stopping after the following insn rather than
> before, I think).
>
> You should be able to achieve this effect by simply always clearing
> the guest's PSTATE.SS when you advance the PC to skip the emulated
> instruction (cf the comment in the pseudocode SSAdvance() function).
>
> I think we should also be doing this PC advance on return from
> userspace's handling of the mmio rather than before we drop back
> to userspace as we do now, but I can't remember why I think that.
> Christoffer, I don't suppose you recall, do you? I think it was
> you I had this conversation with on IRC a month or so back...
>
I don't remember clearly, no. Was it not during lunch at LCU we had
this conversation?
In any case, I think it was related to how userspace observes the state
of the CPU, because when you do the MMIO operation emulation in
userspace, currently if you observe the PC though GET_ONE_REG, you'll
see a PC pointing to the next instruction, not the one you're emulating
which is strange.
Not sure what the relation to a guest single-stepping itself was.
-Christoffer
^ permalink raw reply
* Re: [PATCH 5/7] KVM: arm64: guest debug, add support for single-step
From: Peter Maydell @ 2014-11-30 10:20 UTC (permalink / raw)
To: Christoffer Dall
Cc: Alex Bennée, kvm-devel, arm-mail-list,
kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg@public.gmane.org,
Marc Zyngier, Alexander Graf, J. Kiszka, David Hildenbrand,
Bharat Bhushan, bp-l3A5Bk7waGM, Paolo Bonzini, Gleb Natapov,
Russell King, Catalin Marinas, Will Deacon, Lorenzo Pieralisi,
open list, open list:ABI/API
In-Reply-To: <20141130101056.GB484@cbox>
On 30 November 2014 at 10:10, Christoffer Dall
<christoffer.dall-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
> In any case, I think it was related to how userspace observes the state
> of the CPU, because when you do the MMIO operation emulation in
> userspace, currently if you observe the PC though GET_ONE_REG, you'll
> see a PC pointing to the next instruction, not the one you're emulating
> which is strange.
Also if we ever add support for userspace to say "this MMIO should
fault" then we definitely need the PC-advance to happen afterwards,
not before.
> Not sure what the relation to a guest single-stepping itself was.
I think it just came up in the course of that discussion, because
single-step handling also needs to perform an action (clear PSTATE.SS)
as part of the "advance over this insn" operation. But I think that
you're right that doing the advance before dropping out to userspace
is no worse for singlestep than it is for any other case.
-- PMM
^ permalink raw reply
* Re: [PATCH 5/7] KVM: arm64: guest debug, add support for single-step
From: Christoffer Dall @ 2014-11-30 10:21 UTC (permalink / raw)
To: Alex Bennée
Cc: kvm, linux-arm-kernel, kvmarm, marc.zyngier, peter.maydell, agraf,
jan.kiszka, dahi, r65777, bp, pbonzini, Gleb Natapov,
Russell King, Catalin Marinas, Will Deacon, Lorenzo Pieralisi,
open list, open list:ABI/API
In-Reply-To: <1416931805-23223-6-git-send-email-alex.bennee@linaro.org>
On Tue, Nov 25, 2014 at 04:10:03PM +0000, Alex Bennée wrote:
> This adds support for single-stepping the guest. As userspace can and
> will manipulate guest registers before restarting any tweaking of the
> registers has to occur just before control is passed back to the guest.
> Furthermore while guest debugging is in effect we need to squash the
> ability of the guest to single-step itself as we have no easy way of
> re-entering the guest after the exception has been delivered to the
> hypervisor.
Admittedly this is a corner case, but wouldn't the only really nasty bit
of this be to emulate the guest debug exception?
>
> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
>
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index 48d26bb..a76daae 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -38,6 +38,7 @@
> #include <asm/tlbflush.h>
> #include <asm/cacheflush.h>
> #include <asm/virt.h>
> +#include <asm/debug-monitors.h>
> #include <asm/kvm_arm.h>
> #include <asm/kvm_asm.h>
> #include <asm/kvm_mmu.h>
> @@ -300,6 +301,17 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> kvm_arm_set_running_vcpu(NULL);
> }
>
> +/**
> + * kvm_arch_vcpu_ioctl_set_guest_debug - Setup guest debugging
> + * @kvm: pointer to the KVM struct
> + * @kvm_guest_debug: the ioctl data buffer
> + *
> + * This sets up the VM for guest debugging. Care has to be taken when
> + * manipulating guest registers as these will be set/cleared by the
> + * hyper-visor controller, typically before each kvm_run event. As a
hypervisor
> + * result modification of the guest registers needs to take place
> + * after they have been restored in the hyp.S trampoline code.
I don't understand this??
> + */
> int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
> struct kvm_guest_debug *dbg)
> {
> @@ -317,8 +329,8 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
>
> /* Single Step */
> if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
> - kvm_info("SS requested, not yet implemented\n");
> - return -EINVAL;
> + kvm_info("SS requested\n");
> + route_el2 = true;
> }
>
> /* Software Break Points */
> diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> index 8da1043..78e5ae1 100644
> --- a/arch/arm64/kernel/asm-offsets.c
> +++ b/arch/arm64/kernel/asm-offsets.c
> @@ -121,6 +121,7 @@ int main(void)
> DEFINE(VCPU_FAR_EL2, offsetof(struct kvm_vcpu, arch.fault.far_el2));
> DEFINE(VCPU_HPFAR_EL2, offsetof(struct kvm_vcpu, arch.fault.hpfar_el2));
> DEFINE(VCPU_DEBUG_FLAGS, offsetof(struct kvm_vcpu, arch.debug_flags));
> + DEFINE(GUEST_DEBUG, offsetof(struct kvm_vcpu, guest_debug));
> DEFINE(VCPU_HCR_EL2, offsetof(struct kvm_vcpu, arch.hcr_el2));
> DEFINE(VCPU_MDCR_EL2, offsetof(struct kvm_vcpu, arch.mdcr_el2));
> DEFINE(VCPU_IRQ_LINES, offsetof(struct kvm_vcpu, arch.irq_lines));
> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
> index 28dc92b..6def054 100644
> --- a/arch/arm64/kvm/handle_exit.c
> +++ b/arch/arm64/kvm/handle_exit.c
> @@ -91,6 +91,25 @@ static int kvm_handle_bkpt(struct kvm_vcpu *vcpu, struct kvm_run *run)
> return 0;
> }
>
> +/**
> + * kvm_handle_ss - handle single step exceptions
> + *
> + * @vcpu: the vcpu pointer
> + *
> + * See: ARM ARM D2.12 for the details. While the host is routing debug
> + * exceptions to it's handlers we have to suppress the ability of the
its handlers
> + * guest to trigger exceptions.
not really sure why this comment is here? Does it really help anyone
reading this specific function or does it just confuse people more?
> + */
> +static int kvm_handle_ss(struct kvm_vcpu *vcpu, struct kvm_run *run)
> +{
> + WARN_ON(!(vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP));
is this something that can actually happen or should it be a BUG_ON() -
which may even go away once you're doing hacking on this?
> +
> + run->exit_reason = KVM_EXIT_DEBUG;
> + run->debug.arch.exit_type = KVM_DEBUG_EXIT_SINGLE_STEP;
> + run->debug.arch.address = *vcpu_pc(vcpu);
> + return 0;
> +}
> +
> static exit_handle_fn arm_exit_handlers[] = {
> [ESR_EL2_EC_WFI] = kvm_handle_wfx,
> [ESR_EL2_EC_CP15_32] = kvm_handle_cp15_32,
> @@ -105,6 +124,7 @@ static exit_handle_fn arm_exit_handlers[] = {
> [ESR_EL2_EC_SYS64] = kvm_handle_sys_reg,
> [ESR_EL2_EC_IABT] = kvm_handle_guest_abort,
> [ESR_EL2_EC_DABT] = kvm_handle_guest_abort,
> + [ESR_EL2_EC_SOFTSTP] = kvm_handle_ss,
> [ESR_EL2_EC_BKPT32] = kvm_handle_bkpt,
> [ESR_EL2_EC_BRK64] = kvm_handle_bkpt,
> };
> diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> index 3c733ea..c0bc218 100644
> --- a/arch/arm64/kvm/hyp.S
> +++ b/arch/arm64/kvm/hyp.S
> @@ -16,6 +16,7 @@
> */
>
> #include <linux/linkage.h>
> +#include <linux/kvm.h>
>
> #include <asm/assembler.h>
> #include <asm/memory.h>
> @@ -168,6 +169,31 @@
> // x19-x29, lr, sp*, elr*, spsr*
> restore_common_regs
>
> + // After restoring the guest registers but before we return to the guest
> + // we may want to make some final tweaks to support guest debugging.
"we may want" sounds like we're not sure what we'll be doing here. We
probably want to write something like "If the guest is being debugged we
need to set blah blah blah".
> + ldr x3, [x0, #GUEST_DEBUG]
> + tbz x3, #KVM_GUESTDBG_ENABLE_SHIFT, 2f // No guest debug
> +
> + // x0 - preserved as VCPU ptr
> + // x1 - spsr
> + // x2 - mdscr
not sure we need this comment
> + mrs x1, spsr_el2
> + mrs x2, mdscr_el1
> +
> + // See ARM ARM D2.12.3 The software step state machine
> + // If we are doing Single Step - set MDSCR_EL1.SS and PSTATE.SS
> + orr x1, x1, #DBG_SPSR_SS
> + orr x2, x2, #DBG_MDSCR_SS
> + tbnz x3, #KVM_GUESTDBG_SINGLESTEP_SHIFT, 1f
> + // If we are not doing Single Step we want to prevent the guest doing so
> + // as otherwise we will have to deal with the re-routed exceptions as we
> + // are doing other guest debug related things
> + eor x1, x1, #DBG_SPSR_SS
> + eor x2, x2, #DBG_MDSCR_SS
this really confuses me: so you're setting the SS bits in both
registers, and then if we're not single-stepping the guest, you clear
both bits again?
Wouldn't it be much simper to mask off the bits with a 'bic' and then
setting the bits when needed?
Alternatively, we could manage all these registers from C code and just
save/restore them off the VCPU struct.
> +1:
> + msr spsr_el2, x1
> + msr mdscr_el1, x2
> +2:
> // Last bits of the 64bit state
> pop x2, x3
> pop x0, x1
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 523f476..347e5b0 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -7,6 +7,8 @@
> * Note: you must update KVM_API_VERSION if you change this interface.
> */
>
> +#ifndef __ASSEMBLY__
> +
> #include <linux/types.h>
> #include <linux/compiler.h>
> #include <linux/ioctl.h>
> @@ -515,11 +517,6 @@ struct kvm_s390_irq {
> } u;
> };
>
> -/* for KVM_SET_GUEST_DEBUG */
> -
> -#define KVM_GUESTDBG_ENABLE 0x00000001
> -#define KVM_GUESTDBG_SINGLESTEP 0x00000002
> -
> struct kvm_guest_debug {
> __u32 control;
> __u32 pad;
> @@ -1189,4 +1186,15 @@ struct kvm_assigned_msix_entry {
> __u16 padding[3];
> };
>
> +#endif /* __ASSEMBLY__ */
> +
> +/* for KVM_SET_GUEST_DEBUG */
> +
> +#define KVM_GUESTDBG_ENABLE_SHIFT 0
> +#define KVM_GUESTDBG_ENABLE (1 << KVM_GUESTDBG_ENABLE_SHIFT)
> +#define KVM_GUESTDBG_SINGLESTEP_SHIFT 1
> +#define KVM_GUESTDBG_SINGLESTEP (1 << KVM_GUESTDBG_SINGLESTEP_SHIFT)
> +
> +
> +
> #endif /* __LINUX_KVM_H */
> --
> 2.1.3
>
^ permalink raw reply
* Re: [PATCH 7/7] KVM: arm64: guest debug, HW assisted debug support
From: Christoffer Dall @ 2014-11-30 10:34 UTC (permalink / raw)
To: Alex Bennée
Cc: kvm, linux-arm-kernel, kvmarm, marc.zyngier, peter.maydell, agraf,
jan.kiszka, dahi, r65777, bp, pbonzini, Gleb Natapov,
Jonathan Corbet, Russell King, Catalin Marinas, Will Deacon,
Lorenzo Pieralisi, AKASHI Takahiro, Srivatsa S. Bhat,
open list:DOCUMENTATION, open list, open list:ABI/API
In-Reply-To: <1416931805-23223-8-git-send-email-alex.bennee@linaro.org>
On Tue, Nov 25, 2014 at 04:10:05PM +0000, Alex Bennée wrote:
> This adds support for userspace to control the HW debug registers for
> guest debug. We'll only copy the $ARCH defined number across as that's
> all that hyp.S will use anyway. I've moved some helper functions into
> the hw_breakpoint.h header for re-use.
>
> As with single step we need to tweak the guest registers to enable the
> exceptions but we don't want to overwrite the guest copy of these
> registers so this is done close to the guest entry.
>
> Two new capabilities have been added to the KVM_EXTENSION ioctl to allow
> userspace to query the number of hardware break and watch points
> available on the host hardware.
>
> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 9383359..5e8c673 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2593,7 +2593,7 @@ The top 16 bits of the control field are architecture specific control
> flags which can include the following:
>
> - KVM_GUESTDBG_USE_SW_BP: using software breakpoints [x86, arm64]
> - - KVM_GUESTDBG_USE_HW_BP: using hardware breakpoints [x86, s390]
> + - KVM_GUESTDBG_USE_HW_BP: using hardware breakpoints [x86, s390, arm64]
> - KVM_GUESTDBG_INJECT_DB: inject DB type exception [x86]
> - KVM_GUESTDBG_INJECT_BP: inject BP type exception [x86]
> - KVM_GUESTDBG_EXIT_PENDING: trigger an immediate guest exit [s390]
> @@ -2606,7 +2606,10 @@ we need to ensure the guest vCPUs architecture specific registers are
> updated to the correct (supplied) values.
>
> The second part of the structure is architecture specific and
> -typically contains a set of debug registers.
> +typically contains a set of debug registers. For arm64 the number of
new paragraph for the arch-specific stuff.
> +debug registers is implementation defined and can be determined by
> +querying the KVM_CAP_GUEST_DEBUG_HW_BPS and KVM_CAP_GUEST_DEBUG_HW_WPS
> +capabilities.
>
> When debug events exit the main run loop with the reason
> KVM_EXIT_DEBUG with the kvm_debug_exit_arch part of the kvm_run
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index a76daae..c8ec23a 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -39,6 +39,7 @@
> #include <asm/cacheflush.h>
> #include <asm/virt.h>
> #include <asm/debug-monitors.h>
> +#include <asm/hw_breakpoint.h>
> #include <asm/kvm_arm.h>
> #include <asm/kvm_asm.h>
> #include <asm/kvm_mmu.h>
> @@ -341,8 +342,37 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
>
> /* Hardware assisted Break and Watch points */
> if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP) {
> - kvm_info("HW BP support requested, not yet implemented\n");
> - return -EINVAL;
> + int i;
> + int nb = get_num_brps();
> + int nw = get_num_wrps();
> +
> + /* Copy across up to IMPDEF debug registers to our
coding style
> + * shadow copy in the vcpu structure. The hyp.S code
> + * will then set them up before we re-enter the guest.
> + */
> + memcpy(vcpu->arch.guest_debug_regs.dbg_bcr,
> + dbg->arch.dbg_bcr, sizeof(__u64)*nb);
> + memcpy(vcpu->arch.guest_debug_regs.dbg_bvr,
> + dbg->arch.dbg_bvr, sizeof(__u64)*nb);
> + memcpy(vcpu->arch.guest_debug_regs.dbg_wcr,
> + dbg->arch.dbg_wcr, sizeof(__u64)*nw);
> + memcpy(vcpu->arch.guest_debug_regs.dbg_wvr,
> + dbg->arch.dbg_wvr, sizeof(__u64)*nw);
> +
> + kvm_info("HW BP support requested\n");
> + for (i = 0; i < nb; i++) {
> + kvm_info("%d: dbg_bcr=0x%llx dbg_bvr=0x%llx\n",
> + i,
> + vcpu->arch.guest_debug_regs.dbg_bcr[i],
> + vcpu->arch.guest_debug_regs.dbg_bvr[i]);
> + }
> + for (i = 0; i < nw; i++) {
> + kvm_info("%d: dbg_wcr=0x%llx dbg_wvr=0x%llx\n",
> + i,
> + vcpu->arch.guest_debug_regs.dbg_wcr[i],
> + vcpu->arch.guest_debug_regs.dbg_wvr[i]);
> + }
> + route_el2 = true;
> }
>
> /* If we are going to handle any debug exceptions we need to
> diff --git a/arch/arm64/include/asm/hw_breakpoint.h b/arch/arm64/include/asm/hw_breakpoint.h
> index 52b484b..c450552 100644
> --- a/arch/arm64/include/asm/hw_breakpoint.h
> +++ b/arch/arm64/include/asm/hw_breakpoint.h
> @@ -130,6 +130,18 @@ static inline void ptrace_hw_copy_thread(struct task_struct *task)
> }
> #endif
>
> +/* Determine number of BRP registers available. */
> +static inline int get_num_brps(void)
> +{
> + return ((read_cpuid(ID_AA64DFR0_EL1) >> 12) & 0xf) + 1;
> +}
> +
> +/* Determine number of WRP registers available. */
> +static inline int get_num_wrps(void)
> +{
> + return ((read_cpuid(ID_AA64DFR0_EL1) >> 20) & 0xf) + 1;
> +}
> +
> extern struct pmu perf_ops_bp;
>
> #endif /* __KERNEL__ */
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 38b0f07..e386bf4 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -103,8 +103,9 @@ struct kvm_vcpu_arch {
> /* Exception Information */
> struct kvm_vcpu_fault_info fault;
>
> - /* Debug state */
> + /* Guest debug state */
> u64 debug_flags;
> + struct kvm_guest_debug_arch guest_debug_regs;
>
> /* Pointer to host CPU context */
> kvm_cpu_context_t *host_cpu_context;
> diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> index 78e5ae1..c9ecfd3 100644
> --- a/arch/arm64/kernel/asm-offsets.c
> +++ b/arch/arm64/kernel/asm-offsets.c
> @@ -122,6 +122,10 @@ int main(void)
> DEFINE(VCPU_HPFAR_EL2, offsetof(struct kvm_vcpu, arch.fault.hpfar_el2));
> DEFINE(VCPU_DEBUG_FLAGS, offsetof(struct kvm_vcpu, arch.debug_flags));
> DEFINE(GUEST_DEBUG, offsetof(struct kvm_vcpu, guest_debug));
> + DEFINE(GUEST_DEBUG_BCR, offsetof(struct kvm_vcpu, arch.guest_debug_regs.dbg_bcr));
> + DEFINE(GUEST_DEBUG_BVR, offsetof(struct kvm_vcpu, arch.guest_debug_regs.dbg_bvr));
> + DEFINE(GUEST_DEBUG_WCR, offsetof(struct kvm_vcpu, arch.guest_debug_regs.dbg_wcr));
> + DEFINE(GUEST_DEBUG_WVR, offsetof(struct kvm_vcpu, arch.guest_debug_regs.dbg_wvr));
> DEFINE(VCPU_HCR_EL2, offsetof(struct kvm_vcpu, arch.hcr_el2));
> DEFINE(VCPU_MDCR_EL2, offsetof(struct kvm_vcpu, arch.mdcr_el2));
> DEFINE(VCPU_IRQ_LINES, offsetof(struct kvm_vcpu, arch.irq_lines));
> diff --git a/arch/arm64/kernel/hw_breakpoint.c b/arch/arm64/kernel/hw_breakpoint.c
> index df1cf15..45dcc6f 100644
> --- a/arch/arm64/kernel/hw_breakpoint.c
> +++ b/arch/arm64/kernel/hw_breakpoint.c
> @@ -49,18 +49,6 @@ static DEFINE_PER_CPU(int, stepping_kernel_bp);
> static int core_num_brps;
> static int core_num_wrps;
>
> -/* Determine number of BRP registers available. */
> -static int get_num_brps(void)
> -{
> - return ((read_cpuid(ID_AA64DFR0_EL1) >> 12) & 0xf) + 1;
> -}
> -
> -/* Determine number of WRP registers available. */
> -static int get_num_wrps(void)
> -{
> - return ((read_cpuid(ID_AA64DFR0_EL1) >> 20) & 0xf) + 1;
> -}
> -
> int hw_breakpoint_slots(int type)
> {
> /*
> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
> index 6def054..d024e47 100644
> --- a/arch/arm64/kvm/handle_exit.c
> +++ b/arch/arm64/kvm/handle_exit.c
> @@ -110,6 +110,42 @@ static int kvm_handle_ss(struct kvm_vcpu *vcpu, struct kvm_run *run)
> return 0;
> }
>
> +/**
> + * kvm_handle_hw_bp - handle HW assisted break point
> + *
> + * @vcpu: the vcpu pointer
> + *
> + */
> +static int kvm_handle_hw_bp(struct kvm_vcpu *vcpu, struct kvm_run *run)
> +{
> + WARN_ON(!(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP));
> +
> + run->exit_reason = KVM_EXIT_DEBUG;
> + run->debug.arch.exit_type = KVM_DEBUG_EXIT_HW_BKPT;
> + run->debug.arch.address = *vcpu_pc(vcpu);
> + return 0;
> +}
> +
> +/**
> + * kvm_handle_watch - handle HW assisted watch point
> + *
> + * @vcpu: the vcpu pointer
> + *
> + * These are basically the same as breakpoints (and indeed may use the
> + * breakpoint in a linked fashion). However they generate a specific
> + * exception so we trap it here for reporting to the guest.
> + *
> + */
> +static int kvm_handle_watch(struct kvm_vcpu *vcpu, struct kvm_run *run)
> +{
> + WARN_ON(!(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP));
> +
> + run->exit_reason = KVM_EXIT_DEBUG;
> + run->debug.arch.exit_type = KVM_DEBUG_EXIT_HW_WTPT;
> + run->debug.arch.address = *vcpu_pc(vcpu);
> + return 0;
> +}
the usual comments about the @run documentation and let's get rid of all
these WARN_ON() statements.
> +
> static exit_handle_fn arm_exit_handlers[] = {
> [ESR_EL2_EC_WFI] = kvm_handle_wfx,
> [ESR_EL2_EC_CP15_32] = kvm_handle_cp15_32,
> @@ -125,6 +161,8 @@ static exit_handle_fn arm_exit_handlers[] = {
> [ESR_EL2_EC_IABT] = kvm_handle_guest_abort,
> [ESR_EL2_EC_DABT] = kvm_handle_guest_abort,
> [ESR_EL2_EC_SOFTSTP] = kvm_handle_ss,
> + [ESR_EL2_EC_WATCHPT] = kvm_handle_watch,
> + [ESR_EL2_EC_BREAKPT] = kvm_handle_hw_bp,
> [ESR_EL2_EC_BKPT32] = kvm_handle_bkpt,
> [ESR_EL2_EC_BRK64] = kvm_handle_bkpt,
> };
> diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> index b38ce3d..96f71ab 100644
> --- a/arch/arm64/kvm/hyp.S
> +++ b/arch/arm64/kvm/hyp.S
> @@ -18,6 +18,7 @@
> #include <linux/linkage.h>
> #include <linux/kvm.h>
>
> +#include <uapi/asm/kvm.h>
> #include <asm/assembler.h>
> #include <asm/memory.h>
> #include <asm/asm-offsets.h>
> @@ -174,6 +175,7 @@
> ldr x3, [x0, #GUEST_DEBUG]
> tbz x3, #KVM_GUESTDBG_ENABLE_SHIFT, 2f // No guest debug
>
> + // Both Step and HW BP/WP ops need to modify spsr_el2 and mdscr_el1
> // x0 - preserved as VCPU ptr
> // x1 - spsr
> // x2 - mdscr
> @@ -191,6 +193,11 @@
> eor x1, x1, #DBG_SPSR_SS
> eor x2, x2, #DBG_MDSCR_SS
> 1:
> + // If we are doing HW BP/WP - set MDSCR_EL1.KDE/MDE
> + tbz x3, #KVM_GUESTDBG_USE_HW_BP_SHIFT, 3f
> + orr x2, x2, #DBG_MDSCR_KDE
> + orr x2, x2, #DBG_MDSCR_MDE
> +3:
> msr spsr_el2, x1
> msr mdscr_el1, x2
> 2:
> @@ -815,6 +822,33 @@ __restore_debug:
>
> ret
>
> +/* Setup debug state for debug of guest */
> +__setup_debug:
> + // x0: vcpu base address
> + // x3: ptr to guest registers passed to setup_debug_registers
> + // x5..x20/x26: trashed
> +
> + mrs x26, id_aa64dfr0_el1
> + ubfx x24, x26, #12, #4 // Extract BRPs
> + ubfx x25, x26, #20, #4 // Extract WRPs
> + mov w26, #15
> + sub w24, w26, w24 // How many BPs to skip
> + sub w25, w26, w25 // How many WPs to skip
> +
> + mov x4, x24
> + add x3, x0, #GUEST_DEBUG_BCR
> + setup_debug_registers dbgbcr
> + add x3, x0, #GUEST_DEBUG_BVR
> + setup_debug_registers dbgbvr
> +
> + mov x4, x25
> + add x3, x0, #GUEST_DEBUG_WCR
> + setup_debug_registers dbgwcr
> + add x3, x0, #GUEST_DEBUG_WVR
> + setup_debug_registers dbgwvr
> +
> + ret
> +
> __save_fpsimd:
> save_fpsimd
> ret
> @@ -861,6 +895,13 @@ ENTRY(__kvm_vcpu_run)
> bl __restore_sysregs
> bl __restore_fpsimd
>
> + // Now is the time to set-up the debug registers if we
> + // are debugging the guest
> + ldr x3, [x0, #GUEST_DEBUG]
> + tbz x3, #KVM_GUESTDBG_USE_HW_BP_SHIFT, 2f
> + bl __setup_debug
> + b 1f
> +2:
> skip_debug_state x3, 1f
> bl __restore_debug
> 1:
> @@ -881,6 +922,11 @@ __kvm_vcpu_return:
> bl __save_fpsimd
> bl __save_sysregs
>
> + // If we are debugging the guest don't save debug registers
> + // otherwise we'll be trashing are only good copy we have.
> + ldr x3, [x0, #GUEST_DEBUG]
> + tbnz x3, #KVM_GUESTDBG_USE_HW_BP_SHIFT, 1f
> +
we're introducing an awful lot of conditionals in the assembly code with
these patches, can you re-consider if there's a cleaner abstraction that
allows us to deal with some of this stuff in C-code?
-Christoffer
^ permalink raw reply
* [PATCH v7 07/46] virtio: add virtio 1.0 feature bit
From: Michael S. Tsirkin @ 2014-11-30 15:09 UTC (permalink / raw)
To: linux-kernel
Cc: thuth, rusty, linux-api, virtualization, dahi, pbonzini,
David Miller
In-Reply-To: <1417359787-10138-1-git-send-email-mst@redhat.com>
Based on original patches by Rusty Russell, Thomas Huth
and Cornelia Huck.
Note: at this time, we do not negotiate this feature bit
in core, drivers have to declare VERSION_1 support explicitly.
For this reason we treat this bit as a device bit
and not as a transport bit for now.
After all drivers are converted, we will be able to
move VERSION_1 to core and drop it from all drivers.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
---
include/uapi/linux/virtio_config.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
index 3ce768c..80e7381 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -54,4 +54,7 @@
/* Can the device handle any descriptor layout? */
#define VIRTIO_F_ANY_LAYOUT 27
+/* v1.0 compliant. */
+#define VIRTIO_F_VERSION_1 32
+
#endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
--
MST
^ permalink raw reply related
* [PATCH v7 08/46] virtio: memory access APIs
From: Michael S. Tsirkin @ 2014-11-30 15:09 UTC (permalink / raw)
To: linux-kernel
Cc: Bjarke Istrup Pedersen, thuth, Anup Patel, rusty,
Greg Kroah-Hartman, David Drysdale, dahi, Ashwin Chaugule,
Sakari Ailus, linux-api, pbonzini, Andy Grover, virtualization,
David Miller, Alexei Starovoitov
In-Reply-To: <1417359787-10138-1-git-send-email-mst@redhat.com>
virtio 1.0 makes all memory structures LE, so
we need APIs to conditionally do a byteswap on BE
architectures.
To make it easier to check code statically,
add virtio specific types for multi-byte integers
in memory.
Add low level wrappers that do a byteswap conditionally, these will be
useful e.g. for vhost. Add high level wrappers that
query device endian-ness and act accordingly.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
include/linux/virtio_byteorder.h | 59 +++++++++++++++++++++++++++++++++++++++
include/linux/virtio_config.h | 32 +++++++++++++++++++++
include/uapi/linux/virtio_ring.h | 45 ++++++++++++++---------------
include/uapi/linux/virtio_types.h | 46 ++++++++++++++++++++++++++++++
include/uapi/linux/Kbuild | 1 +
5 files changed, 161 insertions(+), 22 deletions(-)
create mode 100644 include/linux/virtio_byteorder.h
create mode 100644 include/uapi/linux/virtio_types.h
diff --git a/include/linux/virtio_byteorder.h b/include/linux/virtio_byteorder.h
new file mode 100644
index 0000000..51865d0
--- /dev/null
+++ b/include/linux/virtio_byteorder.h
@@ -0,0 +1,59 @@
+#ifndef _LINUX_VIRTIO_BYTEORDER_H
+#define _LINUX_VIRTIO_BYTEORDER_H
+#include <linux/types.h>
+#include <uapi/linux/virtio_types.h>
+
+/*
+ * Low-level memory accessors for handling virtio in modern little endian and in
+ * compatibility native endian format.
+ */
+
+static inline u16 __virtio16_to_cpu(bool little_endian, __virtio16 val)
+{
+ if (little_endian)
+ return le16_to_cpu((__force __le16)val);
+ else
+ return (__force u16)val;
+}
+
+static inline __virtio16 __cpu_to_virtio16(bool little_endian, u16 val)
+{
+ if (little_endian)
+ return (__force __virtio16)cpu_to_le16(val);
+ else
+ return (__force __virtio16)val;
+}
+
+static inline u32 __virtio32_to_cpu(bool little_endian, __virtio32 val)
+{
+ if (little_endian)
+ return le32_to_cpu((__force __le32)val);
+ else
+ return (__force u32)val;
+}
+
+static inline __virtio32 __cpu_to_virtio32(bool little_endian, u32 val)
+{
+ if (little_endian)
+ return (__force __virtio32)cpu_to_le32(val);
+ else
+ return (__force __virtio32)val;
+}
+
+static inline u64 __virtio64_to_cpu(bool little_endian, __virtio64 val)
+{
+ if (little_endian)
+ return le64_to_cpu((__force __le64)val);
+ else
+ return (__force u64)val;
+}
+
+static inline __virtio64 __cpu_to_virtio64(bool little_endian, u64 val)
+{
+ if (little_endian)
+ return (__force __virtio64)cpu_to_le64(val);
+ else
+ return (__force __virtio64)val;
+}
+
+#endif /* _LINUX_VIRTIO_BYTEORDER */
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index f517884..02f0acb 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -4,6 +4,7 @@
#include <linux/err.h>
#include <linux/bug.h>
#include <linux/virtio.h>
+#include <linux/virtio_byteorder.h>
#include <uapi/linux/virtio_config.h>
/**
@@ -199,6 +200,37 @@ int virtqueue_set_affinity(struct virtqueue *vq, int cpu)
return 0;
}
+/* Memory accessors */
+static inline u16 virtio16_to_cpu(struct virtio_device *vdev, __virtio16 val)
+{
+ return __virtio16_to_cpu(virtio_has_feature(vdev, VIRTIO_F_VERSION_1), val);
+}
+
+static inline __virtio16 cpu_to_virtio16(struct virtio_device *vdev, u16 val)
+{
+ return __cpu_to_virtio16(virtio_has_feature(vdev, VIRTIO_F_VERSION_1), val);
+}
+
+static inline u32 virtio32_to_cpu(struct virtio_device *vdev, __virtio32 val)
+{
+ return __virtio32_to_cpu(virtio_has_feature(vdev, VIRTIO_F_VERSION_1), val);
+}
+
+static inline __virtio32 cpu_to_virtio32(struct virtio_device *vdev, u32 val)
+{
+ return __cpu_to_virtio32(virtio_has_feature(vdev, VIRTIO_F_VERSION_1), val);
+}
+
+static inline u64 virtio64_to_cpu(struct virtio_device *vdev, __virtio64 val)
+{
+ return __virtio64_to_cpu(virtio_has_feature(vdev, VIRTIO_F_VERSION_1), val);
+}
+
+static inline __virtio64 cpu_to_virtio64(struct virtio_device *vdev, u64 val)
+{
+ return __cpu_to_virtio64(virtio_has_feature(vdev, VIRTIO_F_VERSION_1), val);
+}
+
/* Config space accessors. */
#define virtio_cread(vdev, structname, member, ptr) \
do { \
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index a99f9b7..61c818a 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -32,6 +32,7 @@
*
* Copyright Rusty Russell IBM Corporation 2007. */
#include <linux/types.h>
+#include <linux/virtio_types.h>
/* This marks a buffer as continuing via the next field. */
#define VRING_DESC_F_NEXT 1
@@ -61,32 +62,32 @@
/* Virtio ring descriptors: 16 bytes. These can chain together via "next". */
struct vring_desc {
/* Address (guest-physical). */
- __u64 addr;
+ __virtio64 addr;
/* Length. */
- __u32 len;
+ __virtio32 len;
/* The flags as indicated above. */
- __u16 flags;
+ __virtio16 flags;
/* We chain unused descriptors via this, too */
- __u16 next;
+ __virtio16 next;
};
struct vring_avail {
- __u16 flags;
- __u16 idx;
- __u16 ring[];
+ __virtio16 flags;
+ __virtio16 idx;
+ __virtio16 ring[];
};
/* u32 is used here for ids for padding reasons. */
struct vring_used_elem {
/* Index of start of used descriptor chain. */
- __u32 id;
+ __virtio32 id;
/* Total length of the descriptor chain which was used (written to) */
- __u32 len;
+ __virtio32 len;
};
struct vring_used {
- __u16 flags;
- __u16 idx;
+ __virtio16 flags;
+ __virtio16 idx;
struct vring_used_elem ring[];
};
@@ -109,25 +110,25 @@ struct vring {
* struct vring_desc desc[num];
*
* // A ring of available descriptor heads with free-running index.
- * __u16 avail_flags;
- * __u16 avail_idx;
- * __u16 available[num];
- * __u16 used_event_idx;
+ * __virtio16 avail_flags;
+ * __virtio16 avail_idx;
+ * __virtio16 available[num];
+ * __virtio16 used_event_idx;
*
* // Padding to the next align boundary.
* char pad[];
*
* // A ring of used descriptor heads with free-running index.
- * __u16 used_flags;
- * __u16 used_idx;
+ * __virtio16 used_flags;
+ * __virtio16 used_idx;
* struct vring_used_elem used[num];
- * __u16 avail_event_idx;
+ * __virtio16 avail_event_idx;
* };
*/
/* We publish the used event index at the end of the available ring, and vice
* versa. They are at the end for backwards compatibility. */
#define vring_used_event(vr) ((vr)->avail->ring[(vr)->num])
-#define vring_avail_event(vr) (*(__u16 *)&(vr)->used->ring[(vr)->num])
+#define vring_avail_event(vr) (*(__virtio16 *)&(vr)->used->ring[(vr)->num])
static inline void vring_init(struct vring *vr, unsigned int num, void *p,
unsigned long align)
@@ -135,15 +136,15 @@ static inline void vring_init(struct vring *vr, unsigned int num, void *p,
vr->num = num;
vr->desc = p;
vr->avail = p + num*sizeof(struct vring_desc);
- vr->used = (void *)(((unsigned long)&vr->avail->ring[num] + sizeof(__u16)
+ vr->used = (void *)(((unsigned long)&vr->avail->ring[num] + sizeof(__virtio16)
+ align-1) & ~(align - 1));
}
static inline unsigned vring_size(unsigned int num, unsigned long align)
{
- return ((sizeof(struct vring_desc) * num + sizeof(__u16) * (3 + num)
+ return ((sizeof(struct vring_desc) * num + sizeof(__virtio16) * (3 + num)
+ align - 1) & ~(align - 1))
- + sizeof(__u16) * 3 + sizeof(struct vring_used_elem) * num;
+ + sizeof(__virtio16) * 3 + sizeof(struct vring_used_elem) * num;
}
/* The following is used with USED_EVENT_IDX and AVAIL_EVENT_IDX */
diff --git a/include/uapi/linux/virtio_types.h b/include/uapi/linux/virtio_types.h
new file mode 100644
index 0000000..e845e8c
--- /dev/null
+++ b/include/uapi/linux/virtio_types.h
@@ -0,0 +1,46 @@
+#ifndef _UAPI_LINUX_VIRTIO_TYPES_H
+#define _UAPI_LINUX_VIRTIO_TYPES_H
+/* Type definitions for virtio implementations.
+ *
+ * This header is BSD licensed so anyone can use the definitions to implement
+ * compatible drivers/servers.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of IBM nor the names of its contributors
+ * may be used to endorse or promote products derived from this software
+ * without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * Copyright (C) 2014 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <mst@redhat.com>
+ */
+#include <linux/types.h>
+
+/*
+ * __virtio{16,32,64} have the following meaning:
+ * - __u{16,32,64} for virtio devices in legacy mode, accessed in native endian
+ * - __le{16,32,64} for standard-compliant virtio devices
+ */
+
+typedef __u16 __bitwise__ __virtio16;
+typedef __u32 __bitwise__ __virtio32;
+typedef __u64 __bitwise__ __virtio64;
+
+#endif /* _UAPI_LINUX_VIRTIO_TYPES_H */
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 4c94f31..44a5581 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -423,6 +423,7 @@ header-y += virtio_blk.h
header-y += virtio_config.h
header-y += virtio_console.h
header-y += virtio_ids.h
+header-y += virtio_types.h
header-y += virtio_net.h
header-y += virtio_pci.h
header-y += virtio_ring.h
--
MST
^ permalink raw reply related
* [PATCH v7 12/46] virtio: set FEATURES_OK
From: Michael S. Tsirkin @ 2014-11-30 15:10 UTC (permalink / raw)
To: linux-kernel
Cc: thuth, rusty, linux-api, virtualization, dahi, pbonzini,
David Miller
In-Reply-To: <1417359787-10138-1-git-send-email-mst@redhat.com>
set FEATURES_OK as per virtio 1.0 spec
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
include/uapi/linux/virtio_config.h | 2 ++
drivers/virtio/virtio.c | 29 ++++++++++++++++++++++-------
2 files changed, 24 insertions(+), 7 deletions(-)
diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
index 80e7381..4d05671 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -38,6 +38,8 @@
#define VIRTIO_CONFIG_S_DRIVER 2
/* Driver has used its parts of the config, and is happy */
#define VIRTIO_CONFIG_S_DRIVER_OK 4
+/* Driver has finished configuring features */
+#define VIRTIO_CONFIG_S_FEATURES_OK 8
/* We've given up on this device. */
#define VIRTIO_CONFIG_S_FAILED 0x80
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 746d350..9248125 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -160,6 +160,7 @@ static int virtio_dev_probe(struct device *_d)
struct virtio_device *dev = dev_to_virtio(_d);
struct virtio_driver *drv = drv_to_virtio(dev->dev.driver);
u64 device_features;
+ unsigned status;
/* We have a driver! */
add_status(dev, VIRTIO_CONFIG_S_DRIVER);
@@ -183,18 +184,32 @@ static int virtio_dev_probe(struct device *_d)
dev->config->finalize_features(dev);
+ if (virtio_has_feature(dev, VIRTIO_F_VERSION_1)) {
+ add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
+ status = dev->config->get_status(dev);
+ if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) {
+ dev_err(_d, "virtio: device refuses features: %x\n",
+ status);
+ err = -ENODEV;
+ goto err;
+ }
+ }
+
err = drv->probe(dev);
if (err)
- add_status(dev, VIRTIO_CONFIG_S_FAILED);
- else {
- add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
- if (drv->scan)
- drv->scan(dev);
+ goto err;
- virtio_config_enable(dev);
- }
+ add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
+ if (drv->scan)
+ drv->scan(dev);
+
+ virtio_config_enable(dev);
+ return 0;
+err:
+ add_status(dev, VIRTIO_CONFIG_S_FAILED);
return err;
+
}
static int virtio_dev_remove(struct device *_d)
--
MST
^ permalink raw reply related
* [PATCH v7 15/46] virtio_net: v1.0 endianness
From: Michael S. Tsirkin @ 2014-11-30 15:10 UTC (permalink / raw)
To: linux-kernel
Cc: David Miller, cornelia.huck, rusty, nab, pbonzini, thuth, dahi,
Rusty Russell, virtualization, netdev, linux-api
In-Reply-To: <1417359787-10138-1-git-send-email-mst@redhat.com>
Based on patches by Rusty Russell, Cornelia Huck.
Note: more code changes are needed for 1.0 support
(due to different header size).
So we don't advertize support for 1.0 yet.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
include/uapi/linux/virtio_net.h | 15 ++++++++-------
drivers/net/virtio_net.c | 33 ++++++++++++++++++++-------------
2 files changed, 28 insertions(+), 20 deletions(-)
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 172a7f0..b5f1677 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -28,6 +28,7 @@
#include <linux/types.h>
#include <linux/virtio_ids.h>
#include <linux/virtio_config.h>
+#include <linux/virtio_types.h>
#include <linux/if_ether.h>
/* The feature bitmap for virtio net */
@@ -84,17 +85,17 @@ struct virtio_net_hdr {
#define VIRTIO_NET_HDR_GSO_TCPV6 4 // GSO frame, IPv6 TCP
#define VIRTIO_NET_HDR_GSO_ECN 0x80 // TCP has ECN set
__u8 gso_type;
- __u16 hdr_len; /* Ethernet + IP + tcp/udp hdrs */
- __u16 gso_size; /* Bytes to append to hdr_len per frame */
- __u16 csum_start; /* Position to start checksumming from */
- __u16 csum_offset; /* Offset after that to place checksum */
+ __virtio16 hdr_len; /* Ethernet + IP + tcp/udp hdrs */
+ __virtio16 gso_size; /* Bytes to append to hdr_len per frame */
+ __virtio16 csum_start; /* Position to start checksumming from */
+ __virtio16 csum_offset; /* Offset after that to place checksum */
};
/* This is the version of the header to use when the MRG_RXBUF
* feature has been negotiated. */
struct virtio_net_hdr_mrg_rxbuf {
struct virtio_net_hdr hdr;
- __u16 num_buffers; /* Number of merged rx buffers */
+ __virtio16 num_buffers; /* Number of merged rx buffers */
};
/*
@@ -149,7 +150,7 @@ typedef __u8 virtio_net_ctrl_ack;
* VIRTIO_NET_F_CTRL_MAC_ADDR feature is available.
*/
struct virtio_net_ctrl_mac {
- __u32 entries;
+ __virtio32 entries;
__u8 macs[][ETH_ALEN];
} __attribute__((packed));
@@ -193,7 +194,7 @@ struct virtio_net_ctrl_mac {
* specified.
*/
struct virtio_net_ctrl_mq {
- __u16 virtqueue_pairs;
+ __virtio16 virtqueue_pairs;
};
#define VIRTIO_NET_CTRL_MQ 4
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b0bc8ea..c07e030 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -347,13 +347,14 @@ err:
}
static struct sk_buff *receive_mergeable(struct net_device *dev,
+ struct virtnet_info *vi,
struct receive_queue *rq,
unsigned long ctx,
unsigned int len)
{
void *buf = mergeable_ctx_to_buf_address(ctx);
struct skb_vnet_hdr *hdr = buf;
- int num_buf = hdr->mhdr.num_buffers;
+ u16 num_buf = virtio16_to_cpu(rq->vq->vdev, hdr->mhdr.num_buffers);
struct page *page = virt_to_head_page(buf);
int offset = buf - page_address(page);
unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
@@ -369,7 +370,9 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
ctx = (unsigned long)virtqueue_get_buf(rq->vq, &len);
if (unlikely(!ctx)) {
pr_debug("%s: rx error: %d buffers out of %d missing\n",
- dev->name, num_buf, hdr->mhdr.num_buffers);
+ dev->name, num_buf,
+ virtio16_to_cpu(rq->vq->vdev,
+ hdr->mhdr.num_buffers));
dev->stats.rx_length_errors++;
goto err_buf;
}
@@ -454,7 +457,7 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
}
if (vi->mergeable_rx_bufs)
- skb = receive_mergeable(dev, rq, (unsigned long)buf, len);
+ skb = receive_mergeable(dev, vi, rq, (unsigned long)buf, len);
else if (vi->big_packets)
skb = receive_big(dev, rq, buf, len);
else
@@ -473,8 +476,8 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
if (hdr->hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
pr_debug("Needs csum!\n");
if (!skb_partial_csum_set(skb,
- hdr->hdr.csum_start,
- hdr->hdr.csum_offset))
+ virtio16_to_cpu(vi->vdev, hdr->hdr.csum_start),
+ virtio16_to_cpu(vi->vdev, hdr->hdr.csum_offset)))
goto frame_err;
} else if (hdr->hdr.flags & VIRTIO_NET_HDR_F_DATA_VALID) {
skb->ip_summed = CHECKSUM_UNNECESSARY;
@@ -514,7 +517,8 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
if (hdr->hdr.gso_type & VIRTIO_NET_HDR_GSO_ECN)
skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
- skb_shinfo(skb)->gso_size = hdr->hdr.gso_size;
+ skb_shinfo(skb)->gso_size = virtio16_to_cpu(vi->vdev,
+ hdr->hdr.gso_size);
if (skb_shinfo(skb)->gso_size == 0) {
net_warn_ratelimited("%s: zero gso size.\n", dev->name);
goto frame_err;
@@ -876,16 +880,19 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
if (skb->ip_summed == CHECKSUM_PARTIAL) {
hdr->hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
- hdr->hdr.csum_start = skb_checksum_start_offset(skb);
- hdr->hdr.csum_offset = skb->csum_offset;
+ hdr->hdr.csum_start = cpu_to_virtio16(vi->vdev,
+ skb_checksum_start_offset(skb));
+ hdr->hdr.csum_offset = cpu_to_virtio16(vi->vdev,
+ skb->csum_offset);
} else {
hdr->hdr.flags = 0;
hdr->hdr.csum_offset = hdr->hdr.csum_start = 0;
}
if (skb_is_gso(skb)) {
- hdr->hdr.hdr_len = skb_headlen(skb);
- hdr->hdr.gso_size = skb_shinfo(skb)->gso_size;
+ hdr->hdr.hdr_len = cpu_to_virtio16(vi->vdev, skb_headlen(skb));
+ hdr->hdr.gso_size = cpu_to_virtio16(vi->vdev,
+ skb_shinfo(skb)->gso_size);
if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4)
hdr->hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
else if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6)
@@ -1112,7 +1119,7 @@ static int virtnet_set_queues(struct virtnet_info *vi, u16 queue_pairs)
if (!vi->has_cvq || !virtio_has_feature(vi->vdev, VIRTIO_NET_F_MQ))
return 0;
- s.virtqueue_pairs = queue_pairs;
+ s.virtqueue_pairs = cpu_to_virtio16(vi->vdev, queue_pairs);
sg_init_one(&sg, &s, sizeof(s));
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MQ,
@@ -1189,7 +1196,7 @@ static void virtnet_set_rx_mode(struct net_device *dev)
sg_init_table(sg, 2);
/* Store the unicast list and count in the front of the buffer */
- mac_data->entries = uc_count;
+ mac_data->entries = cpu_to_virtio32(vi->vdev, uc_count);
i = 0;
netdev_for_each_uc_addr(ha, dev)
memcpy(&mac_data->macs[i++][0], ha->addr, ETH_ALEN);
@@ -1200,7 +1207,7 @@ static void virtnet_set_rx_mode(struct net_device *dev)
/* multicast list and count fill the end */
mac_data = (void *)&mac_data->macs[uc_count][0];
- mac_data->entries = mc_count;
+ mac_data->entries = cpu_to_virtio32(vi->vdev, mc_count);
i = 0;
netdev_for_each_mc_addr(ha, dev)
memcpy(&mac_data->macs[i++][0], ha->addr, ETH_ALEN);
--
MST
^ permalink raw reply related
* [PATCH v7 16/46] virtio_blk: v1.0 support
From: Michael S. Tsirkin @ 2014-11-30 15:10 UTC (permalink / raw)
To: linux-kernel
Cc: thuth, rusty, linux-api, virtualization, dahi, pbonzini,
David Miller
In-Reply-To: <1417359787-10138-1-git-send-email-mst@redhat.com>
Based on patch by Cornelia Huck.
Note: for consistency, and to avoid sparse errors,
convert all fields, even those no longer in use
for virtio v1.0.
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
include/uapi/linux/virtio_blk.h | 15 ++++-----
drivers/block/virtio_blk.c | 70 ++++++++++++++++++++++++-----------------
2 files changed, 49 insertions(+), 36 deletions(-)
diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
index 9ad67b2..247c8ba 100644
--- a/include/uapi/linux/virtio_blk.h
+++ b/include/uapi/linux/virtio_blk.h
@@ -28,6 +28,7 @@
#include <linux/types.h>
#include <linux/virtio_ids.h>
#include <linux/virtio_config.h>
+#include <linux/virtio_types.h>
/* Feature bits */
#define VIRTIO_BLK_F_BARRIER 0 /* Does host support barriers? */
@@ -114,18 +115,18 @@ struct virtio_blk_config {
/* This is the first element of the read scatter-gather list. */
struct virtio_blk_outhdr {
/* VIRTIO_BLK_T* */
- __u32 type;
+ __virtio32 type;
/* io priority. */
- __u32 ioprio;
+ __virtio32 ioprio;
/* Sector (ie. 512 byte offset) */
- __u64 sector;
+ __virtio64 sector;
};
struct virtio_scsi_inhdr {
- __u32 errors;
- __u32 data_len;
- __u32 sense_len;
- __u32 residual;
+ __virtio32 errors;
+ __virtio32 data_len;
+ __virtio32 sense_len;
+ __virtio32 residual;
};
/* And this is the final byte of the write scatter-gather list. */
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index c6a27d5..f601f16 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -80,7 +80,7 @@ static int __virtblk_add_req(struct virtqueue *vq,
{
struct scatterlist hdr, status, cmd, sense, inhdr, *sgs[6];
unsigned int num_out = 0, num_in = 0;
- int type = vbr->out_hdr.type & ~VIRTIO_BLK_T_OUT;
+ __virtio32 type = vbr->out_hdr.type & ~cpu_to_virtio32(vq->vdev, VIRTIO_BLK_T_OUT);
sg_init_one(&hdr, &vbr->out_hdr, sizeof(vbr->out_hdr));
sgs[num_out++] = &hdr;
@@ -91,19 +91,19 @@ static int __virtblk_add_req(struct virtqueue *vq,
* block, and before the normal inhdr we put the sense data and the
* inhdr with additional status information.
*/
- if (type == VIRTIO_BLK_T_SCSI_CMD) {
+ if (type == cpu_to_virtio32(vq->vdev, VIRTIO_BLK_T_SCSI_CMD)) {
sg_init_one(&cmd, vbr->req->cmd, vbr->req->cmd_len);
sgs[num_out++] = &cmd;
}
if (have_data) {
- if (vbr->out_hdr.type & VIRTIO_BLK_T_OUT)
+ if (vbr->out_hdr.type & cpu_to_virtio32(vq->vdev, VIRTIO_BLK_T_OUT))
sgs[num_out++] = data_sg;
else
sgs[num_out + num_in++] = data_sg;
}
- if (type == VIRTIO_BLK_T_SCSI_CMD) {
+ if (type == cpu_to_virtio32(vq->vdev, VIRTIO_BLK_T_SCSI_CMD)) {
sg_init_one(&sense, vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
sgs[num_out + num_in++] = &sense;
sg_init_one(&inhdr, &vbr->in_hdr, sizeof(vbr->in_hdr));
@@ -119,12 +119,13 @@ static int __virtblk_add_req(struct virtqueue *vq,
static inline void virtblk_request_done(struct request *req)
{
struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
+ struct virtio_blk *vblk = req->q->queuedata;
int error = virtblk_result(vbr);
if (req->cmd_type == REQ_TYPE_BLOCK_PC) {
- req->resid_len = vbr->in_hdr.residual;
- req->sense_len = vbr->in_hdr.sense_len;
- req->errors = vbr->in_hdr.errors;
+ req->resid_len = virtio32_to_cpu(vblk->vdev, vbr->in_hdr.residual);
+ req->sense_len = virtio32_to_cpu(vblk->vdev, vbr->in_hdr.sense_len);
+ req->errors = virtio32_to_cpu(vblk->vdev, vbr->in_hdr.errors);
} else if (req->cmd_type == REQ_TYPE_SPECIAL) {
req->errors = (error != 0);
}
@@ -173,25 +174,25 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req,
vbr->req = req;
if (req->cmd_flags & REQ_FLUSH) {
- vbr->out_hdr.type = VIRTIO_BLK_T_FLUSH;
+ vbr->out_hdr.type = cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_FLUSH);
vbr->out_hdr.sector = 0;
- vbr->out_hdr.ioprio = req_get_ioprio(vbr->req);
+ vbr->out_hdr.ioprio = cpu_to_virtio32(vblk->vdev, req_get_ioprio(vbr->req));
} else {
switch (req->cmd_type) {
case REQ_TYPE_FS:
vbr->out_hdr.type = 0;
- vbr->out_hdr.sector = blk_rq_pos(vbr->req);
- vbr->out_hdr.ioprio = req_get_ioprio(vbr->req);
+ vbr->out_hdr.sector = cpu_to_virtio64(vblk->vdev, blk_rq_pos(vbr->req));
+ vbr->out_hdr.ioprio = cpu_to_virtio32(vblk->vdev, req_get_ioprio(vbr->req));
break;
case REQ_TYPE_BLOCK_PC:
- vbr->out_hdr.type = VIRTIO_BLK_T_SCSI_CMD;
+ vbr->out_hdr.type = cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_SCSI_CMD);
vbr->out_hdr.sector = 0;
- vbr->out_hdr.ioprio = req_get_ioprio(vbr->req);
+ vbr->out_hdr.ioprio = cpu_to_virtio32(vblk->vdev, req_get_ioprio(vbr->req));
break;
case REQ_TYPE_SPECIAL:
- vbr->out_hdr.type = VIRTIO_BLK_T_GET_ID;
+ vbr->out_hdr.type = cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_GET_ID);
vbr->out_hdr.sector = 0;
- vbr->out_hdr.ioprio = req_get_ioprio(vbr->req);
+ vbr->out_hdr.ioprio = cpu_to_virtio32(vblk->vdev, req_get_ioprio(vbr->req));
break;
default:
/* We don't put anything else in the queue. */
@@ -204,9 +205,9 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req,
num = blk_rq_map_sg(hctx->queue, vbr->req, vbr->sg);
if (num) {
if (rq_data_dir(vbr->req) == WRITE)
- vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
+ vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_OUT);
else
- vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
+ vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_IN);
}
spin_lock_irqsave(&vblk->vqs[qid].lock, flags);
@@ -476,7 +477,8 @@ static int virtblk_get_cache_mode(struct virtio_device *vdev)
struct virtio_blk_config, wce,
&writeback);
if (err)
- writeback = virtio_has_feature(vdev, VIRTIO_BLK_F_WCE);
+ writeback = virtio_has_feature(vdev, VIRTIO_BLK_F_WCE) ||
+ virtio_has_feature(vdev, VIRTIO_F_VERSION_1);
return writeback;
}
@@ -821,25 +823,35 @@ static const struct virtio_device_id id_table[] = {
{ 0 },
};
-static unsigned int features[] = {
+static unsigned int features_legacy[] = {
VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE, VIRTIO_BLK_F_SCSI,
VIRTIO_BLK_F_WCE, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
VIRTIO_BLK_F_MQ,
+}
+;
+static unsigned int features[] = {
+ VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
+ VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
+ VIRTIO_BLK_F_TOPOLOGY,
+ VIRTIO_BLK_F_MQ,
+ VIRTIO_F_VERSION_1,
};
static struct virtio_driver virtio_blk = {
- .feature_table = features,
- .feature_table_size = ARRAY_SIZE(features),
- .driver.name = KBUILD_MODNAME,
- .driver.owner = THIS_MODULE,
- .id_table = id_table,
- .probe = virtblk_probe,
- .remove = virtblk_remove,
- .config_changed = virtblk_config_changed,
+ .feature_table = features,
+ .feature_table_size = ARRAY_SIZE(features),
+ .feature_table_legacy = features_legacy,
+ .feature_table_size_legacy = ARRAY_SIZE(features_legacy),
+ .driver.name = KBUILD_MODNAME,
+ .driver.owner = THIS_MODULE,
+ .id_table = id_table,
+ .probe = virtblk_probe,
+ .remove = virtblk_remove,
+ .config_changed = virtblk_config_changed,
#ifdef CONFIG_PM_SLEEP
- .freeze = virtblk_freeze,
- .restore = virtblk_restore,
+ .freeze = virtblk_freeze,
+ .restore = virtblk_restore,
#endif
};
--
MST
^ permalink raw reply related
* [PATCH v7 37/46] tun: move internal flag defines out of uapi
From: Michael S. Tsirkin @ 2014-11-30 15:12 UTC (permalink / raw)
To: linux-kernel
Cc: David Miller, cornelia.huck, rusty, nab, pbonzini, thuth, dahi,
Jason Wang, Zhi Yong Wu, Herbert Xu, Tom Herbert, Ben Hutchings,
Masatake YAMATO, Xi Wang, netdev, linux-api
In-Reply-To: <1417359787-10138-1-git-send-email-mst@redhat.com>
TUN_ flags are internal and never exposed
to userspace. Any application using it is almost
certainly buggy.
Move them out to tun.c.
Note: we remove these completely in follow-up patches,
this code movement is split out for ease of review.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
include/uapi/linux/if_tun.h | 16 ++--------
drivers/net/tun.c | 72 +++++++++++++--------------------------------
2 files changed, 24 insertions(+), 64 deletions(-)
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index e9502dd..277a260 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -22,21 +22,11 @@
/* Read queue size */
#define TUN_READQ_SIZE 500
-
-/* TUN device flags */
-#define TUN_TUN_DEV 0x0001
-#define TUN_TAP_DEV 0x0002
+/* TUN device type flags: deprecated. Use IFF_TUN/IFF_TAP instead. */
+#define TUN_TUN_DEV IFF_TUN
+#define TUN_TAP_DEV IFF_TAP
#define TUN_TYPE_MASK 0x000f
-#define TUN_FASYNC 0x0010
-#define TUN_NOCHECKSUM 0x0020
-#define TUN_NO_PI 0x0040
-/* This flag has no real effect */
-#define TUN_ONE_QUEUE 0x0080
-#define TUN_PERSIST 0x0100
-#define TUN_VNET_HDR 0x0200
-#define TUN_TAP_MQ 0x0400
-
/* Ioctl defines */
#define TUNSETNOCSUM _IOW('T', 200, int)
#define TUNSETDEBUG _IOW('T', 201, int)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 9dd3746..b4e4ca0 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -103,6 +103,21 @@ do { \
} while (0)
#endif
+/* TUN device flags */
+
+/* IFF_ATTACH_QUEUE is never stored in device flags,
+ * overload it to mean fasync when stored there.
+ */
+#define TUN_FASYNC IFF_ATTACH_QUEUE
+#define TUN_NO_PI IFF_NO_PI
+/* This flag has no real effect */
+#define TUN_ONE_QUEUE IFF_ONE_QUEUE
+#define TUN_PERSIST IFF_PERSIST
+#define TUN_VNET_HDR IFF_VNET_HDR
+#define TUN_TAP_MQ IFF_MULTI_QUEUE
+
+#define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \
+ IFF_MULTI_QUEUE)
#define GOODCOPY_LEN 128
#define FLT_EXACT_COUNT 8
@@ -1521,32 +1536,7 @@ static struct proto tun_proto = {
static int tun_flags(struct tun_struct *tun)
{
- int flags = 0;
-
- if (tun->flags & TUN_TUN_DEV)
- flags |= IFF_TUN;
- else
- flags |= IFF_TAP;
-
- if (tun->flags & TUN_NO_PI)
- flags |= IFF_NO_PI;
-
- /* This flag has no real effect. We track the value for backwards
- * compatibility.
- */
- if (tun->flags & TUN_ONE_QUEUE)
- flags |= IFF_ONE_QUEUE;
-
- if (tun->flags & TUN_VNET_HDR)
- flags |= IFF_VNET_HDR;
-
- if (tun->flags & TUN_TAP_MQ)
- flags |= IFF_MULTI_QUEUE;
-
- if (tun->flags & TUN_PERSIST)
- flags |= IFF_PERSIST;
-
- return flags;
+ return tun->flags & (TUN_FEATURES | IFF_PERSIST | IFF_TUN | IFF_TAP);
}
static ssize_t tun_show_flags(struct device *dev, struct device_attribute *attr,
@@ -1706,28 +1696,8 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
tun_debug(KERN_INFO, tun, "tun_set_iff\n");
- if (ifr->ifr_flags & IFF_NO_PI)
- tun->flags |= TUN_NO_PI;
- else
- tun->flags &= ~TUN_NO_PI;
-
- /* This flag has no real effect. We track the value for backwards
- * compatibility.
- */
- if (ifr->ifr_flags & IFF_ONE_QUEUE)
- tun->flags |= TUN_ONE_QUEUE;
- else
- tun->flags &= ~TUN_ONE_QUEUE;
-
- if (ifr->ifr_flags & IFF_VNET_HDR)
- tun->flags |= TUN_VNET_HDR;
- else
- tun->flags &= ~TUN_VNET_HDR;
-
- if (ifr->ifr_flags & IFF_MULTI_QUEUE)
- tun->flags |= TUN_TAP_MQ;
- else
- tun->flags &= ~TUN_TAP_MQ;
+ tun->flags = (tun->flags & ~TUN_FEATURES) |
+ (ifr->ifr_flags & TUN_FEATURES);
/* Make sure persistent devices do not get stuck in
* xoff state.
@@ -1890,9 +1860,9 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
if (cmd == TUNGETFEATURES) {
/* Currently this just means: "what IFF flags are valid?".
* This is needed because we never checked for invalid flags on
- * TUNSETIFF. */
- return put_user(IFF_TUN | IFF_TAP | IFF_NO_PI | IFF_ONE_QUEUE |
- IFF_VNET_HDR | IFF_MULTI_QUEUE,
+ * TUNSETIFF.
+ */
+ return put_user(IFF_TUN | IFF_TAP | TUN_FEATURES,
(unsigned int __user*)argp);
} else if (cmd == TUNSETQUEUE)
return tun_set_queue(file, &ifr);
--
MST
^ permalink raw reply related
* [PATCH v7 39/46] tun: add VNET_LE flag
From: Michael S. Tsirkin @ 2014-11-30 15:12 UTC (permalink / raw)
To: linux-kernel
Cc: David Miller, cornelia.huck, rusty, nab, pbonzini, thuth, dahi,
netdev, linux-api
In-Reply-To: <1417359787-10138-1-git-send-email-mst@redhat.com>
virtio 1.0 modified virtio net header format,
making all fields little endian.
Users can tweak header format before submitting it to tun,
but this means more data copies where none were necessary.
And if the iovec is in RO memory, this means we might
need to split iovec also means we might in theory overflow
iovec max size.
This patch adds a simpler way for applications to handle this,
using new "little endian" flag in tun.
As a result, tun simply byte-swaps header fields as appropriate.
This is a NOP on LE architectures.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
include/uapi/linux/if_tun.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index 277a260..18b2403 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -57,6 +57,7 @@
#define IFF_ONE_QUEUE 0x2000
#define IFF_VNET_HDR 0x4000
#define IFF_TUN_EXCL 0x8000
+#define IFF_VNET_LE 0x10000
#define IFF_MULTI_QUEUE 0x0100
#define IFF_ATTACH_QUEUE 0x0200
#define IFF_DETACH_QUEUE 0x0400
--
MST
^ permalink raw reply related
* [PATCH v7 44/46] virtio_scsi: export to userspace
From: Michael S. Tsirkin @ 2014-11-30 15:12 UTC (permalink / raw)
To: linux-kernel
Cc: David Miller, cornelia.huck, rusty, nab, pbonzini, thuth, dahi,
Rusty Russell, Greg Kroah-Hartman, Alexei Starovoitov,
Lad, Prabhakar, Bjarke Istrup Pedersen, Andy Grover,
stephen hemminger, Sakari Ailus, Anup Patel, linux-api,
virtualization
In-Reply-To: <1417359787-10138-1-git-send-email-mst@redhat.com>
Replace uXX by __uXX and _packed by __attribute((packed))
as seems to be the norm for userspace headers.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
---
include/uapi/linux/virtio_scsi.h | 74 ++++++++++++++++++++--------------------
include/uapi/linux/Kbuild | 1 +
2 files changed, 38 insertions(+), 37 deletions(-)
diff --git a/include/uapi/linux/virtio_scsi.h b/include/uapi/linux/virtio_scsi.h
index af44864..42b9370 100644
--- a/include/uapi/linux/virtio_scsi.h
+++ b/include/uapi/linux/virtio_scsi.h
@@ -34,78 +34,78 @@
/* SCSI command request, followed by data-out */
struct virtio_scsi_cmd_req {
- u8 lun[8]; /* Logical Unit Number */
+ __u8 lun[8]; /* Logical Unit Number */
__virtio64 tag; /* Command identifier */
- u8 task_attr; /* Task attribute */
- u8 prio; /* SAM command priority field */
- u8 crn;
- u8 cdb[VIRTIO_SCSI_CDB_SIZE];
-} __packed;
+ __u8 task_attr; /* Task attribute */
+ __u8 prio; /* SAM command priority field */
+ __u8 crn;
+ __u8 cdb[VIRTIO_SCSI_CDB_SIZE];
+} __attribute__((packed));
/* SCSI command request, followed by protection information */
struct virtio_scsi_cmd_req_pi {
- u8 lun[8]; /* Logical Unit Number */
+ __u8 lun[8]; /* Logical Unit Number */
__virtio64 tag; /* Command identifier */
- u8 task_attr; /* Task attribute */
- u8 prio; /* SAM command priority field */
- u8 crn;
+ __u8 task_attr; /* Task attribute */
+ __u8 prio; /* SAM command priority field */
+ __u8 crn;
__virtio32 pi_bytesout; /* DataOUT PI Number of bytes */
__virtio32 pi_bytesin; /* DataIN PI Number of bytes */
- u8 cdb[VIRTIO_SCSI_CDB_SIZE];
-} __packed;
+ __u8 cdb[VIRTIO_SCSI_CDB_SIZE];
+} __attribute__((packed));
/* Response, followed by sense data and data-in */
struct virtio_scsi_cmd_resp {
__virtio32 sense_len; /* Sense data length */
__virtio32 resid; /* Residual bytes in data buffer */
__virtio16 status_qualifier; /* Status qualifier */
- u8 status; /* Command completion status */
- u8 response; /* Response values */
- u8 sense[VIRTIO_SCSI_SENSE_SIZE];
-} __packed;
+ __u8 status; /* Command completion status */
+ __u8 response; /* Response values */
+ __u8 sense[VIRTIO_SCSI_SENSE_SIZE];
+} __attribute__((packed));
/* Task Management Request */
struct virtio_scsi_ctrl_tmf_req {
__virtio32 type;
__virtio32 subtype;
- u8 lun[8];
+ __u8 lun[8];
__virtio64 tag;
-} __packed;
+} __attribute__((packed));
struct virtio_scsi_ctrl_tmf_resp {
- u8 response;
-} __packed;
+ __u8 response;
+} __attribute__((packed));
/* Asynchronous notification query/subscription */
struct virtio_scsi_ctrl_an_req {
__virtio32 type;
- u8 lun[8];
+ __u8 lun[8];
__virtio32 event_requested;
-} __packed;
+} __attribute__((packed));
struct virtio_scsi_ctrl_an_resp {
__virtio32 event_actual;
- u8 response;
-} __packed;
+ __u8 response;
+} __attribute__((packed));
struct virtio_scsi_event {
__virtio32 event;
- u8 lun[8];
+ __u8 lun[8];
__virtio32 reason;
-} __packed;
+} __attribute__((packed));
struct virtio_scsi_config {
- u32 num_queues;
- u32 seg_max;
- u32 max_sectors;
- u32 cmd_per_lun;
- u32 event_info_size;
- u32 sense_size;
- u32 cdb_size;
- u16 max_channel;
- u16 max_target;
- u32 max_lun;
-} __packed;
+ __u32 num_queues;
+ __u32 seg_max;
+ __u32 max_sectors;
+ __u32 cmd_per_lun;
+ __u32 event_info_size;
+ __u32 sense_size;
+ __u32 cdb_size;
+ __u16 max_channel;
+ __u16 max_target;
+ __u32 max_lun;
+} __attribute__((packed));
/* Feature Bits */
#define VIRTIO_SCSI_F_INOUT 0
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 44a5581..e61947c 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -428,6 +428,7 @@ header-y += virtio_net.h
header-y += virtio_pci.h
header-y += virtio_ring.h
header-y += virtio_rng.h
+header-y += virtio_scsi.h
header=y += vm_sockets.h
header-y += vt.h
header-y += wait.h
--
MST
^ permalink raw reply related
* Re: [RFC PATCH] proc, pidns: Add highpid
From: David Herrmann @ 2014-11-30 16:45 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Linux API, Andrew Morton, systemd Mailing List, linux-kernel,
Eric W. Biederman
In-Reply-To: <b0f6c4df0e8ef8afcc7b786edecb4be8c752941e.1417215468.git.luto@amacapital.net>
Hi Andy
On Sat, Nov 29, 2014 at 12:05 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> Pid reuse is common, which means that it's difficult or impossible
> to read information about a pid from /proc without races.
>
> This introduces a second number associated with each (task, pidns)
> pair called highpid. Highpid is a 64-bit number, and, barring
> extremely unlikely circumstances or outright error, a (highpid, pid)
> will never be reused.
>
> With just this change, a program can open /proc/PID/status, read the
> "Highpid" field, and confirm that it has the expected value. If the
> pid has been reused, then highpid will be different.
>
> The initial implementation is straightforward: highpid is simply a
> 64-bit counter. If a high-end system can fork every 3 ns (which
> would be amazing, given that just allocating a pid requires at
> atomic operation), it would take well over 1000 years for highpid to
> wrap.
>
> For CRIU's benefit, the next highpid can be set by a privileged
> user.
>
> NB: The sysctl stuff only works on 64-bit systems. If the approach
> looks good, I'll fix that somehow.
>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>
> If this goes in, there's plenty of room to add new interfaces to
> make this more useful. For example, we could add a fancier tgkill
> that adds and validates hightgid and highpid, and we might want to
> add a syscall to read one's own hightgid and highpid. These would
> be quite useful for pidfiles.
>
> David, would this be useful for kdbus?
Much appreciated! This would serve well as replacement for
'starttime'. I'd prefer if pid_t was 64bit, but I guess that ship
sailed long ago. Though, your patch might in the end just introduce a
new pid64, which replaces the old pid and lives in parallel.
Anyway, considering that we actually want the same pid-reuse
protection for tid, tgid, ppid and so on, we'd have to add a
'starttime' for all of them. Sounds ugly.. so we might just end up
dropping 'starttime' and introduce KDBUS_ITEM_PIDS2 whenever a 'fix'
is merged upstream.
Thanks a lot!
David
_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel
^ permalink raw reply
* Re: kdbus: add documentation
From: David Herrmann @ 2014-11-30 17:12 UTC (permalink / raw)
To: Florian Weimer
Cc: Andy Lutomirski, Greg Kroah-Hartman, Arnd Bergmann,
Eric W. Biederman, One Thousand Gnomes, Tom Gundersen,
Jiri Kosina, Linux API, linux-kernel@vger.kernel.org, Daniel Mack,
Djalal Harouni
In-Reply-To: <87tx1hxdgy.fsf@mid.deneb.enyo.de>
Hi
On Sun, Nov 30, 2014 at 10:08 AM, Florian Weimer <fw@deneb.enyo.de> wrote:
> * Andy Lutomirski:
>
>> At the risk of opening a can of worms, wouldn't this be much more
>> useful if you could share a pool between multiple connections?
>
> They would also be useful to reduce context switches when receiving
> data from all kinds of descriptors. At present, when polling, you
> receive notification, and then you have to call into the kernel,
> again, to actually fetch the data and associated information.
poll(2) and friends cannot return data for changed descriptors. I
think a single trap for each KDBUS_CMD_MSG_RECV is acceptable. If this
turns out to be a bottleneck, we can provide bulk-operations in the
future. Anyway, I don't see how a _shared_ pool would change any of
this?
> kernel could also queue the data for one specific recipient,
> addressing the same issue that SO_REUSEPORT tries to solve (on poll
> notification, the kernel does not know which recipient will eventually
> retrieve the data, so it has to notify and wake up all of them).
We already queue data only for the addressed recipients. We *do* know
all recipients of a message at poll-notification time. We only wake up
recipients that actually got a message queued.
Not sure how this is related to SO_REUSEPORT. Can you elaborate on
your optimizations?
Thanks
David
^ permalink raw reply
* Re: kdbus: add documentation
From: David Herrmann @ 2014-11-30 17:15 UTC (permalink / raw)
To: Florian Weimer
Cc: Greg Kroah-Hartman, Arnd Bergmann, Eric W. Biederman,
One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Andy Lutomirski,
Linux API, linux-kernel, Daniel Mack, Djalal Harouni
In-Reply-To: <871tolysbb.fsf-ZqZwdwZz9NfTBotR3TxKnbNAH6kLmebB@public.gmane.org>
Hi
On Sun, Nov 30, 2014 at 10:02 AM, Florian Weimer <fw-d32yF4oPJVt0XxTmqZlbVQ@public.gmane.org> wrote:
> * Greg Kroah-Hartman:
>
>> +7.4 Receiving messages
>
>> +Also, if the connection allowed for file descriptor to be passed
>> +(KDBUS_HELLO_ACCEPT_FD), and if the message contained any, they will be
>> +installed into the receiving process after the KDBUS_CMD_MSG_RECV ioctl
>> +returns. The receiving task is obliged to close all of them appropriately.
>
> What happens if this is not possible because the file descriptor limit
> of the processes would be exceeded? EMFILE, and the message will not
> be received?
The message is returned without installing the FDs. This is signaled
by EMFILE, but a valid pool offset.
Thanks
David
^ permalink raw reply
* Re: kdbus: add documentation
From: David Herrmann @ 2014-11-30 17:17 UTC (permalink / raw)
To: Florian Weimer
Cc: Greg Kroah-Hartman, Arnd Bergmann, Eric W. Biederman,
One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Andy Lutomirski,
Linux API, linux-kernel, Daniel Mack, Djalal Harouni
In-Reply-To: <8761dxysl0.fsf-ZqZwdwZz9NfTBotR3TxKnbNAH6kLmebB@public.gmane.org>
Hi
On Sun, Nov 30, 2014 at 9:56 AM, Florian Weimer <fw-d32yF4oPJVt0XxTmqZlbVQ@public.gmane.org> wrote:
> * Greg Kroah-Hartman:
>
>> +The focus of this document is an overview of the low-level, native kernel D-Bus
>> +transport called kdbus. Kdbus exposes its functionality via files in a
>> +filesystem called 'kdbusfs'. All communication between processes takes place
>> +via ioctls on files exposed through the mount point of a kdbusfs. The default
>> +mount point of kdbusfs is /sys/fs/kdbus.
>
> Does this mean the bus does not enforce the correctness of the D-Bus
> introspection metadata? That's really unfortunate. Classic D-Bus
> does not do this, either, and combined with the variety of approaches
> used to implement D-Bus endpoints, it makes it really difficult to
> figure out what D-Bus services, exactly, a process provides.
kdbus operates on the transport-level only. We never touch or look at
transferred data. As such, DBus introspection data as defined by
org.freedesktop.DBus.Introspectable is not verified by the transport
layer.
Thanks
David
^ permalink raw reply
* Re: kdbus: add documentation
From: Florian Weimer @ 2014-11-30 17:22 UTC (permalink / raw)
To: David Herrmann
Cc: Andy Lutomirski, Greg Kroah-Hartman, Arnd Bergmann,
Eric W. Biederman, One Thousand Gnomes, Tom Gundersen,
Jiri Kosina, Linux API, linux-kernel@vger.kernel.org, Daniel Mack,
Djalal Harouni
In-Reply-To: <CANq1E4QJhbypbs1PueKdW7AmBPiiYEg1dN-fsAewDJap82feyg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
* David Herrmann:
> poll(2) and friends cannot return data for changed descriptors. I
> think a single trap for each KDBUS_CMD_MSG_RECV is acceptable. If this
> turns out to be a bottleneck, we can provide bulk-operations in the
> future. Anyway, I don't see how a _shared_ pool would change any of
> this?
I responded to Andy's messages because it seemed to be about
generalizing the pool functionality.
>> kernel could also queue the data for one specific recipient,
>> addressing the same issue that SO_REUSEPORT tries to solve (on poll
>> notification, the kernel does not know which recipient will eventually
>> retrieve the data, so it has to notify and wake up all of them).
>
> We already queue data only for the addressed recipients. We *do* know
> all recipients of a message at poll-notification time. We only wake up
> recipients that actually got a message queued.
Exactly, but poll on, say, UDP sockets, does not work this way. What
I'm trying to say is that this functionality is interesting for much
more than kdbus.
> Not sure how this is related to SO_REUSEPORT. Can you elaborate on
> your optimizations?
Without something like SO_REUSEPORT, it is a bad idea to have multiple
threads polling the same socket. The semantics are such that the
kernel has to wake *all* the waiting threads, and one of them will
eventually pick up the datagram with a separate system call. But the
kernel does not know which thread this will be.
With SO_REUSEPORT and a separately created socket for each polling
thread, the kernel will only signal one poll operation because it
assumes that any of the waiting threads will process the datagram, so
it's sufficient just to notify one of them.
kdbus behaves like the latter, but also saves the need to separately
obtain the datagram and related data from the kernel.
^ permalink raw reply
* Re: kdbus: add documentation
From: Florian Weimer @ 2014-11-30 17:23 UTC (permalink / raw)
To: David Herrmann
Cc: Greg Kroah-Hartman, Arnd Bergmann, Eric W. Biederman,
One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Andy Lutomirski,
Linux API, linux-kernel, Daniel Mack, Djalal Harouni
In-Reply-To: <CANq1E4RjwCVEVT7031CeiKrNvRFkLBZc4-X1uKdrdzpf=EwJ+A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
* David Herrmann:
> On Sun, Nov 30, 2014 at 10:02 AM, Florian Weimer <fw-d32yF4oPJVt0XxTmqZlbVQ@public.gmane.org> wrote:
>> * Greg Kroah-Hartman:
>>
>>> +7.4 Receiving messages
>> What happens if this is not possible because the file descriptor limit
>> of the processes would be exceeded? EMFILE, and the message will not
>> be received?
>
> The message is returned without installing the FDs. This is signaled
> by EMFILE, but a valid pool offset.
Oh. This is really surprising, so it needs documentation. But it's
probably better than the alternative (return EMFILE and leave the
message stuck, so that you receive it immediately again—this behavior
makes non-blocking accept rather difficult to use correctly).
^ permalink raw reply
* Re: [PATCH v7 44/46] virtio_scsi: export to userspace
From: Prabhakar Lad @ 2014-11-30 21:43 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: LKML, David Miller, cornelia.huck-tA70FqPdS9bQT0dZR+AlfA,
rusty-8fk3Idey6ehBDgjK7y7TUQ, nab-IzHhD5pYlfBP7FQvKIMDCQ,
pbonzini-H+wXaHxf7aLQT0dZR+AlfA,
thuth-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
dahi-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Rusty Russell,
Greg Kroah-Hartman, Alexei Starovoitov, Bjarke Istrup Pedersen,
Andy Grover, stephen hemminger, Sakari Ailus, Anup Patel,
linux-api,
virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
In-Reply-To: <1417359787-10138-45-git-send-email-mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Hi Michael,
Thanks for the patch.
On Sun, Nov 30, 2014 at 3:12 PM, Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Replace uXX by __uXX and _packed by __attribute((packed))
> as seems to be the norm for userspace headers.
>
> Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Acked-by: Paolo Bonzini <pbonzini-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
> include/uapi/linux/virtio_scsi.h | 74 ++++++++++++++++++++--------------------
> include/uapi/linux/Kbuild | 1 +
Probably worth making a separate patch for this ? as it doesnt
match with patch description aswel.
Apart from that patch looks good.
Acked-by: Lad, Prabhakar <prabhakar.csengg-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Regards,
--Prabhakar Lad
^ permalink raw reply
* Re: [RFC PATCH] proc, pidns: Add highpid
From: Andy Lutomirski @ 2014-11-30 22:05 UTC (permalink / raw)
To: David Herrmann
Cc: Linux API, Andrew Morton, systemd Mailing List,
linux-kernel@vger.kernel.org, Eric W. Biederman
In-Reply-To: <CANq1E4Qv1pukOr8ocUtE0QzLgWFEb3pT=5JXvERj98m-j9FS3Q@mail.gmail.com>
On Nov 30, 2014 9:45 AM, "David Herrmann" <dh.herrmann@gmail.com> wrote:
>
> Hi Andy
>
> On Sat, Nov 29, 2014 at 12:05 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> > Pid reuse is common, which means that it's difficult or impossible
> > to read information about a pid from /proc without races.
> >
> > This introduces a second number associated with each (task, pidns)
> > pair called highpid. Highpid is a 64-bit number, and, barring
> > extremely unlikely circumstances or outright error, a (highpid, pid)
> > will never be reused.
> >
> > With just this change, a program can open /proc/PID/status, read the
> > "Highpid" field, and confirm that it has the expected value. If the
> > pid has been reused, then highpid will be different.
> >
> > The initial implementation is straightforward: highpid is simply a
> > 64-bit counter. If a high-end system can fork every 3 ns (which
> > would be amazing, given that just allocating a pid requires at
> > atomic operation), it would take well over 1000 years for highpid to
> > wrap.
> >
> > For CRIU's benefit, the next highpid can be set by a privileged
> > user.
> >
> > NB: The sysctl stuff only works on 64-bit systems. If the approach
> > looks good, I'll fix that somehow.
> >
> > Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> > ---
> >
> > If this goes in, there's plenty of room to add new interfaces to
> > make this more useful. For example, we could add a fancier tgkill
> > that adds and validates hightgid and highpid, and we might want to
> > add a syscall to read one's own hightgid and highpid. These would
> > be quite useful for pidfiles.
> >
> > David, would this be useful for kdbus?
>
> Much appreciated! This would serve well as replacement for
> 'starttime'. I'd prefer if pid_t was 64bit, but I guess that ship
> sailed long ago. Though, your patch might in the end just introduce a
> new pid64, which replaces the old pid and lives in parallel.
>
> Anyway, considering that we actually want the same pid-reuse
> protection for tid, tgid, ppid and so on, we'd have to add a
> 'starttime' for all of them. Sounds ugly.. so we might just end up
> dropping 'starttime' and introduce KDBUS_ITEM_PIDS2 whenever a 'fix'
> is merged upstream.
I don't understand "we want". Doesn't kdbus currently only think
about tid and tgid?
In any event, I just looked at the kdbus code
(https://github.com/gregkh/kdbus/blob/master/metadata.c), and I see:
meta->pid = get_pid(task_pid(current));
meta->tgid = get_pid(task_tgid(current));
meta->starttime = current->start_time,
I don't understand how that concept of starttime is particularly
useful. Isn't that the start time associated with tid? How can that
be useful when looking up tgid?
Regardless, my patch here associates a highpid with each struct pid,
and both tid and tgid are just struct pids, so you get both. I agree
that adding Highppid to /proc might be useful, but that seems like a
future extension that has nothing to do with kdbus.
--Andy
_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel
^ permalink raw reply
* Re: [RFC PATCH] proc, pidns: Add highpid
From: Andy Lutomirski @ 2014-11-30 22:08 UTC (permalink / raw)
To: Florian Weimer
Cc: systemd Mailing List, Pavel Emelyanov, Linux API,
linux-kernel@vger.kernel.org, criu, Eric W. Biederman,
Cyrill Gorcunov, Andrew Morton
In-Reply-To: <87k32d13d5.fsf@mid.deneb.enyo.de>
On Nov 30, 2014 1:47 AM, "Florian Weimer" <fw@deneb.enyo.de> wrote:
>
> * Andy Lutomirski:
>
> > The initial implementation is straightforward: highpid is simply a
> > 64-bit counter. If a high-end system can fork every 3 ns (which
> > would be amazing, given that just allocating a pid requires at
> > atomic operation), it would take well over 1000 years for highpid to
> > wrap.
>
> I'm not sure if I'm reading the patch correctly, but is the counter
> namespaced? If yes, why?
It's namespaced so that CRIU can migrate/restore a whole pid namespace.
--Andy
_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel
^ permalink raw reply
* Re: [PATCH v17 1/7] mm: support madvise(MADV_FREE)
From: Minchan Kim @ 2014-11-30 23:56 UTC (permalink / raw)
To: Michal Hocko
Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk,
linux-api-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins, Johannes Weiner,
Rik van Riel, KOSAKI Motohiro, Mel Gorman, Jason Evans,
zhangyanfei-BthXqXjhjHXQFUHtdCDX3A, Kirill A. Shutemov,
Kirill A. Shutemov
In-Reply-To: <20141127144725.GB19157-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
Hello Michal,
On Thu, Nov 27, 2014 at 03:47:25PM +0100, Michal Hocko wrote:
> [Late but I didn't get to this soone - I hope this is still up-to-date
> version]
>
> On Mon 20-10-14 19:11:58, Minchan Kim wrote:
> > Linux doesn't have an ability to free pages lazy while other OS
> > already have been supported that named by madvise(MADV_FREE).
> >
> > The gain is clear that kernel can discard freed pages rather than
> > swapping out or OOM if memory pressure happens.
> >
> > Without memory pressure, freed pages would be reused by userspace
> > without another additional overhead(ex, page fault + allocation
> > + zeroing).
> >
> > How to work is following as.
> >
> > When madvise syscall is called, VM clears dirty bit of ptes of
> > the range. If memory pressure happens, VM checks dirty bit of
> > page table and if it found still "clean", it means it's a
> > "lazyfree pages" so VM could discard the page instead of swapping out.
> > Once there was store operation for the page before VM peek a page
> > to reclaim, dirty bit is set so VM can swap out the page instead of
> > discarding.
>
> Is there any patch for madvise man page? I guess the semantic will be
> same/similar to FreeBSD:
> http://www.freebsd.org/cgi/man.cgi?query=madvise&sektion=2
I postponed because I didn't know when we release the feature into mainline
but I should write down in man page ("MADV_FREE since Linux x.x.x").
However, early posting is not harmful.
Here it goes.
Most of content was copied from FreeBSD man page.
>From 2edd6890f92fa4943ce3c452194479458582d88c Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Date: Mon, 1 Dec 2014 08:53:55 +0900
Subject: [PATCH] madvise.2: Document MADV_FREE
Signed-off-by: Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
man2/madvise.2 | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/man2/madvise.2 b/man2/madvise.2
index 032ead7..33aa936 100644
--- a/man2/madvise.2
+++ b/man2/madvise.2
@@ -265,6 +265,19 @@ file (see
.BR MADV_DODUMP " (since Linux 3.4)"
Undo the effect of an earlier
.BR MADV_DONTDUMP .
+.TP
+.BR MADV_FREE " (since Linux 3.19)"
+Gives the VM system the freedom to free pages, and tells the system that
+information in the specified page range is no longer important.
+This is an efficient way of allowing
+.BR malloc (3)
+to free pages anywhere in the address space, while keeping the address space
+valid. The next time that the page is referenced, the page might be demand
+zeroed, or might contain the data that was there before the MADV_FREE call.
+References made to that address space range will not make the VM system page the
+information back in from backing store until the page is modified again.
+It works only with private anonymous pages (see
+.BR mmap (2)).
.SH RETURN VALUE
On success
.BR madvise ()
--
2.0.0
>
> I guess the changelog should be more specific that this is only for the
> private MAP_ANON mappings (same applies to the patch for man).
>
> > Firstly, heavy users would be general allocators(ex, jemalloc,
> > tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
> > have supported the feature for other OS(ex, FreeBSD)
> >
> [...]
> >
> > Cc: Michael Kerrisk <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > Cc: Linux API <linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
> > Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> > Cc: KOSAKI Motohiro <kosaki.motohiro-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> > Cc: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
> > Cc: Jason Evans <je-b10kYP2dOMg@public.gmane.org>
> > Acked-by: Kirill A. Shutemov <kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> > Acked-by: Zhang Yanfei <zhangyanfei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> > Acked-by: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > Signed-off-by: Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>
> Reviewed-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> [...]
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo-Bw31MaZKKs0EbZ0PF+XxCw@public.gmane.org For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org"> email-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org </a>
--
Kind regards,
Minchan Kim
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox