* [PATCH v2 0/3] util/userfaultfd: Support /dev/userfaultfd @ 2023-02-01 21:10 Peter Xu 2023-02-01 21:10 ` [PATCH v2 1/3] linux-headers: Update to v6.1 Peter Xu ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Peter Xu @ 2023-02-01 21:10 UTC (permalink / raw) To: qemu-devel Cc: Leonardo Bras Soares Passos, Juan Quintela, Michal Prívozník, Daniel P . Berrangé, peterx, Philippe Mathieu-Daudé, Dr . David Alan Gilbert v2: - Added R-bs for Phil - Move open_mode into uffd_detect_open_mode() [Phil] - Document uffd_open() in the header file [Phil] - [Discussed with Daniel/Michal, decided to leave fd support for later] The new /dev/userfaultfd handle is superior to the system call with a better permission control and also works for a restricted seccomp environment. The new device was only introduced in v6.1 so we need a header update. Please have a look, thanks. Peter Xu (3): linux-headers: Update to v6.1 util/userfaultfd: Add uffd_open() util/userfaultfd: Support /dev/userfaultfd include/qemu/userfaultfd.h | 8 + include/standard-headers/drm/drm_fourcc.h | 34 ++++- include/standard-headers/linux/ethtool.h | 63 +++++++- include/standard-headers/linux/fuse.h | 6 +- .../linux/input-event-codes.h | 1 + include/standard-headers/linux/virtio_blk.h | 19 +++ linux-headers/asm-generic/hugetlb_encode.h | 26 ++-- linux-headers/asm-generic/mman-common.h | 2 + linux-headers/asm-mips/mman.h | 2 + linux-headers/asm-riscv/kvm.h | 4 + linux-headers/linux/kvm.h | 1 + linux-headers/linux/psci.h | 14 ++ linux-headers/linux/userfaultfd.h | 4 + linux-headers/linux/vfio.h | 142 ++++++++++++++++++ migration/postcopy-ram.c | 11 +- tests/qtest/migration-test.c | 3 +- util/trace-events | 1 + util/userfaultfd.c | 50 +++++- 18 files changed, 362 insertions(+), 29 deletions(-) -- 2.37.3 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v2 1/3] linux-headers: Update to v6.1 2023-02-01 21:10 [PATCH v2 0/3] util/userfaultfd: Support /dev/userfaultfd Peter Xu @ 2023-02-01 21:10 ` Peter Xu 2023-02-02 10:53 ` Juan Quintela 2023-02-01 21:10 ` [PATCH v2 2/3] util/userfaultfd: Add uffd_open() Peter Xu 2023-02-01 21:10 ` [PATCH v2 3/3] util/userfaultfd: Support /dev/userfaultfd Peter Xu 2 siblings, 1 reply; 12+ messages in thread From: Peter Xu @ 2023-02-01 21:10 UTC (permalink / raw) To: qemu-devel Cc: Leonardo Bras Soares Passos, Juan Quintela, Michal Prívozník, Daniel P . Berrangé, peterx, Philippe Mathieu-Daudé, Dr . David Alan Gilbert Signed-off-by: Peter Xu <peterx@redhat.com> --- include/standard-headers/drm/drm_fourcc.h | 34 ++++- include/standard-headers/linux/ethtool.h | 63 +++++++- include/standard-headers/linux/fuse.h | 6 +- .../linux/input-event-codes.h | 1 + include/standard-headers/linux/virtio_blk.h | 19 +++ linux-headers/asm-generic/hugetlb_encode.h | 26 ++-- linux-headers/asm-generic/mman-common.h | 2 + linux-headers/asm-mips/mman.h | 2 + linux-headers/asm-riscv/kvm.h | 4 + linux-headers/linux/kvm.h | 1 + linux-headers/linux/psci.h | 14 ++ linux-headers/linux/userfaultfd.h | 4 + linux-headers/linux/vfio.h | 142 ++++++++++++++++++ 13 files changed, 298 insertions(+), 20 deletions(-) diff --git a/include/standard-headers/drm/drm_fourcc.h b/include/standard-headers/drm/drm_fourcc.h index 48b620cbef..b868488f93 100644 --- a/include/standard-headers/drm/drm_fourcc.h +++ b/include/standard-headers/drm/drm_fourcc.h @@ -98,18 +98,42 @@ extern "C" { #define DRM_FORMAT_INVALID 0 /* color index */ +#define DRM_FORMAT_C1 fourcc_code('C', '1', ' ', ' ') /* [7:0] C0:C1:C2:C3:C4:C5:C6:C7 1:1:1:1:1:1:1:1 eight pixels/byte */ +#define DRM_FORMAT_C2 fourcc_code('C', '2', ' ', ' ') /* [7:0] C0:C1:C2:C3 2:2:2:2 four pixels/byte */ +#define DRM_FORMAT_C4 fourcc_code('C', '4', ' ', ' ') /* [7:0] C0:C1 4:4 two pixels/byte */ #define DRM_FORMAT_C8 fourcc_code('C', '8', ' ', ' ') /* [7:0] C */ -/* 8 bpp Red */ +/* 1 bpp Darkness (inverse relationship between channel value and brightness) */ +#define DRM_FORMAT_D1 fourcc_code('D', '1', ' ', ' ') /* [7:0] D0:D1:D2:D3:D4:D5:D6:D7 1:1:1:1:1:1:1:1 eight pixels/byte */ + +/* 2 bpp Darkness (inverse relationship between channel value and brightness) */ +#define DRM_FORMAT_D2 fourcc_code('D', '2', ' ', ' ') /* [7:0] D0:D1:D2:D3 2:2:2:2 four pixels/byte */ + +/* 4 bpp Darkness (inverse relationship between channel value and brightness) */ +#define DRM_FORMAT_D4 fourcc_code('D', '4', ' ', ' ') /* [7:0] D0:D1 4:4 two pixels/byte */ + +/* 8 bpp Darkness (inverse relationship between channel value and brightness) */ +#define DRM_FORMAT_D8 fourcc_code('D', '8', ' ', ' ') /* [7:0] D */ + +/* 1 bpp Red (direct relationship between channel value and brightness) */ +#define DRM_FORMAT_R1 fourcc_code('R', '1', ' ', ' ') /* [7:0] R0:R1:R2:R3:R4:R5:R6:R7 1:1:1:1:1:1:1:1 eight pixels/byte */ + +/* 2 bpp Red (direct relationship between channel value and brightness) */ +#define DRM_FORMAT_R2 fourcc_code('R', '2', ' ', ' ') /* [7:0] R0:R1:R2:R3 2:2:2:2 four pixels/byte */ + +/* 4 bpp Red (direct relationship between channel value and brightness) */ +#define DRM_FORMAT_R4 fourcc_code('R', '4', ' ', ' ') /* [7:0] R0:R1 4:4 two pixels/byte */ + +/* 8 bpp Red (direct relationship between channel value and brightness) */ #define DRM_FORMAT_R8 fourcc_code('R', '8', ' ', ' ') /* [7:0] R */ -/* 10 bpp Red */ +/* 10 bpp Red (direct relationship between channel value and brightness) */ #define DRM_FORMAT_R10 fourcc_code('R', '1', '0', ' ') /* [15:0] x:R 6:10 little endian */ -/* 12 bpp Red */ +/* 12 bpp Red (direct relationship between channel value and brightness) */ #define DRM_FORMAT_R12 fourcc_code('R', '1', '2', ' ') /* [15:0] x:R 4:12 little endian */ -/* 16 bpp Red */ +/* 16 bpp Red (direct relationship between channel value and brightness) */ #define DRM_FORMAT_R16 fourcc_code('R', '1', '6', ' ') /* [15:0] R little endian */ /* 16 bpp RG */ @@ -204,7 +228,9 @@ extern "C" { #define DRM_FORMAT_VYUY fourcc_code('V', 'Y', 'U', 'Y') /* [31:0] Y1:Cb0:Y0:Cr0 8:8:8:8 little endian */ #define DRM_FORMAT_AYUV fourcc_code('A', 'Y', 'U', 'V') /* [31:0] A:Y:Cb:Cr 8:8:8:8 little endian */ +#define DRM_FORMAT_AVUY8888 fourcc_code('A', 'V', 'U', 'Y') /* [31:0] A:Cr:Cb:Y 8:8:8:8 little endian */ #define DRM_FORMAT_XYUV8888 fourcc_code('X', 'Y', 'U', 'V') /* [31:0] X:Y:Cb:Cr 8:8:8:8 little endian */ +#define DRM_FORMAT_XVUY8888 fourcc_code('X', 'V', 'U', 'Y') /* [31:0] X:Cr:Cb:Y 8:8:8:8 little endian */ #define DRM_FORMAT_VUY888 fourcc_code('V', 'U', '2', '4') /* [23:0] Cr:Cb:Y 8:8:8 little endian */ #define DRM_FORMAT_VUY101010 fourcc_code('V', 'U', '3', '0') /* Y followed by U then V, 10:10:10. Non-linear modifier only */ diff --git a/include/standard-headers/linux/ethtool.h b/include/standard-headers/linux/ethtool.h index 4537da20cc..1dc56cdc0a 100644 --- a/include/standard-headers/linux/ethtool.h +++ b/include/standard-headers/linux/ethtool.h @@ -736,6 +736,51 @@ enum ethtool_module_power_mode { ETHTOOL_MODULE_POWER_MODE_HIGH, }; +/** + * enum ethtool_podl_pse_admin_state - operational state of the PoDL PSE + * functions. IEEE 802.3-2018 30.15.1.1.2 aPoDLPSEAdminState + * @ETHTOOL_PODL_PSE_ADMIN_STATE_UNKNOWN: state of PoDL PSE functions are + * unknown + * @ETHTOOL_PODL_PSE_ADMIN_STATE_DISABLED: PoDL PSE functions are disabled + * @ETHTOOL_PODL_PSE_ADMIN_STATE_ENABLED: PoDL PSE functions are enabled + */ +enum ethtool_podl_pse_admin_state { + ETHTOOL_PODL_PSE_ADMIN_STATE_UNKNOWN = 1, + ETHTOOL_PODL_PSE_ADMIN_STATE_DISABLED, + ETHTOOL_PODL_PSE_ADMIN_STATE_ENABLED, +}; + +/** + * enum ethtool_podl_pse_pw_d_status - power detection status of the PoDL PSE. + * IEEE 802.3-2018 30.15.1.1.3 aPoDLPSEPowerDetectionStatus: + * @ETHTOOL_PODL_PSE_PW_D_STATUS_UNKNOWN: PoDL PSE + * @ETHTOOL_PODL_PSE_PW_D_STATUS_DISABLED: "The enumeration “disabled” is + * asserted true when the PoDL PSE state diagram variable mr_pse_enable is + * false" + * @ETHTOOL_PODL_PSE_PW_D_STATUS_SEARCHING: "The enumeration “searching” is + * asserted true when either of the PSE state diagram variables + * pi_detecting or pi_classifying is true." + * @ETHTOOL_PODL_PSE_PW_D_STATUS_DELIVERING: "The enumeration “deliveringPower” + * is asserted true when the PoDL PSE state diagram variable pi_powered is + * true." + * @ETHTOOL_PODL_PSE_PW_D_STATUS_SLEEP: "The enumeration “sleep” is asserted + * true when the PoDL PSE state diagram variable pi_sleeping is true." + * @ETHTOOL_PODL_PSE_PW_D_STATUS_IDLE: "The enumeration “idle” is asserted true + * when the logical combination of the PoDL PSE state diagram variables + * pi_prebiased*!pi_sleeping is true." + * @ETHTOOL_PODL_PSE_PW_D_STATUS_ERROR: "The enumeration “error” is asserted + * true when the PoDL PSE state diagram variable overload_held is true." + */ +enum ethtool_podl_pse_pw_d_status { + ETHTOOL_PODL_PSE_PW_D_STATUS_UNKNOWN = 1, + ETHTOOL_PODL_PSE_PW_D_STATUS_DISABLED, + ETHTOOL_PODL_PSE_PW_D_STATUS_SEARCHING, + ETHTOOL_PODL_PSE_PW_D_STATUS_DELIVERING, + ETHTOOL_PODL_PSE_PW_D_STATUS_SLEEP, + ETHTOOL_PODL_PSE_PW_D_STATUS_IDLE, + ETHTOOL_PODL_PSE_PW_D_STATUS_ERROR, +}; + /** * struct ethtool_gstrings - string set for data tagging * @cmd: Command number = %ETHTOOL_GSTRINGS @@ -1840,6 +1885,20 @@ static inline int ethtool_validate_duplex(uint8_t duplex) #define MASTER_SLAVE_STATE_SLAVE 3 #define MASTER_SLAVE_STATE_ERR 4 +/* These are used to throttle the rate of data on the phy interface when the + * native speed of the interface is higher than the link speed. These should + * not be used for phy interfaces which natively support multiple speeds (e.g. + * MII or SGMII). + */ +/* No rate matching performed. */ +#define RATE_MATCH_NONE 0 +/* The phy sends pause frames to throttle the MAC. */ +#define RATE_MATCH_PAUSE 1 +/* The phy asserts CRS to prevent the MAC from transmitting. */ +#define RATE_MATCH_CRS 2 +/* The MAC is programmed with a sufficiently-large IPG. */ +#define RATE_MATCH_OPEN_LOOP 3 + /* Which connector port. */ #define PORT_TP 0x00 #define PORT_AUI 0x01 @@ -2033,8 +2092,8 @@ enum ethtool_reset_flags { * reported consistently by PHYLIB. Read-only. * @master_slave_cfg: Master/slave port mode. * @master_slave_state: Master/slave port state. + * @rate_matching: Rate adaptation performed by the PHY * @reserved: Reserved for future use; see the note on reserved space. - * @reserved1: Reserved for future use; see the note on reserved space. * @link_mode_masks: Variable length bitmaps. * * If autonegotiation is disabled, the speed and @duplex represent the @@ -2085,7 +2144,7 @@ struct ethtool_link_settings { uint8_t transceiver; uint8_t master_slave_cfg; uint8_t master_slave_state; - uint8_t reserved1[1]; + uint8_t rate_matching; uint32_t reserved[7]; uint32_t link_mode_masks[]; /* layout of link_mode_masks fields: diff --git a/include/standard-headers/linux/fuse.h b/include/standard-headers/linux/fuse.h index bda06258be..713d259768 100644 --- a/include/standard-headers/linux/fuse.h +++ b/include/standard-headers/linux/fuse.h @@ -194,6 +194,9 @@ * - add FUSE_SECURITY_CTX init flag * - add security context to create, mkdir, symlink, and mknod requests * - add FUSE_HAS_INODE_DAX, FUSE_ATTR_DAX + * + * 7.37 + * - add FUSE_TMPFILE */ #ifndef _LINUX_FUSE_H @@ -225,7 +228,7 @@ #define FUSE_KERNEL_VERSION 7 /** Minor version number of this interface */ -#define FUSE_KERNEL_MINOR_VERSION 36 +#define FUSE_KERNEL_MINOR_VERSION 37 /** The node ID of the root inode */ #define FUSE_ROOT_ID 1 @@ -533,6 +536,7 @@ enum fuse_opcode { FUSE_SETUPMAPPING = 48, FUSE_REMOVEMAPPING = 49, FUSE_SYNCFS = 50, + FUSE_TMPFILE = 51, /* CUSE specific operations */ CUSE_INIT = 4096, diff --git a/include/standard-headers/linux/input-event-codes.h b/include/standard-headers/linux/input-event-codes.h index 50790aee5a..815f7a1dff 100644 --- a/include/standard-headers/linux/input-event-codes.h +++ b/include/standard-headers/linux/input-event-codes.h @@ -862,6 +862,7 @@ #define ABS_TOOL_WIDTH 0x1c #define ABS_VOLUME 0x20 +#define ABS_PROFILE 0x21 #define ABS_MISC 0x28 diff --git a/include/standard-headers/linux/virtio_blk.h b/include/standard-headers/linux/virtio_blk.h index 2dcc90826a..e81715cd70 100644 --- a/include/standard-headers/linux/virtio_blk.h +++ b/include/standard-headers/linux/virtio_blk.h @@ -40,6 +40,7 @@ #define VIRTIO_BLK_F_MQ 12 /* support more than one vq */ #define VIRTIO_BLK_F_DISCARD 13 /* DISCARD is supported */ #define VIRTIO_BLK_F_WRITE_ZEROES 14 /* WRITE ZEROES is supported */ +#define VIRTIO_BLK_F_SECURE_ERASE 16 /* Secure Erase is supported */ /* Legacy feature bits */ #ifndef VIRTIO_BLK_NO_LEGACY @@ -119,6 +120,21 @@ struct virtio_blk_config { uint8_t write_zeroes_may_unmap; uint8_t unused1[3]; + + /* the next 3 entries are guarded by VIRTIO_BLK_F_SECURE_ERASE */ + /* + * The maximum secure erase sectors (in 512-byte sectors) for + * one segment. + */ + __virtio32 max_secure_erase_sectors; + /* + * The maximum number of secure erase segments in a + * secure erase command. + */ + __virtio32 max_secure_erase_seg; + /* Secure erase commands must be aligned to this number of sectors. */ + __virtio32 secure_erase_sector_alignment; + } QEMU_PACKED; /* @@ -153,6 +169,9 @@ struct virtio_blk_config { /* Write zeroes command */ #define VIRTIO_BLK_T_WRITE_ZEROES 13 +/* Secure erase command */ +#define VIRTIO_BLK_T_SECURE_ERASE 14 + #ifndef VIRTIO_BLK_NO_LEGACY /* Barrier before this op. */ #define VIRTIO_BLK_T_BARRIER 0x80000000 diff --git a/linux-headers/asm-generic/hugetlb_encode.h b/linux-headers/asm-generic/hugetlb_encode.h index 4f3d5aaa11..de687009bf 100644 --- a/linux-headers/asm-generic/hugetlb_encode.h +++ b/linux-headers/asm-generic/hugetlb_encode.h @@ -20,18 +20,18 @@ #define HUGETLB_FLAG_ENCODE_SHIFT 26 #define HUGETLB_FLAG_ENCODE_MASK 0x3f -#define HUGETLB_FLAG_ENCODE_16KB (14 << HUGETLB_FLAG_ENCODE_SHIFT) -#define HUGETLB_FLAG_ENCODE_64KB (16 << HUGETLB_FLAG_ENCODE_SHIFT) -#define HUGETLB_FLAG_ENCODE_512KB (19 << HUGETLB_FLAG_ENCODE_SHIFT) -#define HUGETLB_FLAG_ENCODE_1MB (20 << HUGETLB_FLAG_ENCODE_SHIFT) -#define HUGETLB_FLAG_ENCODE_2MB (21 << HUGETLB_FLAG_ENCODE_SHIFT) -#define HUGETLB_FLAG_ENCODE_8MB (23 << HUGETLB_FLAG_ENCODE_SHIFT) -#define HUGETLB_FLAG_ENCODE_16MB (24 << HUGETLB_FLAG_ENCODE_SHIFT) -#define HUGETLB_FLAG_ENCODE_32MB (25 << HUGETLB_FLAG_ENCODE_SHIFT) -#define HUGETLB_FLAG_ENCODE_256MB (28 << HUGETLB_FLAG_ENCODE_SHIFT) -#define HUGETLB_FLAG_ENCODE_512MB (29 << HUGETLB_FLAG_ENCODE_SHIFT) -#define HUGETLB_FLAG_ENCODE_1GB (30 << HUGETLB_FLAG_ENCODE_SHIFT) -#define HUGETLB_FLAG_ENCODE_2GB (31 << HUGETLB_FLAG_ENCODE_SHIFT) -#define HUGETLB_FLAG_ENCODE_16GB (34 << HUGETLB_FLAG_ENCODE_SHIFT) +#define HUGETLB_FLAG_ENCODE_16KB (14U << HUGETLB_FLAG_ENCODE_SHIFT) +#define HUGETLB_FLAG_ENCODE_64KB (16U << HUGETLB_FLAG_ENCODE_SHIFT) +#define HUGETLB_FLAG_ENCODE_512KB (19U << HUGETLB_FLAG_ENCODE_SHIFT) +#define HUGETLB_FLAG_ENCODE_1MB (20U << HUGETLB_FLAG_ENCODE_SHIFT) +#define HUGETLB_FLAG_ENCODE_2MB (21U << HUGETLB_FLAG_ENCODE_SHIFT) +#define HUGETLB_FLAG_ENCODE_8MB (23U << HUGETLB_FLAG_ENCODE_SHIFT) +#define HUGETLB_FLAG_ENCODE_16MB (24U << HUGETLB_FLAG_ENCODE_SHIFT) +#define HUGETLB_FLAG_ENCODE_32MB (25U << HUGETLB_FLAG_ENCODE_SHIFT) +#define HUGETLB_FLAG_ENCODE_256MB (28U << HUGETLB_FLAG_ENCODE_SHIFT) +#define HUGETLB_FLAG_ENCODE_512MB (29U << HUGETLB_FLAG_ENCODE_SHIFT) +#define HUGETLB_FLAG_ENCODE_1GB (30U << HUGETLB_FLAG_ENCODE_SHIFT) +#define HUGETLB_FLAG_ENCODE_2GB (31U << HUGETLB_FLAG_ENCODE_SHIFT) +#define HUGETLB_FLAG_ENCODE_16GB (34U << HUGETLB_FLAG_ENCODE_SHIFT) #endif /* _ASM_GENERIC_HUGETLB_ENCODE_H_ */ diff --git a/linux-headers/asm-generic/mman-common.h b/linux-headers/asm-generic/mman-common.h index 6c1aa92a92..6ce1f1ceb4 100644 --- a/linux-headers/asm-generic/mman-common.h +++ b/linux-headers/asm-generic/mman-common.h @@ -77,6 +77,8 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ +#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/linux-headers/asm-mips/mman.h b/linux-headers/asm-mips/mman.h index 1be428663c..c6e1fc77c9 100644 --- a/linux-headers/asm-mips/mman.h +++ b/linux-headers/asm-mips/mman.h @@ -103,6 +103,8 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ +#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/linux-headers/asm-riscv/kvm.h b/linux-headers/asm-riscv/kvm.h index 7351417afd..8985ff234c 100644 --- a/linux-headers/asm-riscv/kvm.h +++ b/linux-headers/asm-riscv/kvm.h @@ -48,6 +48,7 @@ struct kvm_sregs { /* CONFIG registers for KVM_GET_ONE_REG and KVM_SET_ONE_REG */ struct kvm_riscv_config { unsigned long isa; + unsigned long zicbom_block_size; }; /* CORE registers for KVM_GET_ONE_REG and KVM_SET_ONE_REG */ @@ -98,6 +99,9 @@ enum KVM_RISCV_ISA_EXT_ID { KVM_RISCV_ISA_EXT_M, KVM_RISCV_ISA_EXT_SVPBMT, KVM_RISCV_ISA_EXT_SSTC, + KVM_RISCV_ISA_EXT_SVINVAL, + KVM_RISCV_ISA_EXT_ZIHINTPAUSE, + KVM_RISCV_ISA_EXT_ZICBOM, KVM_RISCV_ISA_EXT_MAX, }; diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h index ebdafa576d..b2783c5202 100644 --- a/linux-headers/linux/kvm.h +++ b/linux-headers/linux/kvm.h @@ -1175,6 +1175,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_VM_DISABLE_NX_HUGE_PAGES 220 #define KVM_CAP_S390_ZPCI_OP 221 #define KVM_CAP_S390_CPU_TOPOLOGY 222 +#define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223 #ifdef KVM_CAP_IRQ_ROUTING diff --git a/linux-headers/linux/psci.h b/linux-headers/linux/psci.h index 213b2a0f70..e60dfd8907 100644 --- a/linux-headers/linux/psci.h +++ b/linux-headers/linux/psci.h @@ -48,12 +48,26 @@ #define PSCI_0_2_FN64_MIGRATE_INFO_UP_CPU PSCI_0_2_FN64(7) #define PSCI_1_0_FN_PSCI_FEATURES PSCI_0_2_FN(10) +#define PSCI_1_0_FN_CPU_FREEZE PSCI_0_2_FN(11) +#define PSCI_1_0_FN_CPU_DEFAULT_SUSPEND PSCI_0_2_FN(12) +#define PSCI_1_0_FN_NODE_HW_STATE PSCI_0_2_FN(13) #define PSCI_1_0_FN_SYSTEM_SUSPEND PSCI_0_2_FN(14) #define PSCI_1_0_FN_SET_SUSPEND_MODE PSCI_0_2_FN(15) +#define PSCI_1_0_FN_STAT_RESIDENCY PSCI_0_2_FN(16) +#define PSCI_1_0_FN_STAT_COUNT PSCI_0_2_FN(17) + #define PSCI_1_1_FN_SYSTEM_RESET2 PSCI_0_2_FN(18) +#define PSCI_1_1_FN_MEM_PROTECT PSCI_0_2_FN(19) +#define PSCI_1_1_FN_MEM_PROTECT_CHECK_RANGE PSCI_0_2_FN(19) +#define PSCI_1_0_FN64_CPU_DEFAULT_SUSPEND PSCI_0_2_FN64(12) +#define PSCI_1_0_FN64_NODE_HW_STATE PSCI_0_2_FN64(13) #define PSCI_1_0_FN64_SYSTEM_SUSPEND PSCI_0_2_FN64(14) +#define PSCI_1_0_FN64_STAT_RESIDENCY PSCI_0_2_FN64(16) +#define PSCI_1_0_FN64_STAT_COUNT PSCI_0_2_FN64(17) + #define PSCI_1_1_FN64_SYSTEM_RESET2 PSCI_0_2_FN64(18) +#define PSCI_1_1_FN64_MEM_PROTECT_CHECK_RANGE PSCI_0_2_FN64(19) /* PSCI v0.2 power state encoding for CPU_SUSPEND function */ #define PSCI_0_2_POWER_STATE_ID_MASK 0xffff diff --git a/linux-headers/linux/userfaultfd.h b/linux-headers/linux/userfaultfd.h index a3a377cd44..ba5d0df52f 100644 --- a/linux-headers/linux/userfaultfd.h +++ b/linux-headers/linux/userfaultfd.h @@ -12,6 +12,10 @@ #include <linux/types.h> +/* ioctls for /dev/userfaultfd */ +#define USERFAULTFD_IOC 0xAA +#define USERFAULTFD_IOC_NEW _IO(USERFAULTFD_IOC, 0x00) + /* * If the UFFDIO_API is upgraded someday, the UFFDIO_UNREGISTER and * UFFDIO_WAKE ioctls should be defined as _IOW and not as _IOR. In diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h index ede44b5572..bee7e42198 100644 --- a/linux-headers/linux/vfio.h +++ b/linux-headers/linux/vfio.h @@ -986,6 +986,148 @@ enum vfio_device_mig_state { VFIO_DEVICE_STATE_RUNNING_P2P = 5, }; +/* + * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power + * state with the platform-based power management. Device use of lower power + * states depends on factors managed by the runtime power management core, + * including system level support and coordinating support among dependent + * devices. Enabling device low power entry does not guarantee lower power + * usage by the device, nor is a mechanism provided through this feature to + * know the current power state of the device. If any device access happens + * (either from the host or through the vfio uAPI) when the device is in the + * low power state, then the host will move the device out of the low power + * state as necessary prior to the access. Once the access is completed, the + * device may re-enter the low power state. For single shot low power support + * with wake-up notification, see + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below. Access to mmap'd + * device regions is disabled on LOW_POWER_ENTRY and may only be resumed after + * calling LOW_POWER_EXIT. + */ +#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3 + +/* + * This device feature has the same behavior as + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user + * provides an eventfd for wake-up notification. When the device moves out of + * the low power state for the wake-up, the host will not allow the device to + * re-enter a low power state without a subsequent user call to one of the low + * power entry device feature IOCTLs. Access to mmap'd device regions is + * disabled on LOW_POWER_ENTRY_WITH_WAKEUP and may only be resumed after the + * low power exit. The low power exit can happen either through LOW_POWER_EXIT + * or through any other access (where the wake-up notification has been + * generated). The access to mmap'd device regions will not trigger low power + * exit. + * + * The notification through the provided eventfd will be generated only when + * the device has entered and is resumed from a low power state after + * calling this device feature IOCTL. A device that has not entered low power + * state, as managed through the runtime power management core, will not + * generate a notification through the provided eventfd on access. Calling the + * LOW_POWER_EXIT feature is optional in the case where notification has been + * signaled on the provided eventfd that a resume from low power has occurred. + */ +struct vfio_device_low_power_entry_with_wakeup { + __s32 wakeup_eventfd; + __u32 reserved; +}; + +#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4 + +/* + * Upon VFIO_DEVICE_FEATURE_SET, disallow use of device low power states as + * previously enabled via VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features. + * This device feature IOCTL may itself generate a wakeup eventfd notification + * in the latter case if the device had previously entered a low power state. + */ +#define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5 + +/* + * Upon VFIO_DEVICE_FEATURE_SET start/stop device DMA logging. + * VFIO_DEVICE_FEATURE_PROBE can be used to detect if the device supports + * DMA logging. + * + * DMA logging allows a device to internally record what DMAs the device is + * initiating and report them back to userspace. It is part of the VFIO + * migration infrastructure that allows implementing dirty page tracking + * during the pre copy phase of live migration. Only DMA WRITEs are logged, + * and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE. + * + * When DMA logging is started a range of IOVAs to monitor is provided and the + * device can optimize its logging to cover only the IOVA range given. Each + * DMA that the device initiates inside the range will be logged by the device + * for later retrieval. + * + * page_size is an input that hints what tracking granularity the device + * should try to achieve. If the device cannot do the hinted page size then + * it's the driver choice which page size to pick based on its support. + * On output the device will return the page size it selected. + * + * ranges is a pointer to an array of + * struct vfio_device_feature_dma_logging_range. + * + * The core kernel code guarantees to support by minimum num_ranges that fit + * into a single kernel page. User space can try higher values but should give + * up if the above can't be achieved as of some driver limitations. + * + * A single call to start device DMA logging can be issued and a matching stop + * should follow at the end. Another start is not allowed in the meantime. + */ +struct vfio_device_feature_dma_logging_control { + __aligned_u64 page_size; + __u32 num_ranges; + __u32 __reserved; + __aligned_u64 ranges; +}; + +struct vfio_device_feature_dma_logging_range { + __aligned_u64 iova; + __aligned_u64 length; +}; + +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_START 6 + +/* + * Upon VFIO_DEVICE_FEATURE_SET stop device DMA logging that was started + * by VFIO_DEVICE_FEATURE_DMA_LOGGING_START + */ +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP 7 + +/* + * Upon VFIO_DEVICE_FEATURE_GET read back and clear the device DMA log + * + * Query the device's DMA log for written pages within the given IOVA range. + * During querying the log is cleared for the IOVA range. + * + * bitmap is a pointer to an array of u64s that will hold the output bitmap + * with 1 bit reporting a page_size unit of IOVA. The mapping of IOVA to bits + * is given by: + * bitmap[(addr - iova)/page_size] & (1ULL << (addr % 64)) + * + * The input page_size can be any power of two value and does not have to + * match the value given to VFIO_DEVICE_FEATURE_DMA_LOGGING_START. The driver + * will format its internal logging to match the reporting page size, possibly + * by replicating bits if the internal page size is lower than requested. + * + * The LOGGING_REPORT will only set bits in the bitmap and never clear or + * perform any initialization of the user provided bitmap. + * + * If any error is returned userspace should assume that the dirty log is + * corrupted. Error recovery is to consider all memory dirty and try to + * restart the dirty tracking, or to abort/restart the whole migration. + * + * If DMA logging is not enabled, an error will be returned. + * + */ +struct vfio_device_feature_dma_logging_report { + __aligned_u64 iova; + __aligned_u64 length; + __aligned_u64 page_size; + __aligned_u64 bitmap; +}; + +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT 8 + /* -------- API for Type1 VFIO IOMMU -------- */ /** -- 2.37.3 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/3] linux-headers: Update to v6.1 2023-02-01 21:10 ` [PATCH v2 1/3] linux-headers: Update to v6.1 Peter Xu @ 2023-02-02 10:53 ` Juan Quintela 2023-02-02 19:49 ` Peter Xu 0 siblings, 1 reply; 12+ messages in thread From: Juan Quintela @ 2023-02-02 10:53 UTC (permalink / raw) To: Peter Xu Cc: qemu-devel, Leonardo Bras Soares Passos, Michal Prívozník, Daniel P . Berrangé, Philippe Mathieu-Daudé, Dr . David Alan Gilbert Peter Xu <peterx@redhat.com> wrote: > Signed-off-by: Peter Xu <peterx@redhat.com> How does this change gets into the tree? I know that it is "automagically" generated, but who decides when that goes into the tree? As we need that for the following patch? Later, Juan. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/3] linux-headers: Update to v6.1 2023-02-02 10:53 ` Juan Quintela @ 2023-02-02 19:49 ` Peter Xu 0 siblings, 0 replies; 12+ messages in thread From: Peter Xu @ 2023-02-02 19:49 UTC (permalink / raw) To: Juan Quintela, Peter Maydell Cc: qemu-devel, Leonardo Bras Soares Passos, Michal Prívozník, Daniel P . Berrangé, Philippe Mathieu-Daudé, Dr . David Alan Gilbert On Thu, Feb 02, 2023 at 11:53:34AM +0100, Juan Quintela wrote: > Peter Xu <peterx@redhat.com> wrote: > > Signed-off-by: Peter Xu <peterx@redhat.com> > > How does this change gets into the tree? > > I know that it is "automagically" generated, but who decides when that > goes into the tree? > > As we need that for the following patch? Copy PeterM. Peter, could you share how we used to do linux header updates? Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v2 2/3] util/userfaultfd: Add uffd_open() 2023-02-01 21:10 [PATCH v2 0/3] util/userfaultfd: Support /dev/userfaultfd Peter Xu 2023-02-01 21:10 ` [PATCH v2 1/3] linux-headers: Update to v6.1 Peter Xu @ 2023-02-01 21:10 ` Peter Xu 2023-02-02 10:27 ` Juan Quintela 2023-02-01 21:10 ` [PATCH v2 3/3] util/userfaultfd: Support /dev/userfaultfd Peter Xu 2 siblings, 1 reply; 12+ messages in thread From: Peter Xu @ 2023-02-01 21:10 UTC (permalink / raw) To: qemu-devel Cc: Leonardo Bras Soares Passos, Juan Quintela, Michal Prívozník, Daniel P . Berrangé, peterx, Philippe Mathieu-Daudé, Dr . David Alan Gilbert Add a helper to create the uffd handle. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Peter Xu <peterx@redhat.com> --- include/qemu/userfaultfd.h | 8 ++++++++ migration/postcopy-ram.c | 11 +++++------ tests/qtest/migration-test.c | 3 ++- util/userfaultfd.c | 13 +++++++++++-- 4 files changed, 26 insertions(+), 9 deletions(-) diff --git a/include/qemu/userfaultfd.h b/include/qemu/userfaultfd.h index 6b74f92792..2101115f70 100644 --- a/include/qemu/userfaultfd.h +++ b/include/qemu/userfaultfd.h @@ -17,6 +17,14 @@ #include "exec/hwaddr.h" #include <linux/userfaultfd.h> +/** + * uffd_open(): Open an userfaultfd handle for current context. + * + * @flags: The flags we want to pass in when creating the handle. + * + * Returns: the uffd handle if >=0, or <0 if error happens. + */ +int uffd_open(int flags); int uffd_query_features(uint64_t *features); int uffd_create_fd(uint64_t features, bool non_blocking); void uffd_close_fd(int uffd_fd); diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c index b9a37ef255..0c55df0e52 100644 --- a/migration/postcopy-ram.c +++ b/migration/postcopy-ram.c @@ -37,6 +37,7 @@ #include "qemu-file.h" #include "yank_functions.h" #include "tls.h" +#include "qemu/userfaultfd.h" /* Arbitrary limit on size of each discard command, * keeps them around ~200 bytes @@ -226,11 +227,9 @@ static bool receive_ufd_features(uint64_t *features) int ufd; bool ret = true; - /* if we are here __NR_userfaultfd should exists */ - ufd = syscall(__NR_userfaultfd, O_CLOEXEC); + ufd = uffd_open(O_CLOEXEC); if (ufd == -1) { - error_report("%s: syscall __NR_userfaultfd failed: %s", __func__, - strerror(errno)); + error_report("%s: uffd_open() failed: %s", __func__, strerror(errno)); return false; } @@ -375,7 +374,7 @@ bool postcopy_ram_supported_by_host(MigrationIncomingState *mis) goto out; } - ufd = syscall(__NR_userfaultfd, O_CLOEXEC); + ufd = uffd_open(O_CLOEXEC); if (ufd == -1) { error_report("%s: userfaultfd not available: %s", __func__, strerror(errno)); @@ -1160,7 +1159,7 @@ static int postcopy_temp_pages_setup(MigrationIncomingState *mis) int postcopy_ram_incoming_setup(MigrationIncomingState *mis) { /* Open the fd for the kernel to give us userfaults */ - mis->userfault_fd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); + mis->userfault_fd = uffd_open(O_CLOEXEC | O_NONBLOCK); if (mis->userfault_fd == -1) { error_report("%s: Failed to open userfault fd: %s", __func__, strerror(errno)); diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c index 1dd32c9506..7a5d1922dd 100644 --- a/tests/qtest/migration-test.c +++ b/tests/qtest/migration-test.c @@ -62,13 +62,14 @@ static bool uffd_feature_thread_id; #include <sys/eventfd.h> #include <sys/ioctl.h> #include <linux/userfaultfd.h> +#include "qemu/userfaultfd.h" static bool ufd_version_check(void) { struct uffdio_api api_struct; uint64_t ioctl_mask; - int ufd = syscall(__NR_userfaultfd, O_CLOEXEC); + int ufd = uffd_open(O_CLOEXEC); if (ufd == -1) { g_test_message("Skipping test: userfaultfd not available"); diff --git a/util/userfaultfd.c b/util/userfaultfd.c index f1cd6af2b1..9845a2ec81 100644 --- a/util/userfaultfd.c +++ b/util/userfaultfd.c @@ -19,6 +19,15 @@ #include <sys/syscall.h> #include <sys/ioctl.h> +int uffd_open(int flags) +{ +#if defined(__linux__) && defined(__NR_userfaultfd) + return syscall(__NR_userfaultfd, flags); +#else + return -EINVAL; +#endif +} + /** * uffd_query_features: query UFFD features * @@ -32,7 +41,7 @@ int uffd_query_features(uint64_t *features) struct uffdio_api api_struct = { 0 }; int ret = -1; - uffd_fd = syscall(__NR_userfaultfd, O_CLOEXEC); + uffd_fd = uffd_open(O_CLOEXEC); if (uffd_fd < 0) { trace_uffd_query_features_nosys(errno); return -1; @@ -69,7 +78,7 @@ int uffd_create_fd(uint64_t features, bool non_blocking) uint64_t ioctl_mask = BIT(_UFFDIO_REGISTER) | BIT(_UFFDIO_UNREGISTER); flags = O_CLOEXEC | (non_blocking ? O_NONBLOCK : 0); - uffd_fd = syscall(__NR_userfaultfd, flags); + uffd_fd = uffd_open(flags); if (uffd_fd < 0) { trace_uffd_create_fd_nosys(errno); return -1; -- 2.37.3 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2 2/3] util/userfaultfd: Add uffd_open() 2023-02-01 21:10 ` [PATCH v2 2/3] util/userfaultfd: Add uffd_open() Peter Xu @ 2023-02-02 10:27 ` Juan Quintela 0 siblings, 0 replies; 12+ messages in thread From: Juan Quintela @ 2023-02-02 10:27 UTC (permalink / raw) To: Peter Xu Cc: qemu-devel, Leonardo Bras Soares Passos, Michal Prívozník, Daniel P . Berrangé, Philippe Mathieu-Daudé, Dr . David Alan Gilbert Peter Xu <peterx@redhat.com> wrote: > Add a helper to create the uffd handle. > > Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> > Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> I can get this one through migration tree. ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v2 3/3] util/userfaultfd: Support /dev/userfaultfd 2023-02-01 21:10 [PATCH v2 0/3] util/userfaultfd: Support /dev/userfaultfd Peter Xu 2023-02-01 21:10 ` [PATCH v2 1/3] linux-headers: Update to v6.1 Peter Xu 2023-02-01 21:10 ` [PATCH v2 2/3] util/userfaultfd: Add uffd_open() Peter Xu @ 2023-02-01 21:10 ` Peter Xu 2023-02-02 10:52 ` Juan Quintela 2 siblings, 1 reply; 12+ messages in thread From: Peter Xu @ 2023-02-01 21:10 UTC (permalink / raw) To: qemu-devel Cc: Leonardo Bras Soares Passos, Juan Quintela, Michal Prívozník, Daniel P . Berrangé, peterx, Philippe Mathieu-Daudé, Dr . David Alan Gilbert Teach QEMU to use /dev/userfaultfd when it existed and fallback to the system call if either it's not there or doesn't have enough permission. Firstly, as long as the app has permission to access /dev/userfaultfd, it always have the ability to trap kernel faults which QEMU mostly wants. Meanwhile, in some context (e.g. containers) the userfaultfd syscall can be forbidden, so it can be the major way to use postcopy in a restricted environment with strict seccomp setup. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Peter Xu <peterx@redhat.com> --- util/trace-events | 1 + util/userfaultfd.c | 37 +++++++++++++++++++++++++++++++++++++ 2 files changed, 38 insertions(+) diff --git a/util/trace-events b/util/trace-events index c8f53d7d9f..16f78d8fe5 100644 --- a/util/trace-events +++ b/util/trace-events @@ -93,6 +93,7 @@ qemu_vfio_region_info(const char *desc, uint64_t region_ofs, uint64_t region_siz qemu_vfio_pci_map_bar(int index, uint64_t region_ofs, uint64_t region_size, int ofs, void *host) "map region bar#%d addr 0x%"PRIx64" size 0x%"PRIx64" ofs 0x%x host %p" #userfaultfd.c +uffd_detect_open_mode(int mode) "%d" uffd_query_features_nosys(int err) "errno: %i" uffd_query_features_api_failed(int err) "errno: %i" uffd_create_fd_nosys(int err) "errno: %i" diff --git a/util/userfaultfd.c b/util/userfaultfd.c index 9845a2ec81..7dceab51d6 100644 --- a/util/userfaultfd.c +++ b/util/userfaultfd.c @@ -18,10 +18,47 @@ #include <poll.h> #include <sys/syscall.h> #include <sys/ioctl.h> +#include <fcntl.h> + +typedef enum { + UFFD_UNINITIALIZED = 0, + UFFD_USE_DEV_PATH, + UFFD_USE_SYSCALL, +} uffd_open_mode; + +static int uffd_dev; + +static uffd_open_mode uffd_detect_open_mode(void) +{ + static uffd_open_mode open_mode; + + if (open_mode == UFFD_UNINITIALIZED) { + /* + * Make /dev/userfaultfd the default approach because it has better + * permission controls, meanwhile allows kernel faults without any + * privilege requirement (e.g. SYS_CAP_PTRACE). + */ + uffd_dev = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC); + if (uffd_dev >= 0) { + open_mode = UFFD_USE_DEV_PATH; + } else { + /* Fallback to the system call */ + open_mode = UFFD_USE_SYSCALL; + } + trace_uffd_detect_open_mode(open_mode); + } + + return open_mode; +} int uffd_open(int flags) { #if defined(__linux__) && defined(__NR_userfaultfd) + if (uffd_detect_open_mode() == UFFD_USE_DEV_PATH) { + assert(uffd_dev >= 0); + return ioctl(uffd_dev, USERFAULTFD_IOC_NEW, flags); + } + return syscall(__NR_userfaultfd, flags); #else return -EINVAL; -- 2.37.3 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2 3/3] util/userfaultfd: Support /dev/userfaultfd 2023-02-01 21:10 ` [PATCH v2 3/3] util/userfaultfd: Support /dev/userfaultfd Peter Xu @ 2023-02-02 10:52 ` Juan Quintela 2023-02-02 20:41 ` Peter Xu 0 siblings, 1 reply; 12+ messages in thread From: Juan Quintela @ 2023-02-02 10:52 UTC (permalink / raw) To: Peter Xu Cc: qemu-devel, Leonardo Bras Soares Passos, Michal Prívozník, Daniel P . Berrangé, Philippe Mathieu-Daudé, Dr . David Alan Gilbert Peter Xu <peterx@redhat.com> wrote: > Teach QEMU to use /dev/userfaultfd when it existed and fallback to the > system call if either it's not there or doesn't have enough permission. > > Firstly, as long as the app has permission to access /dev/userfaultfd, it > always have the ability to trap kernel faults which QEMU mostly wants. > Meanwhile, in some context (e.g. containers) the userfaultfd syscall can be > forbidden, so it can be the major way to use postcopy in a restricted > environment with strict seccomp setup. > > Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> > Signed-off-by: Peter Xu <peterx@redhat.com> Hi Can we change this code to not use the global variable. > --- > util/trace-events | 1 + > util/userfaultfd.c | 37 +++++++++++++++++++++++++++++++++++++ > 2 files changed, 38 insertions(+) > > diff --git a/util/trace-events b/util/trace-events > index c8f53d7d9f..16f78d8fe5 100644 > --- a/util/trace-events > +++ b/util/trace-events > @@ -93,6 +93,7 @@ qemu_vfio_region_info(const char *desc, uint64_t region_ofs, uint64_t region_siz > qemu_vfio_pci_map_bar(int index, uint64_t region_ofs, uint64_t region_size, int ofs, void *host) "map region bar#%d addr 0x%"PRIx64" size 0x%"PRIx64" ofs 0x%x host %p" > > #userfaultfd.c > +uffd_detect_open_mode(int mode) "%d" > uffd_query_features_nosys(int err) "errno: %i" > uffd_query_features_api_failed(int err) "errno: %i" > uffd_create_fd_nosys(int err) "errno: %i" > diff --git a/util/userfaultfd.c b/util/userfaultfd.c > index 9845a2ec81..7dceab51d6 100644 > --- a/util/userfaultfd.c > +++ b/util/userfaultfd.c > @@ -18,10 +18,47 @@ > #include <poll.h> > #include <sys/syscall.h> > #include <sys/ioctl.h> > +#include <fcntl.h> > + > +typedef enum { > + UFFD_UNINITIALIZED = 0, > + UFFD_USE_DEV_PATH, > + UFFD_USE_SYSCALL, > +} uffd_open_mode; > + > +static int uffd_dev; > + > +static uffd_open_mode uffd_detect_open_mode(void) > +{ > + static uffd_open_mode open_mode; > + > + if (open_mode == UFFD_UNINITIALIZED) { > + /* > + * Make /dev/userfaultfd the default approach because it has better > + * permission controls, meanwhile allows kernel faults without any > + * privilege requirement (e.g. SYS_CAP_PTRACE). > + */ > + uffd_dev = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC); > + if (uffd_dev >= 0) { > + open_mode = UFFD_USE_DEV_PATH; > + } else { > + /* Fallback to the system call */ > + open_mode = UFFD_USE_SYSCALL; > + } > + trace_uffd_detect_open_mode(open_mode); > + } > + > + return open_mode; > +} > > int uffd_open(int flags) > { > #if defined(__linux__) && defined(__NR_userfaultfd) > + if (uffd_detect_open_mode() == UFFD_USE_DEV_PATH) { > + assert(uffd_dev >= 0); > + return ioctl(uffd_dev, USERFAULTFD_IOC_NEW, flags); > + } > + > return syscall(__NR_userfaultfd, flags); > #else > return -EINVAL; static int open_userfaultd(void) { /* * Make /dev/userfaultfd the default approach because it has better * permission controls, meanwhile allows kernel faults without any * privilege requirement (e.g. SYS_CAP_PTRACE). */ int uffd = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC); if (uffd >= 0) { return uffd; } return -1; } int uffd_open(int flags) { #if defined(__linux__) && defined(__NR_userfaultfd) static int uffd = -2; if (uffd == -2) { uffd = open_userfaultd(); } if (uffd >= 0) { return ioctl(uffd, USERFAULTFD_IOC_NEW, flags); } return syscall(__NR_userfaultfd, flags); #else return -EINVAL; 27 lines vs 42 No need for enum type No need for global variable What do you think? Later, Juan. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 3/3] util/userfaultfd: Support /dev/userfaultfd 2023-02-02 10:52 ` Juan Quintela @ 2023-02-02 20:41 ` Peter Xu 2023-02-03 21:01 ` Juan Quintela 0 siblings, 1 reply; 12+ messages in thread From: Peter Xu @ 2023-02-02 20:41 UTC (permalink / raw) To: Juan Quintela Cc: qemu-devel, Leonardo Bras Soares Passos, Michal Prívozník, Daniel P . Berrangé, Philippe Mathieu-Daudé, Dr . David Alan Gilbert On Thu, Feb 02, 2023 at 11:52:21AM +0100, Juan Quintela wrote: > Peter Xu <peterx@redhat.com> wrote: > > Teach QEMU to use /dev/userfaultfd when it existed and fallback to the > > system call if either it's not there or doesn't have enough permission. > > > > Firstly, as long as the app has permission to access /dev/userfaultfd, it > > always have the ability to trap kernel faults which QEMU mostly wants. > > Meanwhile, in some context (e.g. containers) the userfaultfd syscall can be > > forbidden, so it can be the major way to use postcopy in a restricted > > environment with strict seccomp setup. > > > > Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> > > Signed-off-by: Peter Xu <peterx@redhat.com> > > > Hi Hi, Juan, > > Can we change this code to not use the global variable. > > > --- > > util/trace-events | 1 + > > util/userfaultfd.c | 37 +++++++++++++++++++++++++++++++++++++ > > 2 files changed, 38 insertions(+) > > > > diff --git a/util/trace-events b/util/trace-events > > index c8f53d7d9f..16f78d8fe5 100644 > > --- a/util/trace-events > > +++ b/util/trace-events > > @@ -93,6 +93,7 @@ qemu_vfio_region_info(const char *desc, uint64_t region_ofs, uint64_t region_siz > > qemu_vfio_pci_map_bar(int index, uint64_t region_ofs, uint64_t region_size, int ofs, void *host) "map region bar#%d addr 0x%"PRIx64" size 0x%"PRIx64" ofs 0x%x host %p" > > > > #userfaultfd.c > > +uffd_detect_open_mode(int mode) "%d" > > uffd_query_features_nosys(int err) "errno: %i" > > uffd_query_features_api_failed(int err) "errno: %i" > > uffd_create_fd_nosys(int err) "errno: %i" > > diff --git a/util/userfaultfd.c b/util/userfaultfd.c > > index 9845a2ec81..7dceab51d6 100644 > > --- a/util/userfaultfd.c > > +++ b/util/userfaultfd.c > > @@ -18,10 +18,47 @@ > > #include <poll.h> > > #include <sys/syscall.h> > > #include <sys/ioctl.h> > > +#include <fcntl.h> > > + > > +typedef enum { > > + UFFD_UNINITIALIZED = 0, > > + UFFD_USE_DEV_PATH, > > + UFFD_USE_SYSCALL, > > +} uffd_open_mode; > > + > > +static int uffd_dev; > > + > > +static uffd_open_mode uffd_detect_open_mode(void) > > +{ > > + static uffd_open_mode open_mode; > > + > > + if (open_mode == UFFD_UNINITIALIZED) { > > + /* > > + * Make /dev/userfaultfd the default approach because it has better > > + * permission controls, meanwhile allows kernel faults without any > > + * privilege requirement (e.g. SYS_CAP_PTRACE). > > + */ > > + uffd_dev = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC); > > + if (uffd_dev >= 0) { > > + open_mode = UFFD_USE_DEV_PATH; > > + } else { > > + /* Fallback to the system call */ > > + open_mode = UFFD_USE_SYSCALL; > > + } > > + trace_uffd_detect_open_mode(open_mode); > > + } > > + > > + return open_mode; > > +} > > > > int uffd_open(int flags) > > { > > #if defined(__linux__) && defined(__NR_userfaultfd) > > + if (uffd_detect_open_mode() == UFFD_USE_DEV_PATH) { > > + assert(uffd_dev >= 0); > > + return ioctl(uffd_dev, USERFAULTFD_IOC_NEW, flags); > > + } > > + > > return syscall(__NR_userfaultfd, flags); > > #else > > return -EINVAL; > > static int open_userfaultd(void) > { > /* > * Make /dev/userfaultfd the default approach because it has better > * permission controls, meanwhile allows kernel faults without any > * privilege requirement (e.g. SYS_CAP_PTRACE). > */ > int uffd = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC); > if (uffd >= 0) { > return uffd; > } > return -1; > } > > int uffd_open(int flags) > { > #if defined(__linux__) && defined(__NR_userfaultfd) > static int uffd = -2; > if (uffd == -2) { > uffd = open_userfaultd(); > } > if (uffd >= 0) { > return ioctl(uffd, USERFAULTFD_IOC_NEW, flags); > } > return syscall(__NR_userfaultfd, flags); > #else > return -EINVAL; > > 27 lines vs 42 > > No need for enum type > No need for global variable > > What do you think? Yes, as I used to reply to Phil I think it can be simplified. I did this major for (1) better readability, and (2) being crystal clear on which way we used to open /dev/userfaultfd, then guarantee we're keeping using it. so at least I prefer keeping things like trace_uffd_detect_open_mode(). I also plan to add another mode when fd-mode is there even if it'll reuse the same USERFAULTFD_IOC_NEW; they can be useful information when a failure happens. Though if you insist, I can switch to the simple version too. -- Peter Xu ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 3/3] util/userfaultfd: Support /dev/userfaultfd 2023-02-02 20:41 ` Peter Xu @ 2023-02-03 21:01 ` Juan Quintela 2023-02-06 21:31 ` Peter Xu 0 siblings, 1 reply; 12+ messages in thread From: Juan Quintela @ 2023-02-03 21:01 UTC (permalink / raw) To: Peter Xu Cc: qemu-devel, Leonardo Bras Soares Passos, Michal Prívozník, Daniel P . Berrangé, Philippe Mathieu-Daudé, Dr . David Alan Gilbert Peter Xu <peterx@redhat.com> wrote: > On Thu, Feb 02, 2023 at 11:52:21AM +0100, Juan Quintela wrote: >> Peter Xu <peterx@redhat.com> wrote: >> > Teach QEMU to use /dev/userfaultfd when it existed and fallback to the >> > system call if either it's not there or doesn't have enough permission. >> > >> > Firstly, as long as the app has permission to access /dev/userfaultfd, it >> > always have the ability to trap kernel faults which QEMU mostly wants. >> > Meanwhile, in some context (e.g. containers) the userfaultfd syscall can be >> > forbidden, so it can be the major way to use postcopy in a restricted >> > environment with strict seccomp setup. >> > >> > Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> >> > Signed-off-by: Peter Xu <peterx@redhat.com> >> >> >> Hi > > Hi, Juan, >> static int open_userfaultd(void) >> { >> /* >> * Make /dev/userfaultfd the default approach because it has better >> * permission controls, meanwhile allows kernel faults without any >> * privilege requirement (e.g. SYS_CAP_PTRACE). >> */ >> int uffd = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC); >> if (uffd >= 0) { >> return uffd; >> } >> return -1; >> } >> >> int uffd_open(int flags) >> { >> #if defined(__linux__) && defined(__NR_userfaultfd) Just an incise, checkpatch don't liue that you use __linux__ This file is compiled under CONFIG_LINUX, so you can drop it. >> static int uffd = -2; >> if (uffd == -2) { >> uffd = open_userfaultd(); >> } >> if (uffd >= 0) { >> return ioctl(uffd, USERFAULTFD_IOC_NEW, flags); >> } >> return syscall(__NR_userfaultfd, flags); >> #else >> return -EINVAL; >> >> 27 lines vs 42 >> >> No need for enum type >> No need for global variable >> >> What do you think? > > Yes, as I used to reply to Phil I think it can be simplified. I did this > major for (1) better readability, and (2) being crystal clear on which way > we used to open /dev/userfaultfd, then guarantee we're keeping using it. so > at least I prefer keeping things like trace_uffd_detect_open_mode(). The trace is ok for me. I just forgot to copy it on the rework, sorry. > I also plan to add another mode when fd-mode is there even if it'll reuse > the same USERFAULTFD_IOC_NEW; they can be useful information when a failure > happens. The other fd mode will change the uffd. What I *kind* of object is: - Using a global variable when it is not needed i.e. for me using a global variable means that anything else is worse. Not the case IMHO. - Call uffd_open_mode() for every call, when we know that it can change, it is going to return always the same value, so cache it. > Though if you insist, I can switch to the simple version too. I always told that the person who did the patch has the last word on style. I preffer my version, but it is up to you to take it or not. Later, Juan. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 3/3] util/userfaultfd: Support /dev/userfaultfd 2023-02-03 21:01 ` Juan Quintela @ 2023-02-06 21:31 ` Peter Xu 2023-02-07 0:11 ` Juan Quintela 0 siblings, 1 reply; 12+ messages in thread From: Peter Xu @ 2023-02-06 21:31 UTC (permalink / raw) To: Juan Quintela Cc: qemu-devel, Leonardo Bras Soares Passos, Michal Prívozník, Daniel P . Berrangé, Philippe Mathieu-Daudé, Dr . David Alan Gilbert On Fri, Feb 03, 2023 at 10:01:04PM +0100, Juan Quintela wrote: > Peter Xu <peterx@redhat.com> wrote: > > On Thu, Feb 02, 2023 at 11:52:21AM +0100, Juan Quintela wrote: > >> Peter Xu <peterx@redhat.com> wrote: > >> > Teach QEMU to use /dev/userfaultfd when it existed and fallback to the > >> > system call if either it's not there or doesn't have enough permission. > >> > > >> > Firstly, as long as the app has permission to access /dev/userfaultfd, it > >> > always have the ability to trap kernel faults which QEMU mostly wants. > >> > Meanwhile, in some context (e.g. containers) the userfaultfd syscall can be > >> > forbidden, so it can be the major way to use postcopy in a restricted > >> > environment with strict seccomp setup. > >> > > >> > Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> > >> > Signed-off-by: Peter Xu <peterx@redhat.com> > >> > >> > >> Hi > > > > Hi, Juan, > > > >> static int open_userfaultd(void) > >> { > >> /* > >> * Make /dev/userfaultfd the default approach because it has better > >> * permission controls, meanwhile allows kernel faults without any > >> * privilege requirement (e.g. SYS_CAP_PTRACE). > >> */ > >> int uffd = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC); > >> if (uffd >= 0) { > >> return uffd; > >> } > >> return -1; > >> } > >> > >> int uffd_open(int flags) > >> { > >> #if defined(__linux__) && defined(__NR_userfaultfd) > > Just an incise, checkpatch don't liue that you use __linux__ > > This file is compiled under CONFIG_LINUX, so you can drop it. Yes indeed. I'll drop it. > > >> static int uffd = -2; > >> if (uffd == -2) { > >> uffd = open_userfaultd(); > >> } > >> if (uffd >= 0) { > >> return ioctl(uffd, USERFAULTFD_IOC_NEW, flags); > >> } > >> return syscall(__NR_userfaultfd, flags); > >> #else > >> return -EINVAL; > >> > >> 27 lines vs 42 > >> > >> No need for enum type > >> No need for global variable > >> > >> What do you think? > > > > Yes, as I used to reply to Phil I think it can be simplified. I did this > > major for (1) better readability, and (2) being crystal clear on which way > > we used to open /dev/userfaultfd, then guarantee we're keeping using it. so > > at least I prefer keeping things like trace_uffd_detect_open_mode(). > > The trace is ok for me. I just forgot to copy it on the rework, sorry. > > > I also plan to add another mode when fd-mode is there even if it'll reuse > > the same USERFAULTFD_IOC_NEW; they can be useful information when a failure > > happens. > > The other fd mode will change the uffd. > > What I *kind* of object is: > - Using a global variable when it is not needed > i.e. for me using a global variable means that anything else is worse. > Not the case IMHO. IMHO globals are evil when they're used in multiple places; that's bad to readability. Here it's not the case because it's set once and for all. I wanted to have an easy and clear way to peek what's the mode chosen even without tracing enabled (e.g. from a dump or a live process). > - Call uffd_open_mode() for every call, when we know that it can change, > it is going to return always the same value, so cache it. uffd_detect_open_mode() caches the result already? Or maybe you meant something else? > > > Though if you insist, I can switch to the simple version too. > > I always told that the person who did the patch has the last word on > style. I preffer my version, but it is up to you to take it or not. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 3/3] util/userfaultfd: Support /dev/userfaultfd 2023-02-06 21:31 ` Peter Xu @ 2023-02-07 0:11 ` Juan Quintela 0 siblings, 0 replies; 12+ messages in thread From: Juan Quintela @ 2023-02-07 0:11 UTC (permalink / raw) To: Peter Xu Cc: qemu-devel, Leonardo Bras Soares Passos, Michal Prívozník, Daniel P . Berrangé, Philippe Mathieu-Daudé, Dr . David Alan Gilbert Peter Xu <peterx@redhat.com> wrote: > On Fri, Feb 03, 2023 at 10:01:04PM +0100, Juan Quintela wrote: >> Peter Xu <peterx@redhat.com> wrote: >> > On Thu, Feb 02, 2023 at 11:52:21AM +0100, Juan Quintela wrote: >> >> Peter Xu <peterx@redhat.com> wrote: >> >> > Teach QEMU to use /dev/userfaultfd when it existed and fallback to the >> >> > system call if either it's not there or doesn't have enough permission. >> >> > >> >> > Firstly, as long as the app has permission to access /dev/userfaultfd, it >> >> > always have the ability to trap kernel faults which QEMU mostly wants. >> >> > Meanwhile, in some context (e.g. containers) the userfaultfd syscall can be >> >> > forbidden, so it can be the major way to use postcopy in a restricted >> >> > environment with strict seccomp setup. >> >> > >> >> > Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> >> >> > Signed-off-by: Peter Xu <peterx@redhat.com> >> >> >> >> >> >> Hi >> > >> > Hi, Juan, >> >> >> >> static int open_userfaultd(void) >> >> { >> >> /* >> >> * Make /dev/userfaultfd the default approach because it has better >> >> * permission controls, meanwhile allows kernel faults without any >> >> * privilege requirement (e.g. SYS_CAP_PTRACE). >> >> */ >> >> int uffd = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC); >> >> if (uffd >= 0) { >> >> return uffd; >> >> } >> >> return -1; >> >> } >> >> >> >> int uffd_open(int flags) >> >> { >> >> #if defined(__linux__) && defined(__NR_userfaultfd) >> >> Just an incise, checkpatch don't liue that you use __linux__ >> >> This file is compiled under CONFIG_LINUX, so you can drop it. > > Yes indeed. I'll drop it. > >> >> >> static int uffd = -2; >> >> if (uffd == -2) { >> >> uffd = open_userfaultd(); >> >> } >> >> if (uffd >= 0) { >> >> return ioctl(uffd, USERFAULTFD_IOC_NEW, flags); >> >> } >> >> return syscall(__NR_userfaultfd, flags); >> >> #else >> >> return -EINVAL; >> >> >> >> 27 lines vs 42 >> >> >> >> No need for enum type >> >> No need for global variable >> >> >> >> What do you think? >> > >> > Yes, as I used to reply to Phil I think it can be simplified. I did this >> > major for (1) better readability, and (2) being crystal clear on which way >> > we used to open /dev/userfaultfd, then guarantee we're keeping using it. so >> > at least I prefer keeping things like trace_uffd_detect_open_mode(). >> >> The trace is ok for me. I just forgot to copy it on the rework, sorry. >> >> > I also plan to add another mode when fd-mode is there even if it'll reuse >> > the same USERFAULTFD_IOC_NEW; they can be useful information when a failure >> > happens. >> >> The other fd mode will change the uffd. >> >> What I *kind* of object is: >> - Using a global variable when it is not needed >> i.e. for me using a global variable means that anything else is worse. >> Not the case IMHO. > > IMHO globals are evil when they're used in multiple places; that's bad to > readability. Here it's not the case because it's set once and for > all. That is part of the problem. int foo; I need to search all the code to see where it is used. int bar(...) { static int foo; .... } I am really sure that: - foo value is preserved between calls - it is not used anywhere else without a single grep through the code. > I > wanted to have an easy and clear way to peek what's the mode chosen even > without tracing enabled (e.g. from a dump or a live process). I haven't thought about this. But you want something different than this? (fada)$ cat /tmp/kk.c int bar(void) { static int foo = 42; return foo++; } int main(void) { int a = 7 + 1; return a + bar(); } (fada)$ gcc -Wall /tmp/kk.c -o /tmp/kkk -g (fada)$ gdb /tmp/kkk (gdb) b main Breakpoint 1 at 0x401123: file /tmp/kk.c, line 10. (gdb) p bar::foo $1 = 42 (gdb) And yes, I have to search how this is done O:-) >> - Call uffd_open_mode() for every call, when we know that it can change, >> it is going to return always the same value, so cache it. > > uffd_detect_open_mode() caches the result already? Or maybe you meant > something else? What I did. Only call the equilavent function once. You are calling it every time that uffd_open() is called. > >> >> > Though if you insist, I can switch to the simple version too. >> >> I always told that the person who did the patch has the last word on >> style. I preffer my version, but it is up to you to take it or not. > > Thanks, Later, Juan. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2023-02-07 0:13 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-02-01 21:10 [PATCH v2 0/3] util/userfaultfd: Support /dev/userfaultfd Peter Xu 2023-02-01 21:10 ` [PATCH v2 1/3] linux-headers: Update to v6.1 Peter Xu 2023-02-02 10:53 ` Juan Quintela 2023-02-02 19:49 ` Peter Xu 2023-02-01 21:10 ` [PATCH v2 2/3] util/userfaultfd: Add uffd_open() Peter Xu 2023-02-02 10:27 ` Juan Quintela 2023-02-01 21:10 ` [PATCH v2 3/3] util/userfaultfd: Support /dev/userfaultfd Peter Xu 2023-02-02 10:52 ` Juan Quintela 2023-02-02 20:41 ` Peter Xu 2023-02-03 21:01 ` Juan Quintela 2023-02-06 21:31 ` Peter Xu 2023-02-07 0:11 ` Juan Quintela
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).