linux

mirror of https://github.com/torvalds/linux.git synced 2026-03-08 01:24:47 +01:00

Author	SHA1	Message	Date
Paolo Bonzini	70295a479d	KVM: always define KVM_CAP_SYNC_MMU KVM_CAP_SYNC_MMU is provided by KVM's MMU notifiers, which are now always available. Move the definition from individual architectures to common code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-02-28 15:31:35 +01:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Paolo Bonzini	bf2c3138ae	Merge tag 'kvm-x86-pmu-6.20' of https://github.com/kvm-x86/linux into HEAD KVM mediated PMU support for 6.20 Add support for mediated PMUs, where KVM gives the guest full ownership of PMU hardware (contexted switched around the fastpath run loop) and allows direct access to data MSRs and PMCs (restricted by the vPMU model), but intercepts access to control registers, e.g. to enforce event filtering and to prevent the guest from profiling sensitive host state. To keep overall complexity reasonable, mediated PMU usage is all or nothing for a given instance of KVM (controlled via module param). The Mediated PMU is disabled default, partly to maintain backwards compatilibity for existing setup, partly because there are tradeoffs when running with a mediated PMU that may be non-starters for some use cases, e.g. the host loses the ability to profile guests with mediated PMUs, the fastpath run loop is also a blind spot, entry/exit transitions are more expensive, etc. Versus the emulated PMU, where KVM is "just another perf user", the mediated PMU delivers more accurate profiling and monitoring (no risk of contention and thus dropped events), with significantly less overhead (fewer exits and faster emulation/programming of event selectors) E.g. when running Specint-2017 on a single-socket Sapphire Rapids with 56 cores and no-SMT, and using perf from within the guest: Perf command: a. basic-sampling: perf record -F 1000 -e 6-instructions -a --overwrite b. multiplex-sampling: perf record -F 1000 -e 10-instructions -a --overwrite Guest performance overhead: --------------------------------------------------------------------------- \| Test case \| emulated vPMU \| all passthrough \| passthrough with \| \| \| \| \| event filters \| --------------------------------------------------------------------------- \| basic-sampling \| 33.62% \| 4.24% \| 6.21% \| --------------------------------------------------------------------------- \| multiplex-sampling \| 79.32% \| 7.34% \| 10.45% \| ---------------------------------------------------------------------------	2026-02-11 12:45:40 -05:00
Marc Zyngier	6316366129	Merge branch kvm-arm64/misc-6.20 into kvmarm-master/next * kvm-arm64/misc-6.20: : . : Misc KVM/arm64 changes for 6.20 : : - Trivial FPSIMD cleanups : : - Calculate hyp VA size only once, avoiding potential mapping issues when : VA bits is smaller than expected : : - Silence sparse warning for the HYP stack base : : - Fix error checking when handling FFA_VERSION : : - Add missing trap configuration for DBGWCR15_EL1 : : - Don't try to deal with nested S2 when NV isn't enabled for a guest : : - Various spelling fixes : . KVM: arm64: nv: Avoid NV stage-2 code when NV is not supported KVM: arm64: Fix various comments KVM: arm64: nv: Add trap config for DBGWCR<15>_EL1 KVM: arm64: Fix error checking for FFA_VERSION KVM: arm64: Fix missing <asm/stackpage/nvhe.h> include KVM: arm64: Calculate hyp VA size only once KVM: arm64: Remove ISB after writing FPEXC32_EL2 KVM: arm64: Shuffle KVM_HOST_DATA_FLAG_* indices KVM: arm64: Fix comment in fpsimd_lazy_switch_to_host() Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-02-05 09:17:58 +00:00
Marc Zyngier	183ac2b2ad	Merge branch kvm-arm64/pkvm-no-mte into kvmarm-master/next * kvm-arm64/pkvm-no-mte: : . : pKVM updates preventing the host from using MTE-related system : sysrem registers when the feature is disabled from the kernel : command-line (arm64.nomte), courtesy of Fuad Taba. : : From the cover letter: : : "If MTE is supported by the hardware (and is enabled at EL3), it remains : available to lower exception levels by default. Disabling it in the host : kernel (e.g., via 'arm64.nomte') only stops the kernel from advertising : the feature; it does not physically disable MTE in the hardware. : : The ability to disable MTE in the host kernel is used by some systems, : such as Android, so that the physical memory otherwise used as tag : storage can be used for other things (i.e. treated just like the rest of : memory). In this scenario, a malicious host could still access tags in : pages donated to a guest using MTE instructions (e.g., STG and LDG), : bypassing the kernel's configuration." : . KVM: arm64: Use kvm_has_mte() in pKVM trap initialization KVM: arm64: Inject UNDEF when accessing MTE sysregs with MTE disabled KVM: arm64: Trap MTE access and discovery when MTE is disabled KVM: arm64: Remove dead code resetting HCR_EL2 for pKVM Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-02-05 09:16:31 +00:00
Fuad Tabba	f35abcbb8a	KVM: arm64: Trap MTE access and discovery when MTE is disabled If MTE is not supported by the hardware, or is disabled in the kernel configuration (`CONFIG_ARM64_MTE=n`) or command line (`arm64.nomte`), the kernel stops advertising MTE to userspace and avoids using MTE instructions. However, this is a software-level disable only. When MTE hardware is present and enabled by EL3 firmware, leaving `HCR_EL2.ATA` set allows the host to execute MTE instructions (STG, LDG, etc.) and access allocation tags in physical memory. Prevent this by clearing `HCR_EL2.ATA` when MTE is disabled. Remove it from the `HCR_HOST_NVHE_FLAGS` default, and conditionally set it in `cpu_prepare_hyp_mode()` only when `system_supports_mte()` returns true. This causes MTE instructions to trap to EL2 when `HCR_EL2.ATA` is cleared. Additionally, set `HCR_EL2.TID5` when MTE is disabled. This traps reads of `GMID_EL1` (Multiple tag transfer ID register) to EL2, preventing the discovery of MTE parameters (such as tag block size) when the feature is suppressed. Early boot code in `head.S` temporarily keeps `HCR_ATA` set to avoid special-casing initialization paths. This is safe because this code executes before untrusted code runs and will clear `HCR_ATA` if MTE is disabled. Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20260122112218.531948-3-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-01-23 11:28:48 +00:00
Marc Zyngier	c983b3e276	Merge branch kvm-arm64/pkvm-features-6.20 into kvmarm-master/next * kvm-arm64/pkvm-features-6.20: : . : pKVM guest feature trapping fixes, courtesy of Fuad Tabba. : . KVM: arm64: Prevent host from managing timer offsets for protected VMs KVM: arm64: Check whether a VM IOCTL is allowed in pKVM KVM: arm64: Track KVM IOCTLs and their associated KVM caps KVM: arm64: Do not allow KVM_CAP_ARM_MTE for any guest in pKVM KVM: arm64: Include VM type when checking VM capabilities in pKVM KVM: arm64: Introduce helper to calculate fault IPA offset KVM: arm64: Fix MTE flag initialization for protected VMs KVM: arm64: Fix Trace Buffer trap polarity for protected VMs KVM: arm64: Fix Trace Buffer trapping for protected VMs Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-01-23 10:04:47 +00:00
Fuad Tabba	b12b3b04f6	KVM: arm64: Check whether a VM IOCTL is allowed in pKVM Certain VM IOCTLs are tied to specific VM features. Since pKVM does not support all features, restrict which IOCTLs are allowed depending on whether the associated feature is supported. Use the existing VM capability check as the source of truth to whether an IOCTL is allowed for a particular VM by mapping the IOCTLs with their associated capabilities. Suggested-by: Oliver Upton <oupton@kernel.org> Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20251211104710.151771-9-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-01-15 15:55:50 +00:00
Fuad Tabba	8823485a69	KVM: arm64: Track KVM IOCTLs and their associated KVM caps Track KVM IOCTLs (VM IOCTLs for now), and the associated KVM capability that enables that IOCTL. Add a function that performs the lookup. This will be used by CoCo VM Hypervisors (e.g., pKVM) to determine whether a particular KVM IOCTL is allowed for its VMs. Suggested-by: Oliver Upton <oupton@kernel.org> Signed-off-by: Fuad Tabba <tabba@google.com> [maz: don't expose KVM_CAP_BASIC to userspace, and rely on NR_VCPUS as a proxy for this] Link: https://patch.msgid.link/20251211104710.151771-8-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-01-15 15:46:57 +00:00
Fuad Tabba	43a21a0f0c	KVM: arm64: Include VM type when checking VM capabilities in pKVM Certain features and capabilities are restricted in protected mode. Most of these features are restricted only for protected VMs, but some are restricted for ALL VMs in protected mode. Extend the pKVM capability check to pass the VM (kvm), and use that when determining supported features. Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20251211104710.151771-6-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-01-15 15:43:15 +00:00
Ben Dooks	4b16ad0bf8	KVM: arm64: Fix missing <asm/stackpage/nvhe.h> include Include <asm/stackpage/nvhe.h> for kvm_arm_hyp_stack_base declaration which fixes the following sparse warning: arch/arm64/kvm/arm.c:63:1: warning: symbol 'kvm_arm_hyp_stack_base' was not declared. Should it be static? Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Link: https://patch.msgid.link/20260112160413.603493-1-ben.dooks@codethink.co.uk Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-01-14 11:21:31 +00:00
Petteri Kangaslampi	8e8eb10c10	KVM: arm64: Calculate hyp VA size only once Calculate the hypervisor's VA size only once to maintain consistency between the memory layout and MMU initialization logic. Previously the two would be inconsistent when the kernel is configured for less than IDMAP_VA_BITS of VA space. Signed-off-by: Petteri Kangaslampi <pekangas@google.com> Tested-by: Vincent Donnefort <vdonnefort@google.com> Link: https://patch.msgid.link/20260113194409.2970324-2-pekangas@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-01-14 10:40:11 +00:00
Sascha Bischoff	5e8b511c39	KVM: arm64: gic: Check for vGICv3 when clearing TWI Explicitly check for the vgic being v3 when disabling TWI. Failure to check this can result in using the wrong view of the vgic CPU IF union causing undesirable/unexpected behaviour. Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://msgid.link/20260106165154.3321753-1-sascha.bischoff@arm.com Signed-off-by: Oliver Upton <oupton@kernel.org>	2026-01-08 12:49:21 -08:00
Sean Christopherson	4b24910c05	KVM: Add a simplified wrapper for registering perf callbacks Add a parameter-less API for registering perf callbacks in anticipation of introducing another x86-only parameter for handling mediated PMU PMIs. No functional change intended. Acked-by: Anup Patel <anup@brainfault.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:51:39 -08:00
Linus Torvalds	51d90a15fe	ARM: - Support for userspace handling of synchronous external aborts (SEAs), allowing the VMM to potentially handle the abort in a non-fatal manner. - Large rework of the VGIC's list register handling with the goal of supporting more active/pending IRQs than available list registers in hardware. In addition, the VGIC now supports EOImode==1 style deactivations for IRQs which may occur on a separate vCPU than the one that acked the IRQ. - Support for FEAT_XNX (user / privileged execute permissions) and FEAT_HAF (hardware update to the Access Flag) in the software page table walkers and shadow MMU. - Allow page table destruction to reschedule, fixing long need_resched latencies observed when destroying a large VM. - Minor fixes to KVM and selftests Loongarch: - Get VM PMU capability from HW GCFG register. - Add AVEC basic support. - Use 64-bit register definition for EIOINTC. - Add KVM timer test cases for tools/selftests. RISC/V: - SBI message passing (MPXY) support for KVM guest - Give a new, more specific error subcode for the case when in-kernel AIA virtualization fails to allocate IMSIC VS-file - Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually in small chunks - Fix guest page fault within HLV* instructions - Flush VS-stage TLB after VCPU migration for Andes cores s390: - Always allocate ESCA (Extended System Control Area), instead of starting with the basic SCA and converting to ESCA with the addition of the 65th vCPU. The price is increased number of exits (and worse performance) on z10 and earlier processor; ESCA was introduced by z114/z196 in 2010. - VIRT_XFER_TO_GUEST_WORK support - Operation exception forwarding support - Cleanups x86: - Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO SPTE caching is disabled, as there can't be any relevant SPTEs to zap. - Relocate a misplaced export. - Fix an async #PF bug where KVM would clear the completion queue when the guest transitioned in and out of paging mode, e.g. when handling an SMI and then returning to paged mode via RSM. - Leave KVM's user-return notifier registered even when disabling virtualization, as long as kvm.ko is loaded. On reboot/shutdown, keeping the notifier registered is ok; the kernel does not use the MSRs and the callback will run cleanly and restore host MSRs if the CPU manages to return to userspace before the system goes down. - Use the checked version of {get,put}_user(). - Fix a long-lurking bug where KVM's lack of catch-up logic for periodic APIC timers can result in a hard lockup in the host. - Revert the periodic kvmclock sync logic now that KVM doesn't use a clocksource that's subject to NTP corrections. - Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the latter behind CONFIG_CPU_MITIGATIONS. - Context switch XCR0, XSS, and PKRU outside of the entry/exit fast path; the only reason they were handled in the fast path was to paper of a bug in the core #MC code, and that has long since been fixed. - Add emulator support for AVX MOV instructions, to play nice with emulated devices whose guest drivers like to access PCI BARs with large multi-byte instructions. x86 (AMD): - Fix a few missing "VMCB dirty" bugs. - Fix the worst of KVM's lack of EFER.LMSLE emulation. - Add AVIC support for addressing 4k vCPUs in x2AVIC mode. - Fix incorrect handling of selective CR0 writes when checking intercepts during emulation of L2 instructions. - Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on VMRUN and #VMEXIT. - Fix a bug where KVM corrupt the guest code stream when re-injecting a soft interrupt if the guest patched the underlying code after the VM-Exit, e.g. when Linux patches code with a temporary INT3. - Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to userspace, and extend KVM "support" to all policy bits that don't require any actual support from KVM. x86 (Intel): - Use the root role from kvm_mmu_page to construct EPTPs instead of the current vCPU state, partly as worthwhile cleanup, but mostly to pave the way for tracking per-root TLB flushes, and elide EPT flushes on pCPU migration if the root is clean from a previous flush. - Add a few missing nested consistency checks. - Rip out support for doing "early" consistency checks via hardware as the functionality hasn't been used in years and is no longer useful in general; replace it with an off-by-default module param to WARN if hardware fails a check that KVM does not perform. - Fix a currently-benign bug where KVM would drop the guest's SPEC_CTRL[63:32] on VM-Enter. - Misc cleanups. - Overhaul the TDX code to address systemic races where KVM (acting on behalf of userspace) could inadvertantly trigger lock contention in the TDX-Module; KVM was either working around these in weird, ugly ways, or was simply oblivious to them (though even Yan's devilish selftests could only break individual VMs, not the host kernel) - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a TDX vCPU, if creating said vCPU failed partway through. - Fix a few sparse warnings (bad annotation, 0 != NULL). - Use struct_size() to simplify copying TDX capabilities to userspace. - Fix a bug where TDX would effectively corrupt user-return MSR values if the TDX Module rejects VP.ENTER and thus doesn't clobber host MSRs as expected. Selftests: - Fix a math goof in mmu_stress_test when running on a single-CPU system/VM. - Forcefully override ARCH from x86_64 to x86 to play nice with specifying ARCH=x86_64 on the command line. - Extend a bunch of nested VMX to validate nested SVM as well. - Add support for LA57 in the core VM_MODE_xxx macro, and add a test to verify KVM can save/restore nested VMX state when L1 is using 5-level paging, but L2 is not. - Clean up the guest paging code in anticipation of sharing the core logic for nested EPT and nested NPT. guest_memfd: - Add NUMA mempolicy support for guest_memfd, and clean up a variety of rough edges in guest_memfd along the way. - Define a CLASS to automatically handle get+put when grabbing a guest_memfd from a memslot to make it harder to leak references. - Enhance KVM selftests to make it easer to develop and debug selftests like those added for guest_memfd NUMA support, e.g. where test and/or KVM bugs often result in hard-to-debug SIGBUS errors. - Misc cleanups. Generic: - Use the recently-added WQ_PERCPU when creating the per-CPU workqueue for irqfd cleanup. - Fix a goof in the dirty ring documentation. - Fix choice of target for directed yield across different calls to kvm_vcpu_on_spin(); the function was always starting from the first vCPU instead of continuing the round-robin search. -----BEGIN PGP SIGNATURE----- iQFIBAABCgAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmkvMa8UHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroMlFwf+Ow7zOYUuELSQ+Jn+hOYXiCNrdBDx ZamvMU8kLPr7XX0Zog6HgcMm//qyA6k5nSfqCjfsQZrIhRA/gWJ61jz1OX/Jxq18 pJ9Vz6epnEPYiOtBwz+v8OS8MqDqVNzj2i6W1/cLPQE50c1Hhw64HWS5CSxDQiHW A7PVfl5YU12lW1vG3uE0sNESDt4Eh/spNM17iddXdF4ZUOGublserjDGjbc17E7H 8BX3DkC2plqkJKwtjg0ae62hREkITZZc7RqsnftUkEhn0N0H9+rb6NKUyzIVh9NZ bCtCjtrKN9zfZ0Mujnms3ugBOVqNIputu/DtPnnFKXtXWSrHrgGSNv5ewA== =PEcw -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM updates from Paolo Bonzini: "ARM: - Support for userspace handling of synchronous external aborts (SEAs), allowing the VMM to potentially handle the abort in a non-fatal manner - Large rework of the VGIC's list register handling with the goal of supporting more active/pending IRQs than available list registers in hardware. In addition, the VGIC now supports EOImode==1 style deactivations for IRQs which may occur on a separate vCPU than the one that acked the IRQ - Support for FEAT_XNX (user / privileged execute permissions) and FEAT_HAF (hardware update to the Access Flag) in the software page table walkers and shadow MMU - Allow page table destruction to reschedule, fixing long need_resched latencies observed when destroying a large VM - Minor fixes to KVM and selftests Loongarch: - Get VM PMU capability from HW GCFG register - Add AVEC basic support - Use 64-bit register definition for EIOINTC - Add KVM timer test cases for tools/selftests RISC/V: - SBI message passing (MPXY) support for KVM guest - Give a new, more specific error subcode for the case when in-kernel AIA virtualization fails to allocate IMSIC VS-file - Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually in small chunks - Fix guest page fault within HLV* instructions - Flush VS-stage TLB after VCPU migration for Andes cores s390: - Always allocate ESCA (Extended System Control Area), instead of starting with the basic SCA and converting to ESCA with the addition of the 65th vCPU. The price is increased number of exits (and worse performance) on z10 and earlier processor; ESCA was introduced by z114/z196 in 2010 - VIRT_XFER_TO_GUEST_WORK support - Operation exception forwarding support - Cleanups x86: - Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO SPTE caching is disabled, as there can't be any relevant SPTEs to zap - Relocate a misplaced export - Fix an async #PF bug where KVM would clear the completion queue when the guest transitioned in and out of paging mode, e.g. when handling an SMI and then returning to paged mode via RSM - Leave KVM's user-return notifier registered even when disabling virtualization, as long as kvm.ko is loaded. On reboot/shutdown, keeping the notifier registered is ok; the kernel does not use the MSRs and the callback will run cleanly and restore host MSRs if the CPU manages to return to userspace before the system goes down - Use the checked version of {get,put}_user() - Fix a long-lurking bug where KVM's lack of catch-up logic for periodic APIC timers can result in a hard lockup in the host - Revert the periodic kvmclock sync logic now that KVM doesn't use a clocksource that's subject to NTP corrections - Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the latter behind CONFIG_CPU_MITIGATIONS - Context switch XCR0, XSS, and PKRU outside of the entry/exit fast path; the only reason they were handled in the fast path was to paper of a bug in the core #MC code, and that has long since been fixed - Add emulator support for AVX MOV instructions, to play nice with emulated devices whose guest drivers like to access PCI BARs with large multi-byte instructions x86 (AMD): - Fix a few missing "VMCB dirty" bugs - Fix the worst of KVM's lack of EFER.LMSLE emulation - Add AVIC support for addressing 4k vCPUs in x2AVIC mode - Fix incorrect handling of selective CR0 writes when checking intercepts during emulation of L2 instructions - Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on VMRUN and #VMEXIT - Fix a bug where KVM corrupt the guest code stream when re-injecting a soft interrupt if the guest patched the underlying code after the VM-Exit, e.g. when Linux patches code with a temporary INT3 - Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to userspace, and extend KVM "support" to all policy bits that don't require any actual support from KVM x86 (Intel): - Use the root role from kvm_mmu_page to construct EPTPs instead of the current vCPU state, partly as worthwhile cleanup, but mostly to pave the way for tracking per-root TLB flushes, and elide EPT flushes on pCPU migration if the root is clean from a previous flush - Add a few missing nested consistency checks - Rip out support for doing "early" consistency checks via hardware as the functionality hasn't been used in years and is no longer useful in general; replace it with an off-by-default module param to WARN if hardware fails a check that KVM does not perform - Fix a currently-benign bug where KVM would drop the guest's SPEC_CTRL[63:32] on VM-Enter - Misc cleanups - Overhaul the TDX code to address systemic races where KVM (acting on behalf of userspace) could inadvertantly trigger lock contention in the TDX-Module; KVM was either working around these in weird, ugly ways, or was simply oblivious to them (though even Yan's devilish selftests could only break individual VMs, not the host kernel) - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a TDX vCPU, if creating said vCPU failed partway through - Fix a few sparse warnings (bad annotation, 0 != NULL) - Use struct_size() to simplify copying TDX capabilities to userspace - Fix a bug where TDX would effectively corrupt user-return MSR values if the TDX Module rejects VP.ENTER and thus doesn't clobber host MSRs as expected Selftests: - Fix a math goof in mmu_stress_test when running on a single-CPU system/VM - Forcefully override ARCH from x86_64 to x86 to play nice with specifying ARCH=x86_64 on the command line - Extend a bunch of nested VMX to validate nested SVM as well - Add support for LA57 in the core VM_MODE_xxx macro, and add a test to verify KVM can save/restore nested VMX state when L1 is using 5-level paging, but L2 is not - Clean up the guest paging code in anticipation of sharing the core logic for nested EPT and nested NPT guest_memfd: - Add NUMA mempolicy support for guest_memfd, and clean up a variety of rough edges in guest_memfd along the way - Define a CLASS to automatically handle get+put when grabbing a guest_memfd from a memslot to make it harder to leak references - Enhance KVM selftests to make it easer to develop and debug selftests like those added for guest_memfd NUMA support, e.g. where test and/or KVM bugs often result in hard-to-debug SIGBUS errors - Misc cleanups Generic: - Use the recently-added WQ_PERCPU when creating the per-CPU workqueue for irqfd cleanup - Fix a goof in the dirty ring documentation - Fix choice of target for directed yield across different calls to kvm_vcpu_on_spin(); the function was always starting from the first vCPU instead of continuing the round-robin search" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (260 commits) KVM: arm64: at: Update AF on software walk only if VM has FEAT_HAFDBS KVM: arm64: at: Use correct HA bit in TCR_EL2 when regime is EL2 KVM: arm64: Document KVM_PGTABLE_PROT_{UX,PX} KVM: arm64: Fix spelling mistake "Unexpeced" -> "Unexpected" KVM: arm64: Add break to default case in kvm_pgtable_stage2_pte_prot() KVM: arm64: Add endian casting to kvm_swap_s[12]_desc() KVM: arm64: Fix compilation when CONFIG_ARM64_USE_LSE_ATOMICS=n KVM: arm64: selftests: Add test for AT emulation KVM: arm64: nv: Expose hardware access flag management to NV guests KVM: arm64: nv: Implement HW access flag management in stage-2 SW PTW KVM: arm64: Implement HW access flag management in stage-1 SW PTW KVM: arm64: Propagate PTW errors up to AT emulation KVM: arm64: Add helper for swapping guest descriptor KVM: arm64: nv: Use pgtable definitions in stage-2 walk KVM: arm64: Handle endianness in read helper for emulated PTW KVM: arm64: nv: Stop passing vCPU through void ptr in S2 PTW KVM: arm64: Call helper for reading descriptors directly KVM: arm64: nv: Advertise support for FEAT_XNX KVM: arm64: Teach ptdump about FEAT_XNX permissions KVM: s390: Use generic VIRT_XFER_TO_GUEST_WORK functions ...	2025-12-05 17:01:20 -08:00
Linus Torvalds	44fc84337b	arm64 updates for 6.19: Core features: - Basic Arm MPAM (Memory system resource Partitioning And Monitoring) driver under drivers/resctrl/ which makes use of the fs/rectrl/ API Perf and PMU: - Avoid cycle counter on multi-threaded CPUs - Extend CSPMU device probing and add additional filtering support for NVIDIA implementations - Add support for the PMUs on the NoC S3 interconnect - Add additional compatible strings for new Cortex and C1 CPUs - Add support for data source filtering to the SPE driver - Add support for i.MX8QM and "DB" PMU in the imx PMU driver Memory managemennt: - Avoid broadcast TLBI if page reused in write fault - Elide TLB invalidation if the old PTE was not valid - Drop redundant cpu_set__tcr_t0sz() macros - Propagate pgtable_alloc() errors outside of __create_pgd_mapping() - Propagate return value from __change_memory_common() ACPI and EFI: - Call EFI runtime services without disabling preemption - Remove unused ACPI function Miscellaneous: - ptrace support to disable streaming on SME-only systems - Improve sysreg generation to include a 'Prefix' descriptor - Replace __ASSEMBLY__ with __ASSEMBLER__ - Align register dumps in the kselftest zt-test - Remove some no longer used macros/functions - Various spelling corrections -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmkvMjkACgkQa9axLQDI XvGaGg//dtT/ZAqrWa6Yniv1LOlh837C07YdxAYTTuJ+I87DnrxIqjwbW+ye+bF+ 61RTkioeCUm3PH+ncO9gPVNi4ASZ1db3/Rc8Fb6rr1TYOI1sMIeBsbbVdRJgsbX6 zu9197jOBHscTAeDceB6jZBDyW8iSLINPZ7LN6lGxXsZM/Vn5zfE0heKEEio6Fsx +AzO2vos0XcwBR9vFGXtiCDx57T+/cXUtrWfA0Cjz4nvHSgD8+ghS+Jwv+kHMt1L zrarqbeQfj+Iixm9PVHiazv+8THo9QdNl1yGLxDmJ4LEVPewjW5jBs8+5e8e3/Gj p5JEvmSyWvKTTbFoM5vhxC72A7yuT1QwAk2iCyFIxMbQ25PndHboKVp/569DzOkT +6CjI88sVSP6D7bVlN6pFlzc/Fa07YagnDMnMCSfk4LBjUfE3jYb+usaFydyv/rl jwZbJrnSF/H+uQlyoJFgOEXSoQdDsll3dv6yEsUCwbd8RqXbAe3svbguOUHSdvIj sCViezGZQ7Rkn6D21AfF9j6e7ceaSDaf5DWMxPI3dAxFKG8TJbCBsToR59NnoSj+ bNEozbZ1mCxmwH8i43wZ6P0RkClvJnoXcvRA+TJj02fSZACO39d3XDNswfXWL41r KiWGUJZyn2lPKtiAWVX6pSBtDJ+5rFhuoFgADLX6trkxDe9/EMQ= =4Sb6 -----END PGP SIGNATURE----- Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: "These are the arm64 updates for 6.19. The biggest part is the Arm MPAM driver under drivers/resctrl/. There's a patch touching mm/ to handle spurious faults for huge pmd (similar to the pte version). The corresponding arm64 part allows us to avoid the TLB maintenance if a (huge) page is reused after a write fault. There's EFI refactoring to allow runtime services with preemption enabled and the rest is the usual perf/PMU updates and several cleanups/typos. Summary: Core features: - Basic Arm MPAM (Memory system resource Partitioning And Monitoring) driver under drivers/resctrl/ which makes use of the fs/rectrl/ API Perf and PMU: - Avoid cycle counter on multi-threaded CPUs - Extend CSPMU device probing and add additional filtering support for NVIDIA implementations - Add support for the PMUs on the NoC S3 interconnect - Add additional compatible strings for new Cortex and C1 CPUs - Add support for data source filtering to the SPE driver - Add support for i.MX8QM and "DB" PMU in the imx PMU driver Memory managemennt: - Avoid broadcast TLBI if page reused in write fault - Elide TLB invalidation if the old PTE was not valid - Drop redundant cpu_set__tcr_t0sz() macros - Propagate pgtable_alloc() errors outside of __create_pgd_mapping() - Propagate return value from __change_memory_common() ACPI and EFI: - Call EFI runtime services without disabling preemption - Remove unused ACPI function Miscellaneous: - ptrace support to disable streaming on SME-only systems - Improve sysreg generation to include a 'Prefix' descriptor - Replace __ASSEMBLY__ with __ASSEMBLER__ - Align register dumps in the kselftest zt-test - Remove some no longer used macros/functions - Various spelling corrections" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (94 commits) arm64/mm: Document why linear map split failure upon vm_reset_perms is not problematic arm64/pageattr: Propagate return value from __change_memory_common arm64/sysreg: Remove unused define ARM64_FEATURE_FIELD_BITS KVM: arm64: selftests: Consider all 7 possible levels of cache KVM: arm64: selftests: Remove ARM64_FEATURE_FIELD_BITS and its last user arm64: atomics: lse: Remove unused parameters from ATOMIC_FETCH_OP_AND macros Documentation/arm64: Fix the typo of register names ACPI: GTDT: Get rid of acpi_arch_timer_mem_init() perf: arm_spe: Add support for filtering on data source perf: Add perf_event_attr::config4 perf/imx_ddr: Add support for PMU in DB (system interconnects) perf/imx_ddr: Get and enable optional clks perf/imx_ddr: Move ida_alloc() from ddr_perf_init() to ddr_perf_probe() dt-bindings: perf: fsl-imx-ddr: Add compatible string for i.MX8QM, i.MX8QXP and i.MX8DXL arm64: remove duplicate ARCH_HAS_MEM_ENCRYPT arm64: mm: use untagged address to calculate page index MAINTAINERS: new entry for MPAM Driver arm_mpam: Add kunit tests for props_mismatch() arm_mpam: Add kunit test for bitmap reset arm_mpam: Add helper to reset saved mbwu state ...	2025-12-02 17:03:55 -08:00
Paolo Bonzini	f58e70cc31	KVM/arm64 updates for 6.19 - Support for userspace handling of synchronous external aborts (SEAs), allowing the VMM to potentially handle the abort in a non-fatal manner. - Large rework of the VGIC's list register handling with the goal of supporting more active/pending IRQs than available list registers in hardware. In addition, the VGIC now supports EOImode==1 style deactivations for IRQs which may occur on a separate vCPU than the one that acked the IRQ. - Support for FEAT_XNX (user / privileged execute permissions) and FEAT_HAF (hardware update to the Access Flag) in the software page table walkers and shadow MMU. - Allow page table destruction to reschedule, fixing long need_resched latencies observed when destroying a large VM. - Minor fixes to KVM and selftests -----BEGIN PGP SIGNATURE----- iIgEABYKADAWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCaS3m5RIcb3VwdG9uQGtl cm5lbC5vcmcACgkQor51iCR83Rb4NAD8C1fGoiCErb6htQMHf1I7ua0ThdIx7OnY Mk1EysNWu94BAI/VKEYgz+UC5uapHh+gnsoOdVTMJZedI/OPrnKa3QIA =/Vl1 -----END PGP SIGNATURE----- Merge tag 'kvmarm-6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 6.19 - Support for userspace handling of synchronous external aborts (SEAs), allowing the VMM to potentially handle the abort in a non-fatal manner. - Large rework of the VGIC's list register handling with the goal of supporting more active/pending IRQs than available list registers in hardware. In addition, the VGIC now supports EOImode==1 style deactivations for IRQs which may occur on a separate vCPU than the one that acked the IRQ. - Support for FEAT_XNX (user / privileged execute permissions) and FEAT_HAF (hardware update to the Access Flag) in the software page table walkers and shadow MMU. - Allow page table destruction to reschedule, fixing long need_resched latencies observed when destroying a large VM. - Minor fixes to KVM and selftests	2025-12-02 18:36:26 +01:00
Oliver Upton	938309b028	Merge branch 'kvm-arm64/vgic-lr-overflow' into kvmarm/next * kvm-arm64/vgic-lr-overflow: (50 commits) : Support for VGIC LR overflows, courtesy of Marc Zyngier : : Address deficiencies in KVM's GIC emulation when a vCPU has more active : IRQs than can be represented in the VGIC list registers. Sort the AP : list to prioritize inactive and pending IRQs, potentially spilling : active IRQs outside of the LRs. : : Handle deactivation of IRQs outside of the LRs for both EOImode=0/1, : which involves special consideration for SPIs being deactivated from a : different vCPU than the one that acked it. KVM: arm64: Convert ICH_HCR_EL2_TDIR cap to EARLY_LOCAL_CPU_FEATURE KVM: arm64: selftests: vgic_irq: Add timer deactivation test KVM: arm64: selftests: vgic_irq: Add Group-0 enable test KVM: arm64: selftests: vgic_irq: Add asymmetric SPI deaectivation test KVM: arm64: selftests: vgic_irq: Perform EOImode==1 deactivation in ack order KVM: arm64: selftests: vgic_irq: Remove LR-bound limitation KVM: arm64: selftests: vgic_irq: Exclude timer-controlled interrupts KVM: arm64: selftests: vgic_irq: Change configuration before enabling interrupt KVM: arm64: selftests: vgic_irq: Fix GUEST_ASSERT_IAR_EMPTY() helper KVM: arm64: selftests: gic_v3: Disable Group-0 interrupts by default KVM: arm64: selftests: gic_v3: Add irq group setting helper KVM: arm64: GICv2: Always trap GICV_DIR register KVM: arm64: GICv2: Handle deactivation via GICV_DIR traps KVM: arm64: GICv2: Handle LR overflow when EOImode==0 KVM: arm64: GICv3: Force exit to sync ICH_HCR_EL2.En KVM: arm64: GICv3: nv: Plug L1 LR sync into deactivation primitive KVM: arm64: GICv3: nv: Resync LRs/VMCR/HCR early for better MI emulation KVM: arm64: GICv3: Avoid broadcast kick on CPUs lacking TDIR KVM: arm64: GICv3: Handle in-LR deactivation when possible KVM: arm64: GICv3: Add SPI tracking to handle asymmetric deactivation ... Signed-off-by: Oliver Upton <oupton@kernel.org>	2025-12-01 00:47:32 -08:00
Oliver Upton	11b8e6edc1	Merge branch 'kvm-arm64/sea-user' into kvmarm/next * kvm-arm64/sea-user: : Userspace handling of SEAs, courtesy of Jiaqi Yan : : Add support for processing external aborts in userspace in situations : where the host has failed to do so, allowing the VMM to potentially : reinject an external abort into the VM. Documentation: kvm: new UAPI for handling SEA KVM: selftests: Test for KVM_EXIT_ARM_SEA KVM: arm64: VM exit to userspace to handle SEA Signed-off-by: Oliver Upton <oupton@kernel.org>	2025-12-01 00:47:20 -08:00
Paolo Bonzini	de8e8ebb1a	KVM TDX changes for 6.19: - Overhaul the TDX code to address systemic races where KVM (acting on behalf of userspace) could inadvertantly trigger lock contention in the TDX-Module, which KVM was either working around in weird, ugly ways, or was simply oblivious to (as proven by Yan tripping several KVM_BUG_ON()s with clever selftests). - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a vCPU if creating said vCPU failed partway through. - Fix a few sparse warnings (bad annotation, 0 != NULL). - Use struct_size() to simplify copying capabilities to userspace. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmkmVkAACgkQOlYIJqCj N/18Ow//cWPmXAdJcM0fRtnSGwzIZszGSD63htgdh5UDeJIFVyUGKH7uGhndQUwK Uo8jCJ4ikwMxDdCijv+e4eqCCMZjb7HQhFKaauPVCJZOhmZn0br3EB5xX24Qgp8R YN5gTheiTCHHVaxAMl9grgi1xTRi6pJRufRebOmtyGKNQkclctXcuSdtw7IEhqdM wKM3eyb7qUhUrmt5tBkSyFAioGcPJIHE3vqLjImqDgduinbXJdQa1sek4Br0sX45 rfISZ2geXDj/Sh7EPrPU1ne5LQbtgzp1WTG6MRCidYfP86riMQUlEMY6odEYAgIX kCd+z248OJShF5EYcEmjc894YLHJ0vVXIXKx/qh0+Jiobz3bujk+whaxTNa26rj0 3qLPGzFpYugtxkGqBYH4q90oUTovEk4922+jPsQ9GKQ26f0q3XzvriEUSOgrvo0Z O26OyK7BezqSM5WMMSf/EGI1ESuli5lbBLYDOaNZS35di2YcDEgtaikRETpWwy82 TGxrjyeW9Pu6M3iTtQsOVHNxA4hU//Qd5HcDj5rcXOg1rgiPV9n2OaCEMwc6qi+V VytbGm4IlMsz6AVHqyv3SUIt1Z4LNAZ/FwK8oeBRVd6LNfm6nfyrW6eQFQVLoIpA 1nyi9XjMg7xj6ubiSEQSTSl9gto8FzVWwLKwZ8dLH7SPvqlz+zY= =qGpA -----END PGP SIGNATURE----- Merge tag 'kvm-x86-tdx-6.19' of https://github.com/kvm-x86/linux into HEAD KVM TDX changes for 6.19: - Overhaul the TDX code to address systemic races where KVM (acting on behalf of userspace) could inadvertantly trigger lock contention in the TDX-Module, which KVM was either working around in weird, ugly ways, or was simply oblivious to (as proven by Yan tripping several KVM_BUG_ON()s with clever selftests). - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a vCPU if creating said vCPU failed partway through. - Fix a few sparse warnings (bad annotation, 0 != NULL). - Use struct_size() to simplify copying capabilities to userspace.	2025-11-26 09:36:37 +01:00
Marc Zyngier	cd4f6ee99b	KVM: arm64: GICv3: Handle deactivation via ICV_DIR_EL1 traps Deactivation via ICV_DIR_EL1 is both relatively straightforward (we have the interrupt that needs deactivation) and really awkward. The main issue is that the interrupt may either be in an LR on another CPU, or ourside of any LR. In the former case, we process the deactivation is if ot was a write to GICD_CACTIVERn, which is already implemented as a big hammer IPI'ing all vcpus. In the latter case, we just perform a normal deactivation, similar to what we do for EOImode==0. Another annoying aspect is that we need to tell the CPU owning the interrupt that its ap_list needs laudering. We use a brand new vcpu request to that effect. Note that this doesn't address deactivation via the GICV MMIO view, which will be taken care of in a later change. Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-29-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>	2025-11-24 14:29:13 -08:00
Marc Zyngier	cf72ee6371	KVM: arm64: Eagerly save VMCR on exit We currently save/restore the VMCR register in a pretty lazy way (on load/put, consistently with what we do with the APRs). However, we are going to need the group-enable bits that are backed by VMCR on each entry (so that we can avoid injecting interrupts for disabled groups). Move the synchronisation from put to sync, which results in some minor churn in the nVHE hypercalls to simplify things. Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-21-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>	2025-11-24 14:29:13 -08:00
Oliver Upton	cb17d79ff5	KVM: arm64: Use kvzalloc() for kvm struct allocation Physically-allocated KVM structs aren't necessary when in VHE mode as there's no need to share with the hyp's address space. Of course, there can still be a performance benefit from physical allocations. Use kvzalloc() for opportunistic physical allocations. Acked-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Joey Gouly <joey.gouly@arm.com> Link: https://msgid.link/20251119093822.2513142-3-oupton@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>	2025-11-19 12:20:57 -08:00
Oliver Upton	297877069b	KVM: arm64: Drop useless __GFP_HIGHMEM from kvm struct allocation A recent change on the receiving end of vmalloc() started warning about unsupported GFP flags passed by the caller. Nathan reports that this warning fires in kvm_arch_alloc_vm(), owing to the fact that KVM is passing a meaningless __GFP_HIGHMEM. Do as the warning says and fix the code. Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/kvmarm/20251118224448.GA998046@ax162/ Acked-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Joey Gouly <joey.gouly@arm.com> Link: https://msgid.link/20251119093822.2513142-2-oupton@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>	2025-11-19 12:20:56 -08:00
Bo Liu	337f7e3a4b	arm64: Fix double word in comments Remove the repeated word "the" in comments. Signed-off-by: Bo Liu <liubo03@inspur.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2025-11-12 17:07:59 +00:00
Alexandru Elisei	85592114ff	KVM: arm64: VHE: Compute fgt traps before activating them On VHE, the Fine Grain Traps registers are written to hardware in kvm_arch_vcpu_load()->..->__activate_traps_hfgxtr(), but the fgt array is computed later, in kvm_vcpu_load_fgt(). This can lead to zero being written to the FGT registers the first time a VCPU is loaded. Also, any changes to the fgt array will be visible only after the VCPU is scheduled out, and then back in, which is not the intended behaviour. Fix it by computing the fgt array just before the fgt traps are written to hardware. Fixes: `fb10ddf35c` ("KVM: arm64: Compute per-vCPU FGTs at vcpu_load()") Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Oliver Upton <oupton@kernel.org> Link: https://patch.msgid.link/20251112102853.47759-1-alexandru.elisei@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-11-12 10:52:58 +00:00
Jiaqi Yan	ad9c62bd89	KVM: arm64: VM exit to userspace to handle SEA When APEI fails to handle a stage-2 synchronous external abort (SEA), today KVM injects an asynchronous SError to the VCPU then resumes it, which usually results in unpleasant guest kernel panic. One major situation of guest SEA is when vCPU consumes recoverable uncorrected memory error (UER). Although SError and guest kernel panic effectively stops the propagation of corrupted memory, guest may re-use the corrupted memory if auto-rebooted; in worse case, guest boot may run into poisoned memory. So there is room to recover from an UER in a more graceful manner. Alternatively KVM can redirect the synchronous SEA event to VMM to - Reduce blast radius if possible. VMM can inject a SEA to VCPU via KVM's existing KVM_SET_VCPU_EVENTS API. If the memory poison consumption or fault is not from guest kernel, blast radius can be limited to the triggering thread in guest userspace, so VM can keep running. - Allow VMM to protect from future memory poison consumption by unmapping the page from stage-2, or to interrupt guest of the poisoned page so guest kernel can unmap it from stage-1 page table. - Allow VMM to track SEA events that VM customers care about, to restart VM when certain number of distinct poison events have happened, to provide observability to customers in log management UI. Introduce an userspace-visible feature to enable VMM handle SEA: - KVM_CAP_ARM_SEA_TO_USER. As the alternative fallback behavior when host APEI fails to claim a SEA, userspace can opt in this new capability to let KVM exit to userspace during SEA if it is not owned by host. - KVM_EXIT_ARM_SEA. A new exit reason is introduced for this. KVM fills kvm_run.arm_sea with as much as possible information about the SEA, enabling VMM to emulate SEA to guest by itself. - Sanitized ESR_EL2. The general rule is to keep only the bits useful for userspace and relevant to guest memory. - Flags indicating if faulting guest physical address is valid. - Faulting guest physical and virtual addresses if valid. Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> Co-developed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://msgid.link/20251013185903.1372553-2-jiaqiyan@google.com Signed-off-by: Oliver Upton <oupton@kernel.org>	2025-11-12 01:27:12 -08:00
Sean Christopherson	50efc2340a	KVM: Rename kvm_arch_vcpu_async_ioctl() to kvm_arch_vcpu_unlocked_ioctl() Rename the "async" ioctl API to "unlocked" so that upcoming usage in x86's TDX code doesn't result in a massive misnomer. To avoid having to retry SEAMCALLs, TDX needs to acquire kvm->lock and all vcpu->mutex locks, and acquiring all of those locks after/inside the current vCPU's mutex is a non-starter. However, TDX also needs to acquire the vCPU's mutex and load the vCPU, i.e. the handling is very much not async to the vCPU. No functional change intended. Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-05 11:03:11 -08:00
Sean Christopherson	0a0da3f921	KVM: Make support for kvm_arch_vcpu_async_ioctl() mandatory Implement kvm_arch_vcpu_async_ioctl() "natively" in x86 and arm64 instead of relying on an #ifdef'd stub, and drop HAVE_KVM_VCPU_ASYNC_IOCTL in anticipation of using the API on x86. Once x86 uses the API, providing a stub for one architecture and having all other architectures opt-in requires more code than simply implementing the API in the lone holdout. Eliminating the Kconfig will also reduce churn if the API is renamed in the future (spoiler alert). No functional change intended. Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-05 11:03:10 -08:00
Oliver Upton	fb10ddf35c	KVM: arm64: Compute per-vCPU FGTs at vcpu_load() To date KVM has used the fine-grained traps for the sake of UNDEF enforcement (so-called FGUs), meaning the constituent parts could be computed on a per-VM basis and folded into the effective value when programmed. Prepare for traps changing based on the vCPU context by computing the whole mess of them at vcpu_load(). Aggressively inline all the helpers to preserve the build-time checks that were there before. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Joey Gouly <joey.gouly@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-10-13 14:44:37 +01:00
Oliver Upton	0aa1b76fe1	KVM: arm64: Prevent access to vCPU events before init Another day, another syzkaller bug. KVM erroneously allows userspace to pend vCPU events for a vCPU that hasn't been initialized yet, leading to KVM interpreting a bunch of uninitialized garbage for routing / injecting the exception. In one case the injection code and the hyp disagree on whether the vCPU has a 32bit EL1 and put the vCPU into an illegal mode for AArch64, tripping the BUG() in exception_target_el() during the next injection: kernel BUG at arch/arm64/kvm/inject_fault.c:40! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 3 UID: 0 PID: 318 Comm: repro Not tainted 6.17.0-rc4-00104-g10fd0285305d #6 PREEMPT Hardware name: linux,dummy-virt (DT) pstate: 21402009 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) pc : exception_target_el+0x88/0x8c lr : pend_serror_exception+0x18/0x13c sp : ffff800082f03a10 x29: ffff800082f03a10 x28: ffff0000cb132280 x27: 0000000000000000 x26: 0000000000000000 x25: ffff0000c2a99c20 x24: 0000000000000000 x23: 0000000000008000 x22: 0000000000000002 x21: 0000000000000004 x20: 0000000000008000 x19: ffff0000c2a99c20 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 00000000200000c0 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000 x8 : ffff800082f03af8 x7 : 0000000000000000 x6 : 0000000000000000 x5 : ffff800080f621f0 x4 : 0000000000000000 x3 : 0000000000000000 x2 : 000000000040009b x1 : 0000000000000003 x0 : ffff0000c2a99c20 Call trace: exception_target_el+0x88/0x8c (P) kvm_inject_serror_esr+0x40/0x3b4 __kvm_arm_vcpu_set_events+0xf0/0x100 kvm_arch_vcpu_ioctl+0x180/0x9d4 kvm_vcpu_ioctl+0x60c/0x9f4 __arm64_sys_ioctl+0xac/0x104 invoke_syscall+0x48/0x110 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x34/0xf0 el0t_64_sync_handler+0xa0/0xe4 el0t_64_sync+0x198/0x19c Code: f946bc01 b4fffe61 9101e020 17fffff2 (d4210000) Reject the ioctls outright as no sane VMM would call these before KVM_ARM_VCPU_INIT anyway. Even if it did the exception would've been thrown away by the eventual reset of the vCPU's state. Cc: stable@vger.kernel.org # 6.17 Fixes: `b7b27facc7` ("arm/arm64: KVM: Add KVM_GET/SET_VCPU_EVENTS") Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-10-13 14:17:03 +01:00
Linus Torvalds	2215336295	hyperv-next for v6.18 -----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmjkpakTHHdlaS5saXVA a2VybmVsLm9yZwAKCRB2FHBfkEGgXip5B/48MvTFJ1qwRGPVzevZQ8Z4SDogEREp 69VS/xRf1YCIzyXyanwqf1dXLq8NAqicSp6ewpJAmNA55/9O0cwT2EtohjeGCu61 krPIvS3KT7xI0uSEniBdhBtALYBscnQ0e3cAbLNzL7bwA6Q6OmvoIawpBADgE/cW aZNCK9jy+WUqtXc6lNtkJtST0HWGDn0h04o2hjqIkZ+7ewjuEEJBUUB/JZwJ41Od UxbID0PAcn9O4n/u/Y/GH65MX+ddrdCgPHEGCLAGAKT24lou3NzVv445OuCw0c4W ilALIRb9iea56ZLVBW5O82+7g9Ag41LGq+841MNlZjeRNONGykaUpTWZ =OR26 -----END PGP SIGNATURE----- Merge tag 'hyperv-next-signed-20251006' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Unify guest entry code for KVM and MSHV (Sean Christopherson) - Switch Hyper-V MSI domain to use msi_create_parent_irq_domain() (Nam Cao) - Add CONFIG_HYPERV_VMBUS and limit the semantics of CONFIG_HYPERV (Mukesh Rathor) - Add kexec/kdump support on Azure CVMs (Vitaly Kuznetsov) - Deprecate hyperv_fb in favor of Hyper-V DRM driver (Prasanna Kumar T S M) - Miscellaneous enhancements, fixes and cleanups (Abhishek Tiwari, Alok Tiwari, Nuno Das Neves, Wei Liu, Roman Kisel, Michael Kelley) * tag 'hyperv-next-signed-20251006' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: hyperv: Remove the spurious null directive line MAINTAINERS: Mark hyperv_fb driver Obsolete fbdev/hyperv_fb: deprecate this in favor of Hyper-V DRM driver Drivers: hv: Make CONFIG_HYPERV bool Drivers: hv: Add CONFIG_HYPERV_VMBUS option Drivers: hv: vmbus: Fix typos in vmbus_drv.c Drivers: hv: vmbus: Fix sysfs output format for ring buffer index Drivers: hv: vmbus: Clean up sscanf format specifier in target_cpu_store() x86/hyperv: Switch to msi_create_parent_irq_domain() mshv: Use common "entry virt" APIs to do work in root before running guest entry: Rename "kvm" entry code assets to "virt" to genericize APIs entry/kvm: KVM: Move KVM details related to signal/-EINTR into KVM proper mshv: Handle NEED_RESCHED_LAZY before transferring to guest x86/hyperv: Add kexec/kdump support on Azure CVMs Drivers: hv: Simplify data structures for VMBus channel close message Drivers: hv: util: Cosmetic changes for hv_utils_transport.c mshv: Add support for a new parent partition configuration clocksource: hyper-v: Skip unnecessary checks for the root partition hyperv: Add missing field to hv_output_map_device_interrupt	2025-10-07 08:40:15 -07:00
Sean Christopherson	6d0386ea99	entry/kvm: KVM: Move KVM details related to signal/-EINTR into KVM proper Move KVM's morphing of pending signals into userspace exits into KVM proper, and drop the @vcpu param from xfer_to_guest_mode_handle_work(). How KVM responds to -EINTR is a detail that really belongs in KVM itself, and invoking kvm_handle_signal_exit() from kernel code creates an inverted module dependency. E.g. attempting to move kvm_handle_signal_exit() into kvm_main.c would generate an linker error when building kvm.ko as a module. Dropping KVM details will also converting the KVM "entry" code into a more generic virtualization framework so that it can be used when running as a Hyper-V root partition. Lastly, eliminating usage of "struct kvm_vcpu" outside of KVM is also nice to have for KVM x86 developers, as keeping the details of kvm_vcpu purely within KVM allows changing the layout of the structure without having to boot into a new kernel, e.g. allows rebuilding and reloading kvm.ko with a modified kvm_vcpu structure as part of debug/development. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Wei Liu <wei.liu@kernel.org>	2025-09-30 22:50:18 +00:00
Paolo Bonzini	924ebaefce	KVM/arm64 updates for 6.18 - Add support for FF-A 1.2 as the secure memory conduit for pKVM, allowing more registers to be used as part of the message payload. - Change the way pKVM allocates its VM handles, making sure that the privileged hypervisor is never tricked into using uninitialised data. - Speed up MMIO range registration by avoiding unnecessary RCU synchronisation, which results in VMs starting much quicker. - Add the dump of the instruction stream when panic-ing in the EL2 payload, just like the rest of the kernel has always done. This will hopefully help debugging non-VHE setups. - Add 52bit PA support to the stage-1 page-table walker, and make use of it to populate the fault level reported to the guest on failing to translate a stage-1 walk. - Add NV support to the GICv3-on-GICv5 emulation code, ensuring feature parity for guests, irrespective of the host platform. - Fix some really ugly architecture problems when dealing with debug in a nested VM. This has some bad performance impacts, but is at least correct. - Add enough infrastructure to be able to disable EL2 features and give effective values to the EL2 control registers. This then allows a bunch of features to be turned off, which helps cross-host migration. - Large rework of the selftest infrastructure to allow most tests to transparently run at EL2. This is the first step towards enabling NV testing. - Various fixes and improvements all over the map, including one BE fix, just in time for the removal of the feature. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmjVhSsACgkQI9DQutE9 ekOSIg//YdbA/zo17GrLnJnlGdUnS0e3/357n3e5Lypx1UFRDTmNacpVPw4VG/jt eVpQn7AgYwyvKfCq46eD+hBBqNv1XTn4DeXttv7CmVqhCRythsEvDkTBWSt7oYUZ xfYXCMKNhqUElH4AbYYx3y7nb2E9/KVGr+NBn6Vf5c14OZ3MGVc/fyp4jM1ih5dR kcV2onAYohlIGvFEyZMBtJ+jkYkqfIbfxfqCL0RAET0aEBFcmM1aXybWZj47hlLM f2j+E6cFQ0ZzUt+3pFhT75wo43lHGtIFDjVd60uishyU+NXTVvqRmXDTRU4k546W 18HHX1yijbzuXIatVhVRo2hIq3jKU37T9wtj46BejbDHRdAPENEyN/Qopm7rNS+X mCwOT7He6KR+H4rU6nFaTcsS7bNRCvIbZP9i9zb6NElbvXu5QnM8BUQsYFCDUa/n xtbtQlckbo/7zeoUsBDrGj2XmCf0d45FTHb7fdWOYEmMSmJhXYpUKdM4JcLyhKoQ DD0ox2S+pt2lwNw3XOSABdES0KJxCvDASAMIgn2h2sGpY8FxsBcVW/BufXopdafG UeInxWaILp2iCDM4tH2GLjKqlvMAOwcA+mAEZToXypxlJAnYA6J1pXCF8WEaM6+D BGrLli8Zd8JRs87byq6K7tp8oLNzZdliJp73j5jfOHTJA4MnvFI= =iAL3 -----END PGP SIGNATURE----- Merge tag 'kvmarm-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 6.18 - Add support for FF-A 1.2 as the secure memory conduit for pKVM, allowing more registers to be used as part of the message payload. - Change the way pKVM allocates its VM handles, making sure that the privileged hypervisor is never tricked into using uninitialised data. - Speed up MMIO range registration by avoiding unnecessary RCU synchronisation, which results in VMs starting much quicker. - Add the dump of the instruction stream when panic-ing in the EL2 payload, just like the rest of the kernel has always done. This will hopefully help debugging non-VHE setups. - Add 52bit PA support to the stage-1 page-table walker, and make use of it to populate the fault level reported to the guest on failing to translate a stage-1 walk. - Add NV support to the GICv3-on-GICv5 emulation code, ensuring feature parity for guests, irrespective of the host platform. - Fix some really ugly architecture problems when dealing with debug in a nested VM. This has some bad performance impacts, but is at least correct. - Add enough infrastructure to be able to disable EL2 features and give effective values to the EL2 control registers. This then allows a bunch of features to be turned off, which helps cross-host migration. - Large rework of the selftest infrastructure to allow most tests to transparently run at EL2. This is the first step towards enabling NV testing. - Various fixes and improvements all over the map, including one BE fix, just in time for the removal of the feature.	2025-09-30 13:23:28 -04:00
Marc Zyngier	d9476fd356	Merge branch kvm-arm64/gic-v5-nv into kvmarm-master/next * kvm-arm64/gic-v5-nv: : . : Add NV support to GICv5 in GICv3 emulation mode, ensuring that the v3 : guest support is identical to that of a pure v3 platform. : : Patches courtesy of Sascha Bischoff (20250828105925.3865158-1-sascha.bischoff@arm.com) : . irqchip/gic-v5: Drop has_gcie_v3_compat from gic_kvm_info KVM: arm64: Use ARM64_HAS_GICV5_LEGACY for GICv5 probing arm64: cpucaps: Add GICv5 Legacy vCPU interface (GCIE_LEGACY) capability KVM: arm64: Enable nested for GICv5 host with FEAT_GCIE_LEGACY KVM: arm64: Don't access ICC_SRE_EL2 if GICv3 doesn't support v2 compatibility Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-09-20 12:26:05 +01:00
Sascha Bischoff	d5a012af34	KVM: arm64: Enable nested for GICv5 host with FEAT_GCIE_LEGACY Extend the NV check to pass for a GICv5 host that has FEAT_GCIE_LEGACY. The has_gcie_v3_compat flag is only set on GICv5 hosts (that explicitly support FEAT_GCIE_LEGACY), and hence the explicit check for a VGIC_V5 is omitted. As of this change, vGICv3-based VMs can run with nested on a compatible GICv5 host. Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-09-17 17:41:02 +01:00
Marc Zyngier	3064cee8c3	Merge branch kvm-arm64/pkvm_vm_handle into kvmarm-master/next * kvm-arm64/pkvm_vm_handle: : pKVM VM handle allocation fixes, courtesy of Fuad Tabba. : : From the cover letter (20250909072437.4110547-1-tabba@google.com): : : "In pKVM, this handle is allocated when the VM is initialized at the : hypervisor, which is on the first vCPU run. However, the host starts : initializing the VM and setting up its data structures earlier. MMU : notifiers for the VMs are also registered before VM initialization at : the hypervisor, and rely on the handle to identify the VM. : : Therefore, there is a potential gap between when the VM is (partially) : setup at the host, but still without a valid pKVM handle to identify it : when communicating with the hypervisor." KVM: arm64: Reserve pKVM handle during pkvm_init_host_vm() KVM: arm64: Introduce separate hypercalls for pKVM VM reservation and initialization KVM: arm64: Consolidate pKVM hypervisor VM initialization logic KVM: arm64: Separate allocation and insertion of pKVM VM table entries KVM: arm64: Decouple hyp VM creation state from its handle KVM: arm64: Clarify comments to distinguish pKVM mode from protected VMs KVM: arm64: Rename 'host_kvm' to 'kvm' in pKVM host code KVM: arm64: Rename pkvm.enabled to pkvm.is_protected KVM: arm64: Add build-time check for duplicate DECLARE_REG use Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-09-15 10:49:04 +01:00
Fuad Tabba	07aeb70707	KVM: arm64: Reserve pKVM handle during pkvm_init_host_vm() When a pKVM guest is active, TLB invalidations triggered by host MMU notifiers require a valid hypervisor handle. Currently, this handle is only allocated when the first vCPU is run. However, the guest's memory is associated with the host MMU much earlier, during kvm_arch_init_vm(). This creates a window where an MMU invalidation could occur after the kvm_pgtable pointer checked by the notifiers is set but before the pKVM handle has been created. Fix this by reserving the pKVM handle when the host VM is first set up. Move the call to the __pkvm_reserve_vm hypercall from the first-vCPU-run path into pkvm_init_host_vm(), which is called during initial VM setup. This ensures the handle is available before any subsystem can trigger an MMU notification for the VM. The VM destruction path is updated to call __pkvm_unreserve_vm for cases where a VM was reserved but never fully created at the hypervisor, ensuring the handle is properly released. This fix leverages the two-stage reservation/initialization hypercall interface introduced in preceding patches. Signed-off-by: Fuad Tabba <tabba@google.com> Tested-by: Mark Brown <broonie@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-09-15 10:46:55 +01:00
Alexandru Elisei	efad60e460	KVM: arm64: Initialize PMSCR_EL1 when in VHE According to the pseudocode for StatisticalProfilingEnabled() from Arm DDI0487L.b, PMSCR_EL1 controls profiling at EL1 and EL0: - PMSCR_EL1.E1SPE controls profiling at EL1. - PMSCR_EL1.E0SPE controls profiling at EL0 if HCR_EL2.TGE=0. These two fields reset to UNKNOWN values. When KVM runs in VHE mode and profiling is enabled in the host, before entering a guest, KVM does not touch any of the SPE registers, leaving the buffer enabled, and it clears HCR_EL2.TGE. As a result, depending on the reset value for the E1SPE and E0SPE fields, KVM might unintentionally profile a guest. Make the behaviour consistent and predictable by clearing PMSCR_EL1 when KVM initialises the host debug configuration. Note that this is not a problem for nVHE, because KVM clears PMSCR_EL1.{E1SPE,E0SPE} before entering the guest. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20250902130833.338216-2-alexandru.elisei@arm.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-09-10 02:56:19 -07:00
Paolo Bonzini	42a0305ab1	KVM/arm64 changes for 6.17, take #2 - Correctly handle 'invariant' system registers for protected VMs - Improved handling of VNCR data aborts, including external aborts - Fixes for handling of FEAT_RAS for NV guests, providing a sane fault context during SEA injection and preventing the use of RASv1p1 fault injection hardware - Ensure that page table destruction when a VM is destroyed gives an opportunity to reschedule - Large fix to KVM's infrastructure for managing guest context loaded on the CPU, addressing issues where the output of AT emulation doesn't get reflected to the guest - Fix AT S12 emulation to actually perform stage-2 translation when necessary - Avoid attempting vLPI irqbypass when GICv4 has been explicitly disabled for a VM - Minor KVM + selftest fixes -----BEGIN PGP SIGNATURE----- iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCaLC0JBccb2xpdmVyLnVw dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFogJAQCyxHd5tuvXWWT/iC2EYFlPWYkU LOQbNhus16QjQ9f2ggD8CoA+6UAxzYW7ZU6IzYkDhJkN/3dKQEQhh8Cx0GXXRAs= =uky+ -----END PGP SIGNATURE----- Merge tag 'kvmarm-fixes-6.17-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 changes for 6.17, take #2 - Correctly handle 'invariant' system registers for protected VMs - Improved handling of VNCR data aborts, including external aborts - Fixes for handling of FEAT_RAS for NV guests, providing a sane fault context during SEA injection and preventing the use of RASv1p1 fault injection hardware - Ensure that page table destruction when a VM is destroyed gives an opportunity to reschedule - Large fix to KVM's infrastructure for managing guest context loaded on the CPU, addressing issues where the output of AT emulation doesn't get reflected to the guest - Fix AT S12 emulation to actually perform stage-2 translation when necessary - Avoid attempting vLPI irqbypass when GICv4 has been explicitly disabled for a VM - Minor KVM + selftest fixes	2025-08-29 12:57:31 -04:00
Marc Zyngier	0843e0ced3	KVM: arm64: Get rid of ARM64_FEATURE_MASK() The ARM64_FEATURE_MASK() macro was a hack introduce whilst the automatic generation of sysreg encoding was introduced, and was too unreliable to be entirely trusted. We are in a better place now, and we could really do without this macro. Get rid of it altogether. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250817202158.395078-7-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-08-21 16:31:56 -07:00
Paolo Bonzini	314b40b3b6	KVM/arm64 changes for 6.17, round #1 - Host driver for GICv5, the next generation interrupt controller for arm64, including support for interrupt routing, MSIs, interrupt translation and wired interrupts. - Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on GICv5 hardware, leveraging the legacy VGIC interface. - Userspace control of the 'nASSGIcap' GICv3 feature, allowing userspace to disable support for SGIs w/o an active state on hardware that previously advertised it unconditionally. - Map supporting endpoints with cacheable memory attributes on systems with FEAT_S2FWB and DIC where KVM no longer needs to perform cache maintenance on the address range. - Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the guest hypervisor to inject external aborts into an L2 VM and take traps of masked external aborts to the hypervisor. - Convert more system register sanitization to the config-driven implementation. - Fixes to the visibility of EL2 registers, namely making VGICv3 system registers accessible through the VGIC device instead of the ONE_REG vCPU ioctls. - Various cleanups and minor fixes. -----BEGIN PGP SIGNATURE----- iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCaIezbRccb2xpdmVyLnVw dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFr/eAQDY5NIG5cR6ZcAWnPQLmGWpz2ou pq4Jhn9E/mGR3n5L1AEAsJpfLLpOsmnLBdwfbjmW59gGsa8k3i5tjWEOJ6yzAwk= =r+sp -----END PGP SIGNATURE----- Merge tag 'kvmarm-6.17' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 changes for 6.17, round #1 - Host driver for GICv5, the next generation interrupt controller for arm64, including support for interrupt routing, MSIs, interrupt translation and wired interrupts. - Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on GICv5 hardware, leveraging the legacy VGIC interface. - Userspace control of the 'nASSGIcap' GICv3 feature, allowing userspace to disable support for SGIs w/o an active state on hardware that previously advertised it unconditionally. - Map supporting endpoints with cacheable memory attributes on systems with FEAT_S2FWB and DIC where KVM no longer needs to perform cache maintenance on the address range. - Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the guest hypervisor to inject external aborts into an L2 VM and take traps of masked external aborts to the hypervisor. - Convert more system register sanitization to the config-driven implementation. - Fixes to the visibility of EL2 registers, namely making VGICv3 system registers accessible through the VGIC device instead of the ONE_REG vCPU ioctls. - Various cleanups and minor fixes.	2025-07-29 12:27:40 -04:00
Paolo Bonzini	f02b1bcc73	Merge tag 'kvm-x86-irqs-6.17' of https://github.com/kvm-x86/linux into HEAD KVM IRQ changes for 6.17 - Rework irqbypass to track/match producers and consumers via an xarray instead of a linked list. Using a linked list leads to O(n^2) insertion times, which is hugely problematic for use cases that create large numbers of VMs. Such use cases typically don't actually use irqbypass, but eliminating the pointless registration is a future problem to solve as it likely requires new uAPI. - Track irqbypass's "token" as "struct eventfd_ctx " instead of a "void ", to avoid making a simple concept unnecessarily difficult to understand. - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC, PIC, and PIT emulation at compile time. - Drop x86's irq_comm.c, and move a pile of IRQ related code into irq.c. - Fix a variety of flaws and bugs in the AVIC device posted IRQ code. - Inhibited AVIC if a vCPU's ID is too big (relative to what hardware supports) instead of rejecting vCPU creation. - Extend enable_ipiv module param support to SVM, by simply leaving IsRunning clear in the vCPU's physical ID table entry. - Disable IPI virtualization, via enable_ipiv, if the CPU is affected by erratum #1235, to allow (safely) enabling AVIC on such CPUs. - Dedup x86's device posted IRQ code, as the vast majority of functionality can be shared verbatime between SVM and VMX. - Harden the device posted IRQ code against bugs and runtime errors. - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1) instead of O(n). - Generate GA Log interrupts if and only if the target vCPU is blocking, i.e. only if KVM needs a notification in order to wake the vCPU. - Decouple device posted IRQs from VFIO device assignment, as binding a VM to a VFIO group is not a requirement for enabling device posted IRQs. - Clean up and document/comment the irqfd assignment code. - Disallow binding multiple irqfds to an eventfd with a priority waiter, i.e. ensure an eventfd is bound to at most one irqfd through the entire host, and add a selftest to verify eventfd:irqfd bindings are globally unique.	2025-07-29 08:35:46 -04:00
Oliver Upton	3318e42b81	Merge branch 'kvm-arm64/doublefault2' into kvmarm/next * kvm-arm64/doublefault2: (33 commits) : NV Support for FEAT_RAS + DoubleFault2 : : Delegate the vSError context to the guest hypervisor when in a nested : state, including registers related to ESR propagation. Additionally, : catch up KVM's external abort infrastructure to the architecture, : implementing the effects of FEAT_DoubleFault2. : : This has some impact on non-nested guests, as SErrors deemed unmasked at : the time they're made pending are now immediately injected with an : emulated exception entry rather than using the VSE bit. KVM: arm64: Make RAS registers UNDEF when RAS isn't advertised KVM: arm64: Filter out HCR_EL2 bits when running in hypervisor context KVM: arm64: Check for SYSREGS_ON_CPU before accessing the CPU state KVM: arm64: Commit exceptions from KVM_SET_VCPU_EVENTS immediately KVM: arm64: selftests: Test ESR propagation for vSError injection KVM: arm64: Populate ESR_ELx.EC for emulated SError injection KVM: arm64: selftests: Catch up set_id_regs with the kernel KVM: arm64: selftests: Add SCTLR2_EL1 to get-reg-list KVM: arm64: selftests: Test SEAs are taken to SError vector when EASE=1 KVM: arm64: selftests: Add basic SError injection test KVM: arm64: Don't retire MMIO instruction w/ pending (emulated) SError KVM: arm64: Advertise support for FEAT_DoubleFault2 KVM: arm64: Advertise support for FEAT_SCTLR2 KVM: arm64: nv: Enable vSErrors when HCRX_EL2.TMEA is set KVM: arm64: nv: Honor SError routing effects of SCTLR2_ELx.NMEA KVM: arm64: nv: Take "masked" aborts to EL2 when HCRX_EL2.TMEA is set KVM: arm64: Route SEAs to the SError vector when EASE is set KVM: arm64: nv: Ensure Address size faults affect correct ESR KVM: arm64: Factor out helper for selecting exception target EL KVM: arm64: Describe SCTLR2_ELx RESx masks ... Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-07-26 08:47:22 -07:00
Oliver Upton	77ee70a073	KVM: arm64: nv: Honor SError exception routing / masking To date KVM has used HCR_EL2.VSE to track the state of a pending SError for the guest. With this bit set, hardware respects the EL1 exception routing / masking rules and injects the vSError when appropriate. This isn't correct for NV guests as hardware is oblivious to vEL2's intentions for SErrors. Better yet, with FEAT_NV2 the guest can change the routing behind our back as HCR_EL2 is redirected to memory. Cope with this mess by: - Using a flag (instead of HCR_EL2.VSE) to track the pending SError state when SErrors are unconditionally masked for the current context - Resampling the routing / masking of a pending SError on every guest entry/exit - Emulating exception entry when SError routing implies a translation regime change Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250708172532.1699409-7-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-07-08 11:36:31 -07:00
Oliver Upton	aae35f4ffb	KVM: arm64: Treat vCPU with pending SError as runnable Per R_VRLPB, a pending SError is a WFI wakeup event regardless of PSTATE.A, meaning that the vCPU is runnable. Sample VSE in addition to the other IRQ lines. Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250708172532.1699409-5-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-07-08 10:40:30 -07:00
Marc Zyngier	1d6fea7663	KVM: arm64: Add helper to identify a nested context A common idiom in the KVM code is to check if we are currently dealing with a "nested" context, defined as having NV enabled, but being in the EL1&0 translation regime. This is usually expressed as: if (vcpu_has_nv(vcpu) && !is_hyp_ctxt(vcpu) ... ) which is a mouthful and a bit hard to read, specially when followed by additional conditions. Introduce a new helper that encapsulate these two terms, allowing the above to be written as if (is_nested_context(vcpu) ... ) which is both shorter and easier to read, and makes more obvious the potential for simplification on some code paths. Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250708172532.1699409-4-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-07-08 10:40:30 -07:00
Ankit Agrawal	f55ce5a6cd	KVM: arm64: Expose new KVM cap for cacheable PFNMAP Introduce a new KVM capability to expose to the userspace whether cacheable mapping of PFNMAP is supported. The ability to safely do the cacheable mapping of PFNMAP is contingent on S2FWB and ARM64_HAS_CACHE_DIC. S2FWB allows KVM to avoid flushing the D cache, ARM64_HAS_CACHE_DIC allows KVM to avoid flushing the icache and turns icache_inval_pou() into a NOP. The cap would be false if those requirements are missing and is checked by making use of kvm_arch_supports_cacheable_pfnmap. This capability would allow userspace to discover the support. It could for instance be used by userspace to prevent live-migration across FWB and non-FWB hosts. CC: Catalin Marinas <catalin.marinas@arm.com> CC: Jason Gunthorpe <jgg@nvidia.com> CC: Oliver Upton <oliver.upton@linux.dev> CC: David Hildenbrand <david@redhat.com> Suggested-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Donald Dutile <ddutile@redhat.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20250705071717.5062-7-ankita@nvidia.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-07-07 16:54:52 -07:00
Mark Rutland	42ce432522	KVM: arm64: Remove kvm_arch_vcpu_run_map_fp() Historically KVM hyp code saved the host's FPSIMD state into the hosts's fpsimd_state memory, and so it was necessary to map this into the hyp Stage-1 mappings before running a vCPU. This is no longer necessary as of commits: * `fbc7e61195` ("KVM: arm64: Unconditionally save+flush host FPSIMD/SVE/SME state") * `8eca7f6d51` ("KVM: arm64: Remove host FPSIMD saving for non-protected KVM") Since those commits, we eagerly save the host's FPSIMD state before calling into hyp to run a vCPU, and hyp code never reads nor writes the host's fpsimd_state memory. There's no longer any need to map the host's fpsimd_state memory into the hyp Stage-1, and kvm_arch_vcpu_run_map_fp() is unnecessary but benign. Remove kvm_arch_vcpu_run_map_fp(). Currently there is no code to perform a corresponding unmap, and we never mapped the host's SVE or SME state into the hyp Stage-1, so no other code needs to be removed. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Will Deacon <will@kernel.org> Cc: kvmarm@lists.linux.dev Reviewed-by: Mark Brown <broonie@kernel.org> Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Link: https://lore.kernel.org/r/20250619134817.4075340-1-mark.rutland@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-07-03 10:39:24 +01:00
Quentin Perret	0e02219f9c	KVM: arm64: Don't free hyp pages with pKVM on GICv2 Marc reported that enabling protected mode on a device with GICv2 doesn't fail gracefully as one would expect, and leads to a host kernel crash. As it turns out, the first half of pKVM init happens before the vgic probe, and so by the time we find out we have a GICv2 we're already committed to keeping the pKVM vectors installed at EL2 -- pKVM rejects stub HVCs for obvious security reasons. However, the error path on KVM init leads to teardown_hyp_mode() which unconditionally frees hypervisor allocations (including the EL2 stacks and per-cpu pages) under the assumption that a previous cpu_hyp_uninit() execution has reset the vectors back to the stubs, which is false with pKVM. Interestingly, host stage-2 protection is not enabled yet at this point, so this use-after-free may go unnoticed for a while. The issue becomes more obvious after the finalize_pkvm() call. Fix this by keeping track of the CPUs on which pKVM is initialized in the kvm_hyp_initialized per-cpu variable, and use it from teardown_hyp_mode() to skip freeing pages that are in fact used. Fixes: `a770ee80e6` ("KVM: arm64: pkvm: Disable GICv2 support") Reported-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20250626101014.1519345-1-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2025-06-26 11:39:15 +01:00

1 2 3 4 5 ...

504 commits