master
1438 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
cad4b3d4ab |
KVM: s390: Change virtual to physical address access in diag 0x258 handler
The parameters for the diag 0x258 are real addresses, not virtual, but
KVM was using them as virtual addresses. This only happened to work, since
the Linux kernel as a guest used to have a 1:1 mapping for physical vs
virtual addresses.
Fix KVM so that it correctly uses the addresses as real addresses.
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
e8061f0618 |
KVM: s390: gaccess: Check if guest address is in memslot
Previously, access_guest_page() did not check whether the given guest
address is inside of a memslot. This is not a problem, since
kvm_write_guest_page/kvm_read_guest_page return -EFAULT in this case.
However, -EFAULT is also returned when copy_to/from_user fails.
When emulating a guest instruction, the address being outside a memslot
usually means that an addressing exception should be injected into the
guest.
Failure in copy_to/from_user however indicates that something is wrong
in userspace and hence should be handled there.
To be able to distinguish these two cases, return PGM_ADDRESSING in
access_guest_page() when the guest address is outside guest memory. In
access_guest_real(), populate vcpu->arch.pgm.code such that
kvm_s390_inject_prog_cond() can be used in the caller for injecting into
the guest (if applicable).
Since this adds a new return value to access_guest_page(), we need to make
sure that other callers are not confused by the new positive return value.
There are the following users of access_guest_page():
- access_guest_with_key() does the checking itself (in
guest_range_to_gpas()), so this case should never happen. Even if, the
handling is set up properly.
- access_guest_real() just passes the return code to its callers, which
are:
- read_guest_real() - see below
- write_guest_real() - see below
There are the following users of read_guest_real():
- ar_translation() in gaccess.c which already returns PGM_*
- setup_apcb10(), setup_apcb00(), setup_apcb11() in vsie.c which always
return -EFAULT on read_guest_read() nonzero return - no change
- shadow_crycb(), handle_stfle() always present this as validity, this
could be handled better but doesn't change current behaviour - no change
There are the following users of write_guest_real():
- kvm_s390_store_status_unloaded() always returns -EFAULT on
write_guest_real() failure.
Fixes:
|
||
|
|
09c38ad044 |
KVM: s390: Fix SORTL and DFLTCC instruction format error in __insn32_query
The __insn32_query() function incorrectly uses the RRF instruction format
for both the SORTL (RRE format) and DFLTCC (RRF format) instructions.
To fix this issue, add separate query functions for SORTL and DFLTCC that
use the appropriate instruction formats.
Additionally pass the query operand as a pointer to the entire array
of 32 elements to slightly optimize performance and readability.
Fixes:
|
||
|
|
5a44bb061d |
KVM: s390: fix validity interception issue when gisa is switched off
We might run into a SIE validity if gisa has been disabled either via using
kernel parameter "kvm.use_gisa=0" or by setting the related sysfs
attribute to N (echo N >/sys/module/kvm/parameters/use_gisa).
The validity is caused by an invalid value in the SIE control block's
gisa designation. That happens because we pass the uninitialized gisa
origin to virt_to_phys() before writing it to the gisa designation.
To fix this we return 0 in kvm_s390_get_gisa_desc() if the origin is 0.
kvm_s390_get_gisa_desc() is used to determine which gisa designation to
set in the SIE control block. A value of 0 in the gisa designation disables
gisa usage.
The issue surfaces in the host kernel with the following kernel message as
soon a new kvm guest start is attemted.
kvm: unhandled validity intercept 0x1011
WARNING: CPU: 0 PID: 781237 at arch/s390/kvm/intercept.c:101 kvm_handle_sie_intercept+0x42e/0x4d0 [kvm]
Modules linked in: vhost_net tap tun xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nft_compat x_tables nf_nat_tftp nf_conntrack_tftp vfio_pci_core irqbypass vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables sunrpc mlx5_ib ib_uverbs ib_core mlx5_core uvdevice s390_trng eadm_sch vfio_ccw zcrypt_cex4 mdev vfio_iommu_type1 vfio sch_fq_codel drm i2c_core loop drm_panel_orientation_quirks configfs nfnetlink lcs ctcm fsm dm_service_time ghash_s390 prng chacha_s390 libchacha aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common dm_mirror dm_region_hash dm_log zfcp scsi_transport_fc scsi_dh_rdac scsi_dh_emc scsi_dh_alua pkey zcrypt dm_multipath rng_core autofs4 [last unloaded: vfio_pci]
CPU: 0 PID: 781237 Comm: CPU 0/KVM Not tainted 6.10.0-08682-gcad9f11498ea #6
Hardware name: IBM 3931 A01 701 (LPAR)
Krnl PSW : 0704c00180000000 000003d93deb0122 (kvm_handle_sie_intercept+0x432/0x4d0 [kvm])
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
Krnl GPRS: 000003d900000027 000003d900000023 0000000000000028 000002cd00000000
000002d063a00900 00000359c6daf708 00000000000bebb5 0000000000001eff
000002cfd82e9000 000002cfd80bc000 0000000000001011 000003d93deda412
000003ff8962df98 000003d93de77ce0 000003d93deb011e 00000359c6daf960
Krnl Code: 000003d93deb0112: c020fffe7259 larl %r2,000003d93de7e5c4
000003d93deb0118: c0e53fa8beac brasl %r14,000003d9bd3c7e70
#000003d93deb011e: af000000 mc 0,0
>000003d93deb0122: a728ffea lhi %r2,-22
000003d93deb0126: a7f4fe24 brc 15,000003d93deafd6e
000003d93deb012a: 9101f0b0 tm 176(%r15),1
000003d93deb012e: a774fe48 brc 7,000003d93deafdbe
000003d93deb0132: 40a0f0ae sth %r10,174(%r15)
Call Trace:
[<000003d93deb0122>] kvm_handle_sie_intercept+0x432/0x4d0 [kvm]
([<000003d93deb011e>] kvm_handle_sie_intercept+0x42e/0x4d0 [kvm])
[<000003d93deacc10>] vcpu_post_run+0x1d0/0x3b0 [kvm]
[<000003d93deaceda>] __vcpu_run+0xea/0x2d0 [kvm]
[<000003d93dead9da>] kvm_arch_vcpu_ioctl_run+0x16a/0x430 [kvm]
[<000003d93de93ee0>] kvm_vcpu_ioctl+0x190/0x7c0 [kvm]
[<000003d9bd728b4e>] vfs_ioctl+0x2e/0x70
[<000003d9bd72a092>] __s390x_sys_ioctl+0xc2/0xd0
[<000003d9be0e9222>] __do_syscall+0x1f2/0x2e0
[<000003d9be0f9a90>] system_call+0x70/0x98
Last Breaking-Event-Address:
[<000003d9bd3c7f58>] __warn_printk+0xe8/0xf0
Cc: stable@vger.kernel.org
Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Fixes:
|
||
|
|
2c9b351240 |
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM:
- Initial infrastructure for shadow stage-2 MMUs, as part of nested
virtualization enablement
- Support for userspace changes to the guest CTR_EL0 value, enabling
(in part) migration of VMs between heterogenous hardware
- Fixes + improvements to pKVM's FF-A proxy, adding support for v1.1
of the protocol
- FPSIMD/SVE support for nested, including merged trap configuration
and exception routing
- New command-line parameter to control the WFx trap behavior under
KVM
- Introduce kCFI hardening in the EL2 hypervisor
- Fixes + cleanups for handling presence/absence of FEAT_TCRX
- Miscellaneous fixes + documentation updates
LoongArch:
- Add paravirt steal time support
- Add support for KVM_DIRTY_LOG_INITIALLY_SET
- Add perf kvm-stat support for loongarch
RISC-V:
- Redirect AMO load/store access fault traps to guest
- perf kvm stat support
- Use guest files for IMSIC virtualization, when available
s390:
- Assortment of tiny fixes which are not time critical
x86:
- Fixes for Xen emulation
- Add a global struct to consolidate tracking of host values, e.g.
EFER
- Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the
effective APIC bus frequency, because TDX
- Print the name of the APICv/AVIC inhibits in the relevant
tracepoint
- Clean up KVM's handling of vendor specific emulation to
consistently act on "compatible with Intel/AMD", versus checking
for a specific vendor
- Drop MTRR virtualization, and instead always honor guest PAT on
CPUs that support self-snoop
- Update to the newfangled Intel CPU FMS infrastructure
- Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as
it reads '0' and writes from userspace are ignored
- Misc cleanups
x86 - MMU:
- Small cleanups, renames and refactoring extracted from the upcoming
Intel TDX support
- Don't allocate kvm_mmu_page.shadowed_translation for shadow pages
that can't hold leafs SPTEs
- Unconditionally drop mmu_lock when allocating TDP MMU page tables
for eager page splitting, to avoid stalling vCPUs when splitting
huge pages
- Bug the VM instead of simply warning if KVM tries to split a SPTE
that is non-present or not-huge. KVM is guaranteed to end up in a
broken state because the callers fully expect a valid SPTE, it's
all but dangerous to let more MMU changes happen afterwards
x86 - AMD:
- Make per-CPU save_area allocations NUMA-aware
- Force sev_es_host_save_area() to be inlined to avoid calling into
an instrumentable function from noinstr code
- Base support for running SEV-SNP guests. API-wise, this includes a
new KVM_X86_SNP_VM type, encrypting/measure the initial image into
guest memory, and finalizing it before launching it. Internally,
there are some gmem/mmu hooks needed to prepare gmem-allocated
pages before mapping them into guest private memory ranges
This includes basic support for attestation guest requests, enough
to say that KVM supports the GHCB 2.0 specification
There is no support yet for loading into the firmware those signing
keys to be used for attestation requests, and therefore no need yet
for the host to provide certificate data for those keys.
To support fetching certificate data from userspace, a new KVM exit
type will be needed to handle fetching the certificate from
userspace.
An attempt to define a new KVM_EXIT_COCO / KVM_EXIT_COCO_REQ_CERTS
exit type to handle this was introduced in v1 of this patchset, but
is still being discussed by community, so for now this patchset
only implements a stub version of SNP Extended Guest Requests that
does not provide certificate data
x86 - Intel:
- Remove an unnecessary EPT TLB flush when enabling hardware
- Fix a series of bugs that cause KVM to fail to detect nested
pending posted interrupts as valid wake eents for a vCPU executing
HLT in L2 (with HLT-exiting disable by L1)
- KVM: x86: Suppress MMIO that is triggered during task switch
emulation
Explicitly suppress userspace emulated MMIO exits that are
triggered when emulating a task switch as KVM doesn't support
userspace MMIO during complex (multi-step) emulation
Silently ignoring the exit request can result in the
WARN_ON_ONCE(vcpu->mmio_needed) firing if KVM exits to userspace
for some other reason prior to purging mmio_needed
See commit
|
||
|
|
1c7d0c3af5 |
Merge tag 's390-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Vasily Gorbik:
- Remove restrictions on PAI NNPA and crypto counters, enabling
concurrent per-task and system-wide sampling and counting events
- Switch to GENERIC_CPU_DEVICES by setting up the CPU present mask in
the architecture code and letting the generic code handle CPU
bring-up
- Add support for the diag204 busy indication facility to prevent
undesirable blocking during hypervisor logical CPU utilization
queries. Implement results caching
- Improve the handling of Store Data SCLP events by suppressing
unnecessary warning, preventing buffer release in I/O during
failures, and adding timeout handling for Store Data requests to
address potential firmware issues
- Provide optimized __arch_hweight*() implementations
- Remove the unnecessary CPU KOBJ_CHANGE uevents generated during
topology updates, as they are unused and also not present on other
architectures
- Cleanup atomic_ops, optimize __atomic_set() for small values and
__atomic_cmpxchg_bool() for compilers supporting flag output
constraint
- Couple of cleanups for KVM:
- Move and improve KVM struct definitions for DAT tables from
gaccess.c to a new header
- Pass the asce as parameter to sie64a()
- Make the crdte() and cspg() page table handling wrappers return a
boolean to indicate success, like the other existing "compare and
swap" wrappers
- Add documentation for HWCAP flags
- Switch to obtaining total RAM pages from memblock instead of
totalram_pages() during mm init, to ensure correct calculation of
zero page size, when defer_init is enabled
- Refactor lowcore access and switch to using the get_lowcore()
function instead of the S390_lowcore macro
- Cleanups for PG_arch_1 and folio handling in UV and hugetlb code
- Add missing MODULE_DESCRIPTION() macros
- Fix VM_FAULT_HWPOISON handling in do_exception()
* tag 's390-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (54 commits)
s390/mm: Fix VM_FAULT_HWPOISON handling in do_exception()
s390/kvm: Move bitfields for dat tables
s390/entry: Pass the asce as parameter to sie64a()
s390/sthyi: Use cached data when diag is busy
s390/sthyi: Move diag operations
s390/hypfs_diag: Diag204 busy loop
s390/diag: Add busy-indication-facility requirements
s390/diag: Diag204 add busy return errno
s390/diag: Return errno's from diag204
s390/sclp: Diag204 busy indication facility detection
s390/atomic_ops: Make use of flag output constraint
s390/atomic_ops: Improve __atomic_set() for small values
s390/atomic_ops: Use symbolic names
s390/smp: Switch to GENERIC_CPU_DEVICES
s390/hwcaps: Add documentation for HWCAP flags
s390/pgtable: Make crdte() and cspg() return a value
s390/topology: Remove CPU KOBJ_CHANGE uevents
s390/sclp: Add timeout to Store Data requests
s390/sclp: Prevent release of buffer in I/O
s390/sclp: Suppress unnecessary Store Data warning
...
|
||
|
|
86014c1e20 |
Merge tag 'kvm-x86-generic-6.11' of https://github.com/kvm-x86/linux into HEAD
KVM generic changes for 6.11 - Enable halt poll shrinking by default, as Intel found it to be a clear win. - Setup empty IRQ routing when creating a VM to avoid having to synchronize SRCU when creating a split IRQCHIP on x86. - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag that arch code can use for hooking both sched_in() and sched_out(). - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid truncating a bogus value from userspace, e.g. to help userspace detect bugs. - Mark a vCPU as preempted if and only if it's scheduled out while in the KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest memory when retrieving guest state during live migration blackout. - A few minor cleanups |
||
|
|
c8b8b8190a |
Merge tag 'loongarch-kvm-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
LoongArch KVM changes for v6.11 1. Add ParaVirt steal time support. 2. Add some VM migration enhancement. 3. Add perf kvm-stat support for loongarch. |
||
|
|
275d05ce06 |
s390/kvm: Move bitfields for dat tables
Move and improve the struct definitions for DAT tables from gaccess.c to a new header. Once in a separate header, the structs become available everywhere. One possible usecase is to merge them in the s390 pte_t and p?d_t definitions, which is left as an exercise for the reader. Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Nico Boehr <nrb@linux.ibm.com> Link: https://lore.kernel.org/r/20240703155900.103783-3-imbrenda@linux.ibm.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
|
|
723ac2d6ba |
s390/entry: Pass the asce as parameter to sie64a()
Pass the guest ASCE explicitly as parameter, instead of having sie64a() take it from lowcore. This removes hidden state from lowcore, and makes things look cleaner. Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Nico Boehr <nrb@linux.ibm.com> Link: https://lore.kernel.org/r/20240703155900.103783-2-imbrenda@linux.ibm.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
|
|
7816e58967 |
kvm: s390: Reject memory region operations for ucontrol VMs
This change rejects the KVM_SET_USER_MEMORY_REGION and
KVM_SET_USER_MEMORY_REGION2 ioctls when called on a ucontrol VM.
This is necessary since ucontrol VMs have kvm->arch.gmap set to 0 and
would thus result in a null pointer dereference further in.
Memory management needs to be performed in userspace and using the
ioctls KVM_S390_UCAS_MAP and KVM_S390_UCAS_UNMAP.
Also improve s390 specific documentation for KVM_SET_USER_MEMORY_REGION
and KVM_SET_USER_MEMORY_REGION2.
Signed-off-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Fixes:
|
||
|
|
33a729a177 |
KVM: s390: vsie: retry SIE instruction on host intercepts
It's possible that SIE exits for work that the host needs to perform rather than something that is intended for the guest. A Linux guest will ignore this intercept code since there is nothing for it to do, but a more robust solution would rewind the PSW back to the SIE instruction. This will transparently resume the guest once the host completes its work, without the guest needing to process what is effectively a NOP and re-issue SIE itself. Signed-off-by: Eric Farman <farman@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20240301204342.3217540-1-farman@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20240301204342.3217540-1-farman@linux.ibm.com> |
||
|
|
4c6abb7f7b |
KVM: s390: fix LPSWEY handling
in rare cases, e.g. for injecting a machine check we do intercept all
load PSW instructions via ICTL_LPSW. With facility 193 a new variant
LPSWEY was added. KVM needs to handle that as well.
Fixes:
|
||
|
|
a6816314af |
KVM: Introduce vcpu->wants_to_run
Introduce vcpu->wants_to_run to indicate when a vCPU is in its core run loop, i.e. when the vCPU is running the KVM_RUN ioctl and immediate_exit was not set. Replace all references to vcpu->run->immediate_exit with !vcpu->wants_to_run to avoid TOCTOU races with userspace. For example, a malicious userspace could invoked KVM_RUN with immediate_exit=true and then after KVM reads it to set wants_to_run=false, flip it to false. This would result in the vCPU running in KVM_RUN with wants_to_run=false. This wouldn't cause any real bugs today but is a dangerous landmine. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240503181734.1467938-2-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
|
|
208da1d5fc |
s390: Replace S390_lowcore by get_lowcore()
Replace all S390_lowcore usages in arch/s390/ by get_lowcore(). Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
|
|
c4201bd24f |
KVM: s390: Don't re-setup dummy routing when KVM_CREATE_IRQCHIP
Now that KVM sets up empty irq routing in kvm_create_vm(), there's no need to setup dummy routing when KVM_CREATE_IRQCHIP. Note, userspace could very theoretically use KVM_CREATE_IRQCHIP after KVM_SET_GSI_ROUTING to reset to empty IRQ routing, but it's extremely unlikely any VMM does that, e.g. the main reason s390 does anything for KVM_CREATE_IRQCHIP is to that s390 doesn't need to be special cased by the VMM. Signed-off-by: Yi Wang <foxywang@tencent.com> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20240506101751.3145407-4-foxywang@tencent.com [sean: keep use_irqchip check, call out KVM_SET_GSI_ROUTING impact] Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
|
|
d65e1a0f30 |
Merge tag 's390-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Alexander Gordeev: - Store AP Query Configuration Information in a static buffer - Rework the AP initialization and add missing cleanups to the error path - Swap IRQ and AP bus/device registration to avoid race conditions - Export prot_virt_guest symbol - Introduce AP configuration changes notifier interface to facilitate modularization of the AP bus - Add CONFIG_AP kernel configuration option to allow modularization of the AP bus - Rework CONFIG_ZCRYPT_DEBUG kernel configuration option description and dependency and rename it to CONFIG_AP_DEBUG - Convert sprintf() and snprintf() to sysfs_emit() in CIO code - Adjust indentation of RELOCS command build step - Make crypto performance counters upward compatible - Convert make_page_secure() and gmap_make_secure() to use folio - Rework channel-utilization-block (CUB) handling in preparation of introducing additional CUBs - Use attribute groups to simplify registration, removal and extension of measurement-related channel-path sysfs attributes - Add a per-channel-path binary "ext_measurement" sysfs attribute that provides access to extended channel-path measurement data - Export measurement data for all channel-measurement-groups (CMG), not only for a specific ones. This enables support of new CMG data formats in userspace without the need for kernel changes - Add a per-channel-path sysfs attribute "speed_bps" that provides the operating speed in bits per second or 0 if the operating speed is not available - The CIO tracepoint subchannel-type field "st" is incorrectly set to the value of subchannel-enabled SCHIB "ena" field. Fix that - Do not forcefully limit vmemmap starting address to MAX_PHYSMEM_BITS - Consider the maximum physical address available to a DCSS segment (512GB) when memory layout is set up - Simplify the virtual memory layout setup by reducing the size of identity mapping vs vmemmap overlap - Swap vmalloc and Lowcore/Real Memory Copy areas in virtual memory. This will allow to place the kernel image next to kernel modules - Move everyting KASLR related from <asm/setup.h> to <asm/page.h> - Put virtual memory layout information into a structure to improve code generation - Currently __kaslr_offset is the kernel offset in both physical and virtual memory spaces. Uncouple these offsets to allow uncoupling of the addresses spaces - Currently the identity mapping base address is implicit and is always set to zero. Make it explicit by putting into __identity_base persistent boot variable and use it in proper context - Introduce .amode31 section start and end macros AMODE31_START and AMODE31_END - Introduce OS_INFO entries that do not reference any data in memory, but rather provide only values - Store virtual memory layout in OS_INFO. It is read out by makedumpfile, crash and other tools - Store virtual memory layout in VMCORE_INFO. It is read out by crash and other tools when /proc/kcore device is used - Create additional PT_LOAD ELF program header that covers kernel image only, so that vmcore tools could locate kernel text and data when virtual and physical memory spaces are uncoupled - Uncouple physical and virtual address spaces - Map kernel at fixed location when KASLR mode is disabled. The location is defined by CONFIG_KERNEL_IMAGE_BASE kernel configuration value. - Rework deployment of kernel image for both compressed and uncompressed variants as defined by CONFIG_KERNEL_UNCOMPRESSED kernel configuration value - Move .vmlinux.relocs section in front of the compressed kernel. The interim section rescue step is avoided as result - Correct modules thunk offset calculation when branch target is more than 2GB away - Kernel modules contain their own set of expoline thunks. Now that the kernel modules area is less than 4GB away from kernel expoline thunks, make modules use kernel expolines. Also make EXPOLINE_EXTERN the default if the compiler supports it - userfaultfd can insert shared zeropages into processes running VMs, but that is not allowed for s390. Fallback to allocating a fresh zeroed anonymous folio and insert that instead - Re-enable shared zeropages for non-PV and non-skeys KVM guests - Rename hex2bitmap() to ap_hex2bitmap() and export it for external use - Add ap_config sysfs attribute to provide the means for setting or displaying adapters, domains and control domains assigned to a vfio-ap mediated device in a single operation - Make vfio_ap_mdev_link_queue() ignore duplicate link requests - Add write support to ap_config sysfs attribute to allow atomic update a vfio-ap mediated device state - Document ap_config sysfs attribute - Function os_info_old_init() is expected to be called only from a regular kdump kernel. Enable it to be called from a stand-alone dump kernel - Address gcc -Warray-bounds warning and fix array size in struct os_info - s390 does not support SMBIOS, so drop unneeded CONFIG_DMI checks - Use unwinder instead of __builtin_return_address() with ftrace to prevent returning of undefined values - Sections .hash and .gnu.hash are only created when CONFIG_PIE_BUILD kernel is enabled. Drop these for the case CONFIG_PIE_BUILD is disabled - Compile kernel with -fPIC and link with -no-pie to allow kpatch feature always succeed and drop the whole CONFIG_PIE_BUILD option-enabled code - Add missing virt_to_phys() converter for VSIE facility and crypto control blocks * tag 's390-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (54 commits) Revert "s390: Relocate vmlinux ELF data to virtual address space" KVM: s390: vsie: Use virt_to_phys for crypto control block s390: Relocate vmlinux ELF data to virtual address space s390: Compile kernel with -fPIC and link with -no-pie s390: vmlinux.lds.S: Drop .hash and .gnu.hash for !CONFIG_PIE_BUILD s390/ftrace: Use unwinder instead of __builtin_return_address() s390/pci: Drop unneeded reference to CONFIG_DMI s390/os_info: Fix array size in struct os_info s390/os_info: Initialize old os_info in standalone dump kernel docs: Update s390 vfio-ap doc for ap_config sysfs attribute s390/vfio-ap: Add write support to sysfs attr ap_config s390/vfio-ap: Ignore duplicate link requests in vfio_ap_mdev_link_queue s390/vfio-ap: Add sysfs attr, ap_config, to export mdev state s390/ap: Externalize AP bus specific bitmap reading function s390/mm: Re-enable the shared zeropage for !PV and !skeys KVM guests mm/userfaultfd: Do not place zeropages when zeropages are disallowed s390/expoline: Make modules use kernel expolines s390/nospec: Correct modules thunk offset calculation s390/boot: Do not rescue .vmlinux.relocs section s390/boot: Rework deployment of the kernel image ... |
||
|
|
22a49f6d30 |
Merge branch 'shared-zeropage' into features
David Hildenbrand says:
===================
This series fixes one issue with uffd + shared zeropages on s390x and
fixes that "ordinary" KVM guests can make use of shared zeropages again.
userfaultfd could currently end up mapping shared zeropages into processes
that forbid shared zeropages. This only apples to s390x, relevant for
handling PV guests and guests that use storage kets correctly. Fix it
by placing a zeroed folio instead of the shared zeropage during
UFFDIO_ZEROPAGE instead.
I stumbled over this issue while looking into a customer scenario that
is using:
(1) Memory ballooning for dynamic resizing. Start a VM with, say, 100 GiB
and inflate the balloon during boot to 60 GiB. The VM has ~40 GiB
available and additional memory can be "fake hotplugged" to the VM
later on demand by deflating the balloon. Actual memory overcommit is
not desired, so physical memory would only be moved between VMs.
(2) Live migration of VMs between sites to evacuate servers in case of
emergency.
Without the shared zeropage, during (2), the VM would suddenly consume
100 GiB on the migration source and destination. On the migration source,
where we don't excpect memory overcommit, we could easilt end up crashing
the VM during migration.
Independent of that, memory handed back to the hypervisor using "free page
reporting" would end up consuming actual memory after the migration on the
destination, not getting freed up until reused+freed again.
While there might be ways to optimize parts of this in QEMU, we really
should just support the shared zeropage again for ordinary VMs.
We only expect legcy guests to make use of storage keys, so let's handle
zeropages again when enabling storage keys or when enabling PV. To not
break userfaultfd like we did in the past, don't zap the shared zeropages,
but instead trigger unsharing faults, just like we do for unsharing
KSM pages in break_ksm().
Unsharing faults will simply replace the shared zeropage by a zeroed
anonymous folio. We can already trigger the same fault path using GUP,
when trying to long-term pin a shared zeropage, but also when unmerging
a KSM-placed zeropages, so this is nothing new.
Patch #1 tested on 86-64 by forcing mm_forbids_zeropage() to be 1, and
running the uffd selftests.
Patch #2 tested on s390x: the live migration scenario now works as
expected, and kvm-unit-tests that trigger usage of skeys work well, whereby
I can see detection and unsharing of shared zeropages.
Further (as broken in v2), I tested that the shared zeropage is no
longer populated after skeys are used -- that mm_forbids_zeropage() works
as expected:
./s390x-run s390x/skey.elf \
-no-shutdown \
-chardev socket,id=monitor,path=/var/tmp/mon,server,nowait \
-mon chardev=monitor,mode=readline
Then, in another shell:
# cat /proc/`pgrep qemu`/smaps_rollup | grep Rss
Rss: 31484 kB
# echo "dump-guest-memory tmp" | sudo nc -U /var/tmp/mon
...
# cat /proc/`pgrep qemu`/smaps_rollup | grep Rss
Rss: 160452 kB
-> Reading guest memory does not populate the shared zeropage
Doing the same with selftest.elf (no skeys)
# cat /proc/`pgrep qemu`/smaps_rollup | grep Rss
Rss: 30900 kB
# echo "dump-guest-memory tmp" | sudo nc -U /var/tmp/mon
...
# cat /proc/`pgrep qemu`/smaps_rollup | grep Rsstmp/mon
Rss: 30924 kB
-> Reading guest memory does populate the shared zeropage
===================
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
||
|
|
175f2f5bcd |
KVM: s390: Check kvm pointer when testing KVM_CAP_S390_HPAGE_1M
KVM allows issuing the KVM_CHECK_EXTENSION ioctl either on the /dev/kvm
fd or the VM fd. In the first case, kvm_vm_ioctl_check_extension() is
called with kvm==NULL. Ensure we don't dereference the pointer in that
case.
Fixes:
|
||
|
|
cc4edb92f5 |
KVM: s390: vsie: Use virt_to_phys for crypto control block
The address of the crypto control block in the (shadow) SIE block is absolute/physical. Convert from virtual to physical when shadowing the guest's control block during VSIE. Signed-off-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/r/20240429171512.879215-1-nsg@linux.ibm.com Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
|
|
06201e00ee |
s390/mm: Re-enable the shared zeropage for !PV and !skeys KVM guests
commit |
||
|
|
22fdd8ba61 |
KVM: s390: vsie: Use virt_to_phys for facility control block
In order for SIE to interpretively execute STFLE, it requires the real or absolute address of a facility-list control block. Before writing the location into the shadow SIE control block, convert it from a virtual address. We currently do not run into this bug because the lower 31 bits are the same for virtual and physical addresses. Signed-off-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com> Link: https://lore.kernel.org/r/20240319164420.4053380-3-nsg@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-Id: <20240319164420.4053380-3-nsg@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
|
|
4f712ee0cb |
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"S390:
- Changes to FPU handling came in via the main s390 pull request
- Only deliver to the guest the SCLP events that userspace has
requested
- More virtual vs physical address fixes (only a cleanup since
virtual and physical address spaces are currently the same)
- Fix selftests undefined behavior
x86:
- Fix a restriction that the guest can't program a PMU event whose
encoding matches an architectural event that isn't included in the
guest CPUID. The enumeration of an architectural event only says
that if a CPU supports an architectural event, then the event can
be programmed *using the architectural encoding*. The enumeration
does NOT say anything about the encoding when the CPU doesn't
report support the event *in general*. It might support it, and it
might support it using the same encoding that made it into the
architectural PMU spec
- Fix a variety of bugs in KVM's emulation of RDPMC (more details on
individual commits) and add a selftest to verify KVM correctly
emulates RDMPC, counter availability, and a variety of other
PMC-related behaviors that depend on guest CPUID and therefore are
easier to validate with selftests than with custom guests (aka
kvm-unit-tests)
- Zero out PMU state on AMD if the virtual PMU is disabled, it does
not cause any bug but it wastes time in various cases where KVM
would check if a PMC event needs to be synthesized
- Optimize triggering of emulated events, with a nice ~10%
performance improvement in VM-Exit microbenchmarks when a vPMU is
exposed to the guest
- Tighten the check for "PMI in guest" to reduce false positives if
an NMI arrives in the host while KVM is handling an IRQ VM-Exit
- Fix a bug where KVM would report stale/bogus exit qualification
information when exiting to userspace with an internal error exit
code
- Add a VMX flag in /proc/cpuinfo to report 5-level EPT support
- Rework TDP MMU root unload, free, and alloc to run with mmu_lock
held for read, e.g. to avoid serializing vCPUs when userspace
deletes a memslot
- Tear down TDP MMU page tables at 4KiB granularity (used to be
1GiB). KVM doesn't support yielding in the middle of processing a
zap, and 1GiB granularity resulted in multi-millisecond lags that
are quite impolite for CONFIG_PREEMPT kernels
- Allocate write-tracking metadata on-demand to avoid the memory
overhead when a kernel is built with i915 virtualization support
but the workloads use neither shadow paging nor i915 virtualization
- Explicitly initialize a variety of on-stack variables in the
emulator that triggered KMSAN false positives
- Fix the debugregs ABI for 32-bit KVM
- Rework the "force immediate exit" code so that vendor code
ultimately decides how and when to force the exit, which allowed
some optimization for both Intel and AMD
- Fix a long-standing bug where kvm_has_noapic_vcpu could be left
elevated if vCPU creation ultimately failed, causing extra
unnecessary work
- Cleanup the logic for checking if the currently loaded vCPU is
in-kernel
- Harden against underflowing the active mmu_notifier invalidation
count, so that "bad" invalidations (usually due to bugs elsehwere
in the kernel) are detected earlier and are less likely to hang the
kernel
x86 Xen emulation:
- Overlay pages can now be cached based on host virtual address,
instead of guest physical addresses. This removes the need to
reconfigure and invalidate the cache if the guest changes the gpa
but the underlying host virtual address remains the same
- When possible, use a single host TSC value when computing the
deadline for Xen timers in order to improve the accuracy of the
timer emulation
- Inject pending upcall events when the vCPU software-enables its
APIC to fix a bug where an upcall can be lost (and to follow Xen's
behavior)
- Fall back to the slow path instead of warning if "fast" IRQ
delivery of Xen events fails, e.g. if the guest has aliased xAPIC
IDs
RISC-V:
- Support exception and interrupt handling in selftests
- New self test for RISC-V architectural timer (Sstc extension)
- New extension support (Ztso, Zacas)
- Support userspace emulation of random number seed CSRs
ARM:
- Infrastructure for building KVM's trap configuration based on the
architectural features (or lack thereof) advertised in the VM's ID
registers
- Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
x86's WC) at stage-2, improving the performance of interacting with
assigned devices that can tolerate it
- Conversion of KVM's representation of LPIs to an xarray, utilized
to address serialization some of the serialization on the LPI
injection path
- Support for _architectural_ VHE-only systems, advertised through
the absence of FEAT_E2H0 in the CPU's ID register
- Miscellaneous cleanups, fixes, and spelling corrections to KVM and
selftests
LoongArch:
- Set reserved bits as zero in CPUCFG
- Start SW timer only when vcpu is blocking
- Do not restart SW timer when it is expired
- Remove unnecessary CSR register saving during enter guest
- Misc cleanups and fixes as usual
Generic:
- Clean up Kconfig by removing CONFIG_HAVE_KVM, which was basically
always true on all architectures except MIPS (where Kconfig
determines the available depending on CPU capabilities). It is
replaced either by an architecture-dependent symbol for MIPS, and
IS_ENABLED(CONFIG_KVM) everywhere else
- Factor common "select" statements in common code instead of
requiring each architecture to specify it
- Remove thoroughly obsolete APIs from the uapi headers
- Move architecture-dependent stuff to uapi/asm/kvm.h
- Always flush the async page fault workqueue when a work item is
being removed, especially during vCPU destruction, to ensure that
there are no workers running in KVM code when all references to
KVM-the-module are gone, i.e. to prevent a very unlikely
use-after-free if kvm.ko is unloaded
- Grab a reference to the VM's mm_struct in the async #PF worker
itself instead of gifting the worker a reference, so that there's
no need to remember to *conditionally* clean up after the worker
Selftests:
- Reduce boilerplate especially when utilize selftest TAP
infrastructure
- Add basic smoke tests for SEV and SEV-ES, along with a pile of
library support for handling private/encrypted/protected memory
- Fix benign bugs where tests neglect to close() guest_memfd files"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits)
selftests: kvm: remove meaningless assignments in Makefiles
KVM: riscv: selftests: Add Zacas extension to get-reg-list test
RISC-V: KVM: Allow Zacas extension for Guest/VM
KVM: riscv: selftests: Add Ztso extension to get-reg-list test
RISC-V: KVM: Allow Ztso extension for Guest/VM
RISC-V: KVM: Forward SEED CSR access to user space
KVM: riscv: selftests: Add sstc timer test
KVM: riscv: selftests: Change vcpu_has_ext to a common function
KVM: riscv: selftests: Add guest helper to get vcpu id
KVM: riscv: selftests: Add exception handling support
LoongArch: KVM: Remove unnecessary CSR register saving during enter guest
LoongArch: KVM: Do not restart SW timer when it is expired
LoongArch: KVM: Start SW timer only when vcpu is blocking
LoongArch: KVM: Set reserved bits as zero in CPUCFG
KVM: selftests: Explicitly close guest_memfd files in some gmem tests
KVM: x86/xen: fix recursive deadlock in timer injection
KVM: pfncache: simplify locking and make more self-contained
KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery
KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled
KVM: x86/xen: improve accuracy of Xen timers
...
|
||
|
|
17193ced2d |
Merge tag 'kvm-s390-next-6.9-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
- Memop selftest rotate fix - SCLP event bits over indication fix - Missing virt_to_phys for the CRYCB fix |
||
|
|
691632f0e8 |
Merge tag 's390-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Heiko Carstens:
- Various virtual vs physical address usage fixes
- Fix error handling in Processor Activity Instrumentation device
driver, and export number of counters with a sysfs file
- Allow for multiple events when Processor Activity Instrumentation
counters are monitored in system wide sampling
- Change multiplier and shift values of the Time-of-Day clock source to
improve steering precision
- Remove a couple of unneeded GFP_DMA flags from allocations
- Disable mmap alignment if randomize_va_space is also disabled, to
avoid a too small heap
- Various changes to allow s390 to be compiled with LLVM=1, since
ld.lld and llvm-objcopy will have proper s390 support witch clang 19
- Add __uninitialized macro to Compiler Attributes. This is helpful
with s390's FPU code where some users have up to 520 byte stack
frames. Clearing such stack frames (if INIT_STACK_ALL_PATTERN or
INIT_STACK_ALL_ZERO is enabled) before they are used contradicts the
intention (performance improvement) of such code sections.
- Convert switch_to() to an out-of-line function, and use the generic
switch_to header file
- Replace the usage of s390's debug feature with pr_debug() calls
within the zcrypt device driver
- Improve hotplug support of the Adjunct Processor device driver
- Improve retry handling in the zcrypt device driver
- Various changes to the in-kernel FPU code:
- Make in-kernel FPU sections preemptible
- Convert various larger inline assemblies and assembler files to
C, mainly by using singe instruction inline assemblies. This
increases readability, but also allows makes it easier to add
proper instrumentation hooks
- Cleanup of the header files
- Provide fast variants of csum_partial() and
csum_partial_copy_nocheck() based on vector instructions
- Introduce and use a lock to synchronize accesses to zpci device data
structures to avoid inconsistent states caused by concurrent accesses
- Compile the kernel without -fPIE. This addresses the following
problems if the kernel is compiled with -fPIE:
- It uses dynamic symbols (.dynsym), for which the linker refuses
to allow more than 64k sections. This can break features which
use '-ffunction-sections' and '-fdata-sections', including
kpatch-build and function granular KASLR
- It unnecessarily uses GOT relocations, adding an extra layer of
indirection for many memory accesses
- Fix shared_cpu_list for CPU private L2 caches, which incorrectly were
reported as globally shared
* tag 's390-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (117 commits)
s390/tools: handle rela R_390_GOTPCDBL/R_390_GOTOFF64
s390/cache: prevent rebuild of shared_cpu_list
s390/crypto: remove retry loop with sleep from PAES pkey invocation
s390/pkey: improve pkey retry behavior
s390/zcrypt: improve zcrypt retry behavior
s390/zcrypt: introduce retries on in-kernel send CPRB functions
s390/ap: introduce mutex to lock the AP bus scan
s390/ap: rework ap_scan_bus() to return true on config change
s390/ap: clarify AP scan bus related functions and variables
s390/ap: rearm APQNs bindings complete completion
s390/configs: increase number of LOCKDEP_BITS
s390/vfio-ap: handle hardware checkstop state on queue reset operation
s390/pai: change sampling event assignment for PMU device driver
s390/boot: fix minor comment style damages
s390/boot: do not check for zero-termination relocation entry
s390/boot: make type of __vmlinux_relocs_64_start|end consistent
s390/boot: sanitize kaslr_adjust_relocs() function prototype
s390/boot: simplify GOT handling
s390: vmlinux.lds.S: fix .got.plt assertion
s390/boot: workaround current 'llvm-objdump -t -j ...' behavior
...
|
||
|
|
e9a2bba476 |
Merge tag 'kvm-x86-xen-6.9' of https://github.com/kvm-x86/linux into HEAD
KVM Xen and pfncache changes for 6.9: - Rip out the half-baked support for using gfn_to_pfn caches to manage pages that are "mapped" into guests via physical addresses. - Add support for using gfn_to_pfn caches with only a host virtual address, i.e. to bypass the "gfn" stage of the cache. The primary use case is overlay pages, where the guest may change the gfn used to reference the overlay page, but the backing hva+pfn remains the same. - Add an ioctl() to allow mapping Xen's shared_info page using an hva instead of a gpa, so that userspace doesn't need to reconfigure and invalidate the cache/mapping if the guest changes the gpa (but userspace keeps the resolved hva the same). - When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation. - Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior). - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs. - Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to refresh), and drop a now-redundant acquisition of xen_lock (that was protecting the shared_info cache) to fix a deadlock due to recursively acquiring xen_lock. |
||
|
|
01be7f53df |
KVM: s390: fix access register usage in ioctls
The routine ar_translation() can be reached by both the instruction intercept path (where the access registers had been loaded with the guest register contents), and the MEM_OP ioctls (which hadn't). Since this routine saves the current registers to vcpu->run, this routine erroneously saves host registers into the guest space. Introduce a boolean in the kvm_vcpu_arch struct to indicate whether the registers contain guest contents. If they do (the instruction intercept path), the save can be performed and the AR translation is done just as it is today. If they don't (the MEM_OP path), the AR can be read from vcpu->run without stashing the current contents. Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Signed-off-by: Eric Farman <farman@linux.ibm.com> Link: https://lore.kernel.org/r/20240220211211.3102609-2-farman@linux.ibm.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
|
|
85a19b3054 |
KVM: s390: only deliver the set service event bits
The SCLP driver code masks off the last two bits of the parameter [1]
to determine if a read is required, but doesn't care about the
contents of those bits. Meanwhile, the KVM code that delivers
event interrupts masks off those two bits but sends both to the
guest, even if only one was specified by userspace [2].
This works for the driver code, but it means any nuances of those
bits gets lost. Use the event pending mask as an actual mask, and
only send the bit(s) that were specified in the pending interrupt.
[1] Linux: sclp_interrupt_handler() (drivers/s390/char/sclp.c:658)
[2] QEMU: service_interrupt() (hw/s390x/sclp.c:360..363)
Fixes:
|
||
|
|
4a59932874 |
KVM: s390: introduce kvm_s390_fpu_(store|load)
It's a bit nicer than having multiple lines and will help if there's another re-work since we'll only have to change one location. Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
|
|
9e7325acb3 |
KVM: s390: Refactor kvm_is_error_gpa() into kvm_is_gpa_in_memslot()
Rename kvm_is_error_gpa() to kvm_is_gpa_in_memslot() and invert the
polarity accordingly in order to (a) free up kvm_is_error_gpa() to match
with kvm_is_error_{hva,page}(), and (b) to make it more obvious that the
helper is doing a memslot lookup, i.e. not simply checking for INVALID_GPA.
No functional change intended.
Link: https://lore.kernel.org/r/20240215152916.1158-9-paul@xen.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
||
|
|
066c40918b |
s390/fpu: decrease stack usage for some cases
The kernel_fpu structure has a quite large size of 520 bytes. In order to reduce stack footprint introduce several kernel fpu structures with different and also smaller sizes. This way every kernel fpu user must use the correct variant. A compile time check verifies that the correct variant is used. There are several users which use only 16 instead of all 32 vector registers. For those users the new kernel_fpu_16 structure with a size of only 266 bytes can be used. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
|
|
ed3a0a011a |
s390/kvm: convert to regular kernel fpu user
KVM modifies the kernel fpu's regs pointer to its own area to implement its custom version of preemtible kernel fpu context. With general support for preemptible kernel fpu context there is no need for the extra complexity in KVM code anymore. Therefore convert KVM to a regular kernel fpu user. In particular this means that all TIF_FPU checks can be removed, since the fpu register context will never be changed by other kernel fpu users, and also the fpu register context will be restored if a thread is preempted. Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
|
|
87c5c70036 |
s390/fpu: rename save_fpu_regs() to save_user_fpu_regs(), etc
Rename save_fpu_regs(), load_fpu_regs(), and struct thread_struct's fpu member to save_user_fpu_regs(), load_user_fpu_regs(), and ufpu. This way the function and variable names reflect for which context they are supposed to be used. This large and trivial conversion is a prerequisite for making the kernel fpu usage preemptible. Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
|
|
419abc4d38 |
s390/fpu: convert FPU CIF flag to regular TIF flag
The FPU state, as represented by the CIF_FPU flag reflects the FPU state of a task, not the CPU it is running on. Therefore convert the flag to a regular TIF flag. This removes the magic in switch_to() where a save_fpu_regs() call for the currently (previous) running task sets the per-cpu CIF_FPU flag, which is required to restore FPU register contents of the next task, when it returns to user space. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
|
|
fd2527f209 |
s390/fpu: move, rename, and merge header files
Move, rename, and merge the fpu and vx header files. This way fpu header files have a consistent naming scheme (fpu*.h). Also get rid of the fpu subdirectory and move header files to asm directory, so that all fpu and vx header files can be found at the same location. Merge internal.h header file into other header files, since the internal helpers are used at many locations. so those helper functions are really not internal. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
|
|
7b2411e793 |
KVM: s390: fix virtual vs physical address confusion
Fix virtual vs physical address confusion. This does not fix a bug since virtual and physical address spaces are currently the same. Suggested-by: Janosch Frank <frankja@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Acked-by: Janosch Frank <frankja@linux.ibm.com> Acked-by: Anthony Krowiak <akrowiak@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com> |
||
|
|
304103736b |
s390/acrs: cleanup access register handling
save_access_regs() and restore_access_regs() are only available by including switch_to.h. This is done by a couple of C files, which have nothing to do with switch_to(), but only need these functions. Move both functions to a new header file and improve the implementation: - Get rid of typedef - Add memory access instrumentation support - Use long displacement instructions lamy/stamy instead of lam/stam - all current users end up with better code because of this Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
|
|
f48212ee8e |
treewide: remove CONFIG_HAVE_KVM
It has no users anymore. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
|
|
3f5198c7f6 |
Merge tag 'kvm-s390-master-6.8-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
pqap instruction missing cc fix vsie shadow creation race fix |
||
|
|
09d1c6a80f |
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"Generic:
- Use memdup_array_user() to harden against overflow.
- Unconditionally advertise KVM_CAP_DEVICE_CTRL for all
architectures.
- Clean up Kconfigs that all KVM architectures were selecting
- New functionality around "guest_memfd", a new userspace API that
creates an anonymous file and returns a file descriptor that refers
to it. guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be
resized. guest_memfd files do however support PUNCH_HOLE, which can
be used to switch a memory area between guest_memfd and regular
anonymous memory.
- New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify
per-page attributes for a given page of guest memory; right now the
only attribute is whether the guest expects to access memory via
guest_memfd or not, which in Confidential SVMs backed by SEV-SNP,
TDX or ARM64 pKVM is checked by firmware or hypervisor that
guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in
the case of pKVM).
x86:
- Support for "software-protected VMs" that can use the new
guest_memfd and page attributes infrastructure. This is mostly
useful for testing, since there is no pKVM-like infrastructure to
provide a meaningfully reduced TCB.
- Fix a relatively benign off-by-one error when splitting huge pages
during CLEAR_DIRTY_LOG.
- Fix a bug where KVM could incorrectly test-and-clear dirty bits in
non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with
a non-huge SPTE.
- Use more generic lockdep assertions in paths that don't actually
care about whether the caller is a reader or a writer.
- let Xen guests opt out of having PV clock reported as "based on a
stable TSC", because some of them don't expect the "TSC stable" bit
(added to the pvclock ABI by KVM, but never set by Xen) to be set.
- Revert a bogus, made-up nested SVM consistency check for
TLB_CONTROL.
- Advertise flush-by-ASID support for nSVM unconditionally, as KVM
always flushes on nested transitions, i.e. always satisfies flush
requests. This allows running bleeding edge versions of VMware
Workstation on top of KVM.
- Sanity check that the CPU supports flush-by-ASID when enabling SEV
support.
- On AMD machines with vNMI, always rely on hardware instead of
intercepting IRET in some cases to detect unmasking of NMIs
- Support for virtualizing Linear Address Masking (LAM)
- Fix a variety of vPMU bugs where KVM fail to stop/reset counters
and other state prior to refreshing the vPMU model.
- Fix a double-overflow PMU bug by tracking emulated counter events
using a dedicated field instead of snapshotting the "previous"
counter. If the hardware PMC count triggers overflow that is
recognized in the same VM-Exit that KVM manually bumps an event
count, KVM would pend PMIs for both the hardware-triggered overflow
and for KVM-triggered overflow.
- Turn off KVM_WERROR by default for all configs so that it's not
inadvertantly enabled by non-KVM developers, which can be
problematic for subsystems that require no regressions for W=1
builds.
- Advertise all of the host-supported CPUID bits that enumerate
IA32_SPEC_CTRL "features".
- Don't force a masterclock update when a vCPU synchronizes to the
current TSC generation, as updating the masterclock can cause
kvmclock's time to "jump" unexpectedly, e.g. when userspace
hotplugs a pre-created vCPU.
- Use RIP-relative address to read kvm_rebooting in the VM-Enter
fault paths, partly as a super minor optimization, but mostly to
make KVM play nice with position independent executable builds.
- Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
CONFIG_HYPERV as a minor optimization, and to self-document the
code.
- Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV
"emulation" at build time.
ARM64:
- LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base
granule sizes. Branch shared with the arm64 tree.
- Large Fine-Grained Trap rework, bringing some sanity to the
feature, although there is more to come. This comes with a prefix
branch shared with the arm64 tree.
- Some additional Nested Virtualization groundwork, mostly
introducing the NV2 VNCR support and retargetting the NV support to
that version of the architecture.
- A small set of vgic fixes and associated cleanups.
Loongarch:
- Optimization for memslot hugepage checking
- Cleanup and fix some HW/SW timer issues
- Add LSX/LASX (128bit/256bit SIMD) support
RISC-V:
- KVM_GET_REG_LIST improvement for vector registers
- Generate ISA extension reg_list using macros in get-reg-list
selftest
- Support for reporting steal time along with selftest
s390:
- Bugfixes
Selftests:
- Fix an annoying goof where the NX hugepage test prints out garbage
instead of the magic token needed to run the test.
- Fix build errors when a header is delete/moved due to a missing
flag in the Makefile.
- Detect if KVM bugged/killed a selftest's VM and print out a helpful
message instead of complaining that a random ioctl() failed.
- Annotate the guest printf/assert helpers with __printf(), and fix
the various bugs that were lurking due to lack of said annotation"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits)
x86/kvm: Do not try to disable kvmclock if it was not enabled
KVM: x86: add missing "depends on KVM"
KVM: fix direction of dependency on MMU notifiers
KVM: introduce CONFIG_KVM_COMMON
KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd
KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache
RISC-V: KVM: selftests: Add get-reg-list test for STA registers
RISC-V: KVM: selftests: Add steal_time test support
RISC-V: KVM: selftests: Add guest_sbi_probe_extension
RISC-V: KVM: selftests: Move sbi_ecall to processor.c
RISC-V: KVM: Implement SBI STA extension
RISC-V: KVM: Add support for SBI STA registers
RISC-V: KVM: Add support for SBI extension registers
RISC-V: KVM: Add SBI STA info to vcpu_arch
RISC-V: KVM: Add steal-update vcpu request
RISC-V: KVM: Add SBI STA extension skeleton
RISC-V: paravirt: Implement steal-time support
RISC-V: Add SBI STA extension definitions
RISC-V: paravirt: Add skeleton for pv-time support
RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr()
...
|
||
|
|
de927f6c0b |
Merge tag 's390-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Alexander Gordeev: - Add machine variable capacity information to /proc/sysinfo. - Limit the waste of page tables and always align vmalloc area size and base address on segment boundary. - Fix a memory leak when an attempt to register interruption sub class (ISC) for the adjunct-processor (AP) guest failed. - Reset response code AP_RESPONSE_INVALID_GISA to understandable by guest AP_RESPONSE_INVALID_ADDRESS in response to a failed interruption sub class (ISC) registration attempt. - Improve reaction to adjunct-processor (AP) AP_RESPONSE_OTHERWISE_CHANGED response code when enabling interrupts on behalf of a guest. - Fix incorrect sysfs 'status' attribute of adjunct-processor (AP) queue device bound to the vfio_ap device driver when the mediated device is attached to a guest, but the queue device is not passed through. - Rework struct ap_card to hold the whole adjunct-processor (AP) card hardware information. As result, all the ugly bit checks are replaced by simple evaluations of the required bit fields. - Improve handling of some weird scenarios between service element (SE) host and SE guest with adjunct-processor (AP) pass-through support. - Change local_ctl_set_bit() and local_ctl_clear_bit() so they return the previous value of the to be changed control register. This is useful if a bit is only changed temporarily and the previous content needs to be restored. - The kernel starts with machine checks disabled and is expected to enable it once trap_init() is called. However the implementation allows machine checks early. Consistently enable it in trap_init() only. - local_mcck_disable() and local_mcck_enable() assume that machine checks are always enabled. Instead implement and use local_mcck_save() and local_mcck_restore() to disable machine checks and restore the previous state. - Modification of floating point control (FPC) register of a traced process using ptrace interface may lead to corruption of the FPC register of the tracing process. Fix this. - kvm_arch_vcpu_ioctl_set_fpu() allows to set the floating point control (FPC) register in vCPU, but may lead to corruption of the FPC register of the host process. Fix this. - Use READ_ONCE() to read a vCPU floating point register value from the memory mapped area. This avoids that, depending on code generation, a different value is tested for validity than the one that is used. - Get rid of test_fp_ctl(), since it is quite subtle to use it correctly. Instead copy a new floating point control register value into its save area and test the validity of the new value when loading it. - Remove superfluous save_fpu_regs() call. - Remove s390 support for ARCH_WANTS_DYNAMIC_TASK_STRUCT. All machines provide the vector facility since many years and the need to make the task structure size dependent on the vector facility does not exist. - Remove the "novx" kernel command line option, as the vector code runs without any problems since many years. - Add the vector facility to the z13 architecture level set (ALS). All hypervisors support the vector facility since many years. This allows compile time optimizations of the kernel. - Get rid of MACHINE_HAS_VX and replace it with cpu_has_vx(). As result, the compiled code will have less runtime checks and less code. - Convert pgste_get_lock() and pgste_set_unlock() ASM inlines to C. - Convert the struct subchannel spinlock from pointer to member. * tag 's390-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (24 commits) Revert "s390: update defconfigs" s390/cio: make sch->lock spinlock pointer a member s390: update defconfigs s390/mm: convert pgste locking functions to C s390/fpu: get rid of MACHINE_HAS_VX s390/als: add vector facility to z13 architecture level set s390/fpu: remove "novx" option s390/fpu: remove ARCH_WANTS_DYNAMIC_TASK_STRUCT support KVM: s390: remove superfluous save_fpu_regs() call s390/fpu: get rid of test_fp_ctl() KVM: s390: use READ_ONCE() to read fpc register value KVM: s390: fix setting of fpc register s390/ptrace: handle setting of fpc register correctly s390/nmi: implement and use local_mcck_save() / local_mcck_restore() s390/nmi: consistently enable machine checks in trap_init() s390/ctlreg: return old register contents when changing bits s390/ap: handle outband SE bind state change s390/ap: store TAPQ hwinfo in struct ap_card s390/vfio-ap: fix sysfs status attribute for AP queue devices s390/vfio-ap: improve reaction to response code 07 from PQAP(AQIC) command ... |
||
|
|
83303a4c77 |
KVM: s390: fix cc for successful PQAP
The various errors that are possible when processing a PQAP
instruction (the absence of a driver hook, an error FROM that
hook), all correctly set the PSW condition code to 3. But if
that processing works successfully, CC0 needs to be set to
convey that everything was fine.
Fix the check so that the guest can examine the condition code
to determine whether GPR1 has meaningful data.
Fixes:
|
||
|
|
fb872da8e7 |
Merge tag 'kvm-x86-generic-6.8' of https://github.com/kvm-x86/linux into HEAD
Common KVM changes for 6.8: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures. |
||
|
|
caadf876bb |
KVM: introduce CONFIG_KVM_COMMON
CONFIG_HAVE_KVM is currently used by some architectures to either
enabled the KVM config proper, or to enable host-side code that is
not part of the KVM module. However, CONFIG_KVM's "select" statement
in virt/kvm/Kconfig corresponds to a third meaning, namely to
enable common Kconfigs required by all architectures that support
KVM.
These three meanings can be replaced respectively by an
architecture-specific Kconfig, by IS_ENABLED(CONFIG_KVM), or by
a new Kconfig symbol that is in turn selected by the
architecture-specific "config KVM".
Start by introducing such a new Kconfig symbol, CONFIG_KVM_COMMON.
Unlike CONFIG_HAVE_KVM, it is selected by CONFIG_KVM, not by
architecture code, and it brings in all dependencies of common
KVM code. In particular, INTERVAL_TREE was missing in loongarch
and riscv, so that is another thing that is fixed.
Fixes:
|
||
|
|
731859dde8 |
Merge tag 'kvm-s390-next-6.8-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
- uvdevice fixed additional data return length - stfle (feature indication) vsie fixes and minor cleanup |
||
|
|
136292522e |
Merge tag 'loongarch-kvm-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
LoongArch KVM changes for v6.8 1. Optimization for memslot hugepage checking. 2. Cleanup and fix some HW/SW timer issues. 3. Add LSX/LASX (128bit/256bit SIMD) support. |
||
|
|
682dbf430d |
KVM: s390: vsie: Fix length of facility list shadowed
The length of the facility list accessed when interpretively executing STFLE is the same as the hosts facility list (in case of format-0) The memory following the facility list doesn't need to be accessible. The current VSIE implementation accesses a fixed length that exceeds the guest/host facility list length and can therefore wrongly inject a validity intercept. Instead, find out the host facility list length by running STFLE and copy only as much as necessary when shadowing. Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20231219140854.1042599-3-nsg@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20231219140854.1042599-3-nsg@linux.ibm.com> |
||
|
|
2731d605d5 |
KVM: s390: vsie: Fix STFLE interpretive execution identification
STFLE can be interpretively executed.
This occurs when the facility list designation is unequal to zero.
Perform the check before applying the address mask instead of after.
Fixes:
|
||
|
|
fe752331d4 |
KVM: s390: vsie: fix race during shadow creation
Right now it is possible to see gmap->private being zero in
kvm_s390_vsie_gmap_notifier resulting in a crash. This is due to the
fact that we add gmap->private == kvm after creation:
static int acquire_gmap_shadow(struct kvm_vcpu *vcpu,
struct vsie_page *vsie_page)
{
[...]
gmap = gmap_shadow(vcpu->arch.gmap, asce, edat);
if (IS_ERR(gmap))
return PTR_ERR(gmap);
gmap->private = vcpu->kvm;
Let children inherit the private field of the parent.
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Fixes:
|
||
|
|
18564756ab |
s390/fpu: get rid of MACHINE_HAS_VX
Get rid of MACHINE_HAS_VX and replace it with cpu_has_vx() which is a short readable wrapper for "test_facility(129)". Facility bit 129 is set if the vector facility is present. test_facility() returns also true for all bits which are set in the architecture level set of the cpu that the kernel is compiled for. This means that test_facility(129) is a compile time constant which returns true for z13 and later, since the vector facility bit is part of the z13 kernel ALS. In result the compiled code will have less runtime checks, and less code. Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |