Expand SoundWire MBQ register map support

Merge series from Charles Keepax <ckeepax@opensource.cirrus.com>:

The current SDCA MBQ (Multi-Byte Quantities) register map only
supports 16-bit types, add support for more sizes and then update
the rt722 driver to use the new support. We also add support for
the deferring feature of MBQs to allow hardware to indicate it is
not currently ready to service a read/write.

Afraid I don't have hardware to test the rt722 change so it is
only build tested, but I thought it good to include a change to
demonstrate the new features in use.
This commit is contained in:
Mark Brown
2025-01-07 23:28:07 +00:00
845 changed files with 13965 additions and 8249 deletions
+2 -1
View File
@@ -435,7 +435,7 @@ Martin Kepplinger <martink@posteo.de> <martin.kepplinger@ginzinger.com>
Martin Kepplinger <martink@posteo.de> <martin.kepplinger@puri.sm>
Martin Kepplinger <martink@posteo.de> <martin.kepplinger@theobroma-systems.com>
Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> <martyna.szapar-mudlaw@intel.com>
Mathieu Othacehe <m.othacehe@gmail.com> <othacehe@gnu.org>
Mathieu Othacehe <othacehe@gnu.org> <m.othacehe@gmail.com>
Mat Martineau <martineau@kernel.org> <mathew.j.martineau@linux.intel.com>
Mat Martineau <martineau@kernel.org> <mathewm@codeaurora.org>
Matthew Wilcox <willy@infradead.org> <matthew.r.wilcox@intel.com>
@@ -735,6 +735,7 @@ Wolfram Sang <wsa@kernel.org> <w.sang@pengutronix.de>
Wolfram Sang <wsa@kernel.org> <wsa@the-dreams.de>
Yakir Yang <kuankuan.y@gmail.com> <ykk@rock-chips.com>
Yanteng Si <si.yanteng@linux.dev> <siyanteng@loongson.cn>
Ying Huang <huang.ying.caritas@gmail.com> <ying.huang@intel.com>
Yusuke Goda <goda.yusuke@renesas.com>
Zack Rusin <zack.rusin@broadcom.com> <zackr@vmware.com>
Zhu Yanjun <zyjzyj2000@gmail.com> <yanjunz@nvidia.com>
@@ -4822,6 +4822,11 @@
can be preempted anytime. Tasks will also yield
contended spinlocks (if the critical section isn't
explicitly preempt disabled beyond the lock itself).
lazy - Scheduler controlled. Similar to full but instead
of preempting the task immediately, the task gets
one HZ tick time to yield itself before the
preemption will be forced. One preemption is when the
task returns to user space.
print-fatal-signals=
[KNL] debug: print fatal signals
@@ -445,8 +445,10 @@ event code Key Notes
0x1008 0x07 FN+F8 IBM: toggle screen expand
Lenovo: configure UltraNav,
or toggle screen expand.
On newer platforms (2024+)
replaced by 0x131f (see below)
On 2024 platforms replaced by
0x131f (see below) and on newer
platforms (2025 +) keycode is
replaced by 0x1401 (see below).
0x1009 0x08 FN+F9 -
@@ -506,9 +508,11 @@ event code Key Notes
0x1019 0x18 unknown
0x131f ... FN+F8 Platform Mode change.
0x131f ... FN+F8 Platform Mode change (2024 systems).
Implemented in driver.
0x1401 ... FN+F8 Platform Mode change (2025 + systems).
Implemented in driver.
... ... ...
0x1020 0x1F unknown
+1 -1
View File
@@ -436,7 +436,7 @@ AnonHugePmdMapped).
The number of file transparent huge pages mapped to userspace is available
by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
To identify what applications are mapping file transparent huge pages, it
is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
is necessary to read ``/proc/PID/smaps`` and count the FilePmdMapped fields
for each mapping.
Note that reading the smaps file is expensive and reading it
+1 -3
View File
@@ -251,9 +251,7 @@ performance supported in `AMD CPPC Performance Capability <perf_cap_>`_).
In some ASICs, the highest CPPC performance is not the one in the ``_CPC``
table, so we need to expose it to sysfs. If boost is not active, but
still supported, this maximum frequency will be larger than the one in
``cpuinfo``. On systems that support preferred core, the driver will have
different values for some cores than others and this will reflect the values
advertised by the platform at bootup.
``cpuinfo``.
This attribute is read-only.
``amd_pstate_lowest_nonlinear_freq``
@@ -114,8 +114,9 @@ patternProperties:
table that specifies the PPID to LIODN mapping. Needed if the PAMU is
used. Value is a 12 bit value where value is a LIODN ID for this JR.
This property is normally set by boot firmware.
$ref: /schemas/types.yaml#/definitions/uint32
maximum: 0xfff
$ref: /schemas/types.yaml#/definitions/uint32-array
items:
- maximum: 0xfff
'^rtic@[0-9a-f]+$':
type: object
@@ -186,8 +187,9 @@ patternProperties:
Needed if the PAMU is used. Value is a 12 bit value where value
is a LIODN ID for this JR. This property is normally set by boot
firmware.
$ref: /schemas/types.yaml#/definitions/uint32
maximum: 0xfff
$ref: /schemas/types.yaml#/definitions/uint32-array
items:
- maximum: 0xfff
fsl,rtic-region:
description:
@@ -90,7 +90,7 @@ properties:
adi,dsi-lanes:
description: Number of DSI data lanes connected to the DSI host.
$ref: /schemas/types.yaml#/definitions/uint32
enum: [ 1, 2, 3, 4 ]
enum: [ 2, 3, 4 ]
"#sound-dai-cells":
const: 0
@@ -82,7 +82,7 @@ examples:
uimage@100000 {
reg = <0x0100000 0x200000>;
compress = "lzma";
compression = "lzma";
};
};
@@ -113,11 +113,8 @@ allOf:
maxItems: 1
- if:
properties:
compatible:
contains:
enum:
- fsl,imx95-usb-phy
required:
- orientation-switch
then:
$ref: /schemas/usb/usb-switch.yaml#
@@ -18,6 +18,7 @@ properties:
compatible:
enum:
- qcom,qca6390-pmu
- qcom,wcn6750-pmu
- qcom,wcn6855-pmu
- qcom,wcn7850-pmu
@@ -27,6 +28,9 @@ properties:
vddaon-supply:
description: VDD_AON supply regulator handle
vddasd-supply:
description: VDD_ASD supply regulator handle
vdddig-supply:
description: VDD_DIG supply regulator handle
@@ -42,6 +46,9 @@ properties:
vddio1p2-supply:
description: VDD_IO_1P2 supply regulator handle
vddrfa0p8-supply:
description: VDD_RFA_0P8 supply regulator handle
vddrfa0p95-supply:
description: VDD_RFA_0P95 supply regulator handle
@@ -51,12 +58,18 @@ properties:
vddrfa1p3-supply:
description: VDD_RFA_1P3 supply regulator handle
vddrfa1p7-supply:
description: VDD_RFA_1P7 supply regulator handle
vddrfa1p8-supply:
description: VDD_RFA_1P8 supply regulator handle
vddrfa1p9-supply:
description: VDD_RFA_1P9 supply regulator handle
vddrfa2p2-supply:
description: VDD_RFA_2P2 supply regulator handle
vddpcie1p3-supply:
description: VDD_PCIE_1P3 supply regulator handle
@@ -119,6 +132,20 @@ allOf:
- vddpcie1p3-supply
- vddpcie1p9-supply
- vddio-supply
- if:
properties:
compatible:
contains:
const: qcom,wcn6750-pmu
then:
required:
- vddaon-supply
- vddasd-supply
- vddpmu-supply
- vddrfa0p8-supply
- vddrfa1p2-supply
- vddrfa1p7-supply
- vddrfa2p2-supply
- if:
properties:
compatible:
@@ -35,6 +35,7 @@ properties:
fsl,liodn:
$ref: /schemas/types.yaml#/definitions/uint32-array
maxItems: 2
description: See pamu.txt. Two LIODN(s). DQRR LIODN (DLIODN) and Frame LIODN
(FLIODN)
@@ -69,6 +70,7 @@ patternProperties:
type: object
properties:
fsl,liodn:
$ref: /schemas/types.yaml#/definitions/uint32-array
description: See pamu.txt, PAMU property used for static LIODN assignment
fsl,iommu-parent:
@@ -51,7 +51,7 @@ properties:
description: Power supply for AVDD, providing 1.8V.
cpvdd-supply:
description: Power supply for CPVDD, providing 3.5V.
description: Power supply for CPVDD, providing 1.8V.
hp-detect-gpios:
description:
+850
View File
@@ -3,3 +3,853 @@
=================
Process Addresses
=================
.. toctree::
:maxdepth: 3
Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
'VMA's of type :c:struct:`!struct vm_area_struct`.
Each VMA describes a virtually contiguous memory range with identical
attributes, each described by a :c:struct:`!struct vm_area_struct`
object. Userland access outside of VMAs is invalid except in the case where an
adjacent stack VMA could be extended to contain the accessed address.
All VMAs are contained within one and only one virtual address space, described
by a :c:struct:`!struct mm_struct` object which is referenced by all tasks (that is,
threads) which share the virtual address space. We refer to this as the
:c:struct:`!mm`.
Each mm object contains a maple tree data structure which describes all VMAs
within the virtual address space.
.. note:: An exception to this is the 'gate' VMA which is provided by
architectures which use :c:struct:`!vsyscall` and is a global static
object which does not belong to any specific mm.
-------
Locking
-------
The kernel is designed to be highly scalable against concurrent read operations
on VMA **metadata** so a complicated set of locks are required to ensure memory
corruption does not occur.
.. note:: Locking VMAs for their metadata does not have any impact on the memory
they describe nor the page tables that map them.
Terminology
-----------
* **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_lock`
which locks at a process address space granularity which can be acquired via
:c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants.
* **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves
as a read/write semaphore in practice. A VMA read lock is obtained via
:c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a
write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked
automatically when the mmap write lock is released). To take a VMA write lock
you **must** have already acquired an :c:func:`!mmap_write_lock`.
* **rmap locks** - When trying to access VMAs through the reverse mapping via a
:c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object
(reachable from a folio via :c:member:`!folio->mapping`). VMAs must be stabilised via
:c:func:`!anon_vma_[try]lock_read` or :c:func:`!anon_vma_[try]lock_write` for
anonymous memory and :c:func:`!i_mmap_[try]lock_read` or
:c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these
locks as the reverse mapping locks, or 'rmap locks' for brevity.
We discuss page table locks separately in the dedicated section below.
The first thing **any** of these locks achieve is to **stabilise** the VMA
within the MM tree. That is, guaranteeing that the VMA object will not be
deleted from under you nor modified (except for some specific fields
described below).
Stabilising a VMA also keeps the address space described by it around.
Lock usage
----------
If you want to **read** VMA metadata fields or just keep the VMA stable, you
must do one of the following:
* Obtain an mmap read lock at the MM granularity via :c:func:`!mmap_read_lock` (or a
suitable variant), unlocking it with a matching :c:func:`!mmap_read_unlock` when
you're done with the VMA, *or*
* Try to obtain a VMA read lock via :c:func:`!lock_vma_under_rcu`. This tries to
acquire the lock atomically so might fail, in which case fall-back logic is
required to instead obtain an mmap read lock if this returns :c:macro:`!NULL`,
*or*
* Acquire an rmap lock before traversing the locked interval tree (whether
anonymous or file-backed) to obtain the required VMA.
If you want to **write** VMA metadata fields, then things vary depending on the
field (we explore each VMA field in detail below). For the majority you must:
* Obtain an mmap write lock at the MM granularity via :c:func:`!mmap_write_lock` (or a
suitable variant), unlocking it with a matching :c:func:`!mmap_write_unlock` when
you're done with the VMA, *and*
* Obtain a VMA write lock via :c:func:`!vma_start_write` for each VMA you wish to
modify, which will be released automatically when :c:func:`!mmap_write_unlock` is
called.
* If you want to be able to write to **any** field, you must also hide the VMA
from the reverse mapping by obtaining an **rmap write lock**.
VMA locks are special in that you must obtain an mmap **write** lock **first**
in order to obtain a VMA **write** lock. A VMA **read** lock however can be
obtained without any other lock (:c:func:`!lock_vma_under_rcu` will acquire then
release an RCU lock to lookup the VMA for you).
This constrains the impact of writers on readers, as a writer can interact with
one VMA while a reader interacts with another simultaneously.
.. note:: The primary users of VMA read locks are page fault handlers, which
means that without a VMA write lock, page faults will run concurrent with
whatever you are doing.
Examining all valid lock states:
.. table::
========= ======== ========= ======= ===== =========== ==========
mmap lock VMA lock rmap lock Stable? Read? Write most? Write all?
========= ======== ========= ======= ===== =========== ==========
\- \- \- N N N N
\- R \- Y Y N N
\- \- R/W Y Y N N
R/W \-/R \-/R/W Y Y N N
W W \-/R Y Y Y N
W W W Y Y Y Y
========= ======== ========= ======= ===== =========== ==========
.. warning:: While it's possible to obtain a VMA lock while holding an mmap read lock,
attempting to do the reverse is invalid as it can result in deadlock - if
another task already holds an mmap write lock and attempts to acquire a VMA
write lock that will deadlock on the VMA read lock.
All of these locks behave as read/write semaphores in practice, so you can
obtain either a read or a write lock for each of these.
.. note:: Generally speaking, a read/write semaphore is a class of lock which
permits concurrent readers. However a write lock can only be obtained
once all readers have left the critical region (and pending readers
made to wait).
This renders read locks on a read/write semaphore concurrent with other
readers and write locks exclusive against all others holding the semaphore.
VMA fields
^^^^^^^^^^
We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it
easier to explore their locking characteristics:
.. note:: We exclude VMA lock-specific fields here to avoid confusion, as these
are in effect an internal implementation detail.
.. table:: Virtual layout fields
===================== ======================================== ===========
Field Description Write lock
===================== ======================================== ===========
:c:member:`!vm_start` Inclusive start virtual address of range mmap write,
VMA describes. VMA write,
rmap write.
:c:member:`!vm_end` Exclusive end virtual address of range mmap write,
VMA describes. VMA write,
rmap write.
:c:member:`!vm_pgoff` Describes the page offset into the file, mmap write,
the original page offset within the VMA write,
virtual address space (prior to any rmap write.
:c:func:`!mremap`), or PFN if a PFN map
and the architecture does not support
:c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`.
===================== ======================================== ===========
These fields describes the size, start and end of the VMA, and as such cannot be
modified without first being hidden from the reverse mapping since these fields
are used to locate VMAs within the reverse mapping interval trees.
.. table:: Core fields
============================ ======================================== =========================
Field Description Write lock
============================ ======================================== =========================
:c:member:`!vm_mm` Containing mm_struct. None - written once on
initial map.
:c:member:`!vm_page_prot` Architecture-specific page table mmap write, VMA write.
protection bits determined from VMA
flags.
:c:member:`!vm_flags` Read-only access to VMA flags describing N/A
attributes of the VMA, in union with
private writable
:c:member:`!__vm_flags`.
:c:member:`!__vm_flags` Private, writable access to VMA flags mmap write, VMA write.
field, updated by
:c:func:`!vm_flags_*` functions.
:c:member:`!vm_file` If the VMA is file-backed, points to a None - written once on
struct file object describing the initial map.
underlying file, if anonymous then
:c:macro:`!NULL`.
:c:member:`!vm_ops` If the VMA is file-backed, then either None - Written once on
the driver or file-system provides a initial map by
:c:struct:`!struct vm_operations_struct` :c:func:`!f_ops->mmap()`.
object describing callbacks to be
invoked on VMA lifetime events.
:c:member:`!vm_private_data` A :c:member:`!void *` field for Handled by driver.
driver-specific metadata.
============================ ======================================== =========================
These are the core fields which describe the MM the VMA belongs to and its attributes.
.. table:: Config-specific fields
================================= ===================== ======================================== ===============
Field Configuration option Description Write lock
================================= ===================== ======================================== ===============
:c:member:`!anon_name` CONFIG_ANON_VMA_NAME A field for storing a mmap write,
:c:struct:`!struct anon_vma_name` VMA write.
object providing a name for anonymous
mappings, or :c:macro:`!NULL` if none
is set or the VMA is file-backed. The
underlying object is reference counted
and can be shared across multiple VMAs
for scalability.
:c:member:`!swap_readahead_info` CONFIG_SWAP Metadata used by the swap mechanism mmap read,
to perform readahead. This field is swap-specific
accessed atomically. lock.
:c:member:`!vm_policy` CONFIG_NUMA :c:type:`!mempolicy` object which mmap write,
describes the NUMA behaviour of the VMA write.
VMA. The underlying object is reference
counted.
:c:member:`!numab_state` CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which mmap read,
describes the current state of numab-specific
NUMA balancing in relation to this VMA. lock.
Updated under mmap read lock by
:c:func:`!task_numa_work`.
:c:member:`!vm_userfaultfd_ctx` CONFIG_USERFAULTFD Userfaultfd context wrapper object of mmap write,
type :c:type:`!vm_userfaultfd_ctx`, VMA write.
either of zero size if userfaultfd is
disabled, or containing a pointer
to an underlying
:c:type:`!userfaultfd_ctx` object which
describes userfaultfd metadata.
================================= ===================== ======================================== ===============
These fields are present or not depending on whether the relevant kernel
configuration option is set.
.. table:: Reverse mapping fields
=================================== ========================================= ============================
Field Description Write lock
=================================== ========================================= ============================
:c:member:`!shared.rb` A red/black tree node used, if the mmap write, VMA write,
mapping is file-backed, to place the VMA i_mmap write.
in the
:c:member:`!struct address_space->i_mmap`
red/black interval tree.
:c:member:`!shared.rb_subtree_last` Metadata used for management of the mmap write, VMA write,
interval tree if the VMA is file-backed. i_mmap write.
:c:member:`!anon_vma_chain` List of pointers to both forked/CoWd mmap read, anon_vma write.
:c:type:`!anon_vma` objects and
:c:member:`!vma->anon_vma` if it is
non-:c:macro:`!NULL`.
:c:member:`!anon_vma` :c:type:`!anon_vma` object used by When :c:macro:`NULL` and
anonymous folios mapped exclusively to setting non-:c:macro:`NULL`:
this VMA. Initially set by mmap read, page_table_lock.
:c:func:`!anon_vma_prepare` serialised
by the :c:macro:`!page_table_lock`. This When non-:c:macro:`NULL` and
is set as soon as any page is faulted in. setting :c:macro:`NULL`:
mmap write, VMA write,
anon_vma write.
=================================== ========================================= ============================
These fields are used to both place the VMA within the reverse mapping, and for
anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects
and the :c:struct:`!struct anon_vma` in which folios mapped exclusively to this VMA should
reside.
.. note:: If a file-backed mapping is mapped with :c:macro:`!MAP_PRIVATE` set
then it can be in both the :c:type:`!anon_vma` and :c:type:`!i_mmap`
trees at the same time, so all of these fields might be utilised at
once.
Page tables
-----------
We won't speak exhaustively on the subject but broadly speaking, page tables map
virtual addresses to physical ones through a series of page tables, each of
which contain entries with physical addresses for the next page table level
(along with flags), and at the leaf level the physical addresses of the
underlying physical data pages or a special entry such as a swap entry,
migration entry or other special marker. Offsets into these pages are provided
by the virtual address itself.
In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge
pages might eliminate one or two of these levels, but when this is the case we
typically refer to the leaf level as the PTE level regardless.
.. note:: In instances where the architecture supports fewer page tables than
five the kernel cleverly 'folds' page table levels, that is stubbing
out functions related to the skipped levels. This allows us to
conceptually act as if there were always five levels, even if the
compiler might, in practice, eliminate any code relating to missing
ones.
There are four key operations typically performed on page tables:
1. **Traversing** page tables - Simply reading page tables in order to traverse
them. This only requires that the VMA is kept stable, so a lock which
establishes this suffices for traversal (there are also lockless variants
which eliminate even this requirement, such as :c:func:`!gup_fast`).
2. **Installing** page table mappings - Whether creating a new mapping or
modifying an existing one in such a way as to change its identity. This
requires that the VMA is kept stable via an mmap or VMA lock (explicitly not
rmap locks).
3. **Zapping/unmapping** page table entries - This is what the kernel calls
clearing page table mappings at the leaf level only, whilst leaving all page
tables in place. This is a very common operation in the kernel performed on
file truncation, the :c:macro:`!MADV_DONTNEED` operation via
:c:func:`!madvise`, and others. This is performed by a number of functions
including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages`.
The VMA need only be kept stable for this operation.
4. **Freeing** page tables - When finally the kernel removes page tables from a
userland process (typically via :c:func:`!free_pgtables`) extreme care must
be taken to ensure this is done safely, as this logic finally frees all page
tables in the specified range, ignoring existing leaf entries (it assumes the
caller has both zapped the range and prevented any further faults or
modifications within it).
.. note:: Modifying mappings for reclaim or migration is performed under rmap
lock as it, like zapping, does not fundamentally modify the identity
of what is being mapped.
**Traversing** and **zapping** ranges can be performed holding any one of the
locks described in the terminology section above - that is the mmap lock, the
VMA lock or either of the reverse mapping locks.
That is - as long as you keep the relevant VMA **stable** - you are good to go
ahead and perform these operations on page tables (though internally, kernel
operations that perform writes also acquire internal page table locks to
serialise - see the page table implementation detail section for more details).
When **installing** page table entries, the mmap or VMA lock must be held to
keep the VMA stable. We explore why this is in the page table locking details
section below.
.. warning:: Page tables are normally only traversed in regions covered by VMAs.
If you want to traverse page tables in areas that might not be
covered by VMAs, heavier locking is required.
See :c:func:`!walk_page_range_novma` for details.
**Freeing** page tables is an entirely internal memory management operation and
has special requirements (see the page freeing section below for more details).
.. warning:: When **freeing** page tables, it must not be possible for VMAs
containing the ranges those page tables map to be accessible via
the reverse mapping.
The :c:func:`!free_pgtables` function removes the relevant VMAs
from the reverse mappings, but no other VMAs can be permitted to be
accessible and span the specified range.
Lock ordering
-------------
As we have multiple locks across the kernel which may or may not be taken at the
same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
the **order** in which locks are acquired and released becomes very important.
.. note:: Lock inversion occurs when two threads need to acquire multiple locks,
but in doing so inadvertently cause a mutual deadlock.
For example, consider thread 1 which holds lock A and tries to acquire lock B,
while thread 2 holds lock B and tries to acquire lock A.
Both threads are now deadlocked on each other. However, had they attempted to
acquire locks in the same order, one would have waited for the other to
complete its work and no deadlock would have occurred.
The opening comment in :c:macro:`!mm/rmap.c` describes in detail the required
ordering of locks within memory management code:
.. code-block::
inode->i_rwsem (while writing or truncating, not reading or faulting)
mm->mmap_lock
mapping->invalidate_lock (in filemap_fault)
folio_lock
hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
vma_start_write
mapping->i_mmap_rwsem
anon_vma->rwsem
mm->page_table_lock or pte_lock
swap_lock (in swap_duplicate, swap_info_get)
mmlist_lock (in mmput, drain_mmlist and others)
mapping->private_lock (in block_dirty_folio)
i_pages lock (widely used)
lruvec->lru_lock (in folio_lruvec_lock_irq)
inode->i_lock (in set_page_dirty's __mark_inode_dirty)
bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
sb_lock (within inode_lock in fs/fs-writeback.c)
i_pages lock (widely used, in set_page_dirty,
in arch-dependent flush_dcache_mmap_lock,
within bdi.wb->list_lock in __sync_single_inode)
There is also a file-system specific lock ordering comment located at the top of
:c:macro:`!mm/filemap.c`:
.. code-block::
->i_mmap_rwsem (truncate_pagecache)
->private_lock (__free_pte->block_dirty_folio)
->swap_lock (exclusive_swap_page, others)
->i_pages lock
->i_rwsem
->invalidate_lock (acquired by fs in truncate path)
->i_mmap_rwsem (truncate->unmap_mapping_range)
->mmap_lock
->i_mmap_rwsem
->page_table_lock or pte_lock (various, mainly in memory.c)
->i_pages lock (arch-dependent flush_dcache_mmap_lock)
->mmap_lock
->invalidate_lock (filemap_fault)
->lock_page (filemap_fault, access_process_vm)
->i_rwsem (generic_perform_write)
->mmap_lock (fault_in_readable->do_page_fault)
bdi->wb.list_lock
sb_lock (fs/fs-writeback.c)
->i_pages lock (__sync_single_inode)
->i_mmap_rwsem
->anon_vma.lock (vma_merge)
->anon_vma.lock
->page_table_lock or pte_lock (anon_vma_prepare and various)
->page_table_lock or pte_lock
->swap_lock (try_to_unmap_one)
->private_lock (try_to_unmap_one)
->i_pages lock (try_to_unmap_one)
->lruvec->lru_lock (follow_page_mask->mark_page_accessed)
->lruvec->lru_lock (check_pte_range->folio_isolate_lru)
->private_lock (folio_remove_rmap_pte->set_page_dirty)
->i_pages lock (folio_remove_rmap_pte->set_page_dirty)
bdi.wb->list_lock (folio_remove_rmap_pte->set_page_dirty)
->inode->i_lock (folio_remove_rmap_pte->set_page_dirty)
bdi.wb->list_lock (zap_pte_range->set_page_dirty)
->inode->i_lock (zap_pte_range->set_page_dirty)
->private_lock (zap_pte_range->block_dirty_folio)
Please check the current state of these comments which may have changed since
the time of writing of this document.
------------------------------
Locking Implementation Details
------------------------------
.. warning:: Locking rules for PTE-level page tables are very different from
locking rules for page tables at other levels.
Page table locking details
--------------------------
In addition to the locks described in the terminology section above, we have
additional locks dedicated to page tables:
* **Higher level page table locks** - Higher level page tables, that is PGD, P4D
and PUD each make use of the process address space granularity
:c:member:`!mm->page_table_lock` lock when modified.
* **Fine-grained page table locks** - PMDs and PTEs each have fine-grained locks
either kept within the folios describing the page tables or allocated
separated and pointed at by the folios if :c:macro:`!ALLOC_SPLIT_PTLOCKS` is
set. The PMD spin lock is obtained via :c:func:`!pmd_lock`, however PTEs are
mapped into higher memory (if a 32-bit system) and carefully locked via
:c:func:`!pte_offset_map_lock`.
These locks represent the minimum required to interact with each page table
level, but there are further requirements.
Importantly, note that on a **traversal** of page tables, sometimes no such
locks are taken. However, at the PTE level, at least concurrent page table
deletion must be prevented (using RCU) and the page table must be mapped into
high memory, see below.
Whether care is taken on reading the page table entries depends on the
architecture, see the section on atomicity below.
Locking rules
^^^^^^^^^^^^^
We establish basic locking rules when interacting with page tables:
* When changing a page table entry the page table lock for that page table
**must** be held, except if you can safely assume nobody can access the page
tables concurrently (such as on invocation of :c:func:`!free_pgtables`).
* Reads from and writes to page table entries must be *appropriately*
atomic. See the section on atomicity below for details.
* Populating previously empty entries requires that the mmap or VMA locks are
held (read or write), doing so with only rmap locks would be dangerous (see
the warning below).
* As mentioned previously, zapping can be performed while simply keeping the VMA
stable, that is holding any one of the mmap, VMA or rmap locks.
.. warning:: Populating previously empty entries is dangerous as, when unmapping
VMAs, :c:func:`!vms_clear_ptes` has a window of time between
zapping (via :c:func:`!unmap_vmas`) and freeing page tables (via
:c:func:`!free_pgtables`), where the VMA is still visible in the
rmap tree. :c:func:`!free_pgtables` assumes that the zap has
already been performed and removes PTEs unconditionally (along with
all other page tables in the freed range), so installing new PTE
entries could leak memory and also cause other unexpected and
dangerous behaviour.
There are additional rules applicable when moving page tables, which we discuss
in the section on this topic below.
PTE-level page tables are different from page tables at other levels, and there
are extra requirements for accessing them:
* On 32-bit architectures, they may be in high memory (meaning they need to be
mapped into kernel memory to be accessible).
* When empty, they can be unlinked and RCU-freed while holding an mmap lock or
rmap lock for reading in combination with the PTE and PMD page table locks.
In particular, this happens in :c:func:`!retract_page_tables` when handling
:c:macro:`!MADV_COLLAPSE`.
So accessing PTE-level page tables requires at least holding an RCU read lock;
but that only suffices for readers that can tolerate racing with concurrent
page table updates such that an empty PTE is observed (in a page table that
has actually already been detached and marked for RCU freeing) while another
new page table has been installed in the same location and filled with
entries. Writers normally need to take the PTE lock and revalidate that the
PMD entry still refers to the same PTE-level page table.
To access PTE-level page tables, a helper like :c:func:`!pte_offset_map_lock` or
:c:func:`!pte_offset_map` can be used depending on stability requirements.
These map the page table into kernel memory if required, take the RCU lock, and
depending on variant, may also look up or acquire the PTE lock.
See the comment on :c:func:`!__pte_offset_map_lock`.
Atomicity
^^^^^^^^^
Regardless of page table locks, the MMU hardware concurrently updates accessed
and dirty bits (perhaps more, depending on architecture). Additionally, page
table traversal operations in parallel (though holding the VMA stable) and
functionality like GUP-fast locklessly traverses (that is reads) page tables,
without even keeping the VMA stable at all.
When performing a page table traversal and keeping the VMA stable, whether a
read must be performed once and only once or not depends on the architecture
(for instance x86-64 does not require any special precautions).
If a write is being performed, or if a read informs whether a write takes place
(on an installation of a page table entry say, for instance in
:c:func:`!__pud_install`), special care must always be taken. In these cases we
can never assume that page table locks give us entirely exclusive access, and
must retrieve page table entries once and only once.
If we are reading page table entries, then we need only ensure that the compiler
does not rearrange our loads. This is achieved via :c:func:`!pXXp_get`
functions - :c:func:`!pgdp_get`, :c:func:`!p4dp_get`, :c:func:`!pudp_get`,
:c:func:`!pmdp_get`, and :c:func:`!ptep_get`.
Each of these uses :c:func:`!READ_ONCE` to guarantee that the compiler reads
the page table entry only once.
However, if we wish to manipulate an existing page table entry and care about
the previously stored data, we must go further and use an hardware atomic
operation as, for example, in :c:func:`!ptep_get_and_clear`.
Equally, operations that do not rely on the VMA being held stable, such as
GUP-fast (see :c:func:`!gup_fast` and its various page table level handlers like
:c:func:`!gup_fast_pte_range`), must very carefully interact with page table
entries, using functions such as :c:func:`!ptep_get_lockless` and equivalent for
higher level page table levels.
Writes to page table entries must also be appropriately atomic, as established
by :c:func:`!set_pXX` functions - :c:func:`!set_pgd`, :c:func:`!set_p4d`,
:c:func:`!set_pud`, :c:func:`!set_pmd`, and :c:func:`!set_pte`.
Equally functions which clear page table entries must be appropriately atomic,
as in :c:func:`!pXX_clear` functions - :c:func:`!pgd_clear`,
:c:func:`!p4d_clear`, :c:func:`!pud_clear`, :c:func:`!pmd_clear`, and
:c:func:`!pte_clear`.
Page table installation
^^^^^^^^^^^^^^^^^^^^^^^
Page table installation is performed with the VMA held stable explicitly by an
mmap or VMA lock in read or write mode (see the warning in the locking rules
section for details as to why).
When allocating a P4D, PUD or PMD and setting the relevant entry in the above
PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is
acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and
:c:func:`!__pmd_alloc` respectively.
.. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and
:c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately
references the :c:member:`!mm->page_table_lock`.
Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if
:c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, a lock embedded in the PMD
physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by
:c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately
:c:func:`!__pte_alloc`.
Finally, modifying the contents of the PTE requires special treatment, as the
PTE page table lock must be acquired whenever we want stable and exclusive
access to entries contained within a PTE, especially when we wish to modify
them.
This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to
ensure that the PTE hasn't changed from under us, ultimately invoking
:c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within
the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock
must be released via :c:func:`!pte_unmap_unlock`.
.. note:: There are some variants on this, such as
:c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but
for brevity we do not explore this. See the comment for
:c:func:`!__pte_offset_map_lock` for more details.
When modifying data in ranges we typically only wish to allocate higher page
tables as necessary, using these locks to avoid races or overwriting anything,
and set/clear data at the PTE level as required (for instance when page faulting
or zapping).
A typical pattern taken when traversing page table entries to install a new
mapping is to optimistically determine whether the page table entry in the table
above is empty, if so, only then acquiring the page table lock and checking
again to see if it was allocated underneath us.
This allows for a traversal with page table locks only being taken when
required. An example of this is :c:func:`!__pud_alloc`.
At the leaf page table, that is the PTE, we can't entirely rely on this pattern
as we have separate PMD and PTE locks and a THP collapse for instance might have
eliminated the PMD entry as well as the PTE from under us.
This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PMD entry
for the PTE, carefully checking it is as expected, before acquiring the
PTE-specific lock, and then *again* checking that the PMD entry is as expected.
If a THP collapse (or similar) were to occur then the lock on both pages would
be acquired, so we can ensure this is prevented while the PTE lock is held.
Installing entries this way ensures mutual exclusion on write.
Page table freeing
^^^^^^^^^^^^^^^^^^
Tearing down page tables themselves is something that requires significant
care. There must be no way that page tables designated for removal can be
traversed or referenced by concurrent tasks.
It is insufficient to simply hold an mmap write lock and VMA lock (which will
prevent racing faults, and rmap operations), as a file-backed mapping can be
truncated under the :c:struct:`!struct address_space->i_mmap_rwsem` alone.
As a result, no VMA which can be accessed via the reverse mapping (either
through the :c:struct:`!struct anon_vma->rb_root` or the :c:member:`!struct
address_space->i_mmap` interval trees) can have its page tables torn down.
The operation is typically performed via :c:func:`!free_pgtables`, which assumes
either the mmap write lock has been taken (as specified by its
:c:member:`!mm_wr_locked` parameter), or that the VMA is already unreachable.
It carefully removes the VMA from all reverse mappings, however it's important
that no new ones overlap these or any route remain to permit access to addresses
within the range whose page tables are being torn down.
Additionally, it assumes that a zap has already been performed and steps have
been taken to ensure that no further page table entries can be installed between
the zap and the invocation of :c:func:`!free_pgtables`.
Since it is assumed that all such steps have been taken, page table entries are
cleared without page table locks (in the :c:func:`!pgd_clear`, :c:func:`!p4d_clear`,
:c:func:`!pud_clear`, and :c:func:`!pmd_clear` functions.
.. note:: It is possible for leaf page tables to be torn down independent of
the page tables above it as is done by
:c:func:`!retract_page_tables`, which is performed under the i_mmap
read lock, PMD, and PTE page table locks, without this level of care.
Page table moving
^^^^^^^^^^^^^^^^^
Some functions manipulate page table levels above PMD (that is PUD, P4D and PGD
page tables). Most notable of these is :c:func:`!mremap`, which is capable of
moving higher level page tables.
In these instances, it is required that **all** locks are taken, that is
the mmap lock, the VMA lock and the relevant rmap locks.
You can observe this in the :c:func:`!mremap` implementation in the functions
:c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap
side of lock acquisition, invoked ultimately by :c:func:`!move_page_tables`.
VMA lock internals
------------------
Overview
^^^^^^^^
VMA read locking is entirely optimistic - if the lock is contended or a competing
write has started, then we do not obtain a read lock.
A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu`, which first
calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU
critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`,
before releasing the RCU lock via :c:func:`!rcu_read_unlock`.
VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
via :c:func:`!vma_end_read`.
VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
acquired. An mmap write lock **must** be held for the duration of the VMA write
lock, releasing or downgrading the mmap write lock also releases the VMA write
lock so there is no :c:func:`!vma_end_write` function.
Note that a semaphore write lock is not held across a VMA lock. Rather, a
sequence number is used for serialisation, and the write semaphore is only
acquired at the point of write lock to update this.
This ensures the semantics we require - VMA write locks provide exclusive write
access to the VMA.
Implementation details
^^^^^^^^^^^^^^^^^^^^^^
The VMA lock mechanism is designed to be a lightweight means of avoiding the use
of the heavily contended mmap lock. It is implemented using a combination of a
read/write semaphore and sequence numbers belonging to the containing
:c:struct:`!struct mm_struct` and the VMA.
Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
operation, i.e. it tries to acquire a read lock but returns false if it is
unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is
called to release the VMA read lock.
Invoking :c:func:`!vma_start_read` requires that :c:func:`!rcu_read_lock` has
been called first, establishing that we are in an RCU critical section upon VMA
read lock acquisition. Once acquired, the RCU lock can be released as it is only
required for lookup. This is abstracted by :c:func:`!lock_vma_under_rcu` which
is the interface a user should use.
Writing requires the mmap to be write-locked and the VMA lock to be acquired via
:c:func:`!vma_start_write`, however the write lock is released by the termination or
downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required.
All this is achieved by the use of per-mm and per-VMA sequence counts, which are
used in order to reduce complexity, especially for operations which write-lock
multiple VMAs at once.
If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the VMA
sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locked. If
they differ, then it is not.
Each time the mmap write lock is released in :c:func:`!mmap_write_unlock` or
:c:func:`!mmap_write_downgrade`, :c:func:`!vma_end_write_all` is invoked which
also increments :c:member:`!mm->mm_lock_seq` via
:c:func:`!mm_lock_seqcount_end`.
This way, we ensure that, regardless of the VMA's sequence number, a write lock
is never incorrectly indicated and that when we release an mmap write lock we
efficiently release **all** VMA write locks contained within the mmap at the
same time.
Since the mmap write lock is exclusive against others who hold it, the automatic
release of any VMA locks on its release makes sense, as you would never want to
keep VMAs locked across entirely separate write operations. It also maintains
correct lock ordering.
Each time a VMA read lock is acquired, we acquire a read lock on the
:c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that
the sequence count of the VMA does not match that of the mm.
If it does, the read lock fails. If it does not, we hold the lock, excluding
writers, but permitting other readers, who will also obtain this lock under RCU.
Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
are also RCU safe, so the whole read lock operation is guaranteed to function
correctly.
On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock`
read/write semaphore, before setting the VMA's sequence number under this lock,
also simultaneously holding the mmap write lock.
This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep
until these are finished and mutual exclusion is achieved.
After setting the VMA's sequence number, the lock is released, avoiding
complexity with a long-term held write lock.
This clever combination of a read/write semaphore and sequence count allows for
fast RCU-based per-VMA lock acquisition (especially on page fault, though
utilised elsewhere) with minimal complexity around lock ordering.
mmap write lock downgrading
---------------------------
When an mmap write lock is held one has exclusive access to resources within the
mmap (with the usual caveats about requiring VMA write locks to avoid races with
tasks holding VMA read locks).
It is then possible to **downgrade** from a write lock to a read lock via
:c:func:`!mmap_write_downgrade` which, similar to :c:func:`!mmap_write_unlock`,
implicitly terminates all VMA write locks via :c:func:`!vma_end_write_all`, but
importantly does not relinquish the mmap lock while downgrading, therefore
keeping the locked virtual address space stable.
An interesting consequence of this is that downgraded locks are exclusive
against any other task possessing a downgraded lock (since a racing task would
have to acquire a write lock first to downgrade it, and the downgraded lock
prevents a new write lock from being obtained until the original lock is
released).
For clarity, we map read (R)/downgraded write (D)/write (W) locks against one
another showing which locks exclude the others:
.. list-table:: Lock exclusivity
:widths: 5 5 5 5
:header-rows: 1
:stub-columns: 1
* -
- R
- D
- W
* - R
- N
- N
- Y
* - D
- N
- Y
- Y
* - W
- Y
- Y
- Y
Here a Y indicates the locks in the matching row/column are mutually exclusive,
and N indicates that they are not.
Stack expansion
---------------
Stack expansion throws up additional complexities in that we cannot permit there
to be racing page faults, as a result we invoke :c:func:`!vma_start_write` to
prevent this in :c:func:`!expand_downwards` or :c:func:`!expand_upwards`.
+31 -29
View File
@@ -22,65 +22,67 @@ definitions:
doc: unused event
-
name: created
doc:
token, family, saddr4 | saddr6, daddr4 | daddr6, sport, dport
doc: >-
A new MPTCP connection has been created. It is the good time to
allocate memory and send ADD_ADDR if needed. Depending on the
traffic-patterns it can take a long time until the
MPTCP_EVENT_ESTABLISHED is sent.
Attributes: token, family, saddr4 | saddr6, daddr4 | daddr6, sport,
dport, server-side.
-
name: established
doc:
token, family, saddr4 | saddr6, daddr4 | daddr6, sport, dport
doc: >-
A MPTCP connection is established (can start new subflows).
Attributes: token, family, saddr4 | saddr6, daddr4 | daddr6, sport,
dport, server-side.
-
name: closed
doc:
token
doc: >-
A MPTCP connection has stopped.
Attribute: token.
-
name: announced
value: 6
doc:
token, rem_id, family, daddr4 | daddr6 [, dport]
doc: >-
A new address has been announced by the peer.
Attributes: token, rem_id, family, daddr4 | daddr6 [, dport].
-
name: removed
doc:
token, rem_id
doc: >-
An address has been lost by the peer.
Attributes: token, rem_id.
-
name: sub-established
value: 10
doc:
token, family, loc_id, rem_id, saddr4 | saddr6, daddr4 | daddr6, sport,
dport, backup, if_idx [, error]
doc: >-
A new subflow has been established. 'error' should not be set.
Attributes: token, family, loc_id, rem_id, saddr4 | saddr6, daddr4 |
daddr6, sport, dport, backup, if_idx [, error].
-
name: sub-closed
doc:
token, family, loc_id, rem_id, saddr4 | saddr6, daddr4 | daddr6, sport,
dport, backup, if_idx [, error]
doc: >-
A subflow has been closed. An error (copy of sk_err) could be set if an
error has been detected for this subflow.
Attributes: token, family, loc_id, rem_id, saddr4 | saddr6, daddr4 |
daddr6, sport, dport, backup, if_idx [, error].
-
name: sub-priority
value: 13
doc:
token, family, loc_id, rem_id, saddr4 | saddr6, daddr4 | daddr6, sport,
dport, backup, if_idx [, error]
doc: >-
The priority of a subflow has changed. 'error' should not be set.
Attributes: token, family, loc_id, rem_id, saddr4 | saddr6, daddr4 |
daddr6, sport, dport, backup, if_idx [, error].
-
name: listener-created
value: 15
doc:
family, sport, saddr4 | saddr6
doc: >-
A new PM listener is created.
Attributes: family, sport, saddr4 | saddr6.
-
name: listener-closed
doc:
family, sport, saddr4 | saddr6
doc: >-
A PM listener is closed.
Attributes: family, sport, saddr4 | saddr6.
attribute-sets:
-
@@ -306,8 +308,8 @@ operations:
attributes:
- addr
-
name: flush-addrs
doc: flush addresses
name: flush-addrs
doc: Flush addresses
attribute-set: endpoint
dont-validate: [ strict ]
flags: [ uns-admin-perm ]
@@ -351,7 +353,7 @@ operations:
- addr-remote
-
name: announce
doc: announce new sf
doc: Announce new address
attribute-set: attr
dont-validate: [ strict ]
flags: [ uns-admin-perm ]
@@ -362,7 +364,7 @@ operations:
- token
-
name: remove
doc: announce removal
doc: Announce removal
attribute-set: attr
dont-validate: [ strict ]
flags: [ uns-admin-perm ]
@@ -373,7 +375,7 @@ operations:
- loc-id
-
name: subflow-create
doc: todo
doc: Create subflow
attribute-set: attr
dont-validate: [ strict ]
flags: [ uns-admin-perm ]
@@ -385,7 +387,7 @@ operations:
- addr-remote
-
name: subflow-destroy
doc: todo
doc: Destroy subflow
attribute-set: attr
dont-validate: [ strict ]
flags: [ uns-admin-perm ]
+6
View File
@@ -2170,6 +2170,12 @@ nexthop_compat_mode - BOOLEAN
understands the new API, this sysctl can be disabled to achieve full
performance benefits of the new API by disabling the nexthop expansion
and extraneous notifications.
Note that as a backward-compatible mode, dumping of modern features
might be incomplete or wrong. For example, resilient groups will not be
shown as such, but rather as just a list of next hops. Also weights that
do not fit into 8 bits will show incorrectly.
Default: true (backward compat mode)
fib_notify_on_flag_change - INTEGER
+3 -1
View File
@@ -347,7 +347,9 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h:
`int pm_runtime_resume_and_get(struct device *dev);`
- run pm_runtime_resume(dev) and if successful, increment the device's
usage counter; return the result of pm_runtime_resume
usage counter; returns 0 on success (whether or not the device's
runtime PM status was already 'active') or the error code from
pm_runtime_resume() on failure.
`int pm_request_idle(struct device *dev);`
- submit a request to execute the subsystem-level idle callback for the
+13 -9
View File
@@ -1797,7 +1797,6 @@ F: include/uapi/linux/if_arcnet.h
ARM AND ARM64 SoC SUB-ARCHITECTURES (COMMON PARTS)
M: Arnd Bergmann <arnd@arndb.de>
M: Olof Johansson <olof@lixom.net>
L: linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
L: soc@lists.linux.dev
S: Maintained
@@ -3608,6 +3607,7 @@ F: drivers/phy/qualcomm/phy-ath79-usb.c
ATHEROS ATH GENERIC UTILITIES
M: Kalle Valo <kvalo@kernel.org>
M: Jeff Johnson <jjohnson@kernel.org>
L: linux-wireless@vger.kernel.org
S: Supported
F: drivers/net/wireless/ath/*
@@ -3893,7 +3893,7 @@ W: http://www.baycom.org/~tom/ham/ham.html
F: drivers/net/hamradio/baycom*
BCACHE (BLOCK LAYER CACHE)
M: Coly Li <colyli@suse.de>
M: Coly Li <colyli@kernel.org>
M: Kent Overstreet <kent.overstreet@linux.dev>
L: linux-bcache@vger.kernel.org
S: Maintained
@@ -7347,7 +7347,7 @@ F: drivers/gpu/drm/panel/panel-novatek-nt36672a.c
DRM DRIVER FOR NVIDIA GEFORCE/QUADRO GPUS
M: Karol Herbst <kherbst@redhat.com>
M: Lyude Paul <lyude@redhat.com>
M: Danilo Krummrich <dakr@redhat.com>
M: Danilo Krummrich <dakr@kernel.org>
L: dri-devel@lists.freedesktop.org
L: nouveau@lists.freedesktop.org
S: Supported
@@ -8453,7 +8453,7 @@ F: include/video/s1d13xxxfb.h
EROFS FILE SYSTEM
M: Gao Xiang <xiang@kernel.org>
M: Chao Yu <chao@kernel.org>
R: Yue Hu <huyue2@coolpad.com>
R: Yue Hu <zbestahu@gmail.com>
R: Jeffle Xu <jefflexu@linux.alibaba.com>
R: Sandeep Dhavale <dhavale@google.com>
L: linux-erofs@lists.ozlabs.org
@@ -8924,7 +8924,7 @@ F: include/linux/arm_ffa.h
FIRMWARE LOADER (request_firmware)
M: Luis Chamberlain <mcgrof@kernel.org>
M: Russ Weight <russ.weight@linux.dev>
M: Danilo Krummrich <dakr@redhat.com>
M: Danilo Krummrich <dakr@kernel.org>
L: linux-kernel@vger.kernel.org
S: Maintained
F: Documentation/firmware_class/
@@ -14756,7 +14756,7 @@ F: drivers/memory/mtk-smi.c
F: include/soc/mediatek/smi.h
MEDIATEK SWITCH DRIVER
M: Arınç ÜNAL <arinc.unal@arinc9.com>
M: Chester A. Unal <chester.a.unal@arinc9.com>
M: Daniel Golle <daniel@makrotopia.org>
M: DENG Qingfang <dqfext@gmail.com>
M: Sean Wang <sean.wang@mediatek.com>
@@ -15345,7 +15345,7 @@ M: Daniel Machon <daniel.machon@microchip.com>
M: UNGLinuxDriver@microchip.com
L: netdev@vger.kernel.org
S: Maintained
F: drivers/net/ethernet/microchip/lan969x/*
F: drivers/net/ethernet/microchip/sparx5/lan969x/*
MICROCHIP LCDFB DRIVER
M: Nicolas Ferre <nicolas.ferre@microchip.com>
@@ -16337,6 +16337,7 @@ F: Documentation/networking/
F: Documentation/networking/net_cachelines/
F: Documentation/process/maintainer-netdev.rst
F: Documentation/userspace-api/netlink/
F: include/linux/ethtool.h
F: include/linux/framer/framer-provider.h
F: include/linux/framer/framer.h
F: include/linux/in.h
@@ -16351,6 +16352,7 @@ F: include/linux/rtnetlink.h
F: include/linux/seq_file_net.h
F: include/linux/skbuff*
F: include/net/
F: include/uapi/linux/ethtool.h
F: include/uapi/linux/genetlink.h
F: include/uapi/linux/hsr_netlink.h
F: include/uapi/linux/in.h
@@ -18458,7 +18460,7 @@ F: Documentation/devicetree/bindings/pinctrl/mediatek,mt8183-pinctrl.yaml
F: drivers/pinctrl/mediatek/
PIN CONTROLLER - MEDIATEK MIPS
M: Arınç ÜNAL <arinc.unal@arinc9.com>
M: Chester A. Unal <chester.a.unal@arinc9.com>
M: Sergio Paracuellos <sergio.paracuellos@gmail.com>
L: linux-mediatek@lists.infradead.org (moderated for non-subscribers)
L: linux-mips@vger.kernel.org
@@ -19502,7 +19504,7 @@ S: Maintained
F: arch/mips/ralink
RALINK MT7621 MIPS ARCHITECTURE
M: Arınç ÜNAL <arinc.unal@arinc9.com>
M: Chester A. Unal <chester.a.unal@arinc9.com>
M: Sergio Paracuellos <sergio.paracuellos@gmail.com>
L: linux-mips@vger.kernel.org
S: Maintained
@@ -20905,6 +20907,8 @@ F: kernel/sched/
SCHEDULER - SCHED_EXT
R: Tejun Heo <tj@kernel.org>
R: David Vernet <void@manifault.com>
R: Andrea Righi <arighi@nvidia.com>
R: Changwoo Min <changwoo@igalia.com>
L: linux-kernel@vger.kernel.org
S: Maintained
W: https://github.com/sched-ext/scx
+1 -1
View File
@@ -2,7 +2,7 @@
VERSION = 6
PATCHLEVEL = 13
SUBLEVEL = 0
EXTRAVERSION = -rc2
EXTRAVERSION = -rc6
NAME = Baby Opossum Posse
# *DOCUMENTATION*
+3 -2
View File
@@ -6,6 +6,7 @@
config ARC
def_bool y
select ARC_TIMERS
select ARCH_HAS_CPU_CACHE_ALIASING
select ARCH_HAS_CACHE_LINE_SIZE
select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DMA_PREP_COHERENT
@@ -297,7 +298,6 @@ config ARC_PAGE_SIZE_16K
config ARC_PAGE_SIZE_4K
bool "4KB"
select HAVE_PAGE_SIZE_4KB
depends on ARC_MMU_V3 || ARC_MMU_V4
endchoice
@@ -474,7 +474,8 @@ config HIGHMEM
config ARC_HAS_PAE40
bool "Support for the 40-bit Physical Address Extension"
depends on ISA_ARCV2
depends on ARC_MMU_V4
depends on !ARC_PAGE_SIZE_4K
select HIGHMEM
select PHYS_ADDR_T_64BIT
help
+1 -1
View File
@@ -6,7 +6,7 @@
KBUILD_DEFCONFIG := haps_hs_smp_defconfig
ifeq ($(CROSS_COMPILE),)
CROSS_COMPILE := $(call cc-cross-prefix, arc-linux- arceb-linux-)
CROSS_COMPILE := $(call cc-cross-prefix, arc-linux- arceb-linux- arc-linux-gnu-)
endif
cflags-y += -fno-common -pipe -fno-builtin -mmedium-calls -D__linux__
+1 -1
View File
@@ -54,7 +54,7 @@
compatible = "snps,dw-apb-gpio-port";
gpio-controller;
#gpio-cells = <2>;
snps,nr-gpios = <30>;
ngpios = <30>;
reg = <0>;
interrupt-controller;
#interrupt-cells = <2>;
+1 -1
View File
@@ -62,7 +62,7 @@
compatible = "snps,dw-apb-gpio-port";
gpio-controller;
#gpio-cells = <2>;
snps,nr-gpios = <30>;
ngpios = <30>;
reg = <0>;
interrupt-controller;
#interrupt-cells = <2>;
+1 -1
View File
@@ -69,7 +69,7 @@
compatible = "snps,dw-apb-gpio-port";
gpio-controller;
#gpio-cells = <2>;
snps,nr-gpios = <30>;
ngpios = <30>;
reg = <0>;
interrupt-controller;
#interrupt-cells = <2>;
+6 -6
View File
@@ -250,7 +250,7 @@
compatible = "snps,dw-apb-gpio-port";
gpio-controller;
#gpio-cells = <2>;
snps,nr-gpios = <32>;
ngpios = <32>;
reg = <0>;
};
@@ -258,7 +258,7 @@
compatible = "snps,dw-apb-gpio-port";
gpio-controller;
#gpio-cells = <2>;
snps,nr-gpios = <8>;
ngpios = <8>;
reg = <1>;
};
@@ -266,7 +266,7 @@
compatible = "snps,dw-apb-gpio-port";
gpio-controller;
#gpio-cells = <2>;
snps,nr-gpios = <8>;
ngpios = <8>;
reg = <2>;
};
};
@@ -281,7 +281,7 @@
compatible = "snps,dw-apb-gpio-port";
gpio-controller;
#gpio-cells = <2>;
snps,nr-gpios = <30>;
ngpios = <30>;
reg = <0>;
};
@@ -289,7 +289,7 @@
compatible = "snps,dw-apb-gpio-port";
gpio-controller;
#gpio-cells = <2>;
snps,nr-gpios = <10>;
ngpios = <10>;
reg = <1>;
};
@@ -297,7 +297,7 @@
compatible = "snps,dw-apb-gpio-port";
gpio-controller;
#gpio-cells = <2>;
snps,nr-gpios = <8>;
ngpios = <8>;
reg = <2>;
};
};
+1 -1
View File
@@ -308,7 +308,7 @@
compatible = "snps,dw-apb-gpio-port";
gpio-controller;
#gpio-cells = <2>;
snps,nr-gpios = <24>;
ngpios = <24>;
reg = <0>;
};
};
+1 -1
View File
@@ -146,7 +146,7 @@
#ifndef __ASSEMBLY__
#include <soc/arc/aux.h>
#include <soc/arc/arc_aux.h>
/* Helpers */
#define TO_KB(bytes) ((bytes) >> 10)
+8
View File
@@ -0,0 +1,8 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef __ASM_ARC_CACHETYPE_H
#define __ASM_ARC_CACHETYPE_H
#define cpu_dcache_is_aliasing() false
#define cpu_icache_is_aliasing() true
#endif
+1 -1
View File
@@ -48,7 +48,7 @@
\
switch(sizeof((_p_))) { \
case 1: \
_prev_ = (__typeof__(*(ptr)))cmpxchg_emu_u8((volatile u8 *)_p_, (uintptr_t)_o_, (uintptr_t)_n_); \
_prev_ = (__typeof__(*(ptr)))cmpxchg_emu_u8((volatile u8 *__force)_p_, (uintptr_t)_o_, (uintptr_t)_n_); \
break; \
case 4: \
_prev_ = __cmpxchg(_p_, _o_, _n_); \
+1 -1
View File
@@ -9,7 +9,7 @@
#ifndef _ASM_ARC_MMU_ARCV2_H
#define _ASM_ARC_MMU_ARCV2_H
#include <soc/arc/aux.h>
#include <soc/arc/arc_aux.h>
/*
* TLB Management regs
+1 -1
View File
@@ -2916,7 +2916,7 @@ bool check_jmp_32(u32 curr_off, u32 targ_off, u8 cond)
addendum = (cond == ARC_CC_AL) ? 0 : INSN_len_normal;
disp = get_displacement(curr_off + addendum, targ_off);
if (ARC_CC_AL)
if (cond == ARC_CC_AL)
return is_valid_far_disp(disp);
else
return is_valid_near_disp(disp);
+1
View File
@@ -6,6 +6,7 @@ menuconfig ARCH_MXC
select CLKSRC_IMX_GPT
select GENERIC_IRQ_CHIP
select GPIOLIB
select PINCTRL
select PM_OPP if PM
select SOC_BUS
select SRAM
+1 -1
View File
@@ -233,7 +233,7 @@
#interrupt-cells = <0x1>;
compatible = "pci-host-ecam-generic";
device_type = "pci";
bus-range = <0x0 0x1>;
bus-range = <0x0 0xff>;
reg = <0x0 0x40000000 0x0 0x10000000>;
ranges = <0x2000000 0x0 0x50000000 0x0 0x50000000 0x0 0x10000000>;
interrupt-map = <0 0 0 1 &gic 0 0 GIC_SPI 168 IRQ_TYPE_LEVEL_HIGH>,
+4 -4
View File
@@ -67,7 +67,7 @@
l2_cache_l0: l2-cache-l0 {
compatible = "cache";
cache-size = <0x80000>;
cache-line-size = <128>;
cache-line-size = <64>;
cache-sets = <1024>; //512KiB(size)/64(line-size)=8192ways/8-way set
cache-level = <2>;
cache-unified;
@@ -91,7 +91,7 @@
l2_cache_l1: l2-cache-l1 {
compatible = "cache";
cache-size = <0x80000>;
cache-line-size = <128>;
cache-line-size = <64>;
cache-sets = <1024>; //512KiB(size)/64(line-size)=8192ways/8-way set
cache-level = <2>;
cache-unified;
@@ -115,7 +115,7 @@
l2_cache_l2: l2-cache-l2 {
compatible = "cache";
cache-size = <0x80000>;
cache-line-size = <128>;
cache-line-size = <64>;
cache-sets = <1024>; //512KiB(size)/64(line-size)=8192ways/8-way set
cache-level = <2>;
cache-unified;
@@ -139,7 +139,7 @@
l2_cache_l3: l2-cache-l3 {
compatible = "cache";
cache-size = <0x80000>;
cache-line-size = <128>;
cache-line-size = <64>;
cache-sets = <1024>; //512KiB(size)/64(line-size)=8192ways/8-way set
cache-level = <2>;
cache-unified;
+2 -2
View File
@@ -87,7 +87,7 @@
1 << PMSCR_EL2_PA_SHIFT)
msr_s SYS_PMSCR_EL2, x0 // addresses and physical counter
.Lskip_spe_el2_\@:
mov x0, #(MDCR_EL2_E2PB_MASK << MDCR_EL2_E2PB_SHIFT)
mov x0, #MDCR_EL2_E2PB_MASK
orr x2, x2, x0 // If we don't have VHE, then
// use EL1&0 translation.
@@ -100,7 +100,7 @@
and x0, x0, TRBIDR_EL1_P
cbnz x0, .Lskip_trace_\@ // If TRBE is available at EL2
mov x0, #(MDCR_EL2_E2TB_MASK << MDCR_EL2_E2TB_SHIFT)
mov x0, #MDCR_EL2_E2TB_MASK
orr x2, x2, x0 // allow the EL1&0 translation
// to own it.
+2 -2
View File
@@ -114,8 +114,8 @@ SYM_CODE_START_LOCAL(__finalise_el2)
// Use EL2 translations for SPE & TRBE and disable access from EL1
mrs x0, mdcr_el2
bic x0, x0, #(MDCR_EL2_E2PB_MASK << MDCR_EL2_E2PB_SHIFT)
bic x0, x0, #(MDCR_EL2_E2TB_MASK << MDCR_EL2_E2TB_SHIFT)
bic x0, x0, #MDCR_EL2_E2PB_MASK
bic x0, x0, #MDCR_EL2_E2TB_MASK
msr mdcr_el2, x0
// Transfer the MM state from EL1 to EL2
+48 -35
View File
@@ -36,15 +36,8 @@
#include <asm/traps.h>
#include <asm/vdso.h>
#ifdef CONFIG_ARM64_GCS
#define GCS_SIGNAL_CAP(addr) (((unsigned long)addr) & GCS_CAP_ADDR_MASK)
static bool gcs_signal_cap_valid(u64 addr, u64 val)
{
return val == GCS_SIGNAL_CAP(addr);
}
#endif
/*
* Do a signal return; undo the signal stack. These are aligned to 128-bit.
*/
@@ -1062,8 +1055,7 @@ static int restore_sigframe(struct pt_regs *regs,
#ifdef CONFIG_ARM64_GCS
static int gcs_restore_signal(void)
{
unsigned long __user *gcspr_el0;
u64 cap;
u64 gcspr_el0, cap;
int ret;
if (!system_supports_gcs())
@@ -1072,7 +1064,7 @@ static int gcs_restore_signal(void)
if (!(current->thread.gcs_el0_mode & PR_SHADOW_STACK_ENABLE))
return 0;
gcspr_el0 = (unsigned long __user *)read_sysreg_s(SYS_GCSPR_EL0);
gcspr_el0 = read_sysreg_s(SYS_GCSPR_EL0);
/*
* Ensure that any changes to the GCS done via GCS operations
@@ -1087,22 +1079,23 @@ static int gcs_restore_signal(void)
* then faults will be generated on GCS operations - the main
* concern is to protect GCS pages.
*/
ret = copy_from_user(&cap, gcspr_el0, sizeof(cap));
ret = copy_from_user(&cap, (unsigned long __user *)gcspr_el0,
sizeof(cap));
if (ret)
return -EFAULT;
/*
* Check that the cap is the actual GCS before replacing it.
*/
if (!gcs_signal_cap_valid((u64)gcspr_el0, cap))
if (cap != GCS_SIGNAL_CAP(gcspr_el0))
return -EINVAL;
/* Invalidate the token to prevent reuse */
put_user_gcs(0, (__user void*)gcspr_el0, &ret);
put_user_gcs(0, (unsigned long __user *)gcspr_el0, &ret);
if (ret != 0)
return -EFAULT;
write_sysreg_s(gcspr_el0 + 1, SYS_GCSPR_EL0);
write_sysreg_s(gcspr_el0 + 8, SYS_GCSPR_EL0);
return 0;
}
@@ -1421,7 +1414,7 @@ static int get_sigframe(struct rt_sigframe_user_layout *user,
static int gcs_signal_entry(__sigrestore_t sigtramp, struct ksignal *ksig)
{
unsigned long __user *gcspr_el0;
u64 gcspr_el0;
int ret = 0;
if (!system_supports_gcs())
@@ -1434,18 +1427,20 @@ static int gcs_signal_entry(__sigrestore_t sigtramp, struct ksignal *ksig)
* We are entering a signal handler, current register state is
* active.
*/
gcspr_el0 = (unsigned long __user *)read_sysreg_s(SYS_GCSPR_EL0);
gcspr_el0 = read_sysreg_s(SYS_GCSPR_EL0);
/*
* Push a cap and the GCS entry for the trampoline onto the GCS.
*/
put_user_gcs((unsigned long)sigtramp, gcspr_el0 - 2, &ret);
put_user_gcs(GCS_SIGNAL_CAP(gcspr_el0 - 1), gcspr_el0 - 1, &ret);
put_user_gcs((unsigned long)sigtramp,
(unsigned long __user *)(gcspr_el0 - 16), &ret);
put_user_gcs(GCS_SIGNAL_CAP(gcspr_el0 - 8),
(unsigned long __user *)(gcspr_el0 - 8), &ret);
if (ret != 0)
return ret;
gcspr_el0 -= 2;
write_sysreg_s((unsigned long)gcspr_el0, SYS_GCSPR_EL0);
gcspr_el0 -= 16;
write_sysreg_s(gcspr_el0, SYS_GCSPR_EL0);
return 0;
}
@@ -1462,10 +1457,33 @@ static int setup_return(struct pt_regs *regs, struct ksignal *ksig,
struct rt_sigframe_user_layout *user, int usig)
{
__sigrestore_t sigtramp;
int err;
if (ksig->ka.sa.sa_flags & SA_RESTORER)
sigtramp = ksig->ka.sa.sa_restorer;
else
sigtramp = VDSO_SYMBOL(current->mm->context.vdso, sigtramp);
err = gcs_signal_entry(sigtramp, ksig);
if (err)
return err;
/*
* We must not fail from this point onwards. We are going to update
* registers, including SP, in order to invoke the signal handler. If
* we failed and attempted to deliver a nested SIGSEGV to a handler
* after that point, the subsequent sigreturn would end up restoring
* the (partial) state for the original signal handler.
*/
regs->regs[0] = usig;
if (ksig->ka.sa.sa_flags & SA_SIGINFO) {
regs->regs[1] = (unsigned long)&user->sigframe->info;
regs->regs[2] = (unsigned long)&user->sigframe->uc;
}
regs->sp = (unsigned long)user->sigframe;
regs->regs[29] = (unsigned long)&user->next_frame->fp;
regs->regs[30] = (unsigned long)sigtramp;
regs->pc = (unsigned long)ksig->ka.sa.sa_handler;
/*
@@ -1506,14 +1524,7 @@ static int setup_return(struct pt_regs *regs, struct ksignal *ksig,
sme_smstop();
}
if (ksig->ka.sa.sa_flags & SA_RESTORER)
sigtramp = ksig->ka.sa.sa_restorer;
else
sigtramp = VDSO_SYMBOL(current->mm->context.vdso, sigtramp);
regs->regs[30] = (unsigned long)sigtramp;
return gcs_signal_entry(sigtramp, ksig);
return 0;
}
static int setup_rt_frame(int usig, struct ksignal *ksig, sigset_t *set,
@@ -1537,14 +1548,16 @@ static int setup_rt_frame(int usig, struct ksignal *ksig, sigset_t *set,
err |= __save_altstack(&frame->uc.uc_stack, regs->sp);
err |= setup_sigframe(&user, regs, set, &ua_state);
if (err == 0) {
if (ksig->ka.sa.sa_flags & SA_SIGINFO)
err |= copy_siginfo_to_user(&frame->info, &ksig->info);
if (err == 0)
err = setup_return(regs, ksig, &user, usig);
if (ksig->ka.sa.sa_flags & SA_SIGINFO) {
err |= copy_siginfo_to_user(&frame->info, &ksig->info);
regs->regs[1] = (unsigned long)&frame->info;
regs->regs[2] = (unsigned long)&frame->uc;
}
}
/*
* We must not fail if setup_return() succeeded - see comment at the
* beginning of setup_return().
*/
if (err == 0)
set_handler_user_access_state();
+7 -25
View File
@@ -26,7 +26,6 @@ enum kunwind_source {
KUNWIND_SOURCE_CALLER,
KUNWIND_SOURCE_TASK,
KUNWIND_SOURCE_REGS_PC,
KUNWIND_SOURCE_REGS_LR,
};
union unwind_flags {
@@ -138,8 +137,10 @@ kunwind_recover_return_address(struct kunwind_state *state)
orig_pc = ftrace_graph_ret_addr(state->task, &state->graph_idx,
state->common.pc,
(void *)state->common.fp);
if (WARN_ON_ONCE(state->common.pc == orig_pc))
if (state->common.pc == orig_pc) {
WARN_ON_ONCE(state->task == current);
return -EINVAL;
}
state->common.pc = orig_pc;
state->flags.fgraph = 1;
}
@@ -178,23 +179,8 @@ int kunwind_next_regs_pc(struct kunwind_state *state)
state->regs = regs;
state->common.pc = regs->pc;
state->common.fp = regs->regs[29];
state->source = KUNWIND_SOURCE_REGS_PC;
return 0;
}
static __always_inline int
kunwind_next_regs_lr(struct kunwind_state *state)
{
/*
* The stack for the regs was consumed by kunwind_next_regs_pc(), so we
* cannot consume that again here, but we know the regs are safe to
* access.
*/
state->common.pc = state->regs->regs[30];
state->common.fp = state->regs->regs[29];
state->regs = NULL;
state->source = KUNWIND_SOURCE_REGS_LR;
state->source = KUNWIND_SOURCE_REGS_PC;
return 0;
}
@@ -215,12 +201,12 @@ kunwind_next_frame_record_meta(struct kunwind_state *state)
case FRAME_META_TYPE_FINAL:
if (meta == &task_pt_regs(tsk)->stackframe)
return -ENOENT;
WARN_ON_ONCE(1);
WARN_ON_ONCE(tsk == current);
return -EINVAL;
case FRAME_META_TYPE_PT_REGS:
return kunwind_next_regs_pc(state);
default:
WARN_ON_ONCE(1);
WARN_ON_ONCE(tsk == current);
return -EINVAL;
}
}
@@ -274,11 +260,8 @@ kunwind_next(struct kunwind_state *state)
case KUNWIND_SOURCE_FRAME:
case KUNWIND_SOURCE_CALLER:
case KUNWIND_SOURCE_TASK:
case KUNWIND_SOURCE_REGS_LR:
err = kunwind_next_frame_record(state);
break;
case KUNWIND_SOURCE_REGS_PC:
err = kunwind_next_regs_lr(state);
err = kunwind_next_frame_record(state);
break;
default:
err = -EINVAL;
@@ -436,7 +419,6 @@ static const char *state_source_string(const struct kunwind_state *state)
case KUNWIND_SOURCE_CALLER: return "C";
case KUNWIND_SOURCE_TASK: return "T";
case KUNWIND_SOURCE_REGS_PC: return "P";
case KUNWIND_SOURCE_REGS_LR: return "L";
default: return "U";
}
}
+9 -2
View File
@@ -739,8 +739,15 @@ static u64 compute_par_s12(struct kvm_vcpu *vcpu, u64 s1_par,
final_attr = s1_parattr;
break;
default:
/* MemAttr[2]=0, Device from S2 */
final_attr = s2_memattr & GENMASK(1,0) << 2;
/*
* MemAttr[2]=0, Device from S2.
*
* FWB does not influence the way that stage 1
* memory types and attributes are combined
* with stage 2 Device type and attributes.
*/
final_attr = min(s2_memattr_to_attr(s2_memattr),
s1_parattr);
}
} else {
/* Combination of R_HMNDG, R_TNHFM and R_GQFSF */
+2 -2
View File
@@ -126,7 +126,7 @@ static void pvm_init_traps_aa64dfr0(struct kvm_vcpu *vcpu)
/* Trap SPE */
if (!FIELD_GET(ARM64_FEATURE_MASK(ID_AA64DFR0_EL1_PMSVer), feature_ids)) {
mdcr_set |= MDCR_EL2_TPMS;
mdcr_clear |= MDCR_EL2_E2PB_MASK << MDCR_EL2_E2PB_SHIFT;
mdcr_clear |= MDCR_EL2_E2PB_MASK;
}
/* Trap Trace Filter */
@@ -143,7 +143,7 @@ static void pvm_init_traps_aa64dfr0(struct kvm_vcpu *vcpu)
/* Trap External Trace */
if (!FIELD_GET(ARM64_FEATURE_MASK(ID_AA64DFR0_EL1_ExtTrcBuff), feature_ids))
mdcr_clear |= MDCR_EL2_E2TB_MASK << MDCR_EL2_E2TB_SHIFT;
mdcr_clear |= MDCR_EL2_E2TB_MASK;
vcpu->arch.mdcr_el2 |= mdcr_set;
vcpu->arch.mdcr_el2 &= ~mdcr_clear;
+2 -1
View File
@@ -2618,7 +2618,8 @@ static const struct sys_reg_desc sys_reg_descs[] = {
ID_WRITABLE(ID_AA64MMFR0_EL1, ~(ID_AA64MMFR0_EL1_RES0 |
ID_AA64MMFR0_EL1_TGRAN4_2 |
ID_AA64MMFR0_EL1_TGRAN64_2 |
ID_AA64MMFR0_EL1_TGRAN16_2)),
ID_AA64MMFR0_EL1_TGRAN16_2 |
ID_AA64MMFR0_EL1_ASIDBITS)),
ID_WRITABLE(ID_AA64MMFR1_EL1, ~(ID_AA64MMFR1_EL1_RES0 |
ID_AA64MMFR1_EL1_HCX |
ID_AA64MMFR1_EL1_TWED |
+11 -1
View File
@@ -608,12 +608,22 @@ static void vgic_its_cache_translation(struct kvm *kvm, struct vgic_its *its,
lockdep_assert_held(&its->its_lock);
vgic_get_irq_kref(irq);
old = xa_store(&its->translation_cache, cache_key, irq, GFP_KERNEL_ACCOUNT);
/*
* Put the reference taken on @irq if the store fails. Intentionally do
* not return the error as the translation cache is best effort.
*/
if (xa_is_err(old)) {
vgic_put_irq(kvm, irq);
return;
}
/*
* We could have raced with another CPU caching the same
* translation behind our back, ensure we don't leak a
* reference if that is the case.
*/
old = xa_store(&its->translation_cache, cache_key, irq, GFP_KERNEL_ACCOUNT);
if (old)
vgic_put_irq(kvm, old);
}
+6
View File
@@ -32,3 +32,9 @@ KBUILD_LDFLAGS += $(ldflags-y)
TIR_NAME := r19
KBUILD_CFLAGS += -ffixed-$(TIR_NAME) -DTHREADINFO_REG=$(TIR_NAME) -D__linux__
KBUILD_AFLAGS += -DTHREADINFO_REG=$(TIR_NAME)
# Disable HexagonConstExtenders pass for LLVM versions prior to 19.1.0
# https://github.com/llvm/llvm-project/issues/99714
ifneq ($(call clang-min-version, 190100),y)
KBUILD_CFLAGS += -mllvm -hexagon-cext=false
endif
+5 -5
View File
@@ -143,11 +143,11 @@ static int show_cpuinfo(struct seq_file *m, void *v)
" DIV:\t\t%s\n"
" BMX:\t\t%s\n"
" CDX:\t\t%s\n",
cpuinfo.has_mul ? "yes" : "no",
cpuinfo.has_mulx ? "yes" : "no",
cpuinfo.has_div ? "yes" : "no",
cpuinfo.has_bmx ? "yes" : "no",
cpuinfo.has_cdx ? "yes" : "no");
str_yes_no(cpuinfo.has_mul),
str_yes_no(cpuinfo.has_mulx),
str_yes_no(cpuinfo.has_div),
str_yes_no(cpuinfo.has_bmx),
str_yes_no(cpuinfo.has_cdx));
seq_printf(m,
"Icache:\t\t%ukB, line length: %u\n",
+2
View File
@@ -239,6 +239,8 @@ handler: ;\
/* =====================================================[ exceptions] === */
__REF
/* ---[ 0x100: RESET exception ]----------------------------------------- */
EXCEPTION_ENTRY(_tng_kernel_start)
+17 -15
View File
@@ -26,15 +26,15 @@
#include <asm/asm-offsets.h>
#include <linux/of_fdt.h>
#define tophys(rd,rs) \
l.movhi rd,hi(-KERNELBASE) ;\
#define tophys(rd,rs) \
l.movhi rd,hi(-KERNELBASE) ;\
l.add rd,rd,rs
#define CLEAR_GPR(gpr) \
#define CLEAR_GPR(gpr) \
l.movhi gpr,0x0
#define LOAD_SYMBOL_2_GPR(gpr,symbol) \
l.movhi gpr,hi(symbol) ;\
#define LOAD_SYMBOL_2_GPR(gpr,symbol) \
l.movhi gpr,hi(symbol) ;\
l.ori gpr,gpr,lo(symbol)
@@ -326,21 +326,21 @@
l.addi r1,r1,-(INT_FRAME_SIZE) ;\
/* r1 is KSP, r30 is __pa(KSP) */ ;\
tophys (r30,r1) ;\
l.sw PT_GPR12(r30),r12 ;\
l.sw PT_GPR12(r30),r12 ;\
l.mfspr r12,r0,SPR_EPCR_BASE ;\
l.sw PT_PC(r30),r12 ;\
l.mfspr r12,r0,SPR_ESR_BASE ;\
l.sw PT_SR(r30),r12 ;\
/* save r31 */ ;\
EXCEPTION_T_LOAD_GPR30(r12) ;\
l.sw PT_GPR30(r30),r12 ;\
l.sw PT_GPR30(r30),r12 ;\
/* save r10 as was prior to exception */ ;\
EXCEPTION_T_LOAD_GPR10(r12) ;\
l.sw PT_GPR10(r30),r12 ;\
/* save PT_SP as was prior to exception */ ;\
l.sw PT_GPR10(r30),r12 ;\
/* save PT_SP as was prior to exception */ ;\
EXCEPTION_T_LOAD_SP(r12) ;\
l.sw PT_SP(r30),r12 ;\
l.sw PT_GPR13(r30),r13 ;\
l.sw PT_GPR13(r30),r13 ;\
/* --> */ ;\
/* save exception r4, set r4 = EA */ ;\
l.sw PT_GPR4(r30),r4 ;\
@@ -357,6 +357,8 @@
/* =====================================================[ exceptions] === */
__HEAD
/* ---[ 0x100: RESET exception ]----------------------------------------- */
.org 0x100
/* Jump to .init code at _start which lives in the .head section
@@ -394,7 +396,7 @@ _dispatch_do_ipage_fault:
.org 0x500
EXCEPTION_HANDLE(_timer_handler)
/* ---[ 0x600: Alignment exception ]-------------------------------------- */
/* ---[ 0x600: Alignment exception ]------------------------------------- */
.org 0x600
EXCEPTION_HANDLE(_alignment_handler)
@@ -424,7 +426,7 @@ _dispatch_do_ipage_fault:
.org 0xc00
EXCEPTION_HANDLE(_sys_call_handler)
/* ---[ 0xd00: Floating point exception ]--------------------------------- */
/* ---[ 0xd00: Floating point exception ]-------------------------------- */
.org 0xd00
EXCEPTION_HANDLE(_fpe_trap_handler)
@@ -506,10 +508,10 @@ _dispatch_do_ipage_fault:
/* .text*/
/* This early stuff belongs in HEAD, but some of the functions below definitely
/* This early stuff belongs in the .init.text section, but some of the functions below definitely
* don't... */
__HEAD
__INIT
.global _start
_start:
/* Init r0 to zero as per spec */
@@ -816,7 +818,7 @@ secondary_start:
#endif
/* ========================================[ cache ]=== */
/* ==========================================================[ cache ]=== */
/* alignment here so we don't change memory offsets with
* memory controller defined
+1 -2
View File
@@ -50,6 +50,7 @@ SECTIONS
.text : AT(ADDR(.text) - LOAD_OFFSET)
{
_stext = .;
HEAD_TEXT
TEXT_TEXT
SCHED_TEXT
LOCK_TEXT
@@ -83,8 +84,6 @@ SECTIONS
. = ALIGN(PAGE_SIZE);
__init_begin = .;
HEAD_TEXT_SECTION
/* Page aligned */
INIT_TEXT_SECTION(PAGE_SIZE)
+1
View File
@@ -208,6 +208,7 @@ CONFIG_FB_ATY=y
CONFIG_FB_ATY_CT=y
CONFIG_FB_ATY_GX=y
CONFIG_FB_3DFX=y
CONFIG_BACKLIGHT_CLASS_DEVICE=y
# CONFIG_VGA_CONSOLE is not set
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_LOGO=y
+1
View File
@@ -716,6 +716,7 @@ CONFIG_FB_TRIDENT=m
CONFIG_FB_SM501=m
CONFIG_FB_IBM_GXT4500=y
CONFIG_LCD_PLATFORM=m
CONFIG_BACKLIGHT_CLASS_DEVICE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y
CONFIG_LOGO=y
+36
View File
@@ -464,7 +464,43 @@ static vm_fault_t vas_mmap_fault(struct vm_fault *vmf)
return VM_FAULT_SIGBUS;
}
/*
* During mmap() paste address, mapping VMA is saved in VAS window
* struct which is used to unmap during migration if the window is
* still open. But the user space can remove this mapping with
* munmap() before closing the window and the VMA address will
* be invalid. Set VAS window VMA to NULL in this function which
* is called before VMA free.
*/
static void vas_mmap_close(struct vm_area_struct *vma)
{
struct file *fp = vma->vm_file;
struct coproc_instance *cp_inst = fp->private_data;
struct vas_window *txwin;
/* Should not happen */
if (!cp_inst || !cp_inst->txwin) {
pr_err("No attached VAS window for the paste address mmap\n");
return;
}
txwin = cp_inst->txwin;
/*
* task_ref.vma is set in coproc_mmap() during mmap paste
* address. So it has to be the same VMA that is getting freed.
*/
if (WARN_ON(txwin->task_ref.vma != vma)) {
pr_err("Invalid paste address mmaping\n");
return;
}
mutex_lock(&txwin->task_ref.mmap_mutex);
txwin->task_ref.vma = NULL;
mutex_unlock(&txwin->task_ref.mmap_mutex);
}
static const struct vm_operations_struct vas_vm_ops = {
.close = vas_mmap_close,
.fault = vas_mmap_fault,
};
+3 -1
View File
@@ -22,7 +22,9 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
else
set_pte(pte, __pte(pte_val(ptep_get(pte)) | _PAGE_PRESENT));
flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
preempt_disable();
local_flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
preempt_enable();
return true;
}
+9 -3
View File
@@ -36,9 +36,15 @@ bool arch_jump_label_transform_queue(struct jump_entry *entry,
insn = RISCV_INSN_NOP;
}
mutex_lock(&text_mutex);
patch_insn_write(addr, &insn, sizeof(insn));
mutex_unlock(&text_mutex);
if (early_boot_irqs_disabled) {
riscv_patch_in_stop_machine = 1;
patch_insn_write(addr, &insn, sizeof(insn));
riscv_patch_in_stop_machine = 0;
} else {
mutex_lock(&text_mutex);
patch_insn_write(addr, &insn, sizeof(insn));
mutex_unlock(&text_mutex);
}
return true;
}
+1 -1
View File
@@ -227,7 +227,7 @@ static void __init init_resources(void)
static void __init parse_dtb(void)
{
/* Early scan of device tree from init memory */
if (early_init_dt_scan(dtb_early_va, __pa(dtb_early_va))) {
if (early_init_dt_scan(dtb_early_va, dtb_early_pa)) {
const char *name = of_flat_dt_get_machine_name();
if (name) {
+1 -1
View File
@@ -590,7 +590,7 @@ void kvm_riscv_aia_enable(void)
csr_set(CSR_HIE, BIT(IRQ_S_GEXT));
/* Enable IRQ filtering for overflow interrupt only if sscofpmf is present */
if (__riscv_isa_extension_available(NULL, RISCV_ISA_EXT_SSCOFPMF))
csr_write(CSR_HVIEN, BIT(IRQ_PMU_OVF));
csr_set(CSR_HVIEN, BIT(IRQ_PMU_OVF));
}
void kvm_riscv_aia_disable(void)
+4 -3
View File
@@ -1566,7 +1566,7 @@ static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
pmd_clear(pmd);
}
static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud, bool is_vmemmap)
{
struct page *page = pud_page(*pud);
struct ptdesc *ptdesc = page_ptdesc(page);
@@ -1579,7 +1579,8 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
return;
}
pagetable_pmd_dtor(ptdesc);
if (!is_vmemmap)
pagetable_pmd_dtor(ptdesc);
if (PageReserved(page))
free_reserved_page(page);
else
@@ -1703,7 +1704,7 @@ static void __meminit remove_pud_mapping(pud_t *pud_base, unsigned long addr, un
remove_pmd_mapping(pmd_base, addr, next, is_vmemmap, altmap);
if (pgtable_l4_enabled)
free_pmd_table(pmd_base, pudp);
free_pmd_table(pmd_base, pudp, is_vmemmap);
}
}
+2
View File
@@ -234,6 +234,8 @@ static unsigned long get_vmem_size(unsigned long identity_size,
vsize = round_up(SZ_2G + max_mappable, rte_size) +
round_up(vmemmap_size, rte_size) +
FIXMAP_SIZE + MODULES_LEN + KASLR_LEN;
if (IS_ENABLED(CONFIG_KMSAN))
vsize += MODULES_LEN * 2;
return size_add(vsize, vmalloc_size);
}
+3 -3
View File
@@ -306,7 +306,7 @@ static void pgtable_pte_populate(pmd_t *pmd, unsigned long addr, unsigned long e
pages++;
}
}
if (mode == POPULATE_DIRECT)
if (mode == POPULATE_IDENTITY)
update_page_count(PG_DIRECT_MAP_4K, pages);
}
@@ -339,7 +339,7 @@ static void pgtable_pmd_populate(pud_t *pud, unsigned long addr, unsigned long e
}
pgtable_pte_populate(pmd, addr, next, mode);
}
if (mode == POPULATE_DIRECT)
if (mode == POPULATE_IDENTITY)
update_page_count(PG_DIRECT_MAP_1M, pages);
}
@@ -372,7 +372,7 @@ static void pgtable_pud_populate(p4d_t *p4d, unsigned long addr, unsigned long e
}
pgtable_pmd_populate(pud, addr, next, mode);
}
if (mode == POPULATE_DIRECT)
if (mode == POPULATE_IDENTITY)
update_page_count(PG_DIRECT_MAP_2G, pages);
}
+1 -1
View File
@@ -270,7 +270,7 @@ static ssize_t sys_##_prefix##_##_name##_store(struct kobject *kobj, \
if (len >= sizeof(_value)) \
return -E2BIG; \
len = strscpy(_value, buf, sizeof(_value)); \
if (len < 0) \
if ((ssize_t)len < 0) \
return len; \
strim(_value); \
return len; \
+12 -1
View File
@@ -429,6 +429,16 @@ static struct event_constraint intel_lnc_event_constraints[] = {
EVENT_CONSTRAINT_END
};
static struct extra_reg intel_lnc_extra_regs[] __read_mostly = {
INTEL_UEVENT_EXTRA_REG(0x012a, MSR_OFFCORE_RSP_0, 0xfffffffffffull, RSP_0),
INTEL_UEVENT_EXTRA_REG(0x012b, MSR_OFFCORE_RSP_1, 0xfffffffffffull, RSP_1),
INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(0x01cd),
INTEL_UEVENT_EXTRA_REG(0x02c6, MSR_PEBS_FRONTEND, 0x9, FE),
INTEL_UEVENT_EXTRA_REG(0x03c6, MSR_PEBS_FRONTEND, 0x7fff1f, FE),
INTEL_UEVENT_EXTRA_REG(0x40ad, MSR_PEBS_FRONTEND, 0xf, FE),
INTEL_UEVENT_EXTRA_REG(0x04c2, MSR_PEBS_FRONTEND, 0x8, FE),
EVENT_EXTRA_END
};
EVENT_ATTR_STR(mem-loads, mem_ld_nhm, "event=0x0b,umask=0x10,ldlat=3");
EVENT_ATTR_STR(mem-loads, mem_ld_snb, "event=0xcd,umask=0x1,ldlat=3");
@@ -6422,7 +6432,7 @@ static __always_inline void intel_pmu_init_lnc(struct pmu *pmu)
intel_pmu_init_glc(pmu);
hybrid(pmu, event_constraints) = intel_lnc_event_constraints;
hybrid(pmu, pebs_constraints) = intel_lnc_pebs_event_constraints;
hybrid(pmu, extra_regs) = intel_rwc_extra_regs;
hybrid(pmu, extra_regs) = intel_lnc_extra_regs;
}
static __always_inline void intel_pmu_init_skt(struct pmu *pmu)
@@ -7135,6 +7145,7 @@ __init int intel_pmu_init(void)
case INTEL_METEORLAKE:
case INTEL_METEORLAKE_L:
case INTEL_ARROWLAKE_U:
intel_pmu_init_hybrid(hybrid_big_small);
x86_pmu.pebs_latency_data = cmt_latency_data;
+2 -1
View File
@@ -1489,7 +1489,7 @@ void intel_pmu_pebs_enable(struct perf_event *event)
* hence we need to drain when changing said
* size.
*/
intel_pmu_drain_large_pebs(cpuc);
intel_pmu_drain_pebs_buffer();
adaptive_pebs_record_size_update();
wrmsrl(MSR_PEBS_DATA_CFG, pebs_data_cfg);
cpuc->active_pebs_data_cfg = pebs_data_cfg;
@@ -2517,6 +2517,7 @@ void __init intel_ds_init(void)
x86_pmu.large_pebs_flags |= PERF_SAMPLE_TIME;
break;
case 6:
case 5:
x86_pmu.pebs_ept = 1;
fallthrough;
+1
View File
@@ -1910,6 +1910,7 @@ static const struct x86_cpu_id intel_uncore_match[] __initconst = {
X86_MATCH_VFM(INTEL_ATOM_GRACEMONT, &adl_uncore_init),
X86_MATCH_VFM(INTEL_ATOM_CRESTMONT_X, &gnr_uncore_init),
X86_MATCH_VFM(INTEL_ATOM_CRESTMONT, &gnr_uncore_init),
X86_MATCH_VFM(INTEL_ATOM_DARKMONT_X, &gnr_uncore_init),
{},
};
MODULE_DEVICE_TABLE(x86cpu, intel_uncore_match);
+1
View File
@@ -452,6 +452,7 @@
#define X86_FEATURE_SME_COHERENT (19*32+10) /* AMD hardware-enforced cache coherency */
#define X86_FEATURE_DEBUG_SWAP (19*32+14) /* "debug_swap" AMD SEV-ES full debug state swap support */
#define X86_FEATURE_SVSM (19*32+28) /* "svsm" SVSM present */
#define X86_FEATURE_HV_INUSE_WR_ALLOWED (19*32+30) /* Allow Write to in-use hypervisor-owned pages */
/* AMD-defined Extended Feature 2 EAX, CPUID level 0x80000021 (EAX), word 20 */
#define X86_FEATURE_NO_NESTED_DATA_BP (20*32+ 0) /* No Nested Data Breakpoints */
+2
View File
@@ -230,6 +230,8 @@ static inline unsigned long long l1tf_pfn_limit(void)
return BIT_ULL(boot_cpu_data.x86_cache_bits - 1 - PAGE_SHIFT);
}
void init_cpu_devs(void);
void get_cpu_vendor(struct cpuinfo_x86 *c);
extern void early_cpu_init(void);
extern void identify_secondary_cpu(struct cpuinfo_x86 *);
extern void print_cpu_info(struct cpuinfo_x86 *);
+15
View File
@@ -65,4 +65,19 @@
extern bool __static_call_fixup(void *tramp, u8 op, void *dest);
extern void __static_call_update_early(void *tramp, void *func);
#define static_call_update_early(name, _func) \
({ \
typeof(&STATIC_CALL_TRAMP(name)) __F = (_func); \
if (static_call_initialized) { \
__static_call_update(&STATIC_CALL_KEY(name), \
STATIC_CALL_TRAMP_ADDR(name), __F);\
} else { \
WRITE_ONCE(STATIC_CALL_KEY(name).func, _func); \
__static_call_update_early(STATIC_CALL_TRAMP_ADDR(name),\
__F); \
} \
})
#endif /* _ASM_STATIC_CALL_H */
+3 -3
View File
@@ -8,7 +8,7 @@
#include <asm/special_insns.h>
#ifdef CONFIG_X86_32
static inline void iret_to_self(void)
static __always_inline void iret_to_self(void)
{
asm volatile (
"pushfl\n\t"
@@ -19,7 +19,7 @@ static inline void iret_to_self(void)
: ASM_CALL_CONSTRAINT : : "memory");
}
#else
static inline void iret_to_self(void)
static __always_inline void iret_to_self(void)
{
unsigned int tmp;
@@ -55,7 +55,7 @@ static inline void iret_to_self(void)
* Like all of Linux's memory ordering operations, this is a
* compiler barrier as well.
*/
static inline void sync_core(void)
static __always_inline void sync_core(void)
{
/*
* The SERIALIZE instruction is the most straightforward way to
+22 -14
View File
@@ -39,9 +39,11 @@
#include <linux/string.h>
#include <linux/types.h>
#include <linux/pgtable.h>
#include <linux/instrumentation.h>
#include <trace/events/xen.h>
#include <asm/alternative.h>
#include <asm/page.h>
#include <asm/smap.h>
#include <asm/nospec-branch.h>
@@ -86,11 +88,20 @@ struct xen_dm_op_buf;
* there aren't more than 5 arguments...)
*/
extern struct { char _entry[32]; } hypercall_page[];
void xen_hypercall_func(void);
DECLARE_STATIC_CALL(xen_hypercall, xen_hypercall_func);
#define __HYPERCALL "call hypercall_page+%c[offset]"
#define __HYPERCALL_ENTRY(x) \
[offset] "i" (__HYPERVISOR_##x * sizeof(hypercall_page[0]))
#ifdef MODULE
#define __ADDRESSABLE_xen_hypercall
#else
#define __ADDRESSABLE_xen_hypercall __ADDRESSABLE_ASM_STR(__SCK__xen_hypercall)
#endif
#define __HYPERCALL \
__ADDRESSABLE_xen_hypercall \
"call __SCT__xen_hypercall"
#define __HYPERCALL_ENTRY(x) "a" (x)
#ifdef CONFIG_X86_32
#define __HYPERCALL_RETREG "eax"
@@ -148,7 +159,7 @@ extern struct { char _entry[32]; } hypercall_page[];
__HYPERCALL_0ARG(); \
asm volatile (__HYPERCALL \
: __HYPERCALL_0PARAM \
: __HYPERCALL_ENTRY(name) \
: __HYPERCALL_ENTRY(__HYPERVISOR_ ## name) \
: __HYPERCALL_CLOBBER0); \
(type)__res; \
})
@@ -159,7 +170,7 @@ extern struct { char _entry[32]; } hypercall_page[];
__HYPERCALL_1ARG(a1); \
asm volatile (__HYPERCALL \
: __HYPERCALL_1PARAM \
: __HYPERCALL_ENTRY(name) \
: __HYPERCALL_ENTRY(__HYPERVISOR_ ## name) \
: __HYPERCALL_CLOBBER1); \
(type)__res; \
})
@@ -170,7 +181,7 @@ extern struct { char _entry[32]; } hypercall_page[];
__HYPERCALL_2ARG(a1, a2); \
asm volatile (__HYPERCALL \
: __HYPERCALL_2PARAM \
: __HYPERCALL_ENTRY(name) \
: __HYPERCALL_ENTRY(__HYPERVISOR_ ## name) \
: __HYPERCALL_CLOBBER2); \
(type)__res; \
})
@@ -181,7 +192,7 @@ extern struct { char _entry[32]; } hypercall_page[];
__HYPERCALL_3ARG(a1, a2, a3); \
asm volatile (__HYPERCALL \
: __HYPERCALL_3PARAM \
: __HYPERCALL_ENTRY(name) \
: __HYPERCALL_ENTRY(__HYPERVISOR_ ## name) \
: __HYPERCALL_CLOBBER3); \
(type)__res; \
})
@@ -192,7 +203,7 @@ extern struct { char _entry[32]; } hypercall_page[];
__HYPERCALL_4ARG(a1, a2, a3, a4); \
asm volatile (__HYPERCALL \
: __HYPERCALL_4PARAM \
: __HYPERCALL_ENTRY(name) \
: __HYPERCALL_ENTRY(__HYPERVISOR_ ## name) \
: __HYPERCALL_CLOBBER4); \
(type)__res; \
})
@@ -206,12 +217,9 @@ xen_single_call(unsigned int call,
__HYPERCALL_DECLS;
__HYPERCALL_5ARG(a1, a2, a3, a4, a5);
if (call >= PAGE_SIZE / sizeof(hypercall_page[0]))
return -EINVAL;
asm volatile(CALL_NOSPEC
asm volatile(__HYPERCALL
: __HYPERCALL_5PARAM
: [thunk_target] "a" (&hypercall_page[call])
: __HYPERCALL_ENTRY(call)
: __HYPERCALL_CLOBBER5);
return (long)__res;
-5
View File
@@ -142,11 +142,6 @@ static bool skip_addr(void *dest)
if (dest >= (void *)relocate_kernel &&
dest < (void*)relocate_kernel + KEXEC_CONTROL_CODE_MAX_SIZE)
return true;
#endif
#ifdef CONFIG_XEN
if (dest >= (void *)hypercall_page &&
dest < (void*)hypercall_page + PAGE_SIZE)
return true;
#endif
return false;
}
+30
View File
@@ -81,6 +81,34 @@ static void do_user_cp_fault(struct pt_regs *regs, unsigned long error_code)
static __ro_after_init bool ibt_fatal = true;
/*
* By definition, all missing-ENDBRANCH #CPs are a result of WFE && !ENDBR.
*
* For the kernel IBT no ENDBR selftest where #CPs are deliberately triggered,
* the WFE state of the interrupted context needs to be cleared to let execution
* continue. Otherwise when the CPU resumes from the instruction that just
* caused the previous #CP, another missing-ENDBRANCH #CP is raised and the CPU
* enters a dead loop.
*
* This is not a problem with IDT because it doesn't preserve WFE and IRET doesn't
* set WFE. But FRED provides space on the entry stack (in an expanded CS area)
* to save and restore the WFE state, thus the WFE state is no longer clobbered,
* so software must clear it.
*/
static void ibt_clear_fred_wfe(struct pt_regs *regs)
{
/*
* No need to do any FRED checks.
*
* For IDT event delivery, the high-order 48 bits of CS are pushed
* as 0s into the stack, and later IRET ignores these bits.
*
* For FRED, a test to check if fred_cs.wfe is set would be dropped
* by compilers.
*/
regs->fred_cs.wfe = 0;
}
static void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
{
if ((error_code & CP_EC) != CP_ENDBR) {
@@ -90,6 +118,7 @@ static void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
if (unlikely(regs->ip == (unsigned long)&ibt_selftest_noendbr)) {
regs->ax = 0;
ibt_clear_fred_wfe(regs);
return;
}
@@ -97,6 +126,7 @@ static void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
if (!ibt_fatal) {
printk(KERN_DEFAULT CUT_HERE);
__warn(__FILE__, __LINE__, (void *)regs->ip, TAINT_WARN, regs, NULL);
ibt_clear_fred_wfe(regs);
return;
}
BUG();
+22 -16
View File
@@ -867,7 +867,7 @@ static void cpu_detect_tlb(struct cpuinfo_x86 *c)
tlb_lld_4m[ENTRIES], tlb_lld_1g[ENTRIES]);
}
static void get_cpu_vendor(struct cpuinfo_x86 *c)
void get_cpu_vendor(struct cpuinfo_x86 *c)
{
char *v = c->x86_vendor_id;
int i;
@@ -1649,15 +1649,11 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
detect_nopl();
}
void __init early_cpu_init(void)
void __init init_cpu_devs(void)
{
const struct cpu_dev *const *cdev;
int count = 0;
#ifdef CONFIG_PROCESSOR_SELECT
pr_info("KERNEL supported cpus:\n");
#endif
for (cdev = __x86_cpu_dev_start; cdev < __x86_cpu_dev_end; cdev++) {
const struct cpu_dev *cpudev = *cdev;
@@ -1665,20 +1661,30 @@ void __init early_cpu_init(void)
break;
cpu_devs[count] = cpudev;
count++;
}
}
void __init early_cpu_init(void)
{
#ifdef CONFIG_PROCESSOR_SELECT
unsigned int i, j;
pr_info("KERNEL supported cpus:\n");
#endif
init_cpu_devs();
#ifdef CONFIG_PROCESSOR_SELECT
{
unsigned int j;
for (j = 0; j < 2; j++) {
if (!cpudev->c_ident[j])
continue;
pr_info(" %s %s\n", cpudev->c_vendor,
cpudev->c_ident[j]);
}
for (i = 0; i < X86_VENDOR_NUM && cpu_devs[i]; i++) {
for (j = 0; j < 2; j++) {
if (!cpu_devs[i]->c_ident[j])
continue;
pr_info(" %s %s\n", cpu_devs[i]->c_vendor,
cpu_devs[i]->c_ident[j]);
}
#endif
}
#endif
early_identify_cpu(&boot_cpu_data);
}
+58
View File
@@ -223,6 +223,63 @@ static void hv_machine_crash_shutdown(struct pt_regs *regs)
hyperv_cleanup();
}
#endif /* CONFIG_CRASH_DUMP */
static u64 hv_ref_counter_at_suspend;
static void (*old_save_sched_clock_state)(void);
static void (*old_restore_sched_clock_state)(void);
/*
* Hyper-V clock counter resets during hibernation. Save and restore clock
* offset during suspend/resume, while also considering the time passed
* before suspend. This is to make sure that sched_clock using hv tsc page
* based clocksource, proceeds from where it left off during suspend and
* it shows correct time for the timestamps of kernel messages after resume.
*/
static void save_hv_clock_tsc_state(void)
{
hv_ref_counter_at_suspend = hv_read_reference_counter();
}
static void restore_hv_clock_tsc_state(void)
{
/*
* Adjust the offsets used by hv tsc clocksource to
* account for the time spent before hibernation.
* adjusted value = reference counter (time) at suspend
* - reference counter (time) now.
*/
hv_adj_sched_clock_offset(hv_ref_counter_at_suspend - hv_read_reference_counter());
}
/*
* Functions to override save_sched_clock_state and restore_sched_clock_state
* functions of x86_platform. The Hyper-V clock counter is reset during
* suspend-resume and the offset used to measure time needs to be
* corrected, post resume.
*/
static void hv_save_sched_clock_state(void)
{
old_save_sched_clock_state();
save_hv_clock_tsc_state();
}
static void hv_restore_sched_clock_state(void)
{
restore_hv_clock_tsc_state();
old_restore_sched_clock_state();
}
static void __init x86_setup_ops_for_tsc_pg_clock(void)
{
if (!(ms_hyperv.features & HV_MSR_REFERENCE_TSC_AVAILABLE))
return;
old_save_sched_clock_state = x86_platform.save_sched_clock_state;
x86_platform.save_sched_clock_state = hv_save_sched_clock_state;
old_restore_sched_clock_state = x86_platform.restore_sched_clock_state;
x86_platform.restore_sched_clock_state = hv_restore_sched_clock_state;
}
#endif /* CONFIG_HYPERV */
static uint32_t __init ms_hyperv_platform(void)
@@ -579,6 +636,7 @@ static void __init ms_hyperv_init_platform(void)
/* Register Hyper-V specific clocksource */
hv_init_clocksource();
x86_setup_ops_for_tsc_pg_clock();
hv_vtl_init_platform();
#endif
/*
+1
View File
@@ -13,6 +13,7 @@
#include <asm/pgtable_types.h>
#include <asm/nospec-branch.h>
#include <asm/unwind_hints.h>
#include <asm/asm-offsets.h>
/*
* Must be relocatable PIC code callable as a C function, in particular
+9
View File
@@ -172,6 +172,15 @@ void arch_static_call_transform(void *site, void *tramp, void *func, bool tail)
}
EXPORT_SYMBOL_GPL(arch_static_call_transform);
noinstr void __static_call_update_early(void *tramp, void *func)
{
BUG_ON(system_state != SYSTEM_BOOTING);
BUG_ON(!early_boot_irqs_disabled);
BUG_ON(static_call_initialized);
__text_gen_insn(tramp, JMP32_INSN_OPCODE, tramp, func, JMP32_INSN_SIZE);
sync_core();
}
#ifdef CONFIG_MITIGATION_RETHUNK
/*
* This is called by apply_returns() to fix up static call trampolines,
-4
View File
@@ -519,14 +519,10 @@ INIT_PER_CPU(irq_stack_backing_store);
* linker will never mark as relocatable. (Using just ABSOLUTE() is not
* sufficient for that).
*/
#ifdef CONFIG_XEN
#ifdef CONFIG_XEN_PV
xen_elfnote_entry_value =
ABSOLUTE(xen_elfnote_entry) + ABSOLUTE(startup_xen);
#endif
xen_elfnote_hypercall_page_value =
ABSOLUTE(xen_elfnote_hypercall_page) + ABSOLUTE(hypercall_page);
#endif
#ifdef CONFIG_PVH
xen_elfnote_phys32_entry_value =
ABSOLUTE(xen_elfnote_phys32_entry) + ABSOLUTE(pvh_start_xen - LOAD_OFFSET);
+26 -5
View File
@@ -36,6 +36,26 @@
u32 kvm_cpu_caps[NR_KVM_CPU_CAPS] __read_mostly;
EXPORT_SYMBOL_GPL(kvm_cpu_caps);
struct cpuid_xstate_sizes {
u32 eax;
u32 ebx;
u32 ecx;
};
static struct cpuid_xstate_sizes xstate_sizes[XFEATURE_MAX] __ro_after_init;
void __init kvm_init_xstate_sizes(void)
{
u32 ign;
int i;
for (i = XFEATURE_YMM; i < ARRAY_SIZE(xstate_sizes); i++) {
struct cpuid_xstate_sizes *xs = &xstate_sizes[i];
cpuid_count(0xD, i, &xs->eax, &xs->ebx, &xs->ecx, &ign);
}
}
u32 xstate_required_size(u64 xstate_bv, bool compacted)
{
int feature_bit = 0;
@@ -44,14 +64,15 @@ u32 xstate_required_size(u64 xstate_bv, bool compacted)
xstate_bv &= XFEATURE_MASK_EXTEND;
while (xstate_bv) {
if (xstate_bv & 0x1) {
u32 eax, ebx, ecx, edx, offset;
cpuid_count(0xD, feature_bit, &eax, &ebx, &ecx, &edx);
struct cpuid_xstate_sizes *xs = &xstate_sizes[feature_bit];
u32 offset;
/* ECX[1]: 64B alignment in compacted form */
if (compacted)
offset = (ecx & 0x2) ? ALIGN(ret, 64) : ret;
offset = (xs->ecx & 0x2) ? ALIGN(ret, 64) : ret;
else
offset = ebx;
ret = max(ret, offset + eax);
offset = xs->ebx;
ret = max(ret, offset + xs->eax);
}
xstate_bv >>= 1;
+1
View File
@@ -31,6 +31,7 @@ int kvm_vcpu_ioctl_get_cpuid2(struct kvm_vcpu *vcpu,
bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
u32 *ecx, u32 *edx, bool exact_only);
void __init kvm_init_xstate_sizes(void);
u32 xstate_required_size(u64 xstate_bv, bool compacted);
int cpuid_query_maxphyaddr(struct kvm_vcpu *vcpu);
-12
View File
@@ -3364,18 +3364,6 @@ static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu,
return true;
}
static bool is_access_allowed(struct kvm_page_fault *fault, u64 spte)
{
if (fault->exec)
return is_executable_pte(spte);
if (fault->write)
return is_writable_pte(spte);
/* Fault was on Read access */
return spte & PT_PRESENT_MASK;
}
/*
* Returns the last level spte pointer of the shadow page walk for the given
* gpa, and sets *spte to the spte value. This spte may be non-preset. If no
+17
View File
@@ -461,6 +461,23 @@ static inline bool is_mmu_writable_spte(u64 spte)
return spte & shadow_mmu_writable_mask;
}
/*
* Returns true if the access indicated by @fault is allowed by the existing
* SPTE protections. Note, the caller is responsible for checking that the
* SPTE is a shadow-present, leaf SPTE (either before or after).
*/
static inline bool is_access_allowed(struct kvm_page_fault *fault, u64 spte)
{
if (fault->exec)
return is_executable_pte(spte);
if (fault->write)
return is_writable_pte(spte);
/* Fault was on Read access */
return spte & PT_PRESENT_MASK;
}
/*
* If the MMU-writable flag is cleared, i.e. the SPTE is write-protected for
* write-tracking, remote TLBs must be flushed, even if the SPTE was read-only,
+5
View File
@@ -985,6 +985,11 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
if (fault->prefetch && is_shadow_present_pte(iter->old_spte))
return RET_PF_SPURIOUS;
if (is_shadow_present_pte(iter->old_spte) &&
is_access_allowed(fault, iter->old_spte) &&
is_last_spte(iter->old_spte, iter->level))
return RET_PF_SPURIOUS;
if (unlikely(!fault->slot))
new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
else
+6
View File
@@ -1199,6 +1199,12 @@ bool avic_hardware_setup(void)
return false;
}
if (cc_platform_has(CC_ATTR_HOST_SEV_SNP) &&
!boot_cpu_has(X86_FEATURE_HV_INUSE_WR_ALLOWED)) {
pr_warn("AVIC disabled: missing HvInUseWrAllowed on SNP-enabled system\n");
return false;
}
if (boot_cpu_has(X86_FEATURE_AVIC)) {
pr_info("AVIC enabled\n");
} else if (force_avic) {
-9
View File
@@ -3201,15 +3201,6 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
if (data & ~supported_de_cfg)
return 1;
/*
* Don't let the guest change the host-programmed value. The
* MSR is very model specific, i.e. contains multiple bits that
* are completely unknown to KVM, and the one bit known to KVM
* is simply a reflection of hardware capabilities.
*/
if (!msr->host_initiated && data != svm->msr_decfg)
return 1;
svm->msr_decfg = data;
break;
}
+1 -1
View File
@@ -2,7 +2,7 @@
#ifndef __KVM_X86_VMX_POSTED_INTR_H
#define __KVM_X86_VMX_POSTED_INTR_H
#include <linux/find.h>
#include <linux/bitmap.h>
#include <asm/posted_intr.h>
void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu);
+10 -1
View File
@@ -9976,7 +9976,7 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
{
u64 ret = vcpu->run->hypercall.ret;
if (!is_64_bit_mode(vcpu))
if (!is_64_bit_hypercall(vcpu))
ret = (u32)ret;
kvm_rax_write(vcpu, ret);
++vcpu->stat.hypercalls;
@@ -12724,6 +12724,13 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
kvm_hv_init_vm(kvm);
kvm_xen_init_vm(kvm);
if (ignore_msrs && !report_ignored_msrs) {
pr_warn_once("Running KVM with ignore_msrs=1 and report_ignored_msrs=0 is not a\n"
"a supported configuration. Lying to the guest about the existence of MSRs\n"
"may cause the guest operating system to hang or produce errors. If a guest\n"
"does not run without ignore_msrs=1, please report it to kvm@vger.kernel.org.\n");
}
return 0;
out_uninit_mmu:
@@ -13997,6 +14004,8 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_rmp_fault);
static int __init kvm_x86_init(void)
{
kvm_init_xstate_sizes();
kvm_mmu_x86_module_init();
mitigate_smt_rsb &= boot_cpu_has_bug(X86_BUG_SMT_RSB) && cpu_smt_possible();
return 0;
+64 -1
View File
@@ -2,6 +2,7 @@
#include <linux/console.h>
#include <linux/cpu.h>
#include <linux/instrumentation.h>
#include <linux/kexec.h>
#include <linux/memblock.h>
#include <linux/slab.h>
@@ -21,7 +22,8 @@
#include "xen-ops.h"
EXPORT_SYMBOL_GPL(hypercall_page);
DEFINE_STATIC_CALL(xen_hypercall, xen_hypercall_hvm);
EXPORT_STATIC_CALL_TRAMP(xen_hypercall);
/*
* Pointer to the xen_vcpu_info structure or
@@ -68,6 +70,67 @@ EXPORT_SYMBOL(xen_start_flags);
*/
struct shared_info *HYPERVISOR_shared_info = &xen_dummy_shared_info;
static __ref void xen_get_vendor(void)
{
init_cpu_devs();
cpu_detect(&boot_cpu_data);
get_cpu_vendor(&boot_cpu_data);
}
void xen_hypercall_setfunc(void)
{
if (static_call_query(xen_hypercall) != xen_hypercall_hvm)
return;
if ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
boot_cpu_data.x86_vendor == X86_VENDOR_HYGON))
static_call_update(xen_hypercall, xen_hypercall_amd);
else
static_call_update(xen_hypercall, xen_hypercall_intel);
}
/*
* Evaluate processor vendor in order to select the correct hypercall
* function for HVM/PVH guests.
* Might be called very early in boot before vendor has been set by
* early_cpu_init().
*/
noinstr void *__xen_hypercall_setfunc(void)
{
void (*func)(void);
/*
* Xen is supported only on CPUs with CPUID, so testing for
* X86_FEATURE_CPUID is a test for early_cpu_init() having been
* run.
*
* Note that __xen_hypercall_setfunc() is noinstr only due to a nasty
* dependency chain: it is being called via the xen_hypercall static
* call when running as a PVH or HVM guest. Hypercalls need to be
* noinstr due to PV guests using hypercalls in noinstr code. So we
* can safely tag the function body as "instrumentation ok", since
* the PV guest requirement is not of interest here (xen_get_vendor()
* calls noinstr functions, and static_call_update_early() might do
* so, too).
*/
instrumentation_begin();
if (!boot_cpu_has(X86_FEATURE_CPUID))
xen_get_vendor();
if ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
boot_cpu_data.x86_vendor == X86_VENDOR_HYGON))
func = xen_hypercall_amd;
else
func = xen_hypercall_intel;
static_call_update_early(xen_hypercall, func);
instrumentation_end();
return func;
}
static int xen_cpu_up_online(unsigned int cpu)
{
xen_init_lock_cpu(cpu);
+5 -8
View File
@@ -106,15 +106,8 @@ static void __init init_hvm_pv_info(void)
/* PVH set up hypercall page in xen_prepare_pvh(). */
if (xen_pvh_domain())
pv_info.name = "Xen PVH";
else {
u64 pfn;
uint32_t msr;
else
pv_info.name = "Xen HVM";
msr = cpuid_ebx(base + 2);
pfn = __pa(hypercall_page);
wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
}
xen_setup_features();
@@ -300,6 +293,10 @@ static uint32_t __init xen_platform_hvm(void)
if (xen_pv_domain())
return 0;
/* Set correct hypercall function. */
if (xen_domain)
xen_hypercall_setfunc();
if (xen_pvh_domain() && nopv) {
/* Guest booting via the Xen-PVH boot entry goes here */
pr_info("\"nopv\" parameter is ignored in PVH guest\n");
+3 -1
View File
@@ -1341,6 +1341,9 @@ asmlinkage __visible void __init xen_start_kernel(struct start_info *si)
xen_domain_type = XEN_PV_DOMAIN;
xen_start_flags = xen_start_info->flags;
/* Interrupts are guaranteed to be off initially. */
early_boot_irqs_disabled = true;
static_call_update_early(xen_hypercall, xen_hypercall_pv);
xen_setup_features();
@@ -1431,7 +1434,6 @@ asmlinkage __visible void __init xen_start_kernel(struct start_info *si)
WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_pv, xen_cpu_dead_pv));
local_irq_disable();
early_boot_irqs_disabled = true;
xen_raw_console_write("mapping kernel into physical memory\n");
xen_setup_kernel_pagetable((pgd_t *)xen_start_info->pt_base,
-7
View File
@@ -129,17 +129,10 @@ static void __init pvh_arch_setup(void)
void __init xen_pvh_init(struct boot_params *boot_params)
{
u32 msr;
u64 pfn;
xen_pvh = 1;
xen_domain_type = XEN_HVM_DOMAIN;
xen_start_flags = pvh_start_info.flags;
msr = cpuid_ebx(xen_cpuid_base() + 2);
pfn = __pa(hypercall_page);
wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
x86_init.oem.arch_setup = pvh_arch_setup;
x86_init.oem.banner = xen_banner;
+41 -9
View File
@@ -20,9 +20,32 @@
#include <linux/init.h>
#include <linux/linkage.h>
#include <linux/objtool.h>
#include <../entry/calling.h>
.pushsection .noinstr.text, "ax"
/*
* PV hypercall interface to the hypervisor.
*
* Called via inline asm(), so better preserve %rcx and %r11.
*
* Input:
* %eax: hypercall number
* %rdi, %rsi, %rdx, %r10, %r8: args 1..5 for the hypercall
* Output: %rax
*/
SYM_FUNC_START(xen_hypercall_pv)
ANNOTATE_NOENDBR
push %rcx
push %r11
UNWIND_HINT_SAVE
syscall
UNWIND_HINT_RESTORE
pop %r11
pop %rcx
RET
SYM_FUNC_END(xen_hypercall_pv)
/*
* Disabling events is simply a matter of making the event mask
* non-zero.
@@ -176,7 +199,6 @@ SYM_CODE_START(xen_early_idt_handler_array)
SYM_CODE_END(xen_early_idt_handler_array)
__FINIT
hypercall_iret = hypercall_page + __HYPERVISOR_iret * 32
/*
* Xen64 iret frame:
*
@@ -186,17 +208,28 @@ hypercall_iret = hypercall_page + __HYPERVISOR_iret * 32
* cs
* rip <-- standard iret frame
*
* flags
* flags <-- xen_iret must push from here on
*
* rcx }
* r11 }<-- pushed by hypercall page
* rsp->rax }
* rcx
* r11
* rsp->rax
*/
.macro xen_hypercall_iret
pushq $0 /* Flags */
push %rcx
push %r11
push %rax
mov $__HYPERVISOR_iret, %eax
syscall /* Do the IRET. */
#ifdef CONFIG_MITIGATION_SLS
int3
#endif
.endm
SYM_CODE_START(xen_iret)
UNWIND_HINT_UNDEFINED
ANNOTATE_NOENDBR
pushq $0
jmp hypercall_iret
xen_hypercall_iret
SYM_CODE_END(xen_iret)
/*
@@ -301,8 +334,7 @@ SYM_CODE_START(xen_entry_SYSENTER_compat)
ENDBR
lea 16(%rsp), %rsp /* strip %rcx, %r11 */
mov $-ENOSYS, %rax
pushq $0
jmp hypercall_iret
xen_hypercall_iret
SYM_CODE_END(xen_entry_SYSENTER_compat)
SYM_CODE_END(xen_entry_SYSCALL_compat)
+83 -24
View File
@@ -6,9 +6,11 @@
#include <linux/elfnote.h>
#include <linux/init.h>
#include <linux/instrumentation.h>
#include <asm/boot.h>
#include <asm/asm.h>
#include <asm/frame.h>
#include <asm/msr.h>
#include <asm/page_types.h>
#include <asm/percpu.h>
@@ -20,28 +22,6 @@
#include <xen/interface/xen-mca.h>
#include <asm/xen/interface.h>
.pushsection .noinstr.text, "ax"
.balign PAGE_SIZE
SYM_CODE_START(hypercall_page)
.rept (PAGE_SIZE / 32)
UNWIND_HINT_FUNC
ANNOTATE_NOENDBR
ANNOTATE_UNRET_SAFE
ret
/*
* Xen will write the hypercall page, and sort out ENDBR.
*/
.skip 31, 0xcc
.endr
#define HYPERCALL(n) \
.equ xen_hypercall_##n, hypercall_page + __HYPERVISOR_##n * 32; \
.type xen_hypercall_##n, @function; .size xen_hypercall_##n, 32
#include <asm/xen-hypercalls.h>
#undef HYPERCALL
SYM_CODE_END(hypercall_page)
.popsection
#ifdef CONFIG_XEN_PV
__INIT
SYM_CODE_START(startup_xen)
@@ -87,6 +67,87 @@ SYM_CODE_END(xen_cpu_bringup_again)
#endif
#endif
.pushsection .noinstr.text, "ax"
/*
* Xen hypercall interface to the hypervisor.
*
* Input:
* %eax: hypercall number
* 32-bit:
* %ebx, %ecx, %edx, %esi, %edi: args 1..5 for the hypercall
* 64-bit:
* %rdi, %rsi, %rdx, %r10, %r8: args 1..5 for the hypercall
* Output: %[er]ax
*/
SYM_FUNC_START(xen_hypercall_hvm)
ENDBR
FRAME_BEGIN
/* Save all relevant registers (caller save and arguments). */
#ifdef CONFIG_X86_32
push %eax
push %ebx
push %ecx
push %edx
push %esi
push %edi
#else
push %rax
push %rcx
push %rdx
push %rdi
push %rsi
push %r11
push %r10
push %r9
push %r8
#ifdef CONFIG_FRAME_POINTER
pushq $0 /* Dummy push for stack alignment. */
#endif
#endif
/* Set the vendor specific function. */
call __xen_hypercall_setfunc
/* Set ZF = 1 if AMD, Restore saved registers. */
#ifdef CONFIG_X86_32
lea xen_hypercall_amd, %ebx
cmp %eax, %ebx
pop %edi
pop %esi
pop %edx
pop %ecx
pop %ebx
pop %eax
#else
lea xen_hypercall_amd(%rip), %rbx
cmp %rax, %rbx
#ifdef CONFIG_FRAME_POINTER
pop %rax /* Dummy pop. */
#endif
pop %r8
pop %r9
pop %r10
pop %r11
pop %rsi
pop %rdi
pop %rdx
pop %rcx
pop %rax
#endif
/* Use correct hypercall function. */
jz xen_hypercall_amd
jmp xen_hypercall_intel
SYM_FUNC_END(xen_hypercall_hvm)
SYM_FUNC_START(xen_hypercall_amd)
vmmcall
RET
SYM_FUNC_END(xen_hypercall_amd)
SYM_FUNC_START(xen_hypercall_intel)
vmcall
RET
SYM_FUNC_END(xen_hypercall_intel)
.popsection
ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS, .asciz "linux")
ELFNOTE(Xen, XEN_ELFNOTE_GUEST_VERSION, .asciz "2.6")
ELFNOTE(Xen, XEN_ELFNOTE_XEN_VERSION, .asciz "xen-3.0")
@@ -116,8 +177,6 @@ SYM_CODE_END(xen_cpu_bringup_again)
#else
# define FEATURES_DOM0 0
#endif
ELFNOTE(Xen, XEN_ELFNOTE_HYPERCALL_PAGE, .globl xen_elfnote_hypercall_page;
xen_elfnote_hypercall_page: _ASM_PTR xen_elfnote_hypercall_page_value - .)
ELFNOTE(Xen, XEN_ELFNOTE_SUPPORTED_FEATURES,
.long FEATURES_PV | FEATURES_PVH | FEATURES_DOM0)
ELFNOTE(Xen, XEN_ELFNOTE_LOADER, .asciz "generic")
+9
View File
@@ -326,4 +326,13 @@ static inline void xen_smp_intr_free_pv(unsigned int cpu) {}
static inline void xen_smp_count_cpus(void) { }
#endif /* CONFIG_SMP */
#ifdef CONFIG_XEN_PV
void xen_hypercall_pv(void);
#endif
void xen_hypercall_hvm(void);
void xen_hypercall_amd(void);
void xen_hypercall_intel(void);
void xen_hypercall_setfunc(void);
void *__xen_hypercall_setfunc(void);
#endif /* XEN_OPS_H */
+1 -2
View File
@@ -155,8 +155,7 @@ int set_blocksize(struct file *file, int size)
struct inode *inode = file->f_mapping->host;
struct block_device *bdev = I_BDEV(inode);
/* Size must be a power of two, and between 512 and PAGE_SIZE */
if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
if (blk_validate_block_size(size))
return -EINVAL;
/* Size cannot be smaller than the size supported by the device */
+1 -1
View File
@@ -1171,7 +1171,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
}
EXPORT_SYMBOL_GPL(__bio_release_pages);
void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
void bio_iov_bvec_set(struct bio *bio, const struct iov_iter *iter)
{
WARN_ON_ONCE(bio->bi_max_vecs);
+5 -1
View File
@@ -1324,10 +1324,14 @@ void blkcg_unpin_online(struct cgroup_subsys_state *blkcg_css)
struct blkcg *blkcg = css_to_blkcg(blkcg_css);
do {
struct blkcg *parent;
if (!refcount_dec_and_test(&blkcg->online_pin))
break;
parent = blkcg_parent(blkcg);
blkcg_destroy_blkgs(blkcg);
blkcg = blkcg_parent(blkcg);
blkcg = parent;
} while (blkcg);
}
+8 -1
View File
@@ -1098,7 +1098,14 @@ static void __propagate_weights(struct ioc_gq *iocg, u32 active, u32 inuse,
inuse = DIV64_U64_ROUND_UP(active * iocg->child_inuse_sum,
iocg->child_active_sum);
} else {
inuse = clamp_t(u32, inuse, 1, active);
/*
* It may be tempting to turn this into a clamp expression with
* a lower limit of 1 but active may be 0, which cannot be used
* as an upper limit in that situation. This expression allows
* active to clamp inuse unless it is 0, in which case inuse
* becomes 1.
*/
inuse = min(inuse, active) ?: 1;
}
iocg->last_inuse = iocg->inuse;
+1 -1
View File
@@ -574,7 +574,7 @@ static int blk_rq_map_user_bvec(struct request *rq, const struct iov_iter *iter)
bio = blk_rq_map_bio_alloc(rq, 0, GFP_KERNEL);
if (!bio)
return -ENOMEM;
bio_iov_bvec_set(bio, (struct iov_iter *)iter);
bio_iov_bvec_set(bio, iter);
/* check that the data layout matches the hardware restrictions */
ret = bio_split_rw_at(bio, lim, &nsegs, max_bytes);
+14 -7
View File
@@ -1544,19 +1544,17 @@ static void blk_mq_requeue_work(struct work_struct *work)
while (!list_empty(&rq_list)) {
rq = list_entry(rq_list.next, struct request, queuelist);
list_del_init(&rq->queuelist);
/*
* If RQF_DONTPREP ist set, the request has been started by the
* If RQF_DONTPREP is set, the request has been started by the
* driver already and might have driver-specific data allocated
* already. Insert it into the hctx dispatch list to avoid
* block layer merges for the request.
*/
if (rq->rq_flags & RQF_DONTPREP) {
list_del_init(&rq->queuelist);
if (rq->rq_flags & RQF_DONTPREP)
blk_mq_request_bypass_insert(rq, 0);
} else {
list_del_init(&rq->queuelist);
else
blk_mq_insert_request(rq, BLK_MQ_INSERT_AT_HEAD);
}
}
while (!list_empty(&flush_list)) {
@@ -4414,6 +4412,15 @@ struct gendisk *blk_mq_alloc_disk_for_queue(struct request_queue *q,
}
EXPORT_SYMBOL(blk_mq_alloc_disk_for_queue);
/*
* Only hctx removed from cpuhp list can be reused
*/
static bool blk_mq_hctx_is_reusable(struct blk_mq_hw_ctx *hctx)
{
return hlist_unhashed(&hctx->cpuhp_online) &&
hlist_unhashed(&hctx->cpuhp_dead);
}
static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
struct blk_mq_tag_set *set, struct request_queue *q,
int hctx_idx, int node)
@@ -4423,7 +4430,7 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
/* reuse dead hctx first */
spin_lock(&q->unused_hctx_lock);
list_for_each_entry(tmp, &q->unused_hctx_list, hctx_list) {
if (tmp->numa_node == node) {
if (tmp->numa_node == node && blk_mq_hctx_is_reusable(tmp)) {
hctx = tmp;
break;
}
+1 -1
View File
@@ -263,7 +263,7 @@ static ssize_t queue_nr_zones_show(struct gendisk *disk, char *page)
static ssize_t queue_iostats_passthrough_show(struct gendisk *disk, char *page)
{
return queue_var_show(blk_queue_passthrough_stat(disk->queue), page);
return queue_var_show(!!blk_queue_passthrough_stat(disk->queue), page);
}
static ssize_t queue_iostats_passthrough_store(struct gendisk *disk,
+225 -285
View File
@@ -41,7 +41,6 @@ static const char *const zone_cond_name[] = {
/*
* Per-zone write plug.
* @node: hlist_node structure for managing the plug using a hash table.
* @link: To list the plug in the zone write plug error list of the disk.
* @ref: Zone write plug reference counter. A zone write plug reference is
* always at least 1 when the plug is hashed in the disk plug hash table.
* The reference is incremented whenever a new BIO needing plugging is
@@ -63,7 +62,6 @@ static const char *const zone_cond_name[] = {
*/
struct blk_zone_wplug {
struct hlist_node node;
struct list_head link;
refcount_t ref;
spinlock_t lock;
unsigned int flags;
@@ -80,8 +78,8 @@ struct blk_zone_wplug {
* - BLK_ZONE_WPLUG_PLUGGED: Indicates that the zone write plug is plugged,
* that is, that write BIOs are being throttled due to a write BIO already
* being executed or the zone write plug bio list is not empty.
* - BLK_ZONE_WPLUG_ERROR: Indicates that a write error happened which will be
* recovered with a report zone to update the zone write pointer offset.
* - BLK_ZONE_WPLUG_NEED_WP_UPDATE: Indicates that we lost track of a zone
* write pointer offset and need to update it.
* - BLK_ZONE_WPLUG_UNHASHED: Indicates that the zone write plug was removed
* from the disk hash table and that the initial reference to the zone
* write plug set when the plug was first added to the hash table has been
@@ -91,11 +89,9 @@ struct blk_zone_wplug {
* freed once all remaining references from BIOs or functions are dropped.
*/
#define BLK_ZONE_WPLUG_PLUGGED (1U << 0)
#define BLK_ZONE_WPLUG_ERROR (1U << 1)
#define BLK_ZONE_WPLUG_NEED_WP_UPDATE (1U << 1)
#define BLK_ZONE_WPLUG_UNHASHED (1U << 2)
#define BLK_ZONE_WPLUG_BUSY (BLK_ZONE_WPLUG_PLUGGED | BLK_ZONE_WPLUG_ERROR)
/**
* blk_zone_cond_str - Return string XXX in BLK_ZONE_COND_XXX.
* @zone_cond: BLK_ZONE_COND_XXX.
@@ -115,6 +111,30 @@ const char *blk_zone_cond_str(enum blk_zone_cond zone_cond)
}
EXPORT_SYMBOL_GPL(blk_zone_cond_str);
struct disk_report_zones_cb_args {
struct gendisk *disk;
report_zones_cb user_cb;
void *user_data;
};
static void disk_zone_wplug_sync_wp_offset(struct gendisk *disk,
struct blk_zone *zone);
static int disk_report_zones_cb(struct blk_zone *zone, unsigned int idx,
void *data)
{
struct disk_report_zones_cb_args *args = data;
struct gendisk *disk = args->disk;
if (disk->zone_wplugs_hash)
disk_zone_wplug_sync_wp_offset(disk, zone);
if (!args->user_cb)
return 0;
return args->user_cb(zone, idx, args->user_data);
}
/**
* blkdev_report_zones - Get zones information
* @bdev: Target block device
@@ -139,6 +159,11 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector,
{
struct gendisk *disk = bdev->bd_disk;
sector_t capacity = get_capacity(disk);
struct disk_report_zones_cb_args args = {
.disk = disk,
.user_cb = cb,
.user_data = data,
};
if (!bdev_is_zoned(bdev) || WARN_ON_ONCE(!disk->fops->report_zones))
return -EOPNOTSUPP;
@@ -146,7 +171,8 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector,
if (!nr_zones || sector >= capacity)
return 0;
return disk->fops->report_zones(disk, sector, nr_zones, cb, data);
return disk->fops->report_zones(disk, sector, nr_zones,
disk_report_zones_cb, &args);
}
EXPORT_SYMBOL_GPL(blkdev_report_zones);
@@ -427,7 +453,7 @@ static inline void disk_put_zone_wplug(struct blk_zone_wplug *zwplug)
{
if (refcount_dec_and_test(&zwplug->ref)) {
WARN_ON_ONCE(!bio_list_empty(&zwplug->bio_list));
WARN_ON_ONCE(!list_empty(&zwplug->link));
WARN_ON_ONCE(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED);
WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_UNHASHED));
call_rcu(&zwplug->rcu_head, disk_free_zone_wplug_rcu);
@@ -441,8 +467,8 @@ static inline bool disk_should_remove_zone_wplug(struct gendisk *disk,
if (zwplug->flags & BLK_ZONE_WPLUG_UNHASHED)
return false;
/* If the zone write plug is still busy, it cannot be removed. */
if (zwplug->flags & BLK_ZONE_WPLUG_BUSY)
/* If the zone write plug is still plugged, it cannot be removed. */
if (zwplug->flags & BLK_ZONE_WPLUG_PLUGGED)
return false;
/*
@@ -525,12 +551,11 @@ again:
return NULL;
INIT_HLIST_NODE(&zwplug->node);
INIT_LIST_HEAD(&zwplug->link);
refcount_set(&zwplug->ref, 2);
spin_lock_init(&zwplug->lock);
zwplug->flags = 0;
zwplug->zone_no = zno;
zwplug->wp_offset = sector & (disk->queue->limits.chunk_sectors - 1);
zwplug->wp_offset = bdev_offset_from_zone_start(disk->part0, sector);
bio_list_init(&zwplug->bio_list);
INIT_WORK(&zwplug->bio_work, blk_zone_wplug_bio_work);
zwplug->disk = disk;
@@ -574,115 +599,22 @@ static void disk_zone_wplug_abort(struct blk_zone_wplug *zwplug)
}
/*
* Abort (fail) all plugged BIOs of a zone write plug that are not aligned
* with the assumed write pointer location of the zone when the BIO will
* be unplugged.
*/
static void disk_zone_wplug_abort_unaligned(struct gendisk *disk,
struct blk_zone_wplug *zwplug)
{
unsigned int wp_offset = zwplug->wp_offset;
struct bio_list bl = BIO_EMPTY_LIST;
struct bio *bio;
while ((bio = bio_list_pop(&zwplug->bio_list))) {
if (disk_zone_is_full(disk, zwplug->zone_no, wp_offset) ||
(bio_op(bio) != REQ_OP_ZONE_APPEND &&
bio_offset_from_zone_start(bio) != wp_offset)) {
blk_zone_wplug_bio_io_error(zwplug, bio);
continue;
}
wp_offset += bio_sectors(bio);
bio_list_add(&bl, bio);
}
bio_list_merge(&zwplug->bio_list, &bl);
}
static inline void disk_zone_wplug_set_error(struct gendisk *disk,
struct blk_zone_wplug *zwplug)
{
unsigned long flags;
if (zwplug->flags & BLK_ZONE_WPLUG_ERROR)
return;
/*
* At this point, we already have a reference on the zone write plug.
* However, since we are going to add the plug to the disk zone write
* plugs work list, increase its reference count. This reference will
* be dropped in disk_zone_wplugs_work() once the error state is
* handled, or in disk_zone_wplug_clear_error() if the zone is reset or
* finished.
*/
zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
refcount_inc(&zwplug->ref);
spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
list_add_tail(&zwplug->link, &disk->zone_wplugs_err_list);
spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
}
static inline void disk_zone_wplug_clear_error(struct gendisk *disk,
struct blk_zone_wplug *zwplug)
{
unsigned long flags;
if (!(zwplug->flags & BLK_ZONE_WPLUG_ERROR))
return;
/*
* We are racing with the error handling work which drops the reference
* on the zone write plug after handling the error state. So remove the
* plug from the error list and drop its reference count only if the
* error handling has not yet started, that is, if the zone write plug
* is still listed.
*/
spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
if (!list_empty(&zwplug->link)) {
list_del_init(&zwplug->link);
zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
disk_put_zone_wplug(zwplug);
}
spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
}
/*
* Set a zone write plug write pointer offset to either 0 (zone reset case)
* or to the zone size (zone finish case). This aborts all plugged BIOs, which
* is fine to do as doing a zone reset or zone finish while writes are in-flight
* is a mistake from the user which will most likely cause all plugged BIOs to
* fail anyway.
* Set a zone write plug write pointer offset to the specified value.
* This aborts all plugged BIOs, which is fine as this function is called for
* a zone reset operation, a zone finish operation or if the zone needs a wp
* update from a report zone after a write error.
*/
static void disk_zone_wplug_set_wp_offset(struct gendisk *disk,
struct blk_zone_wplug *zwplug,
unsigned int wp_offset)
{
unsigned long flags;
spin_lock_irqsave(&zwplug->lock, flags);
/*
* Make sure that a BIO completion or another zone reset or finish
* operation has not already removed the plug from the hash table.
*/
if (zwplug->flags & BLK_ZONE_WPLUG_UNHASHED) {
spin_unlock_irqrestore(&zwplug->lock, flags);
return;
}
lockdep_assert_held(&zwplug->lock);
/* Update the zone write pointer and abort all plugged BIOs. */
zwplug->flags &= ~BLK_ZONE_WPLUG_NEED_WP_UPDATE;
zwplug->wp_offset = wp_offset;
disk_zone_wplug_abort(zwplug);
/*
* Updating the write pointer offset puts back the zone
* in a good state. So clear the error flag and decrement the
* error count if we were in error state.
*/
disk_zone_wplug_clear_error(disk, zwplug);
/*
* The zone write plug now has no BIO plugged: remove it from the
* hash table so that it cannot be seen. The plug will be freed
@@ -690,8 +622,58 @@ static void disk_zone_wplug_set_wp_offset(struct gendisk *disk,
*/
if (disk_should_remove_zone_wplug(disk, zwplug))
disk_remove_zone_wplug(disk, zwplug);
}
static unsigned int blk_zone_wp_offset(struct blk_zone *zone)
{
switch (zone->cond) {
case BLK_ZONE_COND_IMP_OPEN:
case BLK_ZONE_COND_EXP_OPEN:
case BLK_ZONE_COND_CLOSED:
return zone->wp - zone->start;
case BLK_ZONE_COND_FULL:
return zone->len;
case BLK_ZONE_COND_EMPTY:
return 0;
case BLK_ZONE_COND_NOT_WP:
case BLK_ZONE_COND_OFFLINE:
case BLK_ZONE_COND_READONLY:
default:
/*
* Conventional, offline and read-only zones do not have a valid
* write pointer.
*/
return UINT_MAX;
}
}
static void disk_zone_wplug_sync_wp_offset(struct gendisk *disk,
struct blk_zone *zone)
{
struct blk_zone_wplug *zwplug;
unsigned long flags;
zwplug = disk_get_zone_wplug(disk, zone->start);
if (!zwplug)
return;
spin_lock_irqsave(&zwplug->lock, flags);
if (zwplug->flags & BLK_ZONE_WPLUG_NEED_WP_UPDATE)
disk_zone_wplug_set_wp_offset(disk, zwplug,
blk_zone_wp_offset(zone));
spin_unlock_irqrestore(&zwplug->lock, flags);
disk_put_zone_wplug(zwplug);
}
static int disk_zone_sync_wp_offset(struct gendisk *disk, sector_t sector)
{
struct disk_report_zones_cb_args args = {
.disk = disk,
};
return disk->fops->report_zones(disk, sector, 1,
disk_report_zones_cb, &args);
}
static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio,
@@ -700,6 +682,7 @@ static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio,
struct gendisk *disk = bio->bi_bdev->bd_disk;
sector_t sector = bio->bi_iter.bi_sector;
struct blk_zone_wplug *zwplug;
unsigned long flags;
/* Conventional zones cannot be reset nor finished. */
if (!bdev_zone_is_seq(bio->bi_bdev, sector)) {
@@ -707,6 +690,15 @@ static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio,
return true;
}
/*
* No-wait reset or finish BIOs do not make much sense as the callers
* issue these as blocking operations in most cases. To avoid issues
* the BIO execution potentially failing with BLK_STS_AGAIN, warn about
* REQ_NOWAIT being set and ignore that flag.
*/
if (WARN_ON_ONCE(bio->bi_opf & REQ_NOWAIT))
bio->bi_opf &= ~REQ_NOWAIT;
/*
* If we have a zone write plug, set its write pointer offset to 0
* (reset case) or to the zone size (finish case). This will abort all
@@ -716,7 +708,9 @@ static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio,
*/
zwplug = disk_get_zone_wplug(disk, sector);
if (zwplug) {
spin_lock_irqsave(&zwplug->lock, flags);
disk_zone_wplug_set_wp_offset(disk, zwplug, wp_offset);
spin_unlock_irqrestore(&zwplug->lock, flags);
disk_put_zone_wplug(zwplug);
}
@@ -727,6 +721,7 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
{
struct gendisk *disk = bio->bi_bdev->bd_disk;
struct blk_zone_wplug *zwplug;
unsigned long flags;
sector_t sector;
/*
@@ -738,7 +733,9 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
sector += disk->queue->limits.chunk_sectors) {
zwplug = disk_get_zone_wplug(disk, sector);
if (zwplug) {
spin_lock_irqsave(&zwplug->lock, flags);
disk_zone_wplug_set_wp_offset(disk, zwplug, 0);
spin_unlock_irqrestore(&zwplug->lock, flags);
disk_put_zone_wplug(zwplug);
}
}
@@ -746,9 +743,25 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
return false;
}
static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
struct bio *bio, unsigned int nr_segs)
static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk,
struct blk_zone_wplug *zwplug)
{
/*
* Take a reference on the zone write plug and schedule the submission
* of the next plugged BIO. blk_zone_wplug_bio_work() will release the
* reference we take here.
*/
WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));
refcount_inc(&zwplug->ref);
queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
}
static inline void disk_zone_wplug_add_bio(struct gendisk *disk,
struct blk_zone_wplug *zwplug,
struct bio *bio, unsigned int nr_segs)
{
bool schedule_bio_work = false;
/*
* Grab an extra reference on the BIO request queue usage counter.
* This reference will be reused to submit a request for the BIO for
@@ -764,6 +777,16 @@ static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
*/
bio_clear_polled(bio);
/*
* REQ_NOWAIT BIOs are always handled using the zone write plug BIO
* work, which can block. So clear the REQ_NOWAIT flag and schedule the
* work if this is the first BIO we are plugging.
*/
if (bio->bi_opf & REQ_NOWAIT) {
schedule_bio_work = !(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED);
bio->bi_opf &= ~REQ_NOWAIT;
}
/*
* Reuse the poll cookie field to store the number of segments when
* split to the hardware limits.
@@ -777,6 +800,11 @@ static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
* at the tail of the list to preserve the sequential write order.
*/
bio_list_add(&zwplug->bio_list, bio);
zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
if (schedule_bio_work)
disk_zone_wplug_schedule_bio_work(disk, zwplug);
}
/*
@@ -889,13 +917,23 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
{
struct gendisk *disk = bio->bi_bdev->bd_disk;
/*
* If we lost track of the zone write pointer due to a write error,
* the user must either execute a report zones, reset the zone or finish
* the to recover a reliable write pointer position. Fail BIOs if the
* user did not do that as we cannot handle emulated zone append
* otherwise.
*/
if (zwplug->flags & BLK_ZONE_WPLUG_NEED_WP_UPDATE)
return false;
/*
* Check that the user is not attempting to write to a full zone.
* We know such BIO will fail, and that would potentially overflow our
* write pointer offset beyond the end of the zone.
*/
if (disk_zone_wplug_is_full(disk, zwplug))
goto err;
return false;
if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
/*
@@ -914,24 +952,18 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
bio_set_flag(bio, BIO_EMULATES_ZONE_APPEND);
} else {
/*
* Check for non-sequential writes early because we avoid a
* whole lot of error handling trouble if we don't send it off
* to the driver.
* Check for non-sequential writes early as we know that BIOs
* with a start sector not unaligned to the zone write pointer
* will fail.
*/
if (bio_offset_from_zone_start(bio) != zwplug->wp_offset)
goto err;
return false;
}
/* Advance the zone write pointer offset. */
zwplug->wp_offset += bio_sectors(bio);
return true;
err:
/* We detected an invalid write BIO: schedule error recovery. */
disk_zone_wplug_set_error(disk, zwplug);
kblockd_schedule_work(&disk->zone_wplugs_work);
return false;
}
static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
@@ -970,7 +1002,10 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
zwplug = disk_get_and_lock_zone_wplug(disk, sector, gfp_mask, &flags);
if (!zwplug) {
bio_io_error(bio);
if (bio->bi_opf & REQ_NOWAIT)
bio_wouldblock_error(bio);
else
bio_io_error(bio);
return true;
}
@@ -978,18 +1013,20 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING);
/*
* If the zone is already plugged or has a pending error, add the BIO
* to the plug BIO list. Otherwise, plug and let the BIO execute.
* If the zone is already plugged, add the BIO to the plug BIO list.
* Do the same for REQ_NOWAIT BIOs to ensure that we will not see a
* BLK_STS_AGAIN failure if we let the BIO execute.
* Otherwise, plug and let the BIO execute.
*/
if (zwplug->flags & BLK_ZONE_WPLUG_BUSY)
if ((zwplug->flags & BLK_ZONE_WPLUG_PLUGGED) ||
(bio->bi_opf & REQ_NOWAIT))
goto plug;
/*
* If an error is detected when preparing the BIO, add it to the BIO
* list so that error recovery can deal with it.
*/
if (!blk_zone_wplug_prepare_bio(zwplug, bio))
goto plug;
if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
spin_unlock_irqrestore(&zwplug->lock, flags);
bio_io_error(bio);
return true;
}
zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
@@ -998,8 +1035,7 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
return false;
plug:
zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
blk_zone_wplug_add_bio(zwplug, bio, nr_segs);
disk_zone_wplug_add_bio(disk, zwplug, bio, nr_segs);
spin_unlock_irqrestore(&zwplug->lock, flags);
@@ -1083,19 +1119,6 @@ bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
}
EXPORT_SYMBOL_GPL(blk_zone_plug_bio);
static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk,
struct blk_zone_wplug *zwplug)
{
/*
* Take a reference on the zone write plug and schedule the submission
* of the next plugged BIO. blk_zone_wplug_bio_work() will release the
* reference we take here.
*/
WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));
refcount_inc(&zwplug->ref);
queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
}
static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
struct blk_zone_wplug *zwplug)
{
@@ -1103,16 +1126,6 @@ static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
spin_lock_irqsave(&zwplug->lock, flags);
/*
* If we had an error, schedule error recovery. The recovery work
* will restart submission of plugged BIOs.
*/
if (zwplug->flags & BLK_ZONE_WPLUG_ERROR) {
spin_unlock_irqrestore(&zwplug->lock, flags);
kblockd_schedule_work(&disk->zone_wplugs_work);
return;
}
/* Schedule submission of the next plugged BIO if we have one. */
if (!bio_list_empty(&zwplug->bio_list)) {
disk_zone_wplug_schedule_bio_work(disk, zwplug);
@@ -1155,12 +1168,13 @@ void blk_zone_write_plug_bio_endio(struct bio *bio)
}
/*
* If the BIO failed, mark the plug as having an error to trigger
* recovery.
* If the BIO failed, abort all plugged BIOs and mark the plug as
* needing a write pointer update.
*/
if (bio->bi_status != BLK_STS_OK) {
spin_lock_irqsave(&zwplug->lock, flags);
disk_zone_wplug_set_error(disk, zwplug);
disk_zone_wplug_abort(zwplug);
zwplug->flags |= BLK_ZONE_WPLUG_NEED_WP_UPDATE;
spin_unlock_irqrestore(&zwplug->lock, flags);
}
@@ -1216,6 +1230,7 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
*/
spin_lock_irqsave(&zwplug->lock, flags);
again:
bio = bio_list_pop(&zwplug->bio_list);
if (!bio) {
zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
@@ -1224,10 +1239,8 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
}
if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
/* Error recovery will decide what to do with the BIO. */
bio_list_add_head(&zwplug->bio_list, bio);
spin_unlock_irqrestore(&zwplug->lock, flags);
goto put_zwplug;
blk_zone_wplug_bio_io_error(zwplug, bio);
goto again;
}
spin_unlock_irqrestore(&zwplug->lock, flags);
@@ -1249,120 +1262,6 @@ put_zwplug:
disk_put_zone_wplug(zwplug);
}
static unsigned int blk_zone_wp_offset(struct blk_zone *zone)
{
switch (zone->cond) {
case BLK_ZONE_COND_IMP_OPEN:
case BLK_ZONE_COND_EXP_OPEN:
case BLK_ZONE_COND_CLOSED:
return zone->wp - zone->start;
case BLK_ZONE_COND_FULL:
return zone->len;
case BLK_ZONE_COND_EMPTY:
return 0;
case BLK_ZONE_COND_NOT_WP:
case BLK_ZONE_COND_OFFLINE:
case BLK_ZONE_COND_READONLY:
default:
/*
* Conventional, offline and read-only zones do not have a valid
* write pointer.
*/
return UINT_MAX;
}
}
static int blk_zone_wplug_report_zone_cb(struct blk_zone *zone,
unsigned int idx, void *data)
{
struct blk_zone *zonep = data;
*zonep = *zone;
return 0;
}
static void disk_zone_wplug_handle_error(struct gendisk *disk,
struct blk_zone_wplug *zwplug)
{
sector_t zone_start_sector =
bdev_zone_sectors(disk->part0) * zwplug->zone_no;
unsigned int noio_flag;
struct blk_zone zone;
unsigned long flags;
int ret;
/* Get the current zone information from the device. */
noio_flag = memalloc_noio_save();
ret = disk->fops->report_zones(disk, zone_start_sector, 1,
blk_zone_wplug_report_zone_cb, &zone);
memalloc_noio_restore(noio_flag);
spin_lock_irqsave(&zwplug->lock, flags);
/*
* A zone reset or finish may have cleared the error already. In such
* case, do nothing as the report zones may have seen the "old" write
* pointer value before the reset/finish operation completed.
*/
if (!(zwplug->flags & BLK_ZONE_WPLUG_ERROR))
goto unlock;
zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
if (ret != 1) {
/*
* We failed to get the zone information, meaning that something
* is likely really wrong with the device. Abort all remaining
* plugged BIOs as otherwise we could endup waiting forever on
* plugged BIOs to complete if there is a queue freeze on-going.
*/
disk_zone_wplug_abort(zwplug);
goto unplug;
}
/* Update the zone write pointer offset. */
zwplug->wp_offset = blk_zone_wp_offset(&zone);
disk_zone_wplug_abort_unaligned(disk, zwplug);
/* Restart BIO submission if we still have any BIO left. */
if (!bio_list_empty(&zwplug->bio_list)) {
disk_zone_wplug_schedule_bio_work(disk, zwplug);
goto unlock;
}
unplug:
zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
if (disk_should_remove_zone_wplug(disk, zwplug))
disk_remove_zone_wplug(disk, zwplug);
unlock:
spin_unlock_irqrestore(&zwplug->lock, flags);
}
static void disk_zone_wplugs_work(struct work_struct *work)
{
struct gendisk *disk =
container_of(work, struct gendisk, zone_wplugs_work);
struct blk_zone_wplug *zwplug;
unsigned long flags;
spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
while (!list_empty(&disk->zone_wplugs_err_list)) {
zwplug = list_first_entry(&disk->zone_wplugs_err_list,
struct blk_zone_wplug, link);
list_del_init(&zwplug->link);
spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
disk_zone_wplug_handle_error(disk, zwplug);
disk_put_zone_wplug(zwplug);
spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
}
spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
}
static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
{
return 1U << disk->zone_wplugs_hash_bits;
@@ -1371,8 +1270,6 @@ static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
void disk_init_zone_resources(struct gendisk *disk)
{
spin_lock_init(&disk->zone_wplugs_lock);
INIT_LIST_HEAD(&disk->zone_wplugs_err_list);
INIT_WORK(&disk->zone_wplugs_work, disk_zone_wplugs_work);
}
/*
@@ -1471,8 +1368,6 @@ void disk_free_zone_resources(struct gendisk *disk)
if (!disk->zone_wplugs_pool)
return;
cancel_work_sync(&disk->zone_wplugs_work);
if (disk->zone_wplugs_wq) {
destroy_workqueue(disk->zone_wplugs_wq);
disk->zone_wplugs_wq = NULL;
@@ -1669,6 +1564,8 @@ static int blk_revalidate_seq_zone(struct blk_zone *zone, unsigned int idx,
if (!disk->zone_wplugs_hash)
return 0;
disk_zone_wplug_sync_wp_offset(disk, zone);
wp_offset = blk_zone_wp_offset(zone);
if (!wp_offset || wp_offset >= zone->capacity)
return 0;
@@ -1799,6 +1696,7 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
memalloc_noio_restore(noio_flag);
return ret;
}
ret = disk->fops->report_zones(disk, 0, UINT_MAX,
blk_revalidate_zone_cb, &args);
if (!ret) {
@@ -1835,6 +1733,48 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
}
EXPORT_SYMBOL_GPL(blk_revalidate_disk_zones);
/**
* blk_zone_issue_zeroout - zero-fill a block range in a zone
* @bdev: blockdev to write
* @sector: start sector
* @nr_sects: number of sectors to write
* @gfp_mask: memory allocation flags (for bio_alloc)
*
* Description:
* Zero-fill a block range in a zone (@sector must be equal to the zone write
* pointer), handling potential errors due to the (initially unknown) lack of
* hardware offload (See blkdev_issue_zeroout()).
*/
int blk_zone_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask)
{
int ret;
if (WARN_ON_ONCE(!bdev_is_zoned(bdev)))
return -EIO;
ret = blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
BLKDEV_ZERO_NOFALLBACK);
if (ret != -EOPNOTSUPP)
return ret;
/*
* The failed call to blkdev_issue_zeroout() advanced the zone write
* pointer. Undo this using a report zone to update the zone write
* pointer to the correct current value.
*/
ret = disk_zone_sync_wp_offset(bdev->bd_disk, sector);
if (ret != 1)
return ret < 0 ? ret : -EIO;
/*
* Retry without BLKDEV_ZERO_NOFALLBACK to force the fallback to a
* regular write with zero-pages.
*/
return blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask, 0);
}
EXPORT_SYMBOL_GPL(blk_zone_issue_zeroout);
#ifdef CONFIG_BLK_DEBUG_FS
int queue_zone_wplugs_show(void *data, struct seq_file *m)
+1 -4
View File
@@ -698,8 +698,6 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
list_add(&rq->queuelist, &per_prio->dispatch);
rq->fifo_time = jiffies;
} else {
struct list_head *insert_before;
deadline_add_rq_rb(per_prio, rq);
if (rq_mergeable(rq)) {
@@ -712,8 +710,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
* set expire time and add to fifo list
*/
rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
insert_before = &per_prio->fifo_list[data_dir];
list_add_tail(&rq->queuelist, insert_before);
list_add_tail(&rq->queuelist, &per_prio->fifo_list[data_dir]);
}
}
+14 -31
View File
@@ -163,10 +163,6 @@ static int rsassa_pkcs1_sign(struct crypto_sig *tfm,
struct rsassa_pkcs1_inst_ctx *ictx = sig_instance_ctx(inst);
const struct hash_prefix *hash_prefix = ictx->hash_prefix;
struct rsassa_pkcs1_ctx *ctx = crypto_sig_ctx(tfm);
unsigned int child_reqsize = crypto_akcipher_reqsize(ctx->child);
struct akcipher_request *child_req __free(kfree_sensitive) = NULL;
struct scatterlist in_sg[3], out_sg;
struct crypto_wait cwait;
unsigned int pad_len;
unsigned int ps_end;
unsigned int len;
@@ -187,37 +183,25 @@ static int rsassa_pkcs1_sign(struct crypto_sig *tfm,
pad_len = ctx->key_size - slen - hash_prefix->size - 1;
child_req = kmalloc(sizeof(*child_req) + child_reqsize + pad_len,
GFP_KERNEL);
if (!child_req)
return -ENOMEM;
/* RFC 8017 sec 8.2.1 step 1 - EMSA-PKCS1-v1_5 encoding generation */
in_buf = (u8 *)(child_req + 1) + child_reqsize;
in_buf = dst;
memmove(in_buf + pad_len + hash_prefix->size, src, slen);
memcpy(in_buf + pad_len, hash_prefix->data, hash_prefix->size);
ps_end = pad_len - 1;
in_buf[0] = 0x01;
memset(in_buf + 1, 0xff, ps_end - 1);
in_buf[ps_end] = 0x00;
/* RFC 8017 sec 8.2.1 step 2 - RSA signature */
crypto_init_wait(&cwait);
sg_init_table(in_sg, 3);
sg_set_buf(&in_sg[0], in_buf, pad_len);
sg_set_buf(&in_sg[1], hash_prefix->data, hash_prefix->size);
sg_set_buf(&in_sg[2], src, slen);
sg_init_one(&out_sg, dst, dlen);
akcipher_request_set_tfm(child_req, ctx->child);
akcipher_request_set_crypt(child_req, in_sg, &out_sg,
ctx->key_size - 1, dlen);
akcipher_request_set_callback(child_req, CRYPTO_TFM_REQ_MAY_SLEEP,
crypto_req_done, &cwait);
err = crypto_akcipher_decrypt(child_req);
err = crypto_wait_req(err, &cwait);
if (err)
/* RFC 8017 sec 8.2.1 step 2 - RSA signature */
err = crypto_akcipher_sync_decrypt(ctx->child, in_buf,
ctx->key_size - 1, in_buf,
ctx->key_size);
if (err < 0)
return err;
len = child_req->dst_len;
len = err;
pad_len = ctx->key_size - len;
/* Four billion to one */
@@ -239,8 +223,8 @@ static int rsassa_pkcs1_verify(struct crypto_sig *tfm,
struct rsassa_pkcs1_ctx *ctx = crypto_sig_ctx(tfm);
unsigned int child_reqsize = crypto_akcipher_reqsize(ctx->child);
struct akcipher_request *child_req __free(kfree_sensitive) = NULL;
struct scatterlist in_sg, out_sg;
struct crypto_wait cwait;
struct scatterlist sg;
unsigned int dst_len;
unsigned int pos;
u8 *out_buf;
@@ -259,13 +243,12 @@ static int rsassa_pkcs1_verify(struct crypto_sig *tfm,
return -ENOMEM;
out_buf = (u8 *)(child_req + 1) + child_reqsize;
memcpy(out_buf, src, slen);
crypto_init_wait(&cwait);
sg_init_one(&in_sg, src, slen);
sg_init_one(&out_sg, out_buf, ctx->key_size);
sg_init_one(&sg, out_buf, slen);
akcipher_request_set_tfm(child_req, ctx->child);
akcipher_request_set_crypt(child_req, &in_sg, &out_sg,
slen, ctx->key_size);
akcipher_request_set_crypt(child_req, &sg, &sg, slen, slen);
akcipher_request_set_callback(child_req, CRYPTO_TFM_REQ_MAY_SLEEP,
crypto_req_done, &cwait);
+1 -1
View File
@@ -409,7 +409,7 @@ static void ivpu_bo_print_info(struct ivpu_bo *bo, struct drm_printer *p)
mutex_lock(&bo->lock);
drm_printf(p, "%-9p %-3u 0x%-12llx %-10lu 0x%-8x %-4u",
bo, bo->ctx->id, bo->vpu_addr, bo->base.base.size,
bo, bo->ctx ? bo->ctx->id : 0, bo->vpu_addr, bo->base.base.size,
bo->flags, kref_read(&bo->base.base.refcount));
if (bo->base.pages)
+7 -3
View File
@@ -612,18 +612,22 @@ int ivpu_mmu_reserved_context_init(struct ivpu_device *vdev)
if (!ivpu_mmu_ensure_pgd(vdev, &vdev->rctx.pgtable)) {
ivpu_err(vdev, "Failed to allocate root page table for reserved context\n");
ret = -ENOMEM;
goto unlock;
goto err_ctx_fini;
}
ret = ivpu_mmu_cd_set(vdev, vdev->rctx.id, &vdev->rctx.pgtable);
if (ret) {
ivpu_err(vdev, "Failed to set context descriptor for reserved context\n");
goto unlock;
goto err_ctx_fini;
}
unlock:
mutex_unlock(&vdev->rctx.lock);
return ret;
err_ctx_fini:
mutex_unlock(&vdev->rctx.lock);
ivpu_mmu_context_fini(vdev, &vdev->rctx);
return ret;
}
void ivpu_mmu_reserved_context_fini(struct ivpu_device *vdev)

Some files were not shown because too many files have changed in this diff Show More