Commit Graph

48444 Commits

Author SHA1 Message Date
Linus Torvalds
7e7bc8335b vfs-6.17-rc1.bpf
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCjwAKCRCRxhvAZXjc
 osnVAQCv4rM7sF4yJvGlm1myIJcJy5Sabk2q31qMdI1VHmkcOwD+Mxs7d1aByTS8
 /6djhVleq6lcT2LpP9j8YI3Rb+x30QY=
 =PF3o
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.bpf' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs bpf updates from Christian Brauner:
 "These changes allow bpf to read extended attributes from cgroupfs.

  This is useful in redirecting AF_UNIX socket connections based on
  cgroup membership of the socket. One use-case is the ability to
  implement log namespaces in systemd so services and containers are
  redirected to different journals"

* tag 'vfs-6.17-rc1.bpf' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  selftests/kernfs: test xattr retrieval
  selftests/bpf: Add tests for bpf_cgroup_read_xattr
  bpf: Mark cgroup_subsys_state->cgroup RCU safe
  bpf: Introduce bpf_cgroup_read_xattr to read xattr of cgroup's node
  kernfs: remove iattr_mutex
2025-07-28 14:42:31 -07:00
Linus Torvalds
672dcda246 vfs-6.17-rc1.pidfs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCiQAKCRCRxhvAZXjc
 orltAQDq3y1anYETz5/FD6P2gXY1W5hXdSm3EHHeacQ1JjTXvgEA2g1lWO7J4anf
 oOVE8aSvMow/FOjivLZBYmI65pkYJAE=
 =oDKB
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull pidfs updates from Christian Brauner:

 - persistent info

   Persist exit and coredump information independent of whether anyone
   currently holds a pidfd for the struct pid.

   The current scheme allocated pidfs dentries on-demand repeatedly.
   This scheme is reaching it's limits as it makes it impossible to pin
   information that needs to be available after the task has exited or
   coredumped and that should not be lost simply because the pidfd got
   closed temporarily. The next opener should still see the stashed
   information.

   This is also a prerequisite for supporting extended attributes on
   pidfds to allow attaching meta information to them.

   If someone opens a pidfd for a struct pid a pidfs dentry is allocated
   and stashed in pid->stashed. Once the last pidfd for the struct pid
   is closed the pidfs dentry is released and removed from pid->stashed.

   So if 10 callers create a pidfs dentry for the same struct pid
   sequentially, i.e., each closing the pidfd before the other creates a
   new one then a new pidfs dentry is allocated every time.

   Because multiple tasks acquiring and releasing a pidfd for the same
   struct pid can race with each another a task may still find a valid
   pidfs entry from the previous task in pid->stashed and reuse it. Or
   it might find a dead dentry in there and fail to reuse it and so
   stashes a new pidfs dentry. Multiple tasks may race to stash a new
   pidfs dentry but only one will succeed, the other ones will put their
   dentry.

   The current scheme aims to ensure that a pidfs dentry for a struct
   pid can only be created if the task is still alive or if a pidfs
   dentry already existed before the task was reaped and so exit
   information has been was stashed in the pidfs inode.

   That's great except that it's buggy. If a pidfs dentry is stashed in
   pid->stashed after pidfs_exit() but before __unhash_process() is
   called we will return a pidfd for a reaped task without exit
   information being available.

   The pidfds_pid_valid() check does not guard against this race as it
   doens't sync at all with pidfs_exit(). The pid_has_task() check might
   be successful simply because we're before __unhash_process() but
   after pidfs_exit().

   Introduce a new scheme where the lifetime of information associated
   with a pidfs entry (coredump and exit information) isn't bound to the
   lifetime of the pidfs inode but the struct pid itself.

   The first time a pidfs dentry is allocated for a struct pid a struct
   pidfs_attr will be allocated which will be used to store exit and
   coredump information.

   If all pidfs for the pidfs dentry are closed the dentry and inode can
   be cleaned up but the struct pidfs_attr will stick until the struct
   pid itself is freed. This will ensure minimal memory usage while
   persisting relevant information.

   The new scheme has various advantages. First, it allows to close the
   race where we end up handing out a pidfd for a reaped task for which
   no exit information is available. Second, it minimizes memory usage.
   Third, it allows to remove complex lifetime tracking via dentries
   when registering a struct pid with pidfs. There's no need to get or
   put a reference. Instead, the lifetime of exit and coredump
   information associated with a struct pid is bound to the lifetime of
   struct pid itself.

 - extended attributes

   Now that we have a way to persist information for pidfs dentries we
   can start supporting extended attributes on pidfds. This will allow
   userspace to attach meta information to tasks.

   One natural extension would be to introduce a custom pidfs.* extended
   attribute space and allow for the inheritance of extended attributes
   across fork() and exec().

   The first simple scheme will allow privileged userspace to set
   trusted extended attributes on pidfs inodes.

 - Allow autonomous pidfs file handles

   Various filesystems such as pidfs and drm support opening file
   handles without having to require a file descriptor to identify the
   filesystem. The filesystem are global single instances and can be
   trivially identified solely on the information encoded in the file
   handle.

   This makes it possible to not have to keep or acquire a sentinal file
   descriptor just to pass it to open_by_handle_at() to identify the
   filesystem. That's especially useful when such sentinel file
   descriptor cannot or should not be acquired.

   For pidfs this means a file handle can function as full replacement
   for storing a pid in a file. Instead a file handle can be stored and
   reopened purely based on the file handle.

   Such autonomous file handles can be opened with or without specifying
   a a file descriptor. If no proper file descriptor is used the
   FD_PIDFS_ROOT sentinel must be passed. This allows us to define
   further special negative fd sentinels in the future.

   Userspace can trivially test for support by trying to open the file
   handle with an invalid file descriptor.

 - Allow pidfds for reaped tasks with SCM_PIDFD messages

   This is a logical continuation of the earlier work to create pidfds
   for reaped tasks through the SO_PEERPIDFD socket option merged in
   923ea4d4482b ("Merge patch series "net, pidfs: enable handing out
   pidfds for reaped sk->sk_peer_pid"").

 - Two minor fixes:

    * Fold fs_struct->{lock,seq} into a seqlock

    * Don't bother with path_{get,put}() in unix_open_file()

* tag 'vfs-6.17-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (37 commits)
  don't bother with path_get()/path_put() in unix_open_file()
  fold fs_struct->{lock,seq} into a seqlock
  selftests: net: extend SCM_PIDFD test to cover stale pidfds
  af_unix: enable handing out pidfds for reaped tasks in SCM_PIDFD
  af_unix: stash pidfs dentry when needed
  af_unix/scm: fix whitespace errors
  af_unix: introduce and use scm_replace_pid() helper
  af_unix: introduce unix_skb_to_scm helper
  af_unix: rework unix_maybe_add_creds() to allow sleep
  selftests/pidfd: decode pidfd file handles withou having to specify an fd
  fhandle, pidfs: support open_by_handle_at() purely based on file handle
  uapi/fcntl: add FD_PIDFS_ROOT
  uapi/fcntl: add FD_INVALID
  fcntl/pidfd: redefine PIDFD_SELF_THREAD_GROUP
  uapi/fcntl: mark range as reserved
  fhandle: reflow get_path_anchor()
  pidfs: add pidfs_root_path() helper
  fhandle: rename to get_path_anchor()
  fhandle: hoist copy_from_user() above get_path_from_fd()
  fhandle: raise FILEID_IS_DIR in handle_type
  ...
2025-07-28 14:10:15 -07:00
Linus Torvalds
117eab5c6e vfs-6.17-rc1.coredump
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINAYAAKCRCRxhvAZXjc
 opJiAQDXGs+gQcxJ+4BpV4QszT2OJC19oI/f5AQ4PWMJdHgr4AEA7fc6NbBrpmW7
 L/tbdAwIiWp8bL1Q8Wy7Q2qldHtcggM=
 =KbD9
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull coredump updates from Christian Brauner:
 "This contains an extension to the coredump socket and a proper rework
  of the coredump code.

   - This extends the coredump socket to allow the coredump server to
     tell the kernel how to process individual coredumps. This allows
     for fine-grained coredump management. Userspace can decide to just
     let the kernel write out the coredump, or generate the coredump
     itself, or just reject it.

     * COREDUMP_KERNEL
       The kernel will write the coredump data to the socket.

     * COREDUMP_USERSPACE
       The kernel will not write coredump data but will indicate to the
       parent that a coredump has been generated. This is used when
       userspace generates its own coredumps.

     * COREDUMP_REJECT
       The kernel will skip generating a coredump for this task.

     * COREDUMP_WAIT
       The kernel will prevent the task from exiting until the coredump
       server has shutdown the socket connection.

     The flexible coredump socket can be enabled by using the "@@"
     prefix instead of the single "@" prefix for the regular coredump
     socket:

       @@/run/systemd/coredump.socket

   - Cleanup the coredump code properly while we have to touch it
     anyway.

     Split out each coredump mode in a separate helper so it's easy to
     grasp what is going on and make the code easier to follow. The core
     coredump function should now be very trivial to follow"

* tag 'vfs-6.17-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (31 commits)
  cleanup: add a scoped version of CLASS()
  coredump: add coredump_skip() helper
  coredump: avoid pointless variable
  coredump: order auto cleanup variables at the top
  coredump: add coredump_cleanup()
  coredump: auto cleanup prepare_creds()
  cred: add auto cleanup method
  coredump: directly return
  coredump: auto cleanup argv
  coredump: add coredump_write()
  coredump: use a single helper for the socket
  coredump: move pipe specific file check into coredump_pipe()
  coredump: split pipe coredumping into coredump_pipe()
  coredump: move core_pipe_count to global variable
  coredump: prepare to simplify exit paths
  coredump: split file coredumping into coredump_file()
  coredump: rename do_coredump() to vfs_coredump()
  selftests/coredump: make sure invalid paths are rejected
  coredump: validate socket path in coredump_parse()
  coredump: don't allow ".." in coredump socket path
  ...
2025-07-28 11:50:36 -07:00
Linus Torvalds
b711733e89 A single fix for the PTP systemcounter mechanism:
The rework of this mechanism added a 'use_nsec' member to struct
   system_counterval. get_device_system_crosststamp() instantiates that
   struct on the stack and hands a pointer to the driver callback.
 
   Only the drivers which set use_nsec to true, initialize that field, but
   all others ignore it. As get_device_system_crosststamp() does not
   initialize the struct, the use_nsec field contains random stack content
   in those cases. That causes a miscalulation usually resulting in a
   failing range check in the best case.
 
   Initialize the structure before handing it to the drivers to cure that.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmiGFA4THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoRcsEACvQI0LmKTOigzSZvBT1CZnGcwpeqYi
 Ez0v/w+tpyfbwQgf9kxR+ZbjNdwqCYFnR8PZPFKFuvsanWRTIcYaTkIQWvDhcEX/
 U4AFI3VkdZUFckCEY/fv7j3/jkp7pbLVHMq001Z9xaMMcE+ox1AlHpEW0Khd3gqL
 VFLXU5S7Q9H6J6ujjFAXAMuhgjk6WOz8q+ew3hnc3dxwyuEBAz83jOScH/be3dTl
 10ydzoxFEa+ZlacAHX+SqZ7nhS7ExxNlwlUuTYj/EkBCQ8UIoS93YLA5bYMcWCao
 W5rs6vFJmMO6NR6lkqwfKmKyjovx79jHMVNKoxydZGvkqcNMtfc/eUfByxAkyCDP
 gmTCFwgKVGdjGsYwkGqafejmJt5OFrD1hMyWfBhGWQ/Z8CXuuJNEa/8trSyUK/CS
 DFD1InOLltbYuw7rY5gRxb+xmgBTxUMj8gF/hXYs7wNzJqNJXXNae/2Sue+Xi+mV
 iieEF8UonmpMe9k9w3+fFGGDWYa4lYnT5O3VMQ0nEjj6dt5RVQqRvjTa+GtQJzUs
 h4fUs+BIKyCkh6DgRKyIsDruzryOSnZ+vqMcGMm0gvPttc3cGYksLiQVlWYjQhxs
 pTFrHNGOSXMT5WBQ7KWKzGypHlf3WYhVWk1+dmPJedrdyr23AfgKAGM6zva1Oqjc
 81w9DvppBL0sOA==
 =j6Po
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Thomas Gleixner:
 "A single fix for the PTP systemcounter mechanism:

  The rework of this mechanism added a 'use_nsec' member to struct
  system_counterval. get_device_system_crosststamp() instantiates that
  struct on the stack and hands a pointer to the driver callback.

  Only the drivers which set use_nsec to true, initialize that field,
  but all others ignore it. As get_device_system_crosststamp() does not
  initialize the struct, the use_nsec field contains random stack
  content in those cases. That causes a miscalulation usually resulting
  in a failing range check in the best case.

  Initialize the structure before handing it to the drivers to cure
  that"

* tag 'timers-urgent-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Zero initialize system_counterval when querying time from phc drivers
2025-07-27 09:31:32 -07:00
Linus Torvalds
2942242dde 11 hotfixes. 9 are cc:stable and the remainder address post-6.15 issues
or aren't considered necessary for -stable kernels.
 
 7 are for MM.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaILYBgAKCRDdBJ7gKXxA
 jo0uAQDvTlAjH6TcgRW/cbqHRIeiRoZ9Bwh/RUlJXM9neDR2LgEA41B+ohTsxUmZ
 OhM3Ce94tiGrHnVlW3SsmVaO+1TjGAU=
 =KUR9
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2025-07-24-18-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "11 hotfixes. 9 are cc:stable and the remainder address post-6.15
  issues or aren't considered necessary for -stable kernels.

  7 are for MM"

* tag 'mm-hotfixes-stable-2025-07-24-18-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  sprintf.h requires stdarg.h
  resource: fix false warning in __request_region()
  mm/damon/core: commit damos_quota_goal->nid
  kasan: use vmalloc_dump_obj() for vmalloc error reports
  mm/ksm: fix -Wsometimes-uninitialized from clang-21 in advisor_mode_show()
  mm: update MAINTAINERS entry for HMM
  nilfs2: reject invalid file types when reading inodes
  selftests/mm: fix split_huge_page_test for folio_split() tests
  mailmap: add entry for Senozhatsky
  mm/zsmalloc: do not pass __GFP_MOVABLE if CONFIG_COMPACTION=n
  mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
2025-07-24 19:13:30 -07:00
Akinobu Mita
91a229bb7b resource: fix false warning in __request_region()
A warning is raised when __request_region() detects a conflict with a
resource whose resource.desc is IORES_DESC_DEVICE_PRIVATE_MEMORY.

But this warning is only valid for iomem_resources.
The hmem device resource uses resource.desc as the numa node id, which can
cause spurious warnings.

This warning appeared on a machine with multiple cxl memory expanders. 
One of the NUMA node id is 6, which is the same as the value of
IORES_DESC_DEVICE_PRIVATE_MEMORY.

In this environment it was just a spurious warning, but when I saw the
warning I suspected a real problem so it's better to fix it.

This change fixes this by restricting the warning to only iomem_resource.
This also adds a missing new line to the warning message.

Link: https://lkml.kernel.org/r/20250719112604.25500-1-akinobu.mita@gmail.com
Fixes: 7dab174e2e27 ("dax/hmem: Move hmem device registration to dax_hmem.ko")
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24 17:57:59 -07:00
Markus Blöchl
67c632b4a7 timekeeping: Zero initialize system_counterval when querying time from phc drivers
Most drivers only populate the fields cycles and cs_id of system_counterval
in their get_time_fn() callback for get_device_system_crosststamp(), unless
they explicitly provide nanosecond values.

When the use_nsecs field was added to struct system_counterval, most
drivers did not care.  Clock sources other than CSID_GENERIC could then get
converted in convert_base_to_cs() based on an uninitialized use_nsecs field,
which usually results in -EINVAL during the following range check.

Pass in a fully zero initialized system_counterval_t to cure that.

Fixes: 6b2e29977518 ("timekeeping: Provide infrastructure for converting to/from a base clock")
Signed-off-by: Markus Blöchl <markus@blochl.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250720-timekeeping_uninit_crossts-v2-1-f513c885b7c2@blochl.de
2025-07-22 14:25:21 +02:00
Linus Torvalds
2013e8c2e6 Tracing fixes for 6.16:
- Fix timerlat with use of FORTIFY_SOURCE
 
   FORTIFY_SOURCE was added to the stack tracer where it compares the
   entry->caller array to having entry->size elements.
 
   timerlat has the following:
 
      memcpy(&entry->caller, fstack->calls, size);
      entry->size = size;
 
   Which triggers FORTIFY_SOURCE as the caller is populated before the
   entry->size is initialized.
 
   Swap the order to satisfy FORTIFY_SOURCE logic.
 
 - Add down_write(trace_event_sem) when adding trace events in modules
 
   Trace events being added to the ftrace_events array are protected by
   the trace_event_sem semaphore. But when loading modules that have
   trace events, the addition of the events are not protected by the
   semaphore and loading two modules that have events at the same time
   can corrupt the list.
 
   Also add a lockdep_assert_held(trace_event_sem) to
   _trace_add_event_dirs() to confirm its held when iterating the list.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaH06gBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qoJsAP0a+/E0f+5g7O/OtYPVEDSCREv1vj9c
 3dr0iWopqaOC7gEAw8Vc5iWIHKcB/JuJ+GqALoutL+lihruG26MWkFFsOgU=
 =zH5J
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix timerlat with use of FORTIFY_SOURCE

   FORTIFY_SOURCE was added to the stack tracer where it compares the
   entry->caller array to having entry->size elements.

   timerlat has the following:

      memcpy(&entry->caller, fstack->calls, size);
      entry->size = size;

   Which triggers FORTIFY_SOURCE as the caller is populated before the
   entry->size is initialized.

   Swap the order to satisfy FORTIFY_SOURCE logic.

 - Add down_write(trace_event_sem) when adding trace events in modules

   Trace events being added to the ftrace_events array are protected by
   the trace_event_sem semaphore. But when loading modules that have
   trace events, the addition of the events are not protected by the
   semaphore and loading two modules that have events at the same time
   can corrupt the list.

   Also add a lockdep_assert_held(trace_event_sem) to
   _trace_add_event_dirs() to confirm it is held when iterating the
   list.

* tag 'trace-v6.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Add down_write(trace_event_sem) when adding trace event
  tracing/osnoise: Fix crash in timerlat_dump_stack()
2025-07-20 13:03:31 -07:00
Linus Torvalds
62347e2790 A single fix for the scheduler. A recent commit changed the runqueue
counter nr_uninterruptible to an unsigned int. Due to the fact that the
 counters are not updated on migration of a uninterruptble task to a
 different CPU, these counters can exceed INT_MAX. The counter is cast to
 long in the load average calculation, which means that the cast expands
 into negative space resulting in bogus load average values. Convert it back
 to unsigned long to fix this.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmh82RYTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoXNlEACyQxl/OCXSmKpcxrIvMjnalXh3Ibs2
 KqbZ3MA/hY1ZWhSKXBDMv8hkxE00PY8WXsPDtLeRz0n6GegbUx4zVsUQpzn24gRe
 uEl+qIuANaz5uMu2qlmQwSuxVQ0SDqLVFObrNgKQ554jJckjXKcgvrBjwczTw9lJ
 u6yTVLrknf0IsbQ79yxToaNf6jD3HcSIGX06Hs4EYucPzYd54VScr6LGGwux3rmI
 CN0RSA9RduUzcf/aKhe9/tmM6oss4tdByNuVSdx/yZ5QvVw5oOrWdI8sU0BlZsrC
 IjvqLvvLFKRPZre24UeFnIVxhtv+cwPXxrA6tR/SM2eUbHKzU8jHz2TlPu8xwOwb
 sZMMBjPx1Y+fpVQoGxEC1cEWbCUPSyQ7NGN2uS3Quk4XoQLRz6t//B+63rT65bfV
 wMAZpsgZ9/c23yDuPG11XxBMQYVbS0sFyLDymBC93Co8vh8PowDIf2xmgPFLxM78
 HdOCCsHsBPqZutQGQ7/qw//e7T/9fv0MDP8D4fuyLrhUdvBAvGHZp85T4YmOUcjp
 bTTE28Rw1sBgdAnTNjGWfwRN7Hwc1DzaLh40Um+nnjMA+yG9uq7gjTYDx4w6W/EA
 V70tx9tfJOM11g/qfIURkJDpf8PezJ3KHg1ZOiPn9e+I08LoPVMeIqbFEnJ9VAc7
 VStWpQd+SG99AQ==
 =Aqcr
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2025-07-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Thomas Gleixner:
 "A single fix for the scheduler.

  A recent commit changed the runqueue counter nr_uninterruptible to an
  unsigned int. Due to the fact that the counters are not updated on
  migration of a uninterruptble task to a different CPU, these counters
  can exceed INT_MAX.

  The counter is cast to long in the load average calculation, which
  means that the cast expands into negative space resulting in bogus
  load average values.

  Convert it back to unsigned long to fix this.

* tag 'sched-urgent-2025-07-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Change nr_uninterruptible type to unsigned long
2025-07-20 11:08:51 -07:00
Steven Rostedt
b5e8acc14d tracing: Add down_write(trace_event_sem) when adding trace event
When a module is loaded, it adds trace events defined by the module. It
may also need to modify the modules trace printk formats to replace enum
names with their values.

If two modules are loaded at the same time, the adding of the event to the
ftrace_events list can corrupt the walking of the list in the code that is
modifying the printk format strings and crash the kernel.

The addition of the event should take the trace_event_sem for write while
it adds the new event.

Also add a lockdep_assert_held() on that semaphore in
__trace_add_event_dirs() as it iterates the list.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/20250718223158.799bfc0c@batman.local.home
Reported-by: Fusheng Huang(黄富生)  <Fusheng.Huang@luxshare-ict.com>
Closes: https://lore.kernel.org/all/20250717105007.46ccd18f@batman.local.home/
Fixes: 110bf2b764eb6 ("tracing: add protection around module events unload")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-19 13:54:59 -04:00
Linus Torvalds
bf61759db4 sched_ext: Fixes for v6.16-rc6
- Fix handling of migration disabled tasks in default idle selection.
 
 - update_locked_rq() called __this_cpu_write() spuriously with NULL when @rq
   was not locked. As the writes were spurious, it didn't break anything
   directly. However, the function could be called in a preemptible leading
   to a context warning in __this_cpu_write(). Skip the spurious NULL writes.
 
 - Selftest fix on UP.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaHvPZw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGabMAP4jSAr4gYWEBOUaD9btwnPxZwlSiAEQtqBDBVRb
 /UunFAD/WBwUPk/u7BchLHjuH3sYW5gQb40kbtUnmNvB+RNUUgc=
 =3WAD
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - Fix handling of migration disabled tasks in default idle selection

 - update_locked_rq() called __this_cpu_write() spuriously with NULL
   when @rq was not locked. As the writes were spurious, it didn't break
   anything directly. However, the function could be called in a
   preemptible leading to a context warning in __this_cpu_write(). Skip
   the spurious NULL writes.

 - Selftest fix on UP

* tag 'sched_ext-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: idle: Handle migration-disabled tasks in idle selection
  sched/ext: Prevent update_locked_rq() calls with NULL rq
  selftests/sched_ext: Fix exit selftest hang on UP
2025-07-19 10:40:30 -07:00
Linus Torvalds
c64004df89 cgroup: Fixes for v6.16-rc6
An earlier commit to suppress a warning introduced a race condition where
 tasks can escape cgroup1 freezer. Revert the commit and simply remove the
 warning which was spurious to begin with.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaHvMvw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGadfAP0cT4QXwtw0VXyiNr5PMqxQ74rYsngJ+NevRbod
 fK6hIwD/T+owQc/ivYp5/N/XUgpT+Ixp7YRj2RIzQbL6SPjzOwE=
 =IlrN
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:
 "An earlier commit to suppress a warning introduced a race condition
  where tasks can escape cgroup1 freezer. Revert the commit and simply
  remove the warning which was spurious to begin with"

* tag 'cgroup-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  Revert "cgroup_freezer: cgroup_freezing: Check if not frozen"
  sched,freezer: Remove unnecessary warning in __thaw_task
2025-07-19 10:00:47 -07:00
Tomas Glozar
85a3bce695 tracing/osnoise: Fix crash in timerlat_dump_stack()
We have observed kernel panics when using timerlat with stack saving,
with the following dmesg output:

memcpy: detected buffer overflow: 88 byte write of buffer size 0
WARNING: CPU: 2 PID: 8153 at lib/string_helpers.c:1032 __fortify_report+0x55/0xa0
CPU: 2 UID: 0 PID: 8153 Comm: timerlatu/2 Kdump: loaded Not tainted 6.15.3-200.fc42.x86_64 #1 PREEMPT(lazy)
Call Trace:
 <TASK>
 ? trace_buffer_lock_reserve+0x2a/0x60
 __fortify_panic+0xd/0xf
 __timerlat_dump_stack.cold+0xd/0xd
 timerlat_dump_stack.part.0+0x47/0x80
 timerlat_fd_read+0x36d/0x390
 vfs_read+0xe2/0x390
 ? syscall_exit_to_user_mode+0x1d5/0x210
 ksys_read+0x73/0xe0
 do_syscall_64+0x7b/0x160
 ? exc_page_fault+0x7e/0x1a0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

__timerlat_dump_stack() constructs the ftrace stack entry like this:

struct stack_entry *entry;
...
memcpy(&entry->caller, fstack->calls, size);
entry->size = fstack->nr_entries;

Since commit e7186af7fb26 ("tracing: Add back FORTIFY_SOURCE logic to
kernel_stack event structure"), struct stack_entry marks its caller
field with __counted_by(size). At the time of the memcpy, entry->size
contains garbage from the ringbuffer, which under some circumstances is
zero, triggering a kernel panic by buffer overflow.

Populate the size field before the memcpy so that the out-of-bounds
check knows the correct size. This is analogous to
__ftrace_trace_stack().

Cc: stable@vger.kernel.org
Cc: John Kacur <jkacur@redhat.com>
Cc: Luis Goncalves <lgoncalv@redhat.com>
Cc: Attila Fazekas <afazekas@redhat.com>
Link: https://lore.kernel.org/20250716143601.7313-1-tglozar@redhat.com
Fixes: e7186af7fb26 ("tracing: Add back FORTIFY_SOURCE logic to kernel_stack event structure")
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-18 15:51:35 -04:00
Linus Torvalds
d786aba320 bpf-fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmh5r+QACgkQ6rmadz2v
 bTqrFA/+OW+z+vh8k43C9TVZttqC5poGcFqF5zRlYTArQT3AB+QuhG/CRjVQFbGL
 2YfVbq+5pxNo0I/FtCoWVui2OMf1UsRKKvM0pSn50yn3ytRfotZjQ/AWACm/9t5y
 fyRLBIS3ArjashQ9/S71tAIfG6l/B+FGX81wOVa1uL50ab15+4NrplhZHY421o9a
 lH2E2wnpy/BnrB9F/FO4iQbelixvBfMwj8epruhCVbipfx6BOKPMzKVtcm61FVT1
 hDsQZ0bIpVKgpRNBlTUHjVyzYo8oeXzqVhhY7hsmpHxJSiol7KLWyHEJD5ExS9Qg
 XVPK34b9IPgAfS8f/DgGAkWsAht7BMLsR0GUWyVIiacHHqTinRPVfWbzqWa5yjdD
 +8Vp4RVrcUONx69upx+IDrb4uMfQYktdpcvQtSl0SSinsG/INXurT1Vyz8aBPfkv
 WbiBeXhW/dCD9NuL5D9gnyZWaPXIAmbK7+pXJOSIpfKC24WRXTONDXhGP1b6ef31
 zHQu3r98ekYnHr3hbsvdHOWB7LKkJ1bcg2+OsmtYUUmnCiQTM1H8ILTwbSQ4EfXJ
 6iRxYeFp+VJOPScRzmNU/A3ibQWfV+foiO4S6hmazJOy3mmHX6hgPZoj2fjV8Ejf
 xZeOpQbCaZQCzbxxOdtjykwfe+zPWGnyRPnQpVIVdi7Abk1EZaE=
 =eUL7
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:

 - Fix handling of BPF arena relocations (Andrii Nakryiko)

 - Fix race in bpf_arch_text_poke() on s390 (Ilya Leoshkevich)

 - Fix use of virt_to_phys() on arm64 when mmapping BTF (Lorenz Bauer)

 - Reject %p% format string in bprintf-like BPF helpers (Paul Chaignon)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  libbpf: Fix handling of BPF arena relocations
  btf: Fix virt_to_phys() on arm64 when mmapping BTF
  selftests/bpf: Stress test attaching a BPF prog to another BPF prog
  s390/bpf: Fix bpf_arch_text_poke() with new_addr == NULL again
  selftests/bpf: Add negative test cases for snprintf
  bpf: Reject %p% format string in bprintf-like helpers
2025-07-18 11:46:26 -07:00
Lorenz Bauer
2e2713ae1a btf: Fix virt_to_phys() on arm64 when mmapping BTF
Breno Leitao reports that arm64 emits the following warning
with CONFIG_DEBUG_VIRTUAL:

    [   58.896157] virt_to_phys used for non-linear address: 000000009fea9737
      (__start_BTF+0x0/0x685530)
    [   23.988669] WARNING: CPU: 25 PID: 1442 at arch/arm64/mm/physaddr.c:15
      __virt_to_phys (arch/arm64/mm/physaddr.c:?)

        ...

    [   24.075371] Tainted: [E]=UNSIGNED_MODULE, [N]=TEST
    [   24.080276] Hardware name: Quanta S7GM 20S7GCU0010/S7G MB (CG1), BIOS 3D22
      07/03/2024
    [   24.088295] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
    [   24.098440] pc : __virt_to_phys (arch/arm64/mm/physaddr.c:?)
    [   24.105398] lr : __virt_to_phys (arch/arm64/mm/physaddr.c:?)

	...

    [   24.197257] Call trace:
    [   24.199761] __virt_to_phys (arch/arm64/mm/physaddr.c:?) (P)
    [   24.206883] btf_sysfs_vmlinux_mmap (kernel/bpf/sysfs_btf.c:27)
    [   24.214264] sysfs_kf_bin_mmap (fs/sysfs/file.c:179)
    [   24.218536] kernfs_fop_mmap (fs/kernfs/file.c:462)
    [   24.222461] mmap_region (./include/linux/fs.h:? mm/internal.h:167
       mm/vma.c:2405 mm/vma.c:2467 mm/vma.c:2622 mm/vma.c:2692)

It seems that the memory layout on arm64 maps the kernel image in vmalloc space
which is different than x86. This makes virt_to_phys emit the warning.

Fix this by translating the address using __pa_symbol as suggested by
Breno instead.

Reported-by: Breno Leitao <leitao@debian.org>
Closes: https://lore.kernel.org/bpf/g2gqhkunbu43awrofzqb4cs4sxkxg2i4eud6p4qziwrdh67q4g@mtw3d3aqfgmb/
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Tested-by: Breno Leitao <leitao@debian>
Fixes: a539e2a6d51d ("btf: Allow mmap of vmlinux btf")
Link: https://lore.kernel.org/r/20250717-vmlinux-mmap-pa-symbol-v1-1-970be6681158@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-17 11:33:52 -07:00
Andrea Righi
06efc9fe0b sched_ext: idle: Handle migration-disabled tasks in idle selection
When SCX_OPS_ENQ_MIGRATION_DISABLED is enabled, migration-disabled tasks
are also routed to ops.enqueue(). A scheduler may attempt to dispatch
such tasks directly to an idle CPU using the default idle selection
policy via scx_bpf_select_cpu_and() or scx_bpf_select_cpu_dfl().

This scenario must be properly handled by the built-in idle policy to
avoid returning an idle CPU where the target task isn't allowed to run.
Otherwise, it can lead to errors such as:

 EXIT: runtime error (SCX_DSQ_LOCAL[_ON] cannot move migration disabled Chrome_ChildIOT[291646] from CPU 3 to 14)

Prevent this by explicitly handling migration-disabled tasks in the
built-in idle selection logic, maintaining their CPU affinity.

Fixes: a730e3f7a48bc ("sched_ext: idle: Consolidate default idle CPU selection kfuncs")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-07-17 08:19:38 -10:00
Chen Ridong
14a67b42cb Revert "cgroup_freezer: cgroup_freezing: Check if not frozen"
This reverts commit cff5f49d433fcd0063c8be7dd08fa5bf190c6c37.

Commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if not
frozen") modified the cgroup_freezing() logic to verify that the FROZEN
flag is not set, affecting the return value of the freezing() function,
in order to address a warning in __thaw_task.

A race condition exists that may allow tasks to escape being frozen. The
following scenario demonstrates this issue:

CPU 0 (get_signal path)		CPU 1 (freezer.state reader)
try_to_freeze			read freezer.state
__refrigerator			freezer_read
				update_if_frozen
WRITE_ONCE(current->__state, TASK_FROZEN);
				...
				/* Task is now marked frozen */
				/* frozen(task) == true */
				/* Assuming other tasks are frozen */
				freezer->state |= CGROUP_FROZEN;
/* freezing(current) returns false */
/* because cgroup is frozen (not freezing) */
break out
__set_current_state(TASK_RUNNING);
/* Bug: Task resumes running when it should remain frozen */

The existing !frozen(p) check in __thaw_task makes the
WARN_ON_ONCE(freezing(p)) warning redundant. Removing this warning enables
reverting the commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check
if not frozen") to resolve the issue.

The warning has been removed in the previous patch. This patch revert the
commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if not
frozen") to complete the fix.

Fixes: cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if not frozen")
Reported-by: Zhong Jiawei<zhongjiawei1@huawei.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-07-17 07:57:02 -10:00
Chen Ridong
9beb8c5e77 sched,freezer: Remove unnecessary warning in __thaw_task
Commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if not
frozen") modified the cgroup_freezing() logic to verify that the FROZEN
flag is not set, affecting the return value of the freezing() function,
in order to address a warning in __thaw_task.

A race condition exists that may allow tasks to escape being frozen. The
following scenario demonstrates this issue:

CPU 0 (get_signal path)		CPU 1 (freezer.state reader)
try_to_freeze			read freezer.state
__refrigerator			freezer_read
				update_if_frozen
WRITE_ONCE(current->__state, TASK_FROZEN);
				...
				/* Task is now marked frozen */
				/* frozen(task) == true */
				/* Assuming other tasks are frozen */
				freezer->state |= CGROUP_FROZEN;
/* freezing(current) returns false */
/* because cgroup is frozen (not freezing) */
break out
__set_current_state(TASK_RUNNING);
/* Bug: Task resumes running when it should remain frozen */

The existing !frozen(p) check in __thaw_task makes the
WARN_ON_ONCE(freezing(p)) warning redundant. Removing this warning enables
reverting commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if
not frozen") to resolve the issue.

This patch removes the warning from __thaw_task. A subsequent patch will
revert commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if
not frozen") to complete the fix.

Reported-by: Zhong Jiawei<zhongjiawei1@huawei.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-07-17 07:56:50 -10:00
Linus Torvalds
e6e82e5bed Power management fixes for 6.16-rc7
- Fix a deadlock that may occur on asynchronous device suspend
    failures due to missing completion updates in error paths (Rafael
    Wysocki).
 
  - Drop a misplaced pm_restore_gfp_mask() call, which may cause
    swap to be accessed too early if system suspend fails, from
    suspend_devices_and_enter() (Rafael Wysocki).
 
  - Remove duplicate filesystems_freeze/thaw() calls, which sometimes
    cause systems to be unable to resume, from enter_state() (Zihuan
    Zhang).
 -----BEGIN PGP SIGNATURE-----
 
 iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmh5IE4SHHJqd0Byand5
 c29ja2kubmV0AAoJEO5fvZ0v1OO12LYH/3CULHOIoshuWu+G9nIKokqO0oNYmxh1
 qgkh+o9sBz9uTyfCSd1qDT9j1LjzUnOJUe67IzHJFuZcHbnWU4k9VYWV+H8TKyNp
 CcQ+9g5gCqOzxWH7G7C2ekciSnnBlObwJ7ZsDlUOeuJ16GVCjqrFPZbJ6No0A+Hz
 8Ed7R4o1MKrURLU9IZWpqV1a54Z9ySv2yrx9T4G0c8WV2VRJZJ76e1hAGcOr4owj
 kM1+MPnsfU/RvBUUEKjUEm70ZBXGbXT+D9p/L/AuoYyhI94kvoImK1/2An5noHCO
 czK5nDB867z6hu5jTVPt/RoIK/49H/a2CDNYl3ZiZnVVZIoPN/wt3C8=
 =wkHb
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.16-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These address three issues introduced during the current development
  cycle and related to system suspend and hibernation, one triggering
  when asynchronous suspend of devices fails, one possibly affecting
  memory management in the core suspend code error path, and one due to
  duplicate filesystems freezing during system suspend:

   - Fix a deadlock that may occur on asynchronous device suspend
     failures due to missing completion updates in error paths (Rafael
     Wysocki)

   - Drop a misplaced pm_restore_gfp_mask() call, which may cause swap
     to be accessed too early if system suspend fails, from
     suspend_devices_and_enter() (Rafael Wysocki)

   - Remove duplicate filesystems_freeze/thaw() calls, which sometimes
     cause systems to be unable to resume, from enter_state() (Zihuan
     Zhang)"

* tag 'pm-6.16-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: sleep: Update power.completion for all devices on errors
  PM: suspend: clean up redundant filesystems_freeze/thaw() handling
  PM: suspend: Drop a misplaced pm_restore_gfp_mask() call
2025-07-17 09:46:37 -07:00
Breno Leitao
e14fd98c6d sched/ext: Prevent update_locked_rq() calls with NULL rq
Avoid invoking update_locked_rq() when the runqueue (rq) pointer is NULL
in the SCX_CALL_OP and SCX_CALL_OP_RET macros.

Previously, calling update_locked_rq(NULL) with preemption enabled could
trigger the following warning:

    BUG: using __this_cpu_write() in preemptible [00000000]

This happens because __this_cpu_write() is unsafe to use in preemptible
context.

rq is NULL when an ops invoked from an unlocked context. In such cases, we
don't need to store any rq, since the value should already be NULL
(unlocked). Ensure that update_locked_rq() is only called when rq is
non-NULL, preventing calling __this_cpu_write() on preemptible context.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Fixes: 18853ba782bef ("sched_ext: Track currently locked rq")
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v6.15
2025-07-16 15:02:12 -10:00
Nathan Chancellor
1ed171a3af tracing/probes: Avoid using params uninitialized in parse_btf_arg()
After a recent change in clang to strengthen uninitialized warnings [1],
it points out that in one of the error paths in parse_btf_arg(), params
is used uninitialized:

  kernel/trace/trace_probe.c:660:19: warning: variable 'params' is uninitialized when used here [-Wuninitialized]
    660 |                         return PTR_ERR(params);
        |                                        ^~~~~~

Match many other NO_BTF_ENTRY error cases and return -ENOENT, clearing
up the warning.

Link: https://lore.kernel.org/all/20250715-trace_probe-fix-const-uninit-warning-v1-1-98960f91dd04@kernel.org/

Cc: stable@vger.kernel.org
Closes: https://github.com/ClangBuiltLinux/linux/issues/2110
Fixes: d157d7694460 ("tracing/probes: Support BTF field access from $retval")
Link: 2464313eef [1]
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-07-16 14:01:54 +09:00
Zihuan Zhang
228b9deded PM: suspend: clean up redundant filesystems_freeze/thaw() handling
The recently introduced support for freezing filesystems during system
suspend included calls to filesystems_freeze() in both suspend_prepare()
and enter_state(), as well as calls to filesystems_thaw() in both
suspend_finish() and the Unlock path in enter_state(). These are
redundant.

Moreover, calling filesystems_freeze() twice, from both suspend_prepare()
and enter_state(), leads to a black screen and makes the system unable
to resume in some cases.

Address this as follows:

 - filesystems_freeze() is already called in suspend_prepare(), which
   is the proper and consistent place to handle pre-suspend operations.
   The second call in enter_state() is unnecessary and so remove it.

 - filesystems_thaw() is invoked in suspend_finish(), which covers
   successful suspend/resume paths. In the failure case, add a call
   to filesystems_thaw() only when needed, avoiding the duplicate call
   in the general Unlock path.

This change simplifies the suspend code and avoids repeated freeze/thaw
calls, while preserving correct ordering and behavior.

Fixes: eacfbf74196f ("power: freeze filesystems during suspend/resume")
Signed-off-by: Zihuan Zhang <zhangzihuan@kylinos.cn>
Link: https://patch.msgid.link/20250712030824.81474-1-zhangzihuan@kylinos.cn
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-07-15 14:55:00 +02:00
Rafael J. Wysocki
75b63ce2c9 PM: suspend: Drop a misplaced pm_restore_gfp_mask() call
The pm_restore_gfp_mask() call added by commit 12ffc3b1513e ("PM:
Restrict swap use to later in the suspend sequence") to
suspend_devices_and_enter() is done too early because it takes
place before calling dpm_resume() in dpm_resume_end() and some
swap-backing devices may not be ready at that point.  Moreover,
dpm_resume_end() called subsequently in the same code path invokes
pm_restore_gfp_mask() again and calling it twice in a row is
pointless.

Drop the misplaced pm_restore_gfp_mask() call from
suspend_devices_and_enter() to address this issue.

Fixes: 12ffc3b1513e ("PM: Restrict swap use to later in the suspend sequence")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://patch.msgid.link/2810409.mvXUDI8C0e@rjwysocki.net
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-07-15 14:54:26 +02:00
Aruna Ramakrishna
36569780b0 sched: Change nr_uninterruptible type to unsigned long
The commit e6fe3f422be1 ("sched: Make multiple runqueue task counters
32-bit") changed nr_uninterruptible to an unsigned int. But the
nr_uninterruptible values for each of the CPU runqueues can grow to
large numbers, sometimes exceeding INT_MAX. This is valid, if, over
time, a large number of tasks are migrated off of one CPU after going
into an uninterruptible state. Only the sum of all nr_interruptible
values across all CPUs yields the correct result, as explained in a
comment in kernel/sched/loadavg.c.

Change the type of nr_uninterruptible back to unsigned long to prevent
overflows, and thus the miscalculation of load average.

Fixes: e6fe3f422be1 ("sched: Make multiple runqueue task counters 32-bit")

Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250709173328.606794-1-aruna.ramakrishna@oracle.com
2025-07-14 10:59:31 +02:00
Linus Torvalds
0a197b7576 - Prevent perf_sigtrap() from observing an exiting task and warning
about it
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhzhRwACgkQEsHwGGHe
 VUpz3xAAwLM/anyGvUXMP9Wx3X3kXM0NMw0NfkSucE22p7R//DfVWwLgz26hE8VX
 haUp8Idnmp2EV6BsA/6SmslSbzYeBNKSIWwQiRI67goIACK0MDqzYkTlTnS170BU
 bw4Lvd5PzA9DZiHTpqNCmg2m9ouGCQHOsxfKzs6DUMgpca6y8imFYW3gMZwGyA0m
 Zkanw+BqPAhYSbTjZ1ZgtYmtPuvunXu/K/meEbonW2prRXfhZ9tbU31q/iRtHy3c
 v9nRiK83pbqIJHUfYraTlfvmkMz975gvuN7mtmj5H2v+gc1S9adp1VrcI3tYFZDB
 d5FVDsQpqWNBwewHG6VyYww6wCfm0IRg6Ys+rujhYl/ICjmthDUQJg9tZnZ03yxz
 sv/b9E84Nq5BiJeLB88Ue0vRhH4ct9MTUr3Z9zhSKGN3IArpMPOm2wlmUmPoGerZ
 g1jR80oG6Ikwl660MBpEX9yqyG8P5b7jnLGMJC0QJX9c0TLSY8yQgnnWKDVUvLVx
 SrCjACQ2PnxqS9v2RlB3omoBYeEZfv7coy/vKIXZygXg7aNaDMemUktveihjUxLg
 MLdOTH3Ygq5lASIGhhjqJenZ/844XIhZ9rgUcz3HLzlkXLGt8NmjOT+rW1IHAZHo
 VpKbYPedr4uTqcxSDp4Qr/53m9xm4OTPe+3XEkXmeN5WQXbEh4A=
 =vP1/
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fix from Borislav Petkov:

 - Prevent perf_sigtrap() from observing an exiting task and warning
   about it

* tag 'perf_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix WARN in perf_sigtrap()
2025-07-13 10:34:47 -07:00
Linus Torvalds
3f31a806a6 19 hotfixes. A whopping 16 are cc:stable and the remainder address
post-6.15 issues or aren't considered necessary for -stable kernels.
 
 14 are for MM.  Three gdb-script fixes and a kallsyms build fix.
 -----BEGIN PGP SIGNATURE-----
 
 iHQEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaHGbTgAKCRDdBJ7gKXxA
 jowqAPiCWBFfcFaX20BxVaMU1PjC3Lh9llDXqQwBhBNdcadSAP44SGQ8nrfV+piB
 OcNz2AEwBBfS354G0Etlh4k08YoAAw==
 =IDDc
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2025-07-11-16-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "19 hotfixes. A whopping 16 are cc:stable and the remainder address
  post-6.15 issues or aren't considered necessary for -stable kernels.

  14 are for MM.  Three gdb-script fixes and a kallsyms build fix"

* tag 'mm-hotfixes-stable-2025-07-11-16-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  Revert "sched/numa: add statistics of numa balance task"
  mm: fix the inaccurate memory statistics issue for users
  mm/damon: fix divide by zero in damon_get_intervals_score()
  samples/damon: fix damon sample mtier for start failure
  samples/damon: fix damon sample wsse for start failure
  samples/damon: fix damon sample prcl for start failure
  kasan: remove kasan_find_vm_area() to prevent possible deadlock
  scripts: gdb: vfs: support external dentry names
  mm/migrate: fix do_pages_stat in compat mode
  mm/damon/core: handle damon_call_control as normal under kdmond deactivation
  mm/rmap: fix potential out-of-bounds page table access during batched unmap
  mm/hugetlb: don't crash when allocating a folio if there are no resv
  scripts/gdb: de-reference per-CPU MCE interrupts
  scripts/gdb: fix interrupts.py after maple tree conversion
  maple_tree: fix mt_destroy_walk() on root leaf node
  mm/vmalloc: leave lazy MMU mode on PTE mapping error
  scripts/gdb: fix interrupts display after MCP on x86
  lib/alloc_tag: do not acquire non-existent lock in alloc_tag_top_users()
  kallsyms: fix build without execinfo
2025-07-12 10:30:47 -07:00
Linus Torvalds
a0f8361c3c dma-mapping fix for Linux 6.16
- small fix relevant to arm64 server and custom CMA configuration
   (Feng Tang)
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaHCzdQAKCRCJp1EFxbsS
 RMrMAQDghOwKZqYuC27kJt5T7lgG47YCNE5em1v8WsTSvwQAugEA4AlWIpqQ34eI
 Es6ObfMt8Q9gArubFZ0ZDFtmZq9NpA0=
 =+z0i
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-6.16-2025-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux

Pull dma-mapping fix from Marek Szyprowski:

 - small fix relevant to arm64 server and custom CMA configuration (Feng
   Tang)

* tag 'dma-mapping-6.16-2025-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
  dma-contiguous: hornor the cma address limit setup by user
2025-07-11 08:49:25 -07:00
Chen Yu
db6cc3f4ac Revert "sched/numa: add statistics of numa balance task"
This reverts commit ad6b26b6a0a79166b53209df2ca1cf8636296382.

This commit introduces per-memcg/task NUMA balance statistics, but
unfortunately it introduced a NULL pointer exception due to the following
race condition: After a swap task candidate was chosen, its mm_struct
pointer was set to NULL due to task exit.  Later, when performing the
actual task swapping, the p->mm caused the problem.

CPU0                                   CPU1
:
...
task_numa_migrate
     task_numa_find_cpu
      task_numa_compare
        # a normal task p is chosen
        env->best_task = p

                                          # p exit:
                                          exit_signals(p);
                                             p->flags |= PF_EXITING
                                          exit_mm
                                             p->mm = NULL;

      migrate_swap_stop
        __migrate_swap_task((arg->src_task, arg->dst_cpu)
         count_memcg_event_mm(p->mm, NUMA_TASK_SWAP)# p->mm is NULL

task_lock() should be held and the PF_EXITING flag needs to be checked to
prevent this from happening.  After discussion, the conclusion was that
adding a lock is not worthwhile for some statistics calculations.  Revert
the change and rely on the tracepoint for this purpose.

Link: https://lkml.kernel.org/r/20250704135620.685752-1-yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/20250708064917.BBD13C4CEED@smtp.kernel.org
Fixes: ad6b26b6a0a7 ("sched/numa: add statistics of numa balance task")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Reported-by: Jirka Hladky <jhladky@redhat.com>
Closes: https://lore.kernel.org/all/CAE4VaGBLJxpd=NeRJXpSCuw=REhC5LWJpC29kDy-Zh2ZDyzQZA@mail.gmail.com/
Reported-by: Srikanth Aithal <Srikanth.Aithal@amd.com>
Reported-by: Suneeth D <Suneeth.D@amd.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Hladky <jhladky@redhat.com>
Cc: Libo Chen <libo.chen@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 21:07:56 -07:00
Tetsuo Handa
3da6bb4197 perf/core: Fix WARN in perf_sigtrap()
Since exit_task_work() runs after perf_event_exit_task_context() updated
ctx->task to TASK_TOMBSTONE, perf_sigtrap() from perf_pending_task() might
observe event->ctx->task == TASK_TOMBSTONE.

Swap the early exit tests in order not to hit WARN_ON_ONCE().

Closes: https://syzkaller.appspot.com/bug?extid=2fe61cb2a86066be6985
Reported-by: syzbot <syzbot+2fe61cb2a86066be6985@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/b1c224bd-97f9-462c-a3e3-125d5e19c983@I-love.SAKURA.ne.jp
2025-07-09 13:40:17 +02:00
Sebastian Andrzej Siewior
570db4b39f module: Make sure relocations are applied to the per-CPU section
The per-CPU data section is handled differently than the other sections.
The memory allocations requires a special __percpu pointer and then the
section is copied into the view of each CPU. Therefore the SHF_ALLOC
flag is removed to ensure move_module() skips it.

Later, relocations are applied and apply_relocations() skips sections
without SHF_ALLOC because they have not been copied. This also skips the
per-CPU data section.
The missing relocations result in a NULL pointer on x86-64 and very
small values on x86-32. This results in a crash because it is not
skipped like NULL pointer would and can't be dereferenced.

Such an assignment happens during static per-CPU lock initialisation
with lockdep enabled.

Allow relocation processing for the per-CPU section even if SHF_ALLOC is
missing.

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202506041623.e45e4f7d-lkp@intel.com
Fixes: 1a6100caae425 ("Don't relocate non-allocated regions in modules.") #v2.6.1-rc3
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Link: https://lore.kernel.org/r/20250610163328.URcsSUC1@linutronix.de
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Message-ID: <20250610163328.URcsSUC1@linutronix.de>
2025-07-08 20:52:30 +02:00
Petr Pavlu
eb0994a954 module: Avoid unnecessary return value initialization in move_module()
All error conditions in move_module() set the return value by updating the
ret variable. Therefore, it is not necessary to the initialize the variable
when declaring it.

Remove the unnecessary initialization.

Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Link: https://lore.kernel.org/r/20250618122730.51324-3-petr.pavlu@suse.com
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Message-ID: <20250618122730.51324-3-petr.pavlu@suse.com>
2025-07-08 20:52:29 +02:00
Petr Pavlu
ca3881f6fd module: Fix memory deallocation on error path in move_module()
The function move_module() uses the variable t to track how many memory
types it has allocated and consequently how many should be freed if an
error occurs.

The variable is initially set to 0 and is updated when a call to
module_memory_alloc() fails. However, move_module() can fail for other
reasons as well, in which case t remains set to 0 and no memory is freed.

Fix the problem by initializing t to MOD_MEM_NUM_TYPES. Additionally, make
the deallocation loop more robust by not relying on the mod_mem_type_t enum
having a signed integer as its underlying type.

Fixes: c7ee8aebf6c0 ("module: add stop-grap sanity check on module memcpy()")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Link: https://lore.kernel.org/r/20250618122730.51324-2-petr.pavlu@suse.com
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Message-ID: <20250618122730.51324-2-petr.pavlu@suse.com>
2025-07-08 20:52:29 +02:00
Al Viro
a683a5b2ba
fold fs_struct->{lock,seq} into a seqlock
The combination of spinlock_t lock and seqcount_spinlock_t seq
in struct fs_struct is an open-coded seqlock_t (see linux/seqlock_types.h).
	Combine and switch to equivalent seqlock_t primitives.  AFAICS,
that does end up with the same sequence of underlying operations in all
cases.
	While we are at it, get_fs_pwd() is open-coded verbatim in
get_path_from_fd(); rather than applying conversion to it, replace with
the call of get_fs_pwd() there.  Not worth splitting the commit for that,
IMO...

	A bit of historical background - conversion of seqlock_t to
use of seqcount_spinlock_t happened several months after the same
had been done to struct fs_struct; switching fs_struct to seqlock_t
could've been done immediately after that, but it looks like nobody
had gotten around to that until now.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/20250702053437.GC1880847@ZenIV
Acked-by: Ahmed S. Darwish <darwi@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-08 10:25:19 +02:00
Linus Torvalds
772b78c2ab - Fix the calculation of the deadline server task's runtime as this mishap was
preventing realtime tasks from running
 
 - Avoid a race condition during migrate-swapping two tasks
 
 - Fix the string reported for the "none" dynamic preemption option
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhqM/8ACgkQEsHwGGHe
 VUqxng/+P/CQXrijxNOTSlN0NeDfuVPMtpmaijDONxa+m/BAxDjNKVuJefZY/tGa
 jV14hTUMIQkrjuSapIdN2Io02dK7p371ozsOxjNB+kJvDI6kKkOkOn1tWLOGyI+e
 oTIrpJvuxTkVmJOud+3Bl6OR/k+mrQ2R5ud5xJ/exgmBz+wRaRMxIYwQBlmCAZ7I
 uzrR94VL++sZdIuWrBt/5qFQMiwJ3xdrruhz/wdWoq6OQJovNECV1TGFZifKh2Rh
 4DXoMR46gPRXV0r5JoP8BSyw0V2PGwFnVoM3PsOCcN1guJgdiKszCGp89lzN5Z2x
 ySDegu6rnpYoaCmQLjBngGlzBnaEKWKUz9IYrXr/qGjVR8GIvoWjAhOQWvbXjyS2
 5CHRsUBlSJhwlTPJc5RGt8+O9ahWkBGPBCSsnImygTMGl2JIxsZUEEv8ELxaUq5K
 qTAZKYBwzOb2aA3FNe51Pwpz8SI3TKcDLWujHvcNeOSlbO23Bg/TTa3OCy1c3gGg
 HJ7dKw5lSi89VzKhpWwhqBKL1vu/fuVTZ52GCu0BiiwYfCVJwYD40vNNKgiiG1oq
 X2Sr4DUCtwzpFcIMfo9yJ9scqaT5gJywydnB4+oHlbg5OCLDOuWCs0EGGOCPd4LY
 Gi3ft9MBepwYeuCv7DELKKO62jIrlDeOU2FmW+9/RC7/z5egWI4=
 =ubM9
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:

 - Fix the calculation of the deadline server task's runtime as this
   mishap was preventing realtime tasks from running

 - Avoid a race condition during migrate-swapping two tasks

 - Fix the string reported for the "none" dynamic preemption option

* tag 'sched_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Fix dl_server runtime calculation formula
  sched/core: Fix migrate_swap() vs. hotplug
  sched: Fix preemption string of preempt_dynamic_none
2025-07-06 11:17:47 -07:00
Linus Torvalds
a1639ce5e5 - Revert uprobes to using CAP_SYS_ADMIN again as currently they can
destructively modify kernel code from an unprivileged process
 
 - Move a warning to where it belongs
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhqMXoACgkQEsHwGGHe
 VUrSMRAAhwOkHt/Snwd2iT4G7xwbrOvW8q5Ed3WnlHSvy8ygkdbDEF0WPAuQasb6
 qc8iTuQv4i96UtHXWacIuk+q+P/oS/n1wdAyU8nGEavQZGQCaGaTw7gPxy1YcBJB
 mZGtP5HpIIlZpH74lvbBp8Q7T+BJFYSYt+KL2e7Qrc+AyuBUWSvaMnRkT7Ek520S
 aOY75BO104SI2QLY4lx9fTlumgowX44a1tJNEYjntAoMQd+MFMi71zP2FdObtYe6
 Cv4YAU6tSYZML0cOs+YJi50Qk0qg9EGOZKGopuFEs5jIXp4fXCTlCyowqQWFeC5M
 MNlHH1sg2mp+PdDYRatQiarO4gXDMhsT+G+K+TRtBtNwuL0WnbSUxJyNPOkCxfwQ
 nBup5knS9vzPXtuox3az8pYr/VS3H0efBVnDElwG/FhsYHGbOBfOLz54iv3j+Bbe
 CylXJPYPWfJ2UvIJeGRI9NJ3pGKHBLkvUzkwAsGBouCrrZcZIQZeXS1h2IggxDXf
 ooD66aAPcYMgIDKxlpVa8BSYlFzrB0+eq1CsuxoHbr/UcfSjaWSvK1qY+b6EoSaT
 R6L60vuSXyX5s9sHQ8QZoL3qkYIb4oCy8zTFlFjZry8vwHEL4XlzgHq+PTE5oXv4
 VT1uoiybzrx3/X7etz+AO7Vmd/yasyxZSpzd7b6+FVvLhUMXoKg=
 =dYTn
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Borislav Petkov:

 - Revert uprobes to using CAP_SYS_ADMIN again as currently they can
   destructively modify kernel code from an unprivileged process

 - Move a warning to where it belongs

* tag 'perf_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Revert to requiring CAP_SYS_ADMIN for uprobes
  perf/core: Fix the WARN_ON_ONCE is out of lock protected region
2025-07-06 10:49:27 -07:00
Rafael J. Wysocki
250d0579da Merge branch 'pm-sleep'
Merge fixes related to system sleep for 6.16-rc5:

 - Fix typo in the ABI documentation (Sumanth Gavini).

 - Allow swap to be used a bit longer during system suspend and
   hibernation to avoid suspend failures under memory pressure (Mario
   Limonciello).

* pm-sleep:
  PM: sleep: docs: Replace "diasble" with "disable"
  PM: Restrict swap use to later in the suspend sequence
2025-07-04 21:54:55 +02:00
kuyo chang
fc975cfb36 sched/deadline: Fix dl_server runtime calculation formula
In our testing with 6.12 based kernel on a big.LITTLE system, we were
seeing instances of RT tasks being blocked from running on the LITTLE
cpus for multiple seconds of time, apparently by the dl_server. This
far exceeds the default configured 50ms per second runtime.

This is due to the fair dl_server runtime calculation being scaled
for frequency & capacity of the cpu.

Consider the following case under a Big.LITTLE architecture:
Assume the runtime is: 50,000,000 ns, and Frequency/capacity
scale-invariance defined as below:
Frequency scale-invariance: 100
Capacity scale-invariance: 50
First by Frequency scale-invariance,
the runtime is scaled to 50,000,000 * 100 >> 10 = 4,882,812
Then by capacity scale-invariance,
it is further scaled to 4,882,812 * 50 >> 10 = 238,418.
So it will scaled to 238,418 ns.

This smaller "accounted runtime" value is what ends up being
subtracted against the fair-server's runtime for the current period.
Thus after 50ms of real time, we've only accounted ~238us against the
fair servers runtime. This 209:1 ratio in this example means that on
the smaller cpu the fair server is allowed to continue running,
blocking RT tasks, for over 10 seconds before it exhausts its supposed
50ms of runtime.  And on other hardware configurations it can be even
worse.

For the fair deadline_server, to prevent realtime tasks from being
unexpectedly delayed, we really do want to use fixed time, and not
scaled time for smaller capacity/frequency cpus. So remove the scaling
from the fair server's accounting to fix this.

Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: John Stultz <jstultz@google.com>
Signed-off-by: kuyo chang <kuyo.chang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: John Stultz <jstultz@google.com>
Tested-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20250702021440.2594736-1-kuyo.chang@mediatek.com
2025-07-04 10:35:56 +02:00
Peter Zijlstra
ba677dbe77 perf: Revert to requiring CAP_SYS_ADMIN for uprobes
Jann reports that uprobes can be used destructively when used in the
middle of an instruction. The kernel only verifies there is a valid
instruction at the requested offset, but due to variable instruction
length cannot determine if this is an instruction as seen by the
intended execution stream.

Additionally, Mark Rutland notes that on architectures that mix data
in the text segment (like arm64), a similar things can be done if the
data word is 'mistaken' for an instruction.

As such, require CAP_SYS_ADMIN for uprobes.

Fixes: c9e0924e5c2b ("perf/core: open access to probes for CAP_PERFMON privileged process")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/CAG48ez1n4520sq0XrWYDHKiKxE_+WCfAK+qt9qkY4ZiBGmL-5g@mail.gmail.com
2025-07-03 10:33:55 +02:00
Song Liu
5bc9557c9f
bpf: Mark cgroup_subsys_state->cgroup RCU safe
Mark struct cgroup_subsys_state->cgroup as safe under RCU read lock. This
will enable accessing css->cgroup from a bpf css iterator.

Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/20250623063854.1896364-4-song@kernel.org
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-02 14:18:20 +02:00
Song Liu
b95ee9049c
bpf: Introduce bpf_cgroup_read_xattr to read xattr of cgroup's node
BPF programs, such as LSM and sched_ext, would benefit from tags on
cgroups. One common practice to apply such tags is to set xattrs on
cgroupfs folders.

Introduce kfunc bpf_cgroup_read_xattr, which allows reading cgroup's
xattr.

Note that, we already have bpf_get_[file|dentry]_xattr. However, these
two APIs are not ideal for reading cgroupfs xattrs, because:

  1) These two APIs only works in sleepable contexts;
  2) There is no kfunc that matches current cgroup to cgroupfs dentry.

bpf_cgroup_read_xattr is generic and can be useful for many program
types. It is also safe, because it requires trusted or rcu protected
argument (KF_RCU). Therefore, we make it available to all program types.

Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/20250623063854.1896364-3-song@kernel.org
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-02 14:18:20 +02:00
Paul Chaignon
f824274587 bpf: Reject %p% format string in bprintf-like helpers
static const char fmt[] = "%p%";
    bpf_trace_printk(fmt, sizeof(fmt));

The above BPF program isn't rejected and causes a kernel warning at
runtime:

    Please remove unsupported %\x00 in format string
    WARNING: CPU: 1 PID: 7244 at lib/vsprintf.c:2680 format_decode+0x49c/0x5d0

This happens because bpf_bprintf_prepare skips over the second %,
detected as punctuation, while processing %p. This patch fixes it by
not skipping over punctuation. %\x00 is then processed in the next
iteration and rejected.

Reported-by: syzbot+e2c932aec5c8a6e1d31c@syzkaller.appspotmail.com
Fixes: 48cac3f4a96d ("bpf: Implement formatted output helpers with bstr_printf")
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/a0e06cc479faec9e802ae51ba5d66420523251ee.1751395489.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-01 15:22:46 -07:00
Peter Zijlstra
009836b4fa sched/core: Fix migrate_swap() vs. hotplug
On Mon, Jun 02, 2025 at 03:22:13PM +0800, Kuyo Chang wrote:

> So, the potential race scenario is:
>
> 	CPU0							CPU1
> 	// doing migrate_swap(cpu0/cpu1)
> 	stop_two_cpus()
> 							  ...
> 							 // doing _cpu_down()
> 							      sched_cpu_deactivate()
> 								set_cpu_active(cpu, false);
> 								balance_push_set(cpu, true);
> 	cpu_stop_queue_two_works
> 	    __cpu_stop_queue_work(stopper1,...);
> 	    __cpu_stop_queue_work(stopper2,..);
> 	stop_cpus_in_progress -> true
> 		preempt_enable();
> 								...
> 							1st balance_push
> 							stop_one_cpu_nowait
> 							cpu_stop_queue_work
> 							__cpu_stop_queue_work
> 							list_add_tail  -> 1st add push_work
> 							wake_up_q(&wakeq);  -> "wakeq is empty.
> 										This implies that the stopper is at wakeq@migrate_swap."
> 	preempt_disable
> 	wake_up_q(&wakeq);
> 	        wake_up_process // wakeup migrate/0
> 		    try_to_wake_up
> 		        ttwu_queue
> 		            ttwu_queue_cond ->meet below case
> 		                if (cpu == smp_processor_id())
> 			         return false;
> 			ttwu_do_activate
> 			//migrate/0 wakeup done
> 		wake_up_process // wakeup migrate/1
> 	           try_to_wake_up
> 		    ttwu_queue
> 			ttwu_queue_cond
> 		        ttwu_queue_wakelist
> 			__ttwu_queue_wakelist
> 			__smp_call_single_queue
> 	preempt_enable();
>
> 							2nd balance_push
> 							stop_one_cpu_nowait
> 							cpu_stop_queue_work
> 							__cpu_stop_queue_work
> 							list_add_tail  -> 2nd add push_work, so the double list add is detected
> 							...
> 							...
> 							cpu1 get ipi, do sched_ttwu_pending, wakeup migrate/1
>

So this balance_push() is part of schedule(), and schedule() is supposed
to switch to stopper task, but because of this race condition, stopper
task is stuck in WAKING state and not actually visible to be picked.

Therefore CPU1 can do another schedule() and end up doing another
balance_push() even though the last one hasn't been done yet.

This is a confluence of fail, where both wake_q and ttwu_wakelist can
cause crucial wakeups to be delayed, resulting in the malfunction of
balance_push.

Since there is only a single stopper thread to be woken, the wake_q
doesn't really add anything here, and can be removed in favour of
direct wakeups of the stopper thread.

Then add a clause to ttwu_queue_cond() to ensure the stopper threads
are never queued / delayed.

Of all 3 moving parts, the last addition was the balance_push()
machinery, so pick that as the point the bug was introduced.

Fixes: 2558aacff858 ("sched/hotplug: Ensure only per-cpu kthreads run during hotplug")
Reported-by: Kuyo Chang <kuyo.chang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kuyo Chang <kuyo.chang@mediatek.com>
Link: https://lkml.kernel.org/r/20250605100009.GO39944@noisy.programming.kicks-ass.net
2025-07-01 15:02:03 +02:00
Thomas Weißschuh
3ebb1b6522 sched: Fix preemption string of preempt_dynamic_none
Zero is a valid value for "preempt_dynamic_mode", namely
"preempt_dynamic_none".

Fix the off-by-one in preempt_model_str(), so that "preempty_dynamic_none"
is correctly formatted as PREEMPT(none) instead of PREEMPT(undef).

Fixes: 8bdc5daaa01e ("sched: Add a generic function to return the preemption string")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20250626-preempt-str-none-v2-1-526213b70a89@linutronix.de
2025-07-01 15:02:02 +02:00
Luo Gengkun
7b4c5a3754 perf/core: Fix the WARN_ON_ONCE is out of lock protected region
commit 3172fb986666 ("perf/core: Fix WARN in perf_cgroup_switch()") try to
fix a concurrency problem between perf_cgroup_switch and
perf_cgroup_event_disable. But it does not to move the WARN_ON_ONCE into
lock-protected region, so the warning is still be triggered.

Fixes: 3172fb986666 ("perf/core: Fix WARN in perf_cgroup_switch()")
Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250626135403.2454105-1-luogengkun@huaweicloud.com
2025-06-30 09:32:49 +02:00
Linus Torvalds
2fc18d0b89 - Make sure an AUX perf event is really disabled when it overruns
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhg/fMACgkQEsHwGGHe
 VUpi5BAAwBTf3vpsGZvVQNhZhTM9uy9EG0ZmNzPihhJ+e2Ko4BMlWmnBfB0olYgN
 SUBypUQQwkneh5qnUnNe7MEsFof2NONRK4EBwr2l2GWcO8YhEKe6DH+ow+wT+fB0
 B5ifBiEGua1Cv+G276c54WJr35Tkc7XqyfRorvT5LdmynbawU7raS1JK7lQRmKFD
 TzBcTqb8OSTq3tJ+G3eXB5rA9XbYd/TeVCDWYXGOl+BhCt1hnHph+p1xEz/o5PAV
 orCbR8tgv0+tBCvsnSDGQ3TEfAqdPnGYOzIyXte5r9/FaXPhyL8K8x3ixVx1zjnE
 8i+HCUvK7aQs0jFuQ6rfIGnKwNURmM8qVjL65MsFglTJenfXwa7WBYti7dlKUai3
 riaW0FQaEmRt5UhadB3OZJFMzQXKw3ZsxUHjTeYKlx8csangdb03pzwVvMz2o0VO
 xAhJ1i0jgRXaMOFOORtzU7FOZFUuhV8pDKergSObMpimmMG69reNU3MAZPJToYaO
 0Dxx2R/yWsnZMUctVWkcQPL5Qb2e63ecTcYOBUsMfOBuj2WNNLSnh9z6VmHPcT22
 n5nmeAwcGFD33C7CqyT76ruY2687pQi6DxvWxF3ED8vNOkXnP/URkHjpMcRA9fr0
 rUvglIeAxZSXus79ScMy+9Yu985AMljn6ZuMKlGapMWw4+BQAVQ=
 =yQqt
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fix from Borislav Petkov:

 - Make sure an AUX perf event is really disabled when it overruns

* tag 'perf_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/aux: Fix pending disable flow when the AUX ring buffer overruns
2025-06-29 08:16:02 -07:00
Linus Torvalds
ded779017a tracing fixes for v6.16:
- Fix possible UAF on error path in filter_free_subsystem_filters()
 
   When freeing a subsystem filter, the filter for the subsystem is passed in
   to be freed and all the events within the subsystem will have their filter
   freed too. In order to free without waiting for RCU synchronization, list
   items are allocated to hold what is going to be freed to free it via a
   call_rcu(). If the allocation of these items fails, it will call the
   synchronization directly and free after that (causing a bit of delay for
   the user).
 
   The subsystem filter is first added to this list and then the filters for
   all the events under the subsystem. The bug is if one of the allocations
   of the list items for the event filters fail to allocate, it jumps to the
   "free_now" label which will free the subsystem filter, then all the items
   on the allocated list, and then the event filters that were not added to
   the list yet. But because the subsystem filter was added first, it gets
   freed twice.
 
   The solution is to add the subsystem filter after the events, and then if
   any of the allocations fail it will not try to free any of them twice
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaF/yIRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qpoNAP9AuI6SzS+E14UFbA7lEPVtQAgaj6rv
 xURhlmZdsGJ2AQEA3ZTv6Lf3DbnSHzPDOUnK9ItQZE7UHPh4Yed0QrriEAM=
 =hFZ1
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fix from Steven Rostedt:

 - Fix possible UAF on error path in filter_free_subsystem_filters()

   When freeing a subsystem filter, the filter for the subsystem is
   passed in to be freed and all the events within the subsystem will
   have their filter freed too. In order to free without waiting for RCU
   synchronization, list items are allocated to hold what is going to be
   freed to free it via a call_rcu(). If the allocation of these items
   fails, it will call the synchronization directly and free after that
   (causing a bit of delay for the user).

   The subsystem filter is first added to this list and then the filters
   for all the events under the subsystem. The bug is if one of the
   allocations of the list items for the event filters fail to allocate,
   it jumps to the "free_now" label which will free the subsystem
   filter, then all the items on the allocated list, and then the event
   filters that were not added to the list yet. But because the
   subsystem filter was added first, it gets freed twice.

   The solution is to add the subsystem filter after the events, and
   then if any of the allocations fail it will not try to free any of
   them twice

* tag 'trace-v6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix filter logic error
2025-06-28 11:39:24 -07:00
Linus Torvalds
0fd39af24e 16 hotfixes. 6 are cc:stable and the remainder address post-6.15 issues
or aren't considered necessary for -stable kernels.  5 are for MM.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaF8vtQAKCRDdBJ7gKXxA
 jlK9AP9Syx5isoE7MAMKjr9iI/2z+NRaCCro/VM4oQk8m2cNFgD/ZsL9YMhjZlcL
 bMIVUZ9E+yf1w9dLeHLoDba+pnF7Wwc=
 =vdkO
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2025-06-27-16-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "16 hotfixes.

  6 are cc:stable and the remainder address post-6.15 issues or aren't
  considered necessary for -stable kernels. 5 are for MM"

* tag 'mm-hotfixes-stable-2025-06-27-16-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  MAINTAINERS: add Lorenzo as THP co-maintainer
  mailmap: update Duje Mihanović's email address
  selftests/mm: fix validate_addr() helper
  crashdump: add CONFIG_KEYS dependency
  mailmap: correct name for a historical account of Zijun Hu
  mailmap: add entries for Zijun Hu
  fuse: fix runtime warning on truncate_folio_batch_exceptionals()
  scripts/gdb: fix dentry_name() lookup
  mm/damon/sysfs-schemes: free old damon_sysfs_scheme_filter->memcg_path on write
  mm/alloc_tag: fix the kmemleak false positive issue in the allocation of the percpu variable tag->counters
  lib/group_cpus: fix NULL pointer dereference from group_cpus_evenly()
  mm/hugetlb: remove unnecessary holding of hugetlb_lock
  MAINTAINERS: add missing files to mm page alloc section
  MAINTAINERS: add tree entry to mm init block
  mm: add OOM killer maintainer structure
  fs/proc/task_mmu: fix PAGE_IS_PFNZERO detection for the huge zero folio
2025-06-27 20:34:10 -07:00
Edward Adam Davis
6921d1e07c tracing: Fix filter logic error
If the processing of the tr->events loop fails, the filter that has been
added to filter_head will be released twice in free_filter_list(&head->rcu)
and __free_filter(filter).

After adding the filter of tr->events, add the filter to the filter_head
process to avoid triggering uaf.

Link: https://lore.kernel.org/tencent_4EF87A626D702F816CD0951CE956EC32CD0A@qq.com
Fixes: a9d0aab5eb33 ("tracing: Fix regression of filter waiting a long time on RCU synchronization")
Reported-by: syzbot+daba72c4af9915e9c894@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=daba72c4af9915e9c894
Tested-by: syzbot+daba72c4af9915e9c894@syzkaller.appspotmail.com
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-27 15:51:36 -04:00
Mario Limonciello
12ffc3b151 PM: Restrict swap use to later in the suspend sequence
Currently swap is restricted before drivers have had a chance to do
their prepare() PM callbacks. Restricting swap this early means that if
a driver needs to evict some content from memory into sawp in it's
prepare callback, it won't be able to.

On AMD dGPUs this can lead to failed suspends under memory pressure
situations as all VRAM must be evicted to system memory or swap.

Move the swap restriction to right after all devices have had a chance
to do the prepare() callback.  If there is any problem with the sequence,
restore swap in the appropriate dpm resume callbacks or error handling
paths.

Closes: https://github.com/ROCm/ROCK-Kernel-Driver/issues/174
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/2362
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Tested-by: Nat Wittstock <nat@fardog.io>
Tested-by: Lucian Langa <lucilanga@7pot.org>
Link: https://patch.msgid.link/20250613214413.4127087-1-superm1@kernel.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-06-26 20:39:34 +02:00
Leo Yan
1476b21832 perf/aux: Fix pending disable flow when the AUX ring buffer overruns
If an AUX event overruns, the event core layer intends to disable the
event by setting the 'pending_disable' flag. Unfortunately, the event
is not actually disabled afterwards.

In commit:

  ca6c21327c6a ("perf: Fix missing SIGTRAPs")

the 'pending_disable' flag was changed to a boolean. However, the
AUX event code was not updated accordingly. The flag ends up holding a
CPU number. If this number is zero, the flag is taken as false and the
IRQ work is never triggered.

Later, with commit:

  2b84def990d3 ("perf: Split __perf_pending_irq() out of perf_pending_irq()")

a new IRQ work 'pending_disable_irq' was introduced to handle event
disabling. The AUX event path was not updated to kick off the work queue.

To fix this bug, when an AUX ring buffer overrun is detected, call
perf_event_disable_inatomic() to initiate the pending disable flow.

Also update the outdated comment for setting the flag, to reflect the
boolean values (0 or 1).

Fixes: 2b84def990d3 ("perf: Split __perf_pending_irq() out of perf_pending_irq()")
Fixes: ca6c21327c6a ("perf: Fix missing SIGTRAPs")
Signed-off-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: James Clark <james.clark@linaro.org>
Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Liang Kan <kan.liang@linux.intel.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-perf-users@vger.kernel.org
Link: https://lore.kernel.org/r/20250625170737.2918295-1-leo.yan@arm.com
2025-06-26 10:50:37 +02:00