twx-linux

Author	SHA1	Message	Date
Thomas Gleixner	3db6b38dfe	rseq: Switch to fast path processing on exit to user Now that all bits and pieces are in place, hook the RSEQ handling fast path function into exit_to_user_mode_prepare() after the TIF work bits have been handled. If case of fast path failure, TIF_NOTIFY_RESUME has been raised and the caller needs to take another turn through the TIF handling slow path. This only works for architectures which use the generic entry code. Architectures who still have their own incomplete hacks are not supported and won't be. This results in the following improvements: Kernel build Before After Reduction exit to user 80692981 80514451 signal checks: 32581 121 99% slowpath runs: 1201408 1.49% 198 0.00% 100% fastpath runs: 675941 0.84% N/A id updates: 1233989 1.53% 50541 0.06% 96% cs checks: 1125366 1.39% 0 0.00% 100% cs cleared: 1125366 100% 0 100% cs fixup: 0 0% 0 RSEQ selftests Before After Reduction exit to user: 386281778 387373750 signal checks: 35661203 0 100% slowpath runs: 140542396 36.38% 100 0.00% 100% fastpath runs: 9509789 2.51% N/A id updates: 176203599 45.62% 9087994 2.35% 95% cs checks: 175587856 45.46% 4728394 1.22% 98% cs cleared: 172359544 98.16% 1319307 27.90% 99% cs fixup: 3228312 1.84% 3409087 72.10% The 'cs cleared' and 'cs fixup' percentages are not relative to the exit to user invocations, they are relative to the actual 'cs check' invocations. While some of this could have been avoided in the original code, like the obvious clearing of CS when it's already clear, the main problem of going through TIF_NOTIFY_RESUME cannot be solved. In some workloads the RSEQ notify handler is invoked more than once before going out to user space. Doing this once when everything has stabilized is the only solution to avoid this. The initial attempt to completely decouple it from the TIF work turned out to be suboptimal for workloads, which do a lot of quick and short system calls. Even if the fast path decision is only 4 instructions (including a conditional branch), this adds up quickly and becomes measurable when the rate for actually having to handle rseq is in the low single digit percentage range of user/kernel transitions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.701201365@linutronix.de	2025-11-04 08:34:39 +01:00
Thomas Gleixner	05b44aef70	rseq: Implement fast path for exit to user Implement the actual logic for handling RSEQ updates in a fast path after handling the TIF work and at the point where the task is actually returning to user space. This is the right point to do that because at this point the CPU and the MM CID are stable and cannot longer change due to yet another reschedule. That happens when the task is handling it via TIF_NOTIFY_RESUME in resume_user_mode_work(), which is invoked from the exit to user mode work loop. The function is invoked after the TIF work is handled and runs with interrupts disabled, which means it cannot resolve page faults. It therefore disables page faults and in case the access to the user space memory faults, it: - notes the fail in the event struct - raises TIF_NOTIFY_RESUME - returns false to the caller The caller has to go back to the TIF work, which runs with interrupts enabled and therefore can resolve the page faults. This happens mostly on fork() when the memory is marked COW. If the user memory inspection finds invalid data, the function returns false as well and sets the fatal flag in the event struct along with TIF_NOTIFY_RESUME. The slow path notify handler has to evaluate that flag and terminate the task with SIGSEGV as documented. The initial decision to invoke any of this is based on one flags in the event struct: @sched_switch. The decision is in pseudo ASM: load tsk::event::sched_switch jnz inspect_user_space mov $0, tsk::event::events ... leave So for the common case where the task was not scheduled out, this really boils down to three instructions before going out if the compiler is not completely stupid (and yes, some of them are). If the condition is true, then it checks, whether CPU ID or MM CID have changed. If so, then the CPU/MM IDs have to be updated and are thereby cached for the next round. The update unconditionally retrieves the user space critical section address to spare another user*begin/end() pair. If that's not zero and tsk::event::user_irq is set, then the critical section is analyzed and acted upon. If either zero or the entry came via syscall the critical section analysis is skipped. If the comparison is false then the critical section has to be analyzed because the event flag is then only true when entry from user was by interrupt. This is provided without the actual hookup to let reviewers focus on the implementation details. The hookup happens in the next step. Note: As with quite some other optimizations this depends on the generic entry infrastructure and is not enabled to be sucked into random architecture implementations. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.638929615@linutronix.de	2025-11-04 08:34:18 +01:00
Thomas Gleixner	39a167560a	rseq: Optimize event setting After removing the various condition bits earlier it turns out that one extra information is needed to avoid setting event::sched_switch and TIF_NOTIFY_RESUME unconditionally on every context switch. The update of the RSEQ user space memory is only required, when either the task was interrupted in user space and schedules or the CPU or MM CID changes in schedule() independent of the entry mode Right now only the interrupt from user information is available. Add an event flag, which is set when the CPU or MM CID or both change. Evaluate this event in the scheduler to decide whether the sched_switch event and the TIF bit need to be set. It's an extra conditional in context_switch(), but the downside of unconditionally handling RSEQ after a context switch to user is way more significant. The utilized boolean logic minimizes this to a single conditional branch. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.578058898@linutronix.de	2025-11-04 08:34:03 +01:00
Thomas Gleixner	e2d4f42271	rseq: Rework the TIF_NOTIFY handler Replace the whole logic with a new implementation, which is shared with signal delivery and the upcoming exit fast path. Contrary to the original implementation, this ignores invocations from KVM/IO-uring, which invoke resume_user_mode_work() with the @regs argument set to NULL. The original implementation updated the CPU/Node/MM CID fields, but that was just a side effect, which was addressing the problem that this invocation cleared TIF_NOTIFY_RESUME, which in turn could cause an update on return to user space to be lost. This problem has been addressed differently, so that it's not longer required to do that update before entering the guest. That might be considered a user visible change, when the hosts thread TLS memory is mapped into the guest, but as this was never intentionally supported, this abuse of kernel internal implementation details is not considered an ABI break. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.517640811@linutronix.de	2025-11-04 08:33:54 +01:00
Thomas Gleixner	9f6ffd4ceb	rseq: Separate the signal delivery path Completely separate the signal delivery path from the notify handler as they have different semantics versus the event handling. The signal delivery only needs to ensure that the interrupted user context was not in a critical section or the section is aborted before it switches to the signal frame context. The signal frame context does not have the original instruction pointer anymore, so that can't be handled on exit to user space. No point in updating the CPU/CID ids as they might change again before the task returns to user space for real. The fast path optimization, which checks for the 'entry from user via interrupt' condition is only available for architectures which use the generic entry code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.455429038@linutronix.de	2025-11-04 08:33:47 +01:00
Thomas Gleixner	0f085b4188	rseq: Provide and use rseq_set_ids() Provide a new and straight forward implementation to set the IDs (CPU ID, Node ID and MM CID), which can be later inlined into the fast path. It does all operations in one scoped_user_rw_access() section and retrieves also the critical section member (rseq::cs_rseq) from user space to avoid another user..begin/end() pair. This is in preparation for optimizing the fast path to avoid extra work when not required. On rseq registration set the CPU ID fields to RSEQ_CPU_ID_UNINITIALIZED and node and MM CID to zero. That's the same as the kernel internal reset values. That makes the debug validation in the exit code work correctly on the first exit to user space. Use it to replace the whole related zoo in rseq.c Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.393972266@linutronix.de	2025-11-04 08:33:33 +01:00
Thomas Gleixner	eaa9088d56	rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y Make the syscall exit debug mechanism available via the static branch on architectures which utilize the generic entry code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.333440475@linutronix.de	2025-11-04 08:33:27 +01:00
Thomas Gleixner	f7ee1964ac	rseq: Replace the original debug implementation Just utilize the new infrastructure and put the original one to rest. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.212510692@linutronix.de	2025-11-04 08:33:12 +01:00
Thomas Gleixner	abc850e761	rseq: Provide and use rseq_update_user_cs() Provide a straight forward implementation to check for and eventually clear/fixup critical sections in user space. The non-debug version does only the minimal sanity checks and aims for efficiency. There are two attack vectors, which are checked for: 1) An abort IP which is in the kernel address space. That would cause at least x86 to return to kernel space via IRET. 2) A rogue critical section descriptor with an abort IP pointing to some arbitrary address, which is not preceded by the RSEQ signature. If the section descriptors are invalid then the resulting misbehaviour of the user space application is not the kernels problem. The kernel provides a run-time switchable debug slow path, which implements the full zoo of checks including termination of the task when one of the gazillion conditions is not met. Replace the zoo in rseq.c with it and invoke it from the TIF_NOTIFY_RESUME handler. Move the remainders into the CONFIG_DEBUG_RSEQ section, which will be replaced and removed in a subsequent step. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.151465632@linutronix.de	2025-11-04 08:32:57 +01:00
Thomas Gleixner	9c37cb6e80	rseq: Provide static branch for runtime debugging Config based debug is rarely turned on and is not available easily when things go wrong. Provide a static branch to allow permanent integration of debug mechanisms along with the usual toggles in Kconfig, command line and debugfs. Requested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.089270547@linutronix.de	2025-11-04 08:32:49 +01:00
Thomas Gleixner	5412910487	rseq: Expose lightweight statistics in debugfs Analyzing the call frequency without actually using tracing is helpful for analysis of this infrastructure. The overhead is minimal as it just increments a per CPU counter associated to each operation. The debugfs readout provides a racy sum of all counters. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.027916598@linutronix.de	2025-11-04 08:32:41 +01:00
Thomas Gleixner	dab344753e	rseq: Provide tracepoint wrappers for inline code Provide tracepoint wrappers for the upcoming RSEQ exit to user space inline fast path, so that the header can be safely included by code which defines actual trace points. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.967114316@linutronix.de	2025-11-04 08:32:35 +01:00
Thomas Gleixner	4b7de6df20	rseq: Cache CPU ID and MM CID values In preparation for rewriting RSEQ exit to user space handling provide storage to cache the CPU ID and MM CID values which were written to user space. That prepares for a quick check, which avoids the update when nothing changed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.841964081@linutronix.de	2025-11-04 08:32:14 +01:00
Thomas Gleixner	7702a9c285	entry: Inline irqentry_enter/exit_from/to_user_mode() There is no point to have this as a function which just inlines enter_from_user_mode(). The function call overhead is larger than the function itself. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.715309918@linutronix.de	2025-11-04 08:31:47 +01:00
Thomas Gleixner	54a5ab5624	entry: Remove syscall_enter_from_user_mode_prepare() Open code the only user in the x86 syscall code and reduce the zoo of functions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.652839989@linutronix.de	2025-11-04 08:31:37 +01:00
Thomas Gleixner	faba9d250e	rseq: Introduce struct rseq_data In preparation for a major rewrite of this code, provide a data structure for rseq management. Put all the rseq related data into it (except for the debug part), which allows to simplify fork/execve by using memset() and memcpy() instead of adding new fields to initialize over and over. Create a storage struct for event management as well and put the sched_switch event and a indicator for RSEQ on a task into it as a start. That uses a union, which allows to mask and clear the whole lot efficiently. The indicators are explicitly not a bit field. Bit fields generate abysmal code. The boolean members are defined as u8 as that actually guarantees that it fits. There seem to be strange architecture ABIs which need more than 8 bits for a boolean. The has_rseq member is redundant vs. task::rseq, but it turns out that boolean operations and quick checks on the union generate better code than fiddling with separate entities and data types. This struct will be extended over time to carry more information. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.527086690@linutronix.de	2025-11-04 08:30:50 +01:00
Thomas Gleixner	566d8015f7	rseq: Avoid CPU/MM CID updates when no event pending There is no need to update these values unconditionally if there is no event pending. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.462964916@linutronix.de	2025-11-04 08:30:43 +01:00
Thomas Gleixner	83409986f4	rseq, virt: Retrigger RSEQ after vcpu_run() Hypervisors invoke resume_user_mode_work() before entering the guest, which clears TIF_NOTIFY_RESUME. The @regs argument is NULL as there is no user space context available to them, so the rseq notify handler skips inspecting the critical section, but updates the CPU/MM CID values unconditionally so that the eventual pending rseq event is not lost on the way to user space. This is a pointless exercise as the task might be rescheduled before actually returning to user space and it creates unnecessary work in the vcpu_run() loops. It's way more efficient to ignore that invocation based on @regs == NULL and let the hypervisors re-raise TIF_NOTIFY_RESUME after returning from the vcpu_run() loop before returning from the ioctl(). This ensures that a pending RSEQ update is not lost and the IDs are updated before returning to user space. Once the RSEQ handling is decoupled from TIF_NOTIFY_RESUME, this turns into a NOOP. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20251027084306.399495855@linutronix.de	2025-11-04 08:30:23 +01:00
Thomas Gleixner	d923739e2e	rseq: Simplify the event notification Since commit 0190e4198e47 ("rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_* flags") the bits in task::rseq_event_mask are meaningless and just extra work in terms of setting them individually. Aside of that the only relevant point where an event has to be raised is context switch. Neither the CPU nor MM CID can change without going through a context switch. Collapse them all into a single boolean which simplifies the code a lot and remove the pointless invocations which have been sprinkled all over the place for no value. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.336978188@linutronix.de	2025-11-04 08:30:09 +01:00
Thomas Gleixner	067b3b41b4	rseq: Simplify registration There is no point to read the critical section element in the newly registered user space RSEQ struct first in order to clear it. Just clear it and be done with it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.274661227@linutronix.de	2025-11-04 08:30:05 +01:00
Thomas Gleixner	77f19e4d4f	rseq: Move algorithm comment to top Move the comment which documents the RSEQ algorithm to the top of the file, so it does not create horrible diffs later when the actual implementation is fed into the mincer. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.149519580@linutronix.de	2025-11-04 08:29:52 +01:00
Thomas Gleixner	3ca59da7aa	rseq: Avoid pointless evaluation in __rseq_notify_resume() The RSEQ critical section mechanism only clears the event mask when a critical section is registered, otherwise it is stale and collects bits. That means once a critical section is installed the first invocation of that code when TIF_NOTIFY_RESUME is set will abort the critical section, even when the TIF bit was not raised by the rseq preempt/migrate/signal helpers. This also has a performance implication because TIF_NOTIFY_RESUME is a multiplexing TIF bit, which is utilized by quite some infrastructure. That means every invocation of __rseq_notify_resume() goes unconditionally through the heavy lifting of user space access and consistency checks even if there is no reason to do so. Keeping the stale event mask around when exiting to user space also prevents it from being utilized by the upcoming time slice extension mechanism. Avoid this by reading and clearing the event mask before doing the user space critical section access with interrupts or preemption disabled, which ensures that the read and clear operation is CPU local atomic versus scheduling and the membarrier IPI. This is correct as after re-enabling interrupts/preemption any relevant event will set the bit again and raise TIF_NOTIFY_RESUME, which makes the user space exit code take another round of TIF bit clearing. If the event mask was non-zero, invoke the slow path. On debug kernels the slow path is invoked unconditionally and the result of the event mask evaluation is handed in. Add a exit path check after the TIF bit loop, which validates on debug kernels that the event mask is zero before exiting to user space. While at it reword the convoluted comment why the pt_regs pointer can be NULL under certain circumstances. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.022571576@linutronix.de	2025-11-04 08:28:38 +01:00
Thomas Gleixner	e4e28fd698	futex: Convert to get/put_user_inline() Replace the open coded implementation with the new get/put_user_inline() helpers. This might be replaced by a regular get/put_user(), but that needs a proper performance evaluation. No functional change intended. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251027083745.736737934@linutronix.de	2025-11-04 08:28:23 +01:00
Linus Torvalds	ba36dd5ee6	bpf-fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmkFVBIACgkQ6rmadz2v bTrWVg//chctHGZZcP2ZbLxDDwLOwfjjsUY2COaD9P3ZN8/vWX6GEbvElLulkLgD Hwv3pe2C6NzHN9QH37M+WtJmLE1vI5aRuMXzpBKhOtOFAE5BfHzXeON0M4pswZd1 jh7f4w7mBdW3MMoR2Dg/l+lbGxDKFfb9jfD1blm+uOuBodHdbIpa66Mscakannrx tNWoauPDcu7fu7b+KCItnICC+VewaoDmhr20Q8X/kwvqbNPZ98D/tzUw7YlngO1d p+K/oKVAfXbWbW79agNoqD+zVDKAos7dQgqCDY/cuZhJNzt4xBZfTkM62SXdHU7g aCXHg+qxoWMrYTWGGueAhwf4gB3YIe0atKxP9w5gbjtxbWa5Y6oTyIpgGKvO5SMj 7qsmg/m338kS4aKQVjr9D042W+qqxRjrn2eF/x4Sth1GXMJd1ny14NpoNGEk/xsU TZfBdFgNOYUa1jeK3N3oEDdlxx8ITA9gsNPzSy9O8Ke6WRHp5u9Ob/7UIJsiVYWw 6SPdIhagv719m93GvAC4Xe3BrRi1dmf5UX39oOqpnGKkg4lT/xNu4aYP89UbFVGW XgTPX+Cm7kRKb32Fv9GiLC0sTQEWVAiB0jVTGB9E8v15P7ybJ/9IrcRNcwcrKGNS ny+cn1SR+CmX6c8TdliSzLdtgGuPk3QrXkwWs4442IphtbPnhE4= =t7MS -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Mark migrate_disable/enable() as always_inline to avoid issues with partial inlining (Yonghong Song) - Fix powerpc stack register definition in libbpf bpf_tracing.h (Andrii Nakryiko) - Reject negative head_room in __bpf_skb_change_head (Daniel Borkmann) - Conditionally include dynptr copy kfuncs (Malin Jonsson) - Sync pending IRQ work before freeing BPF ring buffer (Noorain Eqbal) - Do not audit capability check in x86 do_jit() (Ondrej Mosnacek) - Fix arm64 JIT of BPF_ST insn when it writes into arena memory (Puranjay Mohan) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf/arm64: Fix BPF_ST into arena memory bpf: Make migrate_disable always inline to avoid partial inlining bpf: Reject negative head_room in __bpf_skb_change_head bpf: Conditionally include dynptr copy kfuncs libbpf: Fix powerpc's stack register definition in bpf_tracing.h bpf: Do not audit capability check in do_jit() bpf: Sync pending IRQ work before freeing ring buffer	2025-10-31 18:22:26 -07:00
Linus Torvalds	a5dbbb39e1	Power management fixes for 6.18-rc4 - Add an exit latency check to the menu cpuidle governor in the case when it considers using a real idle state instead of a polling one to address a performance regression (Rafael Wysocki) - Revert an attempted cleanup of a system suspend code path that introduced a regression elsewhere (Samuel Wu) - Allow pm_restrict_gfp_mask() to be called multiple times in a row and adjust pm_restore_gfp_mask() accordingly to avoid having to play nasty games with these calls during hibernation (Rafael Wysocki) -----BEGIN PGP SIGNATURE----- iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmkDvsQSHHJqd0Byand5 c29ja2kubmV0AAoJEO5fvZ0v1OO1ICQH/1haZ2p7muRXiyP/rmyt41wQm/VZPMFV JBguNH0nd9r9szVntE0ic+lfv/nT5q0C26/tDHCnDPrtqT4aEj8uQgecfb71r1Sn 4cp4Y3BDp/9v6K2AAdo/FBYBMG63qlKlMaSXG2hewH3MreaP1V86AmELhGz6jvSV hFzJCi/bPoS7ot2mmKCE3MSGU9XEI1Hce4YAGfI3j/6RK9UD921g9gZWuEgqsQbB QHM3Wqp348sPW0JgDNYtFv6X6N3+JmSyO0oYLSSTbYKVNVkd7o/3zxEx5pnnLDjD l7y9pUbJ155fBkCbLIrU0NUwCo4PZfMs6KS3L/0Cu3Dp4JPa1F/BsN4= =p+g/ -----END PGP SIGNATURE----- Merge tag 'pm-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix three regressions, two recent ones and one introduced during the 6.17 development cycle: - Add an exit latency check to the menu cpuidle governor in the case when it considers using a real idle state instead of a polling one to address a performance regression (Rafael Wysocki) - Revert an attempted cleanup of a system suspend code path that introduced a regression elsewhere (Samuel Wu) - Allow pm_restrict_gfp_mask() to be called multiple times in a row and adjust pm_restore_gfp_mask() accordingly to avoid having to play nasty games with these calls during hibernation (Rafael Wysocki)" * tag 'pm-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: sleep: Allow pm_restrict_gfp_mask() stacking cpuidle: governors: menu: Select polling state in some more cases Revert "PM: sleep: Make pm_wakeup_clear() call more clear"	2025-10-30 19:02:16 -07:00
Rafael J. Wysocki	590c5cd106	Merge branches 'pm-cpuidle' and 'pm-sleep' Merge a cpuidle fix and two fixes related to system sleep for 6.18-rc4: - Add an exit latency check to the menu cpuidle governor in the case when it considers using a real idle state instead of a polling one to address a performance regression (Rafael Wysocki) - Revert an attempted cleanup of a system suspend code path that introduced a regression elsewhere (Samuel Wu) - Allow pm_restrict_gfp_mask() to be called multiple times in a row and adjust pm_restore_gfp_mask() accordingly to avoid having to play nasty games with these calls during hibernation (Rafael Wysocki) * pm-cpuidle: cpuidle: governors: menu: Select polling state in some more cases * pm-sleep: PM: sleep: Allow pm_restrict_gfp_mask() stacking Revert "PM: sleep: Make pm_wakeup_clear() call more clear"	2025-10-30 20:25:18 +01:00
Rafael J. Wysocki	35e4a69b20	PM: sleep: Allow pm_restrict_gfp_mask() stacking Allow pm_restrict_gfp_mask() to be called many times in a row to avoid issues with calling dpm_suspend_start() when the GFP mask has been already restricted. Only the first invocation of pm_restrict_gfp_mask() will actually restrict the GFP mask and the subsequent calls will warn if there is a mismatch between the expected allowed GFP mask and the actual one. Moreover, if pm_restrict_gfp_mask() is called many times in a row, pm_restore_gfp_mask() needs to be called matching number of times in a row to actually restore the GFP mask. Calling it when the GFP mask has not been restricted will cause it to warn. This is necessary for the GFP mask restriction starting in hibernation_snapshot() to continue throughout the entire hibernation flow until it completes or it is aborted (either by a wakeup event or by an error). Fixes: 449c9c02537a1 ("PM: hibernate: Restrict GFP mask in hibernation_snapshot()") Fixes: 469d80a3712c ("PM: hibernate: Fix hybrid-sleep") Reported-by: Askar Safin <safinaskar@gmail.com> Closes: https://lore.kernel.org/linux-pm/20251025050812.421905-1-safinaskar@gmail.com/ Link: https://lore.kernel.org/linux-pm/20251028111730.2261404-1-safinaskar@gmail.com/ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Tested-by: Mario Limonciello (AMD) <superm1@kernel.org> Cc: 6.16+ <stable@vger.kernel.org> # 6.16+ Link: https://patch.msgid.link/5935682.DvuYhMxLoT@rafael.j.wysocki	2025-10-29 18:55:32 +01:00
Linus Torvalds	fd57572253	sched_ext: Fixes for v6.18-rc3 - Fix scx_kick_pseqs corruption when multiple schedulers are loaded concurrently - Allocate scx_kick_cpus_pnt_seqs lazily using kvzalloc() to handle systems with large CPU counts - Defer queue_balance_callback() until after ops.dispatch to fix callback ordering issues - Sync error_irq_work before freeing scx_sched to prevent use-after-free - Mark scx_bpf_dsq_move_set_[slice\|vtime]() with KF_RCU for proper RCU protection - Fix flag check for deferred callbacks -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaP+iWg4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGZ+MAQCVyGxKbEvCRqPwEkwxRdTTBBkHlxEzgeFAK5GN UrQ6mwEAq7cdmdjpPZ22iHEeyRfr2EZZww4oAlX9JpU0Pipj4QM= =MbR0 -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.18-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix scx_kick_pseqs corruption when multiple schedulers are loaded concurrently - Allocate scx_kick_cpus_pnt_seqs lazily using kvzalloc() to handle systems with large CPU counts - Defer queue_balance_callback() until after ops.dispatch to fix callback ordering issues - Sync error_irq_work before freeing scx_sched to prevent use-after-free - Mark scx_bpf_dsq_move_set_[slice\|vtime]() with KF_RCU for proper RCU protection - Fix flag check for deferred callbacks * tag 'sched_ext-for-6.18-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: fix flag check for deferred callbacks sched_ext: Fix scx_kick_pseqs corruption on concurrent scheduler loads sched_ext: Allocate scx_kick_cpus_pnt_seqs lazily using kvzalloc() sched_ext: defer queue_balance_callback() until after ops.dispatch sched_ext: Sync error_irq_work before freeing scx_sched sched_ext: Mark scx_bpf_dsq_move_set_[slice\|vtime]() with KF_RCU	2025-10-27 10:52:18 -07:00
Linus Torvalds	5fee0dafba	- Restore the original buslock locking in a couple of places in the irq core subsystem after a rework -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmj+EwoACgkQEsHwGGHe VUqm6xAAuPDn4E0wuxgD5l6gXYDWXx7xoHEDT0KuL2J9OsfbWoHl8OwObBRmD7ls au/SuuJUSs3NEntQwLfTklyi7UignkTzcyOYLqb2fMYPFLk+nRXWSjvxsQMQV/u3 wwSXyK1YaZ4qaEKqIAPm5Uvs4E1DQFu6zzBdjVTKB+w1n0Lh9P4xBdDaHgwc/dV/ 8jKt39JsInLzCy+8aDLeabeU5X5qDscnbpJ3LEHf/6scMBCAvQbnfeICvDijzLgf FF4qw+O7qGzFQTKRB2B4pymoFhKGOnGR4jtygejjm3wDO/k2QKS3OwoJo8mzIM3S p/HimQ7Uy0KEU11Vo37ANdE8XErkeoj7meoBNGFiU4KZzRU99CnRz0EDap9RUvlx clat0CC/3NSGau2hcbYDrTSsjkoWVbEtQJ2XbvHavnE0MscHUMIf1vIQjWzvVG06 0u5R1OPD+0czeCIXKZQVDGyRcRmmAF1+na3AuBUDq1h0i+KT4V/Y1vX64IFkDdd3 NaMk6GVmQu3bDpJ4LBpdhVl7cGV50kAbGl77VHST4pERvWQ1EWwwutDp0CK96zo0 WQnQfjF4/5Ja9l5nCLK7kffQtjdFg/jY/wyixwASEWJDM81T+fZSf32VGkP6Wf0N tQYfjKOEj1l/ilRRarSxW8opazhZuN7t7k5e8IxPYP9LmbgAbnQ= =oIwu -----END PGP SIGNATURE----- Merge tag 'irq_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Borislav Petkov: - Restore the original buslock locking in a couple of places in the irq core subsystem after a rework * tag 'irq_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/manage: Add buslock back in to enable_irq() genirq/manage: Add buslock back in to __disable_irq_nosync() genirq/chip: Add buslock back in to irq_set_handler()	2025-10-26 09:54:36 -07:00
Linus Torvalds	1bc9743b64	- Make sure a CFS runqueue on a throttled hierarchy has its PELT clock throttled otherwise task movement and manipulation would lead to dangling cfs_rq references and an eventual crash -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmj+Cf0ACgkQEsHwGGHe VUqkvA/8D1ItoOslMeTpD6YtcaNN9oxzQ7Zow1QaWaPqirUsc+2l/zZ/3R5s0Zlt 9n0mUNdZ6EC03ZGPwYCNVLk2PvTywmMdwXOypya303PXLez2bPigekJIyXJeW5FV YuJWTJBQWtZwiFf2ekP1OmHRceOA4KuBIwmWvfW4YwdXlUGfDLn+X6a4z8GsH/z+ ss8iUTfbEraBoFFaF16xq1zxrvRDw5vZpX2HkcHADiTVdkHcuXrf+33AeW/URWKz FrwimiW+HJdue9trFNwLKUggHCPDoUpHLPA/kmWFiGCZWRXBPpmZ56NGRgfoadGa 4/Hb9ASMjMFl8Y9gnkOqLyomhQ8vJ8LkNqDChiJ5AiQQFYRekrPuZw+zuCENtzVZ miAmp/kXCGSCWTMNZKlztxJGhmn/yiH+sVegmyHyDqGfqnuEBF3sebkf/DDkDAvu 88SG1YB8OlgmDIxShhfHQqw1nZa7BshLkViak6110n4fP6fbZrbY0MwBLHX2VVpQ jJeFuvQ2pZuEl1LKVDsy+ROIShkQITZ8IOeabnm6vAeHEpjomDvmlZOmc5f9NfHV wH6SmrHzSaEam70EJflzoglujYy+JMtVIUd7QC/jYXtPOYj1fcHPgwqlnv25uW9e 4IrwjFNwc2u0MAemKcqRO4DUEwAczD0y+dL/6eVKK8niVmat4f8= =8MbE -----END PGP SIGNATURE----- Merge tag 'sched_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: - Make sure a CFS runqueue on a throttled hierarchy has its PELT clock throttled otherwise task movement and manipulation would lead to dangling cfs_rq references and an eventual crash * tag 'sched_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled	2025-10-26 09:42:19 -07:00
Linus Torvalds	7ea5092f52	- Do not create more than eight (max supported) AUX clocks sysfs hierarchies -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmj+BzAACgkQEsHwGGHe VUrGkg//aG7SmeAZiKL1SCyEZgNitZRzuxD7lHS+gky7qsjZu8ieGwV4WvcPoG2B wQq7dfH7ZiafiOV/IjGaRwWDbg90934QBZNW9hswMyNsg5ci8vhDCRu6TSWhZN5O HSezJNElq9t3QK1+40qY8m5Zw7DlnJJOQhOyDlkc/tdzZjJs+KSDctEytQ1RjKn6 kR1x7GE6UUJpdKP0/fvlvJVALSrR7hyzqmv+G9HBLhuz8E/lj6IUhRpeouQ3u4tn 7TNYHi5/HEIUE/T87YEsIIYeAVe6hQ33M73rJ6UMYDhuwfLmePVYtm5gsPt8D6Zi 9z73CTZKhry1fpMR0X8pow+zHsyMyYtErX93mFmmvCiPtC8FlvUSLWTVx0n8itBQ jyGZOVPAJWiZ1FCuSaZeaBB3s4/AGDFAOOZIS+l0oRzZHE4xU23y1cgyNbf1pp6H 3i2UR0UZc8D5YVaT6LnmhVDosjZrV7V+GPCqcLNxb8QCqr+dkHjb8fv/G2b0RB34 YUg798hulYOYhU+mKr/qOJs9SPztx8VgmirURU/4wU8aM7vpPPdn5IsgeohzBmSA 2LLE3M2KFCtDem5O5I2HUrBsJBr85om+XMAP2O7ctoWzl3GC4j0Fqr4dK2M0UUfD sTMamGyci7QfiUvzfe/Qh+T1dp6b/4kmU/1rsoyA1fn8V2q+khA= =PFVB -----END PGP SIGNATURE----- Merge tag 'timers_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Borislav Petkov: - Do not create more than eight (max supported) AUX clocks sysfs hierarchies * tag 'timers_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Fix aux clocks sysfs initialization loop bound	2025-10-26 09:40:16 -07:00
Andy Shevchenko	53abe3e1c1	sched: Remove never used code in mm_cid_get() Clang is not happy with set but unused variable (this is visible with `make W=1` build: kernel/sched/sched.h:3744:18: error: variable 'cpumask' set but not used [-Werror,-Wunused-but-set-variable] It seems like the variable was never used along with the assignment that does not have side effects as far as I can see. Remove those altogether. Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Tested-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Breno Leitao <leitao@debian.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2025-10-24 16:55:46 -07:00
Malin Jonsson	8ce93aabbf	bpf: Conditionally include dynptr copy kfuncs Since commit a498ee7576de ("bpf: Implement dynptr copy kfuncs"), if CONFIG_BPF_EVENTS is not enabled, but BPF_SYSCALL and DEBUG_INFO_BTF are, the build will break like so: BTFIDS vmlinux.unstripped WARN: resolve_btfids: unresolved symbol bpf_probe_read_user_str_dynptr WARN: resolve_btfids: unresolved symbol bpf_probe_read_user_dynptr WARN: resolve_btfids: unresolved symbol bpf_probe_read_kernel_str_dynptr WARN: resolve_btfids: unresolved symbol bpf_probe_read_kernel_dynptr WARN: resolve_btfids: unresolved symbol bpf_copy_from_user_task_str_dynptr WARN: resolve_btfids: unresolved symbol bpf_copy_from_user_task_dynptr WARN: resolve_btfids: unresolved symbol bpf_copy_from_user_str_dynptr WARN: resolve_btfids: unresolved symbol bpf_copy_from_user_dynptr make[2]: * [scripts/Makefile.vmlinux:72: vmlinux.unstripped] Error 255 make[2]: * Deleting file 'vmlinux.unstripped' make[1]: * [/repo/malin/upstream/linux/Makefile:1242: vmlinux] Error 2 make: * [Makefile:248: __sub-make] Error 2 Guard these symbols with #ifdef CONFIG_BPF_EVENTS to resolve the problem. Fixes: a498ee7576de ("bpf: Implement dynptr copy kfuncs") Reported-by: Yong Gu <yong.g.gu@ericsson.com> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Malin Jonsson <malin.jonsson@est.tech> Link: https://lore.kernel.org/r/20251024151436.139131-1-malin.jonsson@est.tech Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-24 09:44:47 -07:00
Charles Keepax	ef3330b99c	genirq/manage: Add buslock back in to enable_irq() The locking was changed from a buslock to a plain lock, but the patch description states there was no functional change. Assuming this was accidental so reverting to using the buslock. Fixes: bddd10c55407 ("genirq/manage: Rework enable_irq()") Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251023154901.1333755-4-ckeepax@opensource.cirrus.com	2025-10-24 11:38:39 +02:00
Charles Keepax	56363e25f7	genirq/manage: Add buslock back in to __disable_irq_nosync() The locking was changed from a buslock to a plain lock, but the patch description states there was no functional change. Assuming this was accidental so reverting to using the buslock. Fixes: 1b7444446724 ("genirq/manage: Rework __disable_irq_nosync()") Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251023154901.1333755-3-ckeepax@opensource.cirrus.com	2025-10-24 11:38:39 +02:00
Charles Keepax	5d7e45dd67	genirq/chip: Add buslock back in to irq_set_handler() The locking was changed from a buslock to a plain lock, but the patch description states there was no functional change. Assuming this was accidental so reverting to using the buslock. Fixes: 5cd05f3e2315 ("genirq/chip: Rework irq_set_handler() variants") Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251023154901.1333755-2-ckeepax@opensource.cirrus.com	2025-10-24 11:38:39 +02:00
Linus Torvalds	5121062e83	A couple of fixes for Runtime Verification - A bug caused a kernel panic when reading enabled_monitors was reported. Change callbacks functions to always use list_head iterators and by doing so, fix the wrong pointer that was leading to the panic. - The rtapp/pagefault monitor relies on the MMU to be present (pagefaults exist) but that was not enforced via kconfig, leading to potential build errors on systems without an MMU. Add that kconfig dependency. -----BEGIN PGP SIGNATURE----- iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaPqYBRQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qrZjAQC02FScGQM+TTQ2kvIFAscKEfZddks9 iiodkKWxMTZoxwD/VxXKQUD8CP2HW9uSpJw/O3Zv+NAU80Eq8V2f7/0d9gw= =AnZN -----END PGP SIGNATURE----- Merge tag 'trace-rv-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: "A couple of fixes for Runtime Verification: - A bug caused a kernel panic when reading enabled_monitors was reported. Change callback functions to always use list_head iterators and by doing so, fix the wrong pointer that was leading to the panic. - The rtapp/pagefault monitor relies on the MMU to be present (pagefaults exist) but that was not enforced via kconfig, leading to potential build errors on systems without an MMU. Add that kconfig dependency" * tag 'trace-rv-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rv: Make rtapp/pagefault monitor depends on CONFIG_MMU rv: Fully convert enabled_monitors to use list_head as iterator	2025-10-23 16:50:25 -07:00
Samuel Wu	79816d4b9e	Revert "PM: sleep: Make pm_wakeup_clear() call more clear" This reverts commit 56a232d93cea0ba14da5e3157830330756a45b4c. The above commit changed the position of pm_wakeup_clear() for the suspend call path, but other call paths with references to freeze_processes() were not updated. This means that other call paths, such as hibernate(), will not have pm_wakeup_clear() called. Suggested-by: Saravana Kannan <saravanak@google.com> Signed-off-by: Samuel Wu <wusamuel@google.com> [ rjw: Changelog edits ] Link: https://patch.msgid.link/20251022222830.634086-1-wusamuel@google.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-10-23 12:48:04 +02:00
Linus Torvalds	0f3ad9c610	17 hotfixes. 12 are cc:stable and 14 are for MM. There's a two-patch DAMON series from SeongJae Park which addresses a missed check and possible memory leak. Apart from that it's all singletons - please see the changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaPkz/QAKCRDdBJ7gKXxA js6eAQCdnA10LouzzVdqA+HuYh206z8qE2KGsGKpUGDfJv40uAEA4ZbxYrMJmwhU MXFn7Czphh/NOfFFCnrDnOlAFH7MmQc= =hc/P -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2025-10-22-12-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "17 hotfixes. 12 are cc:stable and 14 are for MM. There's a two-patch DAMON series from SeongJae Park which addresses a missed check and possible memory leak. Apart from that it's all singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2025-10-22-12-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: csky: abiv2: adapt to new folio flags field mm/damon/core: use damos_commit_quota_goal() for new goal commit mm/damon/core: fix potential memory leak by cleaning ops_filter in damon_destroy_scheme hugetlbfs: move lock assertions after early returns in huge_pmd_unshare() vmw_balloon: indicate success when effectively deflating during migration mm/damon/core: fix list_add_tail() call on damon_call() mm/mremap: correctly account old mapping after MREMAP_DONTUNMAP remap mm: prevent poison consumption when splitting THP ocfs2: clear extent cache after moving/defragmenting extents mm: don't spin in add_stack_record when gfp flags don't allow dma-debug: don't report false positives with DMA_BOUNCE_UNALIGNED_KMALLOC mm/damon/sysfs: dealloc commit test ctx always mm/damon/sysfs: catch commit test ctx alloc failure hung_task: fix warnings caused by unaligned lock pointers	2025-10-22 14:57:35 -10:00
K Prateek Nayak	0e4a169d1a	sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled Matteo reported hitting the assert_list_leaf_cfs_rq() warning from enqueue_task_fair() post commit fe8d238e646e ("sched/fair: Propagate load for throttled cfs_rq") which transitioned to using cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in propagate_entity_cfs_rq(). The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all its descendants should have their PELT frozen too or weird things can happen as a result of children accumulating PELT signals when the parents have their PELT clock stopped. Another side effect of this is the loss of integrity of the leaf cfs_rq list. As debugged by Aaron, consider the following hierarchy: root(#) / \ A(#) B() \| C <--- new cgroup \| D <--- new cgroup # - Already on leaf cfs_rq list - Throttled with PELT frozen The newly created cgroups don't have their "pelt_clock_throttled" signal synced with cgroup B. Next, the following series of events occur: 1. online_fair_sched_group() for cgroup D will call propagate_entity_cfs_rq(). (Same can happen if a throttled task is moved to cgroup C and enqueue_task_fair() returns early.) propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to "rq->tmp_alone_branch" since its PELT clock is not marked throttled and cfs_rq of cgroup B is not on the list. cfs_rq of cgroup B is skipped since its PELT is throttled. root cfs_rq already exists on cfs_rq leading to list_add_leaf_cfs_rq() returning early. The cfs_rq of cgroup C is left dangling on the "rq->tmp_alone_branch". 2. A new task wakes up on cgroup A. Since the whole hierarchy is already on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early without any modifications to "rq->tmp_alone_branch". The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch". !!! Splat !!! Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not enough since the new cfs_rq is not yet enqueued on the hierarchy. A dequeue on other subtree on the throttled hierarchy can freeze the PELT clock for the parent hierarchy without setting the indicators for this newly added cfs_rq which was never enqueued. Since there are no tasks on the new hierarchy, start a cfs_rq on a throttled hierarchy with its PELT clock throttled. The first enqueue, or the distribution (whichever happens first) will unfreeze the PELT clock and queue the cfs_rq on the leaf cfs_rq list. While at it, add an assert_list_leaf_cfs_rq() in propagate_entity_cfs_rq() to catch such cases in the future. Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/ Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model") Reported-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Suggested-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Aaron Lu <ziqianlu@bytedance.com> Tested-by: Aaron Lu <ziqianlu@bytedance.com> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Link: https://patch.msgid.link/20251021053522.37583-1-kprateek.nayak@amd.com	2025-10-22 15:21:52 +02:00
Noorain Eqbal	4e90776383	bpf: Sync pending IRQ work before freeing ring buffer Fix a race where irq_work can be queued in bpf_ringbuf_commit() but the ring buffer is freed before the work executes. In the syzbot reproducer, a BPF program attached to sched_switch triggers bpf_ringbuf_commit(), queuing an irq_work. If the ring buffer is freed before this work executes, the irq_work thread may accesses freed memory. Calling `irq_work_sync(&rb->work)` ensures that all pending irq_work complete before freeing the buffer. Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Reported-by: syzbot+2617fc732430968b45d2@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=2617fc732430968b45d2 Tested-by: syzbot+2617fc732430968b45d2@syzkaller.appspotmail.com Signed-off-by: Noorain Eqbal <nooraineqbal@gmail.com> Link: https://lore.kernel.org/r/20251020180301.103366-1-nooraineqbal@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-21 09:57:48 -07:00
Linus Torvalds	6548d364a3	cgroup: Fixes for v6.18-rc2 - Fix seqcount lockdep assertion failure in cgroup freezer on PREEMPT_RT. Plain seqcount_t expects preemption disabled, but PREEMPT_RT spinlocks don't disable preemption. Switch to seqcount_spinlock_t to properly associate css_set_lock with the freeze timing seqcount. - Misc changes including kernel-doc warning fix for misc_res_type enum and improved selftest diagnostics. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaPZ7wg4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGTiuAP4swnb+BmcqOu37rxIuKXS+6FwRb0Om2wBs4m5W bgASMAEAva+DWpWx0l8GFV4kf14wRV6rYlkMRcP3KSyqXNLmLg4= =8A4u -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.18-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix seqcount lockdep assertion failure in cgroup freezer on PREEMPT_RT. Plain seqcount_t expects preemption disabled, but PREEMPT_RT spinlocks don't disable preemption. Switch to seqcount_spinlock_t to properly associate css_set_lock with the freeze timing seqcount. - Misc changes including kernel-doc warning fix for misc_res_type enum and improved selftest diagnostics. * tag 'cgroup-for-6.18-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/misc: fix misc_res_type kernel-doc warning selftests: cgroup: Use values_close_report in test_cpu selftests: cgroup: add values_close_report helper cgroup: Fix seqcount lockdep assertion in cgroup freezer	2025-10-20 09:41:27 -10:00
Haofeng Li	39a9ed0fb6	timekeeping: Fix aux clocks sysfs initialization loop bound The loop in tk_aux_sysfs_init() uses `i <= MAX_AUX_CLOCKS` as the termination condition, which results in 9 iterations (i=0 to 8) when MAX_AUX_CLOCKS is defined as 8. However, the kernel is designed to support only up to 8 auxiliary clocks. This off-by-one error causes the creation of a 9th sysfs entry that exceeds the intended auxiliary clock range. Fix the loop bound to use `i < MAX_AUX_CLOCKS` to ensure exactly 8 auxiliary clock entries are created, matching the design specification. Fixes: 7b95663a3d96 ("timekeeping: Provide interface to control auxiliary clocks") Signed-off-by: Haofeng Li <lihaofeng@kylinos.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/tencent_2376993D9FC06A3616A4F981B3DE1C599607@qq.com	2025-10-20 19:56:12 +02:00
Nam Cao	3d62f95bd8	rv: Make rtapp/pagefault monitor depends on CONFIG_MMU There is no page fault without MMU. Compiling the rtapp/pagefault monitor without CONFIG_MMU fails as page fault tracepoints' definitions are not available. Make rtapp/pagefault monitor depends on CONFIG_MMU. Fixes: 9162620eb604 ("rv: Add rtapp_pagefault monitor") Signed-off-by: Nam Cao <namcao@linutronix.de> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202509260455.6Z9Vkty4-lkp@intel.com/ Cc: stable@vger.kernel.org Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/r/20251002082317.973839-1-namcao@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2025-10-20 12:47:40 +02:00
Nam Cao	103541e6a5	rv: Fully convert enabled_monitors to use list_head as iterator The callbacks in enabled_monitors_seq_ops are inconsistent. Some treat the iterator as struct rv_monitor , while others treat the iterator as struct list_head . This causes a wrong type cast and crashes the system as reported by Nathan. Convert everything to use struct list_head * as iterator. This also makes enabled_monitors consistent with available_monitors. Fixes: de090d1ccae1 ("rv: Fix wrong type cast in enabled_monitors_next()") Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/linux-trace-kernel/20250923002004.GA2836051@ax162/ Signed-off-by: Nam Cao <namcao@linutronix.de> Cc: stable@vger.kernel.org Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/r/20251002082235.973099-1-namcao@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2025-10-20 12:47:40 +02:00
Linus Torvalds	d9043c79ba	- Make sure the check for lost pelt idle time is done unconditionally to have correct lost idle time accounting - Stop the deadline server task before a CPU goes offline -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmj0yvsACgkQEsHwGGHe VUqiYRAAncYon7a++87nuCHIw2ktAcjn4PJTz0F1VGw9ZvcbWThUhNoA17jd4uOz XCzSH1rnHnlz359cJIzFwgVYjkBIaqT8GBN0al9ODra37laZCo89bKLmOeAlH81H 1xJXrDwn7U8dYBjgf6E6OGCdAx40kspCBxmpxrFW1VrGDvfNjEAKezm5GWeSED0Z umA93dBr82i4IvfARUkK8s35ctHyx+o+7lCvCSsKSJgM02WWrKqAA/lv6jFjIgdE 0UuYJv+5A2e1Iog2KNSbvSPn23VaMnsZtvXfJoRLFHEsNTiL9NliTnwrOY6xx0Z8 9+GUeWsbobKwcKSk4dctOh0g/4afNbxWe2aAPmScHJNHtXHSeejps+zy4xFCLTZn 2muHCdZ2zo6YSL+og4TQax+FnLYnGUtPFDOQYsNxv/Cp1H+cbgvG5Qp08XXt8Tfl Mt82g25GKklc28AN5Ui7FKTFmV2K363pV04YVZjXOwmxwiEYbwKw8gKfxi7CRW7S fl4nW6Kp8BFtJQxc/RCXDIiX3h0wRlTOmF5FzyFYxgdsmO5AdGqS9tqknLrV2NlH JVtj7alnrmCU34LwtTVfCvYQZiNd4IN+B6/htsL3AzrcLnqJz4O/T/Eyv9UL4yUs yvQuO+yStCyk0BFYaGM3/E0xp87NYjaLiHnpM2jia3DT3UT1t7Q= =uqJW -----END PGP SIGNATURE----- Merge tag 'sched_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Make sure the check for lost pelt idle time is done unconditionally to have correct lost idle time accounting - Stop the deadline server task before a CPU goes offline * tag 'sched_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix pelt lost idle time detection sched/deadline: Stop dl_server before CPU goes offline	2025-10-19 04:59:43 -10:00
Linus Torvalds	343b4b44a1	- Make sure perf reporting works correctly in setups using overlayfs or FUSE - Move the uprobe optimization to a better location logically -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmj0x3YACgkQEsHwGGHe VUo6nQ/9E8LWC6PJG40QUXNZuj5qLe9VVaiWTW7w/zgeCf9nxkt6OhlOIu4fCMKz 6n5marqnvOoG9EXetUz5+n0wJvc9vDACESC0m6ESddaI4PGXULNJIsN2C5dR3UZ3 RULxaXvz9PVVkW3UIuM/U9az7fsG/ttH1rtrWQOsUYQZEO7vA9g+8KtASwnB7yBa 29WzVDYQIuHigdFPkVOuKBEdhslOjNjMM/N/shFOyFS62MGgwwFG/f4xv0c2GanJ 9gS2HPGhwOXLm8x/1Y6D8eKjiT5lvqZcDcRnui8bj7L7YGx+HU4PhRIIg7sBvGqA QQGolxA9Xo2BTufUTxEQK9v2fSvg0f9wuKbkDbRUdyUeWiZZjEeBM/m0AkzEEeKf FUrLCi3V/mN5J/sXSgIwjuCtYctwmsfaukL2bz6DB7feoTHceQmHunKCtBlDZtLE Md/4hzMNYM+T/3nx27quGz8Cepxn9PSObN7W+DddWr0TxOxg2Pq6iMbnd7MulueP K/AMvqDtbbVUB1XpsFvadRLcYUYYfXT9tiOCxa9O2w2NXDG8qeB6FZwScBaWuz1N 9GpKBhVMgZT8m0d3N8NoBi0+h32UVZnsJJ3UhHnceE8UyYf4kSO5L2K3nPHJa301 AavIPkH7+YOl5TAg6JlyYbRRdwfoUzxKUqY/hQ6Q8aLvwb2Jing= =huy7 -----END PGP SIGNATURE----- Merge tag 'perf_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Make sure perf reporting works correctly in setups using overlayfs or FUSE - Move the uprobe optimization to a better location logically * tag 'perf_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix MMAP2 event device with backing files perf/core: Fix MMAP event path names with backing files perf/core: Fix address filter match with backing files uprobe: Move arch_uprobe_optimize right after handlers execution	2025-10-19 04:54:08 -10:00
Emil Tsalapatis	a3c4a0a42e	sched_ext: fix flag check for deferred callbacks When scheduling the deferred balance callbacks, check SCX_RQ_BAL_CB_PENDING instead of SCX_RQ_BAL_PENDING. This way schedule_deferred() properly tests whether there is already a pending request for queue_balance_callback() to be invoked at the end of .balance(). Fixes: a8ad873113d3 ("sched_ext: defer queue_balance_callback() until after ops.dispatch") Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-16 08:34:00 -10:00
Shardul Bankar	f6fddc6df3	bpf: Fix memory leak in __lookup_instance error path When __lookup_instance() allocates a func_instance structure but fails to allocate the must_write_set array, it returns an error without freeing the previously allocated func_instance. This causes a memory leak of 192 bytes (sizeof(struct func_instance)) each time this error path is triggered. Fix by freeing 'result' on must_write_set allocation failure. Fixes: b3698c356ad9 ("bpf: callchain sensitive stack liveness tracking using CFG") Reported-by: BPF Runtime Fuzzer (BRF) Signed-off-by: Shardul Bankar <shardulsb08@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://patch.msgid.link/20251016063330.4107547-1-shardulsb08@gmail.com	2025-10-16 10:45:17 -07:00
Marek Szyprowski	03521c892b	dma-debug: don't report false positives with DMA_BOUNCE_UNALIGNED_KMALLOC Commit 370645f41e6e ("dma-mapping: force bouncing if the kmalloc() size is not cache-line-aligned") introduced DMA_BOUNCE_UNALIGNED_KMALLOC feature and permitted architecture specific code configure kmalloc slabs with sizes smaller than the value of dma_get_cache_alignment(). When that feature is enabled, the physical address of some small kmalloc()-ed buffers might be not aligned to the CPU cachelines, thus not really suitable for typical DMA. To properly handle that case a SWIOTLB buffer bouncing is used, so no CPU cache corruption occurs. When that happens, there is no point reporting a false-positive DMA-API warning that the buffer is not properly aligned, as this is not a client driver fault. [m.szyprowski@samsung.com: replace is_swiotlb_allocated() with is_swiotlb_active(), per Catalin] Link: https://lkml.kernel.org/r/20251010173009.3916215-1-m.szyprowski@samsung.com Link: https://lkml.kernel.org/r/20251009141508.2342138-1-m.szyprowski@samsung.com Fixes: 370645f41e6e ("dma-mapping: force bouncing if the kmalloc() size is not cache-line-aligned") Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: Robin Murohy <robin.murphy@arm.com> Cc: "Isaac J. Manjarres" <isaacmanjarres@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-10-15 13:24:33 -07:00

1 2 3 4 5 ...

49595 Commits