Commit Graph

1215159 Commits

Author SHA1 Message Date
Sebastian Siewior 11e11d0a7d printk: replace local_irq_save with local_lock for safe mode
Safe mode disables interrupts in order to minimize the window where
printk calls use deferred printing. Currently local_irq_save() is
used for this, however on PREEMPT_RT this can lead to large latencies
because safe mode is enabled for the duration of printing a record.

Use a local_lock instead of local_irq_save(). For !PREEMPT_RT it
has the same affect of disabling interrupts for that CPU. For
PREEMPT_RT it will disable preemption, which is enough to prevent
interruption from the irq threads.

Note that disabling preemption for PREEMPT_RT is also very bad
since it is still blocking RT tasks. The atomic/threaded (NOBKL)
consoles were developed such that safe mode is not needed. So it
is expected that a PREEMPT_RT machine does not run with any legacy
consoles registered.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:31 +02:00
John Ogness 07173eb6b3 printk: Add threaded printing support for BKL consoles.
Add threaded printing support for BKL consoles on PREEMPT_RT.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:31 +02:00
John Ogness 5cce771b3a printk: only disable if actually unregistered
Currently in unregister_console() a printk message is generated and
the console is disabled, even it was never registered. There are
code paths (such as uart_remove_one_port()) that call
unregister_console() even if the console is not registered.

It is confusing to see messages about consoles being disabled that
were never disabled. Move the printk and disabling later, when it
is known that the console is actually registered.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:31 +02:00
John Ogness 587e5b5882 printk: Perform atomic flush in console_flush_on_panic()
Typically the panic() function will take care of atomic flushing the
non-BKL consoles on panic. However, there are several users of
console_flush_on_panic() outside of panic().

Also perform atomic flushing in console_flush_on_panic(). A new
function cons_force_seq() is implemented to support the
mode=CONSOLE_REPLAY_ALL feature.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:31 +02:00
John Ogness 37552a17fa rcu: Add atomic write enforcement for rcu stalls
Invoke the atomic write enforcement functions for rcu stalls to
ensure that the information gets out to the consoles.

It is important to note that if there are any legacy consoles
registered, they will be attempting to directly print from the
printk-caller context, which may jeopardize the reliability of the
atomic consoles. Optimally there should be no legacy consoles
registered.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:30 +02:00
Thomas Gleixner b793716044 kernel/panic: Add atomic write enforcement to warn/panic
Invoke the atomic write enforcement functions for warn/panic to
ensure that the information gets out to the consoles.

For the panic case, add explicit intermediate atomic flush calls to
ensure immediate flushing at important points. Otherwise the atomic
flushing only occurs when dropping out of the elevated priority,
which for panic may never happen.

It is important to note that if there are any legacy consoles
registered, they will be attempting to directly print from the
printk-caller context, which may jeopardize the reliability of the
atomic consoles. Optimally there should be no legacy consoles
registered.

Co-developed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:30 +02:00
John Ogness 078e0c5cf0 proc: consoles: Add support for non-BKL consoles
Show 'W' if a non-BKL console implements write_thread() or
write_atomic().

Add a new flag 'N' to show if it is a non-BKL console.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:30 +02:00
John Ogness 52f01b5cb0 tty: tty_io: Show non-BKL consoles as active
/sys/class/tty/console/active shows the consoles that are currently
registered and enabled and are able to print (i.e. have implemented
a write() callback). This is used by userspace programs such as
systemd's systemd-getty-generator to determine where consoles are
in order to automatically start a getty instance.

The non-BKL consoles do not implement write() but also should be
shown as an active console. Expand the conditions to also check if
the callbacks write_thread() or write_atomic() are implemented for
non-BKL consoles.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:30 +02:00
John Ogness cf3b42a10d printk: nobkl: Stop threads on shutdown/reboot
Register a syscore_ops shutdown function to stop all threaded
printers on shutdown/reboot. This allows printk to transition back
to atomic printing in order to provide a robust mechanism for
outputting the final messages.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:30 +02:00
Thomas Gleixner ab8a7d8f04 printk: nobkl: Provide functions for atomic write enforcement
Threaded printk is the preferred mechanism to tame the noisyness of
printk, but WARN/OOPS/PANIC require printing out immediately since
the printer threads might not be able to run.

Add per CPU state to denote the priority/urgency of the output and
provide functions to flush the printk backlog for priority elevated
contexts and when the printing threads are not available (such as
early boot).

Note that when a CPU is in a priority elevated state, flushing only
occurs when dropping back to a lower priority. This allows the full
set of printk records (WARN/OOPS/PANIC output) to be stored in the
ringbuffer before beginning to flush the backlog.

Co-developed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:30 +02:00
Thomas Gleixner 6f124e0c58 printk: nobkl: Add write context storage for atomic writes
The number of consoles is unknown at compile time and allocating
write contexts on stack in emergency/panic situations is not desired
either.

Allocate a write context array (one for each priority level) along
with the per CPU output buffers, thus allowing atomic contexts on
multiple CPUs and priority levels to execute simultaneously without
clobbering each other's write context.

Co-developed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:29 +02:00
Thomas Gleixner f99552d8e4 printk: nobkl: Add printer thread wakeups
Add a function to wakeup the printer threads. Use the new function
when:

  - records are added to the printk ringbuffer
  - consoles are started
  - consoles are resumed

The actual waking is performed via irq_work so that the wakeup can
be triggered from any context.

Co-developed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:29 +02:00
Thomas Gleixner 3f58ef3573 printk: nobkl: Introduce printer threads
Add the infrastructure to create a printer thread per console along
with the required thread function, which is takeover/handover aware.

Co-developed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:29 +02:00
Thomas Gleixner fdf2dc1e54 printk: nobkl: Add emit function and callback functions for atomic printing
Implement an emit function for non-BKL consoles to output printk
messages. It utilizes the lockless printk_get_next_message() and
console_prepend_dropped() functions to retrieve/build the output
message. The emit function includes the required safety points to
check for handover/takeover and calls a new write_atomic callback
of the console driver to output the message. It also includes proper
handling for updating the non-BKL console sequence number.

Co-developed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:29 +02:00
Thomas Gleixner d803bcd90a printk: nobkl: Add print state functions
Provide three functions which are related to the safe handover
mechanism and allow console drivers to denote takeover unsafe
sections:

 - console_can_proceed()

   Invoked by a console driver to check whether a handover request
   is pending or whether the console was taken over in a hostile
   fashion.

 - console_enter/exit_unsafe()

   Invoked by a console driver to denote that the driver output
   function is about to enter or to leave an critical region where a
   hostile take over is unsafe. These functions are also
   cancellation points.

   The unsafe state is stored in the console state and allows a
   takeover attempt to make informed decisions whether to take over
   and/or output on such a console at all. The unsafe state is also
   available to the driver in the write context for the
   atomic_write() output function so the driver can make informed
   decisions about the required actions or take a special emergency
   path.

Co-developed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:29 +02:00
Thomas Gleixner 54388ba872 printk: nobkl: Add sequence handling
On 64bit systems the sequence tracking is embedded into the atomic
console state, on 32bit it has to be stored in a separate atomic
member. The latter needs to handle the non-atomicity in hostile
takeover cases, while 64bit can completely rely on the state
atomicity.

The ringbuffer sequence number is 64bit, but having a 32bit
representation in the console is sufficient. If a console ever gets
more than 2^31 records behind the ringbuffer then this is the least
of the problems.

On acquire() the atomic 32bit sequence number is expanded to 64 bit
by folding the ringbuffer's sequence into it carefully.

Co-developed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:28 +02:00
Thomas Gleixner 377c6aa303 printk: nobkl: Add buffer management
In case of hostile takeovers it must be ensured that the previous
owner cannot scribble over the output buffer of the emergency/panic
context. This is achieved by:

 - Adding a global output buffer instance for early boot (pre per CPU
   data being available).

 - Allocating an output buffer per console for threaded printers once
   printer threads become available.

 - Allocating per CPU output buffers per console for printing from
   all contexts not covered by the other buffers.

 - Choosing the appropriate buffer is handled in the acquire/release
   functions.

The output buffer is wrapped into a separate data structure so other
context related fields can be added in later steps.

Co-developed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:28 +02:00
Thomas Gleixner d387cf62ad printk: nobkl: Add acquire/release logic
Add per console acquire/release functionality. The console 'locked'
state is a combination of several state fields:

  - The 'locked' bit

  - The 'cpu' field that denotes on which CPU the console is locked

  - The 'cur_prio' field that contains the severity of the printk
    context that owns the console. This field is used for decisions
    whether to attempt friendly handovers and also prevents takeovers
    from a less severe context, e.g. to protect the panic CPU.

The acquire mechanism comes with several flavours:

  - Straight forward acquire when the console is not contended

  - Friendly handover mechanism based on a request/grant handshake

    The requesting context:

      1) Puts the desired handover state (CPU nr, prio) into a
         separate handover state

      2) Sets the 'req_prio' field in the real console state

      3) Waits (with a timeout) for the owning context to handover

    The owning context:

      1) Observes the 'req_prio' field set

      2) Hands the console over to the requesting context by
         switching the console state to the handover state that was
         provided by the requester

  - Hostile takeover

      The new owner takes the console over without handshake

      This is required when friendly handovers are not possible,
      i.e. the higher priority context interrupted the owning context
      on the same CPU or the owning context is not able to make
      progress on a remote CPU.

The release is the counterpart which either releases the console
directly or hands it gracefully over to a requester.

All operations on console::atomic_state[CUR|REQ] are atomic
cmpxchg based to handle concurrency.

The acquire/release functions implement only minimal policies:

  - Preference for higher priority contexts
  - Protection of the panic CPU

All other policy decisions have to be made at the call sites.

The design allows to implement the well known:

    acquire()
    output_one_line()
    release()

algorithm, but also allows to avoid the per line acquire/release for
e.g. panic situations by doing the acquire once and then relying on
the panic CPU protection for the rest.

Co-developed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:28 +02:00
Thomas Gleixner fe2ba02c78 printk: Add non-BKL console basic infrastructure
The current console/printk subsystem is protected by a Big Kernel Lock,
(aka console_lock) which has ill defined semantics and is more or less
stateless. This puts severe limitations on the console subsystem and
makes forced takeover and output in emergency and panic situations a
fragile endavour which is based on try and pray.

The goal of non-BKL consoles is to break out of the console lock jail
and to provide a new infrastructure that avoids the pitfalls and
allows console drivers to be gradually converted over.

The proposed infrastructure aims for the following properties:

  - Per console locking instead of global locking
  - Per console state which allows to make informed decisions
  - Stateful handover and takeover

As a first step state is added to struct console. The per console state
is an atomic_long_t with a 32bit bit field and on 64bit also a 32bit
sequence for tracking the last printed ringbuffer sequence number. On
32bit the sequence is separate from state for obvious reasons which
requires handling a few extra race conditions.

Reserve state bits, which will be populated later in the series. Wire
it up into the console register/unregister functionality and exclude
such consoles from being handled in the console BKL mechanisms. Since
the non-BKL consoles will not depend on the console lock/unlock dance
for printing, only perform said dance if a BKL console is registered.

The decision to use a bitfield was made as using a plain u32 with
mask/shift operations turned out to result in uncomprehensible code.

Co-developed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:28 +02:00
Sebastian Andrzej Siewior b9ca69dd0f time: Allow to preempt after a callback.
The TIMER_SOFTIRQ handler invokes timer callbacks of the expired timers.
Before each invocation the timer_base::lock is dropped. The only lock
that is still held is the timer_base::expiry_lock and the per-CPU
bh-lock as part of local_bh_disable(). The former is released as part
of lock up prevention if the timer is preempted by the caller which is
waiting for its completion.

Both locks are already released as part of timer_sync_wait_running().
This can be extended by also releasing in bh-lock. The timer core does
not rely on any state that is serialized by the bh-lock. The timer
callback expects the bh-state to be serialized by the lock but there is
no need to keep state synchronized while invoking multiple callbacks.

Preempt handling softirqs and release all locks after a timer invocation
if the current has inherited priority.

Link: https://lore.kernel.org/r/20230804113039.419794-4-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:28 +02:00
Sebastian Andrzej Siewior b2d0254952 softirq: Add function to preempt serving softirqs.
Add a functionality for the softirq handler to preempt its current work
if needed. The softirq core has no particular state. It reads and resets
the pending softirq bits and then processes one after the other.
It can already be preempted while it invokes a certain softirq handler.

By enabling the BH the softirq core releases the per-CPU bh lock which
serializes all softirq handler. It is safe to do as long as the code
does not expect any serialisation in between. A typical scenarion would
after the invocation of callback where no state needs to be preserved
before the next callback is invoked.

Add functionaliry to preempt the serving softirqs.

Link: https://lore.kernel.org/r/20230804113039.419794-3-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:28 +02:00
Sebastian Andrzej Siewior c6d3d3b3fe sched/core: Provide a method to check if a task is PI-boosted.
Provide a method to check if a task inherited the priority from another
task. This happens if a task owns a lock which is requested by a task
with higher priority. This can be used as a hint to add a preemption
point to the critical section.

Provide a function which reports true if the task is PI-boosted.

Link: https://lore.kernel.org/r/20230804113039.419794-2-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:27 +02:00
Sebastian Andrzej Siewior c15abad8f7 preempt: Put preempt_enable() within an instrumentation*() section.
Callers of preempt_enable() can be within an noinstr section leading to:
| vmlinux.o: warning: objtool: native_sched_clock+0x97: call to preempt_schedule_notrace_thunk() leaves .noinstr.text section
| vmlinux.o: warning: objtool: kvm_clock_read+0x22: call to preempt_schedule_notrace_thunk() leaves .noinstr.text section
| vmlinux.o: warning: objtool: local_clock+0xb4: call to preempt_schedule_notrace_thunk() leaves .noinstr.text section
| vmlinux.o: warning: objtool: enter_from_user_mode+0xea: call to preempt_schedule_thunk() leaves .noinstr.text section
| vmlinux.o: warning: objtool: syscall_enter_from_user_mode+0x140: call to preempt_schedule_thunk() leaves .noinstr.text section
| vmlinux.o: warning: objtool: syscall_enter_from_user_mode_prepare+0xf2: call to preempt_schedule_thunk() leaves .noinstr.text section
| vmlinux.o: warning: objtool: irqentry_enter_from_user_mode+0xea: call to preempt_schedule_thunk() leaves .noinstr.text section

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20230309072724.3F6zRkvw@linutronix.de
2023-09-13 09:10:27 +02:00
Mike Galbraith 047a6b7f0e zram: Replace bit spinlocks with spinlock_t for PREEMPT_RT.
The bit spinlock disables preemption. The spinlock_t lock becomes a sleeping
lock on PREEMPT_RT and it can not be acquired in this context. In this locked
section, zs_free() acquires a zs_pool::lock, and there is access to
zram::wb_limit_lock.

Use a spinlock_t on PREEMPT_RT for locking and set/ clear ZRAM_LOCK bit after
the lock has been acquired/ dropped.

Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/YqIbMuHCPiQk+Ac2@linutronix.de
Link: https://lore.kernel.org/20230323161830.jFbWCosd@linutronix.de
2023-09-13 09:10:27 +02:00
Junxiao Chang 7119d7c8c7 softirq: Wake ktimers thread also in softirq.
If the hrtimer is raised while a softirq is processed then it does not
wake the corresponding ktimers thread. This is due to the optimisation in the
irq-exit path which is also used to wake the ktimers thread. For the other
softirqs, this is okay because the additional softirq bits will be handled by
the currently running softirq handler.
The timer related softirq bits are added to a different variable and rely on
the ktimers thread.
As a consuequence the wake up of ktimersd is delayed until the next timer tick.

Always wake the ktimers thread if a timer related softirq is pending.

Reported-by: Peh, Hock Zhang <hock.zhang.peh@intel.com>
Signed-off-by: Junxiao Chang <junxiao.chang@intel.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:27 +02:00
Frederic Weisbecker 2f6cac08f4 tick: Fix timer storm since introduction of timersd
If timers are pending while the tick is reprogrammed on nohz_mode, the
next expiry is not armed to fire now, it is delayed one jiffy forward
instead so as not to raise an inextinguishable timer storm with such
scenario:

1) IRQ triggers and queue a timer
2) ksoftirqd() is woken up
3) IRQ tail: timer is reprogrammed to fire now
4) IRQ exit
5) TIMER interrupt
6) goto 3)

...all that until we finally reach ksoftirqd.

Unfortunately we are checking the wrong softirq vector bitmask since
timersd kthread has split from ksoftirqd. Timers now have their own
vector state field that must be checked separately. As a result, the
old timer storm is back. This shows up early on boot with extremely long
initcalls:

	[  333.004807] initcall dquot_init+0x0/0x111 returned 0 after 323822879 usecs

and the cause is uncovered with the right trace events showing just
10 microseconds between ticks (~100 000 Hz):

|swapper/-1 1dn.h111 60818582us : hrtimer_expire_entry: hrtimer=00000000e0ef0f6b function=tick_sched_timer now=60415486608
|swapper/-1 1dn.h111 60818592us : hrtimer_expire_entry: hrtimer=00000000e0ef0f6b function=tick_sched_timer now=60415496082
|swapper/-1 1dn.h111 60818601us : hrtimer_expire_entry: hrtimer=00000000e0ef0f6b function=tick_sched_timer now=60415505550

Fix this by checking the right timer vector state from the nohz code.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20220405010752.1347437-2-frederic@kernel.org
2023-09-13 09:10:27 +02:00
Frederic Weisbecker 0b3cadf084 rcutorture: Also force sched priority to timersd on boosting test.
ksoftirqd is statically boosted to the priority level right above the
one of rcu_torture_boost() so that timers, which torture readers rely on,
get a chance to run while rcu_torture_boost() is polling.

However timers processing got split from ksoftirqd into their own kthread
(timersd) that isn't boosted. It has the same SCHED_FIFO low prio as
rcu_torture_boost() and therefore timers can't preempt it and may
starve.

The issue can be triggered in practice on v5.17.1-rt17 using:

	./kvm.sh --allcpus --configs TREE04 --duration 10m --kconfig "CONFIG_EXPERT=y CONFIG_PREEMPT_RT=y"

Fix this with statically boosting timersd just like is done with
ksoftirqd in commit
   ea6d962e80 ("rcutorture: Judge RCU priority boosting on grace periods, not callbacks")

Suggested-by: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20220405010752.1347437-1-frederic@kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:27 +02:00
Sebastian Andrzej Siewior 7f49c1dd9a softirq: Use a dedicated thread for timer wakeups.
A timer/hrtimer softirq is raised in-IRQ context. With threaded
interrupts enabled or on PREEMPT_RT this leads to waking the ksoftirqd
for the processing of the softirq.
Once the ksoftirqd is marked as pending (or is running) it will collect
all raised softirqs. This in turn means that a softirq which would have
been processed at the end of the threaded interrupt, which runs at an
elevated priority, is now moved to ksoftirqd which runs at SCHED_OTHER
priority and competes with every regular task for CPU resources.
This introduces long delays on heavy loaded systems and is not desired
especially if the system is not overloaded by the softirqs.

Split the TIMER_SOFTIRQ and HRTIMER_SOFTIRQ processing into a dedicated
timers thread and let it run at the lowest SCHED_FIFO priority.
RT tasks are are woken up from hardirq context so only timer_list timers
and hrtimers for "regular" tasks are processed here. The higher priority
ensures that wakeups are performed before scheduling SCHED_OTHER tasks.

Using a dedicated variable to store the pending softirq bits values
ensure that the timer are not accidentally picked up by ksoftirqd and
other threaded interrupts.
It shouldn't be picked up by ksoftirqd since it runs at lower priority.
However if the timer bits are ORed while a threaded interrupt is
running, then the timer softirq would be performed at higher priority.
The new timer thread will block on the softirq lock before it starts
softirq work. This "race window" isn't closed because while timer thread
is performing the softirq it can get PI-boosted via the softirq lock by
a random force-threaded thread.
The timer thread can pick up pending softirqs from ksoftirqd but only
if the softirq load is high. It is not be desired that the picked up
softirqs are processed at SCHED_FIFO priority under high softirq load
but this can already happen by a PI-boost by a force-threaded interrupt.

Reported-by: kernel test robot <lkp@intel.com> [ static timer_threads ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:26 +02:00
Sebastian Andrzej Siewior baf33250e0 sched/rt: Don't try push tasks if there are none.
I have a RT task X at a high priority and cyclictest on each CPU with
lower priority than X's. If X is active and each CPU wakes their own
cylictest thread then it ends in a longer rto_push storm.
A random CPU determines via balance_rt() that the CPU on which X is
running needs to push tasks. X has the highest priority, cyclictest is
next in line so there is nothing that can be done since the task with
the higher priority is not touched.

tell_cpu_to_push() increments rto_loop_next and schedules
rto_push_irq_work_func() on X's CPU. The other CPUs also increment the
loop counter and do the same. Once rto_push_irq_work_func() is active it
does nothing because it has _no_ pushable tasks on its runqueue. Then
checks rto_next_cpu() and decides to queue irq_work on the local CPU
because another CPU requested a push by incrementing the counter.

I have traces where ~30 CPUs request this ~3 times each before it
finally ends. This greatly increases X's runtime while X isn't making
much progress.

Teach rto_next_cpu() to only return CPUs which also have tasks on their
runqueue which can be pushed away. This does not reduce the
tell_cpu_to_push() invocations (rto_loop_next counter increments) but
reduces the amount of issued rto_push_irq_work_func() if nothing can be
done. As the result the overloaded CPU is blocked less often.

There are still cases where the "same job" is repeated several times
(for instance the current CPU needs to resched but didn't yet because
the irq-work is repeated a few times and so the old task remains on the
CPU) but the majority of request end in tell_cpu_to_push() before an IPI
is issued.

Reviewed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20230801152648._y603AS_@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:26 +02:00
Sebastian Andrzej Siewior 55a38d0ab4 x86: Enable RT also on 32bit
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-09-13 09:10:26 +02:00
Sebastian Andrzej Siewior 3ea3c069f2 x86: Allow to enable RT
Allow to select RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-09-13 09:10:26 +02:00
Sebastian Andrzej Siewior 1f7514c9c7 net: Avoid the IPI to free the
skb_attempt_defer_free() collects a skbs, which was allocated on a
remote CPU, on a per-CPU list. These skbs are either freed on that
remote CPU once the CPU enters NET_RX or an remote IPI function is
invoked in to raise the NET_RX softirq if a threshold of pending skb has
been exceeded.
This remote IPI can cause the wakeup of ksoftirqd on PREEMPT_RT if the
remote CPU idle was idle. This is undesired because once the ksoftirqd
is running it will acquire all pending softirqs and they will not be
executed as part of the threaded interrupt until ksoftird goes idle
again.

To void all this, schedule the deferred clean up from a worker.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:26 +02:00
Sebastian Andrzej Siewior ecc2d44f55 seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
It was brought up by Tetsuo that the following sequence
   write_seqlock_irqsave()
   printk_deferred_enter()

could lead to a deadlock if the lockdep annotation within
write_seqlock_irqsave() triggers. The problem is that the sequence
counter is incremented before the lockdep annotation is performed. The
lockdep splat would then attempt to invoke printk() but the reader side,
of the same seqcount, could have a tty_port::lock acquired waiting for
the sequence number to become even again.

The other lockdep annotations come before the actual locking because "we
want to see the locking error before it happens". There is no reason why
seqcount should be different here.

Do the lockdep annotation first then perform the locking operation (the
sequence increment).

Fixes: 1ca7d67cf5 ("seqcount: Add lockdep functionality to seqcount/seqlock structures")
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/20230621130641.-5iueY1I@linutronix.de
Link: https://lore.kernel.org/r/20230623171232.892937-2-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:25 +02:00
Sebastian Andrzej Siewior 3b1c90abea signal: Don't disable preemption in ptrace_stop() on PREEMPT_RT.
On PREEMPT_RT keeping preemption disabled during the invocation of
cgroup_enter_frozen() is a problem because the function acquires css_set_lock
which is a sleeping lock on PREEMPT_RT and must not be acquired with disabled
preemption.
The preempt-disabled section is only for performance optimisation
reasons and can be avoided.

Extend the comment and don't disable preemption before scheduling on
PREEMPT_RT.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20230803100932.325870-3-bigeasy@linutronix.de
2023-09-13 09:10:25 +02:00
Sebastian Andrzej Siewior bf1069f8c0 signal: Add proper comment about the preempt-disable in ptrace_stop().
Commit 53da1d9456 ("fix ptrace slowness") added a preempt-disable section
between read_unlock() and the following schedule() invocation without
explaining why it is needed.

Replace the comment with an explanation why this is needed. Clarify that
it is needed for correctness but for performance reasons.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20230803100932.325870-2-bigeasy@linutronix.de
2023-09-13 09:10:25 +02:00
Sebastian Andrzej Siewior 701e060705 locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
After rt_mutex_wait_proxy_lock() task_struct::pi_blocked_on is cleared
if current owns the lock. If the operation has been interrupted by a
signal or timeout then pi_blocked_on can be set. This means spin_lock()
*can* overwrite pi_blocked_on on PREEMPT_RT. This has been noticed by
the recently added lockdep-asserts…

The rt_mutex_cleanup_proxy_lock() operation will clear pi_blocked_on
(and update pending waiters as expected) but it must happen under the hb
lock to ensure the same state in rtmutex and userland.

Given all the possibilities it is probably the simplest option to
try-lock the hb lock. In case the lock is occupied a quick nap is
needed. A busy loop can lock up the system if performed by a task with
high priorioty preventing the owner from running.

The rt_mutex_post_schedule() needs to be put before try-lock-loop
because otherwie the schedule() in schedule_hrtimeout() will trip over
the !sched_rt_mutex assert.

Introduce futex_trylock_hblock() to try-lock the hb lock and sleep until
the try-lock operation succeeds. Use it after rt_mutex_wait_proxy_lock()
to acquire the lock.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230831095314.fTliy0Bh@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:25 +02:00
Thomas Gleixner b945fd3a9d locking/rtmutex: Add a lockdep assert to catch potential nested blocking
There used to be a BUG_ON(current->pi_blocked_on) in the lock acquisition
functions, but that vanished in one of the rtmutex overhauls.

Bring it back in form of a lockdep assert to catch code paths which take
rtmutex based locks with current::pi_blocked_on != NULL.

Reported-by: Crystal Wood <swood@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230427111937.2745231-5-bigeasy@linutronix.de
Link: https://lore.kernel.org/r/20230815111430.488430699@infradead.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:25 +02:00
Sebastian Andrzej Siewior 7529b7e0ae locking/rtmutex: Use rt_mutex specific scheduler helpers
Have rt_mutex use the rt_mutex specific scheduler helpers to avoid
recursion vs rtlock on the PI state.

[[ peterz: adapted to new names ]]

Reported-by: Crystal Wood <swood@redhat.com>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230815111430.421408298@infradead.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:25 +02:00
Peter Zijlstra ba05f9cd37 sched: Provide rt_mutex specific scheduler helpers
With PREEMPT_RT there is a rt_mutex recursion problem where
sched_submit_work() can use an rtlock (aka spinlock_t). More
specifically what happens is:

  mutex_lock() /* really rt_mutex */
    ...
      __rt_mutex_slowlock_locked()
	task_blocks_on_rt_mutex()
          // enqueue current task as waiter
          // do PI chain walk
        rt_mutex_slowlock_block()
          schedule()
            sched_submit_work()
              ...
              spin_lock() /* really rtlock */
                ...
                  __rt_mutex_slowlock_locked()
                    task_blocks_on_rt_mutex()
                      // enqueue current task as waiter *AGAIN*
                      // *CONFUSION*

Fix this by making rt_mutex do the sched_submit_work() early, before
it enqueues itself as a waiter -- before it even knows *if* it will
wait.

[[ basically Thomas' patch but with different naming and a few asserts
   added ]]

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230815111430.355375399@infradead.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:24 +02:00
Thomas Gleixner 04e68a27e5 sched: Extract __schedule_loop()
There are currently two implementations of this basic __schedule()
loop, and there is soon to be a third.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230427111937.2745231-2-bigeasy@linutronix.de
Link: https://lore.kernel.org/r/20230815111430.288063671@infradead.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:24 +02:00
Sebastian Andrzej Siewior 47deef3dbc locking/rtmutex: Avoid unconditional slowpath for DEBUG_RT_MUTEXES
With DEBUG_RT_MUTEXES enabled the fast-path rt_mutex_cmpxchg_acquire()
always fails and all lock operations take the slow path.

Provide a new helper inline rt_mutex_try_acquire() which maps to
rt_mutex_cmpxchg_acquire() in the non-debug case. For the debug case
it invokes rt_mutex_slowtrylock() which can acquire a non-contended
rtmutex under full debug coverage.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230427111937.2745231-4-bigeasy@linutronix.de
Link: https://lore.kernel.org/r/20230815111430.220899937@infradead.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:24 +02:00
Peter Zijlstra a70db40b7a sched: Constrain locks in sched_submit_work()
Even though sched_submit_work() is ran from preemptible context,
it is discouraged to have it use blocking locks due to the recursion
potential.

Enforce this.

Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230815111430.154558666@infradead.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-09-13 09:10:24 +02:00
Linus Torvalds 0bb80ecc33 Linux 6.6-rc1 v6.6-rc1 2023-09-10 16:28:41 -07:00
Linus Torvalds 1548b060d6 Merge tag 'topic/drm-ci-2023-08-31-1' of git://anongit.freedesktop.org/drm/drm
Pull drm ci scripts from Dave Airlie:
 "This is a bunch of ci integration for the freedesktop gitlab instance
  where we currently do upstream userspace testing on diverse sets of
  GPU hardware. From my perspective I think it's an experiment worth
  going with and seeing how the benefits/noise playout keeping these
  files useful.

  Ideally I'd like to get this so we can do pre-merge testing on PRs
  eventually.

  Below is some info from danvet on why we've ended up making the
  decision and how we can roll it back if we decide it was a bad plan.

  Why in upstream?

   - like documentation, testcases, tools CI integration is one of these
     things where you can waste endless amounts of time if you
     accidentally have a version that doesn't match your source code

   - but also like the above, there's a balance, this is the initial cut
     of what we think makes sense to keep in sync vs out-of-tree,
     probably needs adjustment

   - gitlab supports out-of-repo gitlab integration and that's what's
     been used for the kernel in drm, but it results in per-driver
     fragmentation and lots of duplicated effort. the simple act of
     smashing an arbitrary winner into a topic branch already started
     surfacing patches on dri-devel and sparking good cross driver team
     discussions

  Why gitlab?

   - it's not any more shit than any of the other CI

   - drm userspace uses it extensively for everything in userspace, we
     have a lot of people and experience with this, including
     integration of hw testing labs

   - media userspace like gstreamer is also on gitlab.fd.o, and there's
     discussion to extend this to the media subsystem in some fashion

  Can this be shared?

   - there's definitely a pile of code that could move to scripts/ if
     other subsystem adopt ci integration in upstream kernel git. other
     bits are more drm/gpu specific like the igt-gpu-tests/tools
     integration

   - docker images can be run locally or in other CI runners

  Will we regret this?

   - it's all in one directory, intentionally, for easy deletion

   - probably 1-2 years in upstream to see whether this is worth it or a
     Big Mistake. that's roughly what it took to _really_ roll out solid
     CI in the bigger userspace projects we have on gitlab.fd.o like
     mesa3d"

* tag 'topic/drm-ci-2023-08-31-1' of git://anongit.freedesktop.org/drm/drm:
  drm: ci: docs: fix build warning - add missing escape
  drm: Add initial ci/ subdirectory
2023-09-10 11:55:26 -07:00
Linus Torvalds e56b2b6057 Merge tag 'x86-urgent-2023-09-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "Fix preemption delays in the SGX code, remove unnecessarily
  UAPI-exported code, fix a ld.lld linker (in)compatibility quirk and
  make the x86 SMP init code a bit more conservative to fix kexec()
  lockups"

* tag 'x86-urgent-2023-09-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sgx: Break up long non-preemptible delays in sgx_vepc_release()
  x86: Remove the arch_calc_vm_prot_bits() macro from the UAPI
  x86/build: Fix linker fill bytes quirk/incompatibility for ld.lld
  x86/smp: Don't send INIT to non-present and non-booted CPUs
2023-09-10 10:39:31 -07:00
Linus Torvalds e79dbf03d8 Merge tag 'perf-urgent-2023-09-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf event fix from Ingo Molnar:
 "Work around a firmware bug in the uncore PMU driver, affecting certain
  Intel systems"

* tag 'perf-urgent-2023-09-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/uncore: Correct the number of CHAs on EMR
2023-09-10 10:34:46 -07:00
Linus Torvalds 535a265d7f Merge tag 'perf-tools-for-v6.6-1-2023-09-05' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
Pull perf tools updates from Arnaldo Carvalho de Melo:
 "perf tools maintainership:

   - Add git information for perf-tools and perf-tools-next trees and
     branches to the MAINTAINERS file. That is where development now
     takes place and myself and Namhyung Kim have write access, more
     people to come as we emulate other maintainer groups.

  perf record:

   - Record kernel data maps when 'perf record --data' is used, so that
     global variables can be resolved and used in tools that do data
     profiling.

  perf trace:

   - Remove the old, experimental support for BPF events in which a .c
     file was passed as an event: "perf trace -e hello.c" to then get
     compiled and loaded.

     The only known usage for that, that shipped with the kernel as an
     example for such events, augmented the raw_syscalls tracepoints and
     was converted to a libbpf skeleton, reusing all the user space
     components and the BPF code connected to the syscalls.

     In the end just the way to glue the BPF part and the user space
     type beautifiers changed, now being performed by libbpf skeletons.

     The next step is to use BTF to do pretty printing of all syscall
     types, as discussed with Alan Maguire and others.

     Now, on a perf built with BUILD_BPF_SKEL=1 we get most if not all
     path/filenames/strings, some of the networking data structures,
     perf_event_attr, etc, i.e. systemwide tracing of nanosleep calls
     and perf_event_open syscalls while 'perf stat' runs 'sleep' for 5
     seconds:

      # perf trace -a -e *nanosleep,perf* perf stat -e cycles,instructions sleep 5
         0.000 (   9.034 ms): perf/327641 perf_event_open(attr_uptr: { type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0 (PERF_COUNT_HW_CPU_CYCLES), sample_type: IDENTIFIER, read_format: TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING, disabled: 1, inherit: 1, enable_on_exec: 1, exclude_guest: 1 }, pid: 327642 (perf), cpu: -1, group_fd: -1, flags: FD_CLOEXEC) = 3
         9.039 (   0.006 ms): perf/327641 perf_event_open(attr_uptr: { type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0x1 (PERF_COUNT_HW_INSTRUCTIONS), sample_type: IDENTIFIER, read_format: TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING, disabled: 1, inherit: 1, enable_on_exec: 1, exclude_guest: 1 }, pid: 327642 (perf-exec), cpu: -1, group_fd: -1, flags: FD_CLOEXEC) = 4
             ? (           ): gpm/991  ... [continued]: clock_nanosleep())               = 0
        10.133 (           ): sleep/327642 clock_nanosleep(rqtp: { .tv_sec: 5, .tv_nsec: 0 }, rmtp: 0x7ffd36f83ed0) ...
             ? (           ): pool-gsd-smart/3051  ... [continued]: clock_nanosleep())   = 0
        30.276 (           ): gpm/991 clock_nanosleep(rqtp: { .tv_sec: 2, .tv_nsec: 0 }, rmtp: 0x7ffcc6f73710) ...
       223.215 (1000.430 ms): pool-gsd-smart/3051 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7f6e7fffec90) = 0
        30.276 (2000.394 ms): gpm/991  ... [continued]: clock_nanosleep())               = 0
      1230.814 (           ): pool-gsd-smart/3051 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7f6e7fffec90) ...
      1230.814 (1000.404 ms): pool-gsd-smart/3051  ... [continued]: clock_nanosleep())   = 0
      2030.886 (           ): gpm/991 clock_nanosleep(rqtp: { .tv_sec: 2, .tv_nsec: 0 }, rmtp: 0x7ffcc6f73710) ...
      2237.709 (1000.153 ms): pool-gsd-smart/3051 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7f6e7fffec90) = 0
             ? (           ): crond/1172  ... [continued]: clock_nanosleep())            = 0
      3242.699 (           ): pool-gsd-smart/3051 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7f6e7fffec90) ...
      2030.886 (2000.385 ms): gpm/991  ... [continued]: clock_nanosleep())               = 0
      3728.078 (           ): crond/1172 clock_nanosleep(rqtp: { .tv_sec: 60, .tv_nsec: 0 }, rmtp: 0x7ffe0971dcf0) ...
      3242.699 (1000.158 ms): pool-gsd-smart/3051  ... [continued]: clock_nanosleep())   = 0
      4031.409 (           ): gpm/991 clock_nanosleep(rqtp: { .tv_sec: 2, .tv_nsec: 0 }, rmtp: 0x7ffcc6f73710) ...
        10.133 (5000.375 ms): sleep/327642  ... [continued]: clock_nanosleep())          = 0

      Performance counter stats for 'sleep 5':

             2,617,347      cycles
             1,855,997      instructions                     #    0.71  insn per cycle

           5.002282128 seconds time elapsed

           0.000855000 seconds user
           0.000852000 seconds sys

  perf annotate:

   - Building with binutils' libopcode now is opt-in (BUILD_NONDISTRO=1)
     for licensing reasons, and we missed a build test on
     tools/perf/tests makefile.

     Since we now default to NDEBUG=1, we ended up segfaulting when
     building with BUILD_NONDISTRO=1 because a needed initialization
     routine was being "error checked" via an assert.

     Fix it by explicitly checking the result and aborting instead if it
     fails.

     We better back propagate the error, but at least 'perf annotate' on
     samples collected for a BPF program is back working when perf is
     built with BUILD_NONDISTRO=1.

  perf report/top:

   - Add back TUI hierarchy mode header, that is seen when using 'perf
     report/top --hierarchy'.

   - Fix the number of entries for 'e' key in the TUI that was
     preventing navigation of lines when expanding an entry.

  perf report/script:

   - Support cross platform register handling, allowing a perf.data file
     collected on one architecture to have registers sampled correctly
     displayed when analysis tools such as 'perf report' and 'perf
     script' are used on a different architecture.

   - Fix handling of event attributes in pipe mode, i.e. when one uses:

  	perf record -o - | perf report -i -

     When no perf.data files are used.

   - Handle files generated via pipe mode with a version of perf and
     then read also via pipe mode with a different version of perf,
     where the event attr record may have changed, use the record size
     field to properly support this version mismatch.

  perf probe:

   - Accessing global variables from uprobes isn't supported, make the
     error message state that instead of stating that some minimal
     kernel version is needed to have that feature. This seems just a
     tool limitation, the kernel probably has all that is needed.

  perf tests:

   - Fix a reference count related leak in the dlfilter v0 API where the
     result of a thread__find_symbol_fb() is not matched with an
     addr_location__exit() to drop the reference counts of the resolved
     components (machine, thread, map, symbol, etc). Add a dlfilter test
     to make sure that doesn't regresses.

   - Lots of fixes for the 'perf test' written in shell script related
     to problems found with the shellcheck utility.

   - Fixes for 'perf test' shell scripts testing features enabled when
     perf is built with BUILD_BPF_SKEL=1, such as 'perf stat' bpf
     counters.

   - Add perf record sample filtering test, things like the following
     example, that gets implemented as a BPF filter attached to the
     event:

       # perf record -e task-clock -c 10000 --filter 'ip < 0xffffffff00000000'

   - Improve the way the task_analyzer test checks if libtraceevent is
     linked, using 'perf version --build-options' instead of the more
     expensinve 'perf record -e "sched:sched_switch"'.

   - Add support for riscv in the mmap-basic test. (This went as well
     via the RiscV tree, same contents).

  libperf:

   - Implement riscv mmap support (This went as well via the RiscV tree,
     same contents).

  perf script:

   - New tool that converts perf.data files to the firefox profiler
     format so that one can use the visualizer at
     https://profiler.firefox.com/. Done by Anup Sharma as part of this
     year's Google Summer of Code.

     One can generate the output and upload it to the web interface but
     Anup also automated everything:

       perf script gecko -F 99 -a sleep 60

   - Support syscall name parsing on arm64.

   - Print "cgroup" field on the same line as "comm".

  perf bench:

   - Add new 'uprobe' benchmark to measure the overhead of uprobes
     with/without BPF programs attached to it.

   - breakpoints are not available on power9, skip that test.

  perf stat:

   - Add #num_cpus_online literal to be used in 'perf stat' metrics, and
     add this extra 'perf test' check that exemplifies its purpose:

  	TEST_ASSERT_VAL("#num_cpus_online",
                         expr__parse(&num_cpus_online, ctx, "#num_cpus_online") == 0);
  	TEST_ASSERT_VAL("#num_cpus", expr__parse(&num_cpus, ctx, "#num_cpus") == 0);
  	TEST_ASSERT_VAL("#num_cpus >= #num_cpus_online", num_cpus >= num_cpus_online);

  Miscellaneous:

   - Improve tool startup time by lazily reading PMU, JSON, sysfs data.

   - Improve error reporting in the parsing of events, passing YYLTYPE
     to error routines, so that the output can show were the parsing
     error was found.

   - Add 'perf test' entries to check the parsing of events
     improvements.

   - Fix various leak for things detected by -fsanitize=address, mostly
     things that would be freed at tool exit, including:

       - Free evsel->filter on the destructor.

       - Allow tools to register a thread->priv destructor and use it in
         'perf trace'.

       - Free evsel->priv in 'perf trace'.

       - Free string returned by synthesize_perf_probe_point() when the
         caller fails to do all it needs.

   - Adjust various compiler options to not consider errors some
     warnings when building with broken headers found in things like
     python, flex, bison, as we otherwise build with -Werror. Some for
     gcc, some for clang, some for some specific version of those, some
     for some specific version of flex or bison, or some specific
     combination of these components, bah.

   - Allow customization of clang options for BPF target, this helps
     building on gentoo where there are other oddities where BPF targets
     gets passed some compiler options intended for the native build, so
     building with WERROR=0 helps while these oddities are fixed.

   - Dont pass ERR_PTR() values to perf_session__delete() in 'perf top'
     and 'perf lock', fixing some segfaults when handling some odd
     failures.

   - Add LTO build option.

   - Fix format of unordered lists in the perf docs
     (tools/perf/Documentation)

   - Overhaul the bison files, using constructs such as YYNOMEM.

   - Remove unused tokens from the bison .y files.

   - Add more comments to various structs.

   - A few LoongArch enablement patches.

  Vendor events (JSON):

   - Add JSON metrics for Yitian 710 DDR (aarch64). Things like:

  	EventName, BriefDescription
  	visible_window_limit_reached_rd, "At least one entry in read queue reaches the visible window limit.",
  	visible_window_limit_reached_wr, "At least one entry in write queue reaches the visible window limit.",
  	op_is_dqsosc_mpc	       , "A DQS Oscillator MPC command to DRAM.",
  	op_is_dqsosc_mrr	       , "A DQS Oscillator MRR command to DRAM.",
  	op_is_tcr_mrr		       , "A Temperature Compensated Refresh(TCR) MRR command to DRAM.",

   - Add AmpereOne metrics (aarch64).

   - Update N2 and V2 metrics (aarch64) and events using Arm telemetry
     repo.

   - Update scale units and descriptions of common topdown metrics on
     aarch64. Things like:
       - "MetricExpr": "stall_slot_frontend / (#slots * cpu_cycles)",
       - "BriefDescription": "Frontend bound L1 topdown metric",
       + "MetricExpr": "100 * (stall_slot_frontend / (#slots * cpu_cycles))",
       + "BriefDescription": "This metric is the percentage of total slots that were stalled due to resource constraints in the frontend of the processor.",

   - Update events for intel: meteorlake to 1.04, sapphirerapids to
     1.15, Icelake+ metric constraints.

   - Update files for the power10 platform"

* tag 'perf-tools-for-v6.6-1-2023-09-05' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (217 commits)
  perf parse-events: Fix driver config term
  perf parse-events: Fixes relating to no_value terms
  perf parse-events: Fix propagation of term's no_value when cloning
  perf parse-events: Name the two term enums
  perf list: Don't print Unit for "default_core"
  perf vendor events intel: Fix modifier in tma_info_system_mem_parallel_reads for skylake
  perf dlfilter: Avoid leak in v0 API test use of resolve_address()
  perf metric: Add #num_cpus_online literal
  perf pmu: Remove str from perf_pmu_alias
  perf parse-events: Make common term list to strbuf helper
  perf parse-events: Minor help message improvements
  perf pmu: Avoid uninitialized use of alias->str
  perf jevents: Use "default_core" for events with no Unit
  perf test stat_bpf_counters_cgrp: Enhance perf stat cgroup BPF counter test
  perf test shell stat_bpf_counters: Fix test on Intel
  perf test shell record_bpf_filter: Skip 6.2 kernel
  libperf: Get rid of attr.id field
  perf tools: Convert to perf_record_header_attr_id()
  libperf: Add perf_record_header_attr_id()
  perf tools: Handle old data in PERF_RECORD_ATTR
  ...
2023-09-09 20:06:17 -07:00
Linus Torvalds fd3a5940e6 Merge tag '6.6-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:

 - six smb3 client fixes including ones to allow controlling smb3
   directory caching timeout and limits, and one debugging improvement

 - one fix for nls Kconfig (don't need to expose NLS_UCS2_UTILS option)

 - one minor spnego registry update

* tag '6.6-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
  spnego: add missing OID to oid registry
  smb3: fix minor typo in SMB2_GLOBAL_CAP_LARGE_MTU
  cifs: update internal module version number for cifs.ko
  smb3: allow controlling maximum number of cached directories
  smb3: add trace point for queryfs (statfs)
  nls: Hide new NLS_UCS2_UTILS
  smb3: allow controlling length of time directory entries are cached with dir leases
  smb: propagate error code of extract_sharename()
2023-09-09 19:56:23 -07:00
David Howells a3c57ab79a iov_iter: Kunit tests for page extraction
Add some kunit tests for page extraction for ITER_BVEC, ITER_KVEC and
ITER_XARRAY type iterators.  ITER_UBUF and ITER_IOVEC aren't dealt with
as they require userspace VM interaction.  ITER_DISCARD isn't dealt with
either as that can't be extracted.

Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-09-09 15:11:49 -07:00
David Howells 2d71340ff1 iov_iter: Kunit tests for copying to/from an iterator
Add some kunit tests for page extraction for ITER_BVEC, ITER_KVEC and
ITER_XARRAY type iterators.  ITER_UBUF and ITER_IOVEC aren't dealt with
as they require userspace VM interaction.  ITER_DISCARD isn't dealt with
either as that does nothing.

Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-09-09 15:11:49 -07:00