twx-linux/kernel
Linus Torvalds 00845eb968 sched: don't cause task state changes in nested sleep debugging
Commit 8eb23b9f35aa ("sched: Debug nested sleeps") added code to report
on nested sleep conditions, which we generally want to avoid because the
inner sleeping operation can re-set the thread state to TASK_RUNNING,
but that will then cause the outer sleep loop not actually sleep when it
calls schedule.

However, that's actually valid traditional behavior, with the inner
sleep being some fairly rare case (like taking a sleeping lock that
normally doesn't actually need to sleep).

And the debug code would actually change the state of the task to
TASK_RUNNING internally, which makes that kind of traditional and
working code not work at all, because now the nested sleep doesn't just
sometimes cause the outer one to not block, but will cause it to happen
every time.

In particular, it will cause the cardbus kernel daemon (pccardd) to
basically busy-loop doing scheduling, converting a laptop into a heater,
as reported by Bruno Prémont.  But there may be other legacy uses of
that nested sleep model in other drivers that are also likely to never
get converted to the new model.

This fixes both cases:

 - don't set TASK_RUNNING when the nested condition happens (note: even
   if WARN_ONCE() only _warns_ once, the return value isn't whether the
   warning happened, but whether the condition for the warning was true.
   So despite the warning only happening once, the "if (WARN_ON(..))"
   would trigger for every nested sleep.

 - in the cases where we knowingly disable the warning by using
   "sched_annotate_sleep()", don't change the task state (that is used
   for all core scheduling decisions), instead use '->task_state_change'
   that is used for the debugging decision itself.

(Credit for the second part of the fix goes to Oleg Nesterov: "Can't we
avoid this subtle change in behaviour DEBUG_ATOMIC_SLEEP adds?" with the
suggested change to use 'task_state_change' as part of the test)

Reported-and-bisected-by: Bruno Prémont <bonbons@linux-vserver.org>
Tested-by: Rafael J Wysocki <rjw@rjwysocki.net>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
Cc: Ilya Dryomov <ilya.dryomov@inktank.com>,
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>,
Cc: Davidlohr Bueso <dave@stgolabs.net>,
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-01 12:23:32 -08:00
..
bpf Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-01-27 13:55:36 -08:00
configs
debug Surprising number of fixes this merge window :( 2015-01-23 06:40:36 +12:00
events perf: Tighten (and fix) the grouping condition 2015-01-28 13:17:35 +01:00
gcov
irq
locking
power
printk
rcu
sched sched: don't cause task state changes in nested sleep debugging 2015-02-01 12:23:32 -08:00
time Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-01-25 17:47:34 -08:00
trace This holds a few fixes to the ftrace infrastructure as well as 2015-01-17 07:55:52 +13:00
.gitignore
acct.c
async.c
audit_tree.c
audit_watch.c
audit.c
audit.h
auditfilter.c
auditsc.c
backtracetest.c
bounds.c
capability.c
cgroup_freezer.c
cgroup.c cgroup: prevent mount hang due to memory controller lifetime 2015-01-22 10:26:43 -05:00
compat.c
configs.c
context_tracking.c
cpu_pm.c
cpu.c
cpuset.c
crash_dump.c
cred.c
delayacct.c
dma.c
elfcore.c
exec_domain.c
exit.c
extable.c
fork.c
freezer.c
futex_compat.c
futex.c
groups.c
hung_task.c
irq_work.c
jump_label.c
kallsyms.c
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kexec.c
kmod.c
kprobes.c module: remove mod arg from module_free, rename module_memfree(). 2015-01-20 11:38:33 +10:30
ksysfs.c
kthread.c
latencytop.c
Makefile
module_signing.c
module-internal.h
module.c module: make module_refcount() a signed integer. 2015-01-22 11:15:54 +10:30
notifier.c
nsproxy.c
padata.c
panic.c
params.c param: fix uninitialized read with CONFIG_DEBUG_LOCK_ALLOC 2015-01-20 11:38:31 +10:30
pid_namespace.c
pid.c
profile.c
ptrace.c
range.c kernel: avoid overflow in cmp_range 2015-01-17 10:02:23 +13:00
reboot.c
relay.c
resource.c
seccomp.c
signal.c
smp.c
smpboot.c
smpboot.h
softirq.c
stacktrace.c
stop_machine.c
sys_ni.c
sys.c x86, mpx: Strictly enforce empty prctl() args 2015-01-22 21:11:06 +01:00
sysctl_binary.c
sysctl.c
system_certificates.S
system_keyring.c
task_work.c
taskstats.c
test_kprobes.c
torture.c
tracepoint.c
tsacct.c
uid16.c
up.c
user_namespace.c
user-return-notifier.c
user.c
utsname_sysctl.c
utsname.c
watchdog.c
workqueue_internal.h
workqueue.c workqueue: fix subtle pool management issue which can stall whole worker_pool 2015-01-16 14:21:16 -05:00