twx-linux/Documentation
Paul E. McKenney 96d3fd0d31 rcu: Break call_rcu() deadlock involving scheduler and perf
Dave Jones got the following lockdep splat:

>  ======================================================
>  [ INFO: possible circular locking dependency detected ]
>  3.12.0-rc3+ #92 Not tainted
>  -------------------------------------------------------
>  trinity-child2/15191 is trying to acquire lock:
>   (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
>
> but task is already holding lock:
>   (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&ctx->lock){-.-...}:
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
>         [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
>         [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
>         [<ffffffff81732052>] __schedule+0x1d2/0xa20
>         [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
>         [<ffffffff817352b6>] retint_kernel+0x26/0x30
>         [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
>         [<ffffffff813f0504>] pty_write+0x54/0x60
>         [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
>         [<ffffffff813e5838>] tty_write+0x158/0x2d0
>         [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
>         [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
>         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> -> #2 (&rq->lock){-.-.-.}:
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
>         [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
>         [<ffffffff81054336>] do_fork+0x126/0x460
>         [<ffffffff81054696>] kernel_thread+0x26/0x30
>         [<ffffffff8171ff93>] rest_init+0x23/0x140
>         [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
>         [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
>         [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
>
> -> #1 (&p->pi_lock){-.-.-.}:
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
>         [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
>         [<ffffffff81097d62>] default_wake_function+0x12/0x20
>         [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
>         [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
>         [<ffffffff8108ff59>] __wake_up+0x39/0x50
>         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
>         [<ffffffff81111450>] __call_rcu+0x140/0x820
>         [<ffffffff81111b8d>] call_rcu+0x1d/0x20
>         [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
>         [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
>         [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
>         [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
>         [<ffffffff817200be>] kernel_init+0xe/0x190
>         [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
>
> -> #0 (&rdp->nocb_wq){......}:
>         [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
>         [<ffffffff8108ff43>] __wake_up+0x23/0x50
>         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
>         [<ffffffff81111450>] __call_rcu+0x140/0x820
>         [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
>         [<ffffffff81149abf>] put_ctx+0x4f/0x70
>         [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
>         [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
>         [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
>         [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
>         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> other info that might help us debug this:
>
> Chain exists of:
>   &rdp->nocb_wq --> &rq->lock --> &ctx->lock
>
>   Possible unsafe locking scenario:
>
>         CPU0                    CPU1
>         ----                    ----
>    lock(&ctx->lock);
>                                 lock(&rq->lock);
>                                 lock(&ctx->lock);
>    lock(&rdp->nocb_wq);
>
>  *** DEADLOCK ***
>
> 1 lock held by trinity-child2/15191:
>  #0:  (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> stack backtrace:
> CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
>  ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
>  ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
>  ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
> Call Trace:
>  [<ffffffff8172a363>] dump_stack+0x4e/0x82
>  [<ffffffff81726741>] print_circular_bug+0x200/0x20f
>  [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
>  [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
>  [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
>  [<ffffffff810cc243>] lock_acquire+0x93/0x200
>  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
>  [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
>  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
>  [<ffffffff8108ff43>] __wake_up+0x23/0x50
>  [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
>  [<ffffffff81111450>] __call_rcu+0x140/0x820
>  [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
>  [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
>  [<ffffffff81149abf>] put_ctx+0x4f/0x70
>  [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
>  [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
>  [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
>  [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
>  [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
>  [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
>  [<ffffffff8173d4e4>] tracesys+0xdd/0xe2

The underlying problem is that perf is invoking call_rcu() with the
scheduler locks held, but in NOCB mode, call_rcu() will with high
probability invoke the scheduler -- which just might want to use its
locks.  The reason that call_rcu() needs to invoke the scheduler is
to wake up the corresponding rcuo callback-offload kthread, which
does the job of starting up a grace period and invoking the callbacks
afterwards.

One solution (championed on a related problem by Lai Jiangshan) is to
simply defer the wakeup to some point where scheduler locks are no longer
held.  Since we don't want to unnecessarily incur the cost of such
deferral, the task before us is threefold:

1.	Determine when it is likely that a relevant scheduler lock is held.

2.	Defer the wakeup in such cases.

3.	Ensure that all deferred wakeups eventually happen, preferably
	sooner rather than later.

We use irqs_disabled_flags() as a proxy for relevant scheduler locks
being held.  This works because the relevant locks are always acquired
with interrupts disabled.  We may defer more often than needed, but that
is at least safe.

The wakeup deferral is tracked via a new field in the per-CPU and
per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.

This flag is checked by the RCU core processing.  The __rcu_pending()
function now checks this flag, which causes rcu_check_callbacks()
to initiate RCU core processing at each scheduling-clock interrupt
where this flag is set.  Of course this is not sufficient because
scheduling-clock interrupts are often turned off (the things we used to
be able to count on!).  So the flags are also checked on entry to any
state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
state and NO_HZ_FULL user-mode-execution state.

This approach should allow call_rcu() to be invoked regardless of what
locks you might be holding, the key word being "should".

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2013-12-03 10:10:18 -08:00
..
ABI Main batch of InfiniBand/RDMA changes for 3.13: 2013-11-18 15:36:04 -08:00
accounting Documentation/accounting/getdelays.c: avoid strncpy in accounting tool 2013-07-03 16:08:06 -07:00
acpi gpiolib / ACPI: document the GPIO descriptor based interface 2013-10-19 23:32:50 +02:00
aoe aoe: remove do-nothing NAME="%k" term from example udev rules 2013-09-11 15:59:28 -07:00
arm Allwinner sunXi SoCs machine additions for 3.13 2013-10-28 10:19:38 -07:00
arm64 arm64: Use 42-bit address space with 64K pages 2013-11-05 17:23:52 +00:00
auxdisplay
backlight backlight: lp855x_bl: support new LP8555 device 2013-11-13 12:09:14 +09:00
blackfin
block block: change config option name for cmdline partition parsing 2013-09-30 14:31:02 -07:00
blockdev floppy: Correct documentation of driver options when used as a module. 2013-11-08 09:10:31 -07:00
bus-devices
cdrom
cgroups memcg: support hierarchical memory.numa_stats 2013-11-13 12:09:06 +09:00
connector connector - documentation: simplify netlink message length assignment 2013-10-02 16:03:51 -04:00
console
cpu-freq cpufreq: Implement light weight ->target_index() routine 2013-10-25 22:42:24 +02:00
cpuidle cpuidle: remove cpuidle_unregister_governor() 2013-10-30 01:21:24 +01:00
cris
crypto
development-process Documentation: development-process: Update -mm and -next URLs 2013-07-25 12:37:24 +02:00
device-mapper dm cache: resolve small nits and improve Documentation 2013-11-12 13:11:09 -05:00
devicetree ARM: SoC fixes for 3.13-rc 2013-11-26 11:18:37 -08:00
DocBook doc: fix generation of device-drivers 2013-11-27 20:40:04 -08:00
driver-model spi: Updates for v3.13 2013-11-12 15:01:39 +09:00
dvb
early-userspace Documentation: remove reference to 2.7 kernel in early-userspace 2013-08-20 12:47:28 +02:00
EDID
extcon extcon: Simplify extcon_dev_register() prototype by removing unnecessary parameter 2013-09-27 09:37:01 +09:00
fault-injection
fb Documentation/fb/viafb.modes fix a typo 2013-08-20 12:41:11 +02:00
filesystems Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs 2013-11-22 08:38:55 -08:00
firmware_class
fmc doc: Fix typo "is is" in Documentations 2013-08-27 10:50:52 +02:00
frv
gpio Documentation: gpiolib: document new interface 2013-11-25 09:02:30 +01:00
hid HID: uhid: use generic hidinput_input_event() 2013-07-31 10:33:05 +02:00
hwmon Merge branch 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging 2013-11-15 16:35:10 -08:00
i2c i2c: i801: Add Device IDs for Intel Wildcat Point-LP PCH 2013-11-14 18:38:04 +01:00
i2o
ia64
ide
infiniband
input Input: clarify gamepad API ABS values 2013-10-15 23:42:07 -07:00
ioctl ALSA: add DICE driver 2013-10-17 21:18:32 +02:00
isdn
ja_JP HOWTO ja_JP sync 2013-07-24 22:06:34 -07:00
kbuild Documentation/kbuild/kconfig.txt: 'make listnewconfig' replaces: yes "" | make oldconfig 2013-10-08 23:51:50 +02:00
kdump Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-07-04 11:40:58 -07:00
ko_KR Correct unfaithful translation on HOWTO in ko_KR 2013-08-12 17:43:13 -07:00
laptops thinkpad-acpi: Add mute and mic-mute LED functionality 2013-10-17 14:38:44 +02:00
leds Documentation: leds-lp5521,lp5523: update device attribute information 2013-08-26 17:22:13 -07:00
m68k
make
memory-devices
metag
mic misc: mic: Enable OSPM suspend and resume support. 2013-10-05 18:01:42 -07:00
mips
misc-devices
mmc
mn10300
mtd doc: Fix typo "is is" in Documentations 2013-08-27 10:50:52 +02:00
namespaces
netlabel
networking tcp: tsq: restore minimal amount of queueing 2013-11-14 16:25:14 -05:00
nfc
parisc parisc: document the shadow registers 2013-07-09 22:09:19 +02:00
PCI PCI: Update pci_find_slot() description in pci.txt 2013-09-25 15:43:33 -06:00
pcmcia
power More ACPI and power management updates for 3.13-rc1 2013-11-20 13:25:04 -08:00
powerpc powerpc: Update the 00-Index in Documentation/powerpc 2013-08-27 14:44:27 +10:00
pps USB: serial: invoke dcd_change ldisc's handler. 2013-09-26 09:45:40 -07:00
prctl
pti
ptp ptp: add the PTP_SYS_OFFSET ioctl to the testptp program 2013-09-23 16:46:17 -04:00
rapidio doc: Fix typo in doucmentations 2013-07-25 12:34:15 +02:00
RCU rcu: Break call_rcu() deadlock involving scheduler and perf 2013-12-03 10:10:18 -08:00
s390 s390/s390dbf: add debug_level_enabled() function 2013-10-24 17:16:53 +02:00
scheduler H8/300 has been dead for several years, the kernel for it has 2013-11-12 14:13:14 +09:00
scsi SCSI misc on 20130915 2013-09-15 17:41:30 -04:00
security ima: new templates management mechanism 2013-10-25 17:17:04 -04:00
serial serial: core: delete .set_wake() callback 2013-10-16 13:16:19 -07:00
sh
sound ALSA: Fix typo in documentation/alsa 2013-10-29 11:38:04 +01:00
spi spi/documentation: Fix usage of __initdata 2013-08-20 12:52:28 +02:00
sysctl vsprintf: check real user/group id for %pK 2013-11-13 12:09:14 +09:00
target target: Remove TF_CIT_TMPL macro 2013-10-16 13:35:02 -07:00
thermal thermal: thermal_core: allow binding with limits on bind_params 2013-09-03 09:10:24 -04:00
timers doc: add missing files to timers/00-INDEX 2013-10-27 21:55:50 +00:00
tpm drivers/tpm: add xen tpmfront interface 2013-08-09 10:57:06 -04:00
trace Documentation/trace/tracepoints.txt: add links to TRACE_EVENT documentation 2013-11-13 12:09:32 +09:00
usb doc: usb: Fix typo in Documentation/usb/gadget_configs.txt 2013-10-31 13:31:39 +01:00
vDSO
video4linux [media] V4L: Add support for integer menu controls with standard menu items 2013-08-18 07:12:59 -03:00
virtual Merge branch 'kvm-ppc-queue' of git://github.com/agraf/linux-2.6 into queue 2013-11-04 10:20:57 +02:00
vm x86, mm: do not leak page->ptl for pmd page tables 2013-11-21 16:42:28 -08:00
w1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-07-04 11:40:58 -07:00
watchdog watchdog: delete mpcore_wdt driver 2013-07-11 21:47:58 +02:00
wimax
x86 Merge branch 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-11-12 10:48:30 +09:00
xtensa
zh_CN Documentation/zh_CN/SubmittingPatches fix a typo 2013-08-20 12:41:25 +02:00
.gitignore
00-INDEX doc: fix a typo in Documentation/00-INDEX 2013-08-27 10:53:07 +02:00
applying-patches.txt
assoc_array.txt Add a generic associative array implementation. 2013-09-24 10:35:17 +01:00
atomic_ops.txt
bad_memory.txt
basic_profiling.txt
bcache.txt Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-07-04 11:40:58 -07:00
binfmt_misc.txt
braille-console.txt
bt8xxgpio.txt
btmrvl.txt
BUG-HUNTING
bus-virt-phys-mapping.txt
cachetlb.txt Documentation: fix typo and update version in cachetlb.txt 2013-08-20 12:46:52 +02:00
Changes remove obsolete references to powertweak 2013-11-27 20:34:32 -08:00
circular-buffers.txt
clk.txt clk: add support for clock reparent on set_rate 2013-08-19 12:27:17 -07:00
coccinelle.txt
CodingStyle
cpu-hotplug.txt MAINTAINERS: update Zwane Mwaikambo's e-mail address 2013-11-13 12:09:14 +09:00
cpu-load.txt
cputopology.txt doc: Documentation/cputopology.txt fix typo 2013-09-04 12:59:47 +02:00
crc32.txt
dcdbas.txt
debugging-modules.txt
debugging-via-ohci1394.txt
dell_rbu.txt
devices.txt cuse: add fix minor number to /dev/cuse 2013-10-01 16:44:54 +02:00
digsig.txt
DMA-API-HOWTO.txt DMA-API: provide a helper to set both DMA and coherent DMA masks 2013-09-17 15:32:37 +01:00
DMA-API.txt DMA-API: provide a helper to set both DMA and coherent DMA masks 2013-09-17 15:32:37 +01:00
DMA-attributes.txt doc: Documentation/DMA-attributes.txt fix typo 2013-10-14 15:50:53 +02:00
dma-buf-sharing.txt dma-buf: Expose buffer size to userspace (v2) 2013-09-10 11:36:45 +05:30
DMA-ISA-LPC.txt
dmaengine.txt
dmatest.txt dmatest: add a 'wait' parameter 2013-11-14 11:04:40 -08:00
dontdiff
dynamic-debug-howto.txt
edac.txt
efi-stub.txt EFI stub documentation updates 2013-09-25 12:34:32 +01:00
eisa.txt
email-clients.txt
flexible-arrays.txt
futex-requeue-pi.txt
gcov.txt gcov: compile specific gcov implementation based on gcc version 2013-11-13 12:09:34 +09:00
highuid.txt
HOWTO
hw_random.txt
hwspinlock.txt doc: documentation/hwspinlock.txt fix typo 2013-08-27 10:46:02 +02:00
init.txt
initrd.txt
intel_txt.txt
Intel-IOMMU.txt
io_ordering.txt
io-mapping.txt
iostats.txt
IPMI.txt
IRQ-affinity.txt doc: fix a typo about irq affinity 2013-08-20 12:59:18 +02:00
IRQ-domain.txt
IRQ.txt
irqflags-tracing.txt
isapnp.txt
java.txt
kernel-doc-nano-HOWTO.txt
kernel-docs.txt
kernel-parameters.txt Merge branch 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2013-11-21 19:46:00 -08:00
kernel-per-CPU-kthreads.txt kthread: Add pointer to vmstat-avoidance patch 2013-09-25 06:49:46 -07:00
kmemcheck.txt Documentation/kmemcheck: update kmemcheck documentation 2013-08-27 10:47:05 +02:00
kmemleak.txt
kobject.txt
kprobes.txt
kref.txt
ldm.txt
local_ops.txt
lockdep-design.txt
lockstat.txt lockstat: Report avg wait and hold times 2013-10-09 08:19:08 +02:00
lockup-watchdogs.txt
logo.gif
logo.txt
magic-number.txt
Makefile
ManagementStyle
md.txt
media-framework.txt Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media 2013-07-13 12:09:57 -07:00
memory-barriers.txt doc: Fix memory-barrier control-dependency example 2013-08-19 21:39:42 -07:00
memory-hotplug.txt Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-09-06 09:36:28 -07:00
mono.txt
mutex-design.txt locking/doc: Update references to kernel/mutex.c 2013-11-11 12:41:33 +01:00
nommu-mmap.txt
numastat.txt
oops-tracing.txt
padata.txt
parport-lowlevel.txt
parport.txt
percpu-rw-semaphore.txt
phy.txt drivers: phy: add generic PHY framework 2013-09-27 17:35:41 -07:00
pi-futex.txt
pinctrl.txt pinctrl: add documentation for pinctrl_get_group_pins() 2013-10-16 15:35:21 +02:00
pnp.txt
preempt-locking.txt
printk-formats.txt Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-09-06 09:36:28 -07:00
pwm.txt Documentation/pwm: Fix trivial typos 2013-10-24 10:51:33 +02:00
ramoops.txt
rbtree.txt
remoteproc.txt
rfkill.txt
robust-futex-ABI.txt
robust-futexes.txt
rpmsg.txt
rt-mutex-design.txt
rt-mutex.txt
rtc.txt
SAK.txt
SecurityBugs
serial-console.txt
sgi-ioc4.txt
sgi-visws.txt
SM501.txt
smsc_ece1099.txt
sparse.txt
spinlocks.txt
stable_api_nonsense.txt
stable_kernel_rules.txt
static-keys.txt
SubmitChecklist
SubmittingDrivers
SubmittingPatches Documentation/SubmittingPatches: Request summaries for commit references 2013-08-20 12:58:15 +02:00
svga.txt
sysfs-rules.txt doc: Fix typo in doucmentations 2013-07-25 12:34:15 +02:00
sysrq.txt sysrq: Allow magic SysRq key functions to be disabled through Kconfig 2013-10-16 13:01:44 -07:00
this_cpu_ops.txt
unaligned-memory-access.txt
unicode.txt
unshare.txt
vfio.txt vfio: fix documentation 2013-09-05 16:36:21 -06:00
VGA-softcursor.txt
vgaarbiter.txt
video-output.txt
vme_api.txt
volatile-considered-harmful.txt
workqueue.txt workqueue: Correct/Drop references to gcwq in Documentation 2013-08-21 10:32:09 -04:00
ww-mutex-design.txt
xz.txt
zorro.txt