twx-linux/include/uapi/linux
Wang Nan 9ecda41acb perf/core: Add ::write_backward attribute to perf event
This patch introduces 'write_backward' bit to perf_event_attr, which
controls the direction of a ring buffer. After set, the corresponding
ring buffer is written from end to beginning. This feature is design to
support reading from overwritable ring buffer.

Ring buffer can be created by mapping a perf event fd. Kernel puts event
records into ring buffer, user tooling like perf fetch them from
address returned by mmap(). To prevent racing between kernel and tooling,
they communicate to each other through 'head' and 'tail' pointers.
Kernel maintains 'head' pointer, points it to the next free area (tail
of the last record). Tooling maintains 'tail' pointer, points it to the
tail of last consumed record (record has already been fetched). Kernel
determines the available space in a ring buffer using these two
pointers to avoid overwrite unfetched records.

By mapping without 'PROT_WRITE', an overwritable ring buffer is created.
Different from normal ring buffer, tooling is unable to maintain 'tail'
pointer because writing is forbidden. Therefore, for this type of ring
buffers, kernel overwrite old records unconditionally, works like flight
recorder. This feature would be useful if reading from overwritable ring
buffer were as easy as reading from normal ring buffer. However,
there's an obscure problem.

The following figure demonstrates a full overwritable ring buffer. In
this figure, the 'head' pointer points to the end of last record, and a
long record 'E' is pending. For a normal ring buffer, a 'tail' pointer
would have pointed to position (X), so kernel knows there's no more
space in the ring buffer. However, for an overwritable ring buffer,
kernel ignore the 'tail' pointer.

   (X)                              head
    .                                |
    .                                V
    +------+-------+----------+------+---+
    |A....A|B.....B|C........C|D....D|   |
    +------+-------+----------+------+---+

Record 'A' is overwritten by event 'E':

      head
       |
       V
    +--+---+-------+----------+------+---+
    |.E|..A|B.....B|C........C|D....D|E..|
    +--+---+-------+----------+------+---+

Now tooling decides to read from this ring buffer. However, none of these
two natural positions, 'head' and the start of this ring buffer, are
pointing to the head of a record. Even the full ring buffer can be
accessed by tooling, it is unable to find a position to start decoding.

The first attempt tries to solve this problem AFAIK can be found from
[1]. It makes kernel to maintain 'tail' pointer: updates it when ring
buffer is half full. However, this approach introduces overhead to
fast path. Test result shows a 1% overhead [2]. In addition, this method
utilizes no more tham 50% records.

Another attempt can be found from [3], which allows putting the size of
an event at the end of each record. This approach allows tooling to find
records in a backward manner from 'head' pointer by reading size of a
record from its tail. However, because of alignment requirement, it
needs 8 bytes to record the size of a record, which is a huge waste. Its
performance is also not good, because more data need to be written.
This approach also introduces some extra branch instructions to fast
path.

'write_backward' is a better solution to this problem.

Following figure demonstrates the state of the overwritable ring buffer
when 'write_backward' is set before overwriting:

       head
        |
        V
    +---+------+----------+-------+------+
    |   |D....D|C........C|B.....B|A....A|
    +---+------+----------+-------+------+

and after overwriting:
                                     head
                                      |
                                      V
    +---+------+----------+-------+---+--+
    |..E|D....D|C........C|B.....B|A..|E.|
    +---+------+----------+-------+---+--+

In each situation, 'head' points to the beginning of the newest record.
From this record, tooling can iterate over the full ring buffer and fetch
records one by one.

The only limitation that needs to be considered is back-to-back reading.
Due to the non-deterministic of user programs, it is impossible to ensure
the ring buffer keeps stable during reading. Consider an extreme situation:
tooling is scheduled out after reading record 'D', then a burst of events
come, eat up the whole ring buffer (one or multiple rounds). When the
tooling process comes back, reading after 'D' is incorrect now.

To prevent this problem, we need to find a way to ensure the ring buffer
is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is
suggested because its overhead is lower than
ioctl(PERF_EVENT_IOC_ENABLE).

By carefully verifying 'header' pointer, reader can avoid pausing the
ring-buffer. For example:

    /* A union of all possible events */
    union perf_event event;

    p = head = perf_mmap__read_head();
    while (true) {
        /* copy header of next event */
        fetch(&event.header, p, sizeof(event.header));

        /* read 'head' pointer */
        head = perf_mmap__read_head();

        /* check overwritten: is the header good? */
        if (!verify(sizeof(event.header), p, head))
            break;

        /* copy the whole event */
        fetch(&event, p, event.header.size);

        /* read 'head' pointer again */
        head = perf_mmap__read_head();

        /* is the whole event good? */
        if (!verify(event.header.size, p, head))
            break;
        p += event.header.size;
    }

However, the overhead is high because:

 a) In-place decoding is not safe.
    Copying-verifying-decoding is required.
 b) Fetching 'head' pointer requires additional synchronization.

(From Alexei Starovoitov:

Even when this trick works, pause is needed for more than stability of
reading. When we collect the events into overwrite buffer we're waiting
for some other trigger (like all cpu utilization spike or just one cpu
running and all others are idle) and when it happens the buffer has
valuable info from the past. At this point new events are no longer
interesting and buffer should be paused, events read and unpaused until
next trigger comes.)

This patch utilizes event's default overflow_handler introduced
previously. perf_event_output_backward() is created as the default
overflow handler for backward ring buffers. To avoid extra overhead to
fast path, original perf_event_output() becomes __perf_event_output()
and marked '__always_inline'. In theory, there's no extra overhead
introduced to fast path.

Performance testing:

Calling 3000000 times of 'close(-1)', use gettimeofday() to check
duration.  Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
system calls. In ns.

Testing environment:

  CPU    : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
  Kernel : v4.5.0
                    MEAN         STDVAR
 BASE            800214.950    2853.083
 PRE1           2253846.700    9997.014
 PRE2           2257495.540    8516.293
 POST           2250896.100    8933.921

Where 'BASE' is pure performance without capturing. 'PRE1' is test
result of pure 'v4.5.0' kernel. 'PRE2' is test result before this
patch. 'POST' is test result after this patch. See [4] for the detailed
experimental setup.

Considering the stdvar, this patch doesn't introduce performance
overhead to the fast path.

 [1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html
 [2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html
 [3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html
 [4] http://lkml.kernel.org/g/56F89DCD.1040202@huawei.com

Signed-off-by: Wang Nan <wangnan0@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: <acme@kernel.org>
Cc: <pi3orama@163.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lkml.kernel.org/r/1459865478-53413-1-git-send-email-wangnan0@huawei.com
[ Fixed the changelog some more. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23 14:12:39 +02:00
..
android
byteorder include/uapi/linux/byteorder, swab: force inlining of some byteswap operations 2016-03-17 15:09:34 -07:00
caif
can
cifs
dvb [media] include/uapi/linux/dvb/video.h: remove stdint.h include 2015-11-19 08:18:38 -02:00
genwqe
hdlc
hsi
iio iio: ph: add IIO_PH channel type 2016-01-30 16:27:17 +00:00
isdn
mmc
netfilter netfilter: Remove IP_CT_NEW_REPLY definition. 2016-03-14 23:47:27 +01:00
netfilter_arp netfilter: fix include files for compilation 2015-11-23 17:54:38 +01:00
netfilter_bridge netfilter: fix include files for compilation 2015-11-23 17:54:38 +01:00
netfilter_ipv4 netfilter: fix include files for compilation 2015-11-23 17:54:38 +01:00
netfilter_ipv6 netfilter: fix include files for compilation 2015-11-23 17:54:38 +01:00
nfsd
raid drivers: md: use ktime_get_real_seconds() 2016-01-06 11:39:53 +11:00
spi
sunrpc
tc_act introduce IFE action 2016-03-01 17:15:22 -05:00
tc_ematch
usb usb: ch9: Fix SSP Device Cap wFunctionalitySupport type 2016-03-29 13:26:04 +03:00
wimax
a.out.h
acct.h
adb.h
adfs_fs.h
affs_hardblocks.h
agpgart.h include/uapi/linux/agpgart.h: include stdlib.h in userspace 2015-12-10 12:33:23 +01:00
aio_abi.h
am437x-vpfe.h
apm_bios.h
arcfb.h
atalk.h
atm_eni.h
atm_he.h
atm_idt77105.h
atm_nicstar.h
atm_tcp.h
atm_zatm.h
atm.h
atmapi.h
atmarp.h
atmbr2684.h
atmclip.h
atmdev.h
atmioc.h
atmlec.h
atmmpc.h
atmppp.h
atmsap.h
atmsvc.h
audit.h audit: stop an old auditd being starved out by a new auditd 2016-01-25 18:04:15 -05:00
auto_fs4.h autofs4: fix some white space errors 2016-03-15 16:55:16 -07:00
auto_fs.h autofs4: fix some white space errors 2016-03-15 16:55:16 -07:00
auxvec.h
ax25.h
b1lli.h
baycom.h
bcache.h
bcm933xx_hcs.h
bfs_fs.h
binfmts.h
blkpg.h
blktrace_api.h
bpf_common.h
bpf.h bpf: make padding in bpf_tunnel_key explicit 2016-03-30 19:01:33 -04:00
bpqether.h
bsg.h
btrfs.h
can.h
capability.h
capi.h
cciss_defs.h
cciss_ioctl.h
cdrom.h
cgroupstats.h
chio.h
cm4000_cs.h
cn_proc.h
coda_psdev.h
coda.h
coff.h
connector.h
const.h
cramfs_fs.h
cryptouser.h
cuda.h
cyclades.h
cycx_cfm.h
dcbnl.h
dccp.h
devlink.h Introduce devlink infrastructure 2016-03-01 16:07:29 -05:00
dlm_device.h
dlm_netlink.h
dlm_plock.h
dlm.h
dlmconstants.h
dm-ioctl.h
dm-log-userspace.h
dma-buf.h dma-buf: Add ioctls to allow userspace to flush 2016-02-12 16:01:32 +01:00
dn.h
dqblk_xfs.h quota: add new quotactl Q_XGETNEXTQUOTA 2016-02-08 11:21:50 +11:00
edd.h
efs_fs_sb.h
elf-em.h include/uapi/linux/elf-em.h: remove v850 2016-03-17 15:09:34 -07:00
elf-fdpic.h
elf.h
elfcore.h
errno.h
errqueue.h
ethtool.h ethtool: minor doc update 2016-03-22 15:45:44 -04:00
eventpoll.h epoll: add EPOLLEXCLUSIVE flag 2016-01-20 17:09:18 -08:00
fadvise.h
falloc.h
fanotify.h
fb.h
fcntl.h
fd.h
fdreg.h
fib_rules.h
fiemap.h
filter.h
firewire-cdev.h
firewire-constants.h
flat.h
fou.h
fs.h Merge tag 'for-f2fs-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs 2016-03-21 11:03:02 -07:00
fsl_hypervisor.h
fuse.h
futex.h
gameport.h
gen_stats.h
genetlink.h openvswitch: allow management from inside user namespaces 2016-02-11 09:53:19 -05:00
gfs2_ondisk.h gfs2: change gfs2 readdir cookie 2015-12-14 12:19:37 -06:00
gigaset_dev.h
gpio.h gpio: uapi: use 0xB4 as ioctl() major 2016-03-10 16:02:52 +07:00
gsmmux.h
hash_info.h keys, trusted: select hash algorithm for TPM2 chips 2015-12-20 15:27:12 +02:00
hdlc.h
hdlcdrv.h
hdreg.h
hid.h
hiddev.h
hidraw.h
hpet.h
hsr_netlink.h
hw_breakpoint.h
hyperv.h tools: hv: report ENOSPC errors in hv_fcopy_daemon 2015-12-14 19:12:21 -08:00
hysdn_if.h
i2c-dev.h
i2c.h
i2o-dev.h
i8k.h
icmp.h
icmpv6.h
if_addr.h
if_addrlabel.h
if_alg.h
if_arcnet.h
if_arp.h
if_bonding.h
if_bridge.h bridge: mcast: add support for more router port information dumping 2016-03-01 16:55:07 -05:00
if_cablemodem.h
if_eql.h
if_ether.h uapi: add MACsec bits 2016-03-13 22:40:24 -04:00
if_fc.h
if_fddi.h
if_frad.h
if_hippi.h
if_infiniband.h
if_link.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-03-23 23:25:14 -07:00
if_ltalk.h
if_macsec.h uapi: add MACsec bits 2016-03-13 22:40:24 -04:00
if_packet.h
if_phonet.h
if_plip.h
if_ppp.h
if_pppol2tp.h
if_pppox.h
if_slip.h
if_team.h
if_tun.h
if_tunnel.h
if_vlan.h
if_x25.h
if.h net: fix a comment typo 2016-03-18 19:40:27 -04:00
igmp.h
ila.h ila: Add generic ILA translation facility 2015-12-15 23:25:20 -05:00
in6.h ipv6: add IPV6_HDRINCL option for raw sockets 2015-12-17 15:12:28 -05:00
in_route.h
in.h
inet_diag.h
inotify.h
input-event-codes.h
input.h Input: synaptics-rmi4 - add support for Synaptics RMI4 devices 2016-03-10 16:02:39 -08:00
ioctl.h
ip6_tunnel.h
ip_vs.h
ip.h ipv4: add option to drop gratuitous ARP packets 2016-02-11 04:27:35 -05:00
ipc.h
ipmi_msgdefs.h
ipmi.h
ipsec.h
ipv6_route.h
ipv6.h net: ipv6: Make address flushing on ifdown optional 2016-02-25 21:45:15 -05:00
ipx.h
irda.h
irqnr.h
isdn_divertif.h
isdn_ppp.h
isdn.h
isdnif.h
iso_fs.h
ivtv.h
ivtvfb.h
ixjuser.h
jffs2.h
joystick.h
Kbuild rapidio: add mport char device driver 2016-03-22 15:36:02 -07:00
kcm.h kcm: Kernel Connection Multiplexor module 2016-03-09 16:36:14 -05:00
kcmp.h
kcov.h kernel: add kcov code coverage 2016-03-22 15:36:02 -07:00
kd.h
kdev_t.h
kernel-page-flags.h
kernel.h uapi: define DIV_ROUND_UP for userland 2016-03-04 16:10:36 -05:00
kernelcapi.h
kexec.h
keyboard.h
keyctl.h
kfd_ioctl.h
kvm_para.h
kvm.h KVM/ARM updates for 4.6 2016-03-09 11:50:42 +01:00
l2tp.h
libc-compat.h
lightnvm.h lightnvm: introduce factory reset 2016-01-12 08:21:18 -07:00
limits.h
lirc.h [media] bz#75751: Move internal header file lirc.h to uapi/ 2015-11-17 06:47:43 -02:00
llc.h
loop.h
lp.h
lwtunnel.h
magic.h Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs 2016-01-21 12:20:46 -08:00
major.h
map_to_7segment.h
matroxfb.h
mdio.h
media-bus-format.h
media.h Merge commit '840f5b0572ea' into v4l_for_linus 2016-03-15 07:48:28 -03:00
mei.h
membarrier.h
memfd.h
mempolicy.h
meye.h
mic_common.h
mic_ioctl.h
mii.h
minix_fs.h
mman.h
mmtimer.h
module.h
mpls_iptunnel.h
mpls.h
mqueue.h
mroute6.h uapi: define DIV_ROUND_UP for userland 2016-03-04 16:10:36 -05:00
mroute.h net: ipmr: fix code and comment style 2015-11-23 15:06:38 -05:00
msdos_fs.h
msg.h
mtio.h
n_r3964.h
nbd.h
ncp_fs.h
ncp_mount.h
ncp_no.h
ncp.h
ndctl.h nfit, libnvdimm: clear poison command support 2016-03-05 18:06:14 -08:00
neighbour.h
net_dropmon.h
net_namespace.h
net_tstamp.h
net.h
netconf.h netconf: add macro to represent all attributes 2016-03-13 21:54:44 -04:00
netdevice.h
netfilter_arp.h
netfilter_bridge.h netfilter: fix include files for compilation 2015-11-23 17:54:38 +01:00
netfilter_decnet.h
netfilter_ipv4.h
netfilter_ipv6.h
netfilter.h
netlink_diag.h netlink: remove mmapped netlink support 2016-02-18 11:42:18 -05:00
netlink.h netlink: remove mmapped netlink support 2016-02-18 11:42:18 -05:00
netrom.h
nfc.h
nfs2.h
nfs3.h
nfs4_mount.h
nfs4.h
nfs_fs.h
nfs_idmap.h
nfs_mount.h
nfs.h nfs: use btrfs ioctl defintions for clone 2015-11-23 21:53:08 -05:00
nfsacl.h
nl80211.h cfg80211: Add global RRM capability 2016-02-24 09:04:41 +01:00
nubus.h
nvme_ioctl.h
nvram.h
omap3isp.h
omapfb.h
oom.h
openvswitch.h openvswitch: Interface with NAT. 2016-03-14 23:47:29 +01:00
packet_diag.h
param.h
parport.h
patchkey.h
pci_regs.h
pci.h
perf_event.h perf/core: Add ::write_backward attribute to perf event 2016-04-23 14:12:39 +02:00
personality.h
pfkeyv2.h
pg.h
phantom.h
phonet.h
pkt_cls.h net/flower: Introduce hardware offload support 2016-03-10 16:24:02 -05:00
pkt_sched.h net, sched: add clsact qdisc 2016-01-10 22:13:15 -05:00
pktcdvd.h
pmu.h
poll.h
posix_types.h
ppdev.h
ppp_defs.h
ppp-comp.h
ppp-ioctl.h
pps.h
pr.h
prctl.h
psci.h
ptp_clock.h ptp: Add PTP_SYS_OFFSET_PRECISE for driver crosstimestamping 2016-03-03 14:23:43 -08:00
ptrace.h
qnx4_fs.h
qnxtypes.h
quota.h quota: add new quotactl Q_GETNEXTQUOTA 2016-02-08 11:22:21 +11:00
radeonfb.h
random.h
raw.h
rds.h
reboot.h
reiserfs_fs.h
reiserfs_xattr.h
resource.h
rfkill.h rfkill: Update userspace API documentation 2016-02-24 09:04:25 +01:00
romfs_fs.h
rose.h
route.h
rtc.h
rtnetlink.h ipv6: allow routes to be configured with expire values 2015-12-17 15:08:51 -05:00
scc.h
sched.h sched: new clone flag CLONE_NEWCGROUP for cgroup namespace 2016-02-16 13:04:58 -05:00
scif_ioctl.h
screen_info.h
sctp.h
sdla.h
seccomp.h
securebits.h
selinux_netlink.h
sem.h
serial_core.h serial: mvebu-uart: initial support for Armada-3700 serial port 2016-03-07 16:11:14 -08:00
serial_reg.h
serial.h serial: support 16-bit register interface for console 2015-12-13 19:59:48 -08:00
serio.h Input: add eGalaxTouch serial touchscreen driver 2015-12-16 11:31:33 -08:00
shm.h
signal.h
signalfd.h
smiapp.h
snmp.h
sock_diag.h net: diag: Add the ability to destroy a socket. 2015-12-15 23:26:51 -05:00
socket.h
sockios.h include/uapi/linux/sockios.h: mark SIOCRTMSG unused 2016-01-05 16:44:06 -05:00
sonet.h
sonypi.h
sound.h
soundcard.h
stat.h
stddef.h uapi/linux/stddef.h: Provide __always_inline to userspace headers 2016-03-30 12:50:17 +02:00
stm.h
string.h
suspend_ioctls.h
swab.h include/uapi/linux/byteorder, swab: force inlining of some byteswap operations 2016-03-17 15:09:34 -07:00
synclink.h
sysctl.h
sysinfo.h
target_core_user.h target/user: Report capability of handling out-of-order completions to userspace 2016-03-10 21:49:09 -08:00
taskstats.h
tcp_metrics.h
tcp.h tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/In 2016-03-14 14:55:26 -04:00
telephony.h
termios.h
thermal.h
time.h
times.h
timex.h
tiocl.h
tipc_config.h
tipc_netlink.h
tipc.h
toshiba.h
tty_flags.h
tty.h
types.h
udf_fs_i.h
udp.h
uhid.h
uinput.h Input: uinput - rework ABS validation 2015-12-18 17:48:51 -08:00
uio.h
ultrasound.h
un.h
unistd.h
unix_diag.h
usbdevice_fs.h usb: devio: Add ioctl to disallow detaching kernel USB drivers. 2016-03-05 12:05:01 -08:00
usbip.h
userfaultfd.h
userio.h
utime.h
utsname.h
uuid.h
uvcvideo.h
v4l2-common.h [media] media: v4l: Dual license v4l2-common.h under GPL v2 and BSD licenses 2016-02-01 08:47:05 -02:00
v4l2-controls.h [media] v4l: add V4L2_CID_MPEG_VIDEO_FORCE_KEY_FRAME 2016-02-19 08:10:35 -02:00
v4l2-dv-timings.h
v4l2-mediabus.h
v4l2-subdev.h
veth.h
vfio.h vfio/pci: Intel IGD host and LCP bridge config space access 2016-02-22 16:10:09 -07:00
vhost.h vhost_net: basic polling support 2016-03-11 02:18:53 +02:00
videodev2.h [media] UVC: Add support for R200 depth camera 2016-03-03 06:49:20 -03:00
virtio_9p.h
virtio_balloon.h virtio_balloon: export 'available' memory to balloon statistics 2016-03-17 15:09:34 -07:00
virtio_blk.h virtio_blk: VIRTIO_BLK_F_WCE->VIRTIO_BLK_F_FLUSH 2016-03-02 17:01:59 +02:00
virtio_config.h virtio: add VIRTIO_CONFIG_S_NEEDS_RESET device status bit 2016-04-07 15:16:41 +03:00
virtio_console.h
virtio_gpu.h include/uapi/linux/virtio_gpu.h: use __u8 from <linux/types.h> 2015-12-10 12:33:23 +01:00
virtio_ids.h Revert "Merge branch 'vsock-virtio'" 2015-12-08 21:55:49 -05:00
virtio_input.h
virtio_net.h
virtio_pci.h
virtio_ring.h
virtio_rng.h
virtio_scsi.h
virtio_types.h
vm_sockets.h
vsp1.h
vt.h
wait.h
wanrouter.h
watchdog.h
wil6210_uapi.h
wimax.h
wireless.h
x25.h
xattr.h
xfrm.h
xilinx-v4l2-controls.h
zorro_ids.h
zorro.h