Blame - src/kernel/linux/v4.14/tools/perf/Documentation/intel-pt.txt - T103

blob: 76971d2e416450c24fbb24bb51db584da7661180 [file] [log] [blame]

rjw	1f88458	2022-01-06 17:20:42 +0800	[diff] [blame^]	1	Intel Processor Trace
				2	=====================
				3
				4	Overview
				5	========
				6
				7	Intel Processor Trace (Intel PT) is an extension of Intel Architecture that
				8	collects information about software execution such as control flow, execution
				9	modes and timings and formats it into highly compressed binary packets.
				10	Technical details are documented in the Intel 64 and IA-32 Architectures
				11	Software Developer Manuals, Chapter 36 Intel Processor Trace.
				12
				13	Intel PT is first supported in Intel Core M and 5th generation Intel Core
				14	processors that are based on the Intel micro-architecture code name Broadwell.
				15
				16	Trace data is collected by 'perf record' and stored within the perf.data file.
				17	See below for options to 'perf record'.
				18
				19	Trace data must be 'decoded' which involves walking the object code and matching
				20	the trace data packets. For example a TNT packet only tells whether a
				21	conditional branch was taken or not taken, so to make use of that packet the
				22	decoder must know precisely which instruction was being executed.
				23
				24	Decoding is done on-the-fly. The decoder outputs samples in the same format as
				25	samples output by perf hardware events, for example as though the "instructions"
				26	or "branches" events had been recorded. Presently 3 tools support this:
				27	'perf script', 'perf report' and 'perf inject'. See below for more information
				28	on using those tools.
				29
				30	The main distinguishing feature of Intel PT is that the decoder can determine
				31	the exact flow of software execution. Intel PT can be used to understand why
				32	and how did software get to a certain point, or behave a certain way. The
				33	software does not have to be recompiled, so Intel PT works with debug or release
				34	builds, however the executed images are needed - which makes use in JIT-compiled
				35	environments, or with self-modified code, a challenge. Also symbols need to be
				36	provided to make sense of addresses.
				37
				38	A limitation of Intel PT is that it produces huge amounts of trace data
				39	(hundreds of megabytes per second per core) which takes a long time to decode,
				40	for example two or three orders of magnitude longer than it took to collect.
				41	Another limitation is the performance impact of tracing, something that will
				42	vary depending on the use-case and architecture.
				43
				44
				45	Quickstart
				46	==========
				47
				48	It is important to start small. That is because it is easy to capture vastly
				49	more data than can possibly be processed.
				50
				51	The simplest thing to do with Intel PT is userspace profiling of small programs.
				52	Data is captured with 'perf record' e.g. to trace 'ls' userspace-only:
				53
				54	perf record -e intel_pt//u ls
				55
				56	And profiled with 'perf report' e.g.
				57
				58	perf report
				59
				60	To also trace kernel space presents a problem, namely kernel self-modifying
				61	code. A fairly good kernel image is available in /proc/kcore but to get an
				62	accurate image a copy of /proc/kcore needs to be made under the same conditions
				63	as the data capture. A script perf-with-kcore can do that, but beware that the
				64	script makes use of 'sudo' to copy /proc/kcore. If you have perf installed
				65	locally from the source tree you can do:
				66
				67	~/libexec/perf-core/perf-with-kcore record pt_ls -e intel_pt// -- ls
				68
				69	which will create a directory named 'pt_ls' and put the perf.data file and
				70	copies of /proc/kcore, /proc/kallsyms and /proc/modules into it. Then to use
				71	'perf report' becomes:
				72
				73	~/libexec/perf-core/perf-with-kcore report pt_ls
				74
				75	Because samples are synthesized after-the-fact, the sampling period can be
				76	selected for reporting. e.g. sample every microsecond
				77
				78	~/libexec/perf-core/perf-with-kcore report pt_ls --itrace=i1usge
				79
				80	See the sections below for more information about the --itrace option.
				81
				82	Beware the smaller the period, the more samples that are produced, and the
				83	longer it takes to process them.
				84
				85	Also note that the coarseness of Intel PT timing information will start to
				86	distort the statistical value of the sampling as the sampling period becomes
				87	smaller.
				88
				89	To represent software control flow, "branches" samples are produced. By default
				90	a branch sample is synthesized for every single branch. To get an idea what
				91	data is available you can use the 'perf script' tool with no parameters, which
				92	will list all the samples.
				93
				94	perf record -e intel_pt//u ls
				95	perf script
				96
				97	An interesting field that is not printed by default is 'flags' which can be
				98	displayed as follows:
				99
				100	perf script -Fcomm,tid,pid,time,cpu,event,trace,ip,sym,dso,addr,symoff,flags
				101
				102	The flags are "bcrosyiABEx" which stand for branch, call, return, conditional,
				103	system, asynchronous, interrupt, transaction abort, trace begin, trace end, and
				104	in transaction, respectively.
				105
				106	While it is possible to create scripts to analyze the data, an alternative
				107	approach is available to export the data to a sqlite or postgresql database.
				108	Refer to script export-to-sqlite.py or export-to-postgresql.py for more details,
				109	and to script call-graph-from-sql.py for an example of using the database.
				110
				111	There is also script intel-pt-events.py which provides an example of how to
				112	unpack the raw data for power events and PTWRITE.
				113
				114	As mentioned above, it is easy to capture too much data. One way to limit the
				115	data captured is to use 'snapshot' mode which is explained further below.
				116	Refer to 'new snapshot option' and 'Intel PT modes of operation' further below.
				117
				118	Another problem that will be experienced is decoder errors. They can be caused
				119	by inability to access the executed image, self-modified or JIT-ed code, or the
				120	inability to match side-band information (such as context switches and mmaps)
				121	which results in the decoder not knowing what code was executed.
				122
				123	There is also the problem of perf not being able to copy the data fast enough,
				124	resulting in data lost because the buffer was full. See 'Buffer handling' below
				125	for more details.
				126
				127
				128	perf record
				129	===========
				130
				131	new event
				132	---------
				133
				134	The Intel PT kernel driver creates a new PMU for Intel PT. PMU events are
				135	selected by providing the PMU name followed by the "config" separated by slashes.
				136	An enhancement has been made to allow default "config" e.g. the option
				137
				138	-e intel_pt//
				139
				140	will use a default config value. Currently that is the same as
				141
				142	-e intel_pt/tsc,noretcomp=0/
				143
				144	which is the same as
				145
				146	-e intel_pt/tsc=1,noretcomp=0/
				147
				148	Note there are now new config terms - see section 'config terms' further below.
				149
				150	The config terms are listed in /sys/devices/intel_pt/format. They are bit
				151	fields within the config member of the struct perf_event_attr which is
				152	passed to the kernel by the perf_event_open system call. They correspond to bit
				153	fields in the IA32_RTIT_CTL MSR. Here is a list of them and their definitions:
				154
				155	$ grep -H . /sys/bus/event_source/devices/intel_pt/format/*
				156	/sys/bus/event_source/devices/intel_pt/format/cyc:config:1
				157	/sys/bus/event_source/devices/intel_pt/format/cyc_thresh:config:19-22
				158	/sys/bus/event_source/devices/intel_pt/format/mtc:config:9
				159	/sys/bus/event_source/devices/intel_pt/format/mtc_period:config:14-17
				160	/sys/bus/event_source/devices/intel_pt/format/noretcomp:config:11
				161	/sys/bus/event_source/devices/intel_pt/format/psb_period:config:24-27
				162	/sys/bus/event_source/devices/intel_pt/format/tsc:config:10
				163
				164	Note that the default config must be overridden for each term i.e.
				165
				166	-e intel_pt/noretcomp=0/
				167
				168	is the same as:
				169
				170	-e intel_pt/tsc=1,noretcomp=0/
				171
				172	So, to disable TSC packets use:
				173
				174	-e intel_pt/tsc=0/
				175
				176	It is also possible to specify the config value explicitly:
				177
				178	-e intel_pt/config=0x400/
				179
				180	Note that, as with all events, the event is suffixed with event modifiers:
				181
				182	u userspace
				183	k kernel
				184	h hypervisor
				185	G guest
				186	H host
				187	p precise ip
				188
				189	'h', 'G' and 'H' are for virtualization which is not supported by Intel PT.
				190	'p' is also not relevant to Intel PT. So only options 'u' and 'k' are
				191	meaningful for Intel PT.
				192
				193	perf_event_attr is displayed if the -vv option is used e.g.
				194
				195	------------------------------------------------------------
				196	perf_event_attr:
				197	type 6
				198	size 112
				199	config 0x400
				200	{ sample_period, sample_freq } 1
				201	sample_type IP\|TID\|TIME\|CPU\|IDENTIFIER
				202	read_format ID
				203	disabled 1
				204	inherit 1
				205	exclude_kernel 1
				206	exclude_hv 1
				207	enable_on_exec 1
				208	sample_id_all 1
				209	------------------------------------------------------------
				210	sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8
				211	sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8
				212	sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8
				213	sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8
				214	------------------------------------------------------------
				215
				216
				217	config terms
				218	------------
				219
				220	The June 2015 version of Intel 64 and IA-32 Architectures Software Developer
				221	Manuals, Chapter 36 Intel Processor Trace, defined new Intel PT features.
				222	Some of the features are reflect in new config terms. All the config terms are
				223	described below.
				224
				225	tsc Always supported. Produces TSC timestamp packets to provide
				226	timing information. In some cases it is possible to decode
				227	without timing information, for example a per-thread context
				228	that does not overlap executable memory maps.
				229
				230	The default config selects tsc (i.e. tsc=1).
				231
				232	noretcomp Always supported. Disables "return compression" so a TIP packet
				233	is produced when a function returns. Causes more packets to be
				234	produced but might make decoding more reliable.
				235
				236	The default config does not select noretcomp (i.e. noretcomp=0).
				237
				238	psb_period Allows the frequency of PSB packets to be specified.
				239
				240	The PSB packet is a synchronization packet that provides a
				241	starting point for decoding or recovery from errors.
				242
				243	Support for psb_period is indicated by:
				244
				245	/sys/bus/event_source/devices/intel_pt/caps/psb_cyc
				246
				247	which contains "1" if the feature is supported and "0"
				248	otherwise.
				249
				250	Valid values are given by:
				251
				252	/sys/bus/event_source/devices/intel_pt/caps/psb_periods
				253
				254	which contains a hexadecimal value, the bits of which represent
				255	valid values e.g. bit 2 set means value 2 is valid.
				256
				257	The psb_period value is converted to the approximate number of
				258	trace bytes between PSB packets as:
				259
				260	2 ^ (value + 11)
				261
				262	e.g. value 3 means 16KiB bytes between PSBs
				263
				264	If an invalid value is entered, the error message
				265	will give a list of valid values e.g.
				266
				267	$ perf record -e intel_pt/psb_period=15/u uname
				268	Invalid psb_period for intel_pt. Valid values are: 0-5
				269
				270	If MTC packets are selected, the default config selects a value
				271	of 3 (i.e. psb_period=3) or the nearest lower value that is
				272	supported (0 is always supported). Otherwise the default is 0.
				273
				274	If decoding is expected to be reliable and the buffer is large
				275	then a large PSB period can be used.
				276
				277	Because a TSC packet is produced with PSB, the PSB period can
				278	also affect the granularity to timing information in the absence
				279	of MTC or CYC.
				280
				281	mtc Produces MTC timing packets.
				282
				283	MTC packets provide finer grain timestamp information than TSC
				284	packets. MTC packets record time using the hardware crystal
				285	clock (CTC) which is related to TSC packets using a TMA packet.
				286
				287	Support for this feature is indicated by:
				288
				289	/sys/bus/event_source/devices/intel_pt/caps/mtc
				290
				291	which contains "1" if the feature is supported and
				292	"0" otherwise.
				293
				294	The frequency of MTC packets can also be specified - see
				295	mtc_period below.
				296
				297	mtc_period Specifies how frequently MTC packets are produced - see mtc
				298	above for how to determine if MTC packets are supported.
				299
				300	Valid values are given by:
				301
				302	/sys/bus/event_source/devices/intel_pt/caps/mtc_periods
				303
				304	which contains a hexadecimal value, the bits of which represent
				305	valid values e.g. bit 2 set means value 2 is valid.
				306
				307	The mtc_period value is converted to the MTC frequency as:
				308
				309	CTC-frequency / (2 ^ value)
				310
				311	e.g. value 3 means one eighth of CTC-frequency
				312
				313	Where CTC is the hardware crystal clock, the frequency of which
				314	can be related to TSC via values provided in cpuid leaf 0x15.
				315
				316	If an invalid value is entered, the error message
				317	will give a list of valid values e.g.
				318
				319	$ perf record -e intel_pt/mtc_period=15/u uname
				320	Invalid mtc_period for intel_pt. Valid values are: 0,3,6,9
				321
				322	The default value is 3 or the nearest lower value
				323	that is supported (0 is always supported).
				324
				325	cyc Produces CYC timing packets.
				326
				327	CYC packets provide even finer grain timestamp information than
				328	MTC and TSC packets. A CYC packet contains the number of CPU
				329	cycles since the last CYC packet. Unlike MTC and TSC packets,
				330	CYC packets are only sent when another packet is also sent.
				331
				332	Support for this feature is indicated by:
				333
				334	/sys/bus/event_source/devices/intel_pt/caps/psb_cyc
				335
				336	which contains "1" if the feature is supported and
				337	"0" otherwise.
				338
				339	The number of CYC packets produced can be reduced by specifying
				340	a threshold - see cyc_thresh below.
				341
				342	cyc_thresh Specifies how frequently CYC packets are produced - see cyc
				343	above for how to determine if CYC packets are supported.
				344
				345	Valid cyc_thresh values are given by:
				346
				347	/sys/bus/event_source/devices/intel_pt/caps/cycle_thresholds
				348
				349	which contains a hexadecimal value, the bits of which represent
				350	valid values e.g. bit 2 set means value 2 is valid.
				351
				352	The cyc_thresh value represents the minimum number of CPU cycles
				353	that must have passed before a CYC packet can be sent. The
				354	number of CPU cycles is:
				355
				356	2 ^ (value - 1)
				357
				358	e.g. value 4 means 8 CPU cycles must pass before a CYC packet
				359	can be sent. Note a CYC packet is still only sent when another
				360	packet is sent, not at, e.g. every 8 CPU cycles.
				361
				362	If an invalid value is entered, the error message
				363	will give a list of valid values e.g.
				364
				365	$ perf record -e intel_pt/cyc,cyc_thresh=15/u uname
				366	Invalid cyc_thresh for intel_pt. Valid values are: 0-12
				367
				368	CYC packets are not requested by default.
				369
				370	pt Specifies pass-through which enables the 'branch' config term.
				371
				372	The default config selects 'pt' if it is available, so a user will
				373	never need to specify this term.
				374
				375	branch Enable branch tracing. Branch tracing is enabled by default so to
				376	disable branch tracing use 'branch=0'.
				377
				378	The default config selects 'branch' if it is available.
				379
				380	ptw Enable PTWRITE packets which are produced when a ptwrite instruction
				381	is executed.
				382
				383	Support for this feature is indicated by:
				384
				385	/sys/bus/event_source/devices/intel_pt/caps/ptwrite
				386
				387	which contains "1" if the feature is supported and
				388	"0" otherwise.
				389
				390	fup_on_ptw Enable a FUP packet to follow the PTWRITE packet. The FUP packet
				391	provides the address of the ptwrite instruction. In the absence of
				392	fup_on_ptw, the decoder will use the address of the previous branch
				393	if branch tracing is enabled, otherwise the address will be zero.
				394	Note that fup_on_ptw will work even when branch tracing is disabled.
				395
				396	pwr_evt Enable power events. The power events provide information about
				397	changes to the CPU C-state.
				398
				399	Support for this feature is indicated by:
				400
				401	/sys/bus/event_source/devices/intel_pt/caps/power_event_trace
				402
				403	which contains "1" if the feature is supported and
				404	"0" otherwise.
				405
				406
				407	new snapshot option
				408	-------------------
				409
				410	The difference between full trace and snapshot from the kernel's perspective is
				411	that in full trace we don't overwrite trace data that the user hasn't collected
				412	yet (and indicated that by advancing aux_tail), whereas in snapshot mode we let
				413	the trace run and overwrite older data in the buffer so that whenever something
				414	interesting happens, we can stop it and grab a snapshot of what was going on
				415	around that interesting moment.
				416
				417	To select snapshot mode a new option has been added:
				418
				419	-S
				420
				421	Optionally it can be followed by the snapshot size e.g.
				422
				423	-S0x100000
				424
				425	The default snapshot size is the auxtrace mmap size. If neither auxtrace mmap size
				426	nor snapshot size is specified, then the default is 4MiB for privileged users
				427	(or if /proc/sys/kernel/perf_event_paranoid < 0), 128KiB for unprivileged users.
				428	If an unprivileged user does not specify mmap pages, the mmap pages will be
				429	reduced as described in the 'new auxtrace mmap size option' section below.
				430
				431	The snapshot size is displayed if the option -vv is used e.g.
				432
				433	Intel PT snapshot size: %zu
				434
				435
				436	new auxtrace mmap size option
				437	---------------------------
				438
				439	Intel PT buffer size is specified by an addition to the -m option e.g.
				440
				441	-m,16
				442
				443	selects a buffer size of 16 pages i.e. 64KiB.
				444
				445	Note that the existing functionality of -m is unchanged. The auxtrace mmap size
				446	is specified by the optional addition of a comma and the value.
				447
				448	The default auxtrace mmap size for Intel PT is 4MiB/page_size for privileged users
				449	(or if /proc/sys/kernel/perf_event_paranoid < 0), 128KiB for unprivileged users.
				450	If an unprivileged user does not specify mmap pages, the mmap pages will be
				451	reduced from the default 512KiB/page_size to 256KiB/page_size, otherwise the
				452	user is likely to get an error as they exceed their mlock limit (Max locked
				453	memory as shown in /proc/self/limits). Note that perf does not count the first
				454	512KiB (actually /proc/sys/kernel/perf_event_mlock_kb minus 1 page) per cpu
				455	against the mlock limit so an unprivileged user is allowed 512KiB per cpu plus
				456	their mlock limit (which defaults to 64KiB but is not multiplied by the number
				457	of cpus).
				458
				459	In full-trace mode, powers of two are allowed for buffer size, with a minimum
				460	size of 2 pages. In snapshot mode, it is the same but the minimum size is
				461	1 page.
				462
				463	The mmap size and auxtrace mmap size are displayed if the -vv option is used e.g.
				464
				465	mmap length 528384
				466	auxtrace mmap length 4198400
				467
				468
				469	Intel PT modes of operation
				470	---------------------------
				471
				472	Intel PT can be used in 2 modes:
				473	full-trace mode
				474	snapshot mode
				475
				476	Full-trace mode traces continuously e.g.
				477
				478	perf record -e intel_pt//u uname
				479
				480	Snapshot mode captures the available data when a signal is sent e.g.
				481
				482	perf record -v -e intel_pt//u -S ./loopy 1000000000 &
				483	[1] 11435
				484	kill -USR2 11435
				485	Recording AUX area tracing snapshot
				486
				487	Note that the signal sent is SIGUSR2.
				488	Note that "Recording AUX area tracing snapshot" is displayed because the -v
				489	option is used.
				490
				491	The 2 modes cannot be used together.
				492
				493
				494	Buffer handling
				495	---------------
				496
				497	There may be buffer limitations (i.e. single ToPa entry) which means that actual
				498	buffer sizes are limited to powers of 2 up to 4MiB (MAX_ORDER). In order to
				499	provide other sizes, and in particular an arbitrarily large size, multiple
				500	buffers are logically concatenated. However an interrupt must be used to switch
				501	between buffers. That has two potential problems:
				502	a) the interrupt may not be handled in time so that the current buffer
				503	becomes full and some trace data is lost.
				504	b) the interrupts may slow the system and affect the performance
				505	results.
				506
				507	If trace data is lost, the driver sets 'truncated' in the PERF_RECORD_AUX event
				508	which the tools report as an error.
				509
				510	In full-trace mode, the driver waits for data to be copied out before allowing
				511	the (logical) buffer to wrap-around. If data is not copied out quickly enough,
				512	again 'truncated' is set in the PERF_RECORD_AUX event. If the driver has to
				513	wait, the intel_pt event gets disabled. Because it is difficult to know when
				514	that happens, perf tools always re-enable the intel_pt event after copying out
				515	data.
				516
				517
				518	Intel PT and build ids
				519	----------------------
				520
				521	By default "perf record" post-processes the event stream to find all build ids
				522	for executables for all addresses sampled. Deliberately, Intel PT is not
				523	decoded for that purpose (it would take too long). Instead the build ids for
				524	all executables encountered (due to mmap, comm or task events) are included
				525	in the perf.data file.
				526
				527	To see buildids included in the perf.data file use the command:
				528
				529	perf buildid-list
				530
				531	If the perf.data file contains Intel PT data, that is the same as:
				532
				533	perf buildid-list --with-hits
				534
				535
				536	Snapshot mode and event disabling
				537	---------------------------------
				538
				539	In order to make a snapshot, the intel_pt event is disabled using an IOCTL,
				540	namely PERF_EVENT_IOC_DISABLE. However doing that can also disable the
				541	collection of side-band information. In order to prevent that, a dummy
				542	software event has been introduced that permits tracking events (like mmaps) to
				543	continue to be recorded while intel_pt is disabled. That is important to ensure
				544	there is complete side-band information to allow the decoding of subsequent
				545	snapshots.
				546
				547	A test has been created for that. To find the test:
				548
				549	perf test list
				550	...
				551	23: Test using a dummy software event to keep tracking
				552
				553	To run the test:
				554
				555	perf test 23
				556	23: Test using a dummy software event to keep tracking : Ok
				557
				558
				559	perf record modes (nothing new here)
				560	------------------------------------
				561
				562	perf record essentially operates in one of three modes:
				563	per thread
				564	per cpu
				565	workload only
				566
				567	"per thread" mode is selected by -t or by --per-thread (with -p or -u or just a
				568	workload).
				569	"per cpu" is selected by -C or -a.
				570	"workload only" mode is selected by not using the other options but providing a
				571	command to run (i.e. the workload).
				572
				573	In per-thread mode an exact list of threads is traced. There is no inheritance.
				574	Each thread has its own event buffer.
				575
				576	In per-cpu mode all processes (or processes from the selected cgroup i.e. -G
				577	option, or processes selected with -p or -u) are traced. Each cpu has its own
				578	buffer. Inheritance is allowed.
				579
				580	In workload-only mode, the workload is traced but with per-cpu buffers.
				581	Inheritance is allowed. Note that you can now trace a workload in per-thread
				582	mode by using the --per-thread option.
				583
				584
				585	Privileged vs non-privileged users
				586	----------------------------------
				587
				588	Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged users
				589	have memory limits imposed upon them. That affects what buffer sizes they can
				590	have as outlined above.
				591
				592	The v4.2 kernel introduced support for a context switch metadata event,
				593	PERF_RECORD_SWITCH, which allows unprivileged users to see when their processes
				594	are scheduled out and in, just not by whom, which is left for the
				595	PERF_RECORD_SWITCH_CPU_WIDE, that is only accessible in system wide context,
				596	which in turn requires CAP_SYS_ADMIN.
				597
				598	Please see the 45ac1403f564 ("perf: Add PERF_RECORD_SWITCH to indicate context
				599	switches") commit, that introduces these metadata events for further info.
				600
				601	When working with kernels < v4.2, the following considerations must be taken,
				602	as the sched:sched_switch tracepoints will be used to receive such information:
				603
				604	Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged users are
				605	not permitted to use tracepoints which means there is insufficient side-band
				606	information to decode Intel PT in per-cpu mode, and potentially workload-only
				607	mode too if the workload creates new processes.
				608
				609	Note also, that to use tracepoints, read-access to debugfs is required. So if
				610	debugfs is not mounted or the user does not have read-access, it will again not
				611	be possible to decode Intel PT in per-cpu mode.
				612
				613
				614	sched_switch tracepoint
				615	-----------------------
				616
				617	The sched_switch tracepoint is used to provide side-band data for Intel PT
				618	decoding in kernels where the PERF_RECORD_SWITCH metadata event isn't
				619	available.
				620
				621	The sched_switch events are automatically added. e.g. the second event shown
				622	below:
				623
				624	$ perf record -vv -e intel_pt//u uname
				625	------------------------------------------------------------
				626	perf_event_attr:
				627	type 6
				628	size 112
				629	config 0x400
				630	{ sample_period, sample_freq } 1
				631	sample_type IP\|TID\|TIME\|CPU\|IDENTIFIER
				632	read_format ID
				633	disabled 1
				634	inherit 1
				635	exclude_kernel 1
				636	exclude_hv 1
				637	enable_on_exec 1
				638	sample_id_all 1
				639	------------------------------------------------------------
				640	sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8
				641	sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8
				642	sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8
				643	sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8
				644	------------------------------------------------------------
				645	perf_event_attr:
				646	type 2
				647	size 112
				648	config 0x108
				649	{ sample_period, sample_freq } 1
				650	sample_type IP\|TID\|TIME\|CPU\|PERIOD\|RAW\|IDENTIFIER
				651	read_format ID
				652	inherit 1
				653	sample_id_all 1
				654	exclude_guest 1
				655	------------------------------------------------------------
				656	sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8
				657	sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8
				658	sys_perf_event_open: pid -1 cpu 2 group_fd -1 flags 0x8
				659	sys_perf_event_open: pid -1 cpu 3 group_fd -1 flags 0x8
				660	------------------------------------------------------------
				661	perf_event_attr:
				662	type 1
				663	size 112
				664	config 0x9
				665	{ sample_period, sample_freq } 1
				666	sample_type IP\|TID\|TIME\|IDENTIFIER
				667	read_format ID
				668	disabled 1
				669	inherit 1
				670	exclude_kernel 1
				671	exclude_hv 1
				672	mmap 1
				673	comm 1
				674	enable_on_exec 1
				675	task 1
				676	sample_id_all 1
				677	mmap2 1
				678	comm_exec 1
				679	------------------------------------------------------------
				680	sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8
				681	sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8
				682	sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8
				683	sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8
				684	mmap size 528384B
				685	AUX area mmap length 4194304
				686	perf event ring buffer mmapped per cpu
				687	Synthesizing auxtrace information
				688	Linux
				689	[ perf record: Woken up 1 times to write data ]
				690	[ perf record: Captured and wrote 0.042 MB perf.data ]
				691
				692	Note, the sched_switch event is only added if the user is permitted to use it
				693	and only in per-cpu mode.
				694
				695	Note also, the sched_switch event is only added if TSC packets are requested.
				696	That is because, in the absence of timing information, the sched_switch events
				697	cannot be matched against the Intel PT trace.
				698
				699
				700	perf script
				701	===========
				702
				703	By default, perf script will decode trace data found in the perf.data file.
				704	This can be further controlled by new option --itrace.
				705
				706
				707	New --itrace option
				708	-------------------
				709
				710	Having no option is the same as
				711
				712	--itrace
				713
				714	which, in turn, is the same as
				715
				716	--itrace=ibxwpe
				717
				718	The letters are:
				719
				720	i synthesize "instructions" events
				721	b synthesize "branches" events
				722	x synthesize "transactions" events
				723	w synthesize "ptwrite" events
				724	p synthesize "power" events
				725	c synthesize branches events (calls only)
				726	r synthesize branches events (returns only)
				727	e synthesize tracing error events
				728	d create a debug log
				729	g synthesize a call chain (use with i or x)
				730	l synthesize last branch entries (use with i or x)
				731	s skip initial number of events
				732
				733	"Instructions" events look like they were recorded by "perf record -e
				734	instructions".
				735
				736	"Branches" events look like they were recorded by "perf record -e branches". "c"
				737	and "r" can be combined to get calls and returns.
				738
				739	"Transactions" events correspond to the start or end of transactions. The
				740	'flags' field can be used in perf script to determine whether the event is a
				741	tranasaction start, commit or abort.
				742
				743	Note that "instructions", "branches" and "transactions" events depend on code
				744	flow packets which can be disabled by using the config term "branch=0". Refer
				745	to the config terms section above.
				746
				747	"ptwrite" events record the payload of the ptwrite instruction and whether
				748	"fup_on_ptw" was used. "ptwrite" events depend on PTWRITE packets which are
				749	recorded only if the "ptw" config term was used. Refer to the config terms
				750	section above. perf script "synth" field displays "ptwrite" information like
				751	this: "ip: 0 payload: 0x123456789abcdef0" where "ip" is 1 if "fup_on_ptw" was
				752	used.
				753
				754	"Power" events correspond to power event packets and CBR (core-to-bus ratio)
				755	packets. While CBR packets are always recorded when tracing is enabled, power
				756	event packets are recorded only if the "pwr_evt" config term was used. Refer to
				757	the config terms section above. The power events record information about
				758	C-state changes, whereas CBR is indicative of CPU frequency. perf script
				759	"event,synth" fields display information like this:
				760	cbr: cbr: 22 freq: 2189 MHz (200%)
				761	mwait: hints: 0x60 extensions: 0x1
				762	pwre: hw: 0 cstate: 2 sub-cstate: 0
				763	exstop: ip: 1
				764	pwrx: deepest cstate: 2 last cstate: 2 wake reason: 0x4
				765	Where:
				766	"cbr" includes the frequency and the percentage of maximum non-turbo
				767	"mwait" shows mwait hints and extensions
				768	"pwre" shows C-state transitions (to a C-state deeper than C0) and
				769	whether initiated by hardware
				770	"exstop" indicates execution stopped and whether the IP was recorded
				771	exactly,
				772	"pwrx" indicates return to C0
				773	For more details refer to the Intel 64 and IA-32 Architectures Software
				774	Developer Manuals.
				775
				776	Error events show where the decoder lost the trace. Error events
				777	are quite important. Users must know if what they are seeing is a complete
				778	picture or not.
				779
				780	The "d" option will cause the creation of a file "intel_pt.log" containing all
				781	decoded packets and instructions. Note that this option slows down the decoder
				782	and that the resulting file may be very large.
				783
				784	In addition, the period of the "instructions" event can be specified. e.g.
				785
				786	--itrace=i10us
				787
				788	sets the period to 10us i.e. one instruction sample is synthesized for each 10
				789	microseconds of trace. Alternatives to "us" are "ms" (milliseconds),
				790	"ns" (nanoseconds), "t" (TSC ticks) or "i" (instructions).
				791
				792	"ms", "us" and "ns" are converted to TSC ticks.
				793
				794	The timing information included with Intel PT does not give the time of every
				795	instruction. Consequently, for the purpose of sampling, the decoder estimates
				796	the time since the last timing packet based on 1 tick per instruction. The time
				797	on the sample is not adjusted and reflects the last known value of TSC.
				798
				799	For Intel PT, the default period is 100us.
				800
				801	Setting it to a zero period means "as often as possible".
				802
				803	In the case of Intel PT that is the same as a period of 1 and a unit of
				804	'instructions' (i.e. --itrace=i1i).
				805
				806	Also the call chain size (default 16, max. 1024) for instructions or
				807	transactions events can be specified. e.g.
				808
				809	--itrace=ig32
				810	--itrace=xg32
				811
				812	Also the number of last branch entries (default 64, max. 1024) for instructions or
				813	transactions events can be specified. e.g.
				814
				815	--itrace=il10
				816	--itrace=xl10
				817
				818	Note that last branch entries are cleared for each sample, so there is no overlap
				819	from one sample to the next.
				820
				821	To disable trace decoding entirely, use the option --no-itrace.
				822
				823	It is also possible to skip events generated (instructions, branches, transactions)
				824	at the beginning. This is useful to ignore initialization code.
				825
				826	--itrace=i0nss1000000
				827
				828	skips the first million instructions.
				829
				830	dump option
				831	-----------
				832
				833	perf script has an option (-D) to "dump" the events i.e. display the binary
				834	data.
				835
				836	When -D is used, Intel PT packets are displayed. The packet decoder does not
				837	pay attention to PSB packets, but just decodes the bytes - so the packets seen
				838	by the actual decoder may not be identical in places where the data is corrupt.
				839	One example of that would be when the buffer-switching interrupt has been too
				840	slow, and the buffer has been filled completely. In that case, the last packet
				841	in the buffer might be truncated and immediately followed by a PSB as the trace
				842	continues in the next buffer.
				843
				844	To disable the display of Intel PT packets, combine the -D option with
				845	--no-itrace.
				846
				847
				848	perf report
				849	===========
				850
				851	By default, perf report will decode trace data found in the perf.data file.
				852	This can be further controlled by new option --itrace exactly the same as
				853	perf script, with the exception that the default is --itrace=igxe.
				854
				855
				856	perf inject
				857	===========
				858
				859	perf inject also accepts the --itrace option in which case tracing data is
				860	removed and replaced with the synthesized events. e.g.
				861
				862	perf inject --itrace -i perf.data -o perf.data.new
				863
				864	Below is an example of using Intel PT with autofdo. It requires autofdo
				865	(https://github.com/google/autofdo) and gcc version 5. The bubble
				866	sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tutorial)
				867	amended to take the number of elements as a parameter.
				868
				869	$ gcc-5 -O3 sort.c -o sort_optimized
				870	$ ./sort_optimized 30000
				871	Bubble sorting array of 30000 elements
				872	2254 ms
				873
				874	$ cat ~/.perfconfig
				875	[intel-pt]
				876	mispred-all = on
				877
				878	$ perf record -e intel_pt//u ./sort 3000
				879	Bubble sorting array of 3000 elements
				880	58 ms
				881	[ perf record: Woken up 2 times to write data ]
				882	[ perf record: Captured and wrote 3.939 MB perf.data ]
				883	$ perf inject -i perf.data -o inj --itrace=i100usle --strip
				884	$ ./create_gcov --binary=./sort --profile=inj --gcov=sort.gcov -gcov_version=1
				885	$ gcc-5 -O3 -fauto-profile=sort.gcov sort.c -o sort_autofdo
				886	$ ./sort_autofdo 30000
				887	Bubble sorting array of 30000 elements
				888	2155 ms
				889
				890	Note there is currently no advantage to using Intel PT instead of LBR, but
				891	that may change in the future if greater use is made of the data.