| rjw | 1f88458 | 2022-01-06 17:20:42 +0800 | [diff] [blame] | 1 | perf-c2c(1) | 
 | 2 | =========== | 
 | 3 |  | 
 | 4 | NAME | 
 | 5 | ---- | 
 | 6 | perf-c2c - Shared Data C2C/HITM Analyzer. | 
 | 7 |  | 
 | 8 | SYNOPSIS | 
 | 9 | -------- | 
 | 10 | [verse] | 
 | 11 | 'perf c2c record' [<options>] <command> | 
 | 12 | 'perf c2c record' [<options>] -- [<record command options>] <command> | 
 | 13 | 'perf c2c report' [<options>] | 
 | 14 |  | 
 | 15 | DESCRIPTION | 
 | 16 | ----------- | 
 | 17 | C2C stands for Cache To Cache. | 
 | 18 |  | 
 | 19 | The perf c2c tool provides means for Shared Data C2C/HITM analysis. It allows | 
 | 20 | you to track down the cacheline contentions. | 
 | 21 |  | 
 | 22 | The tool is based on x86's load latency and precise store facility events | 
 | 23 | provided by Intel CPUs. These events provide: | 
 | 24 |   - memory address of the access | 
 | 25 |   - type of the access (load and store details) | 
 | 26 |   - latency (in cycles) of the load access | 
 | 27 |  | 
 | 28 | The c2c tool provide means to record this data and report back access details | 
 | 29 | for cachelines with highest contention - highest number of HITM accesses. | 
 | 30 |  | 
 | 31 | The basic workflow with this tool follows the standard record/report phase. | 
 | 32 | User uses the record command to record events data and report command to | 
 | 33 | display it. | 
 | 34 |  | 
 | 35 |  | 
 | 36 | RECORD OPTIONS | 
 | 37 | -------------- | 
 | 38 | -e:: | 
 | 39 | --event=:: | 
 | 40 | 	Select the PMU event. Use 'perf mem record -e list' | 
 | 41 | 	to list available events. | 
 | 42 |  | 
 | 43 | -v:: | 
 | 44 | --verbose:: | 
 | 45 | 	Be more verbose (show counter open errors, etc). | 
 | 46 |  | 
 | 47 | -l:: | 
 | 48 | --ldlat:: | 
 | 49 | 	Configure mem-loads latency. | 
 | 50 |  | 
 | 51 | -k:: | 
 | 52 | --all-kernel:: | 
 | 53 | 	Configure all used events to run in kernel space. | 
 | 54 |  | 
 | 55 | -u:: | 
 | 56 | --all-user:: | 
 | 57 | 	Configure all used events to run in user space. | 
 | 58 |  | 
 | 59 | REPORT OPTIONS | 
 | 60 | -------------- | 
 | 61 | -k:: | 
 | 62 | --vmlinux=<file>:: | 
 | 63 | 	vmlinux pathname | 
 | 64 |  | 
 | 65 | -v:: | 
 | 66 | --verbose:: | 
 | 67 | 	Be more verbose (show counter open errors, etc). | 
 | 68 |  | 
 | 69 | -i:: | 
 | 70 | --input:: | 
 | 71 | 	Specify the input file to process. | 
 | 72 |  | 
 | 73 | -N:: | 
 | 74 | --node-info:: | 
 | 75 | 	Show extra node info in report (see NODE INFO section) | 
 | 76 |  | 
 | 77 | -c:: | 
 | 78 | --coalesce:: | 
 | 79 | 	Specify sorting fields for single cacheline display. | 
 | 80 | 	Following fields are available: tid,pid,iaddr,dso | 
 | 81 | 	(see COALESCE) | 
 | 82 |  | 
 | 83 | -g:: | 
 | 84 | --call-graph:: | 
 | 85 | 	Setup callchains parameters. | 
 | 86 | 	Please refer to perf-report man page for details. | 
 | 87 |  | 
 | 88 | --stdio:: | 
 | 89 | 	Force the stdio output (see STDIO OUTPUT) | 
 | 90 |  | 
 | 91 | --stats:: | 
 | 92 | 	Display only statistic tables and force stdio mode. | 
 | 93 |  | 
 | 94 | --full-symbols:: | 
 | 95 | 	Display full length of symbols. | 
 | 96 |  | 
 | 97 | --no-source:: | 
 | 98 | 	Do not display Source:Line column. | 
 | 99 |  | 
 | 100 | --show-all:: | 
 | 101 | 	Show all captured HITM lines, with no regard to HITM % 0.0005 limit. | 
 | 102 |  | 
 | 103 | -f:: | 
 | 104 | --force:: | 
 | 105 | 	Don't do ownership validation. | 
 | 106 |  | 
 | 107 | -d:: | 
 | 108 | --display:: | 
 | 109 | 	Switch to HITM type (rmt, lcl) to display and sort on. Total HITMs as default. | 
 | 110 |  | 
 | 111 | C2C RECORD | 
 | 112 | ---------- | 
 | 113 | The perf c2c record command setup options related to HITM cacheline analysis | 
 | 114 | and calls standard perf record command. | 
 | 115 |  | 
 | 116 | Following perf record options are configured by default: | 
 | 117 | (check perf record man page for details) | 
 | 118 |  | 
 | 119 |   -W,-d,--sample-cpu | 
 | 120 |  | 
 | 121 | Unless specified otherwise with '-e' option, following events are monitored by | 
 | 122 | default: | 
 | 123 |  | 
 | 124 |   cpu/mem-loads,ldlat=30/P | 
 | 125 |   cpu/mem-stores/P | 
 | 126 |  | 
 | 127 | User can pass any 'perf record' option behind '--' mark, like (to enable | 
 | 128 | callchains and system wide monitoring): | 
 | 129 |  | 
 | 130 |   $ perf c2c record -- -g -a | 
 | 131 |  | 
 | 132 | Please check RECORD OPTIONS section for specific c2c record options. | 
 | 133 |  | 
 | 134 | C2C REPORT | 
 | 135 | ---------- | 
 | 136 | The perf c2c report command displays shared data analysis.  It comes in two | 
 | 137 | display modes: stdio and tui (default). | 
 | 138 |  | 
 | 139 | The report command workflow is following: | 
 | 140 |   - sort all the data based on the cacheline address | 
 | 141 |   - store access details for each cacheline | 
 | 142 |   - sort all cachelines based on user settings | 
 | 143 |   - display data | 
 | 144 |  | 
 | 145 | In general perf report output consist of 2 basic views: | 
 | 146 |   1) most expensive cachelines list | 
 | 147 |   2) offsets details for each cacheline | 
 | 148 |  | 
 | 149 | For each cacheline in the 1) list we display following data: | 
 | 150 | (Both stdio and TUI modes follow the same fields output) | 
 | 151 |  | 
 | 152 |   Index | 
 | 153 |   - zero based index to identify the cacheline | 
 | 154 |  | 
 | 155 |   Cacheline | 
 | 156 |   - cacheline address (hex number) | 
 | 157 |  | 
 | 158 |   Total records | 
 | 159 |   - sum of all cachelines accesses | 
 | 160 |  | 
 | 161 |   Rmt/Lcl Hitm | 
 | 162 |   - cacheline percentage of all Remote/Local HITM accesses | 
 | 163 |  | 
 | 164 |   LLC Load Hitm - Total, Lcl, Rmt | 
 | 165 |   - count of Total/Local/Remote load HITMs | 
 | 166 |  | 
 | 167 |   Store Reference - Total, L1Hit, L1Miss | 
 | 168 |     Total - all store accesses | 
 | 169 |     L1Hit - store accesses that hit L1 | 
 | 170 |     L1Hit - store accesses that missed L1 | 
 | 171 |  | 
 | 172 |   Load Dram | 
 | 173 |   - count of local and remote DRAM accesses | 
 | 174 |  | 
 | 175 |   LLC Ld Miss | 
 | 176 |   - count of all accesses that missed LLC | 
 | 177 |  | 
 | 178 |   Total Loads | 
 | 179 |   - sum of all load accesses | 
 | 180 |  | 
 | 181 |   Core Load Hit - FB, L1, L2 | 
 | 182 |   - count of load hits in FB (Fill Buffer), L1 and L2 cache | 
 | 183 |  | 
 | 184 |   LLC Load Hit - Llc, Rmt | 
 | 185 |   - count of LLC and Remote load hits | 
 | 186 |  | 
 | 187 | For each offset in the 2) list we display following data: | 
 | 188 |  | 
 | 189 |   HITM - Rmt, Lcl | 
 | 190 |   - % of Remote/Local HITM accesses for given offset within cacheline | 
 | 191 |  | 
 | 192 |   Store Refs - L1 Hit, L1 Miss | 
 | 193 |   - % of store accesses that hit/missed L1 for given offset within cacheline | 
 | 194 |  | 
 | 195 |   Data address - Offset | 
 | 196 |   - offset address | 
 | 197 |  | 
 | 198 |   Pid | 
 | 199 |   - pid of the process responsible for the accesses | 
 | 200 |  | 
 | 201 |   Tid | 
 | 202 |   - tid of the process responsible for the accesses | 
 | 203 |  | 
 | 204 |   Code address | 
 | 205 |   - code address responsible for the accesses | 
 | 206 |  | 
 | 207 |   cycles - rmt hitm, lcl hitm, load | 
 | 208 |     - sum of cycles for given accesses - Remote/Local HITM and generic load | 
 | 209 |  | 
 | 210 |   cpu cnt | 
 | 211 |     - number of cpus that participated on the access | 
 | 212 |  | 
 | 213 |   Symbol | 
 | 214 |     - code symbol related to the 'Code address' value | 
 | 215 |  | 
 | 216 |   Shared Object | 
 | 217 |     - shared object name related to the 'Code address' value | 
 | 218 |  | 
 | 219 |   Source:Line | 
 | 220 |     - source information related to the 'Code address' value | 
 | 221 |  | 
 | 222 |   Node | 
 | 223 |     - nodes participating on the access (see NODE INFO section) | 
 | 224 |  | 
 | 225 | NODE INFO | 
 | 226 | --------- | 
 | 227 | The 'Node' field displays nodes that accesses given cacheline | 
 | 228 | offset. Its output comes in 3 flavors: | 
 | 229 |   - node IDs separated by ',' | 
 | 230 |   - node IDs with stats for each ID, in following format: | 
 | 231 |       Node{cpus %hitms %stores} | 
 | 232 |   - node IDs with list of affected CPUs in following format: | 
 | 233 |       Node{cpu list} | 
 | 234 |  | 
 | 235 | User can switch between above flavors with -N option or | 
 | 236 | use 'n' key to interactively switch in TUI mode. | 
 | 237 |  | 
 | 238 | COALESCE | 
 | 239 | -------- | 
 | 240 | User can specify how to sort offsets for cacheline. | 
 | 241 |  | 
 | 242 | Following fields are available and governs the final | 
 | 243 | output fields set for caheline offsets output: | 
 | 244 |  | 
 | 245 |   tid   - coalesced by process TIDs | 
 | 246 |   pid   - coalesced by process PIDs | 
 | 247 |   iaddr - coalesced by code address, following fields are displayed: | 
 | 248 |              Code address, Code symbol, Shared Object, Source line | 
 | 249 |   dso   - coalesced by shared object | 
 | 250 |  | 
 | 251 | By default the coalescing is setup with 'pid,iaddr'. | 
 | 252 |  | 
 | 253 | STDIO OUTPUT | 
 | 254 | ------------ | 
 | 255 | The stdio output displays data on standard output. | 
 | 256 |  | 
 | 257 | Following tables are displayed: | 
 | 258 |   Trace Event Information | 
 | 259 |   - overall statistics of memory accesses | 
 | 260 |  | 
 | 261 |   Global Shared Cache Line Event Information | 
 | 262 |   - overall statistics on shared cachelines | 
 | 263 |  | 
 | 264 |   Shared Data Cache Line Table | 
 | 265 |   - list of most expensive cachelines | 
 | 266 |  | 
 | 267 |   Shared Cache Line Distribution Pareto | 
 | 268 |   - list of all accessed offsets for each cacheline | 
 | 269 |  | 
 | 270 | TUI OUTPUT | 
 | 271 | ---------- | 
 | 272 | The TUI output provides interactive interface to navigate | 
 | 273 | through cachelines list and to display offset details. | 
 | 274 |  | 
 | 275 | For details please refer to the help window by pressing '?' key. | 
 | 276 |  | 
 | 277 | CREDITS | 
 | 278 | ------- | 
 | 279 | Although Don Zickus, Dick Fowles and Joe Mario worked together | 
 | 280 | to get this implemented, we got lots of early help from Arnaldo | 
 | 281 | Carvalho de Melo, Stephane Eranian, Jiri Olsa and Andi Kleen. | 
 | 282 |  | 
 | 283 | C2C BLOG | 
 | 284 | -------- | 
 | 285 | Check Joe's blog on c2c tool for detailed use case explanation: | 
 | 286 |   https://joemario.github.io/blog/2016/09/01/c2c-blog/ | 
 | 287 |  | 
 | 288 | SEE ALSO | 
 | 289 | -------- | 
 | 290 | linkperf:perf-record[1], linkperf:perf-mem[1] |