Bringing up BPI-F3 - Part 2.5

#bpif3

this is a sort of intermission

Getting perf to work up to a point

Apparently the opensbi-mediated access to the performance counter does not map so using the usual cycles and instructions event works in perf record. I got this board mainly to help with dav1d development efforts, so not having perf support would make harder to reason about performance.

The best workaround after a discussion in the forums, is to build the pmu-events to include custom ones and then rely on the overly precise cpu-specific events instead:

$ perf list | grep cycle
  bus-cycles                                         [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  ref-cycles                                         [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  m_mode_cycle
       [M-mode cycles]
  rtu_flush_cycle
  s_mode_cycle
       [S-mode cycles]
  stalled_cycle_backend
       [Stalled cycles backend]
  stalled_cycle_frontend
       [Stalled cycles frontend]
  u_mode_cycle
       [U-mode cycles]
  vidu_total_cycle
  vidu_vec0_cycle
  vidu_vec1_cycle
...

$ perf list | grep inst
  branch-instructions OR branches                    [Hardware event]
  instructions                                       [Hardware event]
  br_inst
       [Branch instructions]
  cond_br_inst
       [Conditional branch instructions]
  indirect_br_inst
       [Indirect branch instructions]
  taken_cond_br_inst
       [Taken conditional branch instructions]
  uncond_br_inst
       [Unconditional branch instructions]
instruction:
  alu_inst
       [ALU (integer) instructions]
  amo_inst
       [AMO instructions]
  atomic_inst
       [Atomic instructions]
  bus_fence_inst
       [Bus FENCE instructions]
  csr_inst
       [CSR instructions]
  div_inst
       [Division instructions]
  ecall_inst
       [ECALL instructions]
  failed_sc_inst
       [Failed SC instructions]
  fence_inst
       [FENCE instructions]
  fp_div_inst
       [Floating-point division instructions]
  fp_inst
       [Floating-point instructions]
  fp_load_inst
       [Floating-point load instructions]
  fp_store_inst
       [Floating-point store instructions]
  load_inst
       [Load instructions]
  lr_inst
       [LR instructions]
  mult_inst
       [Multiplication instructions]
  sc_inst
       [SC instructions]
  store_inst
       [Store instructions]
  unaligned_load_inst
       [Unaligned load instructions]
  unaligned_store_inst
       [Unaligned store instructions]
  vector_div_inst
       [Vector division instructions]
  vector_inst
       [Vector instructions]
  vector_load_inst
       [Vector load instructions]
  vector_store_inst
       [Vector store instructions]
  id_inst_pipedown
       [ID instruction pipedowns]
  id_one_inst_pipedown
       [ID one instruction pipedowns]
  issued_inst
       [Issued instructions]
  rf_inst_pipedown
       [RF instruction pipedowns]
  rf_one_inst_pipedown
       [RF one instruction pipedowns]

Building perf

Perf way to deal with cpu-specific events is through some machinery called jevents.

It lives in tools/perf/pmu-events and you can manually trigger it with.

./jevents.py riscv arch pmu-events.c

And produce C code from a bunch of JSON and a CSV map file.

When I tried build the sources the first time I tried to cut it by setting most NO_{} make variables and left NO_JEVENTS=1, luckily I fixed it after noticing the different output in the forum.

## I assume you have here the custom linux sources
cd /usr/src/pi-linux/tools/perf
## being lazy I disabled about everything instead of installing dependencies, one time I disabled too much.
make -j 8 V=1 VF=1 HOSTCC=riscv64-unknown-linux-gnu-gcc HOSTLD=riscv64-unknown-linux-gnu-ld CC=riscv64-unknown-linux-gnu-gcc CXX=riscv64-unknown-linux-gnu-g++ AR=riscv64-unknown-linux-gnu-ar LD=riscv64-unknown-linux-gnu-ld NM=riscv64-unknown-linux-gnu-nm PKG_CONFIG=riscv64-unknown-linux-gnu-pkg-config prefix=/usr bindir_relative=bin tipdir=share/doc/perf-6.8 'EXTRA_CFLAGS=-O2 -pipe' 'EXTRA_LDFLAGS=-Wl,-O1 -Wl,--as-needed' ARCH=riscv BUILD_BPF_SKEL= BUILD_NONDISTRO=1 JDIR= CORESIGHT= GTK2= feature-gtk2-infobar= NO_AUXTRACE= NO_BACKTRACE= NO_DEMANGLE= NO_JEVENTS=0 NO_JVMTI=1 NO_LIBAUDIT=1 NO_LIBBABELTRACE=1 NO_LIBBIONIC=1 NO_LIBBPF=1 NO_LIBCAP=1 NO_LIBCRYPTO= NO_LIBDW_DWARF_UNWIND= NO_LIBELF= NO_LIBNUMA=1 NO_LIBPERL=1 NO_LIBPFM4=1 NO_LIBPYTHON=1 NO_LIBTRACEEVENT= NO_LIBUNWIND=1 NO_LIBZSTD=1 NO_SDT=1 NO_SLANG=1 NO_LZMA=1 NO_ZLIB= TCMALLOC= WERROR=0 LIBDIR=/usr/libexec/perf-core libdir=/usr/lib64 plugindir=/usr/lib64/perf/plugins -f Makefile.perf install

Now I have a perf with still cycles and instructions not working with perf record, I wonder if there is a way at opensbi or kernel level to aggregate events to make it work properly, but I never had to look into perf internals so probably I poke it way later if nobody address it otherwise, anyway

perf record --group -e u_mode_cycle,m_mode_cycle,s_mode_cycle

produces something close enough for cycles, well u_mode_cycle is enough.

While for instructions the situation is a bit more annoying

perf record --group -e alu_inst,amo_inst,atomic_inst,fp_div_inst,fp_inst,fp_load_inst,fp_store_inst,load_inst,lr_inst,mult_inst,sc_inst,store_inst,unaligned_load_inst,unaligned_store_inst

is close to count all the scalar instructions, but trying to add vector_div_inst,vector_inst,vector_load_inst,vector_store_inst somehow makes perf record stop collecting samples silently, adding just 3 more events works though, so I guess I can be happy with u_mode_cycle,alu_inst,atomic_inst,fp_inst,vector_inst at least.

DEV Community

Bringing up BPI-F3 - Part 2.5

Getting perf to work up to a point

Building perf

Top comments (0)