Monday, March 29, 2010

Reading notes: Linux kernel perf event source

For Linux kernel 2.6.33

perf event enumeration

Kernel has no hardware perf event enumeration interface.

  • tools command line: perf list
  • Available abstract hardware perf events are hard-coded in perf user space tool. Raw hardware perf events are not enumerated by the tool, they can be specified via a 64-bit number in the tool command line. Abstract hardware perf events are implemented with the raw hardware perf events. Whether the abstract hardware perf events are available can be queried via syscall perf_event_open.
  • Software trace point are exported in debugfs (/debugfs/tracing/events)

perf event management

  • There is at most one perf event context for each task and each CPU. This is used to manage all perf events inside one context (task or CPU).
  • All perf events in a context are linked as list in context->event_list.
  • Perf events in one context are organized into groups. All group leaders in a context is linked as list in context->group_list. All perf events in a group is linked as list in group_leader->sibling_list.
  • It seems that perf events in the group share some attributes, such as enable/disable, cpu, inherit, etc. In "perf record", for task events, one group is created for perf events for one CPU. In __perf_event_sched_in, it is assumed perf events in one group is for one CPU. But in "perf_record", group support can not be turned on now, that is, group is not used at all now.
  • perf events inherited in child tasks are linked as list in parent_perf_event->child_list.
  • In "perf record", for task profile, for one perf event type, one perf event is created for each task and each CPU. For perf event operation, it appears that one perf event for each task is sufficient, because one task can only run on one CPU at any time. But perf events may be inherited by children perf events in children tasks (forked tasks). The children perf events use the original perf events for sample output. To make sample ring buffer a per-CPU data structure, perf events are created for each CPU too. So that, when parent and child tasks run on different CPU simultaneously, they all use perf events in parent task for output, but they use different perf events for corresponding CPU.

perf event state track

  • task schedule in/out (perf_event_task_sched_in/out). Record/disable counter before schedule out, restore/enable counter after schedule in. The schedule in/out event is recorded too.
  • task fork
    • perf_event_fork: record fork event.
    • perf_event_init_task: perf_event->attr.inherit control whether to inherit. If inheriting, child perf event will be created for child task. Children events information is accumulated in some statistics. The sample collected via children events are sent to corresponding parent events for same CPU.
  • task exit
    • perf_event_exit_task: feed back event values to parent events (child_total_time_running, etc). Free all resources for children events.
  • fd closed, perf_release: free all resources. perf event in original task is managed by corresponding fd. Children perf events in children tasks are freed in perf_event_exit_task.

Dataflow

For perf event samples:

hardware counter ---> perf event mmap data page -> user space tool (perf)
|
soft trace point -/

For ftrace and tracing point:

 software trace point -> tracing ring buffer -> user space 

User space interface

  • File based interface. File descriptor is obtained via a new syscall perf_event_open. Because one file describptor is needed for each perf event. Perf event may for a specific task, cpu, etc.
  • memory map. Sample data is passed from kernel to user space mainly via memory map. Used to implemend shared memory based ring buffer. Details are in ring buffer.
  • ring buffer
    • Work in both per-CPU and not-per-CPU mode
    • write side in kernel, read side in user space. shared memory communication between kernel and user space. It has better performance than memory copy.
    • lock-less write side, similar as Steven Rostedt's unified tracing buffer, perf_mmap_data.head is used to reserve space, while perf_mmap_data.user_page->data_head is used for committed data.

Code generation for software tracing point

  • A set of macro is defined for software tracing point related code generation, such as binary format description, trace point function generation, etc.
  • Some macros are un-defined and re-defined again and again, and some files are included again and again for different set of macro definitions. This is tricky but really powerful. Source code is in include/trace/define_trace.h and include/trace/ftrace.h.

Tracing and perf event

  • Lock-less trace ring buffer is not used in perf event, another simple ring buffer implementation for mmap data pages is used instead.
  • For software tracing point, the tracing point sample collecting code is shared between tracing and perf event.

Random thought

  • ring buffer in perf event has some special features.
    • work in both per-CPU and not-per-CPU mode
    • shared memory communication between kernel and user space, with write side in kernel and read side in user space.
  • In the code generation for software tracing point
    • macro is un-defined and re-defined again and again, and some files are included again and again for different set of macro definitions.

1 comment:

BE said...

So, where do you find the raw hardware perf event codes? Are the only ones that perf will understand in arch/x86/kernel /cpu/perf_event_intel.c ?