Saturday, May 16, 2009

Buffers in Linux kernel

For 2.6.30

Oprofile Buffer

  • One cpu_buffer for each CPU
    • NMI or IRQ handler add record to cpu_buffer
    • Timer based per-cpu work queue remove record from cpu_buffer
    • One reader and one writer, so it is easy for mutual exclusion and synchronization
    • fix size, may overflow if write too much or not removed quickly enough
  • One event_buffer
    • Timer based per-cpu work queue add record to event_buffer from that of cpu_buffer
    • User space program remove record from event_buffer. User space program is waken up based on threshold.
    • One reader and multiple writer, mutex is used on write side, because writers are in process context (work queue).
    • fix size, may overflow if write too much or not read out quickly enough
  • Synchronization between cpu_buffer, event_buffer and use space program
    • Used for profile, so record flow bandwidth is predictable
      • Maximum bandwidth for cpu_buffer: CPU number * (cpu_buffer size / timer interval)

Unified trace ring buffer

  • One ring_buffer_per_cpu for each CPU
    • In fact two levels of records, the first level is struct buffer_page, the second level is struct ring_buffer_event.
    • Preempt disabling is used to provide mutual exclusion between process contexts.
    • Inside one struct buffer_page, atomic operation (local_add_return) is used to provide mutual exclusion for struct ring_buffer_event. Because process context can only be preempted by IRQ and NMI, and process context can not continue until IRQ and NMI finishes. The length of struct buffer_page is known (local_add_return may exceed buffer_page length).
    • In writer side, "write" indicates allocated records, while "commit" indicates completed records.
    • spinlock (ring_buffer_per_cpu->lock) and IRQ disable is used to provide mutual exclusion for struct buffer_page (reader and writer). In NMI environment, if try_spin_lock failed, the record is discarded, this acceptable for tracing.
    • At least 3 pages are needed for each CPU, 2 buffer_page, 1 reader page.
    • ring_buffer_per_cpu->reader_lock is used to provide mutual exclusion between multiple readers, while ring_buffer_per_cpu->lock is used to provide mutual exclusion between reader and writer.
    • fix size, may overflow if write too much or not consumed quickly enough
    • Iterator (iter) can be used on reader side

Mcelog buffer

  • One global buffer with fixed record length, fixed size. May overflow.
  • Writer side lock-less, implemented using cmpxchg() + finished flag. Records are added in NMI/timer context.
  • Memory order should be considered explicitly because of lock-less
  • Reader side is protected by mutex, because normally there is only one reader. Records are removed in process context.
  • Multiple writers, one reader, throughput bottleneck lies in reader side.

No comments: