A CPU sampling profiler in less than 200 lines


While working on a new version of wtrace, I am analyzing the PerfView source code to learn how its various features work internally. One of such features is the call stack resolution for ETW events. This post will show you how to use the TraceEvent library to decode call stacks, and, as an exercise, we will write a sampling process profiler. Before we start, remember to set DisablePagingExecutive to 1. That is a requirement to make call stacks work for ETW sessions.

❗ ❗ Visit wtrace.net to receive updates on wtrace and my other troubleshooting tools.

Collecting profiling data

First, we need to collect trace data that we will later analyze. Our profiler will rely on SampledProfile events. Enabling them makes the Kernel emit a profiling event every 1ms for every CPU in the system. The SampledProfile event has only two fields: InstructionPointer and ThreadId. InstructionPointer is an address in the memory, so to find a method name that matches it, we need information about the images loaded by a process (the ImageLoad event). Our code to start the kernel session could look as follows:

use session = new TraceEventSession(sessionName, etlFilePath)

let traceFlags = NtKeywords.Profile ||| NtKeywords.ImageLoad
let stackFlags = NtKeywords.Profile
session.EnableKernelProvider(traceFlags, stackFlags) |> ignore

That is enough to resolve call stacks from the native images (NGEN/R2R), where CPU executes the image’s compiled code. However, resolving managed call stacks requires some more data. For managed code, the instruction pointer points to a block of JIT-compiled code that resides in a process private memory and is valid only for the process lifetime. The Kernel does not know about the CLR internals, so we need to query the CLR ETW provider. Fortunately, the ClrTraceEventParser already has a flag which contains all the necessary keywords:

public sealed class ClrTraceEventParser : TraceEventParser
{
    ...
    /// <summary>
    ///  Keywords are passed to TraceEventSession.EnableProvider to enable particular sets of
    /// </summary>
    [Flags]
    public enum Keywords : long
    {
        ...
        /// <summary>
        /// What is needed to get symbols for JIT compiled code.  
        /// </summary>
        JITSymbols = Jit | StopEnumeration | JittedMethodILToNativeMap | SupressNGen | Loader,
        ...
    };
}

We are ready to enable the CLR provider in our ETW session:

let keywords = ClrKeywords.JITSymbols
session.EnableProvider(ClrTraceEventParser.ProviderGuid, TraceEventLevel.Always, uint64(keywords)) |> ignore

Our session is now open and saves ETW events to the .etl file we specified. It’s time to perform the action we want to profile, for example, start a new process. When done, we should call Stop or Dispose to stop the ETW session and close the .etl file. For .NET processes that didn’t start or finish when the session was running, we also need to collect CLR rundown events. Without this data, we will lack JIT mappings, and we won’t be able to resolve some of the managed stacks. Here is the code for creating a rundown session:

let collectClrRundownEvents () = 
    let rec waitForRundownEvents filePath =
        // Poll until 2 seconds goes by without growth.
        let len = FileInfo(filePath).Length
        Thread.Sleep(2000)
        if FileInfo(filePath).Length = len then
            waitForRundownEvents filePath
        else ()

    printfn "Forcing rundown of JIT methods."
    let rundownFileName = Path.ChangeExtension(etlFilePath, ".clrRundown.etl")
    use rundownSession = new TraceEventSession(sessionName + "-rundown", rundownFileName)
    rundownSession.EnableProvider(ClrRundownTraceEventParser.ProviderGuid, 
        TraceEventLevel.Verbose, uint64(ClrRundownTraceEventParser.Keywords.Default)) |> ignore
    waitForRundownEvents rundownFileName
    printfn "Done with rundown."

The final steps are to merge the .etl files, add necessary module information, and prepare symbol files for the pre-compiled assemblies. Fortunately, the TraceEvent library has one method that performs all the operations mentioned above: ZippedETLWriter.WriteArchive. Internally, this method uses KernelTraceControl.dll!CreateMergedTraceFile (from WPT) to merge the .etl files and generate ‘synthetic’ events for all the modules which were loaded by processes during the trace session. If you’d like to see the PInvoke signature of this method, decrypt the OSExtensions.cs file. For all the pre-compiled assemblies (ETWTraceEventSource.GetModulesNeedingSymbols) TraceEvent generates the .pdb files (SymbolReader.GenerateNGenSymbolsForModule) and packs them into the final .zip file.

Our final recording code looks as follows:

do
    use session = new TraceEventSession(sessionName, etlFilePath)

    let traceFlags = NtKeywords.Profile ||| NtKeywords.ImageLoad // NtKeywords.Process and NtKeywords.Thread are added automatically
    let stackFlags = NtKeywords.Profile
    session.EnableKernelProvider(traceFlags, stackFlags) |> ignore

    // we need CLR to resolve managed stacks
    let keywords = ClrKeywords.JITSymbols
    session.EnableProvider(ClrTraceEventParser.ProviderGuid, TraceEventLevel.Always, uint64(keywords)) |> ignore

    printfn "Press enter to stop the tracing session"
    Console.ReadLine() |> ignore

    session.Stop() |> ignore

    collectClrRundownEvents ()

printfn "Collecting the data required for stack resolution..."
let writer = ZippedETLWriter(etlFilePath, log)
if not (writer.WriteArchive(Compression.CompressionLevel.NoCompression)) then
    printfn "Error occured while merging the data"
else
    printfn "Trace session completed"

It is very similar to the 40_SimpleTraceLog sample from the Perfview repository. So if you are looking for C# code, check it out.

Analyzing call stacks with TraceLog

Once we finished and post-processed the trace session, it’s time to analyze the collected events. Usually, when we want to process ETW events, we create the TraceEventSession instance and assign callbacks for interesting events. Using callbacks to the TraceEventSession source is the most efficient way of processing ETW events. However, this approach has some drawbacks. TraceEventParser instances are usually stateless, so if you need to work on aggregated data (processes, threads, modules), you need to implement the aggregation logic. For example, to list process lifetimes saved in the trace, we need to process the ProcessStartGroup and ProcessEndGroup events. Resolving call stacks is even more demanding and involves examining different groups of events. Fortunately, the TraceEvent library again gives us a hand and provides the TraceLog class. As stated in the documentation, TraceLog represents a higher-level event processing. It keeps a backlog of various session objects, including processes, threads, and call stacks. Our task is to instantiate the SymbolReader class, download or unpack the required symbol files, and process the event call stacks. The code below does all that:

type CallStackNode (callsCount : int32) =
    inherit Dictionary<string, CallStackNode>(StringComparer.OrdinalIgnoreCase)
    member val CallsCount = callsCount with get, set

let loadSymbols (traceLog : TraceLog) (proc : TraceProcess) =
    use symbolReader = new SymbolReader(log)

    proc.LoadedModules |> Seq.where (fun m -> not (isNull m.ModuleFile))
                       |> Seq.iter (fun m -> traceLog.CodeAddresses.LookupSymbolsForModule(symbolReader, m.ModuleFile))

let buildCallStacksTree (traceLog : TraceLog) pid tid =
    let perfInfoTaskGuid = Guid(int32 0xce1dbfb4, int16 0x137e, int16 0x4da6, byte 0x87, byte 0xb0, byte 0x3f, byte 0x59, byte 0xaa, byte 0x10, byte 0x2c, byte 0xbc)
    let perfInfoOpcode = 46

    let callStacks = CallStackNode(0)
    let processCallStack (callStack : TraceCallStack) =
        let addOrUpdateChildNode (node : CallStackNode) (callStack : TraceCallStack) =
            let decodedAddress = sprintf "%s!%s" callStack.CodeAddress.ModuleName callStack.CodeAddress.FullMethodName
            match node.TryGetValue(decodedAddress) with
            | (true, childNode) ->
                childNode.CallsCount <- childNode.CallsCount + 1
                childNode
            | (false, _) ->
                let childNode = CallStackNode(1)
                node.Add(decodedAddress, childNode)
                childNode

        let rec processStackFrame (callStackNode : CallStackNode) (callStack : TraceCallStack) =
            let caller = callStack.Caller
            if isNull caller then // root node
                callStackNode.CallsCount <- callStackNode.CallsCount + 1
            
            let childNode = 
                if isNull caller then addOrUpdateChildNode callStackNode callStack
                else processStackFrame callStackNode caller

            addOrUpdateChildNode childNode callStack
        processStackFrame callStacks callStack |> ignore
    
    traceLog.Events
    |> Seq.filter (fun ev -> ev.ProcessID = pid && ev.ThreadID = tid && ev.TaskGuid = perfInfoTaskGuid && (int32 ev.Opcode = perfInfoOpcode))
    |> Seq.iter (fun ev -> processCallStack (ev.CallStack()))

    callStacks

let tryFindProcess (traceLog : TraceLog) processNameOrPid =
    let (|Pid|ProcessName|) (s : string) =
        match Int32.TryParse(s) with
        | (true, pid) -> Pid pid
        | (false, _) -> ProcessName s

    let filter =
        match processNameOrPid with
        | Pid pid -> fun (p : TraceProcess) -> p.ProcessID = pid
        | ProcessName name -> fun (p : TraceProcess) -> String.Equals(p.Name, name, StringComparison.OrdinalIgnoreCase)

    let processes = traceLog.Processes |> Seq.where filter |> Seq.toArray
    if processes.Length = 0 then
        None
    elif processes.Length = 1 then
        Some processes.[0]
    else
        processes |> Seq.iteri (fun i p -> printfn "[%d] %s (%d): '%s'" i p.Name p.ProcessID p.CommandLine)
        printf "Which process to analyze []: "
        match Int32.TryParse(Console.ReadLine()) with
        | (false, _) -> None
        | (true, n) -> processes |> Seq.tryItem n

let analyze processNameOrPid maxDepth =

    let reader = ZippedETLReader(etlFilePath + ".zip", log)
    reader.UnpackArchive()
    
    let options = TraceLogOptions(ConversionLog = log)
    let traceLog = TraceLog.OpenOrConvert(etlFilePath, options)

    match tryFindProcess traceLog processNameOrPid with
    | None -> printfn "No matching process found in the trace"
    | Some proc ->
        printfn "%s [%s] (%d)" proc.Name proc.CommandLine proc.ProcessID

        let sw = Stopwatch.StartNew()
        printfn "[%6d ms] Loading symbols for modules in process %s (%d)" sw.ElapsedMilliseconds proc.Name proc.ProcessID
        loadSymbols traceLog proc

        // usually, system process has PID 4, but TraceEvent attaches the drivers to the Idle process (0)
        let systemProc = traceLog.Processes |> Seq.where (fun p -> p.ProcessID = 0) |> Seq.exactlyOne

        printfn "[%6d ms] Loading symbols for system modules" sw.ElapsedMilliseconds
        loadSymbols traceLog systemProc

        printfn "[%6d ms] Starting call stack analysis" sw.ElapsedMilliseconds
        let printThreadCallStacks (thread : TraceThread) =
            let callStacks = buildCallStacksTree traceLog proc.ProcessID thread.ThreadID

            let rec getCallStack depth name (callStack : CallStackNode) =
                let folder frames (kv : KeyValuePair<string, CallStackNode>) =
                    if kv.Value.Count = 0 || (maxDepth > 0 && depth >= maxDepth) then frames
                    else getCallStack (depth + 1) kv.Key kv.Value |> List.append frames

                callStack
                |> Seq.fold folder [ (sprintf "%s├─ %s [%d]" ("│ " |> String.replicate depth) name callStack.CallsCount) ]

            let stacks = getCallStack 0 (sprintf "Thread (%d) '%s'" thread.ThreadID thread.ThreadInfo) callStacks
            stacks |> List.iter (fun s -> printfn "%s" s)
        
        proc.Threads |> Seq.iter printThreadCallStacks
        printfn "[%6d ms] Completed call stack analysis" sw.ElapsedMilliseconds

An example output might look as follows (I removed some lines for brevity):

PS netcoreapp3.1> .\etwprofiler.exe analyze testproc 10
testproc ["C:\Users\me\testproc.exe" ] (15416)
[     2 ms] Loading symbols for modules in process testproc (15416)
[   428 ms] Loading symbols for system modules
[ 21400 ms] Starting call stack analysis
├─ Thread (2412) 'Startup Thread' [121]
│ ├─ ntdll!LdrInitializeThunk [11]
│ │ ├─ ntdll!LdrInitializeThunk [11]
│ │ │ ├─ ntdll!LdrpInitialize [11]
│ │ │ │ ├─ ntdll!??_LdrpInitialize [11]
│ │ │ │ │ ├─ ntdll!LdrpInitializeProcess [11]
│ │ │ │ │ │ ├─ ntdll!LdrpInitializeTls [1]
│ │ │ │ │ │ ├─ ntdll!LdrLoadDll [2]
│ │ │ │ │ │ │ ├─ ntdll!LdrpLoadDll [2]
...
├─ Thread (24168) '' [1]
│ ├─ ntdll!RtlUserThreadStart [1]
│ │ ├─ ntdll!RtlUserThreadStart [1]
│ │ │ ├─ kernel32!BaseThreadInitThunk [1]
│ │ │ │ ├─ ntdll!TppWorkerThread [1]
│ │ │ │ │ ├─ ntdll!TppAlpcpExecuteCallback [1]
│ │ │ │ │ │ ├─ rpcrt4!LrpcIoComplete [1]
│ │ │ │ │ │ │ ├─ rpcrt4!LRPC_ADDRESS::ProcessIO [1]
├─ Thread (9808) '' [0]
├─ Thread (22856) '.NET ThreadPool Worker' [1]
│ ├─ ntdll!RtlUserThreadStart [1]
│ │ ├─ ntdll!RtlUserThreadStart [1]
│ │ │ ├─ kernel32!BaseThreadInitThunk [1]
│ │ │ │ ├─ coreclr!Thread::intermediateThreadProc [1]
│ │ │ │ │ ├─ coreclr!ThreadpoolMgr::WorkerThreadStart [1]
│ │ │ │ │ │ ├─ coreclr!UnManagedPerAppDomainTPCount::DispatchWorkItem [1]
│ │ │ │ │ │ │ ├─ coreclr!ThreadpoolMgr::AsyncTimerCallbackCompletion [1]
│ │ │ │ │ │ │ │ ├─ coreclr!AppDomainTimerCallback [1]
│ │ │ │ │ │ │ │ │ ├─ coreclr!ManagedThreadBase_DispatchOuter [1]
│ │ │ │ │ │ │ │ │ │ ├─ coreclr!ManagedThreadBase_DispatchMiddle [1]
├─ Thread (12352) '.NET ThreadPool Gate' [0]
[ 21835 ms] Completed call stack analysis

The full profiler code is available in my blog samples repository. For my Dotnetos session about tracing, I also prepared the C# version so you may check it too. And, of course, I invite you to watch the whole session! 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.