Skip to content
This repository was archived by the owner on Nov 26, 2021. It is now read-only.
This repository was archived by the owner on Nov 26, 2021. It is now read-only.

Add a generic DF join API #250

@derkling

Description

@derkling

In many analysis it happens that we are interested in joining information coming from different DF.
For example, let say we have a trace like this:

            adbd-5709  [007]  2943.184105: sched_contrib_scale_f: cpu=7 cpu_scale_factor=1
            adbd-5709  [007]  2943.184105: sched_load_avg_cpu:   cpu=7 util_avg=825
     ->transport-5713  [006]  2943.184106: sched_load_avg_cpu:   cpu=6 util_avg=292
     ->transport-5713  [006]  2943.184107: sched_contrib_scale_f: cpu=6 cpu_scale_factor=2
            adbd-5709  [007]  2943.184108: sched_load_avg_cpu:   cpu=7 util_avg=850
            adbd-5709  [007]  2943.184109: sched_contrib_scale_f: cpu=7 cpu_scale_factor=3
            adbd-5709  [007]  2943.184110: sched_load_avg_cpu:   cpu=6 util_avg=315

Currently we can easily build two DF, one for sched_load_avg_cpu and another for sched_contrib_scale_f.

However, in some analysis it could be useful and correlate the information from these two events, thus getting a single DF where we see a consistent view of the most updated information from both.

In these cases we have a "master_df", e.g. sched_load_avg_cpu, where we want to propagate into the information from a "secondary_df", e.g. sched_contrib_scale_f.

This would require to:

  1. Join the master_df with the secondary_df

  2. Fix any index collision eventually happening, for example in the previous small trace we can see that at the exact time 2943.184105 we have one event for both master_df and secondary_df on each CPU.

A join of these two DF should grant that:

  • the order of the events is consistent with the trace ordering, i.e. sched_contrib_scale_f should be before sched_load_avg_cpu in CPU7 but after in CPU6
  • the time difference between the two events should be almost not noticiable, thus probably we should fix the overlapping timestamps by adding one 1ns to each duplicated index, thus removing the index collision without risking to create a new one with a following event.

Than we need to:

  1. forward propagate each secondary_df columns by considering the value of a "pivot" column which is shared among the two DFs, for example the value cpu can be used to forward propagate the others sched_contrib_scale_f columns (i.e. freq_scale_factor and cpu_scale_factor) in the sched_load_avg_cpu rows

  2. remove all the secondary_df rows which values have been already properly propagated in the following primary_df rows

All these operations together should be supported by a new generic convenience API which, once called with something like:

trappy.ftrace.utils.merge_df(primary_df = 'sched_load_avg_cpu',
                             secondary_df='sched_contrib_scale_f',
                             pivot='cpu')

Where, primary_df is:

          cpu  util_avg
Time                   
0.000000    7       825
0.000001    6       292
0.000003    7       850
0.000005    6       315

and secondary_df is:

          cpu  cpu_scale_factor
Time                           
0.000000    7                 1
0.000002    6                 2
0.000004    7                 3

should returns a single DF which is:

          cpu  util_avg  cpu_scale_factor
Time                                     
0.000000  7.0     825.0               1.0
0.000001  6.0     292.0               NaN   <- since we do not have before a  valid secondary_df entry
0.000003  7.0     850.0               1.0  <- propagation of the previous value
0.000005  6.0     315.0               2.0

Here is a notebook to play with the same example:
https://gist.github.com/derkling/786e911ae01ca170377e1893d6696384
where we can see that the current join API needs to be extended to get the exact result we described before.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions