Add a generic DF join API

In many analysis it happens that we are interested in joining information coming from different DF.
For example, let say we have a trace like this:

```
            adbd-5709  [007]  2943.184105: sched_contrib_scale_f: cpu=7 cpu_scale_factor=1
            adbd-5709  [007]  2943.184105: sched_load_avg_cpu:   cpu=7 util_avg=825
     ->transport-5713  [006]  2943.184106: sched_load_avg_cpu:   cpu=6 util_avg=292
     ->transport-5713  [006]  2943.184107: sched_contrib_scale_f: cpu=6 cpu_scale_factor=2
            adbd-5709  [007]  2943.184108: sched_load_avg_cpu:   cpu=7 util_avg=850
            adbd-5709  [007]  2943.184109: sched_contrib_scale_f: cpu=7 cpu_scale_factor=3
            adbd-5709  [007]  2943.184110: sched_load_avg_cpu:   cpu=6 util_avg=315
```

Currently we can easily build two DF, one for sched_load_avg_cpu and another for sched_contrib_scale_f.

However, in some analysis it could be useful and correlate the information from these two events, thus getting a single DF where we see a consistent view of the most updated information from both.

In these cases we have a **"master_df"**, e.g. sched_load_avg_cpu, where we want to propagate into the information from a **"secondary_df"**, e.g. sched_contrib_scale_f.

This would require to:

1. Join the master_df with the secondary_df

2. Fix any index collision eventually happening, for example in the previous small trace we can see that at the exact time 2943.184105 we have one event for both master_df and secondary_df on each CPU.

A join of these two DF should grant that:
   - the order of the events is consistent with the trace ordering, i.e. sched_contrib_scale_f should be before sched_load_avg_cpu in CPU7 but after in CPU6
   - the time difference between the two events should be almost not noticiable, thus probably we should fix the overlapping timestamps by adding one 1ns to each duplicated index, thus removing the index collision without risking to create a new one with a following event.

Than we need to:

3. forward propagate each secondary_df columns by considering the value of a "pivot" column which is shared among the two DFs, for example the value `cpu` can be used to forward propagate the others `sched_contrib_scale_f` columns (i.e. `freq_scale_factor` and `cpu_scale_factor`) in the `sched_load_avg_cpu` rows

4. remove all the secondary_df rows which values have been already properly propagated in the following primary_df rows

All these operations together should be supported by a new generic convenience API which, once called with something like:

```python
trappy.ftrace.utils.merge_df(primary_df = 'sched_load_avg_cpu',
                             secondary_df='sched_contrib_scale_f',
                             pivot='cpu')
```
Where, primary_df is:

```
          cpu  util_avg
Time                   
0.000000    7       825
0.000001    6       292
0.000003    7       850
0.000005    6       315
```

and secondary_df is:
```
          cpu  cpu_scale_factor
Time                           
0.000000    7                 1
0.000002    6                 2
0.000004    7                 3
```

should  returns a single DF which is:

```
          cpu  util_avg  cpu_scale_factor
Time                                     
0.000000  7.0     825.0               1.0
0.000001  6.0     292.0               NaN   <- since we do not have before a  valid secondary_df entry
0.000003  7.0     850.0               1.0  <- propagation of the previous value
0.000005  6.0     315.0               2.0
```

Here is a notebook to play with the same example:
https://gist.github.com/derkling/786e911ae01ca170377e1893d6696384
where we can see that the current join API needs to be extended to get the exact result we described before.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a generic DF join API #250

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add a generic DF join API #250

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions