improve TimeSeries split performance#3933
Conversation
Signed-off-by: Samir Romdhani <samir.romdhani_externe@rte-france.com>
CalculatedTimeSeries is not concerned by compact array, calc split returns copies and not create NaN Signed-off-by: Samir Romdhani <samir.romdhani_externe@rte-france.com>
46fb6ab to
6893ad2
Compare
Signed-off-by: Samir Romdhani <samir.romdhani_externe@rte-france.com>
fc02f7c to
2865686
Compare
Signed-off-by: Samir Romdhani <samir.romdhani_externe@rte-france.com>
64b7a54 to
08453cf
Compare
Signed-off-by: Samir Romdhani <samir.romdhani_externe@rte-france.com>
Signed-off-by: Samir Romdhani <samir.romdhani_externe@rte-france.com>
Signed-off-by: Samir Romdhani <samir.romdhani_externe@rte-france.com>
Signed-off-by: Samir Romdhani <samir.romdhani_externe@rte-france.com>
Signed-off-by: Samir Romdhani <samir.romdhani_externe@rte-france.com>
6811406 to
1a27bd3
Compare
Signed-off-by: Samir Romdhani <samir.romdhani_externe@rte-france.com>
Signed-off-by: Samir Romdhani <samir.romdhani_externe@rte-france.com>
Signed-off-by: Samir Romdhani <samir.romdhani_externe@rte-france.com>
Signed-off-by: Samir Romdhani <samir.romdhani_externe@rte-france.com>
Signed-off-by: Samir Romdhani <samir.romdhani_externe@rte-france.com>
Signed-off-by: Samir Romdhani <samir.romdhani_externe@rte-france.com>
1a27bd3 to
57690a9
Compare
…nce-for-many-small-chunks
…nce-for-many-small-chunks
|
rolnico
left a comment
There was a problem hiding this comment.
I'm not sure if we go the right way on this.
What if we tried to change the method AbstractTimeSeries.split(int) instead?
public List<T> split(int newChunkSize) {
if (chunks.isEmpty()) {
return List.of();
}
int minOffset = getMinOffset();
// Sort chunks by offset
List<C> sortedChunks = getSortedChunks();
// Map from bucket index -> list of chunk pieces that fall in that bucket.
Map<Integer, List<C>> bucketMap = new LinkedHashMap<>();
for (C chunk : sortedChunks) {
int chunkStart = chunk.getOffset();
int chunkEnd = chunkStart + chunk.getLength() - 1;
int firstBucket = (chunkStart - minOffset) / newChunkSize;
int lastBucket = (chunkEnd - minOffset) / newChunkSize;
for (int b = firstBucket; b <= lastBucket; b++) {
int bucketStart = minOffset + b * newChunkSize;
int bucketEnd = bucketStart + newChunkSize - 1;
// Intersection of chunk and bucket
int intersectStart = Math.max(chunkStart, bucketStart);
int intersectEnd = Math.min(chunkEnd, bucketEnd);
// Trim the chunk to [intersectStart, intersectEnd]
C slice = chunk;
if (intersectStart > chunkStart) {
slice = slice.splitAt(intersectStart).getChunk2();
}
int sliceEnd = intersectStart + slice.getLength() - 1;
if (sliceEnd > intersectEnd) {
slice = slice.splitAt(intersectEnd + 1).getChunk1();
}
bucketMap.computeIfAbsent(b, k -> new ArrayList<>()).add(slice);
}
}
// Build one time series per non-empty bucket
List<T> result = new ArrayList<>(bucketMap.size());
for (List<C> pieces : bucketMap.values()) {
result.add(createTimeSeries(pieces));
}
return result;
}With something like this, we would avoid getting chunks filled with NaN. However, there are still issues because the method TimeSeries.split(List, int) expect a chunkCount based on the index, so we would have to change that or to generate empty chunks?
What do you think of it?
| } | ||
|
|
||
| @Override | ||
| public List<DoubleTimeSeries> split(int newChunkSize) { |
There was a problem hiding this comment.
This code would potentially generate chunks/time series with only NaN values if there is a gap between the initial chunks:
@Test
void testSplitIssue() {
RegularTimeSeriesIndex index = RegularTimeSeriesIndex.create(Interval.parse("2015-01-01T00:00:00Z/2015-01-01T01:45:00Z"), Duration.ofMinutes(15));
TimeSeriesMetadata metadata = new TimeSeriesMetadata("ts1", TimeSeriesDataType.DOUBLE, index);
UncompressedDoubleDataChunk chunk1 = new UncompressedDoubleDataChunk(0, new double[]{1d, 2d, 3d});
UncompressedDoubleDataChunk chunk2 = new UncompressedDoubleDataChunk(6, new double[]{7d, 8d});
StoredDoubleTimeSeries timeSeries = new StoredDoubleTimeSeries(metadata, chunk1, chunk2);
// Split on multiple sizes
List<DoubleTimeSeries> split2TimeSeries = timeSeries.split(2);
List<DoubleTimeSeries> split3TimeSeries = timeSeries.split(3);
List<DoubleTimeSeries> split4TimeSeries = timeSeries.split(4);
assertEquals(4, split2TimeSeries.size());
assertEquals(3, split3TimeSeries.size());
assertEquals(2, split4TimeSeries.size());
assertArrayEquals(new double[]{1d, 2d}, split2TimeSeries.get(0).toCompactArray(), 0d);
assertArrayEquals(new double[]{3d, NaN}, split2TimeSeries.get(1).toCompactArray(), 0d);
assertArrayEquals(new double[]{NaN, NaN}, split2TimeSeries.get(2).toCompactArray(), 0d);
assertArrayEquals(new double[]{7d, 8d}, split2TimeSeries.get(3).toCompactArray(), 0d);
}Do we want this?
There was a problem hiding this comment.
You're right, i think if the range is entirely a gap, it should return empty series.
There was a problem hiding this comment.
I will keep an empty series instead of either skipping it or filling it with NaN values, this preserves positional contract for split(List, int) and preserves the memory goal (e.g splitTestHuge)
…nce-for-many-small-chunks
|
Hi @rolnico, thanks for looping me in — happy to share a view. I like the bucket approach: slicing each chunk into aligned On your open question, I think it's the real constraint and it points to the answer. int chunkCount = computeChunkCount(index, newChunkSize); // ceil(pointCount / newChunkSize)
for (int i = 0; i < chunkCount; i++) {
splitList.get(i).add(split.get(i));
}so it needs every series' So instead of "change split(List, int)" or "fill with NaN", maybe a third option: emit all Two things I'd check first: If that direction sounds right, it would also mean #3941 is fixed by this PR directly — happy to drop my separate adjacency-guard patch and instead add a few gapped-chunk test cases for the bucket version (incl. the split(4) repro). Thanks! |
It seems like an interesting idea worth testing. Could you open a new PR with your proposal, so that we can have a look, compare and test it? |
|
Quick update — I prototyped the bucket version locally and ran it against the time-series suite. Testing surfaced the real trade-off behind your "change split(List,int) or generate empty chunks?" question:
So the crux is the one you flagged: to keep buckets sparse and alignable, Before I open the PR: which contract do you prefer — (a) dense grid of (Small aside: bucket alignment should be to absolute index 0, not |
Signed-off-by: Samir Romdhani <samir.romdhani_externe@rte-france.com>
|
Hello, thanks for the feedback, I agree that the bucket slicing is clearer than the current recursive split, but it change split semantics, and here are the two options, as @rolnico noted :
I'm not against that direction, but I'd rather keep it separate:
What do you think ? |
…ct chunk view - Rewrite AbstractTimeSeries.split(int) with compact chunk - Remove recursive split and splitChunk helper (inused) - Add tests Signed-off-by: Samir Romdhani <samir.romdhani_externe@rte-france.com>
After testing cases, it seems that the bucket approach needs remerging and not only alignment with
I'm fine moving further changes to a separate PR if they're considered out of the perf scope. still, these changes can be justified by the points discussed in this issue:
|
|
|
Thanks @samirromdhani — that compact-chunk approach is neat. Compacting to one continuous block before splitAt sidesteps exactly the remerge problem I hit prototyping the bucket version (adjacent slices from different source chunks landing in the same window, as in splitMultiChunkTimeSeriesTest), and getting an empty series for full gaps for free is a clean way to keep existing behaviour while dropping the recursive split. Nice. |



Please check if the PR fulfills these requirements
Does this PR already have an issue describing the problem?
Fixes #1634
What kind of change does this PR introduce?
What is the current behavior?
What is the new behavior (if this is a feature change)?
Does this PR introduce a breaking change or deprecate an API?
If yes, please check if the following requirements are fulfilled
What changes might users need to make in their application due to this PR? (migration steps)
splitmethod was changed to improve performance in both execution time and memory consumption.toCompactArray, split now uses that.toArraystill available, users can choosetoCompactArrayfor improved memory usage.Other information:
powsybl-benchmark way with tsSize = 100000: