Accelerate aggregation and parse #11

takohei · 2024-05-05T15:30:56Z

Close Issues

Benchmark

data set

name	source	size	file
Takushoku Bus	the fixture of test	small	fx_TakushokuBus.zip
Tokyo Toei Bus	#6	medium	P6_ToeiBus.zip
Vancouver TransLink	MIERUNE/GTFS-GO#77	large	G77_TransLink.zip
Swiss	#1	extra large	GTFS_FP2021_2021-12-08_09-10.zip

results

read_stops(): 1.2 - 1.7 times faster
read_routes() noshape: 8.0 - 12.7 times faster
read_routes() shape: 4.1 - 6.6 times faster
aggregator nounify: 2.9 - 3.9 times faster
aggregator unify: 21.0 - 37.2 times faster

remarks

Current read_routes()@parse.py has 2 bugs. I fix them temporary and use for measurement.
Like this:

        # Point -> LineString: group by route_id and trip_id
        lines = merged.groupby(["route_id", "trip_id"])["stop_pt"].apply(tuple)
        line_df = lines.reset_index()[["route_id", "stop_pt"]].drop_duplicates()

Description（変更内容）

Accelerate and refactor for aggregation, especially for stop unification

Change the unification from apply per stops to process for whole dataframe.
Change merge procedures so that they are done on smaller dataframes.
Decrease max_distance_degree 0.01 to 0.003 and add process of join near groups.
- This change reduces the number of unified stops in test cases because there are stops with a same name that are 400 meters apart from each other.
Exclude parent stops from aggregation with no_unify_stops option.
- This change reduces the number of aggregated stops in the test case with no_unify_stops option.
Break large __aggregate_similar_stops() into small static methods.
Remove path_id and related procedures in read_route_frequency()

Add read_stop_relations() to Aggregator

Add read_stop_relations() that returns relations of stops and similar stops as a list of dictionaries.
Eliminate modification of gtfs data frames in the Aggregator class.
Eliminate GTFS-GO (gtfs_go_dialog.py) getting stop relations from inside Aggregator.

Acclerate and refactor read_routes()

Unify stop patterns by stop_id, not by line string.
Minimize number of coordinate pairing and linestring generation.
Fix sort procedure for stop sequence.
- The route collapse problem in PR Geopandas #9 was caused by a difference in sorting behavior between Pandas 2.2 and 1.1 (built into QGIS 3.28).
Change the method of generating features from iterrows() to to_dict().
Break large read_routes() into some small methods.

Acclerate read_stops()

Change merge/join procedure to be faster.
Change the method of generating features from iterrows() to to_dict().

Add test cases for aggregation

Add test cases of delimiter, time filter.
Add tests for record values.
Set target date to feed_start_date.
Arrange the test cases from per-method to per-option.

Manual Testing（手動テスト）

Please check in QGIS on your Mac OS.

for Pandas 1.1.3 / QGIS 3.34

coderabbitai · 2024-05-05T15:31:05Z

ウォークスルー

GTFSデータ解析のためのgtfs_parserライブラリが大幅に更新されました。主な変更点はAggregatorクラスのメソッドの追加と削除、データフレームの処理方法の改善、およびGeoJSON機能生成のための新しい関数の導入です。これにより、より効率的かつ柔軟なデータ処理が可能になります。

変更

ファイル	変更概要
`gtfs_parser/aggregate.py`	`Aggregator`クラスのメソッドの追加・削除、`__init__`メソッドの更新
`gtfs_parser/gtfs.py`	`load_df`関数に`location_type`列の処理を追加
`gtfs_parser/parse.py`	停留所とルートの読み取り処理の変更、GeoJSON機能生成のための新しい関数
`tests/test_aggregator.py`	新しいテスト関数の追加

🐰✨
新しい道へと進むたび
コードの森を駆け巡る
データの流れを整え
地図上に軌跡を描く
変化を楽しむ我らが旅
🌟🌲🌲🌲🌲🌲🌲

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

codecov · 2024-05-05T15:31:40Z

Codecov Report

Attention: Patch coverage is 91.10169% with 21 lines in your changes are missing coverage. Please review.

Project coverage is 91.24%. Comparing base (98f7fad) to head (34daec8).
Report is 1 commits behind head on main.

Files	Patch %	Lines
tests/test_aggregator.py	85.91%	10 Missing ⚠️
gtfs_parser/aggregate.py	93.16%	8 Missing ⚠️
gtfs_parser/parse.py	93.47%	3 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main      #11       +/-   ##
===========================================
+ Coverage   80.61%   91.24%   +10.63%     
===========================================
  Files           8        8               
  Lines         294      377       +83     
===========================================
+ Hits          237      344      +107     
+ Misses         57       33       -24

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

coderabbitai

Actionable comments posted: 0

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 851052a and 34daec8.

Files selected for processing (4)

gtfs_parser/aggregate.py (7 hunks)
gtfs_parser/gtfs.py (1 hunks)
gtfs_parser/parse.py (3 hunks)
tests/test_aggregator.py (1 hunks)

Additional Context Used

GitHub Check Runs (1)

codecov/patch success (13)

gtfs_parser/aggregate.py: [warning] 51-51: gtfs_parser/aggregate.py#L51
Added line #L51 was not covered by tests

gtfs_parser/aggregate.py: [warning] 96-96: gtfs_parser/aggregate.py#L96
Added line #L96 was not covered by tests

gtfs_parser/aggregate.py: [warning] 143-143: gtfs_parser/aggregate.py#L143
Added line #L143 was not covered by tests

gtfs_parser/aggregate.py: [warning] 151-151: gtfs_parser/aggregate.py#L151
Added line #L151 was not covered by tests

gtfs_parser/aggregate.py: [warning] 154-154: gtfs_parser/aggregate.py#L154
Added line #L154 was not covered by tests

gtfs_parser/aggregate.py: [warning] 203-205: gtfs_parser/aggregate.py#L203-L205
Added lines #L203 - L205 were not covered by tests

gtfs_parser/parse.py: [warning] 101-101: gtfs_parser/parse.py#L101
Added line #L101 was not covered by tests

gtfs_parser/parse.py: [warning] 107-108: gtfs_parser/parse.py#L107-L108
Added lines #L107 - L108 were not covered by tests

tests/test_aggregator.py: [warning] 13-13: tests/test_aggregator.py#L13
Added line #L13 was not covered by tests

tests/test_aggregator.py: [warning] 15-15: tests/test_aggregator.py#L15
Added line #L15 was not covered by tests

tests/test_aggregator.py: [warning] 18-22: tests/test_aggregator.py#L18-L22
Added lines #L18 - L22 were not covered by tests

tests/test_aggregator.py: [warning] 33-34: tests/test_aggregator.py#L33-L34
Added lines #L33 - L34 were not covered by tests

tests/test_aggregator.py: [warning] 41-41: tests/test_aggregator.py#L41
Added line #L41 was not covered by tests

Additional comments not posted (17)

gtfs_parser/gtfs.py (1)

22-23: "location_type" 列の処理が適切に行われています。データの整合性と型の問題を防ぐために、欠損値を "0" で埋め、整数型に変換しています。

gtfs_parser/parse.py (4)

17-32: 停留所データの処理が最適化されており、"ignore_no_route" フラグに基づいて接続されていない停留所を適切に処理しています。

70-72: 形状データの有無と "ignore_shapes" フラグに基づいて、適切な解析戦略を選択しています。

75-109: 形状データの処理が効率的に行われており、特徴生成のためのデータ準備に必要なソート、グルーピング、マージ操作が含まれています。

112-147: 形状データを使用せずにルート特徴を生成する処理が明確で効果的に実装されています。

tests/test_aggregator.py (5)

4-7: 座標が指定されたイプシロン内で互いに近いかどうかをチェックするロジックが正しく、効率的に比較を処理しています。

10-22: 異なるジオメトリタイプ（ポイント、ラインストリング、マルチラインストリング）を処理し、is_coordinate_close を使用して座標を比較するロジックが健全でよく構造化されています。

25-34: 特徴がプロパティを比較することによって特徴のリストに存在するかどうかを効果的にチェックし、必要に応じてジオメトリの比較に is_geometry_close を使用します。ロジックは明確で、チェックを効率的に処理します。

37-41: プロパティのセットがプロパティのリストに存在するかどうかを、キーと値のペアを比較することによって直接的にチェックするロジックがシンプルで効果的です。

44-72: no_unify_stops フラグが設定されているときの Aggregator クラスの動作をチェックするテストが、Aggregator メソッドによって返される関係と特徴の数をチェックします。アサーションは、期待される動作を検証するために正しく配置されています。

gtfs_parser/aggregate.py (7)

9-41: __init__ メソッドは、さまざまなパラメータを使用して Aggregator クラスを初期化し、これらのパラメータに基づいて停留所の時間を処理します。フィルタリングと統一のための静的メソッドの使用は、モジュール性のための良い実践です。

42-63: このメソッドは、日付と時間のパラメータに基づいて停留所の時間をフィルタリングします。出発時間が null 可能であることを処理し、比較のために整数に変換するロジックがうまく実装されています。

66-80: このメソッドは、"location_type" に基づいて統一せずに類似の停留所を取得します。停留所をフィルタリングし、それらの間の関係を作成するロジックが明確で効果的に実装されています。

83-102: このメソッドは、親駅、区切り文字、距離に基づいて類似の停留所を統一します。異なる統一戦略を処理するロジックは、よく構造化され、モジュラーです。

105-127: このメソッドは、親駅に基づいて子停留所を統一します。停留所間の関係を作成し、"location_type" 列を処理するロジックが効果的に実装されています。

130-168: このメソッドは、区切り文字と距離を基準にしてソロ停留所を統一します。異なる統一条件を処理し、結果をマージするロジックが明確でモジュラーです。

171-189: このメソッドは、距離を基準に近くの停留所のペアを効果的に計算します。近接性に基づいて停留所をマージし、"stop_name" 列を処理するロジックがうまく実装されています。

Kanahiro · 2024-05-08T05:38:43Z

LGTM!

takohei added 9 commits May 3, 2024 19:12

fix: improve performance of aggregation

71252bf

add: clustering near stops

940bc5a

fix: unify solo stop

d546ef5

fix: modify sort of aggregation

72788c0

for Pandas 1.1.3 / QGIS 3.34

fix: mod location_type, grouping names, comments

2b2e7a8

add: aggregation test variety

5417895

accelerate and refactor read_routes()

2d40276

accelerate read_stops()

f0bc7db

refactor, accelerate read_stops()

34daec8

coderabbitai bot reviewed May 5, 2024

View reviewed changes

This was referenced May 5, 2024

Update gtfs parser interface MIERUNE/GTFS-GO#81

Merged

Chaos on Vancouver, Canada GTFS dataset MIERUNE/GTFS-GO#77

Closed

Skip unused files #13

Merged

Kanahiro merged commit 425087d into MIERUNE:main May 8, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate aggregation and parse #11

Accelerate aggregation and parse #11

takohei commented May 5, 2024

coderabbitai bot commented May 5, 2024 •

edited

Loading

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (`.coderabbit.yaml`)

Documentation and Community

codecov bot commented May 5, 2024

coderabbitai bot left a comment

Kanahiro commented May 8, 2024

Accelerate aggregation and parse #11

Accelerate aggregation and parse #11

Conversation

takohei commented May 5, 2024

Close Issues

Benchmark

data set

results

remarks

Description（変更内容）

Accelerate and refactor for aggregation, especially for stop unification

Add read_stop_relations() to Aggregator

Acclerate and refactor read_routes()

Acclerate read_stops()

Add test cases for aggregation

Manual Testing（手動テスト）

coderabbitai bot commented May 5, 2024 • edited Loading

ウォークスルー

変更

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

Documentation and Community

codecov bot commented May 5, 2024

Codecov Report

coderabbitai bot left a comment

Choose a reason for hiding this comment

Kanahiro commented May 8, 2024

coderabbitai bot commented May 5, 2024 •

edited

Loading

CodeRabbit Configration File (`.coderabbit.yaml`)