Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate aggregation and parse #11

Merged
merged 9 commits into from
May 8, 2024
Merged

Conversation

takohei
Copy link
Contributor

@takohei takohei commented May 5, 2024

Close Issues

Benchmark

image

data set

name source size file
Takushoku Bus the fixture of test small fx_TakushokuBus.zip
Tokyo Toei Bus #6 medium P6_ToeiBus.zip
Vancouver TransLink MIERUNE/GTFS-GO#77 large G77_TransLink.zip
Swiss #1 extra large GTFS_FP2021_2021-12-08_09-10.zip

results

  • read_stops(): 1.2 - 1.7 times faster
  • read_routes() noshape: 8.0 - 12.7 times faster
  • read_routes() shape: 4.1 - 6.6 times faster
  • aggregator nounify: 2.9 - 3.9 times faster
  • aggregator unify: 21.0 - 37.2 times faster

remarks

Current read_routes()@parse.py has 2 bugs. I fix them temporary and use for measurement.
Like this:

        # Point -> LineString: group by route_id and trip_id
        lines = merged.groupby(["route_id", "trip_id"])["stop_pt"].apply(tuple)
        line_df = lines.reset_index()[["route_id", "stop_pt"]].drop_duplicates()

Description(変更内容)

Accelerate and refactor for aggregation, especially for stop unification

  • Change the unification from apply per stops to process for whole dataframe.
  • Change merge procedures so that they are done on smaller dataframes.
  • Decrease max_distance_degree 0.01 to 0.003 and add process of join near groups.
    • This change reduces the number of unified stops in test cases because there are stops with a same name that are 400 meters apart from each other.
  • Exclude parent stops from aggregation with no_unify_stops option.
    • This change reduces the number of aggregated stops in the test case with no_unify_stops option.
  • Break large __aggregate_similar_stops() into small static methods.
  • Remove path_id and related procedures in read_route_frequency()

Add read_stop_relations() to Aggregator

  • Add read_stop_relations() that returns relations of stops and similar stops as a list of dictionaries.
  • Eliminate modification of gtfs data frames in the Aggregator class.
  • Eliminate GTFS-GO (gtfs_go_dialog.py) getting stop relations from inside Aggregator.

Acclerate and refactor read_routes()

  • Unify stop patterns by stop_id, not by line string.
  • Minimize number of coordinate pairing and linestring generation.
  • Fix sort procedure for stop sequence.
    • The route collapse problem in PR Geopandas #9 was caused by a difference in sorting behavior between Pandas 2.2 and 1.1 (built into QGIS 3.28).
  • Change the method of generating features from iterrows() to to_dict().
  • Break large read_routes() into some small methods.

Acclerate read_stops()

  • Change merge/join procedure to be faster.
  • Change the method of generating features from iterrows() to to_dict().

Add test cases for aggregation

  • Add test cases of delimiter, time filter.
  • Add tests for record values.
  • Set target date to feed_start_date.
  • Arrange the test cases from per-method to per-option.

Manual Testing(手動テスト)

Please check in QGIS on your Mac OS.

Copy link

coderabbitai bot commented May 5, 2024

ウォークスルー

GTFSデータ解析のためのgtfs_parserライブラリが大幅に更新されました。主な変更点はAggregatorクラスのメソッドの追加と削除、データフレームの処理方法の改善、およびGeoJSON機能生成のための新しい関数の導入です。これにより、より効率的かつ柔軟なデータ処理が可能になります。

変更

ファイル 変更概要
gtfs_parser/aggregate.py Aggregatorクラスのメソッドの追加・削除、__init__メソッドの更新
gtfs_parser/gtfs.py load_df関数にlocation_type列の処理を追加
gtfs_parser/parse.py 停留所とルートの読み取り処理の変更、GeoJSON機能生成のための新しい関数
tests/test_aggregator.py 新しいテスト関数の追加

🐰✨
新しい道へと進むたび
コードの森を駆け巡る
データの流れを整え
地図上に軌跡を描く
変化を楽しむ我らが旅
🌟🌲🌲🌲🌲🌲🌲


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

codecov bot commented May 5, 2024

Codecov Report

Attention: Patch coverage is 91.10169% with 21 lines in your changes are missing coverage. Please review.

Project coverage is 91.24%. Comparing base (98f7fad) to head (34daec8).
Report is 1 commits behind head on main.

Files Patch % Lines
tests/test_aggregator.py 85.91% 10 Missing ⚠️
gtfs_parser/aggregate.py 93.16% 8 Missing ⚠️
gtfs_parser/parse.py 93.47% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main      #11       +/-   ##
===========================================
+ Coverage   80.61%   91.24%   +10.63%     
===========================================
  Files           8        8               
  Lines         294      377       +83     
===========================================
+ Hits          237      344      +107     
+ Misses         57       33       -24     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits Files that changed from the base of the PR and between 851052a and 34daec8.
Files selected for processing (4)
  • gtfs_parser/aggregate.py (7 hunks)
  • gtfs_parser/gtfs.py (1 hunks)
  • gtfs_parser/parse.py (3 hunks)
  • tests/test_aggregator.py (1 hunks)
Additional Context Used
GitHub Check Runs (1)
codecov/patch success (13)

gtfs_parser/aggregate.py: [warning] 51-51: gtfs_parser/aggregate.py#L51
Added line #L51 was not covered by tests


gtfs_parser/aggregate.py: [warning] 96-96: gtfs_parser/aggregate.py#L96
Added line #L96 was not covered by tests


gtfs_parser/aggregate.py: [warning] 143-143: gtfs_parser/aggregate.py#L143
Added line #L143 was not covered by tests


gtfs_parser/aggregate.py: [warning] 151-151: gtfs_parser/aggregate.py#L151
Added line #L151 was not covered by tests


gtfs_parser/aggregate.py: [warning] 154-154: gtfs_parser/aggregate.py#L154
Added line #L154 was not covered by tests


gtfs_parser/aggregate.py: [warning] 203-205: gtfs_parser/aggregate.py#L203-L205
Added lines #L203 - L205 were not covered by tests


gtfs_parser/parse.py: [warning] 101-101: gtfs_parser/parse.py#L101
Added line #L101 was not covered by tests


gtfs_parser/parse.py: [warning] 107-108: gtfs_parser/parse.py#L107-L108
Added lines #L107 - L108 were not covered by tests


tests/test_aggregator.py: [warning] 13-13: tests/test_aggregator.py#L13
Added line #L13 was not covered by tests


tests/test_aggregator.py: [warning] 15-15: tests/test_aggregator.py#L15
Added line #L15 was not covered by tests


tests/test_aggregator.py: [warning] 18-22: tests/test_aggregator.py#L18-L22
Added lines #L18 - L22 were not covered by tests


tests/test_aggregator.py: [warning] 33-34: tests/test_aggregator.py#L33-L34
Added lines #L33 - L34 were not covered by tests


tests/test_aggregator.py: [warning] 41-41: tests/test_aggregator.py#L41
Added line #L41 was not covered by tests

Additional comments not posted (17)
gtfs_parser/gtfs.py (1)

22-23: "location_type" 列の処理が適切に行われています。データの整合性と型の問題を防ぐために、欠損値を "0" で埋め、整数型に変換しています。

gtfs_parser/parse.py (4)

17-32: 停留所データの処理が最適化されており、"ignore_no_route" フラグに基づいて接続されていない停留所を適切に処理しています。


70-72: 形状データの有無と "ignore_shapes" フラグに基づいて、適切な解析戦略を選択しています。


75-109: 形状データの処理が効率的に行われており、特徴生成のためのデータ準備に必要なソート、グルーピング、マージ操作が含まれています。


112-147: 形状データを使用せずにルート特徴を生成する処理が明確で効果的に実装されています。

tests/test_aggregator.py (5)

4-7: 座標が指定されたイプシロン内で互いに近いかどうかをチェックするロジックが正しく、効率的に比較を処理しています。


10-22: 異なるジオメトリタイプ(ポイント、ラインストリング、マルチラインストリング)を処理し、is_coordinate_close を使用して座標を比較するロジックが健全でよく構造化されています。


25-34: 特徴がプロパティを比較することによって特徴のリストに存在するかどうかを効果的にチェックし、必要に応じてジオメトリの比較に is_geometry_close を使用します。ロジックは明確で、チェックを効率的に処理します。


37-41: プロパティのセットがプロパティのリストに存在するかどうかを、キーと値のペアを比較することによって直接的にチェックするロジックがシンプルで効果的です。


44-72: no_unify_stops フラグが設定されているときの Aggregator クラスの動作をチェックするテストが、Aggregator メソッドによって返される関係と特徴の数をチェックします。アサーションは、期待される動作を検証するために正しく配置されています。

gtfs_parser/aggregate.py (7)

9-41: __init__ メソッドは、さまざまなパラメータを使用して Aggregator クラスを初期化し、これらのパラメータに基づいて停留所の時間を処理します。フィルタリングと統一のための静的メソッドの使用は、モジュール性のための良い実践です。


42-63: このメソッドは、日付と時間のパラメータに基づいて停留所の時間をフィルタリングします。出発時間が null 可能であることを処理し、比較のために整数に変換するロジックがうまく実装されています。


66-80: このメソッドは、"location_type" に基づいて統一せずに類似の停留所を取得します。停留所をフィルタリングし、それらの間の関係を作成するロジックが明確で効果的に実装されています。


83-102: このメソッドは、親駅、区切り文字、距離に基づいて類似の停留所を統一します。異なる統一戦略を処理するロジックは、よく構造化され、モジュラーです。


105-127: このメソッドは、親駅に基づいて子停留所を統一します。停留所間の関係を作成し、"location_type" 列を処理するロジックが効果的に実装されています。


130-168: このメソッドは、区切り文字と距離を基準にしてソロ停留所を統一します。異なる統一条件を処理し、結果をマージするロジックが明確でモジュラーです。


171-189: このメソッドは、距離を基準に近くの停留所のペアを効果的に計算します。近接性に基づいて停留所をマージし、"stop_name" 列を処理するロジックがうまく実装されています。

@Kanahiro
Copy link
Member

Kanahiro commented May 8, 2024

LGTM!

@Kanahiro Kanahiro merged commit 425087d into MIERUNE:main May 8, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants