Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jan open data -- draft for now #1356

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from
Draft

Jan open data -- draft for now #1356

wants to merge 15 commits into from

Conversation

edasmalchi
Copy link
Member

Finished running the pipeline with some necessary tweaks but creating this as a draft to get some additional review before considering it done.

Speedmaps had a lot of missing data (and the site will revert back to December for now) but I don't know if this is due to:

  • My downloading vehicle positions the day after (on last thursday with an analysis date of last wednesday, is that just too soon?)
  • Changes I made here to fix various errors
  • Something else

I'll try redownloading vehicle positions and rerunning rt_segment_speeds later today, but @tiffanychu90 if anything I did here jumps out at you as a possible cause please let me know.

Otherwise I'll make a separate issue once I learn a little more.

Example of missing data for SFMTA, looked at a few others and the pattern seemed to be the same:

segments_new.time_of_day.value_counts() #  january

Early AM    686
Evening     572
PM Peak     569
AM Peak     539
Midday      483
Owl         259
Name: time_of_day, dtype: int64

speedmap_segs.time_of_day.value_counts() #  december

Early AM    5194
PM Peak     5189
Evening     4836
AM Peak     4645
Midday      4296
Owl         2044
Name: time_of_day, dtype: int64

@tiffanychu90
Copy link
Member

tiffanychu90 commented Jan 22, 2025

My thoughts my looking at the changes:

  • I think I saw vp_idx is commented out...I'm wondering how you're able to link back to the vp_usable_dwell table after? The nearest neighbor approach keys into that and uses it to link vp without carrying a mess of columns. And it takes varying forms (it can be an array in the intermediate steps, so leaving it off and trying to merge it in would not be equivalent)
  • The logs look suspiciously fast...even compared to prior months.
    • Having run this for the last 2 years, I'd say that times do not vary much, and any time there's a glaring drop in times without big refactors, I would check each staged parquet to see where the difference occurs.
    • I would compare the first 3 tables for Dec 2024 and Jan 2025 and look at the number of rows. They should be the same magnitude from month to month. 15M in vp_usable then 12M in vp_usable_dwell, and then vp_condensed_line should be the same number of rows as number of trip_instance_keys.
    speeds_tables:
     raw_vp: vp
     usable_vp: vp_usable
     vp_dwell: vp_usable_dwell
     vp_condensed_line: condensed/vp_condensed
    

Specifically...these logs jump out to me:

  • The very very large decrease in times seems to indicate that there's some large reduction in an input df, so either that's happening in one of the scripts captured in vp_preprocessing.log or in the initial download (your initial guess). It can be in either place, but the other logs don't seem to face that large decrease, so it seems to rule out the downloading of vp.
  • Why would something that takes 30-40 min take 5 min now?
2024-12-18 10:42:34.461 | INFO     | nearest_vp_to_stop:nearest_neighbor_for_stop:189 - nearest neighbor for stop_segments 2024-12-11: 0:36:50.316084
2024-12-18 14:32:17.520 | INFO     | nearest_vp_to_stop:nearest_neighbor_for_stop:189 - nearest neighbor for rt_stop_times 2024-12-11: 0:40:52.717197
2024-12-18 14:54:17.879 | INFO     | nearest_vp_to_stop:nearest_neighbor_for_stop:189 - nearest neighbor for speedmap_segments 2024-12-11: 0:08:09.784862
2025-01-21 13:27:30.117 | INFO     | nearest_vp_to_stop:nearest_neighbor_for_stop:188 - nearest neighbor for stop_segments 2025-01-15: 0:05:06.574214
2025-01-21 17:11:06.855 | INFO     | nearest_vp_to_stop:nearest_neighbor_for_stop:188 - nearest neighbor for stop_segments 2025-01-15: 0:04:14.577150
2025-01-21 18:22:23.308 | INFO     | nearest_vp_to_stop:nearest_neighbor_for_stop:188 - nearest neighbor for rt_stop_times 2025-01-15: 0:39:52.283432
2025-01-21 18:47:14.614 | INFO     | nearest_vp_to_stop:nearest_neighbor_for_stop:188 - nearest neighbor for speedmap_segments 2025-01-15: 0:08:43.206096
  • Why would something that took 15 min be 1 min?
2024-12-17 17:43:47.305 | INFO     | interpolate_stop_arrival:interpolate_stop_arrivals:279 - interpolate arrivals for stop_segments 2024-12-11:  2024-12-11: 0:14:09.409333
2024-12-18 10:52:54.623 | INFO     | interpolate_stop_arrival:interpolate_stop_arrivals:236 - interpolate arrivals for stop_segments 2024-12-11:  2024-12-11: 0:10:20.061296
2024-12-18 14:42:51.535 | INFO     | interpolate_stop_arrival:interpolate_stop_arrivals:236 - interpolate arrivals for rt_stop_times 2024-12-11:  2024-12-11: 0:10:33.922215
2024-12-18 14:56:54.836 | INFO     | interpolate_stop_arrival:interpolate_stop_arrivals:236 - interpolate arrivals for speedmap_segments 2024-12-11:  2024-12-11: 0:02:36.918147
2025-01-21 17:12:27.738 | INFO     | interpolate_stop_arrival:interpolate_stop_arrivals:236 - interpolate arrivals for stop_segments 2025-01-15:  2025-01-15: 0:01:20.848747
2025-01-21 18:35:30.275 | INFO     | interpolate_stop_arrival:interpolate_stop_arrivals:236 - interpolate arrivals for rt_stop_times 2025-01-15:  2025-01-15: 0:13:06.847024
2025-01-21 18:49:29.708 | INFO     | interpolate_stop_arrival:interpolate_stop_arrivals:236 - interpolate arrivals for speedmap_segments 2025-01-15:  2025-01-15: 0:02:15.045884

@edasmalchi
Copy link
Member Author

edasmalchi commented Jan 22, 2025

Thanks for the pointers!

My thoughts my looking at the changes:

  • I think I saw vp_idx is commented out...I'm wondering how you're able to link back to the vp_usable_dwell table after? The nearest neighbor approach keys into that and uses it to link vp without carrying a mess of columns. And it takes varying forms (it can be an array in the intermediate steps, so leaving it off and trying to merge it in would not be equivalent)
  • The logs look suspiciously fast...even compared to prior months.

Hm, I did comment out the sort_cols argument since it's not part of vp_transform.condense_point_geom_to_line.

vp_condensed = vp_transform.condense_point_geom_to_line(
vp_gdf,
group_cols = ["trip_instance_key"],
# sort_cols = ["trip_instance_key", "vp_idx"], not used?
array_cols = ["vp_idx", "geometry"]
)

def condense_point_geom_to_line(
df: gpd.GeoDataFrame,
group_cols: list,
geom_col: str = "geometry",
array_cols: list = []
) -> gpd.GeoDataFrame:

vp_idx was in both sort_cols and array_cols, so it seemed to me like it would still be expanded here:

**{c: lambda x: list(x) for c in array_cols},

It's not clear to me what the behavior of sort_cols was supposed to be or if it should be restored somehow.

Yeah, makes sense to look at the logs. Some sort of log checking/comparison step might be cool to make it somewhat automatic (perhaps print something if there's a major deviation like this...).

Ahh, ok, specifically the sorting was to guarantee that for each trip, timestamps should be increasing (sorting on trip_instance_key-vp_idx). I think I remembered to put the sorting ahead of time, but left the arg there to also set in case it wasn't a sorted gdf.

@edasmalchi
Copy link
Member Author

  • I would compare the first 3 tables for Dec 2024 and Jan 2025 and look at the number of rows. They should be the same magnitude from month to month. 15M in vp_usable then 12M in vp_usable_dwell, and then vp_condensed_line should be the same number of rows as number of trip_instance_keys.
speeds_tables:
 raw_vp: vp
 usable_vp: vp_usable
 vp_dwell: vp_usable_dwell
 vp_condensed_line: condensed/vp_condensed

These all look about the same to me so far:

https://github.com/cal-itp/data-analyses/blob/jan-open-data/rt_segment_speeds/45_diff_tables.ipynb

@tiffanychu90
Copy link
Member

  • I would compare the first 3 tables for Dec 2024 and Jan 2025 and look at the number of rows. They should be the same magnitude from month to month. 15M in vp_usable then 12M in vp_usable_dwell, and then vp_condensed_line should be the same number of rows as number of trip_instance_keys.
speeds_tables:
 raw_vp: vp
 usable_vp: vp_usable
 vp_dwell: vp_usable_dwell
 vp_condensed_line: condensed/vp_condensed

These all look about the same to me so far:

https://github.com/cal-itp/data-analyses/blob/jan-open-data/rt_segment_speeds/45_diff_tables.ipynb

Then I would move further down into rt_segment_speeds portion to start debugging.

Since the logs for the times in nearest_stop_to_vp and interpolate_stop_arrival look similar to previous months for rt_stop_times/speedmap_segments, I would actually look next at the average_speedmap_segments script and compare the number of rows, shapes, route-dir, operators, everything, to see where you're getting a huge drop-off. The average_speedmap_segments starts with a trip-stop grain, so if all your trips are represented, then it's definitely something in the averaging portion.

If it's not there, then I'd move backwards from there into other staged parquets to see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants