Jan open data -- draft for now #1356

edasmalchi · 2025-01-22T21:02:14Z

Finished running the pipeline with some necessary tweaks but creating this as a draft to get some additional review before considering it done.

Speedmaps had a lot of missing data (and the site will revert back to December for now) but I don't know if this is due to:

My downloading vehicle positions the day after (on last thursday with an analysis date of last wednesday, is that just too soon?)
Changes I made here to fix various errors
Something else

I'll try redownloading vehicle positions and rerunning rt_segment_speeds later today, but @tiffanychu90 if anything I did here jumps out at you as a possible cause please let me know.

Otherwise I'll make a separate issue once I learn a little more.

Example of missing data for SFMTA, looked at a few others and the pattern seemed to be the same:

segments_new.time_of_day.value_counts() #  january

Early AM    686
Evening     572
PM Peak     569
AM Peak     539
Midday      483
Owl         259
Name: time_of_day, dtype: int64

speedmap_segs.time_of_day.value_counts() #  december

Early AM    5194
PM Peak     5189
Evening     4836
AM Peak     4645
Midday      4296
Owl         2044
Name: time_of_day, dtype: int64

This reverts commit a76a6f4.

tiffanychu90 · 2025-01-22T22:13:49Z

My thoughts my looking at the changes:

I think I saw vp_idx is commented out...I'm wondering how you're able to link back to the vp_usable_dwell table after? The nearest neighbor approach keys into that and uses it to link vp without carrying a mess of columns. And it takes varying forms (it can be an array in the intermediate steps, so leaving it off and trying to merge it in would not be equivalent)
The logs look suspiciously fast...even compared to prior months.
- Having run this for the last 2 years, I'd say that times do not vary much, and any time there's a glaring drop in times without big refactors, I would check each staged parquet to see where the difference occurs.
- I would compare the first 3 tables for Dec 2024 and Jan 2025 and look at the number of rows. They should be the same magnitude from month to month. 15M in vp_usable then 12M in vp_usable_dwell, and then vp_condensed_line should be the same number of rows as number of trip_instance_keys.
```
speeds_tables:
 raw_vp: vp
 usable_vp: vp_usable
 vp_dwell: vp_usable_dwell
 vp_condensed_line: condensed/vp_condensed
```

Specifically...these logs jump out to me:

The very very large decrease in times seems to indicate that there's some large reduction in an input df, so either that's happening in one of the scripts captured in vp_preprocessing.log or in the initial download (your initial guess). It can be in either place, but the other logs don't seem to face that large decrease, so it seems to rule out the downloading of vp.
Why would something that takes 30-40 min take 5 min now?

2024-12-18 10:42:34.461 | INFO     | nearest_vp_to_stop:nearest_neighbor_for_stop:189 - nearest neighbor for stop_segments 2024-12-11: 0:36:50.316084
2024-12-18 14:32:17.520 | INFO     | nearest_vp_to_stop:nearest_neighbor_for_stop:189 - nearest neighbor for rt_stop_times 2024-12-11: 0:40:52.717197
2024-12-18 14:54:17.879 | INFO     | nearest_vp_to_stop:nearest_neighbor_for_stop:189 - nearest neighbor for speedmap_segments 2024-12-11: 0:08:09.784862
2025-01-21 13:27:30.117 | INFO     | nearest_vp_to_stop:nearest_neighbor_for_stop:188 - nearest neighbor for stop_segments 2025-01-15: 0:05:06.574214
2025-01-21 17:11:06.855 | INFO     | nearest_vp_to_stop:nearest_neighbor_for_stop:188 - nearest neighbor for stop_segments 2025-01-15: 0:04:14.577150
2025-01-21 18:22:23.308 | INFO     | nearest_vp_to_stop:nearest_neighbor_for_stop:188 - nearest neighbor for rt_stop_times 2025-01-15: 0:39:52.283432
2025-01-21 18:47:14.614 | INFO     | nearest_vp_to_stop:nearest_neighbor_for_stop:188 - nearest neighbor for speedmap_segments 2025-01-15: 0:08:43.206096

Why would something that took 15 min be 1 min?

2024-12-17 17:43:47.305 | INFO     | interpolate_stop_arrival:interpolate_stop_arrivals:279 - interpolate arrivals for stop_segments 2024-12-11:  2024-12-11: 0:14:09.409333
2024-12-18 10:52:54.623 | INFO     | interpolate_stop_arrival:interpolate_stop_arrivals:236 - interpolate arrivals for stop_segments 2024-12-11:  2024-12-11: 0:10:20.061296
2024-12-18 14:42:51.535 | INFO     | interpolate_stop_arrival:interpolate_stop_arrivals:236 - interpolate arrivals for rt_stop_times 2024-12-11:  2024-12-11: 0:10:33.922215
2024-12-18 14:56:54.836 | INFO     | interpolate_stop_arrival:interpolate_stop_arrivals:236 - interpolate arrivals for speedmap_segments 2024-12-11:  2024-12-11: 0:02:36.918147
2025-01-21 17:12:27.738 | INFO     | interpolate_stop_arrival:interpolate_stop_arrivals:236 - interpolate arrivals for stop_segments 2025-01-15:  2025-01-15: 0:01:20.848747
2025-01-21 18:35:30.275 | INFO     | interpolate_stop_arrival:interpolate_stop_arrivals:236 - interpolate arrivals for rt_stop_times 2025-01-15:  2025-01-15: 0:13:06.847024
2025-01-21 18:49:29.708 | INFO     | interpolate_stop_arrival:interpolate_stop_arrivals:236 - interpolate arrivals for speedmap_segments 2025-01-15:  2025-01-15: 0:02:15.045884

edasmalchi · 2025-01-22T22:43:28Z

Thanks for the pointers!

My thoughts my looking at the changes:

I think I saw vp_idx is commented out...I'm wondering how you're able to link back to the vp_usable_dwell table after? The nearest neighbor approach keys into that and uses it to link vp without carrying a mess of columns. And it takes varying forms (it can be an array in the intermediate steps, so leaving it off and trying to merge it in would not be equivalent)

The logs look suspiciously fast...even compared to prior months.

Hm, I did comment out the sort_cols argument since it's not part of vp_transform.condense_point_geom_to_line.

data-analyses/gtfs_funnel/vp_keep_usable.py

Lines 157 to 162 in d59e058

    
               vp_condensed = vp_transform.condense_point_geom_to_line( 
        
                   vp_gdf, 
        
                   group_cols = ["trip_instance_key"], 
        
           #        sort_cols = ["trip_instance_key", "vp_idx"], not used? 
        
                   array_cols = ["vp_idx", "geometry"]         
        
               )

data-analyses/rt_segment_speeds/segment_speed_utils/vp_transform.py

Lines 19 to 24 in 43da8a2

    
           def condense_point_geom_to_line( 
        
               df: gpd.GeoDataFrame, 
        
               group_cols: list, 
        
               geom_col: str = "geometry", 
        
               array_cols: list = [] 
        
           ) -> gpd.GeoDataFrame:

vp_idx was in both sort_cols and array_cols, so it seemed to me like it would still be expanded here:

data-analyses/rt_segment_speeds/segment_speed_utils/vp_transform.py

Line 71 in 43da8a2

**{c: lambda x: list(x) for c in array_cols},

It's not clear to me what the behavior of sort_cols was supposed to be or if it should be restored somehow.

Yeah, makes sense to look at the logs. Some sort of log checking/comparison step might be cool to make it somewhat automatic (perhaps print something if there's a major deviation like this...).

Ahh, ok, specifically the sorting was to guarantee that for each trip, timestamps should be increasing (sorting on trip_instance_key-vp_idx). I think I remembered to put the sorting ahead of time, but left the arg there to also set in case it wasn't a sorted gdf.

edasmalchi · 2025-01-22T23:51:29Z

I would compare the first 3 tables for Dec 2024 and Jan 2025 and look at the number of rows. They should be the same magnitude from month to month. 15M in vp_usable then 12M in vp_usable_dwell, and then vp_condensed_line should be the same number of rows as number of trip_instance_keys.
speeds_tables:
 raw_vp: vp
 usable_vp: vp_usable
 vp_dwell: vp_usable_dwell
 vp_condensed_line: condensed/vp_condensed

These all look about the same to me so far:

https://github.com/cal-itp/data-analyses/blob/jan-open-data/rt_segment_speeds/45_diff_tables.ipynb

tiffanychu90 · 2025-01-23T00:31:56Z

I would compare the first 3 tables for Dec 2024 and Jan 2025 and look at the number of rows. They should be the same magnitude from month to month. 15M in vp_usable then 12M in vp_usable_dwell, and then vp_condensed_line should be the same number of rows as number of trip_instance_keys.
speeds_tables:
 raw_vp: vp
 usable_vp: vp_usable
 vp_dwell: vp_usable_dwell
 vp_condensed_line: condensed/vp_condensed
These all look about the same to me so far:

https://github.com/cal-itp/data-analyses/blob/jan-open-data/rt_segment_speeds/45_diff_tables.ipynb

Then I would move further down into rt_segment_speeds portion to start debugging.

Since the logs for the times in nearest_stop_to_vp and interpolate_stop_arrival look similar to previous months for rt_stop_times/speedmap_segments, I would actually look next at the average_speedmap_segments script and compare the number of rows, shapes, route-dir, operators, everything, to see where you're getting a huge drop-off. The average_speedmap_segments starts with a trip-stop grain, so if all your trips are represented, then it's definitely something in the averaging portion.

If it's not there, then I'd move backwards from there into other staged parquets to see.

edasmalchi added 12 commits January 16, 2025 23:59

start funnel

e442953

add nb

7033795

rename function call to match vp_tranform

a76a6f4

Revert "rename function call to match vp_tranform"

4842f9f

This reverts commit a76a6f4.

rename function call to match vp_tranform

e8ddcc7

fix geopandas import

8620740

comment out unused arg

9156e69

explode with pd method instead of pipe to missing function

125d1fd

finish funnel, speedmaps show missing shapes

ed2b605

clarify speedmaps change

e4e1494

re-add stop_meters to export for stage2

7348543

finish running pipeline with fixes, reset makefiles

d59e058

edasmalchi requested a review from tiffanychu90 January 22, 2025 21:03

edasmalchi added 2 commits January 22, 2025 23:46

add nb comparing Dec Jan speed tables

0205417

add nb comparing Dec Jan speed tables

1024816

revert speedmaps

1238973

evansiroky mentioned this pull request Feb 4, 2025

Bug: City of Montebello (Montebello Bus Lines) not showing up in GTFS Digest #1357

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jan open data -- draft for now #1356

Jan open data -- draft for now #1356

edasmalchi commented Jan 22, 2025

tiffanychu90 commented Jan 22, 2025 •

edited

Loading

edasmalchi commented Jan 22, 2025 •

edited by tiffanychu90

Loading

edasmalchi commented Jan 22, 2025

tiffanychu90 commented Jan 23, 2025

Jan open data -- draft for now #1356

Are you sure you want to change the base?

Jan open data -- draft for now #1356

Conversation

edasmalchi commented Jan 22, 2025

tiffanychu90 commented Jan 22, 2025 • edited Loading

edasmalchi commented Jan 22, 2025 • edited by tiffanychu90 Loading

edasmalchi commented Jan 22, 2025

tiffanychu90 commented Jan 23, 2025

tiffanychu90 commented Jan 22, 2025 •

edited

Loading

edasmalchi commented Jan 22, 2025 •

edited by tiffanychu90

Loading