Deferred execution #1

lopezvoliver · 2023-09-20T14:22:40Z

Hi, I worked on a major, but purely technical, update to the geeSEBAL python API. That is, the revision to the code changes only the way in which geeSEBAL communicates with the earthengine API, and nothing was changed about the SEBAL algorithm. The goal was to defer the GEE processing as much as possible (e.g. get rid of .getInfo()). The improvement in runtime highly outweighs some minor breaks in compatibility compared to the current version.

Here's the breakdown of the changes:

`tools.fexp_sensible_heat_flux`:

The iterative process was updated so that it leverages ee.ImageCollection.iterate. By doing this, and removing any .getInfo() calls, this function can be fully asynchronous (defers the execution until requested).

Additionally, a max_iterations (defaults to 15) parameter was added.

`image.py`

Defined a new function sebal that constitutes the SEBAL algorithm to be applied over one ee.Image, assuming all the necessary inputs are included as bands within the image.

`Image` class

Revised the code so that it builds the ee.Image with all the Landsat inputs (including T_RAD) and then calls the sebal function.

Note that because I also removed the calls to .getInfo(), most items in the Image object are now deferred and thus return ee Objects. For example, Image.landsat_version now returns a ee.String. The user may use .getInfo() and get the result when needed. This is a break in compatibility that is justified, as the improvement in runtime far outweighs this disadvantage.

Runtime improvement for `Image`

Here's the comparison using timeit on a simple instance of Image:

%timeit foo=Image("LANDSAT/LC08/C01/T1_SR/LC08_221071_20190714")

serveronly branch:

25.2 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

master branch:

12.7 s ± 1.97 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

i.e., 500 times faster.

When used in a simple script that starts an Export task for a single ee.Image, the improvement is not as impressive (11s v 20s), because of the overhead of initializing the ee library, and creating the Export task. However, as you will see below, this will be much more important for Collection and TimeSeries.

Comparison

The LANDSAT/LC08/C01/T1_SR/LC08_221071_20190714 image was exported using the master and serveronly branches (only the R, GR, B, NDVI, and ET_24h bands were exported). They are publicly available here:

This gee code snapshot was prepared to compare the results, which are identical.

`landsatcollection.py`

Additions

set_landsat_index: This simple function is necessary to keep the original index from a Landsat image, when collections are merged or joined.

fexp_trad_8, fexp_trad_7, and fexp_trad_5: These new functions return a ee.ImageCollection where each image has the corresponding T_RAD band.

fexp_collection_filter: This new function handles filtering a given ee.ImageCollection by a user-defined cloud cover threshold, start and end dates, and optionally filtered by path, row, and a ee.Geometry (E.g. a coordinate).

Changes

All fexp_landsat_NPathRow and fexp_landsat_NCoordinate (where N is one of {5,7,8}) were replaced by a single fexp_landsat_N function. These functions return the corresponding C01/T1_SR collection filtered using fexp_collection_filter, and with the bands renamed as it was done in the original code.

`Collection` class

The init method for this class was modified to leverage the fexp_landsat_N and fexp_trad_N functions, which are then joined into a single collection using ee.Join.inner. Furthermore, the python for loop was replaced by ee.ImageCollection.map, making use of the image.sebal function.

For compatibility, the Collection_ET item is given as an ee.Image (the ET_24h collection is cast into ee.Image as bands).
As was the case with the Image class, it was inevitable to break compatibility with some items in the Collection object.

Runtime improvement for `Collection`

The following short test (3 images only) was used:

%timeit f=Collection(2019,7,1,2019,8,1,15,path=221,row=71)

serveronly branch:

87.4 ms ± 4.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

master branch:

32.1 s ± 5.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

That is, 367 times faster. However, the time to generate a longer collection barely increases for the serveronly branch:

%timeit f=Collection(2000,1,1,2010,5,6,15,path=221,row=71)

87.1 ms ± 527 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

while for the master branch it does. For comparison, three images (only the ET_24h band) were exported using the master and serveronly branches. They are available in these image collections:

`TimeseriesAsync`

The TimeSeries class was largely untouched (except minor adjustments to use the fexp_landsat_N functions. The reason for this was that I feel that the user would expect this function to simply get the time series, on-demand. However, a new TimeSeriesAsync class was defined.

This new class makes use of the Collection class at a given point, then selects the ET_24h band and performs reduceRegion on it. The et_collection item contains the result of this operation, while the Collection items contains the result of the call to the Collection class.

Additionally, three Lists were defined that somewhat mimics the behavior of the TimeSeries class. However, these are returned as ee.Lists, so the user has the option to use .getInfo() on them:

List_ET is the result of et_collection.aggregate_array("ET_24h")
List_Date is the result of et_collection.aggregate_array("date")
List_index is the result of et_collection.aggregate_array("LANDSAT_INDEX")

Finally, two methods were prepared to export the ET table (date, LANDSAT_INDEX, ET_24h columns) as a CSV file. This should be the recommended way to export the table.

toDrive
toCloudStorage

The following example demonstrates the use of TimeSeriesAsync:

import ee
from etbrasil.geesebal import TimeSeriesAsync
ee.Initialize()
point=ee.Geometry.Point([-47.4522, -16.240119])
geesebal_timeseries=TimeSeriesAsync(2000,1,1,2010,5,6,15,coordinate=point)
geesebal_timeseries.toDrive("sebal-time-series-async")

This generates a sebal-time-series-async.csv file in Google Drive. The total process (python runtime + earthengine task) took about 2 minutes.

The following example generates the same csv file but synchronously (note the getInfo()s):

import pandas as pd
import ee
from etbrasil.geesebal import TimeSeriesAsync
ee.Initialize()
point=ee.Geometry.Point([-47.4522, -16.240119])
geesebal_timeseries=TimeSeriesAsync(2000,1,1,2010,5,6,15,coordinate=point)

et_list = geesebal_timeseries.List_ET.getInfo()       
date_list = geesebal_timeseries.List_Date.getInfo()  
landsat_index_list = geesebal_timeseries.List_index.getInfo() 

pd.DataFrame({
    "date": date_list,
    "LANDSAT_INDEX": landsat_index_list,
    "ET_24h": et_list
}).to_csv("sebal-time-series-sync.csv", index=False)

This example took about 3 minutes to run, and the resulting csv file was identical to the previous one.

However, the preferred method should be the asynchronous one, especially for long collections, as described here ("Too many concurrent aggregations" error).

As was the case for Collection, the TimeSeriesAsync runtime is fast and does not depend on the image collection size. Here is my result using timeit:

90.3 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Meanwhile, the current version of geesebal took about 25 minutes ⚠️ on this rather short test (19 images):

from etbrasil.geesebal import TimeSeries
point=ee.Geometry.Point([-50.161317, -9.824870])
geeSEBAL_Collection=TimeSeries(2019,1,1,2019,12,31,15,point)

That is all for now, I hope I haven't missed anything to describe from my changes, and that my explanations were clear.

Cheers,

Oliver.

This makes it possible to install geesebal using pip: "git+https://github.com/lopezvoliver/geeSEBAL@serveronly#subdirectory=etbrasil"

lopezvoliver added 19 commits September 18, 2023 14:58

Updated iterative sensible heat flux calculation

ad6a739

Server-only calibrated radiance. Dropping image_toa

55bb475

Select landsat bands using ee.Dictionary

467014e

Cloud mask and albedo mapping based on ee.Algorithms.If

87579c6

Removed geometry_download, NAME_FINAL

6e184ee

Image is now fully asynchronous (no getInfos)

f608f44

Added LANDSAT_INDEX property

03ff36b

Added T_RAD collection functions

72ca30d

Collections now include all required inputs

cdf7d64

Defined sebal algorithm as a separate function.

9b6a528

Collection is now fully asynchronous

4e7f1bd

Include max_iterations as user parameter.

4d894ad

Added Collection_ET as ee.Image (backwards compatible)

b7756fc

Single collection filter path,row,coord

d1fd23d

Added TimeSeriesAsync

5837e79

Updated init to include TimeSeriesAsync; README

994489e

Added max_iterations to TimeSeriesAsync

fbb5715

Fixed typo in self.coordinate

aa8c382

Create setup.py

2fb293c

This makes it possible to install geesebal using pip: "git+https://github.com/lopezvoliver/geeSEBAL@serveronly#subdirectory=etbrasil"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deferred execution #1

Deferred execution #1

lopezvoliver commented Sep 20, 2023

Deferred execution #1

Are you sure you want to change the base?

Deferred execution #1

Conversation

lopezvoliver commented Sep 20, 2023

tools.fexp_sensible_heat_flux:

image.py

Image class

Runtime improvement for Image

Comparison

landsatcollection.py

Additions

Changes

Collection class

Runtime improvement for Collection

TimeseriesAsync

`tools.fexp_sensible_heat_flux`:

`image.py`

`Image` class

Runtime improvement for `Image`

`landsatcollection.py`

`Collection` class

Runtime improvement for `Collection`

`TimeseriesAsync`