Sliderule Output to S3
Contents
Sliderule Output to S3#
SlideRule can either return in-memory Python objects or return links to Parquet files in Cloud Object Storage (AWS S3). Directing output to S3 can be helpful if you want to collaborate with others on analyzing the same dataset (there is no need for everyone to run the same processing requests). It is also a good approach for analyzing large areas of interest where results may require significant storage space.
This notebook walks through an example of generating SlideRule output as a Parquet file in AWS S3.
Learning Objectives
basics of Parquet and Geoparquet formats
how to output Sliderule results as parquet files
how to pass AWS credentials to write to S3
how to open parquet files on S3 for analysis
Tip
Parquet
is cloud-optimized format. At a very basic level, it is for tabular data. Unlike CSV files which are stored as plain text and written row-wise, Parquet is a columnar binary format that is well-suited to hosting on S3 for data analysis. GeoParquet
is a standard for storing geospatial vector data (Points, Lines, Polygons) in Parquet files.
Sliderule documentation has an extensive description of Parquet. And a Tutorial with code examples!
import os
os.environ['USE_PYGEOS'] = '0'
import geopandas as gpd
import boto3
from sliderule import sliderule, icesat2
import s3fs
Set Area of Interest#
We will use a geojson file from the Sliderule GitHub Repository over Grand Mesa, Colorado:
gfa = gpd.read_file('https://raw.githubusercontent.com/ICESat2-SlideRule/sliderule-python/main/data/grandmesa.geojson')
folium_map = gfa.explore(tiles="Stamen Terrain",
style_kwds=dict(fill=False, color='magenta'),
)
folium_map
Configure SlideRule#
Below we connect to the SlideRule Server and initialize processing parameters for determining precise elevations from ATL03 photon returns. Read more about processing parameters In the SlideRule Docs.
icesat2.init("slideruleearth.io")
parms = {
"poly": sliderule.toregion(gfa)["poly"],
"srt": icesat2.SRT_LAND,
"cnf": icesat2.CNF_SURFACE_HIGH,
"len": 40.0,
"res": 20.0,
"maxi": 6
}
Get Temporary AWS Credentials (JupyterHub)#
Warning
The code cell below will only work on CryoCloud JupyterHub where the required environment variables exist (AWS_WEB_IDENTITY_TOKEN_FILE
, AWS_ROLE_ARN
, and JUPYTERHUB_CLIENT_ID
)
client = boto3.client('sts')
with open(os.environ['AWS_WEB_IDENTITY_TOKEN_FILE']) as f:
TOKEN = f.read()
response = client.assume_role_with_web_identity(
RoleArn=os.environ['AWS_ROLE_ARN'],
RoleSessionName=os.environ['JUPYTERHUB_CLIENT_ID'],
WebIdentityToken=TOKEN,
DurationSeconds=3600
)
ACCESS_KEY_ID = response['Credentials']['AccessKeyId']
SECRET_ACCESS_KEY_ID = response['Credentials']['SecretAccessKey']
SESSION_TOKEN = response['Credentials']['SessionToken']
Configure Parquet and S3 Output#
By default, SlideRule processing returns a geopandas.GeoDataFrame. Below we instead request output be directly to a GeoParquet file in the s3://nasa-cryo-scratch
bucket. We must pass AWS Credentials so that the SlideRule servers processing ATL03 data have access to this bucket.
S3_OUTPUT = 's3://nasa-cryo-scratch/sliderule-example/grandmesa.parquet'
parms["output"] = {
"path": S3_OUTPUT,
"format": "parquet",
"open_on_complete": False,
"region": "us-west-2",
"credentials": {
"aws_access_key_id": ACCESS_KEY_ID,
"aws_secret_access_key": SECRET_ACCESS_KEY_ID,
"aws_session_token": SESSION_TOKEN
}
}
Run SlideRule processing#
Now run SlideRule! Depending on the area of interest, this can take some time. For the Grand Mesa polygon we’re using, as of 2023-07-21 there are 106 ATL03 Granules totaling over 237 Gigabytes in size! Fortunately, SlideRule runs computations on a scalable cluster in AWS us-west-2 so we get out results in less than a minute! In this case, the results are simply a URL pointing to the parquet file on S3, which we will load next.
%%time
# NOTE: At the time of writing the latest ATL03 archive release is 'version 006'
# https://nsidc.org/data/atl03/versions/6
output_path = icesat2.atl06p(parms, version='006')
output_path
CPU times: user 231 ms, sys: 1.84 ms, total: 233 ms
Wall time: 28.1 s
's3://nasa-cryo-scratch/sliderule-example/grandmesa.parquet'
Read output from S3#
Finally, we can read these results. Geopandas has a read_parquet()
function that accepts s3://
URLs:
gf = gpd.read_parquet(output_path)
print("Start:", gf.index.min().strftime('%Y-%m-%d'))
print("End:", gf.index.max().strftime('%Y-%m-%d'))
print("Reference Ground Tracks: {}".format(gf["rgt"].unique()))
print("Cycles: {}".format(gf["cycle"].unique()))
print("Elevation Measurements: {} ".format(gf.shape[0]))
gf.head(2)
Start: 2018-10-16
End: 2023-03-07
Reference Ground Tracks: [1156 714 272 1179 737 211 295 234]
Cycles: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 17 16 15 18]
Elevation Measurements: 328557
extent_id | distance | segment_id | rgt | rms_misfit | gt | dh_fit_dy | n_fit_photons | h_sigma | pflags | spot | h_mean | cycle | w_surface_window_final | dh_fit_dx | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
time | ||||||||||||||||
2018-12-13 08:01:25.295016704 | 5206161266950799362 | 1.571322e+07 | 784512 | 1156 | 0.291742 | 50 | 0.0 | 76 | 0.033481 | 0 | 2 | 1730.918671 | 1 | 3.0 | 0.001898 | POINT (-108.08440 39.16745) |
2018-12-13 08:01:25.638596608 | 5206161266950799363 | 1.571324e+07 | 784513 | 1156 | 0.358982 | 60 | 0.0 | 313 | 0.021044 | 0 | 1 | 1725.510855 | 1 | 3.0 | 0.016074 | POINT (-108.08544 39.16734) |
100,000+ is a lot of points to visualize! Let’s randomly sample 1000 of them and plot on our map
# Need to turn timestamps into strings first
points = gf.sample(1000).reset_index()
points['time'] = points.time.dt.strftime('%Y-%m-%d')
points.explore(column='h_mean', m=folium_map)
Summary#
We processed all ATL03 v006 data covering Grand Mesa, Colorado spanning 2018-10-16 to 2023-03-07 to ATL06-SR elevations. We output our results in GeoParquet format to an AWS S3 bucket and quickly visualized some of the results.
Next steps#
For very large Parquet datasets, the dask-geopandas
library has advanced functionality for distributed computations. Check out their documentation