Sliderule Output to S3#

SlideRule can either return in-memory Python objects or return links to Parquet files in Cloud Object Storage (AWS S3). Directing output to S3 can be helpful if you want to collaborate with others on analyzing the same dataset (there is no need for everyone to run the same processing requests). It is also a good approach for analyzing large areas of interest where results may require significant storage space.

This notebook walks through an example of generating SlideRule output as a Parquet file in AWS S3.

Learning Objectives

basics of Parquet and Geoparquet formats
how to output Sliderule results as parquet files
how to pass AWS credentials to write to S3
how to open parquet files on S3 for analysis

Tip

Parquet is cloud-optimized format. At a very basic level, it is for tabular data. Unlike CSV files which are stored as plain text and written row-wise, Parquet is a columnar binary format that is well-suited to hosting on S3 for data analysis. GeoParquet is a standard for storing geospatial vector data (Points, Lines, Polygons) in Parquet files.

Sliderule documentation has an extensive description of Parquet. And a Tutorial with code examples!

import os
os.environ['USE_PYGEOS'] = '0'
import geopandas as gpd

import boto3
from sliderule import sliderule, icesat2
import s3fs

Set Area of Interest#

We will use a geojson file from the Sliderule GitHub Repository over Grand Mesa, Colorado:

gfa = gpd.read_file('https://raw.githubusercontent.com/ICESat2-SlideRule/sliderule-python/main/data/grandmesa.geojson')

folium_map = gfa.explore(tiles="Stamen Terrain", 
                         style_kwds=dict(fill=False, color='magenta'),
                        )
folium_map

Make this Notebook Trusted to load map: File -> Trust Notebook

Configure SlideRule#

Below we connect to the SlideRule Server and initialize processing parameters for determining precise elevations from ATL03 photon returns. Read more about processing parameters In the SlideRule Docs.

icesat2.init("slideruleearth.io")

parms = {
    "poly": sliderule.toregion(gfa)["poly"],
    "srt": icesat2.SRT_LAND,
    "cnf": icesat2.CNF_SURFACE_HIGH,
    "len": 40.0,
    "res": 20.0,
    "maxi": 6
}

Get Temporary AWS Credentials (JupyterHub)#

Warning

The code cell below will only work on CryoCloud JupyterHub where the required environment variables exist (AWS_WEB_IDENTITY_TOKEN_FILE, AWS_ROLE_ARN, and JUPYTERHUB_CLIENT_ID)

client = boto3.client('sts')

with open(os.environ['AWS_WEB_IDENTITY_TOKEN_FILE']) as f:
    TOKEN = f.read()

response = client.assume_role_with_web_identity(
    RoleArn=os.environ['AWS_ROLE_ARN'],
    RoleSessionName=os.environ['JUPYTERHUB_CLIENT_ID'],
    WebIdentityToken=TOKEN,
    DurationSeconds=3600
)

ACCESS_KEY_ID = response['Credentials']['AccessKeyId']
SECRET_ACCESS_KEY_ID = response['Credentials']['SecretAccessKey']
SESSION_TOKEN = response['Credentials']['SessionToken']

Configure Parquet and S3 Output#

By default, SlideRule processing returns a geopandas.GeoDataFrame. Below we instead request output be directly to a GeoParquet file in the s3://nasa-cryo-scratch bucket. We must pass AWS Credentials so that the SlideRule servers processing ATL03 data have access to this bucket.

S3_OUTPUT = 's3://nasa-cryo-scratch/sliderule-example/grandmesa.parquet'

parms["output"] = {
    "path": S3_OUTPUT, 
    "format": "parquet", 
    "open_on_complete": False,
    "region": "us-west-2",
    "credentials": {
         "aws_access_key_id": ACCESS_KEY_ID,
         "aws_secret_access_key": SECRET_ACCESS_KEY_ID,
         "aws_session_token": SESSION_TOKEN
     }
}

Run SlideRule processing#

Now run SlideRule! Depending on the area of interest, this can take some time. For the Grand Mesa polygon we’re using, as of 2023-07-21 there are 106 ATL03 Granules totaling over 237 Gigabytes in size! Fortunately, SlideRule runs computations on a scalable cluster in AWS us-west-2 so we get out results in less than a minute! In this case, the results are simply a URL pointing to the parquet file on S3, which we will load next.

%%time

# NOTE: At the time of writing the latest ATL03 archive release is 'version 006'
# https://nsidc.org/data/atl03/versions/6
output_path = icesat2.atl06p(parms,  version='006')
output_path

CPU times: user 231 ms, sys: 1.84 ms, total: 233 ms
Wall time: 28.1 s

's3://nasa-cryo-scratch/sliderule-example/grandmesa.parquet'

Read output from S3#

Finally, we can read these results. Geopandas has a read_parquet() function that accepts s3:// URLs:

gf = gpd.read_parquet(output_path)

print("Start:", gf.index.min().strftime('%Y-%m-%d'))
print("End:", gf.index.max().strftime('%Y-%m-%d'))
print("Reference Ground Tracks: {}".format(gf["rgt"].unique()))
print("Cycles: {}".format(gf["cycle"].unique()))
print("Elevation Measurements: {} ".format(gf.shape[0]))
gf.head(2)

Start: 2018-10-16
End: 2023-03-07
Reference Ground Tracks: [1156  714  272 1179  737  211  295  234]
Cycles: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 17 16 15 18]
Elevation Measurements: 328557 

	extent_id	distance	segment_id	rgt	rms_misfit	gt	dh_fit_dy	n_fit_photons	h_sigma	pflags	spot	h_mean	cycle	w_surface_window_final	dh_fit_dx	geometry
time
2018-12-13 08:01:25.295016704	5206161266950799362	1.571322e+07	784512	1156	0.291742	50	0.0	76	0.033481	0	2	1730.918671	1	3.0	0.001898	POINT (-108.08440 39.16745)
2018-12-13 08:01:25.638596608	5206161266950799363	1.571324e+07	784513	1156	0.358982	60	0.0	313	0.021044	0	1	1725.510855	1	3.0	0.016074	POINT (-108.08544 39.16734)

100,000+ is a lot of points to visualize! Let’s randomly sample 1000 of them and plot on our map

# Need to turn timestamps into strings first
points = gf.sample(1000).reset_index()
points['time'] = points.time.dt.strftime('%Y-%m-%d')
points.explore(column='h_mean', m=folium_map)

Make this Notebook Trusted to load map: File -> Trust Notebook

Summary#

We processed all ATL03 v006 data covering Grand Mesa, Colorado spanning 2018-10-16 to 2023-03-07 to ATL06-SR elevations. We output our results in GeoParquet format to an AWS S3 bucket and quickly visualized some of the results.

Next steps#

For very large Parquet datasets, the dask-geopandas library has advanced functionality for distributed computations. Check out their documentation

ICESat-2 Hackweek 2023

Sliderule Output to S3

Contents

Sliderule Output to S3#

Set Area of Interest#

Configure SlideRule#

Get Temporary AWS Credentials (JupyterHub)#

Configure Parquet and S3 Output#

Run SlideRule processing#

Read output from S3#

Summary#

Next steps#