Sliderule Output to S3#

SlideRule can either return in-memory Python objects or return links to Parquet files in Cloud Object Storage (AWS S3). Directing output to S3 can be helpful if you want to collaborate with others on analyzing the same dataset (there is no need for everyone to run the same processing requests). It is also a good approach for analyzing large areas of interest where results may require significant storage space.

This notebook walks through an example of generating SlideRule output as a Parquet file in AWS S3.

Learning Objectives

  • basics of Parquet and Geoparquet formats

  • how to output Sliderule results as parquet files

  • how to pass AWS credentials to write to S3

  • how to open parquet files on S3 for analysis

Tip

Parquet is cloud-optimized format. At a very basic level, it is for tabular data. Unlike CSV files which are stored as plain text and written row-wise, Parquet is a columnar binary format that is well-suited to hosting on S3 for data analysis. GeoParquet is a standard for storing geospatial vector data (Points, Lines, Polygons) in Parquet files.

Sliderule documentation has an extensive description of Parquet. And a Tutorial with code examples!

import os
os.environ['USE_PYGEOS'] = '0'
import geopandas as gpd

import boto3
from sliderule import sliderule, icesat2
import s3fs

Set Area of Interest#

We will use a geojson file from the Sliderule GitHub Repository over Grand Mesa, Colorado:

gfa = gpd.read_file('https://raw.githubusercontent.com/ICESat2-SlideRule/sliderule-python/main/data/grandmesa.geojson')
folium_map = gfa.explore(tiles="Stamen Terrain", 
                         style_kwds=dict(fill=False, color='magenta'),
                        )
folium_map
Make this Notebook Trusted to load map: File -> Trust Notebook

Configure SlideRule#

Below we connect to the SlideRule Server and initialize processing parameters for determining precise elevations from ATL03 photon returns. Read more about processing parameters In the SlideRule Docs.

icesat2.init("slideruleearth.io")
parms = {
    "poly": sliderule.toregion(gfa)["poly"],
    "srt": icesat2.SRT_LAND,
    "cnf": icesat2.CNF_SURFACE_HIGH,
    "len": 40.0,
    "res": 20.0,
    "maxi": 6
}

Get Temporary AWS Credentials (JupyterHub)#

Warning

The code cell below will only work on CryoCloud JupyterHub where the required environment variables exist (AWS_WEB_IDENTITY_TOKEN_FILE, AWS_ROLE_ARN, and JUPYTERHUB_CLIENT_ID)

client = boto3.client('sts')

with open(os.environ['AWS_WEB_IDENTITY_TOKEN_FILE']) as f:
    TOKEN = f.read()

response = client.assume_role_with_web_identity(
    RoleArn=os.environ['AWS_ROLE_ARN'],
    RoleSessionName=os.environ['JUPYTERHUB_CLIENT_ID'],
    WebIdentityToken=TOKEN,
    DurationSeconds=3600
)

ACCESS_KEY_ID = response['Credentials']['AccessKeyId']
SECRET_ACCESS_KEY_ID = response['Credentials']['SecretAccessKey']
SESSION_TOKEN = response['Credentials']['SessionToken']

Configure Parquet and S3 Output#

By default, SlideRule processing returns a geopandas.GeoDataFrame. Below we instead request output be directly to a GeoParquet file in the s3://nasa-cryo-scratch bucket. We must pass AWS Credentials so that the SlideRule servers processing ATL03 data have access to this bucket.

S3_OUTPUT = 's3://nasa-cryo-scratch/sliderule-example/grandmesa.parquet'

parms["output"] = {
    "path": S3_OUTPUT, 
    "format": "parquet", 
    "open_on_complete": False,
    "region": "us-west-2",
    "credentials": {
         "aws_access_key_id": ACCESS_KEY_ID,
         "aws_secret_access_key": SECRET_ACCESS_KEY_ID,
         "aws_session_token": SESSION_TOKEN
     }
}

Run SlideRule processing#

Now run SlideRule! Depending on the area of interest, this can take some time. For the Grand Mesa polygon we’re using, as of 2023-07-21 there are 106 ATL03 Granules totaling over 237 Gigabytes in size! Fortunately, SlideRule runs computations on a scalable cluster in AWS us-west-2 so we get out results in less than a minute! In this case, the results are simply a URL pointing to the parquet file on S3, which we will load next.

%%time

# NOTE: At the time of writing the latest ATL03 archive release is 'version 006'
# https://nsidc.org/data/atl03/versions/6
output_path = icesat2.atl06p(parms,  version='006')
output_path
CPU times: user 231 ms, sys: 1.84 ms, total: 233 ms
Wall time: 28.1 s
's3://nasa-cryo-scratch/sliderule-example/grandmesa.parquet'

Read output from S3#

Finally, we can read these results. Geopandas has a read_parquet() function that accepts s3:// URLs:

gf = gpd.read_parquet(output_path)
print("Start:", gf.index.min().strftime('%Y-%m-%d'))
print("End:", gf.index.max().strftime('%Y-%m-%d'))
print("Reference Ground Tracks: {}".format(gf["rgt"].unique()))
print("Cycles: {}".format(gf["cycle"].unique()))
print("Elevation Measurements: {} ".format(gf.shape[0]))
gf.head(2)
Start: 2018-10-16
End: 2023-03-07
Reference Ground Tracks: [1156  714  272 1179  737  211  295  234]
Cycles: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 17 16 15 18]
Elevation Measurements: 328557 
extent_id distance segment_id rgt rms_misfit gt dh_fit_dy n_fit_photons h_sigma pflags spot h_mean cycle w_surface_window_final dh_fit_dx geometry
time
2018-12-13 08:01:25.295016704 5206161266950799362 1.571322e+07 784512 1156 0.291742 50 0.0 76 0.033481 0 2 1730.918671 1 3.0 0.001898 POINT (-108.08440 39.16745)
2018-12-13 08:01:25.638596608 5206161266950799363 1.571324e+07 784513 1156 0.358982 60 0.0 313 0.021044 0 1 1725.510855 1 3.0 0.016074 POINT (-108.08544 39.16734)

100,000+ is a lot of points to visualize! Let’s randomly sample 1000 of them and plot on our map

# Need to turn timestamps into strings first
points = gf.sample(1000).reset_index()
points['time'] = points.time.dt.strftime('%Y-%m-%d')
points.explore(column='h_mean', m=folium_map)
Make this Notebook Trusted to load map: File -> Trust Notebook

Summary#

We processed all ATL03 v006 data covering Grand Mesa, Colorado spanning 2018-10-16 to 2023-03-07 to ATL06-SR elevations. We output our results in GeoParquet format to an AWS S3 bucket and quickly visualized some of the results.

Next steps#

For very large Parquet datasets, the dask-geopandas library has advanced functionality for distributed computations. Check out their documentation