Using icepyx to search, access and open ICESat-2 data in the cloud
Contents
Using icepyx
to search, access and open ICESat-2 data in the cloud#
Overview#
icepyx
is a python package that simplifies querying, accessing and working with ICESat-2 data. It is also a community of ICESat-2 data users, developers and other members of the scientific community. icepyx
is also a fantastic example of the type of collaborative software development, community and knowledge building that we are trying to foster in this hackweek because the project was started at the 2019 ICESat-2 Hackweek.
Learning Objectives#
In this tutorial you will learn how to:
Use
icepyx
to search for ICESat-2 dataAccess and open the data directly from an S3 bucket in the NASA Earthdata Cloud
Subset data by variable
Prerequisites#
The workflow described in this tutorial forms the initial steps of an Analysis in Place workflow that would be run on a AWS cloud compute resource. You will need:
a JupyterHub, such as CryoHub, or AWS EC2 instance in the us-west-2 region.
a NASA Earthdata Login. If you need to register for an Earthdata Login see the Getting an Earthdata Login section of the ICESat-2 Hackweek 2023 Jupyter Book.
A
.netrc
file, that contains your Earthdata Login credentials, in your home directory. See Configure Programmatic Access to NASA Servers to create a.netrc
file.
Credits#
This tutorial is based on a number of tutorials in the icepyx
documentation. See this documentation for more information.
Computing Environment#
We will use icepyx
to search for and subset data granules. However, currently, icepyx
does not get AWS tokens. Fortunately, we can use earthaccess
(see the earthaccess
tutorial) to do this for us.
import warnings
warnings.filterwarnings("ignore")
import icepyx as ipx
import earthaccess
import h5py
import geopandas as gpd
from shapely.geometry import Point
Create an icepyx
Query object#
icepyx
uses similar search parameters to earthaccess
. The data product is defined by the short_name
. Spatial and temporal filters are defined using spatial_extent
and a date_range
. Spatial extent may be defined as a bounding box, a polygon or a shapefile of GeoJSON vector file.
result = ipx.Query(
product="ATL06",
spatial_extent = [-134.7,58.9,-133.9,59.2],
date_range = ['2020-03-01','2020-04-30'],
)
Products can also be queried by version, orbital cycle and reference ground track. See icepyx
API documentation for more information.
Get information about available granules#
icepyx.Query
returns a Query
object that contains information about the number of granules found and the size of those granules. To return a list of granule ids, set keywords ids=True
. Setting cycles=True
and tracks=True
returns lists of orbital cycles and reference ground tracks.
gran_id = result.avail_granules()
gran_id
{'Number of available granules': 4,
'Average size of granules (MB)': 21.579402685125,
'Total size of all granules (MB)': 86.3176107405}
To directly access ICESat-2 granules in the NASA Earthdata Cloud from an in-region AWS compute reource, you need the S3 links to these granules. To get these links, rerun the search Query and then call aval_granules(
cloud=True`).
Warning
Calling avail_granules(cloud=True)
without rerunning the search query returns an empty list. If you know you want granule ids, cycles and RGTs, in addition to the S3 links, set all four keywords to True
.
result = ipx.Query(
product="ATL06",
spatial_extent = [-134.7,58.9,-133.9,59.2],
date_range = ['2020-03-01','2020-04-30'],
)
s3links = result.avail_granules(cloud=True)[0] # returns a nested list, get first element
s3links
['s3://nsidc-cumulus-prod-protected/ATLAS/ATL06/006/2020/03/10/ATL06_20200310121504_11420606_006_01.h5',
's3://nsidc-cumulus-prod-protected/ATLAS/ATL06/006/2020/03/12/ATL06_20200312233336_11800602_006_01.h5',
's3://nsidc-cumulus-prod-protected/ATLAS/ATL06/006/2020/04/10/ATL06_20200410220936_02350702_006_02.h5',
's3://nsidc-cumulus-prod-protected/ATLAS/ATL06/006/2020/04/12/ATL06_20200412104246_02580706_006_02.h5']
A similar search can be run using the Reference Ground Track and Orbital Cycle as search parameters. In this example, the search query is run for RGT 1142 of cycle 06, which is the first granule returned by the search above.
result = ipx.Query(
product="ATL06",
spatial_extent = [-134.7,58.9,-133.9,59.2], # must be included
cycles = ['06'],
tracks = ['1142']
)
s3links = result.avail_granules(cloud=True)[0]
s3links
['s3://nsidc-cumulus-prod-protected/ATLAS/ATL06/006/2020/03/10/ATL06_20200310121504_11420606_006_01.h5']
Access and Open an ICESat-2 granule directly from the S3 bucket#
Search queries can be run without authenticating with Earthdata Login. To access data stored in an S3 bucket, you need to authenticate with Earthdata Login, and start an S3 session. At the moment, icepyx
is used to authenticate with Earthdata Login and get S3 credentials. earthaccess
is used to start an S3 session.
result.earthdata_login(s3token=True) # Authenticate and get credentials
EARTHDATA_USERNAME and EARTHDATA_PASSWORD are not set in the current environment, try setting them or use a different strategy (netrc, interactive)
You're now authenticated with NASA Earthdata Login
Using token with expiration date: 09/19/2023
Using .netrc file for EDL
s3 = earthaccess.get_s3fs_session(daac='NSIDC') # Start a S3 session - can we use earthaccess.open here
As with the earthaccess
direct access example, we have to create a virtual local file system before we can read a file.
We could open the file using xarray
. However, to demonstrate another access method, we will use h5py, a package for reading HDF5 files. We then put these into a geopandas.GeoDataFrame
.
The h5py.File
method returns a file object. This is the entry point to the HDF5 file.
file = s3.open(s3links[0], 'rb')
fh = h5py.File(file, 'r')
fh
<HDF5 file "ATL06_20200310121504_11420606_006_01.h5>" (mode r)>
HDF5 files are hierarchical structures similar to the directory structure on your computer. Groups are analogous to directories, which can contain other Groups or Datasets. The File Object (fh
) is the root group. Datasets are variables and are collected under a Group. They contain multi-dimensional arrays of data values. Groups and Datasets can have attributes that contain metadata, for example units of data values in a Dataset.
To access a particular dataset, you have to know the path to that dataset. All ICESat-2 data products have a data dictionary that includes the path to all variables, along with datatype, standard name, units and a description.
In this example, I want to get Land Ice Height from the left beam of the first beam pair, along with geolocation, time, error and quality. From inspecting the data dictionary for ATL06, I can see that all of this data is included in the gt1l/land_ice_segments
group. The are also subgroups within this group that contain other data but, in this case, I don’t want that data.
icepyx
has a handy variables lookup for ICESat-2 products. I can use this method to return a full list of paths to variables in the ATL06 file and filter this list to get just the variables under gt1l/land_ice_segments
.
I use a python dictionary comprehension to loop through varlist
with two conditionals to add only variables that directly under the gt1l/land_ice_segment
group. The first checks that the path is in the group. The second checks that the variable is in the first level of this group by counting the number of /
in the path.
Finally, I use split('/')[-1]
to extract the variable name as the last component of the path, and create a dictionary entry with this name as the key and the path as the value.
varlist = result.order_vars.avail()
variables_to_get = {
var.split('/')[-1]: var for var in varlist if var.startswith('gt1l/land_ice_segments') & (var.count('/') == 2)
}
variables_to_get
{'atl06_quality_summary': 'gt1l/land_ice_segments/atl06_quality_summary',
'delta_time': 'gt1l/land_ice_segments/delta_time',
'h_li': 'gt1l/land_ice_segments/h_li',
'h_li_sigma': 'gt1l/land_ice_segments/h_li_sigma',
'latitude': 'gt1l/land_ice_segments/latitude',
'longitude': 'gt1l/land_ice_segments/longitude',
'segment_id': 'gt1l/land_ice_segments/segment_id',
'sigma_geo_h': 'gt1l/land_ice_segments/sigma_geo_h'}
I want to work with these variables in a geopandas.DataFrame
. Dataframes are convenient data structures for tabular data. The Python package pandas
is a powerful tool for working with Dataframes. geopandas
is an extension to pandas that adds georeferencing. Essentially, geopandas
adds an extra geometry
column to a pandas.Dataframe and provides methods for geospatial queries.
Note
What is often called “GIS data” can be thought of as a table of data, where each row of that table is associated with some geospatial object: either a point, line or polygon. We can think of along-track ICESat-2 data as a set of georeferenced points, with each point having a set of attributes such as height, surface type, quality, etc.
There are several ways to create a geopandas.Dataframe
. In this example, I create a dictionary of variables, read from the HDF5 file using the syntax fh[path_to_variable][:]
and use the variable name as the key for this dictionary. I also create a list of Point geometries from latitude and longitude variables, which represent the center points of the land ice height segments. The data
dictionary and geometry
list are used to create the GeoDataFrame
object. I also define the Coordinate Reference System (crs
) for the geometry
as WGS 84 using the EPSG code 4326. The CRS makes the data georeferenced as it relates the longitude and latitude values of each point to a location on Earth.
data = {varname: fh[varpath][:] for varname, varpath in variables_to_get.items() if varname not in ['latitude', 'longitude']}
geometry = [Point(lon, lat) for lon, lat in zip(fh[variables_to_get['longitude']], fh[variables_to_get['latitude']])]
fh.close() # close the file
gdf = gpd.GeoDataFrame(data, geometry=geometry, crs="EPSG:4326")
gdf
atl06_quality_summary | delta_time | h_li | h_li_sigma | segment_id | sigma_geo_h | geometry | |
---|---|---|---|---|---|---|---|
0 | 1 | 6.907771e+07 | 3.402823e+38 | 3.402823e+38 | 673338 | 3.402823e+38 | POINT (-134.37842 59.09455) |
1 | 1 | 6.907771e+07 | 3.402823e+38 | 3.402823e+38 | 673359 | 3.402823e+38 | POINT (-134.37922 59.09080) |
2 | 1 | 6.907771e+07 | 3.402823e+38 | 3.402823e+38 | 673473 | 3.402823e+38 | POINT (-134.38364 59.07043) |
3 | 1 | 6.907771e+07 | 3.402823e+38 | 3.402823e+38 | 673489 | 3.402823e+38 | POINT (-134.38425 59.06757) |
4 | 1 | 6.907771e+07 | 3.402823e+38 | 3.402823e+38 | 673490 | 3.402823e+38 | POINT (-134.38429 59.06739) |
... | ... | ... | ... | ... | ... | ... | ... |
859 | 1 | 6.907776e+07 | 3.402823e+38 | 3.402823e+38 | 690101 | 3.402823e+38 | POINT (-134.94438 56.09580) |
860 | 1 | 6.907776e+07 | 3.402823e+38 | 3.402823e+38 | 690108 | 3.402823e+38 | POINT (-134.94461 56.09455) |
861 | 1 | 6.907776e+07 | 3.402823e+38 | 3.402823e+38 | 690111 | 3.402823e+38 | POINT (-134.94471 56.09401) |
862 | 1 | 6.907776e+07 | 3.402823e+38 | 3.402823e+38 | 690112 | 3.402823e+38 | POINT (-134.94474 56.09383) |
863 | 1 | 6.907776e+07 | 3.402823e+38 | 3.402823e+38 | 690213 | 3.402823e+38 | POINT (-134.94811 56.07576) |
864 rows × 7 columns
Summary#
This tutorial has introduced icepyx
, shown how to search for ICESat-2 data using spatial and temporal filters, and also reference ground tracks and orbital cycles. It has shown how to access a file hosted on AWS S3, read that file using h5py
, extract a subset of variables, and create a geopandas.GeoDataFrame
.
References#
icepyx
documentation
icepyx
ICESat-2 cloud data access tutorial
GeoPandas documentation
h5py
documentation
An introduction to HDF5
This O’Reilly book by Andrew Collete on HDF5 is a useful introduction as well.