Site selection

Housen Chu

library(amerifluxr)
library(data.table)
library(pander)

amerifluxr is a programmatic interface to the AmeriFlux. This vignette demonstrates examples to query a list of target sites based on sites’ general information and availability of metadata and data. A companion vignette for Data import is available as well.

Get a site list with general info

AmeriFlux data are organized by individual sites. Typically, data query begins with site search and selection. A full list of AmeriFlux sites with general info can be obtained using the amf_site_info() function.

Convert the site list to a data.table for easier manipulation. Also see link for variable definition.

# get a full list of sites with general info
sites <- amf_site_info()
sites_dt <- data.table::as.data.table(sites)

pander::pandoc.table(sites_dt[c(1:3), ])
Table continues below
SITE_ID SITE_NAME COUNTRY STATE IGBP
AR-CCa Carlos Casares agriculture Argentina Buenos Aires CRO
AR-CCg Carlos Casares grassland Argentina Buenos Aires GRA
AR-TF1 Rio Moat bog Argentina NA WET
Table continues below
TOWER_BEGAN URL_AMERIFLUX TOWER_END
2012 https://ameriflux.lbl.gov/sites/siteinfo/AR-CCa NA
2018 https://ameriflux.lbl.gov/sites/siteinfo/AR-CCg NA
2016 https://ameriflux.lbl.gov/sites/siteinfo/AR-TF1 2018
Table continues below
LOCATION_LAT LOCATION_LONG LOCATION_ELEV CLIMATE_KOEPPEN MAT MAP
-35.62 -61.32 83 Cfa 16.1 1060
-35.92 -61.19 84 Cfa 16.1 1060
-54.97 -66.73 40 NA NA NA
DATA_POLICY DATA_START DATA_END
LEGACY NA NA
LEGACY NA NA
CCBY4.0 2016 2018

The site list provides a quick summary of all registered sites and sites with available data.

It’s often important to understand the data use policy under which the data are shared. In 2021, the AmeriFlux community moved to the AmeriFlux CC-BY-4.0 License. Most site PIs now share their sites’ data under the CC-BY-4.0 license. Data for some sites are shared under the historical AmeriFlux data-sharing policy, now called the AmeriFlux Legacy Data Policy.

Check link for data use policy and attribution guidelines.

# total number of registered sites
pander::pandoc.table(sites_dt[, .N])
562

# total number of sites with available data
pander::pandoc.table(sites_dt[!is.na(DATA_START), .N])
410

# get number of sites with available data, grouped by data use policy
pander::pandoc.table(sites_dt[!is.na(DATA_START), .N, by = .(DATA_POLICY)])
DATA_POLICY N
CCBY4.0 305
LEGACY 105

Further group sites based on IGBP.

# get a summary table of sites grouped by IGBP
pander::pandoc.table(sites_dt[, .N, by = "IGBP"])
IGBP N
CRO 110
GRA 79
WET 94
DNF 1
EBF 10
WSA 9
ENF 99
DBF 58
MF 14
OSH 39
WAT 9
CSH 12
URB 8
BSV 6
SAV 8
CVM 5
SNO 1

# get a summary table of sites with available data, & grouped by IGBP
pander::pandoc.table(sites_dt[!is.na(DATA_START), .N, by = "IGBP"])
IGBP N
WET 50
DNF 1
WSA 7
EBF 6
ENF 93
DBF 54
MF 12
CRO 63
GRA 61
OSH 31
CSH 11
BSV 4
CVM 2
SAV 6
URB 2
WAT 6
SNO 1

# get a summary table of sites with available data, 
#  & grouped by data use policy & IGBP
pander::pandoc.table(sites_dt[!is.na(DATA_START), .N, by = .(IGBP, DATA_POLICY)][order(IGBP)])
IGBP DATA_POLICY N
BSV CCBY4.0 2
BSV LEGACY 2
CRO CCBY4.0 49
CRO LEGACY 14
CSH LEGACY 5
CSH CCBY4.0 6
CVM CCBY4.0 2
DBF CCBY4.0 48
DBF LEGACY 6
DNF CCBY4.0 1
EBF LEGACY 4
EBF CCBY4.0 2
ENF CCBY4.0 68
ENF LEGACY 25
GRA LEGACY 15
GRA CCBY4.0 46
MF CCBY4.0 7
MF LEGACY 5
OSH LEGACY 9
OSH CCBY4.0 22
SAV CCBY4.0 6
SNO CCBY4.0 1
URB LEGACY 2
WAT CCBY4.0 6
WET CCBY4.0 34
WET LEGACY 16
WSA CCBY4.0 5
WSA LEGACY 2

Once decided, users can query a target site list based on the desired criteria, e.g., IGBP, data availability, data policy, geolocation.


# get a list of cropland and grassland sites with available data,
#  shared under CC-BY-4.0 data policy,
#  located within 30-50 degree N in latitude,
# returned a site list with site ID, name, data starting/ending year
crop_ls <- sites_dt[IGBP %in% c("CRO", "GRA") &
                      !is.na(DATA_START) &
                      LOCATION_LAT > 30 &
                      LOCATION_LAT < 50 &
                      DATA_POLICY == "CCBY4.0",
                    .(SITE_ID, SITE_NAME, DATA_START, DATA_END)]
pander::pandoc.table(crop_ls[c(1:10),])
SITE_ID SITE_NAME DATA_START DATA_END
CA-ER1 Elora Research Station 2015 2021
US-A32 ARM-SGP Medford hay pasture 2015 2017
US-A74 ARM SGP milo field 2015 2017
US-AR1 ARM USDA UNL OSU Woodward Switchgrass 1 2009 2012
US-AR2 ARM USDA UNL OSU Woodward Switchgrass 2 2009 2012
US-ARM ARM Southern Great Plains site- Lamont 2003 2021
US-BMM Bangtail Mountain Meadow 2016 2019
US-BRG Bayles Road Grassland Tower 2016 2020
US-Bi1 Bouldin Island Alfalfa 2016 2021
US-Bi2 Bouldin Island corn 2017 2021

Get metadata availability

In some cases, users may want to know if certain types of metadata are available for the selected sites. The amf_list_metadata() function provides a quick summary of metadata availability before actually downloading the data and metadata.

By default, amf_list_metadata() returns a full site list with the available entries (i.e., counts) for all BADM groups. Check AmeriFlux webpage for definitions of all BADM groups.

# get data availability for selected sites
metadata_aval <- data.table::as.data.table(amf_list_metadata())
pander::pandoc.table(metadata_aval[c(1:3), c(1:10)])
Table continues below
SITE_ID GRP_ACKNOWLEDGEMENT GRP_CLIM_AVG GRP_COUNTRY GRP_DOM_DIST_MGMT
AR-CCa 1 1 1 1
AR-CCg 1 1 1 2
AR-TF1 0 0 1 0
GRP_FLUX_MEASUREMENTS GRP_HEADER GRP_IGBP GRP_LAND_OWNERSHIP GRP_LOCATION
2 1 1 1 1
2 1 1 1 1
3 1 1 1 1

The site_set parameter of the amf_list_metadata() can be used to subset the sites of interest.

metadata_aval_sub <- as.data.table(amf_list_metadata(site_set = crop_ls$SITE_ID))

# down-select cropland & grassland sites by interested BADM group,
#  e.g., canopy height (GRP_HEIGHTC)
crop_ls2 <- metadata_aval_sub[GRP_HEIGHTC > 0, .(SITE_ID, GRP_HEIGHTC)][order(-GRP_HEIGHTC)]
pander::pandoc.table(crop_ls2[c(1:10), ])
SITE_ID GRP_HEIGHTC
US-Ne2 196
US-Tw3 162
US-Twt 133
US-Ne3 128
US-Ne1 119
US-Bi1 112
US-Var 105
US-Snd 70
US-Bi2 54
US-ARM 45

Get data availability

Users can use amf_list_data() to query the availability of specific variables in the data (i.e., flux/met data, so-called BASE data product). The amf_list_data() provides a quick summary of variable availability (per site/year) before downloading the data.

By default, amf_list_data() returns a full site list of variable availability (data percentages per year) for all variables. The site_set parameter of amf_list_data() can be used to subset the sites of interest.

# get data availability for selected sites
data_aval <- data.table::as.data.table(amf_list_data(site_set = crop_ls2$SITE_ID))
pander::pandoc.table(data_aval[c(1:10), ])
Table continues below
SITE_ID VARIABLE BASENAME GAP_FILLED Y1990 Y1991 Y1992 Y1993
US-AR1 CO2 CO2 FALSE 0 0 0 0
US-AR1 FC FC FALSE 0 0 0 0
US-AR1 G G FALSE 0 0 0 0
US-AR1 H H FALSE 0 0 0 0
US-AR1 H2O H2O FALSE 0 0 0 0
US-AR1 LE LE FALSE 0 0 0 0
US-AR1 LW_IN LW_IN FALSE 0 0 0 0
US-AR1 LW_OUT LW_OUT FALSE 0 0 0 0
US-AR1 NETRAD NETRAD FALSE 0 0 0 0
US-AR1 P P FALSE 0 0 0 0
Table continues below
Y1994 Y1995 Y1996 Y1997 Y1998 Y1999 Y2000 Y2001 Y2002 Y2003
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
Table continues below
Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012
0 0 0 0 0 0.5905 0.9866 0.9941 0.6654
0 0 0 0 0 0.6082 0.976 0.9886 0.6621
0 0 0 0 0 0.6421 0.9965 0.9999 0.9961
0 0 0 0 0 0.6123 0.9867 0.9938 0.6666
0 0 0 0 0 0.6092 0.971 0.9792 0.6633
0 0 0 0 0 0.6101 0.9816 0.9936 0.6647
0 0 0 0 0 0.6416 0.9965 0.9999 0.9961
0 0 0 0 0 0.6416 0.9965 0.9999 0.9961
0 0 0 0 0 0.5447 0.9964 0.9996 0.996
0 0 0 0 0 0.6422 0.9965 0.9999 0.9961
Y2013 Y2014 Y2015 Y2016 Y2017 Y2018 Y2019 Y2020 Y2021
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0

The variable availability can be used to subset sites that have certain variables in specific years. The BASENAME column indicates the variable’s base name (i.e., ignoring position qualifier), and can be used to get a coarse-level variable availability.

See AmeriFlux website for definitions of base names and qualifiers.

# down-select cropland & grassland sites based on the available wind speed (WS) and 
# friction velocity (USTAR) data in 2015-2018, regardless their qualifiers
data_aval_sub <- data_aval[data_aval$BASENAME %in% c("WS","USTAR"),
                           .(SITE_ID, BASENAME, Y2015, Y2016, Y2017, Y2018)]

# calculate mean availability of WS and USTAR in each site and each year
data_aval_sub <- data_aval_sub[, lapply(.SD, mean), 
                               by = .(SITE_ID),
                               .SDcols = c("Y2015", "Y2016", "Y2017", "Y2018")]

# sub-select sites that have WS and USTAR data for > 75%
#  during 2015-2018
crop_ls3 <- data_aval_sub[(Y2015 + Y2016 + Y2017 + Y2018) / 4 > 0.75]
pander::pandoc.table(crop_ls3)
SITE_ID Y2015 Y2016 Y2017 Y2018
US-ARM 0.5772 0.9871 0.9683 0.9826
US-Ne1 0.77 0.7861 0.756 0.7167
US-Ne2 0.7636 0.7878 0.7594 0.7442
US-SRG 0.9669 0.9851 0.9775 0.9997
US-Tw3 0.9689 0.9569 0.9763 0.4005
US-Var 0.9983 1 0.9455 1
US-Wkg 0.9973 0.9909 0.9965 0.9848

Last, sometimes users would look for sites with multiple measurements of similar variables (e.g., multilevel wind speed, soil temperature). The VARIABLE column in the variable availability can be used to get a fine-level variable availability.


# down-select cropland & grassland sites by available wind speed (WS) data,
#  mean availability of WS during 2015-2018
data_aval_sub2 <- data_aval[data_aval$BASENAME %in% c("WS"),
                            .(SITE_ID, VARIABLE, Y2015_2018 = (Y2015 + Y2016 + Y2017 + Y2018)/4)]

# calculate number of WS variables per site, for sites that 
#  have any WS data during 2015-2018
data_aval_sub2 <- data_aval_sub2[Y2015_2018 > 0, .(.N, Y2015_2018 = mean(Y2015_2018)), .(SITE_ID)]
pander::pandoc.table(crop_ls4 <- data_aval_sub2[N > 1, ])
SITE_ID N Y2015_2018
US-ARM 3 0.8766
US-Ne1 4 0.7027
US-Ne2 4 0.709
US-Ne3 4 0.7287
US-Wkg 2 0.9942

A companion function amf_plot_datayear() can be used for visualizing the data availability in an interactive figure. However, it is strongly advised to subset the sites, variables, and/or years for faster processing and better visualization.

#### not evaluated so to reduce vignette size
# plot data availability for WS & USTAR
#  for selected sites in 2015-2018
amf_plot_datayear(
  site_set = crop_ls4$SITE_ID,
  var_set = c("WS", "USTAR"),
  nonfilled_only = TRUE,
  year_set = c(2015:2018)
)

Get data summary

In addition, users can use amf_summarize_data() to query the summary statistics of specific variables in the BASE data. The amf_summarize_data() provides summary statistics for each variable (e.g., percentiles) before downloading the data.

By default, amf_summarize_data() returns variable summary (selected percentiles) for all variables and sites. The site_set and var_set parameters can be used to subset the sites or variables of interest.

## get data summary for selected sites & variables
data_sum <- amf_summarize_data(site_set = crop_ls3$SITE_ID,
                     var_set = c("WS", "USTAR"))
pander::pandoc.table(data_sum[c(1:10), ])
Table continues below
  SITE_ID VARIABLE BASENAME GAP_FILLED DATA_RECORD
3595 US-ARM WS_1_1_1 WS FALSE 324084
3598 US-ARM USTAR_1_1_1 USTAR FALSE 324084
3651 US-ARM WS_1_2_1 WS FALSE 324084
3654 US-ARM USTAR_1_2_1 USTAR FALSE 324084
3678 US-ARM WS_1_3_1 WS FALSE 324084
3681 US-ARM USTAR_1_3_1 USTAR FALSE 324084
8686 US-Ne1 USTAR_1_1_1 USTAR FALSE 175320
8758 US-Ne1 WS_1_1_1 WS FALSE 175320
8759 US-Ne1 WS_1_2_1 WS FALSE 175320
8760 US-Ne1 WS_1_3_1 WS FALSE 175320
Table continues below
  DATA_MISSING Q01 Q05 Q10 Q15 Q20
3595 22376 0.5105 1.064 1.451 1.757 2.027
3598 21995 0.02875 0.0533 0.07737 0.1022 0.1279
3651 31002 0.8269 1.728 2.395 2.903 3.342
3654 30042 0.03132 0.05613 0.07933 0.1022 0.1266
3678 49875 0.9939 2.136 3.012 3.704 4.292
3681 44634 0.03308 0.05806 0.08016 0.1015 0.1244
8686 15578 0.024 0.049 0.071 0.093 0.116
8758 118560 0.55 0.94 1.19 1.37 1.53
8759 10493 0.8 1.2 1.49 1.72 1.95
8760 11666 0.52 0.77 0.99 1.19 1.39
Table continues below
  Q25 Q30 Q35 Q40 Q45 Q50 Q55
3595 2.281 2.539 2.805 3.083 3.37 3.678 4.008
3598 0.1542 0.1803 0.2061 0.2314 0.2566 0.2819 0.3076
3651 3.741 4.12 4.484 4.844 5.211 5.579 5.958
3654 0.1525 0.1796 0.2063 0.2332 0.2604 0.2874 0.3145
3678 4.817 5.31 5.786 6.254 6.711 7.169 7.64
3681 0.1499 0.1782 0.2072 0.237 0.2666 0.2959 0.326
8686 0.14 0.164 0.188 0.213 0.238 0.263 0.289
8758 1.68 1.81 1.96 2.1 2.26 2.44 2.63
8759 2.17 2.4 2.64 2.89 3.16 3.44 3.75
8760 1.59 1.79 2 2.22 2.45 2.71 3.01
Table continues below
  Q60 Q65 Q70 Q75 Q80 Q85 Q90
3595 4.362 4.743 5.162 5.635 6.176 6.837 7.696
3598 0.334 0.3621 0.3927 0.4261 0.4644 0.5097 0.5673
3651 6.346 6.766 7.227 7.75 8.373 9.129 10.16
3654 0.3422 0.3712 0.4013 0.434 0.4713 0.5144 0.569
3678 8.12 8.606 9.112 9.653 10.25 10.97 11.94
3681 0.3572 0.3891 0.4231 0.4599 0.5007 0.549 0.6108
8686 0.315 0.343 0.373 0.406 0.4438 0.49 0.551
8758 2.85 3.1 3.38 3.69 4.04 4.48 5.06
8759 4.1 4.49 4.93 5.42 6 6.7 7.62
8760 3.34 3.73 4.17 4.69 5.28 5.98 6.87
  Q95 Q99
3595 8.947 11.25
3598 0.6527 0.828
3651 11.7 14.44
3654 0.6505 0.8178
3678 13.42 16.29
3681 0.7058 0.9155
8686 0.645 0.852
8758 5.93 7.61
8759 9.01 11.65
8760 8.19 10.61

Alternatively, a companion function amf_plot_datasummary() provides interactive visualization to the data summary.

#### not evaluated so to reduce vignette size
## plot data summary of USTAR for selected sites, 
amf_plot_datasummary(
  site_set = crop_ls3$SITE_ID,
  var_set = c("USTAR")
)
#### not evaluated so to reduce vignette size
## plot data summary of WS for selected sites, 
#  including clustering information
amf_plot_datasummary(
  site_set = crop_ls3$SITE_ID,
  var_set = c("WS"),
  show_cluster = TRUE
)

Once having a target site list, users can download these sites’ data and metadata using the site IDs. See Data import for data download and import examples.