The purpose of this project is to carry out data visualisations that
answer the question: How does eclipse visibility vary across different
states or regions in the US? First, we shall discuss what eclipses are.
Eclipses occur when precise alignments of the sun, earth and moon occur.
These can be categorised into; total solar eclipses where the moon
entirely blocks the sun, total lunar eclipses where the earth’s shadow
completely covers the moon, partial solar eclipses where the moon only
partially covers the sun or partial lunar eclipses where only a part of
the moon passes into the earth shadow. Using the following datasets;
“eclipse_annular_2023.csv”,
“eclipse_total_2024.csv”,
“eclipse_partial-2023.csv” and
“eclipse_partial_2024.csv”, it is possible to make
visualisations regarding two particular events; the annular solar
eclipse that occurred across the US on October 14, 2023, and the total
solar eclipse that occurred across the US on April 8 2024.
The datasets contain crucial information regarding the location and
timings of the different stages of the eclipse that occurred at each
location. Information regarding the state in the US, and
the latitude and longitude of the location of
the event were recorded in the dataset. Moreover, the time
at which the moon first contacts the sun in this location, at which the
eclipse is at 50% in this location, at which totality begins in this
location, at which totality ends in this location, at which the eclipse
is back to 50% in this location, at which the moon last contacts the sun
in this location in the annular and total datasets is given. On the
other hand, the time at which the moon first contacts the sun in this
location, at which the eclipse is at 50% of this location’s maximum, at
which the eclipse reaches 100% of this location’s maximum, at which the
eclipse is again at 50% of this location’s maximum and at which the moon
last contacts the sun in this location is provided in the other partial
datasets. In this project, we will produce visualisations and make
observations about these two particular events as seen in the US.
To prepare the dataset for analysis and visualisation, several steps
were taken to clean and transform the data. Firstly, we load the
TidyTuesday datasets. After this, we performed the
glimpse() operation on the individual datasets to get
summary statistics regarding each recorded variable (column) in the
dataset. Attached is the code used to produce the summary statistics.
Notably, the eclipse annular and the eclipse total datasets have only
811 and 3330 observations respectively which is far lesser than those of
the eclipse partial 2023 and eclipse partial 2024 datasets which have
31363 and 28844 observations respectively. Missing entries were checked
for with the following code chunk. No missing values were in the
data.
Next, the function tidy_eclipse_data() was defined to
tidy the datasets. Column names were standardised by converting them to
lowercase and replacing spaces with underscores for consistency.
Time-related columns were formatted as hms objects to allow for accurate
time-based operations and visualisations. Additionally, descriptive
labels were assigned to eclipse event phases to make the data more
interpretable and ready for further analysis. A combined eclipse data
was also created by binding the datasets together to produce a
comprehensive dataset to be used in further analysis. This was then
tidied with the following code chunk to ensure that the
eclipse_type and eclipse_event were recorded
as variables with the times being recorded as observations. The data was
then grouped by eclipse_type so that further calculations
would be computed by eclipse_type. Key summary statistics,
including the number of unique locations for each eclipse type, the
average latitude and longitude of observation points to determine
average geographical coordinates and the total number of recorded events
were computed so that a better sense can be made of the data and so that
the optimal visualisations that could be determined. These summary
statistics have been attached below. ridgeline_data() was
then derived by converting the time column into an hms object and by
filtering out the partial eclipse events so that it can be used for
future visualisations. relationships_data() was also
created by only selecting the information regarding latitude, longitude,
time, and eclipse type and filtering out values with missing time values
which will also be used for future visualization.
geo_heatmap() was also created by grouping the combined
tidy data by name, state,
latitude, longitude and
eclipse_type before producing an event count column and
removing all partial eclipse observations. This will also be used in
future visualisations.
# Uncomment to install required R packages
# install.packages("tidytuesdayR")
# install.packages("hms")
# install.packages("ggridges")
# install.packages("viridis")
# install.packages("maps")
# install.packages("rnaturalearth")
# install.packages("gganimate")
# Load required packages
library(stringr)
library(tidyverse)
library(readxl)
library(lubridate)
library(jsonlite)
library(readr)
library(tidyverse)
library(ggplot2)
library(ggridges)
library(viridis)
library(hms)
library(tidytuesdayR)
library(sf)
library(maps)
library(rnaturalearth)
data <- tidytuesdayR::tt_load('2024-04-09')
eclipse_annular_2023 <- data$eclipse_annular_2023
eclipse_partial_2023 <- data$eclipse_partial_2023
eclipse_partial_2024 <- data$eclipse_partial_2024
eclipse_total_2024 <- data$eclipse_total_2024
# Viewing the data
glimpse(eclipse_annular_2023)
## Rows: 811
## Columns: 10
## $ state <chr> "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", …
## $ name <chr> "Chilchinbito", "Chinle", "Del Muerto", "Dennehotso", "Fort …
## $ lat <dbl> 36.49200, 36.15115, 36.18739, 36.82900, 35.74750, 36.71717, …
## $ lon <dbl> -110.0492, -109.5787, -109.4359, -109.8757, -109.0680, -110.…
## $ eclipse_1 <time> 15:10:50, 15:11:10, 15:11:20, 15:10:50, 15:11:40, 15:10:40,…
## $ eclipse_2 <time> 15:56:20, 15:56:50, 15:57:00, 15:56:20, 15:57:40, 15:56:00,…
## $ eclipse_3 <time> 16:30:29, 16:31:21, 16:31:13, 16:29:50, 16:32:28, 16:29:54,…
## $ eclipse_4 <time> 16:33:31, 16:34:06, 16:34:31, 16:34:07, 16:34:35, 16:33:21,…
## $ eclipse_5 <time> 17:09:40, 17:10:30, 17:10:40, 17:09:40, 17:11:30, 17:09:10,…
## $ eclipse_6 <time> 18:02:10, 18:03:20, 18:03:30, 18:02:00, 18:04:30, 18:01:30,…
glimpse(eclipse_partial_2023)
## Rows: 31,363
## Columns: 9
## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", …
## $ name <chr> "Abanda", "Abbeville", "Adamsville", "Addison", "Akron", "Al…
## $ lat <dbl> 33.09163, 31.56471, 33.60231, 34.20268, 32.87907, 33.21435, …
## $ lon <dbl> -85.52703, -85.25912, -86.97153, -87.17800, -87.74090, -86.8…
## $ eclipse_1 <time> 15:41:20, 15:42:30, 15:38:20, 15:37:50, 15:37:20, 15:38:50,…
## $ eclipse_2 <time> 16:23:30, 16:25:50, 16:20:50, 16:19:50, 16:20:40, 16:21:30,…
## $ eclipse_3 <time> 17:11:10, 17:13:50, 17:07:50, 17:06:50, 17:07:30, 17:08:40,…
## $ eclipse_4 <time> 18:00:00, 18:03:10, 17:56:30, 17:55:10, 17:56:00, 17:57:20,…
## $ eclipse_5 <time> 18:45:10, 18:49:30, 18:42:10, 18:40:30, 18:42:50, 18:43:20,…
glimpse(eclipse_partial_2024)
## Rows: 28,844
## Columns: 9
## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", …
## $ name <chr> "Abanda", "Abbeville", "Adamsville", "Addison", "Akron", "Al…
## $ lat <dbl> 33.09163, 31.56471, 33.60231, 34.20268, 32.87907, 33.21435, …
## $ lon <dbl> -85.52703, -85.25912, -86.97153, -87.17800, -87.74090, -86.8…
## $ eclipse_1 <time> 17:43:00, 17:41:40, 17:41:00, 17:41:30, 17:38:40, 17:40:40,…
## $ eclipse_2 <time> 18:24:10, 18:21:40, 18:23:10, 18:24:10, 18:20:40, 18:22:40,…
## $ eclipse_3 <time> 19:02:00, 19:00:30, 19:00:00, 19:00:30, 18:58:00, 18:59:50,…
## $ eclipse_4 <time> 19:39:20, 19:38:50, 19:36:40, 19:36:40, 19:35:00, 19:36:50,…
## $ eclipse_5 <time> 20:18:50, 20:17:20, 20:17:30, 20:18:00, 20:15:50, 20:17:20,…
glimpse(eclipse_total_2024)
## Rows: 3,330
## Columns: 10
## $ state <chr> "AR", "AR", "AR", "AR", "AR", "AR", "AR", "AR", "AR", "AR", …
## $ name <chr> "Acorn", "Adona", "Alexander", "Alicia", "Alix", "Alleene", …
## $ lat <dbl> 34.63879, 35.03993, 34.61859, 35.89291, 35.42200, 33.76482, …
## $ lon <dbl> -94.20011, -92.89913, -92.45122, -91.08345, -93.72878, -94.2…
## $ eclipse_1 <time> 17:30:40, 17:33:20, 17:33:20, 17:37:30, 17:32:50, 17:29:10,…
## $ eclipse_2 <time> 18:15:50, 18:18:30, 18:18:30, 18:22:40, 18:17:50, 18:14:20,…
## $ eclipse_3 <time> 18:47:35, 18:50:08, 18:51:09, 18:54:29, 18:49:54, 18:46:15,…
## $ eclipse_4 <time> 18:51:37, 18:54:22, 18:53:38, 18:58:05, 18:53:00, 18:50:16,…
## $ eclipse_5 <time> 19:23:40, 19:26:10, 19:26:20, 19:29:50, 19:25:20, 19:22:30,…
## $ eclipse_6 <time> 20:08:30, 20:10:50, 20:11:10, 20:14:10, 20:10:00, 20:07:40,…
summary(eclipse_annular_2023)
## state name lat lon
## Length:811 Length:811 Min. :27.22 Min. :-124.45
## Class :character Class :character 1st Qu.:31.30 1st Qu.:-111.98
## Mode :character Mode :character Median :35.42 Median :-106.70
## Mean :35.41 Mean :-108.05
## 3rd Qu.:38.42 3rd Qu.:-101.36
## Max. :44.87 Max. : -96.72
## eclipse_1 eclipse_2 eclipse_3 eclipse_4
## Length:811 Length:811 Length:811 Length:811
## Class1:hms Class1:hms Class1:hms Class1:hms
## Class2:difftime Class2:difftime Class2:difftime Class2:difftime
## Mode :numeric Mode :numeric Mode :numeric Mode :numeric
##
##
## eclipse_5 eclipse_6
## Length:811 Length:811
## Class1:hms Class1:hms
## Class2:difftime Class2:difftime
## Mode :numeric Mode :numeric
##
##
summary(eclipse_partial_2023)
## state name lat lon
## Length:31363 Length:31363 Min. :17.96 Min. :-176.60
## Class :character Class :character 1st Qu.:35.36 1st Qu.: -97.50
## Mode :character Mode :character Median :39.56 Median : -89.26
## Mean :38.80 Mean : -91.97
## 3rd Qu.:41.93 3rd Qu.: -81.14
## Max. :71.25 Max. : 174.11
## eclipse_1 eclipse_2 eclipse_3 eclipse_4
## Length:31363 Length:31363 Length:31363 Length:31363
## Class1:hms Class1:hms Class1:hms Class1:hms
## Class2:difftime Class2:difftime Class2:difftime Class2:difftime
## Mode :numeric Mode :numeric Mode :numeric Mode :numeric
##
##
## eclipse_5
## Length:31363
## Class1:hms
## Class2:difftime
## Mode :numeric
##
##
summary(eclipse_partial_2024)
## state name lat lon
## Length:28844 Length:28844 Min. :17.96 Min. :-176.60
## Class :character Class :character 1st Qu.:35.24 1st Qu.: -99.08
## Mode :character Mode :character Median :39.52 Median : -90.30
## Mean :38.76 Mean : -93.00
## 3rd Qu.:42.04 3rd Qu.: -81.16
## Max. :71.25 Max. : 174.11
## eclipse_1 eclipse_2 eclipse_3 eclipse_4
## Length:28844 Length:28844 Length:28844 Length:28844
## Class1:hms Class1:hms Class1:hms Class1:hms
## Class2:difftime Class2:difftime Class2:difftime Class2:difftime
## Mode :numeric Mode :numeric Mode :numeric Mode :numeric
##
##
## eclipse_5
## Length:28844
## Class1:hms
## Class2:difftime
## Mode :numeric
##
##
summary(eclipse_total_2024)
## state name lat lon
## Length:3330 Length:3330 Min. :28.45 Min. :-101.16
## Class :character Class :character 1st Qu.:35.42 1st Qu.: -92.41
## Mode :character Mode :character Median :39.24 Median : -86.56
## Mean :38.33 Mean : -86.93
## 3rd Qu.:41.22 3rd Qu.: -82.31
## Max. :46.91 Max. : -67.43
## eclipse_1 eclipse_2 eclipse_3 eclipse_4
## Length:3330 Length:3330 Length:3330 Length:3330
## Class1:hms Class1:hms Class1:hms Class1:hms
## Class2:difftime Class2:difftime Class2:difftime Class2:difftime
## Mode :numeric Mode :numeric Mode :numeric Mode :numeric
##
##
## eclipse_5 eclipse_6
## Length:3330 Length:3330
## Class1:hms Class1:hms
## Class2:difftime Class2:difftime
## Mode :numeric Mode :numeric
##
##
sapply(eclipse_annular_2023, function(x) sum(is.na(x)))
## state name lat lon eclipse_1 eclipse_2 eclipse_3 eclipse_4
## 0 0 0 0 0 0 0 0
## eclipse_5 eclipse_6
## 0 0
sapply(eclipse_partial_2023, function(x) sum(is.na(x)))
## state name lat lon eclipse_1 eclipse_2 eclipse_3 eclipse_4
## 0 0 0 0 0 0 0 0
## eclipse_5
## 0
sapply(eclipse_partial_2024, function(x) sum(is.na(x)))
## state name lat lon eclipse_1 eclipse_2 eclipse_3 eclipse_4
## 0 0 0 0 0 0 0 0
## eclipse_5
## 0
sapply(eclipse_total_2024, function(x) sum(is.na(x)))
## state name lat lon eclipse_1 eclipse_2 eclipse_3 eclipse_4
## 0 0 0 0 0 0 0 0
## eclipse_5 eclipse_6
## 0 0
### Define function to tidy dataset
tidy_eclipse_data <- function(df, eclipse_type) {
df <- df %>%
# Standardize column names
rename_with(~ str_to_lower(.) %>% str_replace_all(" ", "_")) %>%
# Convert time columns to `hms` format (e.g., eclipse_1 to eclipse_6)
mutate(across(starts_with("eclipse_"), ~ as_hms(.), .names = "{.col}")) %>%
# Add a column for the eclipse type (annular, total, partial)
mutate(eclipse_type = eclipse_type)
}
### Tidy the dataset
eclipse_annular_2023 <- tidy_eclipse_data(eclipse_annular_2023, "annular_2023")
eclipse_total_2024 <- tidy_eclipse_data(eclipse_total_2024, "total_2024")
eclipse_partial_2023 <- tidy_eclipse_data(eclipse_partial_2023, "partial_2023")
eclipse_partial_2024 <- tidy_eclipse_data(eclipse_partial_2024, "partial_2024")
combined_eclipse_data <- bind_rows(
eclipse_annular_2023,
eclipse_total_2024,
eclipse_partial_2023,
eclipse_partial_2024
)
glimpse(combined_eclipse_data)
## Rows: 64,348
## Columns: 11
## $ state <chr> "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ…
## $ name <chr> "Chilchinbito", "Chinle", "Del Muerto", "Dennehotso", "Fo…
## $ lat <dbl> 36.49200, 36.15115, 36.18739, 36.82900, 35.74750, 36.7171…
## $ lon <dbl> -110.0492, -109.5787, -109.4359, -109.8757, -109.0680, -1…
## $ eclipse_1 <time> 15:10:50, 15:11:10, 15:11:20, 15:10:50, 15:11:40, 15:10:…
## $ eclipse_2 <time> 15:56:20, 15:56:50, 15:57:00, 15:56:20, 15:57:40, 15:56:…
## $ eclipse_3 <time> 16:30:29, 16:31:21, 16:31:13, 16:29:50, 16:32:28, 16:29:…
## $ eclipse_4 <time> 16:33:31, 16:34:06, 16:34:31, 16:34:07, 16:34:35, 16:33:…
## $ eclipse_5 <time> 17:09:40, 17:10:30, 17:10:40, 17:09:40, 17:11:30, 17:09:…
## $ eclipse_6 <time> 18:02:10, 18:03:20, 18:03:30, 18:02:00, 18:04:30, 18:01:…
## $ eclipse_type <chr> "annular_2023", "annular_2023", "annular_2023", "annular_…
combined_eclipse_data_long <- combined_eclipse_data %>%
pivot_longer(
cols = c(eclipse_1, eclipse_2, eclipse_3, eclipse_4, eclipse_5, eclipse_6),
names_to = "eclipse_event",
values_to = "time"
) %>% # Changing the combined dataset from wide to long format: eclipse event recorded as variables and timings as observations
mutate(
# Step 2: Create readable labels for `eclipse_type`
eclipse_type = case_when(
eclipse_type == "annular_2023" ~ "2023 Annular Eclipse",
eclipse_type == "total_2024" ~ "2024 Total Eclipse",
eclipse_type == "partial_2023" ~ "2023 Partial Eclipse",
eclipse_type == "partial_2024" ~ "2024 Partial Eclipse"
),
# Make `eclipse_event` descriptive
eclipse_event = case_when(
eclipse_event == "eclipse_1" ~ "First Contact",
eclipse_event == "eclipse_2" ~ "50% Eclipse Start",
eclipse_event == "eclipse_3" ~ if_else(eclipse_type == "2023 Annular Eclipse" | eclipse_type == "2024 Total Eclipse", "Annularity/Totality Begins", "100% Eclipse Max"),
eclipse_event == "eclipse_4" ~ if_else(eclipse_type == "2023 Annular Eclipse" | eclipse_type == "2024 Total Eclipse", "Annularity/Totality Ends", "50% Eclipse End"),
eclipse_event == "eclipse_5" ~ if_else(eclipse_type == "2023 Annular Eclipse" | eclipse_type == "2024 Total Eclipse", "50% Eclipse End", "Last Contact"),
eclipse_event == "eclipse_6" ~ "Last Contact"
)
)
glimpse(combined_eclipse_data_long)
## Rows: 386,088
## Columns: 7
## $ state <chr> "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "A…
## $ name <chr> "Chilchinbito", "Chilchinbito", "Chilchinbito", "Chilchi…
## $ lat <dbl> 36.49200, 36.49200, 36.49200, 36.49200, 36.49200, 36.492…
## $ lon <dbl> -110.0492, -110.0492, -110.0492, -110.0492, -110.0492, -…
## $ eclipse_type <chr> "2023 Annular Eclipse", "2023 Annular Eclipse", "2023 An…
## $ eclipse_event <chr> "First Contact", "50% Eclipse Start", "Annularity/Totali…
## $ time <time> 15:10:50, 15:56:20, 16:30:29, 16:33:31, 17:09:40, 18:02…
head(combined_eclipse_data_long)
## # A tibble: 6 × 7
## state name lat lon eclipse_type eclipse_event time
## <chr> <chr> <dbl> <dbl> <chr> <chr> <time>
## 1 AZ Chilchinbito 36.5 -110. 2023 Annular Eclipse First Contact 15:10:50
## 2 AZ Chilchinbito 36.5 -110. 2023 Annular Eclipse 50% Eclipse Start 15:56:20
## 3 AZ Chilchinbito 36.5 -110. 2023 Annular Eclipse Annularity/Total… 16:30:29
## 4 AZ Chilchinbito 36.5 -110. 2023 Annular Eclipse Annularity/Total… 16:33:31
## 5 AZ Chilchinbito 36.5 -110. 2023 Annular Eclipse 50% Eclipse End 17:09:40
## 6 AZ Chilchinbito 36.5 -110. 2023 Annular Eclipse Last Contact 18:02:10
summary_statistics <- combined_eclipse_data_long %>%
group_by(eclipse_type) %>% # Subsequent calculations will be done by eclipse type
summarize(
num_locations = n_distinct(name), # Count unique locations per eclipse type
avg_latitude = mean(lat, na.rm = TRUE), # Average latitude
avg_longitude = mean(lon, na.rm = TRUE), # Average longitude
total_events = n(), # Total number of events for each eclipse type
earliest_event = hms::as_hms(min(time, na.rm = TRUE)), # Convert to readable format using hms
latest_event = hms::as_hms(max(time, na.rm = TRUE)) # Convert to readable format using hms
)
# Display summary statistics
head(summary_statistics)
## # A tibble: 4 × 7
## eclipse_type num_locations avg_latitude avg_longitude total_events
## <chr> <int> <dbl> <dbl> <int>
## 1 2023 Annular Eclipse 795 35.4 -108. 4866
## 2 2023 Partial Eclipse 20664 38.8 -92.0 188178
## 3 2024 Partial Eclipse 19576 38.8 -93.0 173064
## 4 2024 Total Eclipse 2938 38.3 -86.9 19980
## # ℹ 2 more variables: earliest_event <time>, latest_event <time>
ridgeline_data <- combined_eclipse_data_long %>%
mutate(time = as_hms(time)) %>%
filter(eclipse_type %in% c("2023 Annular Eclipse", "2024 Total Eclipse"))
glimpse(ridgeline_data)
## Rows: 24,846
## Columns: 7
## $ state <chr> "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "A…
## $ name <chr> "Chilchinbito", "Chilchinbito", "Chilchinbito", "Chilchi…
## $ lat <dbl> 36.49200, 36.49200, 36.49200, 36.49200, 36.49200, 36.492…
## $ lon <dbl> -110.0492, -110.0492, -110.0492, -110.0492, -110.0492, -…
## $ eclipse_type <chr> "2023 Annular Eclipse", "2023 Annular Eclipse", "2023 An…
## $ eclipse_event <chr> "First Contact", "50% Eclipse Start", "Annularity/Totali…
## $ time <time> 15:10:50, 15:56:20, 16:30:29, 16:33:31, 17:09:40, 18:02…
relationships_data <- combined_eclipse_data_long %>%
select(lat, lon, time, eclipse_type)%>%
filter(!is.na(time))
glimpse(relationships_data)
## Rows: 325,881
## Columns: 4
## $ lat <dbl> 36.49200, 36.49200, 36.49200, 36.49200, 36.49200, 36.4920…
## $ lon <dbl> -110.0492, -110.0492, -110.0492, -110.0492, -110.0492, -1…
## $ time <time> 15:10:50, 15:56:20, 16:30:29, 16:33:31, 17:09:40, 18:02:…
## $ eclipse_type <chr> "2023 Annular Eclipse", "2023 Annular Eclipse", "2023 Ann…
geo_heatmap_data <- combined_eclipse_data_long %>%
group_by(name, lat, lon, eclipse_type) %>%
summarise(event_count = n(), .groups = "drop") %>% # Count events per location
mutate(event_count= as.numeric(event_count)) %>%
filter(eclipse_type == c("2023 Annular Eclipse", "2024 Total Eclipse"))
glimpse(geo_heatmap_data)
## Rows: 4,141
## Columns: 5
## $ name <chr> "Abbott", "Abeytas", "Abington", "Ackerly", "Acomita Lake…
## $ lat <dbl> 31.88514, 34.46520, 39.73314, 32.52506, 35.06888, 34.6387…
## $ lon <dbl> -97.07411, -106.81376, -84.96796, -101.71585, -107.61456,…
## $ eclipse_type <chr> "2024 Total Eclipse", "2023 Annular Eclipse", "2024 Total…
## $ event_count <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, …
This geographic map visualisation is designed to explore how eclipse
visibility varies across different states or regions in the United
States. It visualises the distribution of eclipse types across different
states. The variables used in this visualisation include the geographic
coordinates (longitude and latitude) of
observation points, the eclipse_type (distinguishing
between "2023 Annular Eclipse" and
"2024 Total Eclipse"), and the state
boundaries provided by the ne_states() function. The map
uses distinct colours (red for the 2023 Annular Eclipse and
blue for the 2024 Total Eclipse) to represent the two
eclipse types, plotted as points on a map of the US. This type of
visualisation is ideal for answering the question because it provides a
clear spatial representation of where in the US was each eclipse type
observed. By plotting the data geographically, patterns of areas in
which eclipses occur more frequently and the distribution of observation
points are clear. The use of spatial boundaries helps contextualise the
observations, making it easier to interpret how the visibility of the
eclipse varies geographically. Additionally, the clean design ensures
that the focus remains on the relationship between the geographic
location and eclipse type, effectively addressing the question at
hand.
us_map <- ne_states(country = "United States of America", returnclass = "sf")
# converting dataset into sf object for spatial plotting
# 'crs = 4326' defines the coordinate reference system as WGS84, standard for geographic data
eclipse_data_geom_2023 <- st_as_sf(geo_heatmap_data, coords = c("lon", "lat"), crs = 4326)
ggplot() +
geom_sf(data = us_map, fill = "lightblue", color = "black") + # Add the U.S. map layer using 'geom_sf', 'data = us_map' specifies the U.S. state boundaries
geom_sf(data = eclipse_data_geom_2023, aes(color = eclipse_type), size = 1.2) + # 'data = eclipse_data_geom_2023' specifies the eclipse data converted to spatial format
scale_color_manual(values = c("2023 Annular Eclipse" = "indianred", "2024 Total Eclipse" = "steelblue")) + # set the plot features
labs(title = "Eclipse Events Across the U.S. in 2023 and 2024",
color = "Eclipse Type",
shape = "Eclipse Event") +
coord_sf(xlim = c(-125, -65), ylim = c(20, 52)) +
theme_minimal() +
theme(legend.position = "top",
plot.title = element_text(hjust = 0.5, size = 15, face = "bold"))
This ridgeline plot visualises how eclipse visibility varies across
different states in the US by showing the distribution of eclipse phases
over time for the 2023 Annular Eclipse and the
2024 Total Eclipse. The variables used in this
visualisation include the time of different eclipse phases,
states and eclipse_type. The variable
eclipse_type is used as the fill aesthetic to distinguish
between the 2023 Annular Eclipse and the
2024 Total Eclipse, with unique colours assigned to each
type (orange for 2023 and blue for 2024). The visualisation is ideal for
addressing the question because it provides a clear representation of
how the timing of eclipse visibility varies across states. The ridgeline
format allows for easy comparison of the time distributions for each
state, highlighting patterns such as whether certain states experienced
earlier or later phases of the eclipses. Furthermore, the height of the
ridgeline plots reflects the concentration or frequency of eclipse
phases observed during specific time intervals for each state. Taller
peaks signify a higher intensity or greater frequency of eclipse
activity at those times, while shorter peaks indicate lower intensity or
less frequent activity during the corresponding periods. Adjusting the
x-axis limits based on the earliest and latest observed eclipse events
ensures that the x-axis is restricted to the actual time range of the
eclipse events. This prevents the ridgeline plot’s smoothing effect from
displaying timings when no eclipse events occurred. Furthermore, the
different distinct colours used help enhance clarity between the
2023 Annular Eclipse and the
2024 Total Eclipse. The visualization is both aesthetically
appealing and easy to interpret. This ridgeline plot effectively
captures temporal and spatial variations in eclipse visibility.
# Earliest and latest timing of eclipses
earliest_event <- summary_statistics %>%
filter(eclipse_type == "2023 Annular Eclipse") %>%
pull(earliest_event)
latest_event <- summary_statistics %>%
filter(eclipse_type == "2024 Total Eclipse") %>%
pull(latest_event)
#Creating ridgeline plot
ggplot(ridgeline_data %>%
filter(eclipse_type %in% c("2023 Annular Eclipse", "2024 Total Eclipse")), # Subset the data to include only the "2023 Annular Eclipse" and "2024 Total Eclipse" rows
aes(x = time, y = state, fill = eclipse_type)) +
geom_density_ridges(alpha = 0.5, scale = 1.2, # Adds the ridgeline plot layer
rel_min_height = 0.01) +
labs(title = "Distribution of Eclipse Phases by State",
x = "Time of Eclipse Phases", # Time of eclipse phases is mapped to the x-axis
y = "State") + # States are mapped to the y-axis as categories for ridgelines
theme_minimal(base_size = 15) + # Increase base font size xs
scale_fill_manual(values = c("2023 Annular Eclipse" = "#FFB74D", # Add more plot features
"2024 Total Eclipse" = "#64B5F6")) +
theme(
legend.title = element_blank(),
panel.background = element_rect(fill = "white"),
plot.title = element_text(hjust = 0.5, size = 15, face = "bold"),
axis.title = element_text(size = 13),
axis.text.y = element_text(size = 10),
axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
)+
scale_x_time(
breaks = scales::breaks_width(1800),
labels = scales::time_format("%H:%M"),
limits = c(hms::as_hms(earliest_event ), hms::as_hms(latest_event)) # Define the limits for the x-axis
)
The pyramid barplot visualises the distribution of the number of
different eclipse types across the different states in the US, the
variables used include state, count (number of
eclipse observations) and eclipse_type. The y-axis
represents the total number of eclipses observed (scaled to hundreds),
with the lower half of the chart representing 2023 eclipses (Annular and
Partial) and the upper half representing 2024 eclipses (Total and
Partial). The x-axis represents U.S. states, sorted by total eclipse
observations, with states having the highest observations appearing
first. This highlights the contrast in eclipse occurrences, showing
which states tend to experience more eclipses. Additionally, the pyramid
bar structure effectively contrasts the visibility of full eclipses
versus partial eclipses, allowing for a comparison of eclipse
occurrences across the two different years. This makes it easy to tell
which state tends to have a higher total/annular eclipse. This type of
visualisation is ideal for answering the question because it allows for
a direct comparison of eclipse visibility across states while
distinguishing between different types of eclipses and years. The
mirrored structure of the pyramid chart highlights the distribution of
eclipse observations in 2023 versus 2024 each year. The use of colour
coding for the eclipse type further enhances clarity, enabling the
audience to easily differentiate between annular, total, and partial
eclipses. Additionally, the alignment of states along the x-axis allows
for quick identification of states with the highest and lowest eclipse
observations, making this visualisation both informative and visually
intuitive.
### Prepare Eclipse Counts
eclipse_counts <- combined_eclipse_data_long %>%
group_by(state, eclipse_type) %>%
summarise(count = n()) %>%
ungroup() %>%
mutate(
year = if_else(str_detect(eclipse_type, "2023"), "2023", "2024"),
state = fct_reorder(state, count, .desc = TRUE)
)
# Calculate total eclipse counts for each state
eclipse_counts_total <- eclipse_counts %>%
group_by(state) %>%
summarise(total_count = sum(abs(count))) %>%
ungroup()
# Reorder `state` based on the total eclipse count in decreasing order
eclipse_counts_pyramid_sorted <- eclipse_counts %>%
left_join(eclipse_counts_total, by = "state") %>%
mutate(
count = if_else(year == "2023", -count, count), # Negative counts for 2023
state = fct_reorder(state, total_count, .desc = TRUE) # Reorder by total eclipse counts
) %>%
rename("Eclipse Type" = eclipse_type)
# Plotting the pyramid barplot
ggplot(eclipse_counts_pyramid_sorted, aes(x = state, y = count/100, fill = `Eclipse Type`)) +
geom_bar(stat = "identity", width = 0.7) +
scale_y_continuous(
name = "Total Number of Eclipses Observed\n(in Hundreds)",
labels = abs
) +
scale_x_discrete(name = "State") +
scale_fill_manual(values = c("2023 Annular Eclipse" = "orange",
"2024 Total Eclipse" = "blue",
"2023 Partial Eclipse" = "purple",
"2024 Partial Eclipse" = "green")) +
theme_classic()+
theme(
axis.text.x = element_text(angle = 90, hjust = 1,size = 10),
legend.position = "top",
plot.margin = margin(3,3 ,3 ,1),
legend.box.margin = margin(0, 10, 0, 0),
plot.title = element_text(hjust = 0.5, size = 15, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, size = 10),
plot.background = element_rect(fill = "white", colour = "white"),
legend.key.size = unit(0.2, "cm")
) +
labs(
title = "Pyramid Barplot of Eclipse Type Counts by State",
subtitle = "2023 Eclipses on the Bottom, 2024 Eclipses on the Top"
)
Parani - Wrote the report writing and carried out final code cleaning Harish - Carried out Initial data cleaning and produced summary statistics Siddharth - Produced geographical map summaries Joo Kang - Produced pyramid barplot Elena - Produced the ridgeline plot
[1] TidyTuesday: “TidyTuesday Project Dataset: 2024-04-09.” Retrieved from: https://github.com/rfordatascience/tidytuesday/blob/master/data/2024/2024-04-09/readme.md [2] rnaturalearth: South, Andy. (2017). Natural earth: World Map Data from Natural Earth. R package version 0.1.0. https://CRAN.R-project.org/package=rnaturalearth
AI Prompts Used: “what is the best way to visualise the data with columns state, town, latitude, longitude ,eclipse_type, eclipse event I have in the form of US map?”