# Geospatial Plots#

Geospatial plots are great at:

1. Showing how an area is impacted relative to other areas.

2. Revealing which area is impacted.

3. Illustrateing a pattern in the location of something.

Geospatial plots are NOT good at:

1. Showing how one value correlates to another. (Use a line plot for this)

2. Emphasizing the amount of difference between one area and another.

## Contiguous States Only#

Let’s start off by showing the benefits of showing only the contiguous United States. Imagine that we are trying to show which states have the most electoral college votes and perhaps we want to emphasize that California has the most by a long shot.

One the left, we see what many students show–the full 50 states.
One the right, is a much improved image showing only the 48 contiguous states. fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,5))

# Left Side: Plot the entire United States

# Right Side: Plot the contiguous US with a legend
ax2.axis('off') # remove the box and the x,y ticks

# Set the extent of the axes to cut off Alaska and Hawaii
ax2.set_xlim(-130, -65)
ax2.set_ylim(24, 50)


This data is a merge of two different data sources: electoral_college.csv and a ShapeFile of the United States. You’ll note that in the code we were able to read column values "707,558" (a string with a literal comma as the thousands separator), by using the named parameter thousands in the read_csv call. Handy!! We also eliminated a duplicate column using drop. electoral = pd.read_csv('electoral_college.csv', thousands=',')
gdf = us_states.merge(electoral, left_on='NAME', right_on='State', how='left')
gdf = gdf.drop(columns='State') # this is a duplicate of 'NAME'


In this example, we have two geospatial in one figure where there are just three differences.

1. The box and ticks are hidden

2. We clip off Alaska & Hawaii

3. We show a legend

By hiding Alaska and Hawaii, we can bring a much greater focus on the states. Providing a legend is very helpful. And, eliminating the box with longitude and latitude ticks removes an unnecessary distraction since we really don’t care about the lat/long.

In the chart on the right, we can easily see that California is large and brilliant yellow which ephasizes its impact. There are two ways to hide these states. One way is to eliminate them from the GeoDataFrame:
gdf = gdf[(gdf['NAME'] != 'Alaska') & (gdf['NAME'] != 'Hawaii')]
Another way is to set some limits on the x & y axes: ax2.set_xlim(-130, -65)

Eliminating two states from the drawing isn’t so great. We’d like to show Alaska and Hawaii like other maps do using an inset. This can be done, but it does get complicated rather quickly. The inset is a separate plot with separate arguments. When legends are shown, the size of the figure changes and the placement of the inset changes. I provide this code because it is educational to see and play with, but I recommend the method below this.

## Modified Geometry#

It becomes much easier to deal with plotting the full United States if the GeoDataFrame’s geometry is modified to put Alaska and Hawaii in the “right” spot. The only drawbacks are:

• If you do geospatial sjoins, then it’ll all get messed up geographically.

• If you want to illustrate something on the Pacific Ocean, there are states in the way! from shapely.affinity import translate
from shapely.affinity import scale
from shapely.geometry import MultiPolygon

def modify_geometry(gdf, name_col='NAME'):
'''
Changes, in place, the geometry of Alaska & Hawaii in the GeoDataFrame to be "inset"
name is the column that contains the name of the state.
'''
def scale_translate(state_name, ratio=1.0, offset=(0,0)):
# get the geometry and save the index value using the walrus operator
multi_poly = gdf.loc[(index:=gdf[gdf[name_col] == state_name].index), 'geometry']

# scale each polygon in the state's multi-polygon in both x & y direction
multi_poly = MultiPolygon([scale(poly, ratio, ratio) for poly in multi_poly])
# move the state by the offset values
multi_poly = translate(multi_poly, xoff=offset, yoff=offset)
# set the value using the 'at' method because 'loc' won't work
gdf.at[index, 'geometry'] = multi_poly

scale_translate('Hawaii', offset=(45, 5))

fig, ax = plt.subplots(1, figsize=(10,5))

# Plot the entire United States
# remove the box and the x,y ticks
ax.axis('off')

# Set the extent of the main plot to cut off Alaska's long tail of regions
ax.set_xlim(-130, -65)
# without this limit, we get Alaska's tail looking like it's part of Hawaii
# It's better just to eliminate that portion of the plot
ax.set_ylim(24, 50)

# Call the methods above to generate the plot using modified geometry
modify_geometry(gdf)


This data is a merge of two different data sources: electoral_college.csv and a ShapeFile of the United States. You’ll note that in the code we were able to read column values "707,558" (a string with a literal comma as the thousands separator), by using the named parameter thousands in the read_csv call. Handy!! We also eliminated a duplicate column using drop. electoral = pd.read_csv('electoral_college.csv', thousands=',')
gdf = us_states.merge(electoral, left_on='NAME', right_on='State', how='left')
gdf = gdf.drop(columns='State') # this is a duplicate of 'NAME'


When the geometry of the GeoDataFrame is modified, the plotting code is pretty simple and allows for further customizations without crazy code. A few points worthy of calling out:

• We use a walrus operator (:=) to assign a sub-expression to a variable in the middle of a larger expression. This allows us to reuse the value again later on in the code. It would have been just as easy to write it with two lines of code instead of one, but… Bragging Rights!

• When modifying the GeoDataFrame, we used the at method instead of loc. Using loc results in an error: “ValueError: Must have equal len keys and value when setting with an iterable”.

## Annotated with Table Inset#

The problem with the plot above is that it is very hard to see if there is any difference between Orgegon and Nevada; the colors are just too close! To fix that we will annotate each state with the count of the votes printed onto the state’s location. def annotate_states(gdf, ax, value_col, not_states=None, name_col='NAME'):
# annotate the states in the list, states, with values in col_name
# Filter first so that we iterate through the rows that have the state name we want
gdf_states = gdf if not_states is None else gdf[~gdf[name_col].isin(not_states)]
for _, row in gdf_states.iterrows():
# get the location of the state from the geometry
x, y = row.geometry.centroid.x, row.geometry.centroid.y
# add the text with an xytext offset that seems right for our plots size
ax.annotate("{:.0f}".format(row[value_col]), (x, y), xytext=(-10, 0), textcoords='offset points')

# x, y, width, height as % of figure
ax_table = fig.add_axes([.77, .35, .15, .4])
ax_table.axis('off')
table = ax_table.table(cellText=data.values, colLabels=data.columns, cellLoc='center', loc='center')
# Adjust the font size of the table
table.auto_set_font_size(False)
table.set_fontsize(10)
table.scale(1, 1.5)

fig, ax = plt.subplots(1, figsize=(10,5))

# Plot the entire United States
# Choose the 'cool' colormap so the black text appears nicely
legend_kwds={'orientation':'horizontal'}, edgecolor='lightgray')
# remove the box and the x,y ticks
ax.axis('off')

# Set the extent of the main plot to cut off Alaska's long tail of regions
ax.set_xlim(-130, -65)
ax.set_ylim(24, 50)

small = ['District of Columbia', 'Rhode Island', 'New Jersey', 'Delaware', 'Vermont', 'New Hampshire', 'Connecticut']

# Before we can show our table, we need a DataFrame the data.
# Create the data of integer values of the smaller states, sorted by state name, with abbreviations
df = gdf[gdf['NAME'].isin(small)].sort_values(by='NAME')
d = { 'State': ['CN', 'DE', 'DC', 'NH', 'NJ', 'RI', 'VT'],

# show a table of the values for the smaller states

# Be sure we have our modified geometry before calling the code above
modify_geometry(gdf)


This data is a merge of two different data sources: electoral_college.csv and a ShapeFile of the United States. You’ll note that in the code we were able to read column values "707,558" (a string with a literal comma as the thousands separator), by using the named parameter thousands in the read_csv call. Handy!! We also eliminated a duplicate column using drop. electoral = pd.read_csv('electoral_college.csv', thousands=',')
gdf = us_states.merge(electoral, left_on='NAME', right_on='State', how='left')
gdf = gdf.drop(columns='State') # this is a duplicate of 'NAME'


We moved the legend to be horizontal, not because it was necessary, but because we can. This was accomplished by using the legend_kwds argument in the plot method. There are other ways to do this (as always).

There are several issues that we address in this plotting solution:

Problem

Fix

The black text doesn’t show up well with the default colormap.

Change the colormap to ‘cool’.

The smaller states don’t have room for the text. It all gets printed on top of each other and is unreadable.

Avoid annotating values on the states with a small geometry. We just manually do this, however, we could have used the ‘CENSUSAREA’ column to help us out with that. Doing it explicitly is good enough for us.

The smaller states’ values get lost when not shown!

Create an inset table that displays the values

The dataframe of values is floating point

Use the apply method and a lambda to convert all the values to integer. There is probably another way… there always is!

## Finding Obvious Correlations#

Is there a correlation between population and electoral votes? Does the size of a state in square miles correlate to the number of electoral votes? Geospatial provides only minimal insight; a scatter plot is better.

One the top-left, you’ll see a plot representing Electoral Votes by state. Were we to plot a geospatial plot of Population, we’d see something virtually identical and therefore a correrlation would be seen.
One the top-right, you’ll see a plot representing each state’s size in square miles. def plot_correlation(gdf):
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(10,5))

# Plot the entire United States
gdf.plot(ax=ax2, column='CENSUSAREA', legend=True)
# Set the extent of the main plot to cut off Alaska's long tail of regions
for ax in (ax1, ax2):
ax.set_xlim(-130, -65)
ax.set_ylim(24, 50)
ax2.set_title('Square Miles')
gdf.plot(ax=ax4, kind='scatter', y='Electoral Votes', x='CENSUSAREA', cmap='viridis')
gdf.plot(ax=ax3, kind='scatter', y='Electoral Votes', x='2023 Population')

for ax in (ax3, ax4):
ax.grid(linestyle="--", linewidth=0.5, color='.25', zorder=-10)


Same data as above. Were we to plot a geospatial plot of Population, we’d see something virtually identical to the top-left and, therefore, a correrlation would be seen. But, it would take two geospatial plots to show the correlation. And, in this specific case, the correlation is extremely strong; Electoral Votes is, in fact, calculated by a state’s population. In this example, we omitted the second geospatial plot in favor of a scatter plot.

One the top-right, you’ll see a plot representing each state’s size in square miles. Is there a correlation of a state’s size to the count of Electoral Votes? It’s a bit hard to determine from the geospatial plot. However, in combination with the scatter plots, one can be more confident in their assessment.

In summary: we show 4 different plots in a single figure that in combination provide insight. The geospatial plots provide some understanding about where the Electoral Votes go geographically. Showing it side-by-side to the geospatial plot on Square Miles gives you an idea that there is not an obvious correlation between votes and square miles. However, when we add the scatter plots, the correlations become a lot clearer. It is obvious that there is strong and direct relationship between population and votes, while there is very little, if any, correlation of square miles to votes.

## Correlations and Coefficient of Determination#

In this example, we create three values (A, B, C) for each state. There is a strong correlation in the data, but the geospatial plots makes it hard to see how they relate. See if you can spot the correlations in the geospatial plots? Notice how the plots for ‘Value A’ and ‘Value C’ appear to be nearly identical, yet the details of the relationship is somewhat hidden because of the scale on the colormap. Then, the scatter & line plots make it all very apparent. Read more in the Plot Comments and Code Comments tabs. from scipy import stats

def plot_correlations(gdf):
# sort along the x-axis to make line look like a line
gdf.sort_values(by='A')
res_b = stats.linregress(gdf['A'], gdf['B'])
res_c = stats.linregress(gdf['A'], gdf['C'])

fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8)) = plt.subplots(4, 2, figsize=(12,15))

# the linear plots are virtually on top of each other. Spread them out
# horizontally even though this makes the geospatial plots seem too var apart.

max_value = max(gdf['A'].max(), gdf['B'].max(), gdf['C'].max())

# Plot A, B, C values for the entire United States
gdf.plot(ax=ax1, column='A', legend=True)
gdf.plot(ax=ax2, column='B', legend=True, vmin=0, vmax=max_value)
gdf.plot(ax=ax3, column='C', legend=True)
gdf.plot(ax=ax4, column='A', legend=True, vmin=0, vmax=max_value)
gdf.plot(ax=ax5, column='C', legend=True, vmin=0, vmax=max_value)
axes = (ax1, ax2, ax3, ax4, ax5)
# Set the extent of the main plot to cut off Alaska's long tail of regions
for ax in axes:
ax.set_xlim(-130, -65)
ax.set_ylim(24, 50)
for ax, s in zip(axes, ['Value A', 'Value B', 'Value C', 'Value A (on B scale)', 'Value C (on B scale)']):
ax.set_title(s)

# create the scatter plots
for ax in (ax6, ax8):
gdf.plot(ax=ax, kind='scatter', x='A', y='B', color='blue', label='B Value')
for ax in (ax7, ax8):
gdf.plot(ax=ax, kind='scatter', x='A', y='C', marker='s', color='green', label='C Value')

# create a DataFrame with best fit line data
line = gdf[['A', 'B', 'C']].copy()
line['YB'] = res_b.intercept + res_b.slope*line['A']
line['YC'] = res_c.intercept + res_c.slope*line['A']

# plot best fit lines
for ax in (ax6, ax8):
line.plot(ax=ax, kind='line', x='A', y='YB', color='red', label='B fitted line')
for ax in (ax7, ax8):
line.plot(ax=ax, kind='line', x='A', y='YC', color='black', label='C fitted line')

# print the Coefficient of Determination onto the plots
ax6.text(70, 20, f'R2={res_b.rvalue**2:.3f}')
ax7.text(70, 20, f'R2={res_c.rvalue**2:.3f}')
ax8.text(70, 20, f'B R2={res_b.rvalue**2:.3f}\nC R2={res_c.rvalue**2:.3f}')
ax6.set_title('Value B vs Value A')
ax7.set_title('Value C vs Value A')
ax8.set_title('B & C vs Value A')


The values for A, B, and C are contrived. This is only the first 12 rows of the full set. There is a lot going on here. First, let’s discuss the geospatial plots. There appears to be some correlation in all the geospatial plots, but it is hard to figure out. One might think that Value A and Value C are identical. However, when we plot A, B, C all on the same color scale, you can see that Value A (on B scale) is quite different from Value C (on B scale). Can you see the relationship between Value A and Value B? I think you can see that there is some relationship, but it is hard to understand what it is and how strong it is.

Now, let’s look at the scatter and line plots. The dots alone shows a clear linear relationship. When we add the line, the correlation is bolstered.

But, how strong is the correlation? To determine this, we displayed the Coefficient of Determination value on the plot directly. The larger the value (closer to 1.0), the strong the correlation. The $$R^2$$ value basically tells us what percentage of the value on the y-axis is due to the value on the x-axis. We see that the values of $$R^2$$ are 0.736 and 0.948 respectively. These are quite high.

Lastly, the best plot is the last one which combines all the information into a single plot. If you’re trying to show correlation with Value A, the last plot is all you need.

In this plot, we generated the line using the method stats that we imported with: from scipy import stats. We calculated the slope and y-intercept from the data points and used that to plot the line. It was critically important to first sort the values by the x-axis (in our case, by Value A).

We could have generated the line in the plot using seaborn.regplot. While this would have drawn our line for us automatically, it does not provide the Coefficient of Determination. So, instead, we get all the Line of Best Fit information using stats.linregress.

To present the $$R^2$$ value on the plot, we need to choose where to “print” it using the ax.text method. To accomplish this, we plotted the graph with the text first, identified an open space, and then added the printing code. The $$R^2$$ value is a lot of digits after the decimal place that just get in the way. We choose to print the value using Python’s string formatting functionality as built into Pythons f-string. The API is a bit complicated to understand. It is similar to the way Java implements printf, only it has different escape sequences and specifics. In short, when a string literal is preceded with ‘f’ then the string is expecting to have some formatting sequence in it. A formatting sequence is embedded inside curly braces {} and the text inside the curly braces is interpreted. It would contain identifiers, method calls, and possibly formatting options following a colon. Here is the syantax for printing a floating point number with some text label:

f'text{<identifer>:.<integer>f}'


For example, let’s say we do the following:

gpa = 3.899
print(f'GPA={gpa:.2f}')


That code results in printing, GPA=3.90 (due to rounding to 2 decimal places).

To set the color scales to match in each of the geospatial plots, we set both vmin and vmax. When we don’t set these, the plot will customize the colors to match the data being plotted. This is often what you want to do, but if you’re attempting to see a relationship between two geospatial plots, setting vmin and vmax are important.

The vertical spacing between the all the plots was originally not good enough. The line plots had their x-axis’ labels occluded by the titles of the graphs below them. So, we added vertical spacing with plt.subplots_adjust(hspace=0.3). This actually makes the spacing be too large in the geospatial plots. We could have customized the spacing across all the subplots by using matplotlib GridSpec. But, since that code is pretty foreign and not at all the focus of this discussion, we left the spacing as-is.

## Cartopy#

You can make some cool plots that show the terrain of the Earth using a libary called cartopy. These types of plots are good to use when you’re attempting to illustrate the location of something as it would appear on a map.

• The plots are professional looking.

• Does not work on Replit

• Requires some background knowledge about different types of Coordinate and Projection systems.

You can still do some geodataframe-looking plots. For example, the Hurricane Florence plot is replicated on Hurricane Katrina using this code.

This shows a direct, “flat” Plate Carrée line between Seattle and New York as well as the curved, Geodetic line that an airplane would travel over the curved Earth, which is a more accurate reflection of reality. import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import cartopy.feature as cfeature

def cartopy_plot_of_us():
# create a figure & axis using the Robinson projection.
fig, ax = plt.subplots(1, subplot_kw={'projection': ccrs.Robinson()}, figsize=(10,4))

# show some texture & color on the map representing the terrain
ax.stock_img()

# crop the figure to be just the United States
# if we don't set_extent, the image is too big to be displayed (in Jupyter Notebook)
ax.set_extent([-130, -60, 24, 50], crs=ccrs.PlateCarree())

# set the latitude and longitude points for Seattle & New york
ny_lat, ny_lon = 40.73, -73.93
sea_lat, sea_lon = 47.60, -122.33

# draw a line using Geodetic() to give it a curved look over a curved Earth
plt.plot([ny_lon, sea_lon], [ny_lat, sea_lat], color='blue', linewidth=2, marker='o',
transform=ccrs.Geodetic())

# draw a line using PlateCarree() which assume a flat, paper-like map
plt.plot([ny_lon, sea_lon], [ny_lat, sea_lat], color='gray', linestyle='--',
transform=ccrs.PlateCarree())

# annotate the map with text of cities
plt.text(ny_lon-3, ny_lat-1, 'New York', horizontalalignment='right', transform=ccrs.Geodetic())
plt.text(sea_lon+1, sea_lat-2, 'Seattle', horizontalalignment='left', transform=ccrs.Geodetic())

# Show some features and borders

The Projection types are many. One popular projection is: ccrs.PlateCarree().