SELECT *
FROM ds_ar_dws.ui_wages_lehd
LIMIT 5Notebook 1 - Exploratory Data Analysis (EDA)
2 Introduction
Welcome to the first notebook of Module 2 of the State IMPACT Collaborative Applied Data Analytics training! The notebooks in this module will dive into State IMPACT data that has been provided by the State of Arkansas. This State IMPACT data has also been linked to various other sources. All of these data tables will be provided to you in this workspace, and these notebooks will introduce them and provide possible research avenues and other ways to connect and utilize these data tables.
In this workbook, we’re going to be discovering the primary employment outcome-focused data sets we will use in this course: The Participant Individual Record Layout, or PIRL data, the Arkansas Unemployment Insurance (UI) wage records and the Arkansas Quarterly Census of Employment and Wages (QCEW) data. All data sets are available in standard formats across states, so any code introduced in this notebook should be portable to your own state data. As you know, there are many other data sources available for you as part of this training - we will walk through the primary ones in optional supplemental sections. For each data set, we will work to answer the following key data discovery questions:
What is the structure of the data set? What is the layout of the data? What variables are available? What are their types? How do we interpret the meaning of a single row?
What is the coverage of the data set? What years, geographic regions, and other subgroups are covered by the data?
What is the distribution of the variables in the data set? Are there missing observations, outliers, and other aspects we need to adjust for later in analysis? Are there any interesting patterns worth further investigation?
In order to access these tables, you will need to follow the instructions in the technical setup section below. For those not interested in learning details of R and SQL programming, the technical section can be skipped, and the rest of the notebook read through to understand the structure of the available data and general data analysis concepts. However, we do encourage everyone to follow the technical set up at least once and get into the data as much as possible, as nothing can replace digging through the data yourself to truly understand it.
The Purpose of these Notebooks
You will now have the opportunity to apply the skills covered in both modules thus far to restricted use Arkansas data. With your team, you will carry out a detailed analysis of this data and prepare a final project showcasing your results.
These workbooks are here to help you along in this project development by showcasing how to apply the techniques discussed in class to the Arkansas data. Part of this notebook will be technical - providing basic code snippets your team can modify to begin parsing through the data. As always, however, there will also be an applied data literacy component of these workbooks, and it should help you develop a better understanding of the structure and use of the underlying data even if you never writer a line of code.
The timeline for completing these workbooks will be provided on the training website and communicated to you in class.
3 Technical Setup
4 Basics of Exploratory Data Analysis
In this notebook we discuss exploratory data analysis (EDA). EDA is usually the first step in any plan of analysis. You can think of it as “kicking’ the tires” of a data table and seeing what might be possible to accomplish with the data. In other words, EDA is about seeing what research questions might be possible to answer with the data, how strong any conclusions drawn from the data might be and any considerations needed to strengthen those conclusions, considering any additional data sources to combine with the data to further your insights or any caveats you might need to place on the data, etc.
Typical EDA tasks include examining the size and shape of a data set, inspecting the unit of observation and coverage of the data set, inspecting individual columns and their data types (i.e. are they categorical or numeric?), and checking the data for missingness and outliers.
In this notebook, we will be conducting some exploratory data analysis on our Arkansas data. Before we start programming, however, it will be helpful to have some more background information on the available data tables.
5 Arkansas UI wage data: ds_ar_dws.ui_wages_lehd
The Arkansas UI wage data, stored on Redshift as ds_ar_dws.ui_wages_lehd, provides information on all UI-covered employment in Arkansas. This file is supplied to the U.S. Census Bureau for use in processing the Quarterly Workforce Indicators and the Longitudinal Employer-Household Dynamics (LEHD) program, and represents approximately 95% of private non-farm wage and salary employment for establishments located in the state.
dbplyr
In the examples that follow, a new package for interacting with databases without writing SQL code is also shown. The dbplyr package interfaces with a database using standard dplyr and tidyverse workflows, and complementary code is shown for all SQL query examples. Just like the SQL code, this should be run and executed from R using the connection we established in the collapsed “Environment Setup” section above.
Structure
Let’s begin answering our key data discovery questions. First - what is the structure of the UI wage data? There are two dimensions to this question:
- What variables are available in the UI Wage data?
- What does each row in the UI Wage data represent?
To start, just like in the Foundations Module, let’s glance at a few rows of the UI Wage data:
Note: Because the output is easier to parse through on an HTML document, we have the SQL option written so that you can directly copy and paste them into DBeaver. If you would like to run them in RStudio, and not use the
dbplyrcode, you can do so by using the workflow we introduced in the Foundations Module by first runningqry <- "INSERT SQL CODE"and thendbGetQuery(con, qry).
SELECT *
FROM ds_ar_dws.ui_wages_lehd
LIMIT 5con %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "ui_wages_lehd")) %>%
head(n=5)We can see that there are 16 columns available in the UI Wage data. The UI Wage data dictionary (Right-click on link to open) on the P: drive contains detailed descriptions of each variable.
Just like with the LAUS data, the data dictionary descriptions, while helpful, do not provide the entire context for these variables. We also need to have a clear definition of what each observation - or row - in the UI wage records represents.
To know what we are aiming for, we can find the total number of rows in the UI Wage data:
SELECT COUNT(*)
FROM ds_ar_dws.ui_wages_lehdcon %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "ui_wages_lehd")) %>%
tally()Let’s think about the three categories of variables that might appear in our data set to uniquely define our rows.
Unit of observation
The first category to consider is variables that describe the unit of observation. Recall that the unit of observation refers to the type of entity or object about which data is collected. This could be a person, a organization, a state, or any other entity that is the focus of data collection.
In the case of the UI Wage data, our unit of observation is the person - the individual receiving wages in the data set. Besides leveraging background information on the nature of the data, we can identify this because ssn is a person-level variable in our data set representing the smallest single unit identified in the data.
Importantly, recall that the unit of observation alone does not necessarily define a row for our data set. We can test this by count the number of unique hashed social security numbers in our data relative to the total number of rows:
SELECT COUNT(*), COUNT(DISTINCT(employee_ssn))
FROM ds_ar_dws.ui_wages_lehdcon %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "ui_wages_lehd")) %>%
summarise(
n_rows = n(),
n_people = n_distinct(employee_ssn)
)We see that there are far fewer unique ssn values than total rows in the table. This indicates that ssn alone does not define a row of our data - some individuals must appear in the data more than once. Why might there be multiple observations for the same individual? One of the common reasons is what we will discuss next: time.
Period of observation
One of the most common reasons why this might be the case is that data about each unit of observation is observed at multiple points in time, with information stored in separate rows corresponding to each distinct point in time. Recall that we can refer to the variables describing the period of observation as the time dimension(s). They represent how often the data is captured or updated. However, note that not every data set will have variables representing the period of observation.
In the UI Wage data, the period of observation is represented by the variables reporting_period_year and reporting_period_quarter, which indicate the calendar year and quarter associated with the employment observation, respectively. Note that in other instances, date variables often encode information about the period of observation.
Let’s look at the amount of unique combinations of employee_ssn, reporting_period_year, and reporting_period_quarter combined:
WITH dis as (
SELECT DISTINCT reporting_period_year, reporting_period_quarter, employee_ssn
FROM ds_ar_dws.ui_wages_lehd
)
SELECT COUNT(*)
FROM DIS# tally finds the number of rows
con %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "ui_wages_lehd")) %>%
distinct(reporting_period_year, reporting_period_quarter, employee_ssn) %>%
tally()We still see a difference here between the number of unique person-quarters and the total number of rows from our first query. Perhaps there are some other variables that help constitute a row in the data, or there may be duplicate observations.
Attributes
What about the other 13 variables described in the data dictionary?
These remaining variables represent the information that the wage records contains about our person-quarter observations. There are several different types of attributes that we can see in our data dictionary:
- Individual attributes:
nhr_ui_wage_id,employee_first_name(hashed),employee_middle_name(hashed), andemployee_last_name(hashed) describe the individual associated with the employment observation - Employer attributes:
ui_account_number,reporting_unit_number,federal_ein, andstate_eindescribe the employer associated with the employment observation - Employment attributes:
employee_wage_amount,hours, andweeksdescribe information about the employment.hoursandweeksare both not populated. - Miscellaneous:
reference_state,reporting_period(concatenated version ofreporting_period_yearandreporting_period_quarter)
Here, with the UI wage records, since an individual could have multiple employers within our period of observation, a unique observation should also include the employer. We can verify that claim with the code below:
WITH dis as (
SELECT DISTINCT reporting_period_year, reporting_period_quarter, employee_ssn, federal_ein
FROM ds_ar_dws.ui_wages_lehd
)
SELECT COUNT(*)
FROM discon %>%
tbl(in_schema(
schema = "ds_ar_dws",
table = "ui_wages_lehd"
)) %>%
distinct(reporting_period_year, reporting_period_quarter, employee_ssn, federal_ein) %>%
tally()Any duplication at this granularity should be further investigated.
Coverage
The next step of our exploratory data analysis is to determine the data coverage: what time span, region, and other important subgroups are captured in the data? Luckily, this is pretty straightforward for the UI Wage data: we know that the geographical area is the entire state of Arkansas, and that the group covered is those employed by UI-covered employers. We can confirm this by finding unique values of reference_state in the data:
SELECT DISTINCT(reference_state)
FROM ds_ar_dws.ui_wages_lehdcon %>%
tbl(in_schema(
schema = "ds_ar_dws",
table = "ui_wages_lehd"
)) %>%
distinct(reference_state)We can follow a similar approach for time periods - that being said, because we know the data is tracked over time, it might be helpful to find the number of rows corresponding to each time frame. To do so, we can group our data by quarter, this time using reporting_period, which combines reporting_period_year and reporting_period_quarter into a single string, and aggregate the number of rows.
Note: On the SQL tab, since we are planning on reading this output into R, we have used the
qry <- "INSERT SQL CODE"anddbGetQuery(con, qry)workflow.
SELECT reporting_period, COUNT(*)
FROM ds_ar_dws.ui_wages_lehd
GROUP BY reporting_period
ORDER BY reporting_periodcounts_by_qtr <- con %>%
tbl(in_schema(
schema = "ds_ar_dws",
table = "ui_wages_lehd"
)) %>%
select(reporting_period) %>%
group_by(reporting_period) %>%
tally() %>%
ungroup() %>%
arrange(reporting_period) %>%
collect() # extract data into memory
counts_by_qtrWe can read through the table above and begin to get a feeling for the number of individuals covered in each quarter, but you can also use this same summary data set to quickly calculate our coverage:
counts_by_qtr %>%
summarize(
min_qtr = min(reporting_period),
max_qtr = max(reporting_period)
)From this, we see that the UI Wage data begins in the Q1 2011, and ends in Q4 2022.
We can also find the quarter with the highest number of observations:
counts_by_qtr %>%
filter(
n == max(n)
)Select Variable Distribution
The final part of data discovery is investigating the distribution of key variables in our data set, and documenting any irregularities such as missingness, outliers, invalid values, or other issues that need to be addressed in an analysis leveraging these data components. We’ll look at a few of the variables from the UI Wage data now, but this type of exploration is more of an art than a science, and the continued exploration of these variables will be an essential component of your project in this program. As a reminder, you are encouraged to use this code as inspiration in your own exploration.
reporting_period
Let’s begin by taking the same data pull we just used to examine the data coverage, and plot the data using a simple line plot:
counts_by_qtr %>%
ggplot(aes(
x = reporting_period,
y = as.numeric(n),
group = 1
)) +
geom_line() +
labs(x = "Quarter", y = "Count", title = "Count of UI Wage Recipients by Quarter") +
# rotate x-axis text to avoid quarter overlap
theme(axis.text.x = element_text(angle = 45, hjust = 1))Immediately, we can see a seasonal pattern with higher values in Q2 & Q3 annually, and a decline in records in 2020, which should make sense given Arkansas’s economic situation at the time.
Does anything else catch your eye? Take some time to examine this graph.
employee_wage_amount
We can also examine the patterns in the average wage received by employee during each period. Here is a query to perform this:
SELECT AVG(employee_wage_amount) as avg_wage, reporting_period
FROM ds_ar_dws.ui_wages_lehd
GROUP BY reporting_period
ORDER BY reporting_periodavg_by_qtr <- con %>%
tbl(
in_schema(schema = "ds_ar_dws", table = "ui_wages_lehd")
) %>%
select(employee_wage_amount, reporting_period) %>%
group_by(reporting_period) %>%
summarize(
avg_wage = mean(employee_wage_amount)
) %>%
ungroup() %>%
arrange(reporting_period) %>%
collect()
avg_by_qtrAnd we can visualize the data using a simple line plot:
avg_by_qtr %>%
ggplot(aes(
x = reporting_period,
y = as.numeric(avg_wage),
group = 1
)) +
geom_line() +
labs(x = "Quarter", y = "Average Wage", title = "Average of UI Wages Received by Quarter") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))We also see similar seasonality in average wages received, with Q4 and Q1 within each year showing higher average wages, but unlike the total number of recipients in the previous plot, we do not see any decline in 2020.
6 Arkansas QCEW data: ds_ar_dws.qcew
The QCEW, or Quarterly Census of Employment and Wages, is a data set containing information on UI-covered employers in Arkansas, and often works in tandem with the UI wage data. You can find the QCEW data dictionary on the P drive.
Structure
First, just as we did in the previous section, we will explore the basic structure of the data by examining the first few rows of the table:
SELECT *
FROM ds_ar_dws.qcew
LIMIT 5con %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "qcew")) %>%
head(5)We see there are 113 columns in the table. Consult the data dictionary for descriptions of these columns.
Similar to the one for the UI wage records, the QCEW data dictionary does not specifically define a row. As a reference point, we can find the total number of rows in the data:
SELECT COUNT(*)
FROM ds_ar_dws.qcewcon %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "qcew")) %>%
tally()Unit of observation
QCEW data is reported by employer, so the unit of observation is therefore the employer. The employer, though, is referenced in a handful of variables-including the federal Employer Identification Number (EIN), UI account number (or state ein), and reporting unit number-and may defined differently based on the situation. Here, we will define an employer by their federal EIN (ein in the data), as this corresponds to the employer paying into the UI trust fund and also allows for cross-state linkages. That being said, multi-unit employers may appear as separate rows with the same federal EIN but different reporting unit numbers.
Let’s find the number of unique employers in the data and compare that to the total number of rows:
SELECT COUNT(*), COUNT(DISTINCT(ein))
FROM ds_ar_dws.qcewcon %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "qcew")) %>%
summarise(
n_rows = n(),
n_employers = n_distinct(ein)
)We can go a step further and include the reporting unit number to account for multi-unit establishments:
WITH dis AS (
SELECT DISTINCT ein, reporting_unit_number
FROM ds_ar_dws.qcew
)
SELECT COUNT(*)
FROM discon %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "qcew")) %>%
distinct(ein, reporting_unit_number) %>%
tally()Beyond this, based on our exploration of the UI wage records, it makes sense that a row is not solely defined by an employer or an employer unit - recall that this information is tracked over time.
Period of observation
Whereas the period of observation in the UI Wage data is represented by reporting_period_year and reporting_year_quarter, the QCEW contains reference_year and reference_quarter variables.
We can find the number of unique combinations of ein, reporting_unit_number, reference_year, and reference_quarter combined:
WITH dis AS (
SELECT DISTINCT ein, reporting_unit_number, reference_year, reference_quarter
FROM ds_ar_dws.qcew
)
SELECT COUNT(*)
FROM discon %>%
tbl(in_schema(
schema = "ds_ar_dws",
table = "qcew"
)) %>%
distinct(ein, reporting_unit_number, reference_year, reference_quarter) %>%
tally()As you can see, we don’t have complete de-duplication at this level. Upon further investigation, you may find some federal eins that correspond to multiple state eins, or vice versa, or some other examples of potential duplication. That is all to say that if you plan to use the QCEW data, just be careful for any misaligned duplication!
Attributes
As we saw earlier, the QCEW has a large number of fields, and you should consult the data dictionary for the definitions of these fields. To see the column titles, we can use the queries below.
SHOW COLUMNS from table projects.ds_ar_dws.qcewcon %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "qcew")) %>%
names()Between looking at the list of column names above and the data dictionary, you can see that these other variables contain a wealth of employer-related information, such as industry, location, and employment statistics.
Coverage
We will proceed by assessing the coverage of the QCEW data, first to confirm that all employers are located in Arkansas, which is tracked with the state_fips variable:
SELECT DISTINCT(state_fips)
FROM ds_ar_dws.qcewcon %>%
tbl(in_schema(
schema = "ds_ar_dws",
table = "qcew"
)) %>%
distinct(state_fips)You may have noticed some other variables in the QCEW data that characterize an employer’s location, beyond the state. In fact, one of those variables is county_code, which as the name implies, contains the county of the corresponding employer. We can see if we have data on employers in each county in Arkansas by counting the number of unique counties:
SELECT COUNT(DISTINCT(county_code))
FROM ds_ar_dws.qcewcon %>%
tbl(in_schema(
schema = "ds_ar_dws",
table = "qcew"
)) %>%
summarise(
n_counties = n_distinct(county_code)
)For reference, there are 75 counties in Arkansas, and none of the county codes have changed at least since the 1970s. But there seem to be more county codes here, so we can examine these specific codes:
SELECT DISTINCT(county_code)
FROM ds_ar_dws.qcew
GROUP BY county_code
ORDER BY county_codecon %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "qcew")) %>%
distinct(county_code) %>%
arrange(county_code)The county_code values of 000, 900 - 999 are not real county fips codes, and employers with these codes need to be further investigated if using employer location information in your analysis.
We can also look at coverage from a time perspective, counting the number of observations in the QCEW file by reference_year and reference_quarter:
SELECT count(*) as n,
q.reference_year, q.reference_quarter
FROM ds_ar_dws.qcew q
GROUP BY q.reference_year, q.reference_quarter
ORDER BY q.reference_year, q.reference_quartercnt_by_quarter <- con %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "qcew")) %>%
group_by(reference_year, reference_quarter) %>%
summarise(n=n()) %>%
ungroup() %>%
arrange(reference_year, reference_quarter) %>%
collect()
cnt_by_quartercnt_by_quarter %>%
# combine year and quarter into one variable separated by "-"
mutate(period = paste(reference_year, reference_quarter, sep = "-")) %>%
ggplot(aes(x = period, y = as.numeric(n), group = 1))+
geom_line()+
theme(axis.text.x = element_text(angle =45, hjust=1))In this plot, we see continuous growth into 2020, when we see a large spike in the number of observations - this might indicate an underlying data quality issue in 2020.
Select Variable Distribution
naics_code
One of the commonly-used variables from the QCEW table is naics_code, which contains the North American Industry Classification System (NAICS) codes that correspond to specific industries. The codes traditionally exist at the six-digit level, and can be truncated into broader codes (ex. 624120 converted to 62).
At the six-digit level, let’s see how many unique NAICS codes exist in the QCEW table:
SELECT COUNT(DISTINCT(naics_code))
FROM ds_ar_dws.qcewcon %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "qcew")) %>%
summarize(num_codes = n_distinct(naics_code))That’s a lot of codes! In contrast, let’s see if working with the 2-digit codes might be a bit more feasible:
SELECT COUNT(DISTINCT(SUBSTRING(naics_code, 1, 2)))
FROM ds_ar_dws.qcewcon %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "qcew")) %>%
distinct(naics_code) %>%
collect() %>% #substring doesn't work when translating to postgresql so need to pull into R first
summarize(n_distinct(substring(naics_code, 1, 2)))This is a bit more workable. Let’s find the most common two-digit NAICS codes across all rows:
SELECT SUBSTRING(naics_code, 1, 2), COUNT(*)
FROM ds_ar_dws.qcew
GROUP BY SUBSTRING(naics_code, 1, 2)
ORDER BY 2 DESC --2 references the second column in the SELECT statement
LIMIT 5con %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "qcew")) %>%
group_by(naics_code) %>%
summarize(n = n()) %>%
ungroup() %>%
collect() %>%
#substring doesn't work when translating to postgresql so need to pull into R first
# also convert n to numeric since it pulls in as a character variable
mutate(
two_digit = substring(naics_code, 1, 2),
n = as.numeric(n)
) %>%
group_by(two_digit) %>%
summarize(tot = sum(n)) %>%
arrange(desc(tot)) %>%
head(5)There are a handful of NAICS-related lookup tables available for you in the ds_ar_dws schema. At the two-digit level, the most relevant table will be naics_sector, but just keep in mind that the current table has ranges of codes in some places (ex. 44-45), which make it a bit more complicated to join. For reference, the two-digit NAICS code of 62 corresponds to Health Care and Social Assistance.
7 Linking UI Wages and QCEW
By joining the QCEW data to the UI wage records, we can look at wage patterns while taking into account the characteristics of the employers, such as their industry, for example. In this section, we will do just that, merge the ds_ar_dws.ui_wages_lehd table to the QCEW table ds_ar_dws.qcew.
To join the two tables, we will use the federal_ein in the UI wages and the ein in the QCEW, which both correspond to the federal EIN. We also need to join on the reference period, or the year and quarter combination to ensure we are getting the wages for the correct period. Before doing so, though, to ensure we are not joining multiple NAICS codes per employer, we will master the QCEW table to find the most common industry code for each EIN in a given quarter (and year). This approach is not perfect, but does still join for cases where there are missing reporting unit number values in the UI wage records but not in the QCEW.
And remember that as we saw earlier, there may still be some duplication in the UI wage records unaccounted for in this approach, which could be mediated through additional mastering.
We can use the first two digits of the NAICS codes to obtain the industrial sector for each business, then average the wages across these sectors for each quarter in 2019.
Note: Due to the complexity of the join, we have not provided
dbplyrcode in this example.
with qcew_rank as (
select ein, reference_year, reference_quarter, substring(naics_code, 1, 2) as naics_sub,
ROW_NUMBER () OVER (PARTITION BY ein, reference_year, reference_quarter order by count(*) desc) as rank
from ds_ar_dws.qcew
group by ein, reference_year, reference_quarter, substring(naics_code, 1, 2)
),
qcew_mastered as (
select *
from qcew_rank
where rank = 1
)
select avg(uwl.employee_wage_amount) as avg_wage,
count(distinct uwl.employee_ssn) as n_employee,
count(distinct uwl.federal_ein) as n_employers,
q.reference_year,
q.reference_quarter,
naics_sub
from ds_ar_dws.ui_wages_lehd uwl
left join qcew_mastered q
on (uwl.federal_ein=q.ein
and uwl.reporting_period_year = q.reference_year
and uwl.reporting_period_quarter = q.reference_quarter)
where q.reference_year = '2019' and naics_sub !='00'
group by naics_sub, q.reference_year, q.reference_quarter
order by naics_sub, q.reference_year, q.reference_quarterqcew <- con %>%
tbl(in_schema(schema = "ds_ar_dws", table = "qcew"))
uwl <- con %>%
tbl(in_schema(schema = "ds_ar_dws", table = "ui_wages_lehd"))
# Step 1: count observations per ein/year/quarter/naics_sub
qcew_counted <- qcew %>%
mutate(naics_sub = sql("SUBSTRING(naics_code FROM 1 FOR 2)")) %>%
group_by(ein, reference_year, reference_quarter, naics_sub) %>%
summarise(count_n = n(), .groups = "drop")
# Step 2: for each (ein, year, quarter), rank by count_n
qcew_ranked <- qcew_counted %>%
group_by(ein, reference_year, reference_quarter) %>%
arrange(desc(count_n)) %>%
mutate(rank = row_number()) %>%
ungroup()
# Step 3: keep only rank == 1 (most common naics_sub per ein/year/quarter)
qcew_mastered <- qcew_ranked %>%
filter(rank == 1) %>%
select(ein, reference_year, reference_quarter, naics_sub)
# Step 4: join and summarize
avg_by_naics <- uwl %>%
left_join(
qcew_mastered,
by = c(
"federal_ein" = "ein",
"reporting_period_year" = "reference_year",
"reporting_period_quarter" = "reference_quarter"
)
) %>%
filter(
reporting_period_year == "2019",
naics_sub != "00"
) %>%
group_by(naics_sub, reporting_period_year, reporting_period_quarter) %>%
summarise(
avg_wage = mean(employee_wage_amount, na.rm = TRUE),
n_employee = n_distinct(employee_ssn),
n_employers = n_distinct(federal_ein),
.groups = "drop"
) %>%
arrange(naics_sub, reporting_period_year, reporting_period_quarter) %>%
collect()
avg_by_naicsFor the sake of this example, we will identify the specific sectors by two-digit NAICS codes:
- Agriculture, forestry, fishing and hunting:
11 - Retail Trade:
44-45 - Professional, scientific and technical:
54 - Healthcare and social assistance:
62
avg_by_naics %>%
filter(naics_sub %in% c("11", "62", "44", "45", "54")) %>%
mutate(sector = case_when(
naics_sub == "11" ~ "Agriculture",
naics_sub == "44" ~ "Retail - 44",
naics_sub == "45" ~ "Retail - 45",
naics_sub == "54" ~ "Technical & Professional",
naics_sub == "62" ~ "Healthcare"
),
reporting_period = paste(reporting_period_year,reporting_period_quarter, sep = '-')) %>%
ggplot(aes(
x = reporting_period,
y = as.numeric(avg_wage),
group = sector,
color = sector
)) +
geom_line() +
labs(
x = "Quarter",
y = "Average Quarterly Wage",
title = "Average Wages Received by Quarter",
subtitle = "By NAICS Sector in 2019"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))We see that some industries show marked seasonal variation in wages in 2019, while others do not. On average, we can see that one of the Retail industry workers have the lowest wages on average.
Use the UI Wage and QCEW data dictionaries to identify one or more further variables that might be relevant to your group’s analysis. Think through what these variables should look like, as well as what issues might arise. Working individually or with your group, examine the distribution of these variables. Document any EDA-related concerns and findings in your project template. Brainstorm as to what the cause of these issues might be, and how they could impact your analysis.
8 WIOA Program data
What is WIOA Adult Population?
The Workforce Innovation and Opportunity Act (WIOA) helps job seekers access employment, education, training, and support services to succeed in the labor market, while connecting employers with skilled workers to compete globally. States align core workforce programs through flexible four-year plans, promoting accountability via public performance goals, fostering regional collaboration, and improving the American Job Center (AJC) system. WIOA programs offer career services, job search assistance, workforce preparation, and training through classroom and work-based learning. Serving millions annually, these programs also support business by matching qualified candidates, filling job orders, and providing customized training, delivered through a network of 2,400 AJCs nationwide.
Types of WIOA programs
The relevant WIOA programs for this notebook and our data include that involve Title I and Title III. WIOA Title I, focused on workforce development activities, includes programs such as the WIOA Youth Program, Adult and Dislocated Worker Program, and Job Corps. WIOA Title III reflects the Wagner-Peyser Employment Service, which provides universal access to job search assistance, labor exchange services, and labor market information for both job seekers and employers, primarily delivered through American Job Centers.
PIRL table ds_ar_dws.pirl_update
The Participant Individual Record Layout (PIRL) is a standardized data format developed by the U.S. Department of Labor (DOL) for reporting individual-level participant data across various workforce development programs, particularly those funded through the Workforce Innovation and Opportunity Act (WIOA). It is used by state workforce agencies and other grantees to report information to the DOL. You can find the PIRL data dictionary on the P drive.
The PIRL contains detailed, longitudinal records on individuals who receive services through WIOA and other programs, including: - Demographics - Details of program participation - Education and barriers - Types of training received - Program outcomes
Structure
First, just as we did in the previous section, we will explore the basic structure of the data by examining the first few rows of the table.
The PIRL table has over 500 columns, so consulting the data dictionary is essential before approaching the data. In the AR PIRL table, the data also comes from multiple original sheets, which are combined to create the file we have access to. The sheetnameproperty field tells you what sheet is being shown for a given individual, because a person can participate in multiple WIOA programs, which are stored in the different sheets.
SELECT *
FROM ds_ar_dws.pirl_update
LIMIT 5con %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "pirl_update")) %>%
head() %>%
colnames()EDA of specific fields
- codebook is at
P:\tr-state-impact-ada-training\Data documentation\PIRL_Layout_withAdultEd.xlsx
Below we can inspect the number of records and people in each WIOA training service. We also look at the oldest and earliest program related dates, service completion, and the average time in the program.
SELECT
"adult_wioa",
COUNT(DISTINCT "social_security_number") AS "npeople",
COUNT(*) AS "nrows"
FROM (
SELECT "adult_wioa", "social_security_number"
FROM "ds_ar_dws"."pirl_update"
WHERE ("sheetnameproperty" = 'Adult')
) "q01"
GROUP BY "adult_wioa"con %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "pirl_update")) %>%
filter(sheetnameproperty == "Adult") %>%
select(adult_wioa, social_security_number) %>%
group_by(adult_wioa) %>%
summarise(npeople = n_distinct(social_security_number),
nrows = n()) con %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "pirl_update")) %>%
filter(sheetnameproperty == "Adult" & adult_wioa %in% c("1","2","3")) %>%
select(date_of_program_entry_wioa, date_of_program_exit_wioa) %>%
summarise("Min Program Entry Date" = min(date_of_program_entry_wioa, na.rm = TRUE),
"Max Program Entry Date" = max(date_of_program_entry_wioa, na.rm = TRUE),
"Min Program Exit Date" = min(date_of_program_exit_wioa, na.rm = TRUE),
"Max Program Exit Date" = max(date_of_program_exit_wioa, na.rm = TRUE))SELECT "year", AVG("time_in_program") AS "Average time in program in days"
FROM (
SELECT
"pirl_update".*,
"date_of_program_exit_wioa" - "date_of_program_entry_wioa" AS "time_in_program",
EXTRACT(year FROM "date_of_program_entry_wioa") AS "year"
FROM "ds_ar_dws"."pirl_update"
WHERE ("sheetnameproperty" = 'Adult' AND "adult_wioa" IN ('1', '2', '3'))
) "q01"
GROUP BY "year"
ORDER BY "year"con %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "pirl_update")) %>%
filter(sheetnameproperty == "Adult" & adult_wioa %in% c("1","2","3")) %>%
mutate(time_in_program = date_of_program_exit_wioa - date_of_program_entry_wioa,
year = lubridate::year(date_of_program_entry_wioa)) %>%
group_by(year) %>%
summarise("Average time in program in days" = mean(time_in_program)) %>%
arrange(year) Completers vs non-completers
SQL code
SELECT
"type_of_training_service_1_wioa",
"training_completed_1",
COUNT(DISTINCT "social_security_number") AS "freq"
FROM "ds_ar_dws"."pirl_update"
WHERE ("sheetnameproperty" = 'Adult' AND "adult_wioa" IN ('1', '2', '3'))
GROUP BY "type_of_training_service_1_wioa", "training_completed_1"
order by "type_of_training_service_1_wioa", "training_completed_1"Completion for training service 2
SELECT
"type_of_training_service_2_wioa",
"training_completed_2",
COUNT(DISTINCT "social_security_number") AS "freq"
FROM "ds_ar_dws"."pirl_update"
WHERE ("sheetnameproperty" = 'Adult' AND "adult_wioa" IN ('1', '2', '3'))
GROUP BY "type_of_training_service_2_wioa", "training_completed_2"
order by "type_of_training_service_2_wioa", "training_completed_2"
Completion for training service 3
SELECT
"type_of_training_service_3_wioa",
"training_completed_3",
COUNT(DISTINCT "social_security_number") AS "freq"
FROM "ds_ar_dws"."pirl_update"
WHERE ("sheetnameproperty" = 'Adult' AND "adult_wioa" IN ('1', '2', '3'))
GROUP BY "type_of_training_service_3_wioa", "training_completed_3"
order by "type_of_training_service_3_wioa", "training_completed_3"R code
# Now let's look at types of training services
# Completion for training service 1
con %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "pirl_update")) %>%
filter(sheetnameproperty == "Adult" & adult_wioa %in% c("1","2","3")) %>%
select(type_of_training_service_1_wioa, training_completed_1, social_security_number) %>%
group_by(type_of_training_service_1_wioa, training_completed_1) %>%
summarise(freq = n_distinct(social_security_number)) DELETE
con %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "pirl_update")) %>%
filter(sheetnameproperty == "Adult" & adult_wioa %in% c("1","2","3")) %>%
select(type_of_training_service_1_wioa, training_completed_1, social_security_number) %>%
group_by(type_of_training_service_1_wioa, training_completed_1) %>%
summarise(freq = n_distinct(social_security_number)) # Completion for training service 2
con %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "pirl_update")) %>%
filter(sheetnameproperty == "Adult" & adult_wioa %in% c("1","2","3")) %>%
select(type_of_training_service_2_wioa, training_completed_2, social_security_number) %>%
group_by(type_of_training_service_2_wioa, training_completed_2) %>%
summarise(freq = n_distinct(social_security_number))# Completion for training service 3
con %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "pirl_update")) %>%
filter(sheetnameproperty == "Adult" & adult_wioa %in% c("1","2","3")) %>%
select(type_of_training_service_3_wioa, training_completed_3, social_security_number) %>%
group_by(type_of_training_service_3_wioa, training_completed_3) %>%
summarise(freq = n_distinct(social_security_number))Co-enrollment in more than 1 program
We may also be interested in whether a program participant is enrolled in more than one WIOA program. In this query, we examine the co-enrollment status between the Wagner-Peyser and Adult WIOA programs.
with wagner as (
select
p.social_security_number,
p.wagner_peyser_employment_service_wioa,
p.date_of_program_entry_wioa,
p.date_of_program_exit_wioa,
p.type_of_training_service_1_wioa as wagner_train1,
p.type_of_training_service_1_wioa as wagner_train2,
p.type_of_training_service_1_wioa as wagner_train3,
cast('wagner_peyser' as varchar) as program2
from ds_ar_dws.pirl_update p
where p.sheetnameproperty = 'Wagner-Peyser'
and p.wagner_peyser_employment_service_wioa = '1'
group by 1,2,3,4,5,6,7
),
adult_wioa as (
select
p.social_security_number,
p.date_of_program_entry_wioa,
p.date_of_program_exit_wioa,
p.type_of_training_service_1_wioa as aw_train1,
p.type_of_training_service_1_wioa as aw_train2,
p.type_of_training_service_1_wioa as aw_train3,
cast('adult wioa' as varchar) as program1
from ds_ar_dws.pirl_update p
where p.sheetnameproperty = 'Adult'
and p.adult_wioa in (1,2,3)
group by 1,2,3,4
)
select
adult_wioa.social_security_number,
adult_wioa.date_of_program_entry_wioa,
adult_wioa.date_of_program_exit_wioa,
adult_wioa.program1,
wagner.program2,
case
when wagner.program2 = 'wagner_peyser' and adult_wioa.program1 is not null then 'dual enrolled'
when wagner.program2 is null and adult_wioa.program1 = 'adult wioa' then 'adult wioa only'
when wagner.program2 = 'wagner_peyser' and adult_wioa.program1 is null then 'wagner peyser only'
end as dualenroll
from adult_wioa
full outer join wagner
on adult_wioa.social_security_number = wagner.social_security_number
and adult_wioa.date_of_program_entry_wioa = wagner.date_of_program_entry_wioa
order by adult_wioa.date_of_program_entry_wioa;table(coen$dualenroll)9 Next Steps: Applying the workbook to your project
The workbook provides a structure for you to start your EDA process on the data within the scope of your project. The data coverage and row definition for two of the primary data sets in this training is available, allowing you to focus on evaluating the distribution of variables potentially relevant to your analysis. Below, in the Additional Data Sources section, we apply the same approach to the other key tables we have access to in this workspace. You do not need to work through each section, but should review the content at least for the tables you plan to use in your project. You can expand a specific section by clicking on its header below.
As you evaluate variable distributions, you can start by re-purposing the code in these sections. There are code snippets for distributions of numeric, time-based, and categorical variables that may be appropriate depending on the type of column you are interested in exploring.
In doing so, as recommended in the checkpoint, note your findings in your team’s project template. As your project progresses, it will be helpful to look back at these notes, especially in thinking through how to most accurately and best communicate your team’s final product to an external audience. Ultimately, the EDA process is an essential step in the project development life cycle, as it provides helpful contextual information on the variables you may choose to use (or not use) in your analysis.
For all of these steps, remember not to take notes or discuss exact results outside the ADRF. Instead, create notes or output inside the ADRF, and store them either in your U: drive or in your team’s folder on the P: drive. When discussing results with your team, remember to speak broadly, and instead direct them to look at specific findings within the ADRF. And, as always, feel free to reach out to the Coleridge team if you have any questions as you get used to this workflow!