Xiang Cheng

Over 1.2 million or 7.1% high school students drop out every year. Dropping out of high school is costly for both the students and society: 30% of dropouts are unemployed, their average annual income is $20,241 (34% lower than high school graduates), and a dropout costs taxpayers an average of $292,000 over a lifetime due to unemployment, incarceration, etc. There have been countless efforts at different levels (federal, state, county, school) to prevent high-schoolers from dropping out and to increase on-time graduation rates. However, such efforts often make binary predictions (graduate or not) instead of continuous or ranked ones which is highly prefered by schools and teachers for prioritizations.

Our partner, Muskingum Valley Educational Service Center (MVESC), works with 5 counties, 16 school districts, nearly 2,000 teachers, and 30,897 students in central Ohio to provide a variety of support services. In particular, MVESC is the largest Educational Service Center (ESC) of the 52 ESCs in Ohio, and is the only one with a dedicated data department. The data department gathers comprehensive data, including grades, test scores, daily attendances, current intervention, teachers’ information, etc. These comprehensive datasets gathered by MVESC offer a unique opportunity to study dropout prevention on a large scale.

There are ongoing efforts in Ohio to prevent students from dropping out, but current interventions are applied ad hoc, not well tracked, and frequently come too late to effect meaningful changes in high school outcomes. MVESC has partnered with DSSG to build a data-driven early intervention system to identify and influence at-risk students in earlier years when there is more time for effective interventions.

Our team – fellows Xiang Cheng, Jacqueline Gutman, Johanna Torrence, and Zhe Zhang, along with technical mentors Ali Vanderveld and Kevin Wilson and project manager Chad Kenney – is working closely with MVESC to build practical, data-driven models to identify students in need of additional assistance to finish high school on time. In this project, the focus is not only on students who may be easily detected by simple rules based on poor high school academic performance or extreme disciplinary incidents. We hope to identify both a larger and younger group of students who may need assistance to be better prepared for graduation and college. What we accomplish with Muskingum Valley may be replicated throughout Ohio to enhance existing processes to identify students at risk of not graduating high school.

Our partners have provided 10 years of data from all 14 school districts in MVESC. The big data pool is an advantage, but as we began to explore the data we found that many students do not have clear outcomes.

mvesc3

From the graduation rates reported to the Ohio Department of Education, we expect to find about 8-10% of students in MVESC not graduating within 4 or 5 years. However, in the data we received we found that only about 2.4% of students definitively fall into this category. Many more students, about 22%, have outcomes that are uncertain. These are students who disappear without a withdrawal code or transfer out of MVESC schools. Based on these numbers we expect that at least some portion of these students are actually dropouts or late graduates.

These missing labels cannot be treated as a random subset of students, for struggling students often transfer to electronic high schools, dropout recovery programs, or vocational and technical institutes in the later years of high school. Schools also frequently do not follow up with transfer students to ensure they actually enroll in another institution, so some students coded as transfers may have actually dropped out. Students within this uncertain category likely make up a significant portion of students experiencing adverse outcomes, so we do not want to simply ignore them in our analysis. Instead, we are considering two different approaches to using this unlabeled data in our models.

First, as part of our initial exploration of that data we mapped each student into a fine-grained outcome bucket based on the available data. Some of these, such as 4-year graduates and drop-outs, map clearly to particular labels, while others break up the large uncertain category. For example, one set includes students who are marked as transfers, but for whom there is no IRN indicating which institution they transferred to. Once we decide on these divisions we can look at the students within a set and assign an outcome to particular buckets; perhaps we believe that students who transfer during their 12th grade year and for whom there is no follow-up information likely dropped out. We can also try assigning different outcome labels to different buckets and how the model performance changes with different choices.

mvesc4

Second, we can take a machine learning approach to assigning labels. In this approach, we will train a classifier on the clearly labeled data, then run the classifier on the uncertain students. The resulting labels can then be added for the uncertain students, and the full dataset can be used to train a new model which takes into account all the available data.

Over the coming weeks we will try both of these approaches, as well as exploring which features and types of models offer the best performance. By the end of this summer, we aim to deliver 3 things to MVESC:

  1. An analysis of the 10 years of data from all 14 school districts. The length of time and breadth of area covered by the data collected by MVESC should produce a statistically stronger analysis of the dropout problem than is typically available.
  2. Risk scores and top risk factors for the current students, which will allow our partners to validate the model. If our work makes sense to the schools and teachers, they can take action by prioritizing students using risk scores and determine appropriate interventions from top risk factors.mvesc6
  3. Source code for all the data manipulation and modeling we have done this summer, with the intention that MVESC will easily be able to reproduce our work and even develop it further so the early intervention system can become a consistent part of the set of tools available to teachers and administrators in the years to come.