Looking for a Model

Posted 2016-05-18

Updated: this CSV file has information on who taught when. The three columns are the person's unique identifier, the date on which they first qualified, and the dates on which they taught. (If someone has taught multiple times, there is one record for each teaching event.) People who haven't taught at all are at the bottom with empty values in the third column. Erin Becker's analysis of this data is posted on the Data Carpentry blog.

We rebooted instructor training in October 2015, and things have been going pretty well since then. If we average over all 23 new-style classes, it looks like two thirds of people who take part actually qualify as instructors within four months of finishing the class:

Date	Site(s)	Days Since	Participants	Completed	Percentage	Cum. Participants	Cum. Completed	Cum. %age
2015-10-15	online	170	48	30	62.5%	48	30	62.5%
2015-12-07	Paris	162	7	7	100.0%	55	37	67.2%
2015-12-07	Potsdam	162	5	5	100.0%	60	42	70.0%
2015-12-07	Thessaloniki	162	4	4	100.0%	64	46	71.8%
2015-12-07	Arlington	162	10	4	40.0%	74	50	67.5%
2015-12-07	Vancouver	162	5	4	80.0%	79	54	68.3%
2015-12-07	Wisconsin	162	7	5	71.4%	86	59	68.6%
2015-12-07	Australia	162	3	2	66.6%	89	61	68.5%
2015-12-07	Curitiba	162	3	3	100.0%	92	64	69.5%
2015-12-07	Toronto	162	14	12	85.7%	106	76	71.7%
2016-01-05	Oklahoma	133	19	5	26.3%	125	81	64.8%
2016-01-13	Lausanne	125	20	16	80.0%	145	97	66.9%
2016-01-18	Brisbane	120	20	14	70.0%	165	111	67.2%
2016-01-21	Melbourne	117	27	6	22.2%	192	117	60.9%
2016-01-21	Florida	117	25	8	32.0%	217	125	57.6%
2016-01-28	Auckland	111	20	7	35.0%	237	132	55.7%
2016-02-16	Online	91	26	8	30.7%	263	140	53.2%
2016-02-22	UC Davis	85	23	9	39.1%	286	149	52.1%
2016-03-09	U Washington	69	14	2	14.2%	300	151	50.3%
2016-04-13	online	34	33	1	3.0%	333	152	45.6%
2016-04-17	North West U	31	23	0	0.0%	356	152	42.7%
2016-05-04	Edinburgh	13	15	0	0.0%	371	152	40.9%
2016-05-11	Toronto	6	27	0	0.0%	398	152	38.1%

One of our goals for this year is to lower the majority completion time from four months to three; another is to increase the throughput from two thirds to three quarters. What I'd really like, though, is some help figuring out what statistical model to use for the other important aspect of our training and mentoring: how many of the people we train go on to actually teach workshops, and how quickly.

The data we have includes the following for each person:

unique personal identifier (we can easy anonymize individuals)
date(s) of the instructor training courses they took (someone may enroll, drop out, enroll again, and so on)
date(s) on which they were certified (they may have qualified for Software Carpentry and Data Carpentry at different times)
the date on which they taught their first workshop (if any)

"Mean time to teach first workshop" isn't a good metric, since roughly 1/3 of the people we've trained haven't taught yet. Should we use an inverted half-life measure, i.e., how long until the odds of someone having taught hit 50%? Or would something else give us more insight? Whatever we choose needs to be robust in the face of a big spike in our data in January 2016, when we retroactively certified a big batch of Data Carpentry instructors. If you have suggestions, comments on this post would be very welcome.

Categories: software-carpentry