How To Evaluate Training

By Phil La Duke

Last week I stated (for the umpteenth time) that a worker’s core competency may be the best predictor of safety.  I went on to rant about how in many cases training is slapped together and shoddily delivered in an effort to check the almighty box. One of my readers asked how can one accurately assess the efficiency of training.  So here goes…

“The Kirkpatrick Model is the worldwide standard for evaluating the effectiveness of training. It considers the value of any type of training, formal or informal, across four levels. Level 1 Reaction evaluates how participants respond to the training. Level 2 Learning measures if they actually learned the material.”


¾The Kirkpatrick Model – Kirkpatrick Partner

The Kirkpatrick Model is a simple and fairly accurate way to measure the effectiveness of adult learning events (aka training), and while every six months or so, some Adult Learning theorist will come up with some other method the Kirkpatrick Model endures because of the elegancy of its simplicity.  The Model Consists of four levels with each designed to measure a specific element of the training.

Level One: Reaction

Kirkpatrick’s first level measures the learners’ reaction to the learning event.  There is a strong correlation between how much the learners enjoyed the time spent and found it valuable and learning retention.  Level one evaluations are typically completed immediately at the conclusion of the course using what trainers euphemistically call a “smile sheet” (a reference to how many smiles you counted at the end of a class.) but a good level one evaluation should delve deeper than merely whether or not the people liked the course (people like The Housewives of The Jersey Shore and Survivor but that doesn’t make it good television).  A good course evaluation will concentrate on three elements, the course content, the physical environment, and the presentation/skills of the instructor.  You can glean important insights into the quality of your course if you have constructed a good course evaluation.  Typically this means using a Likert Scale (asking participants to match their agreement with a statement about the course using a scale of 1-5 where one indicates strong disagreement and five strong agreement).  To build an effective level one tool, you should always have statements that are positive so that a score of one is consistently bad and a score of five is consistently good. Write the statements in complete sentences and don’t ask questions.  Also, don’t write more than 10 statements as people tend to want to get out of the class as quickly as possible and if you exceed one page your completion rate goes down exponentially.  I like to finish the one page evaluation with two questions: what did you like most about the course and what could be improved?

Level Two: Learning

The second level of Kirkpatrick’s model is learning, that is, how much of the content did the people actually learn as a result of the training session.  This evaluation is typically achieved through the use of a pre- and posttest.  This causes all sorts of consternation among people who don’t understand how to evaluate training. Many organizations flat out refuse to test the workers and even those who do balk at the idea of pretest.  Pre- and posttests are key to ascertaining whether or not the participants learned anything in the learning event.  Identical (we’ll get back to that in a moment) pre-and posttests are essential because the difference between the pre- and posttest scores indicate the amount of learning that actually took place.  Without a pretest you have no idea if they already knew the material before they came to session, and unless the questions are the same one can’t be certain if they learned the material  in the session.  Of course it is important to ask the questions in a different order and also have the answers in a different order to prevent people just memorizing  the choice without having to think about the information.

I have always preferred a 20 question multiple-choice pre- and posttest because the odds of guessing a single True or false question is 50% but that assumes that the question doesn’t contain any language that tips people off for example.

 True or False: It is never safe to work on energized equipment without locking out. I have seen variations of this question asked with one author of the question believing that the answer is “True” and another believing it to be “False”.  It is pretty easy to guess true/false questions that have absolutes like “never” in the question, because for the statement to be true it means that there is no possible scenario where the statement can true.  In other words, if I can find just one instance where it is safe to work on energized equipment (say during test mode, or other conditions that require power) I can be confident that the statement is false.  Conversely, we need to have a clear definition of “safe”; if by “safe” we mean the absolute absence of risk of injury (a circumstance that is all but impossible) we can confidently answer “True”.  Considering language ambiguity and tip offs, the odds of guessing a True/False question correctly is more like 65%.

Multiple choice (or as people who mistakenly think that they are witty call them, “multiple guess”) if well written provide us with a clearer picture of whether or not the learners actually learned.  For example a pretest question might read:

  1. The element with the lowest atomic weight is:

    1. Hydrogen

    2. Argon

    3. Helium

    4. I don’t know

I get laughed at for using “I don’t know” as a distractor, but you might be surprised how often people select that as an answer.  There are certain things that make this a good question and one is that there is only one correct answer, and the distractors (the wrong choices) are correct answers to other questions.  Here again good grammar makes a difference if I were to ask a question like

  1. An element whose oxidation number is 0 prevents gas from forming compounds readily, is called an ________________ gas:

    1. Inert

    2. Low reaction

    3. Non-reactive

    4. I don’t know

Since the word “an” is used directly before the blank, basic grammar tells us that the correct answer begins with a vowel, and if we have the brains God gave geese we can assume that d) I don’t know is incorrect and by processes of elimination conclude that a) must be the correct answer as all other possibilities are grammatically incorrect. Never use answers like “a) and b) only” or “all of/none of the above” because you risk testing reading comprehension skills instead of knowledge acquisition.

I also get some fair amount of guff for having 20 questions.  “It’s too many” “it takes too long”, fair criticisms I suppose, but also enough to be statistically valid (assuming a couple of variables) but let’s assume we have 20 people in a class and each is taking a 20 question test.  For a confidence level of 95% and a confidence interval of ±5 you would need a population of 19. Once you have validated the test you can then be reasonably certain that the difference between the pre- and posttests are the result of the learning event.

When analyzing the test scores you should see them skewed to the right (in other words you should see the test scores disproportionately high (indicating that most people mastered the content).

You can further analyze the data using µ scores a µ (pronounced moo) score is the average of the averages and if the scores go up it indicates that the instructors are getting better at their jobs, while if they are getting worse it means that instructors are getting bored, taking shortcuts or for some other reason failing to present the full content. Assuming the content and the test has not changed the µ score is an accurate reflection of the performance of the instructor.

The other two levels of Kirkpatrick’s model are a bit too complex for laymen to dabble in, but this is how you can validate whether or not your training is effective.