Quality evaluations should not be taken for granted
California Agriculture 61(1):35-39. https://doi.org/10.3733/ca.v061n01p35
Published January 01, 2007
Subjective quality-evaluation errors in agriculture, such as discarding good-quality product and packing poor-quality product, can be costly to growers and workers. This study of workers and supervisors in a strawberry-plant packingshed revealed the danger in assuming that those responsible for quality control truly understand what is required. We found that the ability of workers to correctly count plants, and to retain or reject them (and explain why), varied considerably. The results highlight the need for employers to carefully define quality parameters, and then test employees and applicants. When top management does not agree on exactly what constitutes acceptable quality, it is difficult to expect quality-control inspectors and workers to understand. Testing, as a tool, can help growers and producers make better employee selection and placement decisions and can also be used for periodic training.
Workers in a California packingshed were tested on several tasks related to sorting and packing strawberry plants. Grower Bob Whitaker (right) and his top manager Areli Toledo banter as they review the the plant ratings for specimens that they did not agree on; Toledo scored highest on the test.
Most agricultural tasks require people to make important subjective decisions of a qualitative nature. For instance, should fruit be picked or left on the tree to reach optimal maturity? Should a cow be milked or moved to a hospital to be treated for mastitis? Does a field need to be irrigated? Should a cucumber on a conveyer belt be packed or discarded? Subjective decisions are made at all hierarchical levels, from farm owner to farmworker.
Over the last 2 decades, the author has carried out a number of informal studies in an attempt to measure “rater reliability” in California and Chile. At one operation in Chile, for example, several managers rated the quality of pruning in a fruit orchard and there was no agreement among them. On another occasion, several respected California viticulturalists were asked to rate the quality of 10 grapevines pruned by different employees. After the score sheets were returned, these raters were asked to go back and redo the evaluation. Often their new scores did not agree with their scores from half an hour before (Billikopf 1994, 2003).
While the consequences for incorrect decisions may vary, such qualitative decision-making is usually a key aspect of farming. But at all operational levels, people in agriculture are usually hired without testing their ability to make qualitative decisions. This casual approach to hiring even extends to research assistants, who are most often interviewed but seldom tested for rater reliability; likewise, inter-rater reliability is rarely checked before results are reported. Such a casual approach to selection compromises the integrity of research results as well as farm profits.
Effective human-resource management offers valuable tools to help improve such critical outcomes. Practical tests (also called “job samples“) are an effective and legal way to enhance selection and placement decisions, as well as the training and performance evaluation of present employees (Billikopf 1988, 2003; Federal Register 1978; US Department of Labor 1999).
While the literature mentions the use of testing in agriculture, and even testing that involves the need for test takers to make qualitative decisions (Billikopf 1988, 2003), little has been written on rater reliability in agricultural employment testing (as either a selection, evaluation, placement or testing tool). One exception is Campbell and Madden (1990), in which raters were asked to evaluate the percentage of plant disease incidence in particular samples.
Another example is Mcquillian (2001), who tested medical personnel for how accurately they made decisions regarding medical cases, based on specific guidelines. Much more common is research that studies the reliability of tools or instruments, such as the reliability of a medical survey instrument used for brain injury diagnosis (Desrosiers et al. 1999). Because medical decision-making can have life-and-death implications, it makes sense that much of the work in this field has been conducted in the medical arena.
Top left, strawberry=plant workers use a trim tool to cut off plant stems. Top right, study participants evaluated 150 numbered samples of strawberry plants, so that their scores could be compared. Left, plants suitable for packing should have crowns roughly the size of a pencil, or larger; this root crown is on the small side.
This study examines whether individuals vary in terms of their ability to make reliable and valid evaluative decisions (that is, their rater reliability), and if this can be measured through the use of a job sample or practical test.
Data was gathered at a California strawberry-plant packingshed. While the study could have been carried out in any agricultural industry, a task was selected in which workers make multiple quick decisions that can easily be measured against a known standard.
Strawberry plants (for replanting) are harvested in the field and brought to the shed in large, tangled clusters that are separated by workers. Plants are then sorted in terms of a single passing grade (the remaining plants are discarded). Good plants are bunched into groups of 100 units and then packed for shipping nationally and abroad. Sorters are responsible for all the tasks, from untangling the plant clusters to bunching them into 100-plant units. The sorter's most critical job is inspecting each plant and determining if it should be discarded or retained, a task that normally is carried out in less than a second per plant.
After the sorters have done their job, several levels of quality-control personnel inspect the plants. (We define quality control as a system to check that sorters are making correct evaluative decisions.) The two most important quality issues are ensuring that good plants (without defects) are packed and that each bunch contains 100 plants.
While sorters must recognize which plants to retain or reject, quality-control personnel must also be able to understand and describe the reason for rejecting particular plants. This extra detail is needed so that sorters can receive feedback on their performance.
Two salient and costly quality-evaluation errors are (1) discarding good product as not salable and (2) packing poor-quality plants. Discarding good plants is detrimental to both the grower and sorters. The grower loses good plants and all the costs involved in growing them; and the workers, who are often paid on a piece-rate basis, lose good plants they could have packed and earned money from.
A poor-quality pack also has negative economic consequences for the plant buyer, who may cultivate nonviable plants or need to re-sort them beforehand. In order to make up for defective plants, some growers ship an extra 10% free. Growers who ship a higher quality pack could gain a competitive edge and positive reputation while saving on plants.
Testing for accurate evaluations
Initial preliminary tests were carried out in December 2004, but the data reported here was collected in September and October 2005. During the initial tests, it became clear that we could not conduct an effective test until top management agreed on what constituted good quality and the reasons for rejecting plants. It took several weeks of negotiation and close work with management to develop a set of known criteria.
Through the testing process we set out to determine how accurately subjects would be able to: (1) count plants per bunch; (2) make reject-versus-retain decisions for each plant; and (3) label the reason for rejecting a plant. To be effective, sorters must make accurate decisions, but not necessarily explain these to someone else. Quality-control personnel, in contrast, must clearly articulate the reason for rejecting plants. Flexibility is required since clients buying the plants can vary in terms of quality pack requirements.
For practical reasons, distinct aspects of the job were tested separately. The first dealt with the accuracy of plant count, and the second with retain-ver-sus-reject reasons. For this study, six distinct reasons were agreed upon for discarding plants. From most serious to least serious, they were: (1) cut crown, (2) black roots, (3) inadequate number of healthy roots, (4) thick crowns, (5) thin crowns and (6) lack of root hairs. For instance, if a plant had a cut crown and black roots, the recorded reason for rejecting it should be the most serious, the cut crown.
Subjects (employees) were shown samples of each discard category and were encouraged to ask questions. Some clearly took better advantage of this opportunity than others.
Subjects, including, left, Luz Maria Romero and, right, Silvia Araiza, had to make retain-versus-reject decisions for 150 strawberry plants and provide the reason for rejection. Despite the apparent simplicity of the task, few subjects scored well against the known correct answers.
For the retain-versus-reject test, the statistical analysis was adapted from the Gage Repeatability and Reproducibility (Gage R&R) quality evaluation tool. The Gage R&R instrument is often used to test the consistency of a measuring gage in the hands of multiple raters. Here, the instrument being tested was a person rather than a gage.
For both tests we developed an answer key with the known criterion against which subjects would be compared. There was no null hypothesis to test, but rather the ability of each subject to make quick, accurate decisions.
Subjects tested included the grower/ shipper, top manager, super checker, checkers, counters and sorters. While the grower and top manager may communicate quality pack standards, it is the super checker who is responsible for checking the work of the regular checkers and counters. The checkers focus mostly on plant quality, while the counters focus on plant count. There is some overlap between the responsibilities of these two job categories.
Accuracy varied widely
Twenty-four subjects (22 female and two male) participated in the counting test. A total of 2,919 plants were spread out in uneven bunches at 12 stations (bunches ranged from 200 to 300 plants, with a mean of 243 plants).
One subject recorded 818 plants in a station that only had 222, throwing off her score by a large margin. The remaining participants ranged from a total of 12 mistakes (an average of one mistake per station or 0.4% error) to 163 mistakes (more than 13 mistakes per station or 5.6% error).
There was sufficient overlap in terms of subjects who participated in the counting test and the retain-versus-reject test to note that those who could count accurately were not necessarily the same as those who did well in the reject-versus-retain test, and vice versa (table 1).
Retain versus reject. Thirty-two subjects (29 female and three male) participated in the retain-versus-reject test. Two separate sets (A and B) consisted of 150 plant samples each. Subjects were given 5 seconds per plant to make and annotate their evaluative decisions. Plants were labeled from 1 to 150 (in groups of five plants per station, with 30 stations per set).
Subjects were divided into two groups, half in set A and half in set B. Each subject evaluated the set of samples to which she or he was assigned twice. Only after the first test was completed and the score sheets collected did subjects proceed to the retest (with a new, blank score sheet).
For each subject, we obtained: (1) a test score (test results compared to known criterion); (2) a retest score (how subjects scored against a known criterion when repeating the same test for a second time); (3) an average test-versus-retest score; and (4) a reliability score (for every decision, how consistently did each subject agree with herself or himself as they evaluated the same plants twice) (table 1).
The average test/retest scores ranged from a high of 95.3% (excellent by any standard) to a low of 58.7%. Had the low-scoring subject indiscriminately accepted all plants for packing without rejecting any, she would have scored better (60%). In fact, it was much more common for subjects to reject good plants than to pack bad ones. Campbell and Madden (1990) also found that experienced raters tended to overestimate plant disease incidence. Our results are similar to those of the medical decision-making study (Mcquillian 2001), in terms of finding a large variation between the best and worst rater in the group.
As test scores increased, reliability scores generally increased as well. Low reliability scores (i.e., assigning different quality scores to the same plants) mean that a subject does not see quality issues consistently. It is possible for individuals to have high reliability scores, yet do poorly in the test/retest. Such individuals may have a reliable eye for quality, but be calibrated to a different north.
We told prospective study participants that they must be able to read and write, but nonetheless had one subject who could not fill out the score sheet. Perhaps this individual felt trapped into making a face-saving move, or else wanted the hourly wage that the grower paid to study participants.
TABLE 1. Job category, number of completed samples and reliability score between the test and retest
Of the remaining 31 subjects, six turned in partial results. They recorded retain-versus-reject decisions, but not reject reasons. These six ranged from the second lowest score to the fourth highest of all participants in terms of their average test/retest scores (table 1).
Identifying discard reasons
As long as sorters understand quality parameters, it is not essential that they (1) can explain it, or (2) can read or write. In contrast, quality-control personnel must be able to do both. The remaining 25 subjects (23 female, two male) completed the final portion of the study, where the reasons for rejecting plants were incorporated into retain-versus-reject decisions. Average test/retest scores ranged from a low of 40% to a high of 92% (table 2).
Subjects who scored highly in the test/retest also tended to have higher reliability scores. Some of the quality-control personnel did quite poorly in this test, with the super checker doing worse than both the checkers and counters she was supposed to direct. Several checkers and counters showed great potential for a super-checker position and were likely to improve with additional training.
As expected, we found high variability among subjects in terms of consistently being able to count plants, make retain-versus-reject decisions and determine the reason for rejecting plants. This variability existed among subjects who were already employed and supposedly knowledgeable. Had we administered the tests to applicants unfamiliar with the industry, we would expect to see even greater variability.
Job samples for testing employees
Our tests involved straightforward, objective issues (such as counting), as well as more subjective questions (such as whether a strawberry plant has sufficient root hairs). We found that subjects who did well in one test did not necessarily do well on another. Consequently, employers should also consider the use of tests to make placement decisions.
Selection procedures for particular tasks vary widely in terms of how valid they are. Validity, in the context of employment testing and placement, deals with how well an instrument or test predicts on-the-job behavior. Intelligence and personality tests are of limited value for predicting job performance, while job samples are highly valid predictors.
A job sample involves asking subjects to perform portions of the actual job duties. Examples may include picking oranges, pruning a deciduous orchard, driving a tractor or treating a calf. Agriculture lends itself well to job sample testing. Farm employers can set up several stations with different job duties to test and evaluate (Billikopf 2003).
It is important to test for as many different types of job tasks as the person will perform on the job. Such practical tests can be easily submitted to content oriented validity and in some instances may also be validated through a criterion-oriented approach (Anastasi 1982; Billikopf 1988, 2003; Federal Register 1978; US Department of Labor 1999). Tests can be designed so that subjects need not be able to read or write. The individualized nature of these tests can make them more time-consuming, however.
The most common error in the reject-versus-retain test was discarding good plants. A combination of preselection testing and careful placement, as well as the use of testing as a performance evaluation and training tool, should reduce material waste and at the same time increase worker wages by a considerable percentage (such as when workers get paid for plants they were previously discarding). Without testing, management mistakes could lead to, for example, placing a super checker in a position of responsibility (such as training and evaluating) over more-skilled individuals.
TABLE 2. Reliability, average test/retest score, test score and retest score for retain-versus-reject test⋆
The objective of this study was to warn researchers involved in subjective evaluations, as well as farm employers whose personnel must evaluate quality on-the-job, that quality determinations should not be taken for granted. Even though the study was carried out in a specific industry, almost every agricultural industry should pay more careful attention to quality.