Forensic experts are surprisingly good at determining whether two written samples match


After the 2020 US presidential election, handwriting analysis researcher Mara Merlin received a strange phone call. The man on the phone line appeared to be seeking her approval of a highly dubious report that claimed signatures were missing from the ballots. Merlino, a professor of psychology at Kentucky State University, asked some poignant questions. After learning, among other things, that the person involved in verifying the signatures was not a trained professional, she concluded that “everything in this report was bad and wrong.”

A new study confirms Merlin’s intuition. He believes that forensic handwriting comparison is an effective method, but requires years of professional training to ensure the validity of conclusions reached in judicial and administrative cases. According to the results published on August 1 in Proceedings of the National Academy of Sciences USA, highly trained professionals can reliably identify samples written by one hand, most often correctly naming. In contrast, people with more limited training perform much worse.

In 2009, amid concerns that some forensic methods may be flawed, the National Research Council released a report that noted that there weak support for methods of comparing crime scene evidence samples, including hair, bite marks and handwriting. In 2016, a group of advisers to then-President Barack Obama prepared a report requires further research accuracy of comparison methods.

The new findings show “the validity and reliability of this process,” says Merlino, who was not involved in the work. The results of this and other studies “point in the same direction, which is that fully trained document examiners are reasonably accurate in the calls they make.” A growing body of evidence helps address the criticism “that there hasn’t been research that empirically proves that people in the field can do what they claim they can do,” she adds.

David Weitz, Professor of Physics at Harvard University and Associate Editor PNAS, who oversaw the peer review process for the paper, says he thought it would be important as a “serious scientific study of forensic analysis,” which he is not “always convinced is done in a truly scientific way.” Scientific American contacted four authors of the FBI-funded study, but none were available for interviews prior to publication.

For the paper, 86 forensic document examiners, most of them civil servants, conducted 100 handwriting comparisons using digital images of such handwriting taken by 230 people. Of the 100 tasks, 44 were comparisons of documents handwritten by one person, and the remaining 56 were comparisons of documents written by two people. Unknown to the participants, a tenth of the comparison sets were repetitions of sets they had already seen—a way to test how consistent each participant was over time.

Medical examiners compare samples based on a long list of factors, says Linda Mitchell, a board-certified medical examiner and researcher who was not involved in the study. Features of interest include the spacing between letters, the way the letters are joined, and whether the “legs” are lowered or raised below or above the letter, such as the tail of a lowercase “g” or the rise of a lowercase “d”. “There are a lot of things,” she says.

Experts in the FBI study expressed their conclusions in the form of five ratings: definite that the same writer wrote or did not write the compared samples, likely that the same writer wrote or did not write them, or no conclusion.

In total, in 3.1 percent of cases, experts mistakenly concluded that the examples for comparison were composed by the same writer. Different writers who were twins were more likely to mislead the examiners, resulting in a false positive rate of 8.7 percent. The false-negative rate of samples that were incorrectly attributed to two different writers was even lower at 1.1 percent.

Twins are tricky because similar environments can cause similar handwriting similarities in some families, Mitchell says. “But at the end of the day,” she adds, “there’s always something to help make your handwriting a different person.”

Level of experience influenced examiner accuracy and confidence. 63 examiners who had been trained for two years or more performed better and were more cautious, tending to rely more on “probable” conclusions than novices. The nine worst-performing participants were among the 23 who did the least amount of training. “That’s what training is all about: making sure you know what you don’t know,” Mitchell says.

Of the 86 participants, 42 erroneously concluded that the samples belonged to the same writer in at least one comparison. But the core group of eight participants was responsible for more than half of those errors — 61 out of 114 incorrect guesses. In terms of false negatives, the group as a whole performed better in raw numbers, with only 17 of 86 making this mistake, and one of the 17 standing out as making more than a fifth of those incorrect calls.

The experts proved to be fairly consistent when called upon to review the same documents again, overruling completely in only 1 percent of the cases they reviewed twice. They were more inclined to hedge from definitive to probable or vice versa than to completely reverse the finding. Only one person in the entire group came to the same conclusion on all of their second reviews. Participants were also equally consistent from examiner to examiner, completely changing the conclusion only 1.2 percent of the time.

The five-point scale, which expresses the strength of the examiner’s judgment, is a metric in flux, Merlin says. Some groups that develop these assessments have tried as many as nine levels to express the strength of judgments, but a gap separates how laypersons and experts interpret the findings, with laypersons generally wanting a simple yes or no rather than a qualified conclusion. “From what I’ve seen, five levels isn’t enough, nine levels is probably optional, but seven levels might be about right,” Merlina adds. Sorting through this and clarifying how lay people interpret this “language of judgment” may be the most important things to establish in further research, she says.

Forensic experts are surprisingly good at determining whether two written samples match

Source link Forensic experts are surprisingly good at determining whether two written samples match