The value of data in Artificial Intelligence

by | Feb 1, 2021 | Blog

Let’s start with this observation: modern software applications collect information from remote sources and then produce new data in turn. Storage is not a problem anymore: Cloud technology gives us almost infinite space at a sustainable price.

Software applications try to solve a real-world problem, for example: improving a production cycle. The more data we have, the better we can analyze them to discover hidden patterns that show us which direction to take. Unfortunately, these patterns are too complex for the human eye to discern. This is exactly what Machine Learning does: it looks for patterns by examining large amounts of data, then generates code that helps us recognize those patterns in new data. It can even analyze data that’s coming in from our application in real-time.

I discovered the world of data analysis during my studies which then led me to work with research groups at CERN in Geneva in the late 90’s, who were looking for the tau neutrino. That wonderful experience left a permanent mark on my understanding of the nature of data and how they should, so to speak, be nurtured.

2020 showed us how a huge amount of data may only generate confusion, leaving room for an infinite number of interpretations. I am referring, of course, to COVID-19: the lack of uniformity in the collection phase complicates not only every possible analysis, but even a basic interpretation becomes difficult. Nevertheless, this has not limited the number of chaotic articles in newspapers and websites worldwide.

Therefore, data must be collected according to a certain criterion: this includes taking responsibility when some of them need to be discarded. In the case of a broken sensor, it can be an easy decision to make, but when the human/political factor takes over, the discussion completely changes.

One of the first fields of study in Statistics, dating back to the eighteenth century, was the ratio of the number of males to the number of females, at birth. This number should range between 1.03 and 1.06, males per female; whenever this ratio has increased significantly in some countries (meaning a high number of males compared to the number of females), very serious social and/or political problems have then come about.

Once you have decided which sample of data is “good”, your analysis starts. Here too, there may be hidden surprises. You get the result which is perhaps surprising, or even revolutionary. In the case of a scientific experiment, perhaps something new is discovered or what was predicted by the theory is observed. The human factor is decisive here, too. Enthusiasm rises, but we try to contain it by repeating the analyses. The results are confirmed and then articles are written, interviews are given. Then, after some time, a small systematic effect is discovered in one of the components of the experimental apparatus and all the previous results vanish into thin air.

For example, I remember when all the newspapers in the world announced that neutrinos traveled at a faster speed than light, in defiance of Einstein’s prediction. Two anomalies were later discovered: one in the calibration of the reference clock used to calculate the travel time of the particle, the other, trivially, in the status of the cable that connects the GPS system to a computer card in the experimental apparatus. You can read the consequences of this bitter discovery here.

What happens when there isn’t any plan for double-checking while analyzing the results found by a machine learning algorithm? When nothing and no one is checking this well-oiled machine? In science fiction literature, the term “ghost in the machine” has been used to refer to the phenomenon when artificial intelligence unexpectedly evolves beyond its original purposes.

I don’t want to conjure up scenarios as seen in Westworld, especially since reality is full of embarrassing examples of assistants from all brands and prices that malfunction. What should make us reflect, instead, is that the systematic effects, the “biases” in the scenarios of “social” networks, can produce unpredictable and disastrous effects as we’ve seen time and time again in the newspaper chronicles.

I want to keep things light and suggest visiting a web site where bizarre and completely random correlations are collected. For example, the number of divorces in Maine with consumption of margarine per capita:

Or the number of graduates with a PhD in Mathematics with the amount of uranium stored at nuclear plants in the US:

One of the cornerstones of statistics is that correlation does not equal causation. In turn, strong evidence in the data does not necessarily imply that it is actually related to the effect we are looking for. Some time ago, the U.S. Army was testing an algorithm to look for missiles in a sample of photos. They tested it on a series of images taken in Germany and the results were excellent. Unfortunately, the algorithm was actually finding trees with some small pieces of a missile nearby. The same algorithm found nothing in images with missiles taken in the desert or gave a false positive with a bicycle in a forest. I guess TensorFlow hadn’t been invented yet!

Is it just folklore? What happens if an algorithm sifts through a number of candidates for a job by analyzing the content of their resumes and compare them to the data of those who have worked in that company over the past 20 years? This is just a hypothetical scenario because I don’t want to quote articles without being able to cite reliable sources. A simple search will show you dozens of examples of racial bias in machine learning algorithms.

I realize I’ve raised a lot of doubts but that’s actually a good thing. To quote a famous joke:

Question to Radio Yerevan: “Is it correct that Grigori Grigorievich Grigoriev won a luxury car at the All-Union Championship in Moscow?

Radio Yerevan answered: “In principle, yes. But first of all it was not Grigori Grigorievich Grigoriev, but Vassili Vassilievich Vassiliev; second, it was not at the All-Union Championship in Moscow, but at a Collective Farm Sports Festival in Smolensk; third, it was not a car, but a bicycle; and fourth he didn’t win it, but rather it was stolen from him.”

I joined Ellycode with the hope of going beyond the sensationalism behind AI and Machine Learning. I was aware that there is still so much to do and to study, but that the opportunity is too important to miss out on.

Written by

Written by

Salvatore Sorrentino