Monografico_3_EN_Doxa_29

3.2. Production processes

The company’s AI engineer, Javier García, explains that “the program works like a human mind: it captures data, analyses them along with other examples, and generates the news”. In practice, however, the method of working to generate content through Gabriele is more complex.

In the participant observation carried out within the framework of this research, the way in which a news story about unemployment in Spain is produced was analysed. The process is divided into three phases:

The first is called “definition of the news design” and consists of the continuous input of data into the program. At this stage, different chronicles, reports and news, among other journalistic genres about unemployment in the country are introduced. These publications serve as a template for the system to automatically detect patterns, items, key words to be dealt with, the context, and linguistic structure. For this reason, these texts have been characterised as having a simple composition, lacking comprehensive interpretation, and a clear linguistic structure. “The aim is to create a kind of library, or narrative archive, that serves as a model or guide for later texts”, explains Llorente. On the other hand, the company’s lead software architect Alberto Moratilla makes it clear that as more examples are provided, the quality of the news created by the software improves. In sports, for example, approximately 10,000 pieces of information are generally used, while in financial matters the figure increases to about 50,000. Currently, the program has compiled around 10 million journalistic texts, equivalent to approximately 40 gigabytes. These examples are selected beforehand by a team of journalists, the number of whose oscillates, but the usual number is two. This is the only stage where information professionals are involved in the production process. This is the phase with the greatest amount of work a priori, but once it is under way, the system itself learns and improves.

The second phase is called “machine learning comprehension”, based on the downloading and further processing of the data entered in the previous stage. The system has taken the selected, relevant information from the knowledge base and combined it with the template format of the library, extracting the most pertinent data. In addition, two algorithms come into play in this process: variability and similarity. The first has created possible initial structures from different combinations and has detected the most relevant content patterns. In the case of the news on unemployment, these were as follows: the long-term unemployment rate; the number of unemployed people registered at Public Employment Offices; the number of people compared to the previous month; job creation registered by Autonomous Regions; and finally, the number of workers registered with Social Security, among others. On the other hand, the second algorithm limits itself to detecting similarities between texts and indicating which articles are similar. If two very similar news items have been created, the program rewrites them so that each client (media) has different information. At this stage, Gabriele can also customize the language and tone, according to the editorial style of each media, ensuring consistency with the rest of the content. “Currently, the texts generated automatically by our software include English and Spanish, and we are the only company in the world that includes Arabic”, states García.

Once these patterns have been established, the third phase begins, called “matching”, where we work with a CSV file, which is a type of text document that stores data in the form of columns and tables. This file forms a graph that merges and indexes the data extracted in real time with the narratives already existing in the system. In the case of the news about unemployment, the program has taken the unemployment data recorded in the month of April, compared them with the