The Royal Spanish Academy and AWS have presented a tool created by both within the framework of the Spanish Language and Artificial Intelligence (LEIA) project. This tool is responsible for analyze the spanish languagewith the objective of Assess its status globally and in all areas, but especially in the digital. This system has been developed based on native AWS cloud technologies, with the RAE in the role of advisor to the tool.
In its first version, which is not the definitive one but rather a test version, this tool contains 8,745,563 documents from Spain and the Spanish-speaking countries of the American continent. The fonts he uses are focused on the current spontaneous Spanish that is used in digital environments. It especially uses informal texts obtained from social networks, forums or electronic commerce platforms. However, it also includes a selection of journalistic texts, which serve to observe the differences between the different types of language.
Its functions are divided into three blocks. The first is dedicated to the study of foreign words, and is responsible, among other things, for detecting its proportion in the texts it examines. The second is responsible for analyzing the richness of vocabulary, for which it measures the diversity of words with the MTLD system (measure of textual lexical diversity, that is, measure of textual lexical diversity). Its last block is a radar of linguistic errors, which is responsible for identifying errors and cataloging their type: grammatical, lexical, style or typographic. For all this, the tool has integrated rules obtained from RAE normative works.
Both the RAE and AWS already have plans to continue expanding the capacity of the tool in its future versions. Thus, among other functions, they plan to provide it with what is necessary so that, among other things, it is capable of carrying out analyzes on the clarity of administrative language, comparing the quality of Spanish by era and detecting common errors in voice assistants and other devices with Intelligence. Artificial. For now, it is capable of working with millions of documents, and offers online viewing with results filtered based on country of origin, data source or date. Your data can be presented in graphs and visual maps.
This cloud-native linguistic analysis tool is built on a serverless, event-driven architecture. The data source analysis you perform has three phases. In the first, documents from data sources are indexed. To do this, the tool uses the cloud service AWS Lambdawhich allows you to run code without provisioning or managing servers, which indexes them in Amazon Open Search Service.
This is a highly scalable system that offers fast access, analysis and search in large volumes of data. But before indexing, another step is carried out, which serves to verify and validate that each document contains the necessary fields to identify it: date of generation, text, country to which it belongs and code of the country in question. The data sources, as well as the results and metrics obtained from the input documents to be processed, are stored in Amazon S3, a storage service designed to be able to access any volume of data from any point.
After the first phase, the second phase arrives, in which, based on various criteria, such as the calculation of general statistics regarding the variability, frequency and richness of the text, and the calculation of errors using natural language processing algorithms; metrics are obtained that characterize the texts of the data sources. In addition, the natural language processing algorithm, based on rules from academic works, detects errors belonging to various categories.
The third phase of the tool is the indexing of the analysis results for later visualization, which is carried out with AWS Lambda. It incorporates the data indexed by each source in the data visualization tool based on Amazon OpenSearch Dashboards. In this way, the users of the tool can see and interact with their data when it has been processed, even using dynamic filters that update the results displayed in real time.
In the development of the project, the creators of the tool have used Amazon Sage Maker, a machine language model generation, training, and deployment service for building and testing algorithms and visualizations. Also AWS Batchwhich is responsible for dynamically taking advantage of the most appropriate amount and type of computing resources in each case based on volume and specific requirements.