
The Ministry of University, Research, and Innovation has funded a research project carried out by the Department of Computer Engineering at the University of Cádiz that has created REDIBAGG, a method that accelerates the training of artificial intelligence models by up to 70%, using less data but without losing accuracy. The technique has the potential to analyze large volumes of information in diverse fields such as medicine, industry, or finance.
The tool is designed to work with large volumes of information used for classification tasks, situations where algorithms must choose between specific options. For example, in healthcare, it could speed up automatic diagnosis systems without sacrificing reliability; in industry, it could be used to detect real-time failures with lower resource consumption; and in finance, it could process large records in less time to prevent fraud or analyze risks.
As explained in an article published in the journal ‘Engineering Applications of Artificial Intelligence’, the system performs well in disparate contexts. «It is not a method oriented towards certain types of data, but rather it is very versatile and robust with any volume containing a large number of characteristics or instances,» notes Juan Francisco Cabrera, co-author of the study.
Another advantage of the tool is its ease of implementation. It can be easily applied in common artificial intelligence work environments like the Python programming language and standard libraries like Scikit-learn, which is specific for using machine learning techniques in a straightforward manner, making it easier for researchers, companies, or institutions to adopt.
REDIBAGG is a variant of ‘bagging’ (short for ‘bootstrap aggregating’ in English), a model combination method widely used to enhance the accuracy of classifiers in the artificial intelligence context. The tool creates multiple subsets from the original data sample. Each subset is used for training a base classifier, and then the predictions are combined to make more reliable decisions. The ‘bagging’ uses ‘bootstrap’ as the resampling method, a statistical technique that generates random sub-samples with replacement. This means that new data collections are created by randomly selecting examples from the original set, allowing some to be repeated and others not.
Although ‘bagging’ is effective, its main drawback is the high computational cost. Each model is trained with a sub-sample of the same size as the original set, which slows down learning and increases resource consumption. To address this limitation, experts have applied a new resampling system that generates smaller but representative subsets.
Based on these sub-samples, they trained several independent models, combining their predictions just like in classic ‘bagging’. «In the era of big data, where large data volumes are handled, using methods that reduce learning times is appreciated, especially if it reduces up to 70% compared to the original method,» emphasizes Esther Lydia Silva, the study’s lead author.
To validate its effectiveness, they tested it on 30 real data sets using Urania, the supercomputer at the University of Cádiz. They worked in diverse areas such as medicine, biology, physics, or social sciences. Additionally, it was applied with different types of classification algorithms, such as decision trees, neural networks, support vector machines, or Bayesian models.
Next Objectives
In all cases, the new approach demonstrated a precision comparable to the original method. On average, they managed to reduce training time by 35%, achieving reductions of 70% in very large data sets. «By working with less complex models, training hours and storage costs are reduced, making the method much more efficient,» clarifies the scientist.
Now, the researchers aim to release the method for use by the scientific community. They also plan to study how the tool could be applied to other machine learning systems besides ‘bagging’ and its variants, combine it with variable selection techniques to obtain even more efficient models, or explore its adaptation to regression tasks where numeric values are predicted instead of categories.
The work has been funded not only by the Ministry of University but also by FEDER Funds.