1. Introduction
The advent of transformer-based models has significantly changed the field of Natural Language Processing (NLP), enabling advanced machine learning techniques that can model languages with remarkable scale and depth. This evolution has not only enhanced our capacity to process and analyze natural language data but has also paved the way for specialized applications across various domains. A notable example is Bidirectional Encoder Representations from Transformers (BERT) [1], which serves as a foundational component for developing task-specific NLP models.
BERT’s architecture, which allows for the consideration of bidirectional context processing, has reshaped how machines understand language. By utilizing masked language modeling and next sentence prediction tasks during training, BERT captures complex relationships between words and phrases, leading to improved performance on various NLP benchmarks [2]. This capability has made BERT a cornerstone in tasks such as sentiment analysis, named entity recognition, and question answering, demonstrating its versatility and effectiveness in real-world applications [3].
As the landscape of NLP continues to advance, causal autoregressive models, particularly those in the Generative Pretrained Transformer (GPT) family, have made substantial strides by focusing on text generation and predictive tasks. Models such as GPT-2 [4], GPT-3 [5], and GPT-4 [6] exemplify this approach, which uses context from previously generated tokens to predict subsequent words in a sequence. This causal autoregressive methodology stands in contrast to bidirectional models such as BERT and DeBERTa (Decoding-enhanced BERT with disentangled attention), which employ both past and future tokens to create contextual representations. The success of GPT models in wide-ranging applications, including zero-shot and few-shot prompting, underscores their effectiveness in generating coherent and contextually relevant text.
The ability of GPT models to generate human-like text has opened up new avenues for creativity and automation in content creation. For instance, GPT-3 has shown strong performance in generating coherent and contextually appropriate text across various tasks, including translation, question-answering, and even creative writing [5,7]. Additionally, GPT has made significant strides in text understanding, particularly with the development of models such as GPT-3 and GPT-4. For instance, researchers have demonstrated that GPT-3 can solve complex reasoning problems and perform tasks from the perspective of cognitive psychology, highlighting its advanced text comprehension capabilities [8].
The impressive success of transformer-based models in various domains has led to the development of specialized models tailored for specific industries. For example, researchers have introduced several domain-specific models for financial tasks [9,10,11,12,13,14]. Expanding the availability of transformer-based models in multiple languages is crucial for maximizing their effectiveness and accessibility. However, developing financial language models tailored for languages other than English presents significant challenges, particularly due to the scarcity of high-quality, large-scale datasets in the financial sector, which creates barriers to achieving optimal performance in non-English financial contexts.
Over the past few years, the financial sector has experienced a significant increase in the volume and complexity of digital information, including documents, reports, and news articles that are essential for decision-making in investment, risk management, and policy formulation. Despite this growth, current natural language processing models face challenges in accurately interpreting the nuances of financial language, particularly in underrepresented languages such as Portuguese. This difficulty is exacerbated by domain-specific jargon, mixed terminology from related fields, and a lack of large-scale annotated datasets. To enhance financial text classification and enable real-time analysis, it is crucial to develop domain-specific language models that improve performance in financial decision-making contexts by effectively addressing these challenges [15,16,17].
When a specialized domain is unavailable for a specific domain, general-domain pretrained language models (PLMs) that support Portuguese, including multilingual models such as XLM-RoBERTa [18] or Portuguese general-domain models such as BERTimbau [19], are commonly used as alternative solutions. There is also the possibility of using cross-lingual transfer, which is a technique that allows models trained on one language to be applied to another language, often without additional training. However, these alternatives may not fully capture the nuances of specialized domains, such as finance. In response to this limitation, we developed a specialized financial language model tailored for Portuguese that was designed to excel in financial language processing tasks while offering an accurate and effective solution.
The creation of these specialized PLMs is essential not only for enhancing performance but also for ensuring that these technologies are inclusive and representative of varied linguistic communities. By focusing on the unique characteristics of different knowledge domains, researchers can develop more effective tools that cater to specific user needs. For instance, financial institutions operating in Portuguese-speaking countries can benefit from tailored models that understand local terminologies and regulatory frameworks.
The proposed model has undergone mixed-domain pretraining, incorporating relevant data from various sectors within the financial domain, such as politics, business management, and accounting. This approach utilizes mixed-domain training as a form of transfer learning, which is advantageous when available training data are limited. The selected domains were chosen to specifically align with the objectives of mixed-domain training and avoid utilizing knowledge from a general domain that could undesirably reduce learning performance in the target domain, a phenomenon known as negative transfer [20].
The model was evaluated through four text classification tasks. The primary results demonstrate that the model not only matches but also exceeds the natural language understanding performance of the baseline models. Notably, some of these baseline models were pretrained using significantly larger datasets. For instance, BERTimbau was pretrained on the BrWaC corpus [21], which is 2.5 times larger than the corpus used for this model’s pretraining. Compared to continuous pretraining with domain-general data, these findings underscore the critical role of mixed-domain pretraining in environments with limited resources.
To summarize, the main contributions of this study are as follows:
We introduce DeB3RTa, the first Portuguese financial transformer-based language model, trained using a comprehensive financial corpus.
We employ a mixed-domain pretraining strategy that incorporates financial, political, business, and accounting data, ensuring a nuanced understanding of financial texts.
We integrate advanced fine-tuning techniques, including layer reinitialization, mixout regularization, stochastic weight averaging, and layer-wise learning rate decay, to enhance model adaptability and performance.
We evaluate DeB3RTa against state-of-the-art NLP models, demonstrating its superiority in key financial NLP tasks such as sentiment analysis, fake news detection, regulatory risk classification, and hate speech detection.
We release the largest curated Portuguese financial corpus to foster further research in the domain.
2. Problem Statement
Natural language processing in the financial domain presents unique challenges, particularly in languages with limited annotated data and specialized terminology, such as Portuguese. While transformer-based models have significantly advanced NLP applications, the availability of pre-trained models tailored for Portuguese financial contexts remains scarce. Existing models, such as BERTimbau and XLM-RoBERTa, while effective in general-domain tasks, fail to capture the nuances of financial language, which includes technical jargon, regulatory expressions, and industry-specific terminology.
Moreover, domain-specific financial models, such as FinBERT and SEC-BERT, have demonstrated notable success in English but lack counterparts trained on Portuguese financial corpora. This gap limits the accuracy and applicability of NLP solutions in Portuguese-speaking financial markets, affecting tasks such as risk classification, sentiment analysis, and fake news detection.
To address this limitation, we introduce DeB3RTa, a transformer-based model specifically designed for the Portuguese financial domain. By employing a mixed-domain pretraining strategy, DeB3RTa integrates data from finance, politics, business management, and accounting, ensuring a more comprehensive understanding of financial text. Additionally, we implement advanced fine-tuning techniques, such as layer reinitialization, mixout regularization, stochastic weight averaging, and layer-wise learning rate decay, to enhance the model’s adaptability and robustness across financial NLP tasks.
This research aims to:
Develop the first Portuguese financial transformer-based model leveraging the largest curated financial corpus in Portuguese;
Demonstrate the effectiveness of mixed-domain pretraining in improving financial language understanding;
Evaluate DeB3RTa against state-of-the-art models in financial NLP benchmarks, showcasing its superior performance in sentiment analysis, fake news detection, and regulatory text classification.
By addressing these challenges, DeB3RTa represents a significant advancement in Portuguese financial NLP, offering a specialized tool for financial institutions, analysts, and policymakers seeking enhanced text analysis capabilities.
3. Background
Language model pretraining represents a significant advancement in NLP, allowing models to comprehend and produce human language with exceptional precision. The emergence of BERT is a pivotal development in this field, as it encapsulates the principles of vocabulary expansion, model structure, self-supervision, and advanced pretraining methodologies. BERT’s approach has set new standards for language understanding and generation tasks.
BERT’s vocabulary employs WordPiece tokenization, breaking words into smaller subword components. Operating within a limited vocabulary, this method effectively handles a wide-ranging set of words, including those not seen during training. This ensures that the model comprehends the complexity of language, such as morphemes and smaller units of meaning, thereby enhancing its learning and generalization capabilities.
The design of BERT, a sophisticated system built upon the transformer model [22], employs self-attention mechanisms to assess the importance of different segments in the input data. In contrast to previous models that linearly process text, BERT reads text in both directions, considering the context of each word on the basis of the words around it. The model is made up of multiple layers of transformer units, each consisting of two sublayers: a multihead self-attention mechanism and a fully connected feedforward network, highlighting the depth of its technicality.
Self-supervision, a more structured approach to unsupervised learning, is fundamental to BERT pretraining. BERT is pretrained on extensive datasets, such as Wikipedia, and uses tasks that derive labels from the input data. There are two main pretraining tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). In MLM, tokens are randomly selected and masked, and the language model is trained to predict their identity using only the context provided by the unmasked tokens. NSP trains the language model to predict if two sentences logically follow each other, which is essential for understanding the relationships between sentences.
BERT’s MLM task employs Subword Word Masking (SWM) to mask subwords (or tokens) randomly. For example, in English, a word such as “Transformers” might be broken down into subword units such as “Transform” and “ers”. While SWM has demonstrated potential, recognizing its challenges, such as determining the optimal granularity of subword units to mask and predict, is essential. This task is complex and may necessitate careful consideration of the morphological characteristics of the language [23].
Whole-word masking (WWM) was introduced to address this issue by ensuring that all subword units comprising a word are masked. This method compels the model to rely on a broader context to predict the masked word rather than just the remaining subword units. The implementation of WWM leads to models that more effectively capture contextual semantic relationships. When dealing with more than one character, WWM is not only beneficial but also essential for improved performance, as it requires the model to consider the context more thoroughly [24].
Research has suggested multiple alternative masking methods to enhance transformer models beyond SWM and WWM. An example of such a technique is PMI-Masking [25], which is based on the concept of Pointwise Mutual Information (PMI) and addresses the drawbacks of random uniform token masking by improving previous heuristic approaches such as WWM, entity/phrase masking, and random-span masking. This approach provides a systematic way to mask related spans in the input data, further enhancing the model’s ability to understand context and relationships within text.
Other models, such as SpanBERT [26] and DeBERTa (as an additional masking strategy), have made significant strides by adopting an additional strategy known as span masking. This technique involves randomly masking continuous spans of tokens rather than individual tokens [27], thereby enhancing the original SWM approach. The goal of span masking is to significantly enhance the model’s comprehension and prediction of text spans. By optimizing specific objectives and masking continuous spans of tokens, models such as SpanBERT have remarkably improved tasks such as extractive question answering, coreference resolution, and relation extraction.
Many studies have been conducted to improve the efficiency of customizing PLMs for specific domain tasks. These investigations have analyzed various approaches, including the use of AdamW, AdamP, RAdam, and MADGRAD optimizers; layer reinitialization; layer-wise learning rate decay; stochastic weight averaging; and mixout regularization. Each of these techniques aims to address specific challenges in fine-tuning large language models (LLMs) for targeted applications.
AdamW, AdamP, RAdam, and MADGRAD are among some of the latest optimizers proposed after the widely used Adam optimizer [28]. These optimizers introduce key improvements that address specific limitations of Adam. AdamW [29], for instance, decouples weight decay from gradient updates, offering better regularization and generalization. AdamP [30] corrects premature decay of effective step sizes in momentum-based gradient descent optimizers by removing the radial component, preserving the original convergence properties of gradient descent (GD) optimizers. RAdam [31] incorporates a rectification term to dynamically adjust learning rates based on variance and momentum, resulting in improved convergence, training stability, and accuracy during the initial training phase. MADGRAD [32], on the other hand, combines dual averaging with cube-root adaptive scaling and momentum, achieving state-of-the-art performance across both computer vision and language tasks by better handling early large gradients while maintaining strong convergence properties.
Researchers have found that transformer-based models can produce significantly different outcomes when different random seeds are used, with weight initialization being particularly sensitive to the choice of seed [33]. Various techniques have been introduced to address this instability, such as layer reinitialization and mixout regularization. These methods aim to improve the consistency and reliability of fine-tuned models across different initialization conditions.
The concept of layer reinitialization, a significant finding in computer vision research, suggests that pretrained lower layers typically acquire more general features. Conversely, higher layers, which are closer to the output, often focus on specific pretraining tasks [34]. Recent research on transformers has revealed that while utilizing the entire network may be the most effective approach in some cases, it could hinder the training process and negatively impact overall performance [35]. This insight has led to more nuanced approaches in fine-tuning, where different layers may be treated differently based on their role in the model.
Mixout [36], a regularization technique, is supported by strong empirical evidence as a complement to traditional methods such as Dropout [37] and DropConnect [38]. In mixout, parameters are replaced with their pretrained values during training with a probability denoted by p rather than being set to zero. Extensive assessments conducted by researchers have consistently shown that mixout enhances the accuracy and consistency of fine-tuning PLMs for subsequent tasks. In particular, mixout has proven advantageous in regulating the fine-tuning of BERT on the GLUE benchmark, demonstrating its effectiveness in real-world applications.
Stochastic Weight Averaging (SWA) [39] is an optimization technique that enhances the generalization of neural networks by averaging the weights of models sampled at different points during training. This method encourages convergence to flatter minima, which are associated with better generalization performance [40]. In the context of natural language processing models, SWA has been shown to improve generalization without additional computational costs, outperforming traditional methods like knowledge distillation [41]. It also aids in better calibration and robustness against distribution shifts, particularly in large language models fine-tuned on small datasets, and, by exploring wider optima, helps in achieving more reliable and accurate predictions in NLP tasks [42,43].
An additional technique for optimizing PLMs is discriminative fine-tuning, which involves assigning different learning rates to various model layers during training. This is because different layers capture different types of information [34,44]. Discriminative fine-tuning, also known as layer-wise learning rate decay (LLRD), implements a strategy of using higher learning rates for the top layers and lower rates for the bottom layers when fine-tuning PLMs. This approach aims to modify the top layers related to the pretraining task while retaining the lower layers that contain more generalized information [35]. Recent research supports the effectiveness of LLRD in enhancing model performance across various tasks [45,46]. By implementing a higher learning rate for the top layer and progressively decreasing the learning rate for each subsequent layer toward the bottom, LLRD enables the model to balance task-specific fine-tuning in the higher layers while preserving the general knowledge encoded in the lower layers. This approach has been successfully employed in fine-tuning PLMs such as ELECTRA [45] and XLNet [47], demonstrating its effectiveness in adapting to new tasks while retaining valuable knowledge from the pretraining phase.
To address the challenges of data scarcity and enhance model performance across various domains, researchers have developed several innovative approaches, including mixed-domain pretraining.
Mixed-domain pretraining has surfaced as a relevant technique in overcoming data limitations, particularly in low-resource languages or specialized domains. This approach involves training models on data from multiple domains, enabling them to learn more robust and generalizable features. By exposing models to heterogeneous inputs, mixed-domain pretraining enhances their ability to adapt to new data sources and perform effectively even with limited domain-specific data. This capability is especially valuable in real-world applications where certain domains may lack extensive datasets [48].
Building upon the concept of domain adaptation, recent scientific literature has highlighted the effectiveness of mixed-domain PLMs. Notable examples include SciBERT [49], which was developed for computer science and biomedical domains; BioBERTpt(all) [50], which was trained on Brazilian clinical text and biomedical text; and ABioNER [51], which was trained on a general Arabic corpus and biomedical Arabic text. By exploiting both domain-specific and general or related corpora during pretraining, these adapted models effectively combine broad language understanding with specialized knowledge, making them particularly valuable in fields such as scientific literature analysis and biomedical research.
While mixed-domain PLMs have shown promise in combining broad language understanding with specialized knowledge, another approach has gained traction: the development of highly focused, domain-specific language models. This strategy involves either adapting existing general-purpose models or training new models from scratch using large corpora of domain-specific text. The rationale behind this approach is that by immersing the model in a particular field’s language and concepts during pretraining, it can develop a deeper, more nuanced understanding of that domain. This specialization can lead to superior performance on domain-specific tasks, albeit potentially at the cost of reduced generalizability to other fields.
The financial sector has also benefited from the rise of domain-specific language models. One of the earliest examples is FinBERT-2019 [9], which was created by adapting the BERT model. Initialized from a PLM trained on general-domain data, FinBERT underwent continual pretraining on financial texts. This means that the model was first initialized from BERT and then further trained on financial data to specialize in financial language tasks. When sufficient domain-specific data became available, training from scratch became an alternative to continual pretraining, resulting, for example, in the creation of FinBERT-2020 [10], FinBERT-2021 [11], SEC-BERT [12], the set of FLANG (Financial LANGuage) models [13], FinSoSent [14], and BusinessBERT [52]. These models have shown significant improvements in tasks specific to financial text analysis, such as sentiment analysis of financial news and earnings call transcripts.
While BERT-like models have seen widespread use, GPT-based models have also emerged as powerful tools for financial text processing [53]. GPT models, which are pretrained on vast corpora, excel in tasks requiring generative capabilities. However, research shows that BERT and RoBERTa (Robustly Optimized BERT Pretraining Approach) often outperform GPT models in benchmarks that demand sentiment analysis, logical reasoning, and other comprehension tasks. For instance, studies have indicated that BERT consistently achieves greater accuracy in sentiment analysis within clinical conversations, effectively capturing emotional nuances better than GPT-2 does. RoBERTa has also demonstrated strong performance in identifying sentiments, especially neutral ones. These findings underscore the strengths of bidirectional models in tasks that require nuanced understanding of context and sentiment [54].
The field of NLP continues to progress rapidly, with innovations in pretraining techniques, fine-tuning strategies, and domain-specific applications driving significant improvements in model performance. As researchers continue to refine these approaches, we can expect further advancements in language understanding and generation across various domains and tasks. The ongoing development of increasingly sophisticated language models promises to reveal new potential in areas ranging from financial analysis to scientific research, paving the way for more intuitive and capable artificial intelligence (AI) systems in the future. The combination of general language understanding and domain-specific knowledge will likely lead to AI systems that can assist humans in increasingly complex and specialized tasks across various fields.
4. Materials and Methods
4.1. Problem Formulation
In this study, we address a set of text classification problems, including fake news detection, sentiment analysis, hate speech classification, and regulatory risk classification. Each of these tasks can be formally defined as a supervised learning problem, where a model learns to assign a label y to a given input text x.
4.1.1. Problem Definition
Let be the space of input texts and the set of possible labels. The goal is to learn a classification function:
(1)
where f is parameterized by a transformer-based model trained to minimize a loss function based on labeled training data , with N examples.Each classification task is defined as follows:
Hate Speech Classification:
Fake News Detection:
Sentiment Analysis:
Regulatory Risk Classification:
4.1.2. Input Representation
Each input text is transformed into a sequence of tokens:
(2)
where T is the number of tokens in the text, and each token is mapped to a dense vector representation using a pre-trained transformer model:(3)
The full sequence representation is obtained via a transformer-based encoder, which outputs a contextualized vector representation for each token:
(4)
where d is the hidden dimension of the model.To obtain a single document representation, we apply a pooling function (e.g., [CLS] token representation or mean pooling over all tokens):
(5)
4.1.3. Classification Function
The final classification is performed using a fully connected layer followed by a softmax activation:
(6)
where and are trainable parameters.4.1.4. Optimization Objective
Training is conducted by minimizing the categorical cross-entropy loss:
(7)
where is a one-hot encoded ground truth label, and is the predicted probability for class c.To improve generalization and stability, we incorporate regularization techniques, including mixout regularization, stochastic weight averaging, and layer-wise learning rate decay, as described in Section 4.2.6.
4.2. Research Workflow Overview
This study presents a thorough approach to evaluate various language models for specific natural language processing tasks. The research workflow, as illustrated in Figure 1, involves six main tasks: model selection, dataset definition, baseline modeling, corpus definition, model fine-tuning, and evaluation.
In the model selection phase, we introduced two versions of our proposed model, a novel approach called DeB3RTa (B3 is a reference to “Bolsa, Brasil, Balcão”, Brazil’s leading stock exchange and 20th in the world in terms of total market capitalization), and their configuration to optimize the balance between performance and computational cost.
During the dataset definition phase, four datasets are incorporated: OFFCOMBR-3, FAKE.BR, CAROSIA, and BBRC (Brazilian Banking Regulation Corpora). These datasets serve as the foundation for training and testing DeB3RTa and the baseline models.
The baseline modeling stage includes four general purpose transformer-based models (Multilingual BERT, BERTimbau, XLM-RoBERTa, and DistilBERT), two financial domain-specific models (SEC-BERT and BusinessBERT), as well as a separate group of five GPT models (GPT-3.5-turbo, GPT-4o-mini, GPT-4o, GPT-4-turbo, and GPT-4).
Additionally, in the corpus definition phase, the corpus that is used in the pretraining of DeB3RTa, which incorporates data from various external sources such as Relevant Facts, Patents, Scielo, Wikipedia, and News, is introduced.
Moreover, in the model fine-tuning phase, downstream tasks were performed on DeB3RTa and the baseline models; in these tasks, techniques were applied to improve the performance of DeB3RTa compared with the baseline models.
Finally, the evaluation phase assesses the performance of all these models via the F1 score, recall, precision and PR-AUC (Area Under Precision–Recall Curve), providing a standardized measure of effectiveness across the different approaches.
4.2.1. Phase 1: Model Selection and Configuration
The chosen model for the study was DeBERTa-v2 [55], in which custom configurations were made. In its base configuration, DeBERTa-v2 has 24 attention heads, 24 layers, an intermediate layer size of 6144, and a hidden size of 1536, with approximately 887 million trainable parameters. We created two versions of our model: a base version with 12 attention heads, 12 layers, an intermediate layer size of 3072, and a hidden size of 768, totaling nearly 426 million trainable parameters, and a smaller version with 6 attention heads, 12 layers, an intermediate layer size of 1536, and a hidden size of 384, totaling nearly 70 million trainable parameters. These hyperparameter values were chosen through preliminary evaluations, which sought to verify values that provided a balance between generalization and representation capacity and computational cost.
We chose DeBERTa on the basis of its demonstrated superiority over other encoders, leading to documented improvements of +0.9% on MNLI, +2.3% on SQuAD v2.0, and +3.6% on RACE compared with RoBERTa-large. Furthermore, it has shown performance surpassing the human level on the SuperGLUE benchmark. These achievements can be attributed to incorporation of two innovative features: disentangled attention, allowing the model to focus independently on different aspects of the input sequence, and an enhanced mask decoder.
DeBERTa distinguishes itself from BERT by encoding each word in the input layer differently. In contrast to BERT’s method of using a single vector that combines word and position embeddings, DeBERTa uses two distinct vectors to capture the content and position of each word. The attention weights between words are computed using separate matrices that consider content and relative positions. This innovative approach allows the model to focus on the relationship between text content and position.
DeBERTa uses a distinctive technique known as the enhanced masked decoder (EMD), which distinguishes it from BERT’s use of absolute positional encodings at the input layer. Instead of incorporating absolute positions at the input layer, DeBERTa integrates absolute positions after each transformer layer, just before the Softmax layer, which predicts masked tokens. This strategy allows the model to capture relative positions within all transformer layers while using absolute positions as supplementary information during decoding. By adopting this approach, DeBERTa can introduce valuable additional information during pretraining, enhancing the model’s utilization of positional data [56].
4.2.2. Phase 2: Corpus Definition for DeB3RTa Pretraining
The pretraining corpus integrates data extracted through web scraping from various sources, including news articles, patents, and financial reports, ensuring the dataset’s relevance to the financial domain. Specifically, the corpus consists of the following:
Relevant Facts: The data pertain to the companies featured in the May–August 2023 portfolio of IbrX 100, which indicates the performance of the 100 most significant assets in the Brazilian stock market. This information was sourced from the database of the Comissão de Valores Mobiliários (Brazilian Securities Commission) and covers the period from 2003 to 2023;
Google Patents: The dataset comprises G06Q, G07C, G07F, and G07G of the International Patent Classification System (IPC), covering 2006–2021;
Scielo: The dataset comprises research articles in Portuguese on Brazilian Scielo about finance, politics, economics, business management, and accountancy, spanning from 1961 to 2023;
Wikipedia: On 19 May 2023, we used the Wikipedia API Python library to crawl Portuguese-language articles. We began by exploring the “Economia” (“Economy” in Portuguese) category and included up to five subcategories. We discarded links to identical articles to ensure data quality and avoid repeating content;
News: The dataset contains articles from specialized and mainstream electronic newspapers in Brazil, Portugal, and Angola. These articles cover finance, economics, politics, and related topics and were selected between 1999 and 2023.
We cleaned each financial resource to create the definitive corpus. This involved segregating the data into individual sentences, removing noisy or poorly constructed sentences, and applying the MinHash algorithm [57], as described in [58], to remove duplicate entries. After completing this step, we merged the entire corpus and carried out an additional deduplication process to remove redundant content, resulting in a total of 1.05 billion tokens. Detailed statistics for each corpus can be found in Table 1.
4.2.3. Phase 3: Baseline Definition
In the initial phase of our study, we conducted a systematic evaluation to identify the most effective models for our tasks and optimize their performance. We selected four baseline general purpose models: Multilingual BERT [1], BERTimbau, XLM-RoBERTa, and DistilBERT [59]. These models were chosen based on their proven effectiveness in various NLP tasks as reported in the literature and their capability to deal with the Portuguese language.
Multilingual BERT (mBERT) is recognized for its versatility across multiple languages, making it suitable for varied linguistic contexts. BERTimbau was specifically designed for Brazilian Portuguese, ensuring that it captures the unique linguistic characteristics of this language. XLM-RoBERTa offers robust cross-lingual capabilities, which are essential for tasks involving multilingual datasets, whereas DistilBERT provides a more efficient alternative with reduced model size and faster inference times.
In addition to these baseline models, we selected two domain-specific models for our experiments: SEC-BERT and BusinessBERT, both of which were trained on English corpora. We employed these models through cross-lingual transfer, enabling us to leverage the learning capabilities of models developed in one language, where abundant resources exist, to address tasks in another language.
Finally, state-of-the-art LLMs were incorporated for the experiment. Specifically, GPT-3.5-turbo, GPT-4o-mini, GPT-4o, GPT-4-turbo, and GPT-4 were utilized for the tasks. These models were selected because of their advanced capabilities and demonstrated success across various NLP applications [60].
4.2.4. Phase 4: Dataset Definition
Our research included four text classification tasks to assess the effectiveness of our model: hate speech detection, fake news detection, sentiment analysis, and document analysis.
Economic conditions, such as unemployment and income parity, have been shown to influence the incidence of hate crimes. For instance, higher unemployment rates are associated with increased violent hate crimes, suggesting that economic stress may exacerbate social tensions and lead to bias-motivated offenses [61,62]. Additionally, the economic framework of hate crimes suggests that individuals may weigh the intrinsic benefits of committing such crimes against the potential costs, including social esteem and legal repercussions, which can be influenced by the prevalence of hate speech in society. Hate speech can act as a signal of social attitudes, potentially normalizing bias and encouraging hate crimes when individuals perceive a supportive environment for their prejudices [63]. Furthermore, online hate speech has been linked to offline hate crimes, indicating that digital expressions of bias can translate into real-world violence, further complicating the economic and social landscape [64].
When information is disseminated, careful consideration of the impact of fake news is essential. Identifying and eradicating misinformation is crucial, mainly because of its potential to influence individuals and entire economies. False rumors and misleading news can significantly impact stock prices and the willingness to engage in large-scale investments. Thus, addressing this issue with the utmost care is critical [65].
Sentiment analysis is vital in finance, as it can forecast trends, identify potential crises, and guide investment decisions. Given the vast amount of data in the financial sector, sentiment analysis has become crucial for analysts and investors. By scrutinizing large datasets and recognizing market sentiment patterns, sentiment analysis can offer valuable insights into financial market behavior and assist investors in making well-informed decisions [66].
Document analysis in the financial and banking industry is pivotal for various reasons, primarily concerning efficiency, accuracy, compliance, and decision-making. Effective classification is crucial for managing this information because the financial sector generates vast amounts of data and documentation [67].
We carried out trials in which we implemented fine-tuning on four datasets:
OFFCOMBR-3 [68]: OFFCOMBR-2 was originally compiled with 1250 comments labeled by three annotators, with varying levels of agreement. Building on this, the authors created OFFCOMBR-3, a more refined dataset of 1033 comments, including only those for which all annotators reached unanimous agreement. Of these, 202 comments (19.5%) were labeled as offensive, while the remaining 831 were considered non-offensive, making the dataset unbalanced.
FAKE.BR [69]: The authors curated a dataset of 7200 news articles manually labeled legitimate or fake. The dataset included an equal number of 3600 fake and legitimate news articles, with each fake article paired with an accurate article of similar length. The majority of the articles were published between January 2016 and January 2018. Each article underwent manual verification to ensure that it contained only false information, avoiding the inclusion of half-truths. The articles were then categorized into six topics: economy; science and technology; society and daily news; politics; religion; and TV and celebrities. However, only the articles categorized under economy and politics were utilized for downstream tasks, as these topics are closely tied to the financial domain, resulting in 4224 news articles.
CAROSIA [70]: The authors compiled news updates about the Brazilian financial market, including 717 positive and 553 negative reports from trusted sources such as G1, Estadão, and Folha de São Paulo. These reports cover the Brazilian stock market index, Ibovespa, and the performance of prominent companies listed on the Brazilian stock market, such as Banco do Brasil, Itaú, Gerdau, and Ambev.
BBRC [71]: The BBRC dataset consists of 25 corpora containing banking regulatory risk data from various divisions of Banco do Brasil (Bank of Brazil). These corpora cover a wide range of topics, including investments, insurance, human resources, security, technology, treasury, loans, accounting, fraud, credit cards, payment methods, agribusiness, and risk management. The dataset includes 61,650 annotated documents, the majority of which range from half a page to three pages in length. The original paper details two experiments on document analysis, and in our trials, we followed the methodology of the second experiment, with the only modification being the elimination of duplicate entries. After this step, our dataset consisted of 337 documents classified as relevant and 295 classified as irrelevant.
In this study, no class balancing techniques, such as oversampling or undersampling, were applied to any of the datasets. Each dataset was used to reflect the natural class distributions. For the fine-tuning phase, each dataset was split into training, validation, and testing subsets following an 80/10/10 ratio, with stratified sampling to preserve the original class distributions across all subsets. Detailed statistics on each split of each dataset can be found in Table 2, Table 3, Table 4 and Table 5; for each split, these statistics capture the median number of words across all texts, as well as the minimum and maximum word counts found in any single text, and the distribution of labels. These metrics are crucial for understanding the challenges posed by each dataset, particularly in evaluating both the textual diversity and class balance.
4.2.5. Phase 5: Model Pretraining
In both model versions, we utilized the DeBERTa-v2 xlarge tokenizer for tokenization, which is based on SentencePiece [72] and employs subword units [73] and unigram language modeling [74]. The token sequence was truncated to 128 tokens with dynamic padding, and the vocabulary size was set to 128,100. During training, which took place at the Human Language Technology Lab of the Instituto de Engenharia de Sistemas e Computadores—Investigação e Desenvolvimento (INESC-ID) in Lisbon, Portugal, using an NVIDIA A100 (lasting 103 h for the base model and 83 h for the smaller model), we employed the standard BERT masking procedure with a 15% masking probability for each example. The training process involved the use of AdamW optimizer with a learning rate of 1 × 10−4 and linear decay. We dedicated the first 1% of the 80,650 training steps to warm-up, and the total batch size consisted of 1536 sentences, comprising 192 samples with gradient accumulation. The models underwent training for 50 epochs, and these specific hyperparameters were carefully chosen on the basis of extensive exploratory testing and research by researchers with limited resources [75]. Figure 2 illustrates the base model’s training loss and convergence.
To achieve the significant benefits of reduced memory usage and faster computation, which are particularly advantageous for PLMs with high computational requirements and memory footprints, we conducted our model training via FP16 (also known as the half-precision floating-point format) [76]. This method uses 16 bits to represent a floating-point number, offering a smaller range of representable values than the default format FP32 does. The decrease in precision enables quicker computation and a smaller memory footprint, making it suitable for training and inference tasks, especially when specialized hardware optimized for lower precision is used. However, the reduced precision inherent in FP16 could impact numerical accuracy, particularly in complex training scenarios.
4.2.6. Phase 6: Model Fine-Tuning
In our optimization efforts for DeB3RTa during downstream tasks, we applied various techniques, including the AdamW, AdamP, RAdam, and MADGRAD optimizers, reinitializing groups of layers, implementing stochastic weight averaging, mixout regularization, and layer-wise learning rate decay. Our preliminary experiments revealed that the best performance was achieved when using the cosine scheduler with warmup in configurations that utilized specific scheduler implementations (AdamP, RAdam, MADGRAD, and LLRD) rather than the linear schedule with warmup used in the Huggingface library default configuration.
Extensive hyperparameter searches were conducted to establish the optimal configuration for each experiment mentioned. This process involved using grid search to systematically explore a range of hyperparameter settings, such as learning rates and decay factors, to identify the best performance on the datasets. The model’s performance was evaluated using multiple metrics: macro F1 score, precision, recall, and PR-AUC. During hyperparameter optimization, the macro F1 score was chosen as the sole criterion for selecting the optimal configuration. This metric provides a balanced assessment by equally weighting each class, mitigating the risk of bias toward the majority class in imbalanced datasets.
A grid search procedure was conducted using training and validation splits. The training split was used to fit the models, while the validation split assessed different hyperparameter configurations based on their macro F1 scores. Once the optimal hyperparameters were identified, the model was evaluated on the test split. This final evaluation incorporated additional metrics—precision, recall, and PR-AUC—to provide a holistic understanding of the model’s performance across all classes.
We fine-tuned the models via the following hyperparameters: a maximum token length of 128 with padding and truncation, four epochs (according to [1], a number of epochs that works well across all tasks), and a warm-up period in 10% of the training steps, since transformer models often struggle to stabilize learning if they do not incorporate a gradual learning rate warm-up at the beginning of training [77]. Owing to variations in dataset sizes, the FAKE.BR dataset tasks were performed using training batch sizes of {32, 64}, whereas the CAROSIA and BBRC dataset tasks used training batch sizes of {16, 32}. Furthermore, all configurations and models for DeB3RTa followed the hyperparameters specified in Table 6, Table 7, Table 8 and Table 9 for the complete set of hyperparameters for each DeB3RTa configuration.
For the layer reinitialization configuration, layers 10, 11, or 12 were reinitialized. To configure the SWA, in addition to the standard learning rates, we used some SWA learning rates to be used at a given training point: 1 × 10−6, 2 × 10−6, 3 × 10−6, 4 × 10−6, 5 × 10−6. According to the method proposed by [41], followed in our work, the SWA phase is started in the fine-tuning at 50% of the training steps, when a lower, constant learning rate is used. The probability p of mixout was tested at five discrete values: 0.1, 0.3, 0.5, 0.7, and 0.9. In LLRD, different learning rates are applied to different layers of the network. The base learning rate is first determined and then typically multiplied by a factor greater than 1 to set the learning rate for the topmost layer (often the task-specific layer). For each preceding layer, the learning rate is reduced by a decay factor, resulting in progressively lower learning rates for earlier layers (closer to the input). Additionally, a set of values for weight decay was applied to prevent overfitting by adding a regularization term to the loss function, which penalizes large weights and encourages the model to maintain smaller and more generalizable weights. Figure 3 shows an example of the learning rate values during the training steps of a layer-wise learning rate decay configuration.
The baseline models were fine-tuned with their default configurations, while the GPT models were employed in a zero-shot setting without explicit training examples, with the model temperature set to zero to lead to more deterministic outputs. The models were accessed via OpenAI API calls, wherein prompts were supplied to execute the classification tasks. This zero-shot approach allowed for an assessment of the models’ capabilities in scenarios where labeled review data were not available. The prompts used are specified in Table 10.
Following the fine-tuning process described above, the results presented in the following section reflect the impact of various hyperparameter configurations, including optimizer selection and learning rate adjustments, on the performance of DeB3RTa across the datasets.
5. Results
This section presents a comprehensive evaluation of DeB3RTa’s performance across the text classification tasks in comparison with baseline models, including multilingual models (mBERT, XLM-RoBERTa, DistilBERT), Portuguese-specific models (BERTimbau), domain-specific models (SEC-BERT, BusinessBERT), and GPT-based models. The results, detailed in Table 11 and Table 12, underscore DeB3RTa’s robustness and superiority in financial NLP applications.
Our comprehensive empirical evaluation compared DeB3RTa against a diverse set of baselines: Portuguese-specific models (BERTimbau), multilingual transformers (mBERT, XLM-RoBERTa, DistilBERT), financial domain-specific models (SEC-BERT, BusinessBERT), and state-of-the-art GPT variants. The experiments spanned four distinct tasks, each targeting a different aspect of financial text processing.
The OFFCOMBR-3 dataset presented unique challenges due to its significant class imbalance (19.5% hate speech instances). While gpt-3.5-turbo achieved the highest F1 score at 0.8157, DeB3RTa’s performance merits closer examination through the lens of PR-AUC, which better reflects performance on imbalanced datasets. The DeB3RTa base model with AdamP configuration achieved a PR-AUC of 0.8081, significantly outperforming other transformer-based models and demonstrating robust threshold-independent performance. This is crucial for real-world applications where optimal classification thresholds may vary.
In the FAKE.BR financial news classification task, while XLM-RoBERTa large achieved the highest F1 score of 0.9953, DeB3RTa’s performance proved remarkably competitive. The DeB3RTa base model with MADGRAD optimization achieved an F1 score of 0.9906, falling short by only 0.47%. This performance is particularly notable given DeB3RTa’s significantly smaller parameter count. The model also demonstrated exceptional stability across different configurations, with multiple variants (MADGRAD, LLRD, Layer 10/11/12) consistently achieving F1 scores above 0.98.
On the CAROSIA dataset for financial sentiment analysis, BERTimbau base achieved the highest F1 score of 0.9363. DeB3RTa demonstrated strong performance, with its base (MADGRAD) configuration reaching 0.9207 while maintaining notably balanced precision (0.9193) and recall (0.9239) scores. This equilibrium between precision and recall is particularly valuable for financial sentiment analysis where both false positives and false negatives can have significant implications. The small variant of DeB3RTa achieved an F1 score of 0.8722, outperforming several larger models such as XLM-RoBERTa base (0.8326).
For the BBRC regulatory document classification task, DeB3RTa base with AdamP optimizer achieved the highest overall F1 score of 0.7609, surpassing all baseline models including specialized financial models like SEC-BERT (0.7478) and BusinessBERT (0.7143). This result is particularly significant, as it represents a substantial improvement of 8.12% over the next best model (BERTimbau large at 0.6797). Even more remarkable is that the DeB3RTa small model, with its reduced parameter count, achieved an F1 score of 0.6712, outperforming larger models such as XLM-RoBERTa large (0.6246).
Analyzing the PR-AUC scores across all datasets reveals DeB3RTa’s consistent ability to maintain strong performance across different operating points. This is particularly evident in the BBRC dataset, where DeB3RTa base with SWA configuration achieved a PR-AUC of 0.8402, and in FAKE.BR, where multiple configurations maintained PR-AUC scores above 0.99. These results indicate robust performance regardless of the chosen classification threshold.
The experimental results also highlight DeB3RTa’s consistent performance across varying text lengths and domain complexities. From short financial news snippets in CAROSIA to lengthy regulatory documents in BBRC, the model maintained competitive performance while using fewer parameters than some of its competitors. This efficiency–performance trade-off is particularly notable in the small variant, which consistently outperformed larger baseline models across multiple tasks.
In summary, our evaluation reveals five key strengths of DeB3RTa: (1) robust performance on imbalanced datasets, evidenced by superior PR-AUC scores; (2) competitive performance with significantly fewer parameters than current state-of-the-art models; (3) consistent performance across varying text lengths and domains; (4) state-of-the-art performance in regulatory document classification; and (5) balanced precision–recall trade-offs across all tasks. These findings demonstrate DeB3RTa’s effectiveness as a resource-efficient, versatile model for financial text processing tasks in Portuguese.
6. Discussion
The experimental results demonstrate DeB3RTa’s effectiveness across diverse financial domain tasks, with particularly strong performance in the regulatory document classification task (BBRC) and competitive results in the fake news detection task (FAKE.BR), sentiment analysis task (CAROSIA), and hate speech detection task (OFFCOMBR-3). While not consistently achieving the highest F1 scores, DeB3RTa shows remarkable efficiency and stability across different task characteristics and data distributions.
Our analysis of optimizer performance reveals important insights into model fine-tuning. Among the tested optimizers (AdamW, AdamP, RAdam, and MADGRAD), MADGRAD emerged as the most consistent performer, particularly excelling in FAKE.BR (F1 = 0.9906 F1) and CAROSIA (F1 = 0.9207) tasks. However, AdamP proved superior for BBRC (F1 = 0.7881) and showed strengths in OFFCOMBR-3 (PR-AUC = 0.8081). This variability underscores the importance of task-specific optimizer selection rather than adopting a one-size-fits-all approach.
Advanced fine-tuning techniques demonstrated varying degrees of effectiveness across the datasets. While these methods aim to prevent overfitting, their impact was notably dataset-dependent. In the BBRC dataset, simpler configurations often outperformed more complex ones, with the base AdamW configuration achieving better results than versions using layer reinitialization, mixout, or LLRD. This suggests that excessive regularization may have prevented the model from capturing important patterns in complex regulatory texts.
The performance of GPT variants, particularly on the FAKE.BR and BBRC datasets, revealed interesting limitations. Despite their sophisticated architecture and larger parameter counts, these models showed notably lower performance compared to transformer-based approaches. On FAKE.BR, gpt-3.5-turbo achieved an F1 score of only 0.6407, while on BBRC, it reached just 0.5205, substantially below DeB3RTa’s performance. This disparity was particularly evident in tasks requiring specialized financial domain knowledge.
The GPT models’ underperformance in these classification tasks likely stems from their generative architecture and zero-shot testing approach. Unlike fine-tuned transformer models, these models rely on prompt engineering and lack task-specific optimization. This is evident in a study where GPT-3 was used to classify questions related to data science. Researchers have reported that augmenting the training set with additional examples generated by GPT-3 itself significantly improved classification accuracy, but it still fell short of human accuracy [5,78].
Compared with previous studies, which show that models such as FinBERT and SEC-BERT achieve significant improvements in financial tasks, the results here underscore the importance of task-specific fine-tuning. For instance, GPT models such as GPT-3.5 and GPT-4 outperform fine-tuned non-generative models in few-shot text classification tasks, i.e., a task where the goal is to classify text into different categories using only a small number of labeled examples. However, the performance is significantly better when the models are provided with representative samples selected by human experts [79].
Our research underscores the importance of the upper layers in expediting learning and improving model performance. In fact, the grid search results across the FAKE.BR, CAROSIA, and BBRC datasets reveal important insights into the relationship between DeB3RTa’s positional embedding strategy and its fine-tuning behavior, as shown in Table 13. As DeBERTa (and by extension, DeB3RTa) incorporates relative positional embeddings from layers 1 to 10 and absolute positional embeddings in layers 11 and 12 [80], the optimal reinitialization of different layers across datasets suggests task-specific needs for handling positional information. This differs from BERT’s architecture, which applies absolute positional embeddings across all 12 layers. DeB3RTa’s hybrid approach appears to offer an advantage in capturing nuanced relationships at different levels of text representation, allowing the model to adapt effectively across tasks.
For the OFFCOMBR-3 dataset (hate speech detection task), reinitializing layer 10 achieved most of the highest F1 scores. This suggests that the hate speech detection task benefited from preserving the originally trained weights in the early layers (1–9) that handle relative positional embeddings, while reinitializing the final relative positional embedding layer (10) allows it to adapt to the specific task. This makes sense because hate speech often relies on contextual relationships between words and phrases that might be better captured by relative positional embeddings.
For FAKE.BR (fake news detection task), reinitializing layers 10, 11, or 12 resulted in nearly identical performance (F1 = 0.9905), indicating that both relative and absolute positional embeddings play equally important roles in detecting misinformation. Fake news often follows deceptive stylistic patterns and structural inconsistencies that require both local contextual relationships and global document structure to be effectively identified. The uniform performance across layers suggests that fake news detection benefits from a hybrid approach that balances these two forms of positional awareness.
For CAROSIA (sentiment analysis task), most of the optimal performance was observed when reinitializing layer 11 or 12. This indicates that the task benefits from the use of absolute positional embeddings, which are introduced in the final two layers of DeB3RTa. In Portuguese financial news, sentiment analysis can be heavily influenced by the position of key terms or phrases within a sentence. For example, the word “queda” (decline) can shift from a negative meaning to a positive meaning depending on its position in the sentence and the surrounding phrases (e.g., “queda nos juros” or “decline in interest rates”). The position of negation or intensification words (e.g., “não” (not) or “muito” (very)) is also critical, as it can significantly alter sentiment, as can comparative and superlative forms (e.g., “maior lucro” (higher profit) or “menor lucro” (lower profit)). Absolute positional embeddings enable the model to capture these structural nuances, understanding not only the relationships between words but also their specific positions in the sentence. By reinitializing layers 11 or 12, the model leverages this positional information to make more accurate sentiment predictions based on how sentiment-laden phrases are positioned within the text.
For BBRC (regulatory document classification task), reinitializing layer 10 or 11 yielded the highest F1 score (0.7597), with layer 12 closely following (F1 = 0.7580). This pattern suggests that regulatory document classification requires a balanced integration of both relative and absolute positional embeddings. Regulatory texts follow structured formats with legal clauses, cross-references, and hierarchical organization, where local relationships between terms and their positioning within the broader document are equally important. The nearly equal effectiveness of layers 10, 11, and 12 highlights the need for a model that can adapt to both section-level dependencies and document-wide structure when classifying regulatory risks.
Furthermore, the grid search results on Mixout regularization underscore its impact, as shown in Table 14. Mixout’s ability to replace parameters with their pre-trained values appears to support generalization, particularly in smaller datasets such as CAROSIA, by mitigating overfitting. However, contrary to previous findings [36], the best results were often achieved with lower Mixout probabilities (e.g., 0.1–0.3) rather than the suggested 0.7–0.9. This suggests that while Mixout is beneficial, its optimal configuration may depend on dataset characteristics, highlighting the need for fine-tuning based on empirical evaluation. The observed F1 scores reinforce Mixout’s role as an effective regularization method, albeit with a different probability range than initially expected.
Looking at Table 15, we can observe several key patterns in the LLRD (Layer-wise Learning Rate Decay) configurations across the datasets.
OFFCOMBR-3 achieved its highest F1 score (0.8577) with a 2 × 10−4 learning rate, 0.90 decay rate, and 1 × 10−1 weight decay. The model showed consistency across different decay rates (0.90 and 0.95) while maintaining strong performance. FAKE.BR performed best at a 1 × 10−4 learning rate, suggesting that conservative updates were beneficial. The top configuration (F1 = 0.9953) used a 0.95 decay rate and 1 × 10−4 weight decay, reinforcing the importance of regularization in high-accuracy tasks. CAROSIA reached its highest F1 score (0.9033) with a 2 × 10−4 learning rate, 0.90 decay rate, and 1 × 10−3 weight decay, supporting the idea that the task benefited from moderate learning rates. BBRC required higher learning rates (3 × 10−4) for optimal performance, with its best F1 score (0.7580) observed using a 0.95 decay rate and 1 × 10−1 or 1 × 10−2 weight decay, suggesting that the task benefited from larger updates.
Across all datasets, learning rates between 1 × 10−4 and 3 × 10−4 consistently yielded the best results, whereas higher learning rates did not yield the best F1 scores across the datasets. This supports prior research on the risks of overshooting optimal solutions in transformer models when using excessive learning rates [81]. The effect of decay rates (0.90 and 0.95) was generally minimal, with performance being more strongly influenced by learning rate and weight decay. However, weight decay values (1 × 10−4 to 1 × 10−1) had a moderate but dataset-dependent impact, indicating that their influence varied based on task-specific characteristics. Additionally, the pooler multiplier was consistently 1.02 or 1.03 in optimal configurations, suggesting a stable fine-tuning behavior across datasets.
DeB3RTa’s performance across all four datasets, including its smaller variant, demonstrates remarkable consistency, adaptability, and efficiency. On FAKE.BR, the base variant achieves competitive results, approaching the best-performing models, whereas on CAROSIA, it delivers strong performance, outperforming several transformer-based models. These results emphasize that the DeB3RTa architecture, combined with mixed-domain pretraining, offers a balanced approach for handling varied text lengths and complex financial jargon, making it highly suitable for real-world financial applications where computational resources may be limited. In addition to fine-tuning techniques, the results suggest that the smaller version of DeB3RTa maintains competitive performance against much larger models such as XLM-RoBERTa and GPT-4, making it a more practical option for deployment in financial institutions that require real-time analysis with limited computational resources.
6.1. Failure Analysis
In the OFFCOMBR-3 hate speech detection task, DeB3RTa’s best configuration achieved an F1 score of 0.7539 (with the MADGRAD optimizer), significantly underperforming compared to gpt-3.5-turbo’s 0.8157. This gap became even more pronounced with the small variant, which achieved only 0.5460. The precision of 0.8016 indicates that when the model identifies something as hate speech, it is correct 80.16% of the time. However, the recall of 0.7262 shows that the model only successfully identifies 72.62% of all hate speech instances in the dataset.
On the CAROSIA sentiment analysis dataset, while DeB3RTa demonstrated strong and balanced performance (F1 = 0.9207, precision = 0.9193, recall = 0.9239), there remains a performance gap compared to BERTimbau base (F1 = 0.9363), indicating room for improvement even in tasks where the model shows strong baseline performance.
The most significant limitation appears in model size reduction. The smaller variant of DeB3RTa showed considerable performance degradation across tasks:
OFFCOMBR-3: Drop from 0.7539 to 0.5460 (F1 score)
CAROSIA: Despite outperforming larger models like XLM-RoBERTa base (0.8326), the small variant’s F1 score of 0.8722 still represents a notable drop from the base model’s 0.9207
BBRC: Drop from 0.7609 to 0.6712 (F1 score)
This consistent pattern of degradation with the smaller architecture indicates a significant challenge in maintaining performance while reducing model size. The BBRC regulatory document classification task particularly highlights this limitation, where the drop of nearly 9 percentage points suggests that complex document understanding is highly sensitive to model capacity.
6.2. Threats to Validity
This study’s internal validity is generally strong, given the use of reliable datasets such as FAKE.BR, CAROSIA, and BBRC. However, as with any data analysis, there remains a possibility of minor inconsistencies in data preprocessing and annotation. The smaller size of these datasets, in fact, can facilitate thorough verification, though minor issues may still arise.
The smaller size of these datasets, while allowing for more thorough verification, can also present a potential threat to the validity of the results. To address this, regularization techniques, such as weight decay, are employed in this work to help mitigate overfitting—a common challenge with smaller datasets. However, with training splits ranging from 505 to 3379 instances and test splits from 64 to 423 instances, the limited data size may still influence this study’s outcomes.
The class imbalance in datasets such as CAROSIA and BBRC presents a challenge common to many real-world machine learning problems. While this reflects realistic scenarios, it is important to consider its potential influence on the model’s performance across different classes.
The choice of evaluation metrics, including the F1 score, is appropriate for the tasks at hand, especially given the imbalanced nature of some datasets. However, as with any single metric, it may not capture every nuance of model performance.
The consistent outperformance of DeB3RTa across tasks provides strong evidence for its effectiveness. However, as is standard in machine learning research, it is important to acknowledge that small differences in performance metrics can be influenced by factors such as random seed selection and hyperparameter choices. This study’s use of advanced fine-tuning techniques such as LLRD and mixout contributes to its robust methodology, while also introducing task-specific optimizations that may influence generalizability.
The fine-tuning process, including the choice of optimizers and scheduling techniques, was crucial to the model’s success. While this level of optimization is a strength of this study, it also presents a common challenge in deep learning research: the sensitivity of results to specific hyperparameter configurations.
7. Conclusions
In this study, we introduced DeB3RTa, a financial domain-specific transformer-based model for the Portuguese language. Our model consistently outperformed general-domain models such as BERTimbau and multilingual models such as XLM-RoBERTa across critical financial tasks, including fake news detection, sentiment analysis, and regulatory risk classification. These findings underscore DeB3RTa’s ability to handle the complexities and nuances of financial language more effectively than its general-domain counterparts.
The model’s success is largely due to the use of mixed-domain pretraining, which enabled it to generalize better across various subdomains within finance. Furthermore, optimization techniques, including layer reinitialization, mixout regularization, and layer-wise learning rate decay (LLRD), were essential in enhancing its performance and avoiding overfitting, particularly in smaller datasets.
One avenue for improvement could be the exploration of lightweight model architectures, such as pruning or quantization methods, to reduce computational costs without significant loss of accuracy. Another promising strategy would be the exploration of additional transfer learning techniques, such as continuous pretraining, to further refine DeB3RTa’s ability to handle specialized financial texts. Additionally, extending the model’s ability to support multilingual financial environments would greatly benefit institutions operating in global markets. Furthermore, future work will focus on evaluating the model’s effectiveness in real-world financial applications, assessing its performance in practical scenarios and industry-relevant tasks.
This study demonstrates that mixed-domain pretraining, fine-tuning techniques, and architectural choices play a decisive role in model performance within specialized domains such as finance, where DeB3RTa has proven to be a powerful and efficient solution.
To facilitate the reproducibility of the experiments, we have made the source code and detailed instructions available in the public repository:
Conceptualization, H.P., L.P. and J.P.C.; Methodology, H.P.; Validation, H.P., L.P. and J.P.C.; Writing—original draft, H.P.; Writing—review & editing, L.P. and J.P.C.; Supervision, L.P. and J.P.C. All authors have read and agreed to the published version of the manuscript.
The data presented in this study are available from the corresponding author upon request due to copyright and licensing restrictions associated with the original data sources.
This work was supported by the Instituto Federal de Educação, Ciência e Tecnologia do Maranhão and the Human Language Technology Lab in Instituto de Engenharia de Sistemas e Computadores—Investigação e Desenvolvimento (INESC-ID).
The authors declare no conflicts of interest.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Figure 2. Convergence of DeB3RTa base PLM: train loss with exponential moving average (smoothing factor: 0.9).
Figure 3. Layer-wise learning rate decay schedule: initial learning rate = 1 × 10−4, decay rate = 0.95, pooler multiplier = 1.02; warmup steps and cosine schedule with learning rates for embedding, pooler, and layers 0–11 are displayed.
Summary of the financial corpus by source, including token, sentence, and document counts.
Corpus | Tokens | Sentences | Documents |
---|---|---|---|
Relevant Facts | 8,983,628 | 200,396 | 9581 |
Google Patents | 40,306,323 | 1,055,227 | 7973 |
Scielo | 98,167,792 | 4,779,149 | 10,498 |
Wikipedia | 73,925,719 | 2,548,436 | 102,140 |
News | 828,975,676 | 27,107,736 | 2,347,217 |
Total | 1,050,359,095 | 35,690,944 | 2,477,409 |
Descriptive statistics of the OFFCOMBR-3 dataset.
Split | Median | Min | Max | Not Hate Speech | Hate Speech |
---|---|---|---|---|---|
Train | 11 | 1 | 91 | 664 | 162 |
Validation | 12 | 1 | 94 | 83 | 20 |
Test | 10 | 1 | 65 | 84 | 20 |
Descriptive statistics of the FAKE.BR dataset.
Split | Median | Min | Max | Fake | Legitimate |
---|---|---|---|---|---|
Train | 348 | 10 | 7517 | 1689 | 1690 |
Validation | 358.5 | 10 | 4050 | 211 | 211 |
Test | 354 | 17 | 4891 | 211 | 212 |
Descriptive statistics of the CAROSIA dataset.
Split | Median | Min | Max | Negative | Positive |
---|---|---|---|---|---|
Train | 13 | 2 | 26 | 442 | 574 |
Validation | 13 | 6 | 24 | 55 | 72 |
Test | 13 | 5 | 24 | 56 | 71 |
Descriptive statistics of the BBRC dataset.
Split | Median | Min | Max | Not | Relevant |
---|---|---|---|---|---|
Train | 526 | 52 | 219,610 | 236 | 269 |
Validation | 480 | 71 | 18,849 | 29 | 34 |
Test | 700 | 127 | 92,041 | 30 | 34 |
Optimizer-based configurations.
Model | Learning Rate | Optimizer |
---|---|---|
DeB3RTa small/base | {1 × 10−5, 2 × 10−5, 3 × 10−5, 4 × 10−5, 5 × 10−5} | AdamW, AdamP, |
RAdam, MADGRAD |
Layer reinitialization configurations.
Model | Learning Rate | Layer to Reinitialize |
---|---|---|
DeB3RTa base (Layer 10/11/12) | {1 × 10−5, 2 × 10−5, 3 × 10−5, 4 × 10−5, 5 × 10−5} | {10, 11, 12} |
Stochastic weight averaging configurations.
Model | Learning Rate | SWA Learning Rate |
---|---|---|
DeB3RTa base (SWA) | {1 × 10−5, 2 × 10−5, 3 × 10−5, 4 × 10−5, 5 × 10−5} | {1 × 10−6, 2 × 10−6, 3 × 10−6, 4 × 10−6, 5 × 10−6} |
Mixout and layer-wise learning rate decay configurations.
Model | Learning Rate | Hyperparameters |
---|---|---|
DeB3RTa base (Mixout) | {1 × 10−5, 2 × 10−5, 3 × 10−5, 4 × 10−5, 5 × 10−5} | Mixout probability: |
DeB3RTa base (LLRD) | {1 × 10−4, 2 × 10−4, 3 × 10−4, 4 × 10−4, 5 × 10−4} | Decay rate: |
Prompts used by GPT models for classification tasks in OFFCOMBR-3, FAKE.BR, CAROSIA, and BBRC datasets.
Dataset | Prompt |
---|---|
OFFCOMBR-3 | Classify the following web comment as either hate speech or not hate speech. Message: ‘text’. The output should only contain two words: hate or not hate. |
FAKE.BR | Classify the following text of news articles as either legitimate (true information) or fake (false information). Message: ‘text’. The output should only contain two words: legitimate or fake. |
CAROSIA | Classify the following text of news updates about the Brazilian financial market as either positive (indicating favorable market conditions) or negative (indicating unfavorable conditions). Message: ‘text’. The output should only contain two words: positive or negative. |
BBRC | Classify the following text of banking regulatory risk from different departments of Banco do Brasil as either relevant (impacting departmental compliance and operations) or not relevant (not impacting departmental compliance). Message: ‘text’. The output should only contain two words: relevant or not relevant. |
F1 scores and PR-AUC across datasets (highest values in bold underlined).
F1 Scores | ||||
---|---|---|---|---|
Model | OFFCOMBR-3 | FAKE.BR | CAROSIA | BBRC |
DeB3RTa small (AdamW) | 0.5460 | 0.9598 | 0.8722 | 0.6712 |
DeB3RTa base (AdamW) | 0.6836 | 0.9858 | 0.9120 | 0.7460 |
DeB3RTa base (AdamP) | 0.7424 | 0.9858 | 0.8795 | 0.7609 |
DeB3RTa base (RAdam) | 0.7206 | 0.9835 | 0.9038 | 0.7490 |
DeB3RTa base (MADGRAD) | 0.7539 | 0.9906 | 0.9207 | 0.7176 |
DeB3RTa base (SWA) | 0.6737 | 0.9716 | 0.8722 | 0.7290 |
DeB3RTa base (Layer 10/11/12) | 0.7080 | 0.9811 | 0.9201 | 0.7143 |
DeB3RTa base (Mixout) | 0.7102 | 0.9811 | 0.9038 | 0.7013 |
DeB3RTa base (LLRD) | 0.7424 | 0.9882 | 0.8627 | 0.6995 |
mBERT base | 0.6172 | 0.9835 | 0.8789 | 0.6995 |
BERTimbau base | 0.6877 | 0.9858 | 0.9363 | 0.7117 |
BERTimbau large | 0.6780 | 0.9906 | 0.9280 | 0.6797 |
XLM-RoBERTa base | 0.6737 | 0.9811 | 0.8326 | 0.6085 |
XLM-RoBERTa large | 0.4468 | 0.9929 | 0.9123 | 0.6246 |
DistilBERT base | 0.6810 | 0.9740 | 0.8550 | 0.6297 |
BusinessBERT | 0.6494 | 0.9693 | 0.7233 | 0.7143 |
SEC-BERT | 0.6905 | 0.9693 | 0.8326 | 0.7478 |
gpt-3.5-turbo | 0.8157 | 0.6407 | 0.6600 | 0.5205 |
gpt-4o-mini | 0.7984 | 0.7285 | 0.9280 | 0.4921 |
gpt-4o | 0.6590 | 0.7754 | 0.8424 | 0.5873 |
gpt-4-turbo | 0.6590 | 0.8345 | 0.9207 | 0.5377 |
gpt-4 | 0.6586 | 0.6698 | 0.8892 | 0.5536 |
PR-AUC Scores | ||||
Model | OFFCOMBR-3 | FAKE.BR | CAROSIA | BBRC |
DeB3RTa small (AdamW) | 0.6813 | 0.9925 | 0.8957 | 0.7812 |
DeB3RTa base (AdamW) | 0.7534 | 0.9948 | 0.9605 | 0.8315 |
DeB3RTa base (AdamP) | 0.8081 | 0.9943 | 0.9740 | 0.8290 |
DeB3RTa base (RAdam) | 0.7424 | 0.9943 | 0.9523 | 0.7861 |
DeB3RTa base (MADGRAD) | 0.7911 | 0.9960 | 0.9725 | 0.8203 |
DeB3RTa base (SWA) | 0.7424 | 0.9952 | 0.9356 | 0.8402 |
DeB3RTa base (Layer 10/11/12) | 0.7840 | 0.9937 | 0.9711 | 0.8125 |
DeB3RTa base (Mixout) | 0.7774 | 0.9947 | 0.9605 | 0.8105 |
DeB3RTa base (LLRD) | 0.7913 | 0.9964 | 0.9362 | 0.7527 |
mBERT base | 0.6295 | 0.9931 | 0.9726 | 0.7702 |
BERTimbau base | 0.7180 | 0.9972 | 0.9776 | 0.8157 |
BERTimbau large | 0.7534 | 0.9991 | 0.9714 | 0.7367 |
XLM-RoBERTa base | 0.6622 | 0.9946 | 0.8689 | 0.6508 |
XLM-RoBERTa large | 0.5607 | 0.9993 | 0.9705 | 0.6738 |
DistilBERT base | 0.6418 | 0.9970 | 0.9164 | 0.7233 |
BusinessBERT | 0.6962 | 0.9952 | 0.8250 | 0.8046 |
SEC-BERT | 0.7352 | 0.9966 | 0.8971 | 0.7927 |
gpt-3.5-turbo | 0.7392 | 0.7531 | 0.8360 | 0.7157 |
gpt-4o-mini | 0.7135 | 0.8214 | 0.9529 | 0.6667 |
gpt-4o | 0.6971 | 0.8442 | 0.9164 | 0.7268 |
gpt-4-turbo | 0.6971 | 0.8759 | 0.9559 | 0.6993 |
gpt-4 | 0.6978 | 0.7959 | 0.9382 | 0.7337 |
Recall and precision scores across datasets (highest values in bold underlined).
Recall Scores | ||||
---|---|---|---|---|
Model | OFFCOMBR-3 | FAKE.BR | CAROSIA | BBRC |
DeB3RTa small (AdamW) | 0.5452 | 0.9598 | 0.8722 | 0.6716 |
DeB3RTa base (AdamW) | 0.6643 | 0.9858 | 0.9112 | 0.7451 |
DeB3RTa base (AdamP) | 0.7202 | 0.9858 | 0.8774 | 0.7598 |
DeB3RTa base (RAdam) | 0.7083 | 0.9835 | 0.9023 | 0.7490 |
DeB3RTa base (MADGRAD) | 0.7262 | 0.9906 | 0.9239 | 0.7176 |
DeB3RTa base (SWA) | 0.6583 | 0.9716 | 0.8722 | 0.7284 |
DeB3RTa base (Layer 10/11/12) | 0.6893 | 0.9811 | 0.9201 | 0.7137 |
DeB3RTa base (Mixout) | 0.7024 | 0.9811 | 0.9023 | 0.7010 |
DeB3RTa base (LLRD) | 0.7333 | 0.9882 | 0.8595 | 0.6990 |
mBERT base | 0.6298 | 0.9835 | 0.8755 | 0.6990 |
BERTimbau base | 0.6774 | 0.9858 | 0.9380 | 0.7118 |
BERTimbau large | 0.6714 | 0.9906 | 0.9272 | 0.6804 |
XLM-RoBERTa base | 0.6583 | 0.9811 | 0.8332 | 0.6088 |
XLM-RoBERTa large | 0.5000 | 0.9929 | 0.9131 | 0.6294 |
DistilBERT base | 0.6845 | 0.9740 | 0.8525 | 0.6324 |
BusinessBERT | 0.6262 | 0.9692 | 0.7271 | 0.7137 |
SEC-BERT | 0.6905 | 0.9692 | 0.8332 | 0.7471 |
gpt-3.5-turbo | 0.8464 | 0.6590 | 0.6986 | 0.5343 |
gpt-4o-mini | 0.7762 | 0.7404 | 0.9272 | 0.4941 |
gpt-4o | 0.6250 | 0.7829 | 0.8516 | 0.5882 |
gpt-4-turbo | 0.6250 | 0.8345 | 0.9239 | 0.5402 |
gpt-4 | 0.6250 | 0.6929 | 0.8939 | 0.5657 |
Precision Scores | ||||
Model | OFFCOMBR-3 | FAKE.BR | CAROSIA | BBRC |
DeB3RTa small (AdamW) | 0.5990 | 0.9598 | 0.8722 | 0.6711 |
DeB3RTa base (AdamW) | 0.7190 | 0.9858 | 0.9129 | 0.7530 |
DeB3RTa base (AdamP) | 0.7772 | 0.9858 | 0.8826 | 0.7718 |
DeB3RTa base (RAdam) | 0.7366 | 0.9835 | 0.9058 | 0.7490 |
DeB3RTa base (MADGRAD) | 0.8016 | 0.9906 | 0.9193 | 0.7176 |
DeB3RTa base (SWA) | 0.6993 | 0.9716 | 0.8722 | 0.7390 |
DeB3RTa base (Layer 10/11/12) | 0.7382 | 0.9811 | 0.9201 | 0.7206 |
DeB3RTa base (Mixout) | 0.7196 | 0.9811 | 0.9058 | 0.7020 |
DeB3RTa base (LLRD) | 0.7532 | 0.9882 | 0.8688 | 0.7032 |
mBERT base | 0.6104 | 0.9835 | 0.8852 | 0.7032 |
BERTimbau base | 0.7015 | 0.9858 | 0.9352 | 0.7250 |
BERTimbau large | 0.6860 | 0.9906 | 0.9289 | 0.6917 |
XLM-RoBERTa base | 0.6993 | 0.9811 | 0.8321 | 0.6085 |
XLM-RoBERTa large | 0.4038 | 0.9929 | 0.9117 | 0.6310 |
DistilBERT base | 0.6779 | 0.9740 | 0.8594 | 0.6432 |
BusinessBERT | 0.7255 | 0.9696 | 0.7240 | 0.7206 |
SEC-BERT | 0.6905 | 0.9696 | 0.8321 | 0.7500 |
gpt-3.5-turbo | 0.7947 | 0.7008 | 0.7452 | 0.5409 |
gpt-4o-mini | 0.8295 | 0.7901 | 0.9289 | 0.4939 |
gpt-4o | 0.9242 | 0.8242 | 0.8485 | 0.5911 |
gpt-4-turbo | 0.9242 | 0.8345 | 0.9193 | 0.5421 |
gpt-4 | 0.9235 | 0.7554 | 0.8886 | 0.5784 |
Five best grid search results for the OFFCOMBR-3, FAKE.BR, CAROSIA, and BBRC datasets with reinitialized layers and varying hyperparameters (highest values of each dataset in bold).
Dataset | F1 Score | Reinitialized Layer | Learning Rate | Batch Size |
---|---|---|---|---|
OFFCOMBR-3 | 0.8260 | 10 | 4 × 10−5 | 32 |
0.8239 | 11 | 5 × 10−5 | 32 | |
0.8139 | 10 | 3 × 10−5 | 32 | |
0.7980 | 10 | 5 × 10−5 | 16 | |
0.7944 | 10 | 3 × 10−5 | 16 | |
FAKE.BR | 0.9905 | 12 | 1 × 10−5 | 64 |
0.9905 | 11 | 1 × 10−5 | 64 | |
0.9905 | 10 | 5 × 10−5 | 64 | |
0.9905 | 12 | 4 × 10−5 | 32 | |
0.9905 | 11 | 2 × 10−5 | 32 | |
CAROSIA | 0.9120 | 11 | 3 × 10−5 | 16 |
0.9038 | 10 | 4 × 10−5 | 32 | |
0.9033 | 12 | 4 × 10−5 | 16 | |
0.8960 | 11 | 2 × 10−5 | 16 | |
0.8872 | 12 | 5 × 10−5 | 16 | |
BBRC | 0.7597 | 10 | 5 × 10−5 | 16 |
0.7597 | 11 | 4 × 10−5 | 16 | |
0.7597 | 11 | 2 × 10−5 | 16 | |
0.7580 | 12 | 5 × 10−5 | 16 | |
0.7580 | 12 | 4 × 10−5 | 16 |
Five best grid search results for the OFFCOMBR-3, FAKE.BR, CAROSIA, and BBRC datasets with Mixout and varying hyperparameters (highest values of each dataset in bold).
Dataset | F1 Score | Mixout | Learning Rate | Batch Size |
---|---|---|---|---|
OFFCOMBR-3 | 0.8449 | 0.1 | 5 × 10−5 | 16 |
0.826 | 0.7 | 4 × 10−5 | 16 | |
0.826 | 0.1 | 2 × 10−5 | 32 | |
0.8188 | 0.3 | 3 × 10−5 | 32 | |
0.8188 | 0.5 | 5 × 10−5 | 16 | |
FAKE.BR | 0.9953 | 0.1 | 1 × 10−5 | 64 |
0.9953 | 0.1 | 4 × 10−5 | 32 | |
0.9929 | 0.3 | 1 × 10−5 | 64 | |
0.9929 | 0.9 | 3 × 10−5 | 64 | |
0.9929 | 0.3 | 3 × 10−5 | 64 | |
CAROSIA | 0.8867 | 0.7 | 2 × 10−5 | 32 |
0.8795 | 0.5 | 5 × 10−5 | 16 | |
0.8789 | 0.9 | 3 × 10−5 | 16 | |
0.8789 | 0.1 | 1 × 10−5 | 16 | |
0.8782 | 0.9 | 4 × 10−5 | 16 | |
BBRC | 0.7444 | 0.1 | 2 × 10−5 | 16 |
0.7429 | 0.3 | 5 × 10−5 | 16 | |
0.7291 | 0.9 | 5 × 10−5 | 32 | |
0.7277 | 0.3 | 4 × 10−5 | 32 | |
0.7277 | 0.3 | 4 × 10−5 | 16 |
Five best grid search results for the OFFCOMBR-3, FAKE.BR, CAROSIA, and BBRC datasets with LLRD and varying hyperparameters (highest values of each dataset in bold).
Dataset | F1 Score | Learning Rate | Rate Decay | Weight Decay | Pooler Multiplier | Batch Size |
---|---|---|---|---|---|---|
OFFCOMBR-3 | 0.8577 | 2 × 10−4 | 0.90 | 1 × 10−1 | 1.03 | 16 |
0.8449 | 2 × 10−4 | 0.95 | 1 × 10−3 | 1.02 | 16 | |
0.8387 | 2 × 10−4 | 0.90 | 1 × 10−4 | 1.03 | 16 | |
0.8387 | 2 × 10−4 | 0.90 | 1 × 10−3 | 1.03 | 16 | |
0.8387 | 2 × 10−4 | 0.95 | 1 × 10−1 | 1.03 | 16 | |
FAKE.BR | 0.9953 | 1 × 10−4 | 0.95 | 1 × 10−4 | 1.03 | 32 |
0.9953 | 1 × 10−4 | 0.90 | 1 × 10−1 | 1.03 | 64 | |
0.9929 | 2 × 10−4 | 0.90 | 1 × 10−2 | 1.03 | 64 | |
0.9929 | 1 × 10−4 | 0.95 | 1 × 10−1 | 1.03 | 64 | |
0.9929 | 2 × 10−4 | 0.90 | 1 × 10−4 | 1.02 | 64 | |
CAROSIA | 0.9033 | 2 × 10−4 | 0.90 | 1 × 10−3 | 1.02 | 16 |
0.8950 | 3 × 10−4 | 0.90 | 1 × 10−4 | 1.03 | 16 | |
0.8872 | 3 × 10−4 | 0.90 | 1 × 10−4 | 1.02 | 32 | |
0.8853 | 1 × 10−4 | 0.90 | 1 × 10−4 | 1.03 | 32 | |
0.8853 | 1 × 10−4 | 0.90 | 1 × 10−2 | 1.03 | 32 | |
BBRC | 0.7580 | 3 × 10−4 | 0.95 | 1 × 10−1 | 1.02 | 32 |
0.7580 | 3 × 10−4 | 0.95 | 1 × 10−2 | 1.02 | 32 | |
0.7277 | 2 × 10−4 | 0.90 | 1 × 10−1 | 1.02 | 16 | |
0.7232 | 3 × 10−4 | 0.95 | 1 × 10−4 | 1.02 | 32 | |
0.7232 | 2 × 10−4 | 0.90 | 1 × 10−4 | 1.03 | 16 |
References
1. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Minneapolis, MN, USA, 2–7 June 2019; Burstein, J.; Doran, C.; Solorio, T. Association for Computational Linguistics: Kerrville, TX, USA, 2019; pp. 4171-4186.
2. Aksoy, Ç.; Ahmetoğlu, A.; Güngör, T. Hierarchical Multitask Learning Approach for BERT. arXiv; 2020; arXiv: 2011.04451
3. Zhang, Z.; Wu, Y.; Zhao, H.; Li, Z.; Zhang, S.; Zhou, X.; Zhou, X. Semantics-aware BERT for Language Understanding. arXiv; 2020; arXiv: 1909.02209[DOI: https://dx.doi.org/10.1609/aaai.v34i05.6510]
4. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog; 2019; 1, 9.
5. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. et al. Language models are few-shot learners. Proceedings of the NIPS ’20; Online, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020.
6. OpenAI Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S. et al. GPT-4 Technical Report. arXiv; 2024; arXiv: 2303.08774
7. Saravanan, S.; Sudha, K. GPT-3 Powered System for Content Generation and Transformation. Proceedings of the 2022 Fifth International Conference on Computational Intelligence and Communication Technologies (CCICT) 2022; Sonepat, India, 8–9 July 2022; pp. 514-519.
8. Binz, M.; Schulz, E. Using cognitive psychology to understand GPT-3. Proc. Natl. Acad. Sci. USA; 2023; 120, e2218523120. [DOI: https://dx.doi.org/10.1073/pnas.2218523120]
9. Araci, D. FinBERT: Financial Sentiment Analysis with Pre-trained Language Models. arXiv; 2019; arXiv: 1908.10063
10. Yang, Y.; Uy, M.C.S.; Huang, A. FinBERT: A Pretrained Language Model for Financial Communications. arXiv; 2020; arXiv: 2006.08097
11. Liu, Z.; Huang, D.; Huang, K.; Li, Z.; Zhao, J. FinBERT: A pre-trained financial language representation model for financial text mining. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI’20; Yokohama, Japan, 7–15 January 2021.
12. Loukas, L.; Fergadiotis, M.; Chalkidis, I.; Spyropoulou, E.; Malakasiotis, P.; Androutsopoulos, I.; Paliouras, G. FiNER: Financial Numeric Entity Recognition for XBRL Tagging. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Dublin, Ireland, 22–27 May 2022; Muresan, S.; Nakov, P.; Villavicencio, A. Association for Computational Linguistics: Kerrville, TX, USA, 2022; pp. 4419-4431.
13. Shah, R.; Chawla, K.; Eidnani, D.; Shah, A.; Du, W.; Chava, S.; Raman, N.; Smiley, C.; Chen, J.; Yang, D. When FLUE Meets FLANG: Benchmarks and Large Pretrained Language Model for Financial Domain. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing; Abu Dhabi, United Arab Emirates, 7–11 December 2022; Goldberg, Y.; Kozareva, Z.; Zhang, Y. Association for Computational Linguistics: Kerrville, TX, USA, 2022; pp. 2322-2335.
14. Delgadillo, J.; Kinyua, J.; Mutigwe, C. FinSoSent: Advancing Financial Market Sentiment Analysis through Pretrained Large Language Models. Big Data Cogn. Comput.; 2024; 8, 87. [DOI: https://dx.doi.org/10.3390/bdcc8080087]
15. Cao, Y.; Yang, L.; Wei, C.; Wang, H. Financial Text Sentiment Classification Based on Baichuan2 Instruction Finetuning Model. Proceedings of the 2023 5th International Conference on Frontiers Technology of Information and Computer (ICFTIC); Qiangdao, China, 17–19 November 2023; pp. 403-406.
16. Lo, A.W.; Singh, M. ChatGPT. From ELIZA to ChatGPT: The evolution of natural language processing and financial applications. J. Portf. Manag.; 2023; 49, pp. 201-235. [DOI: https://dx.doi.org/10.3905/jpm.2023.1.512]
17. Inserte, P.R.; Nakhlé, M.; Qader, R.; Caillaut, G.; Liu, J. Large Language Model Adaptation for Financial Sentiment Analysis. arXiv; 2024; arXiv: 2401.14777
18. Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Online, 5–10 July 2020; Jurafsky, D.; Chai, J.; Schluter, N.; Tetreault, J. Association for Computational Linguistics: Kerrville, TX, USA, 2020; pp. 8440-8451.
19. Souza, F.; Nogueira, R.; Lotufo, R. BERT models for Brazilian Portuguese: Pretraining, evaluation and tokenization analysis. Appl. Soft Comput.; 2023; 149, 110901. [DOI: https://dx.doi.org/10.1016/j.asoc.2023.110901]
20. Zhang, W.; Deng, L.; Zhang, L.; Wu, D. A Survey on Negative Transfer. IEEE/CAA J. Autom. Sin.; 2023; 10, pp. 305-329. [DOI: https://dx.doi.org/10.1109/JAS.2022.106004]
21. Wagner Filho, J.A.; Wilkens, R.; Idiart, M.; Villavicencio, A. The brWaC Corpus: A New Open Resource for Brazilian Portuguese. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018); Miyazaki, Japan, 7–12 May 2018; Calzolari, N.; Choukri, K.; Cieri, C.; Declerck, T.; Goggi, S.; Hasida, K.; Isahara, H.; Maegaard, B.; Mariani, J.; Mazo, H. et al. Association for Computational Linguistics: Kerrville, TX, USA, 2018.
22. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv; 2023; arXiv: 1706.03762
23. Amrhein, C.; Sennrich, R. How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology?. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021; Punta Cana, Dominican Republic, 7–11 November 2021; Moens, M.F.; Huang, X.; Specia, L.; Yih, S.W.T. Association for Computational Linguistics: Kerrville, TX, USA, 2021; pp. 689-705.
24. Dai, Y.; Li, L.; Zhou, C.; Feng, Z.; Zhao, E.; Qiu, X.; Li, P.; Tang, D. “Is Whole Word Masking Always Better for Chinese BERT?”: Probing on Chinese Grammatical Error Correction. Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022; Dublin, Ireland, 22–27 May 2022; Muresan, S.; Nakov, P.; Villavicencio, A. Association for Computational Linguistics: Kerrville, TX, USA, 2022; pp. 1-8.
25. Levine, Y.; Lenz, B.; Lieber, O.; Abend, O.; Leyton-Brown, K.; Tennenholtz, M.; Shoham, Y. PMI-Masking: Principled masking of correlated spans. arXiv; 2020; arXiv: 2010.01825
26. Joshi, M.; Chen, D.; Liu, Y.; Weld, D.S.; Zettlemoyer, L.; Levy, O. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Trans. Assoc. Comput. Linguist.; 2020; 8, pp. 64-77. [DOI: https://dx.doi.org/10.1162/tacl_a_00300]
27. He, W.; Dai, Y.; Yang, M.; Sun, J.; Huang, F.; Si, L.; Li, Y. Unified dialog model pre-training for task-oriented dialog understanding and generation. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2022; Madrid, Spain, 11–15 July 2022.
28. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv; 2017; arXiv: 1412.6980
29. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv; 2019; arXiv: 1711.05101
30. Heo, B.; Chun, S.; Oh, S.J.; Han, D.; Yun, S.; Kim, G.; Uh, Y.; Ha, J.W. AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights. arXiv; 2021; arXiv: 2006.08217
31. Liu, L.; Jiang, H.; He, P.; Chen, W.; Liu, X.; Gao, J.; Han, J. On the Variance of the Adaptive Learning Rate and Beyond. arXiv; 2021; arXiv: 1908.03265
32. Defazio, A.; Jelassi, S. Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization. arXiv; 2021; arXiv: 2101.11075
33. Dodge, J.; Ilharco, G.; Schwartz, R.; Farhadi, A.; Hajishirzi, H.; Smith, N. Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping. arXiv; 2020; arXiv: 2002.06305
34. Yosinski, J.; Clune, J.; Bengio, Y.; Lipson, H. How transferable are features in deep neural networks?. Proceedings of the 27th International Conference on Neural Information Processing Systems—Volume 2, NIPS’14; Montreal, QC, Canada, 8–13 December 2014; pp. 3320-3328.
35. Zhang, T.; Wu, F.; Katiyar, A.; Weinberger, K.Q.; Artzi, Y. Revisiting Few-sample BERT Fine-tuning. arXiv; 2021; arXiv: 2006.05987
36. Lee, C.; Cho, K.; Kang, W. Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models. arXiv; 2020; arXiv: 1909.11299
37. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res.; 2014; 15, pp. 1929-1958.
38. Wan, L.; Zeiler, M.; Zhang, S.; Le Cun, Y.; Fergus, R. Regularization of Neural Networks using DropConnect. Proceedings of the 30th International Conference on Machine Learning, Proceedings of Machine Learning Research; Atlanta, GA, USA, 17–19 June 2013; Volume 28, pp. 1058-1066.
39. Izmailov, P.; Podoprikhin, D.; Garipov, T.; Vetrov, D.; Wilson, A.G. Averaging Weights Leads to Wider Optima and Better Generalization. arXiv; 2019; arXiv: 1803.05407
40. Guo, H.; Jin, J.; Liu, B. Stochastic Weight Averaging Revisited. Appl. Sci.; 2023; 13, 2935. [DOI: https://dx.doi.org/10.3390/app13052935]
41. Lu, P.; Kobyzev, I.; Rezagholizadeh, M.; Rashid, A.; Ghodsi, A.; Langlais, P. Improving Generalization of Pre-trained Language Models via Stochastic Weight Averaging. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022; Abu Dhabi, United Arab Emirates, 7–11 December 2022; Goldberg, Y.; Kozareva, Z.; Zhang, Y. Association for Computational Linguistics: Kerrville, TX, USA, 2022; pp. 4948-4954.
42. Talman, A.; Celikkanat, H.; Virpioja, S.; Heinonen, M.; Tiedemann, J. Uncertainty-Aware Natural Language Inference with Stochastic Weight Averaging. Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa); Tórshavn, Faroe Islands, 22–24 May 2023; pp. 358-365.
43. Onal, E.; Flöge, K.; Caldwell, E.; Sheverdin, A.; Fortuin, V. Gaussian Stochastic Weight Averaging for Bayesian Low-Rank Adaptation of Large Language Models. arXiv; 2024; arXiv: 2405.03425
44. Howard, J.; Ruder, S. Universal Language Model Fine-tuning for Text Classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Melbourne, Australia, 15–20 July 2018; Gurevych, I.; Miyao, Y. Association for Computational Linguistics: Kerrville, TX, USA, 2018; pp. 328-339.
45. Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. arXiv; 2020; arXiv: 2003.10555
46. You, Y.; Li, J.; Reddi, S.; Hseu, J.; Kumar, S.; Bhojanapalli, S.; Song, X.; Demmel, J.; Keutzer, K.; Hsieh, C.J. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. arXiv; 2020; arXiv: 1904.00962
47. Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized autoregressive pretraining for language understanding. Proceedings of the 33rd International Conference on Neural Information Processing Systems; Vancouver, BC, Canada, 8–14 December 2019; Curran Associates Inc.: Red Hook, NY, USA, 2019.
48. Trewartha, A.; Walker, N.; Huo, H.; Lee, S.; Cruse, K.; Dagdelen, J.; Dunn, A.; Persson, K.A.; Ceder, G.; Jain, A. Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science. Patterns; 2022; 3, 100488. [DOI: https://dx.doi.org/10.1016/j.patter.2022.100488]
49. Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A Pretrained Language Model for Scientific Text. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); Hong Kong, China, 3–7 November 2019; Inui, K.; Jiang, J.; Ng, V.; Wan, X. Association for Computational Linguistics: Kerrville, TX, USA, 2019; pp. 3615-3620.
50. Schneider, E.T.R.; de Souza, J.V.A.; Knafou, J.; Oliveira, L.E.S.e.; Copara, J.; Gumiel, Y.B.; Oliveira, L.F.A.d.; Paraiso, E.C.; Teodoro, D.; Barra, C.M.C.M. BioBERTpt—A Portuguese Neural Language Model for Clinical Named Entity Recognition. Proceedings of the 3rd Clinical Natural Language Processing Workshop; Online, 19 November 2020; Rumshisky, A.; Roberts, K.; Bethard, S.; Naumann, T. Association for Computational Linguistics: Kerrville, TX, USA, 2020; pp. 65-72.
51. Boudjellal, N.; Zhang, H.; Khan, A.; Ahmad, A.; Naseem, R.; Shang, J.; Dai, L. ABioNER: A BERT-Based Model for Arabic Biomedical Named-Entity Recognition. Complexity; 2021; 2021, 6633213. [DOI: https://dx.doi.org/10.1155/2021/6633213]
52. Borchert, P.; Coussement, K.; De Weerdt, J.; De Caigny, A. Industry-sensitive language modeling for business. Eur. J. Oper. Res.; 2024; 315, pp. 691-702. [DOI: https://dx.doi.org/10.1016/j.ejor.2024.01.023]
53. Niszczota, P.; Abbas, S. GPT has become financially literate: Insights from financial literacy tests of GPT and a preliminary test of how people use it as a source of advice. Financ. Res. Lett.; 2023; 58, 104333. [DOI: https://dx.doi.org/10.1016/j.frl.2023.104333]
54. Chatzimina, M.E.; Papadaki, H.A.; Pontikoglou, C.; Tsiknakis, M. A Comparative Sentiment Analysis of Greek Clinical Conversations Using BERT, RoBERTa, GPT-2, and XLNet. Bioengineering; 2024; 11, 521. [DOI: https://dx.doi.org/10.3390/bioengineering11060521]
55. He, P.; Liu, X.; Gao, J.; Chen, W. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. arXiv; 2021; arXiv: 2006.03654
56. Liang, W.; Liang, Y. BPDec: Unveiling the Potential of Masked Language Modeling Decoder in BERT pretraining. arXiv; 2024; arXiv: 2401.15861
57. Broder, A. On the resemblance and containment of documents. Proceedings of the Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171); Salerno, Italy, 13 June 1997; pp. 21-29.
58. Lee, K.; Ippolito, D.; Nystrom, A.; Zhang, C.; Eck, D.; Callison-Burch, C.; Carlini, N. Deduplicating Training Data Makes Language Models Better. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Dublin, Ireland, 22–27 May 2022; Muresan, S.; Nakov, P.; Villavicencio, A. Association for Computational Linguistics: Kerrville, TX, USA, 2022; pp. 8424-8445.
59. Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv; 2020; arXiv: 1910.01108
60. Kalyan, K.S. A survey of GPT-3 family large language models including ChatGPT and GPT-4. Nat. Lang. Process. J.; 2024; 6, 100048. [DOI: https://dx.doi.org/10.1016/j.nlp.2023.100048]
61. Gale, L.R.; Heath, W.C.; Ressler, R.W. An Economic Analysis of Hate Crime. East. Econ. J.; 2002; 28, pp. 203-216.
62. Curthoys, A. Identifying the Effect of Unemployment on Hate Crime; Renée Crown University Honors Thesis Projects—All. 33 Syracuse University: Syracuse, NY, USA, 2013.
63. Dharmapala, D.; McAdams, R.H. Words that kill: An economic perspective on hate speech and hate crimes. SSRN Electron. J.; 2002; [DOI: https://dx.doi.org/10.2139/ssrn.300695]
64. Williams, M.L.; Burnap, P.; Javed, A.; Liu, H.; Ozalp, S. Hate in the Machine: Anti-Black and Anti-Muslim Social Media Posts as Predictors of Offline Racially and Religiously Aggravated Crime. Br. J. Criminol.; 2019; 60, pp. 93-117. [DOI: https://dx.doi.org/10.1093/bjc/azz049]
65. Vosoughi, S.; Roy, D.; Aral, S. The spread of true and false news online. Science; 2018; 359, pp. 1146-1151. [DOI: https://dx.doi.org/10.1126/science.aap9559]
66. Dong, S.; Liu, C. Sentiment Classification for Financial Texts Based on Deep Learning. Comput. Intell. Neurosci.; 2021; 2021, 9524705. [DOI: https://dx.doi.org/10.1155/2021/9524705] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34671395]
67. Singh, R.; Sharma, V.; Kashyap, R.; Manwal, M. Automated Multi-Page Document Classification and Information Extraction for Insurance Applications using Deep Learning Techniques. Proceedings of the 2024 11th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO); Noida, India, 14–15 March 2024; pp. 1-7.
68. de Pelle, R.; Moreira, V. Offensive Comments in the Brazilian Web: A dataset and baseline results. Proceedings of the Anais do VI Brazilian Workshop on Social Network Analysis and Mining; Porto Alegre, RS, Brazil, 5 July 2016; pp. 510-519.
69. Silva, R.M.; Santos, R.L.; Almeida, T.A.; Pardo, T.A. Towards automatically filtering fake news in Portuguese. Expert Syst. Appl.; 2020; 146, 113199. [DOI: https://dx.doi.org/10.1016/j.eswa.2020.113199]
70. Carosia, A.E.d.O.; Silva, A.E.A.d.; Coelho, G.P. Replication data for: Predicting the Brazilian Stock Market using Sentiment Analysis, Technical Indicators, and Stock Prices. Repos. Dados Pesqui. Unicamp; 2022; [DOI: https://dx.doi.org/10.25824/redu/GFJHFK]
71. Faria de Azevedo, R.; Eduardo Muniz, T.H.; Pimentel, C.; Jose de Assis Foureaux, G.; Caldeira Macedo, B.; Vasconcelos, D.d.L. BBRC: Brazilian Banking Regulation Corpora. Proceedings of the Joint Workshop of the 7th Financial Technology and Natural Language Processing, the 5th Knowledge Discovery from Unstructured Data in Financial Services, and the 4th Workshop on Economics and Natural Language Processing @ LREC-COLING 2024; Torino, Italy, 20 May 2024; pp. 150-166.
72. Kudo, T.; Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations; Brussels, Belgium, 31 October–4 November 2018; pp. 66-71.
73. Sennrich, R.; Haddow, B.; Birch, A. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Berlin, Germany, 7–12 August 2016; Erk, K.; Smith, N.A. Association for Computational Linguistics: Kerrville, TX, USA, 2016; pp. 1715-1725.
74. Kudo, T. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Melbourne, Australia, 15–20 July 2018; Gurevych, I.; Miyao, Y. Association for Computational Linguistics: Kerrville, TX, USA, 2018; pp. 66-75.
75. Izsak, P.; Berchansky, M.; Levy, O. How to Train BERT with an Academic Budget. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; Online and Punta Cana, Dominican Republic, 7–11 November 2021; Moens, M.F.; Huang, X.; Specia, L.; Yih, S.W.T. Association for Computational Linguistics: Kerrville, TX, USA, 2021; pp. 10644-10652.
76. Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; Venkatesh, G. et al. Mixed Precision Training. arXiv; 2018; arXiv: 1710.03740
77. Xu, P.; Kumar, D.; Yang, W.; Zi, W.; Tang, K.; Huang, C.; Cheung, J.C.K.; Prince, S.J.; Cao, Y. Optimizing Deeper Transformers on Small Datasets. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers); Online, 2–5 August 2021; Zong, C.; Xia, F.; Li, W.; Navigli, R. Association for Computational Linguistics: Kerrville, TX, USA, 2021; pp. 2089-2102.
78. Balkus, S.V.; Yan, D. Improving short text classification with augmented data using GPT-3. Nat. Lang. Eng.; 2023; 30, pp. 943-972. [DOI: https://dx.doi.org/10.1017/S1351324923000438]
79. Loukas, L.; Stogiannidis, I.; Malakasiotis, P.; Vassos, S. Breaking the Bank with ChatGPT: Few-Shot Text Classification for Finance. Proceedings of the Fifth Workshop on Financial Technology and Natural Language Processing and the Second Multimodal AI For Financial Forecasting; Macao, China, 20 August 2023; Chen, C.C.; Takamura, H.; Mathur, P.; Sawhney, R.; Huang, H.H.; Chen, H.H. Association for Computational Linguistics: Kerrville, TX, USA, 2023; pp. 74-80.
80. Jeong, Y.; Kim, E. SciDeBERTa: Learning DeBERTa for Science Technology Documents and Fine-Tuning Information Extraction Tasks. IEEE Access; 2022; 10, pp. 60805-60813. [DOI: https://dx.doi.org/10.1109/ACCESS.2022.3180830]
81. Wortsman, M.; Liu, P.J.; Xiao, L.; Everett, K.; Alemi, A.; Adlam, B.; Co-Reyes, J.D.; Gur, I.; Kumar, A.; Novak, R. et al. Small-scale proxies for large-scale Transformer training instabilities. arXiv; 2023; arXiv: 2309.14322
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
The complex and specialized terminology of financial language in Portuguese-speaking markets create significant challenges for natural language processing (NLP) applications, which must capture nuanced linguistic and contextual information to support accurate analysis and decision-making. This paper presents DeB3RTa, a transformer-based model specifically developed through a mixed-domain pretraining strategy that combines extensive corpora from finance, politics, business management, and accounting to enable a nuanced understanding of financial language. DeB3RTa was evaluated against prominent models—including BERTimbau, XLM-RoBERTa, SEC-BERT, BusinessBERT, and GPT-based variants—and consistently achieved significant gains across key financial NLP benchmarks. To maximize adaptability and accuracy, DeB3RTa integrates advanced fine-tuning techniques such as layer reinitialization, mixout regularization, stochastic weight averaging, and layer-wise learning rate decay, which together enhance its performance across varied and high-stakes NLP tasks. These findings underscore the efficacy of mixed-domain pretraining in building high-performance language models for specialized applications. With its robust performance in complex analytical and classification tasks, DeB3RTa offers a powerful tool for advancing NLP in the financial sector and supporting nuanced language processing needs in Portuguese-speaking contexts.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details



1 Education Department, Federal Institute of Maranhão, Pinheiro 65200-000, MA, Brazil
2 Department of Electrical Engineering, Federal University of Maranhão, São Luís 65080-805, MA, Brazil;
3 Instituto de Engenharia de Sistemas e Computadores–Investigação e Desenvolvimento (INESC-ID)/Instituto Superior Técnico, Universidade de Lisboa, 1000-029 Lisbon, Portugal;