Natural language processing is among the fields which are highly impacted by breakthroughs in the Deep Learning field. Hence, it is no surprise that multiple state-of-the-art techniques are pushing boundaries almost every day, and keeping up with the latest achievements, testing their capabilities has become increasingly difficult.
In this post, we will focus on Named Entity Recognition (or simply NER) task. This is one of the most widely solved problems in the natural language processing field which provides multiple fascinating applications by finding entities of your interest in the documents which can be later used to detect and extract interesting relations. We will consider developing NER tagger for processing financial documents which are characterized by very formal language, a specific and domain-focused dictionary, and multiple entities which are of interest for any financial analyst who is performing any investigations to make decisions for auditing, investing, financial reporting, extract suspicious events or perform network analysis to link entities between multiple documents.
In this tutorial, we will apply state-of-the-art deep learning techniques like bidirectional long-short term memory (Bi-LSTM) neural networks, combined with conditional random fields (CRF) which explicitly model dependencies between the labels as a transition matrix which generally tends to improve performance at the final step. Similar techniques are employed in multiple related papers to solve various related tasks like POS tagging, general NER, biomedical NER detection, or electronic health records processing. Moreover, it is very easy to apply state-of-the-art pre-trained language models, such as BERT, ELMo, RoBERTa, and XLNet, which are implemented in a well-known transformers package, and to significantly improve the performance of the final tagger. While the custom implementation might be rather a tedious task, this is vastly simplified using Zalando’s Flair framework which provides such implementation right out of the box. Moreover, it includes several additional features such as integration with the transformers package, the ability to use embedding models as input (their inputs are basically concatenated), and simulated annealing-based learning rate change schedulers. Hence, one can easily test multiple models without unnecessary complexity which comes in performing suitable tokenization and encoding for particular language models, data loading and batching, output postprocessing, or evaluation.
The setup
Ontonotes corpus was selected for benchmarking which is widely used to train general NER taggers. It contains a considerable amount of formal or business-related tagged text, such as news, conversational telephone speech, weblogs, Usenet newsgroups, broadcasts, and talk shows. We excluded texts which are not relevant, such as religious texts. Additionally, we used a text corpus of tagged SEC documents which can be found in this repository. Unfortunately, it is impossible to merge these datasets or use them to perform fine-tuning as they are not compatible and contain a different number of entity types. Ontonotes corpus contains 20 types of different entities, while SEC corpus contains only the most basic types such as Person, Location, Organization, or Miscellaneous.
The following models were considered for the experiment:
- Google’s BERT (Bidirectional Encoder Representations from Transformers), as the original transformer model which sparkled the whole transformers revolution
- Facebook’s RoBERTa, as a more robust and optimized version of the original BERT model
- ELMo model with deep contextualized word representations, created by Allen Institute
- Flair contextual string embeddings created by Zalando. For better performance, we used both forward and backward embeddings
- XLNet model which applies autoregressive pretraining and overcomes BERT in a multitude of tasks
- XLM-Roberta, a large multilingual model based on Facebook’s RoBERTa
Moreover, we checked for options for models pretrained using financial document corpus and used FinBERT trained by the Hong Kong University of Science and Technology particularly to address financial communication analysis. FinBERT is trained using over 5 billion tokens, including corporate reports 10-K & 10-Q (2.5B tokens). earnings call transcripts (1.3B tokens) and analyst reports (1.1B tokens). While it is originally trained for financial sentiment analysis tasks, it can be easily used as a general language model. It perfectly addresses our needs and goals, hence, it is used additionally together with previously described models.
For our experiments, we used HPZ Data Science Workstation with two Nvidia RTX8000 GPUs (with 48 GB VRAM per each) and 370 GB amount of RAM. Given the availability of such resources, we considered using the largest versions of the language models discussed above. The following parameters were used:
- the learning rate was set to 0.1;
- 100 epochs for training;
- LSTM hidden layer size was set to 256;
- early stopping (patience) parameter equal to 3 (initial runs were performed with the patience parameter set to 10, however, it did not prove to be beneficial).
- Batch size = 32 which is the more conservative setting;
- Batch size = 512 to test performance when larger datasets should be considered.
Results for the Ontonotes dataset
Initial runs were performed in a batch size of 32.
The training performance on the validation set is shown in the figure below
Performance in terms of precision and F1-Score are summarized in the table below.
Precision |
F1-Score |
|||||||||||||
Entity |
bert-ner |
elmo-ner |
finbert-ner |
flair-ner |
glove-char-ner |
roberta-ner |
xlnet-ner |
bert-ner |
elmo-ner |
finbert-ner |
flair-ner |
glove-char-ner |
roberta-ner |
xlnet-ner |
CARDINAL |
0.76 |
0.78 |
0.74 |
0.77 |
0.75 |
0.77 |
0.75 |
0.81 |
0.80 |
0.76 |
0.78 |
0.78 |
0.80 |
0.77 |
DATE |
0.80 |
0.79 |
0.77 |
0.80 |
0.78 |
0.79 |
0.78 |
0.85 |
0.83 |
0.81 |
0.84 |
0.82 |
0.82 |
0.82 |
EVENT |
0.53 |
0.57 |
0.49 |
0.53 |
0.61 |
0.62 |
0.61 |
0.54 |
0.58 |
0.45 |
0.54 |
0.52 |
0.62 |
0.61 |
FAC |
0.68 |
0.69 |
0.53 |
0.71 |
0.62 |
0.64 |
0.70 |
0.64 |
0.60 |
0.41 |
0.60 |
0.47 |
0.60 |
0.59 |
GPE |
0.88 |
0.89 |
0.85 |
0.89 |
0.88 |
0.89 |
0.89 |
0.91 |
0.91 |
0.86 |
0.91 |
0.90 |
0.92 |
0.91 |
LANGUAGE |
0.90 |
0.80 |
0.56 |
0.89 |
0.73 |
0.78 |
0.54 |
0.75 |
0.67 |
0.43 |
0.70 |
0.64 |
0.61 |
0.52 |
LAW |
0.58 |
0.51 |
0.42 |
0.61 |
0.54 |
0.43 |
0.59 |
0.57 |
0.52 |
0.31 |
0.56 |
0.51 |
0.44 |
0.55 |
LOC |
0.62 |
0.68 |
0.63 |
0.72 |
0.67 |
0.64 |
0.69 |
0.64 |
0.69 |
0.62 |
0.71 |
0.70 |
0.66 |
0.71 |
MONEY |
0.88 |
0.87 |
0.86 |
0.86 |
0.85 |
0.86 |
0.86 |
0.90 |
0.88 |
0.88 |
0.89 |
0.86 |
0.88 |
0.88 |
NORP |
0.82 |
0.87 |
0.78 |
0.86 |
0.84 |
0.82 |
0.88 |
0.85 |
0.89 |
0.80 |
0.87 |
0.87 |
0.85 |
0.88 |
ORDINAL |
0.65 |
0.69 |
0.59 |
0.69 |
0.67 |
0.68 |
0.68 |
0.75 |
0.77 |
0.71 |
0.76 |
0.75 |
0.77 |
0.77 |
ORG |
0.80 |
0.84 |
0.76 |
0.82 |
0.80 |
0.83 |
0.82 |
0.83 |
0.87 |
0.74 |
0.85 |
0.82 |
0.85 |
0.83 |
PERCENT |
0.86 |
0.86 |
0.85 |
0.86 |
0.86 |
0.85 |
0.86 |
0.88 |
0.88 |
0.87 |
0.88 |
0.88 |
0.87 |
0.88 |
PERSON |
0.88 |
0.89 |
0.82 |
0.87 |
0.86 |
0.87 |
0.84 |
0.90 |
0.90 |
0.82 |
0.89 |
0.88 |
0.89 |
0.87 |
PRODUCT |
0.65 |
0.71 |
0.74 |
0.73 |
0.66 |
0.67 |
0.64 |
0.59 |
0.67 |
0.50 |
0.64 |
0.55 |
0.65 |
0.53 |
QUANTITY |
0.74 |
0.76 |
0.71 |
0.76 |
0.64 |
0.74 |
0.77 |
0.71 |
0.70 |
0.63 |
0.70 |
0.60 |
0.67 |
0.72 |
TIME |
0.58 |
0.60 |
0.54 |
0.61 |
0.56 |
0.64 |
0.60 |
0.63 |
0.63 |
0.57 |
0.62 |
0.59 |
0.63 |
0.63 |
WORK_OF_ART |
0.69 |
0.73 |
0.66 |
0.64 |
0.62 |
0.68 |
0.55 |
0.58 |
0.67 |
0.29 |
0.58 |
0.51 |
0.65 |
0.50 |
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILEDand could not be processed; the smallest batch size which made it possible to be run was only 64. Training tagger with both Flair embedding models and batch size of 512 also failed, therefore, it had to be reduced to 256 to perform training successfully. Results per entity type are not shown in this post, but those who are interested are able to find them in the GitHub repository with the relevant code.
Batch size = 32 |
Batch size = 512 |
|||||
Model |
F1-Micro |
F1-Macro |
Time |
F1-Micro |
F1-Macro |
Time |
roberta-ner |
0.8657 |
0.7685 |
9.3 |
0.847 |
0.7485 |
3.1 |
xlnet-ner |
0.8522 |
0.7559 |
8.1 |
0.8452 |
0.7397 |
3.6 |
elmo-ner |
0.8515 |
0.7507 |
9 |
|||
xlm-roberta-ner |
0.8462 |
0.7318 |
10.5 |
0.8399 |
0.7213 |
3.3 |
finbert-ner |
0.795 |
0.6663 |
10.1 |
0.7927 |
0.6623 |
2.9 |
flair-ner |
0.8464 |
0.7397 |
8.2 |
0.8412 |
0.7396 |
4.4 |
bert-ner |
0.8453 |
0.7408 |
7.8 |
0.849 |
0.7345 |
2.7 |
The obtained results clearly indicate that using large batches did not deteriorate the final results, but the time required for training was reduced more than 3 times.
Server load analysis
We also measured GPU load during the training process using nvidia-smi utility. This tool provides a multitude of GPU use and utilization metrics of our interest. The
-
GPU utilization – is defined as “percent of the time over the past sample period during which one or more kernels was executing on the GPU”. It used to measure the level of GPU utilization during the training process
-
Memory utilization – the documentation defines it as “percent of the time over the past sample period during which global (device) memory was being read or written”. In this experiment, it helps to evaluate the efficiency of memory use;
-
Used memory percentage – defines the percentage of GPU memory required at a particular time.
We measured server load for both small ar large batches. The figures below describe GPU utilization for the whole training period when the batch size is equal to 32. GPUs were generally utilized at a level of 40-50% which might indicate underutilization which stayed similar for each model.
Finally, we trained NER taggers using tagged SEC data for training. Their performance results are presented below. We also calculated mean resource utilization values over the whole training process. To our surprise, ELMo based model showed to be rather resource-demanding, as it managed to use almost all available GPU memory. XLNet and XLM-RoBERTa models also required a larger amount of memory, yet this is less surprising as those models are really large, and fully loading them into GPU requires a significant amount of resources. It is no surprise that GPU load utilization is much higher with a large batch size. The figure below illustrates this as well. Unfortunately, these results do not include ELMo-based model training, which failed to train using a batch size larger than 64. Again, the mean resource utilization chart indicates that there were still resources for even larger batch sizes to be used if required; however, this might be difficult for XLNet and XLM-RoBERTa-based model training.Training tagger for financial document tagging
Finally, we trained NER taggers using tagged SEC data for training. Their performance results are presented below. Due to the relatively small training dataset, it did not long to train the taggers. The results indicate that training took less than 10 minutes for each model.
Precision |
F1-Score |
|||||||||
Model |
elmo-ner |
finbert-ner |
flair-ner |
roberta-ner |
xlnet-ner |
elmo-ner |
finbert-ner |
flair-ner |
roberta-ner |
xlnet-ner |
LOC |
0.50 |
0.52 |
0.79 |
0.52 |
0.54 |
0.55 |
0.57 |
0.69 |
0.58 |
0.60 |
MISC |
1 |
1 |
0 |
1 |
0 |
0.44 |
0.25 |
0 |
0.44 |
0 |
ORG |
0.36 |
0.60 |
0.46 |
0.47 |
0.55 |
0.44 |
0.56 |
0.50 |
0.55 |
0.58 |
PERSON |
0.96 |
0.92 |
0.92 |
0.96 |
0.95 |
0.96 |
0.94 |
0.92 |
0.97 |
0.94 |
The table above shows that FinBERT-based taggers showed quite competitive performance compared to other taggers. It is no surprise that it outperformed other taggers in detecting organization names.
Model |
F1-Micro |
F1-Macro |
Time |
roberta-ner |
0.8206 |
0.636 |
9.5 |
finbert-ner |
0.8143 |
0.58 |
7.8 |
xlnet-ner |
0.8143 |
0.5293 |
7.8 |
flair-ner |
0.7987 |
0.5275 |
6.8 |
elmo-ner |
0.7857 |
0.5982 |
8.1 |
Final points
In this post, we explored capabilities to train models for named entity recognition using recent state-of-the-art architectures and pre-trained language models. While selecting optimal configuration can be quite challenging, performance between taggers in this experiment did not differ very significantly (although the application of RoBERTa enabled it to achieve the best results). Moreover, we tested the BERT model pretrained particularly for financial communication, and it turned out to be useful when NER is targeted at processing financial documents. Next, one can perform multiple improvements, optimize its architecture (e.g., try to use Bi-GRU instead of Bi-LSTM), perform fine-tuning, or do some other size and performance optimizations. Further, it would be interesting to apply this model for other tasks which could benefit from NER, such as run matching or linking on various financial documents and extracting relations between different entities or subjects.
The code of the whole experiment, together with preprocessed datasets, notebooks, and results, is available on GitHub at https://github.com/Intellerts/bert-ner
Darius Dilijonas
CTO & Head of Data Science Team; Partnership Professor at Vilnius University
Paulius Danėnas
Data scientist / Software developer / Researcher / ML Engineer