Which solution is financially better for heavy data science tasks? (Public Cloud -vs- HP Z8 G4 + NVIDIA GPUs)

Summary: Companies taking data science seriously, shouldn’t be discouraged by the initial investment in a dedicated workstation. As this experiment shows, you’ll break even investing in an HP Z8 G4 compared to the cloud after only 7 months. In two years, the total cost of the cloud will reach almost 120.000 dollar. For that amount, you’re able to buy four HP Z8 G4 machines delivering four times the performance, and still, you would come out ahead.


Data science is a computationally heavy task. As methods improve and breakthroughs are made, the new state-of-the-art techniques push the boundaries of modern hardware. Natural language processing is a fast-evolving field. Keeping up with the latest achievements, implementing the newest models, and testing their capabilities has become increasingly difficult.

From an IT perspective, there are two main ways you can support your data scientists. You can either invest in local hardware or use the computation power of the cloud. In this business case, we’ll pit an HP Z8 G4 machine with two NVIDIA GPUs against an offering of similar performance from a major cloud provider. Our aim is to find out whether it’s more interesting to use the HP Z8 G4 machine, which requires a sizable upfront investment or embrace the cloud.

To make our case tangible, we’ll run some realistic data science experiments as a starting point for our comparison. We’ll define a realistic workload and calculate the cost for said workload over time using both the HP Z8 G4 machine and a comparable cloud instance.

In this white paper, we’ll first elaborate on the data science workload we’ll use as a basis for our calculations. Next, we’ll take a look at the hardware required. We’ll first describe how the workload runs on a local desktop machine and then look at a comparable bare metal cloud instance. Finally, we’ll calculate the total cost of both options to find out what is more financially sound: an upfront investment in a local data science workstation such as the HP Z8 G4 or monthly payments to a public cloud provider.

Starting from a realistic workload

For our comparison, we need a realistic data science workload. We decided to experiment with Named Entity Recognition (NER) tasks. NER is an advanced subsection of the natural language processing (NLP) field with a myriad of fascinating applications. While the underlying models for named entity recognition are complex, the principle is not.

With NER, an algorithm aims to extract information from unstructured text by locating and classifying so-called named entities. Let’s look at a simplified example. You might encounter a document containing the following sentence:

“Big Business Corp. has paid 10.000 dollars in regard to an invoice send by the Spanish division of Desktop Inc.”

Using NER, an algorithm could recognize [Big Business Corp.] and [Desktop Inc.] as company names, [10.000 dollars] as an amount, [Spanish] as an adjective pertaining to the location, and so forth. Fig. 1 illustrates this further by applying NER to text from Wikipedia. Using NER on a large amount of unstructured data could yield interesting insights for analysts making decisions for auditing, investing, and financial reporting of events identifying suspicious events. Further analysis of the structured data generated by NER could yield insights and connections not previously known.

In a real-life scenario, data scientists would train, run and finetune models to achieve actionable results useable by the business. As we’ll see further along in the white paper, this is a continuous iterative task.

Fig.1: NER extracts information from unstructured text by classifying entities.

Setting up the test

For our test, we’ll analyze the cost of running some popular NER models. We’ll look at financial information specifically, since this is a realistic application of NER. First, we have to pick models to use. Named entity recognition is a widely solved field of natural language processing, so we have our pick. For our experiment, we picked the following models:

  • Google’s BERT (Bidirectional Encoder Representations from Transformers), as the original transformer model which sparked the whole transformers revolution;
  • Facebook’s RoBERTa, as a more robust and optimized version of the original BERT model;
  • ELMo model with deep contextualized word representations, created by Allen Institute;
  • Flair contextual string embeddings created by Zalando. For better performance, we used both forward and backward embeddings;
  • XLNet model, which applies autoregressive pretraining and overcomes BERT in a multitude of tasks;
  • XLM-Roberta, a large multilingual model based on Facebook’s RoBERTa.

For training the models, we use data from Ontonotes. Ontonotes contains a massive amount of textual data from telephone conversations, newswire, newsgroups, broadcast news, broadcast conversation, and weblogs. We’ll exclude the available religious texts, as they are irrelevant for training our benchmark models.

In addition, we use data from the US Security and Exchange Commission (SEC). The documents in this dataset are very relevant for training our financial NER model. The SEC data contains only basic types of entities such as Person, Location, Organization, or Miscellaneous, while the Ontonotes data can be used to train a model to recognize 20 types of entities. We also used FinBER, a model pretrained by the Hong Kong University of Science and Technology.

The specifics of each model are relevant to data scientists. From a business point of view, however, they are all similar. They need to be trained and will perform NER on a set of unstructured data. The training, running, and finetuning of the models consume a lot of system resources. We performed
the experiment with different models to ensure there isn’t any bias in regard to the underlying hardware.
The hardware

First, we’ll run our experiments on local hardware. As NER is a resource-heavy task, we need a capable machine. In this business case, we’re using an HP Z8 G4 Data Science Workstation with two state-of-the-art NVIDIA.

RTX8000 GPU’s, equipped with 48 GB of VRAM each. The system itself has 376 GB of RAM. This gives us an abundance of system resources to work with, which enables us to use the largest versions of all language models mentioned above. Lastly, the system is equipped with dual Intel(R) Xeon(R) Gold 6242R CPU (20 cores each), a 4 TB HP Z Turbo data drive, and a 1.450-watt power supply and runs Ubuntu 20.04.

Fig.2: The HP Z8 G4 configuration used in our experiment.

NER, just like other natural language applications and deep learning in general, performs best when the software models have access to hardware accelerators like the NVIDIA GPUs. That’s because training en inference isn’t necessarily complex tasks in and of themselves. Or more specifically: a single calculation required for a single step in the process isn’t that computationally heavy. To run a complex model in an acceptable time frame, the hardware must be able to complete a great many small calculations simultaneously.

CPUs are very good at crunching fewer heavy operations, while GPUs as accelerators excel in the parallel execution of smaller instructions, which is what we need now. Furthermore, the amount of memory dictates the size of the model and the amount of data a system can handle. As you can see, our HP Z8 G4 is well equipped for the task with a decent amount of memory and two powerful GPUs.

Testing and tuning

During our experiment, we tried to use our HP machine to the fullest. This means we increased the size of the batches of data to be crunched gradually until some models started to run out of available resources. In general, we tried to optimize the models as much as possible.

Data science is not an easy task. Creating an accurate and efficient model that delivers relevant results may take hundreds of tries. In data science in general and our NER experiment specifically, there isn’t any finite-state or perfect answer to be found. As a data scientist, you must work on the model and keep working on it, making it even more relevant and precise. This means you’ll have to do a lot of experiments, tune some parameters on each model, and run them all again.

On average, running one experiment with a single model takes 3 hours and 46 minutes. At first, running all models took 23 hours. From there on out, we started finetuning the models over twenty different iterations. After each run, it’s time to evaluate the results, iterate, and go at it again. In the end, running all models only took eight hours and a half using the full GPU capacity of the HP Z8 G4.

This process of running the full set of models, fine-tuning, and iterating will be our benchmark for calculating the cost of data science on both the HP Z8 G4 and in the cloud.

To the cloud

The public cloud claims to offer an interesting alternative to our high-end data science machine. One might be tempted to run the experiments in the cloud. The upfront cost is admittedly significantly less. However, as we’ve discussed in the previous chapter, data science isn’t a one-off project but requires continuous iteration and improvement. How will the cloud compare to the desktop when we take the entire lifecycle of a NER project into account, including continuous improvement?

First off, we need an instance comparable to our HP Z8 G4 workstation. AWS offers the g4db.metal instance. We want a bare metal instance as we intend to run the same software in it, and don’t want to be limited to a specific data science service from one cloud provider.

The bare metal instance is powered by up to eight NVIDIA T4 Tensor core-GPUs with 320 Turing Tensor cores and 2.560 CUDA cores. The GPUs have 16 GB of memory each. We opt for an instance with 384 GiB of RAM. The CPU is a custom Xeon Scalable processor with up to 64 vCPUs. That’s less than our HP workstation, but the impact shouldn’t be significant on our experimental workload.

When compared to the HP Z8, the AWS bare metal instance is noticeably more powerful in some regards. It has approximately double the amount of GPU cores. The cores in the HP Z8 G4 however operate at nearly twice the clock speed. This means the AWS instance can perform twice the amount of parallel computations compared to the desktop, but each calculation takes twice as long. These differences cancel each other out. Other resources are similar. Generally speaking, the HP Z8 G4 and the AWS g4dn.metal will perform on the same level. The HP Z8 G4 is available with flexible configurations, while the AWS instance is not: you must adapt to whatever offering closest matches your needs.

Fig.3: Even though there are differences in the configuration of the AWS bare metal instance and the HP Z8 G4 workstation, they perform similarly.

It takes a similar amount of time to run our final iteration of experiments on the AWS instance and the HP G8 Z4: 8 hours and 30 minutes. Let’s look at AWS pricing. For our comparison, we’ll choose on-demand pricing as this strategy emphasizes the flexibility of the cloud.

The price for one AWS g4dn.metal instance for a single hour is 9,78 dollars. Running one iteration of the experiment takes more than 8 hours. 9 hours times 9,78 dollars for one hour equals 88,02 dollars for one run. Now we still have to include VAT. In Belgium and The Netherlands, the standard VAT rate is 21 percent. The final price for one iteration of our NER model costs 106,5 dollars.

Now let’s look at the entire workflow. Our final run takes 8 hours and 30 minutes but required a lot of optimization. Our very first iteration took us 23 hours. Over the course of 20 iterations, we managed to get the runtime down to 8 hours and 30 minutes. In total, we performed 320 hours ((23h +9h)/2 * 20 iterations) of calculations with the HP Z8 G4 machine. Using the AWS g4db.metal instance with its cost of 9,78 dollar an hour for 320 hours costs us 3.129,6 dollars. Including VAT, the total cost of our experiment in the cloud is 3.789,82 dollar excluding networking, storage, or extra layers of security.


The HP Z8 G4 isn’t cheap. Without significant discounts from HP, the configuration of the data science workstation used costs approximately 30.000 dollar. Running the workstation isn’t free either. The HP Z8 G4 utilizes 1.450 Watt per hour. Using the cost of electricity in the Netherlands at the time of the experiment, the entire 320-hour runtime yields an electricity bill of 53,36 dollars. In the grand scheme of things, this is negligible. Finally, having your own desktop requires some maintenance. Let’s say the maintenance cost is 300 dollar each month. The initial investment for running the experiment locally is 30.000 dollars for the machine plus 300 dollars for maintenance and 53 dollars for electricity, equalling 30.353 dollars in total.

Using AWS for one 320-hour run costs 3.790 dollars. Including necessary costs for other services such as storage and networking, we calculate a final monthly price of 4.927 dollars.

Now in a real-life scenario, data scientists wouldn’t just stop running experiments after one month. They’d continue to finetune models and implement new ones. Let’s say they keep running our experiment month after month, each time needing 320 hours of calculations.

Using the HP Z8 G4 workstation, the total cost of running the experiment for seven months including the initial hardware investment, maintenance, and electricity is 32.471 dollars. Relying on the AWS g4db.metal instance for seven months costs 34.489 dollars. In other words, for a dedicated data science project, it only takes 7 months for the investment in an HP Z8 G4 desktop to pay off. From there, the gains only increase. Running experiments for one entire year costs 34.236 dollars. If you had opted to use the cloud instead of a local workstation, the cost would have ballooned to 59.124 dollars.

If the data science team keeps using the workstation to the fullest for two years, the TCO still won’t have exceeded 40.000 dollar (38.472 dollar to be precise). Using the cloud for a similar amount of time costs 118.248 dollar. Using HP Z8 G4 machine Using AWS g4dn.metal On-Demand Instance Using AWS g4dn.metal On-Demand Instance (+ other services) Month Cost of Machine (HPZ8) Maintenance (HPZ8) Electricity Costs (HPZ8) Cumulative (HPZ8) Monthly Price (AWS) Cumulative (AWS) Monthly Price (AWS+30%) Cumulative (AWS+30%)

Fig.3: using the HP Z8 G4 workstation in a realistic manner, the initial investment pays of after approximately 7 months compared to the cloud. From there on out, savings accumulate.

Fig.4: The cost increase over time using the cloud compared to using a dedicated workstation is obvious. The initial investment in a data science machine pays off after approximately 7 months, depending on usage.


It’s clear the public cloud has a place in data science. If our experiment would have been a one-off, the AWS instance offers an interesting solution. However, if you’re planning to run data science tasks on a regular basis, the HP Z8 G4 machine should pay off after approximately 7 months. The more you use the machine, the faster you’ll start saving money compared to the cloud. Our experiment used a 320-hour runtime each month as a baseline, based on real-life training and using NER models. It’s certainly possible to further maximize the usage of the machine in daily operations.

Companies taking data science seriously, shouldn’t be discouraged by the initial investment in a dedicated workstation. As this experiment shows, you’ll break even investing in an HP Z8 G4 compared to the cloud after only 7 months. In two years, the total cost of the cloud will reach almost 120.000 dollar. For that amount, you’re able to buy four HP Z8 G4 machines delivering four times the performance, and still, you would come out ahead.


Share this article:
Share on facebook
Share on twitter
Share on linkedin
Share on whatsapp
Share on pinterest


Join our data science mailinglist

This website uses cookies to ensure you get the best experience on our website. More information.