“Now, we have on record that 11,000 homes are at risk. That is between 15 and 20 percent of the total housing stock in Zaanstad. We have a better idea of which buildings we have to work on, and when. This means we can make crucial savings on investigation costs.”————–
Subsidence is a living nightmare for property owners, where the ground under your building collapses, often causing extensive structural damage and costing thousands to repair. However, a new AI-enabled predictive model could reduce the devastating impact of subsidence.
After analyzing more than 136 million records, our foundation repair model could identify buildings with a high risk of subsidence, allowing the region to make crucial savings on both manual investigation and any ensuing repair costs.
Zaanstad’s Sinking Feeling
Subsidence is a serious problem in Zaanstad, a municipality with 152,000 residents and where approximately 70,000 buildings are built on wooden piles on soft ground. The wood is susceptible to bacterial damage and this reduces the foundations’ loadbearing capacity.
Consequently, one-third of the municipality’s total housing stock is predicted to have foundation problems – but the true extent of this risk is unknown for the vast majority of properties. This is because it is both expensive and time-consuming to manually investigate a property’s foundations and fix any damage, with an average repair costing between 30,000 and 50,000 Euros per house.
So, Intellerts was tasked with answering this question for Zaanstad: can a predictive model accurately assess the quality of the foundations of a building to reduce these costs and timescales?
Big Data for a Big Problem
From a data analytics perspective, this was a complex challenge. Not only do bad foundations not necessarily lead to catastrophic subsidence immediately, but there were also many other factors to consider. These included the soil type (mostly clay or peat), water level, presence of trees that may remove moisture from the ground underneath, the drainage system, and vibrations from any nearby building work. Intellerts’ data scientists had plenty of information to work with, including 140 gigabytes of data, which is equivalent to around 136 million records.
The data did not only come from the municipality and Parteon, but also from external parties such as the KNMI, the country’s meteorological society, and the Kadaster government agency. “Linking all these files was a time-consuming task,” explains Martin Haagoort, managing director and data scientist at lntellerts. “There were a considerable number of transformations required to get all the data at the same level so that it could be analyzed as a whole.” Satellite data was also used during the analysis, which contained detailed records. By mathematically analyzing the radar images from the satellites, the subsidence per building could be determined to the nearest millimeter.
Using all of this information, a predictive model was developed by Intellerts using a range of cutting-edge machine learning algorithms.
By using algorithms that continually learn from data, machine learning ensures that new trends, patterns and insights can be uncovered that may not have been identified by manual analysis methods alone. The resulting foundation repair model was then used to categorize the foundations of every building in Zaanstad, even though measurements or expert estimates were not available for most properties. “This has unprecedented advantages in terms of planning and costs in the management of foundation repair,” according to Levinus Jongmans, who is responsible for foundation repair at the Municipality of Zaanstad.
The results from this model were also in agreement with Parteon’s existing building information and foundation reports. “The outcome was very encouraging,” says Jurgen de Ruiter, Parteon’s CFO. “Not only is the model of high value, but it gives extensive information that allows separation into five risk categories, from low (monitoring is enough) to very high (action required).”
“Now, we have on record that 11,000 homes are at risk. That is between 15 and 20 percent of the total housing stock in Zaanstad. We have a better idea of which buildings we have to work on, and when. This means we can make crucial savings on investigation costs.” Intellerts helps a wide range of businesses unlock the power of AI and data science to stay ahead of the competition. If you’d like to find out more about how we could help your organization.
Going Deeper on Modeling with HP Data Science Workstations
To achieve our goals, we had to create a discriminatory classifier that would distinguish between a building that requires inspection, monitoring, or must be intervened. The accuracy of this classifier is one of the core requirements, as failure to identify objects which need attention or supervision would result in significant losses. However, finding optimal classifiers is a very time-consuming and tedious task.
Fortunately, recent developments in automated machine learning (AutoML) enable intelligent search for optimal classifiers; furthermore, they may identify whole optimal pipelines and find additional relevant features by performing automated preprocessing, feature construction, and selection together with classifier selection and optimization. This could also be less computationally expensive than running an exhaustive search over full parameter space, which may become infeasible due to high dimensionality and may require a significant amount of resources and additional efforts to use them efficiently by distributed computation features. Therefore, we decided to employ and test AutoML, which provides similar capabilities out of the box. After an initial search, we ended up with the TPOT library, which applies genetic programming to search over thousands of possible parameter combinations to find the best possible pipeline. Moreover, it is customizable through the available configurations (custom ones may be used as well) and custom tuning. It provides capabilities to run in parallel by using Dask, which is an additional benefit as it allows monitoring the execution in the Dask dashboard. The nature of AutoML does not guarantee an optimal solution, nor does it ensure the same solution for the same dataset. Therefore, the final pipeline should be considered a guideline or even a recommendation for further tuning and implementing the final classifier.
To run TPOT we limited it to 10 generations while leaving the population size at 100. This means that it had to evaluate 1100 classifiers in total (the size of the initial population plus the total amount of 10 further generations). As our initial features were carefully crafted and tested, we decided that a larger number of populations could be a large overhead and focused on selecting the classification technique. TPOT allows selecting between classifiers available in the scikit-learn library, such as Naive Bayes, decision trees, k-nearest neighbors, linear SVM, logistic regression, stochastic gradient descent (SGD), random forests, extra trees, and gradient boosting classifiers; moreover, one may additionally employ search for deep neural networks implemented in PyTorch and GPU accelerated classifiers available in CUDA ML library (cuML, part of NVIDIA RAPIDS framework).
We did not consider the latter due to deployment. However, they seem to be an interesting option for our further experiments after the dataset becomes much more extensive. We used a quarter of the dataset to form a holdout subset for testing purposes to evaluate the performance of the final classifier on the unseen data.
After running our TPOT classifier on our dataset, we obtained an optimal classifier. It was no surprise that a tuned gradient boosting (GBM) classifier was selected by TPOT to be the best performing one. Indeed, it resulted in 92.19% testing accuracy, a significant improvement over 87% accuracy achieved previously using Random Forest classifiers without any specific tuning.
Factor 4 in Speed with HP Data Science Workstations
The investigation of other classifiers considered in the classification procedure indicated that random forest, neural network, or XGBoost classifiers were among the best-performing ones. Moreover, classifier selection took a little more than 2 hours on the HP Z8 workstation with 80 cores due to the large level of parallelization.
To compare, we ran the same workload on a regular gaming desktop PC (Intel Core i7 (8 cores), 16 GB of RAM, NVIDIA GTX1060 GPU with 6 GB VRAM,8 cores). It took more than 8 hours. Hence, the time required to train them varies greatly.
We can conclude that recent advancements in hardware push further boundaries for automated machine learning and data science.