We are currently experiencing a huge shortage of data scientists. In fact, the European Commission has identified the need for 346,000 more data scientists by 2020 across the continent, and organisations are struggling to fill this void.
Presently, companies are looking for data scientists who have computer science skills, knowledge of statistics, and domain expertise relevant to problems specific to their business. These types of candidates are proving difficult to find, however companies may find success by concentrating on the third proficiency: domain expertise.
Unfortunately, this skill is often overlooked. Yet, given its business focus, this could be proving detrimental to companies on the lookout for data scientists. Domain expertise is needed to make judgement calls throughout the development of any analytical model. It facilitates the differentiation between correlation and causation, noise and signal, between an anomaly worth further investigation and one that can be overlooked. It’s incredibly hard to teach because domain knowledge necessitates on-the-job experience, time to develop and mentorship.
This type of expertise is usually apparent in research and engineering departments that have established cultures grounded in the knowledge of products they build and design. Familiar with the systems they work on, these teams frequently use technical computing tools and statistical methods within their design procedures, meaning the step-up to the machine learning algorithms, and big data tools of the data analytics world are far less daunting.
Industries are beginning to acknowledge data science as an important differentiator, engineering included. Alongside domain knowledge, engineers need flexible and scalable environments that allow them to use the tools of the data scientist. Varying from problem to problem, employees may require traditional analysis techniques, for example optimisation and statistics, data-specific techniques such as signal and image processing, or newer functionalities including machine learning algorithms. However, having these tools organised in one environment becomes crucial due to the high cost of learning a new tool for each technique.
Being up-to-date and accommodating
However, how can engineers with domain expertise gain access to newer techniques like machine learning?
Ultimately, machine learning has been created to help identify fundamental trends and structure in data by tailoring a statistical model to that dataset. Yet, it’s hard to know which model is going to be most suitable when presented with new data as there are so many models to choose from. Testing and evaluating a selection of model types can take too much time when using ‘bleeding edge’ machine learning algorithms. Every algorithm will have an interface that is unique to the algorithm and criteria of the researcher who developed it. In order to try a range of different models and compare approaches it is far too time intensive. One solution is allowing engineers to try out trusted machine learning algorithms in a secure environment, as this assists best practices such as preventing over-fitting.
Gartner recently stated that engineers with the domain expertise, ‘can bridge the gap between mainstream self-service analytics by business users and the advanced analytics techniques of data scientists. They are now able to perform sophisticated analysis that would previously have required more expertise, enabling them to deliver advanced analytics without having the skills that characterise data scientists.’
Alongside the evolution of technology, businesses must rapidly analyse, verify and visualise reams of data to provide up-to-date insights and exploit business opportunities. Rather than spending time and money seeking out those rare data scientists, organisations can stay ahead of the pack by empowering their engineers to carry out data science and providing them with a flexible tool environment which allows engineers and scientists to become data scientists – making data accessible to the masses.
Diving into data analytics technologies
The tsunami of data being created in the digital age provides businesses with an opportunity to optimise processes and provide differentiated products. A new set of algorithms and infrastructure has emerged that allows organisations to use key data analytics techniques such as big data or machine learning to capitalise on these opportunities.
Additionally, this new infrastructure behind big data or machine learning leads to a host of different technologies that support the iterative process of building a data analytics algorithm. It’s this beginning stage of the process of building the algorithm that can set a business up for success. It involves trying several strategies such as finding other sources of data, different machine learning approaches and feature transformations.
Given the potentially unlimited number of combinations to try, it is crucial to iterate quickly. Domain experts are well suited to this task, as they can use their knowledge and intuition to avoid approaches that are unlikely to yield strong results. The faster an engineer with domain knowledge can combine their knowledge with the tools that enable quick iterations, the faster the business can gain a competitive advantage.
As more and more data is generated, systems need to evolve to process it all. In this big data space, open source projects such as Hadoop and Spark (both part of the Apache Software Foundation), are making it easier and cheaper to store and analyse large amounts of data and have the potential to greatly impact engineers’ work. For engineers accustomed to working with data in files on desktop machines, on network drives, or in traditional databases, these new tools require a different way of accessing the data before analysis can even be considered.
In many cases, artificial data silos and inefficiencies can be created – such as when someone needs to be contacted to pull data out of the big data system each time a new analysis is performed.
Another challenge engineers face when working with big data is the need to change their computational approach. When data is small enough to fit in memory, the standard workflow is to load the data in and perform computation – the computation would typically be fast because the data is already in memory. However, with big data, there are often disc reads/writes, as well as data transfers across networks, which slow down computations.
When engineers are designing a new algorithm, they need to be able to iterate quickly over many designs. The result is a new workflow that involves grabbing a sample of the data and working with it locally, enabling quick updates and easy usage of helpful development tools such as debuggers. Once the algorithm has been vetted on the sample, it is then run against the full dataset in the big data system.
The solution for these challenges is a system that lets engineers use a familiar environment to write code that runs both on the data sample locally and on the full dataset in the big data system. Data samples can be downloaded, and algorithms prototyped locally. New computational models that utilise a deferred evaluation framework are used to run the algorithm on the full dataset in a performance optimised manner. For the iterative analysis that is common to engineering and data science workflows, this deferred evaluation model is key to reducing the time it takes for an analysis to complete on a full dataset, which can often be in the order of minutes or hours.
Big data technologies have been a key enabler in the growth of data science. With large amounts of data collected, new algorithms were needed to reason on this data, which has led to a boom in the use of machine learning.
Machine learning is used to identify the underlying trends and structures in data and is split up into unsupervised learning and supervised learning.
In unsupervised learning, we try to uncover relationships in data, such as groups of data points that are similar. For example, we may want to look at driving data to see if there are distinct modes that people operate their cars in. From cluster analysis, we may discover different trends such as city versus highway driving.
In supervised learning, we are given input and output data, and the goal is to train a model that, given new inputs, can predict the new outputs. Supervised learning is commonly used in applications such as predictive maintenance, fraud detection and face recognition in images.
Combined, big data and machine learning are poised to bring new solutions to long standing business problems. The underlying technology, in the hands of domain experts who are intimately familiar with these business problems, can yield significant results. For example, engineers at Baker Hughes used machine learning techniques to predict when pumps on their gas and oil extraction trucks would fail.
They collected nearly a terabyte of data from these trucks, then used signal processing techniques to identify relevant frequency content. Domain knowledge was crucial here, as they needed to be aware of other systems on the truck that might show up in sensor readings, but that weren’t helpful at predicting pump failures. They applied machine learning techniques that can distinguish a healthy pump from an unhealthy one. The resulting system is projected to reduce overall costs by $10m. Throughout the process, their knowledge of the systems on the pumping trucks enabled them to dig into the data and iterate quickly.
Leveraging tools for processing big data and applying machine learning, engineers such as those at Baker Hughes are well positioned to tackle problems and improve business outcomes. With their domain knowledge of these complex systems, engineers take these tools far beyond traditional uses for web and marketing applications.