I used to be a faculty member of the Department of Bioengineering at the University of Washington in Seattle. My lab’s research was motivated by a question on how to engineer microbes that are safe by preventing unwanted mutations and providing predictive behaviors of newly implemented cellular functions. To answer these questions, I took diverse mathematical and statistical approaches. One of the recent approaches that I used is Bayesian inference. This Bayesian method was applied to identify and characterize mathematical models to explain observed noisy time series cellular signals.
Now at the Blueprint Consulting Services, I push my boundary toward industrial applications of various mathematical and statistical approaches to answer diverse problems in data science including sales and marketing, intelligent crude oil pump systems, advertisement bidding, and cyber-security, as well as specific fields within the biotech industry.
Time-Series Data with the Limited Sample Size
Time series data are often found in diverse areas of industry, and can help answer the following questions, to name a few:
- "How likely is it that individual customers will re-subscribe to my online services?"
- "What are the future sales of my products? What are the probabilities?"
- "When are my devices going to fail to operate? What are their chances of failure in a certain timeframe?"
These questions are often answered by understanding the relationship between data at multiple time points and by taking into account seasonality and trends. This is often done by applying autoregressive integrated moving average (ARIMA) models. However, when time series data are not large enough or show significant variability, different analysis methods that are more appropriate may need to be introduced. One such alternative method is Bayesian inference.
Bayesian Inference: Approximate Bayesian Computation
Here, I introduce a Bayesian approach that is not yet recognized well in the data science industry, so called Approximate Bayesian Computation (ABC). This approach is composed of three steps: (1) sampling parameter values of a mathematical model, based on the prior knowledge of your system (2) generating synthetic data by running simulations of the model, and comparing them with the observed data, and (3) selecting the best choices of parameter values. Thus, the ABC method can provide probability to obtain parameter values of a mathematical model that explains a given dataset.
This approach can be used for various classes of problems, whether the observed data are (nearly) continuous or discrete, or smooth (deterministic) or noisy. More importantly, due to the Bayesian nature, small data sets can be trained to improve mathematical models. This Bayesian approach does not provide one definitive prediction, but provide a range of predictions with their corresponding probabilities. This probability-based prediction provides information on how confident we are in the prediction and thus how confidently the prescribed action can be placed for your desired goals.
Forecasting Sales in the Supply Chain Management
Let us consider the problem of forecasting sales in wireless phone service providers. In this business, it is important to forecast the demands of phones and supply them to individual stores right on time. If sales can be forecasted accurately, the demands for new phones can be reliably predicted and phones can be supplied in a timely manner.
Here, the sales can be affected by a number of factors, which can fluctuate in time randomly. Thus, forecasting sales is by nature based on odds, i.e., probability. The Bayesian approaches can be appropriate for this case.
To forecast sales based on historical sales data, mathematical models can be proposed by incorporating various factors such as the promotions of the company and its competitors, weekly or monthly visits to its local stores, weekly or monthly sales, and local demographic information. Once the models are built, the parameter values of the mathematical models can be inferred, by applying the best guess on parameter values (more specifically the prior probability distribution of the values), and then selecting parameter values that explain the observed data well. This Bayesian approach gives you the updated (posterior) probability distribution of the parameter values based on your observed data. This procedure can be repeated until the parameter distributions do not change further. The final selection of the parameter values will provide a collection of sales value predictions with their corresponding likelihood. Based on this prediction likelihood, future sales values can be forecasted with a given confidence interval.
Decision tree based approaches such as the Random Forest, XGBoost, and AdaBoost, can be used for sales forest as well. These approaches can provide high predictive power without overfitting. However, prescriptive actions that can be obtained from these approaches are not intuitive, simply because these approaches build black box models, not mechanistic ones. Although the ABC method may, however, face challenges in coming up with reasonable mathematical models, once appropriate models are proposed, you can systematically (in the Bayesian way) predict the parameter values and prescribe the action of plans based on the mechanisms built in the models!
Various machine learning algorithms have been developed to meet the needs of prediction for different systems. The Bayesian methods such as ABC have been developed to bypass the construction of likelihood functions and to be applied for various classes of systems including nonlinear stochastic dynamical systems. This approach can provide mechanistic understanding of the time series data and even prescriptive action plans for future, once appropriate mathematical models are provided based on the underlying mechanisms. This Bayesian approach has been widely applied in medical, biological, and bioengineering fields, including systems and synthetic biology, population genetics, ecology, epidemiology, and oncology. I believe that the ABC approach will be appreciated further in other fields of data science research and industry.
More about Kyung - Kyung Hyuk Kim is a Data Scientist at Blueprint Consulting Services. He helps provide meaningful information-driven solutions for industries including biotech companies. He was an Acting Assistant Professor in the Department of Bioengineering at the University of Washington. He received a series of grant awards from the National Science Foundation and led a group of researchers as a Principal Investigator. He has a Ph.D. in Physics from the University of Washington, Seattle.