Suppose you conduct a series of experiments to check the effect of large number of input variables on your output variable. You would like to determine the most significant input variables affecting the output variable. But there are too many variables and conducting experiments is hard and expensive. What do you do then?
The purpose of this article is to show the creation of data driven model based on scarce experimental data, bulge with Bootstrap and validate it to physical results.
We conducted various physical experiments with multiple input variables affecting our output variable. We combined our data set with data sets from the literature leading to 157 experimental data points. Symbolic regression is used on the experimental data to produce a number of analytical expressions describing the interactive effect of the input variables without prior knowledge of an underlying physical process. We selected a simple model with only one fitting parameter to compare with the experimental data. A sensitivity analysis of the chosen model shows the input variables affecting the predicted output variable in order of importance. A bootstrap method (a statistical method) determines the precision of the model parameter, explained more in the text. In addition, the model indicates the variable spaces for which we need more experiments.
This article is a simplied version of the following scientific article:
R. Thorat and H. Bruining. “Determination of the most significant variables affecting the steady state pressure drop in selected foam flow experiments”. In: Journal of Petroleum Science and Engineering 141 (2016). pp. 144–156. https://doi:10.1016/j.petrol.2015.12.001
Introduction
The traditional approach of using the same data, both to build the model and to estimate its predictive performance, tends to bias the estimate of the model-prediction error. The plotted parameters are connected with the original data set and therefore cannot necessarily be used for different data sets [122]. The approach of cutting the data in half (one half for modeling and another half for validation) has a drawback of not utilizing precious data points for model building. In addition to criteria for model choice, right model verification / validation is also important to assess the merit of the selected model. Therefore, for the purpose of validating the model, we used a bootstrap method [51, 52] to generate 50 simulated data sets different from the original data set. A given bootstrap sample data set consists of some original data points repeated in the set while some appear only once and some not at all [123]. We used the standard deviation obtained from the 50 datasets to determine the precision of the fitting parameter of the model.
The symbolic regression software characterizes the sensitivity of the predicted output variable to the variable in the model. Further, we compare the observed output and the predicted output by plotting them on Y and X axis respectively.
Experimental data
Based on our experiments and literature survery, we selected six independent variables affecting the output variable. We coupled our experimental data with the data from the literature to get 157 experimental data points, explained in detail in the original scientific article.
Procedure
We explain in the following subsection application of symbolic regression on 157 experimental data points. A procedure is given for an ideal choice of the selected rational expression with fitting parameter A0. Furthermore, we use the simplest bootstrap method to get 50 simulated data sets. The model from symbolic regression with the fitting parameter A0 is applied to 50 data sets. A parameter for each simulated data set (AS 1 — AS 50) has been found by variance minimization between predicted and observed output variable. The deviation of those simulated parameters with respect to the predicted parameter estimates the error in the predicted parameter. The experimental and statistical results in the following subsection are compared for the relationship between the variables and their interactive effect on the output variable.
In the analysis we argue that the parameter subspaces where there are insufficient data, are necessary to find the interdependence of the variables affecting the output varaible. Further, we discuss the merit and drawback of symbolic regression and the bootstrap method. We end with some conclusions about the experimental procedure, about the symbolic regression and about the estimate of the observed output variable.
Statistical modeling
For the modeling part we used Eureqa®[49], a software package, which searches the fitting parameters and the form of the equations simultaneously [44].
Symbolic regression
The software based on symbolic regression finds the relation between the independent variables and the dependent variable, i.e. the output. From these symbolic functions, the program derives partial derivatives for the same pairs of variables for each candidate function. The program repeats the steps of deriving numerical and symbolic partial derivatives to get the best solutions. The back ground article [44] and an article on genetic programming [124] gives details.
Fitting parameter for the model to the experimental data:
We follow the criticism of Von Neumann[89] and avoid models with too many fitting parameters. Indeed, we made a trade-off between error and complexity to select a model equation among the expressions, which contains three independent input variables and a single fitting variable.
Sensitivity analysis of the variables affecting the output:
The calculation of how a single input variable influenced the predicted output at all input data points is as follows. Eureqa uses absolute average value of the partial derivative of a output varible with respect to the single variable, the standard deviation of the input variable in the input data and the standard deviation of output variable. If the sensitivity value is 0.5, when the input variable is changed by one standard deviation,the output variable would change by 0.5 of its standard deviation [125]. The percentage of datapoints for which change in the putput variable is more than 0,is denoted as % positive, i.e. the data points for which an increase in the input variable would lead to an increase in the output variable. The percentage of datapoints for which change in the putput variable is less than 0, is denoted as % negative, i.e. the data points for which an increase in the input variable would lead to an decrease in the output variable. We calculated the magnitude of the positive and negative increase for the respective datapoints.
Verification of the model:
The model is verified by comparing the observed output variable and the predicted output variable by plotting them on the Y and X axis respectively. A linear relationship, y(x)=y(x|a,b)=a+bx is considered, where y(x) is the predicted output; the fitting parameter “a” is the intercept and the fitting parameter “b” is the slope of the line. As the error in the observed output was not known, we assumed that all measurements have the same standard deviation. The formulae, derived from minimization of the chi square merit function calculates the intercept, the slope and their respective standard deviations (Chapter15, modeling of data from Numerical recipes [52]).
Validation of the model with Bootstrap:
For validation, it is necessary to assume that the data points were independently and identically distributed. A bootstrap procedure involved drawing 157 data points at a time with replacement from the original set by independent random sampling. Because of the replacement, a data set is created using visual basic in Excel® in which a random fraction of the original points (typically 1/e=37%)[52] are replaced by duplicated original points. Fifty such synthetic datasets, each with 157 data points are subjected to the same model as the original data. The fitting parameter for each simulated data set is found by minimizing the sum of squares of difference between predicted output and observed output by the Generalized Reduced Gradient (GRG) nonlinear engine in Excel®. To obtain the error, we compared the 50 values of the fitted parameters (As 1 to As 50) for simulated dataset to the original fitted parameter A0.
RESULTS
Predicted output variable vs. experimental output variable
We calculated the predicted output variable values by using the mode equation from set of expressions generated by symbolic regression using original experimental data from our work and from the literature. The procedure of generating these equations is briefly discussed in subsection statistical modeling, while readers can find the details of the procedure in the work by Schmidt [44].
The observed output variable deviated from the predicted one by 10 % error for a typical data point. The intercept of the straight line is noted along with its standard deviation. The slope between the observed output variable and the predicted output variable is 0.85 with a standard deviation of 0.03.
Error in fitting parameter
When we applied the model to 50 data sets with the used bootstrap method, we observed a small error of 3% in the fitting parameter A0.
Statistical results vs experimental results
The model does not match the experimental results on the contribution of one input variable, where the observed output variable increases with an increase in one input varible while other input variables are constant.
The model shows considerable discrepancy in the fit for the data points with the lowest observed value of the output variable. The observed output variable in this work is higher than the predicted output variable. Moreover, the model also shows a large discrepancy for the data points with the highest observed output variable from the literature.
DISCUSSION
How useful is the statistical model really?
The arithmetic expressions given by the symbolic regression allow to express the observed output variable in terms of only three variables. Equations with less number of variables show a worse match. If we include more data in the symbolic regression, we might have to include other physical parameters to explain the observed output variable. Extending the dataset would also force to consider more parameters on which the predicted output variable depends. However, it is not possible to include each data set which might give the predicted results closer to the experimental results. Although the equation shows some agreement with the data from literature , the interactive effect of input variables on the output variable as explained in the original article is not observed. Therefore, we can conclude that a relatively important variable masks the relationships between the variables in the subsets of data. We need more experiments for conditions for which large deviations between observed and predicted output variable occur.
Disclaimer
In addition, we interpret the model in a statistical sense, where, for the available data points, the effect of an input variable on the predicted output variable cannot be viewed in an isolated sense. The model is therefore a data driven equation and not based on physical mechanism, as explicitly stated.
CONCLUSION
Symbolic regression: We can apply symbolic regression to our data along with data from the literature to produce a number of analytical expressions without prior knowledge of an underlying physical process.
A simple model: A simple model with a single fitting parameter can describe the output variable with three out of the six variables.
Sensitivity analysis: We used a sensitivity analysis to modify the model, which shows the chosen model with the variables in order of importance.
Model validation: We validated the model by estimating the error in the model parameter by the applied bootstrap method.
Scope of the model: The scope of the derived data driven model is not to replace the models based on physical processes, i.e. mechanistic models.
Drawbacks: Our data set and the data set from literature show significant deviation from the chosen symbolic regression model, which shows that the model has limitations.
Limitations: The model from symbolic regression is only able to explain the general behavior and hierarchy of the variables affecting the output variable. The model gives the variable spaces for which we need more experiments. Considering an entire data set shows that the trends obtained from a subset of the data are not necessarily valid for the complete dataset.
REFERENCES
[28] R. Mason, R. Gunst, and J. Hess, Statistical Design and Analysis of Experiments, 2nded. (John Wiley & Sons, Inc., Hoboken, New Jersey, 2003).
[44] M. Schmidt and H. Lipson, Distilling free-form natural laws from experimental data, Science324,81(2009).
[45] K. Vladislavleva, K. Veeramachaneni, U.-M. OReilly, M. Burland, and J. Parcon, Learning a lot from only a little: Genetic programming for panel segmentation on sparse sensory evaluation data, in Genetic Programming, Lecture Notes in Computer Science, Vol. 6021, edited by A. Esparcia-Alcázar, A. Ekárt, S. Silva, S.Dignum, and A.Uyar (Springer Berlin Heidelberg, 2010) pp. 244255.
[49] Nutonian,Eureqa® Desktop, (2015),www.nutonian.com/products/eureqa/.
[50] H. Akaike, A new look at the statistical model identification, IEEE Transactions on Automatic Control19,716(1974).
[51]B.Efron and R.Tibshirani, An Introduction to the Bootstrap, Monographs on Statistics and Applied Probability, Vol.57 (Chapman Hall CRC,1993).
[52] H.Press,S.A.Teukolsky,W.Vetterling, and B.Flannery, Numerical Recipes, 3rd ed. (Cambridge university press, NewYork, 2007).
[74] G. E. P. Box, Science and statistics, Journal of the American Statistical Association 71, pp.791(1976).
[89] F.Dyson,A meeting with Enrico Fermi, Nature 427, 297(2004).
[122] J. D. Olden and D. A. Jackson, Torturing data for the sake of generality: How valid are our regression models? Ecoscience 7, 501(2000).
[123] E.Vittinghoff,D.Glidden,S.Shiboski, and C. McCulloch, Basic statistical methods, in Regression Methods in Biostatistics, Statistics for Biology and Health (Springer US,2012)pp.2767.
[124] K. Veeramachaneni, E. Vladislavleva, and U. M. OReilly, Knowledge mining sensory evaluation data: Genetic programming, statistical techniques, and swarm optimization, Genetic Programming and Evolvable Machines 13,103(2012).
[125] A.Raynolds, Sensitivity interpretation, (2014), formulize.nutonian.com/forum.