HighTech and Innovation

The production of hydrocarbon resources at an oil field is concomitant with challenges with respect to the formation of scale inside the reservoir rock – intricately impairing its permeability and hindering the flow. Historically, the effect of ions is attributed to the undergone phenomenon; nevertheless, there exists a great deal of ambiguity about its relative significance compared to other factors, or the effectiveness as per the ion type. The present work applies a data mining strategy to unveil the influencing hierarchy of the parameters involved in driving the process within major rock categories – sandstone and carbonate – to regulate a target functionality. The functionalities considered evolve around maximizing the oil recovery, minimizing permeability impairment/ scale damage. A pool of experimental as well as field data was used for this sake, accumulating the bulk of the available literature data. The methods used for data analysis in the present work included the Bayesian Network, Random Forest, Deep Neural Network, as well as Recursive Partitioning. The results indicate a rolling importance for different ion species - altering under each functionality – which is not ranked as the most influential parameter in either case. For the oil recovery target, our results quantify a distinction between the source of ion of a single type, in terms of its influencing rank in the process. This latter deduction is the first proposal of its kind – suggesting a new perspective for research. Moreover, the machine learning methodology was found to be capable of reliably capturing the data – evidenced by the minimal errors in the bootstrapped results.


Introduction
As a long lasting issue in petroleum production, the formation of scales continues to impede the flow and causes economic burden on the upstream sector, which is estimated in excess of billions of dollars worldwide [1]. The formation of scales subsequently reduces the permeability of the formation [2], which adversely affects the recovery of the hydrocarbon resources. Given its importance, several studies have focused on understanding the effects of different parameters on the scale formation phenomenon and proposing relevant mechanisms. According to the literature, the scale formation at oilfield is linked with operational parameters such as field type as well as laboratory parameters such as the concentration of selected ions with the latter being tested both inside inhibitor-free and inhibitor-containing environments [3][4][5][6]. Nevertheless, the theories evolved over the deposition mechanism are non-overlapping and there exists a great deal of ambiguity about the influencing rank of the found parameters in the sequel -a question which this work attempts to address.
A plausible deduction on the actual interplay between parameters affecting the formation damage/oil recovery process can merely be derived by considering all the parameters involved simultaneously. This definition refers to a Big-data framework, representing a favorably large sample size, to be subsequently applied to machine learning strategies. In practice, however, the available data in the literature adheres to a study conducted on a given functionality -maximizing the oil recovery, minimizing permeability impairment/ scale damage. It is therefore logical to construct a specific database for each target functionality, for which the data is separately available.
The application of machine learning strategies has been widely practiced in the oil and gas development. These attempts have covered aspects of enhanced oil recovery [7][8][9][10][11][12][13][14], fracture detection [15], development plan optimization [15,16], dynamic production prediction [18][19][20][21] and asphaltene precipitation prediction [22]. Some studies have also focused on applying machine learning strategies to model permeability impairment due to mineral scale deposition [23][24][25] and predict the success of an inhibition scenario in the field [4]. The bulk of these works have adopted an Artificial Neural Network (ANN) technique for their analysis [26], albeit some hybrid strategies have also been tested. In essence, these hybrid methods were created by introducing an adjustment to the neuron weights (inside ANN settings) through meta-heuristics algorithms, such as Imperialist Competitive Algorithm (ICA), Genetics algorithm (GA), particle swarm optimization (PSO), or both (HGAPSO). The modifications have reportedly yielded improved overall accuracy; nevertheless, some developed models bear limitations with respect to being trapped within local optima, making their predictions unreliable for a certain range of the data spectrum [23]. This creates a necessity for other machine learning techniques to be also evaluated for the same target.
A common feature of the recent machine-learning investigations on the reservoir mineral scale prediction [23][24][25] has been the adoption of a single data-bank as their model input, which reports experimental results on sandstone rocks. This brings about another limitation to the established results algorithm efficiency or parameter importance rank as being specific to the given rock type, or being tested otherwise. The present work contributes to the existing literature in this field in several ways testing new algorithms efficiency within sandstone and carbonate rocks, providing an indepth view of parameter importance rank for a specific functionality. In this regards, the authors have also accumulated a data-set of experimental results on oil recovery from carbonate rocks from both the literature and our experiments with added parameters list, to include the source of ions, for further importance level classification deducted from data processing. A flowchart is presented in Figure 1 to illustrate our research methodology

Data
The data used in the present work was obtained from the open literature as well as our own experiments [4,25,[27][28][29][30]. As explained earlier, the data was collected so as to target three main functionalities minimizing permeability impairment (I), minimizing the possibility of scale damage in the field (II), maximizing oil recovery from matrix (III). As such, three distinct sets of data were acquired with essentially different parameters list. The list alters slightly under each category owing to their original recording scheme. In essence, the lists include parameters related to the fluid/matrix properties as well as the experimental/field conditions, under which the data was obtained. Tables 1 to 3 provide a description of the parameters considered under each functionality. The embedding rock type for the data in category (I) belongs to sandstone; whereas in the other two categories the data refers to the carbonate case.    Porosity of the core (-)

6
So, initial oil saturation of core (-) 7 Ko, relative permeability of oil in core (mD) 8 Kw, relative permeability of water in core (mD) 9 CaCO3, weight percent of core (-)

12
Acid number of oil (mg KOH/g oil)

Methods
Several methods have been employed in the present study, for regression as well as classification of parameters; including Bayesian Network (BN), Classification and Regression Trees (CART), Random Forest (RF) and Deep Neural Network (DNN). In order to keep the size of this manuscript within reasonable length, only an explanation of BN and RF methods is given in this section, which have outperformed the other applied techniques in terms of their established accuracy [31,32].

Bayesian Network
A Bayesian network belongs to a class of graphical models, which concisely represent the probabilistic dependencies between a given set of (random) variables = { 1 , 2 , … , }, in the form of a directed acyclic graph (DAG). The DAG shapes such that its nodes represent the variables and its arrows represent probabilistic dependencies between the nodes. In this structure, an arrow goes from an influencing parent node to an influenced child node, in a one-directional way. Such a graphical structure enables estimation of the joint probability distribution. Provided that a variable only depends on its parent nodes, DAG defines a factorization of the joint probability distribution, for each variable, into a set of local probability distribution functions, in which the form of factorization is given by the Markov property of its network. The Markov chain framework considers the product of the conditional (local) probability distributions associated with each variable , as the (global) joint probability distribution of variables in [33]. For the case of the factorization of the joint density function can be obtained by Nagarajan et al. (2013) [34]: In which Π X i represents the set of parents of . For the random variables with discrete nature, the factorization of the joint probability distribution is obtained by: For Each three disjoint subsets of nodes in DAG, say (A, B, C), a directed separation (d-separation) criterion is evaluated. Assuming node C to d-separate nodes A and B, then along every sequence of arcs between nodes A and B, there exists a node , which either is positioned in C (not having any converging arcs) or has converging arcs (being pointed to along the network path by two arcs) and none of or the nodes that can be reached from (its descendants) are in C [34]. As the situation of a converging connection violates the d-separation criteria for the child node ( Figure  2a), it can be assumed that the parent nodes (A and B) are not independent given the child node. As such, the Markov property stipulates: For the other two scenarios -serial and diverging connections (Figures 2b and 2c) -the corresponding values measures as Equations 4 and 5, respectively:

Figure 2. The graphical separation for the converging connection (a), serial connection (b) and diverging connection (c) of fundamental connections in DAG
The BN learning process aims to find an optimal structure in addition to its underlying parameters. In this respect, two approaches have been devised. The first approach analyzes the probabilistic relationships supervised by the Markov property of Bayesian networks with conditional independence tests and subsequently constructs a graph that satisfies the corresponding d-separation statements (Constraint-based algorithms). The other approach assigns a score to each BN candidate and maximizes it with a heuristic algorithm (Score-based algorithms) [34].
Once a Bayesian network has been established, approximate inference on an unknown value can be made by taking advantage of the BN's fundamental properties, which serves an added advantage of evading the curse of dimensionality, due to its mere usage of the local distributions [35]. In other words, the posterior probability of a target node can then be computed from the data generated by applying stochastic simulation to the distribution network, for a large number of cases. For this sake, two algorithms have been proposed -Logical Sampling (LS) and Likelihood weighting (LW). Traversing the nodes from the parent nodes down to children nodes, the LS algorithm generates a case by selecting values for each node -weighed by the probability of those values occurring at random. At each step, the weighing probability is either the prior or the conditional probability table entry for the sampled parent values. An instantiation of all the nodes in the BN is later on created, once all the structure is visited. The collection of instantiation data enables estimation of the posterior probability for node given evidence. The LW algorithms works in a similar way to the former algorithm except that it adds the fractional likelihood of the evidence combination to the run count, instead of one [35].

Random Forest
The Random Forest is a class of ensemble learning techniques, which principally aggregates a collection of random decision trees. The individual trees are not necessarily optimal and are randomly perturbed. Such a diversity enables more extensive exploration of the tree predictors` space -enhancing the RF predictive performance. Each tree is composed of root, branch and leaf nodes, which is generated based on bootstrap sampling from the original training data. The optimal node splitting feature is selected, for each node of a tree, from a set of features, being randomly selected from a feature space of size [36]. If the number of features is less than the size of the feature space, the node splitting feature selection would decrease the correlation between different trees, which subsequently makes the average response of multiple regression trees to have lower expected variance than the individual regression trees. Nevertheless, an improvement in the predictive capability of the individual trees alongside increase in the number of selected features can result in an increase in the correlation between trees and therefore void any gains obtained from averaging multiple predictions.

Consider
( , ) and ( ) as being the training input feature and output response, respectively. For a sample , = 1, 2, … , and = 1, 2, … , . The node splitting process would then attempt to select a feature from a set of features and partition the node into two child nodes with respect to a threshold . The child nodes -left and rightsatisfy the conditions ( ∈ , ≤ ) and ( ∈ , > ), respectively. Assume the node cost as being the sum of square deviances: where ( ) denotes the expected value of the responses. Consequently, the objective function to optimize would be the reduction in cost for partition at node -the reward function (Equation 7).
The optimal selection, * ∈ , maximizes the reward function. The node splitting process can be computationally expensive, as the complexity associated with each node split is of order ( )requiring the checking of a total of partitions for a continuous feature with samples [36]. To deal with this complexity in the tree construction process, several recommendations have been proposed, such as applying the Principal Component Analysis (PCA) in the response matrix or using the basis functions to represent the response variables with the node cost [36]. The corresponding node cost functions to use in tree construction, in that case, would respectively take the form of Equations 8 to 9: where ( ) is the response obtained based on the principal components, ̅ ( ) is the principal components' mean vector, ( ) is the vector of basis coefficients, ( ) is the expected value of the basis coefficients vector and Θ represents the matrix of inner vector products.
The RF methodology relies on fitting the tree based on bootstrap samples from training data -[( 1 , 1 ), ( 2 , 2 ), … , ( , )]while the randomized feature selection process is in effect. Assuming that ̅ ( , Φ) represent the partition containing a test sample for the tree Φ, the response of the tree can be obtained by Equation 10, with the corresponding weights, ( , Φ), given by Equation 11 [36]: Should a collection of a number of trees be accumulated -Φ 1 , Φ 2 , … , Φ T -the average RF prediction for the test sample can be obtained by incorporating the average weights over the forest:

Results
The results obtained are applicable to three set of target functionalities, based on which the data was originally established. In addition, the results pinpoint to a performance benchmark amongst different methods applied in selected target functionalities. Table 4 lists the results obtained for the importance rank of influencing parameters related to target functionality (I), obtained by the RF method. The results are valid for analyzing permeability impairment in sandstone matrix. In this setting, the ions in the formation water have shown an influencing rank in the order of SO4(2-)>Ba2+> Sr2+>Ca2+. In essence, two sets of parameters are enlisted in this table; namely, ion-related parameters (micro-scale) as well as the parameters related to the physics of the field (such as initial permeability) or experimental conditions devised (macro-scale). As evident from the table, both micro-scale and macro-scale parameters have driven an influencing role in the permeability impairment process; nevertheless, the importance of macro-scale parameters have ranked higher. Figure 3 depicts the underlying Bayesian network of influencing parameters related to target functionality (I) for the same set of data. The figure is informative, as it provides the first illustration of the exact interplay amongst different parameters in the data on target functionality (I). Only the statistically-significant arcs have been drawn. It is noticeable that the earlier conclusion made on the microscale/macroscale parameters importance comparison is also confirmed by the Bayesian networkplacing mostly macroscale parameters in the parent nodes. The results presented in Table 1 is further confirmed by the applying the tree pruning to the data ( Figure 4). As can be seen, the highly-influencing parameters in target functionality (I)such as pore volume, initial permeability or SO4(2-) concentration-are detected with higher splitting position in the optimally-pruned tree for data in target functionality (I). The numbers at the target end of the branches (i.e. end lines) in Figure 4, refer to the number of cases received in that end terminal.  Using the improved knowledge of parameters` roles mentioned above, this work attempted to improve on the predictive ability in target functionality (I). Figure 5 shows our results for the final permeability in sandstone matrix, obtained by the RF method. The results pertain to bootstrapped ones with a fraction of 70% for training, which attest to the accuracy of the method. As evident, our results have outperformed the competing results of the gene expression method of Rostami et al. [25]. Our methodology is also competitive to other hybrid scheme prepositions for the same data [23,24] yielding a sum of error squared (R 2 ) of 0.987 and 0.978 on bootstrapped results for the RF and BN methods, respectively. It should be noticed that the mentioned hybrid machine learning attempts on the same data, merely report their R 2 measurements on the whole data set, and not on a bootstrapped sample, which would have been different if tested otherwise. Table 5 provides a performance benchmark of different machine learning algorithms analyzed for target functionality (I) -reporting the mean error percentile of bootstrapped results in each case. For the deep machine learning case, the H2O AutoML scheme was used [37], which allows for automatic inspection of over 270 neural network models for optimal detection. Based on the results, the RF method provides the most accurate output for target functionality (I), with a mean error percentile of less than 5% on the bootstrapped results. In an analogous way, the data related to target functionality (II) were analyzed, based on which the importance rank of influencing parameters were ascertained ( Table 6). The nature of the data related to this functionality had been of the categorical type (i.e. TRUE/FALSE), which referred to the occurrence of calcium carbonate scale formation in the field [4]. The data was also more limited than the data in the other two functionalities, in terms of the number of macro-scale parameters enlisted -mostly considering ions data for a practical purpose. However, the results again indicate the highest-ranking parameter in the group as being of a macro-scale type (i.e. the "field" parameter). As determined by the RF method, the hierarchy of the ion effects in the process has been in the order of Ca2+>Na+>Mg2+>HCO3->SO4 (2-) in the injecting fluid, while placing the pH parameter in the least influencing rank. Table 7 lists the performance of different machine learning algorithms for target functionality (II). Given the categorical type of output, there performance is reported based on the accuracy of the confusion matrix for bootstrapped results. Both RF and BN methods have outperformed the other machine learning methods, in terms of their accuracy. This performance comparison would make the RF results more trustable, including the deductions made earlier on the ranking hierarchy of the ions.    Table 8 lists the RF results obtained for the importance rank of influencing parameters related to target functionality (III). As evident, the highest-ranking parameter in the list is detected with a macro-scale type, in the carbonate matrix. Since the data accounted for the source of ion introductioninjecting or formationit was possible to determine on the importance rank of each ion species, based on its introduction source, which is the first analysis of its kind, to the knowledge of the authors. The dataset used for target functionality (III) in the present article cannot be distributed due to the licensing issues.

Table 4. The importance rank of influencing parameters related to target functionality (I) in sandstone matrix
Based on the RF results (Table 8), the effective ions have shown a different influencing hierarchy compared to the other sectors studied. For a given type of ion, the influencing rank on oil recovery has also been different based on the introduction source into the carbonate matrix. For instance, the importance level of HCO3-ion in the formation water has been higher than that in the injecting water. For some other ions, such as SO4(2-), the injecting water content has rendered more important than the content in the formation water. This finding potentially reveals a more complex phenomenon underlying the oil recovery process, which requires further investigation. On the other hand, the overall effectiveness of ions shows an altered arrangement compared to the other carbonate case (item II), which nullifies any general deductions on the global effectiveness of an ion over the otherssuggesting its case dependency.
For instance, in a low salinity water injection context, a recent research [38] reports that neither the cation/SO4(2-) concentration nor the difference of sulfate ion concentration change in brines show effectiveness towards tertiary oil recovery. The data mining results for target functionality (III) indicate that the most influential parameter in the list is of a macro-scale type, which sustains through all the functionalities/environments studied.

Conclusion
The data mining results indicate a rolling importance for the confluent effect of considered ion species, which alters under different environments. This essentially rejects the prior propositions on the existence of a global order list for the effectiveness of ions for selected functionalities. For the carbonate matrix, the random forest results clearly distinguish between the source of ion introduction into the matrixinjecting or formation water -on its importance rank in the sequel, for oil recovery purposes. This latter conclusion is notable, as it provides the first quantitative confirmation of the source of ion significance towards its overall functionalitybringing a novel concept to the field. For all the target functionalities/matrix environments studied, the most influential parameter was detected as being of a macro-scale type, which does not include an ion. In other words, in neither of the cases studied, an ion was ranked as being the most influential parameter in the list. The minimal errors obtained over the bootstrapped results indicate that the machine learning methodologies applied have been successful in capturing the experimental/field data, within major rock types carbonate and sandstone-over studied functionalities. The random forest and Bayesian network methods stand out as the most accurate techniques amongst the other machine learning strategies applied to the sandstone case, whose results outperform the most recent hybrid method predictions on the same data.

Data Availability Statement
The data presented in this study are available in article.