Dendrogram Analysis and Statistical Examination for Total Microbiological Mesophilic Aerobic Count of Municipal Water Distribution Network System

The microbiological quality of water for human consumption is a critical safety aspect that should not be overlooked, especially when considering facilities for healthcare and the treatment of ill populations. Thus, the biological stability of water is crucial for the distribution network that delivers potable water to the final users for consumption and other human activities. The present work aimed to study a municipal distribution network system for city water within a healthcare facility. The implementation of the statistical analysis was conducted over long-term data collection, and the comparative study for the microbiological count of the water samples from different points-of-use was assessed using the non-parametric analysis of the Kruskal-Wallis test. The comparative study involved a preliminary general one-way Analysis of Variance (ANOVA) followed by ad-hoc pairwise comparison. The statistical study involved a correlation matrix and a dendrogram to elucidate the level of association between different sections in the network. The ports C4 and C13 were at the trough in the microbiological count, in contrast to C13, which showed the highest level of the average microbial density. Despite a low to moderate level of correlation between the datasets of the water network, the tree diagram (dendrogram) analysis showed remarkable clustering. Use points could be grouped into three dense groups based on abrupt cuts in the similarity value. The study was useful in the analysis of the pattern and behavior of the microbial quality in a distribution water network in a specific area of the study. This work in turn would help in investigating the areas of improvement and defect spotting, in addition to assessing the biological stability of the water distribution system. The study could be extended to cover other different processed water networks, such as distilled, deionized, and purified water, as well as Water-For-Injection (WFI).


Introduction
In the world of the uninterrupted increase in the number of ill populations, immunocompromised patients, and defected healthy individuals, seeking appropriate control of human consumable products becomes a critical task, especially in the healthcare industry and settings [1]. One of the important components in everyday human activities is water [2]. There are several quality characteristics that must be considered in the monitoring and control of municipal water [3]. One of the most critical properties is the microbial count, or bioburden content, of the city water.
In this type of dynamic system, the biological stability of the water distribution network is essential to avoid any unexpected excursions in microbial quality [4]. The fewer fluctuations in the bioburden count (as frequency and magnitude), the less the risk of out-of-control (OOC) status might occur, the safer the water for human use and consumption [5]. Previous research has been carried out to monitor this approach using Statistical Process Control (SPC) and control or trending sharts [6]. However, in the compound or complex system, it would be plausible to minimize the dimensionality of the variables through cluster analysis for the revealing of grouping in the characteristics in terms of microbial count [7]. In turn, this will highlight sections for improvement, defective regions, and acceptable parts.
Considering the importance of the homogeneity of the microbiological count in the water distribution network, the present study aimed to provide a statistical stability analysis of municipal water systems in a selected area of a healthcare facility. The work would cover a two-dimensional analysis of the pattern of the microbial count dispersion as a function of time and location. The distribution of data was analyzed and compared. Correlation and similarity studies could be executed to spot a unique grouping tendency in different segments of the water distribution network system.

Material and Methods
The study subject of this work was focused on the distribution network of the city water system in a selected healthcare facility. This network provides service for different partitions in the plant and each section possessed its own point-of-use port(s). The healthcare facility received its supply of municipal water lines from two sources. These stations were subject to regular monitoring for quality inspection characteristics [8]. An important monitored trait to be considered was the total microbiological aerobic, mesophilic viable count, or Total Viable Count (TVC).
The microbiological sampling was conducted aseptically over a 45-month period for the determination of the water plate count. The standard analysis technique for water samples collected in sterile bottles was conducted using an aseptic technique by a method provided by other researchers [9]. After incubation, the Heterotrophic Plate Count (HPC) was quantified as the number of Colony Forming Unit (CFU) per milliliter.
The reported data was collected chronologically in a column database for the plate count for each use point in the municipal water distribution system and processed using statistical software packages. The examination included general descriptive statistics, cumulative histograms, distribution pattern investigation, global variance analysis, multiple comparison tests, correlation matrix analysis, and tree diagram (Dendrogram). The programs included in the study embraced XLSTAT Premium version 2021.1.1 for Correlogram for correlation matrix drawing and P value plot [10]. This Excel-integrated software was also used to illustrate cumulative histograms for all dataset columns. GraphPad Prisom for Windows version 6.01 [11] was used to create the descriptive statistics, graphical summary, overall bioburden comparison, and multiple pairwise comparisons in the table and figure. Finally, Minitab version 17.1.0 was used for cluster analysis using a tree diagram and to show a detailed tabulated similarity level [12].

Results and Discussion
The mean microbial count of city water for each point-of-use along with the overall value with Standard Error of the Mean (SEM) could be demonstrated in Figure 1. The maximum HPC was found at port C5, and the minimum average microbial levels were detected at C4 and C13. Figure 2 shows the distribution analysis of the microbial count as a cumulative histogram [13]. The most likely distribution fit for the discrete dataset was variable. However, the lognormal fit was the most common of the examined use points. Exceptions were found in C4, C9, and C13 with Weibull III, exponential, and Gamma II distributions, respectively. In the same line, Figure 3 illustrates the pattern of data using a probability-probability or percent-percent (P-P) plot. All dispersions were akin to each other, suggesting close spreading of the datasets [14]. All the results were far from the Gaussian distribution's usual shape, and this finding was expected based on the previously found dispersion pattern in other studies of the microbial count distribution in the water samples.  The true distribution of data versus the the theoretical expected best-fit Multiple comparison analysis was investigated between the microbial count trend of all working lines of the city water distribution system within the healthcare facility. The approximate P ≤ 0.05 was found to be 0.0187 using Analysis of Variance (ANOVA). The general (global) effect is pinpointed as a significant variation in microbial quality between different sections of the water distribution network in the plant due to significant variation between medians. Nevertheless, a multiple comparison (post hoc) test using a non-parametric test Kruskal-Wallis for pairwise analysis yielded a non-significant difference ( Figure 4). Thus, the conclusion of significant ANOVA with nonsignificant multiple pairwise comparisons was that the p-value computed by the ANOVA was lower than the alpha (α) significance level (e.g. 0.05) [15]. All the p-values computed by the pairwise multiple comparisons test were higher than the α significance level.

M e a n /S E M D i s t r i b u t i o n P o i n t
Some reasons for why post-hoc test might appear not significant while the overall effect was significant. A conservative multiple comparisons test. A weakly significant global effect (p-value of the ANOVA table was very close to the significance level) was not the case in the present study. Hence, this reason was excluded. Another reason that should be investigated was the lack of statistical power. For instance, when treatments had small sizes. When multiple comparisons tests were not statistically powerful, it would be less likely to detect significant differences. However, this was not the case herein also as the number of values per group was reasonably high. The more conservative the test, the more likely rejection would be found significantly different between groups that in reality were meaningful. In addition, a high number of factor levels can also be an explanation as in the current situation of Table 1. The more pairwise comparisons would be found in hand, the more p-values will get penalized in order to decrease the risk of rejecting null hypotheses while they are true [15]. Thus, there is a reasonable assumption for considering a significant variation in the microbiological water quality between different sections in the distribution network.

P-P plot (C1 to C13)
Derived from plotting cumulative distribution function (cdfs) of the actual vs. expected records R a n k s C 1 C 2 C 3 C 4 C 5 C 6 C 7 C 8 C 9 C 1 0 C 1 1 C 1 2 C 1 3

Figure 4. Kruskal-Wallis test (p < 0.05) showing ranking distribution for all use points of water distribution system
The correlation matrix between different segments of the municipal water distribution system is shown in Figure 5 [16]. It could be noted that a low correlation level existed between the point-of-use ports with moderate records that were observed at the best estimates.   Table 2 showed the formation of clusters for dendrogram creation at each step and determined the similarity (or distance) levels of the clusters formed. The pattern of how similarity or distance values change from step to step could aid in the selection of the final clustering for the database [17]. The step where the values changed abruptly might be identified as an acceptable point to define the final grouping. The decision about final grouping was also called "cutting the dendrogram". Cutting the dendrogram was akin to drawing a line across the dendrogram to specify the final clustering [17]. In this essence, the inflection between step 10 (8 observations) and 11 (12 observations) could be observed. The dendrogram (tree diagram) was used to display the groups formed by aggregation of variables at each increment and showed their similarity levels. In Figure 6, the similarity levelswhich could be displayed as distance level as well -were quantified along the y-axis through measuring the corresponding horizontal line at each step and the various variables (distribution network sampling points) were listed along the x-axis. Accordingly, three cutting edge clusters could be identified viz. 1(4), 5(8) and 13(1).  While the present analysis is limited by the TVC only in this work, the study could be extended in the future to cover other quality aspects of water such as Total Organic Carbon (TOC) and conductivity, in addition to the traceability of specific objectionable and pathogenic microorganisms that are a potential risk to human and other living organisms' health in water. The advantage of the combination of these statistical techniques with the new technologies in microbial enumeration and detection should not be underestimated.

Conclusion
Monitoring the biological stability of the water distribution system is crucial for the safety of human consumption and other associated activities. One of the important quality criteria of the distribution network is the control and monitoring of HPC. Statistical analysis of a well-established database might reveal useful information for reporting the quality issues, patterns, and trends that would help and support decision-making in continuous improvement projects. A single type of analysis might not reveal outcomes that could be revealed by another one, such as global ANOVA against pairwise multiple comparisons and correlation matrix against dendrogram. The overall comparison revealed a significant variation between different parts of the distribution system, yet this was not evident in the stepwise comparative analysis. In the same line, the correlation matrix did not yield sufficiently interesting associations in the present situation between different water sampling ports that could demonstrate a pattern. In the existing situation, a long-term study of municipal distribution networks showed a clustering tendency in the system segments related to the bioburden level, identifying three main groups. While there could be found two clusters of four and eight points-ofuse, the last group consisted of only one line that stood at the end of the system, segregated from the adjacent network groups.

Data Availability Statement
The data presented in this study are available on request from the corresponding author